User Manual

Overview

GO-DMBC identifies potential cancer biomarkers using deep learning that integrates protein-protein interaction networks with Gene Ontology and knowledge graph embeddings.

Quick Start

  1. Go to the Analysis page
  2. Select your cancer type
  3. Enter a gene list OR upload PPI network files
  4. Click "Run Analysis"
  5. View results and download predictions

Example Data

Download example files to test the tool or use as templates for your own data.

nodes.csv

Example nodes file containing gene symbols and optional degree information.

Download nodes.csv

edges.csv

Example edges file containing source-target gene pairs with confidence scores.

Download edges.csv

All Example Files

Download all example files as a single ZIP archive.

Download All (ZIP)

Sample Gene List

You can also copy-paste this gene list directly into the analysis form:

BRCA1
BRCA2
TP53
EGFR
MYC
CDK4
RB1
PTEN
PIK3CA
AKT1
ERBB2
ESR1
PGR
KRAS
BRAF
MDM2
CCND1
CDH1
FOXA1
GATA3

Input Requirements

Option 1: Gene List

Provide a list of gene symbols (HGNC format). The system will automatically construct a PPI network using STRING database.

Minimum requirement: At least 10 genes are required for analysis.

Format

    One gene symbol per line, OR Comma-separated gene symbols

Example

BRCA1
BRCA2
TP53
EGFR
MYC

Option 2: PPI Network Files

Upload pre-constructed network files in CSV format.

Note: The tool does not support .tsv files from STRING directly. Please convert to the CSV format described below.

nodes.csv

ColumnRequiredDescription
SYMBOLYesGene symbol (HGNC format)
degreeNoNode degree (calculated if not provided)

Example nodes.csv:

SYMBOL,degree
BRCA1,15
BRCA2,12
TP53,25
EGFR,18

edges.csv

ColumnRequiredDescription
sourceYesSource gene symbol
targetYesTarget gene symbol
weightNoEdge confidence score (0-1)

Example edges.csv:

source,target,weight
BRCA1,BRCA2,0.95
BRCA1,TP53,0.87
TP53,EGFR,0.82

Analysis Pipeline

1. PPI Construction

If gene list provided, query STRING database to build interaction network. Edges are filtered by confidence score and nodes by degree.

2. GO Term Embedding

Retrieve Gene Ontology annotations for each protein. Generate 768-dimensional embeddings using fine-tuned BioBERT model.

3. GeoKG Embedding

Map proteins to UniProt IDs and retrieve pre-computed knowledge graph embeddings (50 dimensions supported).

4. Feature Assembly

Concatenate GO and GeoKG embeddings. Filter out proteins not present in both embedding sets.

5. Model Prediction

Apply cancer-specific Graph Neural Network. Output probability and binary prediction for each gene.

6. Enrichment Analysis

Perform GO and KEGG pathway enrichment on predicted biomarkers (requires at least 5 predicted biomarkers).

Output Interpretation

Predictions Table

ColumnDescription
SYMBOLGene symbol
biomarker_probabilityProbability of being a biomarker (0-1)
predicted_biomarkerBinary prediction (0 = non-biomarker, 1 = biomarker)
confidenceConfidence in prediction

Network Visualization

Predicted biomarkers
Non-biomarkers (neighbors of biomarkers)

Node size is proportional to network degree.

Enrichment Results

    GO Enrichment: Biological Process, Molecular Function, Cellular Component KEGG Pathways: Enriched biological pathways with links to KEGG database

Parameters

ParameterDefaultRangeDescription
Cancer Type Breast Cancer Breast, Lung, Glioblastoma Cancer-specific model to use for prediction
GeoKG Dimension 50 50, 100, 200, 500, 1000 Knowledge graph embedding dimension
Min Confidence 0.7 0.4 - 0.9 Minimum STRING confidence score for edges
Min Degree 2 1 - 5 Minimum node degree (nodes below are removed)

Troubleshooting

Q: "Please provide at least 10 genes"

A: The analysis requires a minimum of 10 genes for meaningful network construction. Add more genes to your list.


More genes are recommended for PPI construction and reliable predictions.

Q: "Network too small after filtering"

A: Too many genes were removed due to low degree or missing embeddings. Try:

    Lowering the minimum confidence threshold
    Lowering the minimum degree requirement
    Adding more genes to your list

Q: "Only X genes have both GO and GeoKG embeddings"

A: Some genes in your list don't have GO annotations or aren't in the knowledge graph. This is normal for less-characterized genes.

Q: Analysis is taking too long

A: Large gene lists (>500 genes) may take 2-5 minutes. The STRING API query is usually the slowest step.

Start Analysis Back to Home