User Manual
Overview
GO-DMBC identifies potential cancer biomarkers using deep learning that integrates protein-protein interaction networks with Gene Ontology and knowledge graph embeddings.
Quick Start
- Go to the Analysis page
- Select your cancer type
- Enter a gene list OR upload PPI network files
- Click "Run Analysis"
- View results and download predictions
Example Data
Download example files to test the tool or use as templates for your own data.
nodes.csv
Example nodes file containing gene symbols and optional degree information.
Download nodes.csvedges.csv
Example edges file containing source-target gene pairs with confidence scores.
Download edges.csvSample Gene List
You can also copy-paste this gene list directly into the analysis form:
BRCA1 BRCA2 TP53 EGFR MYC CDK4 RB1 PTEN PIK3CA AKT1 ERBB2 ESR1 PGR KRAS BRAF MDM2 CCND1 CDH1 FOXA1 GATA3
Input Requirements
Option 1: Gene List
Provide a list of gene symbols (HGNC format). The system will automatically construct a PPI network using STRING database.
Format
-
One gene symbol per line, OR
Comma-separated gene symbols
Example
BRCA1
BRCA2
TP53
EGFR
MYC
Option 2: PPI Network Files
Upload pre-constructed network files in CSV format.
nodes.csv
| Column | Required | Description |
|---|---|---|
SYMBOL | Yes | Gene symbol (HGNC format) |
degree | No | Node degree (calculated if not provided) |
Example nodes.csv:
SYMBOL,degree
BRCA1,15
BRCA2,12
TP53,25
EGFR,18
edges.csv
| Column | Required | Description |
|---|---|---|
source | Yes | Source gene symbol |
target | Yes | Target gene symbol |
weight | No | Edge confidence score (0-1) |
Example edges.csv:
source,target,weight
BRCA1,BRCA2,0.95
BRCA1,TP53,0.87
TP53,EGFR,0.82
Analysis Pipeline
1. PPI Construction
If gene list provided, query STRING database to build interaction network. Edges are filtered by confidence score and nodes by degree.
2. GO Term Embedding
Retrieve Gene Ontology annotations for each protein. Generate 768-dimensional embeddings using fine-tuned BioBERT model.
3. GeoKG Embedding
Map proteins to UniProt IDs and retrieve pre-computed knowledge graph embeddings (50 dimensions supported).
4. Feature Assembly
Concatenate GO and GeoKG embeddings. Filter out proteins not present in both embedding sets.
5. Model Prediction
Apply cancer-specific Graph Neural Network. Output probability and binary prediction for each gene.
6. Enrichment Analysis
Perform GO and KEGG pathway enrichment on predicted biomarkers (requires at least 5 predicted biomarkers).
Output Interpretation
Predictions Table
| Column | Description |
|---|---|
SYMBOL | Gene symbol |
biomarker_probability | Probability of being a biomarker (0-1) |
predicted_biomarker | Binary prediction (0 = non-biomarker, 1 = biomarker) |
confidence | Confidence in prediction |
Network Visualization
Node size is proportional to network degree.
Enrichment Results
-
GO Enrichment: Biological Process, Molecular Function, Cellular Component
KEGG Pathways: Enriched biological pathways with links to KEGG database
Parameters
| Parameter | Default | Range | Description |
|---|---|---|---|
| Cancer Type | Breast Cancer | Breast, Lung, Glioblastoma | Cancer-specific model to use for prediction |
| GeoKG Dimension | 50 | 50, 100, 200, 500, 1000 | Knowledge graph embedding dimension |
| Min Confidence | 0.7 | 0.4 - 0.9 | Minimum STRING confidence score for edges |
| Min Degree | 2 | 1 - 5 | Minimum node degree (nodes below are removed) |
Troubleshooting
Q: "Please provide at least 10 genes"
A: The analysis requires a minimum of 10 genes for meaningful network construction. Add more genes to your list.
More genes are recommended for PPI construction and reliable predictions.
Q: "Network too small after filtering"
A: Too many genes were removed due to low degree or missing embeddings. Try:
-
Lowering the minimum confidence threshold
Lowering the minimum degree requirement
Adding more genes to your list
Q: "Only X genes have both GO and GeoKG embeddings"
A: Some genes in your list don't have GO annotations or aren't in the knowledge graph. This is normal for less-characterized genes.
Q: Analysis is taking too long
A: Large gene lists (>500 genes) may take 2-5 minutes. The STRING API query is usually the slowest step.