Manual - GO-DMBC

Overview

GO-DMBC identifies potential cancer biomarkers using deep learning that integrates protein-protein interaction networks with Gene Ontology and knowledge graph embeddings.

Quick Start

Go to the Analysis page
Select your cancer type
Enter a gene list OR upload PPI network files
Click "Run Analysis"
View results and download predictions

Example Data

Download example files to test the tool or use as templates for your own data.

nodes.csv

Example nodes file containing gene symbols and optional degree information.

Download nodes.csv

edges.csv

Example edges file containing source-target gene pairs with confidence scores.

Download edges.csv

All Example Files

Download all example files as a single ZIP archive.

Download All (ZIP)

Sample Gene List

You can also copy-paste this gene list directly into the analysis form:

BRCA1
BRCA2
TP53
EGFR
MYC
CDK4
RB1
PTEN
PIK3CA
AKT1
ERBB2
ESR1
PGR
KRAS
BRAF
MDM2
CCND1
CDH1
FOXA1
GATA3

Input Requirements

Option 1: Gene List

Provide a list of gene symbols (HGNC format). The system will automatically construct a PPI network using STRING database.

Minimum requirement: At least 10 genes are required for analysis.

Format

One gene symbol per line, OR Comma-separated gene symbols

Example

BRCA1
BRCA2
TP53
EGFR
MYC

Option 2: PPI Network Files

Upload pre-constructed network files in CSV format.

Note: The tool does not support .tsv files from STRING directly. Please convert to the CSV format described below.

nodes.csv

Column	Required	Description
`SYMBOL`	Yes	Gene symbol (HGNC format)
`degree`	No	Node degree (calculated if not provided)

Example nodes.csv:

SYMBOL,degree
BRCA1,15
BRCA2,12
TP53,25
EGFR,18

edges.csv

Column	Required	Description
`source`	Yes	Source gene symbol
`target`	Yes	Target gene symbol
`weight`	No	Edge confidence score (0-1)

Example edges.csv:

source,target,weight
BRCA1,BRCA2,0.95
BRCA1,TP53,0.87
TP53,EGFR,0.82

Analysis Pipeline

1. PPI Construction

If gene list provided, query STRING database to build interaction network. Edges are filtered by confidence score and nodes by degree.

2. GO Term Embedding

Retrieve Gene Ontology annotations for each protein. Generate 768-dimensional embeddings using fine-tuned BioBERT model.

3. GeoKG Embedding

Map proteins to UniProt IDs and retrieve pre-computed knowledge graph embeddings (50 dimensions supported).

4. Feature Assembly

Concatenate GO and GeoKG embeddings. Filter out proteins not present in both embedding sets.

5. Model Prediction

Apply cancer-specific Graph Neural Network. Output probability and binary prediction for each gene.

6. Enrichment Analysis

Perform GO and KEGG pathway enrichment on predicted biomarkers (requires at least 5 predicted biomarkers).

Output Interpretation

Predictions Table

Column	Description
`SYMBOL`	Gene symbol
`biomarker_probability`	Probability of being a biomarker (0-1)
`predicted_biomarker`	Binary prediction (0 = non-biomarker, 1 = biomarker)
`confidence`	Confidence in prediction

Network Visualization

Predicted biomarkers

Non-biomarkers (neighbors of biomarkers)

Node size is proportional to network degree.

Enrichment Results

GO Enrichment:

KEGG Pathways:

Parameters

Parameter	Default	Range	Description
Cancer Type	Breast Cancer	Breast, Lung, Glioblastoma	Cancer-specific model to use for prediction
GeoKG Dimension	50	50, 100, 200, 500, 1000	Knowledge graph embedding dimension
Min Confidence	0.7	0.4 - 0.9	Minimum STRING confidence score for edges
Min Degree	2	1 - 5	Minimum node degree (nodes below are removed)

Troubleshooting

Q: "Please provide at least 10 genes"

A: The analysis requires a minimum of 10 genes for meaningful network construction. Add more genes to your list.

More genes are recommended for PPI construction and reliable predictions.

Q: "Network too small after filtering"

A: Too many genes were removed due to low degree or missing embeddings. Try:

Q: "Only X genes have both GO and GeoKG embeddings"

A: Some genes in your list don't have GO annotations or aren't in the knowledge graph. This is normal for less-characterized genes.

Q: Analysis is taking too long

A: Large gene lists (>500 genes) may take 2-5 minutes. The STRING API query is usually the slowest step.

Start Analysis Back to Home

User Manual

Overview

Quick Start

Example Data

nodes.csv

edges.csv

All Example Files

Sample Gene List

Input Requirements

Option 1: Gene List

Format

Example

Option 2: PPI Network Files

nodes.csv

Example nodes.csv:

edges.csv

Example edges.csv:

Analysis Pipeline

1. PPI Construction

2. GO Term Embedding

3. GeoKG Embedding

4. Feature Assembly

5. Model Prediction

6. Enrichment Analysis

Output Interpretation

Predictions Table

Network Visualization

Enrichment Results

Parameters

Troubleshooting

Q: "Please provide at least 10 genes"

Q: "Network too small after filtering"

Q: "Only X genes have both GO and GeoKG embeddings"

Q: Analysis is taking too long