User Manual
Overview
CABIgo identifies potential cancer biomarkers using deep learning that integrates protein-protein interaction networks with Gene Ontology and knowledge graph embeddings.
Quick Start
- Go to the Analysis page
- Select your cancer type
- Enter a gene list OR upload PPI network files
- Click "Run Analysis"
- View results and download predictions
Example Data
Download example files to test the tool or use as templates for your own data.
nodes.csv
Example nodes file containing gene symbols and optional degree information.
Download nodes.csvedges.csv
Example edges file containing source-target gene pairs with confidence scores.
Download edges.csvSample Gene List
You can also copy-paste this gene list directly into the analysis form:
ABAT ABCA6 ABCA9 ACAA2 ACACB ACADL ACADS ACKR1 ACKR3 ACKR4 ACO1 ACSL1 ACSM5 ACSS2 ACSS3 ACVR1C ADAM12 ADAMTS3 ADAMTS5 ADAMTSL4 ADCY4 ADCYAP1R1 ADGRA2 ADGRD1 ADGRL4 ADH1B ADH1C ADHFE1 ADIPOQ ADIRF ADM ADRA1A ADRA2A ADRB1 ADRB2 AGPAT2 AGR2 AGTR1 AHNAK AHNAK2 AIFM2 AKAP12 AKAP9 AKR1C1 AKR1C3 ALCAM ALDH1A1 ALDH1L1 ALDH2 ALDH3A2 ALDH3B2 ALK AMOTL2 ANG ANGPT1 ANGPTL4 ANK2 ANK3 ANLN ANO3 ANTXR2 ANXA1 AOC3 AP1M2 AP1S2 APCDD1 APOB APOBEC3A APOBEC3B APOC1 AQP1 AQP3 ARHGEF6 ASPA ASPH ASPM ATAD2 ATE1 ATG10 ATM ATP1A2 ATP1B1 ATP8B4 ATR ATXN7 AURKA AVPR1A AZGP1 AZIN1 BABAM1 BAMBI BARD1 BCL11B BCL2A1 BCL6 BCLAF1 BCOR BGN BHMT2 BICDL1 BIK BIN1 BIRC5 BLM BMP2 BMP6 BOK BRCA1 BRCA2 BRIP1 BUB1 BUB1B BUB3 C19orf12 C4orf19 C6 CA3 CA4 CALB2 CASQ2 CAT CAV1 CAV2 CAVIN1 CAVIN2 CBX2 CBX7 CBX8 CCBE1 CCDC170 CCDC69 CCN4 CCNA2 CCNB1 CCNB2 CCNE2 CCR5 CCR7 CCT2 CD209 CD24 CD248 CD300LF CD300LG CD34 CD36 CD37 CD9 CD99L2 CDC14B CDC20 CDC42EP2 CDC45 CDC7 CDCA3 CDCA5 CDCA7 CDCA8 CDCP1 CDH1 CDH11 CDH5 CDK1 CDK12 CDKN1C CDKN2A CDKN2B CDKN2C CDKN3 CDO1 CDON CDS1 CDYL2 CEACAM6 CEBPA CENPE CENPF CENPK CENPN CENPU CEP41 CEP55 CERS6 CETP CFAP298 CFD CFH CFL2 CGN CHEK1 CHEK2 CHMP4C CIDEA CIDEC CKMT2 CKS2 CLDN3 CLDN4 CLDN5 CLDN7 CLEC7A CLGN CLIC5 CLU CNKSR2 CNR1 CNRIP1 CNTNAP2 COL10A1 COL11A1 COL1A1 COL6A6 COMP COX11 COX7A1 CPEB1 CRABP2 CREB3L4 CREB5 CRNKL1 CRYAB CTHRC1 CTPS1 CXADR CXCL10 CXCL11 CXCL12 CXCL8 CXCR4 CYBRD1 CYP26B1 CYTH2 CYTIP DBF4 DCLRE1B DCN DDR2 DEGS2 DEPDC1 DGAT2 DHX15 DIO2 DLC1 DLGAP5 DLX2 DMD DMTN DNAJC1 DPT DSP DST DTL DUSP5 E2F3 E2F5 E2F8 EBF1 EBF2 EBF3 ECM2 ECT2 EDNRB EFEMP1 EFNA1 EFNA4 EGFLAM EGFR EGLN3 EHBP1 EHD2 EIF1 ELF3 ELL EMCN ENAH ENC1 ENPP2 EP300 EPAS1 EPB41L2 EPB41L5 EPB42 EPCAM EPN3 EPSTI1 ERBB2 ERBB3 ERBB4 ERG ESPN ESR1 ESRP1 ETFB ETNK1 EXO1 EZH1 EZH2 EZR F10 F12 FA2H FABP4 FABP5 FADS3 FAM83D FANCD2 FANCI FAXDC2 FBLN2 FBLN5 FBN1 FBXO11 FCRL4 FDPS FEN1 FERMT2 FEZ1 FGF1 FGF2 FGF3 FGFR2 FGFR3 FHL1 FKBP11 FKBP4 FN1 FNDC5 FOS
Input Requirements
Option 1: Gene List
Provide a list of gene symbols (HGNC format). The system will automatically construct a PPI network using STRING database.
Format
- One gene symbol per line, OR
- Comma-separated gene symbols
Example
BRCA1
BRCA2
TP53
EGFR
MYC
Option 2: PPI Network Files
Upload pre-constructed network files in CSV format.
nodes.csv
| Column | Required | Description |
|---|---|---|
SYMBOL | Yes | Gene symbol (HGNC format) |
degree | No | Node degree (calculated if not provided) |
Example nodes.csv:
SYMBOL,degree
BRCA1,15
BRCA2,12
TP53,25
EGFR,18
edges.csv
| Column | Required | Description |
|---|---|---|
source | Yes | Source gene symbol |
target | Yes | Target gene symbol |
weight | No | Edge confidence score (0-1) |
Example edges.csv:
source,target,weight
BRCA1,BRCA2,0.95
BRCA1,TP53,0.87
TP53,EGFR,0.82
Analysis Pipeline
1. PPI Construction
If gene list provided, query STRING database to build interaction network. Edges are filtered by confidence score and nodes by degree.
2. GO Term Embedding
Retrieve Gene Ontology annotations for each protein. Generate 768-dimensional embeddings using fine-tuned BioBERT model.
3. GeoKG Embedding
Map proteins to UniProt IDs and retrieve pre-computed knowledge graph embeddings (50 dimensions supported).
4. Feature Assembly
Concatenate GO and GeoKG embeddings. Filter out proteins not present in both embedding sets.
5. Model Prediction
Apply cancer-specific Graph Neural Network. Output probability and binary prediction for each gene.
6. Enrichment Analysis
Perform GO and KEGG pathway enrichment on predicted biomarkers (requires at least 5 predicted biomarkers).
Output Interpretation
Predictions Table
| Column | Description |
|---|---|
SYMBOL | Gene symbol |
biomarker_probability | Probability of being a biomarker (0-1) |
predicted_biomarker | Binary prediction (0 = non-biomarker, 1 = biomarker) |
confidence | Confidence in prediction |
Network Visualization
Node size is proportional to network degree.
Enrichment Results
- GO Enrichment: Biological Process, Molecular Function, Cellular Component
- KEGG Pathways: Enriched biological pathways with links to KEGG database
Parameters
| Parameter | Default | Range | Description |
|---|---|---|---|
| Cancer Type | Breast Cancer | Breast, Lung, Glioblastoma | Cancer-specific model to use for prediction |
| GeoKG Dimension | 50 | 50, 100, 200, 500, 1000 | Knowledge graph embedding dimension |
| Min Confidence | 0.7 | 0.4 - 0.9 | Minimum STRING confidence score for edges |
| Min Degree | 2 | 1 - 5 | Minimum node degree (nodes below are removed) |
Troubleshooting
Q: "Please provide at least 10 genes"
A: The analysis requires a minimum of 10 genes for meaningful network construction.
Add more genes to your list.
More genes are recommended for PPI construction and reliable predictions.
Q: "Network too small after filtering"
A: Too many genes were removed due to low degree or missing embeddings. Try:
- Lowering the minimum confidence threshold
- Lowering the minimum degree requirement
- Adding more genes to your list
Q: "Only X genes have both GO and GeoKG embeddings"
A: Some genes in your list don't have GO annotations or aren't in the knowledge graph. This is normal for less-characterized genes.
Q: Analysis is taking too long
A: Large gene lists (>500 genes) may take 2-5 minutes. The STRING API query is usually the slowest step.