RFE on Rectal Cancer Dataset

Datasets


  • TCGA-COAD 美國癌症基因體圖譜計畫
  • GSE199057 高通量基因表達數據庫
Volcano Plot
TCGA-COAD Volcano Plot

Description

By using ChAMP package, a volcano plot can be generated to visualize the differentially methylated regions (DMRs) in the TCGA-BRCA dataset. The volcano plot displays the relationship between the significance (p-value) and the magnitude of change (fold change) of the DMRs.


Dbeta Calculation and DMR Filtering

Extensive preprocessing of GSE199057 and TCGA-COAD methylation data was performed using the ChAMP package in R. This involved several critical steps:

  • Quality Control: Removal of low-quality probes, SNP-related probes, and cross-reactive probes
  • Normalization: BMIQ normalization to adjust for probe type bias (Infinium I vs II)
  • Batch Effect Correction: ComBat algorithm was applied to remove potential batch effects
  • Differential Methylation Analysis: Calculated differentially methylated beta values (Dbeta) between tumor and normal tissues
  • Statistical Filtering: Applied significance threshold of p < 0.05
  • Annotation: DMRs were mapped to genes based on their genomic locations and proximity to TSS (Transcription Start Sites)

The resulting differentially methylated regions (DMRs) represent critical epigenetic alterations that may drive breast cancer development and progression. Positive Dbeta values indicate hypermethylation in cancer tissues, while negative values represent hypomethylation.

Hyper/Hypo Methylation Distribution - TCGA-COAD

HyperHypo0350070001050014000598112564

Feature Distribution - TCGA-COAD

1stExon3'UTR5'UTRBodyTSS1500TSS20002000400060008000112312342101713351641790

Hyper/Hypo Methylation Distribution - GSE199057

HyperHypo03500700010500140001207011467

Feature Distribution - GSE199057

1stExon3'UTR5'UTRBodyTSS1500TSS200ExonBnd030006000900012000744116122451130961821671225

Principal Component Analysis (PCA)

PCA of TCGA-COAD

PCA of GSE199057

Joined Data Analysis

After filtering the DMRs from both datasets, we joined the two datasets to create a comprehensive dataset for further analysis. The joined dataset contains a total of 88 gene, whose similarity was calculated using GO (Gene Ontology) terms. The GO terms were obtained from the Ensembl database, and the similarity was calculated using the GOSemSim package in R.

Hierarchical clustering using simple sum
Simple Sum Clustering
Hierarchical clustering using weighted sum
Weighted Sum Clustering
Hierarchical clustering using consensus method
Consensus Clustering
Comparison of different clustering methods
Hierarchical Clustering Comparison

Machine Learning

After thorough evaluation of clustering methodologies, consensus clustering emerged as the superior approach for analyzing our joined dataset due to its robustness in handling gene expression patterns. For biomarker identification, we implemented Recursive Feature Elimination (RFE) with cross-validation to systematically select the most predictive gene signatures while minimizing redundancy. This process ranked genes based on their discriminative power between cancer and normal tissue samples.

The candidate biomarkers underwent rigorous validation through an ensemble-based voting classifier architecture, which integrated predictions from multiple base learners (Random Forest, SVM, and Gradient Boosting) to improve classification stability. To enhance generalization capabilities and address potential overfitting concerns, we employed bootstrap aggregating (bagging) techniques with out-of-bag error estimation. This comprehensive machine learning pipeline delivered robust biomarker combinations with exceptional predictive performance across multiple validation datasets.

Top Gene Combinations with Highest Performance

The table below shows the gene combinations that achieved the highest classification performance:

Gene SetGene 1Gene 2Gene 3Gene 4AccuracyRecallSpecificityPrecisionF1 ScoreAUC
SetKRTAP24-1UNC5CSHISA2ZNF7930.9926830.985366110.9925930.991077