RFE on Prostate Cancer Dataset

Datasets


  • TCGA-PRAD 美國癌症基因體圖譜計畫
  • GSE269244 高通量基因表達數據庫
Volcano Plot
TCGA-PRAD Volcano Plot

Description

By using ChAMP package, a volcano plot can be generated to visualize the differentially methylated regions (DMRs) in the TCGA-BRCA dataset. The volcano plot displays the relationship between the significance (p-value) and the magnitude of change (fold change) of the DMRs.


Dbeta Calculation and DMR Filtering

Extensive preprocessing of GSE269244 and TCGA-PRAD methylation data was performed using the ChAMP package in R. This involved several critical steps:

  • Quality Control: Removal of low-quality probes, SNP-related probes, and cross-reactive probes
  • Normalization: BMIQ normalization to adjust for probe type bias (Infinium I vs II)
  • Batch Effect Correction: ComBat algorithm was applied to remove potential batch effects
  • Differential Methylation Analysis: Calculated differentially methylated beta values (Dbeta) between tumor and normal tissues
  • Statistical Filtering: Applied significance threshold of p < 0.05
  • Annotation: DMRs were mapped to genes based on their genomic locations and proximity to TSS (Transcription Start Sites)

The resulting differentially methylated regions (DMRs) represent critical epigenetic alterations that may drive breast cancer development and progression. Positive Dbeta values indicate hypermethylation in cancer tissues, while negative values represent hypomethylation.

Hyper/Hypo Methylation Distribution - TCGA-PRAD

HyperHypo02500500075001000090859567

Feature Distribution - TCGA-PRAD

1stExon3'UTR5'UTRBodyTSS1500TSS20002000400060008000111913751953734350581804

Hyper/Hypo Methylation Distribution - GSE269244

HyperHypo030006000900012000109497045

Feature Distribution - GSE269244

1stExon3'UTR5'UTRBodyTSS1500TSS20002000400060008000102913921958724447151656

Principal Component Analysis (PCA)

PCA of TCGA-PRAD

PCA of GSE269244

Joined Data Analysis

After filtering the DMRs from both datasets, we joined the two datasets to create a comprehensive dataset for further analysis. The joined dataset contains a total of 156 gene, whose similarity was calculated using GO (Gene Ontology) terms. The GO terms were obtained from the Ensembl database, and the similarity was calculated using the GOSemSim package in R.

Hierarchical clustering using simple sum
Simple Sum Clustering
Hierarchical clustering using weighted sum
Weighted Sum Clustering
Hierarchical clustering using consensus method
Consensus Clustering
Comparison of different clustering methods
Hierarchical Clustering Comparison

Machine Learning

After thorough evaluation of clustering methodologies, consensus clustering emerged as the superior approach for analyzing our joined dataset due to its robustness in handling gene expression patterns. For biomarker identification, we implemented Recursive Feature Elimination (RFE) with cross-validation to systematically select the most predictive gene signatures while minimizing redundancy. This process ranked genes based on their discriminative power between cancer and normal tissue samples.

The candidate biomarkers underwent rigorous validation through an ensemble-based voting classifier architecture, which integrated predictions from multiple base learners (Random Forest, SVM, and Gradient Boosting) to improve classification stability. To enhance generalization capabilities and address potential overfitting concerns, we employed bootstrap aggregating (bagging) techniques with out-of-bag error estimation. This comprehensive machine learning pipeline delivered robust biomarker combinations with exceptional predictive performance across multiple validation datasets.

Top Gene Combinations with Highest Performance

The table below shows the gene combinations that achieved the highest classification performance:

Gene SetGene 1Gene 2Gene 3AccuracyRecallSpecificityPrecisionF1 ScoreAUC
SetFBXO30KLHDC8APYCARD0.8985290.9735290.8235290.8464780.9054670.92154