Schematic diagram of this study design
In this study, to identify and investigate a risk model which can help improve the prognosis in colon cancer, we collected data on human patients with colorectal adenocarcinoma (COAD) (Fig. 1A), which is categorized as transcriptome, clinical part and hallmark gene datasets (Additional file 1: Table S1). Correlation modules are identified by weighted correlation network analysis (WGCNA). Differentially expressed genes (DEGs) (adjusted p < 0.05, | log2(fold change) |> 1) were identified among genes of survival-correlated modules. Univariant Cox analysis was performed to predict the prognosis-related module-derived DEGs. The risk model was established by applying stepwise regression to multivariant Cox model and 23 genes (from 174 candidate prognosis-related module-derived DEGs) were selected with coefficient defined for each target gene (Fig. 1B). A risk score was then calculated by summing up the multiplication of gene expression level and its coefficient. Risk assessment and prognosis analysis were performed to evaluate the risk score in predicting survival in COAD patients as an independent parameter. In addition, a comprehensive decision tree of nomogram was constructed based on the risk score and other clinicopathological parameters to improve risk stratification and assessment for individual patients (Fig. 1C). Then, further validation of the risk model was performed in another large independent cohort of 562 samples of colon cancer. Evaluation on the molecular classification, treatment response, pathway enrichment and function representation on the risk model were also applied to provide a deep insight into these risk model genes (Fig. 1D). Risk model genes showed a robust and putative functional role in indicating immune relevance. So, we tried to identify the correlation between immune activity and risk score by deconvoluting the tumor immune microenvironment from the transcriptome (Fig. 1E), and showed that the immune cells and immune activity are indeed involved in tumor patients in the risk model.
Dataset preparation and data processing
The results here are in part based from data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
A cohort of 437 samples with clinical annotations and follow-up information were included in our study. Transcriptome profiling data (HTSeq-FPKM files downloaded from The Cancer Genome Atlas TCGA, https://portal.gdc.cancer.gov) were used as the main data set for risk model establishment and evaluation. All gene expression quantification files from the cohort were downloaded as txt format and further merged together in one file for downstream analysis.
The validation dataset of GSE39582 containing 566 samples from 562 patients was downloaded from the research work of Laetitia Marisa et al. [10].
Hallmark gene sets were downloaded from Yin He et al. [11] (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6310928/), which addressed immune functions in different detailed aspects.
CMS subtyping data and Kras/Braf mutation information of the TCGA-COAD dataset are from the research work of Justin Guinney et al. [12].
Manual curation data of treatment and response information in the TCGA-COAD dataset is from the work of Enrico Moiso [13].
The dataset containing both proteome and transcriptome data is from the project CPTAC-2 prospective and is downloaded from cBioPortal database [14].
All the datasets are summarized in Additional file 1: Table S1 and can be downloaded directly from the indicated websites. Datasets or custom scripts that are used in this research can be obtained upon request.
Data visualization and statistical analysis
R software (version 3.5.1, http://www.r-pro-ject.org) was used to analyze data and plot graphs. Boxplot and point plot were generated with R package “ggplot2”, heatmap scaled by row was generated by R package “pheatmap” with a clustering distance of “euclidean”. Chord diagram was generated with R package ‘circlize’. DEG statistical analysis is performed by Wilcox test with R function “wilcox.test”. Welch’s t test (unpaired) or one-way analysis of variance was used to analyze differences between groups in variables with a normal distribution.
Gene co-expression network construction
Co-expression networks were constructed by using WGCNA (v1.69) package [15, 16] in R. First of all, TCGA-COAD samples with clinical traits were selected after removing repetitions from the same patient (Additional file 1: Fig. S1A), then unqualified samples for WGCNA analysis are filtered by the function of “goodSamplesGenes”, ending up with 288 COAD samples. Uncommon genes and those with counts less than 10 in more than 90% samples are removed from the WGCNA analysis. After sample clustering, five outliers are detected and removed in the downstream module analysis with 283 remaining samples. A soft threshold (power) of 7 is chosen for network construction by function of “blockwiseModules” according to the scale-free topology criterion which referred to the smallest value for an approximate scale free topology as WGCNA used the topological overlap measure (TOM) to represent proximity (Additional file 1: Fig. S1B). Based on the topological overlap matrix measured from a pairwise correlation-based adjacency matrix, the neighborhood similarity among genes were estimated and the gene co-expression modules, which are distinguished with different colors, were then identified by average linkage hierarchical clustering. Using the Dynamic Hybrid Tree Cut algorithm and a minimum module size of 30 genes, a total of eighteen modules were identified. The correlation analysis of different modules is based on the module eigengenes (MEs) which represent the first principal component of the expression profiles in a given module (Additional file 1: Fig. S1C). Linear regression between MEs and clinical traits is applied to identify the module-trait associations (Fig. 2A). Modules are linked to traits by function of “bicorAndPvalue”. Data visualization is performed by the built-in functions in WGCNA package. The visualization of the network in module tan and turquoise is performed by the software Cytoscape (v3.7.1) based on the connection information of topological overlap matrix from WGCNA analysis with an edge weight threshold of 0.01, ending up with all the nodes (genes) in module tan (n = 78) and turquoise (n = 638) shown in the network.
DEG identification
Significant differentially expressed genes (adjusted p < 0.05, |log2(Fold Change)|> 1) are identified by comparing tumor samples (n = 398) with normal samples (n = 39) or risk-high group (n = 167) with risk-low group (n = 167) in the TCGA cohort. p-value is calculated by unpaired Wilcoxon test, p-value is adjusted by FDR method. Log2 fold change is calculated by mean expression (FPKM) of tumor versus normal group.
Prognosis analysis
By using the function ‘coxph’ in R package ‘survival’, a Cox proportional-hazards regression model was used to evaluate the significance of each parameter to overall survival, both survival time and state are considered as response parameters in this analysis.
Risk model construction
Univariant Cox analysis was first performed to predict the prognosis-related DEGs from module MEturquoise and MEtan. Risk model was then established by applying stepwise regression (‘step’ function in R) to multivariant Cox model (‘coxph’ function in R) on 23 candidate prognosis-related genes in order to get an optimal simple model without compromising the model accuracy. The strategy used for stepwise regression is “sequential replacement”, which is a combination of forward and backward selections. It starts with no predictors, then sequentially add the most contributive predictors. After adding each new variable of candidate genes, remove any variables that no longer provide an improvement in the model fit. In this way, 23 risk genes were obtained in the risk model. By applying this method, a coefficient was predicted on each gene based on survival time and state of patients (Additional file 1: Fig. S2D). A risk value was then calculated by summing up the multiplication of gene expression level and its coefficient. Subsequently, we performed risk analysis to evaluate the risk model. Low- and high-risk group are stratified by the median risk score in the TCGA-COAD cohort (n = 334).
Survival analysis
The Kaplan–Meier method was used to draw survival curves and the log-rank test was used to evaluate differences by using the function ‘coxph’ in R package ‘survival’. A Cox proportional-hazards regression model was used to evaluate the significance of each parameter to overall survival. Time-dependent receiver operating characteristic (tROC) analysis was performed to measure the predictive power by the R package ‘survivalROC’ [17] with the parameter ‘method = “KM”’, and the areas under the curve at different time points [AUC(t)] of all the variables were compared.
Cox analysis on risk score and other clinical parameters
Univariant or Multivariant Cox proportional-hazards (Cox-PH) regression model was applied on the risk score (categorizing the patients in low-risk or high-risk) and other clinical parameters (age, gender, stage. T, M, N) using R package ‘survival’.
Nomogram analysis
A nomogram and a calibration curve were calculated and plotted using R function (cph, Survival, calibrate and nomogram) in the R package ‘rms’ [18] with the parameters ‘lp = F, maxscale = 100, fun.at = c(0.99,0.9,0.8,0.6,0.4,0.2,0.1)’ for ‘nomogram’. Nomogram is a pictorial representation of a complex mathematical formula. In the nomogram of this study, all the clinical variables, such as age, gender, stage, TNM phage and risk score were used to represent a statistical prognostic model that predicts a probability of cancer death. Basically, nomo score coming from the nomogram analysis, is a comprehensive parameter for predicting cancer risk by integrating all the clinical variables that were put in. Nomo scores were extracted by the function ‘formula_lp’ and ‘points_cal’ in R package ‘nomogramFormula’.
Gene ontology analysis
Gene ontology analysis is performed with differentially expressed genes in web application DAVID (https://david.ncifcrf.gov) [19, 20]. All gene ontologies enriched significantly (p < 0.05, fisher’s exact test) are shown in the dot plot.
Immune cells deconvolution
CIBERSORTx (http://cibersortx.stanford.edu/) was implemented to deconvolute the composition of specific immune cells in the tumor microenvironment (TME) from the transcriptome data of TCGA cohort. Specifically, FPKM matrix of COAD patients is formatted as input file and uploaded to the web application of CIBERSORTx, then cell fractions are imputed in the module of “Impute Cell Fractions” with custom mode and default parameters (permutations for significance analysis is set to 100). The signature matrix file built in CIBERSORTx (LM22: 22 immune cell types) is used as the signature reference in this analysis.
Immune activity deconvolution
Single-sample gene set enrichment (ssGSEA) analysis (R package ‘gsva’ with parameters “method = 'ssgsea', kcdf = 'Gaussian', abs.ranking = TRUE”) was utilized to identify clusters with different immune activity referring the immune-activity hallmarks from Yin He et al. [11] in four different immune aspects: immune cells, immune and cytolytic activity, antigen presentation pathway, and cytokine response. The enrichment scores of each hallmarks gene set were summed to generate the SIES (Sum of immune enrichment scores) for each sample.
‘Estimation of STromal and Immune cells in MAlignant Tumours using Expression data’ (ESTIMATE) is a method that uses gene expression signatures to infer the fraction of stromal and immune cells in tumour samples. Stromal, Immune and Tumor scores were obtained by using R package of ESTIMATE [21]. The stromal score captures the presence of stroma in tumor tissue, immune score represents the infiltration of immune cells in tumor tissue. ESTIMATE score of each patient = immune score of each patient + the corresponding stromal score. As the higher the ESTIMATE score is, the lower the content of tumor cells is in the given tumor microenvironment. The tumor score was calculated by the following equation: Tumor score = maximal value of ESTIMATE score in TCGA-COAD cohort—ESTIMATE score of each colon cancer patient.
Correlation analysis between SIES and risk score
As outliers affect the correlation analysis a lot, that is especially serious in the risk model where extreme values exist, those outliers statistically away from normal distribution are filtered from TCGA-COAD cohort before correlation analysis. Maximum and minimum are tested and filtered in a stepwise way by the function of “grubbs.test” until all the remaining risk scores fits a normal distribution. We end up with 313 samples from 334 samples in TCGA-COAD for downstream correlation in which unpaired two tailed t test are applied.
The comparison between our risk model with other prognosis models from different researches in the colon cancer
The performance comparison is performed between our risk model and other models from the researches of Huang et al. [22], Chen et al. [23], Sun et al. [24], and Liu et al. [25]. The comparison is based on the metrics of − log10(p-value) in the discovery and validation cohort respectively. p-value is calculated by the log rank test in the survival analysis. For those p values recorded already in the indicated cohorts in the corresponding researches, they were used directly without any change. For those p values not recorded in the corresponding researches, we calculated the p value in the indicated cohorts by two groups stratified according to the median value of gene expression or risk score as defined in the corresponding researches. For those models which indicated the gene expression in more than one gene, we used the gene which had the best prognosis in the discovery cohort in the comparison.