Molecular subtyping of nasopharyngeal carcinoma (NPC) and a microRNA-based prognostic model for distant metastasis

Background Nasopharyngeal carcinoma (NPC) is a highly invasive and metastatic cancer, with diverse molecular characteristics and clinical outcomes. This study aims to dissect the molecular heterogeneity of NPC, followed by the construction of a microRNA (miRNA)-based prognostic model for prediction of distant metastasis. Methods We retrieved two NPC datasets: GSE32960 and GSE70970 as training and validation cohorts, respectively. Consensus clustering was employed for cluster discovery, and support vector machine was used to build a classifier. Finally, Cox regression analysis was applied to constructing a prognostic model for predicting risk of distant metastasis. Results Three NPC subtypes (immunogenic, classical and mesenchymal) were identified that are molecularly distinct and clinically relevant, of which mesenchymal subtype (~ 36%) is associated with poor prognosis, characterized by suppressing tumor suppressor miRNAs and the activation of epithelial-mesenchymal transition. Out of the 25 most differentially expressed miRNAs in mesenchymal subtype, miR-142, miR-26a, miR-141 and let-7i have significant prognostic power (P < 0.05). Conclusions We proposed for the first time that NPC can be stratified into three subtypes. Using a panel of 4 miRNAs, we established a prognostic model that can robustly stratify NPC patients into high- and low- risk groups of distant metastasis. Electronic supplementary material The online version of this article (10.1186/s12929-018-0417-5) contains supplementary material, which is available to authorized users.


Background
Nasopharyngeal carcinoma (NPC) is one of the five major types of head and neck malignant tumor which develops in the epithelial lining of the nasopharynx [1]. NPC differs significantly from other head and neck cancers in its occurrence, causes and treatment strategies. According to the American Cancer Society, NPC is characterized by its unique geographical and racial distribution with the incidence rate of 20 to 30 cases per 100,000 each year in Southeast Asia as compared with less than 1 case per 100,000 in the United States. Several key etiological factors including genetic [2], Epstein-Barr virus (EBV) infection [3] and dietary [4] implicated as the major causes of NPC. Although NPC is highly sensitive to radiotherapy and chemotherapy, local recurrence and distant metastasis are very common, it is estimated that 15% to 60% of patients will develop local recurrence [5,6], and 30% to 40% of patients will develop distant metastasis within 4 years after primary treatment [7,8]. Thus identifying patients at high-risk of local and / or distant metastasis would be crucial for personalized treatment of NPC.
Like other malignancies, NPC is not a single disease which is mainly caused by the intra-tumoral heterogeneity, thus the genetic complexity indeed pose a significant challenge to the targeted therapies for NPC. Owing to the heterogeneous character of NPC, it is necessary to classify NPC patients into different groups which corresponding well with their molecular features as well as clinical outcomes. To this end, not only it can help us understand more about the underlying mechanisms of the tumorigenesis of NPC, but also help us to develop subtype specific therapies for NPC patients. Traditional histopathologic classification of cancer has been carried out by pathologists relies on the histologic appearance and morphological features of the tumors, which only partially reflect the heterogeneity character of cancers. However, in reality, tumors with similar morphological appearance may vary in response to therapy and have distinct clinical outcomes [9]. Recent advancements in genome wide molecular profiling have allowed researchers to classify cancers into homogeneous groups with improved diagnosis and prognosis than traditional classification of cancers [10,11].
MicroRNAs (miRNAs) are a class of highly conserved noncoding, short regulatory RNAs (19-25 nucleotides) cleaved from 70 to 100 nucleotides hairpin pre-miRNA precursors and are negative regulators of gene expression [12]. MiRNAs are involved in diverse biological functions, including development, differentiation, proliferation, apoptosis and cancers [13]. MiRNA expression signatures are informative, which have been shown to be potential new biomarkers for cancer diagnosis, prognosis and therapy prediction [14][15][16]. Various miRNA-based classifiers have been built to classify breast cancer [17], lung cancer [18], hepatocellular carcinoma [19], colorectal cancer [20], kidney cancer [21] and myeloma [22] into homogeneous groups based on the specific miRNA expression patterns in cancers.
The recently widely used of miRNA arrays has enabled the large scale profiling of miRNAs in NPC [23]. Here, we analyzed two independent datasets (GSE32960 and GSE70970) which consist of a total number of 558 NPC patients with miRNA expression profiles. We employed an unsupervised classification approach to stratify these patients into three molecular and clinical distinct subgroups (immunogenic, classical and mesenchymal). Of which mesenchymal subtype (~36%) is characterized by suppressing tumor suppressor miRNAs and activation of epithelial-mesenchymal transition (EMT). Compared with the other two subtypes, patients classified into mesenchymal subtype have a higher risk of metastasis and poorer distant metastasis-free survival (DMFS). While immunogenic subtype accounts for a small portion of the total NPC patients (~19%), they were found to have enrichment in RNA binding and immune related gene sets as well as have good clinical outcomes. Finally, classical subtype was found to be enriched in cell cycle related gene sets, and have an intermediate survival compared with other two subtypes. We also classified the 12 commonly used NPC cell lines into the three subtypes, with six classical, five mesenchymal and one immunogenic subtype, which provides a good in-vitro platform for further subtype-specific studies.
Furthermore, out of the 25 most differentially expressed miRNAs in mesenchymal subtype, miR-142, miR-26a, miR-141 and let-7i have significant prognostic power (P < 0.05), as determined by univariate Cox regression analysis. We then built a Cox regression model by using the selected 4 miRNAs. This model can be used to separate the NPC patients into high-risk and low-risk groups of distant metastasis. Thus, our study not only provided a new classification system for NPC, but also identified a panel of biomarkers which may have a great potential to be applied in the clinic for predicting the risk of DMFS.

Data curation and pre-processing
We first searched the Gene Expression Omnibus (GEO) database (www.ncbi.nlm.nih.gov/geo) for all available expression data related to NPC. We came across two relevant datasets, one is in the accession number of GSE32960 [24] which contains 312 non-distantmetastatic paraffin-embedded NPC and 18 paraffinembedded non-cancer nasopharyngitis biopsy samples. All these samples were collected between Jan 16, 2003, andFeb 25, 2006 from the Sun Yat-sen University Cancer Center (Guangzhou, China), and the clinical staging was classified according to the criteria of the American Joint Committee on Cancer Staging Manual (Seventh Edition). Patients median follow-up was 62.1 months (IQR 47.7-71.5) [24]. Another dataset is in the accession number of GSE70970 [25], which in total contains 246 NPC patients from the Princess Margaret Cancer Center (Toronto, Canada). Those 246 tumor samples were collected at two different time periods. We downloaded normalized miRNA expression and clinical data from GEO database and used the ComBat [26] to remove the batch effects in this dataset.
In total, we collected altogether 558 NPC patients for this subtyping study (Table 1). At which, the miRNA expression profiling of 86 stage II patients from GSE32960 was our training (discovery) dataset to build a classification model. This is because that the expression data of early stages patients are less noisy than the late stages (stage III and stage IV) [27]. Expression noise, or unavoidable stochastic fluctuations [28] were increased along tumorigenesis [27]. More specifically, Han et al. [27] studied the changes of expression noise in different human cancers and found that more than 53.7% genes had increased noise in patients with late stage than early stage cancers. This study showed that a noticeable loss of expression control as cancer development and progression. In order to avoid impacts from ambient noise, we had better use early stage patients' data to build the   Log-rank test model. Besides, since stage I patients are associated with good clinical outcomes than other stages, and only account for a very small portion (< 4%) of total patients in GSE32960, thus they were also excluded from the training dataset. The remaining 226 NPC patients from GSE32960 and 246 NPC patients from GSE70970 were used as two independent validation datasets. Only miRNA features common to both datasets were remained for the following analysis.

Identification of NPC subtypes
We first selected 300 most variable miRNAs by calculating the median absolute deviation (MAD) of each miRNA across 86 patients from the training dataset, the variable miRNAs were retained and row-normalized expression for the following analysis. Next, we performed consensus clustering [29] consisted of 1000 iterations of hierarchical clustering, with 0.9 subsampling ratio, and agglomerative average linkage and Pearson correlation to cluster these 86 patients. We used the gap statistic [30], which is a measure of within-cluster dispersion to assess the optimal number of clusters. Silhouette width was computed to identify the most representative samples within each cluster. Finally, we retained samples with positive silhouette width (n = 77) to build a classifier for NPC.

Cell culture
Human NPC cell line C17 was obtained through the generosity of Dr. Pierre Busson (Institut Gustave Roussy, France) and cultured in RPMI 1640 medium supplemented with 7.5% fetal bovine serum (FBS), 25 mM HEPES and 7 μM ROCK inhibitor Y-27632. C666, CNE2, HNE1, HK1, HONE1, NP69, NP460 were kindly provided by Prof. George S.W. Tsao (The University of Hong Kong, Hong Kong). C666, CNE2, HK1 and HONE1 were cultured in RPMI 1640 medium supplemented with 10% FBS. HNE1 was cultured in cultured in DMEM medium supplemented with 5% FBS and 5% newborn calf serum. NP69 was cultured in Keratinocyte-SFM medium supplemented with 0.05 mg/ml bovine pituitary extract and 5 ng/ml epidermal growth factor. NP460 was cultured in Defined Keratinocyte-SFM medium and EpiLife Medium in 1:1 ratio. HK1-LMP1, HK1-LMP1 Cis R, HONE1-EBV and HONE1-EBV Cis R were gifts from Prof. Brigette B.Y. Ma (The Chinese University of Hong Kong, Hong Kong). These cell lines were cultured as previously described [31]. All cells were maintained at 37°C and 5% CO 2 humidified atmosphere. All culture reagents were obtained from Thermo Fisher Scientific. List of the NPC cell lines involved in our study can be found at Table 2.

Generation of the NPC classifier and classification
To build the NPC classifier, we also did a feature (miRNA) selection process which involved two filtering steps to select the most representative and predictive miRNAs. First, we used the Significance Analysis of Microarrays (SAM) algorithm (R package siggenes version 1.42.0) to identify miRNAs significantly differentially to assess each miRNA's ability to separate two clusters. The retained 10 miRNA with AUC > 0.9 were trained by Support Vector Machine (SVM) to build a classifier. The expression profiles of the two validation datasets, and NPC cell line data were mean or median centered across all samples and then subjected to classification using the classifier built based on the training dataset.
miRNA target prediction and gene set enrichment analysis (GSEA) Differentially expressed miRNAs between each subtype were identified by using the R package limma [32], with absolute log2 fold change greater than 1 and Benjamini-Hochberg-adjusted p-value less than 0.05. We then obtained experimentally validated target genes of each differential miRNA based on the miRWalk 2.0 database (http://zmf.umm.uni-heidelberg.de/apps/zmf/mirwalk2/) [33]. GSEA is a widely used method to interpret expression data at the level of gene sets, that is, groups of genes that share common biological function, or regulation [34]. In this study, GSEA with annotated gene sets from KEGG, Reactome and Gene Ontology (GO) was done with Enrichr tool (http://amp.pharm.mssm.edu/ Enrichr/) [35].

Survival analysis
In the two datasets, DMFS were calculated from treatment to the date to the first distant relapse, and diseasefree survival (DFS) to the first relapse at any site or death from any cause, whichever occurred first, and overall survival (OS) to death from any cause [24,25]. Survival analysis was performed using the Kaplan-Meier method, and the differences in time to an event (death or recurrence) between curves were assessed by using the log-rank tests. Adjusted P values were obtained by Benjamini and Hochberg's method of less than 0.05 were considered to be statistically significant.

Cox regression model
In order to identify a miRNA signature associated with risk of distant metastasis (DM), we did a differential miRNA expression analysis between mesenchymal subtype and non-mesenchymal subtypes. In total, we identified 25 differentially expressed miRNAs (Table 3) (limma package [36] in R) with a cutoff of absolute log2 fold change greater than 1 and adjusted P value less than 0.05. Among the 25 miRNAs, miR-142, miR-26a, miR-141 and let-7i have significant prognostic power (P < 0.05) (

Results
Unsupervised clustering identifies three subtypes in NPC Unsupervised clustering was applied to the 86 stage II NPC patients from the GSE32960 dataset, which revealed 2 to 4 well-defined clusters (Fig. 1a). GAP statistics were calculated to determine the optimal number of clusters, and a peak was found at k = 3 (Fig. 1a). Silhouette width analysis was subsequently performed to select the most coherent samples within each cluster.

Classification in the validation and cell line datasets
In order to investigate whether these three subtypes exist in other datasets, we first performed classifications in the two validation datasets. One is an internal dataset contains the remaining 226 NPC patients from GSE32960 and another is an external dataset contains 246 NPC patients from GSE70970. Patients in these two datasets can be classified into three subtypes with a similar proportion of patients being distributed among subtypes (Table 1), which may suggest a general inter-tumor heterogeneity pattern exist in the NPC patients. Furthermore, we can also classify the 12 commonly used NPC cell lines into the three subtypes using our classification system, with six classical, five mesenchymal and one immunogenic subtype ( Table 2). The cell line classification results may provide a good in vitro platform for studying NPC biology and finally developing subtype-specific therapies for the patients.

Functional annotation of NPC subtypes
There are distinct miRNA expression patterns between subtypes as observed in the heatmaps (Fig. 1b-d).
Among the 10 miRNAs, miR-1248, miR-29b, miR-26a, let-7f and let-7d were specifically down-regulated in NPC2 (mesenchymal subtype); while miR-622 and miR-1293 were specifically down-regulated in NPC1 (classical subtype); and the majority of miRNAs (80%) were upregulated in NPC3 (immunogenic subtype) compared to the other two subtypes (Fig. 1b-d). We also compared the 10 miRNAs expression patterns between non-cancer (n = 18) and cancer groups in the training (n = 86) and validation (n = 226) datasets. Results show that there exist clearly negative correlations between non-cancer and mesenchymal groups. Specifically, miR-29b, let-7f, let-7d and miR-26a were strikingly more highlyexpressed; and miR − 622 was lowly-expressed in the non-cancer group (Additional file 1: Figure S1). The distinct miRNAs expression patterns between patient (especially in the mesenchymal subtype) and normal groups also suggested that these 10 miRNAs are cancer-specific dysregulated miRNAs, and can be investigated further to develop personalized therapies for NPC patients.
To identify the association of biological pathways with subtypes, we subsequently performed GSEA for the enriched target genes in each subtype. In total, we obtained 55 target genes in NPC1, 1, 241 target genes in NPC2, 35 target genes in NPC3 and 251 target genes in non-cancer group (Additional file 2: Table S1) by searching the miRWalk 2.0 database [33]. Gene sets significantly enriched for each subtype were displayed in Additional file 3: Table S2, and in total there were 451 gene sets that significantly enriched (adjusted P < 0.05) in at least one NPC subtype (Additional file 3: Table S2). We then used the k-means clustering method with k = 4 to cluster these 451 gene sets, and a P value heatmap was built to show the gene sets enriched in each subgroup (Fig. 1e). EMT and metastasis related gene sets were most highly enriched in NPC2, thus we named this group of patients as mesenchymal subtype. Cell cycle related gene sets were specifically enriched in NPC1, which reflect a typical characteristic of the rapidly proliferating tumor cells, therefore we name this subtype as classical. Various RNA binding and immune related gene sets were most enriched in NPC3. Although RNA binding related gene sets are specifically enriched in NPC3 (Fig. 1e), the biological functions of these gene sets are still not fully understood, so we named NPC3 as immunogenic (Fig. 1e).

Clinical characterization of NPC subtypes
Survival analysis by Kaplan-Meier for each subtype indicated that mesenchymal subtype had the worst clinical outcomes (significantly poorer DMFS) compared with the classical and immunogenic subtypes (Fig. 2c, f, i). There was no significant differences among the three subtypes for other clinical endpoints such as OS and DFS (Fig. 2 a-b, d-e and g-h), which suggested that these subtypes only have DMFS differences both in training and validation datasets. The average age of the external validation dataset is slightly older than the training dataset, and the ratio of male patients in each subtype vary from 59% to 81%. The detailed clinical information of these subtypes were summarized in the Table 1. We also investigated the association among the subtypes with other clinical factors, such as age, sex and tumor stage, which revealed no significant differences (Table 1). This analysis demonstrated that other clinical factors cannot predict DMFS, and supports the use of subtypes as a reliable prognostic factor in NPC.

Cox proportional hazards model can separate the NPC patients into high-risk and low-risk groups of distant metastasis
The Cox proportional hazards model is one of the most popular used method to analyze survival data [37]. Out of the 25 most differentially expressed miRNAs in mesenchymal subtype, miR-142, miR-26a, miR-141 and let-7i have significant prognostic power (P < 0.05), as determined by univariate Cox regression analysis (Table 3). For identification of high-risk distant metastasis, we built a multivariate Cox regression model using the selected 4 miRNAs. We calculated the risk scores based on the model for each patients in the training dataset (n = 312), a cutoff was determined by the median risk score (0.027), and patients were classified into high-risk (> 0.027) and low-risk (< 0.027) groups. Survival analysis a b c d e Fig. 1 Unsupervised classification identified three molecular distinct subtypes of nasopharyngeal carcinoma. a Unsupervised classification of the training dataset shows the optimal cluster number is three. A classifier was constructed (using 10 unique miRNAs) to categorize patients in each of the subtypes; (b-d) The training dataset (86 patients), GSE32960 set (226 patients) and GSE70970 set (246 patients) were classified into three subtypes according to the classifier, respectively. In the heatmaps, columns correspond to patients, and rows to 10 miRNAs (miR-1248, miR-29b, let-7f, let-7d, miR-26a, miR-200b, miR-370, miR-2053, miR-1293 and miR-622). Expression values are represented by different colors, red means higher expression values, and green for lower expression values. Note: IM is short for immunogenic; (e) A p-value heatmap to represent NPC subtype enriched pathways (normal group was used as control), values in the heatmap equal to -log10 (p-value) were subsequently performed to investigate if there were survival differences between these two groups, compared with patients with low-risk scores, patients with high risk scores in the training dataset had shorter DMFS (hazard ratio [HR] 3.1, 95% CI 1.8-5.4; P = 1.2e-05), and validation dataset DMFS (2.2, 1.1-4.5; P = 0.022) ( Fig. 3a-b). We also investigated if other miRNA signatures, such as Liu's 5-miRNA signature (miR-93, miR-26a, miR-142, miR-29c and miR-30e) [24], Bruce's 4-miRNA signature (miR-154, miR-449b, miR-140 and miR-34c) [25] and randomly generated 4-miRNA signature (miR-653, miR-766, miR-1302 and miR-505) were significantly associated with DMFS. Although Liu's 5-miRNA signature has significant prognostic power in the training dataset (P = 1.7e-07), it performed worse with its p-value of 0.37 in the validation dataset (Fig. 3c-d). Bruce's and randomly generated 4-miRNA signatures all received poor performances both in training and validation datasets ( Fig. 3e-f, and g-h).

Discussion
Like other cancer types, not all NPC patients will present identical clinical outcomes after treatment, some will result in relatively good treatment outcomes, whereas some are not. The major reason for such phenomenon is caused by the intra-tumoral heterogeneity. How to classify and select the right treatment strategies for NPC patients become crucial tasks. Recent genome wide molecular profiling provide an opportunity to investigate the genetic changes during the development and progression of cancers, and have been widely used in the cancer classification studies [38,39]. More and more genome wide molecular profiling studies have been carried out in NPC, and there are some gene  [17][18][19][20][21][22]. We found two big miRNA expression datasets containing a total number of 550 NPC patients [24,25], and thus employed them in our subtyping study. According to the literature search, Liu et al. [24] (GSE32960) dataset contains the largest NPC miRNA profiling data (312 tumor and 18 normal control) so far, while Bruce et al. [25] (GSE70970) dataset contains 246 NPC patients. More specifically, we used 86 stage II patients from GSE32960 as our training dataset, the remaining patients from GSE32960 and GSE70970 were used as two validation datasets. We identified three subtypes in NPC: classical, mesenchymal and immunogenic subtypes. We found that patients classified into mesenchymal subtype tend to have the worst clinical outcomes, thus we put our emphasis on this subtype.
Mesenchymal subtype-specific miRNAs, such as let-7 family, miR-29b, miR-29a, and miR-26a, are the major contributor to the poor prognosis of the mesenchymal subtype. The let-7 family of miRNAs contain several members: let-7 (−a, −b, −c, −d, −e, −f, −g, and -i), they are highly conserved across animal species [40], and are widely considered as tumor suppressor miRNAs. Let-7 miRNAs are frequently downregulated in various types of cancers, including in NPC [41][42][43]. Wong et al. [42] found that let-7 expression were downregulated in NPC cells compared with normal nasopharyngeal cells, and let-7 can inhibit cell proliferation through renal cell down-regulation of c-Myc expression. Li et al. [43] investigated miRNA expression at different stages of NPC tissue samples and found that different members from let-7 family were dysregulated from early stage to the late stage. In our study, we found that let-7 regulate much more EMT and migration related genes than other miRNAs, indicating it plays a critical role in the poor prognosis characteristic of mesenchymal subtype. Other mesenchymal subtype specific miRNAs include miR-29b, miR-29a, and miR-26a. The miR-29 family consists of three members: miR-29a, miR-29b, and miR-29c, differing only in few bases in the 3′ end nucleotides, among them miR-29b is the most highly expressed member [44]. The miR-29 family functioning as a tumor suppressor in many types of cancers, which can regulate apoptosis, cell proliferation and differentiation. MiR-29b can regulate the expression of tumor suppressor p53, and is recognized as an important regulator of EMT [45]. Reduced expression of miR-29c has been reported in several NPC studies [24,46]. In our study, we found that miR-29a and miR-29b were the mesenchymal subtype specific miRNAs, and they were significantly downregulated in mesenchymal subtype. Interestingly, we also found that miR-29a was in our Cox model associated with DMFS. MiR-26a has two precursors: miR-26a-1 and miR-26a-2, which located in chromosomes 3 and 12, respectively. MiR-26 is down-regulated in other cancers as well as in NPC [24]. Ma et al. [47] found that miR-26a can inhibit the EMT by down regulation of EZH2 expression, Liang et al. [48] found that miR-26a can regulate the biogenesis of let-7d, and Slaby et al. [49] had proved that miR-26a was associated with tumor relapse in renal cell carcinoma, which all corresponds well with the expression pattern of miR-26a in our study. We also identified miR-1248, which has not been reported to be associated with EMT in NPC.
In the era of precision oncology, molecular subtyping of NPC is important. Not only it can stratify patients into different subgroups, but also may help in triaging treatment strategies for the patients in different subgroups. In our study, we identified some targets for mesenchymal subtype, which might have implication with clinical values. In the meantime, we found that mesenchymal subtype patients have enriched for EMT and /or migration related miRNAs and pathways, thus may account for the worst clinical outcomes of the mesenchymal subtype. Compared to mesenchymal subtype, classical subtype have better clinical outcome, and the majority of patients (~42%) are classified into classical subtype. Finally, we identified four prognostic miRNAs (miR-142, miR-26a, miR-141 and let-7i) and build a Cox regression model. The model can be used to separate the NPC patients into high-and low-risk groups of distant metastasis. Among the four miRNAs, miR-142 and miR-26a have been reported by Liu et al. [24] as prognostic factors for DFS in NPC, which indicate that these two miRNAs can be used to predict the risk of both DFS and DMFS. As one member of the miR-200 family, miR-141 was reported to be dysregulated in many cancers, participating in various cellular processes including EMT, cell proliferation and migration [50]. MiR-141 expression has been proved to be negatively correlated with survival in NPC [51]. In summary, the 4-miRNA Cox model is strongly associated with NPC tumorigenesis, and has been demonstrated to be prognostic signature of DMFS in our study.
To our best knowledge, this is the first study to classify the NPC patients into three molecular and clinical distinct subtypes based on miRNA expression profiles. We also classified the 12 commonly used NPC cell lines into the three subtypes, which can provide in vitro platforms to study subtypes of NPC. The present findings warrant, larger patients datasets validation before applied into the clinic.

Conclusions
We proposed for the first time that NPC can be stratified into three subtypes. Using a panel of 4 miRNAs, we established a prognostic model that can robustly stratify NPC patients into high-and low-risk groups of distant metastasis.