Molecular profile and copy number analysis of sporadic colorectal cancer in Taiwan

Background Colorectal cancer (CRC) is a major health concern worldwide, and recently becomes the most common cancer in Asia. The case collection of this study is one of the largest sets of CRC in Asia, and serves as representative data for investigating genomic differences between ethnic populations. We took comprehensive and high-resolution approaches to compare the clinicopathologic and genomic profiles of microsatellite instability (MSI) vs. microsatellite stability (MSS) in Taiwanese sporadic CRCs. Methods 1,173 CRC tumors were collected from the Taiwan population, and sequencing-based microsatellite typing assay was used to determine MSI and MSS. Genome-wide SNP array was used to detect CN alterations in 16 MSI-H and 13 MSS CRCs and CN variations in 424 general controls. Gene expression array was used to evaluate the effects of CN alterations, and quantitative PCR methods were used to replicate the findings in independent clinical samples. Results These 1,173 CRC tumors can be classified into 75 high-frequency MSI (MSI-H) (6.4%), 96 low-frequency MSI (8.2%) and 1,002 MSS (85.4%). Of the 75 MSI-H tumors, 22 had a BRAF mutation and 51 showed MLH1 promoter hypermethylation. There were distinctive differences in the extent of CN alterations between CRC MSS and MSI-H subtypes (300 Mb vs. 42 Mb per genome, p-value < 0.001). Also, chr7, 8q, 13 and 20 gains, and 8p and 18 losses were frequently found in MSS but not in MSI-H. Nearly a quarter of CN alterations were smaller than 100 kb, which might have been missed in previous studies due to low-resolution technology. 514 expressed genes showed CN differences between subtypes, and 271 of them (52%) were differentially expressed. Conclusions Sporadic CRCs with MSI-H displayed distinguishable clinicopathologic features, which differ from those of MSS. Genomic profiling of the two types of sporadic CRCs revealed significant differences in the extent and distribution of CN alterations in the cancer genome. More than half of expressed genes showing CN differences can directly contribute to their expressional diversities, and the biological functions of the genes associated with CN changes in sporadic CRCs warrant further investigation to establish their possible clinical implications.


Background
Colorectal cancer (CRC) is one of the major leading causes of cancer deaths around the world, and is the most common cancer in Taiwan [1]. Two different genetic pathways have been described for tumorigenesis of CRC. The most frequent pathway is the chromosomal instability pathway characterized by alterations in tumor suppressor genes and oncogenes, including APC, TP53 and K-ras [2,3]. On the other hand, 10-15% of all cases of CRC show microsatellite instability (MSI), which are resulted from a germline mutation in the mismatch repair (MMR) system or somatic hypermethylation of the promoter region of the MLH1 gene [4]. Tumors with MMR deficiency exhibited frequent errors in microsatellite DNA, short segments of DNA containing tandem repeats of mono-, di-or trinucleotides [5]. The high-frequency MSI (MSI-H) CRCs have unique clinicopathologic features, such as right-sided, mucinous or poorly differentiated, and stable chromosomal status in the tumors [6].
About 80% of MSI tumors have a near-diploid karyotype and a distinct genetic alteration distinguishable from those of microsatellite stable (MSS) cancers [7][8][9][10]. Despite the advancement of our understanding of cancer genetics of CRC, genomic alterations of various subtypes of CRC have not been fully characterized. The copy number variations (CNVs) can contribute to variable levels of gene expressions [11], and thus fine-scale copy number (CN) profiling of cancer can enhance our knowledge about tumorigenesis. Among all somatic mutations, nongermline CNVs found in the cancer genomes, also known as copy number alterations/aberrations (CNAs), are frequently observed, e.g., gains of oncogenes and losses of tumor suppresser genes [12]. Furthermore, the DNA CN states of CRC cases are related to the response of drug treatments, e.g., the CNA degree of CRC is associated with response to systemic combination chemotherapy with capecitabine and irinotecan [13]. Previous cytogenetic studies have shown MSS tumors are characterized with more chromosomal and copy number aberrations than MSI tumors [14,15], and most of MSI tumors have a near-diploid karyotype and appear to follow a genetic pathway distinct from MSS tumors [9]. These studies showed that gain of chromosome 7, 8q, 13 and 20q and loss of chromosome 4q, 8p, 17p and 18q were frequent in CRC MSS tumors [16]. Both profiles of genome-wide CNA and gene expression have been used to classify MSS and MSI subtypes of CRC samples [17]. However, previous genome-wide CNA studies of CRC were limited by the resolution of comparative genomic hybridization (CGH) array technology (probe distance > 30 kb), thereby subtle CN changes harboring cancer-causing variants might be missed [13,17,18]. As genomic technology advances, high-density single-nucleotide polymorphism (SNP) array can be used to genotype a huge number of SNPs and detect CN changes on the genomic scale. In the current study, we have applied Affymetrix SNP 6.0 array (Affymetrix, CA, USA), with its median inter-probe distance of less than 700 bp, to detect CNAs in CRC cancer genome of clinical samples. As compared to other reports on the CRC cancer genome using the CGH arrays, we have achieved a much improved resolution. Molecular karyotype profiling of the two subtypes of sporadic CRCs revealed significant differences in the extent and distribution of CN alterations in the cancer genome. Combining the data of genome-wide CNAs and Illumina Human Ref-8 gene expression array (Illumina, CA, USA), CNAs might significantly contribute to the expressional levels of genes, more than half of which were differently expressed between CRC MSI-H and MSS.

Clinical patients and tumor tissues
A total of 1,543 colorectal cancer patients who underwent surgeries in Taipei Veterans General Hospital from January 2000 to December 2007 were included. The study was approved by the Institutional Review Board of the Taipei Veterans General Hospital, and written informed consent for tissue collection was obtained from all patients. Patient with preoperative chemoradiotherapy, or emergent operative procedure, or death within 30 postoperative days, or evidence of familial adenomatous polyposis were excluded from this study. Clinical information was recorded prospectively and stored in a database. This included: (i) age, sex, personal and family history, and (ii) tumor size, location, gross appearance, TNM stage, differentiation and pathological prognostic features. Tumors were meticulously dissected, with samples collected from the 4 tumor quadrants to explore intratumoral heterogeneity. The corresponding normal mucosa, at least 10 cm away from the primary tumor edge, was collected. Tissue fragments were immediately frozen in liquid nitrogen and stored at -70°C. Sections of cancerous and collateral tissues were reviewed and analyzed by a senior gastrointestinal pathologist blinded to patient outcomes. Disease stage was determined with the TNM classification of the International Union Against Cancer [19]. The pathological factors analyzed included lymphovascular invasion, invasive tumor pattern, grade of differentiation, mucin production and intratumoral lymphocyte infiltration. These pathological features were defined by the College of American Pathologists consensus statement [20].

Microsatellite Instability Analysis
High-molecular-weight genomic DNA from each tumor and from corresponding normal tissue was purified using the QIAamp Tissue kit (QIAGEN GmbH, Germany). Yield and purity were determined by electrophoresis on 0.8% agarose gel and spectrophotometric absorbance at 260 nm. According to international criteria for determination of MSI, 5 five reference microsatellite markers were used: D5S345, D2S123, BAT25, BAT26, and D17S250. Primer sequences were obtained from Gen-Bank (http://www.gdb.org). Detection of MSI was performed as previously described [20,21]. Briefly, DNA was amplified using fluorescent polymerase chain reaction (PCR). PCR products were denatured and analyzed by electrophoresis on 5% denaturing polyacrylamide gels, and results were analyzed using GeneScan Analysis software (Applied Biosystems, CA, USA). Tumor samples that exhibited allele peaks different from the corresponding normal sample(s) were classified as MSI for that particular marker. Samples with ≥ 2 MSI of 5 markers were defined as MSI-H, those with only one MSI of 5 markers were defined as low-frequency MSI (MSI-L) and others without evidence of MSI were classified as MSS. Analyses were performed twice if results were ambiguous.

BRAF mutation and MLH1 methylation analysis
To detect BRAF mutation, DNA from tumor tissue was amplified and sequenced with primers described in previous studies [22]. Briefly, the extracted DNA was selectively amplified by PCR in a DNA thermocycler. A negative control containing no DNA template was included for each PCR amplification round. The PCR products were analyzed by an automated sequencer (ABI Prism 3100 Genetic Analyzer; Applied Biosystems). Each sample was sequenced on both sense and antisense strands. Each mutation was confirmed by a second sequencing procedure on new PCR products. Methylation of the MLH1 promoter was determined using a methylation-specific PCR method. DNA was treated with sodium bisulfite, which converts unmethylated cytosine to uracil, yet leaves methylated cytosine unchanged, and subjected to amplification with methylated-and unmethylated-specific primers, respectively [23].
Flow Cytometry for DNA Ploidy 703 of 1,173 tumors were available to examine the status of DNA ploidy using flow cytometry by following the method of Dressler et al. [24]. The DNA index (DI), representing the ratio of the mean fluorescence intensity of the G 0 G 1 peak of the tumor cell population to that of the normal diploid population, was used to quantitate DNA ploidy. Specimens were considered diploid (DI = 1) if they had a single G 0 G 1 peak and aneuploid (DI ≠ 1) if they exhibited two or more discrete peaks, including abnormal G 0 G 1 peaks (each peak equivalent to the fluorescence of at least 20% of the total sample nuclei) and a corresponding G 2 M peak. Samples with coefficients of variation > 8% were excluded from further analysis [21]. Tumors with both diploid and aneuploid subpopulations were classified as having DNA aneuploidy. The mean coefficients of variation were 6.4% and 2.4% in tumor tissues and normal colon mucosa, respectively.

High-density SNP array and data analysis
A total of 500 ng of genomic DNA of 16 MSI-H and 13 MSS CRC samples was subjected to SNP genotyping using genome-wide Affymetrix Human SNP 6.0 array according to the manufacturer's instructions. Genotyping was performed by the National Genotyping Center at Academia Sinica, Taipei, Taiwan (http://ngc.sinica.edu. tw). This array contains 1.8 millions markers widely distributing in human genome. After standard Affymetrix quantile normalization, the intensity data was analyzed using Genotyping Console (GTC) software v.3.0.1 (Affymetrix) with default parameters of hidden-Markov model (HMM) to identify CN-changed regions [25]. PennCNV [26] and Partek Genome Suite (Partek Inc., MO, USA) software were additionally used to reconfirm CN alterations identified by GTC software. CNA predicted by PennCNV and Partek software with default HMM parameters are 91.6% and 89.8% concordant with those of GTC software. In consideration of CN-changed regions with at least 20 consecutive probes, we found that all these CNA identified are 100% overlapped with those defined by either PennCNV or Partek software, implying these CNAs were highly reliable for the following analysis.

Quantitative genomic PCR
CN changes of selected genes, including epidermal growth factor receptor (EGFR), deleted colon cancer (DCC) and calcium-dependent membrane-binding protein 1 (CPNE1), were verified by using quantitative genomic PCR experiments. Primer Express Software version 3.0 (Applied Biosystems) was applied to design PCR primers for the selected target genes. Quantitative genomic PCR were performed using the ABI StepOne Plus system (Applied Biosystems). PCR reactions were prepared using the Power SYBR-Green PCR reagent kit (Applied Biosystems), and 2.5 ng genomic DNA was used in each reaction. qPCR conditions were as follows: initial denaturation at 94°C for 3 minutes, followed by 40 cycles of denaturation at 94°C for 15 seconds, and combined annealing and extension at 60°C for 60 seconds. The fluorescence signal was detected in real time during the qPCR procedure. The primer pair for the long interspersed nuclear elements 1 sequence was used for normalization. The mean estimated CN was calculated from triplicate PCR reactions for each individual.

Whole-genome gene expression analysis
RNA samples of 16 MSI-H and 13 MSS tumors (identical cases as used in SNP array analysis) were prepared using Qiagen's RNAeasy kit (Qiagen), and then were assayed using the Agilent Systems Bioanalyzer (Agilent Technologies, CA, USA) to ensure that high-quality RNA was used for the gene expression array experiments. The Illumina TotalPrep RNA amplification kit (Ambion, TX, USA) was used to amplify and generate biotinylated RNA. Illumina Human Ref-8 V3 arrays were processed and scanned at medium PMT settings as recommended by the manufacturer, and were analyzed using GenomeStudio software (Illumina). After subtracting background, array data was normalized using the quantile method, and detection p-value < 0.01 was used to ensure that only expressed genes were used in the following analyses.

Statistical analysis
All results in the text and tables are given as means ± standard deviation. In clinical analyses, categorical variables were analyzed using a chi-square test with Yates' correction, and comparisons of quantitative variables between groups were performed based on Student's t-test. In genomic data analysis, CNA frequency comparisons between CRC MSS and MSI-H subtypes were carried out by using Fisher's exact test, and t-test was applied in comparing expressional levels of each transcript between CRC subtypes. SAS/STAT (SAS Institute, NC, USA) program was used to carry out all statistical analyses.

Results
A total of 1,543 CRCs were recruited in Taiwan population from 2000 to 2007 as shown in Figure 1. To focus on sporadic CRC cases for the clinicopathologic and genomic analyses, 370 (24.0%) meeting the Revised Bethesda criteria [27], defined patients having CRC familiar history, were excluded, and the remaining 1,173 patients were sporadic CRC cases. There were 785 (66.9%) males and 388 (33.1%) females in these sporadic CRC patients. Tumors were found in right-side colon in 294 patients (25.1%), left-side colon in 478 patients (40.8%), and in the rectum in 401 patients (34.2%). There were 159 patients (13.6%) with stage I cancers, 395 patients (33.7%) with stage II cancers, 407 patients (34.7%) with stage III cancers and 212 patients (18.1%) with stage IV cancers. Based on microsatellite instability analysis, among the 1,173 tumors analyzed, 75 (6.4%) were MSI-H, 96 (8.2%) were MSI-L, and 1,002 (85.4%) were MSS. Interestingly, 48 out of the 75 MSI-H tumors (64%) were located in the right colon; 67% had stage I or II disease; 60% were female and 24% were poorly or mucinous differentiated (Table 1). In contrast to the clinopathologic features of MSI-H tumors, MSS/MSI-L showed left sided predominant, less mucinous or poorly differentiation and more advanced disease.
Methylation of the MLH1 gene promoter and BRAF gene mutations were analyzed for all MSI-H tumors. Of the 75 MSI-H tumors, 22 (29.3%) had a BRAF mutation and 51 (68%) showed hypermethylation of the MLH1 gene promoter. Immunohistochemical (IHC) stains for MLH1, MSH2, MSH6 and PMS2 proteins were carried out for 70 cases with MSI-H tumors whose samples were available ( Figure 2). As shown in Figure 1, 47 (Figure 1).
Of the 703 tumors, including 51 MSI-H and 652 MSI-L/ MSS, available for the status of DNA ploidy, 231 showed DNA diploid (32.9%). We found that 70.2% of MSI-L/MSS tumors showed DNA aneuploidy, but only 27.5% of MSI-H tumors showed DNA aneuploidy. To molecularly characterize chromosomal aberrations at a high resolution (≤ 20 kb) and compare the genomic features between the MSI-H and MSS subtypes, Affymetrix SNP 6.0 array was applied to detect genome-wide CNAs in 16 MSI-H tumors with both MLH1 hypermethylation and BRAF mutation, and compared to the genomic profiles of 13 MSS CRC tumors. To identify reliable CN changes, we only included CN-changed regions covering more than 20 probes, and these CNAs were also called by PennCNV and Partek CNV calling software (algorithm-independent). As a control, the CNV profile of Taiwanese population was based on 434 general controls from Han Chinese Cell and Genome Bank that were genotyped using Affymetrix SNP 6 array [28]. This data provides useful information, at the population scale, the common variation of genomic structure in the Taiwanese study subjects. A total of 399 CNV regions were identified in this population (Dr. Y.-T. Chen, unpublished data), the average size of the CNV regions was 350 kb (covering a total of 4.66% of the human genome), and 372 (93.23%) were reported in the database of genomic variants (http://projects.tcag.ca/variation/). As shown in Figure 3, the whole-genome CNV patterns of the two CRC subtypes were grossly different. DNA CN gain in chr7, 8q, 13 and 20 and loss in chr4q, 8p and 18 were frequently found in MSS but not in MSI-H tumors. Consisting with previous studies, the chromosomal structures of CRCs with microsatellite instability were similar to those of normal controls [9] (Figure 3). There were distinctive differences in the number of CNAs between CRC MSS and MSI-H subtypes (Figure 4a The majority of CNAs (> 80%) found in CRC cases was smaller than 500 kb, and nearly a quarter of CN alterations were smaller than 100 kb, which might have been missed in the previous studies due to low-resolution technologies (Additional File 1). Therefore, CNA frequencies of some DNA segments in this study were higher than those from previous studies (14). 13,279 protein-coding genes and 557 microRNA were affected by CN changes in these CRC samples, of which 1,434 genes (10.8%) and 35 microRNAs (6.3%) were related to CNVs observed in the general Taiwanese population. To identify genes harboring the CRC subtype-common and/ or specific CN changes, the gene-based CNA frequency of MSS and MSI-H subtypes were compared as shown in Figure 5a. 1,515 of 13,279 genes (11.4%) were found to have CN frequency difference between MSS and MSI-H tumors using Fisher's exact tests (p-value < 0.05, Additional File 2), and CNA frequencies of these genes in MSS tumors were all higher than those in MSI-H tumors. The CN gain of EGFR gene, a well-known cancer gene and drug target, was commonly found in CRC MSS tumors (8 out of 13 samples, 62%) according to genome-wide CNA analysis. To replicate the findings from the SNP array analysis, we applied qPCR approach to evaluate the EGFR CN states of independent 48 CRC MSS and 48 MSI-H samples (Additional File 3). The CN gain frequency of the independent CRC MSS group was 64.6% (31 of 48) and consisted to that (62%) of the array-based CN analysis, and was higher than overall 14% of CRC MSI-H subtype (n = 64). Furthermore, although CN losses of DCC gene were commonly found in CRCs in previous studies [29], we observed that this DCC deletions were frequently found in MSS CRCs (46%) but not in MSI-H (0%). Twelve cancer-associated genes were found to show different CN frequencies between CRC subtypes as shown in Table 2 (Fisher's exact test, p-value < 0.01), but the biological functions of many identified genes with high CNA frequencies were not fully characterized.  Although there were numerous genes affected by CN gains and/or losses in CRC cancer genome, especially in MSS cases, some might not directly contribute to the levels of gene expressions. The patterns of differentiallyexpressed genes between CRC subtypes (two sample ttest with p-value < 0.05) are similar to those of CNA analysis at genome-wide scale (Figure 5b). Only 514 of    coefficient, r 2 = 0.7). CPNE1 gene regulates tumour necrosis factor-alpha receptor signaling pathway and is over-expressed in liver cancer [31,32], but is still poorly investigated in CRC tumorigenesis.

Discussion
This is a large-scale sporadic CRC study in an Asian population, and our results showed that the clinicopathologic features of MSI-H tumors were right-sided predominant, poorly or mucinous diffenentiated, less advanced disease and female predominant. Similar to previous studies with Lynch syndrome [6,22] We have applied high-density SNP array to detect copy number changes in the CRC cancer genome in the Taiwanese population, and compared the CNA frequencies between MSS and MSI-H subtypes. Previous CRC CN analyses primarily concerned with the Caucasian genetic backgrounds and these studies were hampered by the low-resolution of CGH array. Although different populations and technological resolutions were used in this study, the overall CNV pattern was globally similar to those from previous studies, indicating the mechanism of CRC tumorigenesis of different ethnic populations might be similar. Although EGFR CN gains were commonly found in MSS tumors (64%), some MSI-H tumors (14%) carried three or four gene copies. Previous studies have shown a small proportion of MSI-H tumors harbor multiple CNAs and chromosome abnormalities [17]. Consistently, we also observed some MSI-H tumors carried more than 1 Mb CNAs (Additional File 1), and 27.5% MSI-H tumors showed DNA aneuploidy. Studies showed that response predictors for CRC patients using cetuximab, EGFR monoclonal antibody, included K-ras/Braf mutation and EGFR gene CN, etc [33,34]. Further investigations are needed to clarify whether MSI tumors might be resistant to cetuximab for possible BRAF mutation or relatively low copy number of EGFR gene. Among (Table 2). Besides CNVs, other genomic variants, including SNPs and Indels, and epigenomic modifications all can regulate transcript levels, so an integrated analysis are needed to interpret the transcript diversities between CRC subtypes.
The identified CRC subtype-specific CN-altered genes should be seriously considered when investigating the mechanism of heterogeneous CRC tumorigenesis, and might be used as candidate markers in the drug therapy studies. The major discrepancy, and argument, between our results and other studies was that the proportion of MSI-H in our study was only 6.4%, lower than that of previous reports [35][36][37][38]. Selection bias and racial and/ or environmental factors might affect the MSI incidence in CRCs. Because rectal cancer is less likely to show MSI-H than colon cancer [39], a lower rate of MSI-H colorectal cancer will be reflected in population-based studies. In studies without selection [39][40][41] incidence of MSI would be similar to our results.

Additional material
Additional file 1: The size distribution of copy number variation in colorectal cancer. CNVs were called by using Affymetrix Genotyping Console program based on the intensity data of Affymetrix SNP 6.0 array, and 20-probe criterion was used to filter out false-positive predictions. The sizes of identified CN changes from MSS CRCs were majorly between 50 and 500 kb, and a quarter of these alterations were smaller than 100 kb.
Additional file 2: Genes showing copy number (CN) differences between MSS and MSI-H CRC cases. 1,515 genes were found to have CN frequency difference between MSS and MSI-H tumors using Fisher's exact tests (p-value < 0.05). Additional file 6: The combined analysis of copy number alterations (CNAs) and gene expressions. 1,515 genes showing different CNA frequencies between CRC subtypes, and 514 of them were expressed in these tumor tissues. 271 of 514 genes (52%) show differential expressions between CRC MSS and MSI-H subtypes (two sample t-test with p-value < 0.05).
Additional file 7: The positive correlation between copy number and expression in CPNE1 gene. The average CPNE1 expressional levels of MSS group was higher then that of MSI-H group (p-value = 0.008), and the gene CNs were highly correlated to expressional levels (liner regression correlation coefficient, r 2 = 0.7).