CoDP: predicting the impact of unclassified genetic variants in MSH6 by the combination of different properties of the protein
Journal of Biomedical Science volume 20, Article number: 25 (2013)
Lynch syndrome is a hereditary cancer predisposition syndrome caused by a mutation in one of the DNA mismatch repair (MMR) genes. About 24% of the mutations identified in Lynch syndrome are missense substitutions and the frequency of missense variants in MSH6 is the highest amongst these MMR genes. Because of this high frequency, the genetic testing was not effectively used in MSH6 so far. We, therefore, developed CoDP (Combination of the Different Properties), a bioinformatics tool to predict the impact of missense variants in MSH6.
We integrated the prediction results of three methods, namely MAPP, PolyPhen-2 and SIFT. Two other structural properties, namely solvent accessibility and the change in the number of heavy atoms of amino acids in the MSH6 protein, were further combined explicitly. MSH6 germline missense variants classified by their associated clinical and molecular data were used to fit the parameters for the logistic regression model and to assess the prediction. The performance of CoDP was compared with those of other conventional tools, namely MAPP, SIFT, PolyPhen-2 and PON-MMR.
A total of 294 germline missense variants were collected from the variant databases and literature. Of them, 34 variants were available for the parameter training and the prediction performance test. We integrated the prediction results of MAPP, PolyPhen-2 and SIFT, and two other structural properties, namely solvent accessibility and the change in the number of heavy atoms of amino acids in the MSH6 protein, were further combined explicitly. Variants data classified by their associated clinical and molecular data were used to fit the parameters for the logistic regression model and to assess the prediction. The values of the positive predictive value (PPV), the negative predictive value (NPV), sensitivity, specificity and accuracy of the tools were compared on the whole data set. PPV of CoDP was 93.3% (14/15), NPV was 94.7% (18/19), specificity was 94.7% (18/19), sensitivity was 93.3% (14/15) and accuracy was 94.1% (32/34). Area under the curve of CoDP was 0.954, that of MAPP for MSH6 was 0.919, of SIFT was 0.864 and of PolyPhen-2 HumVar was 0.819. The power to distinguish between pathogenic and non-pathogenic variants of these methods was tested by Wilcoxon rank sum test (p < 8.9 × 10-6 for CoDP, p < 3.3 × 10-5 for MAPP, p < 3.1 × 10-4 for SIFT and p < 1.2 × 10-3 for PolyPhen-2 HumVar), and CoDP was shown to outperform other conventional methods.
In this paper, we provide a human curated data set for MSH6 missense variants, and CoDP, the prediction tool, which achieved better accuracy for predicting the impact of missense variants in MSH6 than any other known tools. CoDP is available at http://cib.cf.ocha.ac.jp/CoDP/.
Lynch syndrome (MIM: #120435, #609310), also known as Hereditary Non-Polyposis Colorectal Cancer (HNPCC), is an autosomal dominant disease and the most common hereditary colorectal cancer syndrome . Lynch syndrome accounts for 1-5% of all colorectal cancer (CRC) patients [2–4] and associates with germline mutations in one of the DNA mismatch repair (MMR) genes including MLH1, MSH2, MSH6 and PMS2 (MIM: #120436, #609309, #600678, #600259, respectively). MMR gene mutation carriers are at high risks of developing Lynch syndrome associated cancer at colorectal, endometrial, small bowel, stomach, ovary, ureter and hepatobiliary tract. Individuals at high risks can be identified by the use of genetic testing, and appropriate surveillance programs can be provided to prevent cancer development.
Previous studies reported that more than 90% of the detectable mutations in Lynch syndrome were found in MLH1 and MSH2. Recent data, however, showed that MSH6 contributed to about 20% of the mutations [6, 7]. In addition, MSH6 shows the greatest frequency (~37 - 49%) of missense variants in the MMR genes, and most of them are currently “unclassified variants” (UVs) [6, 8].
MSH6 mutation carriers tend to develop CRC at the age elder than MLH1 and MSH2 mutation carriers and tend to show reduced penetrance [9–12]. These tendencies suggest that family cancer history with an MSH6 mutation should not be necessarily dense enough to meet the Amsterdam criteria. Furthermore, colorectal tumor from MSH6 mutation carriers sometimes demonstrates microsatellite instability low (MSI-L) or microsatellite stable (MSS) , or normal staining pattern of immunohistochemistry (IHC) for MMR proteins . It is, therefore, important to analyze and integrate all the available data, and the data derived from the use of in silico tools for the classification of UVs is one of them.
A number of methods to predict the biological effects of missense variants as pathogenic or genetic have been reported. For Lynch syndrome, SIFT , PolyPhen [15, 16] and multivariate analysis of protein polymorphisms (MAPP)  have been used in general. Predictions using SIFT is based on sequence conservation, while that of PolyPhen is based on sequence conservation plus protein structural features [14–16]. These methods aim to predict the pathogenicity of variants for general proteins and hence they were not tuned to the interpretation of the prediction for a specific protein. MAPP uses the evolutionary variations and scales of six physicochemical properties to evaluate the structural and functional impact of all possible variants . MAPP can be customized for a specific protein. It has been optimized to MLH1 and MSH2 and outperformed SIFT and PolyPhen (MAPP-MMR ). This result indicates that the algorithm customized for a specific protein is superior to those applicable to proteins in general. However, the accuracy of prediction by MAPP-MMR is not satisfactory enough for the genetic testing. Hence, improvement in the prediction method is required.
In the field of bioinformatics, especially the field for developing a prediction method out of amino acid sequences, it has been pointed out that the prediction accuracy can be improved by integrating many different prediction methods (e.g. ). Following this idea, the accuracy of the pathogenicity prediction could be improved by integrating a number of existing methods to predict the biological effects of missense variants. In addition, none of the existing methods directly incorporate the information obtained from the MSH6 protein structure. The three-dimensional structure of MSH6-MSH2 complex with ADP and DNA was already solved . The structural data should contain varieties of information, some of which would be useful for the prediction. The easily obtained information related to the mutation effect to the structure includes the solvent accessibility of amino acid residue and the residue volume change. The mutation of amino acid residue at the surface of the protein are tolerant compared with that in the interior of the proteins, and a small volume change in amino acid residues in mutation inside the protein is tolerant compared with a mutation with a big volume change .
We, therefore, optimized MAPP  for MSH6 and then integrated SIFT , PolyPhen-2  and two properties from protein structure, namely solvent accessibility and the volume change in amino acid residues. We joined these properties on the logistic regression model and compared the prediction performance with MAPP, SIFT, PolyPhen-2 and PON-MMR . The parameter adjustment was done on the data that we gathered from different databases and literature and associated them with one another for this study. The newly developed method achieved the best prediction accuracy, sensitivity and specificity, and can distinguish pathogenic variants from non-pathogenic variants clearly. We named the method CoDP, C ombination o f D ifferent P roperties on MSH6, and made it available at http://cib.cf.ocha.ac.jp/CoDP/.
The dataset of MSH6 missense variants
MSH6 missense variants and their associated clinical and molecular data were collected from the following databases: InSiGHT (http://www.insight-group.org/), MMRUV (http://www.mmrmissense.net/), UniProt (http://www.uniprot.org/), dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/), NHLBI Exome Sequencing Project (ESP) (http://evs.gs.washington.edu/EVS/), HapMap Project (http://hapmap.ncbi.nlm.nih.gov/) and 1000 Genomes (http://www.1000genomes.org/). A systematic literature search was conducted on PubMed (http://www.ncbi.nlm.nih.gov/pubmed/) to compile unregistered MSH6 missense variants in the databases above. These data were used to assess the in silico pathogenicity prediction.
Clinical and molecular data on carriers with missense variants were also collected. The data included the age at the first diagnosis of CRC or endometrial cancer, any affected relatives with Lynch syndrome associated cancer, microsatellite instability (MSI), IHC, segregation study, allele frequency and biochemical functional assay. The biochemical functional assay included the investigations of the following; MMR activity, MSH2 protein interaction, localization, ATP hydrolysis and mismatch recognition. We employed the results of the assay from the literature as is. These clinical and molecular data were used to divide the carriers into one of the following three categories; “likely to be Lynch syndrome (LLS)”, “unlikely to be Lynch syndrome (ULS)” and “unclassified.” LLS is a carrier with pathogenic variant, and ULS is a carrier with non-pathogenic variant. An “Unclassified” carrier has a variant with unknown clinical significance, which is usually called unclassified variant (UV). The division was carried out based on the criteria shown in Table 1. When a carrier fulfilled one or more of the criteria for LLS in Table 1, the carrier was classified as LLS, and when a carrier fulfilled one or more of the criteria for ULS, the carrier was classified as ULS. When the criterion that the carrier fulfilled became important, a sub-numbering system was used, such as LLS-1 for a carrier fulfilling the first criterion of LLS.
Optimization of MAPP for MSH6
We optimized MAPP  to predict pathogenicity of MSH6 missense variants. MAPP requires the appropriate multiple sequence alignment of MSH6 orthologues for evaluating missense variants. MSH6 amino acid sequences were collected from GenBank (http://www.ncbi.nlm.nih.gov/genbank/) using BLAST  by the default parameters and human MSH6 as a query sequence. The sequences were also obtained from Ensembl genome database (http://www.ensemblgenomes.org/). The inclusion of both paralogous and orthologous sequences into the multiple sequence alignment for the training of MAPP was known to worsen the performance of the prediction [14, 17]. We, therefore, selected orthologues of human MSH6 sequences based on their domain organization and a phylogenetic tree. There was a wide range of variability in domain structures of the MSH6 proteins, and MSH6 sequences with the same domain organization to human MSH6 are the good candidates of orthologues. Vertebrate MSH6, the close homologues to human MSH6, generally have a PCNA-binding motif , a PWWP domain  and an MutS domain  (Figure 1). These vertebrate MSH6 sequences were aligned together with other MSH6 homologs by T-Coffee alignment tool  and a phylogenetic tree was built. This phylogenetic tree was compared with the species tree, and the proteins orthologous to human MSH6 were operationally defined by the sequences with the same domain organization that located around the human MSH6 consistently with the species tree. As a result, the vertebrate sequences were selected as an initial set and a multiple sequence alignment of them was built for MAPP prediction.
We then improved the prediction accuracy by increasing the size of the sequence set. An augmented data set was reported to improve the accuracy of the prediction . The addition of amino acid sequences to the data set was limited to the domain regions, because the inter-domain sequences were too diverse to align. Sequences of non-vertebrates were added to the initial sequence set and the prediction accuracy was tested using a receiver operating characteristic (ROC) curve and the area under the curve (AUC).
Structural properties to assess mutations in MSH6
Structural property for amino acid residue substitutions was obtained on the three-dimensional structure of MSH6-MSH2-DNA-ADP complex, registered as 2o8b  in Protein Data Bank . The registered structure is void of residues at 551, 652, 942, and 992, and of loops at 720–728, 1099–1104, 1123–1125, 1179–1187 and 1271–1283. These missing structures were complemented using MOE (Chemical Computing Group Inc. Montreal, Canada), molecular structure building software.
Two properties we focused on were relative accessible surface area (accessibility) of each residue and the change of volumes in residues by substitution. The accessible surface area was calculated using a modified method of Shrake and Rupley  with water radius of 1.4 Å . The threshold of 0.1 was used to separate the locations of residues into two categories; buried and surface. The relevance of accessibility to the prediction was tested based on the correlation between the accessibility and LLS/ULS. The change of volumes was quantified by the difference of the number of heavy atoms in the side chains. The relevance of this value to the prediction was also tested by the method that was the same as the one used for the accessibility test.
Combining different properties
We used the logistic regression model to integrate the properties. The logistic regression analysis gives the probability (q) of a categorical variable outcome based on one or more predictor variables (X i ). The logistic regression equation is given by: logit(q) = ln [q/(1−q)] = Z + ∑b i X i , where Z is the constant and b 1 , b 2 , …, b n are the partial correlation coefficients for X 1 , X 2 , …, X n . We defined the value q as joint score in CoDP and this score was used for predicting the impact of UVs. The scores of MAPP for MSH6, SIFT, PolyPhen-2 and the appropriate structural properties discussed above were used as predictors X i . Variant sets of LLS and ULS without the biochemical functional assay were used to optimize b i . The applicability of the joint score for prediction was tested on the variants of LLS and ULS with the biochemical functional assay.
The capability of predicting the impact of UVs was tested using the variants of LLS and ULS. The prediction performance of the tools, CoDP, MAPP for MSH6, SIFT, PolyPhen-2 and PON-MMR, was compared. The comparison was carried out on prediction score distributions. The positive predictive value (PPV), the negative predictive value (NPV), sensitivity, specificity and accuracy were calculated as follows: PPV = TP / (TP + FP); NPV = TN / (FN+TN); Sensitivity = TP / (TP+FN); Specificity = TN / (FP+TN); Accuracy = (TP+TN) / (TP +TN+FP+FN), where TP is true positive, FP is false positive, TN is true negative and FN is false negative. To classify pathogenic variants, the threshold values 0.05 and 0.446 were used in SIFT  and PolyPhen-2 , respectively. The prediction performance was also compared using AUC. The box and whisker plot for each prediction was drawn to clarify the power to distinguish between LLS and ULS variants. Statistical analyses were carried out on PASW Statistics 18.0.0 software program (SPSS Inc., Chicago, IL, USA).
Results and discussion
The dataset of MSH6 germline missense variants
A total of 294 germline missense variants were collected from the variant databases and literature (Additional file 1: Table S1). Pathogenicity of these variants was determined based on the molecular and clinical data, and the variants were classified into three categories, namely LLS, ULS and UV (Table 1). Out of these 294 variants data, fifteen were classified as LLS (Tables 2 and 3) and nineteen as ULS (Tables 4 and 5).
Out of fifteen LLS variants, five variants including G566R, R976H, G1139S, S1188N and E1193K showed abnormality in protein function assay (Table 2). These five variants also showed high level of MSI (MSI-H), and showed loss of MSH6 expression except for G566R variant [12, 30–38]. Hence, these five variants were LLS-1 and/or LLS-2. Out of the remaining ten LLS variants (=15-5), L449P, P591S, G670G, R772W, Y969C, G1069E and A1236P variants had MSI-H and loss of MSH6 expression like the ones in Table 2, but these variants fulfilled the clinical criteria, such as family cancer history and probands’ tumor features [39–46], and hence these seven variants were LLS-2 and/or LLS-3 (Table 3). The remaining three LLS variants (=15-5-7), namely C559Y, P623L and R1076C, were LLS-3 [31, 44, 47, 48] (Table 3).
Out of nineteen ULS variants, four variants including R128L, S144I, L396V and K728T showed normal function in protein function assay and normal staining pattern in IHC, hence fulfilled definition ULS-2 [30–32, 34, 49, 50] (Table 4). In addition, L396V was polymorphism and also fulfilled definition ULS-1. Out of the remaining fifteen ULS variants (=19-4), K13T, G54A, S56L, R468H, S503C, R635G and I1054F variants demonstrated MSS and showed normal expression of MSH6 [34, 49, 51, 52], hence these seven variants possessed normal MMR activity and fulfilled definition ULS-3 (Table 5). The remaining eight (=19-4-7) ULS variants, namely A25V, G39E, C196F, I886V, E1163V, E1196K, E1234Q and E1304K were polymorphism and fulfilled definition ULS-1 (Table 5).
In total, 34 variants in Tables 2, 3, 4 and 5 were available for prediction assessment, and the remaining 260 variants, which were UVs, were the targets to predict whether each of them was either LLS or ULS. In the following analyses, we used the data in Tables 3 and 5 as a parameter training data set, and the data in Tables 2 and 4 as a prediction test data set. All 34 variants data was referred to as the whole data set. And we applied the prediction to UV dataset at the end.
Optimization of MAPP for MSH6
The sequence data set for the multiple alignments
From GenBank and Ensembl, 126 sequences of MSH6 orthologues were selected (Additional file 2: Table S2). Of them, 34 were derived from vertebrates. Most of the vertebrate orthologues had, from the N-terminus, a PCNA-binding motif (Qxx[LI]xx[FF], amino acid 4–11 in human MSH6) , a PWWP domain (amino acid 89–194)  and an MutS domain (amino acid 362–1355)  (Figure 1). These sequences were a set of initial sequences for a multiple sequence alignment.
We then added the amino acid sequences of the PCNA-binding motif and of the PWWP domain of 91 non-vertebrate MSH6 to the initial set, and found that the prediction performance was improved. The procedure of adding more amino acid sequences of MutS domain was, however, not straightforward. Three different sets of sequences were made from the non-vertebrate MutS domain. The first set contained the entire non-vertebrate MutS domain (91 sequences). The second set contained MutS domains derived from the sequences that were comprised of both the MutS and PWWP domains (5 sequences). The third set contained MutS domains derived from the sequences that were comprised of both the MutS domain and PCNA-binding motif (58 sequences). A multiple sequence alignment was built with initial sequences plus each of the described sequence sets, and the performance of prediction was tested on the whole data set using an ROC curve. The AUC of the first set was 0.767, that of the second set was 0.689 and that of the third set was 0.811. It turned out that the initial set plus the third set, namely sequences of both MutS domain and PCNA-binding motif, performed best and this set was used hereafter.
Normalization of the impact score
MAPP determines the pathogenicity of missense variants by an index known as impact score. The threshold of the impact score is required to determine whether the variant is pathogenic or not. The impact score basically depends on the degree of conservation of amino acid types in the alignment position . Therefore, the threshold of the impact score in different domains of MSH6 likely varies. Indeed, the optimum threshold for the initial sequence set was 8.5, that for the PCNA-binding motif was 4.1, that for the PWWP domain was 5.0 and that for the MutS domain was 4.1. The different threshold values of the different domains in the same sequence could cause confusion. We, therefore, normalized the impact scores so as to make the threshold value 1.0 throughout the sequence.
The prediction performance of MAPP for MSH6
This type of prediction method should ideally distinguish disease-causing variants from benign variants . The distributions of the score of MAPP for MSH6 between LLS and ULS variants in the whole data set were significantly different. The average for LLS and ULS was 2.673 and 0.851, respectively (Student’s t-test: p < .001) and median for LLS and ULS was 2.099 and 0.770, respectively (Mann–Whitney U test: p < .001). The capability of this tool is, therefore, reasonably sufficient to distinguish pathogenic variants from non-pathogenic variants.
Development of CoDP
The prediction performance of SIFT and PolyPhen-2
We examined the prediction performance of both SIFT and PolyPhen-2 on the whole data set. PolyPhen-2 calculates values of both HumDiv and HumVar. HumDiv is used for diagnosis of Mendelian disease, and HumVar is used for the evaluation of rare alleles potentially involved in complex phenotypes . Both SIFT and PolyPhen-2 clearly distinguished the median for LLS variants and that for ULS variants (Mann–Whitney U test: HumVar p < .001, HumDiv p < .001, SIFT p < .001).
Correlation between the structural properties of the MSH6 protein and LLS/ULS
The correlation between solvent accessibility of substituted amino acid and LLS/ULS was found to be statistically significant. The average of the solvent accessibility of the substituted amino acid residues in LLS and in ULS variants were 0.141 and 0.589, respectively (Student’s t-test: p < .001) and the median of the solvent accessibility of the residues in LLS and ULS variants were 0.087 and 0.583, respectively (Mann–Whitney U test: p < .005). The amino acid residues substituted in LLS tend to have smaller accessibility than those in ULS variants. Similarly, the correlation between the changes in the number of heavy atoms in the side chains of the substituted residues in LLS/ULS variants was also significant (Figure 2). Minor change in the number of heavy atoms in the side chains was often observed in ULS. These significant differences in the two properties evidently have a potential to be used as predictors for pathogenicity of MSH6 variants. When these two properties alone were applied to the whole data set, eleven out of 15 LLS variants and 17 out of 19 ULS variants were correctly distinguished, which is equivalent to 82.4% accuracy, using the most appropriate threshold. It is surprising to find that this simple and explicit usage of protein three-dimensional structure data had a classification power comparable to the power of SIFT and PolyPhen2.
Combining different properties by logistic regression model
To further improve the prediction accuracy, we combined different prediction methods above on the logistic regression equation and the weight for each method was optimized using the training data set. The logistic regression equation for joint score q was obtained as:
The significance level is less than 1% and hence this model seems to be useful for the prediction. In the equation above, we omitted PolyPhen-2 HumDiv, because HumDiv had low accuracy, as will be explained below.
We calculated both AUC and the cut-off value of joint score q. AUC was 0.954 and the cut-off value was 0.56. Based on these values, we considered that the variants with the joint score q = 0.56 or less has minor impact on the function of the MSH6 protein, and hence the variants were likely to be non-pathogenic variants. The variants with the joint score q more than 0.56 were, therefore, likely to be pathogenic. More specifically, the variants with the joint score q more than 0.65 likely have the function impaired. And the variants with the joint score q between 0.56 and 0.65 likely have moderate impact on function. We applied this prediction procedure to the test data set, namely the variants with the biochemical functional assay (Tables 2 and 4), and found that the procedure predicted those variants correctly (LLS: 5/5 variants, ULS: 4/4 variants). Of the five LLS variants, four variants, namely G566R, G1139K, S1188N and E1193K, were in the category of “impaired function. ”
Comparison of prediction performance
The performance of CoDP was first compared with those of other conventional tools, namely MAPP, SIFT, PolyPhen-2 and PON-MMR on the whole data set. The values of PPV, NPV, sensitivity, specificity and accuracy were compared (Table 6). PPV of CoDP was 93.3% (14/15), NPV was 94.7% (18/19), sensitivity was 93.3% (14/15), specificity was 94.7% (18/19) and accuracy was 94.1% (32/34). All these scores were better than those of the conventional methods except for PON-MMR. PON-MMR predicted eleven out of 34 LLS/ULS variants as either pathogenic or non-pathogenic variants, and remaining 23 variants as UVs. The eleven variants were predicted correctly, of which three were pathogenic variants and eight were non-pathogenic variants. However, prediction by PON-MMR did not classify 23 (= 34–11) variants as pathogenic or non-pathogenic, and hence the method cannot be used for UV curation, which we aim for in our tools. Therefore, we put PON-MMR aside in this comparison. Superiority of CoDP was also clarified by AUC. AUC of CoDP was 0.954, that of MAPP for MSH6 was 0.919, of SIFT was 0.864 and of PolyPhen-2 HumVar was 0.819. The power to distinguish between LLS and ULS of these methods was visualized by the box and whisker plot (Figure 3) and further tested by Wilcoxon rank sum test. The test ended in p < 8.9 × 10-6 for CoDP, p < 3.3 × 10-5 for MAPP, p < 3.1 × 10-4 for SIFT and p < 1.2 × 10-3 for PolyPhen-2 HumVar. These tests clearly demonstrated that CoDP outperformed other conventional methods.
When the performances of the tools were compared on the test data set alone, only CoDP predicted all test variants correctly. The values of PPV, NPV, sensitivity, specificity and accuracy of the tools in the test data set were shown in Table 7 (MAPP LLS: 4/5 variants, ULS: 4/4 variants; SIFT LLS: 4/5 variants, ULS: 4/4 variants; PolyPhen-2 HumVar LLS: 5/5 variants, ULS: 2/4 variants). AUC of CoDP was 1.000, that of MAPP for MSH6 was 0.800, of SIFT was 0.950 and of PolyPhen-2 HumVar was 0.900. The power to distinguish between LLS and ULS of these methods on the test data set was p < 1.5 × 10-2 for CoDP, p < 1.9 × 10-1 for MAPP, p < 6.5 × 10-2 for SIFT and p < 1.5 × 10-2 for PolyPhen-2 HumVar. The box and whisker plot that visualized the distribution of the scores were shown in Additional file 3: Figure S1.
The small size of the test data set may raise doubts on the superiority of CoDP. To overcome the paucity of the test sample, we also employed a leave-one-out jackknife method and evaluated the performance of the tools. CoDP predicted 85.3% (29/34, LLS 93.3%, 14/15, ULS 78.9%, 15/19) of the variants correctly and the performance was still better than SIFT and PolyPhen-2 HumVar (Table 6). Here, we did not compared the performance of CoDP and MAPP for MSH6, because of the fact that MAPP is based on the information retrieved from the homologous sequences and hence it was difficult to leave the information of the target sequence out of the training set.
Predicting UVs by CoDP
We now used CoDP to interpret 260 germline missense variants, which were classified as UVs. Of 260 UVs, 84 variants (32.3%) were predicted as pathogenic variants, and 176 variants (67.7%) as non-pathogenic variants, hence about one third of the UVs detected in MSH6 were predicted as pathogenic variants. Of these putative 84 pathogenic variants, three variants were predicted to have the moderate impact on the protein (0.56 < joint score q ≤ 0.65), and the 81 variants were predicted to have impaired function (joint score q > 0.65) (Table 8).
The higher joint scores of CoDP tend to derive from the mutations in the conserved domain, namely in the MutS domain. This tendency suggests that missense mutations in the domain should have considerable influence on protein function. The MutS domain in MSH6 forms a heterodimer with MSH2 and participates in the early recognition of mismatches and small insertion/deletion loops of DNA [54, 55]. For instance, the E1193K variant, classified as LLS, is located in the MutS domain V region (Figure 1). The MutS domain V region is the highly conserved region in MutS homologues . This variant showed remarkable impairment of function, such as the loss of heterodimerization with MSH2 and MMR activity . CoDP gave the joint score q = 0.813 to E1193K variant, indicating that the variant likely has significant damage to the structure of MSH6, which may impair the function of the protein.
In this study, we built CoDP, the new prediction tool to assess the MSH6 missense variants. The novelty of CoDP lies in the direct incorporation of protein three-dimensional structure information and the introduction of the logistic regression model for combining the different prediction methods. The former feature was found to have unexpectedly high performance in LLS/ULS classification, and the latter procedure can be interpreted as an introduction of a simple neural network model for combining outputs from different prediction schemes. These new features enabled CoDP to achieve better performance for the classification of the MSH6 variants. The better performance was also sustained by the manually curated dataset of MSH6 variants presented in Tables 2, 3, 4, 5, and 6.
For adjusting the parameters, we carefully categorized MSH6 germline missense variants into LLS and ULS. In the current dataset, only 34 out of 294 variants could be categorized into LLS and ULS. This was due to the paucity of both biochemical functional assay data and clinical and molecular data that are linked to the variants of MSH6 on the databases. This data paucity makes the present CoDP not be clinically applicable. However, current form of CoDP has better utility for supporting a risk estimation of UVs in MSH6, as SIFT or PolyPhen-2 does to other proteins. In the future when more associated data would be obtained, the appropriate parameters would be set, and the accuracy of CoDP would be further improved.
The area under the curve
Hereditary Non-Polyposis Colorectal Cancer
Likely to be Lynch syndrome
Multivariate analysis of protein polymorphisms
High level of microsatellite instability
Microsatellite instability low
The negative predictive value
The positive predictive value
A receiver-operating characteristic
Unlikely to be Lynch syndrome
Lynch HT, De la Chapelle A: Hereditary colorectal cancer. N Engl J Med. 2003, 348: 919-932. 10.1056/NEJMra012242.
Aaltonen LA, Salovaara R, Kristo P, Canzian F, Hemminki A, Peltomäki P, Chadwick RB, Kääriäinen H, Eskelinen M, Järvinen H, Mecklin JP, De la Chapelle A: Incidence of hereditary nonpolyposis colorectal cancer and the feasibility of molecular screening for the disease. N Engl J Med. 1998, 338: 1481-1487. 10.1056/NEJM199805213382101.
Hampel H, Frankel WL, Martin E, Arnold M, Khanduja K, Kuebler P, Clendenning M, Sotamaa K, Prior T, Westman JA, Panescu J, Fix D, Lockman J, LaJeunesse J, Comeras I, De la Chapelle A: Feasibility of screening for Lynch syndrome among patients with colorectal cancer. J Clin Oncol. 2008, 26: 5783-5788. 10.1200/JCO.2008.17.5950.
Grover S, Syngal S: Genetic testing in gastroenterology: Lynch syndrome. Best Pract Res Clin Gastroenterol. 2009, 23: 185-196. 10.1016/j.bpg.2009.02.006.
Lynch HT, De la Chapelle A: Genetic susceptibility to non-polyposis colorectal cancer. J Med Genet. 1999, 36: 801-818.
Nilbert M, Wikman FP, Hansen TVO, Krarup HB, Orntoft TF, Nielsen FC, Sunde L, Gerdes A-M, Cruger D, Timshel S, Bisgaard M-L, Bernstein I, Okkels H: Major contribution from recurrent alterations and MSH6 mutations in the Danish Lynch syndrome population. Fam Canc. 2009, 8: 75-83. 10.1007/s10689-008-9199-3.
Woods MO, Williams P, Careen A, Edwards L, Bartlett S, McLaughlin JR, Younghusband HB: A new variant database for mismatch repair genes associated with Lynch syndrome. Hum Mutat. 2007, 28: 669-673. 10.1002/humu.20502.
Nyström-Lahti M, Perrera C, Räschle M, Panyushkina-Seiler E, Marra G, Curci A, Quaresima B, Costanzo F, D’Urso M, Venuta S, Jiricny J: Functional analysis of MLH1 mutations linked to hereditary nonpolyposis colon cancer. Genes Chromosomes Canc. 2002, 33: 160-167. 10.1002/gcc.1225.
Baglietto L, Lindor NM, Dowty JG, White DM, Wagner A, Gomez Garcia EB, Vriends AHJT, Cartwright NR, Barnetson RA, Farrington SM, Tenesa A, Hampel H, Buchanan D, Arnold S, Young J, Walsh MD, Jass J, Macrae F, Antill Y, Winship IM, Giles GG, Goldblatt J, Parry S, Suthers G, Leggett B, Butz M, Aronson M, Poynter JN, Baron JA, Le Marchand L: Risks of Lynch syndrome cancers for MSH6 mutation carriers. J Natl Canc Inst. 2010, 102: 193-201. 10.1093/jnci/djp473.
Berends MJW, Wu Y, Sijmons RH, Mensink RGJ, Van der Sluis T, Hordijk-Hos JM, De Vries EGE, Hollema H, Karrenbeld A, Buys CHCM, Van der Zee AGJ, Hofstra RMW, Kleibeuker JH: Molecular and clinical characteristics of MSH6 variants: an analysis of 25 index carriers of a germline variant. Am J Hum Genet. 2002, 70: 26-37. 10.1086/337944.
Hendriks YMC, Wagner A, Morreau H, Menko F, Stormorken A, Quehenberger F, Sandkuijl L, Møller P, Genuardi M, Van Houwelingen H, Tops C, Van Puijenbroek M, Verkuijlen P, Kenter G, Van Mil A, Meijers-Heijboer H, Tan GB, Breuning MH, Fodde R, Wijnen JT, Bröcker-Vriends AHJT, Vasen H: Cancer risk in hereditary nonpolyposis colorectal cancer due to MSH6 mutations: impact on counseling and surveillance. Gastroenterology. 2004, 127: 17-25. 10.1053/j.gastro.2004.03.068.
Wijnen J, De Leeuw W, Vasen H, Van der Klift H, Møller P, Stormorken A, Meijers-Heijboer H, Lindhout D, Menko F, Vossen S, Möslein G, Tops C, Bröcker-Vriends A, Wu Y, Hofstra R, Sijmons R, Cornelisse C, Morreau H, Fodde R: Familial endometrial cancer in female carriers of MSH6 germline mutations. Nat Genet. 1999, 23: 142-144. 10.1038/13773.
Plaschke J, Engel C, Krüger S, Holinski-Feder E, Pagenstecher C, Mangold E, Moeslein G, Schulmann K, Gebert J, Von Knebel Doeberitz M, Rüschoff J, Loeffler M, Schackert HK: Lower incidence of colorectal cancer and later age of disease onset in 27 families with pathogenic MSH6 germline mutations compared with families with MLH1 or MSH2 mutations: the German Hereditary Nonpolyposis Colorectal Cancer Consortium. J Clin Oncol. 2004, 22: 4486-4494. 10.1200/JCO.2004.02.033.
Ng PC, Henikoff S: Predicting deleterious amino acid substitutions. Genome Res. 2001, 11: 863-874. 10.1101/gr.176601.
Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR: A method and server for predicting damaging missense mutations. Nat Meth. 2010, 7: 248-249. 10.1038/nmeth0410-248.
Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002, 30: 3894-3900. 10.1093/nar/gkf493.
Stone EA, Sidow A: Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res. 2005, 15: 978-986. 10.1101/gr.3804205.
Chao EC, Velasquez JL, Witherspoon MSL, Rozek LS, Peel D, Ng P, Gruber SB, Watson P, Rennert G, Anton-Culver H, Lynch H, Lipkin SM: Accurate classification of MLH1/MSH2 missense variants with multivariate analysis of protein polymorphisms-mismatch repair (MAPP-MMR). Hum Mutat. 2008, 29: 852-860. 10.1002/humu.20735.
Ginalski K, Elofsson A, Fischer D, Rychlewski L: 3D-Jury: a simple approach to improve protein structure predictions. Bioinformatics. 2003, 19: 1015-1018. 10.1093/bioinformatics/btg124.
Warren JJ, Pohlhaus TJ, Changela A, Iyer RR, Modrich PL, Beese LS: Structure of the human MutSalpha DNA lesion recognition complex. Mol Cell. 2007, 26: 579-592. 10.1016/j.molcel.2007.04.018.
Go M, Miyazawa S: Relationship between mutability, polarity and exteriority of amino acid residues in protein evolution. Int J Pept Protein Res. 1980, 15: 211-224.
Ali H, Olatubosun A, Vihinen M: Classification of mismatch repair gene missense variants with PON-MMR. Hum Mutat. 2012, 33: 642-650. 10.1002/humu.22038.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.
Kleczkowska HE, Marra G, Lettieri T, Jiricny J: hMSH3 and hMSH6 interact with PCNA and colocalize with it to replication foci. Genes Dev. 2001, 15: 724-736. 10.1101/gad.191201.
Laguri C, Duband-Goulet I, Friedrich N, Axt M, Belin P, Callebaut I, Gilquin B, Zinn-Justin S, Couprie J: Human mismatch repair protein MSH6 contains a PWWP domain that targets double stranded DNA. Biochemistry. 2008, 47: 6199-6207. 10.1021/bi7024639.
Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000, 302: 205-217. 10.1006/jmbi.2000.4042.
Berman H, Henrick K, Nakamura H: Announcing the worldwide Protein Data Bank. Nat Struct Biol. 2003, 10: 980-10.1038/nsb1203-980.
Shrake A, Rupley JA: Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J Mol Biol. 1973, 79: 351-371. 10.1016/0022-2836(73)90011-9.
Accessible Surface Area and Accessibility Calculation for Protein.http://cib.cf.ocha.ac.jp/bitool/ASA/,
Cyr JL, Heinen CD: Hereditary cancer-associated missense mutations in hMSH6 uncouple ATP hydrolysis from DNA mismatch binding. J Biol Chem. 2008, 283: 31641-31648. 10.1074/jbc.M806018200.
Kariola R, Hampel H, Frankel WL, Raevaara TE, De la Chapelle A, Nyström-Lahti M: MSH6 missense mutations are often associated with no or low cancer susceptibility. Br J Canc. 2004, 91: 1287-1292. 10.1038/sj.bjc.6602129.
Kolodner RD, Tytell JD, Schmeits JL, Kane MF, Das GR, Weger J, Wahlberg S, Fox EA, Peel D, Ziogas A, Garber JE, Syngal S, Anton-culver H, Li FP: Germ-line msh6 mutations in colorectal cancer families. Canc Res. 1999, 59: 5068-5074.
Plaschke J, Krüger S, Pistorius S, Theissig F, Saeger HD, Schackert HK: Involvement of hMSH6 in the development of hereditary and sporadic colorectal cancer revealed by immunostaining is based on germline mutations, but rarely on somatic inactivation. Int J Canc. 2002, 97: 643-648. 10.1002/ijc.10097.
Steinke V, Rahner N, Morak M, Keller G, Schackert HK, Görgens H, Schmiegel W, Royer-Pokora B, Dietmaier W, Kloor M, Engel C, Propping P, Aretz S: No association between MUTYH and MSH6 germline mutations in 64 HNPCC patients. Eur J Hum Genet. 2008, 16: 587-592. 10.1038/ejhg.2008.26.
Woods MO, Hyde AJ, Curtis FK, Stuckless S, Green JS, Pollett AF, Robb JD, Green RC, Croitoru ME, Careen A, Chaulk JaW, Jegathesan J, McLaughlin JR, Gallinger SS, Younghusband HB, Bapat BV, Parfrey PS: High frequency of hereditary colorectal cancer in Newfoundland likely involves novel susceptibility genes. Clin Canc Res. 2005, 11: 6853-6861. 10.1158/1078-0432.CCR-05-0726.
Studamire B, Quach T, Alani E: Saccharomyces cerevisiae Msh2p and Msh6p ATPase activities are both required during mismatch repair. Mol Cell Biol. 1998, 18: 7590-7601.
Hampel H, Frankel W, Panescu J, Lockman J, Sotamaa K, Fix D, Comeras I, La Jeunesse J, Nakagawa H, Westman JA, Prior TW, Clendenning M, Penzone P, Lombardi J, Dunn P, Cohn DE, Copeland L, Eaton L, Fowler J, Lewandowski G, Vaccarello L, Bell J, Reid G, De la Chapelle A: Screening for Lynch syndrome (hereditary nonpolyposis colorectal cancer) among endometrial cancer patients. Canc Res. 2006, 66: 7810-7817. 10.1158/0008-5472.CAN-06-1114.
Kantelinen J, Hansen TVO, Kansikas M, Krogh LN, Korhonen MK, Ollila S, Nyström M, Gerdes A-M, Kariola R: A putative Lynch syndrome family carrying MSH2 and MSH6 variants of uncertain significance-functional analysis reveals the pathogenic one. Fam Canc. 2011, 10: 515-520. 10.1007/s10689-011-9436-z.
Cederquist K, Emanuelsson M, Wiklund F, Golovleva I, Palmqvist R, Grönberg H: Two Swedish founder MSH6 mutations, one nonsense and one missense, conferring high cumulative risk of Lynch syndrome. Clin Genet. 2005, 68: 533-541. 10.1111/j.1399-0004.2005.00537.x.
Yan H-L, Hao L-Q, Jin H-Y, Xing Q-H, Xue G, Mei Q, He J, He L, Sun S-H: Clinical features and mismatch repair genes analyses of Chinese suspected hereditary non-polyposis colorectal cancer: a cost-effective screening strategy proposal. Canc Sci. 2008, 99: 770-780. 10.1111/j.1349-7006.2008.00737.x.
Hendriks Y, Franken P, Dierssen JW, De Leeuw W, Wijnen J, Dreef E, Tops C, Breuning M, Bröcker-Vriends A, Vasen H, Fodde R, Morreau H: Conventional and tissue microarray immunohistochemical expression analysis of mismatch repair in hereditary colorectal tumors. Am J Pathol. 2003, 162: 469-477. 10.1016/S0002-9440(10)63841-2.
Plaschke J, Krüger S, Dietmaier W, Gebert J, Sutter C, Mangold E, Pagenstecher C, Holinski-Feder E, Schulmann K, Möslein G, Rüschoff J, Engel C, Evans G, Schackert HK: Eight novel MSH6 germline mutations in patients with familial and nonfamilial colorectal cancer selected by loss of protein expression in tumor tissue. Hum Mutat. 2004, 23: 285-
Sjursen W, Haukanes BI, Grindedal EM, Aarset H, Stormorken A, Engebretsen LF, Jonsrud C, Bjørnevoll I, Andresen PA, Ariansen S, Lavik LAS, Gilde B, Bowitz-Lothe IM, Maehle L, Møller P: Current clinical criteria for Lynch syndrome are not sensitive enough to identify MSH6 mutation carriers. J Med Genet. 2010, 47: 579-585. 10.1136/jmg.2010.077677.
Suchy J, Kurzawski G, Jakubowska K, Rać ME, Safranow K, Kładny J, Rzepka-Górska I, Chosia M, Czeszyńska B, Oszurek O, Scott RJ, Lubiński J: Frequency and nature of hMSH6 germline mutations in Polish patients with colorectal, endometrial and ovarian cancers. Clin Genet. 2006, 70: 68-70. 10.1111/j.1399-0004.2006.00630.x.
Yoon SN, Ku J-L, Shin Y-K, Kim K-H, Choi J-S, Jang E-J, Park H-C, Kim D-W, Kim MA, Kim WH, Lee TS, Kim JW, Park N-H, Song Y-S, Kang S-B, Lee H-P, Jeong S-Y, Park J-G: Hereditary nonpolyposis colorectal cancer in endometrial cancer patients. Int J Canc. 2008, 122: 1077-1081.
Pastrello C, Pin E, Marroni F, Bedin C, Fornasarig M, Tibiletti MG, Oliani C, Ponz De Leon M, Urso ED, Della Puppa L, Agostini M, Viel A: Integrated analysis of unclassified variants in mismatch repair genes. Genet Med. 2011, 13: 115-124. 10.1097/GIM.0b013e3182011489.
Limburg PJ, Harmsen WS, Chen HH, Gallinger S, Haile RW, Baron JA, Casey G, Woods MO, Thibodeau SN, Lindor NM: Prevalence of alterations in DNA mismatch repair genes in patients with young-onset colorectal cancer. Clin Gastroenterol Hepatol. 2011, 9: 497-502. 10.1016/j.cgh.2010.10.021.
Schofield L, Watson N, Grieu F, Li WQ, Zeps N, Harvey J, Stewart C, Abdo M, Goldblatt J, Iacopetta B: Population-based detection of Lynch syndrome in young colorectal cancer patients using microsatellite instability as the initial test. Int J Canc. 2009, 124: 1097-1102. 10.1002/ijc.23863.
Barnetson RA, Cartwright N, Van Vliet A, Haq N, Drew K, Farrington S, Williams N, Warner J, Campbell H, Porteous ME, Dunlop MG: Classification of ambiguous mutations in DNA mismatch repair genes identified in a population-based study of colorectal cancer. Hum Mutat. 2008, 29: 367-374. 10.1002/humu.20635.
Kariola R, Raevaara TE, Lönnqvist KE, Nyström-Lahti M: Functional analysis of MSH6 mutations linked to kindreds with putative hereditary non-polyposis colorectal cancer syndrome. Hum Mol Genet. 2002, 11: 1303-1310. 10.1093/hmg/11.11.1303.
Peterlongo P, Nafa K, Lerman GS, Glogowski E, Shia J, Ye TZ, Markowitz AJ, Guillem JG, Kolachana P, Boyd JA, Offit K, Ellis NA: MSH6 germline mutations are rare in colorectal cancer families. Int J Canc. 2003, 107: 571-579. 10.1002/ijc.11415.
Giráldez MD, Balaguer F, Caldés T, Sanchez-de-Abajo A, Gómez-Fernández N, Ruiz-Ponte C, Muñoz J, Garre P, Gonzalo V, Moreira L, Ocaña T, Clofent J, Carracedo A, Andreu M, Jover R, Llor X, Castells A, Castellví-Bel S: Association of MUTYH and MSH6 germline mutations in colorectal cancer patients. Fam Canc. 2009, 8: 525-531. 10.1007/s10689-009-9282-4.
Jiang R, Yang H, Zhou L, Kuo C-CJ, Sun F, Chen T: Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations. Am J Hum Genet. 2007, 81: 346-360. 10.1086/519747.
Iyer RR, Pluciennik A, Burdett V, Modrich PL: DNA mismatch repair: functions and mechanisms. Chem Rev. 2006, 106: 302-323. 10.1021/cr0404794.
Kunkel TA, Erie DA: DNA mismatch repair. Annu Rev Biochem. 2005, 74: 681-710. 10.1146/annurev.biochem.74.082803.133243.
This work was supported by Grant-in-Aid for Cancer Research from Ministry of Health, Labour and Welfare, Japan. KY was supported by Platform for Drug Discovery, Informatics, and Structural Life Science from the Ministry of Education, Culture, Sports, Science and Technology, Japan.
The authors declare that they have no competing interests.
HT performed the majority of the work presented in this manuscript and drafted the manuscript. HT, KA and KY participated in this research. HK assisted in research carried out. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: Table S1: MSH6 missense variants data used for parameter fitting. The file can be read by standard TIF viewer, such as Preview on Mac OS X. (TIFF 17 MB)
Additional file 2: Table S2: A list of amino acid sequences used for the multiple sequence alignment of MSH6. The file can be read by standard TIF viewer, such as Preview on Mac OS X. (TIFF 2 MB)
Additional file 3: Figure S1: Box and whisker plots for the score distribution of in silico tools evaluated on the test set. The top and the bottom of the box are the 75th and 25th percentile, respectively, and the white line in the box is the median. The distributions of LLS and ULS are divided clearly. The file can be read by standard TIF viewer, such as Preview on Mac OS X. (TIFF 776 KB)
About this article
Cite this article
Terui, H., Akagi, K., Kawame, H. et al. CoDP: predicting the impact of unclassified genetic variants in MSH6 by the combination of different properties of the protein. J Biomed Sci 20, 25 (2013). https://doi.org/10.1186/1423-0127-20-25