GWAS analysis of FSS cases and controls
Samples who fulfilled the diagnostic criteria of FSS were recruited from Children Hospital, China Medical University. The FSS was diagnosed by clinicians with the following criteria, including body height less than 3rd percentile to the population with corresponding age, and with a family history of short stature. In addition, only samples with ordinal annual growth rate and coincide bone and chronologic age will be included in this study. The controls in this study were selected from Taiwan Biobank based on their body height, i.e., >75th of all samples. We obtained informed consent from all study participants and guardians. This study was performed in accordance with approved guidelines and regulations.
In sample-level quality control (QC) step, for the 827 FSS patients, we removed 30 duplicated samples, two samples with data quality center (DQC) < 0.82, and 7 samples with call rate < 97%. For the remaining 788 samples, 52 were filtered in kinship QC step and left 736 samples for association analysis. For the controls from Taiwan Biobank, after removing samples with DQC < 0.82, failed plate QC, failed sample QC, missing gender and age information and failed kinship check, resulting in 464 remained for downstream analysis.
In marker-level QC step, for the 628,132 autosomal SNPs, we excluded the SNPs with MAF < 5%, SNP call rate < 98% in either case or control groups, Hardy-Weinberg equilibrium test p-value < 0.0001 (based on controls), and with batch effect. The remaining 530,030 (84.38%) SNPs were subjected to association analysis under additive inheritance model.
Functional annotation of FSS-related SNPs
The region of FSS-associated SNPs (human genome hg19) was annotated using ANNOVAR [24]. The region of variants was categorized as either exonic, intronic, non-coding (nc) RNA intronic, the 5′ untranslated region (UTR), the 3′ UTR, intergenic, upstream, or downstream. For variants located in an exonic region, we further characterized their functional type, i.e., synonymous or non-synonymous.
Identifying SNPs in linkage disequilibrium (LD) with FSS-related variants
For the 122 FSS-associated variants identified from a GWAS of a Taiwanese population, SNPs that were in high LD to these variants were identified using the 1000 Genome [25] phase 3 database (dbSNP Build 137). SNPs with an r2 value (a measure of LD) of > 0.8 and within a 100-kilobase (kb) window of FSS-associated variants based on an East Asian (EAS) super-population were selected using the R proxysnps package.
Conspectus of the drug repositioning analysis for FSS
In this study, we proposed a bioinformatics pipeline called SNP-heuristic and expression-based functional unifying network (Shefun) algorithm embodied by two major portions: (1) an SNP-heuristic part and (2) an expression-based functional unifying network part.
The first part is centralized on SNPs. By SNP-based annotations, we could obtain functional states (non-coding/non-synonymous/synonymous), chromatin state, and cis-regulation data of each SNP. These data provided two aspects of information for the second part of the Shefun algorithm: resolution of tissue-specificity and determination of “seed” genes. For tissue specificity, based on the enrichment of FSS-associated SNPs with an active chromatin state, we resolved the tissue type(s) for a coexpression analysis. In addition, genes with cis-expression quantitative trait locus (eQTL) annotation and/or with non-synonymous variant(s) located in it could be utilized as “seed” genes for network construction.
The second part of Shefun, which mainly focuses on genes, includes several consecutive analytical modus operandi as follows: the construction of tissue-specific expression-based networks; a subnetwork enrichment analysis to establish gene-phenotype relationships; drug repurposing by inferring drug-phenotype relationships; an over-representation analysis; and primary target annotation. All of these functional analyses are unified into a network scene.
Non-synonymous, chromatin state segmentation and cis-eQTL annotations
FSS-associated SNPs (and SNPs in high LD with FSS-related SNPs) were queried in HaploReg (vers. 4.1) [26] using the 1000 Genome Phase 1 database and an Asian (ASN) population. The functional state, chromatin state segmentation (25-state), and cis-eQTL information were extracted from the output sheet of HaploReg.
SNPs with a chromatin state of 1~19 were defined as “active”; 20~25 as “inactive”, and the remaining as “not available” (n.a.). For each cell type, we calculated the number of SNPs with an active chromatin state, and calculated one-sided p values (Z = (N – mean(N))/SD(N), where N is the number of SNPs with state 1~19 in the given cell type, and SD is the standard deviation) by comparing to the mean of the number of “active SNPs” across cell types (mean no. = 84.73).
For the cis-eQTL part, given the results from chromatin state segmentation, we selected only SNPs with cis-eQTL annotation in the following tissue types: whole blood, adipose (subcutaneous) tissues, adipose (visceral omentum) tissues, breast mammary tissue, skin (sun-exposed; lower leg), cells (transformed fibroblasts), muscle (skeletal), skin (not sun-exposed; suprapubic), osteoblasts (prostaglandin E2 (PGE2)), osteoblasts (bone morphogenetic protein 2 (BMP2)), osteoblasts (Dex.) and osteoblasts (untreated). We further merged tissue types into seven categories: adipose, blood, bone, breast, fibroblast, skeletal muscle, and skin.
The SNPs were categorized based on non-coding/non-synonymous/synonymous, the active/inactive chromatin state, and cis-eQTL, and visualized them by a radar chart using the R fmsb package.
Genotype-tissue expression (GTEx) transcriptomic dataset pre-processing
GTEx expression data (five tissue types including adipose, breast, fibroblast, skeletal muscle, and skin) were downloaded from recount2 (https://jhubiostatistics.shinyapps.io/recount/) and processed using the R recount package. Samples with an RNA integrity number (RIN) of < 6.0 were filtered. Next, gene expression values were aggregated by the average, and then log2-scaled (scaled E = log2(E+ 1), where E represents the gene expression value). Then, lowly expressed genes were removed by preserving genes with a scaled expression of > 1 in 80% of the samples in at least one tissue type. Finally, we performed a principal component analysis (PCA) adjustment for latent covariates, also known as surrogate variables, using the R sva package.
Bone tissue dataset pre-processing
As GTEx did not include bone expression data, we thus downloaded a bone biopsy transcriptomic dataset (E-MEXP-1618) of postmenopausal females from ArrayExpress (https://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-1618/). The raw gene expression values were normalized using the R gcrma package.
Expression-based network construction
The expression-based network (six tissue types, excluding “whole blood”) was consociated with two levels of information: (1) messenger (m) RNA coexpression and (2) protein-protein interactions (PPIs). To do this, for each selected tissue type, FSS-related genes (“seed” genes), constituted by tissue-specific eGenes (from cis-eQTL annotation) and genes that contained non-synonymous SNPs, served as input genes for a coexpression network analysis. For each input gene, genes with the top 10/15/20/25/30 highest Pearson’s product-moment correlation coefficient were included to build a subnetwork. Then, the subnetworks were further expanded using PPI information adopted from the Human Protein Reference Database (HPRD, vers. Release9_041310) [27]. Furthermore, self-loops and redundant links were removed from each subnetwork for the sake of conciseness. Different subnetworks were fused into a bigger subnetwork if they contained at least one identical gene.
Gene set enrichment analysis (GSEA)
The “pathways” for GSEA were the merged expression-based subnetworks, and the gene-level statistics were beta-coefficients (related to “height”) acquired from Taylor et al. (human skeletal muscle biopsies) [28]. The GSEA was conducted using the R fgsea package with 99,999 permutations. The significance threshold was set to a false discovery rate (FDR) of < 0.1. The subnetworks that reached a significant threshold were defined as “height-related subnetworks”. For each height-related subnetwork, genes within it were assigned a value of + 1 if the subnetwork was positively enriched (representing a positive “gene-phenotype relationship”) and − 1 if the subnetwork was negatively enriched (representing a negative “gene-phenotype relationship”).
Ligand/drug repositioning
Ligand-target (gene) interaction data were queried from the Guide to PHARMACOLOGY website (http://www.guidetopharmacology.org/download.jsp, vers. 2019.3). Data were first filtered by the following criteria: (1) human species; (2) non-endogenous agents; (3) a clear type/action of the mechanism for each ligand-target pair; and (4) distinct target (gene symbol) information. We further removed the drug-gene pair of the actions of “binding”, “mixed”, and “neutral”. Next, we assigned a value of + 1 to the ligand-target pair of action of the mechanism of “activation”, “agonist”, “biased agonist”, “full agonist”, “partial agonist”, and “positive”; and also the type of mechanism of “activator” and “agonist”. Similarly, we assigned a value of − 1 to ligand-target pairs with an action mechanism of “antagonist”, “feedback inhibition”, “inhibition”, “inverse agonist”, “irreversible inhibition”, “negative”, “pore blocker”, “slows inactivation”, and “voltage-dependent inhibition”; and mechanism types of “antagonist”, “channel blocker”, “gating inhibitor” and “inhibitor”. Consequently, + 1 or − 1 represents a positive or negative drug-gene relationship, respectively.
For each gene in the height-related subnetworks, the drug-phenotype relationship was inferred by multiplying the assigned values of “drug-gene relationship” and “gene-phenotype relationship”. There were four possibilities to show the logic of how we inferred the drug/ligand effect, i.e., “drug-gene relationship” × “gene-phenotype relationship” = “drug-phenotype relationship”: (1) + 1 × + 1 = + 1; (2) + 1 × − 1 = − 1; (3) -1 × + 1 = − 1; and (4) -1 × − 1 = + 1. A final value of + 1 suggests that the drug may enhance or exacerbate the phenotype of interest, and a final value of − 1 suggests that the drug may alleviate, diminish, or inhibit the phenotype of interest. The repositioning analysis revolved around genes in height-related subnetworks, and drugs/ligands were selected which possibly targeted those genes with a calculated value (drug-phenotype relationship) of + 1 only, as this meant that the selected drugs/ligands possibly enhanced the phenotype of interest (i.e., height) and therefore was a potential candidate for repurposing to FSS.
Gene ontology (GO) biological process (BP) terms and Kyoto encyclopedia of genes and genomes (KEGG) pathway over-representation analysis (ORA)
Height-related subnetwork genes were subjected to a GO analysis [29] to assess their enrichment in BP terms. The enrichment test was performed using “weight01” implemented in the R topGO package. Moreover, the KEGG ORA test was performed using the R clusterProfiler package. The Benjamini-Hochberg (BH) method was applied for multiple test corrections.
Statistical and bioinformatics analysis
All in-house statistical and bioinformatics scripts for drug repositioning analysis were written in R language (https://www.r-project.org/). Gene symbols from different sources were unified using the R HGNChelper package. The conversion between gene symbols, Entrez Gene ID, and Ensembl Stable ID was performed using the R clusterProfiler package. The networks were illustrated using the R igraph package utilizing the Fruchterman-Reingold (FR) algorithm.