1 Introduction
Gene shaving (GS), identifies subsets of genes, is an important research area in the analysis of an DNA microarray gene expression data for biomedical discovery. It leads to gene discovery relevant for a particular target annotation. GS is not relevant to the hierarchical clustering and other widely used methods for analyzing gene expression in the genomewide association studies. GS leads to gene discovery relevant for a specific target annotation. Hence, those selected genes play an important role in the analysis of gene expression data since they are able to differentiate samples from different populations. Despite their successes, these studies are often hampered by their relatively low reproducibility and nonlinearity
Hastie . (); Ruan Yuan (2011); Chen Ishwaran (2012); CastellanosGarzón Romos ().The incorporation of various statistical machine learning methods into genomic analysis is a rather recent topic. Since largescale DNA microarray data present significant challenges for statistical data analysis as the high dimensionality of genomic features makes the classical approaches framework no longer feasible. The kernel methods is a appropriate tools to deal such datasets that map data from a high dimensional space to a feature space using a nonlinear feature map. The main advantage of these methods is to combine statistics and geometry in an effective way Hofmann . (2008); Alam Fukumizu (2014); Charpiat . (). Kernel canonical correlation analysis (kernel CCA) have been extensively studied for decades Akaho (2001); Alam Fukumizu (2015, 2013),.
Nowadays, sensitivity, influence function (IF), based methods have been used to detect an influence observation. a visualization method for detecting influential observations using the IF of kernel PCA has been propposed Debruyne et al. (2009) Debruyne . (2010)
. Filzmoser et al. (2008) also developed a method for outlier identification in high dimensions
Filzmoser . (2008). However, these methods are limited to a single data set. Due to the properties of eigendecomposition, kernel CCA and its variant are still a well used method for an biomedical data analysisAlam . (2008, 2016, 2018).The contribution of this paper is threefold. First, we address the IF of kernel CCA. Second, we use the distribution based methods to confirm the influential observations. Finally, the proposed method is applied to identify a set of gene in both synthesized and real an DNA microarray gene expression data.
The remainder of the paper is organized as follows. In the next section, we provide a brief review of positive definite kernel, kernel CCA and IF of kernel CCA. The utility of the proposed method is demonstrated by both simulated and real data analysis from an imaging genetics study in Section 3. In Section 4, we summarize our findings and give a perspective for future research.
2 Method
2.1 Positive definite kernel
In kernel methods, a nonlinear feature map is defined by positive definite kernel. It is known Aronszajn (1950) that a positive definite kernel is associated with a Hilbert space , called reproducing kernel Hilbert space (RKHS), consisting of functions on so that the function value is reproduced by the kernel. For any function and a point , the function value is where in the inner product of is called the reproducing property. Replacing with yields for any . A symmetric kernel defined on a space is called positive definite, if for an arbitrary number of points the Gram matrix is positive semidefinite. To transform data for extracting nonlinear features, the mapping is defined as
which is a function of the first argument. This map is called the f feature map, and the vector
in is called the feature vector. The inner product of two feature vectors is then This is known as the kernel trick. By this trick the kernel can evaluate the inner product of any two feature vectors efficiently without knowing an explicit form of Hofmann . (2008); Alam Fukumizu (2014); Charpiat . ().2.2 Kernel canonical correlation analysis
Kernel CCA has been proposed as a nonlinear extension of linear CCA Akaho (2001). Researchers have extended the standard kernel CCA with an efficient computational algorithm Bach Jordan (2002). Over the last decade, kernel CCA has been used for various tasks Alzate Suykens (2008); Huang . (2009); Richfield . (2017); Alam Fukumizu (2015)
. Given two sets of random variables
and with two functions in the RKHS, and , the optimization problem of the random variables and is(1) 
The optimizing functions and are determined up to scale.
Using a finite sample, we are able to estimate the desired functions. Given an i.i.d sample,
from a joint distribution
, by taking the inner product with elements or “parameters” in the RKHS, we have features and , where and are the associated kernel functions for and , respectively. The kernel Gram matrices are defined as and . We need the centered kernel Gram matrices and , where with and is the vector with ones. The empirical estimate of Eq. (1) is then given bywhere
where and are the directions of and , respectively.
2.3 Influence function of the kernel canonical correlation analysis
By using the IF of kernel PCA, linear PCA and linear CCA, we can derive the IF of kernel CCA (kernel CC and kernel CVs). For simplicity, let us define .
Theorem 2.1
Given two sets of random variables having the distribution and the jth kernel CC ( ) and kernel CVs ( and ), the influence functions of kernel CC and kernel CVs at are
(2) 
The above theorem has been proved on the basis of previously established ones, such as the IF of linear PCA Tanaka (1988, 1989), the IF of linear CCA Romanazzi (1992), and the IF of kernel PCA, respectively. The details proof is given in Alam . (2018).
Using the above result, we can establish some properties of kernel CCA: robustness, asymptotic consistency and its standard error. In addition, we are able to identify a set of genes based on the influence of the data.
For a sample data, let be a sample from the empirical joint distribution . The EIF (IF based on empirical distribution) of kernel CC and kernel CVs at for all points are , , and , respectively.
For the bounded kernels, the IFs defined in Theorem 2.1 have three properties: gross error sensitivity, local shift sensitivity, and rejection point. But for unbounded kernels, say a linear, polynomial and so on, the IFs are not bounded.
3 Experiments
To demonstrate the performance of the proposed method in a comparison of three popular gene selection methods (Ttest, SAM and LIMMA), we used both simulated and real microarray gene expression datasets. We used three R packages of other methods such as stats, samr and limma. The performance measures AUC were computed for each of the methods using ROC package. All R packages are available in the comprehensive R archive network (cran) or bioconductor.
3.1 Simulation study
To investigate the performance of the proposed method in comparison with three popular methods as mentioned above with k= 2 groups, we considered gene expression profiles from both normal distribution and tdistribution. We also considered datasets for both smallandlargesample cases with different percentages of DE genes.
3.2 Simulated gene expression profiles generated from Normal Distribution
The following oneway ANOVA model was used to generate simulated datasets from normal distribution
(3) 
where , i is the expression of the th gene for the th samples in k group, is the mean of all expressions of ith gene in the kth group and
is the random error which usually follows a normal distribution with mean zero mean and variance
.To investigate the performance of the proposed method in a comparison of other three popular methods as early mentioned for groups, we generated datasets using times of simulations for both small and large sample cases using Eq. (3.2). The means and the common variance of both groups were set as and , respectively. Each dataset for each case represented the gene expression profiles of genes, with samples. The proportions of DE gene (pDEG) were set to and for each of the datasets. We computed average values of different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and AUC based on and estimated DE genes by four methods (Ttest, SAM, LIMMA and Proposed) for each of datasets. Fig. 1a and Fig.1b represents the ROC curve based on estimated DE genes by four methods for both smallandlargesample cases, respectively. From this figure we observe that the proposed method performed better than other three methods for smallsample case (see Fig.1a). On the other hand, for largesample case (see Fig.1b) proposed method keeps almost equal performance with other three methods (Ttest, SAM and LIMMA). Fig.2 shows the boxplot of AUC values based on 100 simulated datasets estimated by each of the four methods both for smallandlargesample cases, respectively. Fig.2a and Fig.2b represent the boxplots of AUC values with pDEG = and , respectively. From these boxplots we obtained similar results like ROC curve for every pDEG values. We also noticed that the performance of the methods increases when we increase the value of pDEG to . Furthermore, we estimated the average values of different performance measures such TPR, TNR, FPR, FNR, MER, FDR and AUC based on (pDEG=) and (pDEG=) estimated DE genes by each of the methods. The results are summarized in Table 1. In this table the results without and within the brackets (.) indicates average of different performance measures estimated by different methods for smallandlarge sample cases, respectively. From this Table 1 we also revealed similar interpretations like ROC curve and boxplots.
Methods  With proportion of DE gene (pDEG) = 0.02  

TPR  TNR  FPR  FNR  MER  FDR  AUC  
Ttest 








SAM 








LIMMA 








Proposed 








Methods  With proportion of DE gene (pDEG) = 0.06  
TPR  TNR  FPR  FNR  MER  FDR  AUC  
Ttest 








SAM 








IMMA 








Proposed 







3.3 Simulated Gene Expression Profiles generated from t Distribution
We also investigated the performance of the proposed method in a comparison of other three methods (Ttest, SAM and LIMMA) for nonnormal case; accordingly we generated 100 simulated datasets from tdistribution with 10 degrees of freedom. We set the mean and variance as before. We estimated different performance measures such as TPR, TNR, FPR, FNR, MER, FDR and AUC based on 20 estimated DE genes by four methods for each of 100 datasets. The average values of performance measures are summarized in Table 2. From this table we notice that the performances of all the methods deteriorated when the datasets came from tdistribution. We also observe that the proposed method performed better than the other three methods (Ttest, SAM and LIMMA). For example, the proposed method produces AUC =
() which is larger than (), () and () for the competitors Ttest, SAM and LIMMA. The boxplots in Fig.3 and ROC curve in Fig.1(cd) also revealed similar results like Table 2. We also noticed from boxplots that the proposed method has less variability among the other three methods. From this analysis we may conclude that the performance of the proposed method has improved than the three wellknown gene selection methods.Methods  With proportion of DE gene (pDEG) = 0.02  

TPR  TNR  FPR  FNR  MER  FDR  AUC  
Ttest 








SAM 








LIMMA 








Proposed 







3.4 Application to colon cancer microarray data
The data consist of expression levels of 2000 genes obtained from a microarray study on 62 colon tissue samples collected from coloncancer patients. Among colon tissue, tumor tissues, coded 2 and 22 normal tissues, coded 1 Alon . (1999). The goal here is to characterize the underlying interactions between genetic makers for their association with the coloncancer patients and the healthy persons.
To calculate the influence value of each gene, we used three methods: PCAout, liner CCA and the proposed method, KCCA, respectively. Figure 4 visualizes the plots of absolute influence value for
genes. By the outliers detection technique in the one dimensional influence value of each method, we obtained
, and genes using PCAout, liner CCA and the proposed method KCCA, respectively. To compare the selected genes, we made a Venndiagram of the selected genes from the three methods. Figure 5 presents the Venndiagram of the PCOut, LCCAOut, and KCCAOut methods. From this figure, we observe that the disjoint selected genes of PCOut, LCCAOut, and KCCAOut are , , and , respectively. The number of common genes between PCOut and LCCAOut, and PCOut and KCCAOut, and LCCAOut and KCCAOut are 7, 1, and 61, respectively. All methods selected 4 common genes: J00231, T57780, M94132 and M87789.Genes do not function alone; rather, they interact with each other. When genes share a similar set of GO annotation terms, they are most likely to be involved with similar biological mechanisms. To verify this, we extracted the genegene networks using STRING Szklarczyk . (2007). STRING imports protein association knowledge from databases of both physical interactions and curated biological pathways. In STRING, the simple interaction unit is the functional relationship between two proteins/genes that can contribute to a common biological purpose. Figure 6 shows the genegene network based on the protein interactions between the combined . In this figure, the color saturation of the edges represents the confidence score of a functional association. Further network analysis shows that the number of nodes, number of edges, average node degree, clustering coefficient, PPI enrichment values are , , , for , respectively. This network of genes has significantly more interactions than expected, which indicates that they may function in a concerted effort.
4 Concluding remarks
The kernel based methods provide more powerful and reproducible outputs, but the interpretation of the results remains challenging. Incorporating biological knowledge information (e.g., GO) can provide additional evidences on the results. The performance of the proposed method was evaluated on both simulated and a real data. The extensive simulation studies show the power gain of the proposed method relative to the alternative methods.
The utility of the proposed method is further demonstrated with the application to colon cancer microarray data. According to the influence values, the proposed method is able to rank the influence of a gene and the genes are are identified to be highly related to disease. Using a ourlier detection methods the proposed method extracts the genes out of genes, which are considered to have significant impact on the patients. By conducting gene ontology, pathway analysis, and network analysis including visualization, we find evidences that the selected genes have a significant influence on the manifestation of colon cancer disease and can serve as a distinct feature for the classification of colon cancer patients from the healthy controls.
Although the Gaussian kernel has a free parameter (bandwidth), in this study, we used the median of the pairwise distance as the bandwidth for the Gaussian kernel, which appears to be practical. In future work, tt must be emphasized that choosing a suitable kernel is indispensable.
Acknowledgments
The authors wish to thank the University Grants Commission of Bangladesh for support.
References
 Akaho (2001) AkahoAkaho, S. 2001. A kernel method for canonical correlation analysis A kernel method for canonical correlation analysis. International meeting of psychometric Society.35321377.
 Alam . (2016) Alam2016ACMaAlam, MA., Calhoun, V. Wang, YP. 2016. Influence Function of Multiple Kernel Canonical Analysis to Identify Outliers in Imaging Genetics Data Influence function of multiple kernel canonical analysis to identify outliers in imaging genetics data. Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics Proceedings of the 7th acm international conference on bioinformatics, computational biology, and health informatics ( 210–219).
 Alam Fukumizu (2013) Ashad13Alam, MA. Fukumizu, K. 2013. Higherorder regularized kernel CCA Higherorder regularized kernel CCA. 12th International Conference on Machine Learning and Applications374377.
 Alam Fukumizu (2014) Ashad14Alam, MA. Fukumizu, K. 2014. Hyperparameter Selection in Kernel Principal Component Analysis Hyperparameter selection in kernel principal component analysis. Journal of Computer Science10(7)1139–1150.

Alam Fukumizu (2015)
Ashad15Alam, MA. Fukumizu, K.
2015.
HigherOrder Regularized Kernel Canonical Correlation
Analysis Higherorder regularized kernel canonical correlation
analysis.
International Journal of Pattern Recognition and Artificial Intelligence29(4)1551005(124).
 Alam . (2018) Alam18bAlam, MA., Fukumizu, K. Wang, YP. 2018. Influence function and robust variant of kernel canonical correlation analysis Influence function and robust variant of kernel canonical correlation analysis. Neurocomputing3041229.
 Alam . (2008) Ashad08Alam, MA., Nasser, M. Fukumizu, K. 2008. Sensitivity analysis in robust and kernel canonical correlation analysis Sensitivity analysis in robust and kernel canonical correlation analysis. 11th International Conference on Computer and Information Technology, Bangladesh.IEEE399404.

Alon . (1999)
Alon99Alon, U., Barkai, N., Notterman, DA., Gish, K., Ybarra, S., Mack, D. Levine, AJ.
1999.
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
96(12)67456750.  Alzate Suykens (2008) Alzate2008Alzate, C. Suykens, JAK. 2008. A regularized kernel CCA contrast function for ICA A regularized kernel CCA contrast function for ICA. Neural Networks21170181.
 Aronszajn (1950) AronRKHSAronszajn, N. 1950. Theory of reproducing kernels Theory of reproducing kernels. Transactions of the American Mathematical Society68337404.

Bach Jordan (2002)
Back02Bach, FR. Jordan, MI.
2002.
Kernel independent component analysis Kernel independent component analysis.
Journal of Machine Learning Research3148.  CastellanosGarzón Romos () Castellanos15CastellanosGarzón, J. Romos, J. .
 Charpiat . () Charpiat15Charpiat, G., Hofmann, M. Schölkopf, B. . Kernel methods in medical imaging Kernel methods in medical imaging. Handbook of Biomedical Imaging Handbook of biomedical imaging ( 6381). Berlin, GermanySpringer.
 Chen Ishwaran (2012) XChen12Chen, X. Ishwaran, H. 2012. Random forests for genomic data analysis Random forests for genomic data analysis. Genomics99323329.
 Debruyne . (2010) Debruyne10Debruyne, M., Hubert, M. Horebeek, J. 2010. Detecting influential observations in Kernel PCA Detecting influential observations in kernel pca. Computational Statistics and Data Analysis5430073019.
 Filzmoser . (2008) Filzmoser08Filzmoser, P., Maronna, R. Werner, M. 2008. Outlier identification in high dimensions Outlier identification in high dimensions. computational Stastistics& Data Analysis5216941711.
 Hastie . () Hastei00Hastie, T., Tibshirani, R., Eisen, MB., Alizadeh, A., Levy, R., Staudt, L.P. Brown, t. .
 Hofmann . (2008) Hofmann08Hofmann, T., Schölkopf, B. Smola, JA. 2008. Kernel Methods in Machine Learning Kernel methods in machine learning. The Annals of Statistics3611711220.
 Huang . (2009) Huang2009Huang, SY., Lee, M. Hsiao, C. 2009. Nonlinear measures of association with kernel canonical correlation analysis and applications Nonlinear measures of association with kernel canonical correlation analysis and applications. Journal of Statistical Planning and Inference13921622174.
 Richfield . (2017) Richfield17Richfield, O., Alam, MA., Calhoun, V. Wang, YP. 2017. Learning Schizophrenia Imaging Genetics Data Via Multiple Kernel Canonical Correlation Analysis Learning schizophrenia imaging genetics data via multiple kernel canonical correlation analysis. Proceedings  2016 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2016, Shenzhen, China55075011.
 Romanazzi (1992) Romanazii92Romanazzi, M. 1992. Influence in Canonical Correlation Analysis Influence in canonical correlation analysis. Psychometrika57(2)237259.
 Ruan Yuan (2011) Ruan11Ruan, L. Yuan, M. 2011. An Empirical Bayes’ Approach to Joint Analysis of Multiple Microarray Gene Expression Studies An empirical bayes’ approach to joint analysis of multiple microarray gene expression studies. Biometrics6716171626.
 Szklarczyk . (2007) STRING15Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., HuertaCepas, J.von Mering, C. 2007. STRING v10: Proteinprotein Interaction Networks, Integrated over the Tree of Life STRING v10: Proteinprotein interaction networks, integrated over the tree of life. Nucleic Acids Research43531–543.
 Tanaka (1988) Tanaka88Tanaka, Y. 1988. Sensitivity analysis in principal component analysis: influence on the subspace spanned by principal components Sensitivity analysis in principal component analysis: influence on the subspace spanned by principal components. Communications in StatisticsTheory and Methods17(9)3157–3175.

Tanaka (1989)
Tanaka89Tanaka, Y.
1989.
Influence functions related to eigenvalue problem which appear in multivariate analysis Influence functions related to eigenvalue problem which appear in multivariate analysis.
Communications in StatisticsTheory and Methods18(11)3991–4010.
Comments
There are no comments yet.