Skip Navigation


Bioinformatics Advance Access originally published online on February 10, 2006
Bioinformatics 2006 22(8):919-923; doi:10.1093/bioinformatics/btl034
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/8/919    most recent
btl034v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xiong, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xiong, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Non-linear tests for identifying differentially expressed genes or genetic networks

Hao Xiong

Department of Computer Science, Texas A&M University 301 Harvey R. Bright Bldg, College Station, TX 77843-3112, USA


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: One of the recently developed statistics for identifying differentially expressed genetic networks is Hotelling T2 statistic, which is a quadratic form of difference in linear functions of means of gene expressions between two types of tissue samples, and so their power is limited.

Results: To improve the power of test statistics, a general statistical framework for construction of non-linear tests is presented, and two specific non-linear test statistics that use non-linear transformations of means are developed. Asymptotical distributions of the non-linear test statistics under the null and alternative hypothesis are derived. It has been proved that under some conditions the power of the non-linear test statistics is higher than that of the T2 statistic. Besides theory, to evaluate in practice the performance of the non-linear test statistics, they are applied to two real datasets. The preliminary results demonstrate that the P-values of the non-linear statistics for testing differential expressions of the genetic networks are much smaller than those of the T2 statistic. And furthermore simulations show the Type I errors of the non-linear statistics agree with the threshold used and the statistics fit the {chi}2 distribution.

Contact: hxiong{at}cs.tamu.edu

Supplementary information: Supplementary data are available on Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Microarray technology can simultaneously measure expression levels of thousands or even ten thousands of genes and produce an avalanche of data. It did not take long before scientists avail themselves of this valuable tool in studying variation of genome-wide gene expression over different tissue samples, different experimental conditions or different time points of biological process (Brown and Botstein, 1999).

It is widely recognized that the conditions of the cell and cellular processes are influenced by a large number of genes interwoven into networks, rather than a few genes (Strohman, 2002). To study individual biological components alone is not sufficient to discover the rules underlying complex biological systems. Therefore, a systems level study of genetic networks and identification of differentially expressed genetic networks holds the key to unraveling the relationship between genotype and phenotype (Xiong, et al., 2004; Khalil and Hill, 2005; Lu et al., 2005).

The ideal statistics for testing differentially expressed genetic networks should have high power while keeping false positive rates at a specified level. A multivariate statistic for testing differentially expressed genetic networks is Hotelling T2 statistic (Anderson, 1984; Xiong et al., 2004; Lu et al., 2005). However, it is a quadratic function on the difference of means of expression levels between two types of tissue samples (e.g. tumor and normal tissue samples), and the difference is a linear function of expression levels. One strategy to improve the power of test statistics is to amplify difference in the means of gene expression. A natural way to amplify such difference is to transform gene expression levels. It is not difficult to show that any linear transformation of gene expression in T2 statistics will not change their pre-transformation values. To overcome this problem, I propose to use non-linear transformation of means of gene expression in normal tissue (Formula) and abnormal tissue (Formula), i.e. Formula and Formula, expecting statistics based on difference Formula will be more powerful than those based on Formula.

The main purpose of this report is to develop a statistic framework for constructing non-linear statistics for testing differentially expressed genes or genetic networks and propose several non-linear statistics for gene expression data analysis. To do so, I first investigate the properties of non-linear transformation, then study how to construct test statistics based on them and derive asymptotic distributions of non-linear test statistics under the null and alternative hypothesis. Since different non-linear tests may have different power, selection of non-linear statistics is critical to the successful application of non-linear tests to gene expression data analysis. I compare the power of several non-linear test statistics and Hotelling T2 statistic. Finally, to evaluate their performance non-linear test statistics are applied to two real gene expression datasets and simulations are run to obtain type I error and distribution of statistics.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
2.1 Non-linear functions of sample means of gene expressions and their distributions
For convenience of theoretical analysis, I assume that the number of tissue samples is large enough to allow application of large sample theory to gene expression data analysis although in practice the assumption often does not hold. Under this assumption, the sample mean of gene expression will be asymptotically distributed as normal distributions. Let Formula be a vector of means of gene expressions. Then, Formula will asymptotically have a normal distribution Formula, where

Formula
n is the number of tissue samples. Therefore, the distribution of non-linear functions of sample means can be derived from asymptotic theory of non-linear functions of normal random vectors, a classical in statistical inference (Serfling, 1980). By asymptotic theory of functions of normal vectors, it is obtained that under some regularity conditions of non-linear functions f, Formula is asymptotically distributed as a multivariate normal distribution

Formula
where Formulaa Jacobian matrix of the vector of functions, Formula. Difference in non-linear transformation of means of gene expressions between abnormal and normal tissue samples Formula is asymptotically distributed as the following multivariate normal distribution

Formula
B and {Sigma}X are defined as before, nA and n are the number of abnormal and normal tissue samples respectively.

The principle behind T2 test statistics in gene expression data analysis is to compare difference in the means of gene expression in abnormal and normal samples, and that difference, if amplified, may improve power to identify differentially expressed genes or genetic networks. One strategy to amplify the difference is to transform non-linearly the means of gene expressions so that the non-linear difference should be larger. Therefore, the goal is to search for such non-linear transformations. With this goal in mind, I have investigated how difference of non-linear transformation of means can amplify difference of the means in gene expressions.

Define

Formula
a Hessian matrix of non-linear function fix) and Formula, an array. Then by Taylor expansion, we have (Bates and Watts, 1980)

Formula 1(1)
From Equation (1), the difference in the non-linear functions of the means of the gene expressions between abnormal and normal tissues depends on the Jacobian and Hessian matrices of the non-linear functions. If the norm of the coefficient matrix of the vector µy – µx satisfies

Formula 1
then

Formula 1
which implies that the norm of differences in non-linear functions of the means of the gene expressions between the abnormal and normal tissues is larger than that of the original difference in the means of the gene expressions under this condition. The matrix B + (1/2) Hy µy) characterizes the strength of non-linearity of the non-linear functions (Bates and Watts, 1980) and hence provides information for searching non-linear functions which can be used to construct non-linear test statistics with high power.

2.2 Test statistics
The results of non-linear functions of asymptotically normal random vectors can be used to construct non-linear test statistics for testing differential expressions of genes or genetic networks. The quadratic form XTCX of asymptotically norm random vectors provides a statistic framework for construction of test statistics.

Suppose that there are k genes in the pathway being tested. Let xij be the expression of the j-th gene in the i-th normal tissue sample and yij be the expression of the j-th gene in the i-th abnormal tissue sample. Define

Formula 1
The pooled-sample variance-covariance matrix of the indicator variables for the marker genotypes is defined as

Formula 1
Let Formula 1 be an estimator of matrix {Lambda}, where {Lambda} = (1/nA)C{Sigma}YCT + (1/n)B{Sigma}XBT as defined before. Under the null hypothesis, we have B = C and {sum}x = {sum}y. The estimator Formula 1 can be obtained by substituting pooled-sample estimation of the covariance matrices Formula 1 and pooled estimation of the Jacobean matrices Formula 1 into the equation defining the matrix {Lambda}. For convenience, estimator Formula 1 under the null hypothesis, will be denoted by Formula 1. The non-linear statistics can be defined as

Formula 2(2)
where Formula 2 is the generalized inverse of the matrix Formula 2. Let r = rank Formula 2. It can be shown (Greenwood and Nikulin, 1996) that under the null hypothesis of no differential expressions of the gene or genetic network, i.e., H0 : µx = µy, the statistic TN is asymptotically distributed as a central Formula 2 distribution, and under the alternative hypothesis Ha : µx != µy the statistic TN is asymptotically distributed as a non-central Formula 2 distribution with the following non-centrality parameter:

Formula 3(3)
Now we consider a special vector-valued non-linear function. Let g(x) be a real valued non-linear function and has a non-zero derivative at its mean E[x] = µ. Define

Formula 3
where µx = [µ1x, ... , µkx]T and µy = [µ1y, ... , µky]T. Thus, we have

Formula 4(4)
Under this definition, Jacobean matrices B and C have the following simple forms:

Formula 5(5)
Test statistic TN in Equation (2) defines a class of non-linear tests. Various non-linear functions satisfying some regularity conditions can be used to construct the test statistics. Table 1 lists some of non-linear functions used in this study and their corresponding derivatives.


View this table:
[in this window]
[in a new window]
 
Table 1 Some of the non-linear transformations

 
2.3 Comparisons of power of non-linear test statistics and the Hotelling's T2 statistic by approximate formula
To evaluate the performance of non-linear statistics for testing differential expressions of genes or genetic networks, we need to compare the power of the non-linear test statistics and the Hotelling's T2 statistic. Calculation of the power of non-linear test statistic is based on computation of the non-centrality parameter in Equation (3). The non-centrality parameters can be approximated by Taylor expansion. In the Supplementary material, I show that under some conditions the non-centrality parameters of the non-linear statistics are larger than that of the Hotelling's T2 statistic. That means that under above assumed conditions the power of the non-linear test statistics is higher than that of Hotelling's T2 test statistic.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
3.1 Null distribution of the non-linear test statistics
In the previous sections I have shown that when the sample size is large enough to apply large sample theory, the distribution of the non-linear statistics under the null hypothesis of no differential expressions is asymptotically a central {chi}2 distribution. To examine the validity of this statement, I performed a series of simulation studies. Two datasets: the expression profiles of seven genes in invasive lobular and ductal carcinomas of breast in which there are 38 invasive ductal carcinoma (IDC) and 21 invasive lobular carcinoma (ILC) patients (Zhao et al., 2004), and the expression profiles of four genes in 72 lung neuroendocrine tumor samples and 19 normal samples from GEO database at http://www.ncbi.nlm.nih.gov/geo/gds/gds_browse.cgi?gds=619 were used for simulations. The samples in two types of breast cancers, and lung tumor and normal samples were randomly permuted, 100 000 simulations were repeated. In each simulation, the non-linear statistics were calculated. Figure 1A and B plot the histograms of the statistics based on the quadratic and Gaussian transformations applied to breast cancer samples with the theoretical central {chi}2 distributions superimposed, respectively. Figure 2A and B plot the histograms of the statistics based on the quadratic and Gaussian transformations applied to lung samples with the theoretical central {chi}2 distributions superimposed, respectively. It can be seen that the distributions of the non-linear statistics are similar to the theoretical central {chi}2 distributions even under the scenario of modest tissue sample size. Table 2 summarizes the type I error rates of the non-linear test statistics and the Hotelling T2 statistic applied to the simulated breast cancer samples and lung samples. It shows that the estimated type I error rates of the non-linear test statistics were not appreciably different from the nominal levels: {alpha} = 0.05 and {alpha} = 0.01. These results demonstrate that the tests based on quadratic and Gaussian transformation of the expression levels are still valid even for the reasonably small sample sizes.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1 (A) Null distribution of the non-linear statistic based on quadratic transformation applied to breast cancer samples. (B) Null distribution of the non-linear statistic based on Gaussian transformation applied to breast cancer samples.

 

Figure 2
View larger version (7K):
[in this window]
[in a new window]
 
Fig. 2 (A) Null distribution of the non-linear statistic based on quadratic transformation applied to lung samples. (B) Null distribution of the non-linear statistic based on Gaussian transformation applied to lung samples.

 

View this table:
[in this window]
[in a new window]
 
Table 2 Estimated type I error rates of non-linear test statistics and the T2 statistic (100 000 simulations)

 
3.2 Power of the non-linear test statistics and Hotelling's test statistic
In Section 2.3 I showed that under some conditions, the power of some non-linear statistics is higher than that of the Hotelling T2 statistic using approximation approach. Now the exact analytic methods are used to calculate their power. The power of two non-linear statistics based on quadratic and Gaussian functions and the Hotelling T2 statistic with the significance level {alpha} = 0.001 is shown in Figure 3. I assume that the variances Formula 5 and Formula 5 of the gene expression in the normal and abnormal tissue samples are equal to 5 and 1 respectively, and the numbers of both normal and abnormal tissue samples are equal to 100. In Figure 3, for ease of presentation, I consider the expression of only one gene. Figure 3 plots the power of the test statistics as a function of difference in measure of gene expressions between normal and abnormal tissues, and demonstrates that the non-linear test statistics in general have higher power than the Hotelling's T2 statistic. The difference in power between the non-linear statistics and the Hotelling's T2 statistic increases as the difference in gene expression levels between the abnormal and normal tissue samples increases.


Figure 3
View larger version (14K):
[in this window]
[in a new window]
 
Fig. 3 Power of the non-linear statistics based on the quadratic and Gaussian transformations and Hotelling's T2 statistic with a significance level {alpha} = 0.001 as a function of difference in gene expression between the abnormal and normal tissue samples.

 
3.3 Real data examples
To further evaluate the performance of the non-linear test statistics two real datasets were used. One dataset was the gene expression profiles of invasive lobular and ductal carcinomas of breast (Zhao et al., 2004). There were 38 IDC and 21 ILC patients with over 42 000 genes profiled using cDNA. Specimens were separated by the modified Scarff-Bloom-Richardson method. The non-linear statistics and the Hotelling's T2 statistic were applied to this dataset for testing differential expressions of two pathways: cell cycle regulation pathway and MKKK pathway between IDC samples and ILC samples. I have taken care that, for better comparison, the number of genes here match that of genes used in calculating type I error rates. Cell cycle regulation pathway includes seven genes: CDC2, CDK7, HK2, KRAS2, PLK1, PRKCA and STAT1, and MKKK pathway includes seven genes: DLK1, MAP3K14, MAP3K3, MAP3K4, MAP3K7, MAP4K3 and MST1. Table 3 lists P-values of T2, quadratic and Gaussian non-linear statistics for testing differential expressions of cell cycle regulation and MKKK pathways. In general non-linear statistics show stronger significance.


View this table:
[in this window]
[in a new window]
 
Table 3 P-values of T2 and non-linear statistics for testing differential expressions of the cell cycle regulation and MKKK pathways between IDC samples and ILC samples

 
Another dataset was for lung neuroendocrine tumor and normal gene profiles. There were 72 various tumor samples and 19 normal samples from GEO database. Table 4 lists P-values of the Hotelling's T2, quadratic and Gaussian non-linear statistics for testing differential expressions of P53 pathway and Androgen pathway. P53 pathway includes nine genes: BAX, BCL2, CDKN1A, HIF1A, HSPA4, IGFBP3, MDM2, TNFRSF10B and TRAF1. Androgen pathway includes four genes: CDK2, EGFR, FOLH1 and TMEPAI. Again the number of genes here match that used for calculating type I error. The results show that P-values of non-linear statistics were smaller than that of the Hotelling's T2.


View this table:
[in this window]
[in a new window]
 
Table 4 P-values of T2 and non-linear statistics for testing differential expressions of P53 pathway and Androgen pathway between the lung cancer samples and normal samples

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
A key issue in gene expression data analysis is to identify differentially expressed genetic networks. To improve the power of the Hotelling T2 statistic for testing differential expressions of genetic networks I have developed a general statistical framework for non-linear tests, and have provided basic procedures on how to construct test statistics using non-linear transformations of means of gene expressions; I have presented two non-linear test statistics for testing differential expressions of genetic networks between two types of tissue samples.

In this report, I have derived asymptotical distributions of the non-linear test statistics under null and alternative hypotheses. To reveal the relationship between the power of linear test statistics and non-linear statistics, I have approximated the non-centrality parameter of the non-linear test statistics and showed how it depends on the measure of non-linearity of functions. I have also demonstrated that under some conditions the power of non-linear test statistics is larger than that of Hotelling T2 statistic. The power of test statistics is a complicated issue; it depends on many parameters such as difference in population means of gene expressions using two types of tissue samples, the variance of gene expressions, the number of tissue samples and the measure of non-linearity of non-linear functions; so it is difficult to find statistics that are uniformly most powerful. To further evaluate performance of the non-linear test statistics, the proposed non-linear test statistics are applied to two real datasets.

The differential expression of genetic networks is the property of the networks as whole, owing to perhaps differential expressions of some individual genes in the network, or other factors like gene–gene interaction. This report shows that non-linear transformations provide amplified power and can more conclusively demonstrate differentiation of tumor and normal tissues, all without high rate of false positive, and thus its superiority. Because of its enhanced power, cases that might have been missed would emerge so they can be investigated further. It is a better one-step measure for testing genetic networks for differentiation.

The results in this report are thus far limited. Theoretical and empirical studies should be conducted to compare and investigate the relative strength and weakness of non-linear statistics and other existing statistics for identifying differentially expressed genetic networks. This report only presents two non-linear statistics; it is worthwhile to investigate other non-linear statistics and develop general procedures for searching optimal non-linear statistics with the highest power. Non-linear tests are powerful tools, particularly for identifying differentially expressed genetic networks. However theory for non-linear tests has not been fully developed and non-linear statistics have not been applied to large datasets. Considerable theoretic work and empirical evaluation for non-linear tests for gene expression data analysis are urgently needed.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin John Bishop

Received on September 19, 2005; revised on January 23, 2006; accepted on January 31, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Anderson, T.W. An Introduction to Multivariate Statistical Analysis, (1984) Edn 2 , New York, NY John Wiley & Sons.

    Bates, D.M. and Watts, D.G. (1980) Relative curvature measure of nonlinearity. J. R. Statist. Soc. B, 42, 1–25.

    Brown, P.O. and Botstein, D. (1999) Exploring the new world of the genome with DNA microarrays. Nat. Genet, . 21, Suppl 1, 33–37[CrossRef][ISI][Medline].

    Greenwood, P.E. and Nikulin, M.S. A Guide to Chi-Squared Testing, (1996) , New York John Wiley & Sons.

    Khalil, I.G. and Hill, C. (2005) Systems biology for cancer. Curr. Opin. Oncol, . 17, 44–48[CrossRef][Medline].

    Lu, Y., et al. (2005) Hotelling's T2 multivariate profiling for detecting differential expression in microarrays. Bioinformatics, 21, 3105–3113[Abstract/Free Full Text].

    Serfling, R.J. Approximation Theorems of Mathematical Statistics, (1980) , New York John Wiley & Sons.

    Strohman, R. (2002) Maneuvering in the complex path from genotypes to phenotype. Science, 296, 701–703[Abstract/Free Full Text].

    Xiong, M.M., et al. (2004) Identification of genetic networks. Genetics, 166, 1037–1052[Abstract/Free Full Text].

    Zhao, H., et al. (2004) Different gene expression patterns in invasive lobular and ductal carcinomas of the breast. Mol. Biol. Cell, 15, 2523–2536[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
D. Ucar, I. Neuhaus, P. Ross-MacDonald, C. Tilford, S. Parthasarathy, N. Siemers, and R.-R. Ji
Construction of a reference gene association network from multiple profiling data: application to data analysis
Bioinformatics, October 15, 2007; 23(20): 2716 - 2724.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/8/919    most recent
btl034v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xiong, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xiong, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?