Skip Navigation


Bioinformatics Advance Access originally published online on February 24, 2006
Bioinformatics 2006 22(9):1103-1110; doi:10.1093/bioinformatics/btl053
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/9/1103    most recent
btl053v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Prieto, C.
Right arrow Articles by De Las Rivas, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Prieto, C.
Right arrow Articles by De Las Rivas, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Algorithm to find gene expression profiles of deregulation and identify families of disease-altered genes

C. Prieto 1, M.J. Rivas 2, J.M. Sánchez 2, J. López-Fidalgo 2 and J. De Las Rivas 1,*

1 Bioinformatics and Functional Genomics Research Group, Cancer Research Center (CIC USAL-CSIC) Salamanca, Spain
2 Department of Statistics, Faculty of Science (USAL) Salamanca, Spain

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 

Motivation: Alteration of gene expression often results in up- or down-regulated genes and the most common analysis strategies look for such differentially expressed genes. However, molecular disease mechanisms typically constitute abnormalities in the regulation of genes producing strong alterations in the expression levels. The search for such deregulation states in the genomic expression profiles will help to identify disease-altered genes better.

Results: We have developed an algorithm that searches for the genes which present a significant alteration in the variability of their expression profiles, by comparing an altered state with a control state. The algorithm provides groups of genes and assigns a statistical measure of significance to each group of genes selected. The method also includes a prefilter tool to select genes with a threshold of differential expression that can be set by the user ad casum. The method is evaluated using an experimental set of microarrays of human control and cancer samples from patients with acute promyelocytic leukemia.

Availability: The method is implemented in an R package called AlteredExpression available in http://bioinfow.dep.usal.es/Altered-Expression/ and will be included in the Bioconductor project.

Contact: jrivas{at}usal.es


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 
The development of genomic technology has provided the means to interrogate at once thousands of genes in biological samples. In this respect, the advance of high-density oligonucleotide microarrays has become one of the most powerful tools that allow testing the expression state of thousands of genes at a genome-wide scale. At present, a critical bottleneck for this kind of studies is the analysis of the huge amount of data that these microarrays provide (Marshall, 2004). Bioinformatic tools, based on robust computational and statistical studies, are needed to obtain correct and comparable expression profiles that characterize gene families or groups of genes involved in common biological functions.

Gene expression is a tightly regulated process, crucial for the proper functioning of a cell. In microarray data, common regulation is reflected by strong correlations between expression levels (Eisen et al., 1998), and it is usual that genes of similar function yield similar expression profiles across a diverse range of conditions (Segal et al., 2003; Stuart et al., 2003). Molecular disease mechanisms typically constitute abnormalities in the regulation of genes producing strong alterations in the expression levels (Rhodes et al., 2002). The resulting changes in expression profiles can help to identify disease-related genes (Golub et al., 1999), and can also facilitate improved diagnosis and even prognosis of disease outcome (West et al., 2001). Alteration of gene regulation often results in expression profiles with large variability. However, common analysis strategies (Tusher et al., 2001; Efron et al., 2001) look only for differentially expressed genes (overexpressed or represed), but do not explore the variability that appears in an altered state when compared with a control state.

We have developed a new algorithm that is able to analyze and compare the expression profiles of genes from control samples versus disease altered samples and finds those genes that suffer a strong alteration in variability or de-regulation. The algorithm is specially adapted and developed for high-density oligonucleotide microarrays, particularly GeneChips manufactured by Affymetrix. It also includes a method to preselect differentially expressed genes. The method provides a list of well-defined groups of de-regulated genes assigning a statistical score of significance to each group. The method is evaluated in this paper using an experimental set of 16 microarrays: 6 controls and 10 cancer samples.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 
2.1 Background correction and normalization
A key step before any analysis of gene expression between different samples is to achieve an adequate background correction and normalization of the expression signals done for all the microarrays that are going to be analyzed. Both background correction and normalization have to be done at internal level (intra-microarrays) and at comparative level (inter-microarrays). Several algorithms have been published to calculate the expression signal from raw data of Affymetrix microarrays. A statistical algorithm designed and recommended by Affymetrix called MAS5 (Liu et al., 2002), a model-based algorithm called MBEI (Li and Wong, 2001) and a multiple average algorithm called RMA (Irizarry et al., 2003a). Comparisons of the normalization methods used by these algorithms show that MAS5 does not achieve an adequate multiple chip normalization and RMA is the method that provides the best precision in signal detection, especially with the difficult genes that present low levels of expression (Irizarry et al., 2003a; Barash et al., 2004). RMA uses an efficient quantile normalization of the distribution of probe intensities for each array in a set of arrays (Bolstad et al., 2003). Following these publications and using Bioconductor and R as computational tools, we use RMA for background correction, normalization and expression calculation of a set of 16 microarrays from clinical samples. The set corresponds to samples coming from human bone marrow biopsies from 16 different individuals: 6 healthy persons and 10 patients with acute promyelocytic leukemia (APL). This is a multiple and complex set of microarrays that we used in this paper for the evaluation and validation of the algorithm presented. The complexity of the samples can be seen in Figure 1a that plots the correlation of expression values for each gene (i.e. each probeset in the microarrays) in a subset of six samples: three patients and three controls. The comparisons done are nine pairwise comparisons of three patients versus three controls (grey points); nine pairwise comparisons of three controls versus other three controls (black points) and one comparison of a control against itself (middle line). The black points reflect the high degree of biological variability between individuals in these clinical samples, but thanks to a correct normalization it is possible to see that the greater differences correspond to the comparisons of the expression values of patients versus controls. The spots in the grey region will be the ones that best define the differences between disease samples and control samples. Figure 1b and c present the box-plots and the density-plots of the distributions of expression values from the three control samples and three patient samples that were compared in Figure 1a. The numbers correspond to log2-transformed data. It can be observed in the density-plots that the distributions of expression values derived from Affymetrix microarrays do not follow a normal (i.e. Gaussian) distribution. Another observation is the small differences in the expression distributions between different samples, even between control samples and patient samples. This fact indicates a good normalization and makes it possible to proceed into the next step of the analysis to find de-regulated genes. The normalization shown for the subset of samples in Figure 1 was evenly done at once for all the 16 samples of the study.


Figure 1
View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1 Representation of the expression data from several normalized microarrays from control samples (C) and altered samples (P) from patients with APL. (a) Correlation plot of the expression values of three controls versus three patients (grey spots), three controls versus three other controls (black spots), one control against itself (middle line). (b) Box plots showing the data distributions of three controls (C1, C2 and C3) and three patients (P1, P2 and P3). (c) Density plots of the added distributions of three controls (black line) and three patients (grey line). The box below presents a summary of the statistical parameters calculated for the expression data of a control sample (C3, first row) and of a patient sample (P3, second row).

 
2.2 Score function for altered expression and statistical adjustment to F distribution
The interest of the work is to find groups of genes that present a similar expression pattern in control samples but suffer a distinct alteration of their expression profiles in the query samples, i.e. the disease or altered samples. Each experimental set of microarrays, involving I genes from J samples, can be mathematically assigned to a |I| x |J|-dimensional matrix I x J = M = (eij), where eij is the intensity of the ith gene for the jth sample. In this way, each gene has an expression profile along the samples that defines its state. For a subset of genes G {subseteq} I and a subset of samples S {subseteq} J, the submatrix G x S represents the intensities of those genes for those samples. Owing to the fact that the aim is to compare only two states within the samples, the standard or control state (i.e. control samples) versus the disease or altered state (i.e. altered samples), the samples set are always divided in two subsets: Sa (altered samples) and Sc (control samples); so, |J| = |Sa| + |Sc|.

For each expression value eij, once normalized and the background corrected, we can consider an additive model that conceptually divides this expression value in a set of four different additive elements: the gene effects ({alpha}i), the sample effects (ßj), a constant value common to all the data in a set ({gamma}) and the errors ({varepsilon}ij); so that

Formula
The usual assumption done in most of the algorithms that handle expression values is that the errors are independent and its variances are equal along genes and samples. Under this model the maximum-likelihood estimators of the additive elements described above will be for global common effects ({gamma}) the global mean of all the expression values for samples and genes (Formula); for gene effects ({alpha}i) the difference between the means of the expression values for different samples in a gene i (Formula) and the global mean (Formula) and for sample effects (ßj) the difference between the means of the expression values for different genes in a sample j (Formula) and the global mean (Formula). So the maximum-likelihood estimators for the errors {varepsilon}ij = eij{alpha}ßj {gamma} will be

Formula
Using these estimators, the aim is to compare all the expression values eij of the matrix M to find the variability and correlation between genes and samples. To find such variability of expression values, a statistical parameter of variability should be calculated. An obvious parameter is the standard correlation coefficient but several authors have shown that the residual variance (RV) allows a more efficient search for high-scoring gene sets (Cheng and Church, 2000; Kostka and Spang, 2004). As these authors, we choose the RV(G, S) as the parameter to measure the variability within a certain subset of genes G along a certain subset of samples S. This parameter is defined as

Formula
From the linear model, the former summatory follows a Pearson's {chi}2 distribution with (|G| – 1) and (|S| – 1) degrees of freedom (Montgomery, 2000; Draghici, 2003). In this way, the RV is the ratio between a Pearson's {chi}2 distribution and its degrees of freedom, and it is calculated for the subset of genes G and samples S. Next step in the algorithm is to implement the ability to compare RV for the two different and independent subsets of samples, altered and control (Sa and Sc), in a given subset of genes G. For both subsets we calculate each RVc = RV(G, Sc) and RVa = RV(G, Sa), and the comparison can be done by the relative residual variance (RRV) that is the quotient or ratio between the RVc of the controls and the RVa of the altered samples:

Formula
The division of these two Pearson's {chi}2 distributions follows a Fisher–Snedecor's distribution (Fn1,n2) with (|G| – 1) · (|Sc| – 1) and (|G| – 1) · (|Sa| – 1) degrees of freedom (i.e. n1 and n2 respectively) (Montgomery, 2000). The adjustment to a probability distribution allows constructing a statistical hypotheses test for each subset of genes G, where the null hypothesis H0 will be no difference in the parameter of variability (RV) of expression between the controls (RVc) and the altered samples (RVa) for a given subset of genes G. In statistical terms, no such difference corresponds to the situation where the RVa of altered samples is less than or equal to the RVc of the controls. On the other side, the alternative hypothesis H1 will be the existence of a real significant difference between the controls and the altered samples. RV is taken as the estimator of variance:

H0: Formula, i.e. the subset of genes G of altered versus control samples does not present difference in variability
H1: Formula, i.e. the subset of genes G of altered versus control samples does present difference in variability

This is a one-tail test where the rejection region is {F ≤ k} for a critical value k. In this way, a probability score can be calculated corresponding to the value of the F statistic, and such score is taken as an indicator of the difference in variability between the control and the altered samples of a subset of genes G: F-score = P {F ≤ Fobs}. The score is calculated for each subset of genes G, and the program runs by iteration until achieving stabilization of such score for a group of genes. This allows a progressive selection of groups with increasing score from a minimal initial value that will correspond to the most distant from the null hypothesis.

2.3 Algorithm to search for de-regulation in gene expression: AlteredExpression
Following the score function defined above and the statistical parameters derived, we wrote in R the algorithm called AlteredExpression. A full scheme to show how it works is presented as a detailed flowchart in Figure 2. The algorithm is designed to run using as input a microarray expression dataset with I genes and J samples subdivided in two defined types: control samples Sc and altered samples Sa. The output provides each time an optimal group of genes identified with best RRV and minimal F-score. Once a first group of genes G1 is selected, they are taken out from the whole gene dataset (I G1) and the algorithm runs all again selecting a second group (G2). The process continues until the RRV of the groups generated do not change significantly, i.e. they arrive to a stable stage. The algorithm includes some variable parameters that are fixed in advance as input parameters and they can be adjusted depending of the type of samples analyzed. These parameters and their meaning are as follows:

  1. initialSize: indicates the number of genes included in the initial random sampling of the dataset. We explore several sizes and an initial value of 20 genes behaves well for most of the Affymetrix datasets or subsets (for a total number of genes between 2.000 and 20.000). The trials done with different initialSizes show that this parameter does not affect the resulting output groups.
  2. maxiter: indicates the maximum number of iterations that are allowed for the algorithm to work exploring the whole data matrix and find an optimal group. This parameter is just for security to avoid running without stopping, and it is fixed to a high number of 500 iterations to avoid stopping before the correct selection of a stable group.
  3. vSize: is a parameter introduced to modulate the effect of the group size. It can vary between 0 and 1, being close to 0 for higher size effect and close to 1 for lower size effect. The vSize is fixed to 0.5 by default.
  4. pctSubset: is a parameter, varying between 0 and 1, introduced to fix the percentage of in–out genes (ioGenes) that are randomly selected and changed in the genes set (G) every time the algorithm iterates in the external loop 1. The ioGenes is the whole set of genes that individually have improved the RRV each time the algorithm iterates in the internal loop 2. The pctSubset is fixed to 0.1 by default, meaning thatonly 10% of the ioGenes are changed in G each loop 1 run (Fig. 2).


Figure 2
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 2 Flowchart of the algorithm AlteredExpression showing the way it works to select a group of genes from the initial data matrix. Four parameters described in Methods are initialSize, maxiter, vSize and pctSubset. The output gives a stable group of genes, together with the RRV calculated for such group (optimalRRV) and its F-score.

 
2.4 Pre-selection of genes with differential expression
The main idea behind the method is to select the groups of genes that show a de-regulation, but also including the possibility of pre-selecting genes with a certain threshold of differential expression between control and altered state. The algorithm and score function described above are designed to look for a maximum variability in the altered state (altered samples, Sa) with a small variability in the control state (control samples, Sc). The algorithm can explore complete crude datasets from microarray experiments, but, in order to get a more correct and meaningful output it is better to avoid the mean expression of a certain gene to be the same between control and altered samples. To achieve this, the input data can be previously treated with an algorithm to look for differentially expressed genes. One of the most solid and useful algorithms for this purpose is Significant Analysis of Microarrays (SAM) (Tusher et al., 2001) that uses permutations of the samples to estimate the percentage of genes identified by chance, and it also calculates a false discovery rate (FDR) which corresponds to a Type I error calculation (i.e. number of false positives; Benjamini and Hochberg, 1995).

A more simple way to start with a minimal differential expression between control and altered samples is to de-selected or delete from the crude input data the genes that do not present a minimal change in mean expression value between control and altered samples. This is a means pre-filter ({Delta}MF) that is included at the beginning of the algorithm and its threshold of expression difference can be fixed before the running according to the characteristics of the sample. As shown below in the results, we evaluate the use of SAM and/or the means pre-filter within the algorithm.


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 
3.1 Algorithm tuning with clinical experimental data
For a correct development of the algorithm AlteredExpression, an evaluation of the performance was done using different sets of microarray data from clinical samples. The main data used were the set mentioned above corresponding to human samples from 16 different individuals: 6 healthy persons (controls) and 10 patients with APL (altered). This set of data has an important biological variability, as it is usually the case for clinical samples coming from different persons. We observed that adequate inter-microarray normalization was needed before any further data analysis, because without such normalization the performance of the algorithms was critically diminished. For these reasons, in all our analyses the microarray expression datasets have been processed using RMA to calculate the signal (Irizarry et al., 2003).

Another observation learned in the trials, done to tune the algorithm efficiency, was that the search for altered genes performed by the algorithm was improved when the data introduced as input were previously filtered by differential expression. Figure 3 shows these effects presenting the variation of gene content of the first group given by the algorithm. The raw data correspond to an input of the complete set of expression values for all genes in the microarrays without any filter (22.283 probesets) (Fig. 3a and d). The filtered data correspond to the ones obtained using {Delta}MF (Fig. 3g) or using SAM with or without {Delta}MF (Fig. 3b, c, e, f, h and i). As seen in Figure 3 scaled graphs, using a {Delta}MF of expression value 1.0 (that corresponds to 8–9% of the expression range) a big number of genes, that do not show means change between control and altered samples (called ‘flat’ genes), are cleared (Fig. 3g, h and i). The use of both {Delta}MF and SAM gives the algorithm the best performance in the search for altered genes. This is a stringent strategy that pushes the method to find genes that clearly fulfil the criteria of both differential expression and alteration, in a way to avoid ad maximum false positives. The value of using {Delta}MF together with SAM is shown in Figure 3, where the graphs illustrate that only SAM, even when using a quite stringent FDR of 0.001, is not enough to filter out all ‘flat’ genes (Fig. 3e and f). These genes are all eliminated adding a {Delta}MF of value 1.0 (Fig. 3h and i). To better demonstrate that {Delta}MF, implemented in our method, improves the simple use of SAM, Figure 4 shows that using SAM with FDR 0.01 still lets some genes that are ‘flat’ genes and do not fulfil the criteria of a minimal difference in expression means that we are looking for. Such genes are taken by SAM because they have a very small standard deviation in the control samples but larger deviation in the altered samples, therefore they can be taken as significant in statistical tests involving the mean and the standard deviation as it is SAM (that is a modified t-test). In contrast, the means-filter {Delta}MF is most efficient to filter-out such genes and this allows obtaining more consistent results in the performance of the AlteredExpression algorithm.


Figure 3
View larger version (64K):
[in this window]
[in a new window]
 
Fig. 3 Graphs presenting the expression values of genes in 16 different samples: 6 control and 10 altered samples (i.e. from patients with APL) separated by a red line. The genes presented are the ones included in the first groups and each graph corresponds to different runs of the algorithm using different sets of initial genes: all 22.283 genes of the microarrays (a,d,g); 4056 genes obtained using SAM at FDR 0.01 (b,e,h); 1764 genes obtained using SAM at FDR 0.001 (c,f,i). The use of a means-filter of 1.0 is also indicated in the graphs. Each gene is represented as a color line that includes the 16 values of expression in absolute numbers (a,b,c) or 16 values of expression scaled by making value 1.0 for the expression of the first sample and the rest relative to it (d,e,f,g,h,i).

 

Figure 4
View larger version (36K):
[in this window]
[in a new window]
 
Fig. 4 Graphs presenting the expression values of genes as Figure 3. The genes are the ones included in the first groups using different conditions as initial set: (a) 4056 genes (SAM at FDR 0.01) without {Delta}MF; (b) with {Delta}MF = 0.4; (c) with {Delta}MF = 0.8. Graphs (d) and (e) present the ‘flat’ genes that are filtered out, from (a) and (b) graphs respectively, when using the {Delta}MF.

 
To give an orientation for the use of {Delta}MF we have done a numerical correlation between the relative number of genes that are filtered at a certain value of {Delta}MF and at a certain value of SAM FDR. The obtained correlation (data not shown) indicates that for the dataset used in this work a {Delta}MF = 0.8 – 1.0 was equivalent to a FDR = 0.01 (i.e. a 99% of statistical specificity).

The genes pre-filtered with {Delta}MF are in many cases low expression genes, but not always because the criteria used is a ‘difference in expression means’ independently of the absolute expression value of a gene. This point is quite important because other selective algorithms, like MAS5, apply a simple cutoff of low expression values clearing or ignoring all the genes that are below a certain signal value. A filter based on a simple cutoff follows the criteria that the noise and error accumulates at low expressions and that a major part of the genes in the transcriptome do not change in expression in a certain biological state. However, this approach is frequently inadequate for microarray data, because many critical genes involved in cellular regulation have low expression values (e.g. genes that are at the beginning of signal transduction cascades), and also because these type of genes many times show small fold changes when they are over or underexpressed.

Behaviour of the F-score in consecutive groups: finding the most significant groups of genes
As described, AlteredExpression algorithm explores the data matrix of expression values searching for groups of genes that have minimal variation in the control samples but show clear variability in the disease samples. In this way, the algorithm produces a series of consecutive stable groups starting from a minimal F-score for the first group, followed by increasing scores for the next groups found. In all cases, before each run of the algorithm, a group of 20 genes was randomly selected (initialSize = 20). The usual size of the final groups found was about 30–50 genes using the default values of the parameters described above. The behaviour of the F-scores for the groups selected after running the algorithm is shown in Figure 5, that represents the scores in log10 scale for each group (Fig. 5a), and the adjusted probability scores, using the Bonferroni method (Bonferroni, 1937; Hsu, 1996) for each group (Fig. 5b). These graphs indicate that the scores present an asymptotic tendency and the initial groups include most of the change. A 75% of the total change in score is achieved between the 1st group and the 10th, the change of first three groups being remarkable. The groups in Figure 5a and b are the ones selected starting with the total set of 1764 genes (obtained with SAM FDR 0.001 and {Delta}MF 1.0). Using the set of 4056 genes (obtained with SAM FDR 0.01 and {Delta}MF 1.0) the behaviour of the scores is similar (Fig. 5c and d) and again the initial groups include the most significant changes. The general tendency to increase the scores does not always keep the order of the most stable groups found (Fig. 5a, b and c). To avoid such fluctuations the groups are ordered by increasing the F-scores in Figure 5d, that shows better continuous growth of this statistical parameter and allows an adequate selection of the most significant groups of genes.


Figure 5
View larger version (14K):
[in this window]
[in a new window]
 
Fig. 5 Graphs presenting the change in F-scores (log10 scale) given for each group of genes generated. Groups obtained using initial set of (a) 1764 genes (SAM at FDR 0.001) and {Delta}MF = 1.0; or using (c) 4056 genes (SAM at FDR 0.01) and {Delta}MF = 1.0. Graph (b) presents the same data as (a) but with adjusted F-scores by Bonferroni method. Graph (d) presents the same data as (c) but ordering the groups by F-scores instead of ordering by appearance.

 
3.3 Stability of the algorithm in repetitive runs, biological meaning of the groups and comparison with other methods
A critical point in any algorithm that selects groups of genes is how reproducible it is, or in other words, which is the consistency, coherence and stability of the groups selected with maximal significance. To check this we use the initial set of 4056 genes (obtained with SAM FDR 0.01 and {Delta}MF 1.0) and the algorithm was run 10 times completely, obtaining 10 different sets of gene groups. Then, all the genes included in each group of each run were compared with the list of genes of all the other groups in other runs. The data obtained are represented in Figure 6 as percentage of identity of each group along 10 runs. The graph indicates that from first to fifth group the reproducibility of the runs is very high because the groups show an identity >85%. The consistency is high, ~80%, for Groups 6–9, and it starts to be lower after Group 10. This result also indicates that there is a correlation between the high coherence of the initial groups and the low F-scores that these groups present, observed in Figure 5.


Figure 6
View larger version (35K):
[in this window]
[in a new window]
 
Fig. 6 Comparison of the list of genes present in each group obtained in 10 different complete runs of the algorithm. Each group of genes is compared in percentage of identity with all the other groups obtained in the other nine runs. In this way, the 10 groups most similar are selected making 15 groups ordered by percentage of identity. Each colour bar includes 10 groups one from each run.

 
The strong consistency and the high F-scores of the initial groups selected by the algorithm are two good indicators that the genes included on these groups keep some kind of biological relationships that make them to cluster. This suggestion agrees with the basic hypothesis that an altered biological state, in cells or tissue from a disease sample, will include groups of genes with altered expression. The algorithm proposed is able to find such groups of altered expression with respect to the control samples. A further question is if such groups of altered expression include genes with similar or related biological functions.

A final analysis was done to check the biological consistency of each group by assignment of the genes included in the group to functional terms given by the gene ontology (GO) annotation (Gene Ontology Consortium, 2005). The result of this assignment for the example used (set of 4056 genes) is shown in Figure 7. The data indicate that the functional annotation of the groups obtained is quite different; for example the first group includes some genes related to functions that are not found at all in the other groups (Groups 2–5): blood vessel development, endothelial cell differentiation, myoblast differentiation, etc. This result shows consistency and functional coherence for the groups obtained with the samples used and such consistency was only dependent on the variability observed between the set of altered samples and the control samples. In any case, the detail about the way the algorithm performs showed in this paper is a clear guide to make a good use of it, knowing that for any given study on some specific samples some tuning runs will be needed to explore the characteristics of such samples.


Figure 7
View larger version (32K):
[in this window]
[in a new window]
 
Fig. 7 Assignment of the genes included in the first to fifth groups to functional terms given by the GO annotation. The assignment is done to GO category biological_process at level ≥3. The scale represents the percentage of total genes that are assigned to each GO term knowing that each gene can be assigned to more than one term.

 
A final point is to consider how similar is the algorithm presented to others previously published and publicly available. As far as we could find out there are not many algorithms that address the issue of statistically significant variability in gene expression in the comparison of two sets of samples. As we said above the core idea of the algorithm presented is to use a residual variance and this idea was previously proposed (Cheng and Church, 2000; Kostka and Spang, 2004). In the case of Kostka and Spang (2004) they wrote an algorithm (dcoex) that tries to find groups of genes with differences in the covariance structure. AlteredExpression follows a similar strategy but also incorporates information on differential expression to find disease-altered groups of genes.

There are in the literature many algorithms that search for groups of genes that show significant co-expression and co-regulation on microarray data, being probably k-means (Hartigan, 1975) and SOM (Kohonen, 1995) the most widely used to generate clusters of genes in a divisive way. These algorithms are unsupervised and they are not designed for class comparison, however such is a key point of our study where control and altered samples are well defined. Other methods perform supervised clustering of genes including the information of sample classes (Dettling and Buhlmann, 2002). This kind of methods usually use as grouping process the partial least squares procedure, to build a covariance matrix that tries to be maximized (Hartigan, 1975). These algorithms are again mostly designed to find co-regulated genes and they do not perform class comparison. The algorithms most used for class comparison are the ones that search for significant differential expression, like SAM algorithm (Tusher et al., 2001) or EBA algorithm (Efron et al., 2001). These methods do not explore the variability in gene expression that occurs between a control and an altered set of samples, because their objective is to find significant expression mean differences. For these reasons, the proposed algorithm AlteredExpression can be a good complement for a more comprehensive search of the changes that occur between two sets of samples, control versus altered ones.


    Acknowledgments
 
The authors acknowledge the funding and support provided by the Spanish Ministerio de Sanidad y Consumo, ISCIII (research grant ref. PI030920), Junta de Castilla y Leon (research grant ref. SA104/03) and Fundación BBVA (Bioinformatics Grants Program 2003).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on July 21, 2005; revised on December 30, 2005; accepted on February 8, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 REFERENCES
 

    Barash, Y., et al. (2004) Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays. Bioinformatics, 20, 839–846[Abstract/Free Full Text].

    Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. Ser. B, 57, 289–300.

    Bolstad, B.M., et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193[Abstract/Free Full Text].

    Bonferroni, C.E. (1937) Teoria statistica delle classi e calcolo delle probabilita. Volume in Onore di Ricarrdo dalla Volta, , Italy Universita di Firenza, pp. 1–62.

    Cheng, Y. and Church, G.M. (2000) Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol, . 8, 93–103[Medline].

    Dettling, M. and Buhlmann, P. (2002) Supervised clustering of genes. Genome Biol, . 3, RESEARCH0069.

    Draghici, S. Data Analysis Tools for DNA Microarrays, (2003) , London, UK Chapman and Hall/CRC.

    Efron, B., et al. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc, . 96, 1151–1160[CrossRef][ISI].

    Eisen, M.B., et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868[Abstract/Free Full Text].

    Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Harris, M.A., et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res, . 32, D258–D261[Abstract/Free Full Text].

    Hartigan, J.A. Clustering Algorithms, (1975) , New York Wiley.

    Hsu, J.C. Multiple Comparisons Theory and Methods, (1996) , London, UK Chapman and Hall.

    Irizarry, R.A., et al. (2003a) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res, . 31, e15[Abstract/Free Full Text].

    Irizarry, R.A., et al. (2003b) Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics, 4, 249–264[Abstract].

    Kohonen, T. Self-Organizing Maps, (1995) , Berlin Springer.

    Kostka, D. and Spang, R. (2004) Finding disease specific alterations in the co-expression of genes. Bioinformatics, 20, Suppl. 1, I194–I199.

    Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA, 98, 31–36[Abstract/Free Full Text].

    Liu, W.M., et al. (2002) Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics, 18, 1593–1599[Abstract/Free Full Text].

    Marshall, E. (2004) Getting the noise out of gene arrays. Science, 306, 630–631[Abstract/Free Full Text].

    Montgomery, D.C. Design and Analysis of Experiments, (2000) 5th edn , New York, NY John Wiley and Sons Inc.

    Rhodes, D.R., et al. (2002) Meta-analysis of microarrays: interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res, . 62, 4427–4433[Abstract/Free Full Text].

    Segal, E., et al. (2003) Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics, 19, Suppl. 1, i273–i282[Abstract].

    Stuart, J.M., et al. (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science, 302, 249–255[Abstract/Free Full Text].

    Tusher, V.G., et al. Significance analysis of microarrays applied to the ionizing radiation response. [Erratum (2001), Proc. Natl Acad. Sci. USA, 98 10515.]. Proc. Natl Acad. Sci. USA, 98, 5116–5121.

    West, M., et al. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl Acad. Sci. USA, 98, 11462–11467[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. W.K. Ho, M. Stefani, C. G. dos Remedios, and M. A. Charleston
Differential variability analysis of gene expression and its application to human diseases
Bioinformatics, July 1, 2008; 24(13): i390 - i398.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/9/1103    most recent
btl053v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Prieto, C.
Right arrow Articles by De Las Rivas, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Prieto, C.
Right arrow Articles by De Las Rivas, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?