Skip Navigation


Bioinformatics Advance Access originally published online on January 19, 2007
Bioinformatics 2007 23(5):589-596; doi:10.1093/bioinformatics/btl680
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow An erratum has been published
Right arrow All Versions of this Article:
23/5/589    most recent
btl680v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhou, J.
Right arrow Articles by Peng, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhou, J.
Right arrow Articles by Peng, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Automatic recognition and annotation of gene expression patterns of fly embryos

Jie Zhou 1,{dagger} and Hanchuan Peng 2,*,{dagger}

1Department of Computer Science, Northern Illinois University, DeKalb, IL 60115 and 2Janelia Farm Research Campus, Howard Hughes Medical Institute, Ashburn, VA 200147, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Experiments and...
 4 Conclusions
 REFERENCES
 

Motivation: Gene expression patterns obtained by in situ mRNA hybridization provide important information about different genes during Drosophila embryogenesis. So far, annotations of these images are done by manually assigning a subset of anatomy ontology terms to an image. This time-consuming process depends heavily on the consistency of experts.

Results: We develop a system to automatically annotate a fruitfly's embryonic tissue in which a gene has expression. We formulate the task as an image pattern recognition problem. For a new fly embryo image, our system answers two questions: (1) Which stage range does an image belong to? (2) Which annotations should be assigned to an image? We propose to identify the wavelet embryo features by multi-resolution 2D wavelet discrete transform, followed by min-redundancy max-relevance feature selection, which yields optimal distinguishing features for an annotation. We then construct a series of parallel bi-class predictors to solve the multi-objective annotation problem since each image may correspond to multiple annotations.

Supplementary information: The complete annotation prediction results are available at: http://www.cs.niu.edu/~jzhou/papers/fruitfly and http://research.janelia.org/peng/proj/fly_embryo_annotation/. The datasets used in experiments will be available upon request to the correspondence author.

Contact: jzhou{at}cs.niu.edu and pengh{at}janelia.hhmi.org


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Experiments and...
 4 Conclusions
 REFERENCES
 
Analysis of in situ gene expression patterns sheds new light on understanding the complicated relationship of genes. Recent work on automating this process has been reported for several model systems including mouse (e.g. Carson et al., 2005), fruitfly (e.g. Peng and Myers, 2004), etc. For fruitfly (Drosophila melanogaster), gene expression pattern images during embryogenesis obtained by in situ mRNA hybridization provide important spatial–temporal functional information. These images contain body structures that emerge during certain developmental stages. For investigation of genetic regulatory elements, automatic retrieval and clustering, these images have been found to be very useful (Pan et al., 2006; Peng and Myers, 2004; Peng et al., 2006).

Annotations of these structures are important for the study of Drosophila embryogenesis. So far, the annotations of these fly embryo images are done by manually assigning anatomy ontology terms to the images (Tomancak et al., 2002). This time-consuming process depends heavily on the consistency of experts. With the availability of a large number of pattern images in databases such as the Berkeley Drosophila Genome Project (BDGP), an automatic and systematic annotation approach that can increase the efficiency and consistency of the analysis becomes highly desirable.

We develop a system to automatically annotate the fruitfly's embryonic tissue in which a gene has expression. We formulate the task as an image pattern recognition problem. For a new fly embryo image, our system answers two questions: (1) Which stage range does an image belong to? (2) Which annotations should be assigned to an image?

One key issue is that the embryonic tissues in which a target gene is expressed are often unknown beforehand; therefore, an arbitrary input image of gene expression pattern may have many different ontological annotations associated. From the viewpoint of pattern recognition, this means that each input sample may correspond to multiple class labels. Given the set of K class labels {Phi} = {c1, ..., cK}, automatic annotation of a fly embryo gene expression image is multi-objective: an input image xi corresponds to a target set ai sub{Phi}. This problem demands a special design for feature extraction and classification. There are three major challenges:

  1. For the multi-objective problem of embryonic image annotation, features corresponding to an unknown number of tissue structures coexist in the same image. As in the image data, each image pixel is often regarded as a dimension, and the total dimensionality equals the number of image pixels, which is often very large. Effective extraction of features is critical for identifying the discriminating features correlated with different structures.
  2. The sample distribution of our data is heavily skewed. Typically, some ontology terms are commonly used to annotate our image samples, while many other terms are associated with a relatively small number of images. The actual percentages for different annotations vary greatly, from less than 1% to about 20%.
  3. The quality and morphology of images vary greatly. Variation in embryo morphology, expression pattern staining and image orientation increase the difficulty of effective preprocessing and registration of the images. In addition, some inconsistent or missing annotations make it a nontrivial effort to generate an objective evaluation dataset.

This paper proposes the wavelet embryo feature extraction and selection method for fly embryonic images so that various gene-expressed structures within the same image can be effectively decomposed and automatically recognized. We use the wavelet decomposition scheme to project the original pixel-based embryonic images to a new feature domain in order to reveal features at different resolutions and frequency bands. Then we apply the min-redundancy max-relevance (mRMR) feature selection algorithm to identify the optimal distinguishing features for a specific annotation. With this scheme, we then construct a series of parallel bi-class predictors to solve the multi-objective annotation problem.

We have tested our approach using in situ mRNA expression patterns of 463 fly genes. We compared our method against several others and found that the combination of wavelet embryo features and mRMR feature selection yields promising features for recognition of gene expression patterns, despite various challenges in the input image data. We have generated comprehensive prediction tables over the entire course of the embryogenesis for these 463 fly genes. Comparison of our results with the expert manual annotations indicates that our paradigm is successful. Our predictions are available on the authors’ websites.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Experiments and...
 4 Conclusions
 REFERENCES
 
Our method has three major portions, namely wavelet embryo feature decomposition, feature selection, and recognition of gene expression patterns.

2.1 Wavelet embryo features
With a large number of features coexisting in a single embryonic gene expression image, the task of feature extraction is to produce a representation that can best characterize this embryo image.

One option to extract features from digital images is to use pixels or combination of pixels as candidate features. With the high dimensionality of embryonic images, pixel combinations have been considered due to its smaller computational load and information redundancy (Pan et al., 2006; Peng and Myers, 2004; Peng et al., 2006). For example, eigen-embryo analysis uses principal component analysis (PCA) to conduct linear combinations of pixel intensity values and extract the most prominent image features (Peng et al., 2006). While eigen-embryo features in conjunction with graph partition methods have led to very interesting results in clustering gene expression patterns in an unsupervised way, the method seems to be less appropriate for the purpose of automatic annotation where features with different levels of prominences need to be revealed to characterize various structures coexisting in the same image.

In this paper, we propose using the multi-resolution wavelet representation for embryo images. Wavelet representation lies between the spatial and Fourier domains, in the sense that wavelets are localized in both space and frequency, whereas the standard Fourier transform is only localized in frequency (Daubechies, 1992; Mallat, 1989, 1999). Multi-resolution representations based on wavelet decomposition are effective for identifying and analyzing local and multi-scale features from signals or images and have been used in other pattern recognition tasks such as face recognition and image retrieval (Chien and Wu, 2002; Manjunath and Ma, 1996). With wavelet decomposition, a fly embryo image is projected to a feature space where information is decomposed into various frequency bands and different levels of resolutions so that features of different structures may be effectively separated.

We use 2D discrete wavelet transform (DWT) to extract multi-resolution image features. 2D DWT decomposes an image using orthonormal wavelet basis functions. Let the set of wavelet basis functions be {{psi}k,n(r1, r2), k, n isin Z}, where n is the translation factor (when n increases, the wavelet shifts right), and k is the dilation factor, which denotes a particular resolution level (when k gets smaller, the resolution increases). Let x(r1, r2) represent the intensity values of embryo image pixels indexed by the location vector (r1, r2). We can obtain the wavelet transform coefficient dk,n as below (Mallat, 1999; Resnikoff and Wells, 1998):


Formula

In order to construct the 2D orthonormal wavelet basis function {Psi}k,n(r1, r2) for multi-resolution analysis, we start from the following 1D counterpart.

In multi-resolution analysis theory, the wavelet function {psi}(r) has a companion, the scaling function {varphi}(r), which approximates the signal/image at a specific level of resolution. The sets of scaling and wavelet functions are built by translation and dilation:


Formula

Intuitively, the scaling function indicates the trend of the signal, while the wavelet functions code the details of the images that can be added to reconstruct the original signal. Mallat (1989) linked orthogonal wavelets and scaling functions to quadrature mirror filters in signal processing theory: scaling function {varphi}(r) produces a smoothed signal as a low-pass filter; wavelet functions {psi}(r) catch the high-frequency details as high-pass filters.

Correspondingly, the 2D orthonormal wavelet basis function {Psi}(r1, r2) is factorized using the scaling function {varphi}(r) and mother wavelet function {psi}(r) as below:


Formula 1

(1)
Schematically, each basis function in Equation (1) represents a two-step transform process: (1) perform a 1D wavelet transform on each row (indexed by r1) of the embryonic image; and then (2) perform the transform on each column (indexed by r2) of the result of step one. In our experiments, {varphi}(r) and {psi}(r) of Daubechies-1 wavelet and scaling functions (Daubechies, 1992) are used.

With the connection between wavelet basis functions and filters established by Mallat (1989), we can also view each subspace in Equation (1) as a 2D filter that transforms the original image into one of the four components: the low–low (LL), low–high (LH), high–low (HL) and high–high (HH) parts of the transform. In other words, the wavelet decomposition yields four sub-images at each resolution level. LL filtering is a ‘smoothing’ of the original image. The other three at each resolution level are the detailed sub-images. Then at level 2, we apply the same analysis to the LL subimage, where the wavelets now reveal coarser-grained details of the image, and thus achieve the multi-resolution feature decomposition.

Figure 1 shows an example of the wavelet-embryo decomposition. The LL2 quadrant in the upper left corner of (b) is a smoothing of the original image in (a). The other three parts, LH, HL, and HH, are detail images at two different resolution levels. The set of coefficients {dk,n} of all sub-images at all resolution levels are used as features to characterize an gene expression pattern. In our experiment, k = 0, 1 (correspond to levels 1 and 2 in Fig. 1); n = 1, ... |r1|*|r2|, where |r1| and |r2| are width and height of the sub-image at a specific resolution level. As seen in Figure 1, a subsampling by 2 is done by DWT on both directions of r1 and r2, so the number of coefficients in each subimage is one quarter of that of the input image. As a result, the total number of coefficients (features) is about the same as the number of pixels of the original image. If the number of pixels is not exactly a power of 2, then the number of coefficients after subsampling is {lceil} |r1 | /2 {rciel} or {lceil} |r2 | /2 {rciel}. For example, in the case of an image of 50 x 100, the total number of coefficients is 5050 = 25 x 50 x 3 + 13 x 25 x 4.


Figure 1
View larger version (132K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. 2D wavelet decomposition of a fly embryo gene expression pattern image. (a) The original in situ mRNA expression pattern image of gene CG3400 in embryonic stage 7–8. (b) The wavelet embryo features obtained by applying two-level 2D wavelet decomposition.

 
In the rest of the paper, we will use wavelet embryo features to refer to features generated from embryonic images by wavelet decomposition.

2.2 Min-redundancy max-relevance feature selection
The dimensionality of the wavelet embryo features is equivalent to the number of pixels in an image. It is found that using the full set of wavelet coefficients may often lead to inaccurate results for several other problems (Mallet et al., 1997; Pitter and Kamarhthi, 1999). In our method, the next step is to select a most characterizing subset of features that can best discriminate the patterns and help annotate embryo images. We consider generating a compact feature set with strong discriminating strength for specific annotations using the mRMR feature selection method (Peng et al., 2005). The mRMR feature selection algorithm is theoretically formulated as that features should be selected to maximize the statistical dependency between the annotation–distribution of image samples and the joint distribution of the selected features. Based on information theory, this can be factorized as finding features that are mutually far away from each other (minimum redundancy) but also individually most similar to the distribution of annotations (maximum relevance).

We use mutual information to measure the level of similarity between features. Let S denote the features subset that we are seeking and {Omega} the pool of all candidate features { fi} (they equal {dk,n} in the case of wavelet embryo features). The minimum redundancy condition is


Formula

where |S| is the number of features in S and I(fi, fj) is mutual information between fi and fj,


Formula

where p(·) is the probabilistic density function.

We again use mutual information I(c, fi) between the target class c isin {c1, ..., cK} and the feature fi to quantify the relevance of fi for the classification task. The maximum relevance condition is to maximize the total relevance of all features in S:


Formula

To obtain mRMR features, we optimize these two conditions simultaneously, in quotient form by


Formula 2

(2)
The solution of Eqation (2) can be computed efficiently in O(|S| · N) time, with N being the total number of features in {Omega}.

2.3 Design of the recognition/annotation system
Our automatic annotation system has two tiers. At the first tier, an incoming image is assigned to a specific developmental stage in the course of fly embryogenesis. Following the convention of Berkeley Drosophila Genome Project (http://www.fruitfly.org), there are six stage ranges: stage 1–3, stage 4–6, stage 7–8, stage 9–10, stage 11–12 and stage 13–16.

At the second tier, image ontology annotation terms are assigned to an input image. As explained earlier, the problem of annotating images is multi-objective, we solve it by decomposing it into multiple binary subproblems, for each of which we train a classifier to predict a particular annotation term based on the one-versus-others method. In another word, for a particular annotation term, the training samples are associated with labels 1 or 0, depending on whether this image sample has this annotation or not in the training data. In the testing phase, this trained classifier is responsible for predicting whether or not an input image should be associated with this image ontology annotation term. In this way, a series of parallel classifiers will determine if an image has any of the annotations in the training set.

The trained classifiers also calculate a probabilistic confidence score in order to give users a quantitative measure of predictions. Since classifiers for different annotations are trained independently, we are able to add annotation-specific classifiers based on user requirement and/or system complexity constraints, without altering previously trained classifiers.

For our problem, one challenge in designing classifiers is that an annotation term usually corresponds to only a small portion of images. For example, in the evaluation set of genes we collect, even the most common annotation terms are associated with only about 20% image samples. As a result, for the one-vs-others binary classifier, the sample numbers for class 1 (with the annotation) and class 0 (without the annotation) are very unbalanced, because most samples are labeled as class 0. Therefore, we need to choose a classifier that is able to predict accurately for the small-sample-number class instead of being biased toward the large class.

We use linear discriminant analysis (LDA) to implement the binary annotation classifier. LDA is a commonly used statistical algorithm for separating samples of two or more categories (Webb, 2002). It is reported in literature that linear discriminant analysis may give better performance than the quadratic discriminant analysis (QDA) when true covariance matrices are unknown and the sample sizes are small (O’Neill, 1992; Webb, 2002). Our empirical comparison with the support vector machine (Chang and Lin, 2001; Vapnik, 1995) shows LDA has a relatively balanced performance on both large and small classes. These observations help decide our choice of classifier. (Results are omitted due to the space limitation.)

Denote the posterior probability of an image x having an annotation as p( y = 1|x), where y (y isin {0,1}) is the class label of the image. It can be shown that, based on the assumption of multivariate normal density of LDA, we have


Formula

where b is class-independent, and


Formula

with µ1 is the sample mean of class 1, {Sigma} = (1/(n1 + n0 – 2))(n1 {Sigma}1 + n0 {Sigma}0) is the pooled within-class covariance matrix, and n1 and n0 are numbers of samples for classes 1 and 0. g0(x) for class 0 can be calculated similarly. Using


Formula

we can then compute p(y = 1|x) and p(y = 0|x).

For embryonic image annotation, the body part structure with gene expression is considered to be present if (1) p(y = 1|x) > p(y = 0|x); and (2) p(y = 1|x) > TH, where TH is a threshold set to 0.6 in our experiment. p(y = 1|x) is the confidence score of the annotation.

Because a gene may be associated to multiple annotations, we can aggregate all individual annotation confidence scores as an overall rankscore,


Formula 3

(3)

where M is the number of total possible annotations, pi is the confidence score for a particular annotation term, f(·) is the hard-limiting function with f(u) = 1 if u > 0 and otherwise f(u) = 0; the parameter {alpha} on the denominator is mainly for avoiding division by zero when no annotation is found to be associated to the image. When {alpha} is 0, the rankscore is simply the average of the individual probabilities for various annotations.

The rankscore of the annotation predictions provides two uses: (1) The predicted annotation results can be ranked/sorted based on these scores. When the annotation system processes a large number of images, most confident predictions with interesting annotation results can be quickly identified; (2) Instead of having only a decision on individual annotations, users will have additional global quantitative control of the confidence level.


    3 Experiments and Discussions
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Experiments and...
 4 Conclusions
 REFERENCES
 
In this study, we made use of the mRNA gene expression pattern data set of 463 genes generated through our earlier work of gene clustering (Peng et al., 2006), which also explains the details the image preprocessing methods. In most cases, for each gene, there is one representative mRNA expression pattern image determined by experts for each of the six stage ranges (stages 1–3, 4–6, 7–8, 9–10, 11–13 and 14–16) over the entire course of fly embryogenesis. We addressed two questions about our computational approach, as detailed in Sections 3.1 and 3.2, respectively.

  1. How good is the performance of pattern recognition with mRMR wavelet embryo features, compared against several other possible feature extraction methods?
  2. How can the automatic annotation system be applied to real applications?

3.1 Experiment 1—Evaluation of the mRMR wavelet embryo features
We compared our scheme with two feature extraction methods:

  1. Eigen-embryo decomposition (Peng et al., 2006) that uses principal component analysis to extract prominent image features.
  2. LeNet features. LeNet, an artificial neural network that can extract local image features, is considered as one of best machines for pattern recognition (LeCun et al., 1998). We used the common structure of LeNet with four hidden layers. Outputs of the fourth hidden layer were used as the extracted features of an embryo image.

In order to evaluate our feature extraction/selection algorithm, we built 10 testing sets using real images drawn from various stage ranges. For each stage range, we built two different data sets using two annotations: in one set, each image has only one of the two possible annotations; these annotations will never correspond to one image sample at the same time. In another word, these two annotations are mutually exclusive (M.E.); thus, we call the first set the ‘M.E. set’. In the second data set, some images may correspond to two annotations at the same time, i.e., annotations overlap and the problem is multi-objective (M.O.). It is thus called the ‘M.O. set’, which models the more typical situation in our problem.

In addition, as we need a scheme working well with unbalanced sample sizes, the synthetic sets were built in the way that one annotation class has a lot more images than the other: the larger class usually doubles or triples the number of images of the smaller class. This can be seen from the sample numbers shown in both Tables 1 and 2. Overall, with both the M.E. set and the M.O. set on each of the five stage ranges (stages 4–6, 7–8, 9–10, 11–12, 13–16), we produced 10 testing sets, listed in Tables 1 and 2. (Note: here we did not show results for stage 1–3 because the gene expression pattern for stage 1–3 is very easy to classify, as it corresponds to only two annotations, ‘maternal’ or ‘none’.)


View this table:
[in this window]
[in a new window]

 
Table 1. Recognition rates (%) of feature extraction/selection methods on synthetic data sets with mutually exclusive (M.E.) annotations

 

View this table:
[in this window]
[in a new window]

 
Table 2. Recognition rates (%) of feature extrction/selection methods on synthetic data sets with multi-objective (M.O.) annotations

 
We compared wavelet embryo features against eigen embryo features and LeNet features. Input images have the size 50*100 pixels. For eigen-embryo features, we used the top 20 principal components based on the best performance tested in the range of 10–50 principal components. Without feature selection, dimensions of raw LeNet and wavelet features are 3036 and 5050, respectively. For a fair comparison, we also selected 20 top LeNet and wavelet embryo features using mRMR. These comparison schemes are shown in Tables 1 and 2, where we report recognition rates using Leave-One-Out Cross-Validation (LOOCV) and the LDA classifier (Section 2.3). LOOCV uses each image as a testing sample, while the remaining images are used to train the classifier, and sums up all errors in testing.

Tables 1 and 2 list the recognition rates of these five feature extraction/selection methods on M.E. and M.O. sets, respectively. Since the data sets are unbalanced, we report both the recognition rate on the small annotation class only (RS in Tables 1 and 2) and the overall recognition rate (RA in Tables 1 and 2).

We have several major observations from Tables 1 and 2:

  1. Both result tables consistently show that, in both M.E. and M.O. situations, mRMR selected wavelet features deliver a superior performance than all four other types of features. For example, in Table 1 we see that while eigen features, LeNet features and wavelet features without mRMR selection lead to ~50–80% recognition accuracy, the mRMR wavelet embryo features consistently yield close to 100% accuracy. This indicates that the combination of wavelet embryo features and mRMR feature selection is excellent to obtain the most distinguishing feature sets. We thus used mRMR wavelet embryo features in our annotation system.
  2. Feature selection is essential for the success of the proposed use of wavelet embryo features. Using all wavelet coefficients turned out to be redundant and led to less accurate results, which is comparable to using eigen embryo features, than using mRMR wavelet embryo features. On the other hand, applying mRMR feature selection on LeNet features did not achieve as good results, which shows that wavelet decomposition does capture more discriminative information from embryo images.
  3. Our results confirm that the multi-objective (M.O.) recognition problem is harder than the mutually exclusive (M.E.) situation. Recognition rates in Table 2 are in general lower than those in Table 1. This shows the difficulty of building the automatic annotation system for fly embryo pattern images that are usually multi-objective.
  4. Comparing the RA and RS recognition rates of the mRMR wavelet embryo features, we can see that the two numbers are often very close to each other. This indicates that with the good features, the LDA classifier works effectively on unbalanced data sets without being biased toward the larger class significantly. Thus, we used LDA as the classifier in our automatic annotation system.

3.2 Experiment 2—Evaluation of the overall automatic annotation system
We applied the best performing features, mRMR-selected wavelet embryo features, to evaluate our two-tier automatic annotation system for gene expression images. The evaluation set consists of expression patterns of all 463 genes extracted from the BDGP gene expression database (http://www.fruitfly.org).

  • Tier 1: An input gene expression pattern image is automatically assigned to a stage-range using our algorithm.

This is a mutually exclusive six-class recognition problem since an image belongs to one and only one stage range out of six possibilities—stages 1–3, 4–6, 7–8, 9–10, 11–12 and 13–16. For each image, we used top 20 wavelet embryo features selected by mRMR. 10-fold cross validation was used to report the recognition rate. This commonly used method randomly separates the our data set as 10 portions; each is iteratively used as a test set while the remaining nine subsets are used in training the classifier. The final recognition rate is computed based on the total error tested using all portions.

As shown in the confusion matrix of prediction and detailed prediction table in the supplement materials, for all our pattern images, we achieved the recognition rate of 99.55% using wavelet embryo features and LDA classifier. This number indicates that our system can assign an incoming image to one of the six stage ranges in a very reliable way.

  • Tier 2: Once an image is assigned to a stage range, our system automatically assigns annotation terms to it. This is a multi-objective problem for which we developed binary classifiers for every annotation of interest.

Top 20 mRMR-selected wavelet embryo features were used in training each LDA classifier. As a quantitative assessment of the results, LOOCV is used to produce the prediction accuracy.

The annotation system produces three pieces of information for each testing image:

  1. The decision whether the specific annotation is considered ‘present’ in this image
  2. The estimation confidence score of each annotation
  3. The rankscore of all annotations given to this image

These results are presented in Table 3. Genes are sorted so that embryo images with the most confident annotation predictions are shown at the top. Entries with a probabilistic confidence score lower than 0.6 are marked with a dash ‘–’ to indicate that our system did not predict that the respective annotation is ‘present’ in this image. The complete annotation tables for all embryogenesis stages are available on authors’ websites as supplements. Table 3 shows partial annotation results of stages 11–12. The top 30 ranked genes with their expression pattern images are listed together with their probabilities of having any of the most popular five annotations ‘present’ at the specific stage range. The supplementary materials include prediction results of a larger set of annotations varying from 10 to 18 annotations per stage range.


View this table:
[in this window]
[in a new window]

 
Table 3 Predicted annotations for images at stage 11–12

 
It can be seen from the result table that the majority of our automatic annotations are consistent with the experts’ annotations stored in the BDGP database. It is evident that our system is a meaningful effort to address this challenging problem.

Our system can successfully recognize patterns and annotations for some pretty poor images in terms of the blur and deformation, such as the patterns of gene CG12157 (row 27), CG4608 (row 29), all of which are successfully annotated without any error.

It is also worth noting that the BDGP database may have some missing annotations such as row 12 in Table 3 (gene bowl, CG10021) for stages 11–12. In this case, our system is still capable of predicting some gene expression image structures, which are likely to be correct if we visually compare them with other genes with similar annotations. For example, for the gene bowl, our system predicts it has two annotations PMP and AMP; when we check other genes like CG10535, CG33071, etc., with the same annotations, we can see similar patterns visible in the respective images. This suggests that the automatic annotation system may be used to fill missing annotations for existing databases besides being used for annotating newly collected pattern images.


    4 Conclusions
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Experiments and...
 4 Conclusions
 REFERENCES
 
In this paper, we have proposed a practical paradigm to recognize and automatically annotate in situ gene expression patterns of fly embryos by combining the minimum redundant wavelet embryo features and a two-tier classification system. We have achieved promising results using patterns of the entire embryogenesis course of 463 fly genes.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

Associate Editor: Alvis Brazma

Received on October 3, 2006; revised on December 10, 2006; accepted on January 5, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Experiments and...
 4 Conclusions
 REFERENCES
 

    Carson JP, et al. A digital atlas to characterize the mouse brain transcriptome. PLoS Comput. Biol., ( (2005) ) 1, : e41.[CrossRef][Medline].

    Chang C, Lin C. LIBSVM: a library for support vector machines. ( (2001) ) Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm..

    Chien JT, Wu CC. Discriminant waveletfaces and nearest feature classifiers for face recognition. IEEE Trans. Pattern Anal. Mach. Intell., ( (2002) ) 24, : 1644–1649.[CrossRef].

    Daubechies I. Ten Lectures on Wavelets, Science for Industrial and Applied Mathematics., ( (1992) ) Society for Industrial and Applied Mathematics..

    LeCun Y, Botton L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, ( (1998) ) 86, (11): 2278–2324.[CrossRef][ISI].

    Mallat SG. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell., ( (1989) ) 11, : 674–693.[CrossRef].

    Mallat SG. A Wavelet Tour of Signal Processing, ( (1999) ) 2nd edn. USA: Academic Press..

    Mallet Y, et al. Classification using adaptive wavelets for feature extraction. IEEE Trans. Pattern Anal. Mach. Intell., ( (1997) ) 19, : 1058–1066.[CrossRef].

    Manjunath BS, Ma WY. Texture features for browsing and retrieval of image data. IEEE Trans. Pattern Anal. Mach. Intell., ( (1996) ) 18, : 837–842.[CrossRef].

    Pan JY, et al. Automatic mining of fruit fly embryo images. In. Proc. ACM SIGKDD 2006, ( (2006) )..

    Peng H, Myers EW. Comparing in situ mRNA expression patterns of Drosophila embryos. In. Proc. RECOMB 2004, ( (2004) ) pp. 157–166..

    Peng H, et al. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell., ( (2005) ) 27, : 1226–1238.[CrossRef][Medline].

    Peng H, et al. Clustering gene expression patterns of fly embryos. In. Proc. ISBI 2006, ( (2006) ) pp. 1144–1147..

    Pitter S, Kamarthi SV. Feature extraction from wavelet coefficients for pattern recognition tasks. IEEE Trans. Pattern Anal. Mach. Intell., ( (1999) ) 21, : 83–88.[CrossRef].

    O’Neill TJ. Error rates of non-Bayes classification rules and the robustness of Figher's linear discriminant function. Biometrika, ( (1992) ) 79, : 177–184.[Abstract/Free Full Text].

    Resnikoff HL, Wells RO. Wavelet Analysis, The Scalable Structure of Information, ( (1998) ) New York: Springer..

    Tomancak P, et al. Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biology, ( (2002) ) 3, ..

    Vapnik V. The Nature of Statistical Learning Theory, ( (1995) ) Berlin: Springer-Verlag..

    Webb A. Statistical Pattern Recognition, ( (2002) ) 2nd. Chichester, West Sussex England: John Wiley and Sons..


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
H. Peng
Bioimage informatics: a new area of engineering biology
Bioinformatics, September 1, 2008; 24(17): 1827 - 1836.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S. Ji, L. Sun, R. Jin, S. Kumar, and J. Ye
Automated annotation of Drosophila gene expression patterns using a controlled vocabulary
Bioinformatics, September 1, 2008; 24(17): 1881 - 1888.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow An erratum has been published
Right arrow All Versions of this Article:
23/5/589    most recent
btl680v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhou, J.
Right arrow Articles by Peng, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhou, J.
Right arrow Articles by Peng, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?