Bioinformatics Advance Access originally published online on October 11, 2006
Bioinformatics 2006 22(24):3032-3039; doi:10.1093/bioinformatics/btl521
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Maximum likelihood inference of imprinting and allele-specific expression from EST data
Computational Biology Group, Institute of Infectious Disease and Molecular Medicine, University of Cape Town South Africa
*To whom correspondence should be addressed
| ABSTRACT |
|---|
|
|
|---|
Motivation: In a diploid organism the proportion of transcripts that are produced from the two parental alleles can differ substantially due, for example to epigenetic modification that causes complete or partial silencing of one parental allele or to cis acting polymorphisms that affect transcriptional regulation. Counts of SNP alleles derived from EST sequences have been used to identify both novel candidates for genomic imprinting as well as examples of genes with allelic differences in expression.
Results: We have developed a set of statistical models in a maximum likelihood framework that can make highly efficient use of public transcript data to identify genes with unequal representation of alternative alleles in cDNA libraries. We modelled both imprinting and allele-specific expression and applied the models to a large dataset of SNPs mapped to EST sequences. Using simulations, matched closely to real data, we demonstrate significantly improved performance over existing methods that have been applied to the same data. We further validated the power of this approach to detect imprinting using a set of known imprinted genes and inferred a set of candidate imprinted genes, several of which are in close proximity to known imprinted genes. We report evidence that there are undiscovered imprinted genes in known imprinted regions. Overall, more than half of the genes for which the most data are available show some evidence of allele-specific expression.
Availability: Software is available from the authors on request.
Contact: cathal{at}science.uct.ac.za
Supplementary information: http://cbio.uct.ac.za/publication_support/ML_EST
| 1 INTRODUCTION |
|---|
|
|
|---|
Different rates of transcription from the two alleles in a diploid organism can result from epigenetic modification that causes the complete or partial suppression of one allele or from cis acting sequence polymorphisms that affect the rate of transcription. Genomic imprinting, the heritable suppression of one parental allele through epigenetic modification, plays a role in several important heritable human diseases and remains a topic of interest and debate among evolutionary biologists (Wilkins, 2005; Reik and Lewis, 2005; Constancia et al., 1998; Morison and Reeve, 1998). More recently, sequence polymorphisms that affect gene expression, both in cis and in trans, have become recognized as a major source of phenotypic variation and disease (Oleksiak et al., 2002; Morley et al., 2004; Schadt et al., 2003; Yan et al., 2002; Pastinen and Hudson, 2004). Because of their biological importance and implications for human health, several high-throughput methods have been applied to find novel candidate imprinted genes and genes with allelic differences in expression (allele-specific expression), including methods that have made use of the data available in public transcript databases (Ge et al., 2005; Lin et al., 2005; Yang et al., 2003).
Genomic imprinting is thought to affect a small minority of genes. In human and mouse, genomic imprinting has been confirmed for
41 and 71 transcriptional units (TUs), respectively (Morison et al., 2005), often occurring in genomic clusters. Although the number of human diseases and phenotypes with parent-of-origin effects is greater than can be accounted for by known imprinted genes (Morison and Reeve, 1998; Robertson, 2005), it has been suggested that only a small number of imprinted genes remain to be discovered (Morison et al., 2005). In many cases genomic imprinting affects just a subset of the tissues and developmental stages in which a gene is expressed. As a result it can be difficult to confirm that a specific gene is never imprinted and the imprinting status of several genes is disputed (Morison et al., 2005).
Luedi et al. (2005) inferred imprinted genes in mouse from genomic sequence features and put the number of imprinted mouse genes at 600. Contrary to the view expressed by Morison et al. (2005), this result strongly suggests that the vast majority of imprinted genes remain undiscovered. A wide range of methods has also been applied to detect allele-specific expression by measuring expression level differences between the two alleles of a polymorphic marker in heterozygous individuals (Knight, 2004; Pastinen and Hudson, 2004; Yan et al., 2002; Buckland, 2004). Two groups have recently used public transcript data for this purpose (Ge et al., 2005; Lin et al., 2005).
In the current work we have applied statistical models, in a maximum likelihood framework, to detect genes for which there is evidence of unequal expression of alternative alleles in a subset of the cDNA libraries in dbEST. The models are sensitive to the absence of apparent heterozygous libraries and imbalance between alternative alleles in libraries in which both alleles are observed. We have modelled two alternative causes of unequal representation of alleles: imprinting and allele-specific expression and investigated a set of genes for which the model of genomic imprinting is favoured over the null model. Using simulation and a positive control data set we confirm the power of this method to detect imprinted genes and show that it has far greater power than existing methods that have been applied to the same data. To test whether there is evidence of undiscovered imprinted genes in known imprinted regions we also assessed the evidence in favour of the imprinting model for genes that are located close to known imprinted genes.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
ESTs and clone library information was downloaded from dbEST (version 146; ftp://ftp.ncbi.nih.gov/repository/dbEST/). Human chromosomal sequences, as well as, flat files of pre-computed EST and SNP genomic locations, based on NCBI genome assembly 35 were downloaded from the UCSC genome database [ftp://hgdownload.cse.ucsc.edu/goldenPath/hg17; (Karolchik et al., 2003)]. We discarded ESTs for which <90% of the sequence was mapped to the genome. The EST base at the position corresponding to the SNP was inferred from the EST sequences and genomic coordinates of the SNPs and ESTs. EST sequences derived from cDNA libraries annotated as pooled by tissue donor (Kelso et al., 2003) were removed from the analysis.
Known and novel transcripts based on NCBI Genome build 35 (Hubbard et al., 2005) were downloaded from Ensembl (ftp://ftp.ensembl.org/pub/current_human/data/fasta/). We aligned the transcript sequences to the genome using BLAT (Kent, 2002) with the default parameters and the -fine switch, recommended for mRNA to genome alignment. We removed all transcripts that mapped to more than one location on the genome and discarded any transcripts for which <90% of the total transcript length could be mapped. All Ensembl genes with at least one transcript that mapped multiple times to the genome were removed from our dataset in order to reduce the impact of paralogous sequences on our results. The final dataset consisted only of ESTs and SNPs corresponding to Ensembl transcripts that mapped to unique genomic locations. This dataset consisted of 286 692 different SNPs mapped to transcripts from 6716 different cDNA libraries and from 13 661 different Ensembl genes.
2.1 Simulations
Simulations were used to compare the performance of alternative methods to recover imprinted genes and genes with allele-specific expression levels. We took a sample of 100 SNPs from our dataset satisfying the power criteria described above and designated 50 as imprinted and the remainder as non-imprinted. We constructed 1000 random replicates of this data, matching allele frequency, sequence error rate and EST coverage to the values estimated from the real data. For each simulated cDNA library the genotype for a particular SNP was assigned randomly under the assumption of HardyWeinberg equilibrium and estimated population allele frequencies for the SNP. We simulated imprinting with complete and with partial silencing of the imprinted allele. In the case of partial silencing the probability of observing the silenced allele in an imprinted library was sampled from a normal distribution with mean and standard error equal to 0.1 (truncated to the left at zero). For the allele-specific expression simulations, we again simulated replicate datasets consisting of 50 cases and 50 controls. For these cases, the highly-expressed allele was assigned randomly and the probability of observing the low-expression allele was obtained by again sampling from the normal distribution with a mean and standard deviation of 0.1 (truncated at 0). Simulations as well as statistical tests described in Yang et al. (2003) were implemented in Perl.
| 3 ALGORITHM |
|---|
|
|
|---|
For a given SNP, with alleles A and B, that could be mapped to transcript sequences, the data consisted of counts, ai and bi, of the SNP alleles in the cDNA libraries i = 1 ... N, in which the transcript was found. The likelihood of the observed data, under the null model with no allele-specific expression or imprinting, can be expressed as a function of the frequency of allele A, fA, and the sequencing error rate,
. We defined
as the probability that the incorrect allele was read at the SNP position for a given EST and estimated this probability separately for each SNP. The error rate was estimated as half of the proportion of ESTs for which the nucleotide at the polymorphic position corresponded to neither of the SNP alleles, with pseudocounts (one per allele) added to allow for sparse data. In the case of SNPs with more than two alleles this would result in a conservative overestimate of
. Assuming HardyWeinberg equilibrium the likelihood of the data is given by |
|
| 3.1 Model of genomic imprinting |
|---|
|
|
|---|
We used a mixture model and two additional parameters to model genomic imprinting, which may be dependent on the context of gene expression and which may involve partial rather than complete silencing of one of the parental alleles. For a given gene the mixture weight, pI, represents the proportion of cDNA libraries in which the gene is imprinted and the parameter q represents the probability of observing the suppressed allele in an imprinted cDNA library. The likelihood of the data under the imprinting model is
|
|
| 3.2 Model of allele-specific expression |
|---|
|
|
|---|
We considered a cis regulatory variant which is closely linked to the observed SNP allele and which may affect expression level in a subset of expression contexts. In the absence of recombination between the SNP allele and the regulatory variant we would expect either no effect in a given cDNA library or overrepresentation of one allele, with the overrepresented allele consistent across libraries. Using the mixture weight pE to represent the proportion of cDNA libraries in which there is an allelic difference in gene expression and a new parameter
to represent the probability of observing allele A in an affected cDNA library the likelihood of the data is |
|
| 3.3 Model of loss of heterozygosity in cancer |
|---|
|
|
|---|
Both the imprinting and the allele-specific expression models can be adapted to model loss of heterozygosity associated with cancer. Loss of heterozygosity involving consistent loss of the same allele, such as would be expected for example with cancers resulting from heterozygosity for a defective tumour suppressor gene, can be modelled by adapting the allele-specific expression model. We compare the likelihood of a model in which the parameter pE is estimated separately for libraries annotated as cancer or normal by eVOC (Kelso et al., 2003) to the likelihood of a model in which a single value of pE is estimated for all libraries. Similarly the imprinting model can be adapted to model loss of heterozygosity that does not involve consistent loss of the same allele, such as would be expected for example in the case of a haploinsufficient tumour suppressor gene.
| 3.4 Optimization |
|---|
|
|
|---|
Maximum likelihood parameter estimates were obtained by optimizing the likelihood using Powells method (Press et al., 1992), implemented in the Perl programming language. Raw data and software are available from the authors on request.
| 4 RESULTS AND DISCUSSION |
|---|
|
|
|---|
Excluding genes on the X and Y chromosomes, we found 81 SNPs out of a total of 52 027 tested, for which the imprinting model provided a better fit to the data than the null model at the 1% significance level (a complete list is provided as Supplementary Data online). The power to detect an improved fit to the data depends on the number of ESTs that map to the SNP and on the allele frequencies: if there are two few ESTs or if one allele is too rarely observed, the data do not provide a chance of identifying imprinting or allele-specific expression. We therefore restricted the analysis to SNPs that were spanned by at least 50 ESTs in the database and for which the expected number of ESTs derived from the less frequent allele was at least five. This resulted in 76 SNPs for which the fit of the imprinting model was significantly better than the null model from a total of 3969 SNPs tested (corresponding to 1898 different genes). These power criteria succeeded in identifying a much smaller set of SNPs to test while losing very few cases in which the null model could be rejected. The SNPs that favoured the imprinting model were derived from 60 different genes. The genes in which the imprinting model was favoured at the 1% significance level in at least one SNP, ranked in order of the largest difference in log likelihood between the imprinting and null models, are shown in Table 1. We have also provided a more complete version of the data corresponding to these 60 genes as Supplementary Table 2. For each candidate gene this includes a list of libraries in which transcripts of the gene were found as well as eVOC (Kelso et al., 2003) anatomical annotations for each of the cDNA libraries. The table also includes SNP IDs and the number of transcripts of each allele of the SNP observed in each of the cDNA libraries.
|
We used simulation to evaluate the accuracy and power of the statistical models we present and to compare to previously published methods of detecting imprinting and allele-specific expression using data in the cDNA libraries of dbEST. In 1000 replicate datasets, each consisting of 50 genes simulated under imprinting (with complete silencing of the imprinted allele) and 50 non-imprinted genes, we recovered on average 67% of the imprinted genes before the first false-positive discovery (Fig. 1). Yang et al. (2003) presented two alternative methods to detect novel candidate imprinted genes from dbEST. Both methods use only ESTs with trace information and high Phred scores rather than modelling sequence error. In the first method, Yang et al. (2003) compared the expected frequency of heterozygous libraries based on inferred frequencies of individual alleles to the expected number of heterozygous libraries, given the EST data mapping to a specific SNP. The difference between these quantities was summarized as a Z-statistic, which was used to calculate a P-value. Few, if any, of the P-values obtained by this approach are statistically significant and instead Yang et al. (2003) rank the logarithm of the P-values and consider the top-ranking examples as candidates for imprinting. The way in which Bayes rule was applied in this method is questionable (the use of flat priors for the alternative genotypes underlying the data in individual cDNA libraries ignores the information about the allele frequencies contained in the collection of cDNA libraries) and it is possible to construct trivial data sets in which the expected value of their Z-statistic is not zero (e.g. a dataset that consists of just one EST in a single cDNA library). However, in simulated data, we found that the method used by Yang et al. (2003) did have some power to detect imprinting (Fig. 1a), although, for any given false positive rate, the power was much lower than the method presented here.
|
An alternative method contained in the same paper and applied in a subsequent paper (Lin et al., 2005) to the problem of identifying allele-specific expression from EST data relies on observing both alleles in heterozygous individuals. It therefore cannot be used to detect cases of complete imprinting or allele-specific expression (where one allele is never observed). In this method the statistical significance of the difference in allele frequencies within the cDNA library is evaluated using the binomial distribution. We simulated datasets with partial suppression of one allele to compare the performance of this method and the previous method to the likelihood method. For most of the false-positive rate range the likelihood method has far greater power to recover the imprinted genes than either of the previously published methods (Fig. 1b). We also found that the likelihood method had improved power to detect cases of allele-specific expression than the binomial method (Fig. 1c).
To validate the power of the imprinting model to highlight candidate imprinted genes, we obtained a set of 34 known imprinted TUs from the imprinted gene catalogue [(Morison et al., 2001); http://igc.otago.ac.nz] that could be mapped to Ensembl genes. Of the imprinted genes, six had SNPs that met the power criteria described above and five of the six were among the genes for which the imprinting model was favoured over the null model at the 1% significance level (P-values from the likelihood ratio test with two additional parameters ranged from 0.003 to 8 x 1010). The remaining known imprinted gene (IGF2) showed a non-significant improvement in likelihood for the imprinting model over the null model (P = 0.1), and a significant improvement in likelihood for the allele-specific expression model over the null model (P = 0.007). The enrichment for imprinted genes among the genes for which the null model was rejected in favour of the imprinting model for at least one SNP was highly statistically significant (P = 1 x 107 from two-sided Fisher's exact test).
Several of the genes in Table 1, not previously identified as imprinted, are promising candidates. TFPI2 (tissue factor pathway inhibitor 2), the top candidate from the table (P = 1 x 1013), is located within a region on chromosome 7 that contains several known imprinted genes (Fig. 2a). The TFPI2 gene has a 44 bp antisense overlap with an alternative isoform of GNGT1, which shows evidence of maternal expression (Okita et al., 2003). Another of the candidate imprinted genes in Table 1, COL1A2 (collagen, type I, alpha 2; P = 0.005), is also located in this region, 150 kb from a known imprinted gene (SGCE). COL1A2 was considered a candidate for genomic imprinting in mouse on the basis of differential expression between parthenogenote and androgenote mouse embryos but found to be biallelically expressed in normal mouse embryos (Mizuno et al., 2002). Our data suggest that this gene should be reconsidered as a possible candidate for imprinting in human and that a greater proportion of the genes in this region may be imprinted than previously thought.
|
Five genes from Table 1 map to 11p15, an imprinted region involved in Beckwith Wiedemann syndrome (Morison and Reeve, 1998). Two of these are the well-known imprinted genes INS and H19. One of the candidate genes from Table 1, putative insulin-like growth factor II associated protein, C11orf43, maps to the 162 kb region between INS and H19, just 3 kb from the imprinted insulin-like growth factor 2 (IGF2; Fig. 2b). A second candidate, CTSD, located 230 kb from H19, is among a set of candidate imprinted genes in mouse (Luedi et al., 2005). Although biallelic expression of CTSD has previously been reported (Rachmilewitz et al., 1993), this may be due to limited tissues in which imprinting of this gene has been tested (Luedi et al., 2005). The integrin-linked kinase gene (ILK) from Table 1, also on chromosome 11, is <350 kb from a putatively imprinted zinc finger gene (ZNF215) in 11p15.4. Given the limited proportion of genes for which sufficient data were available to test for imprinting using this approach, the presence of three novel candidate imprinted genes suggests that the number of genes in this region that are imprinted in at least some individuals and under at least some conditions, is underestimated.
The appearance of insulin-like growth factor binding protein 1 (IGFBP1) in Table 1 is also of interest. This protein is involved in regulation of prenatal growth (Jones and Clemmons, 1995) and is encoded within 5 Mb of the imprinted gene GRB10 on chromosome 7, but previous experimental results reported that IGFBP1 is not imprinted (Wakeling et al., 2000). However the experimental test of imprinting of IGFBP1 involved just one fetal tissue source (fetal liver) and the authors point to the possibility that IGFBP1 might be imprinted in other tissues (Wakeling et al., 2000). Closer examination of our data reveals that the improved fit of the imprinting model over the null model for the SNP shown in Table 1 is primarily due to data from a single placental library in which there is a substantial difference in the proportion of two alleles present for one SNP. There is also a library derived from pregnant uterus in which there is a significant difference between the proportions of the two alleles for a second SNP (rs7454; see Supplementary data). Interestingly, IGFBP3, which is located adjacent to IGFBP1 on human chromosome 7, also shows evidence of imprinting at the 5% significance level (P = 0.02).
Since imprinted genes frequently occur in genomic clusters, we plotted the sliding average of the difference in the log likelihoods between the null and imprinting models (
L) using windows of 10 consecutive genes satisfying our power criteria. This highlights several, though not all, of the well-known clusters of imprinted regions (e.g. imprinted regions on chromosomes 7 and 11 shown in Fig. 3). To test whether there is evidence of additional imprinted genes in known imprinted regions we compared
L between genes that are located close to known imprinted genes and the remaining genes, considering only genes that satisfied our power criteria. There were 32 genes in our dataset within 1 Mb of a known imprinted gene, which were not themselves known to be imprinted. The mean value of
L for these genes was 2.5, compared to 1.3 for the remaining genes (P = 0.003, from 100 000 random samples). Genes in close proximity to known imprinted genes are the most likely to have had their imprinting status tested experimentally (Morison et al., 2005). However our results suggest that there are additional imprinted genes in these regions that remain to be discovered. Genes in Table 1, such as COL1A2, TFPI2, CTSD and C11orf43 (Fig. 2), are good candidates for genomic imprinting.
|
There were 108 genes for which the allele-specific expression model fitted the data better than the null model for at least one SNP at the 1% significance level (Supplementary Table 3). This included 58 of the 60 genes for which the imprinting model fitted the data significantly better than the null model as well as an additional 50 genes for which only the allele-specific model was favoured over the null model. Because the imprinting and allele-specific models are not nested it was not possible to use the likelihood ratio test to distinguish between them. For most of the genes in Table 1 the difference in log likelihood between the two models is small, suggesting that the data available in dbEST cannot distinguish imprinted genes from genes with allelic differences in expression level.
The overlap between the genes that favour the imprinting model over the null model and the genes that favour the allele-specific expression model over the null model is not surprising, because the effects modelled in both cases are similar. The only difference between these two models is whether or not the overrepresented allele is consistent across cDNA libraries: the imprinting model allows different alleles to be overrepresented in different libraries, while the allele-specific expression model favours situations in which the same allele tends to be overrepresented in different libraries. In many cases, the data do not provide enough information to distinguish between these two possibilities. For a given gene, rejection of the null model in favour of either of the alternative models, is evidence of unequal expression of the two alleles of the gene in at least a subset of the cDNA libraries in which the gene is found.
Considering the data set as a whole we find evidence that the null model of exactly equal expression of alternative alleles is unlikely to apply for most of the genes analysed. The proportion of a set of independent hypothesis tests for which the null hypothesis is true can be estimated from a vector of P-values using software packages such as QVALUE (Storey and Tibshirani, 2003). Application to the vector of P-values from the allele-specific expression model reveals that the null hypothesis is likely to be strictly true for only a minority of genes (
20%). This can be visualized by comparing the distribution of P-values we get from applying the likelihood ratio test to data generated under the null hypothesis of equal expression of alternative alleles, to the P-values we obtain from the real data (Fig. 4). The P-value distribution is strongly skewed to the left for the real data and close to the expected uniform distribution for the data simulated under the null hypothesis. This result is not affected by applying a more stringent criterion for determining pooled libraries (for the less stringent method we removed all libraries annotated as pooled by donor in eVOC and for the more stringent one we removed all libraries unless specifically annotated as not pooled by donor). In the more stringent case the proportion of genes conforming to the null hypothesis estimated by QVALUE increases slightly to 24%.
|
One advantage of our method over some previous methods of inferring imprinting and allele-specific expression from dbEST (Yang et al., 2003; Lin et al., 2005) is that it has the capacity to integrate different signatures of these phenomena and to perform a combined inference based on data in many cDNA libraries. The imprinting model is sensitive to the absence of heterozygous libraries, as well as, imbalances between alleles in heterozygous libraries. The allele-specific expression model combines evidence of imbalance between alleles across many cDNA libraries and this model tends to be favoured over the imprinting model when the overexpressed allele is consistent between libraries. Including a parameter for the relatively high EST sequencing error is another advantage of the statistical models that we have applied. This parameter, which is estimated from counts of nucleotides at the polymorphic position that do not correspond to either SNP allele, prevents inference of allelic imbalance for homozygote libraries with some sequence errors. Some previous studies have used the alternative strategy of omitting ESTs with Phred scores below some cut-off (Yang et al., 2003; Lin et al., 2005), however this restricts the data to ESTs with trace information. It also fails to resolve the problem, because base calls with high Phred scores still retain a finite probability of error and even a very small number of errors can give a very high probability of generating false-positive results. For example, a highly expressed gene for which there are many ESTs present in a single homozygous library will be interpreted as providing strong evidence of allele-specific expression if only one or two of the ESTs have a sequence error in favour of the alternative allele at the SNP position.
It is possible to compare the set of SNPs for which the allele-specific expression model is favoured over the null model in this study (Supplementary Table 3) to sets of candidates reported in other studies using alternative methods (Ge et al., 2005; Lo et al., 2003; Lin et al., 2005). However, because in each of the other studies to which we compared our results the number of SNPs reported to be associated with allele-specific expression of the corresponding gene is relatively small, the overlap between our results and previously published sets, as well as the overlaps of the previously published sets with each of the others are all relatively small [e.g. in some of the most extensive previous studies, Ge et al. (2005) report 117 SNPs with allelic differences in gene expression and Lo et al. (2003) report 326, but only three SNPs are shared between these studies]. The most meaningful comparison is between the results of Lin et al. (2005) and the results presented here because both studies make use of essentially the same data. Where Lin et al. (2005) present 35 different genes with evidence of allele-specific expression our candidate list consisted of 108 different genes (at a P-value of 0.01). Six genes are common to both candidate lists. Closer examination of the data for some of the cases where we do not find allele-specific expression reveals that sequence error is an equally plausible explanation of the result.
Allele-specific expression is likely to be much more common than genomic imprinting. Human individuals may be heterozygous for functionally significant polymorphisms in the promoter regions of as many as 40% of genes (Rockman and Wray, 2002). Consistent with this, a very high proportion of human genes show evidence of allele-specific expression in at least some individuals (Lo et al., 2003). We found that genes for which the most data were available tended to favour the allele-specific expression model over the null model. For example, for 108 genes with at least 200 ESTs the null model was rejected in favour of the allele-specific model at the 5% level in 55% of cases, for at least one SNP mapping to the gene. This apparently very high prevalence of allele-specific expression, which is consistent with at least one previous report (Lo et al., 2003), has important implications for our understanding of the biology and molecular evolution of diploid organisms.
Unequal representation of alleles in cDNA libraries could also result from loss of heterozygosity in cancer tissues. This is particularly significant because the majority of cDNA libraries in the public domain are likely to be derived from tumours rather than normal tissues (Baranova et al., 2001). Several of the genes in Table 1 are known oncogenes or tumour suppressors [e.g. KISS1 (Lee et al., 1996), CTSD (Iacobuzio-Donahue et al., 2004), SLC2A1 (Smith, 1999) and RHOB (Huang and Prendergast, 2006)]. For some of these genes genomic deletions or aberrant methylation of one allele in neoplastic tissues could explain rejection of the null model and Table 1 may contain novel examples of cancer associated genes that have not previously been annotated as such. We modified the models for allele-specific expression and imprinting to estimate parameters separately for cancer and normal tissues (Materials and Methods). Genes showing greater evidence of unequal expression of alternative alleles in cDNA libraries from cancer tissues compared to normal tissues are listed in Supplementary Tables 4 and 5. As we would predict, for the majority of cases the paramter representing the probability of unequal expression in a library was higher for cancer than for normal libraries (60 cases from a total of 91; P = 0.002). However, binary classification of cDNA libraries as cancer or normal is likely to be too coarse to identify loss of heterozygosity that may be restricted to specific cancer types and further, more detailed analysis is required to determine whether allelic expression effects that differ between cDNA libraries with different annotations can be used to identify novel disease associated genes or mutations.
| Acknowledgments |
|---|
This work was funded by the South African National Bioinformatics Network. VN is a recipient of the UCT International Student Scholarship. We are grateful to Chris Gehring and Win Hide for comments on the manuscript. Funding to pay the Open Access publication charges for this article was provided by the South African National Bioinformatics Network.Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on August 18, 2006; revised on September 26, 2006; accepted on October 6, 2006
| REFERENCES |
|---|
|
|
|---|
Baranova, A.V., et al. (2001) In silico screening for tumour-specific expressed sequences in human genome. FEBS Lett, . 508, 143148[CrossRef][Web of Science][Medline].
Buckland, P.R. (2004) Allele-specific gene expression differences in humans. Hum. Mol. Genet, . 13, Spec No 2, R255R260
Constancia, M., et al. (1998) Imprinting mechanisms. Genome Res, . 8, 881900
Ge, B., et al. (2005) Survey of allelic expression using est mining. Genome Res, . 15, 15841591
Huang, M. and Prendergast, G.C. (2006) Rhob in cancer suppression. Histol. Histopathol, . 21, 213218[Web of Science][Medline].
Hubbard, T., et al. (2005) Ensembl 2005. Nucleic Acids Res, . 33, D447D453
Iacobuzio-Donahue, C., et al. (2004) Cathepsin d protein levels in colorectal tumors: divergent expression patterns suggest complex regulation and function. Int. J. Oncol, . 24, 473485[Web of Science][Medline].
Jones, J.I. and Clemmons, D.R. (1995) Insulin-like growth factors and their binding proteins: biological actions. Endocr. Rev, . 16, 334
Karolchik, D., et al. (2003) The ucsc genome browser database. Nucleic Acids Res, . 31, 5154
Kelso, J., et al. (2003) evoc: a controlled vocabulary for unifying gene expression data. Genome Res, . 13, 12221230
Kent, W.J. (2002) Blatthe blast-like alignment tool. Genome Res, . 12, 656664
Knight, J.C. (2004) Allele-specific gene expression uncovered. Trends Genet, . 20, 113116[CrossRef][Web of Science][Medline].
Lee, J.H., et al. (1996) Kiss-1, a novel human malignant melanoma metastasis-suppressor gene. J. Natl Cancer Inst, . 88, 17311737
Lin, W., et al. (2005) Allelic variation in gene expression identified through computational analysis of the dbest database. Genomics, 86, 518527[CrossRef][Web of Science][Medline].
Lo, H.S., et al. (2003) Allelic variation in gene expression is common in the human genome. Genome Res, . 13, 18551862
Luedi, P.P., et al. (2005) Genome-wide prediction of imprinted murine genes. Genome Res, . 15, 875884
Mizuno, Y., et al. (2002) Asb4, ata3, and dcn are novel imprinted genes identified by high-throughput screening using riken cdna microarray. Biochem. Biophys. Res. Commun, . 290, 14991505[CrossRef][Web of Science][Medline].
Morison, IM. and Reeve, A.E. (1998) A catalogue of imprinted genes and parent-of-origin effects in humans and animals. Hum. Mol. Genet, . 7, 15991609
Morison, I.M., et al. (2001) The imprinted gene and parent-of-origin effect database. Nucleic Acids Res, . 29, 275276
Morison, I.M., et al. (2005) A census of mammalian imprinting. Trends Genet, . 21, 457465[CrossRef][Web of Science][Medline].
Morley, M., et al. (2004) Genetic analysis of genome-wide variation in human gene expression. Nature, 430, 743747[CrossRef][Medline].
Okita, C., et al. (2003) A new imprinted cluster on the human chromosome 7q21-q31, identified by human-mouse monochromosomal hybrids. Genomics, 81, 556559[CrossRef][Web of Science][Medline].
Oleksiak, M.F., et al. (2002) Variation in gene expression within and among natural populations. Nat. Genet, 32, 261266[CrossRef][Web of Science][Medline].
Pastinen, T. and Hudson, T.J. (2004) Cis-acting regulatory variation in the human genome. Science, 306, 647650
Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T. (1992) Numerical Recipies in C:. The Art of Scientific Computing, , Cambridge Cambridge University Press.
Rachmilewitz, J., et al. (1993) Use of a novel system for defining a gene imprinting region. Biochem. Biophys. Res. Commun, . 196, 659664[CrossRef][Web of Science][Medline].
Reik, W. and Lewis, A. (2005) Co-evolution of x-chromosome inactivation and imprinting in mammals. Nat. Rev. Genet, . 6, 403410[CrossRef][Web of Science][Medline].
Robertson, K.D. (2005) Dna methylation and human disease. Nat. Rev. Genet, . 6, 597610[CrossRef][Web of Science][Medline].
Rockman, M.V. and Wray, G.A. (2002) Abundant raw material for cis-regulatory evolution in humans. Mol. Biol. Evol, . 19, 19912004
Schadt, E.E., et al. (2003) Genetics of gene expression surveyed in maize, mouse and man. Nature, 422, 297302[CrossRef][Medline].
Smith, T.A. (1999) Facilitative glucose transporter expression in human cancer tissue. Br. J. Biomed. Sci, . 56, 285292[Web of Science][Medline].
Storey, J.D. and Tibshirani, R. (2003) Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA, 100, 94409445
Wakeling, E.L., et al. (2000) Biallelic expression of igfbp1 and igfbp3, two candidate genes for the Silver-Russell syndrome. J. Med. Genet, . 37, 6567
Wilkins, J.F. (2005) Genomic imprinting and methylation: epigenetic canalization and conflict. Trends Genet, . 21, 356365[CrossRef][Web of Science][Medline].
Yan, H., et al. (2002) Allelic variation in human gene expression. Science, 297, 1143
Yang, H.H., et al. (2003) Computation method to identify differential allelic gene expression and novel imprinted genes. Bioinformatics, 19, 952955
This article has been cited by other articles:
![]() |
D. Monk, A. Wagschal, P. Arnaud, P.-S. Muller, L. Parker-Katiraee, D. Bourc'his, S. W. Scherer, R. Feil, P. Stanier, and G. E. Moore Comparative analysis of human chromosome 7q21 and mouse proximal chromosome 6 reveals a placental-specific imprinted gene, TFPI2/Tfpi2, which requires EHMT2 and EED for allelic-silencing Genome Res., August 1, 2008; 18(8): 1270 - 1281. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







