Skip Navigation


Bioinformatics Advance Access originally published online on June 22, 2007
Bioinformatics 2007 23(16):2155-2162; doi:10.1093/bioinformatics/btm313
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/16/2155    most recent
btm313v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, K.-W.
Right arrow Articles by Park, Y.-J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, K.-W.
Right arrow Articles by Park, Y.-J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

PowerCore: a program applying the advanced M strategy with a heuristic search for establishing core sets

Kyu-Won Kim 1,2,{dagger}, Hun-Ki Chung 1,{dagger}, Gyu-Taek Cho 1, Kyung-Ho Ma 1, Dorothy Chandrabalan 3, Jae-Gyun Gwag 1, Tae-San Kim 1, Eun-Gi Cho 1 and Yong-Jin Park 1,3,*

1National Institute of Agricultural Biotechnology, 247 Seodun-dong, Suwon, 441-707, 2Qubesoft,R/No, Dongyoung Central B/D, 847-2 Geumjeong-dong, Gunpo 434-050, R.Korea and 3Bioversity International, APO Office, Serdang 43400, Malaysia

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESIGN CONCEPT
 3 IMPLEMENTATION
 4 VALIDATION
 5 RESULTS AND DISCUSSION
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Core sets are necessary to ensure that access to useful alleles or characteristics retained in genebanks is guaranteed. We have successfully developed a computational tool named ‘PowerCore’ that aims to support the development of core sets by reducing the redundancy of useful alleles and thus enhancing their richness.

Results: The program, using a new approach completely different from any other previous methodologies, selects entries of core sets by the advanced M (maximization) strategy implemented through a modified heuristic algorithm. The developed core set has been validated to retain all characteristics for qualitative traits and all classes for quantitative ones. PowerCore effectively selected the accessions with higher diversity representing the entire coverage of variables and gave a 100% reproducible list of entries whenever repeated.

Availability: PowerCore software uses the .NET Framework Version 1.1 environment which is freely available for the MS Windows platform. The files can be downloaded from http://genebank.rda.go.kr/powercore/. The distribution of the package includes executable programs, sample data and a user manual.

Contact: yjpark{at}rda.go.kr


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESIGN CONCEPT
 3 IMPLEMENTATION
 4 VALIDATION
 5 RESULTS AND DISCUSSION
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Useful alleles, especially those contributing to valuable agronomic traits are often conserved in genebanks worldwide. The potential use of these large collections could be greatly enhanced by constituting subsamples also known as core collections or core sets (Basigalup et al., 1995; Brown, 1989; Franco et al., 2006; Frankel and Brown, 1984; Upadhyaya et al., 2006). Effective deployment of useful alleles from genebanks has been made possible especially with the recent technological revolution brought upon by genomic and bioinformatics tools. Allele mining exploits the deoxyribonucleic acid (DNA) sequence of one genotype to isolate useful alleles from related genotypes (Latha et al., 2004). Discovering the full diversity of available genes and their agronomic significance will allow genebanks to achieve their full potential thus contributing to sustainable development by deployment of the right alleles in the right places at the right time (Hamilton and McNally, 2005).

Over the years, tremendous progress has been achieved using different methodologies including the stratified random sampling, and such methodologies have been successfully applied to develop core collections for various uses (Balfourier et al., 1998; Chandra et al., 2002; Hu et al., 2000; Peeters and Martinelli, 1989; Spagnoletti and Qualset, 1993). Several other strategies have also been proposed for use including proportional allocation, log frequency allocation and the constant allocation (Brown, 1989; Spagnoletti and Qualset, 1993; van Hintum et al., 2000). New trials such as the M (maximization) strategy or nested selection methods (Bataillon et al., 1996; Marita et al., 2002; Schoen and Brown, 1993) have been conducted to select specific combinations of accessions that include complete coverage and retention. Similarly, using iterative procedures of selecting the highest diversity among subsets by the criterion of richness and the highest sum of squares of active variables based on the M strategy, the MSTRAT program was developed and released (Gouesnard et al., 2001). To date, the M strategy is clearly the most powerful function for selecting entries with the most diverse alleles and eliminating redundancy that comes from non-informative alleles, which arise from co-ancestry and certain assertive mating systems in establishing core sets (Franco et al., 2006).

As a solution to the traveling salesman problem (TSP), the ‘heuristic algorithm’ was designed for selecting the optimal pathway to the last goal following the Karg–Thompson's algorithm (Karg and Thompson, 1965) and later improved to not only search the best increment for each node, but also the next-best increment (Raymond, 1969). Various applications of the heuristic algorithm include the FASTA program for sequence comparison (Altschul et al., 1990), GeneMark for the ab initio gene search program (Besemer and Borodovsky, 2005), GenAlignRefine for the multiple sequence alignment program (Wang and Lefkowitz, 2005) and Bounded Sparse Dynamic Programming (BSDP) (Slater and Birney, 2005). The heuristic algorithm was also applied in developing the core set for the Arabidopsis collection using single nucleotide polymorphism (SNP) data (McKhann et al., 2004).

Here, we present a new software application named PowerCore, which can be applied for developing core sets using the advanced M strategy and possessing the power to represent all alleles or classes of their observations.


    2 DESIGN CONCEPT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESIGN CONCEPT
 3 IMPLEMENTATION
 4 VALIDATION
 5 RESULTS AND DISCUSSION
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Scales for variables expressing traits of genetic accessions vary based on their characteristics and measurement methods. These are the nominal, ordinal, interval and ratio scales. The interval and ratio scales may categorize and divide variants into an appropriate interval. They can then be categorized under the ordinal scale. The ordinal scale may also be used as a nominal scale as shown in Figure 1.


Figure 1
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. A set of nominal values of variables expressing traits of genetic accessions (a: accession; v: variable; n: nominal value).

 
When one converts several variables expressing traits of accessions into one nominal scale according to the method above, one may assume a set, Formula , with elements of all nominal values in the set of the whole accessions, A with respect to a certain variable, v (certain repetitive nominal values may occupy an element of Formula ). Formula is a set with elements Formula , with respect to variables v1, v2, ..., vm. In other words, Formula = {Formula | v isin all the variables of the whole accessions}. In addition, if Formula = Formula for all the variables, then let Formula be equal to Formula (Formula = Formula ) (Fig. 1).

At this point, one may consider subsets, Asub, of the set of whole accessions, A, in which Formula = Formula . Each Asub exhibits all nominal values of each variable expressed by the set A, one of which with the minimum number of elements can be represented as a core collection. Thus, the problem in finding the representative accessions with the minimum number of accessions may be expressed as the problem of finding an Asub with the minimum number of elements out of every Asub sufficing Formula = Formula .

To find an Asub where Formula = Formula with the approach above, one may create an empty set, E, and add a certain appropriate accession to E recursively until Formula and Formula become equal. This process may also be described as the shortest-path problem. If the set, E, contains no element, then it is in the initial state. If Formula and Formula are equal to each other, then it becomes the final state, or in other words, the goal. Selecting an entry and adding it to E is an expansion of a node. Thus, reaching the goal with the minimum number of elements in E using this method involves minimizing the number of nodes from the initial node to the goal. However, this search process does not consider the order of accessions. For example, suppose there are accessions, a, b and c, then six different paths may exist when adding to E. These paths all have the same significance: a -> b -> c, a -> c -> b, b -> a -> c, b -> c -> a, c -> a -> b and c -> b -> a. In other words, if one of them were to be expanded in a search process, it would not be necessary to expand the rest.

The problem in finding a core collection, therefore, may be expressed as searching for the shortest path with the minimum number of nodes in the search process above which may be discovered using the A*-algorithm.

If an optimal path exists from the initial node, s, to the final node via a node, n, one may define the cost of the optimal path from s to n as g*(n) and the cost of the optimal path from n to the final node as h*(n). Then, let us define the sum of g*(n) and h*(n) as f*(n) as follows:


Formula

A graph search using an evaluation function is known as the A*-algorithm in which an evaluation function, f, is a measure of f * expressed as follows:


Formula

In this equation, g and h are measures of g*(n) and h*(n), respectively. An algorithm sufficing h(n) ≤ h*(n) for all nodes, n, at all times is called the A*-algorithm, it always finds the goal if it exists, and this path is the shortest path (Hart et al., 1968).

When implementing a search for a core collection using the graph search with an evaluation function, f, one may define g(n) as the number of accessions added to E, and h(n) as the number of accessions added to E until the final state, the goal is reached. Then, one may evaluate h^(n) sufficing h^(n) ≤ h*(n) as follows.

One may denote a set, Formula , from all the sets, Formula with respect to all variables, v1, v2, ..., vm that may find a relative complement, Formula , for each variable. Then,

h^(n) = the maximum number of elements in Formula among the elements, Formula in Formula .

An accession may not have more than one nominal value per variable so that the number of nodes from a node, n, to the goal, must be equal to or greater than h^(n). Thus, h^(n) ≤ h*(n) for all nodes if and only if h^(n) is defined as above. The graph search using an evaluation function, f^(n) = g(n) + h^(n), is an A*-algorithm. This search finds E sufficing Formula with the minimum number of accessions if the set, E exists, as shown in Figure 1.


    3 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESIGN CONCEPT
 3 IMPLEMENTATION
 4 VALIDATION
 5 RESULTS AND DISCUSSION
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
A core collection obtained using the above search method h^(n) guarantees the shortest path, but many nodes are expected to expand in this search. Furthermore, the number of accessions in the actual analysis is extremely high and implementation of the above search method cannot assure expected results in the limited time given. Thus, another method was seen as necessary to find an optimal path, close to the shortest path in plausible time, which may not guarantee the shortest path to the goal. In order to implement the new method, the search method was modified.

Considering the search method to find the entry for core collection in the previous section, an element in A was added to E as each node expands. Thus, one will always find the goal as the depth of nodes expands with the number of elements in A. In other words, all nodes lead to the goal. Also within a path, a deeper node is closer to the goal.

With this characteristic in mind, priority was given to h^(n) of deeper nodes and the comparison of their values. Then, a node with the minimum value was selected and expanded.

One may consider Formula a set A of all the accessions as its elements with respect to a variable, v. If Formula = {d1, d2, ..., dk}, and another set, Formula with ordered pairs (d1, t1), (d2, t2), ..., (dk, tk) as its elements where the first element of each pair is an element of Formula and the second element is an integer, t, denoted as,

Formula = {(d1, t1), (d2, t2), ..., (dk, tk)}. In this set, d1, d2, ..., dk are defined as items in Formula and t1, t2, ..., tk as the ‘filled values’ of each item. Each ordered pair is a ‘diversity cell’.

In particular, Formula is defined as Formula when all the filled values, t1, t2, ..., tk, are 0. That is, Formula = {(d1, 0), (d2, 0)...., (dk, 0)}.

Then, we denote a set with elements Formula , Formula ..., Formula with respect to all the variables, v1, v2, ... vm as Formula and a set with elements Formula , Formula , ..., Formula as Formula .

For an accession, a (if a isin A), we define Formula + a as follows and express it as ‘filling an accession, a, into Formula ’.

Formula + a:

for each v in all the variables of whole accessions

if Formula Formula

Here, we express (v(a), t) isin Formula as ‘filling an item, v(a), in Formula ’.

The search process is as follows:

  1. Create an Formula sufficing Formula = Formula for the set of the whole accessions, A.
  2. Create an empty set, E.
  3. Create a list, N.

    for each e (if e isin AE)

    N(e) <- Formula + e (N(e) must be a value of the item, e, in N)

(4) {triangleright} Calculate h^(n):

create a list, H.

for each e (if e isin AE)

    create a list, NUMBER.

        for each v in all the variables of the whole accessions

            find the number of ordered pairs sufficing t = 0 among every ordered pair, (d, t) and NUMBER(v) <- (d, t) isin Formula isin N(e)

            (NUMBER(v) must be a value of the item, v, in NUMBER).

H(e) <- NUMBER is the maximum value (H(e) must be a value of the item, e, in H).

(5) Select an item, e, with H(e) as its minimum value,

E <- E {cup} {e} (if several e's exist, then one e is selected randomly).


Formula

(6) T <- 0

        for each Formula (if Formula )

            for each (d, t) (if (d, t) isin Formula )

                T <- T + t

(7) If T != 0, then proceed to Step (3).

In this search, Step (3) is a process of expanding children nodes by adding an entry, e, from a parent node and the Step (4) is a process of evaluating the expanded node with an evaluation function, h^(n).

However, evaluating nodes with h^(n) above will create several nodes with the same depth minimizing h^(n) so that a path will be randomly selected. We have modified and improved the method above to evaluate an optimal node with more information instead of by random selection as follows.

One may define the number of filled values sufficing t = 0 among every diversity cell, (dv, t) in Formula (if Formula ) of a node as empty (Formula ). Selecting a node with an empty value (Formula ) at its minimum does not guarantee the shortest path, but the empty (Formula ) value only decreases in the above search process. We have modified the above search to select a node with the minimum empty (Formula ) value with respect to the goal when several nodes exist with h^(n) at their minimum.

If several nodes exist with the minimum empty (Formula ) value, we will select a node to which an accession, e, with less abundant nominal value among accessions in E is added to E. We have defined an added accession to expand a node as e. The value of a variable item, v, in this newly added accession might be expressed as v(e). Thus, Formula (e) now expresses the value of t which suffices (v(e),t) isin Formula (eisinA). If e has variables, v1, v2, ..., vm, then it may be defined as an overlap.


Formula

The values of Formula increase by one as an accession with the nominal values of v1(e), v2(e), ..., vm(e) fill in Formula . This overlap (Formula , e) can be an indicator of how many repetitive nominal values e, in average, has for each variable in a set, E. In other words, e, on average, has nominal values for each variable unlike other accessions in a set, E, as the value overlap (Formula , e) gets smaller. Therefore, a node with the minimum overlap (Formula , e) will be selected to take an accession with less abundant values in a set, E.

If several nodes with the minimum overlap (Formula , e) value exist, then a node with an accession with higher rarity is selected using predefined values of rarity of accessions in the whole accessions, A. Before executing the above search process, Formula must be performed for every a sufficing a isin A. Then, lists P and D are created to find values for P(a) <- overlap(Formula , a) and Formula for every a (if a isin A) in advance (P(a) and D(a) are values of an element, a, in lists P and D, respectively).

P(a) can serve as an indicator for the rarity of an accession, a and D(a) indicates the degree of deviation of rarity for each nominal value of a, with respect to the whole accession set, A. The node with the minimum P(a) value will be selected to take an accession with high rarity.

When several nodes with the minimum value of P(a) exist, the node with the highest D(a) value will be chosen. That selects an accession with an exceptionally rare characteristic in a specific trait rather than an accession with evenly distributed rare characteristics in all traits among the accessions with the same P(a) value: the higher the D(a) value, the higher the deviation of rarity of a's nominal value with respect to each variable. Hence, nominal values with high rarity with respect to certain variables are concentrated in such accessions.

The new program's source code is written in Microsoft C# and compiled with Microsoft Visual Studio .NET 2003. The program has been tested in the Microsoft Windows XP environment, and the specifications of the testing computer include a 1.5 GHz Intel mobile processor and a 1 GB RAM.


    4 VALIDATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESIGN CONCEPT
 3 IMPLEMENTATION
 4 VALIDATION
 5 RESULTS AND DISCUSSION
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 Analysis with statistical indicators
Ten sets of 100 virtual accessions were created, each with four nominal variables and three continuous variables as materials for the analysis. Within the PowerCore program, a component divided intervals of continuous variables to nominalize them; the continuous variables in this analysis were automatically classified into different categories based on Sturges’ rule (Sturges, 1926).


Formula

The search using the PowerCore was heuristic. The core set was generated via this search by calculating the mean difference (MD,%), variance difference (VD,%), coincidence rate (CR,%) and variable rate (VR,%) for continuous variables and computing a frequency distribution for each variable (Hu et al., 2000).


Formula

(Me: Mean of entire collection, Mc: Mean of core collection)


Formula

(Ve: Variance of entire collection, Vc: Variance of core collection)


Formula

(Re: Range of entire collection, Rc: Range of core collection)


Formula

(CVe: coefficient of variation of entire collection, CVc: coefficient of variation of core collection, m: number of traits)

4.2 Comparative analysis with a non-heuristic random method to retain whole diversity cells, provided from PowerCore
The basis for generating the core collection using PowerCore is the nominalization of continuous variables. Nominalizing these variables led to the decrease in number of accessions collected in a core collection which was considered necessary in performing the heuristic search through its evaluation function using the given data.

A comparative analysis was performed with the non-heuristic random search wherein no prior information was required for the generation of the core set. The procedure for the random search was as follows:

  1. Formula sufficing Formula for a set of the whole accessions, A is created.
  2. An empty set, E is created
  3. for each v in all the variables of the whole accessions

for each item d in Formula

if Formula (d) equals to 0 (Formula (d) must be a filled value of d)

then an element from e <- AE is selected to fill d randomly


Formula

This random search was performed 10 times to compute the average values of the MD, VD, CR and VR, and frequency distribution.

One hundred virtual accessions were created, each with four nominal variables and three continuous variables for the analysis.


    5 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESIGN CONCEPT
 3 IMPLEMENTATION
 4 VALIDATION
 5 RESULTS AND DISCUSSION
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
5.1 Results of analysis with statistical indicators
The number of accessions, MD, VD, CR and VR values for the core collection are displayed in Table 1. PowerCore selected an average of 11 out of 100 virtual accessions thus reducing the number of accessions by 89% for the entire collection.


View this table:
[in this window]
[in a new window]

 
Table 1. Average values for core collections using heuristic search

 
MD exhibits the difference in averages of accessions between the core set and the entire collection. MD values in Table 1 show that the mean of the core collections selected by ‘PowerCore’ is similar to the mean of the entire collection (Table 1).

VD displays the difference in distribution. VD values in Table 1 show that the variance of the core collections selected by ‘PowerCore’ is rather different from the variance for the entire collection. It was noted that the VD values fluctuated among the different sets.

VR allows a comparison between the coefficient of variation values existing in the core collections and the entire collection and determines how well it is being represented in the core sets. VR values in Table 1 show an average value of 67.1%.

CR indicates whether the distribution ranges of each variable in the core set are well represented when compared to the entire collection. Results obtained (Table 1) show that the average CR value is 93.8%. In order for core collections to represent the whole accessions, some researchers claim that the CR value should be ≥80% (Hu et al., 2000).

MD, VD and VR are used to measure the statistical consistency between the core and entire collections. Core collections do not aim for statistical consistency such as average or variation but they seek to cover the genetic diversity of the entire collection. Thus, even well-collected core sets would not show high scores of these statistical indexes based on values attained for average and variation. Moreover, these methods can only be applied to continuous variables.

Particular attention needs to be given to the high CR of core collections as indicated in Table 1. Compared to the other statistical indicators used in this study, PowerCore specifically indicates an exceptionally high CR value for the core sets. Once classification of the continuous variables is performed by PowerCore, the software takes into account all classes, without omission of any of its variables. Thus, PowerCore possesses the capability to cover all the distribution ranges of each class. However, 100% CR value is not attained in Table 1. The reason is that in the case of continuous variables wherein classes are generated, PowerCore would only require the least number of accessions from each class.

In view of the above, we suggest a new indicator, ‘Coverage’, which can be used to evaluate a core set for its coverage of variables.


Formula

Where Dc is number of classes occupied in core collection and De is number of classes occupied in entire accessions in each character and m is the number of variables. The core sets resulted by PowerCore show 100% coverage of variables without any deviations. This suggests PowerCore maintains all the diversity present in each class.

5.2 Results of the comparative analysis with a non-heuristic random method, implemented within PowerCore
The heuristic search selected 10 out of 100 virtual accessions compared to the random search which selected an average of 17.1 accessions. Table 2 shows the MD, VD, CR and VR values obtained using the heuristic search of PowerCore and the random method. The frequency distribution of core collections with respect to each variable is exhibited in Figure 2. The CR value obtained using the random method was slightly higher since more accessions were selected. Heuristic search always resulted in the same value as the number of accessions selected in every try is the same. However, the random search does not provide the same results whenever repeated.


View this table:
[in this window]
[in a new window]

 
Table 2. Values of variables for core collections using the heuristic and random searches

 

Figure 2
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Frequency distribution of core collections with respect to each variable. (Note: NM1 and NM2 are nominal variables, and M1, M2 and M3 are continuous variables.)

 
The frequency distribution of core collections with respect to each variable is exhibited in Figure 2. The heuristic method used in PowerCore and the random method are both well illustrated in Figure 2 wherein the core subsets generated contain intervals of values for the whole collection with respect to each variable. Figure 2 also shows that the categorization values for each variable of these core collections exhibited extremely low frequency as opposed to the entire collection. If one considers the frequency of each categorization value from the frequency distribution in Figure 2 as accessions with repetitive values, an extremely small or negligible frequency indicates these have been significantly discarded from the core collections.

The heuristic and random searches have greatly reduced the number of accessions, since nominalizing continuous variables in the preparation procedures for establishment of core collections efficiently discards unnecessary accessions. It was noted, however that the heuristic search reduces the size of core collections to ≤60% as compared to the random search. The results attained confirms that the modified A*-algorithm of PowerCore is more effective than a random search that does not apply the evaluation function for determining the shortest search path.

5.3 Comparison of the heuristic method (PowerCore) with other conventional methods using real rice data sets
To compare the selecting efficiency of PowerCore to Random (R-), Proportional (P-) and MSTRAT methods, two different real rice data sets were used. The phenotype set comprise of 28 quantitative and 11 qualitative traits while the SSR (simple sequence repeat) set includes 18 loci. Both independent sets contain 1000 accessions, respectively. It has been proven that PowerCore has better efficiency than any other conventional methods when the same number of entries was selected in the comparison core sets (Table 3). The core sets developed by PowerCore, retained all different alleles or intervals which two different entire collections possess in both the phenotype and SSR sets of real rice data, ensuring 100% of coverage in developed core sets relative to entire collections. MSTRAT was revealed to be the best method in the coverage rate (94.8% for phenotype and 88.9% for SSRs), compared with the other conventional methods (Table 3).


View this table:
[in this window]
[in a new window]

 
Table 3. Comparison of the heuristic method with other conventional methods using the two different rice real data sets of 1000 accessions, respectively for phenotype and SSRs

 
Basically, PowerCore implements the heuristic algorithm for selecting candidate entries by calculating the costs to reach the goal. So, even if the users repeat the selecting of subsets using the same data, the same list of entries is generated. This is another benefit for users of PowerCore.


    6 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESIGN CONCEPT
 3 IMPLEMENTATION
 4 VALIDATION
 5 RESULTS AND DISCUSSION
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
PowerCore is a completely new approach differing from any other previous methodologies, which effectively simplifies the generation process of a core set while significantly cutting down the number of core entries, maintaining 100% of the diversity as categorical variables. For continuous variables, 100% diversity is achieved based on precision of classification. PowerCore is applicable to various types of genomic data including SNPs.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESIGN CONCEPT
 3 IMPLEMENTATION
 4 VALIDATION
 5 RESULTS AND DISCUSSION
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Drs V. Ramanatha Rao, Prem Mathur, Zongwen Zhang, Xavier Scheldeman and Andrew Jarvis from Bioversity International, and the group of Dr Felipe dela Cruz, University of the Philippines, Los Banos for validating this software using their national plant genetic resources collections (India, China, South America and Philippines), and their valuable comments for improving various options for different users in national genebanks. This study was supported by the National Institute of Agricultural Biotechnology (#NIAB 05-6-11-30-2), the Bio-Green 21 program (Grant code # 20050401034738) of the Rural Development Administration (RDA) and Agricultural Research and Development Promotion Center (ARPC), Republic of Korea.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

Received on February 28, 2007; revised on May 24, 2007; accepted on June 5, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 DESIGN CONCEPT
 3 IMPLEMENTATION
 4 VALIDATION
 5 RESULTS AND DISCUSSION
 6 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Balfourier F, et al. Comparison of different spatial strategies for sampling a core collection of natural populations of fodder crops. Genet. Sel. Evol (1998) 30(Suppl. 1):215–235.[CrossRef]

    Basigalup DH, et al. Development of a core collection for perennial Medicago plant introductions. Crop Sci (1995) 35:1163–1168.[Abstract/Free Full Text]

    Bataillon TM, et al. Neutral genetic markers and conservation genetics: simulated germplasm collection. Genetics (1996) 144:409–417.[Abstract]

    Besemer J, Borodovsky M. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res (2005) 33:W451–W454.[Abstract/Free Full Text]

    Brown AHD. Core collections: a practical approach to genetic resources management. Genome (1989) 31:818–824.

    Chandra S, et al. Optimal sampling strategy and core collection size of Andean tetraploid potato based on isozyme data—a simulation study. Theor. Appl. Genet (2002) 104:1325–1334.[CrossRef][Web of Science][Medline]

    Franco J, et al. Sampling strategies for conserving maize diversity when forming core subsets using genetic markers. Crop Sci (2006) 46:854–864.[Abstract/Free Full Text]

    Frankel OH, Brown AHD. Plant genetic resources today: a critical appraisal. In: Crop Genetic Resources: Conservation and Evaluation—Holden JHW, Williams JT, eds. (1984) Winchester, Massachusetts, USA: Allen and Unwin. 249–257.

    Gouesnard B, et al. MSTRAT: an algorithm for building germplasm core collections by maximizing allelic or phenotypic richness. J. Hered (2001) 92:93–94.[Free Full Text]

    Hamilton RS, McNally K. Unlocking the genetic vault. In: Geneflow. Rome, Italy: International Plant Genetic Resources Institute. 29.

    Hart P, et al. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybernet (1968) 4:100–107.

    Hu J, et al. Methods of constructing core collections by stepwise clustering with three sampling strategies based on the genotypic values of crops. Theor. Appl. Genet (2000) 101:264–268.[CrossRef][Web of Science]

    Karg RL, Thompson GL. A heuristic approach to solving the traveling-Salesman Problem. Manage. Sci (1965) 10:225–248.

    Latha R, et al. Allele mining for stress tolerance genes in Oryza species and related germplasm. Mol. Biotechnol (2004) 27:101–108.[CrossRef][Web of Science][Medline]

    Marita JM, et al. Development of an algorithm identifying maximally diverse core collections. Genet. Resour. Crop Evol (2002) 47:515–526.[CrossRef]

    McKhann HI, et al. Nested core collections maximizing genetic diversity in Arabidopsis thaliana. Plant J (2004) 38:193–202.[CrossRef][Web of Science][Medline]

    Peeters JP, Martinelli JA. Hierarchical cluster analysis as a tool to manage variation in germplasm collections. Theor. Appl. Genet (1989) 78:42–48.[CrossRef][Web of Science]

    Raymond TC. Heuristic algorithm for the traveling-salesman problem. IBM J. Res. Dev (1969) 13:400–407.

    Schoen DJ, Brown AHD. Conservation of allelic richness in wild crop relatives is aided by assessment of genetic markers. Proc. Natl Acad. Sci. USA (1993) 90:10623–10627.[Abstract/Free Full Text]

    Slater G St C, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics (2005) 6:31.[CrossRef][Medline]

    Spagnoletti ZPL, Qualset CO. Evaluation of five strategies for obtaining a core subset from a large genetic resource collection of durum wheat. Theor. Appl. Genet (1993) 87:295–304.[CrossRef][Web of Science]

    Sturges H. The choice of a class-interval. J. Am. Stat. Assoc (1926) 21:65–66.[Web of Science]

    Upadhyaya HD, et al. Development of a composite collection for mining germplasm possessing allelic variation for beneficial traits in chickpea. Plant Genet. Resour (2006) 4:13–19.

    van Hintum T, et al. Core collections of plant genetic resources. IPGRI Technical Bulletin No. 3. (2000) Rome, Italy: International Plant Genetic Resources Institute.

    Wang C, Lefkowitz EJ. Genomic multiple sequence alignments: refinement using a genetic algorithm. BMC bioinformatics (2005) 6:200.[CrossRef][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Crop Sci.Home page
H. A. Agrama, W. Yan, F. Lee, R. Fjellstrom, M.-H. Chen, M. Jia, and A. McClung
Genetic Assessment of a Mini-Core Subset Developed from the USDA Rice Genebank
Crop Sci., June 26, 2009; 49(4): 1336 - 1346.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/16/2155    most recent
btm313v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kim, K.-W.
Right arrow Articles by Park, Y.-J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kim, K.-W.
Right arrow Articles by Park, Y.-J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?