R&D CENTER

Plant Genome Database: An integrated platform for plant genomes

Jongsun Park*, Yongsung Kim, Hong Xi
URL  
Genome, defined as a blend of gene and chromosome, is whole nucleotide sequences of one organism. Currently, more than 182 plant genomes have been sequenced and/or published; however, there is no central repository for plant genome sequences. NCBI genome database, as a general sequence repository, does not contain all published plant genomes with gene models (e.g. Utricularia gibba). Ensembl and Phytozome are another plant genome repositories containing 44 and 25 and 72 genomes, respectively, which is much less than currently available plant genomes. Lack of central repository of plant genome is a critical huddle to dissect plant genomes in various ways. Moreover, a lot of re-sequencing projects including Arabidopsis thaliana (~1,001 genomes) and Oryza sativa genomes (>3000 genomes) do not provide assembled genome sequences for understanding intraspecies divergences. To overcome these problems, we developed a standardized plant genome database (http://www.plantgenome.info/) for collecting all available plant genomes with automated pipelines to assemble genomes based on resequencing raw data. Moreover, gene annotation with InterProScan and identification of simple sequence repeats (SSRs) are also conducted under the pipeline. 182 plant and four red algal genomes (154 species) have been collected from diverse sources including NCBI, Phytozome, Ensembl, and independent plant databases (Release 2.1). Total length of 186 genomes is 243.289 Gbp (1.31 Gbp on average) and total numbers of genes and ORFs are 5,084,655 and 6,502,475, respectively, from 152 plant genomes. The largest one is Pinus lambertiana (34.08 Gbp) from Gymnosperm of which average genome length is 21.08 Gbp. 155 species comprise of 4 red algae, 13 chlorophytes, one charophytes, one liverworts, two mosses, six Gymnosperm species, and 127 Angiosperm species. 25 orders of Angiosperm have sequenced genomes: 42 genomes in Brassicales showing 26 genomes are from A. thaliana ecotypes, 29 genomes in Poales among which 13 genomes are from Oryza genus, 15 genomes in Rosales displaying six genomes are from Fragaria genus. Together with resequencing genomes, current major plant genome researches are focusing on understanding within species and/or genus. 34 out of 186 genomes (18.28\%) do not have gene model data even though some of them were already published. 83.19\% (5,409,371) of plant ORFs have 13,321 distinct functional domains detected by InterProScan. 14,261,443 Simple Sequence Repeats (SSRs) were identified from 186 genomes and covers 0.16\% (392,317,043 Mbp) of genome sequences. Oryza brachyantha genome have largest proportion of SSRs (4.76\%) and Hordeum vulgare has the smallest (0.025\%). Throughout these analyses, 240.98 Gbp plant genome sequence is not just collection of nucleotides but new indicators to understand characteristics of plants along with taxonomy.