ITAG2.3 Tomato Genome Annotation Release Contents: 1. Introduction 2. Files in this release 3. Links and other resources 4. Release statistics 5. Post-release updates == 1. Introduction == The International Tomato Annotation Group (ITAG) is pleased to announce the ITAG2.3 release of the official Tomato genome annotation (ITAG2.3), covering approximately 84% of the genome, with 34,727 gene models. This release file set was generated on April 29, 2011. In this release, all of the gene models are annotated with best-guess text descriptions of their function, and 19,662 ( 56.6%) have associated Gene Ontology terms describing their function. See section 4 for more statistics describing this release. Please send comments or questions about these annotations to: itag@sgn.cornell.edu == 2. Files in this release == * ITAG2.3_cdna_alignments.gff3 GFF version 3 file containing alignments of existing EST and cDNA sequences to the genome. Analyses in this file: ITAG_microtom_flcdnas ITAG_transcripts_sol ITAG_transcripts_tomato * ITAG2.3_de_novo_gene_finders.gff3 GFF version 3 file containing predictions from several de novo gene finders. These were integrated into the final gene models by EuGene. Analyses in this file: ITAG_augustus ITAG_geneid_tomato ITAG_glimmerhmm_ath ITAG_glimmerhmm_tomato ITAG_trnascanse * ITAG2.3_gene_models.gff3 GFF version 3 file containing gene models in this release. Analyses in this file: ITAG_eugene * ITAG2.3_genomic.fasta fasta-format sequence file of genomic contig sequences. * ITAG2.3_other_genomes.gff3 GFF version 3 file containing alignments to other genomes or assemblies other than the one used as reference for this annotation. Analyses in this file: ITAG_itag1_ref ITAG_tobacco_contigs * ITAG2.3_repeats.gff3 GFF version 3 file containing repetitive regions, at 'normal' stringency, meaning running it *with* the -nolow option, so that low-complexity and simple repeats are NOT masked. The repeat set used for masking is available at ftp://ftp.sgn.cornell.edu/genomes/Solanum_lycopersicum/repeats/mipsREdat_8.8_solanaceae_TE.masked.gz. Analyses in this file: ITAG_repeats * ITAG2.3_repeats_aggressive.gff3 GFF version 3 file containing repetitive regions, at 'aggressive' stringency, meaning running it *without* the -nolow option, so that low-complexity and simple repeats ARE masked. The repeat set used for masking is available at ftp://ftp.sgn.cornell.edu/genomes/Solanum_lycopersicum/repeats/mipsREdat_8.8_eudico_TEs.masked.gz. Analyses in this file: ITAG_repeats * ITAG2.3_sgn_data.gff3 GFF version 3 file containing alignments to sequences related to data on SGN. Currently contains alignments to SGN unigenes, SGN marker sequences, and SGN locus sequences. Analyses in this file: ITAG_sgn_loci ITAG_sgn_markers ITAG_sgn_unigenes * ITAG2.3_genomic_reagents.gff3 GFF version 3 file containing alignments to subclones or other intermediate materials used in the genome. Analyses in this file: ITAG_tomato_bacs DBolser_Dundee_BES_SSAHA * ITAG2.3_assembly.gff3 GFF version 3 file containing chromsome features, as well as positions of scaffolds, contigs, and inter-scaffold (i.e. unknown-size) gaps. Analyses in this file: SL2.40_assembly * ITAG2.3_infernal.gff3 GFF version 3 file containing annotated small-RNA regions, produced by Infernal (http://infernal.janelia.org/). Analyses in this file: ITAG_infernal * ITAG2.3_protein_reference.gff3 GFF version 3 file containing reference features for each protein sequence. Useful for loading features on the protein sequences into databases like Chado. Analyses in this file: none * ITAG2.3_protein_functional.gff3 GFF version 3 file containing annotations on protein sequences, such as protein domains. Analyses in this file: ITAG_blastp_ath_pep ITAG_blastp_refseq_pep ITAG_blastp_rice_pep ITAG_blastp_swissprot ITAG_blastp_trembl ITAG_interpro * ITAG2.3_cdna.fasta fasta-format sequence file of cDNA sequences. * ITAG2.3_cds.fasta fasta-format sequence file of CDS sequences. * ITAG2.3_proteins.fasta fasta-format sequence file of protein sequences. * ITAG2.3_dropped_features.gff3 GFF version 3 file listing features from the ITAG2 that could not be remapped from the SL2.31 assembly to the SL2.40 assembly. * SL2.31_to_SL2.40_dropped_features.gff3 GFF version 3 file listing features from the ITAG2 that could not be remapped from the SL2.31 assembly to the SL2.40 assembly. == 3. Links and other resources == Sequences and annotations can also be viewed and searched on SGN: http://solgenomics.net/genomes/Solanum_lycopersicum/genome_data.pl The fully annotated chromosome sequences in GFF version 3 format, along with Fasta files of cDNA, CDS, genomic and protein sequences, and lists of genes are available for bulk download from the SGN site at: http://solgenomics.net/itag/release/2.3/list_files. For those who are not familiar with the GFF3 file format, the format specification can be found here: http://www.sequenceontology.org/gff3.shtml A graphical display of the Tomato sequence and annotation can be viewed using SGN's genome browser. Browse the chromosomes, search for names or short sequences and view search hits on the whole genome, in a close-up view or on a nucleotide level: http://solgenomics.net/gbrowse/ SGN's BLAST services have also been updated with this dataset, available at: http://solgenomics.net/tools/blast/ ITAG is committed to the continual improvement of the Tomato genome annotation and actively encourages the community to contact us with new data, corrections and suggestions. Announcements of new releases, updates of data, tools, and other developments from ITAG can be found on SGN: http://solgenomics.net/ == 4. Release statistics == 4.1 Proportion of Genome Annotated Estimated genome size: 930 Mbp Size of annotated assembly: 782 Mbp Est. proportion of genome: 84% 4.2 Structural Annotation Gene model count: 34,727 Exon count: 160,007 Intron count: 125,280 Gene model length (bp) --------------------------------------------------- Min 63 Max 244,094 Range 244,031 Mean 3,163.6 StdDev 4,098.6 Median 2,048 Frequency Distribution: Bin Frequency 5,000 28,106 10,000 4,994 15,000 1,050 20,000 328 25,000 135 30,000 56 35,000 21 40,000 20 244,094 17 Intergenic distance (bp) --------------------------------------------------- Min 0 Max 1,950,919 Range 1,950,919 Mean 41,855.5 StdDev 101,239.6 Median 11,410 Frequency Distribution: Bin Frequency 20,000 22,159 40,000 5,222 60,000 2,149 80,000 1,150 100,000 734 120,000 492 140,000 360 160,000 314 180,000 253 200,000 213 220,000 169 240,000 142 260,000 131 280,000 134 300,000 99 320,000 89 340,000 87 360,000 85 380,000 54 400,000 71 420,000 52 440,000 53 460,000 36 480,000 31 500,000 35 1,950,919 387 Exons per gene model --------------------------------------------------- Min 1 Max 71 Range 70 Mean 4.6 StdDev 4.6 Median 3 Frequency Distribution: Bin Frequency 10 30,449 18 3,602 27 567 36 77 45 23 54 6 62 2 71 1 Exon length (bp) --------------------------------------------------- Min 1 Max 219,043 Range 219,042 Mean 262.4 StdDev 841.6 Median 148 Frequency Distribution: Bin Frequency 5,000 159,972 10,000 21 15,000 8 20,000 2 25,000 2 30,000 0 219,043 2 Intron length (bp) --------------------------------------------------- Min 38 Max 115,324 Range 115,286 Mean 543.8 StdDev 923.4 Median 216 Frequency Distribution: Bin Frequency 2,000 119,392 4,000 4,566 6,000 887 8,000 305 10,000 103 12,000 10 14,000 3 16,000 2 18,000 2 20,000 2 115,324 8 4.3 Functional Annotation Gene models with GO terms: 19,662 ( 56.6%) Unique GO terms associated: 2,108 Genes with splice variants: 0 Gene models with functional description text: 34,727 (100.0%) Gene Ontology terms associated, per gene model --------------------------------------------------- Min 0 Max 17 Range 17 Mean 1.1 StdDev 1.3 Median 1 Frequency Distribution: Bin Frequency 2 29,156 4 5,075 6 452 8 35 11 4 13 3 17 2 == 5. Post-release updates == * Tue May 3 14:21:15 EDT 2011 Corrected invalid characters in description string for Solyc09g082990.2.1 (GDP-D-mannose-3â ,5â -epimerase 2) Affected files: ITAG2.3_cdna.fasta ITAG2.3_cds.fasta ITAG2.3_gene_models.gff3 ITAG2.3_proteins.fasta