This directory contains coding sequence (cds) and protein predictions from the SGN unigene builds.

Each build has its own sub-directory with the corresponding species name and contains two files
in fasta format. One contains the predicted cds sequences and the other contains protein sequence.
Species specific matrices that have been developed are given as a third file where applicable.

Methods used:

We used ESTScan version 3.0 to to process sgn unigene build 3.

The model for the ESTScan was built based on 483 EMBL tomato (lycopersicon esculentum) nuclear 
coding sequences using the software provided with ESTScan.  

After we built the model, we evaluated the model in terms of its ability to determine the right 
frame, identify the coding region and fix the frame shift error.  The test sequences came from 
published tomato cDNA sequences with known start and stop, published tomato cDNA sequences with 
deliberately introduced frame shift errors and tomato ESTs with coding regions identified 
through alignment to arabidopsis homolog. It turned out that ESTScan, with the tomato model we 
built, was able to find the frame with high accuracy.  Most of the coding regions identified by 
ESTScan were correct.  However, it might miss 3 to 5 amino acids at both ends.  Especially, 
when the non coding region was short (less than 30 nucleotide), ESTScan might not be able to 
distinguish it from the coding region.  ESTScan fixed the frame shift errors in most of the test 
sequences, not guaranteed at exactly the insert/deletion position but usually within 10 
nuceotides up or down stream of the shift error.

Chenwei Lin (Cornell University) developed the tomato model for ESTScan and ran the ESTScan analyses.

Please contact sgn-feedback@sgn.cornell.edu if you have any questions or comments.