This document only gives step by step instruction for the domain annotation part of sgn.  For backgroud information and details, please refer to the domain_notes file.

==============================================================

Part One:  domain tables
This part needs to be done periotically (e.g. once a year).  Before start, clean up tables (delete all table content) domain, go, interpro, interpro_go.

1. Download the files from ftp://ftp.ebi.ac.uk/pub/databases/interpro/
interpro2go
interpro.xml.gz
names.dat

2. Parse interpro.xml with parseInterProxml.pl to get interpro.txt


3. Load interpro table  with interpro_upload.pl, using names.dat

4. Load go table 
   First process interpro2go file with: cut -f 2 -d '>' interpro2go | sort -u -t ':' -k 3 > go 
   Then load go with go_upload.pl, using go
5. Load interpro_go, with ig_upload.pl, using interpro2go

6. Load domain, with domain_upload.pl, using interpro.txt

==============================================================

Part Two: table cds
This part needs to be done every time a new unigene build is made.

1. Get unigene seqeunces with extract-unigene-as-fasta.pl
(To be added, Check if the unigene has a entry in cds already).

2. Run estscan, with the matrix file mrna.smat (generated in Dec 2003 with ~480 tomato nuclear cds) and get estscan and estscan.pep.  For deatils see ESTScan docamentation.
Estscan -M <matrix file> -t <peptide result file> <input file> > <output file>

3. Parse both estscan and estscan.pep with to_oneline.pl, which converts sequences to one line.  It looks dumb, as compared to Bio::SeqIO, however it is the right way, since Bioperl only keep information up to the first space in the header line.  We need ALL the information in the header, start, end, score ...

4. Load cds with cds_upload.pl, from .estscan.oneline, .estscan.pep.oneline and the input fasta file for estscan.  

Right now the script load protein_seq from ESTScan peptide result.  HOwever, I found there were some errors in the peptide result.  I highly recommend we do a direct translation from the cds result.  To cope with the family analysis pipeline, we need to retain some conventions (how to handle the insertions at the beginning ...).  I highly recommend modifying the cds_q in family/fix_estscanpep.pl and use it for this purpose.

==============================================================

Part Three: table domain_match
This part needs to be done for each new unigene build.

Every time Part One is done, erase all the content in the following table and repeat this part.

1. Submit estscan.pep from ESTScan to P-IPRSCAN at Cornell CBSU, select raw output.  It usually take long.

2. Run domain_match_upload.pl, with the IPRSCAN result file.
(At present time, the script looks for an entry in cds and domain, and it will load the domain_match only when it finds entries in BOTH tables.) 
