
By Chenwei Lin, June 28, 2006

================================================================

Run the following pipeline when new unigene builds are built. 

This documentation mainly describes the step by step script flows.  For details about how each script works, please refer to the srcipt help infor. Most of the PART TWO and THREE scripts have perldoc.  A brief help information show up when a script is run without any command line options.

================================================================

PART ONE - TRIBE_MCL ANALYSIS

1.  Get the latest Arabidopsis proeome from TAIR, FTP/Sequences/blast_datasets

2. In the group table, add a new entry.  (Use group_id=34811 as an example).  Specify the version of Arabidopsis proteome.

3. Add all the member unigene builds in the group_linkage table. (example group_id=34811). 
insert into group_linkage (group_id, member_id, member_type) values (group_id, $unigene_build_id, 19);

4.  Query peptide sequences of the latest build of all species (organisms), with fetch_fasta.pl.  This script needs to be modified - eliminate CGN part and te retrieve the member unigene builds from group_linkage table.

5.  Combine fasta file from step 3 with the Arabidopsis proteome fasta file.
 
6.  Run a self blastp with the file from step 2, use -m 8 and -F F.

7.  Run tribe_mcl on the blastp result from step 3.  Run three analysis, with i value 1.2, 2 and 5.  The analysis is set up on lycopene only.
Command:
	mclblastline --blast-m9 --mcl-I=<1.1 to 5.0> <blastp result file>

8.  Format tribe_mcl raw result to generate a tab-delimit file
Command:
	clmformat -icl <out file from tribe_mcl> -dump <formatted result file> -tab <map file, generated by tribe_mcl, .tab>


====================================================================

PART TWO - FAMILY DATA UPLOAD

1.  Run family_upload.pl
group_id is the group_id from PART ONE step 2
tab-delimit family file is from PART ONE step 8

2. Download Arabidopsis InterPro annotation file TAIR FTP site.  In protein directory.

3.  Run family_annot.pl
group_id is from PART ONE step 1

I keep the family_annotation loading seperate, in case we want to have a new way of getting the family annotation, we can change this step only. 


====================================================================

PART THREE - FAMILY ALIGNMENT

This part run on families with a certain size (number of members)  range.

1.  Run redundant_family.pl

2.  Retrieve peptide sequences by running family_fasta.pl 
The result directory is pep_fasta

3. Make a pep_align directory, run in the directory run_tcoffee.pl
   Move the dnd file to another directory.  (We actually don't need them)

4. Make a cds_fasta directory and retrieve cds sequence  by running family_fasta.pl

5. Convert peptide alignment to cds alignment by running run_pep_2_nt.pl

6. Load the alignment sequences by running align_upload.pl

===================================================================

PART FOUR - FAMILY TREE

This part uses results from PART THREE

1. Make a cds_overlap directory.  Get overlapping sequences from PART THREE by running run_overlap_seq.pl

2.  Make a cds_nex directory.  Make nexus files for families woth sufficient overlap length.  For family trees in SGN, use 90 for nucleotides.

3. Make a model and a tree directoty.  Getting trees by runnig run_modeltes_paup_unroot.pl
I stored modelblockPAUPb10.txt file in family directory.  We need to put it in a more proper location, and modify modeltest_paup_unroot.pl.


4. Make a nw_plain and a nw_newid directory.  Parse the tree result by running run_tre2newick.pl

5.  Load the family and family_tree tables by running family_tree_upload.pl

6.  Sort out the family tree result files by running organize_log_file.pl .  Move the tree result file to /data/shared/websites/family_trees/group_id
