John Binns imparted (2006-05-01 @ 08:25:56 -0700):
> there are some new ESTs in 
/data/local/PGN/submission/
> that i am supposed to pipeline.

Unlucky you  ;)

Quick important note- visited later, below- do all 
processing for FGN,
not PGN.  Always.  Except when running BLASTs. (Then 
you do it for both)  :)

> i have absolutely no clue what to do. i have read
> fgn-pipeline-est.txt but it doesn't tell me what 
files
> are supposed to be where when i run stuff, what
> machine i am supposed to run these commands on, and
> what each of these commands will be doing.

Since the data you have to process is in submission, 
start by running
the commands in 
	/data/shared/teri/notes/fgn-import_data.txt
that involves copying and unpacking the submitted 
data, and running
the 00_import_submitted_seq.pl to prepare it for the 
normal FGN
pipeline.

That script gets a bit complicated because you have to 
look at the file
names and grok what the biologists were trying to 
represent with them-
the information you will need is the library (they 
probably have some
dumb name for this so rename it with our standards), 
the plate (e.g.,
10MS1), and the well (which is the section of the 
plate [look at it
like a grid] and is labelled with a letter followed by 
a number [e.g.,
H11]).  Look at the filenames and get Beth's help to 
determine what
they represent.

Since biologists can't agree on things like naming 
conventions, there
are several different regular expressions to parse 
that data once
you've determined the necessary format.  If you run 
the script without
any arguments, it will show you the variations I've 
seen, which have
already been entered into the script.  If what you 
have is different,
you'll have to add it to the script too.

Because I'm lazy, you would have to update the script 
for each new
format ina few places- 
	1. around line 31 for the usage string
	2. around line 53 for the actual regular expression
	3. around line 87 for setting the variables that come 
out of the
	regular expression

Once you run the script, it will ask you for the new 
library name (we
never use theirs- our standard is the first letter of 
the genus
followed by the first two letters of the species, 
followed by two
incrementing digits for which library this is for that 
species) and
then will keep you updated on what it's doing as it
moves/renames/repacks the files.

To make things even more complicated, some submissions 
have data that
uses more than one different naming scheme at the same 
time.  To
process that, you'll have to run through the script 
once for each
different format found.  I think my script is 
exceptionally dumb, in
that it cleans up the directories a bit too well after 
each go, so keep
in mind that you may have to re-copy and unpack the 
data before
re-running the script.
line 53 for 


After all that is done, you get into the more 
mysterious (to me anyway) 
process that Dan wrote).  Just look at 
fgn-pipeline-est.txt and run the
commands there in order (script 00 there just 
downloads new data from
PSU, which is entirely separate from the stuff in 
submission/, but
should leave the data in the same state).  As you go 
along, you can
probably ignore the comment about conversion to 
ab1/scf format (I only
needed to worry about this once or twice- you'll know 
if you need to
worry about it because the pipeline will soon have no 
input files).

When it comes time to create the new database entry, 
look at line 59
for an example SQL query- change it as needed (look at 
the other
library data in the database for an example).

Don't forget to add the new library to 
03_new_seq_upload.pl (around
line 162, but if you do forget, the script will just 
ignore all that data 
anyway, so I guess it could be worse).  The ps/uf/etc 
you need to enter
for each library is the sequencing location (look in 
the
sequencing_location table to figure out what the two 
letter acronyms
mean).

The sequencing info is in the database and denotes 
what types of
machines and dyes were used.  Frankly, this has always 
been a bit of a
mystery to me, and I've found that if you ask the 
submitter it's
usually a mystery to them, too.  If you look at the
sequencing_information table, you'll see that I've 
mostly just used ABI
3700 for everything- this is because Dan told me that 
the other ABI
machines we were seeing were functionally equivalent 
(at least at that
time) and that is what he was doing.

Adding new vectors is a bitch.  Hope you don't need to 
do it.  It's
especially bad because Dan's pipeline gives no unusual 
error messages
when a vector can't be found- you just find that by 
the time you get to
the later scripts in the pipeline you have no more 
sequence to
sequence. (Actually, adding the vector isn't that bad- 
it's just
appending it to a fasta file; it's FINDING the vector 
sequence that'[s
a pain).

The rest of the pipeline should just go.  There will 
be some errors
from time to time (most memorably, near the end when 
building the
library assembly regarding disk space)- as long as you 
still have data
to process for the next script in the pipeline, they 
are mostly fine.

After you are done with the pipeline, there's some 
manual file copying
and shuffling to be done- details on that are in 
fgn-pipeline-est.txt.

After that, there's the unigene build, too.  I wrote 
that pipeline
(though not all of the scripts it calls), and like to 
think it's less
bad than the est pipeline (save for the fact that the 
first step is step 
3); though there's still some manual database editing 
and file
shuffline/editing to be done.  Details on that are in
fgn-pipeline-unigene.txt.

After that, there's the blast pipeline, which I also 
wrote.  That
should go without a hitch, presuming that you can 
rlogin to the
cluster nodes and that there is sufficient hard drive 
space left in
/data on each node. To easily get that going, edit the
pgn_processing/scripts/blast_scripts/run-fgn-blasts.sh 
script to have
the correct unigene IDs and then run it in a screen.  
For one mid-size
build it should take about 4-6 hours to blast against 
ath1 and NR.

> also, these are in PGN right now. do i run them
> through the FGN pipeline, since there is no
> pgn-pipeline-est.txt?

Yep, that's exactly what you do.

> if i do, how do they get into
> PGN?

This is a huge kludge, but you have to wait until a 
good hunk of FGN
data ready to go public, take PGN offline, copy the 
mysql database 
from FGN to PGN, and then update it on PGN to keep 
nonpublic data
nonpublic.  Some ideas for the SQL commands you use to 
do this are in
	/data/shared/teri/notes/pgn-import_from_fgn.txt

> what is the difference between PGN and FGN,
> anyway? 

PGN is the public subset of FGN.

> right now, i have a checkout of pgn_processing and
> copies of FGN and PGN MYSQL databases on my local
> machine. what else do i need?

A lot, if you plan to build unigenes there.  That's 
probably not worth
it though- do all the menial database tedium on your 
machine for
copying FGN to PGN, but for all other maintenance, do 
it on rubisco, on
the live database.  That's not as bad as it sounds, 
because that
database doesn't change all that often, and if it gets 
borken it only
takes a few moments to copy over a recent backup copy 
of the database.
Of course, since the db nightly backup is now much 
more complicated
that simply copying over the database files, you 
should probably get
your own backup copy of the database first.

> do these scripts rely on being able to mount
> /data/shared? if so, since we can't mount that
> anymore, should i be running them somewhere else?

Yep, they do.  All the pgn data lives in /data/shared.  
Rubisco can
still mount data shared, run the scripts there.  
Really.