John Binns imparted (2006-05-01 @ 08:25:56 -0700): > there are some new ESTs in /data/local/PGN/submission/ > that i am supposed to pipeline. Unlucky you ;) Quick important note- visited later, below- do all processing for FGN, not PGN. Always. Except when running BLASTs. (Then you do it for both) :) > i have absolutely no clue what to do. i have read > fgn-pipeline-est.txt but it doesn't tell me what files > are supposed to be where when i run stuff, what > machine i am supposed to run these commands on, and > what each of these commands will be doing. Since the data you have to process is in submission, start by running the commands in /data/shared/teri/notes/fgn-import_data.txt that involves copying and unpacking the submitted data, and running the 00_import_submitted_seq.pl to prepare it for the normal FGN pipeline. That script gets a bit complicated because you have to look at the file names and grok what the biologists were trying to represent with them- the information you will need is the library (they probably have some dumb name for this so rename it with our standards), the plate (e.g., 10MS1), and the well (which is the section of the plate [look at it like a grid] and is labelled with a letter followed by a number [e.g., H11]). Look at the filenames and get Beth's help to determine what they represent. Since biologists can't agree on things like naming conventions, there are several different regular expressions to parse that data once you've determined the necessary format. If you run the script without any arguments, it will show you the variations I've seen, which have already been entered into the script. If what you have is different, you'll have to add it to the script too. Because I'm lazy, you would have to update the script for each new format ina few places- 1. around line 31 for the usage string 2. around line 53 for the actual regular expression 3. around line 87 for setting the variables that come out of the regular expression Once you run the script, it will ask you for the new library name (we never use theirs- our standard is the first letter of the genus followed by the first two letters of the species, followed by two incrementing digits for which library this is for that species) and then will keep you updated on what it's doing as it moves/renames/repacks the files. To make things even more complicated, some submissions have data that uses more than one different naming scheme at the same time. To process that, you'll have to run through the script once for each different format found. I think my script is exceptionally dumb, in that it cleans up the directories a bit too well after each go, so keep in mind that you may have to re-copy and unpack the data before re-running the script. line 53 for After all that is done, you get into the more mysterious (to me anyway) process that Dan wrote). Just look at fgn-pipeline-est.txt and run the commands there in order (script 00 there just downloads new data from PSU, which is entirely separate from the stuff in submission/, but should leave the data in the same state). As you go along, you can probably ignore the comment about conversion to ab1/scf format (I only needed to worry about this once or twice- you'll know if you need to worry about it because the pipeline will soon have no input files). When it comes time to create the new database entry, look at line 59 for an example SQL query- change it as needed (look at the other library data in the database for an example). Don't forget to add the new library to 03_new_seq_upload.pl (around line 162, but if you do forget, the script will just ignore all that data anyway, so I guess it could be worse). The ps/uf/etc you need to enter for each library is the sequencing location (look in the sequencing_location table to figure out what the two letter acronyms mean). The sequencing info is in the database and denotes what types of machines and dyes were used. Frankly, this has always been a bit of a mystery to me, and I've found that if you ask the submitter it's usually a mystery to them, too. If you look at the sequencing_information table, you'll see that I've mostly just used ABI 3700 for everything- this is because Dan told me that the other ABI machines we were seeing were functionally equivalent (at least at that time) and that is what he was doing. Adding new vectors is a bitch. Hope you don't need to do it. It's especially bad because Dan's pipeline gives no unusual error messages when a vector can't be found- you just find that by the time you get to the later scripts in the pipeline you have no more sequence to sequence. (Actually, adding the vector isn't that bad- it's just appending it to a fasta file; it's FINDING the vector sequence that'[s a pain). The rest of the pipeline should just go. There will be some errors from time to time (most memorably, near the end when building the library assembly regarding disk space)- as long as you still have data to process for the next script in the pipeline, they are mostly fine. After you are done with the pipeline, there's some manual file copying and shuffling to be done- details on that are in fgn-pipeline-est.txt. After that, there's the unigene build, too. I wrote that pipeline (though not all of the scripts it calls), and like to think it's less bad than the est pipeline (save for the fact that the first step is step 3); though there's still some manual database editing and file shuffline/editing to be done. Details on that are in fgn-pipeline-unigene.txt. After that, there's the blast pipeline, which I also wrote. That should go without a hitch, presuming that you can rlogin to the cluster nodes and that there is sufficient hard drive space left in /data on each node. To easily get that going, edit the pgn_processing/scripts/blast_scripts/run-fgn-blasts.sh script to have the correct unigene IDs and then run it in a screen. For one mid-size build it should take about 4-6 hours to blast against ath1 and NR. > also, these are in PGN right now. do i run them > through the FGN pipeline, since there is no > pgn-pipeline-est.txt? Yep, that's exactly what you do. > if i do, how do they get into > PGN? This is a huge kludge, but you have to wait until a good hunk of FGN data ready to go public, take PGN offline, copy the mysql database from FGN to PGN, and then update it on PGN to keep nonpublic data nonpublic. Some ideas for the SQL commands you use to do this are in /data/shared/teri/notes/pgn-import_from_fgn.txt > what is the difference between PGN and FGN, > anyway? PGN is the public subset of FGN. > right now, i have a checkout of pgn_processing and > copies of FGN and PGN MYSQL databases on my local > machine. what else do i need? A lot, if you plan to build unigenes there. That's probably not worth it though- do all the menial database tedium on your machine for copying FGN to PGN, but for all other maintenance, do it on rubisco, on the live database. That's not as bad as it sounds, because that database doesn't change all that often, and if it gets borken it only takes a few moments to copy over a recent backup copy of the database. Of course, since the db nightly backup is now much more complicated that simply copying over the database files, you should probably get your own backup copy of the database first. > do these scripts rely on being able to mount > /data/shared? if so, since we can't mount that > anymore, should i be running them somewhere else? Yep, they do. All the pgn data lives in /data/shared. Rubisco can still mount data shared, run the scripts there. Really.