SOL 2008 Sequencing Meeting 7pm-11pm, 10/15/08 Holiday Inn am Stadtwald, Koeln, Germany PRESENTATIONS =============== Roeland van Hamm Chromosome 6 (NL) --------------- progress so far has been slow. many of the chromosomes have problems with gaps that they have not yet been able to find extension or seed bacs for. chromosome 6 estimates still 160 BACs to finish it. 3-15 small gaps (< 4 bacs), 4-5 large gaps (4-15 bacs) The chromosome 6 project has now run out of money. We are now looking for more funding sources. Options: A. continue bac walking B. purify and shotgun sequences chr6-12 C. Sequence chr6-12 by shotgun sequencing. We have decided to pursue option C. We want to produce: - a whole-genome physical map based on 10X Genome Analyzer-generated AFLP sequence tags - a 20X genome coverage of 454 Titanium reads, using combination shotgun and paired-end runs - a 30X coverage in SOLID reads - about 3 million Sanger reads from Sleected BAC Mixture (SBM-data, Kazusa) The initiative We will assemble this data together with all currently available data: 300000 bac ends, 180,000 fosmid ends, 30% of euchromatic sequence The challenge: assemble the genome Use 66Mb of available seqeunce to benchmark the procedure, which was the strategy used in the Vitis vinifera genome. 1. Match all reads vs. all reads, 100% identity 2. Cluster reads and divide into repeat and low-copy clusters 3. superately assemble the low-coipy clusters 4. merge assembled clusters, lowering stringency step-wise 5. use bac-end, fosm.... .... Funding for 10X solexa BAC-based physical map is secured 15X SOLID coverage, funding secured. 10X 454 coverage, application in process, production expected 12/2008 10X 454 coverage, ItalyL Secured SBM data from Kasuza: already done Data release Will have assembly of next-gen data with contigs anchored as much as possible All data will be released to the consortium. Users connot publish the data without consent of the producers for 9 months. Estimated time line Production of SOLID and 454 data already started. Production of physicalmap: Jan-July 2009 Assembly of d all data sets: May 2009-Aug 2009 September 2009, assembly released to SOL consortium Invitation - other members welcome to join the 'seed consortium', provided you can contribute novel and significant expertise or data sets. - can stick with the time line - can agree to data release policy Question: what kind of contig sizes do you expect? Answer: I think we need more coverage, probably in the thousands for supercontig count. Question: if other groups can secure funding, would the sequencing be done at multiple sites, or one site? Answer: We can distribute the sequencing, we are already doing it at 3 different places. Question: does this also mean that the BAC-by-BAC sequencing will stop? Answer: We ourselves expect to continue doing it. The problem is finding the extension BACs. We plan to accumulate and fill our Titanium runs partially with BACs. (DISCUSSION, mostly about how hard assembly will be?) M.Bouzayen: we must be careful what promises we make publicly about this, because we don't have experience with 454. J. Giovannoni: exactly right. I think we should keep this in a context of it being a tool to assist the BAC-by-BAC process. D. Zamir: I applaud your thought and vision. It's clear to us that many of the sequencing centers are well-funded, but they are just out of BACs. It is not limited to a single center, it is in most centers. Although it's not a proven technology, I think that we have to take a risk on it if we want to finish the genome in a timely fashion. (DISCUSSION - distribution of credit, of labor, of risk for this venture) S.Knapp: outside of this group, the solanaceae group is held up as an example of cooperation that has worked very well. I think the risk here is not so much whether the 454 will produce good products, but whether at the end of the 454 sequencing, the community will still be intact. Jim Giovannoni Chr 1 and 10 status -------------------- I will be very brief. It is already 8pm. Here is a summary of the libraries that are available. Steve Stack is still doing FISH. He couldn't be here, but he's still working. Bruce Rowe is going to pilot-sequence 24 BACs beginning very soon. We are submitting an NSF proposal to get the funding for sequencing the US chromosomes. The initial indications are good. G.Giuliano: our experience is that if you rely on overgos, you run two risks. markers that are apparently well-separated on the genetic map can actually overlap. also, our experience is that overgos are wrong 30% of the time, they are actually on another chromosome. Our experience has also been that IL mapping is very reliable. we routinely IL-map both ends to avoid hybrid BACs. Doil Choi Chr 2 project ------------------ latest estimated 22Mb of euchromatin, not 26. we have sequenced about 13.8 Mb so far (non-redundant), so based on 22Mb we are around 63% done problems are mostly that D. Zamir: it seems that the position that chromosome 2 is in, they should certainly keep in mind the Fosmid library Chuanyou Li Chr 3 project ------------------ ordered the 3 bac libraries. did manual editing of the available FPC data, reducing it to 3000 contigs physical map coverage is very low, only about ~50Mb of the 950Mb. is not focused on the euchromatin did a pooled PCR to find markers on BACs. results were not good. for many of the markers, we got too many hits, and for many there were no hits. next, we tried doing some FISH. 97 BACs are in our pipeline, 44 finished, 15 submitted to SGN. problems as we see them are the sparse physical map and identification of extension and seed BACs Gerard Bishop Chr 4 ---------------- Sanger is leaving the project at the end of October, so we're moving the sequencing effort over to Imperial. Total finished sequence is about 19.5Mb of unique finished sequence One awkward thing we've found is that there are actually quite a lot of genes in the heterochromatin, and the repeats seem to be distributed heavily throughout the BACs. We have 81 mapped contigs. 119 BACs/44 contigs have been definitely FISH/IL mapped to chromosome 4. 58 BACs are still under confirmation but they have chr4 marker sequence. ~60 markers for which we have not identified any BACs. ~13 BACs have been sequenced to HTGS3 and placed on chr0, since they are definitely not on chr4 22 markers on SGN are missing sequences, those would help us a lot. next steps: confirm all of the recalcitrant BACs as on chr4, with IL mapping. use 3d pools with market sequences to identify further BACs. - use 3d bac pools to identify BACs to extend current contigs - analyze output from X2 GS-FLX and X2 illumina runs on cDNA from chr4 IL and parental lines to identify SNPs and more chr4 markers. G.Giuliano: do you think that you have so many non-chr4 bacs and repetitive bacs because you relied so heavily on the FPC map? G.Bishop: it could be because chr4 is just different, we won't really know until we get more sequence L.Mueller: have you estimated the gap size between any contigs? GB: no, not yet. we haven't done any FISH for that. Chr5 update ------------ following standard protocols: fingerprinting, purity check validating the BAC map positions with IL mapping we have about 23 contigs now, with 83 bacs in various stages of sequencing we have also found lots of genes in the heterochromatin. we have 31 BACs completed and submitted, 31 in phase 2, 13 in phase 1, 8 are in prep 12 of our clones ended up on different chromosomes chr6 update Sander Peters ----------- - have identified serious physical gaps, have done extensive annotation - fished 113 bacs: 81 on chr6, 3 on chr6 and others, 28 on others, 1 no signal - 47 bacs in phase 3, 96 ordered contigs, - uploaded AGP and TPF to SGN - have done 10.7 Mb, 2.0 on short arm, 1.8 pericent, 6.9 long arm. - were able to get physical measurements for the lengths of the arms and euchromatin on chr6, using FISH with BACs - we have also found that quite some genes are in the heterochromatin. we think there are probably about 1400 genes in the heterochromatin of chr6. - conclusions: we have megabase-sized gaps still. sequence islands may serve as a backbone for WGA. we need more anchored BACs. chr7 update M. Bouzayen ------------ we made 3d dna pools for half of the hinIII library and teh entire MboI bac library, freely available for the community, just ask. Also, macroarray filters. until may 08, we sequenced our bacs with classical sanger. now, we have switched to only 454 GS-FLX long paired end tag reads and multiplex identifiers. first batch of 12 BACs sequenced, 4 BACs came out directly phase 2 in 1 contig, 7 came out in 2 to 11 contigs, and 1 problematic BAC. results now available in genbank and sgn. chr8 update S.Sato ------------ did some work to add more markers to expen2000 to try to get more seed clones. got 34 new seed bacs out of it. did a shotgun of a selected bac mixture (SBM). did 2.9 million reads of SBM (1.7 Gbp). did a second SBM mixture. got about 580 Mb of nonredundant contigs out of the whole effort. covers about 71% of tomato gene index data. chr9 update J.van Eck for A.Granell ------------- completely out of new seed bacs. have not been able to find any extensions at all. 29 dead ends. don't know yet how useful the fosmids will be. chr12 update G. Giuliano ------------- Goes over IL mapping of BACs workflow. Has done a lot of IL mapping, but nobody else was doing it (or at least not reporting on SGN). Volunteers to map BACs that are placed in chromosome 0. Did some pilot 454 sequencing of BACs, got good results. Emphasizes that they will IL map BACs that people want them to IL map. ITAG overview L. Mueller ------------- Gives an overview of inputs and outputs of the itag pipeline. emphasizes need for AGP files, because it's based on contigs, and the contigs are made from the AGP files. ITAG pipeline will soon be in version 1, which will be the first production run. DISCUSSION =============== DZ: regarding the new sequencing approach, Roeland agreed to write a short proposal that will be sent to all the sequencing centers for comment. We will develop a sequencing standards document that we can all agree to. Regarding the fosmids, let's collect statistics on how effective they are being. Perhaps SGN can make some statistics about fosmid performance. Another interesting proposal, from KR, is the synteny approach. Comparing grapevine and tomato might be almost gene-for-gene. It needs to be tested, but if we see that there is high conservation, then we could get markers for gaps using a synteny approach. This is a bioinformatics analysis that could be very quick. Final thing on my list, tomorrow we have a meeting with the potato group, and my feeling is that with the momentum we're starting to gather, I think this is very good news. Our objective for our meeting tomorrow with potato is, since it's so similar, to find a way that's acceptable for all, to use the tomato sequences for the potato project, at least to close gaps etc. Giovanni Giuliano: In this year we learned a lot, by making mistakes. A number of projects have already submitted BACs to GenBanks. I make the request that countries that have IL information, enter it into SGN. Assign the BAC to by IL mapped by your own country, and enter your IL information, because the IL mapping information is very valuable for other projects. RKL: T.Mohapatra: S. Knapp: we might consider attaching one of the Creative Commons licenses to our data, since it's understood very widely. I'm sure that one of the CC licenses is compatible with the ENCODE data release policy. D. Zamir: now is the time for us to start designing our paper for release. what table we'll have, what figures we'll have. Perhaps we can have data views on SGN that are dynamically updated. G.G: I have a few proposals on how to structure the paper. A genome paper nowadays, in order to be high-profile, needs to have expression information, deep sequencing, etc. Perhaps we could get the Affy array online fast enough. One thing I've never seen in a genome paper is proteomics. If you annotate the protein products as well, that would be really a stand-out from what you usually see.