The genome era is an incredible time to be in science, but also presents unprecedented challenges. With the publishing of the first complete human genome just about 10 years ago, major questions remain, such as:
- Where are all the genes located on the genome?
- How and when are the genes regulated?
- How are those genes spliced to form different variants of genes?
- Which of the genes encode proteins, and which encode other functional RNAs?
It turns out that “sequencing a genome” was only a starting point, not an end point. There’s a lot of work to do.
The National Institutes of Health is now supporting a project by the name “ENcyclopedia of DNA Elements,” whose goal is to ultimately solve many of these riddles, providing us with sufficient insights on genomes and their genes that we can start bringing the genome era to improve health care.
The ENCODE project is large and ambitious – and today is the kickoff of the project’s annual meeting in Washington DC.
Highlights of the first day of the meeting include: presentations from Ewan Birney and Manolis Kellis on preliminary analyses of the large data sets being generated, a presentation from Eric Green, leader of the Nat. Human Genome Research Institute about the future of the project, and highlights from a variety of researchers (including yours truly) about various data generation and analysis efforts for the project. These data generation efforts include: transcription mapping by RNAseq, histone mapping, studies of heterocrhomatin, examination of alternative splicing, and analysis of protein expression.
While the meeting is exciting, it is clear that nobody has a handle on the data analysis challenges (no slight to Ewan, Manolis, or the other folks intended here – I’m just highlighting the huge nature of the problem). Each research group is producing data at a tremendous clip, and while there is a “Data Coordination Center” for the project through which all the data flows and is tracked, trying to put it all together is going to take years.
Who will do that work? I am concerned about it, because this kind of work is not suited for most university environments. It requires a significant long-term investment in computing and software infrastructure. Universities like my own are notoriously bad at long-term planning. And the short funding cycles now provided by the NIH exacerbate that problem, since continuity for more than 2-3 years is almost never assured.
I think that the solution will have to be some kind of institutes and/or commercial ventures that are developed for this purpose. These need to be separate from the pressures of academic science, which is a constant publish-or-perish affair, making it hard to focus on long-term infrastructure development. Developing such infrastructure requires extraordinary focus and long-term thinking, and doesn’t lend itself well to the constant publishing required by academia.
Perhaps that’s part of the reason that Ewan Birney has been so successful at the European Bioinformatics Institute – it is not a traditional academic setting. I think we need to replicate his successes here in the US.