GenomeQuest on Cloud Mine for Next-Generation Sequencing Data



By Kevin Davies

July 29, 2009 | GenomeQuest today announced the launch of GenomeQuest 6.0Beta, a new sequence data management solution that provides a web-accessible, cloud computing environment for researchers to “align and mine” next-generation sequencing data.  

“There’s a lot of interest in the cloud,” says president/CEO Ron Ranauro. “In a sense, GenomeQuest has built the first commercial application-specific cloud for biocomputing.”

Users can access the cloud from any internet-connected client server. “Sitting behind all this is a 500-CPU compute farm for processing that’s purpose-built for processing volumes of sequence data.”

“We’ve always had a platform technology, but when I got to the company in 2002, the market wasn’t really ready for another platform,” Ranauro told Bio-IT World.  “The Human Genome Project had crested by then, a lot of pharma, biotech and major academic labs had defined their platform over the preceding five years. What we did, which was a good strategy, was focus our application on … mining genetic sequence data for IP.”

The strategy netted more than 100 customers, including 16/20 big pharmas, and several big agricultural science customers who recognized GenomeQuest as a powerful search engine. “The strategy has always been to create an enterprise sequence data management platform. The question we’d been facing is when would the market be ready?” The launch of the next-gen machines from 454, Illumina and Applied Biosystems in 2006-07 marked what Ranauro calls the “catalytic event for causing the enterprise and academic markets to rethink the way they’re managing sequence data.”

Easy Button

GenomeQuest 6.0Beta is the culmination of bringing GenomeQuest’s core platform technology into a broad platform capable of managing sequence data from raw FASTA to high-level pathway information. Ranauro explains: “Sequence data is (sic) not structured data, so it doesn’t lend itself to data management strategies that are organized to handle structured data very well. From the beginning, we took a distributed computing dataflow model for managing the unstructured sequence data. That gives us the scalability.”

Using the GenomeQuest Engine to provide scalability, GenomeQuest 6.0 addresses the needs of three key constituencies -- the researcher, the bioinformatician, and the IT manager.

Researchers “don’t care so much about bioinformatics,” says Ranauro. “The early visionary market for next-gen sequencing wants to do everything, but the mainstream market wants “the Easy button.” They also want some flexibility to tune parameters. They’re not interested in managing data but want common workflows.”

GenomeQuest delivers the two largest production workflows for gene expression and variant (SNP) discovery. “Any researcher can self register and upload a file, or use the sample file and start getting results very quickly.”

Bioinformaticians, on the other hand, “have to be able to access the data models and the algorithms through an open API. We’ve put a tremendous investment in exposing the application programming interface at multiple levels. Since it’s a web application, there’s a URL API used to script and access any data or workflow or database in the system. There’s a scripted command line API which most bioinformatics developers will prefer, which also has this very nice property of providing access to data, workflows, results and analytics while hiding the details of the computing and the reference data itself. A bioinformatics [specialist] can use the command line API to focus on the task at hand, and not the specifics of the IT.”

And from the perspective of the IT manager, scalability is critical. “The volumes of these next-gen machines just continues to escalate,” says Ranauro. “A system that won’t scale is going to be a difficult investment to justify.”

Web Gem

Ranauro half-jokingly says GenomeQuest is becoming a web company. Normally offering researchers a demo requires multiple steps involving a salesman, a web demo, and registering for an account. “Now the researchers can come to the site, self register, use a sample data set or upload their own, run workflows and mine the results.” The available sample data includes donations from Illumina, Life Technologies and 454, including metagenomic pathogen data (454), and variant detection workflows and gene expression data (Illumina, Life Technologies).

GenomeQuest 6.0 fits into the next-gen workflow from the generation of the raw data. Ranauro describes the pipeline: “We would pick it up from the raw FASTA files, post image processing – it’s the read and an ID... That file can be uploaded. A multigigabyte file can take half a day. If it’s an even bigger file, they can sneakernet it to us.” (GenomeQuest is currently using “fairly rudimentary” compression, but Ranauro acknowledges “there are better ways of doing it,” and is open to leveraging data-transfer services from companies such as Aspera.)

“The end user is presented with a simple web application where (s)he can select the reference genome… They can also select how much extended annotation they want. Do they want to know if the variants found are novel related to dbSNP? Are the variants falling inside coding regions..? The result file is a sequence database of the assembly which can be mined according to those properties. You might say, give me only the novel SNPs in coding regions of very high quality.

“Being able to mine and filter the results is the secret sauce of the scalable engine. Now the biologists can do this work without needing to be a programmer, through a very simple web application. That’s the contribution we’re making – allowing a broader, mainstream audience to participate fully in next-generation sequencing.”

Biologists can select and create custom views of the appropriate reference sequence or subsets thereof. “It’s providing data management, but data isn’t really moving around or up and down from the server to the PC. All the manipulation is happening in the cloud but the user is able to manipulate [it].”

The web architecture enables everything to be shared, including workflow, result databases, and selected views on results. “Those can be used as hypothesis drivers for the next set of experiments,” he says.

Upside and Roadmaps

While Ranauro has his sights set on mainstream users, he sees upside elsewhere. “In the fullness of time, a genome center is going to want to get onto the cloud, because they have to lower their costs, just the same as anybody else, to get to the $1000 genome. It might be that GenomeQuest‘s platform provides a smoother path onto the cloud than taking all the in-house infrastructure and trying to recreate it on Amazon… We see ourselves providing the on-ramp to the cloud.”

While the GenomeQuest platform currently runs on a homegrown datacenter cloud, Ranauro says, “we’re actively looking at scaling options that might include Amazon. Hosting this on Amazon is a very real possibility, but it’s not currently on our roadmap.”

De novo assembly functionality is on the roadmap, however, for the second half of 2009. “We’ll provide the computational and alignment engine but we’ll rely on the industry for the assembly. There are important assemblers, such as 454 Newbler, today. For short reads, later this year – there we’ll rely probably on Velvet or Abyss.”

Ranauro also sees a rich environment for next-gen software companies such as CLC bio and DNAStar to add value. “Those tools have a very rich feature set. There’s always going to be a researcher that can benefit. The problem we’re solving is, having that data on the PC is having it siloed again and the industry goes back to where it was ten years ago, with silos of data.”

Ranauro says he’s actively looking for feedback from early users. That will go a long way to determining how long the ‘beta’ designation lasts, but he says early users “are loving it.” He continues: “We’re actually giving a very powerful sequence data management capability away for free. You don’t have to do next-gen sequencing to get high value from this web site!”

 “This is the only product that can process the data and then mine it using an easy-to-use web-based platform,” says Ranauro. “There’s a reason why the IT industry went from client-server to web-based. It provides centralized management, local control, more of a tractable knowledge engineering environment for an enterprise. We don’t see our customers wanting to move data up and down between PCs and servers or across networks. They really want to have it stored centrally but be able to manipulate it easily. We’re really the only company offering that.

“The ability to align the data and mine the alignments using sequence analysis and annotation simultaneously in a scalable way -- no-one has that!”

Click here to login and leave a comment.  

0 Comments

Add Comment

Text Only 2000 character limit

Page 1 of 1



White Papers & Special Reports

sgi whp 2
Managing the Modern Genomics Data Flood
Sponsored by SGI

Managing and storing the perfect storm of multi-disciplined data pouring from next generation sequencers and other omics instruments is a central challenge in life sciences. Discover in this paper how the SGI ArcFiniti storage solution, optimized for unstructured genomics and life sciences data can: 

  • Reduce costs, proactively protect data integrity, and deliver the high performance I/O required for genomics data processing and analysis.  
  • Effectively manage capacities from 156TB to 1.4PB as a disk based, integrated hardware and software platform 


sgi - whp 1
Turning Genomics Data into Practical Insight
Sponsored by SGI

With worldwide sequencing capacity approaching 13 quadrillion DNA bases annually turning genomics data into knowledge is a true computational challenge. Read this paper and learn how the SGI UV coherent shared memory platform can:  

  • Speed results time while cost competitively tackling the most difficult computational problems across all omics disciplines. 
  • Push performance by scaling to extraordinary levels, up to 256 sockets (2,560 cores, 4,096 threads) per single system (one OS image). 

Provide support for up to 16TB of coherent shared memory in a single system image enabling extreme efficiency across a wide range of compute demands. 



accerlys-logo_2012_wh
New Complimentary Market Survey…
Collaborations and Communications Within Drug Discovery Research
Sponsored by Accelrys
This survey was conducted by the Cambridge Healthtech Media Group in January, 2012. It was sponsored by Accelrys related to their HEOS initiative to gather valid information around externalizing collaborative research while improving communications in the cloud. With 310 qualified industry respondents the survey findings reveal useful usage and trends patterns.  An insightful follow-on discussion and webinar related to this survey, and the HEOS by Scynexis SaaS portal is also available on the Bio-IT World website for complementary viewing.
 


Job Openings

tessella logo 
Scientific Software Engineer
Boston MA
$70,000 to $95,000
 
Apply at http://jobs.tessella.com   

oxford nanopore logo 


Early Access Collaborations ManagersClick here to find out more and apply   

Oxford Nanopore's GridION technology, VP, Sales and Marketing Click to  Apply  

For reprints and/or copyright permission, please contact  Tim McLucas, (781) 972-1342, tmclucas@healthtech.com .