Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. In addressing this issue, the current project developed FLINT, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. FLINT takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43,552 bacterial genomes from Ensembl. FLINT runs on Amazon's Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40K genomes on 64 machines in 67san order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments. (publisher abstract modified)
Downloads
Similar Publications
- Just Science Podcast: Just Using Inadvertently Photographed Ridge Detail as Evidence
- Transient Hypoxia Drives Soil Microbial Community Dynamics and Biogeochemistry During Human Decomposition
- Introducing the NIJ Forensic Intelligence Framework: Pillars and Guiding Principles for Successful Implementation