Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. In addressing this issue, the current project developed FLINT, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. FLINT takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43,552 bacterial genomes from Ensembl. FLINT runs on Amazon's Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40K genomes on 64 machines in 67san order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments. (publisher abstract modified)
Downloads
Similar Publications
- IS2aR, a Computational Tool to Transform Voxelized Reference Phantoms into Patient-specific Whole-body Virtual CTs for Peripheral Dose Estimation
- Evaluation of Cannabis Product Mislabeling: The Development of a Unified Cannabinoid LC-MS/MS Method to Analyze E-liquids and Edible Products
- Should Survey Likert Scales Include Neutral Response Categories?