Case Study: Validating GenomeNext's Churchill with the 1,000 Genomes Project
- genomenext
- Nov 22, 2014
- 2 min read
The 1000 Genomes Project was the first project to sequence the genomes of a large number of people and to provide a comprehensive resource on human genetic variation. The goal of the 1000 Genomes Project was to find most genetic variants that have frequencies of at least 1% in the populations studied. In order to demonstrate the genomic analysis pipeline’s utility for population scale genomic analysis, 1,088 low coverage whole-genome samples from “phase 1” of the 1000 Genomes Project (1KG) were processed from FASTQ to a single multi-sample VCF in 7 days using 400 Amazon EC2 instances (cc2.8xlarge spot instances). The total analysis cost was ~$12,000, inclusive of data storage and processing.
The analysis pipeline identified 41.2M genetic variants versus 1KG’s 39.7M. The two call sets had 34.4M variant sites in common, of which 34.3M had the same minor allele with highly similar frequencies. The results were validated against previously identified variants (dbSNP Build138, excluding those from the 1KG submission). SNP validation rates were similar, 52.8% (GenomeNext) and 52.4% (1KG). However, due to improvements in indel calling since the original 1KG analysis, the analysis pipeline called three-fold more indels with a higher rate of validation (19.5% vs. 12.5%). Of the indels unique to our analysis pipeline, a 7 fold higher rate of validation was observed compared to those unique to 1KG. Of the GIAB consortium’s validated indel dataset13, 81.5% were observed in the “Churchill” analysis in contrast to 43.9% with the 1KG analysis. Our analysis pipeline called ~71% of the 99,895 novel validated indels in the GIAB NA12878 dataset (those not found in the 1KG analysis) with alternative allele frequencies as high as 100% (mean 40.2%).
In summary:
The 1,000 sequenced genomes publicly available had never been analyzed through a single analytical pipeline.
1,000 sequenced genomes were analyzed by organizations that used different techniques, hardware, and analysis solutions and provided the analysis to the community which is currently used as the base sample (Gold Standard) to compare genomes analysis against in order to make medical determinations and scientific discovery.
Our estimate is that it cost over ~$100M to perform the analysis on the 1,000 sequenced genomes. Moreover, it took years to conduct the analysis.
However, when NCH used Churchill to perform the analysis on AWS they discovered that the current analysis on the 1,000 genomes was extremely inaccurate and they discovered 30,000 new “variants”.
Our analysis pipeline performed the analysis in 7 days. The result was fastest time to analysis and least expensive to analyze 1,000 genomes to date.
First time the 1,000 genomes were analyzed through a single analytical pipeline that is accurate, determinate, and 100% reproducible.
What it means: All experiments and medical research that has been based on the “Gold Standard” for genomic experiments and medical determination that was conducted before are now questionable in light of the accuracy produced by the GenomeNext analysis pipeline. Additionally, the analysis on the 1,000 sequenced genomes performed by our analysis pipeline could become the sample baseline for the world.































Comments