SnpFilt: A pipeline for reference-free assembly-based identification of SNPs in bacterial genomes
Affiliation:
1. School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales, 2052, Australia;2. Centre for Infectious Diseases and Microbiology–Public Health, Institute of Clinical Pathology and Medical Research, Westmead Hospital, New South Wales, Australia;3. Marie Bashir Institute for Infectious Diseases and Biosecurity, The University of Sydney, New South Wales, Australia;1. Physicalchemistry Departamental Section, Faculty of Pharmacy and Food Sciences, University of Barcelona, Barcelona, Spain;2. IN2UB, Barcelona, Spain;3. AU-CSIC. Av. Joan XXIII, 27-31, 08028, Barcelona, Spain;1. Mathematical Biosciences Institute, The Ohio State University, Columbus, OH 43210, USA;2. Department of Statistics, The Ohio State University, Columbus, OH 43210, USA;3. Department of Evolution, Ecology, and Organismal Biology, The Ohio State University, Columbus, OH 43210, USA;1. Key Laboratory of Major Diseases in Children and National Key Discipline of Pediatrics (Capital Medical University), Ministry of Education, Beijing Pediatric Research Institute, Beijing Children’s Hospital, Capital Medical University, Beijing, China;2. Laboratory of Molecular Microbiology, St. Petersburg Pasteur Institute, St. Petersburg, Russia;3. National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, State Key Laboratory for Infectious Disease Prevention and Control, National Reference Laboratory of Tuberculosis, Beijing, China
Abstract:
De novo assembly of bacterial genomes from next-generation sequencing (NGS) data allows a reference-free discovery of single nucleotide polymorphisms (SNP). However, substantial rates of errors in genomes assembled by this approach remain a major barrier for the reference-free analysis of genome variations in medically important bacteria. The aim of this report was to improve the quality of SNP identification in bacterial genomes without closely related references. We developed a bioinformatics pipeline (SnpFilt) that constructs an assembly using SPAdes and then removes unreliable regions based on the quality and coverage of re-aligned reads at neighbouring regions. The performance of the pipeline was compared against reference-based SNP calling for Illumina HiSeq, MiSeq and NextSeq reads from a range of bacterial pathogens including Salmonella, which is one of the most common causes of food-borne disease. The SnpFilt pipeline removed all false SNP in all test NGS datasets consisting of paired-end Illumina reads. We also showed that for reliable and complete SNP calls, at least 40-fold coverage is required. Analysis of bacterial isolates associated with epidemiologically confirmed outbreaks using the SnpFilt pipeline produced results consistent with previously published findings. The SnpFilt pipeline improves the quality of de-novo assembly and precision of SNP calling in bacterial genomes by removal of regions of the assembly that may potentially contain assembly errors. SnpFilt is available from https://github.com/LanLab/SnpFilt.