首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Recent advances in high-throughput genome sequencing technologies have enabled the systematic study of various genomes by making whole genome sequencing affordable. Modern sequencers generate a huge number of small sequence fragments called reads, where the read length and the per-base sequencing cost depend on the technology used. To date, many hybrid genome assembly algorithms have been developed that can take reads from multiple read sources to reconstruct the original genome. However, rigorous investigation of the feasibility conditions for complete genome reconstruction and the optimal sequencing strategy for minimizing the sequencing cost has been conspicuously missing. An important aspect of hybrid sequencing and assembly is that the feasibility conditions for genome reconstruction can be satisfied by different combinations of the available read sources, opening up the possibility of optimally combining the sources to minimize the sequencing cost while ensuring accurate genome reconstruction. In this paper, we derive the conditions for whole genome reconstruction from multiple read sources at a given confidence level and also introduce the optimal strategy for combining reads from different sources to minimize the overall sequencing cost. We show that the optimal read set, which simultaneously satisfies the feasibility conditions for genome reconstruction and minimizes the sequencing cost, can be effectively predicted through constrained discrete optimization. Through extensive evaluations based on several genomes and different read sets, we verify the derived feasibility conditions and demonstrate the performance of the proposed optimal hybrid sequencing and assembly strategy.  相似文献   

2.
De novo assembly of bacterial genomes from next-generation sequencing (NGS) data allows a reference-free discovery of single nucleotide polymorphisms (SNP). However, substantial rates of errors in genomes assembled by this approach remain a major barrier for the reference-free analysis of genome variations in medically important bacteria. The aim of this report was to improve the quality of SNP identification in bacterial genomes without closely related references. We developed a bioinformatics pipeline (SnpFilt) that constructs an assembly using SPAdes and then removes unreliable regions based on the quality and coverage of re-aligned reads at neighbouring regions. The performance of the pipeline was compared against reference-based SNP calling for Illumina HiSeq, MiSeq and NextSeq reads from a range of bacterial pathogens including Salmonella, which is one of the most common causes of food-borne disease. The SnpFilt pipeline removed all false SNP in all test NGS datasets consisting of paired-end Illumina reads. We also showed that for reliable and complete SNP calls, at least 40-fold coverage is required. Analysis of bacterial isolates associated with epidemiologically confirmed outbreaks using the SnpFilt pipeline produced results consistent with previously published findings. The SnpFilt pipeline improves the quality of de-novo assembly and precision of SNP calling in bacterial genomes by removal of regions of the assembly that may potentially contain assembly errors. SnpFilt is available from https://github.com/LanLab/SnpFilt.  相似文献   

3.
Metagenomic studies suggest that only a small fraction of the viruses that exist in nature have been identified and studied. Characterization of unknown viral genomes is hindered by the many genomes populating any virus sample. A new method is reported that integrates drop‐based microfluidics and computational analysis to enable the purification of any single viral species from a complex mixed virus sample and the retrieval of complete genome sequences. By using this platform, the genome sequence of a 5243 bp dsDNA virus that was spiked into wastewater was retrieved with greater than 96 % sequence coverage and more than 99.8 % sequence identity. This method holds great potential for virus discovery since it allows enrichment and sequencing of previously undescribed viruses as well as known viruses.  相似文献   

4.
MotivationSequencing-based methods to examine fundamental features of the genome, such as gene expression and chromatin structure, rely on inferences from the abundance and distribution of reads derived from Illumina sequencing. Drawing sound inferences from such experiments relies on appropriate mathematical methods to model the distribution of reads along the genome, which has been challenging due to the scale and nature of these data.ResultsWe propose a new framework (SRSFseq) based on square root slope functions shape analysis to analyse Illumina sequencing data. In the new approach the basic unit of information is the density of mapped reads over region of interest located on the known reference genome. The densities are interpreted as shapes and a new shape analysis model is proposed. An equivalent of a Fisher test is used to quantify the significance of shape differences in read distribution patterns between groups of density functions in different experimental conditions. We evaluated the performance of this new framework to analyze RNA-seq data at the exon level, which enabled the detection of variation in read distributions and abundances between experimental conditions not detected by other methods. Thus, the method is a suitable supplement to the state-of-the-art count based techniques. The variety of density representations and flexibility of mathematical design allow the model to be easily adapted to other data types or problems in which the distribution of reads is to be tested. The functional interpretation and SRSF phase-amplitude separation technique give an efficient noise reduction procedure improving the sensitivity and specificity of the method.  相似文献   

5.
6.
The discovery of 5-hydroxymethylcytosine (5hmC) in mammalian genomes is a landmark in epigenomics study. Similar to 5-methylcytosine (5mC), 5hmC is viewed as a critical epigenetic modification. Deciphering the functions of 5hmC necessitates the location analysis of 5hmC in genomes. Here, we proposed an engineered deaminase-mediated sequencing (EDM-seq) method for the quantitative detection of 5hmC in DNA at single-nucleotide resolution. This method capitalizes on the engineered human apolipoprotein B mRNA-editing catalytic polypeptide-like 3A (A3A) protein to produce differential deamination activity toward cytosine, 5mC, and 5hmC. In EDM-seq, the engineered A3A (eA3A) protein can deaminate C and 5mC but not 5hmC. The original C and 5mC in DNA are deaminated by eA3A to form U and T, both of which are read as T during sequencing, while 5hmC is resistant to deamination by eA3A and is still read as C during sequencing. Therefore, the remaining C in the sequence manifests the original 5hmC. By EDM-seq, we achieved the quantitative detection of 5hmC in genomic DNA of lung cancer tissue. The EDM-seq method is bisulfite-free and does not require DNA glycosylation or chemical treatment, which offers a valuable tool for the straightforward and quantitative detection of 5hmC in DNA at single-nucleotide resolution.

In EDM-seq, the original C and 5mC in DNA are deaminated by eA3A to form U and T, both of which are read as T during sequencing. While the 5hmC is resistant to deamination by eA3A and is still read as C during sequencing.  相似文献   

7.
Cytosine methylation is one of the most important RNA epigenetic modifications. With the development of experimental technology, scientists attach more importance to RNA cytosine methylation and find bisulfite sequencing is an effective experimental method for RNA cytosine methylation study. However, there are only a few tools can directly deal with RNA bisulfite sequencing data efficiently. Herein, we developed a specialized tool BS-RNA, which can analyze cytosine methylation of RNA based on bisulfite sequencing data and support both paired-end and single-end sequencing reads from directional bisulfite libraries. For paired-end reads, simply removing the biased positions from the 5′ end may result in “dovetailing” reads, where one or both reads seem to extend past the start of the mate read. BS-RNA could map “dovetailing” reads successfully. The annotation result of BS-RNA is exported in BED (.bed) format, including locations, sequence context types (CG/CHG/CHH, H = A, T, or C), reference sequencing depths, cytosine sequencing depths, and methylation levels of covered cytosine sites on both Watson and Crick strands. BS-RNA is an efficient, specialized and highly automated mapping and annotation tool for RNA bisulfite sequencing data. It performs better than the existing program in terms of accuracy and efficiency. BS-RNA is developed by Perl language and the source code of this tool is freely available from the website: http://bs-rna.big.ac.cn.  相似文献   

8.
A number of different approaches have been described to identify proteins from tandem mass spectrometry (MS/MS) data. The most common approaches rely on the available databases to match experimental MS/MS data. These methods suffer from several drawbacks and cannot be used for the identification of proteins from unknown genomes. In this communication, we describe a new de novo sequencing software package, PEAKS, to extract amino acid sequence information without the use of databases. PEAKS uses a new model and a new algorithm to efficiently compute the best peptide sequences whose fragment ions can best interpret the peaks in the MS/MS spectrum. The output of the software gives amino acid sequences with confidence scores for the entire sequences, as well as an additional novel positional scoring scheme for portions of the sequences. The performance of PEAKS is compared with Lutefisk, a well-known de novo sequencing software, using quadrupole-time-of-flight (Q-TOF) data obtained for several tryptic peptides from standard proteins.  相似文献   

9.
The recent introduction of polymerase chain reaction (PCR)-massively parallel sequencing (MPS) technologies in forensics has changed the approach to allelic short tandem repeat (STR) typing because sequencing cloned PCR fragments enables alleles with identical molecular weights to be distinguished based on their nucleotide sequences. Therefore, because PCR fidelity mainly depends on template integrity, new technical issues could arise in the interpretation of the results obtained from the degraded samples. In this work, a set of DNA samples degraded in vitro was used to investigate whether PCR-MPS could generate “isometric drop-ins” (IDIs; i.e., molecular products having the same length as the original allele but with a different nucleotide sequence within the repeated units). The Precision ID GlobalFiler NGS STR panel kit was used to analyze 0.5 and 1 ng of mock samples in duplicate tests (for a total of 16 PCR-MPS analyses). As expected, several well-known PCR artifacts (such as allelic dropout, stutters above the threshold) were scored; 95 IDIs with an average occurrence of 5.9 IDIs per test (min: 1, max: 11) were scored as well. In total, IDIs represented one of the most frequent artifacts. The coverage of these IDIs reached up to 981 reads (median: 239 reads), and the ratios with the coverage of the original allele ranged from 0.069 to 7.285 (median: 0.221). In addition, approximately 5.2% of the IDIs showed coverage higher than that of the original allele. Molecular analysis of these artifacts showed that they were generated in 96.8% of cases through a single nucleotide change event, with the C > T transition being the most frequent (85.7%). Thus, in a forensic evaluation of evidence, IDIs may represent an actual issue, particularly when DNA mixtures need to be interpreted because they could mislead the operator regarding the number of contributors. Overall, the molecular features of the IDIs described in this work, as well as the performance of duplicate tests, may be useful tools for managing this new class of artifacts otherwise not detected by capillary electrophoresis technology.  相似文献   

10.
With the emergence of new viral infections and pandemics, there is a need to develop faster methods to unravel the virus identities in a large number of clinical samples. This report describes a virus identification method featuring high throughput, high resolution, and high sensitivity detection of viruses. Identification of virus is based on liquid hybridization of different lengths of virus-specific probes to their corresponding viruses. The probes bound to target sequences are removed by a biotin–streptavidin pull-down mechanism and the supernatant is analyzed by capillary electrophoresis. The probes depleted from the sample appear as diminished peaks in the electropherograms and the remaining probes serve as calibrators to align peaks in different capillaries. The virus identities are unraveled by a signal processing and peak detection algorithm developed in-house. Nine viruses were used in the study to demonstrate how the system works to unravel the virus identity in single and double virus infections. With properly designed probes, the system is able to distinguish closely related viruses. The system takes advantage of the high resolution feature of capillary electrophoresis to resolve probes that differ by length. The method may facilitate virus identity screen from more candidate viruses with an automated 4-color DNA sequencer.  相似文献   

11.
Today, we can read human genomes and store digital data robustly in synthetic DNA. Herein, we report a strategy to intertwine these two technologies to enable the secure storage of valuable information in synthetic DNA, protected with personalized keys. We show that genetic short tandem repeats (STRs) contain sufficient entropy to generate strong encryption keys, and that only one technology, DNA sequencing, is required to simultaneously read the key and the data. Using this approach, we experimentally generated 80 bit strong keys from human DNA, and used such a key to encrypt 17 kB of digital information stored in synthetic DNA. Finally, the decrypted information was recovered perfectly from a single massively parallel sequencing run.  相似文献   

12.
Today, we can read human genomes and store digital data robustly in synthetic DNA. Herein, we report a strategy to intertwine these two technologies to enable the secure storage of valuable information in synthetic DNA, protected with personalized keys. We show that genetic short tandem repeats (STRs) contain sufficient entropy to generate strong encryption keys, and that only one technology, DNA sequencing, is required to simultaneously read the key and the data. Using this approach, we experimentally generated 80 bit strong keys from human DNA, and used such a key to encrypt 17 kB of digital information stored in synthetic DNA. Finally, the decrypted information was recovered perfectly from a single massively parallel sequencing run.  相似文献   

13.
A multiple-primer DNA sequencing approach suitable for genotyping, detection and identification of microorganisms and viruses has been developed. In this new method two or more sequencing primers, combined in a pool, are added to a DNA sample of interest. The oligonucleotide that hybridizes to the DNA sample will function as a primer during the subsequent DNA sequencing procedure. This strategy is suited for selective detection and genotyping of relevant microorganisms and samples harboring different DNA targets such as multiple variant/infected samples as well as unspecific amplification products. This method is used here in a model system for detection and typing of high-risk oncogenic human papilloma viruses (HPVs) in samples containing multiple infections/variants or unspecific amplification products. Type-specific sequencing primers were designed for four of the most oncogenic (high-risk) HPV types (HPV-16, HPV-18, HPV-33, and HPV-45). The primers were combined and added to a sample containing a mixture of one high-risk (16, 18, 33, or 45) and one or two low-risk types. The DNA samples were sequenced by the Pyrosequencing technology and the Sanger dideoxy sequencing method. Correct genotyping was achieved in all tested combinations. This multiple-sequencing primer approach also improved the sequence data quality for samples containing unspecific amplification products. The new strategy is highly suitable for diagnostic typing of relevant species/genotypes of microorganisms.  相似文献   

14.
Bacteria and other living organisms offer a potentially unlimited resource for the discovery of new chemical catalysts, but many interesting reaction phenotypes observed at the whole organism level remain difficult to elucidate down to the molecular level. A key challenge in the discovery process is the identification of discrete molecular players involved in complex biological transformations because multiple cryptic genetic components often work in concert to elicit an overall chemical phenotype. We now report a rapid pipeline for the discovery of new enzymes of interest from unsequenced bacterial hosts based on laboratory-scale methods for the de novo assembly of bacterial genome sequences using short reads. We have applied this approach to the biomass-degrading soil bacterium Amycolatopsis sp. 75iv2 ATCC 39116 (formerly Streptomyces setonii and S. griseus 75vi2) to discover and biochemically characterize two new heme proteins comprising the most abundant members of the extracellular oxidative system under lignin-reactive growth conditions.  相似文献   

15.
The detection of viruses is of interest for a number of fields including biomedicine, environmental science, and biosecurity. Of particular interest are methods that do not require expensive equipment or trained personnel, especially if the results can be read by the naked eye. A new “double imprinting” method was developed whereby a virus‐bioimprinted hydrogel is further micromolded into a diffraction grating sensor by using imprint‐lithography techniques to give a “Molecularly Imprinted Polymer Gel Laser Diffraction Sensor” (MIP‐GLaDiS). A simple laser transmission apparatus was used to measure diffraction, and the system can read by the naked eye to detect the Apple Stem Pitting Virus (ASPV) at concentrations as low as 10 ng mL−1, thus setting the limit of detection of these hydrogels as low as other antigen‐binding methods such as ELISA or fluorescence‐tag systems.  相似文献   

16.
A new strategy is described for the determination of amino acid sequences of unknown peptides. Different from the well-known but often inefficient de novo sequencing approach, the new method is based on a two-step process. In the first step the amino acid composition of an unknown peptide is determined on the basis of accurate mass values of the peptide precursor ion and a small number of accurate fragment ion mass values, and, as in de novo sequencing, without employing protein database information or other pre-information. In the second step the sequence of the found amino acids of the peptide is determined by scoring the agreement between expected and observed fragment ion signals of the permuted sequences. It was found that the new approach is highly efficient if accurate mass values are available and that it easily outstrips common approaches of de novo sequencing being based on lower accuracies and detailed knowledge of fragmentation behavior. Simple permutation and calculation of all possible amino acid sequences, however, is only efficient if the composition is known or if possible compositions are at least reduced to a small list. The latter requires the highest possible instrumental mass accuracy, which is currently provided only by fourier transform ion cyclotron resonance mass spectrometry. The connection between mass accuracy and peptide composition variability is described and an example of peptide compositioning and composition-based sequencing is presented.  相似文献   

17.
Fagerquist CK  Yee E  Miller WG 《The Analyst》2007,132(10):1010-1023
Protein biomarkers observed in the matrix-assisted laser desorption/ionization time-of-flight mass spectra (MALDI-TOF-MS) of cell lysates of three strains of Campylobacter coli, two strains of C. lari and one strain of C. concisus have been identified by 'bottom-up' proteomic techniques. The significant findings are as follows. First, the protein biomarkers identified were: PhnA-related protein, 4-oxalocrotonate tautomerase (DmpI)-related protein, NifU-like protein, cytochrome c, DNA-binding protein HU, 10 kDa chaperonin, thioredoxin, as well as several conserved hypothetical and ribosomal proteins. Second, variations in the biomarker ion m/z in MALDI-TOF-MS spectra across species and strains are the result of variations in the amino acid sequence of the protein due to non-synonymous mutations of the biomarker gene. Third, the most common post-translational modifications (PTMs) were the removal of the N-terminal methionine and N-terminal signal peptides. However, in the case of the NifU protein (an iron-sulfur cluster transport protein), post-translational cleavage occurred from the C-terminus. Fourth, only the genomes of the C. coli strain RM2228 and C. lari strain RM2100 have been sequenced; thus, proteomic identification of the proteins of the other strains in this study relied upon sequence homology to the genomic sequence of these strains as well as the genomes of sequences of other Campylobacter strains. In some cases, the determination of the full amino acid sequence of a protein biomarker from a genomically non-sequenced strain was accomplished by combining non-overlapping partial sequences from proteomic identifications of genomically-sequenced strains that were of the same species (or of a different species) to that of the non-sequenced strain. The accuracy of this composite sequence was confirmed by both MS and MS/MS. It was necessary, in some cases, to perform de novo sequencing on 'gaps' in the composite sequence that were not homologous to any genomically-sequenced strain. In order to validate the composite sequence approach, composite sequences were further confirmed by subsequent DNA sequencing of the biomarker gene. Thus, using the composite sequence approach, it was possible to determine the full amino acid sequence of an unknown protein from a genomically non-sequenced bacterial strain without the necessity of either sequencing the biomarker gene or performing full de novo MS/MS sequencing. The sequence obtained could then be used as a strain-specific biomarker for analysis by 'top-down' proteomics techniques.  相似文献   

18.
The fast Fourier transform (FFT) sampling algorithm has been used with success in application to protein‐protein docking and for protein mapping, the latter docking a variety of small organic molecules for the identification of binding hot spots on the target protein. Here we explore the local rather than global usage of the FFT sampling approach in docking applications. If the global FFT based search yields a near‐native cluster of docked structures for a protein complex, then focused resampling of the cluster generally leads to a substantial increase in the number of conformations close to the native structure. In protein mapping, focused resampling of the selected hot spot regions generally reveals further hot spots that, while not as strong as the primary hot spots, also contribute to ligand binding. The detection of additional ligand binding regions is shown by the improved overlap between hot spots and bound ligands. © 2016 Wiley Periodicals, Inc.  相似文献   

19.
Viruses are normally defined as pathogens and have a bad reputation because of pandemics such as Influenza, HIV/AIDS, Ebola, and SARS. Most viruses are, however, not enemies or killers but play important roles in the origin, development and maintenance of life of all species on our planet. This is new information we learnt by new technologies such as sequencing. Viruses are the most successful species on Earth, they are ubiquitous, in the oceans, in our environment, in animals, plants, bacteria, up in the air, perhaps even in the universe, within our body and even as part of our genomes. They influence our health, our well‐being, mental properties, our gut microbiota including obesity, and may help to cope with multi‐drug‐resistant bacteria. There the phages, viruses of bacteria, raise hopes. Viruses built our immunity: viruses protect against viruses. We do not have to lay eggs – thanks to viruses! They are the drivers of evolution and adaptation to environmental changes, also e. g. in plankton. The success story of viruses started about 3.5 billion years ago when life began. Newly discovered giant viruses are almost bacteria in their composition, suggesting that the borderline between dead matter and life is continuous. There are many open questions – how did life begin, is there life on exoplanets, how to find it? Are virus‐like elements, viroids, important for the origin of life? Will viruses eliminate mankind [1]?  相似文献   

20.
Drug repurposing, the practice of utilizing existing drugs for novel clinical indications, has tremendous potential for improving human health outcomes and increasing therapeutic development efficiency. The goal of multi-disease multitarget drug repurposing, also known as shotgun drug repurposing, is to develop platforms that assess the therapeutic potential of each existing drug for every clinical indication. Our Computational Analysis of Novel Drug Opportunities (CANDO) platform for shotgun multitarget repurposing implements several pipelines for the large-scale modeling and simulation of interactions between comprehensive libraries of drugs/compounds and protein structures. In these pipelines, each drug is described by an interaction signature that is compared to all other signatures that are subsequently sorted and ranked based on similarity. Pipelines within the platform are benchmarked based on their ability to recover known drugs for all indications in our library, and predictions are generated based on the hypothesis that (novel) drugs with similar signatures may be repurposed for the same indication(s). The drug-protein interactions used to create the drug-proteome signatures may be determined by any screening or docking method, but the primary approach used thus far has been BANDOCK, our in-house bioanalytical or similarity docking protocol. In this study, we calculated drug-proteome interaction signatures using the publicly available molecular docking method Autodock Vina and created hybrid decision tree pipelines that combined our original bio- and chem-informatic approach with the goal of assessing and benchmarking their drug repurposing capabilities and performance. The hybrid decision tree pipeline outperformed the two docking-based pipelines from which it was synthesized, yielding an average indication accuracy of 13.3% at the top10 cutoff (the most stringent), relative to 10.9% and 7.1% for its constituent pipelines, and a random control accuracy of 2.2%. We demonstrate that docking-based virtual screening pipelines have unique performance characteristics and that the CANDO shotgun repurposing paradigm is not dependent on a specific docking method. Our results also provide further evidence that multiple CANDO pipelines can be synthesized to enhance drug repurposing predictive capability relative to their constituent pipelines. Overall, this study indicates that pipelines consisting of varied docking-based signature generation methods can capture unique and useful signals for accurate comparison of drug-proteome interaction signatures, leading to improvements in the benchmarking and predictive performance of the CANDO shotgun drug repurposing platform.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号