Genetic stability of genome-scale deoptimized RNA virus vaccine candidates under selective pressure |
Abstract |
Recoding viral genomes by numerous synonymous substitutions provided live attenuated vaccine candidates predicted to have a low risk of reversion. However, their stability under selective pressure was largely unknown. We evaluated the phenotypic reversion of representative genome-scale deoptimized human respiratory syncytial virus (RSV) vaccine candidates in the context of strong selective pressure. We found that a virus bearing a deoptimized L-polymerase ORF evolved to escape temperature sensitivity restriction by mutations in L and multiple other proteins. Additional analysis revealed that single mutations in the M2-1 ORF were able to substantially escape the restriction imposed by the deoptimized polymerase. Based on this information, we generated a stable deoptimized RSV vaccine candidate with improved attenuation and immunogenicity suitable for additional development.
Reference |
Download |
Multiplexed highly-accurate DNA sequencing of closely-related HIV-1 variants using continuous long reads from single molecule, real-time sequencing |
Abstract |
Single Molecule, Real-Time (SMRT®) Sequencing (Pacific Biosciences, Menlo Park, CA, USA) provides the longest continuous DNA sequencing reads currently available. However, the relatively high error rate in the raw read data requires novel analysis methods to deconvolute sequences derived from complex samples. Here, we present a workflow of novel computer algorithms able to reconstruct viral variant genomes present in mixtures with an accuracy of >QV50. This approach relies exclusively on Continuous Long Reads (CLR), which are the raw reads generated during SMRT Sequencing. We successfully implement this workflow for simultaneous sequencing of mixtures containing up to forty different >9 kb HIV-1 full genomes. This was achieved using a single SMRT Cell for each mixture and desktop computing power. This novel approach opens the possibility of solving complex sequencing tasks that currently lack a solution.
Reference |
Download |
Online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4787755/
Dynamic regulation of HIV-1 mRNA populations analyzed by single molecule enrichment and long read sequencing |
Abstract |
Alternative RNA splicing greatly expands the repertoire of proteins encoded by genomes. Next-generation sequencing (NGS) is attractive for studying alternative splicing because of the efficiency and low cost per base, but short reads typical of NGS only report mRNA fragments containing one or few splice junctions. Here, we used single-molecule amplification and long-read sequencing to study the HIV-1 provirus, which is only 9700 bp in length, but encodes nine major proteins via alternative splicing. Our data showed that the clinical isolate HIV-1(89.6) produces at least 109 different spliced RNAs, including a previously unappreciated ~1 kb class of messages, two of which encode new proteins. HIV-1 message populations differed between cell types, longitudinally during infection, and among T cells from different human donors. These findings open a new window on a little studied aspect of HIV-1 replication, suggest therapeutic opportunities and provide advanced tools for the study of alternative splicing.
Reference |
Download |
PDF: dynamicHIV.pdf
Online: http://nar.oxfordjournals.org/content/early/2012/08/24/nar.gks753.long
A simulation-approximation approach to sample size planning for high-dimensional classification studies. |
Abstract |
Classification studies with high-dimensional measurements and relatively small sample sizes are increasingly common. Prospective analysis of the role of sample sizes in the performance of such studies is important for study design and interpretation of results, but the complexity of typical pattern discovery methods makes this problem challenging. The approach developed here combines Monte Carlo methods and new approximations for linear discriminant analysis, assuming multivariate normal distributions. Monte Carlo methods are used to sample the distribution of which features are selected for a classifier and the mean and variance of features given that they are selected. Given selected features, the linear discriminant problem involves different distributions of training data and generalization data, for which 2 approximations are compared: one based on Taylor series approximation of the generalization error and the other on approximating the discriminant scores as normally distributed. Combining the Monte Carlo and approximation approaches to different aspects of the problem allows efficient estimation of expected generalization error without full simulations of the entire sampling and analysis process. To evaluate the method and investigate realistic study design questions, full simulations are used to ask how validation error rate depends on the strength and number of informative features, the number of noninformative features, the sample size, and the number of features allowed into the pattern. Both approximation methods perform well for most cases but only the normal discriminant score approximation performs well for cases of very many weakly informative or uninformative dimensions. The simulated cases show that many realistic study designs will typically estimate substantially suboptimal patterns and may have low probability of statistically significant validation results.
Reference |
Download |
Functional Characterization of Spliceosomal Introns and Identification of U2, U4, and U5 snRNAs in the Deep-branching Eukaryote Entamoeba histolytica |
Abstract |
Pre-mRNA splicing is essential to ensure accurate expression of many genes in eukaryotic organisms. In Entamoeba histolytica, a deep-branching eukaryote, approximately, 30% of the annotated genes are predicted to contain introns, however, the accuracy of these predictions has gone untested. In this paper, we mined an EST library representing 7% of amoebic genes and find evidence supporting splicing of 60% of the testable intron predictions, the majority of which contain a GUUUGU 5' splice site and a UAG 3' splice site. Additionally, we identified several splice site misannotations, present evidence for the existence of 30 novel introns in previously annotated genes, and identify novel genes through uncovering their spliced ESTs. Finally, we provide molecular evidence for the E. histolytica U2, U4 and U5 snRNAs. These data lay the foundation for further dissecting the role of RNA processing in E. histolytica
Bibtex |
Download |
An automated, sheathless capillary electrophoresis-mass spectrometry platform for discovery of biomarkers in human serum. |
Abstract |
A capillary electrophoresis-mass spectrometry (CE-MS)method has been developed to perform routine,automated analysis of low-molecular-weight peptides in human serum.The method incorporates transient isotachophoresis for in-line preconcentra- tion and a sheathless electrospray interface.To evaluate the performance of the method and demonstrate the utility of the approach,an experiment was designed in which peptides were added to sera from individuals at each of two different con- centrations,artificially creating two groups of samples.The CE-MS data from the serum samples were divided into separate training and test sets.A pattern-recognition/feature-selection algorithm based on support vector machines was used to select the mass-to-charge (m/z )values from the training set data that distinguished the two groups of samples from each other.The added peptides were identified correctly as the distinguishing features,and pattern recognition based on these peptides was used to assign each sample in the independent test set to its respective group.A twofold difference in peptide concentration could be detected with statistical significance (pvalue ,0.0001).The accuracy of the assignment was 95%,demonstrating the utility of this technique for the discovery of patterns of biomarkers in serum. Keywords:Biomarkers /Capillary electrophoresis /Serum /Sheathless electrospray /Time of flight-mass spectrometry
Bibtex |
Download |
Small Subunit Ribosomal RNA Modeling Using Stochastic Context-Free Grammars |
Abstract |
We introduce a model based on stochastic context-free grammars (SCFGs) that can construct small subunit ribosomal RNA (SSU rRNA) multiple alignments. The method takes into account both primary sequence and secondary structure basepairing interactions. We show that this method produces multiple alignments of quality close to hand edited ones and outperforms several other methods. We also introduce a method of SCFG constraints that dramatically reduces the required computer resources needed to effectively use SCFGs on large problems such as SSU rRNA. Without such constraints, the required computer resources are infeasible for most computers. This work has applications to fields such as phylogenetic tree construction. {\bf Keywords}: Ribosomal RNA, Multiple Alignment, Stochastic Context-Free Grammar, HMM, Constraints
Bibtex |
Download |
Knowledge-based Analysis of Microarray Gene Expression Data Using Support Vector Machines |
Abstract |
We introduce a new method of functionally classifying genes using gene expression data from DNA microarray hybridization experiments. The method is based on the theory of support vector machines. SVMs are considered a supervised computer learning method because they exploit prior knowledge of gene function to identify unknown genes of similar function from expression data. SVMs avoid several problems associated with unsupervised clustering methods such as hierarchical clustering methods and self organizing maps. SVMs have many mathematical features that make them attractive for gene expression analysis, including their flexibility in choosing a similarity function, sparseness of solution when dealing with large data sets, the ability to handle large feature spaces, and the ability to identify outliers. We test several SVMs that use different similarity metrics, as well as some other supervised learning methods, and find that the SVMs best identify sets of genes with a common function using expression data. Finally, we use SVMs to predict functional roles for uncharacterized yeast ORFs based on their expression data.
Bibtex |
Download |
Online:: http://www.pnas.org/cgi/content/abstract/97/1/262
Postscript tech report: genex.ps
RNA Modeling Using Stochastic Context-Free Grammars |
Abstract |
Recent developments in high-throughput biological technologies have created a wealth of biological sequence data. The immense size of these biological datasets has prompted the use of computational methods for their analysis. This work presents the theory and application of stochastic context-free grammars (SCFGs) to biological sequence analysis and specifically to the problem of RNA secondary structure modeling. SCFGs are a method of characterizing biological sequences that take into account the statistical identity of different sequence positions including pairwise interactions between positions. It is their ability to model pairwise interacting positions that make SCFGs a natural mathematical model of RNA secondary structure. SCFGs can automatically generate structural multiple alignments of RNA families that take into account basepairing interactions.SCFGs are presented as an extension of another probabilistic model used in biological sequence analysis, hidden Markov models. I present several SCFG algorithm developments including a SCFG constraint system that gives significant performance enhancements in both time and space and allows large SCFGs to be applied to large sequence analysis problems. I give a method using intersected SCFGs to model non-context-free structures. I also introduce a new method of sequence classification using a support vector machine framework and feature vectors generated from a SCFG.
I apply the SCFG method to an {\em in vitro} selected RNA pseudoknot that binds biotin. Even though SCFGs cannot model the RNA pseudoknot structure directly, I show that an approximation using two SCFGs can effectively perform database searches and find RNA pseudoknot structures. I then apply SCFGs to modeling small subunit ribosomal RNA, a large molecule that is important to the construction of phylogenetic trees of life. I compare the SCFG method to several other methods in constructing multiple alignments of this molecule and show that the SCFG outperforms the other methods, attaining a multiple alignment whose quality is close to hand-edited alignments. I apply SCFGs with support vector machines to a phylogenetic classification problem and show that they outperform a standard method. I describe the SCFG RNA modeling software, RNACAD, that was used in this work.
Bibtex |
Download |
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology. |
Abstract |
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously published formula for estimating these expected probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.
Bibtex |
Download |
Postscript: dirichlet.ps
Postscript tech report: dirichletTech.ps
RNA Pseudoknot Modeling Using Intersections of Stochastic Context Free Grammars with Applications to Database Search |
Abstract |
A model based on intersections of stochastic context free grammars is presented to allow for the modeling of RNA pseudoknot structures. The model runs relatively fast, having the same order running time as stochastic context free grammar parsers. The model is shown to be able to perform database searches and find RNA sequences which resemble RNA pseudoknots which bind biotin. The problem domain of RNA biotin binders has significance in the support of the RNA world model of early life on earth.
Bibtex |
Download |
Stochastic Context-Free Grammars for tRNA Modeling |
Abstract |
Stochastic context-free grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of tRNA sequences. SCFGs capture the sequences' common primary and secondary structure and generalize the hidden Markov models (HMMs) used in related work on protein and DNA. Results show that after having been trained on as few as 20 tRNA sequences from only two tRNA subfamilies (mitochondrial and cytoplasmic), the model can discern general tRNA from similar-length RNA sequences of other kinds, can find secondary structure of new tRNA sequences, and can produce multiple alignments of large sets of tRNA sequences. Our results suggest potential improvements in the alignments of the D- and T-domains in some mitochdondrial tRNAs that cannot be fit into the canonical secondary structure.
Bibtex |
Download |
Hidden Markov Models in Computational Biology: Applications to Protein Modeling |
Abstract |
Hidden Markov Models (HMMs) are applied to the problems of statistical modeling, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated on the globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the SWISS-PROT 22 database for other sequences that are members of the given protein family, or contain the given domain. The HMM produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate three-dimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database fit the globin, kinase and EF-hand HMMs), the HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PROFILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appears to have a slight advantage over PROFILESEARCH in terms of lower rates of false negatives and false positives, even though the HMM is trained using only unaligned sequences, whereas PROFILESEARCH requires aligned training sequences. Our results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionarily preserved putative intracellular region of 155 residues in the $\alpha$-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling. This region has been suggested to contain the functional domains that are typical or essential for all L-type calcium channels regardless of whether they couple to ryanodine receptors, conduct ions or both.
Bibtex |
Download |
Postscript: hmm.part1.ps and hmm.part2.ps