Statistical models for local occurrences of RNA-structures


We develop in this article the necessary statistical theory for computing, for instance, E-values when searching long sequences for the occurrences of local RNA-structures. We show in particular how the theory can be used for estimating scoring parameters with the purpose of optimizing the discriminative performance of the algorithm. The results are implemented in the program StemSearch, which can search for stem loop structures that are formed by, for example, micro RNA precursors. We illustrate the use of the estimation method in practice by considering three miRNA target datasets from Human, Arabidopsis, and C. elegans and by optimizing three penalty parameters in StemSearch. We show that the optimization can improve the discriminative performance considerably when using a first order Markov model as null-distribution. Finally, we compare the output from StemSearch with that of RNALfold, and we discuss some notable differences that are primarily due to fundamental differences in the choice of parameters.

Journal of Computational Biology, 16(6), 845-858