Bioinformatics
Institute of Computer Science
University Freiburg
de

LocARNA PP Format

Synopsis

Format for multiple aligned or single sequences together with the probabilistic description of the (consensus) RNA secondary structure ensemble by probabilities of base pairs, base pair stackings, and base pairs and unpaired bases in the loop of base pairs.

Description

The LocARNA PP format combines sequence or alignment information and (respectively, single or consensus) ensemble probabilities into an PP 2.0 record. LocARNA utilizes this format for input (single to-be-aligned sequences/alignments) and output (alignments.) Records are composed of one or several sections.

This format is used by tools of the LocARNA package for input and output of sequences and alignments together with their probabilistic ensemble descriptions. Note that, for legacy, LocARNA as well supports a deprecated version 1.0 of the PP format.

Example PP record

#PP 2.0

hdrA                       GGCACCACUC-GAAGGC--UAAGCCAAAGUGGUGCU
vhuD                       GUUCUCUCGG-GAACCCGUCAAGGGACCGAGAGAAC
vhuU                       AGCUCACAACCGAACCCAUUUGGGAGGUUGUGAGCU
fwdB                       AUGUUGGAGGGGAACCCGUAAGGGACCCUCCAAGAU
#A1                        ......AA..............BBB...........
#A2                        ......12..............123...........

#END

#SECTION BASEPAIRS

#BPCUT 0.2

4 33 1.0
1 36 0.6
9 28 0.98
8 29 1.0
5 32 1.0
7 30 0.9
6 31 1
2 35 0.9

#END

#SECTION INLOOP

#BPILCUT 0.5
#UILCUT  0.5

2 35: 5 32 0.89 ; 3 0.9 4 0.1
8 29: 9 28 0.98 7 30 0.9; 10 0.7 27 0.6

#END

Format description

Each record starts with the header

#PP 2.0

followed by the "alignment section". This section specifies the sequence names, alignment strings, and -optionally- anchor constraints.

Thus, it contains lines describing alignment rows

sequence name alignment string

or alignment annotations (usually, to specify anchor constraints):

#An constraint-string

The latter lines (for n=1..) each specify the n-th characters of alignment column names, such that multi-character names can be specified by several lines with consecutive indices n; characters '.' and ' ' are identified. Line breaks are supported by concatenating strings with repeated names. Otherwise, the order of lines is arbitrary.

Each section is terminated by the line

#END

All following sections are introduced by a section header

#SECTION section_name

Base pairs probabilities are specified in the section with header

#SECTION BASEPAIRS

The keyword #BPCUT allows specifying the cutoff of contained probabilities. Base pair probabilities are listed each in a single line

i j p_ij

In loop probabilities can be specified in a section with header

#SECTION INLOOP

Here, the additional base pair in loop and unpaired in loop probability thresholds are respectively specified with

#BPILCUT 0.0005
#UILCUT  0.0005

The probabilities in the loop of a base pair i,j are specified by lines

i j: { k l Pr[(k,l) in loop of (i,j) } ; { k Pr[k in loop of (i,j)] }