This software package contains the GraphProt framework as published in “GraphProt: modeling binding preferences of RNA-binding proteins”.
GraphProt contains a precompiled version of “EDeN”, the SVM package used for feature creation and classification. This binary should run on most Linux-based systems. In case it does not run on your system, please call “bash ./recompile_EDeN.sh” from the GraphProt main directory.
GraphProt uses various opensource software packages. Please make sure that the follwing programs are installed and accessible via the PATH environment variable (i.e. you should be able to call the programs by just issuing the command).
GraphProt will scan for these programs and notify you if something seems amiss. GraphProt contains a copy of fastapl (http://seq.cbrc.jp/fastapl/index.html.en).
GraphProt analyses are started by calling “GrapProt.pl”. If no options are given, GraphProt.pl will display a help message summarizing all available options. The default mode is to run analyses in classification setting, switch to regression setting using the parameter -mode regression. In general, GraphProt analyses are run by issuing different actions, e.g.
GraphProt.pl –action train -fasta train_positives.fa -negfasta train_negatives.fa
GraphProt supports input sequences in fasta format. The viewpoint mechanism sets viewpoints to all nucleotides in uppercase letters, nucleotides in lowercase letters are only used for RNA structure predictions.
GraphProt parameters abstraction, R, D, bitsize, c, epsilon, epochs and lambda are set to default values. For best results, optimized parameters should be obtained with the ls parameter optimization setting.
Input files in classification setting are specified with parameters “-fasta” (binding sites) and “-negfasta” (unbound sites). For regressions, input sequences are specified with “-fasta” and sequence scores with “-affinities”. For each sequence, the affinity file should contain one value per line.
Output filenames can be specified via a prefix (-prefix); if no prefix is given, the default is “GraphProt”.
Determines optimized parameters. Parameters are printed to screen and written to file “GraphProt.param”.
Runs a 10-fold crossvalidation. Crossvalidation results are written to file “GraphProt.cv_results”.
Trains a GraphProt model. The model is written to file “GraphProt.model”.
Predict binding of whole sequences, e.g. CLIP sites. Margins are written to file “GraphProt.predictions”. Each line of this file gives the margin for one sequence in the second column, in the same order as the fasta file. In classification setting the first column contains the class, in regression setting the first column contains, if specified, the affinities, otherwise 1.
Predict binding profiles (nucleotide-wise margins) for sequences. Nucleotide-wise margins are written to file “GraphProt.profile”, this file contains three columns:
Please note that with GraphProt structure models this action currently only supports sequences of up to 150 nt.
Predict high-affinity target sites as showcased in the GraphProt paper. Selects all regions with average scores within 12nt above a given percentile (parameter -percentile, defaults to 99). Average nucleotide-wise margins of high-affinity sites are written to file GraphProt.has. This file contains three columns:
Create RNA sequence and structure motifs as described in the “GraphProt” paper. Motifs are written to files “GraphProt.sequence_motif.png” and “GraphProt.structure_motif.png”.
In addition to the integrated usage via GraphProt.pl, individual tasks such as creation of RNA structure graphs or calculation of features can be accomplished using the following tools:
Usage information for these tools can be obtained by specifying the “-h” option.
RNA sequence and structure graphs are created using fasta2shrep_gspan.pl. Structure graphs are created using the following parameters. The user has to chose an appropriate RNAshapes ABSTRACTION_LEVEL.
fasta2shrep_gspan.pl –seq-graph-t –seq-graph-alph -abstr -stdout -M 3 -wins ‘150,’ -shift ‘25’ -fasta PTBv1.train.fa -t ABSTRACTION_LEVEL | gzip > PTBv1.train.gspan.gz
RNA sequence graphs are created using the following parameters:
fasta2shrep_gspan.pl –seq-graph-t -nostr -stdout -fasta PTBv1.train.fa | gzip > PTBv1.train.gspan.gz
For example, 10-fold crossvalidation using EDeN is done via:
EDeN/EDeN -a CROSS_VALIDATION -c 10 -i PTBv1.train.gspan.gz -t PTBv1.train.class -g DIRECTED -b BIT_SIZE -r RADIUS -d DISTANCE -e EPOCHS -l LAMBDA
and setting the appropriate parameters for BIT_SIZE, RADIUS, DISTANCE, EPOCHS and LAMBDA.