DKFZ
Molecular Genome Analysis
 

3 of 5

Documentation

Purpose Formats Syntax

Publication


Main page
Documentation






Syntax



Commonly used regular expressions
One way to describe biological patterns is the use of regular expressions. The syntax of regular expressions allows to introduce ambiguities in patterns. These individual ambiguities can be classified as described in the following examples:

No ambiguity
H characterizes exactly the character "H"
A{3} defines AAA

Content ambiguity
[AL] allows the presence of either A or L at the given position
. allows any character

Length ambiguity
A{1,3} allows A, AA or AAA
[AL]{1,3} allows the characters A and L in each position of a stretch with a length between 1 and three positions
.{1,3} allows any characters of one up to three positions

In addition 3of5 provides further conversions and extensions on top of the commonly used regular expressions to constrain the matching of patterns in different ways:

Constraints of content
.[^LIV] allows any character in one position but not L, I or V
.{1,3}[^LIV] forbids L,I,V in a stretch one up to three positions which may otherwise have arbitrary content

Constraints of position
^PATTERN allows a pattern matching only at the N-terminal end of the sequence
PATTERN$ allows a pattern matching only at the C-terminal end of the sequence


The novel n-of-m pattern type
The n-of-m pattern type allows for the definition of ambiguities in patterns where the individual elements may vary in position, order, and content. In addition the constraint of content [^ ] element may be also added.

n-of-m standard syntax
(3of5)(KR) requires at least three K and/or R residues in a stretch of 5 positions

n-of-m extended syntax
(nof5)( (min3)(KR) (eq1)(P) ) requires at least three K and/or R residues in a stretch of 5 positions and demands exactly one P in the same sequence stretch.

operators:
min: minimal e.g. (min3)(KR) = requires at least 3 K and/or R residues
max: maximal e.g. (max3)(KR) = requires at most 3 K and/or R residues
eq: equal e.g. (eq3)(KR) = requires exactly 3 K and/or R residues


n-of-m syntax details
The standard syntax of n-of-m consists of two bracket pairs with (nofm)(ABDF). The first bracket pair contains information on the minimal number of specified characters (n) and of the total length of the pattern segment (m), the second bracket pair determines the set of characters that is allowed.
Thus 3of5 enables to describe a multitude of pattern ambiguties. Some discrete patterns are listed below, which all are described by the pattern example (4of8)(DE) - (with x serving here as arbitrary characters): DDDDxxxx, DDDxDxxx, DDxxDDxx, xxDDDDxx, DDEExxxx, DEEExxxx, EEEExxxx, ... .
This n-of-m pattern type can be combined with other common regular expressions, thus allowing the description of more complex patterns as, for example, the bipartite nuclear localization site of nucleoplasmin: [KR][KR].{10}(3of5)(KR).
Furthermore it is possible to exclude any characters by appending these behind in a special arrangement of brackets. For instance, [KR][KR].{10}(3of5)(KR)[^P] may represent the nucleoplasmin pattern from above, but in this case no proline is allowed to be present in the matching sequence.


In addition the current 3of5 version provides an extended syntax of the n-of-m pattern type. This extended syntax allows to define groups of characters with different numerical constraints. Thus, sophisticated patterns are permitted to be specified.
One example may be (n of 5) ( (min 3)(KR) (max 1)(P) )[^LIV]. Besides the n-of-m pattern constraints seen above as part of the nucleoplasmin pattern this extended syntax defines further constraints: It matches consequently any sequence of the length of 5 residues that contains at least 3 lysines and arginines respectively and that contains at most one proline but that has no hydrophobic amino acids like leucine, isoleucine or valine in any of its 5 positions.

The arrangement of the standard syntax is kept to use two pairs of brackets for specifications of distinct characters or groups of characters. The first pair contains data of the number of occurrences for the respective characters, which are specified between the second pair of brackets. There are following operators to constrain the numbers of occurences: "min" (meaning "minimal" = "equal or more "), "max" ("maximal" = "less or equal") or "eq" ("exactly equal"), followed by the respective number. The total list of these describing double pair of brackets has to be framed by a main pair of brackets. Furthermore, an additional preceding pair of brackets contains the total length m of the pattern stretch in the form (nofm). While m has indeed the function of a true variable, n functions simply to bridge to the standard syntax.



Examples:
Permutations of ale
pattern DOG.{1,10}(3of5)(ALE)
test sequence DOGANDALLEYCATWEASLE

Bipartite NLS, Nucleoplasmin-Type
pattern [KR][KR].{10}(3of5)(KR)
test sequence MASTVSNTSKLEKPVSLIWGCELNEQDKTFEFKVEDDEEKCEHQLALRTVCLGDKAKDEFNIVEIVTQEE GAEKSVPIATLKPSILPMATMVGIELTPPVTFRLKAGSGPLYISGQHVAMEEDYSWAEEEDEGEAEGEEE EEEEEDQESPPKAVKRPAATKKAGQAKKKKLDKEDESSEEDSPTKKGKGAGRGRKPAAKK