PatScan was developed by Ross Overbeek and Mark D'Souza and is maintained by
the Bioinformatics group at Argonne
National Laboratory.
If you would like to cite PatScan please use the following reference:
Dsouza M, Larsen N, Overbeek R.
Searching for patterns in genomic data.
Trends Genet. 1997 Dec;13(12):497-8.
GPXGXXXXXXXXXGXXXXXNDGXXXXXXXXXXXPThis is a pattern composed of a single pattern unit. The pattern unit is made up of a 34-character region. Eight characters must match exactly, the others can be anything. This is actually not a bad characterization of a fairly interesting class of proteins, but one would probably like to allow some mismatches (i.e., one would like a slightly less stringent pattern).
GPXGXXXXXXXXXGXXXXXNDGXXXXXXXXXXXP[1,0,0]This pattern will allow a single mismatch out of the eight specified positions.
up to 1 mismatch up to 0 "deletions" and up to 0 "insertions".A "deletion" means that a character in the input pattern is skipped.
GYHVLMMAWG[2,1,0]against
AGVGPPGGYAVCMAWGKRSTVLMthe matched subsequence would be
GYAVCMAWG.NOTE: You cannot leave a space between the string of amino acid codes and the qualifier. Thus,
GYHVLMMAWG [2,1,0]is invalid.
any(TS) 1...1 GP 1...1 G 4...4 any(LFIVM) 4...4 G 5...5 NDG 10...11 PThis pattern contains 13 "pattern units". A match is successful only when each pattern unit is successfully matched.
any(TS)will match either a T or an S in the first position. Similarly,
notany(TS)would match anything but a T or an S.
Min...Maxnotation is used to indicate any character string of the designated length. For example,
10...12matches any 10, 11 or 12 character string.
all any(TS) 1...1 GP 1...1 G 4...4 any(LFIVM) G sp|P02461|CA13_HUMAN:[1120,1131]: S P GP A G QQGA I G sp|P02461|CA13_HUMAN:[1132,1143]: S P GP A G PRGP V G sp|P02463|CA14_MOUSE:[1095,1106]: S P GP R G SPGN I G sp|P02465|CA21_BOVIN:[158,169] : S V GP V G PAGP I G sp|P04258|CA13_BOVIN:[412,423] : S P GP R G QPGV M G sp|P04258|CA13_BOVIN:[964,975] : S P GP A G HQGA V G sp|P04258|CA13_BOVIN:[976,987] : S P GP A G PRGP V G sp|P08125|CA1A_CHICK:[80,91] : S P GP Q G PPGP L G sp|P13941|CA13_RAT:[304,315] : S P GP A G PRGP V G . . . sp|Q02388|CA17_HUMAN:[2720,2731]: S A GP P G PPGS V G sp|P05997|CA25_HUMAN:[769,780] : T P GP K G DRGG I G sp|P22138|RPA2_YEAST:[53,64] : T E GP D G GLLN L G sp|P42382|CH60_EHRCH:[29,40] : T A GP K G LTVA I G sp|Q01149|CA21_MOUSE:[749,760] : T K GP K G ENGI V G COMPLETED REQUEST
p1=8...9 3...8 ~p1This pattern consists of three "pattern units" separated by spaces.
p1=8...9which means "match 8 to 9 characters and call them p1".
3...8which means "match 3 to 8 characters".
~p1which means "match the reverse complement of p1".
cgtaaccaa ggttaacc ttggttacgNow for a short aside: PatScan will search only one strand, unless you ask for searches against the complementary strand, as well. With a pattern of the sort we just used, there is no need to search the opposite strand. However, it is normally the case that you will wish to search both the sequence and the opposite strand (i.e., the reverse complement of the sequence). You usually should ask for this option when you scan nucleotide sequences.
r1={au,ua,gc,cg,gu,ug,ga,ag} p1=2...3 0...4 p2=2...5 1...5 r1~p2 4...4 ~p1Let us first show you how to handle "non-standard rules for pairing in reverse complements". The example is shown as two lines. You may use as many lines as you like in forming a pattern, although you can only break a pattern at points where a space would be legal.
p1=2...3 match 2 or 3 characters (call it p1) 0...4 match 0 to 4 characters p2=2...5 match 2 to 5 characters (call it p2) 1...5 match 1 to 5 characters r1~p2 match the reverse complement of p2, using the pairing rule r1 {au,ua,gc,cg,gu,ug,ga,ag} 4...4 match 4 characters ~p1 match the reverse complement of p1 allowing only G-C, C-G, A-T, and T-A pairsThus, r1~p2 means "match the reverse complement of p2 using rule r1".
p1=10...10 3...8 ~p1[1,2,1]Now let us consider the issue of tolerating mismatches and bulges.
a pairing other than G-C, C-G, A-T, or T-A,
a deletion is a character that occurs in p1, but has been deleted from the string matched by ~p1, and
an "insertion" is a character that occurs in the string matched by ~p1, but not for which no corresponding character occurs in p1.
ACGTACGTAC GGGGGGGG GCGTTACCTwhich is, you must admit, a fairly weak loop.
It is common to allow mismatches, but you will find yourself using insertions and deletions much more rarely. You should note that allowing mismatches, insertions, and deletions does force the program to try many additional possible pairings, so it does slow things down a bit.
NOTE: You cannot leave a space before the qualifier. Thus,p1=10...10 3...8 ~p1 [1,2,1]is invalid.
p1=6...6 3...8 p1Find exact 6 character repeat separated by 3 to 8 characters.
p1=6...6 3..8 p1[1,0,0]Same as above, allowing one mismatch in the repeated string.
p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0]Match 12 characters of a 3-character sequence occuring 4 times with up to 1 mismatch in each of the the 2nd, 3rd and 4th sequences.
p1=4...8 0...3 p2=6...8 p1 0...3 p2This would match things like:
ATCT G TCTTT ATCT TG TCTTT.
Occasionally, one wishes to match a specific, known sequence. In such a case, you can just give the sequence (along with an optional statement of the allowable mismatches, insertions, and deletions).
p1=6...8 GAGA ~p1Match a hairpin with GAGA as the loop.
RRRRYYYYMatch 4 purines followed by 4 pyrimidines.
TATAA[1,0,0]Match TATAA, allowing 1 mismatch.
Note that complex patterns could often match against a number of overlapping areas of a sequence: only the first would be reported (after a successful match, the matching algorithm picks up at the first character past the matched substring).
Note that searches may take some time since they are queued (sometimes for a few hours, but results can often be obtained in just a few minutes).embl|M27249:[142,162] : aaaaaaga aatca tctttttt embl|M35517:[343,363] : aaaaaaga aatca tctttttt embl|V00101:[241,261] : aaaaaaga aatca tctttttt embl|X07796:[343,363] : aaaaaaga aatca tctttttt embl|X56679:[1562,1587] : aaaaaagac ccttaggg gtctttttt embl|M24537:[3334,3359] : aaaaaagcc cactagag ggctttttt embl|M98822:[15,40] : aaaaaagcc cactagag ggctttttt embl|D29985:[5641,5666] : aaaaaagcg cccttggg cgctttttt embl|X73124:[83254,83277] : aaaaaccc tttttaaa gggttttt embl|M12501:[611,636] : aaaaagact tggaaaca agtcttttt . . . embl|M77837:[623,644] : tttttaaa ggtaca tttaaaaa embl|M16192:[1522,1542] : tttttata ataat tataaaaa embl|L08822:[930,953] : tttttctg tgctgaaa cagaaaaa embl|L25604:[3323,3347] : ttttttgaa ataaaac ttcaaaaaa embl|M97391:[3519,3543] : ttttttgaa gttttgt ttcaaaaaa COMPLETED REQUEST
Each line gives an EMBL accession number, followed by the positions in the EMBL entry that were matched by the pattern. The returned hits are sorted on the first field matched by the pattern.
So, try out some patterns and see what you get. Note that we limit the maximum number of reported hits; you can override the maximum, but we suggest that you only do so once you know that you really want to see a truly large number of matches.Ambiguity Codes for Nucleotides | |
---|---|
Code | Nucleotides |
M | {A, C} |
R | {A, G} |
W | {A, T} |
S | {C, G} |
Y | {C, T} |
K | {G, T} |
V | {A, C, G} |
H | {A, C, T} |
D | {A, G, T} |
B | {C, G, T} |
N | {A, C, G, T} |
(X | Y)
where X and Y are pattern units. The construct (X | Y)
will match
successfully if either X matches or Y matches.
NOTE: The spaces before and after the vertical bar
" | "
are
necessary.
name=X
where "name" is one of {p1,p2,p3,...} and X is a basic pattern unit. When a named simple pattern unit successfully matches a section of a sequence, that section can be later referred to in constructs such as
p1 p2[0,1,0] ~p3and so forth (see below). The "name" saves the value of the matched substring.
name=complementswhere name is one of {r1,r2,r3,...} and complements is a set defining what is meant by "complement" under the named rule. For example,
r1={au,ua,gc,cg,gu,ug,ga,ag} r2={au,ua,gc,cg}shows two complementation rule pattern units defining two specialized notions of "complement".
Explicitly defined complementation rules are useful when scanning for helicies in nucleotide sequences, especially when unusual constraints exist for specific positions.
Normally, one uses the standard complementation rule, i.e. the set{at,ta,cg,gc}.PatScan assumes this to be the default rule, and it does not need to be defined explicitly.
examples: AGYGGT YCXXGA TATAA[1,0,0]
example: 3...8
examples: ~p2 ~p3[1,1,0] r2~p4
examples: p1 p2[1,0,0]
example: any(IV)
example: notany(IVL)
example: {(80,0,20,0),(0,100,0,0),(20,40,40,0)} > 119
example: length(p1+p2+p3) < 12
[Mismatches,Deletions,Insertions]For example, a string pattern of the form
RNYRNYRNYRNY[1,0,0]would match 12 characters in a nucleotide string (where R stands for a purine and Y stands for a pyrimidine; the standard ambiguity codes are used for nucleotides, but only X is allowed as an ambiguous character for proteins). A "deletion" is a character in the pattern for which no character in the matched sequence corresponds, while an "insertion" is a character in the sequence which does not correspond to any character in the string pattern unit.
Min...Max,which indicates that it will match any subsequence with length between Min and Max.
~p1matches the reverse complement of whatever p1 represents. If a special rule of complementation is required, it precedes the ~, thus
r1~p2says "match the reverse complement of whatever p2 matched, where complementation is defined by complementation rule 2". You can also add a match qualifier; thus,
~p2[1,0,1]would allow a single mismatch and a single character bulge in the helix.
p1=3...3 p1 p1would match a 9-character string made up of a 3-mer repeated three times. You can qualify the matches; for example,
p1=3...3 p1[1,0,0] p1[1,0,0] p1[1,0,0] p1[1,0,0]matches a 15-character string which might be thought of as 5 repetitions of a 3-mer that have experienced a few mutations.
C1 C2 C3 C4 C5 C6 C7 C8 A 16 57 0 95 0 18 0 0 C 0 10 80 0 100 60 0 50 G 84 29 0 0 0 20 100 50 T 0 4 20 5 0 2 0 0One could use the following pattern unit to search for inexact matches related to such a "weight matrix":
{(16,0,84,0),(57,10,29,4),(0,80,0,20),(95,0,0,5), (0,100,0,0),(18,60,20,2),(0,0,100,0),(0,50,50,0)} > 450This pattern unit will attempt to match exactly eight characters. For each character in the sequence, the entry in the corresponding tuple is added to an accumulated sum. If the sum is greater than 450, the match succeeds; else it fails. For protein sequences, you must use 20-tuples (with the entries corresponding to the amino acids in alphabetical order). This will be used only by the most serious aficionadoes.
p1=5...5 p2=1...5 p3=3...7 ~p1[1,0,0] p4=3...8 ~p3 length(p2+p4) < 10would match a pseudo-knot like structure, setting a maximum size on the two unpaired internal subsequences.
TTF TTF at the end of the database sequence.
^ TTFmatches TTF at the beginning of the database sequence.
p1=4...4 <p1 matches all of the following, AGGD DGGA FAFL LFAF GSAP PASG SAPR RPASThat is, it matches any four characters, followed by its reverse. This is the actual palindrome, not the biologically common meaning of "reverse complement".
all any(TS) 1...1 GP 1...1 G 4...4 any(LFIVM) G sp|P02461|CA13_HUMAN:[1120,1131]: S P GP A G QQGA I G sp|P02461|CA13_HUMAN:[1132,1143]: S P GP A G PRGP V G . . . sp|P42382|CH60_EHRCH:[29,40] : T A GP K G LTVA I G sp|Q01149|CA21_MOUSE:[749,760] : T K GP K G ENGI V G COMPLETED REQUESTThe first line indicates that PatScan searched the entire database, and the second line returns the pattern you input. Subsequent lines list the matches to this pattern found in the database. For example:
sp|P02461|CA13_HUMAN:[1120,1131]: S P GP A G QQGA I Gcan be broken up into 5 fields.
sp | P02461 | CA13_HUMAN : [1120,1131] : S P GP A G QQGA I G | | | | | | | | | | Swiss-Prot | Swiss-Prot ID position of matched sequence database | matched sequence searched | Swiss-Prot Accession numberThe output is in alphabetical order based on the first character of the first pattern unit (in this case S first, then T). The matched sequence is broken into 8 segments to match the 8 pattern units of the input pattern. In the above example this breaks up as:
any(TS) 1...1 GP 1...1 G 4...4 any(LFIVM) G | | | | | | | | S P GP A G QQGA I G
fungi p1=2...2 3...4 p2=4...5 3...3 ~p2 2...3 ~p1 embl|A02534:[1009,1028] : aa acc cagc agg gctg gg tt embl|A06260:[1501,1521] : aa ata taag gaa ctta tga tt . . . embl|Z50840:[2998,3019] : tt agct ctct aga agag tgt aa embl|Z67741:[458,478] : tt gcg gcgc ctt gcgc cag aa COMPLETED REQUESTThe first line indicates that PatScan searched the fungi database, and the second line returns the pattern you input. Subsequent lines list the matches to this pattern found in the database. For example:
embl|A02534:[1009,1028] : aa acc cagc agg gctg gg ttcan be broken up into 4 fields.
embl | A02534 : [1009,1028] : aa acc cagc agg gctg gg tt | | | | | | | | EMBL EMBL position of matched sequence database Accession matched sequence searched NumberThe output is in alphabetical order based on the first character of the first pattern unit (in this case AA first, with TT last). The matched sequence is broken into 8 segments to match the 8 pattern units of the input pattern. In the above example this breaks up as:
p1=2...2 3...4 p2=4...5 3...3 ~p2 2...3 ~p1 | | | | | | | aa acc cagc agg gctg gg ttYou can access the EMBL database to get complete information about the nucleotide sequence in which the matched sequence occurs, using: