Sie sind auf Seite 1von 16

Regular

Expressions
A simple and powerful way to
match characters
Laurent Falquet, EPFL March, 2005
Swiss Institute of Bioinformatics
Swiss EMBnet node

Regular Expressions
What is a regular expression?
Literal (or normal characters)
Alphanumeric
abcABC0123...

Punctuation
-_ ,.;:=()/+ *%&{}[]?!$^|\<>"@#

Metacharacters
Ex: ls *.java

Flavors
awk, egrep, Emacs, grep, Perl,
POSIX, Tcl, PROSITE !

Example: PROSITE Patterns are regular


expressions

Pattern: <A-x-[ST](2)-x(0,1)-{V}
Perl Regexp: ^A.[ST]{2}.?[^V]
Text: The sequence must start with an alanine,
followed by any amino acid, followed by a serine
or a threonine, two times, followed by any amino
acid or nothing, followed by any amino acid
except a valine.
Simply the syntax differ

Regular Expressions (1)

In Perl: //

Start and End of line

Match any of several

^ start, $ end
[] or (|)

Match 0, 1 or more

. 1 of any
? 0 or 1
+ 1 or more
* 0 or more
{m,n} range
! negation

Examples
Match every instance of a
SwissProt AC
m/[OPQ][0-9][A-Z0-9]{3}[0-9]/;
m/ [OPQ]\d[A-Z0-9]{3}\d/;

Match every instance of a


SwissProt ID
m/[A-Z0-9]{2,5}_[A-Z0-9]{3,5}/;

Regular Expressions (2)

Escape character or back


reference

\d digit [0-9]
\s whitespace [space\f\n\r\t]
\w character [a-zA-Z0-9_]
\D\S\W complement of \d\s\w

\num character in octal


\xnum character in
hexadecimal
\cchar control character

m//

s///

$var =~ s/colou?r/couleur/;

Translate operator

tr///

$var =~ m/colou?r/;
$var !~ m/colou?r/;

Substitution operator

Byte notation

Match operator

\char or \num

Shorthand

$revcomp =~ tr/ACGT/tgca/;

Modifiers //#

/i case insensitive
/g global match
Many other /s,/m,/o,/x...

Regular Expressions (3)

Grouping

External reference

Exercises

$var =~ s/sp\:(\w\d{5})/swissprot AC=$1/;

Internal reference

$var =~ s/tr\:(\w\d{5})\|\1/trembl AC=$1/;

Numbering

$1 to $9
$10 to more if needed...

Create a regexp to recognize


any pseudo IP address:
012.345.678.912
Create a regexp to recognize
any email address:
Jean.Dupond@isb-sib.ch
Create a regexp to change any
HTML tag to another

<address> -> <pre>

On sib-dea:
use visual_regexp-1.2.tcl to check
your regular expressions
(requires X-windows)

Regular Expressions (4)

Solution RegExp

/[\d{1,3}\.]{3}\d{1,3}/

/\w+\.\w+\@\w+\-?\w+\.[a-z]{2,4}/

/\<(\/?)address\>/\<$1pre\>/
generalized:

address = \w+

Perl In-liners

In-liners: some options

-a autosplit (only with -n or -p)


-c check syntax
-d debugger
-e pass script lines
-h help
-i direct editing of a file
-n loop without print
-p loop with print
-v version

Example:

perl -e 'print qq(hello world\n);'

In-liners: -n and -p

perl -pe s/\r/\n/g <file>

is equivalent to:

open READ, file;


while (<READ>) {
s/\r/\n/g;
print;
}
close(READ);

perl -i -pe s/\r/\n/g


<file>
Warning: the -i option
modifies the file directly
perl -ne is the same
without the print

In-liners: -a (only with -n or -p)

perl -ane print @F, \n; <file> Example:

is equivalent to:

open READ, file;


while (<READ>) {
@F = split( );
print @F, \n;
}
close(READ);

hits -b 'sw' -o pff2


prf:CARD | perl -ane
'print join("\t",
reverse(@F)),"\n";'

In-liners: -a (only with -n or -p)

hits -b 'sw' -o pff2 prf:CARD


sw:ICEA_XENLA 1
90
prf:CARD
5
-1
18.553
sw:RIK2_MOUSE 435 513 prf:CARD
5
-11 15.058
sw:CARC_HUMAN 1
88
prf:CARD
6
-1
15.395
sw:NAL1_HUMAN 1380 1463 prf:CARD
7
-1
15.058
sw:ASC_HUMAN 113 195 prf:CARD
7
-2
15.374
sw:CAR8_HUMAN 347 430 prf:CARD
8
-1
18.343
sw:CARF_HUMAN 134 218 prf:CARD
9
-1
12.932

hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";'


18.553
15.058
15.395
15.058
15.374
18.343
12.932

-1
-11
-1
-1
-2
-1
-1

5
5
6
7
7
8
9

prf:CARD
prf:CARD
prf:CARD
prf:CARD
prf:CARD
prf:CARD
prf:CARD

90
513
88
1463
195
430
218

1
sw:ICEA_XENLA
435 sw:RIK2_MOUSE
1
sw:CARC_HUMAN
1380 sw:NAL1_HUMAN
113 sw:ASC_HUMAN
347 sw:CAR8_HUMAN
134 sw:CARF_HUMAN

In-liners: examples

perl -e print int(rand(100)),"\n" for 1..100' | perl -e


'$x{$_}=1 while <>;print sort {$a<=>$b} keys %x'

for($i=0;$i<100;$i++) {
$nb = int(rand(100));
$hash{$nb} = 1;
}
print sort {$a<=>$b} keys %hash;

In-liners: extract FASTA from SP


open (READ, /db/proteome/ECOLI.dat); # open file
while ($line=<READ>) { # read line by line until the end
if($line=~ /^ID +(\w+)/) { print >$1\n; } # print fasta header
if($line=~ /^ /) {
$line =~ s/ //g; # remove spaces
print $line;
# print sequence line
}
}
close(READ);

cat /db/proteome/ECOLI.dat | perl -ne if (/^ID +(\w+)/)


{print">$1\n";} if(/^ /) {s/ //g; print}

In-liners: your turn

Create an In-liner that extracts non-redundant FASTA format


sequences from a redundant database in SwissProt format

cat /db/proteome/ECOLI.dat | perl -ne ' if (/^ID +(\w+)/)


{print ">$1\n;} if(/^ /) {s/ //g; print}' | perl -e 'while(<>)
{ if (/>/) { $i=$_; $x{$i}=""} $x{$i}.=$_} print values
%x

Das könnte Ihnen auch gefallen