Beruflich Dokumente
Kultur Dokumente
extendsObject
implementsSerializable
Acompiledrepresentationofaregularexpression.
Aregularexpression,specifiedasastring,mustfirstbecompiledintoaninstanceofthisclass.TheresultingpatterncanthenbeusedtocreateaMatcherobject
thatcanmatcharbitrarycharactersequencesagainsttheregularexpression.Allofthestateinvolvedinperformingamatchresidesinthematcher,somany
matcherscansharethesamepattern.
Atypicalinvocationsequenceisthus
Patternp=Pattern.compile("a*b");
Matcherm=p.matcher("aaaaab");
booleanb=m.matches();
Amatchesmethodisdefinedbythisclassasaconvenienceforwhenaregularexpressionisusedjustonce.Thismethodcompilesanexpressionandmatches
aninputsequenceagainstitinasingleinvocation.Thestatement
booleanb=Pattern.matches("a*b","aaaaab");
isequivalenttothethreestatementsabove,thoughforrepeatedmatchesitislessefficientsinceitdoesnotallowthecompiledpatterntobereused.
Instancesofthisclassareimmutableandaresafeforusebymultipleconcurrentthreads.InstancesoftheMatcherclassarenotsafeforsuchuse.
Summaryofregularexpressionconstructs
*
Construct
Characters
x
\\
\0n
\0nn
\0mnn
\xhh
\uhhhh
\x{h...h}
\t
\n
\r
\f
\a
\e
\cx
Characterclasses
[abc]
[^abc]
[azAZ]
[ad[mp]]
[az&&[def]]
[az&&[^bc]]
Matches
Thecharacterx
Thebackslashcharacter
Thecharacterwithoctalvalue0n(0<=n<=7)
Thecharacterwithoctalvalue0nn(0<=n<=7)
Thecharacterwithoctalvalue0mnn(0<=m<=3,0<=n<=7)
Thecharacterwithhexadecimalvalue0xhh
Thecharacterwithhexadecimalvalue0xhhhh
Thecharacterwith
hexadecimalvalue0xh...h(Character.MIN_CODE_POINT<=0xh...h<=Character.MAX_CODE_POINT)
Thetabcharacter('\u0009')
Thenewline(linefeed)character('\u000A')
Thecarriagereturncharacter('\u000D')
Theformfeedcharacter('\u000C')
Thealert(bell)character('\u0007')
Theescapecharacter('\u001B')
Thecontrolcharactercorrespondingtox
a,b,orc(simpleclass)
Anycharacterexcepta,b,orc(negation)
athroughzorAthroughZ,inclusive(range)
athroughd,ormthroughp:[admp](union)
d,e,orf(intersection)
athroughz,exceptforbandc:[adz](subtraction)
athroughz,andnotmthroughp:[alqz](subtraction)
[az&&[^mp]]
Predefinedcharacterclasses
.
Anycharacter(mayormaynotmatchlineterminators)
\d
\D
\s
\S
\w
Adigit:[09]
Anondigit:[^09]
Awhitespacecharacter:[\t\n\x0B\f\r]
Anonwhitespacecharacter:[^\s]
Awordcharacter:[azAZ_09]
Anonwordcharacter:[^\w]
POSIXcharacterclasses(USASCIIonly)
\p{Lower}
Alowercasealphabeticcharacter:[az]
\p{Upper}
Anuppercasealphabeticcharacter:[AZ]
\W
\p{ASCII}
AllASCII:[\x00\x7F]
\p{Alpha}
Analphabeticcharacter:[\p{Lower}\p{Upper}]
\p{Digit}
\p{Alnum}
Adecimaldigit:[09]
Analphanumericcharacter:[\p{Alpha}\p{Digit}]
\p{Punct}
Punctuation:Oneof!"#$%&'()*+,./:;<=>?@[\]^_`{|}~
\p{Graph}
\p{Print}
Avisiblecharacter:[\p{Alnum}\p{Punct}]
Aprintablecharacter:[\p{Graph}\x20]
\p{Blank}
Aspaceoratab:[\t]
\p{Cntrl}
Acontrolcharacter:[\x00\x1F\x7F]
\p{XDigit}
Ahexadecimaldigit:[09afAF]
Awhitespacecharacter:[\t\n\x0B\f\r]
\p{Space}
java.lang.Characterclasses(simplejavacharactertype)
\p{javaLowerCase} Equivalenttojava.lang.Character.isLowerCase()
\p{javaUpperCase}
Equivalenttojava.lang.Character.isUpperCase()
\p{javaWhitespace} Equivalenttojava.lang.Character.isWhitespace()
\p{javaMirrored}
Equivalenttojava.lang.Character.isMirrored()
ClassesforUnicodescripts,blocks,categoriesandbinaryproperties
\p{IsLatin}
ALatinscriptcharacter(script)
\p{InGreek}
\p{Lu}
\p{IsAlphabetic}
\p{Sc}
\P{InGreek}
[\p{L}&&
[^\p{Lu}]]
Boundarymatchers
^
$
\b
\B
AcharacterintheGreekblock(block)
Anuppercaseletter(category)
Analphabeticcharacter(binaryproperty)
Acurrencysymbol
AnycharacterexceptoneintheGreekblock(negation)
Anyletterexceptanuppercaseletter(subtraction)
Thebeginningofaline
Theendofaline
Awordboundary
Anonwordboundary
\Z
Thebeginningoftheinput
Theendofthepreviousmatch
Theendoftheinputbutforthefinalterminator,ifany
\z
Theendoftheinput
\A
\G
Greedyquantifiers
X?
X*
X,onceornotatall
X,zeroormoretimes
X+
X{n}
X,oneormoretimes
X,exactlyntimes
X{n,}
X{n,m}
X,atleastntimes
X,atleastnbutnotmorethanmtimes
Reluctantquantifiers
X??
X,onceornotatall
X*?
X+?
X{n}?
X,zeroormoretimes
X,oneormoretimes
X,exactlyntimes
X{n,}?
X{n,m}?
X,atleastntimes
X,atleastnbutnotmorethanmtimes
Possessivequantifiers
X?+
X,onceornotatall
X*+
X,zeroormoretimes
X++
X{n}+
X{n,}+
X,oneormoretimes
X,exactlyntimes
X,atleastntimes
X{n,m}+
X,atleastnbutnotmorethanmtimes
Logicaloperators
XY
X|Y
(X)
XfollowedbyY
EitherXorY
X,asacapturinggroup
Backreferences
\n
\k<name>
Quotation
\
Whateverthenthcapturinggroupmatched
Whateverthenamedcapturinggroup"name"matched
Nothing,butquotesthefollowingcharacter
Nothing,butquotesallcharactersuntil\E
Nothing,butendsquotingstartedby\Q
\Q
\E
Specialconstructs(namedcapturingandnoncapturing)
(?<name>X)
X,asanamedcapturinggroup
(?:X)
X,asanoncapturinggroup
(?idmsuxU
idmsuxU)
Nothing,butturnsmatchflagsidmsuxUonoff
(?idmsux
idmsux:X)
X,asanoncapturinggroupwiththegivenflagsidmsuxonoff
(?=X)
X,viazerowidthpositivelookahead
(?!X)
X,viazerowidthnegativelookahead
X,viazerowidthpositivelookbehind
(?<=X)
(?<!X)
X,viazerowidthnegativelookbehind
X,asanindependent,noncapturinggroup
(?>X)
Backslashes,escapes,andquoting
Thebackslashcharacter('\')servestointroduceescapedconstructs,asdefinedinthetableabove,aswellastoquotecharactersthatotherwisewouldbe
interpretedasunescapedconstructs.Thustheexpression\\matchesasinglebackslashand\{matchesaleftbrace.
Itisanerrortouseabackslashpriortoanyalphabeticcharacterthatdoesnotdenoteanescapedconstructthesearereservedforfutureextensionstotheregular
expressionlanguage.Abackslashmaybeusedpriortoanonalphabeticcharacterregardlessofwhetherthatcharacterispartofanunescapedconstruct.
BackslasheswithinstringliteralsinJavasourcecodeareinterpretedasrequiredbyTheJavaLanguageSpecificationaseitherUnicodeescapes(section3.3)or
othercharacterescapes(section3.10.6)Itisthereforenecessarytodoublebackslashesinstringliteralsthatrepresentregularexpressionstoprotectthemfrom
interpretationbytheJavabytecodecompiler.Thestringliteral"\b",forexample,matchesasinglebackspacecharacterwheninterpretedasaregularexpression,
while"\\b"matchesawordboundary.Thestringliteral"\(hello\)"isillegalandleadstoacompiletimeerrorinordertomatchthestring(hello)the
stringliteral"\\(hello\\)"mustbeused.
CharacterClasses
Characterclassesmayappearwithinothercharacterclasses,andmaybecomposedbytheunionoperator(implicit)andtheintersectionoperator(&&).Theunion
operatordenotesaclassthatcontainseverycharacterthatisinatleastoneofitsoperandclasses.Theintersectionoperatordenotesaclassthatcontainsevery
characterthatisinbothofitsoperandclasses.
Theprecedenceofcharacterclassoperatorsisasfollows,fromhighesttolowest:
1
2
3
4
Literalescape
Grouping
Range
Union
\x
[...]
az
[ae][iu]
5 Intersection
[az&&[aeiou]]
Notethatadifferentsetofmetacharactersareineffectinsideacharacterclassthanoutsideacharacterclass.Forinstance,theregularexpression.losesits
specialmeaninginsideacharacterclass,whiletheexpressionbecomesarangeformingmetacharacter.
Lineterminators
Alineterminatorisaoneortwocharactersequencethatmarkstheendofalineoftheinputcharactersequence.Thefollowingarerecognizedaslineterminators:
Anewline(linefeed)character('\n'),
Acarriagereturncharacterfollowedimmediatelybyanewlinecharacter("\r\n"),
Astandalonecarriagereturncharacter('\r'),
Anextlinecharacter('\u0085'),
Alineseparatorcharacter('\u2028'),or
Aparagraphseparatorcharacter('\u2029).
IfUNIX_LINESmodeisactivated,thentheonlylineterminatorsrecognizedarenewlinecharacters.
Theregularexpression.matchesanycharacterexceptalineterminatorunlesstheDOTALLflagisspecified.
Bydefault,theregularexpressions^and$ignorelineterminatorsandonlymatchatthebeginningandtheend,respectively,oftheentireinputsequence.
IfMULTILINEmodeisactivatedthen^matchesatthebeginningofinputandafteranylineterminatorexceptattheendofinput.When
inMULTILINEmode$matchesjustbeforealineterminatorortheendoftheinputsequence.
Groupsandcapturing
Groupnumber
Capturinggroupsarenumberedbycountingtheiropeningparenthesesfromlefttoright.Intheexpression((A)(B(C))),forexample,therearefoursuch
groups:
((A)(B(C)))
2
3
4
(A)
(B(C))
(C)
Groupzeroalwaysstandsfortheentireexpression.
Capturinggroupsaresonamedbecause,duringamatch,eachsubsequenceoftheinputsequencethatmatchessuchagroupissaved.Thecaptured
subsequencemaybeusedlaterintheexpression,viaabackreference,andmayalsoberetrievedfromthematcheroncethematchoperationiscomplete.
Groupname
Acapturinggroupcanalsobeassigneda"name",anamedcapturinggroup,andthenbebackreferencedlaterbythe"name".Groupnamesarecomposed
ofthefollowingcharacters.Thefirstcharactermustbealetter.
Theuppercaseletters'A'through'Z'('\u0041'through'\u005a'),
Thelowercaseletters'a'through'z'('\u0061'through'\u007a'),
Thedigits'0'through'9'('\u0030'through'\u0039'),
AnamedcapturinggroupisstillnumberedasdescribedinGroupnumber.
Thecapturedinputassociatedwithagroupisalwaysthesubsequencethatthegroupmostrecentlymatched.Ifagroupisevaluatedasecondtimebecauseof
quantificationthenitspreviouslycapturedvalue,ifany,willberetainedifthesecondevaluationfails.Matchingthestring"aba"againstthe
expression(a(b)?)+,forexample,leavesgrouptwosetto"b".Allcapturedinputisdiscardedatthebeginningofeachmatch.
Groupsbeginningwith(?areeitherpure,noncapturinggroupsthatdonotcapturetextanddonotcounttowardsthegrouptotal,ornamedcapturinggroup.
Unicodesupport
ThisclassisinconformancewithLevel1ofUnicodeTechnicalStandard#18:UnicodeRegularExpression,plusRL2.1CanonicalEquivalents.
Unicodeescapesequencessuchas\u2014inJavasourcecodeareprocessedasdescribedinsection3.3ofTheJavaLanguageSpecification.Suchescape
sequencesarealsoimplementeddirectlybytheregularexpressionparsersothatUnicodeescapescanbeusedinexpressionsthatarereadfromfilesorfromthe
keyboard.Thusthestrings"\u2014"and"\\u2014",whilenotequal,compileintothesamepattern,whichmatchesthecharacterwithhexadecimal
value0x2014.
AUnicodecharactercanalsoberepresentedinaregularexpressionbyusingitsHexnotation(hexadecimalcodepointvalue)directlyasdescribedin
construct\x{...},forexampleasupplementarycharacterU+2011Fcanbespecifiedas\x{2011F},insteadoftwoconsecutiveUnicodeescapesequencesof
thesurrogatepair\uD840\uDD1F.
Unicodescripts,blocks,categoriesandbinarypropertiesarewrittenwiththe\pand\PconstructsasinPerl.\p{prop}matchesiftheinputhasthepropertyprop,
while\P{prop}doesnotmatchiftheinputhasthatproperty.
Scripts,blocks,categoriesandbinarypropertiescanbeusedbothinsideandoutsideofacharacterclass.
ScriptsarespecifiedeitherwiththeprefixIs,asinIsHiragana,orbyusingthescriptkeyword(oritsshortformsc)as
inscript=Hiraganaorsc=Hiragana.
ThescriptnamessupportedbyPatternarethevalidscriptnamesacceptedanddefinedbyUnicodeScript.forName.
BlocksarespecifiedwiththeprefixIn,asinInMongolian,orbyusingthekeywordblock(oritsshortformblk)as
inblock=Mongolianorblk=Mongolian.
TheblocknamessupportedbyPatternarethevalidblocknamesacceptedanddefinedbyUnicodeBlock.forName.
CategoriesmaybespecifiedwiththeoptionalprefixIs:Both\p{L}and\p{IsL}denotethecategoryofUnicodeletters.Sameasscriptsandblocks,
categoriescanalsobespecifiedbyusingthekeywordgeneral_category(oritsshortformgc)asingeneral_category=Luorgc=Lu.
ThesupportedcategoriesarethoseofTheUnicodeStandardintheversionspecifiedbytheCharacterclass.Thecategorynamesarethosedefinedinthe
Standard,bothnormativeandinformative.
BinarypropertiesarespecifiedwiththeprefixIs,asinIsAlphabetic.ThesupportedbinarypropertiesbyPatternare
Alphabetic
Ideographic
Letter
Lowercase
Uppercase
Titlecase
Punctuation
Control
White_Space
Digit
Hex_Digit
Noncharacter_Code_Point
Assigned
PredefinedCharacterclassesandPOSIXcharacterclassesareinconformancewiththerecommendationofAnnexC:CompatibilityPropertiesofUnicode
RegularExpression,whenUNICODE_CHARACTER_CLASSflagisspecified.
Classes
Matches
\p{Lower} Alowercasecharacter:\p{IsLowercase}
\p{Upper} Anuppercasecharacter:\p{IsUppercase}
\p{ASCII} AllASCII:[\x00\x7F]
\p{Alpha} Analphabeticcharacter:\p{IsAlphabetic}
\p{Digit} Adecimaldigitcharacter:p{IsDigit}
Analphanumericcharacter:[\p{IsAlphabetic}\p{IsDigit}]
\p{Punct} Apunctuationcharacter:p{IsPunctuation}
\p{Graph} Avisiblecharacter:[^\p{IsWhite_Space}\p{gc=Cc}\p{gc=Cs}\p{gc=Cn}]
\p{Print} Aprintablecharacter:[\p{Graph}\p{Blank}&&[^\p{Cntrl}]]
Aspaceoratab:[\p{IsWhite_Space}&&
\p{Blank}
[^\p{gc=Zl}\p{gc=Zp}\x0a\x0b\x0c\x0d\x85]]
\p{Cntrl} Acontrolcharacter:\p{gc=Cc}
\p{XDigit} Ahexadecimaldigit:[\p{gc=Nd}\p{IsHex_Digit}]
\p{Alnum}
\p{Space}
\d
\D
\s
\S
\w
\W
Awhitespacecharacter:\p{IsWhite_Space}
Adigit:\p{IsDigit}
Anondigit:[^\d]
Awhitespacecharacter:\p{IsWhite_Space}
Anonwhitespacecharacter:[^\s]
Awordcharacter:
[\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}]
Anonwordcharacter:[^\w]
Categoriesthatbehavelikethejava.lang.Characterbooleanismethodnamemethods(exceptforthedeprecatedones)areavailablethroughthe
same\p{prop}syntaxwherethespecifiedpropertyhasthenamejavamethodname.
ComparisontoPerl5
ThePatternengineperformstraditionalNFAbasedmatchingwithorderedalternationasoccursinPerl5.
Perlconstructsnotsupportedbythisclass:
Predefinedcharacterclasses(Unicodecharacter)
\hAhorizontalwhitespace
\HAnonhorizontalwhitespace
\vAverticalwhitespace
\VAnonverticalwhitespace
\RAnyUnicodelinebreaksequence\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
\XMatchUnicodeextendedgraphemecluster
Thebackreferenceconstructs,\g{n}forthenthcapturinggroupand\g{name}fornamedcapturinggroup.
Thenamedcharacterconstruct,\N{name}foraUnicodecharacterbyitsname.
Theconditionalconstructs(?(condition)X)and(?(condition)X|Y),
Theembeddedcodeconstructs(?{code})and(??{code}),
Theembeddedcommentsyntax(?#comment),and
Thepreprocessingoperations\l\u,\L,and\U.
ConstructssupportedbythisclassbutnotbyPerl:
Characterclassunionandintersectionasdescribedabove.
NotabledifferencesfromPerl:
InPerl,\1through\9arealwaysinterpretedasbackreferencesabackslashescapednumbergreaterthan9istreatedasabackreferenceifatleastthat
manysubexpressionsexist,otherwiseitisinterpreted,ifpossible,asanoctalescape.Inthisclassoctalescapesmustalwaysbeginwithazero.Inthis
class,\1through\9arealwaysinterpretedasbackreferences,andalargernumberisacceptedasabackreferenceifatleastthatmanysubexpressions
existatthatpointintheregularexpression,otherwisetheparserwilldropdigitsuntilthenumberissmallerorequaltotheexistingnumberofgroupsoritis
onedigit.
Perlusesthegflagtorequestamatchthatresumeswherethelastmatchleftoff.ThisfunctionalityisprovidedimplicitlybytheMatcherclass:Repeated
invocationsofthefindmethodwillresumewherethelastmatchleftoff,unlessthematcherisreset.
InPerl,embeddedflagsatthetoplevelofanexpressionaffectthewholeexpression.Inthisclass,embeddedflagsalwaystakeeffectatthepointatwhich
theyappear,whethertheyareatthetoplevelorwithinagroupinthelattercase,flagsarerestoredattheendofthegroupjustasinPerl.
Foramoreprecisedescriptionofthebehaviorofregularexpressionconstructs,pleaseseeMasteringRegularExpressions,3ndEdition,JeffreyE.F.Friedl,
O'ReillyandAssociates,2006.