Anda di halaman 1dari 34

c ular pression: a system of pattern masks to describe strings to search for, sort of a

more generalised wildcarding. You can use them to scan text to extract data embedded in
boilerplate. You can use them to replace boilerplate patterns.
Regexes are notoriously difficult to proofread and debug. You must test them „ 
„„ (unit tests) with every pathological data string, and every corner case you
can think of.
Introduction Examples
Other Regex Engines Negative Regex
Quoting, why you need \\\\ Matching vs Finding vs LookingAt
Recipes for Quoting Splitting
Regex Variations Table Replacing
Multiple Characters Tips
Awkward Characters String
Terminology Books
Pattern Flags Learning More
Named Fragments Links

î c 
^ava version 1.4 introduces the java.util.regex package. If they don¶t work, use
Wassup to check out the version of ^ava you are using. You may be inadvertently
using an old one. Perl-like Regex expressions are compiled into a Pattern (parsed into
an internal state machine format, not byte code). You don¶t use a constructor to create
a Pattern; you use the static method Pattern. compile( String). Then you create a
Matcher object with Pattern. matcher( String) feeding it the String you wonder if
matches the pattern. Finally, you call Matcher. matches to see if the xfString fits the
pattern. There are many other things you can do, for example, to find multiple
matches in your String.

Regex cannot do tasks like look for balanced ( ) or deal with a simple precedence
grammar. For that you need a parser. Regexes are also very awkward if your fields are
not in some standard order. They drive you nuts analysing HTML form parameters for
example where the parms can come in any order. They are great when data come is
some standard order, with some missing with alternate forms, and variable separators.

Regexes will drive you insane like no other kind of computer programming will. You
can stare at them for hours and have no clue why they fail to match. If you change
them the tiniest bit, they will refuse to work. The problem is they are black boxes.
You can¶t watch them work to figure out why and where they are failing. Failures are
often subtle reluctant/greedy issues, or a failure in a totally different part of the regex
that you presumed. Escaping requires great precision since there are two
escaping/quoting mechansims interacting, one for regex and one for ^ava String
literals.

 c    


[aniel Savarese has written a second Regex package based on Perl regexes. Look at
the Apache ^akarta project. IBM Alphaworks has one. Search for c  . ^akarta-ORO
(née OROMatcher), lets you add regex ability to your own ^ava programs. Funduc
Search and Replace is a utility for doing global search and replace on files using
regular expressions. The Quoter Amanuensis helps you compose regex expressions
for Funduc Search and Replace. SlickEdit® is a text editor that has supports several
kinds of regular expressions for global search and replace. Forté Agent newsreader
has a regular expression scheme for decribing junk filters. However, it is completely
unlike ^ava regex. It is more like Google search expressions.

J 
Reserved characters, aka meta characters are command characters that have special
meaning in regexes must be quoted when you mean them literally, as just characters.
This does not mean you must enclose them in quotation marks, but rather you must
specially mark them as meant literally by preceding them with a \. e.g. \- \+ \?. If you
are unsure, quote. It won¶t hurt to quote punctuation that does not need it. However,
[on¶t quote : in Vslick since \:« has special meaning.

Unfortunately, the regex people used the same quoting character \ as the designers of
^ava did for String literals. In a non-regex ^ava String literal, every literal \ must be
doubled. In a regex every literal \ must be doubled. So when you express a regex as a
^ava String literal, every literal \ must be quadrupled! and written as \\\\.

When you compose a regex String on the fly, character by character, then ^ava String
literal quoting is no longer at play. There you merely need double each \. Be
especially careful with File.fileSeparatorChar in composed on the fly regexes. If it is \
it must be doubled.

^ava 1.4.1+ also offers \Q« \E quoting long passages without having to quote
command characters individually. You still have to quote for String literals though.

The quoter amanuensis will let you compose your literal regex strings then convert
them to deal with both regex and ^ava \\ quoting.

In ^ava version 1.5 or later, Pattern.quote( String ) will do the same thing the quoter
amanuensis does to a String to give you the equivalent regex, properly quoted to
match it literally. It just mindlessly sandwiches the string in \Q « \E, whether it needs
it or not.

 
 c          
  c       
  

  cJ 

cc c 

    
c  c  c 

c 

cc c c  
  c  c 

 
  c 
c c   c c   
 c    
left bracket,
acting as a regex command [ "[" [ "["
character
left bracket,
reserved regex command character [ "[" \[ "\\["
acting as a literal [
A literal newline character  "\n" \n "\\n"
A literal carriage return character  "\r" \r "\\r"
A literal double quote character,
magic to ^ava, nothing special to " "\"" " "\""
regex.
A literal single quote character,
magic to ^ava, nothing special to ' "\'" ' "\'"
regex.
A literal backslash character,
\ "\\" \\ "\\\\"
magic to both ^ava and regex.

  c  
I use three different regex engines many times a day. I have a heck of a time
remembering which commands work with which one. So I composed this table.
Lucky I don¶t need Perl too.
  c  
 
  
     
 
 



c  !c  
      
 c  
*+.-?[\]
!$()*+-?[\
{|}
]^|
$ ( ) * + - . ? [ \ ] ^ { Prune this
Prune this Reserved
|} string to get
string to get just metacharacters in
Prune this string to just the
the chars you search strings must be
 c get just the chars you chars you
want: \-quoted if used as data
c  c
 want: want:
\!"#\$%&'\(\)\*\ chars, e. g. \+ \* \|. If in
!"#\$%&'\(\)\*\+,\- !"#$%&'()\*
+,\-./0- doubt, quote. It won¶t
\./0-9:;<=>\?@A- \+,\-\./0-
9:;<=>\?@A- hurt.
Z\[\\\]\^_`a-z\{\|\} 9:;<=>\?@A
Z\[\\\]\^_`a-
-Z\[\\\]^_`a-
z{\|}
z\{\|\}
Reserved
metacharacters in
c   replace strings must be
\$ \ %<>\
c  c
 \-quoted if used as data
chars, e. g. \% \\ \< \>
If in doubt, quote. It
  c  
 
  
     
 
 

won¶t hurt. In ^ava,


you can abbreviate [a-
z\.] as [a-z.] since . is
clearly a character not
a command inside [].
Matches anything. In
^ava . sometimes
  . . .
matches Cr and Lf and
sometimes not.
Zero or More of the
preceding thing. .*
matches anything. In
Funduc, the * comes
before the thing
repeated, e.g. *[] to
match anything even
over multiple lines. In
^ava and SlickEdit, the
* comes after, e.g. [a-
z]*. Normally you
want .*?, the reluctant
form instead of .* for
wildcard matching. As
greedy: * a rule of thumb, if your
" * *
reluctant: *? regex is matching too
long a string, try
replacing a greedy
quantifier with a
reluctant one. The
documenatation
mislead me. It made it
sound as if reluctant
would only every
match a single
character ² pretty
lame, but that is not so.
It just finds the first
match to your
complete regex.
One or More of the
preceding thing. In
greedy: + Funduc, the + comes
 reluctant: +? + + before the thing
possessive: ++ repeated, e.g. +[0-
9\,\.\+\-] to crudely
match a number. In
  c  
 
  
     
 
 

^ava and SlickEdit, the


+ comes after, e.g. [0-
9\,\.\+\-]+.
Exactly One of the
preceding things,
similarly for any {n}.
Here is a cute trick to
use this ^ava feature to
count characters,
inserting a dash
between pairs of
 {1} {1} default characters:
// insert a dash
between chars
String cute =
"AA54BG4G3G".repla
ceAll(
"(\\w{2})(?!$)",
"$1:" );
// cute is "AA -54-
BG-4G-3G"
Zero or One of the
greedy: ? preceding thing. e.g
"c  ?
reluctant: ?? (abc)? will match"" or
"abc"
[elimits a group of
characters or patterns.
The characters
capturing: ( ² ) matching the group
c  non-capturing: (?: ² ( ² ) (²) will show up when you
) call group(i). However,
they won¶t if you make
the group non-
capturing.
Not character operator,
e.g. In ^ava, [^abc]
means anything but a,
b or c. In other
contexts ^ means start
of line. In VSlick
c ^ ~ ?!() [~abc] means the
same. In Funduc works
only on expressions.
xref?!(=) finds the
letters xref followed by
anything but =. In ^ava
you can say [a-
  c  
 
  
     
 
 

z&&[^m-p]] to get a
through z, except m
through p.
Not expression
operator. In java
anything but X, via
zero width negative
lookahead. After the
non-match, you
continue where you
left off, not at the end
of the non-matching
string. In ^ava, you
   (?!X) ~ ! might search for a
word beginning with l
but not a lion like this:
"(?!lion)l[a-z]+ ". (?!
looks ahead, and aborts
the match if it sees the
undesirable pattern. In
Funduc xref?!(=) finds
the letters xref
followed by anything
but =
infix or Operator,
(cat|dog) matches cat
or dog. Each each
c | | |
alternate will get its
own dedicated group(i)
slot.
any char but newline.
To make newline also
match dot, in ^ava,
embed (?s) early in the
string. (?s) does not
match anything, it just
switches mode. You
can also turn the mode
  . . ?
on with a Pattern.
compile flag
[OTALL. You can
turn it off again with
(?-s).
Use plain . not [.]
because inside square
brackets dot just means
  c  
 
  
     
 
 

a literal period, not


any-character.
newline, given for
 \r\n \n \r\n
Windows.
Start of Line. In other
  ^ ^ ^ contexts means not.
See notes on $.
End of Line. For
Windows, matches a
pair of characters \r\n.
For Linux matches \n.
For Mac matches \r.

ap For
^ava, $ means
end of string
not end of line,
unless you turn
on multiline
mode by
embedding
(?m) first. You
can turn it off
again with (?-
m). You can
also turn it on
  $ $ $ with Pattern.
compile( xxx,
Pattern.MULTI
LINE ).
ap This is
subtle, and will
drive you nuts.
^ and $ do not
match any
actual
character. In
multiline (?m)
mode, they
match an empty
string right
after or before
a line
terminator.
Most of the
time use (?s) to
  c  
 
  
     
 
 

turn on
multiline mode
and allow \s*?
to swallow
newlines as if
there were
spaces.

 ^^ Start of File


 $$ End of File
Range Operator, list of
chars,[ab] means
match a or b. [a-z]
matches any character
in range a through z.
[0-9] is a digit. [a-z] is
lower case. [A-Z] is
upper case. [ -_] (space
dash underscore) is
any printable ASCII
char.
In Funduc, you don¶t
need parenthesis
around [a-z] in the
search string.

Keep strings of
c   [] [] []
selection characters
inside [] in
alphabetical order. It
will make proofreading
easier and comparing
regexes easier, e. g. [
a-z0-9\"%&'\\(\\)\\-
./:;\\?=_] The quoter
amananuesis will
compute the span of
any string, a canonical
regex expression that
will hop over the
string. It will create
tidy complex range
expressions sorted in
alphabetical order.
  [^, ] [~, ] n/a any character except a
  c  
 
  
     
 
 

comma or space
  c  a through z, except for
[a-z&&[^bc]] n/a n/a
  b and c
Sub-Expression.
In Funduc, you don¶t
need parenthesis
  () () ()
around *[a-z] in the
search string. Further,
you must not use them!
  +n Column Operator
back reference to
tagged expression #1,
in #$ for replace.
E.g. in SlickEdit to
replace all occurrences
of
<span
class="jmethod">
used before an upper
case name, converting
them to
<span class="jclass">..
Search string : <span
class="jmethod">([A-
Z])
Replace string : <span
%1
class="jclass">\1
\1 %2 etc.
c   $1 Remember to turn
\2 etc. %1< (to lower
exact case matching on
case)
for these to work.
In Funduc, you don¶t
need parenthesis
around [a-z] in the
search string. [a-z]* in
Funduc will put the
first character in %1
and the rest of the
match in %2, very
confusing.

^ava regex has only


very primitive replace
ability. Every match
must be replaced by
the same string, with
  c  
 
  
     
 
 

$1 $2 etc to bring over


matched pieces from
the original String.
However, in ^ava you
can also use \1 in the
 string to insist
on a match for some
expression found
earlier in the string, i.e.
a repeated pattern,
most commonly used
to make sure single or
double quotes balance.
Use Matcher.
replaceAll.

Intelli^ editor uses


standard ^ava regex,
including $1 to mark a
replacement parameter.
search: \(([a-zA- search: \(([a- search: \([a-zA- Replace all (x with ( x
c  
z\(\"]) zA-z\(\"]) z\(\"] but only if x is
 
replace: \( \1 replace: \( \1 replace: \( %1 alphabetic or ( or "
  

  \s = [ \t\n\x0B\f\r] [ \t\n] [ \t\r\n] single white space


  
one or more white

  spaces, [ \t\n\x0B\f\r]


\s+ \:b +[ \t\r\n]
   Watch out, matches
line end as well!
zero or more white

spaces, [ \t\n\x0B\f\r]*

  \s* [ \t\r\n]* *[ \t\r\n]


Watch out, matches
  
line end as well!
single non white space
 \S [^ \t\n] [! \t\r\n]
(blank, tab)
one or more non-white
 \S+ [^ \n\t]+ +[! \r\n\t]
spaces
alphabetic word (string

c (\p{Alpha}+) \:w +[A-Za-z]


of A-Z a-z )
number (string of
([0-9\,\.\+\- digits, commas,
 c ([0-9\,\.\+\-]+) +[0-9\,\.\+\-]
]+) decimal points and
signs)
  c  
 
  
     
 
 

quoted String. It easier


just to quote all
\"(\\\"(*[ A-Za-
\"(\\\"|([ A-Za- punctuation
z\'\[\]\+\=\!\@\#
z\'\[\]\+\=\!\@\# sometimes. It is easier
  \:q \$\%\^\&\*\(\)
\$\%\^\&\*\(\) to proofread. [on¶t
\<\>\:\;\?\|\\]*))\
\<\>\:\;\?\|\\]*))\" quote : in Vslick since
"
\:« has special
meaning.
\d = digit = [0-9]
\:a
alphanumeri
\[ = non digit = [^0-
c char = [A-
9]
Za-z0-9]
\s = single
\d0-\d27
whitespace char = predefined match
ASCII codes
[ \t\n\x0B\f\r] strings, e.g. \:w = ([A-
0«27
Za-z]+) matches a
specified as
\S = not whitespace word. Those are braces
8-bit
= [^\s] in \p{Alnum} not
decimal.
parentheses. It can be
\w = single hard to tell in some
\:b blanks =
alphanumeric char = typefaces. The strings
([ \t]+)
[a-zA-Z_0-9] are case sensitive, and
when used in ^ava
\:c alpha
\W = not source code such
char = [A-
alphanumeric = strings must be coded
Za-z]
[^\w] as "\\p{Alnum}". \d \[
  
\s \S \w \W \p{Lower}
\:d digit =
The following are all etc. will also work
[0-9]
case-sensitive. You inside [«].
must specify
\:f filename
\p{Lower} not \p{Lower] is not quite
part
\P{lower} etc. identical to [a-z] If you
\p{Lower} overrides have
\:h hex =
CASE_INSENSITI CASE_INSENSITIVE
([0-9A-Fa-
VE. Even then it will , \p{Lower} will only
f]+)
not match upper case match lower case
letters. letters while [a-z] will
\:i int = ([0-
\p{Lower} = [a-z] also match upper case
9]+)
ones.
\p{Upper} = [A-Z]
\:n float
\p{ASCII} = [\x00-
\:p path
\x7F]
\:q quoted
\p{Alpha} = [A-za-
  c  
 
  
     
 
 

z] string

\p{[igit} = [0-9] \:v C


language
[\p{[igit}\.]+ = [0- variable
9\.]+ decimal name = ([A-
number Za-z_$][A-
Za-z0-
\p{Alnum} = [[A- 9_$]*)
Za-z0-9]
\:w word =
\p{Punct} = ([A-Za-z]+)
[!"#\$%&'\(\)\*\+,\-
\./:;<=>\?@\[\\\]\^_`\
{\|\}~]

\p{Graph} =
[\p{Alnum}\p{Punct
}]

\p{Print} =
[\p{Graph}\x20]

\p{Blank} = [ \t] c.f.


\s

\p{Cntrl} = [\x00-
\x1F\x7F]

\p{X[igit} = [0-9a-
fA-F]

\p{Space} =
[ \t\n\x0B\f\r] c.f. \s

\p{Lu} = upper case


letter

\p{InGreek} = Greek
letter

\p{Sc} = a currency
symbol

[\p{L}&&[^\p{Lu}]
] = anything but an
  c  
 
  
     
 
 

upper case letter.

(?i) = turn on case


insensitive mode

(?-i) = turn on case


sensitive mode
%%srpath%%
%%srfile%%
%%srfiledate%
X{n} X{n,m} X{n,m} means X
%
capturing ( ² ) appears exactly n to m
%%srfiletime%
non-capturing (?: ² times.
%
 c  ) X{n} means X appears
%%srfilesize%
greedy: + exactly n times.
%
reluctant: +? X{n,} means X
%%srdate%%
possessive: ++ appears at least n times
%%srtime%%
%%envvar=frui
t%%
This table only covers the most common magic characters. See the documentation for
each Regex package for details.

%  c c
%   
  
[A-Z] A single upper-case letter
[A-Z]* zero or more upper-case letters
[A-Z]+ one or more upper-case letters
[A-Z][A-Z] Exactly two upper-case letters
[A-Z]{2} Exactly two upper-case letters (same as above)
[A-Z]{2,} Two or more upper-case letters
[A-Z]{2,10} Between 2 and 10 (inclusive) upper-case letters
[a-zA-Z] A single letter, upper- or lower-case



cc c
Here is how to represent various awkward characters. They represent the combined
quoting needs for ^ava String literals and Regex Patterns.

  

cc c

 & c 
\ The literal backslash character. You must double the \ twice since \ is the
\\\\
quoting character in both ^ava and Regex literals.

  

cc c

 & c 
The character with hexadecimal value 0x@@, e.g. \\xff. Only works with two
\\x@@
hex digits!
The character with hexadecimal value 0x@@@@, e.g. \u20ac. Must always have
exactly four hex digits. [on¶t use for control characters e.g. 0..ff since \u
\u@@@@
expansion happens prior to compilation. In other words \u000a will start a new
line in your program. Note there is only one lead \.
\\t The tab character \u0009
\\n The newline (line feed) character \u000a
\\r The carriage-return character \u000d
\\x0c The form-feed character \u000c.
\\a The alert (bell) character \u0007 \a itself is illegal in ^ava Strings
\\e The escape character \u001b
\\c control characters, e.g. \\cq for ctrl-q.
\\- Literal -, not a regex range operator.
\\+ Literal +, not a regex operator.
\\* Literal *, not a regex operator.
\\? Literal ?, not a regex operator.
\\( Literal (, not a regex expression bracketer.
\\) Literal ), not a regex expression bracketer.
\\[ Literal [, not a regex expression bracketer.
\\] Literal ], not a regex expression bracketer.
\\{ Literal {, not a regex expression bracketer.
\\} Literal }, not a regex expression bracketer.
\\| Literal |, not a regex operator.
\\$ Literal $, not a regex end of line.
\\^ Literal ^, not regex operator.
\\< Literal <, not regex operator.
\\= Literal =, not regex operator.

 c    
Pattern.CASE_INSENSITIVE is a flag you can feed to Pattern.compile to do case
insensitive searches. This is much easier than trying to do them directly in the regex
strings.

^ava 1.4.1+ regexes have assertions, extra conditions placed on the match. Colourful
regex terminology includes:

ap capturing means characters are accumulated for Matcher.group.


ap greedy means find the longest possible match (consuming the most
text).
ap lookahead means it looks ahead for X.
ap lookbehind means it looks behind for X.
ap negative means the match fails if it finds X.
ap positive means the match succeeds if it finds X.
ap possessive means greedily match as much as you can and do not back
off, even when doing so would allow the overall match to succeed. For
example, if you applied the greedy regex .+ to abc you get abc.
ap reluctant means find the shortest/first possible match, If you applied
the reluctant regex .+? to abc, you would just get a.
ap zero-width means it doesn¶t actually capture any characters, or prevent
them from being used in further matching.

The easiest way to understand these terms is to experiment with the various regex
operators on simple strings. You can make yourself a test program that reads strings
from the console. That way, at least you can avoid having to deal with ^ava \ string
quoting. You only need concern yourself with regex \ quoting. You can also use the
Quoter Amanuensis to first apply regex quoting then ^ava string quoting and let you
paste the result into your program.

G c  
You can specify flags to Pattern.compile( String regex, int flags) with:

By default regexes are case-sensitive.

G G c  


  c  
      
 
Makes case does not matter on matching, s
matches S. Even if you use it, \p{Lower} will
CASE_INSENSITIVE (?i)
not match upper case letters. [a] will match A
though.
Make ^ and $ match embedded newlines. You
might expect embedded newlines to match by
default, but they don¶t. For ^ava, $ means end of
string not end of line, unless you turn on
MULTILINE (?m)
multiline mode by embedding (?m) first. You
can turn it off again with (?-m). You can also
turn it on with Pattern. compile( xxx,
Pattern.MULTILINE ).
Makes . match any character, including a line
[OTALL (?s) terminator. By default . does not match line
terminators.
Used in conjunction with CASE_INSENSITIVE
to use the elaborate code-folding schemes to
UNICO[E_CASE (?u) compare Unicode upper and lower case. By
default, the presumption is all characters being
matched are US-ASCI.
G G c  
  c  
      
 
Treats canonically accented characters done with
single char or with a pair as equivalent e.g. å :
CANON_EQ
the pair "a\u030A" is the treated the same as the
single character "\u00E5".
UNIX_LINES (?d) \n is recognised in ^ and $ processing.
Treat all characters as ordinary literals rather
LITERAL \Q« \E
than as commands. You don¶t then quote with \.
Makes whitespace ignored, and allows
COMMENTS ?x embedded comments starting with # that are
ignored until the end of a line.

 c 
Naming fragments of regexes as String constants can make your code easier to
proofread.
I name the fragment Strings beginning with A_ so the Rearranger or other code tidier
will put them before my regex Patterns, and will group the fragment Strings together.

ap Note how much easier the regex patterns are to proofread.


ap Note that if you have got a fragment pattern wrong, you need fix it in
only one place.
ap Note that you can reuse your regex fragments from a previous
program. You don¶t have to work them out from first principles each time.
ap Note how you can implement the patterns in ever more refined ways
without having to adjust all your patterns.
ap Note how you can debug more simply. Get your fragments debugged
first, then your patterns will usually work first time.

 
The following examples use the ^ava conventions. For use on the command line,
undouble the \\.

 
  
(?!X) is the exclusion or negative regex operator, anything but X, via zero width
negative lookahead. After the non-match, you continue where you left off, not at the
end of the non-matching string. In ^ava, you might search for a word beginning with l
but not a lion like this: "(?!lion)l[a-z]+ ". (?! looks ahead, and aborts the match if it
sees the undesirable pattern. I have not completely understood this operator.
Sometimes exclusions don¶t work and I have no idea why. It sometimes easier to let
the regex collect too much stuff and then toss what you don¶t need programmatically
in ^ava.
% 
  
 
ap Matching means the the pattern must match the entire String.
ap Finding means the pattern must appear somewhere in the String.
ap LookingAt mean the String must start with the pattern.

% 
When you want the entire String to match your Pattern,

  
When you want to find fragments in your String that match the Pattern, use
Matcher.find.
If you only want to find the first occurrence of a regex in a String you can use this

Here is how you do a case-insensitive find.

The following example will help you understand how the c | operator works, and the
effects of using layers of capturing ().

 
When you want to see if you String starts with your Pattern, use Matcher.lookingAt.

  
Regexes can be used to break phrases into individual words. Here is an example:
Beware, split treats leading, embedded and trailing separators differently. It ignores
trailing separators unless you use split ( string, -1 /* limit */ ). It inherited this oddity
from Perl.
Another oddity is when you split an empty String, you don¶t get a 0-length array. You
get a array with a 0-length String in the [0] postition.

  
Here is how to search for instances of some pattern in a big string, and replace them
all with some computed modification of the pattern.

 
ap Be careful with space, a pair of spaces and a newline. Regexes treat
them as quite different. You may tend to treat them as equal by eye.
ap If you are having trouble composing a String to describe what you do
want, try instead to compose one that describes what you don¶t want, and
reverse the sense of the match.
ap When a regex does not work, give just as much attention to the right
end as the left. It can be failing on the very last character.
ap [on¶t try to put all your logic in one be-all-end-all [EBE ([oes
Everything But Eat) regex. Use several simpler ones in succession. For
example, if there are four distinct patterns you are looking for, use four
regexes rather than one giant complicated one.
ap Always check the your regex to make sure you did not use an unquoted
magic character as if it were an ordinary one. It is so easy to forget that a
character is a command that you rarely use.
ap Regex code often seems to work, but because you left out one letter
from a Pattern, it will fail to catch all instances. Manually count instances and
make sure all are accounted for. Rather than thinking of which characters to
include, look at a Pattern that includes everything, and decide letter by letter if
you want to include it. In general include characters unless there is a specific
reason to exclude them. [on¶t exclude them just because they not commonly
used.
ap You will need some code like this if you want to include the separator
character in your regexes, since \ has to be quoted in regexes.
ap Regexes are a sledgehammer for complex pattern matching. For simple
tasks you can do the job at least three times faster and more simply with
String. substring, String. indexOf, String. lastIndexOf, String. startsWith,
String.endsWith, or possibly StringTokenizer or StreamTokenizer. Free to mix
regex logic with String method logic.
ap Regexes are not designed for complex language analysis like parsing
XML or ^ava source code. Use a parser instead.
ap Compiling a Pattern is a non-trivial, time-consuming operation. If you
are not careful, your Pattern will be recompiled on every use. For speed, use
this idiom to compile the Pattern only once:
ap // ensuring the Pattern is compiled only once.
private static final Pattern p = Pattern.compile( "[a]*" );

ap Regexes rarely give an error message. They just fail to match anything.
Start with just the first few chars of your regex and see what that matches.
Then when that works and a few more characters at a time rather than trying to
debug the whole thing at once. You can gradually add characters to the right
end of your regex, or gradually chop them off until it starts matching.
ap Intelli^ and Eclipse both have a regex plugin to help you compose and
debug regexes.
ap Matcher.group( i ) is full of surprises. Print out everything from 0 to n
to make sure you are grabbing the right fragment. If you have an | in your
pattern with parallel alternates, each option will get its own dedicated group
slot.
ap Keep your Pattern characters in ASCII order. It makes them easier to
proofread.

c 
The String class borrows some convenience regex methods, such as split, matches,
replaceAll and replaceFirst. Normally you would use the more efficient java.util.regex
methods such as Matcher. replaceFirst and Matcher.replaceAll where you precompile
a tt  it 
 ti  i     t  i iai 
 t  tt t ti l   t   

0 
  ”M ti R l i   l

 i   lt 
l   iti 
 V V 
0 ± ±± 

V 
 !Rilla 
V 
  "


  a#il$a

lB %l  iti l      l


lat   
 t iilla  t      l lM  
  t t  t i ill t it l  %ti  l
t i  lti  t  t l & iliti !ti  
talt l  llalli l t ttit til 


'     '   


(    (  

'     t ii  

(   '    

'    (   

(    '' l  

'  it   ll   

ili    a  

'  (
!illa i






  ”R l i  tR


 V V 
0 ± ±)" 

V 
 !Rilla 
V 
  
 
a tli

lC t t t i t x „


   „ 


 i l    l t  


'     '   


(    (  


'     t ii  

(   '    

'    (   

(    '' l  

'  it   ll   

ili    a  

'  (
!illa i






   
l! tt il   

l! tt il        '    

l!    jtil  :ill:

ap t tl 
ap it t *"+) it  l *"+ a l l
&i   :i

l!    ttl :ill:

ap t tl 
ap it t *"+) it  l *"+ a l l
&i   :i

l!    tt, t:ill:

ap t tl 
ap it t *"+) it  l *"+ a l l
&i   :i

l!    Mt l :ill:

ap t tl 
ap it t *"+) it  l *"+ a l l
&i   :i

l!    Mt l$ll:ill:

ap t tl 
ap it t *"+) it  l *"+ a l l
&i   :i
Oracle¶s ^avadoc on String.matches : available:

ap on the web at Oracle.com


ap in the current ^[K 1.6.0_24 or in the old ^[K 1.5.0_22 on your local
Windows ^: drive.

A common bug is to confuse String.replace (non regex replace all) with


String.replaceAll (regex replace all) and String. replaceFirst (regex replace just one
instance). It is probably too late now for Sun to assign the methods clearer names.
Oracle¶s ^avadoc on String.replace : available:

ap on the web at Oracle.com


ap in the current ^[K 1.6.0_24 or in the old ^[K 1.5.0_22 on your local
Windows ^: drive.

Oracle¶s ^avadoc on String.replaceAll : available:

ap on the web at Oracle.com


ap in the current ^[K 1.6.0_24 or in the old ^[K 1.5.0_22 on your local
Windows ^: drive.

Oracle¶s ^avadoc on String.replaceFirst : available:

ap on the web at Oracle.com


ap in the current ^[K 1.6.0_24 or in the old ^[K 1.5.0_22 on your local
Windows ^: drive.

Oracle¶s ^avadoc on String.split : available:

ap on the web at Oracle.com


ap in the current ^[K 1.6.0_24 or in the old ^[K 1.5.0_22 on your local
Windows ^: drive.

Slick Edit documentation available from Help | contents ” Search and Replace ”
Regular Expressions ” Unix Regular Expressions.

Funduc search and replace documentation is available from Help ” contents ”


Regular Expressions | Search Operators.

tcc/TakeCommand documentation is available from help | contents ” wildcards ”


advanced wildcards

Apache RegExp
Eclipse Regex tester
Expresso Regex tester: for Windows .net style regex
finite state automaton
^avaRegex.com
^Flex
^Regex
KRegExpEditor: part of K[E utilities
li t    
- tt il
.: lil i  at
/iRR t t
/ tR $ i 
R Ba: il&i  t lt lta     aii 
it  
R  t t
R C   tt jt
R    tt jt
R   tt jt
R t ti t l:li a  t  t t t ti 
 l i  i 
lR  0t tal1
  Mt 

t
i(
ti 
ti
i(

R:  it ((at i 

-  tt   t a t i   ila a l l :i
  : 0 itli/i  


 iti 1


 tt://i  /j l /  tl  :\i \j l \  tl 

l ila  liti ltt t t it   


i i  ta  tti   i iti l i 
 /itli t    ti  t i t i    t 
t R a2: %a ta   
t itil t  i  ti l liitla iat t
CiMi t 


i  %: """) 3
iBl  - %: ")")""" 3
# - i it ±"
 
p

p
p

  c c   


pp 

G  
 

c  ()
Home
Articles/Links Tutorial Part 1
Mugs, T-shirts
p
Comments/Raves Basic Pattern Elements:
New in 1.5.3 p
A Game [], {}, *, +, ?, (?i) p
An Online Test
Questions
\w, \s, \d, \W, \S, \D
p
p
Copyright/License p
ppp p pp
p p
[ownload Free  p  ppp p
pp
p
pp

pp 
ppp
p p 
p
î    
¦ ¦ c„  p
pp
p
pp p
pp
You Can Buy!
 p  p
pp
p p   pp
 „ @  
Quick Start // create a Regex object
Tutorial Part 1 Regex r ß  Regex"shells";
Tutorial Part 2
Tutorial Part 3 // search for a match within a string
rsearch"She sells sea shells by the
Tutorial Part 4
sea shore." ;
Tutorial Part 5
Tutorial Part 6
Systemoutprintln""rdidMatch;
Examples
// Prints "true" -- r.didMatch() is a
Support
boolean function
FAQ
// that tells us whether the last
[ocumentation
search was successful

  
// in finding a pattern.

^ava Beautifier SystemoutprintlnrstringMatched ;


Code Colorizer // Prints "shells" -- the part of the
GUI Grep String that
Swing Grep // matched during the previous search.

@ c  Systemoutprintlnrleft;


Phreida // Prints "She sells sea " -- the part
xmlser of the String
// that is to the left of the matching
text.

Systemoutprintlnrright;
// Prints " by the sea shore." -- the
part of the
// String that is to the right of the
matching text.
 p! pp  p p
p p p p" p
p p
#p$p%! p  p
#p&p
p

However, the above bit of code does not match if


we encounter the substring "SHELLS" rather than
"shells".

rsearch"SHE SELLS SEA SHELLS BY


THE SEA SHORE." ;
Systemoutprintln""rdidMatch;
// Prints "false"
p p
p
pp 
p p pp

r ß  Regex"(?i)shells" ;
rsearch"SHE SELLS SEA SHELLS BY THE
SEA SHORE." ;
Systemoutprintln""rdidMatch;
// Prints "true"
SystemoutprintlnrstringMatched ;
// Prints "SHELLS"
p$
p pp p p
 pp p pp
 p
ppp p p p p p p
p p p
 pp p pp
 pp p
p  p p p p p p  p pp
p p'((pp

r ß  Regex"[Ss]hells" ;
rsearch"SHELLS Shells shells" ;
SystemoutprintlnrstringMatched 
// Prints "Shells"
 p p  p )p# p
p    pp
 p p pp p pp p

pp
 p* +p p
pp p p
pp p
 p
p   p p p  p p
p

p p  p ppp, p p p p
 p
p ppp

pp p*-./0123456+p
  p
pp

Regex r ß  Regex"[012345678]" ;
rsearch"How old are you? I'm 35." ;
SystemoutprintlnrstringMatched ;
// Prints "3"

p
p ppppp pp
p p
 ppp p02pp p" pp0pp
 p
pp )p# p p pp p
pp p
pp
 p
#p pp

Regex r ß 
Regex"[0123456789][0123456789]" ;
rsearch"How old are you? I'm 35." ;
SystemoutprintlnrstringMatched ;
// Prints "35"
   p
p p
p p p
p%p  p p
p pp
p
p" p p

pp

rsearch"How old are you? I'm only


8.";
SystemoutprintlnrstringMatched ;
// Prints "null" because no match
occurred.
%p p  !pp p p
 p pp

rsearch"When were you born? In


1963";
SystemoutprintlnrstringMatched ;
// Prints "19"
%pp p 
p pp  p p p p p


 pp p pp p p pp

Regex r ß 
Regex"[0123456789]{1,4}" ;
rsearch"How old are you? I'm only
8.";
SystemoutprintlnrstringMatched ;
// Prints "8"

rsearch"How old are you? I'm 35." ;


SystemoutprintlnrstringMatched ;
// Prints "35"

rsearch"When were you born? In


1963.";
SystemoutprintlnrstringMatched ;
// Prints "1963"
%p
p
  p p 
pp7. 8p
p pp
p
p
 p p p
 p p
p pp

It may be that we want don't want to specify a


maximum number of characters to match. Perhaps
we just want to match one or more digits. We can
do this by not supplying the second digit to the {}
pattern element.

r ß  Regex"[0123456789]{1,}" ;
rsearch"What's your favorite number?
It's 979834743." ;
SystemoutprintlnrstringMatched ;
// Prints "979834743"
%pp p p p pp
p pp
) p p

 p*-./012456+p
pp
p#p
%
p
pp p pppp p pp
 pp p p pppp p 
p

p9   pp


pp  pp p
p

pp p 
p  p p p p  p
 p*-:6+p p p

 p
p ppp
 p
pp p p-p p6pp p p*:;+p
 pp p p ppp p p*<:=+p p
p pp p p pp p p*<:=:;-:
6+p ppppp
p
p pp p
 pp p p p pp

pp

Regex r ß  Regex"[A-Z][a-z]{1,}";
// Matches an upper case letter,
followed by one or
// more lower case letters.
rsearch"What is your name? My name
is Fred." ;
SystemoutprintlnrstringMatched ;
// Prints "What"
p%p pp 
p pp9p pp
 p%p
p" p
pp pp
Regex r ß  Regex"[A-VX-Z][a-
z]{1,}";
// Matches an upper case letter
(excluding W),
// followed by one or more lower case
letters.
rsearch"What is your name? My name
is Fred." ;
SystemoutprintlnrstringMatched ;
// Prints "My"
p
p pp%p p(! p pp p

pp

Regex r ß  Regex"[A-VX-Z][a-
z]{2,}";
// Matches an upper case letter
(excluding W),
// followed by two or more lower case
letters.
rsearch"What is your name? My name
is Fred." ;
SystemoutprintlnrstringMatched ;
// Prints "Fred" -- "My" does not
match because
// the pattern now requires one
capital letter
// (that isn't a "W") and at least two
lower case
// letters.
9
 p%ppp
p pp%p pp p
 p pp
pp

Regex r ß  Regex"[A-VX-
Z][^ ]{2,}" ;
 pp>p p pp
 pp

p*+! p
p
 pp p p*>p+p p pp
p pp pp p p*>-:6+p p p
pp
p pp

pp

You may be wondering, at this point, if it is


possible to match against something like "[0-9]" as
literal text and not as a digit. The answer, of
course, is yes. Preceeding a non-alphanumeric
character with a "\\" (note, this is really only one
backslash, but the java compiler interprets two
backslashes as one when they appear inside quotes)
causes Regex to interpret it as literal text. (Note:
Putting a backslash before an alphanumeric
character often makes it a special pattern character
instead of a literal).

Regex r ß  Regex"\\[0-9]";
rsearch"the pattern is [0 -9]";
SystemoutprintlnrstringMatched ;
// Prints "[0 -9]"
r ß  Regex"[0-9]";
rsearch"the pattern is [0 -9]";
SystemoutprintlnrstringMatched ;
// Prints "0"
3  p ppp
 p p p p   p, p
p
 p pp

p
ppp

Regex r1ß  Regex"\\w";


// the same as "[0 -9A-Za-
z_]"
Regex r2ß  Regex"\\w+";
// the same as " \\w{1,}"
Regex r3ß  Regex"\\w?";
// the same as " \\w{0,1}
Regex r4ß  Regex"\\w*";
// the same as " \\w{0,}
Regex r5ß 
Regex"\\w{5}";
// the same as
Regex("\\w{5,5}");
Regex r6ß  Regex"\\s";
// the same as "[
\b\t\n\r]" -- these
// are referred to as white
space characters.
Regex r7ß  Regex"\\d";
// the same as "[0 -9]" -- a
digit
Regex r8ß  Regex"\\W";
// the same as "[^A -Za-z0-
9_]" -- these are
// the valid characters for
a java variable
// name.
Regex r9ß  Regex"\\D";
// the same as "[^0 -9]" --
not a digit
Regex r10ß  Regex"\\S";
// the same as "[^
\b\r\t\n]"
Regex r11ß  Regex".";
// the same as "[^ \n]". In
most cases, this
// serves the purpose of
matching any character.
// The patt ern ".*" is a
popular way
// to match arbitrary
regions of text.

pp  !pp 
 pp  $p pp
 pp 
p
pp pp
p p p p
p p p
pp 
p$ p
pp p p p
 pp
Regex r12ß 
Regex"(?s).";
// will match any character
Regex r13ß 
Regex"(?s)foo:.";
// matches on the string
"foo:"
// followed by any
character.

 p p pp
p p p  p
 pp
pp pppp p
 p 
 pp

. p ¦ p  p::pp
 p p
 p
 p pp p
 p
ppp  
p
p
 p p
 p  pp
/p  p pp::p
 pp p  pp
 p
p  pp 
pp
p
 pp
0p  p  p::pp  p p
  p

 pp 
p p  p
 p
p

p p*+p p*>+p
p p p pp
  p p
p p
p p   pp
p pp : p  p p
p  p
*+p

p 
 ppp pp p p*:
:+p

p 
 p p p
p  ppp
ppppppp
1p   p   p::p7 8p
 
 p ppp p p
p
p  p p
p 70.-8p
 
 p pp p p p pp

p p p  p p p
  pp
 p p ppp p 
pp p
 p p
p p
p  p p

p p
p  p p 708p
 
 p
p p p ! pp


 pp
p

:p "p/p:p
  c c   
pp 

G  
 

c  ()
Home
Articles/Links Tutorial Part 2
Mugs, T-shirts
Comments/Raves Pattern Elements:
p

New in 1.5.3
A Game (), (?:), (?!), (?=)
An Online Test
Questions p
p
, p p p 
p
pp 
p p p? pp
Copyright/License
p p
p 
p pp p p p  p% p
[ownload Free

p p7/ 8p
 p pp 
p  p
î      p p pp ppp pp  p pp
¦ ¦ c„  p pp
You Can Buy!

 „ @  
Regex r ß  Regex"(foo){2,}" ;
Quick Start rsearch"foo";
Tutorial Part 1 Systemoutprintln""rdidMatch;
Tutorial Part 2 // Prints "false"
Tutorial Part 3
Tutorial Part 4 rsearch"foofoofoo" ;
Tutorial Part 5 SystemoutprintlnrstringMatched ;
Tutorial Part 6 // Prints "foofoofoo"
Examples
Support < p p p 
p
p p
p p
p
FAQ  p p  ppp
[ocumentation #  pp
   Regex r ß  Regex"[abc]([def])" ;
^ava Beautifier rsearch"==> be <==" ;
Code Colorizer Systemoutprintln""rdidMatch;
GUI Grep // Prints "true"
Swing Grep
SystemoutprintlnrstringMatched ;
// Prints "be"
@ c 
Phreida SystemoutprintlnrstringMatched 1;
xmlser // Prints "e"
// This is the contents of the first
backreference
'p 
p pp  p ppp pp

p 
pp p

pp 
p
pp
p pp# p
pp pp

Regex r ß  Regex"([abc])([def])" ;
rsearch"==> be <==" ;

SystemoutprintlnrstringMatched 1;
// Prints "b"
// This is the contents of the first
backreference

SystemoutprintlnrstringMatched 2;
// Prints "e"
// This is the contents of the second
backreference
p#  pp
 p  p 
p pp


p pppp
pp pp
 p p. pp p p pp
p p/ p p p
p
p
p 
p p p# p
p p
#p p  p
#  pp

Regex r ß  Regex"(ab(cd))ef" ;
rsearch"==>abcdef<==" ;

SystemoutprintlnrstringMatched ;
// Prints "abcdef"

SystemoutprintlnrstringMatched 1;
// Prints "abcd"

SystemoutprintlnrstringMatched 2;
// Prints "cd"
G p p pp  
p p  p p
p

p
p ppp 
 p p p

pp

Regex r ß  Regex"(a)+b*";
rsearch"==>aaaabbb<==" ;
SystemoutprintlnrstringMatched 1;
// Prints "a"
// Note that the subpattern is just the
// literal character "a" so that is
what
// the backreference sees.

r ß  Regex"(a+)b*";
rsearch"==>aaaabbb<==" ;
SystemoutprintlnrstringMatched 1;
// Prints "aaaa"
// Now the () contains the * as well,
so
// all the matching a's are returned in
// the backreference.

r ß  Regex"([abc])+" ;
rsearch"==>aaabbbc<==" ;
SystemoutprintlnrstringMatched 1;
// Prints "c"
// When you have something of the form
(...)*
// the backreference returns the last
thing
// that matched.
3 p, p p p p  p.p p
.p p
ppp ppp p
p p# p p
" p p p p pp  pp p
p pp
pp ppp p
p pp 
ppp

Another use of parenthesis is to select one of a set


of patterns. The character "|" is used to distinguish
the different patterns. For example:

Regex r ß 
Regex"(apple|banana|pear|orange)" ;
rsearch"apple";
Systemoutprintln""rdidMatch;
// Prints "true"
rsearch"orange";
Systemoutprintln""rdidMatch;
// Prints "true"
rsearch"grape";
Systemoutprintln""rdidMatch;
// Prints "false"
%p p" p pp 
p

p p! ppp p

 p
p
pp# p
p
p p p
 p

 p p p$p
ppp p
p p

p
 p p  p p 

 p
pp !p
 p p  p ppp 

;p p p

p$ p pp p p p pp

Regex r1 ß 
Regex"(?:foo){2,}" ;
// is the same as
Regex r2 ß 
Regex"(foo){2,}" ;
// except that r1 produces
no backreference.
p p$@p pp p p #pp
pp
 p p
p
p pp; : pp?
 p

p  p p$pp

Regex r ß  Regex"(?i)foo(?=bar)" ;

rsearch"Foo or foobar?" ;
SystemoutprintlnrstringMatched ;
// Prints "foo"
// Matches on the lower case version
of
// foo because it is followed by bar -
- but
// since the match is zero -width "bar"
is
// not part of the matched string.
r ß  Regex"(?i)foo" ;
rsearch"Foo or foobar?" ;
SystemoutprintlnrstringMatched ;
// Prints "Foo"
p p p$&p p
 pp #p
 

p
p; :
pp::pp p
pp
 p
 p ppp

Regex r ß  Regex"(?i)foo(?!bar)" ;
rsearch"Foobar or foo?" ;
SystemoutprintlnrstringMatched ;
// Prints "foo"
// Cannot match on "Foo" because it is
followed
// by bar.

r ß  Regex"(?i)foo" ;
rsearch"Foobar or foo?" ;
SystemoutprintlnrstringMatched ;
// Prints "Foo"

 pG 
p pp
p 
pp

ap   
p p pp
ap G 
p#  pp
ap 
p p pp p p p p
pp
ap $ppp
p
#ppppp p
# p
p pp
ap $@ppp
p
#p$pppp
p  p
pp p; p
pp
ap $&ppp
p
#p$@pppp
p p
 p
pp p

p
pà p  pp

p
G
 p3pp
p

p
p

Anda mungkin juga menyukai