Anda di halaman 1dari 9

Notepad++ searching and Replacing

----------------------------------------------

Escape sequences supported in extended mode


In extended mode, and in regular expressions unless stated otherwise, some specific
escape sequences (a backslash followed by a single character and optional
material), besides the traditional \r, \n and \t. This list only reports sequences
supported in extended mode. Pleas consult the whole list of escape sequences
supported in Regular Expressions.

\n the Line Feed control character LF (ASCII 0x0A)


\r The Carriage Return control character CR (ASCII 0x0D)
\t the TAB control character (ASCII 0x09)
\0 the NUL control character (ASCII 0x00). Not supported in regular expressions
- use \x00 instead.
\\ the backslash character (ASCII 0x05C)
\b the binary representation of a byte, made of 8 digits which are either 1's or
0's. This has a different meaning in regular expressions (beginning of a word).
\o the octal representation of a byte, made of 3 digits in the 0-7 range
\d the decimal representation of a byte, made of 3 digits in the 0-9 range
\x the hexadecimal representation of a byte, made of 2 digits in the 0-9, A-F/a-
f range.
\u In extended mode, the hexadecimal representation of a two byte character,
made of 4 digits in the 0-9, A-F/a-f range. In Unicode builds, finds a Unicode
character. In ANSI builds, finds characters requiring two bytes, like in the Shift-
JIS encoding. In regular expressions, this stands for a lowercase letter.

In a regular expression (shortened into regex throughout), special characters


interpreted are:
Single-character matches
., \c
Matches any character. If you check the box which says ". matches newline", the dot
will indeed do that, enabling the "any" character to run over multiple lines. With
the option unchecked, then . will only match characters within a line, and not the
line ending characters (\r and \n)
\X
Matches a single non-combining characer followed by any number of combining
characters. This is useful if you have a Unicode encoded text with accents as
separate, combining characters.

This allows you to use a character Г that would otherwise have a special meaning.
For example, \[ would be interpreted as [ and not as the start of a character set.
Adding the backslash (this is called escaping) works the other way round, as it
makes special a character that otherwise isn't. For instance, \d stands for "a
digit", while "d" is just an ordinary letter.
Non ASCII characters
\xnn
Specify a single chracter with code nn. What this stands for depends on the text
encoding. For instance, \xE9 may match an é or a θ depending on the code page in an
ANSI encoded document.
\x{nnnn}
Like above, but matches a full 16-bit Unicode character. If the document is ANSI
encoded, this construct is invalid.
\Onnn
A single byte character whose code in octal is nnn.
[[.collating sequence.]]
The character the collating sequence stands for. For instance, in Spanish, "ch" is
a single letter, though it is written using two characters. That letter would be
represented as [[.ch.]]. This trick also works with symbolic names of control
characters, like [[.BEL.]] for the character of code 0x07. See also the discussion
on character ranges.

Control characters
\a The BEL control character 0x07 (alarm).
\b The BS control character 0x08 (backspace). This is only allowed inside a
character class definition. Otherwise, this means "a word boundary".
\e The ESC control character 0x1B.
\f The FF control character 0x0C (form feed).
\n The LF control character 0x0A (line feed). This is the regular end of line
under Unix systems.
\r The CR control character 0x0D (carriage return). This is part of the
DOS/Windows end of line sequence CR-LF, and was the EOL character on Mac 9 and
earlier. OSX and later versions use \n.
\R Any newline character.
\t The TAB control character 0x09 (tab, or hard tab, horizontal tab).
\Ccharacter The control character obtained from character by stripping all but its
6 lowest order bits. For instance, \C1, \CA and \Ca all stand for the SOH control
character 0x01.

Ranges or kinds of characters


[...] This indicates a set of characters, for example, [abc] means any of the
characters a, b or c. You can also use ranges, for example [a-z] for any lower case
character. You can use a collating sequence in character ranges, like in [[.ch.]-
[.ll.]] (these are collating sequence in Spanish).
[^...] The complement of the characters in the set. For example, [^A-Za-z]
means any character except an alphabetic character. Care should be taken with a
complement list, as regular expressions are always multi-line, and hence [^ABC]*
will match until the first A,B or C (or a, b or c if match case is off), including
any newline characters. To confine the search to a single line, include the newline
characters in the exception list, e.g. [^ABC\r\n].
[[:name:]] The whole character class named name. Most of the time, there is a
single letter escape sequence for them - see below.

Recognised classes are:


alnum : ASCII letters and digits
alpha : ASCII letters
blank : spacing which is not a line terminator
cntrl : control characters
d , digit : decimal digits
graph : graphical character
l , lower : lowercase letters
print : printable characters
punct : punctuation characters: , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] \ ^ _
{ } | ~
s , space : whitespace
u , upper : uppercase letters
unicode : any character with code point above 255
w , word : word character
xdigit : hexadecimal digits
\pshort name,\p{name} Same as [[:name:]]. For instance, \pd and \p{digit} both
stand for a digit, \d.
\Pshort name,\P{name] Same as [^[:name:]] (not belonging to the class name).
Note that Unicode categories like in \p{Sc} or \p{Currency_Symbol}, they are
flagged as an invalid regex in v6.6.6. This is because support would draw a large
library in, which would have other uses.
\d A digit in the 0-9 range, same as [[:digit:]].
\D Not a digit. Same as [^[:digit]].
\l A lowercase letter. Same as [a-z] or [[:lower:]].
NOTE: this will fall back on "a word character" if the "Match case" search option
is off.
\L Not a lower case letter. See note above.
\u An uppercase letter. Same as [[:uper:]]. See note about lower case letters.
\U Not an uppercase letter. Same note applies.
\w A word character, which is a letter, digit or underscore. This appears not to
depend on what the Scintilla component considers as word characters. Same as
[[:word:]].
\W Not a word character. Same as :alnum: with the addition of the underscore.
\s A spacing character: space, EOLs and tabs count. Same as [[:space:]].
\S Not a space.
\h Horizontal spacing. This only matches space, tab and line feed.
\H Not horizontal whitespace.
\v Vertical whitespace. This encompasses the The VT, FF and CR control
characters: 0x0B (vertical tab), 0x0D (carriage return) and 0x0C (form feed).
\V Not vertical whitespace.
[[=primary key=]]
All characters that differ from primary key by case, accent or similar alteration
only. For example [[=a=]] matches any of the characters: a, À, Á, Â, Ã, Ä, Å, A, à,
á, â, ã, ä and å.
Multiplying operators
+
This matches 1 or more instances of the previous character, as many as it can. For
example, Sa+m matches Sam, Saam, Saaam, and so on. [aeiou]+ matches consecutive
strings of vowels.
*
This matches 0 or more instances of the previous character, as many as it can. For
example, Sa*m matches Sm, Sam, Saam, and so on.
?
Zero or one of the last character. Thus Sa?m matches Sm and Sam, but not Saam.
*?
Zero or more of the previous group, but minimally: the shortest matching string,
rather than the longest string as with the "greedy" * operator. Thus, m.*?o applied
to the text margin-bottom: 0; will match margin-bo, whereas m.*o will match margin-
botto.
+?
One or more of the previous group, but minimally.
{n}
Matches n copies of the element it applies to.
{n,}
Matches n' or more copies of the element it applies to.
{m,n}
Matches m to n copies of the element it applies to, as much it can.
{n,}?,{m,n}?
Like the above, but match as few copies as they can. Compare with *? and friends.
*+,?+,++,{n,}+,{m,n}+
These so called "possessive" variants of greedy repeat marks do not backtrack. This
allows failures to be reported much earlier, which can boost performance
significantly. But they will eliminate matches that would require backtracking to
be found.
Example: matching ".*" against "abc"x will find "abc", because
" then abc"x then $ fails
" then abc" then x fails
" then abc then " succeeds.
However, matching "*+" against "abc"x will fail, because the possessive repeat
factor prevented backtracking.
Anchors
Anchors match a position in the line, rather than a particular character.
^
This matches the start of a line (except when used inside a set, see above).
$
This matches the end of a line.
\<
This matches the start of a word using Scintilla's definitions of words.
\>
This matches the end of a word using Scintilla's definition of words.
\b
Matches either the start or end of a word.
\B
Not a word boundary.
\A, \'
The start of the matching string.
\z, \`
The end of the matching string.
\Z
Matches like \z with an optional sequence of newlines before it. This is equivalent
to (?=\v*\z), which departs from the traditional Perl meaning for this escape.
Groups
(...)
<Parentheses mark a subset of the regular expression. The string matched by the
contents of the parentheses ( ) can be re-used as a backreference or as part of a
replace operation; see Substitutions, below.
Groups may be nested.
(?<some name>...), (?'some name'...),(?(some name)...)
Names this group some name.
\gn , \g{n}
The n-th subexpression, aka parenthesised group. Uing the second form has some
small benefits, like n being more than 9, or disambiguating when n might be
followed by digits. When n' is negative, groups are counted backwards, so that \g-2
is the second last matched group.
\g{something},\k<something>
The string matching the subexpression named something.
\digit
Backreference: \1 matches an additional occurence of a text matched by an earlier
part of the regex. Example: This regular expression: ([Cc][Aa][Ss][Ee]).*\1 would
match a line such as Case matches Case but not Case doesn't match cASE. A regex can
have multiple subgroups, so \2, \3, etc can be used to match others (numbers
advance left to right with the opening parenthesis of the group). So \n is a
synonym for \gn, but doesn't support the extension syntax for the latter.
Readability enhancements
(:...)
A grouping construct that doesn't count as a subexpression, just grouping things
for easier reading of the regex.
(?#...)
Comments. The whole group is for humans only and will be ignored in matching text.
Using the x flag modifier (see section below) is also a good way to improve
readability in complex regular expressions.
Search modifiers
The following constructs control how matches condition other matches, or otherwise
alter the way search is performed. For those readers familiar with Perl, \G is not
supported.
\Q
Starts verbatim mode (Perl calls it "quoted"). In this mode, all characters are
treated as-is, the only exception being the \E end verbatim mode sequence.
\E
Ends verbatim mode. Ths, "\Q\*+\Ea+" matches "\*+aaaa".
(?:flags-not-flags ...), (?:flags-not-flags:...)
Applies flags and not-flags to search inside the parentheses. Such a construct may
have flags and may have not-flags - if it has neither, it is just a non-marking
group, which is just a readability enhancer. The following flags are known:
i : case insensitive (default: off)
m : ^ and $ match embedded newlines (default: as per ". matches newline")
s: dot matches newline (default: as per ". matches newline")
x: Ignore unescaped whitespace in regex (default: off)
(?|expression using the alternation | operator)
If an alternation expression has subexpressions in some of its alternatives, you
may want the subexpression counter not to be altered by what is in the other
branches of the alternation. This construct will just do that.
For example, you get the following subexpressioncounter values:
# before ---------------branch-reset----------- after
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
Without the construct, (p(q)r) would be group #3, and (t) group #5. With the
constuct, they both report as group #2.
Control flow
Normally, a regular expression parses from left to right linerly. But you may need
to change this behaviour.
|
The alternation operator, which allows matching either of a number of options, like
in : one|two|three to match either of "one", "two" or "three". Matches are
attempted from left to right. Use (?:) to match an empty string in such a
construct.
(?n), (?signed-n)
Refers to subexpression #n. When a sign is present, go to the signed-n-th
expression.
(?0), (?R)
Backtrack to start of pattern.
(?&name)
Backtrack to subexpression named name.
(?assertionyes-pattern|no-pattern)
Mathes yes-pattern if assertion is true, and no-pattern otherwise if provided.
Supported assertions are:
(?=assert) (positive lookahead)
(?!assert) (negative lookahead)
(?(R)) (true if inside a recursion)
(?(Rn) (true if in a recursion to subexpression numbered n
PCRE doesn't treat recursion expressions like Perl does:
In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
alternatives and there is a subsequent matching failure.
\K
Resets matched text at this point. For instance, matching "foo\Kbar" will not match
bar". It will match "foobar", but will pretend that only "bar" matches. Useful when
you wish to replace only the tail of a matched subject and groups are clumsy to
formulate.
Assertions
These special groups consume no characters. Their successful matching counts, but
when they are done, matching starts over where it left.
(?=pattern)
If pattern matches, backtrack to start of pattern. This allows using logical AND
for combining regexes.
For instance,
(?=.*[[:lower:]])(?=.*[[:upper:]]).{6,}
tries finding a lowercase letter anywhere. On success it backtracks and searches
for an uppercase letter. On yet another success, it checks whether the subject has
at least 6 characters.
'"q(?=u)i" doesn't match "quit", because, as matching 'u' consumes 0 characters,
matching "i" in the pattern fails at "u" i the subject.
(?!pattern)
Matches if pattern didn't match.
(?<=pattern)
Asserts that pattern matches before some token.
(?<pattern)
Asserts that pattern does not match before some token.
NOTE: pattern has to be of fixed length, so that the regex engine knows where to
test the assertion.
(?>pattern)
Match pattern independently of surrounding patterns, and don't backtrack into it.
Failure to match will cause the whole subject not to match.
Substitutions
\a,\e,\f,\n,\r,\t,\v
The corresponding control character, respectively BEL, ESC, FF, LF, CR, TAB and VT.

\Ccharacter", \xnn,\x{nnnn</i>}
Like in search patterns, respectively the control character with the same low order
bits, the character with code 'nn and the character with code nnnn (requires
Unicode encoding).
\l
Causes next character to output in lowercase
\L
Causes next characters to be output in lowercase, until a \E is found.
\u
Causes next character to output in uppercase
\U
Causes next characters to be output in uppercase, until a \E is found.
\E
Puts an end to forced case mode initiated by \L or \U.
$&, $MATCH, ${^MATCH}
The whole matched text.
$`, $PREMATCH, ${^PREMATCH}
The text between the previous and current match, or the text before the match if
this is the first one.
$", $POSTMATCH, ${$POSTMATCH}
Everything that follows current match.
$LAST_SUBMATCH_RESULT, $^N
Returns what the last matching subexpression matched.
$+, $LAST_PAREN_MATCH
Returns what matched the last subexpression in the pattern.
$$
Returns $.
$n, ${n}, \n
Returns what matched the subexpression numbered n. Negative indices are not alowed.

$+{name}
Returns what matched subexpression named name.
Zero length matches
While, in normal or extended mode, there would be no point in looking for text of
length 0, this can very normally happen with regula expressions. For instance, to
add something at the beginning of a line, you'll search for "^" and replace with
whatever is to be added.
Notepad++ would select the match, bt there is no sensible way to select a stretch
zero character long. Whe this happens, a tooltip very similar to function call tips
is displayed instea, with a caret pointing upwards to the empty match.
A match was found at the first column of line 5.

Examples
These examples come from an earlier version of this page: Notepad++ RegExp Help, by
Author : Georg Dembowski

Add more examples using advanced features of PCRE

IMPORTANT
You have to check the box "regular expression" in search & replace dialog
When copying the strings out of here, pay close attention not to have additional
spaces in front of them! Then the RegExp will not work!
Example 0
How to replace/delete full lines according to a regex pattern? Let's say you wish
to delete all the lines in a file that contain the word "unused", without leaving
blank lines in their stead. This means you need to locate the line, remove it all,
and additionally remove its terminating newline.
So, you'd want to do this:: Find: ^.*?unused.*?$\R Replace with: nothing, not even
a space The regular expression appears to always work is to be read like this:
assert the start of a line
match some characters, stopping as early as required for the expression to match
the string you search in the file, "unused"
more characters, again stopping at the earliest necessary for the expression to
match
assert line ends
A newline character or sequence
Remember that .* gobbles everything to the end of line if ". matches newline" is
off, and to the end of file if the option is on!
Well, why is appears above in bold letters? Because this expression assumes each
line ends with an end of line sequence. This is almost always true, and may fail
for the last line in the file. It won't match and won't be deleted.
But the remedy is fairly simle: we translate in regex parlance that the newline
should match if it is there. So the correct expression actually is:
^.*?unused.*?$\R?
Example 1
You use a MediaWiki (e.g. Wikipedia, Wikitravel) and want to make all headings one
"level higher", so a H2 becomes a H1 etc.
Search ^=(=)
Replace with \1
Click "Replace all"

You do this to find all headings2...9 (two equal sign characters are required)
which begin at line beginning (^) and to replace the two equal sign characters by
only the last of the two, so eleminating one and having one remaining.
Search =(=)$
Replace with \1
Click "Replace all"

You do this to find all headings2...9 (two equal sign characters are required)
which end at line ending ($) and to replace the two equal sign characters by only
the last of the two, so eleminating one and having one remaining.
== title == became = title =, you're done :-)
Example 2
You have a document with a lot of dates, which are in German date format (dd.mm.yy)
and you'd like to transform them to sortable format (yy-mm-dd). Don't be afraid by
the length of the search term – it's long, but consiting of pretty easy and short
parts.
Do the following:
Search ([^0-9])([0123][0-9])\.([01][0-9])\.([0-9][0-9])([^0-9])
Replace with \1\4-\3-\2\5
Click "Replace all"
You do this to fetch
the day, whose first number can only be 0, 1, 2 or 3
the month, whose first number can only be 0 or 1
but only if the separator is . and not 'any character' ( . versus \. )
but only if no numbers are sourrounding the date, as then it might be an IP address
instead of a date
and to write all of this in the opposite order, except for the surroundings. Pay
attention: Whatever SEARCH matches will be deleted and only replaced by the stuff
in the REPLACE field, thus it is mandatory to have the surroundings in the REPLACE
field as well!
Outcome:
31.12.97 became 97-12-31
14.08.05 became 05-08-14
the IP address 14.13.14.14 did not change
You're done :-)
Example 3
You have printed in windows a file list using dir /b/s >filelist.txt to the file
filelist.txt and want to make local URLs out of them.
Open filelist.txt with Notepad++
Search \\
Replace with /
Click "Replace all" to change windows path separator char \ into URL path separator
char /
Search ^(.*)$
Replace with file:///\1
Click "Replace all" to add file:/// in the beginning of all lines
According on your requirements, preceed to escape some characters like space to %20
etc. C:\!\aktuell.csv became file:///C:/!/aktuell.csv
You're done :-)
Example 4
Another Search Replace Example
[Data]
AS AF AFG 004 Afghanistan
EU AX ALA 248 Åland Islands
EU AL ALB 008 Albania, People's Socialist Republic of
AF DZ DZA 012 Algeria, People's Democratic Republic of
OC AS ASM 016 American Samoa
EU AD AND 020 Andorra, Principality of
AF AO AGO 024 Angola, Republic of
NA AI AIA 660 Anguilla
AN AQ ATA 010 Antarctica (the territory South of 60 deg S)
NA AG ATG 028 Antigua and Barbuda
SA AR ARG 032 Argentina, Argentine Republic
AS AM ARM 051 Armenia
NA AW ABW 533 Aruba
OC AU AUS 036 Australia, Commonwealth of
Search for: ([A-Z]+) ([A-Z]+) ([A-Z]+) ([0-9]+) (.*)
Replace with: \1,\2,\3,\4,\5
Hit "Replace All"
Final Data:
AS,AF,AFG,004,Afghanistan
EU,AX,ALA,248,Åland Islands
EU,AL,ALB,008,Albania, People's Socialist Republic of
AF,DZ,DZA,012,Algeria, People's Democratic Republic of
OC,AS,ASM,016,American Samoa
EU,AD,AND,020,Andorra, Principality of
AF,AO,AGO,024,Angola, Republic of
NA,AI,AIA,660,Anguilla
AN,AQ,ATA,010,Antarctica (the territory South of 60 deg S)
NA,AG,ATG,028,Antigua and Barbuda
SA,AR,ARG,032,Argentina, Argentine Republic
AS,AM,ARM,051,Armenia
NA,AW,ABW,533,Aruba
OC,AU,AUS,036,Australia, Commonwealth of
Example 5
How to recognize a balanced expression, in mathematics or in programming?
Let's first explicitly describe what we wish to match. An expression is balanced if
and only if all areas delineatd by parentheses contain a balanced expression. Like
in: 1+f(x+g())-h(2).
This leads to define the following kinds of groups: balanced ::= no_paren paren ...
no_paren
no_paren = [^()]* -- a possibly empty group of characters without a single
parenthesis
paren ::= ( balanced )
Can we represent this as a regex? We cannot as-is.
The first hurdle is that there is no primitive construct to represent an
alternating sequence of tokens. A common trick then is to represent the sequence as
a repetition of the repeating pattern - here, no_paren followed by paren -, with
any odd stuff at the end added.
So we have a more manageable, although slightly more complex, representation:
balanced ::= simple* no_paren
simple ::= no_paren paren
no_paren ::= [^()]*
paren = ( balanced )

A second hurdle is that parentheses are not ordinary characters. That's ok, we'll
escape them as \( and \) respectively.
The third one is more interesting. How do we represent the whole of an expression
inside a nested sub-expression? This smacks of recursion. PCRE has recursion. The
simplest form of it is tgoing back to the start of the search pattern - not the
searched text! - and doing it again. It writes as (?R). You remember seeing this
one in the main list, right?
So:
we know how to match a no_paren. It will be nicer to give it an explicit name. This
we'll do in the embelishments section below.
we jusrtr discovered how to write a paren: \((?R)\)
This gives us the following hard to read, but correct regex:
([^()]*\((?R)\))*[^()]*
Try it, it works. But it is about as hard to decrypt as a badly indented piece of
code without a comment and with unpromising, unclear identifiers. This is only one
of the reasons why old Perl earned itself the rare qualifier of "write-only
language".
Embellishments
First of all, let's add some spacing so that we can identify the components of the
regex. Spacing can be added using the x modifier flag, which is off by default.
So we can write something more legible:
(?x: ([^ ( ) ]* \( (?R) \) )* [^()]* )
Now let's add some commenting
(?x: ([^ ( ) ]* \( (?# The next group means "start matching the
beginning of the regex")(?R) \) )* [^()]* )
In Perl, we could go further by assigning names to groups. However, in PCRE this
will not work, because any named group, once matched, won't change. This is
obviously not what we want.

Anda mungkin juga menyukai