#############################################################################
#
# name: main
# purpose: entry point of the script
#############################################################################
#
# @ARGV holds all script command line arguments (pos 0 is not prog-name)
# $0 holds script filename
print "hello world\n";
Data Types
#############################################################################
#
# name: main
# purpose: show the basic datatypes
#############################################################################
#
# scalar operations
$num = $num*2 + 3 - $float; # $num is 23.86
$num = 2**4 % 5; # $num is 1 - exp then modulus
$num++; # $num is 2 - inc after eval
$ms = (1<< 3)&0xff|0x03^0x01; # $ms is 0x0a
print ++($foo = '99'); # prints '100' - inc before eval
$new = $str." world"; # $new is "hello world"
# array operations
$first = $nums[0]; # $first is 1
$strings[1] = "neo"; # @strings is ("one","neo")
$mixed[2] = 37; # @mixed is ("three",3.13,37) - grows automatically
@joined = (@mixed,8); # @joined is ("three",3.13,37,8)
@sl = @nums[0,-1,1]; # @sl is (1,3,2) - array slice (specific indices)
@sl = @nums[0..2]; # @sl is (1,2,3) - array slice (span)
$len = scalar(@nums); # an array in scalar context is the list length (3)
$last_index = $#nums; # $last_index is 2 (the last index in the list)
$#nums = -1; # @nums is () - empty
# hash operations
$jims_age = $ages{"jim"}; # $jims_age is 18
$ages{"jim"}++; # key "jim" has value of 19 in %ages
$ages{"ron"} = 24; # key "ron" with value 24 added to %ages
@sl = @ages{"ted","ron"}; # @sl is (21,24) - hash slice
$stats = scalar(%ages); # string eg. "1/16" - 1 used bucket out of 16
alloced
# reference operations
$num_copy = $$scalarref; # dereference using {type}$reference
@mixed_copy = @$arrayref;
$value = $$hashref{"jim"};
$value = $arrayref->[0]; # or dereference using $reference->
$value = $hashref->{"jim"};
Conditionals
#############################################################################
#
# name: main
# purpose: show the basic conditionals
#############################################################################
#
# regular c style if statement, must use blocks
if (defined($value) && ($value == 1)) # defined() tests for
undef
{
print "value equals 1\n";
}
Functions
#############################################################################
#
# name: main
# purpose: show function and subroutine syntax
#############################################################################
#
# return values
sub seventeen1 # return keyword indicated return value
{
return 17;
}
sub seventeen2 # if no return exists, retval is the last
expression
{
17;
}
$num = seventeen1() + seventeen2() + 53;
sub retlist # all datatypes can be returned
{
return (1,2,3);
}
($one,$two,$thr) = retlist; # () are optional (even when we have args)
# arguments
sub has_args
{
@func_arguments = @_; # all arguments are members of the list @_
$first_arg = $_[0]; # returns undef if no arg given
($arg1,$arg2,$arg3) = @_; # the common perl way to handle function
arguments
}
has_args($num,@l1,22,@l2); # all arguments are flattened into one list
sub takes_two_lists # to pass several lists / hashes, use references
{
($l1ref,$l2ref) = @_;
@list1 = @$l1ref;
}
takes_two_lists(\@a,\@b);
Regular Expressions
#############################################################################
#
# name: main
# purpose: show regular expression usage
#############################################################################
#
# matching
$call911 = 'Someone, call 911.'; # the string we want to match upon
$found = ($call911 =~ /call/); # $found is TRUE, matched 'call'
@res = ($call911 =~ /Some(...)/); # @res is ('one'), matched 'Someone'
$entire_res = $&; # $entire_res is 'Someone'
$brack1_res = $1; # $brack1_res is 'one', $+ for last
brackets
($entire_pos,$brack1_pos) = @-; # $entire_pos is 0, $brack1_pos is 4
($entire_end,$brack1_end) = @+; # $entire_end is 7, $brack1_end is 7
# global matching (get all found)
$call911 =~ /(.o.)/g; # g is global-match, $1 is 'Som', $2 is
'eon'
@res = ($call911 =~ /(.o.)/g); # @res is ('Som','eon'), $& is 'eon'
# substituting
$greeting = "hello world"; # the string we want to replace in
$greeting =~ s/hello/goodbye/; # $greeting is 'goodbye world'
# splitting
@l = split(/\W+/,$call911); # @l is ('Someone','call','911')
@l = split(/(\W+)/,$call911); # @l is ('Someone',', ','call','
','911','.')
# pattern syntax
$call911 =~ /c.ll/; # . is anything but \n, $& is 'call'
$call911 =~ /c.ll/s; # s is singe-line, . will include \n, $& is 'call'
$call911 =~ /911\./; # \ escapes metachars {}[]()^$.|*+?\, $& is '911.'
$call911 =~ /o../; # matches earliest, $& is 'ome'
$call911 =~ /g?one/; # ? is 0 or 1 times, $& is 'one'
$call911 =~ /cal+/; # + is 1 or more times, $& is 'call', * for 0 or
more
$call911 =~ /cal{2}/; # {2} is exactly 2 times, $& is 'call'
$call911 =~ /cal{0,3}/; # {0,3} is 0 to 3 times, $& is 'call', {2,} for >=
2
$call911 =~ /S.*o/; # matches are greedy, $& is 'Someo'
$call911 =~ /S.*?o/; # ? makes match non-greedy, $& is 'So'
$call911 =~ /^.o/; # ^ must match beginning of line, $& is 'So'
$call911 =~ /....$/; # $ must match end of line, $& is '911.'
$call911 =~ /9[012-9a-z]/;# one of the letters in [...], $& is '91'
$call911 =~ /.o[^m]/; # none of the letters in [^...], $& is 'eon'
$call911 =~ /\d*/; # \d is digit, $& is '911'
$call911 =~ /S\w*/; # \w is word [a-zA-Z0-9_], $& is 'Someone'
$call911 =~ /..e\b/; # \b is word boundry, $& is 'one', \B for non-
boundry
$call911 =~ / \D.../; # \D is non-digit, $& is ' call', \W for non-word
$call911 =~ /\s.*\s/; # \s is whitespace char [\t\n ], $& is ' call '
$call911 =~ /\x39\x31+/; # \x is hex byte, $& is '911'
$call911 =~ /Some(.*),/; # (...) extracts, $1 is 'one', $& is 'Someone,'
$call911 =~ /e(one|two)/; # | means or, $& is 'eone'
$call911 =~ /e(?:one|tw)/;# (?:...) does not extract, $& is 'eone', $1 is
undef
$call911 =~ /(.)..\1/; # \1 is memory of first brackets, $& is 'omeo'
$call911 =~ /some/i; # i is case-insensitive, $& is 'Some'
$call911 =~ /^Some/m; # m is multi-line, ^ will match start of entire
text
$call911 =~ m!call!; # use ! instead of /, no need for \/, $& is 'call'
Special Variables
#############################################################################
#
# name: main
# purpose: show some special internal variables
#############################################################################
#
# $_ - default input
print for (1..10); # in many places, no var will cause work on $_
print $_ for $_ (1..10); # same as above
Standard IO
#############################################################################
#
# name: main
# purpose: show some basic IO and file handling
#############################################################################
#
# cleanup
close(IN);
close(OUT);
is module provides syntax highlighting for Perl code. The design bias is roughly line-oriented
and streamed (ie, processing a file line-by-line in a single pass). Provisions may be made in the
future for tasks related to "back-tracking" (ie, re-doing a single line in the middle of a stream)
such as speeding up state copying.
Constructors
The only constructor provided is new(). When called on an existing object, new() will create a
new copy of that object. Otherwise, new() creates a new copy of the (internal) Default Object.
Note that the use of the procedural syntax modifies the Default Object and that those changes
will be reflected in any subsequent new() calls.
Formatting
Formatting is done using the format_string() method. Call format_string() with one or
more strings to format, or it will default to using $_.
You can also retrieve the text used for formatting for an element via get_start_format() or
get_end_format. Bulk retrieval of the names or values of defined formats is possible via
get_format_names_list() (names), get_start_format_values_list() and
get_end_format_values_list().
See "FORMAT TYPES" later in this document for information on what format elements can be
used.
You can reset all of the above states (and a few other internal ones) using reset().
In unstable (TRUE) mode, formatting is not considered to be persistent with nested formats. Or,
put another way, when unstable, the formatter can only "remember" one format at a time and
must reinstate formatting for each token. An example of unstable formatting is using ANSI color
escape sequences in a terminal.
In stable (FALSE) mode (the default), formatting is considered persistent within arbitrarily
nested formats. Even in stable mode, however, formatting is never allowed to span multiple
lines; it is always fully closed at the end of the line and reinstated at the beginning of a new line,
if necessary. This is to ensure properly balanced tags when only formatting a partial code
snippet. An example of stable formatting is HTML.
Substitutions
Using define_substitution(), you can have the formatter substitute certain strings with
others, after the original string has been parsed (but before formatting is applied). This is useful
for escaping characters special to the output mode (eg, > and < in HTML) without them affecting
the way the code is parsed.
You can retrieve the current substitutions (as a hash-ref) via substitutions().
FORMAT TYPES
Several of the Format Types have underscores in their name. This underscore is special, and
indicates that the Format Type can be "generalized." This means that you can assign a value to
just the first part of the Format Type name (the part before the underscore) and that value will be
applied to all Format Types with the same first part. For example, the Format Types for all types
of variables begin with "Variable_". Thus, if you assign a value to the Format Type "Variable", it
will be applied to any type of variable. Generalized Format Types take precedence over non-
generalized Format Types. So the value assigned to "Variable" would be applied to
"Variable_Scalar", even if "Variable_Scalar" had a value explicitly assigned to it.
You can also define a "short-cut" name for each Format Type that can be generalized. The short-
cut name would be the part of the Format Type name after the underscore. For example, the
short-cut for "Variable_Scalar" would be "Scalar". Short-cut names have the least precedence
and are only assigned if neither the generalized Type name, nor the full Type name have values.
Comment_Normal
A normal Perl comment. Starts with '#' and goes until the end of the line.
Comment_POD
Inline documentation. Starts with a line beginning with an equal sign ('=') followed by a
word (eg: '=pod') and continuing until a line beginning with '=cut'.
Directive
Either the "she-bang" line at the beginning of the file, or a line directive altering what the
compiler thinks the current line and file is.
Label
A loop or statement label (to be the target of a goto, next, last or redo).
Quote
Any string or character that begins or ends a String. Including, but not necessarily limited
to: quote-like regular expression operators (m//, s///, tr///, etc), a Here-Document
terminating line, the lone period terminating a format, and, of course, normal quotes (', ",
`, q{}, qq{}, qr{}, qx{}).
String
Any text within quotes, formats, Here-Documents, Regular Expressions, and the like.
Subroutine
The identifier used to define, identify, or call a subroutine (or method). Note that
Syntax::Highlight::Perl cannot recognize a subroutine if it is called without using
parentheses or an ampersand, or methods called using the indirect object syntax. It
formats those as barewords.
Variable_Scalar
A scalar variable.
Note that (theoretically) this format is not applied to non-scalar variables that are being
used as scalars (ie: array or hash lookups, nor references to anything other than scalars).
Syntax::Highlight::Perl figures out (or at least tries to) the actual type of the variable
being used (by looking at how you're subscripting it) and formats it accordingly. The first
character of the variable (ie, the $, @, %, or *) tells you the type of value being used, and
the color (hopefully) tells you the type of variable being used to get that value.
(See "KNOWN ISSUES" for information about when this doesn't work quite right.)
Variable_Array
Variable_Hash
A hash variable.
Variable_Typeglob
A typeglob. Note that typeglobs not beginning with an asterisk (*) (eg: filehandles) are
formatted as barewords. This is because, well, they are.
Whitespace
Whitespace. Not usually formatted but it can be.
Character
Keyword
Note that Perl does not make any distinction between keywords and built-in functions (at
least not in the documentation). Thus I had to make a subjective call as to what would be
considered keywords and what would be built-in functions.
Builtin_Function
The list of built-in functions can be found (and overloaded) in the variable
$Syntax::Highlight::Perl::builtin_list_re as a pre-compiled regular expression.
Builtin_Operator
A Perl built-in function, called as a list or unary operator (ie, without using parentheses).
The list of built-in functions can be found (and overloaded) in the variable
$Syntax::Highlight::Perl::builtin_list_re as a pre-compiled regular expression.
Operator
A Perl operator.
Bareword
Note that this does not apply to the package portion of a fully qualified variable name.
Number
A numeric literal.
Symbol
CodeTerm
The special tokens that signal the end of executable code and the begining of the DATA
section. Specifically, '__END__' and '__DATA__'.
DATA
It is actually recommended that you use the OO interface, as this allows you to instantiate
multiple, concurrent-yet-separate formatters. Though I cannot think of why you would need
multiple formatters instantiated. :-)
One point to note: the new() method uses the Default Object to initialize new objects. This
means that any changes to the state of the Default Object (including Format definitions) made by
using the procedural interface will be reflected in any subsequently created objects. This can be
useful in some cases (eg, call set_format() procedurally just before creating a batch of new
objects to define default Formats for them all) but will most likely lead to trouble.
METHODS
new PACKAGE
new OBJECT
Creates a new object. If called on an existing object, creates a new copy of that object
(which is thenceforth totally separate from the original).
reset
Resets the object's internal state. This breaks out of strings and here-docs, ends PODs,
resets the line-count, and otherwise gets the object back into a "normal" state to begin
processing a new stream.
Note that this does not reset any user options (including formats and format stability).
unstable EXPR
unstable
If called with a non-zero number, puts the formatter into unstable formatting mode.
In unstable mode, it is assumed that formatting is not persistent one token to the next and
that each token must be explicitly formatted.
in_heredoc
in_string
Returns true if the next string to be formatted will be inside a multi-line string.
in_pod
Returns true if the formatter would consider the next string passed to it as begin within a
POD structure. This is false immediately before any POD instigators (=pod, =head1,
=item, etc), true immediately after an instigator, throughout the POD and immediately
before the POD terminator (=cut), and false immediately after the POD terminator.
was_pod
Returns true if the last line of the string just formatted was part of a POD structure. This
includes the /^=\w+/ POD instigators and terminators.
in_data
Returns true if the next string to be formatted will be inside the DATA section (ie,
follows a __DATA__ or __END__ tag).
line_count
substitutions
Returns a reference to the substitution table used. The substitution table is a hash whose
keys are the strings to be replaced, and whose values are what to replace them with.
define_substitution HASH_REF
define_substitution LIST
Allows user to define certain characters that will be substituted before formatting is done
(but after they have been processed for meaning).
If the first parameter is a reference to a hash, the formatter will replace it's own hash with
the given one, and subsequent changes to the hash outside the formatter will be reflected.
Otherwise, it will copy the arguments passed into it's own hash, and any substitutions
already defined (but not in the parameter list) will be preserved. (ie, the new substitutions
will be added, without destroying what was there already.)
set_start_format HASH_REF
set_start_format LIST
Given either a list of keys/values, or a reference to a hash of keys/values, copy them into
the object's Formats list.
set_end_format HASH_REF
set_end_format LIST
Given either a list of keys/values, or a reference to a hash of keys/values, copy them into
the object's Formats list.
set_format LIST
get_start_format LIST
Retrieve the string that is inserted to begin a given format type (starting format string).
First: Prefer the names joined by underscore, from most general to least. For example,
given ("Variable", "Scalar"): "Variable" then "Variable_Scalar".
Second: Then try each name singly, in reverse order. For example, "Scalar" then
"Variable".
get_end_format LIST
Retrieve the string that is inserted to end a given format type (ending format string).
get_format_names_list
get_start_format_values_list
Returns a list of the values of all the start Formats defined (in the same order as the
names returned by get_format_names_list()).
get_end_format_values_list
Returns a list of the values of all the end Formats defined (in the same order as the names
returned by get_format_names_list()).
format_string LIST
Formats one or more strings of Perl code. If no strings are specified, defaults to $_.
Returns the list of formatted strings (or the first string formatted if called in scalar
context).
Note: The end of the string is considered to be the end of a line, regardless of whether or
not there is a trailing line-break (but trailing line-breaks will not cause an extra, empty
line).
Another Note: The function actually uses $/ to determine line-breaks, unless $/ is set to
\n (newline). If $/ is \n, then it looks for the first match of m/\r?\n|\n?\r/ in the string
and uses that to determine line-breaks. This is to make it easy to handle non-unix text.
Whatever characters it ends up using as line-breaks are preserved.
Returns TOKEN wrapped in the start and end Formats corresponding to LIST (as would
be returned by get_start_format( LIST ) and get_end_format( LIST ),
respectively).
Barewords used as keys to a hash are formatted as strings. This is Good. They should not be,
however, if they are not the only thing within the curly braces. That can be fixed.
This version does not handle formats (see perlform(1)) very well. It treats them as Here-
Documents and ignores the rules for comment lines, as well as the fact that picture lines are not
supposed to be interpolated. Thus, your picture lines will look strange with the '@'s being
formatted as array variables (albeit, invalid ones). Ideally, it would also treat value lines as
normal Perl code and format accordingly. I think I'll get to the comment lines and non-
interpolating picture lines first. If/When I do get this fixed, I will most likely add a format type of
'Format' or something, so that they can be formatted differently, if so desired.
This version does not handle Regular Expression significant characters. It simply treats Regular
Expressions as interpolated strings.
User-defined subroutines, called without parentheses, are formatted as barewords. This is
because there is no way to tell them apart from barewords without parsing the code, and would
require us to go as far as perl does when doing the -c check (ie, executing BEGIN and END
blocks and the like). That's not going to happen.
If you are indexing (subscripting) an array or hash, the formatter tries to figure out the "real"
variable class by looking at how you index the variable. However, if you do something funky (but
legal in Perl) and put line-breaks or comments between the variable class
character ($) and your identifier, the formatter will get confused and treat your variable as a
scalar. Until it finds the index character. Then it will format the scalar class character ($) as a
scalar and your identifier as the "correct" class.
If you put a line-break between your variable identifier and it's indexing character (see above),
which is also legal in Perl, the formatter will never find it and treat your variable as a scalar.
If you put a line-break between a bareword hash-subscript and the hash variable, or between a
bareword and its associated => operator, the bareword will not be formatted correctly (as a
string). (Noticing a pattern here?)
AUTHOR
Copyright (c) 2001 Cory Johns. This library is free software; you can redistribute and/or modify
it under the same conditions as Perl itself.
TO DO
1. Improve handling of regular expressions. Add support for regexp-special characters. Recognize
the /e option to the substitution operator (maybe).
2. Improve handling of formats. Don't treat format definitions as interpolating. Handle format-
comments. Possibly format value lines as normal Perl code.
3. Create in-memory deep-copy routine to replace eval(Data::Dumper) deep-copy.
4. Generalize state transitions (reset() and, in the future, copy_state()) to use non-hard-
coded keys and values for state variables. Probably will extrapolate them into an overloadable
hash, and use the aforementioned deep-copy to assign them.
5. Create a method to save or copy states between objects ( copy_state()). Would be useful for
using this module in an editor.
6. Add support for greater-than-one length special characters. Specifically, octal, hexidecimal, and
control character codes. For example, \644, \x1a4 or \c[.
REVISIONS