Linguistics 158: Computer-aided methods in linguistics
John B. Lowe /
Department of Linguistics
University of California, Berkeley - Spring 1997
Introduction to programming in PERL
week /class: 8 / 15 :
Lec
Tu , 11 Mar 1997
Preliminaries
- Evaluation
- Homeworks, etc.
Remarks on coding and debugging
- Halting problem "repeat forever"
- Proofing programs correct
Debugging
- Compile time errors vs. Run time errors
- Compile time errors : syntax errors and inconsistent use of typed data elements:
- Run-time errors : data dependency, exceeding capacities, invalid operations after changes of state
- Tracing execution
- Error messages and how to interpret them
History and background of PERL
- C shell, C, sed, awk, grep, and Unix hacking
- System administrators dream
Our interest in PERL
- String handling and text manipulation
- Source of "canned" code for virtually everything.
Sources of PERL info
PERL syntax and semantics
Getting started : "console I/O"
#!/usr/bin/perl
print "Enter Hexadecimal Number: "; # Ask for a number.
$answer = ; # Input the number.
print hex($answer),"\n"; # Print out new number.
Statements
The only kind of simple statement is an expression evaluated for its side effects ... [which] must terminate with a semicolon. Simple statements may optionally be followed by a single modifier, just before the terminating semicolon. Possible modifiers are:
if EXPR
unless EXPR
while EXPR
until EXPR
Flow control in PERL
NB: a modified block executes once before the conditional is evaluated. This is so that you can write loops like:
do {
$_ = ;
...
} until $_ eq ".\n";
Let's read some more PERL!
# a simpleminded Pascal comment stripper
# (warning: assumes no { or } in strings)
line: while () {
while (s|({.*}.*){.*}|$1 |) {}
s|{.*}| |;
if (s|{.*| |) {
$front = $_;
while () {
if (/}/) { # end of comment?
s|^|$front{|;
redo line;
}
}
}
print;
}
Labels in PERL
line: while () {
next line if /^#/; # discard comments
...
}
Variables
- Filehandles
- $ : Scalar variables
- @ : "normal" array
- % : associative arrays
- * :
Data conversion
Building arrays from strings
while (<>) {
chop; # avoid \n on last field
@array = split(/:/);
...
}
Both of these do the same thing...
do 'stat.pl';
eval \`cat stat.pl\`;
s/\n// ;
chop ;
String functions
chop
eval
crypt
index
rindex
length
q
substr
Arrays and list functions
delete
each
grep
keys
join
pop
push
reverse
shift
sort
A couple examples
@foo = grep(!/^#/, @bar); # weed out comments
$_ = join(':',
$login,$passwd,$uid,$gid,$gcos,$home,$shell);
Pattern matching
m/PATTERN/gio
/PATTERN/gio
?PATTERN?
s/PATTERN/REPLACEMENT/gieo
Some examples:
s/\bgreen\b/mauve/g; # don't change wintergreen
($foo = $bar) =~ s/bar/foo/;
$_ = 'abc123xyz';
s/\d+/$&*2/e; # yields 'abc246xyz'
s/\d+/sprintf("%5d",$&)/e; # yields 'abc 246xyz'
s/\w/$& x 2/eg; # yields 'aabbcc 224466xxyyzz'
study
tr/SEARCHLIST/REPLACEMENTLIST/cds
y/SEARCHLIST/REPLACEMENTLIST/cds
$ARGV[1] =~ y/A-Z/a-z/; # canonicalize to lower case
$cnt = tr/*/*/; # count the stars in $_
$cnt = tr/0-9//; # count the digits in $_
tr/a-zA-Z//s; # bookkeeper -> bokeper
($HOST = $host) =~ tr/a-z/A-Z/;
y/a-zA-Z/ /cs; # change non-alphas to single space
tr/\200-\377/\0-\177/;# delete 8th bit
Format strings
# a report on the /etc/passwd file
format STDOUT_TOP =
Passwd File
Name Login Office Uid Gid Home
------------------------------------------------------------------
.
format STDOUT =
@<<<<<<<<<<<<<<<<<< @||||||| @<<<<<<@>>>> @>>>> @<<<<<<<<<<<<<<<<<
$name, $login, $office,$uid,$gid, $home
.
$_ The default input and pattern-searching space.
The following pairs are equivalent:
while (<>) {... # only equivalent in while!
while ($_ = <>) {...
/^Subject:/
$_ =~ /^Subject:/
y/a-z/A-Z/
$_ =~ y/a-z/A-Z/
chop
chop($_)
$foo{$a,$b,$c}
$foo{join($;, $a, $b, $c)}
Special Variables
- $ARGV
- contains the name of the current file when reading from <>.
- @ARGV
- The array ARGV contains the command line arguments
- $_
- The default input and pattern-searching space.
- $!
- If used in a numeric context, yields the current value of errno
- $#
- output format for printed numbers.
- $%
- current page number
- $&
- string matched by the last successful pattern match (not counting any matches hidden within a
- $'
- string following whatever was matched by the last successful pattern match
- $*
- Set to 1 to do multiline matching within a string, 0 to tell perl that it can assume that strings contain a
- $+
- last bracket matched by the last search pattern.
- $,
- output field separator for the print operator.
- $-
- number of lines left on the page of the currently selected output channel.
- $.
- current input line number of the last filehandle that was read.
- $/
- input record separator, newline by default.
- $0
- Contains the name of the file containing the perl script being executed.
- $:
- current set of characters after which a string may be broken
- $;
- subscript separator for multi-dimensional array emulation.
- $
- $=
- current page length
- $@
- perl syntax error message from the last eval command. $ARGVcontains the name of the current file when reading from <>.
- $^L
- What formats output to perform a formfeed.
- $~
- name of the current report format for the currently selected output channel. Default is name of the
- $^
- name of the current top-of-page format
- $""This is like $, except that it applies to array values interpolated into a double-quoted string
- $^T
- time at which the script began running, in seconds since the epoch. $^W
- current value of the warning switch.
- $[
- index of the first element in an array, and of the first character in a substring. Default is 0, but you
- $\
- output record separator for the print operator. Ordinarily the print operator simply prints out the
- $\`
- string preceding whatever was matched by the last successful pattern match
Homework 5 : Corpus analysis (due: Mar. 13)
Homework answers
[Ling 158 Home Page |
Linguistics 158 schedule]