Start by completing the exercises from Lab Session 01 if you haven’t done so already.
Part 1) Regular expressions: A very very short introduction
Regular expressions (“Regex”) have been around since the early days of digital computing, and are still very much used today by academics, programmers and other technically-minded people. Most programming languages and many technical tools support regular expressions. They are simple, powerful and very versatile!
We will be covering regular expressions properly later in this course, but I’ll give you a very brief introduction for now which will supplement the lectures and just cover the basics. There is a great regex cheat sheet here which you may use as a reference going forward. Keep in mind that not all software accepts all kinds of regular expressions, so sometime you might have to try things a few different ways for it to work.
The linux command for searching a file with regular expressions is grep
. However, grep
doesn’t accept all kinds of modern regular expressions, and there are also variations between different operating systems and Linux distributions. The new regular expressions are called ‘extended’ regular expressions. I recommend that you use egrep
or grep -e
, as this version of grep
accepts extended regular expressions in addition to the basic ones.
For example, if I want to search a file for mentions of a multiple terms, I can use egrep
like this:
Searching for two different terms with an extended regular expression
Exercise 1.1: Introduction to regexps
Here’s a different example. Take a look at the regex cheat sheet to figure out what each of the characters in the expression mean.
Searching for words which begin with ‘trav’
This regular expression searches for all four-digit numbers in the file. There are multiple different expressions which accomplish the same thing. See if you can find them!
Searching for four-digit numbers
Exercise 1.2: Searching and counting with regexps
Test what happens if you use the -iwo
options to search with a wildcard, e.g. .*t
for all words ending on t. Because the wildcard .
matches everything including spaces, the result may not be what you expected. If you want to exclude spaces and punctuation, try using \w*
instead of .*
.
Exercise 1.3: Tokenization with egrep
Look at the examples in the lecture notes for Tokenization with egrep.
a) As an exercise, change the regexp to make a list of words with two letters or fewer.
b) What do you gain and what do you lose by case folding?
c) What is a word? Should we exclude numbers? What about compounds with hyphens?
Part 2) Writing shell scripts
A series of commands can also be stored as a script, to perform certain operations without having to type each individual command each time, and can also take arguments do be dynamic. A script’s name usually ends with the .sh
extension, but as noted earlier, Linux files can also have no extension.
We’ve already gone over piping things from one command to another, using the |
operator. This sends the output of one command as input to another. There are also many other operators that perform a variety of different functions. >
can be used to pipe a command’s output to a file (or to commands that usually don’t take other commands as input). When we use >
, it will overwrite any previous contents of the file, but if we use >>
instead, the content will be appended. <
can be used to redirect a file’s contents as the input to a command. See this page for more information.
A script’s first line should start with #!/usr/bin/env sh
, /usr/bin/env bash
, or similar. This tells the system that it should be executed as a shell script. Scripts in other languages can also use this to tell the shell which program should execute it; the equivalent for Python would be #!/usr/bin/env python
.
A script can be as simple as a series of commands, or it can contain flow control, loops, and such. Here’s an example using echo
and read
. echo
prints its arguments to the terminal, while read
takes an input from the user, and stores it as a variable. Variables are “containers” for data. We can change their name and the content in them as we go along, which is why they are called variables.
Example:
1
2
3
4
5
6
7
8
9
10
11
12
1: #!/usr/bin/env bash
2: #user_registration.sh
3:
4: echo -e "Hello!\nWhat is your name?"
5: read NAME
6: echo -e "It's nice to meet you, $NAME.\nHow old are you?"
7: read AGE
8: echo -e "Wow, $AGE is actually my favourite number.\nWhich city are you from?"
9: read CITY
10: echo -e "That's cool, my cousin currently works in $CITY."
11:
12: echo "$NAME;$AGE;$CITY" >>user_data.txt
This script will record a user’s name, age, and city, and then append those values (separated by a semicolon) to a file called user_data.txt. We can run it by using the command source
, like this:
1
source user_registration.sh
This will prompt us to answer the three questions from the script. If we answer them, we’ll se that a new file has by the name of user_data.txt appeared in the directory. We can run the script as many times as we want and add more data to the file. However, the content in user_data.txt is a little hard to read. Let’s use the awk
command to format it differently:
1
1: awk -F';' '{print "Name: " $1 "\nAge: " $2 "\nCity: " $3 "\n"}' <user_data.txt
If the above doesn’t make much sense, don’t worry. We’ll be working a lot more with awk
/ gawk
in the coming weeks. All you need to know for now is that the -F';' '
option tells Awk that we are using ; to separate the fields in user_data.txt. This is so that Awk knows how to separate Name, Age and City from each other and not just interpret it as a single long string. Then there’s the argument, which is a single long string with Awk code. The code tells Awk to print out “Name”, then the first field in user_data.txt, then “Age” followed by the second field, then “City” and then the third field at the end. The “\n” indicates a new line. Finally, we give user_data.txt as input, which you can see is directed into Awk with the <
operator.
Exercise 2.1: Writing your first shell scripts
Recreate the user_registration.sh shell script and the user_data.txt file by repeating the steps above. Make sure to run the script a few times, so that user_data.txt at least has a few lines for you to work with. (Yes, you can use nano user_data.txt
and just enter the content manually instead. Just make sure that you test the script at least once.)
Exercise 2.2: Extracting file contents with regex and shell scripting
Create a second script which takes a regular expression from the user and filters the records.
Tip: You can use read
to prompt the user for a regular expression to use
Exercise 2.3: Character translation
a) Try the last example in the lecture notes section about character translation, but read input from a file instead of the terminal by means of < chess.txt
.
b) Write a script that replaces all vowels with asterisks.
Exercise 2.4: Tokenization with tr
a) Why is there an empty line at the beginning of the tokens file? If it bothers you, try to remove it by adding egrep -v '^$'
to the pipeline.
b) What is a word? Should we exclude numbers? What about compounds with hyphens? Check man tr
.
c) What information do we lose by lowercasing every word? How could we distinguish meaningful capitalization (as in proper names) from accidental capitalization, i.e. at the beginning of a sentence?
d) When tokenizing, you can also select other strings than whole words. Find all consonant clusters by using the regular expression [bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ]
instead.