The solutions to the exam prep exercises are presented below. The exercises are across different subjects to cover what I believe is the most important stuff from this semester. As always, I recommend that you attempt the exercises yourself before looking at the solutions. The exercises to Lab Session 13 are available in their entirety here.
Let me know if you have any questions to the exercises or anything else from the curriculum.
Good luck on the exam!
Disagree with my solutions, or have something to add? Leave a comment!
Exercise 1: Computing wages
Look at the file wages.txt
, which contains data about the name, hourly wage and hours worked this month for 11 employees. Write an AWK or Python script which prints out the monthly salary of each employee, but only if the monthly salary is greater than 10,000.
1
2
3
4
5
6
7
8
9
10
11
Arne 182 0
Marie 190 0
Vishal 213 150
Grete 192 16
Finn 192 37
Kari 220 145
Arnfinn 220 150
Torstein 250 132
Trude 196 150
Geir 160 13
Hoda 196 42
AWK solution:
1
2
3
4
5
#!/usr/bin/env awk
# Filename: high_wages.awk
{($2 * $3 > 10000)
{print $1, $2 * $3}
}
1
2
3
4
5
6
$ awk -f high_wages.awk wages.txt
Vishal 31950
Kari 31900
Arnfinn 33000
Torstein 33000
Trude 29400
Exercise 2: Levenshtein
a) What’s the Levenshtein distance between the words “monstrous” and “miscellaneous”? Use a cost of 2 for substitutions and a cost of 1 for insertions and deletions.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Filename: levenshtein-1.py
from functools import lru_cache
# The Levenshtein function from the lecture notes
@lru_cache(maxsize=4095)
def levenshtein_dist(string, target):
if not string:
return len(target)
if not target:
return len(string)
if string[0] == target[0]:
return levenshtein_dist(string[1:], target[1:])
lins = levenshtein_dist(string, target[1:]) + 1
lsub = levenshtein_dist(string[1:], target[1:]) + 2 # change this from 1 to 2 to calculate with higher sub cost
ldel = levenshtein_dist(string[1:], target) + 1
return min(lins, lsub, ldel)
print(levenshtein_dist("monstrous", "miscellaneous"))
1
2
$ python levenshtein-1.py
12
b) Write a Python program which calculates the Levenshtein distance between two input strings, but only with insertion and deletion operations. Is there a different result when testing with the two words from exercise a
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Filename: levenshtein-2.py
from functools import lru_cache
# Modified Levenshtein function with no substitution operations
@lru_cache(maxsize=4095)
def levenshtein_dist(string, target):
if not string:
return len(target)
if not target:
return len(string)
if string[0] == target[0]:
return levenshtein_dist(string[1:], target[1:])
lins = levenshtein_dist(string, target[1:]) + 1
ldel = levenshtein_dist(string[1:], target) + 1
return min(lins, ldel)
print(levenshtein_dist("monstrous", "miscellaneous"))
1
2
$ python levenshtein-2.py
12
There is no difference. The reason is that a substitution is always equivalent to a deletion operation followed by an insertion.
Exercise 3: AWK
In general, what does this script do?
Assume that the file text.txt
is a file with multiple lines of text content.
1
awk '{sum += length($0)} END {print sum / NR}' text.txt
The program calculates the input file’s average line length. sum
is the name of a variable which increases iteratively with the length of the entire line (remember: in AWK $0
is the entire input line). At the end of the program, it prints out the final sum (the sum of all line lenghts) divided by NR
, which is an AWK built-in variable that keeps track of the index for the current input row/line that is being processed. This means that at the end of the program, NR
is equal to the number of lines in the input file. The sum of all line lengths divided by the number of rows/lines gives the average line length of the input file.
Exercise 4: Frequency lists
a) Create a word frequency list of the short story The Last Question by Isaac Asimov.
1
2
3
4
5
6
7
8
#!/usr/bin/env bash
# wordfreq.sh
egrep -ow '\w+' $1 |
gawk '{print tolower($0)}' |
sort | uniq -c | sort -r >> $1-freq-list.txt
echo "Frequency list created. Saved as '$1-freq-list.txt' "
1
2
$ source wordfreq.sh the-last-question.txt
Frequency list created. Saved as '../the-last-question.txt-freq-list.txt'
b) How many words with twelve letters or more occur more than once?
1
2
3
4
5
6
7
#!/usr/bin/env bash
# twelve-letter-words-cnt.sh
# Prints the number of words in the input file which are 12 letters or longer
# and occur more than once
egrep -o '\b[[:alpha:]]{12,}\b' "$1" | tr -cs '[:alnum:]' '\n' | sort |
uniq -ci | sort -nr | awk '$1 > 1 {print $0}' | wc -l
1
2
$ source twelve-letter-words-cnt.sh the-last-question.txt
5
There are 5 words with twelve letters or longer that occur more than once.
Exercise 5: Python
a) Define a function vocabulary
that takes the name of a file as its argument and returns the sorted set of tokens.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from nltk import word_tokenize
from re import sub
# returns the sorted set of tokens in a file
def vocabulary(path_to_file):
with open(path_to_file, encoding="UTF-8") as text:
# read the content of the file and lowercase it
content = text.read().lower()
# tokenize the content with nltk
tokens = word_tokenize(content)
# Clean the tokens, remove unwanted characters, and get all unique tokens by converting from list to set
tokens = set([sub(r"[.,'§!()-_;`:]", '', token) for token in tokens])
# If token is not empty, add to list comprehension, then sort the list
tokens = sorted([token for token in tokens if token != ''])
return tokens
print(vocabulary('grunnloven-v1814.txt'))
You can get the file I used to test the program here.
b) Define a function vocabulary_out
that takes the names of an input file and an output file as arguments, and writes the sorted set of tokens from the input file to the output file. It should use vocabulary
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Tokenize an input file and write to a specified output file by using the vocabulary function
# NB! The code for 5 a) must be executed first for this code segment to work.
def vocabulary_out(path_to_input_file, path_for_output_file):
# Get sorted list of tokens from input file
tokens = vocabulary(path_to_input_file)
# Create a new file and write the tokens to it
with open(path_for_output_file, mode='w', encoding="UTF-8") as output:
for token in tokens:
output.write(f'{token}\n')
vocabulary_out('grunnloven-v1814.txt', 'grunnloven-v1814-vocab.txt')
The first 20 lines of the file grunnloven-v1814-vocab.txt
should look something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
a
aabenbar
aabne
aabner
aabnes
aall
aand
aar
aarlig
aarligen
aars
aarsag
aasædesretten
addresse
adgang
adlyder
adskilles
af
afdelinger
affattede
Exercise 6: Character frequencies
What is the least frequently used letter in the file cuerpo.txt?
We can modify the shell script charfreq.sh
from the lectures on shell scripting to get the least frequently used letter. Since charfreq.sh
returns a frequency list of all characters sorted descendingly by frequency, we need to remove the -r
flag from the sort
command at the end of the script so that it returns a list which is sorted ascendingly instead. Since we only want to count alphanumeric characters, a.k.a the letters, we also need to run the file through a sed
command in the beginning. Removing all non-alphanumeric characters can be achieved with sed 's/[^a-zA-Z0-9]//g'
. Finally, we can use the head
command with the flag -1
when running the script to only print the first line, which contains our answer.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/usr/bin/env bash
# Filename: letterfreq.sh
# Remove non-alphanumeric characters from the file
sed 's/[^a-zA-Z0-9]//g' |
# Get the character frequencies and sort ascendingly
awk '
BEGIN { FS="" }
{ for (field=1; field <= NF; field++) {
freq[tolower($field)]++ }
}
END { for (char in freq)
{ print freq[char], char }
}' | sort -n
1
2
$ source letterfreq.sh < cuerpo.txt | head -1
3 w
The answer is the letter “w”, which has been used only 3 times.
Exercise 7: Extracting words from HTML
Create a list of all unique compound words (hyphenated or closed form) containing the word “korona” or “corona” (case-insensitive) by searching in Norwegian newspapers on the web. Some examples are “koronafast”, “pre-korona-times” and “koronaknekken”. Which five compound words are most commonly used? Collect data from both 2020 and 2021.
Perform a case-insensitive search in the korpus Aviskorpus, bokmål, 01.01.21 --->
with a regular expression of your choosing. We want to match words that contain “corona” or “korona” in the beginning, end and at the middle of the compound word (with and without hyphens). However, we don’t want our expression to match the word “korona” or “corona” alone without text on either side. It is perhaps easiest to achieve this by using the regex “OR” operator |
, though there are other ways to do it as well. Another way would be to perform multiple searches (one with [ck]orona
in the center, one preceding and one following) and concatenate the results. Anyway, I solved it by searching for ([ck]orona..*|..*[ck]orona|..*[ck]orona..*)
. Feel free to expand the number of matches by increasing the option Antall sider per linje
to 10,000+, though this might take a while to run. Save the page with results to a file by right-clicking and selecting Save Page As...
and specifying the type (HTML only), filename and location. Repeat the process with the År
option set to 2020 (keep searching in the same korpus as previously) and concatenate the two files in the shell with the command cat filename1.html filename2.html > corona_compounds_2020-2021.html
.
We can then get all corona-related compund words we searched for by using egrep '<b>' corona_compounds_2020-2021.html | sed -E 's/.*<b>(.*)</b>.*/\1/'
to retrieve the words between the bold tags (<b>
, </b>
) in the file. The results from grep can be piped to sort
and uniq
to produce only the unique words. Finally, the results can be saved to a new file with > filename.txt
.
The full (shell) solution:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ cat cqp-avis-korona-10000_2021.htm cqp-avis-korona-10000_2020.htm > corona_compounds_2020-2021.html
$ egrep '<b>' corona_compounds_2020-2021.html | sed -E 's!.*<b>(.*)</b>.*!\1!' | sort | uniq > corona_compounds_2020-2021.txt
$ head -20 corona_compounds_2020-2021.txt
ArbeidslivKoronaLO
BarcelonaKoronavirus
betacoronavirus
BorgenKoronavirus
BorgenVaksineKoronavirus
corona-
coronaanbefalinger
coronaansvarlig
corona-åpne
coronaår
coronaåret
corona-året
Coronaåret
CORONA-ÅRET
coronaårsak
coronaavdeling
corona-avdeling
corona-avdelinger
corona-avlysningene
coronaavlyst
I haven’t tried the Python implementation of this, but the steps should be mostly the same. Let me know if you run into problems!
Exercise 8: Finite State Automata
a) Use JFLAP to create an FSA for the regular expression a+h?rgh+
.
b) Create a transition table for your FSA.
Exercise 9: Trigrams
Write a Python, shell or AWK script which takes a text file as input and:
1) Creates a new file with all trigrams (and only the trigrams)
2) Prints out the longest trigram (as determined by the number of letters in each trigram)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/usr/bin/env bash
# Filename: get-trigrams.sh
# Creates a new file with all trigrams and prints out the longest trigram
# First, tokenize the file
tr '[:upper:]' '[:lower:]' < "$1" | tr -cs '[a-záéóíúñ]' '\n' | egrep -v '^$' > "$1-tokens"
# Then create a new file with the next and next-next tokens
tail -n+2 "$1-tokens" > "$1-next"
tail -n+3 "$1-tokens" > "$1-next-next"
# Create a new file with the trigrams
paste "$1-tokens" "$1-next" "$1-next-next" > "$1-trigrams"
# Delete the last three lines, they aren't trigrams
sed -i '$d' "$1-trigrams"
sed -i '$d' "$1-trigrams"
sed -i '$d' "$1-trigrams"
# Print out the longest trigram(s) with awk
awk '
BEGIN {FS = "\t"} # Set field seperator to tab
# Calculate the length of each line, store to var "len"
{len = length($0)}
# If current len is greater than the max len, delete the array of results and set new max
len > max {delete result; max = len}
# If current len is equal to the longest len so far, add it to the array of results
len == max {result[NR] = $0}
# Finish by printing out the array of results.
END {for (i in result) printf("%s chars: %s \n", length(result[i]), result[i])}
' "$1-trigrams"
1
2
3
4
5
6
7
8
9
10
11
12
13
$ source get-trigrams.sh the-last-question.txt
collecting interstellar hydrogen
$ head the-last-question.txt-trigrams
the last question
last question by
question by isaac
by isaac asimov
isaac asimov the
asimov the last
the last question
last question was
question was asked
was asked for
And that’s it! If you have solved both these and the previous lab exercises, I think you should be well-prepared for the exam. Let me know if you have any more questions. Good luck!