Solutions to Exercises from Lab Session 03
Post
Cancel

# Solutions to Exercises from Lab Session 03

Here are my solutions to the exercises from lab session 03.
You should try to solve the lab exercises yourself before you peek at the answers!
You can find the exercises to Lab Session 03 here.

Disagree with my solutions, or have something to add?

## Part 1) Word frequencies and N-grams

### Exercise 1.1: Script files and word frequencies

a) Recreate this example (tokenize.sh + count word frequencies) and send the output to a file. Compare this result with a frequency list obtained otherwise, for instance, with Antconc.

Creating a frequency list for chess.txt in the shell

Frequency list for chess.txt with Antconc

When comparing the frequency list to Antconc, we see that the output is mostly the same. The only difference, as far as I can see, is that Antconc sorts the occurences first by frequency, then alphabetically, while bash sorts the occurrences by frequency, then in reverse alphabetical order.

However, there might be other differences depending on the tokenization process in bash, such as that it might be necessary for you to specify how apostrophes should be interpreted. Antconc also has a lot more functionality, and enables you to easily sort by different parameters. It also provides a ranking of each word in the frequency list.

b) Create a script file freq.sh to make a frequency list of a tokenized file given as argument. Try to combine it with the script for tokenization.

A script which creates a frequency list of a file

c) Make a frequency list of a larger text, then select lines matching a regexp, e.g. words with three letters or less. How do frequencies of shorter words compare to frequencies of longer words?

Frequency by word length

We can select words that are three letters or less with the regular expression \b[[:alpha:]]{1,3}\b. If we sort the output with sort -k1 -nr, we can see the words with the highest frequency. When we execute the code for words of different lengths, we see that there is an exponential relationship between word length and number of occurrences in the text. Words of three words or less are much more prevalent (at least the popular ones) than words of four words, and words of four letters are much more prevalent than words with six letters, and so on.

### Exercise 1.2: Plotting word frequencies in R

a) Recreate the example from the lecture notes in RStudio.

An RStudio plot of frequency by index (rank) for words in for Sherlock Holmes

b) Make observations about the distribution of the frequencies.
As we can see in the plot, the words that have a low index (occur at the top) in the frequency list have a very high frequency. The single most frequent word has an index of 1 and a frequency of over 700 occurences, while the second most frequent word occurs about 550 times. There is a very long tail of words that are only used once or twice.

c) The picture can be made clearer by using a logarithmic y-axis. Try plot(Frequency,log="y") instead. Then try making both axes logarithmic with log="xy".

Word frequency by index with a logarithmic x and y axis

This plot indicates more clearly that there is an exponential relationship between a word’s index in the frequency list and the number of occurrences. This generally means that the most frequent word (index 2) is approximately twice as frequent as the second most frequent word, and three times more frequent than the third most frequent word, and so on and so forth.

d) Compare the result to the similar graph below, which shows word frequencies in different languages:
A plot of the rank versus frequency for the first 10 million words in 30 Wikipedias (dumps from October 2015) on a log-log scale.
Obtained from Wikipedia.org on 30.01.2021

The graph shows that the observations we made from plotting the Sherlock frequency list holds true for most languages. Zip’s Law roughly states that the rank (index) of a word in a frequency list is inversely proportional to its frequency.

### Exercise 1.3: N-grams

a) Execute these commands on a larger tokenized file. Create a frequency list of the resulting bigrams.

There’s a very large text file you can use here. You will of course need to tokenize it first :)

Creating a frequency list for bigrams in The Bible

b) Create a list of trigrams for the same file. Compare the number of trigrams with the number of bigrams.

Creating a list for trigrams in The Bible and comparing the number of bigrams to trigrams

Okay, so the number of bigrams is equal to the number of trigrams? That doesn’t make sense. What’s going on here…?

Comparing the number of bigrams to trigrams

Ah, when using tail instead of head we see why it counted the same number of bigrams and trigrams. For bigrams, there is a single line which isn’t actually a bigram. For trigrams, there are two lines which don’t contain trigrams. This is because the paste command kept on pasting empty space when one of the files we gave as input had reached the end.

In conclusion, for The Bible the total number of bigrams is 994,514 and the total number of trigrams it’s 994,513. In general, if $$X = \text{The number of words in a given sentence K}$$,
then the number of N-grams for sentence $$K$$ would be calculated as:

$Ngrams_K = X - (N - 1)$

## Part 2) Ciphers and string substitutions

### Exercise 2.1: Ciphers

a) Create a script decrypt.sh which turns the ciphertext produced by Koenraad’s encrypt.sh back to normal.
The following workflow should give the original file: source encrypt.sh < chess.txt | source decrypt.sh

1 2 3 4 5 6 7 8 #!/usr/bin/env bash # decrypt.sh normal='qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM' decrypted='a-zA-Z' tr $normal$decrypted 

b) Extend the strings to include the characters æøåÆØÅ (if your implementation of tr supports multi-byte characters), and also digits, space and punctuation. Consider a less systematic ordering of the characters in the second string.

1 2 3 4 5 6 7 8 #!/usr/bin/env bash # encrypt-improved.sh normal='abcdefghijklmnopqrstuvwxyzæøåABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅ' decrypted='qkQeæfAJUHGORwTYødjPÆSWiXåxZLyItnØFbpgcÅoMvKsDazVCulrEhBmN' tr $normal$decrypted 

### Exercise 2.2: String substitutions

Reproduce the lecture example code and extend the amount to include numbers with a decimal point, such as 289.90, and change the replacement so that it converts the decimal point to a decimal comma.

1 2 3 $sed -E 's/kr\.? ([0-9]+)([\.,])?([0-9]+)?/\1,\3 kroner/g' kr. 140.50 140,50 kroner  Type cmd + c or ctrl + c to exit the sed program. ### Exercise 2.3: String substitutions (2) a) Some accented vowels are missing from the example in the lecture notes. Add them. First we need to edit the tokenize.sh-script to include accented characters in Spanish. Don’t forget ñ and ü! (Though I think only a handful of Spanish words use the ü…) 1 2 3 #!/usr/bin/env bash # tokenize.sh tr '[:upper:]' '[:lower:]' <$1 | tr -cs '[a-záéóíúüñ]' '\n' 

Next, we need to edit the sed expression to account for accented vowels. In this case, we don’t want to include the ñ. I also added another i to the second group, because there is a word (gui) where the u is part of the onset, not the nucleus.

1 $sed -E 's/[aeiouáéóíúü]+/,&,/' mono-es-tokens.txt | sed -E 's/,([iu])([aeoáéóíúüi])/\1,\2/' | head  b) Replace empty onsets or codas with 0. Here, we simply add another substitution to the expression. You can replace the leading commas in the beginning of a line with 0, with sed -E 's/^,/0,/', like this: 1 2 3 4 5 6 7 8 9 10 11$ sed -E 's/[aeiouáéóíúü]+/,&,/' mono-es-tokens.txt | sed -E 's/,([iu])([aeoáéóíúü])/\1,\2/' | sed -E 's/^,/0,/' | head 0,a, 0,a,l 0,a,r 0,a,s 0,a,x b,a,h b,a,r b,e, b,e,l bi,e,n 

c) After reading the chapter on cutting columns, cut the vowel column from this comma-separated output and produce a graph of the frequencies of the vowels.

I’m just going to continue to add more code to the existing expression, even though you might want to save the output from the previous exercise to a file before you progress further. Anyway, this time we add the expression cut -d',' -f2  to the pipeline. The -d',' flag sets the delimiter to comma, while the -f2 flag tells cut to extract only the second field.

PS: We could’ve used awk instead of cut here. Remember how?

1 2 $sed -E 's/[aeiouáéóíú]+/,&,/' mono-es-tokens.txt | sed -E 's/,([iu])([aeoáéóíú])/\1,\2/' | sed -E 's/^,/0,/' | cut -d',' -f2  We want to count the frequency of each vowel, so we add the familiar expression sort | uniq -c | sort -nr to the pipeline as well: 1 2 3 4 5 6 7 8 9 10 11 12$ sed -E 's/[aeiouáéóíú]+/,&,/' mono-es-tokens.txt | sed -E 's/,([iu])([aeoáéóíú])/\1,\2/' | sed -E 's/^,/0,/' | cut -d',' -f2 | sort | uniq -c | sort -nr 63 a 57 e 50 o 36 i 27 u 7 é 3 í 2 á 1 y 1 ú 

Our goal is to produce a graph of the frequencies of the vowels. In order to do that, we first need to save the output to a comma-separated values (.csv) file. A .csv file is just like a text file where the columns are separated with a comma, so we can use awk '{print $1","$2}' or sed -E 's/ *([0-9]+) /\1,/'g to replace the spaces in the output with commas. We’ll then save the final output to a file, here named es-vowels-freq.csv:

1 2 $sed -E 's/[aeiouáéóíú]+/,&,/' mono-es-tokens.txt | sed -E 's/,([iu])([aeoáéóíúi])/\1,\2/' | sed -E 's/^,/0,/' | cut -d',' -f2 | sort | uniq -c | sort -nr | awk '{print$1","$2}' > es-vowels-freq.csv  Almost there! Finally, we can plot the frequencies in RStudio: 1 2 3 4 5 6 7 freqlist <- read.csv("es-vowels-freq.csv", fileEncoding = "UTF-8", header=FALSE) names(freqlist) <- c("Frequency", "Character index") attach(freqlist) barplot(freqlist$Frequency, las=2, col='darkgreen', xlab='Character', ylab='Frequency') 

And the output should look something like this:

The frequency of vowels in the file mono-es.txt

Feel free to experiment further with different parameters, plot styles, labelling etc! For example, you can try to add a label to each bar with the character that the bar represents. I.e, the first character is a, the second is e, and so forth.

### Exercise 3.1: Word frequencies

a) Tokenize The Bible and save it to a new file bible-tokens.txt.

If downloading the file from MittUiB doesn’t work, you can copy the plaintext of The Bible directly from The Gutenberg Project here.

To tokenize the file, we can use the tokenize.sh script from the lecture notes.

1 2 3 4 5 6 7 8 9 10 11 12 $source tokenize.sh the-bible.txt > bible-tokens.txt$ head bible-tokens.txt the project gutenberg ebook of the bible douay rheims old 

b) Retrieve all words from bible-tokens.txt which contain the letter “A” (case insensitive) non-initially and non-finally. Get the frequencies for these words and sort them by highest occurrence.

The simplest way to do this is probably by using egrep:

1 2 3 4 5 6 7 8 9 10 11 $egrep -iw '[^a][[:alnum:]]+[^a]' bible-tokens.txt | head the project gutenberg ebook the bible douay rheims old new  We have a shell script freqlist.sh from earlier which we can use to get the frequencies. It looks like this: 1 2 3 #!/usr/bin/env bash # freqlist.sh egrep -ow '\w+'$FILE | gawk '{print tolower($0)}' | sort | uniq -c | sort -nr  So a final solution could look like this: 1 2 3 4 5 6 7 8 9 10 11 12$ egrep -iw '[^a][[:alnum:]]+[^a]' bible-tokens.txt | source freqlist.sh 79636 the 16117 that 11695 shall 10664 his 10120 for 9049 they 8542 not 8472 lord 8171 him 7647 them ... 

c) Use egrep to only retrieve words from bible-tokens.txt that are five letters long or more, then count and sort them like above.

1 2 3 4 5 6 7 8 9 10 11 $egrep -iw '\w{5,}' bible-tokens.txt | source freqlist.sh | head 11695 shall 5252 their 4482 which 2760 israel 2735 there 2499 people 2176 things 2048 before 1929 children 1926 against  d) Out of all words in the text, how many percent of them are five letters or longer and used more than once? 1 2 3 4$ egrep -iw '\w{5,}' bible-tokens.txt |source freq-list-improved.sh | awk '{sum+=$1} END {print sum}' 313841$ wc -w bible-tokens.txt 994536 

We can calculate the answer with the following formula:

$\dfrac{\text{amount of long words which occur more than once}}{\text{total number of words in the text}}$

So the answer is 313841 / 994534 = 0.315= 31.5%

e) Write a shell script that can take a regular expression (argument $1) and a file name (argument $2) from the user, and output a frequency list like in the exercises above.

1 2 3 4 5 6 7 8 9 #!/usr/bin/env bash # freqlist-with-regex.sh # Tokenize the input file egrep -iwo '[[:alpha:]]+' < $2 | awk '{print tolower($0)}' | # Extract matches with egrep and the regex input from the user egrep $1 | # Output a frequency list sort | uniq -c | sort -nr  1 2 3 4 5 6 7 8$ source freqlist-with-regex.sh '^lead.*' ../the-bible.txt 72 lead 30 leadeth 29 leader 8 leaders 7 leading 3 leadest 2 leads 

### Exercise 3.2: N-grams

a) Find all trigrams in chess.txt containing the word “det”. Afterwards, find only trigrams containing “det” as the first word, middle word, then last word.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 $tail -n+2 chess-tokens.txt > chess-next.txt$ tail -n+3 chess-tokens.txt > chess-next-next.txt $paste chess-tokens.txt chess-next.txt chess-next-next.txt > chess-trigrams.txt$ egrep -w 'det' chess-trigrams.txt | head er det allerede det allerede remis nrks studio det studio det var det var spilt brettet ja det ja det var det var raskt til nrk det nrk det gikk 

Finding trigrams containing det as the first word:

1 2 3 4 5 6 7 $awk '{if ($1=="det") print $0}' chess-trigrams.txt det allerede remis det var spilt det var raskt det gikk inn det er ganske det spilles mange  As the second word: 1 2 3 4 5 6 7$ awk '{if ($2=="det") print$0}' chess-trigrams.txt er det allerede studio det var ja det var nrk det gikk får det er slitsomt det spilles 

And the third: