Start by completing the exercises from Lab Session 02 if you haven’t done so already.
The lecture notes for lecture 3 can be found here.
Part 1) Word frequencies and N-grams
Exercise 1.1: Script files and word frequencies
a) Recreate this example (tokenize.sh
+ count word frequencies) and send the output to a file. Compare this result with a frequency list obtained otherwise, for instance, with Antconc.
b) Create a script file freq.sh
to make a frequency list of a tokenized file given as argument. Try to combine it with the script for tokenization.
c) Make a frequency list of a larger text, then select lines matching a regexp, e.g. words with three letters or less. How do frequencies of shorter words compare to frequencies of longer words?
The next exercise requires R and RStudio. Some of you might not have any experience with using R, so you might want to check out a short primer first.
Here’s what you need to know for now:
- Installing R and RStudio.
- Create a new R file in RStudio and store it in a folder of your choice (“File” –> “Save As”).
- Download the sherlock-freq.csv file
and store it in the same folder as your RStudio file.- Import the dataset by running this line of code:
freqlist = read.table("LING123/sherlock-freq.csv")
NB! Use the relative filepath to
sherlock-freq.csv
from where you stored the R file. On Windows you might have to specify the folder (for me that would be ‘LING123’, as in the example above), even though the R file and the .csv actually are in the same folder. Pay attention to the Unix-style ‘/’ in the pathname (as opposed to ‘\’).
If you are interested in a proper introduction to R, I can recommend this free tutorial. It covers the most basic stuff in a couple of hours, and has a lot of great example code you can try out for yourself.
Exercise 1.2: Plotting word frequencies in R
a) Recreate the example from the lecture notes in RStudio.
b) Make observations about the distribution of the frequencies.
c) The picture can be made clearer by using a logarithmic y axis. Try plot(Frequency,log="y")
instead. Then try making both axes logarithmic with log="xy"
.
d) Compare the result to the similar graph below, which shows word frequencies in different languages:
A plot of the rank versus frequency for the first 10 million words in 30 Wikipedias (dumps from October 2015) on a log-log scale.
Obtained from Wikipedia.org on 30.01.2021
Exercise 1.3: N-grams
a) Execute these commands on a larger tokenized file. Create a frequency list of the resulting bigrams.
There’s a very large text file you can use here. You will of course need to tokenize it first :)
b) Create a list of trigrams for the same file. Compare the number of trigrams with the number of bigrams.
Part 2) Ciphers and string substitutions
The tr
command lets us translate one set of characters to another, which allows us to create ciphers. tr
can take a file (specified as the third argument to the command) or user input (when no third argument is specified). Here’s an example:
1
2
3
4
5
6
7
`tr` 'abcdefghijklmnopqrstuvwxyz' 'qwertyuiopasdfghjklzxcvbnm'
> hello
itssg
> a very secret message
q ctkn ltektz dtllqut
Reminder: Use
ctrl+c
/cmd+c
to interrupt programs running in the shell.
Exercise 2.1: Ciphers
a) Create a script decrypt.sh
which turns the ciphertext produced by Koenraad’s encrypt.sh
back to normal. The following workflow should give the original file:
1
source encrypt.sh < chess.txt | source decrypt.sh
b) Extend the strings to include the characters æøåÆØÅ
(if your implementation of tr
supports multi-byte characters), and also digits, space and punctuation. Consider a less systematic ordering of the characters in the second string.
Exercise 2.2: String substitutions
Reproduce the lecture example code with (EUR|NOK|USD)
instead of kr\.?.
Now you have two groups. Use \2
instead of kroner
in the replacement.
Exercise 2.3: String substitutions (2)
a) Some accented vowels are missing from the example in the lecture notes. Add them.
b) Replace empty onsets or codas with 0.
c) After reading the chapter on cutting columns, cut the vowel column from this comma-separated output and produce a graph of the frequencies of the vowels.
Part 3: Additional exercises
These exercises are slightly more challenging. See if you are able to solve them all! Let the seminar leader know if you are stuck, and you might get a hint…
Exercise 3.1: Word frequencies (extra)
a) Tokenize The Bible and save it to a new file bible-tokens.txt
.
b) Retrieve all words from bible-tokens.txt
which contain the letter A (case insensitive) non-initially and non-finally. Get the frequencies for these words and sort them by highest occurence, like this:
1
2
3
4
16117 word number one
11695 word number two
5877 word number three
...
c) Use egrep
to only retrieve words from bible-tokens.txt
that are five letters long or more, then count and sort them like above.
d) Out of all words in the text, how many percent of them are five letters or longer and used more than once?
Tip: \(\dfrac{\text{amount of long words which occur more than once}}{\text{total number of words in the text}}\)
d) Write a shell script that can take a regular expression (argument $1
) and a file name (argument $2
) from the user, and output a frequency list like in the exercises above.
Exercise 3.2: N-grams
a) Find all trigrams in chess.txt
containing the word det. Afterwards, find only trigrams containing det as the first word, middle word, then last word.
b) Why can it be useful to differentiate where the word occurs in the N-grams?
Exercise 3.3: String substitutions
Using sed
, create a shell script which reformats dates written as DD.MM.YYYY to YYYY-MM-DD.