Welcome to Lab Session 5!
Make sure that you’ve completed and understood the exercises from Lab Session 04 before getting started with the exercises below. Don’t hesitate to let me know if there’s anything you need help with. This includes other concepts mentioned in the lecture which aren’t explicitly a part of the exercises.
Additionally, don’t forget that you can discuss the exercises with your fellow classmates on MittUiB, the UiB Studyfellowship Discord server (channel: #ling123) or right here in ling123labs.com’s comments section.
The lecture notes for lecture 5 can be found here.
Before you get started on the exercises, I recommend that you skim through this post on some terminology and setting the shell environment for scripts.
Exercise 1: Palindromes with Awk
a) Apply the palindromes script to a large word list in order to find as many palindrome words as possible in a language. Many systems contain a dictionary file /usr/share/dict/words. Alternatively, you can download The Unix Dictionary.
b) The script selects all strict palindromes, but not phrase palindromes, such as Anne var i Ravenna. Extend the script so that the punctuation, spaces and capitalization are ignored. This could be done by making three columns:
- the original string
- the normalized string
- the reversed normalized string. If fields 2 and 3 are equal, field 1 is returned.
Tip 1: The
pastecommand can take three arguments, like this:
paste $1 $1 -. If you pass this to awk, you’ll have three fields (columns) to work with.
Tip 2: You can use the AWK function
gsubto perform string substitutions in
Tip 3: The output should look something like this. Don’t worry about the formatting, as long as you get the same result. Desired output from exercise 1b
c) Create a script that finds all words in a file that are reverse anagrams but not palindromes, for instance, rail and liar.
Exercise 2: Character to word ratio
a) When counting the number of characters, we should count the characters in words only, excluding spaces and punctuation. Modify the script in the lecture notes accordingly. One way to do this involves tokenizing the file.
b) Print more information. Give the number of words and characters.
c) Try a way to compute the ratio of words to sentences.
Exercise 3: Character frequencies with Awk
a) Try also
source encrypt.sh < chess.txt | source charfreq.sh | head. This will give you a clue about which cipher letter represents the most frequent characters.
b) Extend the script from the lecture notes by first changing all uppercase to lowercase letters.
- END OF LECTURE 5 -
If you are finished with all other lab exercises, you can get started on the exercises below.
Exercise 4: An interesting regular expressions problem
It’s okay if your expression also matches an empty string.
Exercise 5: (From next week’s slides) Extracting text from HTML
a) Consider refining the script from the lecture notes further by conflating inflections, e.g. virus and viruset. Identify challenges for various cases. What extra information would you ideally need?
b) Do a new search in the Norwegian Newspaper Corpus. Extract words and create a frequency list.