Lab Session 13

Welcome to Lab Session 13: Exam Revision

Welcome to the final lab session in LING123: Language and Computers. Today we’ll focus on exam revision, which means that it’s up to you to decide what you’re going to work on!

If there are any exercises from the previous lab sessions which you haven’t completed yet, I recommend that you work through these today. Otherwise, I’ve created a bunch of exercises across different subjects to cover what I believe is the most important stuff from this semester.

This is also your last opportunity to ask me questions related to the lab exercises or any concepts from the lectures, so don’t hold back!

Exercise 1: Computing wages

Look at the file wages.txt, which contains data about the name, hourly wage and hours worked this month for 11 employees. Write an AWK or Python script which prints out the monthly salary of each employee, but only if the monthly salary is greater than 10,000.

Arne 182 0
Marie 190 0
Vishal 213 150
Grete 192 16
Finn 192 37
Kari 220 145
Arnfinn 220 150
Torstein 250 132
Trude 196 150
Geir 160 13
Hoda 196 42

Exercise 2: Levenshtein

a) What’s the Levenshtein distance between the words “monstrous” and “miscellaneous”? Use a cost of 2 for substitutions and a cost of 1 for insertions and deletions.

b) Write a Python program which calculates the Levenshtein distance between two input strings, but only with insertion and deletion operations. Is there a different result when testing with the two words from exercise a?

Exercise 3: AWK

In general, what does this script do?
Assume that the file text.txt is a file with multiple lines of text content.

awk '{sum += length($0)} END {print sum / NR}' text.txt

Exercise 4: Frequency lists

a) Create a word frequency list of the short story The Last Question by Isaac Asimov.

b) How many words with twelve letters or more occur more than once?

Exercise 5: Python

a) Define a function vocabulary that takes the name of a file as its argument and returns the sorted set of tokens.

b) Define a function vocabulary_out that takes the names of an input file and an output file as arguments, and writes the sorted set of tokens from the input file to the output file. It should use vocabulary.

Exercise 6: Character frequencies

What is the least frequently used letter in the file cuerpo.txt?

Exercise 7: Extracting words from HTML

Create a list of all unique compound words (hyphenated or closed form) containing the word “korona” or “corona” (case-insensitive) by searching in Norwegian newspapers on the web. Some examples are “koronafast”, “pre-korona-times” and “koronaknekken”. Which five compound words are most commonly used? Collect data from both 2020 and 2021.

Exercise 8: Finite State Automata

a) Use JFLAP to create an FSA for the regular expression a+h?rgh+.

b) Create a transition table for your FSA.

Exercise 9: Trigrams

Write a Python, shell or AWK script which takes a text file as input and:
1) Creates a new file with all trigrams (and only the trigrams)
2) Prints out the longest trigram (as determined by the number of letters in each trigram)

Tip: Take a look at the Solutions to lab session 03 for a guide to creating a list of trigrams.

A concluding note from your teaching assistant

The exercises, examples and guides on this website should hopefully cover all of the programming curriculum for LING123. I hope you have learned a lot by working through the exercises and when participating in the lab sessions. As always, don’t hesitate to get in touch if there’s anything you need! I would also appreciate any feedback, either through MittUiB, email, Koenraad’s course evaluation, or maybe even by throwing me an endorsement on LinkedIn :)

Good luck with the exam and all your future endeavours!