Posts Lab Session 06

Lab Session 06

Welcome to Lab Session 6

Make sure that you’ve completed and understood the exercises from Lab Session 05 before getting started with the exercises below. Don’t hesitate to let me know if there’s anything you need help with. This includes other concepts mentioned in the lecture which aren’t explicitly a part of the exercises.

Additionally, don’t forget that you can discuss the exercises with your fellow classmates on MittUiB, the UiB Studyfellowship Discord server (channel: #ling123) or right here in’s comments section.

The lecture notes for lecture 6 can be found here.

Exercise 1: Extracting text from HTML

a) Consider refining the script from the lecture notes further by conflating inflections (similar to lemmatization), e.g. “virus” and “viruset”. Identify challenges for various cases. What extra information would you ideally need?

Tip: No coding required here. Look at the output from the script, and try to think about all the different edge cases for removing inflections. How would you - hypothetically - attempt to do this?

b) Perform a new search in the Norwegian Newspaper Corpus. Extract words and create a frequency list.

Tip: When you visit a web page, you can right click and select “Save Page As…” to save the HTML file to your computer.

Exercise 2: Sorting by key

Look at the example from the lecture notes. Some lines from Bergens Tidende have the date in the wrong format. For example, 7. should be 07 etc. Improve the script solve this problem.

Exercise 3: Comparing vocabularies

a) For two files of about equal length, use comm to produce the specific vocabulary of each of the files, in other words, the vocabulary that is unique for each file. Count the lines for each vocabulary file. What would be possible causes for similar or different sizes of the specific vocabularies?

b) What could the information in text-specific vocabularies be used for? Or, conversely, what could the vocabulary that two files have in common tell us?

c) Note that comparisons with comm ignores the frequency of each word. How could we take frequencies into account when comparing two different vocabularies?

Exercise 4: Filtering and summing columns (with Awk)

a) Modify the program to select words not containing virus. The operator !~ means “does not match”.

b) Try a more complex regexp.

c) Try a more complex condition.

d) If you like R, do filtering and summing in R.

Additional exercises

If you are finished with all other lab exercises, you can get started on the exercises below.

Exercise 5: Getting started with Python

a) Install the latest version of Python on your machine, if you don’t have it already.

b) Read the Python introduction at W3Schools.

c) Get familiar with an IDE / text editor, such as Spyder or PyCharm. You should know how to:

  • Create new files and folders
  • Configure a Python interpreter
  • Run the code and display the output in the built-in terminal

You can also look up different shortcut keys for:

  • Autocomplete line
  • Select/edit multiple lines/places at once
  • Run the code in the file

This post is licensed under CC BY 4.0 by the author.