Welcome to the LING123 Lab Sessions!
In this very first lab session, we’ll begin with looking at some basic concepts for computational linguistics and processing language data. After that, we’re getting started with shell scripting!
Complete the exercises in order. Don’t hesitate to ask me or your fellow students for help if you get stuck! Additionally, you can also discuss the exercises on MittUiB, the UiB Studyfellowship Discord server (channel: #ling123) or right here in ling123labs.com’s comments section.
If you need help with the lab exercises during the extent of this course, or for any other reason in particular need to get in touch, feel free to shoot me an email by clicking the letter icon on the navigational sidebar at the bottom left of this page. There are also many great resources on the web.
Remember: A programmer’s most useful tool is the search engine!
You can find the Lecture notes here.
Part 1: About computational linguistics and digital language data
Exercise 1.1: Computers in the study of language
- a) Which areas of language study might benefit from computational data and tools?
- b) Think of other applications for corpora besides machine translation.
Exercise 1.2: Corpora
- a) Think of linguistically relevant expressions to search for.
- b) Look at the NorGramTall blog, which uses a corpus to find out preferences in Norwegian.
- c) What are some limitations of the web as a corpus?
Exercise 1.3: Searching in corpora
Tip: You need to log in with Feide in order to use Corpuscle. At Corpuscle, click “CLARIN SPF” at the top, search for “Feide”, and log in. You can then navigate to the Corpus list and select the “Child’s Rights” corpus. Select “Metadata” from the menu on the left-hand side and click ‘Accept’. You are now able to query the corpus by selecting “Query” from the menu. Navigate creating queries, searching for collocations and creating distributions through the menu.
When using Antconc, I suggest starting with a .txt-file such as lofoten.txt
- a) Look for all words starting with child by means of the regular expression
child.*in an English corpus, for instance, in the corpus Child Rights in Corpuscle. Make a word list. Find collocations. Find their distribution relative to country.
- b) Try out Antconc, a tool with which you can analyze your own corpus.
Exercise 1.4: The trouble with language
- a) What happens if a ligature (ﬄ) is considered different from its component letters?
- b) Which different kinds of knowledge are necessary to understand language? How much "common sense" and knowledge about the world is necessary?
- c) Some characters may be hard to distinguish, for instance different dashes, which could be a hyphen, minus sign, etc. (‐ - ⁃ －) and different characters similar to apostrophe (' ʼ ′ ʹ). Try to find a way to examine if they are different, or if they are the same character in various fonts.
Part 2: Introduction to shell scripting
An example of shell scripting (using WSL Ubuntu). Intimidating, huh? Don’t worry, we’ll get you there in no time!
Software installations and setup
If you are using Windows, please make sure you have installed a compatible shell. I have posted an installation guide here.
Basic usage of the shell
When coding in the shell, you first enter the command, then any arguments. The command can be navigational, like
cd, or it can be a program such as
[command] [arg1] [arg2]...
The command, and it’s arguments are separated with spaces. Any argument containing a space must be escaped with a backlash (\) or be quoted with ‘’ or “”.
To start things out, we can print text to the terminal like this:
1 echo "Hello world!"
The bash shell includes the command
help, which can be used to get basic information about included commands, and more information about them by typing
In addition, most commands and command-line programs have a
manual available by typing
man [command]. More information about the command-line can be found here.
Exercise 2.1: Getting started with shell scripting
- a) Check your locale with the locale command. Change the locale to a different language and type date.
Exercise 2.2: Additional exercises for basic shell usage and word counting
- a) Acquaint yourself with the command-line, try navigating directories, viewing files, etc. Try out some various commands.
- b) Try editing a file using a terminal text editor such as Vim, Nano, or Emacs. What are the benefits of using one of these instead of an IDE like Atom or Pycharm? What are the drawbacks?
Exercise 2.3: Counting lines and words
- The wc command can take more than one file as arguments. Test something like the following which also uses a text file lofoten.txt:
wc chess.txt lofoten.txt
Exercise 2.4: Word counting on the Web
- Find an article on the web. Copy the content to a new file in a terminal text editor (vim, nano, emacs>).
Ctrl+v(Cmd for MacOS) doesn't work, try
- Use a word or a pattern of your choosing, and find the number of occurrences with the wc command.