Posts Solutions to the exercises from Lab Session 01

Solutions to the exercises from Lab Session 01

Try to solve the lab exercises by yourself before you view the solutions!
You can find the exercises to Lab Session 01 here.

Lecture notes

Disagree with my answers, or have something to add?
Leave a comment!

Exercise 1.1: Computers in the study of language

a) Which areas of language study might benefit from computational data and tools?

As most areas of language study benefit greatly from computational data and tools, there’s not really an exhaustive list which can answer this question definitely. Linguistics has become extremely popular in the past decade due to its ability to produce impressive applications for speech recognition, language translation, sentiment analysis and much, much more. Most of these applications are heavily reliant on large amounts of training data, which are retrieved and processed using computational tools. Language data is often scraped from the web or otherwise supplied by information systems. Recent advances in speech recognition technology have even made studying spoken language much easier by removing transcription work almost entirely. This has in turn vastly increased the size and variety of natural language datasets we get to work with.

Other, more traditional domains within linguistics use many processing tools to analyse modern written language. Historical linguistics make use of computational tools to compare modern language to older texts. Morphology and phonetics benefit greatly from methods such as tokenization, and they can analyse the rules of language in the context of regular languages and finite state automata. Semantics and pragmatics can use language processing tools to filter written (or spoken) texts by their context or keywords, and can even use these tools to compare the variations in context for the same words.

I’m sure you can find many other domains within language study which might benefit. The point is, modern language researchers need to be fluent in language processing tools, and there is great value in knowing how to utilize the vast amounts of language data that exist in digital form.

b) Think of other applications for corpora besides machine translation.
As mentioned above, corpora can be used as training data for a wide array of different natural language processing applications. They can also be used for statistical analysis in any field of language research. Lexicographers (the people who write dictionaries) have known for a long time that the best way to avoid missing things is to have a big corpus and a powerful computer. Other examples of applications are using concordance examples from corpora to teach a language or testing a hypothesis of how language is used. Corpora can also reveal instances of very rare or exceptional use cases that we wouldn’t get by looking at single texts or through introspection.

Exercise 1.2: Corpora

a) Think of linguistically relevant expressions to search for.
Many people use search engines to investigate correct grammatical use of words or phrases. For example, one might search recurring or reoccurring to decide which one is appropriate.

As language researchers, we can therefore use search engines to discover which words or phrases are difficult for people to use correctly. One way of doing this is by looking at the Google Trends tool.

One interesting example of how Google Trends can be used is to look at difficult terms used by politicians after a debate, or after an important news event such as a natural disaster or a climate change summit. Which words do people need to google? How can we explain the results? We can also compare different terms, such as different spellings of the same word, or compare different countries’ most popular terms.

example usages of Google Trends Exploring the query “how to spell” with Google Trends

b) Look at the NorGramTall blog, which uses a corpus to find out preferences in Norwegian.

c) What are some limitations of using the web as a corpus?

  • The Web is dynamic, so researchers have little control over precision and recall of search results. Results may be very difficult, if not possible, to replicate.
  • Web data may be incorrect, or may not represent the general population.
  • Reliable metadata may be lacking.
  • it is sometimes challenging to track down and verify sources.
  • Web data is usually reliant on third parties.
  • Using language data from the web require technical skills, such as web scraping or programming in query languages.

Exercise 1.3: Searching in corpora

a) Look for all words starting with “child” by means of the regular expression child.* in an English corpus, for instance, in the corpus “Child Rights” in Corpuscle. Make a word list. Find collocations. Find their distribution relative to country.

Tip: You need to log in with Feide in order to use Corpuscle. At Corpuscle, click “CLARIN SPF” at the top, search for “Feide”, and log in. You can then navigate to the Corpus list and select the “Child’s Rights” corpus. Select “Metadata” from the menu on the left-hand side and click ‘Accept’. You are now able to query the corpus by selecting “Query” from the menu on the left. Navigate creating queries, searching for collocations and creating distributions through the menu.

Word list: Creating a world list in Corpuscle Creating a world list in Clarino Corpuscle

To search with the regular expression child.* in Corpuscle, enter the query child* without the punctuation mark in basic search or the query "child.* in advanced search.

Collocations: Collocations in Corpuscle Looking for collocations in Clarino Corpuscle

Click the “Show collocations”-button to view the results. Sort the results by frequency by selecting “Frequency” from the drop down list between sorted by and Download. Tick the box with “show freq.” to view the collocation frequencies.

Distribution of collocations relative to country: The distribution of collocations relative to country The distribution of collocations relative to country

Click “Run Query” to run the query again. Show the distribution of word relative to country by selecting “word” and “country” from the drop down lists. Leave the “group by” and “and” fields empty. Tick the box named “counts only” to only view the count for each word in the result. You can also explore collocations by selecting a positive or negative offset for the word, by selecting a number from the dropdown lists next to the triangle (delta) on the “of”-row.

b) Try out Antconc, a tool with which you can analyze your own corpus.

When using Antconc, I suggest starting with a .txt-file such as lofoten.txt

Exercise 1.4: The trouble with language

a) What happens if a ligature (ffl) is considered different from its component letters?
The combination of the letters “f”, “f”, and “l” usually need to be represented with a unique Unicode character. Otherwise, it might be hard to read. The problem is therefore that two words that both contain “ffl” might be represented differently. This especially has implications for search, but can also create some headaches for language processing in general (i.e. string comparisons).

b) Which different kinds of knowledge are necessary to understand language? How much “common sense” and knowledge about the world is necessary?

Disclaimer: I will try my best to avoid entering philosophical territory here, but keep in mind that these questions have been - and still are - quite heavily debated. So it goes without saying that a lot of what follows is opinion. You should do some critical thinking on your own, and figure out for yourself what you think is the correct answer.

Most linguists believe that knowledge about grammar is necessary to speak a language, at least in any efficient sense. I will leave it up to you to decide on whether you think that this knowledge is innate or not.

Language is a medium through which humans pass on most of our knowledge. Language shapes the way we think about the world. Rhetoric sways opinions.
Any modern cognitive scientist would (probably) argue that learning language is an important part of learning about the world, and one could also take the view that human language and knowledge are intricately related in such a sense that they are difficult to separate. Each word is bound to a concept, a “thing” we all have some common understanding of. One could therefore state that common understanding is key to communication in general, because understanding a word on an individual level doesn’t get you very far. Arguably, understanding of a word arises when we are able to relate the word to the same concept as everyone else does when they hear that word. Additionally, we can consider that true masters of a language also realize that different groups of the speaking population understand the same words differently, unrelated to context. An example is the word “ping”. So therefore, one can say that both linguistic and social knowledge is necessary to understand and use language.

Many words and phrases mean different things based on the context they are used in, and interpreting the different meanings of a word is a (somewhat) unsolved challenge in natural language processing. Even more so for languages such as Norwegian, which has much fewer words than English and to a greater extent relies on using the same words in different context to convey different meanings. So one might argue that knowledge about human culture and social life, humour, and different technical domains is needed to understand how the same word can be used in different contexts. After all, we have all experienced how a difference in situational awareness can cause communication difficulties (such as misinterpretation).

On the other hand, recent advances in AI have shown promising results through the use of very complex neural networks which seem to interpret words with many meanings. And chatbots “understand” more and more about what humans want from them by serving up increasingly accurate answers to requests given in the form of natural language. So can we not say that a chatbot “understands” what we mean when we ask it “When does McDonald’s close” or “Where can I buy a PS5” and it gives us the correct answer we were looking for?

Some will argue that computers and computer programs don’t “know” anything. After all, software is just a bunch of ones and zeros, computers don’t “understand” anything on a “deep” level. Yet we now have chatbots which seem to do quite well in domain-specific language interpretation tasks. We have software applications that can do sentiment analysis to figure out if a person is angry or sad based on a tweet they wrote. We have knowledge graphs which can link related concepts and explain how different things all fit together. A car in computer memory is no longer “just” a string of 1’s and 0’s, a car is related to a truck, and a car has four wheels, and you need a licence to drive one. So is this still a case where the software understands what we are telling them? Or should we stick to the Chinese Room argument and state that they are only matching an input to the correct output? In the future, the answers to these questions might redefine how we look at the requirements for “understanding” natural language.

What do you think? There are many great books and articles on the subject. Ask Koenraad for a few suggestions if you are curious. I have a few, but linguistics is a divisive field and I am not a linguist, so I will refrain from doing so here… :)

c) Some characters may be hard to distinguish, for instance different dashes, which could be a hyphen, minus sign, etc. (‐ - ⁃ -) and different characters similar to apostrophe (' ʼ ′ ʹ). Try to find a way to examine if they are different, or if they are the same character in various fonts.

You can use Unicode Character Search to figure this out. Insert each character and see which code it belongs to, then compare the codes to see which ones are actually different.

Exercise 2.1: Getting started with shell scripting

a) Check your locale with the locale command. Change the locale to a different language and type date.
Setting the locale in Ubuntu bash

Restart the terminal for the changes to take effect.

Setting the locale in Ubuntu bash

Exercise 2.2: Additional exercises for basic shell usage and word counting

a) Acquaint yourself with the command-line, try navigating directories, viewing files, etc. Try out some various commands.

b) Try editing a file using a terminal text editor such as Vim, Nano, or Emacs. What are the benefits of using one of these instead of an IDE like Atom or Pycharm? What are the drawbacks?
No definitive answer here either, but the general idea is that we have to make a trade-off between increased functionality/a nice interface and simplicity/efficiency.

Editing files in the terminal is quick and easy for simple stuff, but there is a significant lack of functionality compared to an IDE such as VScode. However, it is still very useful to learn how to code in a terminal editor, because you often save time and effort by quickly writing some code in the terminal instead of having to open an external program. So unless you’re writing longer scripts, I recommend simply getting used to a terminal editor such as nano.

Exercise 2.3: Counting lines and words

a) The wc command can take more than one file as arguments. Test something like the following which also uses a text file lofoten.txt:

wc chess.txt lofoten.txt

Word counting with two files

Exercise 2.4: Word counting on the Web

a) Find an article on the web. Copy the content to a new file in a terminal text editor (vim, nano, emacs).
If Ctrl+c and Ctrl+v (Cmd instead of ctrl for Mac) doesn’t work, try Ctrl+c and right click instead.

  1. Find an article on the web. For example, you can use The Gutenberg Project to download free books in plain text (UTF-8). The Gutenberg Project
  2. Copy the text directly into a text editor, such as nano. It might take a few moments if it is a large amount of text.
  3. Save the file. In nano, press ctrl + x / cmd + x, then press y if prompted to save buffer. Then enter the name of the new file and press Enter.

c) Use a pattern of your choosing, and find the number of occurrences with the wc command.

Counting words in a file

This post is licensed under CC BY 4.0 by the author.