Solutions to Exercises from Lab Session 04

Here are my solutions to the exercises from lab session 04.
You should try to solve the lab exercises yourself before you peek at the answers!
The exercises to Lab Session 04 are available here.

Link to relevant lecture notes

Disagree with my solutions, or have something to add?
Leave a comment!

Solution to Exercise 1: Counting columns with Awk

Look at the example from the lecture notes. The French examples d’information and l’agriculture are contractions of two words. Translate the apostrophe to a space before counting.

We can use sed to substitute apostrophes by running sed "s/'/ /g" and sending the output to the rest of the code as seen in the lecture notes. Like this:

$ sed "s/'/ /g" fraud-mwes.txt | head
cas de fraude
direction générale
type de contrôle
fiche d information
identification de la fiche-mère
transit communautaire
prélèvement agricole
communication trimestrielle
enquête administrative
principal obligé

Running the complete line of code gives us:

$ sed "s/'/ /g" fraud-mwes.txt | awk '{print NF}' | head | sort | uniq -c | sort -nr
2
3
4
5
6
7

Solution to Exercise 2: Cutting columns and numbering lines

a) Extend the script, for instance with sed, so that you put a period after each number. Experiment with different formats.

We can insert a period after each number with the sed command, by dividing the line into two groups and inserting them at the end with a period and tab in between. You might have to use whitespace \s instead of tab \t depending on how your file is formatted on your system.

$ awk '{ print $2 }' most-freq-en.txt | sort | nl | sed -E 's/(\s*[[:digit:]]+)\t(\w+)/\1.\t\2/g' | head
 A
 About
 After
 All
 Also
 An
 And
 Any
 As
 At

b) Reverse the columns in the last example, so that the numbers come after the words. This can be done by adding | awk '{ print $2 " " $1 }' after the nl command.

As given in the exercise description above, we first insert | awk '{ print $2 " " $1 }' after the nl command. This awk code will print the words first, then a space, then the numbers at the end. However, we now need to modify our sed expression to account for the reordering of the columns. We can just reorder the elements, and then insert the groups at the end with a tab character (\t) at the beginning and between the groups so that the output looks neat with even spacing between the columns.

$ awk '{ print $2 }' most-freq-en.txt | sort | nl | awk '{print $2 " " $1}'|
 sed -E 's/(\w+) ([[:digit:]]+)/\t\1\t\2./g' | head
        A       1.
        About   2.
        After   3.
        All     4.
        Also    5.
        An      6.
        And     7.
        Any     8.
        As      9.
        At      10.

Solution to Exercise 3: Reverse dictionary

Check the alphabetical order for a few different languages.

This Wikipedia article provides a detailed description of the alphabetic ordering for different languages! In particular, read the section on Language-Specific conventions.

Solutions to Exercises from Lab Session 04

Solution to Exercise 1: Counting columns with Awk

Solution to Exercise 2: Cutting columns and numbering lines

Solution to Exercise 3: Reverse dictionary

Further Reading

Solutions to Exercises from Lab Session 05

Lab Session 04

Solutions to Exercises from Lab Session 03

Trending Tags