Disclaimer: You should always attempt to solve the lab exercises by yourself before looking at the proposed solutions below. The exercises to Lab Session 08 are available here.
Disagree with my solutions, or have something to add?
Leave a comment!
Exercise 1: Tokenizer with RE in Python
Take a look at the example in the lecture notes.
a) What are two important differences between the results of tokenize
and word_tokenize
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# tokenize.py
import re
from nltk import download
from nltk.tokenize import word_tokenize
download('punkt') # might be needed for word_tokenize to work properly.
pb = """
Once upon a time, there was a princess called Buttercup.
She had a servant called Westley.
"""
def word_tokens(text):
words = re.sub(r'[\W_]+', ' ', text.casefold())
return words.split()
print(f'Tokenization with regex: {word_tokens(pb)} \n')
print(f'Tokenization with nltk: {word_tokenize(pb)} \n')
1
2
3
4
5
$ python tokenize.py
Tokenization with regex: ['once', 'upon', 'a', 'time', 'there', 'was', 'a', 'princess', 'called', 'buttercup', 'she', 'had', 'a', 'servant', 'called', 'westley']
Tokenization with nltk: ['Once', 'upon', 'a', 'time', ',', 'there', 'was', 'a', 'princess', 'called', 'Buttercup', '.', 'She', 'had', 'a', 'servant', 'called', 'Westley', '.']
There are two important differences:
nltk
includes capitalizationnltk
includes punctuation (.,;:`“)
b) What if we want to use word_tokenize but also want casefolding?
1
2
3
4
5
6
7
8
9
10
11
12
# filename: tokenize_casefold.py
from nltk.tokenize import word_tokenize
pb = """
Once upon a time, there was a princess called Buttercup.
She had a servant called Westley.
"""
# Using word_tokenize with casefolding
tokens = [word.casefold() for word in word_tokenize(pb)]
print(tokens)
1
2
$ python tokenize_casefold.py
['once', 'upon', 'a', 'time', ',', 'there', 'was', 'a', 'princess', 'called', 'buttercup', '.', 'she', 'had', 'a', 'servant', 'called', 'westley', '.']
Exercise 2: Zippers
What happens if you try to take the next
of a zip generator that is exhausted?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Filename: zippers.py
# Create a new zipper object
z = zip(*['abcdefgh'[i:] for i in range(3)])
# Exhaust the zipper
for i in z:
print(next(z))
# But what happens when we want to print the original?
print('z:', *z)
# Take the next again, see what happens
print(next(z))
1
2
3
4
5
6
7
8
9
10
11
$ python zippers.py
('b', 'c', 'd')
('d', 'e', 'f')
('f', 'g', 'h')
z:
Traceback (most recent call last):
File "C:\Users\Path-to-file\zippers.py", line 14, in <module>
print(next(z))
StopIteration
So the answer is that we get a StopIteration
error. But why is that?
Well, when we run the next
of a zipper object, we are actually changing the original object while “unpacking” it. This means that if we attempt to print the contents of a zipper object which has already been unpacked (which we tried here with print('z:', *z)
) we get nothing at all.
So when we additionally attempt to take the next
of z
again, we get the StopIteration
error because there are no more elements to be generated by the zipper object. It’s empty!
Exercise 3: N-grams with Python
There are three different tasks here, a), b) and c) which I will solve in one single python script displayed below. The questions are presented below the code with a short explanation, though they are also present as comments in the code.
To show you an example of using a corpus from nltk
, I decided to include the Brown Corpus in this exercise. This can come in handy for you if you should need to test your code on a longer text. Keep in mind that all of NLTK’s corpora are already tokenized, so in this case - for the sake of the exercise - I had to untokenize, then re-tokenize the text first.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# filename: n-grams.py
import re
from nltk import download
nltk.download('brown') # download the Brown Corpus from nltk
from nltk.corpus import brown # import the corpus you just downloaded
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
# The Brown Corpus consists of one million words of American English texts printed in 1961.
# Get the 100 first words in the corpus
brown = ' '.join(brown.words()[:100])
print(brown)
# a) Once again, there is a difference between the results,
# due to a difference between tokenize and word_tokenize. Why would you use one or the other?
# Let's create some n-grams first, to see the difference ourselves.
# Function for regex tokenization
def tokenize(text):
words = re.sub(r'[\W_]+', ' ', text.casefold())
return words.split()
# Function for creating n_grams from a tokenized text
def n_grams(tokens, n=3):
return zip(*[tokens[i:] for i in range(n)])
# Re-tokenize the first 100 words from the Brown corpus
brown_tokenized = tokenize(brown)
# Create N-grams from the first 100 words of the Brown corpus
brown_ngrams = n_grams(brown_tokenized, n=4)
print("\nExercise 3 a)")
# Print the results from creating n-grams with re and zip
print("creating n-grams with re and zip: ", *brown_ngrams)
# Print the results from creating n-grams with nltk
brown_nltk_tokenized = word_tokenize(brown)
brown_nltk_ngrams = ngrams(brown_nltk_tokenized, n=4)
print("creating n-grams with nltk: ", *brown_nltk_ngrams)
# ---------------------------------------------------------------------------------------------------------------------#
# b) The resulting n-grams are in both cases generators.
# If you do not convert them to lists, you can use next.
# Create a new instance of the ngrams object which is not unpacked
brown_nltk_ngrams = ngrams(word_tokenize(brown), n=4)
# Unpacking the list with the asterix is done "in place" directly on the generator object.
# So is the "next" function, so every time we call it, the original object is changed.
# This also applies for when we use "print(next(generator_object)))
# Therefore, we won't be able to use next if we have already used the unpacking operator once before.
print("\nExercise 3 b)")
# print(*brown_nltk_ngrams) # <---- This needs to be removed for next to work.
print("Printing the next of the brown_nltk_ngrams object: ", next(brown_nltk_ngrams))
# ---------------------------------------------------------------------------------------------------------------------#
# c) Use " ".join() to convert the list of n-gram tuples to a list of strings
# in which the words are separated by spaces.
brown_nltk_ngrams = ngrams(word_tokenize(brown), n=4) # Create a clean instance of the ngrams generator object
# Create an empty list
n_grams_list = []
# Loop through the brown_nltk_ngrams generator object
for n_gram in brown_nltk_ngrams:
# Convert the n-gram from a tuple with 4 elements to a single string with 4 words
n_grams_list.append(' '.join(n_gram))
print("\nExercise 3 c)")
print("Converting the list of n-gram tuples to a list of strings :", n_grams_list)
1
2
3
4
5
6
7
8
9
10
11
12
$ python n-grams.py
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan
Exercise 3 a)
creating n-grams with re and zip: ('the', 'fulton', 'county', 'grand') ('fulton', 'county', 'grand', 'jury') ('county', 'grand', 'jury', 'said') ('grand', 'jury', 'said', 'friday') ('jury', 'said', 'friday', 'an') ('said', 'friday', 'an', 'investigation') ('friday', 'an', 'investigation', 'of') ('an', 'investigation', 'of', 'atlanta') ('investigation', 'of', 'atlanta', 's') ('of', 'atlanta', 's', 'recent') ('atlanta', 's', 'recent', 'primary') ('s', 'recent', 'primary', 'election') ('recent', 'primary', 'election', 'produced') ('primary', 'election', 'produced', 'no') ('election', 'produced', 'no', 'evidence') ('produced', 'no', 'evidence', 'that') ('no', 'evidence', 'that', 'any') ('evidence', 'that', 'any', 'irregularities') ('that', 'any', 'irregularities', 'took') ('any', 'irregularities', 'took', 'place') ('irregularities', 'took', 'place', 'the') ('took', 'place', 'the', 'jury') ('place', 'the', 'jury', 'further') ('the', 'jury', 'further', 'said') ('jury', 'further', 'said', 'in') ('further', 'said', 'in', 'term') ('said', 'in', 'term', 'end') ('in', 'term', 'end', 'presentments') ('term', 'end', 'presentments', 'that') ('end', 'presentments', 'that', 'the') ('presentments', 'that', 'the', 'city') ('that', 'the', 'city', 'executive') ('the', 'city', 'executive', 'committee') ('city', 'executive', 'committee', 'which') ('executive', 'committee', 'which', 'had') ('committee', 'which', 'had', 'over') ('which', 'had', 'over', 'all') ('had', 'over', 'all', 'charge') ('over', 'all', 'charge', 'of') ('all', 'charge', 'of', 'the') ('charge', 'of', 'the', 'election') ('of', 'the', 'election', 'deserves') ('the', 'election', 'deserves', 'the') ('election', 'deserves', 'the', 'praise') ('deserves', 'the', 'praise', 'and') ('the', 'praise', 'and', 'thanks') ('praise', 'and', 'thanks', 'of') ('and', 'thanks', 'of', 'the') ('thanks', 'of', 'the', 'city') ('of', 'the', 'city', 'of') ('the', 'city', 'of', 'atlanta') ('city', 'of', 'atlanta', 'for') ('of', 'atlanta', 'for', 'the') ('atlanta', 'for', 'the', 'manner') ('for', 'the', 'manner', 'in') ('the', 'manner', 'in', 'which') ('manner', 'in', 'which', 'the') ('in', 'which', 'the', 'election') ('which', 'the', 'election', 'was') ('the', 'election', 'was', 'conducted') ('election', 'was', 'conducted', 'the') ('was', 'conducted', 'the', 'september') ('conducted', 'the', 'september', 'october') ('the', 'september', 'october', 'term') ('september', 'october', 'term', 'jury') ('october', 'term', 'jury', 'had') ('term', 'jury', 'had', 'been') ('jury', 'had', 'been', 'charged') ('had', 'been', 'charged', 'by') ('been', 'charged', 'by', 'fulton') ('charged', 'by', 'fulton', 'superior') ('by', 'fulton', 'superior', 'court') ('fulton', 'superior', 'court', 'judge') ('superior', 'court', 'judge', 'durwood') ('court', 'judge', 'durwood', 'pye') ('judge', 'durwood', 'pye', 'to') ('durwood', 'pye', 'to', 'investigate') ('pye', 'to', 'investigate', 'reports') ('to', 'investigate', 'reports', 'of') ('investigate', 'reports', 'of', 'possible') ('reports', 'of', 'possible', 'irregularities') ('of', 'possible', 'irregularities', 'in') ('possible', 'irregularities', 'in', 'the') ('irregularities', 'in', 'the', 'hard') ('in', 'the', 'hard', 'fought') ('the', 'hard', 'fought', 'primary') ('hard', 'fought', 'primary', 'which') ('fought', 'primary', 'which', 'was') ('primary', 'which', 'was', 'won') ('which', 'was', 'won', 'by') ('was', 'won', 'by', 'mayor') ('won', 'by', 'mayor', 'nominate') ('by', 'mayor', 'nominate', 'ivan')
creating n-grams with nltk: ('The', 'Fulton', 'County', 'Grand') ('Fulton', 'County', 'Grand', 'Jury') ('County', 'Grand', 'Jury', 'said') ('Grand', 'Jury', 'said', 'Friday') ('Jury', 'said', 'Friday', 'an') ('said', 'Friday', 'an', 'investigation') ('Friday', 'an', 'investigation', 'of') ('an', 'investigation', 'of', 'Atlanta') ('investigation', 'of', 'Atlanta', "'s") ('of', 'Atlanta', "'s", 'recent') ('Atlanta', "'s", 'recent', 'primary') ("'s", 'recent', 'primary', 'election') ('recent', 'primary', 'election', 'produced') ('primary', 'election', 'produced', '``') ('election', 'produced', '``', 'no') ('produced', '``', 'no', 'evidence') ('``', 'no', 'evidence', '``') ('no', 'evidence', '``', 'that') ('evidence', '``', 'that', 'any') ('``', 'that', 'any', 'irregularities') ('that', 'any', 'irregularities', 'took') ('any', 'irregularities', 'took', 'place') ('irregularities', 'took', 'place', '.') ('took', 'place', '.', 'The') ('place', '.', 'The', 'jury') ('.', 'The', 'jury', 'further') ('The', 'jury', 'further', 'said') ('jury', 'further', 'said', 'in') ('further', 'said', 'in', 'term-end') ('said', 'in', 'term-end', 'presentments') ('in', 'term-end', 'presentments', 'that') ('term-end', 'presentments', 'that', 'the') ('presentments', 'that', 'the', 'City') ('that', 'the', 'City', 'Executive') ('the', 'City', 'Executive', 'Committee') ('City', 'Executive', 'Committee', ',') ('Executive', 'Committee', ',', 'which') ('Committee', ',', 'which', 'had') (',', 'which', 'had', 'over-all') ('which', 'had', 'over-all', 'charge') ('had', 'over-all', 'charge', 'of') ('over-all', 'charge', 'of', 'the') ('charge', 'of', 'the', 'election') ('of', 'the', 'election', ',') ('the', 'election', ',', '``') ('election', ',', '``', 'deserves') (',', '``', 'deserves', 'the') ('``', 'deserves', 'the', 'praise') ('deserves', 'the', 'praise', 'and') ('the', 'praise', 'and', 'thanks') ('praise', 'and', 'thanks', 'of') ('and', 'thanks', 'of', 'the') ('thanks', 'of', 'the', 'City') ('of', 'the', 'City', 'of') ('the', 'City', 'of', 'Atlanta') ('City', 'of', 'Atlanta', '``') ('of', 'Atlanta', '``', 'for') ('Atlanta', '``', 'for', 'the') ('``', 'for', 'the', 'manner') ('for', 'the', 'manner', 'in') ('the', 'manner', 'in', 'which') ('manner', 'in', 'which', 'the') ('in', 'which', 'the', 'election') ('which', 'the', 'election', 'was') ('the', 'election', 'was', 'conducted') ('election', 'was', 'conducted', '.') ('was', 'conducted', '.', 'The') ('conducted', '.', 'The', 'September-October') ('.', 'The', 'September-October', 'term') ('The', 'September-October', 'term', 'jury') ('September-October', 'term', 'jury', 'had') ('term', 'jury', 'had', 'been') ('jury', 'had', 'been', 'charged') ('had', 'been', 'charged', 'by') ('been', 'charged', 'by', 'Fulton') ('charged', 'by', 'Fulton', 'Superior') ('by', 'Fulton', 'Superior', 'Court') ('Fulton', 'Superior', 'Court', 'Judge') ('Superior', 'Court', 'Judge', 'Durwood') ('Court', 'Judge', 'Durwood', 'Pye') ('Judge', 'Durwood', 'Pye', 'to') ('Durwood', 'Pye', 'to', 'investigate') ('Pye', 'to', 'investigate', 'reports') ('to', 'investigate', 'reports', 'of') ('investigate', 'reports', 'of', 'possible') ('reports', 'of', 'possible', '``') ('of', 'possible', '``', 'irregularities') ('possible', '``', 'irregularities', '``') ('``', 'irregularities', '``', 'in') ('irregularities', '``', 'in', 'the') ('``', 'in', 'the', 'hard-fought') ('in', 'the', 'hard-fought', 'primary') ('the', 'hard-fought', 'primary', 'which') ('hard-fought', 'primary', 'which', 'was') ('primary', 'which', 'was', 'won') ('which', 'was', 'won', 'by') ('was', 'won', 'by', 'Mayor-nominate') ('won', 'by', 'Mayor-nominate', 'Ivan')
Exercise 3 b)
Printing the next of the brown_nltk_ngrams object: ('The', 'Fulton', 'County', 'Grand')
Exercise 3 c)
Converting the list of n-gram tuples to a list of strings : ['The Fulton County Grand', 'Fulton County Grand Jury', 'County Grand Jury said', 'Grand Jury said Friday', 'Jury said Friday an', 'said Friday an investigation', 'Friday an investigation of', 'an investigation of Atlanta', "investigation of Atlanta 's", "of Atlanta 's recent", "Atlanta 's recent primary", "'s recent primary election", 'recent primary election produced', 'primary election produced ``', 'election produced `` no', 'produced `` no evidence', '`` no evidence ``', 'no evidence `` that', 'evidence `` that any', '`` that any irregularities', 'that any irregularities took', 'any irregularities took place', 'irregularities took place .', 'took place . The', 'place . The jury', '. The jury further', 'The jury further said', 'jury further said in', 'further said in term-end', 'said in term-end presentments', 'in term-end presentments that', 'term-end presentments that the', 'presentments that the City', 'that the City Executive', 'the City Executive Committee', 'City Executive Committee ,', 'Executive Committee , which', 'Committee , which had', ', which had over-all', 'which had over-all charge', 'had over-all charge of', 'over-all charge of the', 'charge of the election', 'of the election ,', 'the election , ``', 'election , `` deserves', ', `` deserves the', '`` deserves the praise', 'deserves the praise and', 'the praise and thanks', 'praise and thanks of', 'and thanks of the', 'thanks of the City', 'of the City of', 'the City of Atlanta', 'City of Atlanta ``', 'of Atlanta `` for', 'Atlanta `` for the', '`` for the manner', 'for the manner in', 'the manner in which', 'manner in which the', 'in which the election', 'which the election was', 'the election was conducted', 'election was conducted .', 'was conducted . The', 'conducted . The September-October', '. The September-October term', 'The September-October term jury', 'September-October term jury had', 'term jury had been', 'jury had been charged', 'had been charged by', 'been charged by Fulton', 'charged by Fulton Superior', 'by Fulton Superior Court', 'Fulton Superior Court Judge', 'Superior Court Judge Durwood', 'Court Judge Durwood Pye', 'Judge Durwood Pye to', 'Durwood Pye to investigate', 'Pye to investigate reports', 'to investigate reports of', 'investigate reports of possible', 'reports of possible ``', 'of possible `` irregularities', 'possible `` irregularities ``', '`` irregularities `` in', 'irregularities `` in the', '`` in the hard-fought', 'in the hard-fought primary', 'the hard-fought primary which', 'hard-fought primary which was', 'primary which was won', 'which was won by', 'was won by Mayor-nominate', 'won by Mayor-nominate Ivan']
a) Once again, there is a difference between the results, due to a difference between tokenize
and word_tokenize
. Why would you use one or the other?
The nltk
implementation does not lowercase the words, so if we want to preserve capitalization, nltk
is better than the varous regex implementations we’ve built ourselves. However, the nltk
implementation also includes a lot of punctuation marks, quotes, etc. that we might or might not be interested in. So we have to know the quirks of each method in order to know which one to use in different situations, depending on what we want the output to look like.
b) The resulting n-grams are in both cases generators. If you do not convert them to lists, you can use next
.
We can always convert a generator object to a list by using the unpacking operator *
within a list comprehension, like this: generator_to_list = [*generatorObj]
. However, as mentioned previously, this process will empty the generator object, so then we can’t use it anymore (and thus we are also not able to use next
).
c) Look at the following: list_of_strings = [" ".join(('alpha', 'beta', 'gamma'))].
Use such a method to convert the list of n-gram tuples to a list of strings in which the words are separated by spaces.
If you look at the code for 3 c), you can see that I first had to create a clean instance of the ngrams generator object. This is because I had already unpacked the generator object earlier. I then created an empty list for the ngrams, and used a for
loop to loop over the elements generated by the ngrams generator object and append each object to the list. While doing so, I also converted each ngram from a tuple with 4 elements to a single string of 4 words by using the " ".join()
-method.
Exercise 4: NLTK Stopwords
Create a function that removes stopwords from a given text.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# filename: stopwords.py
import re
from nltk import download
# download('stopwords')
from nltk.corpus import stopwords
# print(stopwords.words('english'))
test_text = '''
“You know," said Arthur, "it's at times like this,
when I'm trapped in a Vogon airlock with a man from Betelgeuse,
and about to die of asphyxiation in deep space
that I really wish I'd listened to what my mother told me when I was young."
"Why, what did she tell you?"
"I don't know, I didn't listen.”
'''
# Tokenize the text (without including punctuation marks, with lowercasing) and remove the stopwords
def remove_stopwords(text):
tokens = re.sub(r'[\W_]+', ' ', text.casefold()).split() # tokenization
clean_word_list = [word for word in tokens if word not in stopwords.words('english')] # remove stopwords
return clean_word_list
cleaned = remove_stopwords(test_text)
print(f'Tokenized text without stopwords: {cleaned}')
print(f'Unique words in text (no stopwords): {sorted(set(cleaned))}')
1
2
3
$ python stopwords.py
Tokenized text without stopwords: ['know', 'said', 'arthur', 'times', 'like', 'trapped', 'vogon', 'airlock', 'man', 'betelgeuse', 'die', 'asphyxiation', 'deep', 'space', 'really', 'wish', 'listened', 'mother', 'told', 'young', 'tell', 'know', 'listen']
Unique words in text (no stopwords): ['airlock', 'arthur', 'asphyxiation', 'betelgeuse', 'deep', 'die', 'know', 'like', 'listen', 'listened', 'man', 'mother', 'really', 'said', 'space', 'tell', 'times', 'told', 'trapped', 'vogon', 'wish', 'young']
Exercise 5: Matplotlib basics
Use the matplotlib
and pandas
libraries to draw a line chart of Bitcoin price data from bitcoin-prices.csv
:
1
2
3
4
5
6
7
8
9
10
Date,Open,High,Low,Close
Mar-09-2021,52272.97,54824.12,51981.83,54824.12
Mar-08-2021,51174.12,52314.07,49506.05,52246.52
Mar-07-2021,48918.68,51384.37,48918.68,51206.69
Mar-06-2021,48899.23,49147.22,47257.53,48912.38
Mar-05-2021,48527.03,49396.43,46542.51,48927.30
Mar-04-2021,50522.31,51735.09,47656.93,48561.17
Mar-03-2021,48415.81,52535.14,48274.32,50538.24
Mar-02-2021,49612.11,50127.51,47228.85,48378.99
Mar-01-2021,45159.50,49784.02,45115.09,49631.24
One way to solve it is like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Filename: plot_Bitcoin.py
# Import libraries
import matplotlib.pyplot as plt
import pandas as pd
# Create a Pandas DataFrame
df = pd.read_csv('bitcoin-prices.csv', parse_dates=True, index_col=0)
print(f"Pandas DataFrame: \n\n", df)
# Plot the dataframe with Matplotlib.plot()
df.plot(linestyle='--', marker='o')
# Display the plot
plt.savefig("bitcoin.png")
# plt.show() <--- For displaying the plot in PyCharm, Jupyter Notebook or other IDE's that support displaying graphs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ python plot_Bitcoin.py
Pandas DataFrame:
Open High Low Close
Date
2021-03-09 52272.97 54824.12 51981.83 54824.12
2021-03-08 51174.12 52314.07 49506.05 52246.52
2021-03-07 48918.68 51384.37 48918.68 51206.69
2021-03-06 48899.23 49147.22 47257.53 48912.38
2021-03-05 48527.03 49396.43 46542.51 48927.30
2021-03-04 50522.31 51735.09 47656.93 48561.17
2021-03-03 48415.81 52535.14 48274.32 50538.24
2021-03-02 49612.11 50127.51 47228.85 48378.99
2021-03-01 45159.50 49784.02 45115.09 49631.24
When you open the new image file (bitcoin.png
) that was generated by your Python program, you should see something like this:
A line chart of Bitcoin price data, created using Matplotlib and Pandas
Feel free to play around with the Matplotlib library. For example, you can experiment with the parameters color
, linestyle
, linewidth
, marker
and markersize
to customize the graph.
You can also check out the matplotlib
documentation in its entirety here!