Today, we’re going to finish up the curriculum tied to the section on Python programming in the course notes. Then we’re going to start with some XML processing using XSLT.
PS: No labs will be held during the Easter, i.e week 13. There will be one more round of labs after Easter in order to finish the programming curriculum.
Part 1: Wrapping up Python with a note on recursion
In order to understand the lecture notes on the Levenshtein function we first need to understand recursion, which is an important concept in computer programming.
Recursion is the process in which a function calls itself. That may sound confusing, but it’s not that hard to grasp once you see it in action. Let me give you an example.
If you’ve done any high school or college-level mathematics, you’ve probably heard of the factorial. A factorial \(n!\) is the product (the sum of multiplicating) all positive integers less than or equal to \(n\). For example, the factorial \(5!\) is equal to \(5 * 4 * 3 * 2 * 1 = 120\). The factorial of 1 is 1.
You may have noticed that \(5! = 5 * 4!\), and that \(4! = 4 * 3!\), and so on until you get to \(1!\) which is just 1. The formula for calculating a factorial can therefore be stated as:
\[n! = n * (n - 1)!\]Here, the factorial n!
is defined by itself, which means that this formula uses recursion. However, the formula doesn’t really make sense unless we know that \(1! = 1\), or else it would just continue into the negatives and go on forever.
We can define our own factorial function recursively in Python like this:
1
2
3
4
5
6
7
8
9
def factorial_recursive(n):
if n == 1: # base case
return 1
else: # recursive case
return n * factorial_recursive(n - 1)
print(factorial_recursive(5))
Executing the code above will output 120
, which is the correct answer for \(5!\). It’s important to understand that when we call factorial(n)
we will first have to run factorial(n - 1)
before being able to return n * factorial_recursive(n - 1)
. This applies for all numbers, except for 1, which is the base case (A.K.A the “terminating condition”). factorial_recursive(5)
will therefore have to wait for factorial_recursive(4)
to finish before outputting the answer. Each time a recursive call results in an additional function call, the first function will have to wait for the “newest” function to finish executing in order to know what to return. The structure of how older recursive function calls have to wait for the newer function calls to finish executing is often referred to the “call stack”. A stack is a data structure in programming, but it is intuitively the same as a stack of books. If you want to access the bottom book, you need to remove the books on top first. Here’s a visualization of the stack for our factorial function. Notice how each function call (except for the last one) has to wait before knowing what to return.
An example of recursion with the factorial function
Okay, so I hope that recursion makes a little more sense to you. You can always look up many more examples online, like this one or this one. Now, let’s move on to the Levenshtein function from the lecture notes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def levenshtein_dist(string, target):
if not string:
return len(target)
if not target:
return len(string)
if string[0] == target[0]:
return levenshtein_dist(string[1:], target[1:])
lins = levenshtein_dist(string, target[1:]) + 1
lsub = levenshtein_dist(string[1:], target[1:]) + 1
ldel = levenshtein_dist(string[1:], target) + 1
return min(lins, lsub, ldel)
print(levenshtein_dist("fine", "fined"))
print(levenshtein_dist("aloud", "allowed"))
The program above lets us know that the Levenshtein distance between the words “fine” and “fined” is 1, which makes sense because you only need to append a single ‘d’ into the last position of the word “fine” to get “fined”. On the other hand, the Levenshtein distance between “aloud” and “allowed” is three, because there are three steps needed to transform “aloud” to “allowed”:
- insert a second “l” between the “l” and the “o”:
alloud
- replace the “u” in the second to last index with a “w”:
allowd
- Insert an “e” between the “w” and the “d”:
allowed
So how does the code work? Well, it loops over the input string recursively and compares different parts of the input string to different parts of the target string to find out whether it would be easiest to insert, substitute or delete a character in order to get closer to the target string. Let’s go through the code and see what is going on. We’ll start with the base case:
1
2
3
4
5
6
def levenshtein_dist(string, target):
# Recursion terminating conditions
if not string:
return len(target)
if not target:
return len(string)
There are two terminating conditions in the base case:
if not string
says that if the input string is empty, then the Levenshtein distance between the input string and the target string is simply the length of the target string.if not target
is the same asif not string
, only for an empty target string instead. The number of insertions needed to match an empty string is always equal to the length of the non-empty string.
Then there’s the recursive case, which has two parts. The first part states that if the initial character of both strings are the same, we can move right along and check the rest of the string.
1
2
if string[0] == target[0]:
return levenshtein_dist(string[1:], target[1:])
In other words, if we run levenshtein_dist("pie", "pier")
, this part of the code will be executed exactly three times, as there are three characters in the same position in both words.
Finally, there’s the part where the function keeps track of the “cost” of the different operations to calculate the Levenshtein distance. This is more similar to the example of the recursive factorial function at the beginning, except that we’re only adding numbers together instead of multiplying.
1
2
3
4
5
lins = levenshtein_dist(string, target[1:]) + 1
lsub = levenshtein_dist(string[1:], target[1:]) + 1
ldel = levenshtein_dist(string[1:], target) + 1
return min(lins, lsub, ldel)
It is important to note that while the function always returns the minimum value of the cost for insertion, deletion and substitution, when we check multiple characters in a row we can still use different operations in succession. The point is that in the end, we’ll have a series of numbers (only 1’s in this case) we need to add together. The sum we have at the end is the final output to the terminal.
If you find this hard to understand, don’t worry. Everyone struggles with recursion the first time they learn about it. Let me know if there’s anything in particular you don’t understand! And remember, the important part about recursion is figuring out what the base case/terminating condition is. If a recursive function doesn’t have a terminating condition, well… It’ll just run forever! (or at least until the Python maximum recursion depth limit is exceeded).
Exercise 1.1: Minimum edit distance (Levenshtein)
The Levenshtein function uses the same cost of 1 for all edits. One could argue that a substitution is a combination of a deletion and an insertion, so it should be more costly. Change the program so that a substitution has a cost of 2, and test for different strings.
Exercise 1.2: Recursive functions
a) Write a Python function that recursively reverses an input string.
b) Create a function that recursively counts the number of vowels in a string.
Part 2: Introduction to XSLT
Exercise 2.1: Filtering XML with XSLT
Look at the examples in the lecture notes.
a) Try formatting words and tags in other ways.
b) Try to reconstruct the plain text.