Posts Lab Session 11
Post
Cancel

Lab Session 11

Welcome to Lab Session 11: Working with XSLT (and more regular expressions)

Today there are two exercises about XSLT. When you finish these, there is also an exercise about extracting email addresses from a text with regular expressions. Koenraad will be going through regular expressions more extensively over the next few weeks as part of the lecture notes on formal languages.



What is XSL and XSLT, and why are you learning about it?

XSL is an XML-based programming language for processing style sheets. XSL has a similar syntax to XML, and it’s basically a file, just like with CSS, that describes how to display an XML document.

XSLT is a transformation language for XML documents that builds upon XSL. XSLT is used as a general purpose XML processing language, which means that it has many other purposes other than processing XSL. For example, we can use XSLT to generate HTML web pages from XML data (often referred to as an “XML template” or “XML stylesheet”).

These tools are useful to language researchers because, as opposed to when using a grep or the Python re module, we can exploit the hierarchical structure of XML and HTML. These markup languages are two very common data formats, and you will certainly have to process these forms of data when you are working with language data from the web. So instead of having to read and sort through the entire document with one of the other processing techniques we have learned so far, we can use XSLT and XSL to match terms by XML or HTML tags (or “elements”). This will save us a lot of headache, because we won’t have to write very complicated regular expressions or other logic in order to get what we want.



Exercise 1: Filtering XML with XSLT

a) Edit the XSL file from the lecture notes to change the layout of the web page.

Tip: In some browsers, you may need to empty the cache before the changes will be visible. You can probably just search for “cache” in the settings for your browser, then select “delete cache”.

b) Edit the CSS file to change some visual aspects such as colors, fonts, etc.

Exercise 2: An XSLT application for OCR

a) Change the XSL file to enable more classes.

b) Define the new classes in the CSS file. Try to set different visual characteristics. Consider not only different colors, but also a different background-color.

Tip: You can use this website to obtain the hex codes for different colours. Hex codes (a.k.a “Web colors”) are accepted by CSS, and makes it easier to tweak the colours we want.


Exercise 3: Regular expressions and email addresses

Write a Python or shell script to extract all UiB email adresses from an input file.
UiB email adresses must follow the format local-part@domain, where local-part follows the syntax as noted in this article and domain is either uib.no or student.uib.no.
Here is some dummy text you can copy in order to test your script:

<p class="field field-name-field-uib-lead field-type-text-long field-label-hidden field-wrapper">
      <a href="mailto:Guowen.Shang@uib.no">Guowen Shang</a>, Associate Professor of Chinese language,
      is currently studying sign languages – not languages
      with signs, but on signs. “The language signs are everywhere in our environment, but the language choices on
      the signs are determined by a range of social, economic, psychological and political factors”. This inquiry is
      in accord with his academic commitment to the language policy and planning issues, especially in China.</p>

---------

Email addresses:
> myemail@maildrop.cc
> james-gordon93@testmail.ninja
> joe_mama@student.uib.no
> big_shot_prof@uib.no
> xcv921@uib.no
> intricatelongaddress@gmail.com
> not-a-valid-email:adress@uib.no
> another..not.valid@address@fakemail.com
> 900-a-valid-address(a_comment)@outlook.com

Sit et maxime debitis nihil. Exercitationem omnis a velit in necessitatibus.
Feel free to [shoot me an email](sebastian.rokholt@student.uib.no) if you need help with anything.
Id odio sit at. Et laborum aut facere distinctio. Nemo et dolorum quisquam esse quidem.



This post is licensed under CC BY 4.0 by the author.