Content from Virtual Environments


Last updated on 2025-09-18 | Edit this page

Overview

Questions

  • What is a Virtual Environment? / Why use a Virtual Environment?
  • How do I create a Virtual Environment?

Objectives

  • Create a new virtual environment using uv
  • Push our new project to a GitHub repository.

What is a Virtual Environment?


A virtual environment is an isolated workspace where you can install python packages and run python code without worrying about affecting the tools, executables, and packages installed in either the global python environment or in other projects.

Callout

What is the difference between a “package manager” and a “virtual environment”?

A package manager helps automate the process of installing, upgrading, and removing software packages. Each package is usually built on top of several other packages, and rely on the methods and objects provided. However as projects are upgraded and changed over time, the available methods and objects can change. A package manager solves the complex “dependency web” created by the packages you would like to install and finds the version of all required packages that meets your needs.

Why Would I use a Virtual Environment?


If you are only ever working on your own projects, or on scripts for a single project, it’s absolutely fine to never worry about virtual environments. But as soon as you start creating new projects working on code written by other people, it becomes incredibly important to know that the code that you are running is running on the exact same versions of libraries.

In the past, it was notoriously difficult to manage environments with python:

XKCD - “Python Environment”
XKCD - “Python Environment”

There have been a number of attempts to create a “one size fits all” approach to virtual environments and dependency management:

  • venv
  • virtualenv
  • conda
  • pipenv
  • pyenv
  • poetry

We’re going to use uv for the purposes of this workshop. UV is a another tool that promises to slot in to the needs of environment and dependency management, however there are a few key elements that set it apart:

  1. It is written in Rust, which gives it a significant speed improvement over pip and conda.
  2. It works with the pyproject.toml and uv.lock files, which allow for human and computer readable project files.
  3. It can install and manage it’s own python versions.
  4. It works as a drop-in replacement for pip, eliminating the need to learn new commands.

Creating a project with UV


Before the workshop, you should have had a chance to install and check that your python and uv executables were working. If you have not yet had a chance to do this, please refer to the setup page for this workshop.

We’re going to start with a totally blank project, so let’s create a directory called “textanalysis-tool”. Navigate to this directory in your command line. Let’s quickly make sure we have UV installed and working by typing uv --version. You should see something like the following (the exact version number might be different):

Checking that uv is installed by running "uv --version"
Checking that UV is installed

We can start off with a new project with UV by running the command uv init. This will automatically create a couple files for us:

Files created by running "uv init"
Files created by “uv init”

We can see that there are a few files created by this command:

  • .python-version: This file is used to optionally specify the Python version for the project.
  • main.py: This is the main Python script for the project.
  • pyproject.toml: This file is used to manage project dependencies and settings.
  • README.md: This file contains human written information about the project.
  • .gitignore: (Depending on your version of uv) This file specifies files and directories that should be ignored by git.

If we take a look at the pyproject.toml file, we can see that it contains some basic information about our project in a fairly readable format:

TOML

[project]
name = "textanalysis-tool-{my-name}"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.13"
dependencies = []

The requires-python field may vary depending on the exact version of python you’re working with.

Make sure to change {my-name} in the name field to something unique, such as your GitHub username. This is important later when we upload our package to TestPyPI, as package names must be unique.

Creating a Virtual Environment


To create a virtual environment with UV, we can use the uv venv command. This will create a new virtual environment in a directory called .venv within our project folder.

BASH

uv venv

Before we activate our environment, let’s quickly check the location of the current python executable you are using is by starting a python interpreter and running the following commands:

Callout

Depending on your operating system, you may need to type python3 instead of python to start the interpreter.

PYTHON

import sys
sys.executable

You can type exit to leave the python interpreter

You should see the path to the location of the python executable on your machine. Now let’s activate our environment. The exact command will depend on your operating system, but if you look above the python code to the output of the uv venv command, you should see the correct command.

BASH

source .venv/bin/activate

If this command works properly, you should see that before your prompt is now some text in parenthesis:

(textanalysis-tool) D:\Documents\Projects\textanalysis-tool>

Let’s start up the python interpreter again and check the location of our executable:

PYTHON

import sys
sys.executable

What you should now see is that the executable is located in the .venv/Scripts directory of our project:

(textanalysis-tool) D:\Documents\Projects\textanalysis-tool>python
Python 3.13.7 (tags/v3.13.7:bcee1c3, Aug 14 2025, 14:15:11) [MSC v.1944 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.executable
'D:\\Documents\\Projects\\textanalysis-tool\\.venv\\Scripts\\python.exe'

Exit out of the interpreter and deactivate the virtual environment with deactivate.

Git Commit and Pushing to our Repository


Callout

Some versions of uv will automatically create a .gitignore file when you run uv init. If you don’t see one in your project folder, you can create one manually.

We also want to create another file called .gitignore, to control which files are added to our git repository. It’s generally a good idea to create this file early on, and update it whenever you notice files or folders you want to explicitly prevent from being added to the repository.

We can create a gitignore from the command line with type nul > .gitignore (Windows), or touch .gitignore (Mac/Linux). There are several pre-written gitignores that we can optionally use, but for this project we’ll maintain our own. Open up the file and add the following lines to it:

__pycache__/
dist/
*.egg-info/
scratch/

A commonly used gitignore is the Python.gitignore maintained by GitHub. You can find it here.

Next, let’s set up a repository on GitHub to store our code. We’ll make an entirely blank repository, with the same name as our project: “textanalysis-tool”.

The Github interface for creating a new repository
Creating a new repository
Callout

We’re creating the files on our local machine first, then the remote repository. There’s no reason you can’t go the other way around, creating the remote repository then cloning it to your local machine.

First, we’ll initialize a git repository locally, making an initial commit with the files that uv generated:

git init
git add .gitignore .python-version README.md main.py pyproject.toml
git commit -m "Initial commit"

Then we’ll follow the directions for creating a new repository:

git remote add origin https://github.com/{username}/textanalysis-tool.git
git branch -M main
git push -u origin main
Callout

We are using https for our remote URL, but you can also use ssh if you have that set up.

If all goes well, we’ll see our code appear in the new repository:

The Github interface showing the initial commit
New repository with initial commit

And with that, we’re ready to start writing our tool!

Challenge

Challenge 1: Adding a Package Dependency

Now that we have our project set up, let’s add a project dependency. Later on in the workshop, we’ll be parsing HTML documents, so let’s add the beautifulsoup4 package to our project.

Try the following command

uv add beautifulsoup4

Take a look at the pyproject.toml and uv.lock files. What changed? What is the purpose of each file?

What is the difference between the command uv add beautifulsoup4 and uv pip install beautifulsoup4?

The pyproject.toml file is a human readable file that contains the list of packages that our project depends on. The uv.lock file is a machine readable file that contains the exact versions of all packages that were installed, including any dependencies of the packages we explicitly installed.

uv add will add the package to the pyproject.toml file, and install the package into our virtual environment. uv pip install will install the package into our virtual environment, but will not add it to the pyproject.toml file.

Key Points
  • Setting up a virtual environment is useful for managing project dependencies.
  • Using uv simplifies the process of creating and managing virtual environments.
  • There are several options other than uv for managing virtual environments, such as venv and conda.
  • It’s important to version control your project from the start, including a .gitignore file.

Content from Creating A Module


Last updated on 2025-09-28 | Edit this page

Overview

Questions

  • How do I create a Python module?
  • How do I import a local module into my code?

Objectives

  • Create a Python module with multiple files
  • Import functions from a local module into a script

Project Organization


In order to keep our project organized, we’ll start by creating some directories to put our code in. So that we can keep the “source” code of our project separate from other aspects, we’ll start by creating a directory called “src”. In this directory, we’ll create a second directory with the name of our module, in this case “textanalysis-tool”. We can also delete the “main.py” file that was generated automatically by uv. Your project folder should now look like this:

textanalysis-tool/
├── src/
│   └── textanalysis_tool/
├── .gitignore
├── .python-version
├── pyproject.toml
└── README.md
Callout

Note that the interior folder has an underscore instead of a hyphen. We will import the module using the name of the interior folder, textanalysis_tool. This is important as hyphens are not a valid character in Python module names.

Next, we’ll create a file called __init__.py in the src/textanalysis_tool directory. This file will make Python treat the directory as a package. Next to the __init__.py file, we can create other Python files that will contain the code for our module.

Callout

The __init__.py file is a special filename in python that indicated that the directory should be treated as a package. Often these files are simply blank, however we can also include some additional code to initialize the package or set up any necessary imports, as we will see later.

Let’s create a code file now, called say_hello.py and put a simple function in it:

PYTHON


def hello(name: str = "User"):
    return f"Hello, {name}!"

Our project folder should now look like this:

textanalysis-tool/
├── src/
│   └── textanalysis_tool/
│       ├── __init__.py
│       └── say_hello.py
├── .gitignore
├── .python-version
├── pyproject.toml
└── README.md

Previewing Our Module


It’s all well and good to write some code in here, but how can we actually use it? Let’s create a python script to test our module.

Let’s create a directory called “tests”, and start a new file called test_say_hello.py in it.

Add the following code to it:

PYTHON

import textanalysis_tool

result = textanalysis_tool.say_hello.hello("My Name")

if result == "Hello, My Name!":
    print("Test passed!")
else:
    print("Test failed!")
Callout

One of the really nice things about using uv is that we can replace python in our commands with uv run and it will use the environment we have created for the project to run the code. At the moment, this doesn’t make a difference, but once we start adding dependencies to our project we’ll see how useful this is.

Let’s run the script from our command line. If you’re in the root directory of the project, your command will look something like uv run tests/test_say_hello.py.

Aaaand… It doesn’t work!

PYTHON

D:\Documents\Projects\textanalysis-tool>uv run tests/test_say_hello.py
Traceback (most recent call last):
  File "D:\Documents\Projects\textanalysis-tool\tests\test_say_hello.py", line 1, in <module>
    from textanalysis_tool.say_hello import hello
ModuleNotFoundError: No module named 'textanalysis_tool'

The reason for this is that we never actually told python where it can find our code!

The Python PATH


When you run a command like import pandas, what python actually does is search the a series of directories in order looking for a module file called pandas.py. We can see what directories will be checked by printing the sys.path variable.

PYTHON

import sys
print(sys.path)

We are only interested in checking our current code, not in installing it as a package. However because we have the __init__.py file in our package directory, if we add the exact or relative path to our package directory to our sys.path variable, python will look there for modules as well.

Let’s add a quick line to the top of our testing script to add our specific module directory to the path:

PYTHON

import sys
sys.path.insert(0, "./src")

from textanalysis_tool.say_hello import hello

result = hello("My Name")

if result == "Hello, My Name!":
    print("Test passed!")
else:
    print("Test failed!")

This time it works! And if we modify the hello function to print out something slightly different, we can see that running the script changes the output right away!

Callout

We used sys.path.insert instead of sys.path.append because we want to give our module directory the highest priority when searching for modules. This way, if there are any naming conflicts with other installed packages, our local module will be found first.

Ideally you would never be working with a package name that has a conflict elsewhere in the path directories, but just in case, this avoids some potential issues.

Dot Notation in Imports


You probably noticed that our function call mimics the file and directory structure of the project. We have the project directory (textanalysis_tool), then the filename (say_hello), and finally the function name (hello). Python treats all of these similarly when trying to locate a function or module. We can also modify our import to make the code a little neater:

PYTHON

import sys
sys.path.insert(0, "./src")

from textanalysis_tool import say_hello

result = say_hello.hello("My Name")

if result == "Hello, My Name!":
    print("Test passed!")
else:
    print("Test failed!")

or even better:

PYTHON

import sys
sys.path.insert(0, "./src")

from textanalysis_tool.say_hello import hello

result = hello("My Name")

if result == "Hello, My Name!":
    print("Test passed!")
else:
    print("Test failed!")
Callout

Fos simplicity here, we are using the from ... import ... syntax to import only the function we need from the module. This way we don’t have to include the module name every time we call the function.

This is a common practice in python, however the absolute best practice would be to import the entire module and then call the function with the module name, as in the first example. This way we avoid potential naming conflicts with functions from other modules, as well as providing clarity about where the function is coming from when reading the code.

The init.py File


We can see that in order to import our function, we have to include both the name of the module and the name of the file before specifying the name of the function. Sometimes this can get tedious, especially if there is a directory in our project with lots of files and different functions in each file. This is where we can simplify things a little by adding a tiny bit of code to our __init__.py file:

PYTHON

from .say_hello import hello

We can run our testing script just the same way as before and it will still work, but we can also now leave out the .say_hello part:

PYTHON

from textanalysis_tool import hello

result = hello("My Name")

if result == "Hello, My Name!":
    print("Test passed!")
else:
    print("Test failed!")

Git Add / Commit


Before we forget, now that we have some simple code up and running, let’s add it to our git repository.

Challenge

Challenge 1: Add a Sub Module

Create a directory under src/textanalysis_tool called greetings and add a file called greet.py with a function called greet that prints out the following:

Hello {User}!
It's nice to meet you!

How do you call this function from your testing script?

PYTHON

from textanalysis_tool.say_hello.greetings import greet

greet("My Name")

You could also include this line in our existing __init__.py file:

PYTHON

from textanalysis_tool.say_hello.greetings import greet

or add another __init__.py and include the following line:

PYTHON

from textanalysis_tool.say_hello.greetings import greet
Challenge

Challenge 2: Inter-Module imports

Building on the last challenge, the first line of our greeting is identical to the output from our hello function. How can we avoid code duplication by calling the hello function from within our greet function?

In our greetings.py file:

PYTHON

from textanalysis_tool import hello

def greet(name):
    return hello(name) + "\nIt's nice to meet you!"

Why are you bothering to write out the entire module path in greetings.py? Can’t you just do this:

PYTHON

from ..my_module import hello

def greet(name):
    hello(name)
    print("It's nice to meet you!")

Yes, you absolutely can! The dot-notation used in python pathing can use .. to refer to the parent directory, or even ... to refer to a grandparent directory.

The reason we’re not doing this here is for clarity, as recommended in PEP8.

Key Points
  • Python modules are simply directories with an __init__.py file in them
  • You can add the path to your module directory to sys.path to make it available for import
  • You can use dot notation in your imports to specify the module, file, and function you want to use

Content from Class Objects


Last updated on 2025-09-28 | Edit this page

Overview

Questions

  • What is a class object?
  • How can I defined a class object in Python?
  • How can I use a class object in my module?

Objectives

  • Create a Class object in our module.
  • Demonstrate how to use our Class object in a sample script.

What is a Class Object?


You can think of a class object as a kind of “blueprint” for an object. It defines what properties the object can have, and what methods it can perform. Once a class is defined, you can create any number of objects based on that class, each of which is referred to as an “instance” of that class.

As an example, let’s imagine a Car. A Car has many properties and can do many things, but for our purposes, let’s limit them slightly. Our Car will have a make, model, year, and color, and it will be able to honk, and be painted.

The make, model, year, and color are all “properties” of the car. Honking a horn and being painted are both “methods” of the car. Here’s a diagram of our car object:

Car Class object example
Car Class object example

In python we can define a class object like this:

PYTHON

class Car:
    def __init__(self, make: str, model: str, year: int, color: str = "grey", fuel: str = "gasoline"):
        self.make = make
        self.model = model
        self.year = year
        self.color = color
        self.fuel = fuel

    def honk(self) -> str:
        return "beep"

    def paint(self, new_color: str) -> None:
        self.color = new_color

    def noise(self, speed: int) -> str:
        if speed <= 10:
            return "putt putt"
        else:
            return "vrooom"
Callout

The convention in python is that all classes should be named in CamelCase, with no underscores. There are no limits enforced by the interpreter but it is good practice to follow the standards of the python community.

Some of this might look familiar if you think about how we define functions in Python. There’s a def keyword, followed by the function name and parentheses. Inside the parentheses, we can define parameters, and these parameters can contain default values. We can also include type hints, for both parameters and return values. However all of this is indented one level, underneath the class keyword, which is followed by our class name.

Note that this is just our blueprint - it doesn’t refer to any specific car, just the general idea of a car. Also note the __init__ method. This is a special method which is called whenever you “instantiate” a new object. The parameters for this function are supplied when we first create an object and function similarly to a method, in that if no default value is provided, it is required in order to create the object, and if a default value is provided, it is optional.

An instance of a car, in this case called “my_car” might look something like this:

Car Instance example
Car Instance Example
Callout

What exactly is “an instance”?

An instance is how we refer to a specific object that has been created from a class. The class is the “blueprint”, while the instance is the actual object that is created based on that blueprint.

In our example, my_car is an instance of the Car class. It has its own specific values for the properties defined in the class (make, model, year, color), and it can use the methods defined in the class (honk, paint).

Also note that each of the methods within the class object definition starts with a “self” argument. This is a reference to the current instance of the class, and is used to access variables that belong to the class. In our example, we store the make, model, year and color as properties of the class. When we call the paint method, we use self.color to refer to the current instance’s color property.

Callout

The __init__ method is called a “dunder” (double underlined) method in python. There are a number of other dunder methods that we can define, that will interact with various built-in functions and operators. For example, we can define a __str__ method, that will allow us to specify how our object should be represented as a string when we call str() on it. Likewise, we can define __eq__, which would tell python how to behave when we compare two objects for equality.

A Class object for Our project


Let’s create a class object for our text analysis project. We’re going to be downloading some books from Project Gutenberg. To make things easy to begin with, we’ll limit ourselves to just the .txt files.

Since we’re going to create some useful objects and methods for working with documents, let’s define a Document class.

Discussion

Take a look a an example txt document from Project Gutenberg: Meditations, by Marcus Aurelius

What properties and methods might we want to include in our Document class?

It looks like there’s a standard metadata section in these documents, with a Title, Author, Release Date, Language, and Credits. Those will probably be useful metadata. Looking at the url for this file, it also looks like there’s an ID on Project Gutenberg.

For methods, we’ll need to be able to read the document from a file. And for some simple methods, let’s count the number of lines in a document, and another method to get the number of times a particular word appears.

Lets start writing our class object in a new file: src/textanalysis_tool/document.py:

PYTHON


class Document:
    def __init__(self, filepath: str, title: str, author: str = "", id: int = 0):
        self.filepath = filepath
        self.title = title
        self.author = author
        self.id = id
        self.content = self.read(self.filepath)

    def read(self, filepath: str) -> None:
        with open(filepath, 'r', encoding='utf-8') as file:
            return file.read()

    def get_line_count(self) -> int:
        return len(self.content.splitlines())

    def get_word_occurrence(self, word: str) -> int:
        return self.content.lower().count(word.lower())

Our class object Document is a “blueprint” for a collection of methods. When we define it, we have to provide the class with a filepath, a title, an author, and an id. Only the filepath and the title are required, while the author and id are optional.

The __init__ method is called as soon as the object is created, and we can see that in addition to storing the parameters to their self counterparts, there is an additional property called self.content. This property is used to store the entire text content of the document. We obtain this by calling the self.read method, which reads the content from the specified file.

Callout

Principle of Least Astonishment (or, We’re All Adults Here)

Unlike other programming languages, python doesn’t have the concept of “private” or “internal” variables and methods. Instead there is a convention which says that any variable or method that is intended for internal use should be prefixed with an underscore (e.g. content). This is however just a convention - there is nothing stopping you from accessing these variables and methods from outside the class if you really want to.

There are also two methods that we’ve defined - get_line_count and get_word_occurrence. Neither of these will be called directly on the class itself, but rather on instances of the class that we create (as indicated by the use of self within the class methods). Note that these methods make use of the self.content property - this is a variable that is not defined within the method, so you may expect it to be out of scope. However the self keyword refers to the specific instance of the class itself, and so it has access to all of its properties and methods, including the self.content property.

Trying out Our Class Object


Let’s try out our new class object. Create a file in our “tests” directory called example_file.txt and add some text to it:

This is a test document. It contains words.
It is only a test document.

Next, let’s create another test file. Our last one was called test_say_hello.py, so let’s call this test_document.py:

PYTHON

import sys

sys.path.insert(0, "./src")

from textanalysis_tool.document import Document

total_tests = 3
passed_tests = 0
failed_tests = 0

# Check that we can create a Document object
doc = Document(filepath="tests/example_file.txt", title="Test Document")
if doc.title == "Test Document" and doc.filepath == "tests/example_file.txt":
    passed_tests += 1
else:
    failed_tests += 1

# Test the methods
if doc.get_line_count() == 2:
    passed_tests += 1
else:
    failed_tests += 1

if doc.get_word_occurrence("test") == 2:
    passed_tests += 1
else:
    failed_tests += 1

print(f"Total tests: {total_tests}")
print(f"Passed tests: {passed_tests}")
print(f"Failed tests: {failed_tests}")

Now we’ll run this file using our uv environment:

BASH

uv run tests/test_document.py

You should see the output:

Total tests: 3
Passed tests: 3
Failed tests: 0
Challenge

Challenge 1: Identify the mistake

The following code is supposed to define a Bird class that inherits from the Animal class and overrides the whoami method to provide a specialized message. However, there is a mistake in the code that prevents it from working as intended. Can you identify and fix the mistake?

PYTHON

class Animal:
    def __init__(self, name: str):
        print(f"Creating an animal named {name}")
        self.name = name

class Bird(Animal):
    def whoami() -> str:
        return f"I am a bird. My name is irrelevant."

When we try to run the code we get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 14
     11         return f"I am a bird. My name is irrelevant."
     13 animal = Bird("boo")
---> 14 animal.whoami()

TypeError: Bird.whoami() takes 0 positional arguments but 1 was given

We have forgotten to include the self parameter in the whoami method of the Bird class. The self parameter is required for instance methods in Python, as it refers to the instance of the class. Without it, the method cannot access instance properties or methods.

Challenge

Challenge 2: The str Method

In our examples so far, we define an __init__ method for our class objects. This is a special kind of method called a “dunder” (double underlined) method. There are a number of other dunder methods that we can define, that will interact with various built-in functions and operators.

Try to define a __str__ method for the following class object that will create the following output when run:

PYTHON

class Animal:
    def __init__(self, name: str):
        print(f"Creating an animal named {name}")
        self.name = name

    # Your __str__ method here

animal = Animal("Moose")
print(str(animal))

output:

Creating an animal named Moose
I am an animal named Moose.

PYTHON

class Animal:
    def __init__(self, name: str):
        print(f"Creating an animal named {name}")
        self.name = name

    def __str__(self) -> str:
        return f"I am an animal named {self.name}."
Challenge

Challenge 3: Static Methods

In addition to instance methods, which operate on an instance of a class (and so have self as the first parameter), we can also define static methods. These are methods that don’t operate on an instance of the class, and so don’t have self as the first parameter. Instead, they are defined using the @staticmethod decorator.

Create a static method called is_animal that takes a single parameter, obj, and returns True if obj is an instance of the Animal class, and False otherwise.

PYTHON

class Animal:
    def __init__(self, name: str):
        print(f"Creating an animal named {name}")
        self.name = name

    # Your static method here

A decorator is a special kind of function that modifies the behavior of another function. They are defined using the @ symbol, followed by the name of the decorator function. In this case, we use the @staticmethod decorator to indicate that the following method is a static method, so this line must be placed directly above the method definition.

You can use the python built-in function isinstance to check if an object is an instance of a class. (https://docs.python.org/3/library/functions.html#isinstance)

PYTHON

class Animal:
    def __init__(self, name: str):
        print(f"Creating an animal named {name}")
        self.name = name

    @staticmethod
    def is_animal(obj: object) -> bool:
        return isinstance(obj, Animal)
Challenge

Challenge 4: Testing Out Our Class on Real Data

Let’s download a real text file from Project Gutenberg and see how our class object handles it. You can pick any file you like, or you can use the same one we looked at earlier: Meditations, by Marcus Aurelius.

Modify your test_document.py file to create a new Document object using the real text file, and then test out the get_line_count and get_word_occurrence methods on it. What do you get? What issues might there be when we start using this class on our actual data?

How can we improve our class to handle these issues?

The Metadata at the start of the document is always gated by a line that says *** START OF THE PROJECT GUTENBERG EBOOK {the title of the book} *** and at the end of the document with the line *** END OF THE PROJECT GUTENBERG EBOOK {the title of the book} ***.

We need some way to extract all of the content between these two markers…

There is a python module called re that allows us to work with regular expressions. This can be used to match specific patterns in text, or for extracting specific parts of a string. You can use the following regex pattern to match the content between the start and end markers:

PYTHON

pattern = r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK .*? \*\*\*(.*?)\*\*\* END OF THE PROJECT GUTENBERG EBOOK .*? \*\*\*"

match = re.search(pattern, raw_text, re.DOTALL)
if match:
    content = match.group(1).strip()

The Project Gutenberg text files have a lot of metadata at the start and end of the file, which will affect the line count and word occurrence counts. We might want to modify our self.read method to strip out this metadata before storing the content in self.content.

One possible solution would look like this:

PYTHON


import re

class Document:

    CONTENT_PATTERN = r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK .*? \*\*\*(.*?)\*\*\* END OF THE PROJECT GUTENBERG EBOOK .*? \*\*\*"

    def __init__(self, filepath: str, title: str = "", author: str = "", id: int = 0):
        self.filepath = filepath
        self.content = self.get_content(filepath)

        self.title = title
        self.author = author
        self.id = id

    def get_content(self, filepath: str) -> str:
        raw_text = self.read(filepath)
        match = re.search(self.CONTENT_PATTERN, raw_text, re.DOTALL)
        if match:
            return match.group(1).strip()
        raise ValueError(f"File {filepath} is not a valid Project Gutenberg Text file.")

    def read(self, file_path: str) -> None:
        with open(file_path, "r", encoding="utf-8") as file:
            return file.read()

    def get_line_count(self) -> int:
        return len(self.content.splitlines())


    def get_word_occurrence(self, word: str) -> int:
        return self.content.lower().count(word.lower())

Note that our test file now fails, because the CONTENT_PATTERN doesn’t match anything in our example_file.txt. We could modify our test file to use a real Project Gutenberg text file:

*** START OF THE PROJECT GUTENBERG EBOOK TEST ***

This is a test document. It contains words.
It is only a test document.

*** END OF THE PROJECT GUTENBERG EBOOK TEST ***

Or we could modify our class to allow for a different content pattern to be specified when creating the object:

PYTHON

...

# Check that we can create a Document object
Document.CONTENT_PATTERN = r"(.*)"
doc = Document(filepath="tests/example_file.txt", title="Test Document")
if doc.title == "Test Document" and doc.filepath == "tests/example_file.txt":
    passed_tests += 1
else:
    failed_tests += 1

...

Neat! We’ve successfully created and used a class object in our module. But certainly there’s a better way to test this, right? In the next episode, we’ll look at how to write proper unit tests for our class object with pytest.

Key Points
  • Python classes are defined using the class keyword, followed by the class name and a colon.
  • The __init__ method is a special method that is called when an instance of the class is created.
  • Class methods are defined like normal functions, but they must include self as the first parameter.

Content from Unit Testing


Last updated on 2025-09-28 | Edit this page

Overview

Questions

  • What is unit testing?
  • Why is unit testing important?
  • How do you write a unit test in Python?

Objectives

  • Explain the concept of unit testing
  • Demonstrate how to write and run unit tests in Python using pytest

Unit Testing


We have our two little test files, but you might imagine that it’s not particularly efficient to always write individual scripts to test our code. What if we had a lot of functions or classes? Or a lot of different ideas to test? What if our objects changed down the line? Our existing modules are going to be difficult to maintain, and, as you may have expected, there is already a solution for this problem.

What Makes For a Good Test?

A good test is one that is: - Isolated: A good test should be able to run independently of other tests, and should not rely on external resources such as databases or web services. - Repeatable: A good test should produce the same results every time it is run. - Fast: A good test should run quickly, so that it can be run frequently during development. - Clear: A good test should be easy to understand, so that other developers can easily see what is being tested and why. Not just in the output provided by the test, but the variables, function names, and structure of the test should be clear and descriptive.

It can be easy to get carried away with testing everything and writing tests that cover every single case that you can come up with. However having a massive test suite that is cumbersome to update and takes a long time to run is not useful. A good rule of thumb is to focus on testing the most important parts of your code, and the parts that are most likely to break. This often means focusing on edge cases and error handling, rather than trying to test every possible input.

Generally, a good test module will test the basic functionality of an object or function, as well as a few edge cases.

Discussion

What are some edge cases that you can think of for the hello function we wrote earlier?

What about the Document class?

Some edge cases for the hello function could include:

  • Passing in an empty string
  • Passing in a very long string
  • Passing in something that is not a string (e.g. a number or a list)

Some edge cases for the Document class could include:

  • Passing in a file path that does not exist
  • Passing in a file that is not a text file
  • Passing in a file that is empty
  • Passing in a word that does not exist in the document when testing get_word_occurrence

pytest


pytest is a testing framework for Python that helps to write simple and scalable test cases. It is widely used in the python community, and has in-depth and comprehensive documentation. We won’t be getting into all of the different things you can do with pytest, but we will cover the basics here.

To start off with, we need to add pytest to our environment. However unlike our previous packages, pytest is not required for our module to work, it is only used by us as we are writing code. Therefore it is a “development dependency”. We can still add this to our pyproject.toml file via uv, but we need to add a special flag to our command so that it goes in the correct place.

BASH

uv add pytest --dev

If you open up your pyproject.toml file, you should see that pytest has been added under a new called “dependency groups” (your version number may be different):

TOML

[dependency-groups]
dev = [
    "pytest>=8.4.2",
]

Now we can start creating our tests.

Writing a pytest Test File

Part of pytest is the concept of “test discovery”. This means that pytest will automatically find any test files that follow a certain naming convention. By default, pytest will look for files that start with test_ or end with _test.py. Inside these files, pytest will look for functions that start with test_.

Now, our files already have the correct names, so we just need to change the contents. Let’s start with test_say_hello.py. Open it up and replace the contents with the following:

PYTHON

from textanalysis_tool.say_hello import hello

def test_hello():
    assert hello("My Name") == "Hello, My Name!"

In our previous test file, we had to add the path to our module each time. Now that we are using pytest, we can use a special file called conftest.py to add this path automatically. Create a file called conftest.py in the tests directory and add the following code to it:

PYTHON

import sys

sys.path.insert(0, "./src")

Now, we just need to run the tests. We can do this with the following command:

BASH

uv run pytest
Callout

Note that we are using uv run to run pytest, this ensures that pytest is run in the correct environment with all the dependencies we have installed.

You should see output similar to the following:

============================= test session starts ==============================
platform win32 -- Python 3.13.7, pytest-8.4.2, pluggy-1.6.0
rootdir: D:\Documents\Projects\textanalysis-tool
configfile: pyproject.toml
collected 1 item

tests\test_say_hello.py .                                          [100%]
============================== 1 passed in 0.12s ===============================

Why didn’t it run the other test file? Because even though the file is named correctly, it doesn’t contain any functions that start with test_. Let’s fix that now.

Open up test_document.py and replace the contents with the following:

PYTHON

from textanalysis_tool.document import Document

def test_create_document():
    Document.CONTENT_PATTERN = r"(.*)"
    doc = Document(filepath="tests/example_file.txt")
    assert doc.filepath == "tests/example_file.txt"


def test_document_word_count():
    Document.CONTENT_PATTERN = r"(.*)"
    doc = Document(filepath="tests/example_file.txt")
    assert doc.get_line_count() == 2


def test_document_word_occurrence():
    Document.CONTENT_PATTERN = r"(.*)"
    doc = Document(filepath="tests/example_file.txt")
    assert doc.get_word_occurrence("test") == 2
Callout

Our example file doesn’t exactly look like a Project Gutenberg text file, so we need to change the CONTENT_PATTERN to match everything. This is a class level variable, so we can change it on the class itself, rather than on the instance.

Let’s run our tests again:

BASH

uv run pytest

You should see output similar to the following:

============================= test session starts ==============================
platform win32 -- Python 3.13.7, pytest-8.4.2, pluggy-1.6.0
rootdir: D:\Documents\Projects\textanalysis-tool
configfile: pyproject.toml
collected 4 items

tests\test_document.py ...                                      [ 75%]
tests\test_say_hello.py .                                       [100%]
============================== 4 passed in 0.15s ===============================

You can see that all of the tests have passed. There is a small green pip for each test that was performed, and a summary at the end. Compare this to the test file we had before. We got rid of all of the if statements, and just use the assert statement to check if the output is what we expect.

Testing Edge Cases / Exceptions

Let’s add an edge case to our tests. Open up test_say_hello.py and add a test case for an empty string:

PYTHON

from textanalysis_tool.say_hello import hello

def test_hello():
    assert hello("My Name") == "Hello, My Name!"

def test_hello_empty_string():
    assert hello("") == "Hello, !"

Run the tests again:

BASH

uv run pytest

We get passing tests, which is what we expect. But we are the ones in charge of the function, what if we say that if the user doesn’t provide a name, we want to raise an exception? Let’s change the test to say that if the user provides an empty string, we want to raise a ValueError:

PYTHON

import pytest

from textanalysis_tool.say_hello import hello


def test_hello():
    assert hello("My Name") == "Hello, My Name!"


def test_hello_empty_string():
    with pytest.raises(ValueError):
        hello("")

Run the tests again:

BASH

uv run pytest

This time, we get a failing test, because the hello function DID NOT raise a ValueError. Let’s change the hello function to raise a ValueError if the name is an empty string:

PYTHON

def hello(name: str = "User"):
    if name == "":
        raise ValueError("Name cannot be empty")
    return f"Hello, {name}!"

Running the tests again, we can see that all the tests pass.

Fixtures

One of the great features of pytest is the ability to use fixtures. Fixtures are a way to provide data, state or configurations to your tests. For example, we have a line in each of our tests that creates a new Document object. We can use a fixture to create this object once, and then use it in each of our tests. That way, if we need to change the way we create the object in the future, we only need to change it in one place.

Let’s create a fixture for our Document object. Open up test_document.py and add the following import at the top:

PYTHON

import pytest

Then, add the following code below the imports:

PYTHON

@pytest.fixture
def doc():
    return Document(filepath="tests/example_file.txt")

Now, we can use this fixture in our tests. Update the test functions to accept a parameter called doc, and remove the line that creates the Document object. The updated test file should look like this:

PYTHON

import pytest

from textanalysis_tool.document import Document

@pytest.fixture
def doc():
    Document.CONTENT_PATTERN = r"(.*)"
    return Document(filepath="tests/example_file.txt")

def test_create_document(doc):
    assert doc.filepath == "tests/example_file.txt"

def test_document_word_count(doc):
    assert doc.get_line_count() == 2

def test_document_word_occurrence(doc):
    assert doc.get_word_occurrence("test") == 2
Callout

Because our Documents are validated by searching for a starting and ending regex pattern, our test files will not have that. We could ensure that our test files would, or we can just temporarily alter the search pattern for the duration of the test. CONTENT_PATTERN is a class level variable, so we need to modify it before the instance is created.

Let’s run our tests again. Nothing changed in the output, but our code is now cleaner and easier to maintain.

Monkey Patching

Another useful feature of pytest is monkey patching. Monkey patching is a way to modify or extend the behavior of a function or class during testing. This is useful when you want to test a function that depends on an external resource, such as a database, file system or web resource. Instead of actually accessing the external resource, you can use monkey patching to replace the function that accesses the resource with a mock function that returns a predefined value.

In our use case, we have a file called example_file.txt that we use to test our Document class. However, if we wanted to test the Document class with files that have different contents, would need to create a whole array of different test files. Instead, we can use monkey patching to replace the open function, so that instead of actually opening a file, it returns a string that we define.

Let’s monkey patch the open function in our test_document.py file. First, we need to import the monkeypatch fixture from unittest.mock (a python built-in module). Add the following import at the top of the file:

PYTHON

from unittest.mock import mock_open

Then, we can create a new fixture that monkey patches the open function. Add the following code below the doc fixture:

PYTHON

@pytest.fixture(autouse=True)
def mock_file(monkeypatch):
    mock = mock_open(read_data="This is a test document. It contains words.\nIt is only a test document.")
    monkeypatch.setattr("builtins.open", mock)
    return mock

The other difference you’ll notice is that we added the parameter autouse=True to the fixture. This means that, within this test file, this specific fixture will be automatically applied to all tests, without needing to explicitly include it as a parameter in each test function.

Go ahead and delete the example_file.txt file, and run the tests again. Your tests should still pass, even though the file doesn’t exist anymore. This is because we are using monkey patching to replace the open function with a mock function that returns the string we defined.

Edge Cases Again / Test Driven Development

Let’s go back and think about other things that might go wrong with our Document class. What if the user provides a file path that doesn’t exist? What if the user provides a file that is not a text file? Or a file that is empty of content? Rather than write these into our class object, we can first write tests that will check for the behavior we expect or want in these edge cases, see if they fail, and then update our class object to make the tests pass. This is called “Test Driven Development” (TDD), and is a common practice in software development.

Let’s add a test for a file that is empty. In this case, we would want the initialization of the object to fail with a ValueError. However for this test, we can’t use our fixtures from above, so we’ll have to code it into the test. Add the following to test_document.py:

PYTHON

def test_empty_file(monkeypatch):
    # Mock an empty file
    mock = mock_open(read_data="")
    monkeypatch.setattr("builtins.open", mock)

    with pytest.raises(ValueError):
        Document(filepath="empty_file.txt")
Callout

Because we are monkeypatching the open function, we don’t actually need to have a file called empty_file.txt in our tests directory. The open function will be replaced with our mock function that returns an empty string. We are providing a file name here to be consistent with the Document class initialization, and we are using the name to act as additional information for later developers to clarify the intent of the test.

Run the tests again:

BASH

uv run pytest

It fails, as we expect. Now, let’s update the Document class to raise a ValueError if the file is empty. Open up document.py and update the get_content method to the following:

PYTHON

...
    def get_content(self, filepath: str) -> str:
        raw_text = self.read(filepath)
        if not raw_text:
            raise ValueError(f"File {filepath} contains no content.")

        match = re.search(self.CONTENT_PATTERN, raw_text, re.DOTALL)
        if match:
            return match.group(1).strip()
        raise ValueError(f"File {filepath} is not a valid Project Gutenberg Text file.")
...
Challenge

Challenge 1: Write a simple test

Create a file called text_utilities.py in the src/textanalysis_tool directory. In this file, paste the following function:

PYTHON

def create_acronym(phrase: str) -> str:
    """Create an acronym from a phrase.

    Args:
        phrase (str): The phrase to create an acronym from.

    Returns:
        str: The acronym.
    """
    if not isinstance(phrase, str):
        raise TypeError("Phrase must be a string.")

    words = phrase.split()
    if len(words) == 0:
        raise ValueError("Phrase must contain at least one word.")

    articles = {"a", "an", "the", "and", "but", "or", "nor", "on", "at", "to", "by", "in"}

    acronym = ""
    for word in words:
        if word.lower() not in articles:
            acronym += word[0].upper()

    return acronym

Create the following test cases for this function:

  • A test that checks if the acronym for “As Soon As Possible” is “ASAP” and that the acronym for “For Your Information” is “FYI”.
  • A test that checks that the function raises a TypeError when the input is not a string.
  • A test that checks that the function raises a ValueError when the input is an empty string.

Are there any other edge cases you can think of? Write a test to prove that your edge case is not handled by this function as it is currently written.

Remember that to use pytest, you need to create a file that starts with test_ and that the test functions need to start with test_ as well.

You can use the pytest.raises context manager to check for specific exceptions. For example:

PYTHON

def test_raises_error():
    with pytest.raises(FileNotFoundError):
        read_my_file("non_existent_file.txt")

What happens if the phrase contains only articles? For example, “and the or by”?

in the tests directory, create a file called test_text_utilities.py:

PYTHON

import pytest

from textanalysis_tool.text_utilities import create_acronym


def test_create_acronym():
    assert create_acronym("As Soon As Possible") == "ASAP"
    assert create_acronym("For Your Information") == "FYI"


def test_create_acronym_invalid_type():
    with pytest.raises(TypeError):
        create_acronym(123)


def test_create_acronym_empty_string():
    with pytest.raises(ValueError):
        create_acronym("")


def test_create_acronym_no_valid_words():
    with pytest.raises(ValueError):
        create_acronym("and the or")

Run the tests with uv run pytest.

In the create_acronym function, we need to add a check after we finish iterating through the words to see if the acronym is empty. If it is, we can raise a ValueError:

PYTHON

    ...

    if not acronym:
        raise ValueError("Phrase must contain at least one non-article word.")

    return acronym
Challenge

Challenge 2: Additional Edge Case

Try adding a test for another edge case for our Document class, this time for a file that is not actually a text file, for example, a binary file or an image file. Then, update the Document class to make the test pass.

You can mock a binary file by using the mock_open function from the unittest.mock module, and using the read_data parameter to provide binary data like b'\x00\x01\x02'.

In the Document class, we need to check if the data read from the file is binary data. The read method of a file object is clever enough to return binary data as a bytes object, so we can check if the data in self._content is an instance of type bytes. If it is, we can raise a ValueError.

PYTHON

text_data = "This is a test string."
if isinstance(text_data, bytes):
    raise ValueError("File is not a valid text file.")

binary_data = b'\x00\x01\x02'
if isinstance(binary_data, bytes):
    raise ValueError("File is not a valid text file.")

You can create a test that simulates opening a binary file by using the mock_open function from the unittest.mock module. Here’s an example of how you might write such a test:

PYTHON

def test_binary_file(monkeypatch):
    # Mock a binary file
    mock = mock_open(read_data=b'\x00\x01\x02')
    monkeypatch.setattr("builtins.open", mock)

    with pytest.raises(ValueError):
        Document(filepath="binary_file.bin")

And then, in the Document class, you can check if the data read from the file is binary data like this:

PYTHON

...
    def get_content(self, filepath: str) -> str:
        raw_text = self.read(filepath)
        if not raw_text:
            raise ValueError(f"File {filepath} contains no content.")

        if isinstance(raw_text, bytes):
            raise ValueError(f"File {self.filepath} is not a valid text file.")

        match = re.search(self.CONTENT_PATTERN, raw_text, re.DOTALL)
        if match:
            return match.group(1).strip()
        raise ValueError(f"File {filepath} is not a valid Project Gutenberg Text file.")
...
Key Points
  • We can use pytest to write and run unit tests in Python.
  • A good test is isolated, repeatable, fast, and clear.
  • We can use fixtures to provide data or state to our tests.
  • We can use monkey patching to modify the behavior of functions or classes during testing.
  • Test Driven Development (TDD) is a practice where we write tests before writing the code to make the tests pass.

Content from Extending Classes with Inheritance


Last updated on 2025-09-28 | Edit this page

Overview

Questions

  • What if we want classes that are similar, but handle slightly different cases?
  • How can we avoid duplicating code in our classes?

Objectives

  • Explain the concept of inheritance in object-oriented programming
  • Demonstrate how to create a subclass that inherits from a parent class
  • Show how to override methods and properties in a subclass

Extending Classes with Inheritance


So far you may be wondering why classes are useful. After all, all we’ve really done in essence is make a tiny module with some functions in it that are slightly more complicated than normal functions. One of the real powers of classes is the ability to limit code duplicate through a concept called inheritance.

Inheritance

Inheritance is a way to create a new class that contains all of the same properties and methods as an existing class, but allows us to add additional new properties and methods, or to override existing methods. This allows us to create a new class that is a specialized version of an existing class, without having to rewrite a whole bunch of code.

Taking a look at our Car class from earlier, we might want to create a new class for a specific type of engine, like a Gas Engine or an Electric Engine. Both kinds of cars will have the same basic properties and methods, but they will also have some additional properties and methods that are specific to the type of engine, or properties that are set by default, like our fuel property.

But since both types of cars are still cars, they will share a lot of the same properties and methods. Rather than repeating all of the code from the Car class in both our new classes, we can use inheritance to create our new classes based on the Car class:

In python, this would look something like this:

PYTHON

class Car:
    def __init__(self, make: str, model: str, year: int, color: str = "grey", fuel: str = "gasoline"):
        self.make = make
        self.model = model
        self.year = year
        self.color = color
        self.fuel = fuel

    def honk(self) -> str:
        return "beep"

    def paint(self, new_color: str) -> None:
        self.color = new_color

    def noise(self, speed: int) -> str:
        if speed <= 10:
            return "putt putt"
        else:
            return "vrooom"

class CarGasEngine(Car):
    def __init__(self, make: str, model: str, year: int, color: str = "grey"):
        super().__init__(make=make, model=model, year=year, color=color, fuel="gasoline")

class CarElectricEngine(Car):
    def __init__(self, make: str, model: str, year: int, color: str = "grey"):
        super().__init__(make=make, model=model, year=year, color=color, fuel="electric")

    def noise(self, speed: int) -> str:
        return "hmmmmmm"
Class diagram showing inheritance from Car to CarGasEngine and CarElectricEngine
Class diagram showing the Car class as a parent class, with CarGasEngine and CarElectricEngine as child classes that inherit from Car.
Callout

Note that the noise method in the CarElectricEngine class is overridden to provide a different implementation than the one in the Car class. This is called method overriding, and it allows us to define a different behavior for a method in a subclass. When we call the noise method on an instance of CarElectricEngine, it will use the overridden method, rather than the one defined in the Car class.

However in CarGasEngine, we do not override the noise method, so it will use the one defined in the Car class.

More on overriding methods in a moment.

You can see that the CarGasEngine class is defined in a similar way to the Car class, but it inherits from the Car class by including it in parentheses after the class name. The __init__ method of the CarGasEngine class also has a call to super().__init__(). The super() function is a way to refer specifically to the parent class, in this case, the Car class. This allows us to call the __init__ method of the Car class, which sets up all of the properties that a Car has.

Applying Inheritance to Our Document Class

For our Document class, we have a few different types of documents available from the Project Gutenberg website. We are currently using plain text files, but there are also HTML files that we can download. They will have the same information, but the data within will be structured in a slightly different way. We can use inheritance to create a pair of new classes: HTMLDocument and PlainTextDocument, that both inherit from the Document class. This will allow us to keep all of the common functionality in the Document class, but to add any additional functionality specific to each document type.

Most of what we’ve written so far is specific to reading and parsing data out of the plain text files, so almost all of the code from Document can be copied. We’ll leave the functions for gutenberg_url, get_line_count, and get_word_occurrence.

In addition, we’ll need an __init__ in our Document class. At the moment, all it does is save the filename in the filename property, but we might expand this in the future. We’ll also need a reference to the super().__init__() in our PlainTextDocument. At the moment, our classes look like this:

PYTHON

class Document:
    @property
    def gutenberg_url(self) -> str | None:
        if self.id:
            return f"https://www.gutenberg.org/cache/epub/{self.id}/pg{self.id}.txt"
        return None

    def __init__(self, filepath: str):
        self.filepath = filepath

    def get_line_count(self) -> int:
        return len(self._content.splitlines())

    def get_word_occurrence(self, word: str) -> int:
        return self._content.lower().count(word.lower())

PYTHON

import re

from textanalysis_tool.document import Document


class PlainTextDocument(Document):
    TITLE_PATTERN = r"^Title:\s*(.*?)\s*$"
    AUTHOR_PATTERN = r"^Author:\s*(.*?)\s*$"
    ID_PATTERN = r"^Release date:\s*.*?\[eBook #(\d+)\]"
    CONTENT_PATTERN = r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK .*? \*\*\*(.*?)\*\*\* END OF THE PROJECT GUTENBERG EBOOK .*? \*\*\*"

    def __init__(self, filepath: str):
        super().__init__(filepath=filepath)

    def _extract_metadata_element(self, pattern: str, text: str) -> str | None:
        match = re.search(pattern, text, re.MULTILINE)
        return match.group(1).strip() if match else None

    def get_content(self, filepath: str) -> str:
        raw_text = self.read(filepath)

        match = re.search(self.CONTENT_PATTERN, raw_text, re.DOTALL)
        if match:
            return match.group(1).strip()
        raise ValueError(f"File {filepath} is not a valid Project Gutenberg Text file.")

    def get_metadata(self, filepath: str) -> dict:
        raw_text = self.read(filepath)

        title = self._extract_metadata_element(self.TITLE_PATTERN, raw_text)
        author = self._extract_metadata_element(self.AUTHOR_PATTERN, raw_text)
        extracted_id = self._extract_metadata_element(self.ID_PATTERN, raw_text)

        return {
            "title": title,
            "author": author,
            "id": int(extracted_id) if extracted_id else None,
        }

    def read(self, file_path: str) -> None:
        with open(file_path, "r", encoding="utf-8") as file:
            raw_text = file.read()

        if not raw_text:
            raise ValueError(f"File {self.filepath} contains no content.")

        if isinstance(raw_text, bytes):
            raise ValueError(f"File {self.filepath} is not a valid text file.")

        return raw_text

We’ll also have another class for reading HTML files. This will be similar to the ´PlainTextDocument´ class, but it will use the ´BeautifulSoup´ library to parse the HTML file and extract the content and metadata. Rather than type out the entire class now, you can either copy and paste the code below into a new file called ´src/textanalysis_tool/html_document.py´, or you can download the file from the Workshop Resources.

Prerequisite

As we do not have BeautifulSoup in our environment yet, you will need to add it using uv:

uv add beautifulsoup4

This will install the package to your environment as well as add it to your pyproject.toml file.


import re

from bs4 import BeautifulSoup

from textanalysis_tool.document import Document


class HTMLDocument(Document):
    URL_PATTERN = "^https://www.gutenberg.org/files/([0-9]+)/.*"

    @property
    def gutenberg_url(self) -> str | None:
        if self.id:
            return f"https://www.gutenberg.org/cache/epub/{self.id}/pg{self.id}-h.zip"
        return None

    def __init__(self, filepath: str):
        super().__init__(filepath=filepath)

        extracted_id = re.search(self.URL_PATTERN, self.metadata.get("url", ""), re.DOTALL)
        self.id = int(extracted_id.group(1)) if extracted_id.group(1) else None

    def read(self, filepath) -> BeautifulSoup:
        with open(filepath, encoding="utf-8") as file_obj:
            parsed_file = BeautifulSoup(file_obj, "html.parser")

        # Check that the file is parsable as HTML
        if not parsed_file or not parsed_file.find("h1"):
            raise ValueError("The file could not be parsed as HTML.")

        return parsed_file

    def get_content(self, filepath: str) -> str:
        parsed_file = self.read(filepath)

        # Find the first h1 tag (The book title)
        title_h1 = parsed_file.find("h1")

        # Collect all the content after the first h1
        content = []
        for element in title_h1.find_next_siblings():
            text = element.get_text(strip=True)

            # Stop early if we hit this text, which indicate the end of the book
            if "END OF THE PROJECT GUTENBERG EBOOK" in text:
                break

            if text:
                content.append(text)

        return "\n\n".join(content)

    def get_metadata(self, filename) -> str:
        parsed_file = self.read(filename)

        title = parsed_file.find("meta", {"name": "dc.title"})["content"]
        author = parsed_file.find("meta", {"name": "dc.creator"})["content"]
        url = parsed_file.find("meta", {"name": "dcterms.source"})["content"]
        extracted_id = re.search(self.URL_PATTERN, url, re.DOTALL)
        id = int(extracted_id.group(1)) if extracted_id.group(1) else None

        return {"title": title, "author": author, "id": id, "url": url}

Overriding Methods

Notice that in the HTMLDocument class, we have overridden the gutenberg_url property to return the URL for the HTML version of the book. This is an example of how we can override methods and properties in a subclass to provide specialized behavior. When we create an instance of HTMLDocument, it will use the gutenberg_url property defined in the HTMLDocument class, rather than the one defined in the Document class.

Callout

When overriding methods, it’s important to ensure that the new method has the same signature as the method being overridden. This means that the new method should have the same name, number of parameters, and return type as the method being overridden.

Additionally, the __init__ is technically also an overridden method, since it is defined in the parent class. However, since we are calling the parent class’s __init__ method using super(), we are not completely replacing the behavior of the parent class’s __init__ method, but rather extending it. We can do the exact same thing with other methods if we want to add some functionality to an existing method, rather than completely replacing it.

Testing our Inherited Classes

Now let’s try out our classes. We already have the pg2680.txt file in our ´scratch´ folder, now let’s download the HTML version of the same book from Project Gutenberg. You can download it from this link. (Note that the file is zipped, as it also contains images. We won’t be using the images, but you’ll need to unzip the file to get to the HTML file.) Once you have the HTML file, place it in the ´scratch´ folder alongside the ´pg2680.txt´ file.

You can either copy and paste the code below into a new file called demo_inheritance.py, or you can download the file from the Workshop Resources.

PYTHON

import sys

sys.path.insert(0, "src")

from textanalysis_tool.document import Document
from textanalysis_tool.plain_text_document import PlainTextDocument
from textanalysis_tool.html_document import HTMLDocument

# Test the PlainTextDocument class
plain_text_doc = PlainTextDocument(filepath="scratch/pg2680.txt")
print(f"Plain Text Document Title: {plain_text_doc.title}")
print(f"Plain Text Document Author: {plain_text_doc.author}")
print(f"Plain Text Document ID: {plain_text_doc.id}")
print(f"Plain Text Document Line Count: {plain_text_doc.line_count}")
print(f"Plain Text Document 'the' Occurrences: {plain_text_doc.get_word_occurrence('the')}")
print(f"Plain Text Document Gutenberg URL: {plain_text_doc.gutenberg_url}")
print(f"Type of Plain Text Document: {type(plain_text_doc)}")
print(f"Parent Class: {type(plain_text_doc).__bases__[0]}")

print("=" * 40)

# Test the HTMLDocument class
html_doc = HTMLDocument(filepath="scratch/pg2680-images.html")
print(f"HTML Document Title: {html_doc.title}")
print(f"HTML Document Author: {html_doc.author}")
print(f"HTML Document ID: {html_doc.id}")
print(f"HTML Document Line Count: {html_doc.line_count}")
print(f"HTML Document 'the' Occurrences: {html_doc.get_word_occurrence('the')}")
print(f"HTML Document Gutenberg URL: {html_doc.gutenberg_url}")
print(f"Type of HTML Document: {type(html_doc)}")
print(f"Parent Class: {type(html_doc).__bases__[0]}")

print("=" * 40)

# We can't use the Document class directly
doc = Document(filepath="scratch/pg2680.txt")

You should get some output that looks like this:

Plain Text Document Title: Meditations
Plain Text Document Author: Emperor of Rome Marcus Aurelius
Plain Text Document ID: 2680
Plain Text Document Line Count: 6845
Plain Text Document 'the' Occurrences: 5736
Plain Text Document Gutenberg URL: https://www.gutenberg.org/cache/epub/2680/pg2680.txt
Type of Plain Text Document: <class 'textanalysis_tool.plain_text_document.PlainTextDocument'>
Parent Class: <class 'textanalysis_tool.document.Document'>
========================================
HTML Document Title: Meditations
HTML Document Author: Marcus Aurelius, Emperor of Rome, 121-180
HTML Document ID: 2680
HTML Document Line Count: 5635
HTML Document 'the' Occurrences: 6161
HTML Document Gutenberg URL: https://www.gutenberg.org/cache/epub/2680/pg2680-h.zip
Type of HTML Document: <class 'textanalysis_tool.html_document.HTMLDocument'>
========================================
Parent Class: <class 'textanalysis_tool.document.Document'>
Traceback (most recent call last):
  File "E:\Projects\Python\scratch\textanalysis-tool\scratch\demo_inheritance.py", line 34, in <module>
    doc = Document(filepath="scratch/pg2680.txt")
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\Projects\Python\scratch\textanalysis-tool\src\textanalysis_tool\document.py", line 14, in __init__
    self.content = self.get_content(filepath)
                   ^^^^^^^^^^^^^^^^
AttributeError: 'Document' object has no attribute 'get_content'

Note that the end of the script results in an error - since the Document class is no longer contains the get_content or get_metadata methods, it cannot be used directly. However we don’t get an error until we try to call one of those methods.

Callout

This is a use case for something called an abstract base class, which is a class that is designed to be inherited from, but never instantiated directly. One way to handle this would be to add these methods to the Document class, but have them raise a NotImplementedError. This way, if someone tries to instantiate the Document class directly, they will get an error indicating that maybe this class is not meant to be used directly:

PYTHON

class Document:
    @property
    def gutenberg_url(self) -> str | None:
        if self.id:
            return f"https://www.gutenberg.org/cache/epub/{self.id}/pg{self.id}.txt"
        return None

    @property
    def line_count(self) -> int:
        return len(self.content.splitlines())

    def __init__(self, filepath: str):
        self.filepath = filepath
        self.content = self.get_content(filepath)

        metadata = self.get_metadata(filepath)
        self.title = metadata.get("title")
        self.author = metadata.get("author")
        self.id = metadata.get("id")

    def get_word_occurrence(self, word: str) -> int:
        return self.content.lower().count(word.lower())

    def get_content(self, filepath: str) -> str:
        raise NotImplementedError("This method should be implemented by subclasses.")

    def get_metadata(self, filepath: str) -> dict[str, str | None]:
        raise NotImplementedError("This method should be implemented by subclasses.")

Another way to handle this is to use the abc module from the standard library, which provides a way to define abstract base classes. This is a more formal way to define a class that is meant to be inherited from, but not instantiated directly:

PYTHON

from abc import ABC, abstractmethod


class Document(ABC):
    @property
    def gutenberg_url(self) -> str | None:
        if self.id:
            return f"https://www.gutenberg.org/cache/epub/{self.id}/pg{self.id}.txt"
        return None

    @property
    def line_count(self) -> int:
        return len(self.content.splitlines())

    def __init__(self, filepath: str):
        self.filepath = filepath
        self.content = self.get_content(filepath)

        self.metadata = self.get_metadata(filepath)
        self.title = self.metadata.get("title")
        self.author = self.metadata.get("author")
        self.id = self.metadata.get("id")

    def get_word_occurrence(self, word: str) -> int:
        return self.content.lower().count(word.lower())

    @abstractmethod
    def get_content(self, filepath: str) -> str:
        pass

    @abstractmethod
    def get_metadata(self, filepath: str) -> dict[str, str | None]:
        pass


:::

## Unit Testing

One of the first effects of this is that our `Document` class is no longer directly testable, since
it cannot be instantiated directly. However, we can still test the `PlainTextDocument` and
`HTMLDocument` classes, which will also indirectly test the `Document` class. You can either copy
the code below into two new files called `tests/test_plain_text_document.py` and
`tests/test_html_document.py`, or you can download the files from the [Workshop Resources](./workshop_resources.html).
(Also make sure to delete the existing `tests/test_document.py` file, since it is no longer
applicable.)

`tests/test_plain_text_document.py`

::: spoiler

```python
import pytest
from unittest.mock import mock_open

from textanalysis_tool.plain_text_document import PlainTextDocument

TEST_DATA = """
Title: Test Document

Author: Test Author

Release date: January 1, 2001 [eBook #1234]
                Most recently updated: February 2, 2002

*** START OF THE PROJECT GUTENBERG EBOOK TEST ***
This is a test document. It contains words.
It is only a test document.
*** END OF THE PROJECT GUTENBERG EBOOK TEST ***
"""


@pytest.fixture(autouse=True)
def mock_file(monkeypatch):
    mock = mock_open(read_data=TEST_DATA)
    monkeypatch.setattr("builtins.open", mock)
    return mock


@pytest.fixture
def doc():
    return PlainTextDocument(filepath="tests/example_file.txt")


def test_create_document(doc):
    assert doc.title == "Test Document"
    assert doc.author == "Test Author"
    assert isinstance(doc.id, int) and doc.id == 1234


def test_empty_file(monkeypatch):
    # Mock an empty file
    mock = mock_open(read_data="")
    monkeypatch.setattr("builtins.open", mock)

    with pytest.raises(ValueError):
        PlainTextDocument(filepath="empty_file.txt")


def test_binary_file(monkeypatch):
    # Mock a binary file
    mock = mock_open(read_data=b"\x00\x01\x02")
    monkeypatch.setattr("builtins.open", mock)

    with pytest.raises(ValueError):
        PlainTextDocument(filepath="binary_file.bin")


def test_document_line_count(doc):
    assert doc.line_count == 2


def test_document_word_occurrence(doc):
    assert doc.get_word_occurrence("test") == 2

tests/test_html_document.py

PYTHON

import pytest
from unittest.mock import mock_open

from textanalysis_tool.html_document import HTMLDocument

TEST_DATA = """
<head>
  <meta name="dc.title" content="Test Document">
  <meta name="dcterms.source" content="https://www.gutenberg.org/files/1234/1234-h/1234-h.htm">
  <meta name="dc.creator" content="Test Author">
</head>
<body>
  <h1>Test Document</h1>
  <p>
    This is a test document. It contains words.
    It is only a test document.
  </p>
</body>
"""


@pytest.fixture(autouse=True)
def mock_file(monkeypatch):
    mock = mock_open(read_data=TEST_DATA)
    monkeypatch.setattr("builtins.open", mock)
    return mock


@pytest.fixture
def doc():
    return HTMLDocument(filepath="tests/example_file.txt")


def test_create_document(doc):
    assert doc.title == "Test Document"
    assert doc.author == "Test Author"
    assert isinstance(doc.id, int) and doc.id == 1234


def test_empty_file(monkeypatch):
    # Mock an empty file
    mock = mock_open(read_data="")
    monkeypatch.setattr("builtins.open", mock)

    with pytest.raises(ValueError):
        HTMLDocument(filepath="empty_file.html")


def test_document_line_count(doc):
    assert doc.line_count == 2


def test_document_word_occurrence(doc):
    assert doc.get_word_occurrence("test") == 2
Challenge

Challenge 1: Predict the output

What will happen when we run the following code? Why?

PYTHON

class Animal:
    def __init__(self, name: str):
        print(f"Creating an animal named {name}")
        self.name = name

    def whoami(self) -> str:
        return f"I am a {type(self)} named {self.name}"

class Dog(Animal):
    def __init__(self, name: str):
        print(f"Creating a dog named {name}")
        super().__init__(name=name)

class Cat(Animal):
    def __init__(self, name: str):
        print(f"Creating a cat named {name}")


animals = [Dog(name="Chance"), Cat(name="Sassy"), Dog(name="Shadow")]

for animal in animals:
    print(animal.whoami())

We get some of the output we expect, but we also get an error:

Creating a dog named Chance
Creating an animal named Chance
Creating a cat named Sassy
Creating a dog named Shadow
Creating an animal named Shadow
I am a <class '__main__.Dog'> named Chance

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[4], line 22
     19 animals = [Dog(name="Chance"), Cat(name="Sassy"), Dog(name="Shadow")]
     21 for animal in animals:
---> 22     print(animal.whoami())

Cell In[4], line 7, in Animal.whoami(self)
      6 def whoami(self) -> str:
----> 7     return f"I am a {type(self)} named {self.name}"

AttributeError: 'Cat' object has no attribute 'name'

We failed to call the super().__init__() method in the Cat class, so the name property was never set. When we then try to access the instance property name in the whoami method, we get an AttributeError.

Challenge

Challenge 2: Class Methods and Properties

We’ve mostly focused on instance properties and methods so far, but classes can also have what are called “class properties” and “class methods”. These are properties and methods that are associated with the class itself, rather than with an instance of the class.

Without running it, what do you think the following code will do? Will it run without error?

PYTHON

class Animal:
    PHYLUM = "Chordata"

    def __init__(self, name: str):
        self.name = name

    def whoami(self) -> str:
        return f"I am a {type(self)} named {self.name} in the phylum {self.PHYLUM}"

class Snail(Animal):
    def __init__(self, name: str):
        super().__init__(name=name)

animal1 = Snail(name="Gary")
Animal.PHYLUM = "Mollusca"
print(animal1.whoami())

animal2 = Snail(name="Slurms MacKenzie")
print(animal2.whoami())

creature3 = Snail(name="Turbo")
creature3.CLASS = "Gastropoda"
print(creature3.whoami(), "and is in class", creature3.CLASS)

The PHYLUM property is a class property, so it is shared among all instances of the class.

There’s two things about this piece of code that are a bit tricky.

1 The PHYLUM property is a class property, so it is shared among all instances of the class. When we set Animal.PHYLUM = "Mollusca", we are actually modifying the class property for all instances going forward, which is why when we print animal2.whoami(), it shows that the phylum is still “Mollusca”, even though we created a new instance of Snail.

2 - We never defined a CLASS property in the Animal or Snail class, but we can actually still create a new property on an instance of a class at any time. (Generally, this is not a good idea, as it can cause confusion when you reference a property that doesn’t exist in any class definition, but it is technically possible.)

Challenge

Challenge 3: Create a new subclass

The previous challenge is not quite correct, as canonnically “Slurms MacKenzie” is not a snail, but a slug. Create a subclass of ‘Animal’ called “Mollusk” that inherits from “Animal”, but only sets the class property PHYLUM to “Mollusca”. Then create two subclasses of “Mollusk”: “Snail” and “Slug”.

You can implement any methods or properties you want in the “Snail” and “Slug” classes, but you may also just leave them empty like so:

PYTHON

class MyClass:
    pass

It is not necessary for the Snail and Slug classes to have their own __init__ methods, as they will inherit the __init__ method from the Animal class through the Mollusk class.

PYTHON

class Animal:
    PHYLUM = "Chordata"

    def __init__(self, name: str):
        self.name = name

    def whoami(self) -> str:
        return f"I am a {type(self)} named {self.name} in the phylum {self.PHYLUM}"

class Mollusk(Animal):
    PHYLUM = "Mollusca"

class Snail(Mollusk):
    pass

class Slug(Mollusk):
    pass
Key Points
  • Inheritance allows us to create a new class that is a specialized version of an existing class
  • We can override methods and properties in a subclass to provide specialized behavior

Content from Inheritance and Composition


Last updated on 2025-09-28 | Edit this page

Overview

Questions

  • How does Composition differ from Inheritance?
  • When should I use Composition over Inheritance?

Objectives

  • Explain the difference between Inheritance and Composition
  • Use Composition to build classes that contain instances of other classes

Composition


In the previous episode, we saw how to use Inheritance to create a specialized version of an existing class. But there’s another strategy we can use to build classes: Composition. Composition is a design principle where a class is composed of one or more objects from other classes, rather than inheriting from them. This allows us to create complex functionality by combining several smaller, simpler classes.

Back to the Car Example

Let’s revisit our Car example from the Class Objects episode. As a reminder, our class looked like this:

Car Class object example
Car Class object example

But what if we wanted to add more functionality to our Car class? We talked about in the previous episode how we could use Inheritance to create specialized versions of the Car class, like this:

Car Inheritance example
Car Inheritance example

The code might look something like this:

PYTHON

class Car:
    def __init__(self, make: str, model: str, year: int, color: str = "grey", fuel: str = None):
        self.make = make
        self.model = model
        self.year = year
        self.color = color
        self.fuel = fuel

    def honk(self) -> str:
        return "beep"

    def paint(self, new_color: str) -> None:
        self.color = new_color

    def noise(self) -> str:
        raise NotImplementedError("Subclasses must implement this method")

class CarGasEngine(Car)
    def __init__(self, make: str, model: str, year: int, color: str = "grey"):
        super().__init__(make, model, year, color, fuel="gasoline")

    def noise(self, speed: int) -> str:
        if speed <= 10:
            return "put put put"
        elif speed > 10:
            return "vrooom"


class CarElectricEngine(Car)
    def __init__(self, make: str, model: str, year: int, color: str = "grey"):
        super().__init__(make, model, year, color, fuel="electric")

    def noise(self, speed: int) -> str:
        return "hmmmmmm"

But what happens if we start adding more kinds of engines? Or if our engines start getting more complex, with different properties and methods? Even worse, what if we want to add a new type of car component, like Wheels? Do we now need to start creating subclasses for every possible combination of car, engine, and wheels? We would end up with a lot of subclasses, some of which might have extremely similar (if not identical) functionality.

Instead, we can use a Compositional approach by making a new kind of class called Engine, and then including an instance of that class as a property of the Car class. This way, we can create different kinds of engines as separate classes, and then use them in our Car class without having to create a new subclass for each one. Here’s how that would look:

Car Composition example
Car Composition example

PYTHON

class Car:
    def __init__(self, make: str, model: str, year: int, color: str = "grey", engine: Engine = None):
        self.make = make
        self.model = model
        self.year = year
        self.color = color
        self.engine = engine

    def honk(self) -> str:
        return "beep"

    def paint(self, new_color: str) -> None:
        self.color = new_color

    def noise(self, speed: int) -> str:
        if self.engine:
            return self.engine.noise(speed)
        raise ValueError("Car must have an engine")


class Engine:
    def __init__(self, fuel: str):
        self.fuel = fuel

    def noise(self, speed: int) -> str:
        raise NotImplementedError("Subclasses must implement this method")

class GasEngine(Engine):
    def __init__(self):
        super().__init__(fuel="gasoline")

    def noise(self, speed: int) -> str:
        if speed <= 10:
            return "put put put"
        elif speed > 10:
            return "vrooom"

class ElectricEngine(Engine):
    def __init__(self):
        super().__init__(fuel="electric")

    def noise(self, speed: int) -> str:
        return "hmmmmmm"

At first glance this might look even more complicated, but it has several advantages:

  • Separation of Concerns: The Engine class is responsible for engine-specific behavior, while the Car class focuses on car-specific behavior. This makes the code easier to understand and maintain.
  • Reusability: The Engine class can be reused in other contexts, such as in a Truck or Motorcycle class, without duplicating code.
  • Flexibility: We can easily add new types of engines by creating new subclasses of Engine, without having to modify the Car class or create new subclasses of Car.

Think about if we added a Tire class as well. We could have different types of tires (e.g., Road Tires, Racing Tires, Snow Tires, etc.) and then include an instance of the Tire class in the Car class. This would allow us to mix and match different types of engines and tires without having to create a new subclass for every possible combination.

Complete Car Composition example
Complete Car Composition example

Refactoring our Document Example


Let’s take this concept and apply it to our Document example from previous episodes. We can create a new class called Reader that is responsible for reading files and providing the content to the Document class. We can have a different reader type for each file format we want to support.

To start with, let’s create a directory for our readers called readers, and then create a base class called BaseReader in a file called readers/base_reader.py:

PYTHON

from abc import ABC, abstractmethod

class BaseReader(ABC):
    @abstractmethod
    def get_content(self, filepath: str) -> str:
        pass

    @abstractmethod
    def get_metadata(self, filepath: str) -> dict:
        pass

Then we’ll create a TextReader class in a file called readers/text_reader.py. We’ll move over the logic for reading and extracting content from a Project Gutenberg text file here:

PYTHON

import re

from .base_reader import BaseReader


class TextReader(BaseReader):
    TITLE_PATTERN = r"^Title:\s*(.*?)\s*$"
    AUTHOR_PATTERN = r"^Author:\s*(.*?)\s*$"
    ID_PATTERN = r"^Release date:\s*.*?\[eBook #(\d+)\]"
    CONTENT_PATTERN = r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK .*? \*\*\*(.*?)\*\*\* END OF THE PROJECT GUTENBERG EBOOK .*? \*\*\*"

    def read(self, filepath: str) -> str:
        with open(filepath, encoding="utf-8") as file_obj:
            return file_obj.read()

    def _extract_metadata_element(self, pattern: str, text: str) -> str | None:
        match = re.search(pattern, text, re.MULTILINE)
        if match:
            return match.group(1).strip()
        return None

    def get_content(self, filepath: str) -> str:
        raw_text = self.read(filepath)
        match = re.search(self.CONTENT_PATTERN, raw_text, re.DOTALL)
        if match:
            return match.group(1).strip()
        raise ValueError(f"File {filepath} is not a valid Project Gutenberg Text file.")

    def get_metadata(self, filepath: str) -> dict:
        raw_text = self.read(filepath)

        title = self._extract_metadata_element(self.TITLE_PATTERN, raw_text)
        author = self._extract_metadata_element(self.AUTHOR_PATTERN, raw_text)
        extracted_id = self._extract_metadata_element(self.ID_PATTERN, raw_text)

        return {
            "title": title,
            "author": author,
            "id": int(extracted_id) if extracted_id else None,
        }

Next, we’ll do the same for the HTML code. Create a file called readers/html_reader.py:

PYTHON

import re

from bs4 import BeautifulSoup

from .base_reader import BaseReader


class HTMLReader(BaseReader):
    URL_PATTERN = "^https://www.gutenberg.org/files/([0-9]+)/.*"

    def read(self, filepath) -> BeautifulSoup:
        with open(filepath, encoding="utf-8") as file_obj:
            parsed_file = BeautifulSoup(file_obj, features="html.parser")

        if not parsed_file:
            raise ValueError("The file could not be parsed as HTML.")

        return parsed_file

    def get_content(self, filepath) -> str:
        parsed_file = self.read(filepath)

        # Find the first h1 tag (The book title)
        title_h1 = parsed_file.find("h1")

        # Collect all the content after the first h1
        content = []
        for element in title_h1.find_next_siblings():
            text = element.get_text(strip=True)

            # Stop early if we hit this text, which indicate the end of the book
            if "END OF THE PROJECT GUTENBERG EBOOK" in text:
                break

            if text:
                content.append(text)

        return "\n\n".join(content)

    def get_metadata(self, filename) -> str:
        parsed_file = self.read(filename)

        title = parsed_file.find("meta", {"name": "dc.title"})["content"]
        author = parsed_file.find("meta", {"name": "dc.creator"})["content"]
        url = parsed_file.find("meta", {"name": "dcterms.source"})["content"]
        extracted_id = re.search(self.URL_PATTERN, url, re.DOTALL)
        id = int(extracted_id.group(1)) if extracted_id.group(1) else None

        return {"title": title, "author": author, "id": id}

Finally, we can update our Document class to use these readers. We’ll add a new parameter to the Document constructor called reader, which will be an instance of a BaseReader subclass. Here’s how the updated Document class might look:

PYTHON

from textanalysis_tool.readers.base_reader import BaseReader

class Document:
    @property
    def gutenberg_url(self) -> str | None:
        if self.id:
            return f"https://www.gutenberg.org/cache/epub/{self.id}/pg{self.id}.txt"
        return None

    @property
    def line_count(self) -> int:
        return len(self.content.splitlines())

    def __init__(self, filepath: str, reader: BaseReader):
        self.filepath = filepath
        self.content = reader.get_content(filepath)

        metadata = reader.get_metadata(filepath)
        self.title = metadata.get("title")
        self.author = metadata.get("author")
        self.id = metadata.get("id")

    def get_word_occurrence(self, word: str) -> int:
        return self.content.lower().count(word.lower())

Key Points


Ok, that’s a lot of changes. So what was that all about?

  • Modularity: Our code is now made up of smaller, more focused classes. The code responsible for reading files in and parsing the contents is separate from the code that represents a document and provides analysis.
  • Extensibility: We can easily add support for new file formats by creating new reader classes that inherit from BaseReader, without having to modify the Document class.
  • Maintainability: Each class has a single responsibility, making it easier to understand and maintain.
  • Reusability: The reader classes can be reused in other contexts, such as in a different application that needs to read and parse files.
Callout

Also note that in the Document class, in the __init__ method, the typehint for the reader parameter is BaseReader. This means that any subclass of BaseReader can be passed in, allowing for flexibility in the type of reader used.

By using Inheritance and an abstract base class for the reader, we are essentially creating a promise that any subclass of BaseReader will implement the methods defined in the base class. This is how we can safely call reader.get_content() and reader.get_metadata() in the Document class - we no longer care what specific type of reader it is, as long as it adheres to the abstract base class interface.

Testing the New Objects


We can now delete our two inherited classes DocumentText and DocumentHTML, and update our tests to use the new TextReader and HTMLReader classes. This, however means that we need to rewrite our tests to use the new Document constructor, which requires a BaseReader subclass instance.

Because of our new Compositional design, we can now test the Document without a specific reader by using a mock reader. This allows us to isolate the Document class and test its functionality without relying on the actual file reading and parsing logic:

PYTHON

import pytest

from textanalysis_tool.document import Document
from textanalysis_tool.readers.base_reader import BaseReader


class MockReader(BaseReader):
    def get_content(self, filepath: str) -> str:
        return "This is a test document. It contains words.\nIt is only a test document."

    def get_metadata(self, filepath: str) -> dict:
        return {
            "title": "Test Document",
            "author": "Test Author",
            "id": 1234,
        }


def test_create_document():
    doc = Document(filepath="dummy_path.txt", reader=MockReader())
    assert doc.title == "Test Document"
    assert doc.author == "Test Author"
    assert isinstance(doc.id, int) and doc.id == 1234


def test_line_count():
    doc = Document(filepath="dummy_path.txt", reader=MockReader())
    assert doc.line_count == 2


def test_get_word_occurrence():
    doc = Document(filepath="dummy_path.txt", reader=MockReader())
    assert doc.get_word_occurrence("test") == 2
Callout

This time we are using a MockReader class that implements the BaseReader interface. We could also use a fixture here, but this is simpler for demonstration purposes.

Challenge

Challenge 1: Writing Tests for the Readers

Now that we have our TextReader and HTMLReader classes, we need to write tests for them. Create a new directory called tests/readers, and then create two new test files: test_text_reader.py and test_html_reader.py. Write tests for the get_content and get_metadata methods of each reader based on our previous tests for the PlainTextDocument and HTMLDocument classes.

tests/readers/test_text_reader.py

PYTHON

import pytest
from unittest.mock import mock_open

from textanalysis_tool.readers.text_reader import TextReader

TEST_DATA = """
Title: Test Document

Author: Test Author

Release date: January 1, 2001 [eBook #1234]
                Most recently updated: February 2, 2002

*** START OF THE PROJECT GUTENBERG EBOOK TEST ***
This is a test document. It contains words.
It is only a test document.
*** END OF THE PROJECT GUTENBERG EBOOK TEST ***
"""


@pytest.fixture(autouse=True)
def mock_file(monkeypatch):
    mock = mock_open(read_data=TEST_DATA)
    monkeypatch.setattr("builtins.open", mock)
    return mock


def test_get_content():
    reader = TextReader()
    content = reader.get_content("dummy_path.txt")
    assert "This is a test document." in content
    assert "It is only a test document." in content


def test_get_metadata():
    reader = TextReader()
    metadata = reader.get_metadata("dummy_path.txt")
    assert metadata["title"] == "Test Document"
    assert metadata["author"] == "Test Author"
    assert metadata["id"] == 1234

tests/readers/test_html_reader.py

PYTHON

import pytest
from unittest.mock import mock_open

from textanalysis_tool.readers.html_reader import HTMLReader

TEST_DATA = """
<head>
  <meta name="dc.title" content="Test Document">
  <meta name="dcterms.source" content="https://www.gutenberg.org/files/1234/1234-h/1234-h.htm">
  <meta name="dc.creator" content="Test Author">
</head>
<body>
  <h1>Test Document</h1>
  <p>
    This is a test document. It contains words.
    It is only a test document.
  </p>
</body>
"""


@pytest.fixture(autouse=True)
def mock_file(monkeypatch):
    mock = mock_open(read_data=TEST_DATA)
    monkeypatch.setattr("builtins.open", mock)
    return mock


def test_get_content():
    reader = HTMLReader()
    content = reader.get_content("dummy_path.html")
    assert "This is a test document." in content
    assert "It is only a test document." in content


def test_get_metadata():
    reader = HTMLReader()
    metadata = reader.get_metadata("dummy_path.html")
    assert metadata["title"] == "Test Document"
    assert metadata["author"] == "Test Author"
    assert metadata["id"] == 1234
Challenge

Challenge 2: Adding a New Reader

We have one last file type we haven’t added support for yet: Epub. Create a new reader class for epub files called EpubReader in a file called readers/epub_reader.py. You can use the ebooklib package to read epub files. You can install it with pip:

BASH

uv add Ebooklib

You can refer to the package documentation here

To get the metadata from an epub file, you can use the get_metadata method of the EpubBook class. Project Gutenberg uses the “Dublin Core” metadata standard, so the namespace is “DC”.

Here’s an example of how to get the title:

PYTHON

book.get_metadata(namespace="DC", name="title")[0][0]

The get_metadata method returns a list of tuples, where the first element is the value and the second element is the attributes, so we need to access the first element of the first tuple to get the actual title.

The other metadata fields we need are “creator” (author) and “source” (id).

To get the content from an epub file, we can iterate over all of the “items” in the book that are a document, then use BeautifulSoup to extract the text from the HTML content.

PYTHON

for item in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    soup = BeautifulSoup(item.get_content(), features="html.parser")
    text = soup.get_text()
    # Do something with the text

PYTHON

import re

from bs4 import BeautifulSoup
import ebooklib

from textanalysis_tool.readers.base_reader import BaseReader


class EPUBReader(BaseReader):
    SOURCE_URL_PATTERN = "https://www.gutenberg.org/files/([0-9]+)/[0-9]+-h/[0-9]+-h.htm"

    def read(self, filepath: str) -> ebooklib.epub.EpubBook:
        book = ebooklib.epub.read_epub(filepath)
        if not book:
            raise ValueError("The file could not be parsed as EPUB.")
        return book

    def get_content(self, filepath):
        book = self.read(filepath)
        text = ""
        for section in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
            content = section.get_content()
            soup = BeautifulSoup(content, features="html.parser")
            text += soup.get_text()
        return text

    def get_metadata(self, filepath) -> dict:
        book = self.read(filepath)

        source_url = book.get_metadata(namespace="DC", name="source")[0][0]
        extracted_id = re.search(self.SOURCE_URL_PATTERN, source_url, re.DOTALL).group(1)

        metadata = {
            "title": book.get_metadata(namespace="DC", name="title")[0][0],
            "author": book.get_metadata(namespace="DC", name="creator")[0][0],
            "extracted_id": int(extracted_id) if extracted_id else None,
        }
        return metadata
Key Points
  • Composition allows us to build complex functionality by combining several smaller, simpler classes
  • Composition promotes separation of concerns, reusability, flexibility, and maintainability
  • By using abstract base classes, we can define interfaces that subclasses must implement, allowing for flexibility in our code design

Content from Static Code Analysis


Last updated on 2025-09-16 | Edit this page

Overview

Questions

  • What is static code analysis?
  • How can static code analysis tools help improve code quality?

Objectives

  • Implement ruff and mypy in a Python project
  • Understand how to read and fix issues reported by ruff and mypy

What is Static Code Analysis?


Static code analysis tools are programs or scripts that analyze source code without executing it in order to identify potential issues, bugs, or code smells. These tools can help developers improve code quality, maintainability, and adherence to coding standards.

We’re going to look at two static code analysis tools in this episode: ruff and mypy.

Each of these tools can be run from the command line, and they can also be integrated into your development workflow, such as in your text editor, as a pre-commit hook, or in your continuous integration (CI) pipeline.

Ruff


Ruff is a fast Python linter and code formatter that takes over the roles of several other tools, including flake8, pylint, and isort.

We can install ruff as a development dependency using uv:

BASH

uv add ruff --dev

We can then run ruff on our codebase to identify any issues:

BASH

uv run ruff check .

Did you get any output? Depending on your IDE and it’s settings, you might have already fixed some of the issues.

The default configuration for ruff only checks for a few types of issues. We can customize the configuration by adding a section for ruff in our pyproject.toml file:

TOML

[tool.ruff]
# Exclude specific files and directories from ruff
exclude = [
    ".venv",
    "__init__.py",
]
line-length = 100
indent-width = 4

[tool.ruff.format]
quote-style = "double"
indent-style = "space"
skip-magic-trailing-comma = false
line-ending = "auto"

[tool.ruff.lint]
# Enable specific linting rules
# - "D": Docstring-related rules (Not included for this workshop)
# - "E", "W": PEP8 style errors
# - "F": Flake8 compatibility
# - "I": Import-related rules (isort)
# - "B": Bugbear (Extended pycodesyle checks)
# - "PL": Pylint compatibility
# - "C90": McCabe complexity checks (identify code with large numbers of paths - should be refactored)
# - "N": Naming conventions for classes, functions, variables, etc.
# - "ERA": Remove commented out code
# - "RUF": Ruff-specific rules
# - "TID": Tidy Imports
select = ["E", "W", "F", "I", "B", "PL", "C90", "N", "ERA", "RUF", "TID", "SIM"]

# These are Personal preference
# D203 - Don't require a space between the docstring and the class or function definition
# D212 - The summary of the docstring can go on the line below the triple quotes
ignore = ["D203", "D212"]

This configuration adds a number of additional rules to check for. This is the output from running this one on our codebase:

I001 [*] Import block is un-sorted or un-formatted
 --> src\textanalysis_tool\readers\epub_reader.py:1:1
  |
1 | / import re
2 | |
3 | | from bs4 import BeautifulSoup
4 | | import ebooklib
5 | |
6 | | from textanalysis_tool.readers.base_reader import BaseReader
  | |____________________________________________________________^
  |
help: Organize imports

E501 Line too long (136 > 100)
  --> src\textanalysis_tool\readers\text_reader.py:10:101
   |
 8 | …
 9 | …)\]"
10 | …GUTENBERG EBOOK .*? \*\*\*(.*?)\*\*\* END OF THE PROJECT GUTENBERG EBOOK .*? \*\*\*"
   |                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
11 | …
12 | …
   |

I001 [*] Import block is un-sorted or un-formatted
 --> tests\readers\test_html_reader.py:1:1
  |
1 | / import pytest
2 | | from unittest.mock import mock_open
3 | |
4 | | from textanalysis_tool.readers.html_reader import HTMLReader
  | |____________________________________________________________^
5 |
6 |   TEST_DATA = """
  |
help: Organize imports

PLR2004 Magic value used in comparison, consider replacing `1234` with a constant variable
  --> tests\readers\test_html_reader.py:41:30
   |
39 |     assert metadata["title"] == "Test Document"
40 |     assert metadata["author"] == "Test Author"
41 |     assert metadata["id"] == 1234
   |                              ^^^^
   |

I001 [*] Import block is un-sorted or un-formatted
 --> tests\readers\test_text_reader.py:1:1
  |
1 | / import pytest
2 | | from unittest.mock import mock_open
3 | |
4 | | from textanalysis_tool.readers.text_reader import TextReader
  | |____________________________________________________________^
5 |
6 |   TEST_DATA = """
  |
help: Organize imports

PLR2004 Magic value used in comparison, consider replacing `1234` with a constant variable
  --> tests\readers\test_text_reader.py:40:30
   |
38 |     assert metadata["title"] == "Test Document"
39 |     assert metadata["author"] == "Test Author"
40 |     assert metadata["id"] == 1234
   |                              ^^^^
   |

PLR2004 Magic value used in comparison, consider replacing `1234` with a constant variable
  --> tests\test_document.py:21:50
   |
19 |     assert doc.title == "Test Document"
20 |     assert doc.author == "Test Author"
21 |     assert isinstance(doc.id, int) and doc.id == 1234
   |                                                  ^^^^
   |

PLR2004 Magic value used in comparison, consider replacing `2` with a constant variable
  --> tests\test_document.py:26:30
   |
24 | def test_line_count():
25 |     doc = Document(filepath="dummy_path.txt", reader=MockReader())
26 |     assert doc.line_count == 2
   |                              ^
   |

PLR2004 Magic value used in comparison, consider replacing `2` with a constant variable
  --> tests\test_document.py:31:47
   |
29 | def test_get_word_occurrence():
30 |     doc = Document(filepath="dummy_path.txt", reader=MockReader())
31 |     assert doc.get_word_occurrence("test") == 2
   |                                               ^
   |

Found 9 errors.
[*] 3 fixable with the `--fix` option.

Auto-fixing Issues

You notice right away that the end of the message says that 3 of the issues are fixable with the --fix option. We can run ruff again with this option to automatically fix these issues:

BASH

uv run ruff check . --fix

Ruff will automatically fix issues that have very clear solutions, such as sorting imports and fixing spacing. This command will modify your source files, so be sure to review the changes just in case, but it should never modify the logic of your code.

Ignoring Files and Rules

Some of the issues reported by ruff don’t really make sense for our project. For example, it is complaining about “magic numbers” in our tests. These are numbers that appear directly in the code without being assigned to a named constant. In tests, this is often fine, since the numbers are used in a clear context, and assigning them to a constant variable would just add unnecessary complexity.

We can tell ruff to ignore specific rules for specific files or directories. We already have an example of this in our pyproject.toml file, where we tell ruff to ignore the __init__.py files in our codebase. We can add the following to ignore the magic number rule (PLR2004) in our tests directory:

TOML

# Ignore magic number rule in tests
[tool.ruff.lint.per-file-ignores]
"tests/**" = ["PLR2004"]

Docstrings


We we had removed the check for docstrings (D) from our ruff configuration, but let’s just re-enable it for a moment to see what it reports and how to fix it.

BASH

select = ["D", "E", "W", "F", "I", "B", "PL", "C90", "N", "ERA", "RUF", "TID", "SIM"]

To run ruff on a single file, we can specify the file path instead of a directory:

BASH

uv run ruff check src/textanalysis_tool/document.py

It looks like we get six errors, all related to missing docstrings.

What exactly is a docstring?

A docstring is a special type of comment that is used to document a module, class, method, or function in Python. Docstrings are written using triple quotes (""") and are placed immediately after the definition of the module, class, method, or function they are documenting.

There are several different styles for writing docstrings:

The choice of style often depends on the conventions used in a particular project or organization.

Let’s stick with the Google style for this project. Here’s an example of a docstring for one of the functions in our Document class:

PYTHON

def get_word_occurrence(self, word: str) -> int:
    """
    Count the number of occurrences of the given word in the document content.

    Args:
        word (str): The word to count occurrences of.

    Returns:
        int: The number of occurrences of the word in the document content.

    """

    return self.content.lower().count(word.lower())

We can see that the Docstring is made up of different sections:

  • A brief summary of what the function does
  • An Args section that describes the function’s parameters
  • A Returns section that describes what the function returns
Callout

Docstrings are not interpreted by python, so don’t affect the runtime behavior of your code. They are primarily for documentation purposes, and can be accessed using the built-in help() function.

Running the ruff command again shows that we have fixed one of the issues in this file.

Challenge

Challenge 1: Docstrings for the Document Class

Add a docstring for the document.py module, the Document class, and the __init__ method of the Document class. Make sure to include a brief summary, as well as Args and Returns sections where appropriate.

Refer to the Google Style Guide - Comments in Modules section for guidance.

Mypy


One of the common complaints about Python is that it is a dynamically typed language, which can lead to type-related errors that are only caught at runtime. To help mitigate this, Python supports type hints, which allow developers to specify the expected types of variables, function parameters, and return values.

We’ve been using type hints all along in this project, but as these are only used by the IDE and the user, there’s no guarantee that the types are actually correct. This is where mypy comes in.

We can install mypy as a development dependency using uv:

BASH

uv add mypy --dev

Then run it on our codebase:

BASH

uv run mypy src

Your output might look something like this:

src\textanalysis_tool\readers\html_reader.py:28: error: Item "None" of "PageElement | None" has no attribute "find_next_siblings"  [union-attr]
src\textanalysis_tool\readers\html_reader.py:40: error: Return type "str" of "get_metadata" incompatible with return type "dict[Any, Any]" in supertype "textanalysis_tool.readers.base_reader.BaseReader"  [override]
src\textanalysis_tool\readers\html_reader.py:43: error: Value of type "PageElement | None" is not indexable  [index]
src\textanalysis_tool\readers\html_reader.py:44: error: Value of type "PageElement | None" is not indexable  [index]
src\textanalysis_tool\readers\html_reader.py:45: error: Value of type "PageElement | None" is not indexable  [index]
src\textanalysis_tool\readers\html_reader.py:47: error: Item "None" of "Match[str] | None" has no attribute "group"  [union-attr]
src\textanalysis_tool\readers\html_reader.py:49: error: Incompatible return value type (got "dict[str, Any]", expected "str")  [return-value]
src\textanalysis_tool\readers\epub_reader.py:3: error: Skipping analyzing "ebooklib": module is installed, but missing library stubs or py.typed marker  [import-untyped]
src\textanalysis_tool\readers\epub_reader.py:3: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
src\textanalysis_tool\readers\epub_reader.py:31: error: Item "None" of "Match[str] | None" has no attribute "group"  [union-attr]
Found 9 errors in 2 files (checked 7 source files)

These errors are slightly more complex than the ones reported by ruff, but they are also very useful, as the often point to places where our code might not handle all of the possible cases correctly.

Fixing Mypy Errors

MyPy Errors can be a bit tricky to understand, as they often involve a more complex analysis of the code than ruff.

For example, the first error is telling us that we are trying to access the find_next_siblings method on an object that could be None. This is a potential bug, as if the object is None, this will raise an AttributeError at runtime.

Looking at the project code, the issue is in the HTMLReader class, in the get_content method:

PYTHON

    def get_content(self, filepath) -> dict:
        soup = self.read(filepath)

        # Find the first h1 tag (The book title)
        title_h1 = soup.find("h1")

        # Collect all the content after the first h1
        content = []
        for element in title_h1.find_next_siblings():
            text = element.get_text(strip=True)

            # Stop early if we hit this text, which indicate the end of the book
            if "END OF THE PROJECT GUTENBERG EBOOK" in text:
                break

            if text:
                content.append(text)

        return "\n\n".join(content)

The soup.find("h1") call will return None if no h1 tag is found in the HTML document. We should probably add a check for this case and raise a more informative error message.

PYTHON

    def get_content(self, filepath) -> dict:
        soup = self.read(filepath)

        # Find the first h1 tag (The book title)
        title_h1 = soup.find("h1")
        if title_h1 is None:
            raise ValueError(f"No <h1> tag found in the HTML document: {filepath}")

        # Collect all the content after the first h1
        content = []
        for element in title_h1.find_next_siblings():
            text = element.get_text(strip=True)

            # Stop early if we hit this text, which indicate the end of the book
            if "END OF THE PROJECT GUTENBERG EBOOK" in text:
                break

            if text:
                content.append(text)

        return "\n\n".join(content)
Callout

Note that we don’t have to actually change the code in the for loop, as mypy is smart enough to understand that if title_h1 is none, then the ValueError will be raised, and the code following it will not be executed.

Challenge

Challenge 1: Fix a Mypy Error

Another error we get is related to the get_metadata method in the same class:

src\textanalysis_tool\readers\html_reader.py:45: error: Value of type "PageElement | None" is not indexable  [index]
src\textanalysis_tool\readers\html_reader.py:46: error: Value of type "PageElement | None" is not indexable  [index]
src\textanalysis_tool\readers\html_reader.py:47: error: Value of type "PageElement | None" is not indexable  [index]

Why is this error being reported? What can we do to fix it? (There’s actually two issues here! One is a more specific BeautifulSoup issue, and the other is a more general Python issue.)

Values in python dictionaries can be accessed in two ways:

  • Using the indexing syntax: value = my_dict[key]
  • Using the get method: value = my_dict.get(key)

The indexing syntax will raise a KeyError if the key is not found in the dictionary, while the get method will return None (or a default value if provided) if the key is not found.

There are several kinds of values that can be returned from a BeautifulSoup.find call, and we can’t really be certain of which type we will get back. It could be a Tag, a NavigableString, a PageElement, or None.

There is an alternative method called select_one that can be used to find elements using plain CSS selectors, and it always returns either a Tag or None, which lets us avoid some of the complexity of dealing with multiple possible types.

soup.find("meta", {"name": "dc.title"}) can be replaced with soup.select_one('meta[name="dc.title"]')

MyPy is pointing out that we are using the indexing syntax to access values in a dictionary, but given the way we are pulling the data out of the text, it is possible that the key might not even be present in the dictionary.

We can fix this by using the get method instead, and providing a default value of None if the key is not found.

PYTHON

title_element = soup.select_one('meta[name="dc.title"]')
title = title_element.get("content") if title_element else "Unknown Title"

author_element = soup.select_one('meta[name="dc.creator"]')
author = author_element.get("content") if author_element else "Unknown Author"

url_element = soup.select_one('meta[name="dcterms.source"]')
url = url_element.get("content") if url_element else "Unknown URL"
Challenge

Challenge 2: Fixing Other Mypy Errors

You should have at least one other mypy error in the HTMLReader class. Can you find it and fix it? Do you have any other mypy errors in other parts of the codebase? If so, can you fix those as well?

It will depend on your code!

Key Points
  • There are many static code analysis tools available for Python, each with its own strengths and weaknesses.
  • Ruff is a fast linter and code formatter that can replace several other tools, including flake8, pylint, and isort.
  • MyPy is a static type checker that can help catch type-related errors in Python code.

Content from Building and Deploying a Package


Last updated on 2025-09-16 | Edit this page

Overview

Questions

  • How do we build and deploy our project as a package?

Objectives

  • Deploy our project locally as a python package
  • Set up Github Actions to automatically build and deploy our package

Building Locally


Since we have been using uv all along and have our pyproject.toml set up, building our package is as simple as running:

BASH

uv build

You should see some output that looks like this:

Building source distribution...
Building wheel from source distribution...
Successfully built dist\textanalysis_tool-0.1.0.tar.gz
Successfully built dist\textanalysis_tool-0.1.0-py3-none-any.whl

And a new directory called dist/ should appear in your project folder. Inside that directory, you should see two files:

  • textanalysis_tool-0.1.0-py3-none-any.whl: This is the wheel file, which is a built package that can be installed.
  • textanalysis_tool-0.1.0.tar.gz: This is the source distribution, which contains the source code of your package.

What Exactly are these files?

A Wheel file is a built package that can be installed using pip. It contains all the necessary files for your package to run, including compiled code if applicable. Wheels are the preferred format for Python packages because they are faster to install and can include pre-compiled binaries.

Callout

Why is it called a “wheel”?

The term “wheel” is used because the original name for PyPi was “cheese shop” (a reference to a Monty Python sketch), and a wheel of cheese is a common image associated with cheese shops.

The source distribution is a tarball that contains the source code of your package. You can open it with a tool like tar or 7-Zip to see the contents. It includes all the files needed to build and install your package, including the pyproject.toml file and any other source files.

GitHub Actions


Building the package locally is great for testing, but what we actually want is to have our package built and automatically deployed from GitHub, so that anyone using our package can always get the latest version. GitHub Actions is a tool that allows us to automate tasks in our GitHub repository by configuring a workflow using a YAML file.

Setting Up the Workflow

In your project, create a new directory called .github/workflows/. Inside that directory, create a new file called build.yml. This file will contain the configuration for our GitHub Actions workflow.

Here is the build.yml file we will use:

YAML

name: Build and Deploy Package

on:
  push:
    tags:
      - '[0-9]+.[0-9]+.[0-9]+*' # Deploy when we tag with a semantic version

jobs:
  build:
    runs-on: ubuntu-latest # Use the most up-to-date Ubuntu runner
    permissions:
      contents: write  # Required to create releases

    steps:
    - uses: actions/checkout@v4 # Check out the repository

    # Create a Python environment
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    # Install uv
    - name: Install uv
      uses: astral-sh/setup-uv@v3

    # Build the package
    - name: Build Package
      run: uv build

    # Upload the built package to the repository as an artifact
    - name: Upload Artifacts
      uses: actions/upload-artifact@v4
      with:
        name: textanalysis-tool
        path: dist/

    # Create GitHub Release with artifacts
    - name: Create Release
      uses: softprops/action-gh-release@v1
      with:
        files: dist/*
        generate_release_notes: true
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

    # Publish to TestPyPI
    - name: Publish to TestPyPI
      run: |
        uv publish --publish-url https://test.pypi.org/legacy/ --username __token__ --password ${{ secrets.TEST_PYPI_API_TOKEN }}

Setting up PyPI

Because we are not ready to publish our package to the main PyPI repository, we will use TestPyPI, which is a separate instance of the Python Package Index that allows you to test package publishing and installation without affecting the real PyPI.

For this you will need to have an account on TestPyPI with 2-Factor enabled. You can create an account at https://test.pypi.org/account/register/.

Once your account is set up, go to your account settings and create an API token with the name github-actions. Copy the token to your clipboard.

Next, we need to store the token in our GitHub repository as a secret. Go to your repository on GitHub, click on “Settings”, then “Secrets and variables”, and then “Actions”. Click on “New repository secret” and name it TEST_PYPI_API_TOKEN. Paste your token into the “Value” field and click “Add secret”.

Triggering the Workflow

git add, git commit, and git push this file to your repository. You should see a new “Actions” tab in your GitHub repository. Click on it, and you should see your workflow listed there:

GitHub Actions Tab with a Workflow
GitHub Actions & Workflow

But nothing has run yet! This is because we have configured our workflow to only run when we push a tag that follows a semantic version format (0.1.0, 1.0.0, etc.). This is a common practice for deploying packages, as it allows us to control when a new version of our package is released.

Let’s create a new tag and push it to our repository. You can do this using the following commands:

BASH

git tag 0.1.0
git push --tags

Looking back at the “Actions” tab, you should see that your workflow has started running! Our project is pretty small, so it should finish quickly. Clicking on the workflow will show you the steps that were executed:

GitHub Actions Successful Build
GitHub Actions Running

You can see the “Setup Python”, “Install uv”, “Build Package”, and “Upload Artifacts” steps that we defined in our build.yml file, as well as some additional steps before and after that GitHub automatically adds. If everything went well, you should see a green check mark next to each step.

Clicking on any given job will give some additional details, including the console output.

Callout

Github Actions is free for public repositories, but does have limits on the amount of compute time you can use per month. As of September 2025, the limit is 2,000 minutes per month. This is generally more than enough for most open source projects.

Finding the Built Package

So our workflow is successful, but where is our package? Our little script actually put our project in three places:

  • As an artifact in our GitHub repository for this workflow run
  • As a release in our GitHub repository
  • On TestPyPI as a package

The Artifact can be found in the “Actions” tab, by clicking on the workflow run, and then clicking on the “Artifacts” section on the right side of the page. You can download the artifact as a ZIP file, which contains the dist/ directory with our built package files. (This is identical to what we built locally earlier.)

The Release can be found by clicking on the “Releases” link on the right side of your GitHub repository page. You should see a new release with the tag 0.1.0, and the built package files attached as assets.

Finally, the package can be found on TestPyPI by going to https://test.pypi.org/project/textanalysis-tool-{my-name}/.

Installing our package from TestPyPI


Now that our package is published on TestPyPI, we can install it into any Python environment using a package manager. Because we are using test.pypi.org, we need to specify the index URL when installing the package, which we do not have to do for packages on the main PyPI repository.

Create a new virtual environment in a clean folder with uv:

BASH

uv init
uv venv
uv activate

Then install our package from TestPyPI using pip:

BASH

uv pip install --index-url https://test.pypi.org/simple/ textanalysis-tool-{my-name}

You should see output indicating that pip is downloading and installing our package. Once the installation is complete, you can verify that the package is installed by running:

BASH

uv pip show textanalysis-tool-{my-name}

Continuous Integration / Continuous Deployment (CI/CD)


One of the advantages of using an automated process like GitHub Actions is that we can also easily add additional steps to our workflow. In previous episodes, we have added unit tests to our project, as well as static code analysis using ruff and mypy. We can easily add these steps to our workflow to ensure that our code is always tested and checked before it is built and deployed.

Here’s the updated build.yml file with the additional steps:

YAML

name: Build and Deploy Package

on:
  push:
    tags:
      - '[0-9]+.[0-9]+.[0-9]+*' # Deploy when we tag with a semantic version

jobs:
  build:
    runs-on: ubuntu-latest # Use the most up-to-date Ubuntu runner
    permissions:
      contents: write  # Required to create releases

    steps:
    - uses: actions/checkout@v4 # Check out the repository

    # Create a Python environment
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    # Install uv
    - name: Install uv
      uses: astral-sh/setup-uv@v3

    # Install project with dev dependencies
    - name: Install dependencies
      run: uv sync --dev

    # Static Code Analysis with ruff
    - name: Run ruff
      run: uv run ruff check .

    # Type Checking with mypy
    - name: Run mypy
      run: uv run mypy src

    # Running Tests with pytest
    - name: Run tests
      run: uv run pytest tests

    # Build the package
    - name: Build Package
      run: uv build

    # Upload the built package to the repository as an artifact
    - name: Upload Artifacts
      uses: actions/upload-artifact@v4
      with:
        name: textanalysis-tool
        path: dist/

    # Create GitHub Release with artifacts
    - name: Create Release
      uses: softprops/action-gh-release@v1
      with:
        files: dist/*
        generate_release_notes: true
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

    # Publish to TestPyPI
    - name: Publish to TestPyPI
      run: |
        uv publish --publish-url https://test.pypi.org/legacy/ --username __token__ --password ${{ secrets.TEST_PYPI_API_TOKEN }}

Update the version number in your pyproject.toml file to 0.2.0, and then create a new tag and push it to your repository:

BASH

git tag 0.2.0
git push --tags

Now, unless you already went through in the earlier episodes, you will likely see that the ruff , mypy, or pytest steps fail. This is great! It means that we have caught issues in our code before they made it into a release! Only once all the tests pass and the code checks are clean will the package be built and deployed.

Callout

Making a new tag for each release is good practice, however there are also occasions where you might want to run some of these jobs regularly when merging in code, or on pushes to certain branches. This is possible - there is an example of this in the Workshop Addendum.

Challenge

Challenge 1:

Key Points
  • We can build our package locally using uv build, which creates a dist/ directory with the built package files.
  • We can use GitHub Actions to automate the building and deployment of our package.
  • GitHub Actions workflows are defined using YAML files in the .github/workflows/ directory.
  • We can trigger our workflow to run on specific events, such as pushing a tag that follows semantic versioning.
  • We can add additional steps to our workflow, such as running tests and static code analysis, to ensure code quality before deployment.