(Talk) Clean Code
Last updated on 2026-05-22 | Edit this page
Clean Code
What exactly do you mean by “clean code”?
“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.”
– Martin Fowler
“Clean Coding” is a practice that aims to make code as easy as possible to understand, maintain, and extend in the future.
Imagine that you are writing a lab notebook for yourself (or another fellow researcher) about an experiment you are conducting. You would want to make sure that the notebook is clear, concise, and easy to follow. You would want it to be as clear as possible the steps you took, the results you obtained, and the conclusions you drew.
There are some general rules around writing clean code that apply to most programming languages. These rules are not set in stone, but they are generally accepted as good practices. They are designed to make your code easier to read, understand, and maintain.
Clean Code is Readable
The most important aspect of clean code is readability. Code is read far more often than it is written, so it is important to make sure that your code is easy to read and understand.
Certain languages are just by design closer to english text, and therefor easier to read, but even in languages like Python, which is known for its readability, it is still possible to write code that is difficult to understand. Conversely, even in languages like C, which are known for their complexity, it is possible to write code that is easy to understand by using clear variable names and comments.
Here’s an example of some code that is difficult to read:
It might be pretty clear what this code is doing if you are familiar with the Fibonacci sequence, but it’s not immediately obvious and might take a few seconds, especially if you are not familiar with the language. Here’s the same code, but with better variable names, comments, and a docstring:
PYTHON
def fibonacci(n) -> int:
"""
Calculate the nth number in the Fibonacci sequence using a recursive algorithm.
Args:
n (int): The index of the number to calculate.
Returns:
int: The nth number in the Fibonacci sequence.
"""
# The first two numbers in the Fibonacci sequence are 0 and 1 - if we get either of these
# values, just return them
if n <= 1:
return n
else:
# For values larger than 1, the nth number is the sum of the (n-1)th and (n-2)th numbers
# Recursively calculate the nth number in the Fibonacci sequence by calculating the
# (n-1)th and (n-2)th numbers and adding them together.
return fibonacci(n-1) + fibonacci(n-2)
Clean Code is well commented
Comments are an important part of writing clean code, helping to explain what the code is doing, why it is doing it, and how it is doing it. Comments can be used to explain complex or non-obvious parts of the code, and to provide context for the code.
It is possible to over-comment code - it’s not necessary to comment every line of code, especially if the code is self-explanatory (which, if you are using good variable names, it often will be!).
An example of code that contains good comments:
PYTHON
# Find all of the files in the directory that have a .csv extension
csv_file_list = glob.glob("*.csv")
for filename in csv_file_list:
# Read in the data from the file
data = pd.read_csv(filename)
# Calculate the mean of the data grouped by the 'group_identifier' column, then reset the
# index and rename the 'value' column to 'mean_value'.
results = (
data
.groupby("group_identifier")
.mean()
.reset_index()
.rename(columns={"value": "mean_value"})
)
# Create a new filename for the results and save them to a new file
output_filename = Path(filename).stem + "_results.csv"
results.to_csv(output_filename)
There are also a few examples of bad comments:
- It’s ok to use a comment like
# TODO: this requires further investigationto indicate that there is something that needs to be fixed or improved, but these have a tendency to stick around in code for a long time and can become noise. - Commented out code is another common issue. It’s better to use version control to keep track of changes to the code, rather than commenting out code that is no longer needed.
- Comments which exist just to exist. In the example above, we have
the line
for filename in csv_file_list:. Since we are using good variable names to begin with, it is clear from the code what this line is doing, and no comment is needed.
Clean Code uses descriptive variable names
This could be considered a subset of readability, but it is so important that it deserves its own section. Variable names should be descriptive and meaningful. They should describe as succinctly as possible what the variable represents and what it is used for. Where possible, use full words and avoid abbreviations. If a varaible name is too long, it might be a sign that the variable is doing too much and should be split into multiple variables.
Some examples of poor variable names:
-
x,i,j- these are common loop variables, but they don’t tell you anything about what they are used for. There are exceptions1 to this rule, but in general, you should avoid using single-letter variable names. -
temp- why is this value temporary? What is it being used for? -
data- what kind of data is this? What does it represent? -
list1- what kind of list? Why is this the first list?
Some examples of good variable names:
Naming Style
As a subset of variable naming, there are a couple different approaches for naming variables.
-
snake_case: Words are separated by underscores. For
example,
total_value,number_of_files,user_id. -
camelCase: The first letter of each word is
capitalized, except for the first word. For example,
totalValue,numberOfFiles,userId. -
kebab-case: Words are separated by hyphens. For
example,
total-value,number-of-files,user-id. -
PascelCase: The first letter of each word is
capitalized. For example,
TotalValue,NumberOfFiles,UserId. -
UPPERCASE: All letters are capitalized. For
example,
TOTAL_VALUE,NUMBER_OF_FILES,USER_ID.
The specific convention you use is less important than being consistent 4
Clean Code is Maintainable
Maintainability is the ease with which code can be modified. Take this example, which is a very common intro problem in programming:
PYTHON
for i in range(1, 101):
if i % 3 == 0 and i % 5 == 0:
print("FizzBuzz")
elif i % 3 == 0:
print("Fizz")
elif i % 5 == 0:
print("Buzz")
else:
print(i)
Now imagine that you are asked to modify this code so that it prints “Fizz” for every even number, “Foo” for every number that is divisible by 5, and “FizzBuzz” for every number that is divisible by 7 and 5. Also, instead of checking 100 numbers, we only want to print out the first 40. These are simple changes, but it still requires modifying the code in multiple places.
PYTHON
for i in range(1, 21):
if i % 7 == 0 and i % 5 == 0:
print("FizzBuzz")
elif i % 2 == 0:
print("Fizz")
elif i % 5 == 0:
print("Foo")
else:
print(i)
So how might we rewrite this code to be more maintainable? One way might be to create a list of rules that we can iterate over, checking each rule in turn to see if it applies to the current number. This way, we can easily add, remove, or modify rules without having to change the core logic of the loop.
Clean Code is DRY
DRY stands for “Don’t Repeat Yourself”. This is a principle that states that you should avoid writing the same code multiple times. As a general rule, if you find yourself copying and pasting code, you should pause and think about whether there is a better way to structure your code so that you can reuse the same logic in multiple places.
Though copy/paste can be a quick way to get something done, it can lead to problems down the line when it comes time to make changes. You might forget to update all the copies of the code, or you might make a mistake in one copy that you don’t make in another. Take the following example, where someone wants to run the same analysis twice, but on different datasets:
PYTHON
import pandas as pd
###
### Analyse dataset from 2024-01-01
###
dataset_df = pd.read_csv("2024_01_01_dataset.csv")
# exclude records that have a placeholder value like "N/A"
filtered_df = dataset_df[dataset_df["score"] not in ["N/A", "NA"]]
# exclude records that have a score less than 5
filtered_df = dataset_df[dataset_df["score"] >= 5]
# find the mean and standard deviation of the remaining scores
mean_score = filtered_df["score"].mean()
std_dev = filtered_df["score"].std()
print(f"Dataset 2024_01_01_dataset.csv: Mean score: {mean_score}, Standard deviation: {std_dev}")
###
### Analyse dataset from 2024-02-01
###
dataset_df = pd.read_csv("2024_02_01_dataset.csv")
# exclude records that have a placeholder value like "N/A"
filtered_df = dataset_df[dataset_df["score"] not in ["N/A", "NA"]]
# exclude records that have a score less than 5
filtered_df = dataset_df[dataset_df["score"] >= 5]
# find the mean and standard deviation of the remaining scores
mean_score = filtered_df["score"].mean()
std_dev = filtered_df["score"].std()
print(f"Dataset 2024_02_01_dataset.csv: Mean score: {mean_score}, Standard deviation: {std_dev}")
This is a pretty short example, but you can see how as new datasets are added, the code will quickly become unwieldy.
With some quick refactoring, we can make this code more DRY by extracting the common logic into a function that we can reuse for each dataset, then looping over the datasets.
PYTHON
import pandas as pd
datasets = ["2024_01_01_dataset.csv", "2024_02_01_dataset.csv"]
def analyse_dataset(file_path: str):
dataset_df = pd.read_csv(file_path)
# exclude records that have a placeholder value like "N/A"
filtered_df = dataset_df[dataset_df["score"] not in ["N/A", "NA"]]
# exclude records that have a score less than 5
filtered_df = dataset_df[dataset_df["score"] >= 5]
# find the mean and standard deviation of the remaining scores
mean_score = filtered_df["score"].mean()
std_dev = filtered_df["score"].std()
return mean_score, std_dev
for dataset in datasets:
mean_score, std_dev = analyse_dataset(dataset)
print(f"Dataset {dataset}: Mean score: {mean_score}, Standard deviation: {std_dev}")
This small change means that the process of updating the code to
adapt to changes in processing is much simpler - the updates would only
have to be done once in the analyse_dataset function and
would always apply to all datasets. Additional, by defining the
datasets list at the start of the script, it is easy to add
or remove datasets from the analysis without having to modify the core
logic of the script.
Clean Code is Well Documented
Documentation is an important part of writing clean code. It helps other people (and your future self) understand what the code does, why it does it, and how it does it. The documentations can be as detailed as you are willing to make it, but at a minimum, you should include a docstring for each function that describes what the function does, what arguments it takes, and what it returns.
There are a variety of standards for writing docstrings, both between languages and within. The most important thing is to select a standard as stick with it. Here are some example docstrings for a function that calculates the area of a circle:
Python:
Google Style Guide
Numpy Style Guide
Sphinx Style Guide
R:
Roxygen2 Style Guide
R
#' Calculate the area of a circle
#'
#' @param radius The radius of the circle
#' @return The area of the circle
#'
#' @examples
#' circle_area(5)
circle_area <- function(radius) {
return(pi * radius^2)
}
File Headers
In addition to documenting functions, it is also important to document the file itself. The exact information will depend on the project and purpose of the file. A common script header might look like this:
Scripts
With scripts in particular, it’s important to include a header that describes the purpose of the script, the author, and the date it was created or last modified. This information can help others understand the context of the script and how it fits into the larger project.
Python
PYTHON
"""
01_data_cleaning.py
Author: Max Mustermann <max.mustermann@email.org>
Date: 2024-01-01
This script reads in a dataset for the <PROJECT> project and applies some basic cleaning steps
to prepare the data for analysis.
These steps include:
- Removing records with missing/invalid values
- Standardizing column names
- Converting data types
- Saving the cleaned dataset to a new file
Note: This script assumes that the dataset is in CSV format.
"""
R
R
#' Data Cleaning Script
#'
#' Author: Max Mustermann <max.mustermann@email.org>
#' Date: 2024-01-01
#'
#' This script reads in a dataset for the <PROJECT> project and applies some basic cleaning steps
#' to prepare the data for analysis.
#'
#' These steps include:
#' - Removing records with missing/invalid values
#' - Standardizing column names
#' - Converting data types
#' - Saving the cleaned dataset to a new file
#'
#' --------------------------------------------------------------------------------------------
#'
#' Notes: This script assumes that the dataset is in CSV format.
#'
#' --------------------------------------------------------------------------------------------
Modules
For modules, it is important to include a header that describes the purpose of the module and it’s intended usage. This information can help others understand how to use the objects defined in the file and how they fit into the larger project. An example header for a module might look like this:
Python
PYTHON
"""
project_dataset.py
This object is used to load datasets for the <PROJECT> project for use in the model training
pipeline. It is based on the pytorch Dataset class and provides a simple interface for loading
and preprocessing data.
Usage:
dataset = ProjectDataset(file_path="data.csv")
Parameters:
- file_path (str): The path to the dataset file. The file must be in CSV format.
- train (bool, optional): A flag indicating whether the dataset is for training or testing.
Default is True.
Exceptions:
- FileNotFoundError: If the specified file does not exist
- ValueError: If the file is not in CSV format
"""
R
R
#' ProjectDataset: Load datasets for the <PROJECT> model training pipeline
#'
#' This object is used to load datasets for the <PROJECT> project, providing a simple
#' interface for loading and preprocessing data.
#'
#' @usage dataset <- ProjectDataset(file_path = "data.csv")
#'
#' @param file_path A string specifying the path to the dataset file. The file must be in CSV format.
#' @param train A logical value indicating whether the dataset is for training or testing. Default is TRUE.
#'
#' @return A dataset object ready for use in model training.
#'
#' @examples
#' dataset <- ProjectDataset("data.csv")
#'
#' @throws FileNotFoundError If the specified file does not exist.
#' @throws ValueError If the file is not in CSV format.
Clean Code is Consistent
You might agree or disagree with some of the rules we’ve discussed so far, but the most important by far is consistency. It is better to have code that is consistently formatted and styled than to have code that is formatted and styled in the “best” way some of the time. Consistency makes it easier to read and understand code because you know what to expect.
Tools For Writing Clean Code
There are a number of tools available that can help you write clean code. Some of these tools are linters, which are programs that analyze your code and flag potential issues. Others are formatters, which automatically format your code according to a set of rules.
Linters
- Python: flake8
- Python: Pylint
- Python: ruff
- R: lintr
- JavaScript: ESLint
- Java: Checkstyle
- C/C++: Clang
Formatters
There are language specific conventions - for example, Python uses
snake_casefor variable names andPascelCasefor class names. JavaScript usescamelCasefor both variable and class names. CSS useskebab-casefor class names.↩︎In languages that allow any value to be assigned to any variable (like Python) it is useful to include the type of the variable in the name. This is not strictly necessary in languages that require you to declare the type of the variable when you create it (like C or Java).↩︎
In languages that allow any value to be assigned to any variable (like Python) it is useful to include the type of the variable in the name. This is not strictly necessary in languages that require you to declare the type of the variable when you create it (like C or Java).↩︎
There are language specific conventions - for example, Python uses
snake_casefor variable names andPascelCasefor class names. JavaScript usescamelCasefor both variable and class names. CSS useskebab-casefor class names.↩︎