This lesson is in the early stages of development (Alpha version)

Introduction to High-Performance Computing in Python

Basic syntax

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • Where do I start?

Objectives
  • Understand basic Python syntax and data types.

The most basic use of Python is to use it as a fancy calculator. It is very easy to do basic maths in Python.

print(5 + 1)
6

Note that we don’t always have to use the print() statement. Notice how leaving out print() gives us the same result as above.

5 + 1
6

Python can do all of the normal basic maths operations you’d expect.

5 + 3
2 - 9
4 * 6
14 / 3
8
-7
24
4.666666666666667

You can also use it to more complicated operations, like exponentiation (**):

5 ** 2
25

Along with floor and remainder division. Floor division (//) gives the results of division, rounded down. Remainder division (%), gives the remainder after division.

5 // 2  # floor division
5 % 2   # remainder division
2
1

Python follows the normal order of operations for maths.

4 + 1 * 6
10

However, if you want Python to change the order it does things in, you can use parentheses to specify what to do first. Note that there is no limit to the number of parentheses you can use.

(4 + 1) * 6
30

Variables

Of course, we will probably want to save our answers at some point. We can do this by assigning a variable. In Python, a variable is a name for a saved result. We can set them with the = sign.

weight_kg = 55

If we want to retrieve the information we’ve stored, we can do it by simply typing the name of the variable again.

weight_kg
55

We can perform maths on variables the same way we would normally.

print('weight in pounds:', 2.2 * weight_kg)
weight in pounds: 121.00000000000001

As the example above shows, we can print several things at once by separating them with commas. Note that in this case, the number might appear as 121.00000000000001 due to the way numbers are internally represented in Python.

We can also change a variable’s value by assigning it a new one:

weight_lb = 2.2 * weight_kg
print(weight_lb)
121.00000000000001

What happens when we change a variable? Let’s update weight_kg and see what happens to weight_lb.

print('weight_kg starting value is', weight_kg)
weight_kg = 10000
print('after updating, weight_kg ending value is', weight_kg)
print('weight in lb ending value is', weight_lb)
weight_kg starting value is 55
after updating, weight_kg ending value is 10000
weight in lb ending value is 121.00000000000001

Notice how even though we changed the value of weight_kg, weight_lb did not update. This demonstrates a very important property of programming languages: a computer will not do anything unless you specifically tell it to — nothing ever happens automatically. This is different from the behaviour of a spreadsheets, where a cell will automatically update when the cells it refers to are updated.

If we want to tell Python to update weight_lb to reflect the new value of weight_kg, we will need to perform this operation explicitly.

weight_lb = weight_kg * 2.2
print('new value for weight_lb is', weight_lb)
new value for weight_lb is 22000.0

One more thing to note: what we just did is the best way to learn Python. Don’t know how something works? Try it and find out!

Where are variables stored?

Your computer has two places where it stores information: hard disk and memory. What are they and what are they used for? Where do variables get stored?

Memory is where temporary information on your computer gets placed. It is very fast and easy to access, but has one important drawback: data here is erased when your program quits or your computer shuts down. All information you save as variables in Python will be stored in memory! When programming, we always need to save our data as a file (on our hard disk) if we want to keep it!

Your computer’s hard disk is used to store information long-term. This is where files get stored, and the information on your hard drive is more or less permanent. Hard drives can also store lots of data very cheaply — a terabyte of hard drive space is very cheap, whereas the same amount of memory costs a lot more. So if hard drive space is permanent and super-cheap, why don’t we use it to store all of our data? The biggest reason is speed — memory is typically hundreds, if not thousands of times faster to access. If we stored our variables to our hard disk, our programs would be incredibly slow!

Errors

Of course, not everything will always work perfectly. We are going to run into errors. For instance, what happens if we accidentally don’t finish a command?

1 +
SyntaxError: invalid syntax (<ipython-input-15-70475fc083df, line 1)
  File "<ipython-input-15-70475fc083df", line 1
    1 +
       ^
SyntaxError: invalid syntax

This is an error. Errors are good! When we do something that Python doesn’t like, it will give us an error message. These error messages are called tracebacks, and often tell us exactly how to fix our stuff!

Let’s walk through this error:

SyntaxError: invalid syntax

All errors have types. This one is a SyntaxError, indicating, well… an error in our syntax. Syntax is “computer-speak” for how things are supposed to be typed. Python only understands certain commands, and typos will mess it up. If we type a command in such a way that Python can’t understand it, we need to fix our syntax (make sure we’ve typed a valid command). Takeaway message: We made an error when typing things.

File "<ipython-input-15-70475fc083df", line 1
    1 +
       ^

Python is trying to be helpful and tell us exactly where our error occurred. The first thing it does is tell us which file had the error in it. Since we are using the terminal, it gives us the semi-confusing <ipython-input-15-70475fc083df instead of a filename.

The line 1 bit tells us that our error was on line 1 of our last command. Specifically, Python has printed out the offending line for us, and pointed an arrow (^) at the bad part. Takeaway message: The error came right after we typed the + sign.

Different types of data

Computers are not smart, and have to be explicitly told how to handle different types of data. Although a human might know that you can’t do maths on a word, our computer does not. To work around this problem, programming languages store different types of data in different ways.

For reasons that are likely obvious, we will need to store text differently than numbers. What is less obvious is that Python also has special ways of handling integers vs. decimals, Boolean values (True/False), and a special value used to indicate no data whatsoever.

Strings

We’ve already encountered 3 of these “data types” already. The first is strings, which are used to store text. Strings are indicated with either single (') or double (") quotes.

To see what data type something is, we can use the type() command. It will print the data type of whatever is inside the parentheses.

type('this is a string')
type("this is also a string")
str
str

We can also make multiline strings with 3 of either set of quotes.

multiline = '''
    This string
    spans
    multiple
    lines
    !!!!
    '''
print(multiline)
type(multiline)
This string
spans
multiple
lines
!!!!

str

Python makes it very easy to work with basic text. For instance, we can even use the + sign to put strings together!

'some text' + 'MORE TEXT'
'repetition' * 3
'some textMORE TEXT'
'repetitionrepetitionrepetition'

Note that maths operations on strings will only work within reason. Attempting to add a string to a number doesn’t work!

'5' + 5
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-f9dbf5f0b234 in <module()
---- 1 '5' + 5

TypeError: Can't convert 'int' object to str implicitly

This error indicates that Python doesn’t know how to convert a string to an integer (without our help)!

Numbers

Integers are used to store any whole number, either positive or negative. Any number without a decimal place is an int, or integer.

type(5)
type(-1000)
type(6 + -33)
int
int
int

But what happens when we perform a maths operation that would result in a decimal?

type(10 / 3)
float

Any operation that would result in a decimal being created converts the number to a “float”. Floats are used to represent decimals in Python. To explicitly set a number as a float, just add a decimal point.

type(1.3)
type(22.)
float
float

Other data types

Python has two special “Boolean” values to indicate whether or not something is true or false. Unsurprisingly, these are defined as True and False.

type(True)
type(False)
bool
bool

Finally, there is a special value called None used to indicate no data. We will revisit None in more detail later, so for now, just be aware it exists.

type(None)
NoneType

Converting between data types

Data often isn’t the format you want it to be. For instance, we got an error earlier while attempting to perform addition between a string and a number ('5' + 5). What if we really needed to do that? Fortunately, Python makes it rather easy to convert between data types. Each data type has a function used to convert another piece of data.

To convert a string to an integer, for instance, we can use the int() command:

print(int('5') + 5)
10

Likewise, we can use the following commands to convert data to other types:

  • str() - creates a string
  • int() - creates an integer
  • float() - creates a float
  • bool() - creates a Boolean

Using this information, see if you can fix the left side of these statements to equal the right side of each statement. Use only the commands shown above.

1 + '1' == '11'
'6' - 7 == -1
7.23 == 7
'5' == True
4 / 1.3 == 4

Data type conversion pitfalls

You may have noticed something weird when converting a float to an int in the last example. Is Python simply rounding floats to the nearest integer, or is it doing something else?

Key Points

  • Errors are there to help us.


Scripts and imports

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • What is a Python program?

Objectives
  • Explain what constitutes a Python program.

  • Import Python modules.

Everything we’ve learned so far is pretty cool. But if we want to run a set of commands more than once? How do we write a program in Python?

Python programs are simply a set of Python commands saved in a file. No compiling required! To demonstrate this, let’s write our first program! Enter the following text in a text editor and save it under any name you like (Python files are typically given the extension .py).

print('it works!!!')

We can now run this program in several ways. If we were to open up a terminal in the folder where we had saved our program, we could run the command python3 our-script-name.py to run it.

it works!!!

What’s the point of print()?

We saw earlier that there was no difference between printing something with print() and just entering a command on the command line. But is this really the case? Is there a difference after all?

Try executing the following code:

print('this involves print')
'this does not'

What gets printed if you execute this as a script? What gets printed if you execute things line by line? Using this information, what’s the point of print()?

import-ing things

IPython has a neat trick to run command line commands without exiting IPython. Any command that begins with ! gets run on your computer’s command line, and not the IPython terminal.

We can use this fact to run the command python3 our-script-name.py. I’ve called my script test.py as an example.

!python3 test.py
it works!!!

What if we wanted to pass additional information to Python? For example, what if we want Python to print whatever we type back at us? To do this, we’ll need to use a bit of extra functionality: the sys package.

Python includes a lot of extra features in the form of packages, but not all of them get loaded by default. To access a package, we need to import it.

import sys

You’ll notice that there’s no output. Only one thing is changed: We can now use the bonuses provided by the sys package. For now, all we will use is sys.argv. sys.argv is a special variable that stores any additional arguments we provide on the command-line after python3 our-script-name.py. Let’s make a new script called command-args.py to try this out.

import sys
print('we typed: ', sys.argv)

We can then execute this program with:

!python3 test.py word1 word2 3
we typed: ['test.py', 'word1', 'word2', '3']

You’ll notice that sys.argv looks different from other data types we’ve seen so far. sys.argv is a list (more about this in the next session).

Key Points

  • To run a Python program, use python3 program_name.py.


Numpy arrays and lists

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How do we store large amounts of data?

Objectives
  • Learn to use lists and Numpy arrays, and explain the difference between each.

At the end of the last lesson, we noticed that sys.argv gave us a new data structure: a list. A list is a set of objects enclosed by a set of square brackets ([]).

example = [1, 2, 4, 5]
example
[1, 2, 4, 5]

Note that a list can hold any type of item, even other lists!

example = [1, True, None, ["word", 123], "test"]
example
[1, True, None, ['word', 123], 'test']

We can get different pieces of a list via indexing. We add a set of square brackets after the list in question along with the index of the values we want. Note that in Python, all indices start from 0 — the first element is actually the 0th element (this is different from languages like R or MATLAB). The best way to think about array indices is that they are the number of offsets from the first position — the first element does not require an offset to get to.

/hpc-python/Arrays%20start%20at%200
Source: xkcd #163

A few examples of this in action:

# first element
example[0]
# second element
example[1]
# fetch the list inside the list
example[3]
1
True
['word', 123]

Note that we can index a range using the colon (:) operator. A colon by itself means fetch everything.

example[:]
[1, True, None, ['word', 123], 'test']

A colon on the right side of an index means everything after the specified index.

example[2:]
[None, ['word', 123], 'test']

A colon on the left side of an index means everything before, but not including, the index.

example[:2]
[1, True]

And if we use a negative index, it means get elements from the end, going backwards.

# last element
example[-1]
# everything except the last two elements
example[:-2]
'test'
[1, True, None]

Note that we can use the index multiple times to retrieve information from nested objects.

example[3][0]
'word'

If we index out of range, it is an error:

example[5]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-12-98429cb6526b> in <module>()
----> 1 example[5]

IndexError: list index out of range

We can also add two lists together to create a larger list.

[45, 2] + [3]
[45, 2, 3]

Lists as objects

Like other objects in Python, lists have a unique behaviour that can catch a lot of people off guard. What happens when we run the following code?

list1 = [1, 2, 3, 4]
list2 = list1
list2 += [5, 6, 7]
print('List 2 is: ', list2)
print('List 1 is: ', list1)
List 2 is:  [1, 2, 3, 4, 5, 6, 7]
List 1 is:  [1, 2, 3, 4, 5, 6, 7]

Modifying list2 actually modified list1 as well. In Python, lists are objects. Objects are not copied when we assign them to a new value (like in R). This is an important optimisation, as we won’t accidentally fill up all of our computer’s memory by renaming a variable a couple of times. When we ran list2 = list1, it just created a new name for list1. list1 still points at the same underlying object.

We can verify this with the id() function. id() prints an objects unique identifier. Two objects will not have the same ID unless they are the same object.

id(list1)
id(list2)
140319556870408
140319556870408

In order to create list2 as a unique copy of list1. We have to use the .copy() method.

list1 = [1, 2, 3, 4]
list2 = list1.copy()
list2 += [5, 6, 7]
print('List 2 is: ', list2)
print('List 1 is: ', list1)
id(list2)
id(list1)
List 2 is:  [1, 2, 3, 4, 5, 6, 7]
List 1 is:  [1, 2, 3, 4]
140319554648072
140319554461896

.copy() is a method. Methods are special functions associated with an object and define what it can do. They always follow the syntax object.method(arg1, arg2) and have predefined number of arguments mostly with default values. We may also specify a subset of arguments, e.g. object.method(arg1, arg4=some_value).

Other frequently used methods of lists include .append():

list1.append(77)
[1, 2, 3, 4, 77]
# this adds a one-element list
list1.append([88])
[1, 2, 3, 4, 77, [88]]

And .extend() (combines two lists, instead of adding the second list as an element):

list1.extend([99, 88, 101])
[1, 2, 3, 4, 77, [88], 99, 88, 101]

And of course, .remove() and .clear() (both do exactly what you think they should do):

list1.remove([88])
print(list1)
list1.clear()
print(list1)
[1, 2, 3, 4, 77, 99, 88, 101]
[]

Dynamic resizing of lists

Python’s lists are an extremely optimised data structure. Unlike R’s vectors, there is no time penalty to continuously adding elements to list. You never need to pre-allocate a list at a certain size for performance reasons.

Iterating through lists

We’ll very frequently want to iterate over lists and perform an operation with every element. We do this using a for loop.

A for loop generally looks like the following:

for variable in things_to_iterate_over:
    do_stuff_with(variable)

An example of an actually functioning for loop is shown below:

for i in range(10):
    print(i)
0
1
2
3
4
5
6
7
8
9

In this case we are iterating over the values provided by range(). range() is a special generator function we can use to provide a sequence of numbers.

We can also iterate over a list, or any collection of elements:

for element in ['a', True, None]:
    print(type(element))
<class 'str'>
<class 'bool'>
<class 'NoneType'>

Vectorised operations with Numpy

Numpy is a numerical library designed to make working with numbers easier than it would otherwise be.

For example, say we had a list of a thousand numbers. There’s no way to do vector maths without iterating through all the elements!

vals = list(range(1000))

new_vals = vals.copy()
print(new_vals[:5])
for idx in range(1000):
    new_vals[idx] += 10

print(new_vals[:5])
[0, 1, 2, 3, 4]
[10, 11, 12, 13, 14]

That was a lot of work. Numpy lets us do vector maths like in R, saving us a lot of effort. The most basic function is np.array() which creates a numerical array from a list. A numpy array is a collection of numbers that can have any number of dimensions. In this case, there is only one dimension, since we created the array from a list.

import numpy as np

new_vals = np.array(vals)
new_vals += 10
new_vals[:5]
array([10, 11, 12, 13, 14])

One very nice thing about Numpy is that it’s much more performant than ordinary Python lists. A nice trick we can use with IPython to measure execution times is the %timeit magic function. Anything following the %timeit gets measured for speed. Adding %% to the timeit command instead of % means that timeit is run on the entire cell, not just a single line. Note that %%timeit must be on the first line of an IPython/Jupyter cell for it to work, whereas the %timeit command can be used anywhere.

Using Python’s lists:

%%timeit
for idx in range(1000):
    vals[idx] + 10
10000 loops, best of 3: 165 µs per loop

Using numpy:

%timeit new_vals + 10
The slowest run took 22.13 times longer than the fastest.
This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.63 µs per loop

Numpy was about 100x faster, though %timeit did mention that Numpy could be cheating a bit. Even in Numpy’s worst case scenario however, it still ran 5x faster than using Python’s basic lists.

Working with multiple dimensions

Sometimes, you’ll encounter a dataset with multiple dimensions and will need to be able to retrieve elements from it as such.

arr2d = np.arange(0, 40)  # sequence of numbers from 0 to 39
arr2d = arr2d.reshape([5, 8])  # reshape so it has 5 rows and 8 columns
arr2d
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31],
       [32, 33, 34, 35, 36, 37, 38, 39]])

In this case, we must index using multiple indices, separated by a comma.

To grab the first element, we would use [0, 0]

arr2d[0, 0]
0

The first index, corresponds to rows, the second corresponds to columns, and the third to the next dimension…

arr2d[0, :]
arr2d[:, 0]
array([0, 1, 2, 3, 4, 5, 6, 7])
array([ 0,  8, 16, 24, 32])

Practising indexing

Retrieve everything defined in the range of rows 4-5 and columns 1-4.

Key Points

  • Lists store a sequence of elements.

  • Numpy allows vector maths in Python.


Storing data with dicts

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How do I store structured data?

Objectives
  • Be able to store data using Python’s dict objects.

Dictionaries (also called dicts) are another key data structure we’ll need to use to write a pipeline. In particular, dicts allow efficient key-value storage of any type of data.

To create a dict, we use syntax like the following.

example = {}
type(example)
dict

We can then store values in our dict using indexing. The index is referred to as the “key”, and the stored data is referred to as the “value”.

example['key'] = 'value'
example['key']
'value'

In addition, keys can be stored using any type of value. Let’s add several more values to demonstrate this.

example[1] = 2
example[4] = False
example['test'] = 5
example[7] = 'myvalue'

To retrieve all keys in the dictionary, we can use the .keys()method. Note how we used the list() function to turn our resulting output into a list.

list(example.keys())
['key', 1, 4, 'test', 7]

Likewise, we can retrieve all the values at once, using .values()

list(example.values())
['value', 2, False, 5, 'myvalue']

Dictionary order

Note that the order of keys and values in a dictionary should not be relied upon. We’ll create dictionary another way to demonstrate this:

unordered = {'a': 1,
             'b': 2,
             'c': 3,
             'd': 4}
{'a': 1, 'b': 2, 'c': 3, 'd': 4}

Depending on your version of Python, the dictionary will either be in order, or out of order. If you are on Python 3.6+ dictionaries are ordered.

Iterate through and print the dictionary’s keys in both forward and reverse order.

(To iterate through the dict in a specific order, you will need to sort the keys using the sorted() function.)

Key Points

  • Dicts provide key-value storage of information.


Functions and Conditions

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How do I write functions?

Objectives
  • Be able to write our own functions and use basic functional programming constructs like map() and filter().

Of course, at some point, we are going to want to define our own functions rather than just use the ones provided by Python and its various modules.

The general syntax for defining a function is as follows:

def function(arg1):
    # do stuff with arg1
    return answer

So, an example function that adds two numbers together might look a little like this:

def adder(x, y):
    return x + y

adder(1, 2)
3

We can also add a default argument (say if we wanted y to be equal to 10 unless we otherwise specified), by using an equals sign and a default value in our function definition:

def adder(x, y=10):
    return x + y

adder(5)
15

Practice defining functions

Define a function that converts from temperatures in Fahrenheit to temperatures in Kelvin, and another function that converts back again.

The general formula for the conversion from Fahrenheit to Kelvin is:

kelvin = (fahr - 32) * 5 / 9 + 273.15

Conditional statements

We may also need to have our functions do specific things in some conditions, but not in others. This relies upon comparisons between items:

In python, comparison is done using the == operator:

True == True
True == False
'words' == 'words'
True
False
True

not indicates the opposite of True or False, and != means not equal to.

not True == False
True != False
True
True

As with other programming languages, we can make the usual comparisons with the > and < operators. Adding an equals sign (>=, <=) indicates less than or equal to or greater than or equal to.

5 < 10
5 > 10
-4 >= -4
1 <= 2
True
False
True
True

These statements can be combined with the if statement to produce code that executes at various times.

number = 5
if number <= 10:
    print('number was less than 10')
number was less than 10

If the if statement is not equal to True, the statement does not execute:

number = 11
if number <= 10:
    print('number was less than 10')

However, we can add code to execute when the if condition is not met by adding an else statement.

number = 11
if number <= 10:
    print('number was less than 10')
else:
    print('number was greater than 10')
number was greater than 10

And if we want to check an additional statement, we can use the elif keyword (else-if):

number = 10
if number < 10:
    print('number was less than 10')
elif number == 10:
    print('number was equal to 10')
else:
    print('number was greater than 10')

One final note, to check if a value is equal to None in Python we must use is None and is not None. Normal == operators will not work.

None is None
5 is not None
True
True

Additionally, we can check if one value is in another set of values with the in operator:

5 in [4, 5, 6]
43 in [4, 5, 6]
True
False

map(), filter(), and anonymous (lambda) functions

Python has good support for functional programming, and has its own equivalents for map/reduce-style functionality. To “map” a function means to apply it to a set of elements. To “reduce” means to collapse a set of values to a single value. Finally, “filtering” means returning only a set of elements where a certain value is true.

Let’s explore what that means with our own functions. The syntax of map/reduce/filter is identical:

map(function, thing_to_iterate_over, next_thing_to_iterate_over)

Let’s apply this to a few test cases using map. Note that when selecting which function we are going to “map” with,

import math
values = [0, 1, 2, 3, 4, 5, 6]
map(math.sin, values)
<map object at 0x7f31c246cba8>

To retrieve the actual values, we typically need to make the resulting output a list.

list(map(math.sin, values))
[0.0,
 0.8414709848078965,
 0.9092974268256817,
 0.1411200080598672,
 -0.7568024953079282,
 -0.9589242746631385,
 -0.27941549819892586]

filter() applies a similar operation, but instead of applying a function to every piece, it only returns points where a function returns true.

def less_than_3(val):
    return val < 3

list(filter(less_than_3, values))
[0, 1, 2]

That was very inconvenient. We had to define an entire function just to only use it once. The solution for this is to write a one-time use function that has no name. Such functions are called either anonymous functions or lamdba functions (both mean the same thing).

To define a lambda function in python, the general syntax is as follows:

lambda x: x + 54

In this case, lambda x: indicates we are defining a lambda function with a single argument, x. Everything following the : is our function. Whatever value this evaluates to is automatically returned. So lambda x: x + 54 equates to:

def some_func(x):
    return x + 54

Rewriting our filter statement to use a lambda function:

list(filter(lambda x: x < 3, values))
[0, 1, 2]

And a side-by-side example that demonstrates the difference between map() and filter().

list(map(lambda x: x+100, [1,2,3,4,5]))
list(filter(lambda x: x<3, [1,2,3,4,5]))
[101, 102, 103, 104, 105]   # map()
[1, 2]   # filter()

Using lambdas in practice

Add '-cheesecake' to every word in the following list using map().

['new york', 'chocolate', 'new york', 'ketchup', 'mayo']

Using filter(), remove the items which would be absolutely terrible to eat.

map/filter style functionality with Numpy arrays

Although you could use a for-loop to apply a custom function to a numpy array in a single go, there is a handy np.vectorize() function you can use to convert your functions to a vectorised numpy equivalent. Note that this is purely for convenience — this uses a for-loop internally.

import numpy as np
# create a function to perform cubes of a number
vector_cube = np.vectorize(lambda x: x ** 3)

vector_cube(np.array([1, 2, 3, 4, 5]))
array([  1,   8,  27,  64, 125])

To perform a similar option to filter(), you can actually specify a conditional statement inside the [] when indexing a Numpy array.

arr = np.array([1, 2, 3, 4, 5])
arr[arr >= 3]

Removing np.nan values

Remove all of the np.nan values from the following sequence using logical indexing.

np.array([np.nan, np.nan, 2, 3, 4, np.nan])

Key Points

  • map() applies a function to every object in a data structure.

  • filter() returns only the data objects for which some condition is true.


Introduction to parallel computing

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How do I run code in parallel?

Objectives
  • Understand how to run parallel code with multiprocessing.

The primary goal of these lesson materials is to accelerate your workflows by executing them in a massively parallel (and reproducible!) manner. Of course, what does this actually mean?

The basic concept of parallel computing is simple to understand: we divide our job in tasks that can be executed at the same time, so that we finish the job in a fraction of the time that it would have taken if the tasks are executed one by one. There are a lot of different ways of parallelizing things however - we need to cover these concepts before running our workflows in parallel.

Let’s start with an analogy: suppose that we want to paint the four walls in a room. This is our problem. We can divide our problem in 4 different tasks: paint each of the walls. In principle, our 4 tasks are independent from each other in the sense that we don’t need to finish one to start another. However, this does not mean that the tasks can be executed simultaneously or in parallel. It all depends on the amount of resources that we have for the tasks.

Concurrent vs. parallel execution

If there is only one painter, they could work for a while in one wall, then start painting another one, then work for a little bit in the third one, and so on. The tasks are being executed concurrently but not in parallel. Only one task is being performed at a time. If we have 2 or more painters for the job, then the tasks can be performed in parallel.

In our analogy, the painters represent CPU cores in your computer. The number of CPU cores available determines the maximum number of tasks that can be performed in parallel. The number of concurrent tasks that can be started at the same time, however, is unlimited.

Synchronous vs. asynchronous execution

Now imagine that all workers have to obtain their paint form a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously. However, if they use the same colour, and two of them run out of paint at the same time, then they have to synchronise to use the dispenser — one should wait while the other is being serviced.

In our analogy, the paint dispenser represents access to the memory in your computer. Depending on how a program is written, access to data in memory can be synchronous or asynchronous.

Distributed vs. shared memory

Finally, imagine that we have 4 paint dispensers, one for each worker. In this scenario, each worker can complete its task totally on their own. They don’t even have to be in the same room, they could be painting walls of different rooms in the house, on different houses in the city, and different cities in the country. In many cases, however, we need a communication system in place. Suppose that worker A, needs a colour that is only available in the dispenser of worker B — worker A should request the paint to worker B, and worker B should respond by sending the required colour.

Think of the memory distributed on each node/computer of a cluster as the different dispensers for your workers. A fine-grained parallel program needs lots of communication/synchronisation between tasks, in contrast with a course-grained one that barely communicates at all. An embarrassingly/massively parallel problem is one where all tasks can be executed completely independent from each other (no communications required).

Processes vs. threads

Our example painters have two arms, and could potentially paint with both arms at the same time. Technically, the work being done by each arm is the work of a single painter.

In this example, each painter would be a process (an individual instance of a program). The painters’ arms represent a “thread” of a program. Threads are separate points of execution within a single program, and can be executed either synchronously or asynchronously.


How does parallelization work in practice?

These concepts translate into several different types of parallel computing, each good at certain types of tasks:

Asynchronous programming

Often times, certain computations involve a lot of waiting. Perhaps you sent some information to a webserver on the internet and are waiting back on a response. In this case, if you needed to make lots of requests over the internet, your program would spend ages just waiting to hear back. In this scenario, it would be very advantageous to fire off a bunch of requests to the internet, and then instead of waiting on each one, check back periodically to see if the request has completed before processing each request individually.

This is an example of asynchronous programming. One thread executes many tasks at the same time, periodically checking on each one, and only taking an action once some external task has completed. Asynchronous programming is very important when programming for the web, where lots of waiting around happens. To do this in Python, you’d typically want to use something like the asyncio module. It’s not very useful for scientific programming, because only one core/thread is typically doing any work — a normal program that doesn’t run in parallel at all would be just as fast!

Shared memory programming

Shared memory programming means using the resources on a single computer, and having multiple threads or processes work together on a single copy of a dataset in memory. This is the most common form of parallel programming and is relatively easy to do. We will cover basic shared-memory programming in Python using the multiprocess / multiprocessing packages in this lesson.

Distributed memory programming

Shared memory programming, although very useful, has one major limitation: we can only use the number of CPU cores present on a single computer. If we want to increase speed any more, we need a better computer. Big computers cost lots and lots of money. Wouldn’t it be more efficient to just use a lot of smaller, cheap computers instead?

This is the rationale behind distributed memory programming — a task is farmed out to a large number of computers, each of which tackle an individual portion of a problem. Results are communicated back and forth between compute nodes.

This is most advantageous when a dataset is too large to fit into a computer’s memory (depending on the hardware you have access to this can be anything from several dozen gigabytes, to several terabytes). Frameworks like MPI, Hadoop, and Spark see widespread use for these types of problems (and are not covered in this lesson).

Serial farming

In many cases, we’ll need to repeat the same computation multiple times. Maybe we need to run the same set of steps on 10 different samples. There doesn’t need to be any communication at all, and each task is completely independent of the others.

In this scenario, why bother with all of these fancy parallel programming techniques, let’s just start the same program 10 times on 10 different datasets on 10 different computers. The work is still happening in parallel, and we didn’t need to change anything about our program to achieve this. As an extra benefit, this works the same for every program, regardless of what it does or what language it was written in.

This technique is known as serial farming, and is the primary focus of this lesson. We will learn to use Snakemake to coordinate the parallel launch of dozens, if not hundreds or thousands of independent tasks.


Parallelization in Python

Python does not thread very well. Specifically, Python has a very nasty drawback known as a Global Interpreter Lock (GIL). The GIL ensures that only one compute thread can run at a time. This makes multithreaded processing very difficult. Instead, the best way to go about doing things is to use multiple independent processes to perform the computations. This method skips the GIL, as each individual process has it’s own GIL that does not block the others. This is typically done using the multiprocessing module.

Before we start, we will need the number of CPU cores in our computer. To get the number of cores in our computer, we can use the psutil module. We are using psutil instead of multiprocessing because psutil counts cores instead of threads. Long story short, cores are the actual computation units, threads allow additional multitasking using the cores you have. For heavy compute jobs, you are generally interested in cores.

import psutil
# logical=True counts threads, but we are interested in cores
psutil.cpu_count(logical=False)
8

Using this number, we can create a pool of worker processes with which to parallelize our jobs:

from multiprocessing import Pool
pool = Pool(psutil.cpu_count(logical=False))

The pool object gives us a set of parallel workers we can use to parallelize our calculations. In particular, there is a map function (with identical syntax to the map() function used earlier), that runs a workflow in parallel.

Let’s try map() out with a test function that just runs sleep.

import time

def sleeping(arg):
    time.sleep(0.1)

%timeit list(map(sleeping, range(24)))
1 loop, best of 3: 2.4 s per loop

Now let’s try it in parallel:

%timeit pool.map(sleeping, range(24))

If you are using a Jupyter notebook, this will fail:

# more errors omitted
AttributeError: Can't get attribute 'sleeping' on <module '__main__'>
AttributeError: Can't get attribute 'sleeping' on <module '__main__'>

Differences between Jupyter notebooks versus and the Python interpreters

The last command may have succeeded if you are running in a Python or IPython shell. This is due to a difference in the way Jupyter executes user-defined functions):

1 loop, best of 3: 302 ms per loop

Jupyter notebooks define user functions under a special Python module called __main__. This does not work with multiprocessing. However these issues are not limited to Jupyter notebooks — a similar error will occur if you use a lambda function instead:

pool.map(lambda x: time.sleep(0.1), range(24))
---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-10-df8237b4b421> in <module>()
----> 1 pool.map(lambda x: time.sleep(0.1), range(24))

# more errors omitted

The multiprocessing module has a major limitation: it only accepts certain functions, and in certain situations. For instance any class methods, lambdas, or functions defined in __main__ wont’ work. This is due to the way Python “pickles” (read: serialises) data and sends it to the worker processes. “Pickling” simply can’t handle a lot of different types of Python objects.

Fortunately, there is a fork of the multiprocessing module called multiprocess that works just fine (pip install --user multiprocess). multiprocess uses dill instead of pickle to serialise Python objects (read: send your data and functions to the Python workers), and does not suffer the same issues. Usage is identical:

# shut down the old workers
pool.close()

from multiprocess import Pool
pool = Pool(8)
%timeit pool.map(lambda x: time.sleep(0.1), range(24))
pool.close()
1 loop, best of 3: 309 ms per loop

This is a general purpose parallelization recipe that you can use for your Python projects.

# make sure to always use multiprocess
from multiprocess import Pool
# start your parallel workers at the beginning of your script
pool = Pool(number_of_cores)

# execute a computation(s) in parallel
result = pool.map(your_function, something_to_iterate_over)
result2 = pool.map(another_function, more_stuff_to_iterate_over)

# turn off your parallel workers at the end of your script
pool.close()

Parallel workers (with their own copy of everything) are created, data are sent to these workers, and then results are combined back together again. There is also an optional chunksize argument (for pool.map()) that lets you control how big each chunk of data is before it’s sent off to each worker. A larger chunk size means that less time is spent shuttling data to and from workers, and will be more useful if you have a large number of very fast computations to perform. When each iteration takes a very long time to run, you will want to use a smaller chunk size.

Key Points

  • Pool.map() will perform an operation in parallel.


Introduction to Snakemake

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How can I make my results easier to reproduce?

Objectives
  • Understand our example problem.

Let’s imagine that we’re interested in seeing the frequency of various words in various books.

We’ve compiled our raw data i.e. the books we want to analyse and have prepared several Python scripts that together make up our analysis pipeline.

Let’s take quick look at one of the books using the command

$ head books/isles.txt

By default, head displays the first 10 lines of the specified file.

A JOURNEY TO THE WESTERN ISLANDS OF SCOTLAND


INCH KEITH


I had desired to visit the Hebrides, or Western Islands of Scotland, so
long, that I scarcely remember how the wish was originally excited; and
was in the Autumn of the year 1773 induced to undertake the journey, by
finding in Mr. Boswell a companion, whose acuteness would help my

Our directory has the Python scripts and data files we we will be working with:

|- books
|  |- abyss.txt
|  |- isles.txt
|  |- last.txt
|  |- LICENSE_TEXTS.md
|  |- sierra.txt
|- plotcount.py
|- wordcount.py
|- zipf_test.py

The first step is to count the frequency of each word in a book. The first argument (books/isles.txt) to wordcount.py is the file to analyse, and the last argument (isles.dat) specifies the output file to write.

$ python wordcount.py books/isles.txt isles.dat

Let’s take a quick peek at the result.

$ head -5 isles.dat

This shows us the top 5 lines in the output file:

the 3822 6.7371760973
of 2460 4.33632998414
and 1723 3.03719372466
to 1479 2.60708619778
a 1308 2.30565838181

We can see that the file consists of one row per word. Each row shows the word itself, the number of occurrences of that word, and the number of occurrences as a percentage of the total number of words in the text file.

We can do the same thing for a different book:

$ python wordcount.py books/abyss.txt abyss.dat
$ head -5 abyss.dat
the 4044 6.35449402891
and 2807 4.41074795726
of 1907 2.99654305468
a 1594 2.50471401634
to 1515 2.38057825267

Let’s visualise the results. The script plotcount.py reads in a data file and plots the 10 most frequently occurring words as a text-based bar plot:

$ python plotcount.py isles.dat ascii
the   ########################################################################
of    ##############################################
and   ################################
to    ############################
a     #########################
in    ###################
is    #################
that  ############
by    ###########
it    ###########

plotcount.py can also show the plot graphically:

$ python plotcount.py isles.dat show

Close the window to exit the plot.

plotcount.py can also create the plot as an image file (e.g. a PNG file):

$ python plotcount.py isles.dat isles.png

Finally, let’s test Zipf’s law for these books:

$ python zipf_test.py abyss.dat isles.dat
Book	First	Second	Ratio
abyss	4044	2807	1.44
isles	3822	2460	1.55

Zipf’s Law

Zipf’s Law is an empirical law formulated using mathematical statistics that refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions.

Zipf’s law was originally formulated in terms of quantitative linguistics, stating that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. For example, in the Brown Corpus of American English text, the word the is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word of accounts for slightly over 3.5% of words (36,411 occurrences), followed by and (28,852). Only 135 vocabulary items are needed to account for half the Corpus.

Source: Wikipedia

Together these scripts implement a common workflow:

  1. Read a data file.
  2. Perform an analysis on this data file.
  3. Write the analysis results to a new file.
  4. Plot a graph of the analysis results.
  5. Save the graph as an image, so we can put it in a paper.
  6. Make a summary table of the analyses

Running wordcount.py and plotcount.py at the shell prompt, as we have been doing, is fine for one or two files. If, however, we had 5 or 10 or 20 text files, or if the number of steps in the pipeline were to expand, this could turn into a lot of work. Plus, no one wants to sit and wait for a command to finish, even just for 30 seconds.

The most common solution to the tedium of data processing is to write a shell script that runs the whole pipeline from start to finish.

Using your text editor of choice (e.g. nano), add the following to a new file named run_pipeline.sh.

# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss
# and the summary table for the Zipf's law tests

python wordcount.py books/isles.txt isles.dat
python wordcount.py books/abyss.txt abyss.dat

python plotcount.py isles.dat isles.png
python plotcount.py abyss.dat abyss.png

# Generate summary table
python zipf_test.py abyss.dat isles.dat > results.txt

Run the script and check that the output is the same as before:

$ bash run_pipeline.sh
$ cat results.txt

This shell script solves several problems in computational reproducibility:

  1. It explicitly documents our pipeline, making communication with colleagues (and our future selves) more efficient.
  2. It allows us to type a single command, bash run_pipeline.sh, to reproduce the full analysis.
  3. It prevents us from repeating typos or mistakes. You might not get it right the first time, but once you fix something it’ll stay fixed.

Despite these benefits it has a few shortcomings.

Let’s adjust the width of the bars in our plot produced by plotcount.py.

Edit plotcount.py so that the bars are 0.8 units wide instead of 1 unit. (Hint: replace width = 1.0 with width = 0.8 in the definition of plot_word_counts.)

Now we want to recreate our figures. We could just bash run_pipeline.sh again. That would work, but it could also be a big pain if counting words takes more than a few seconds. The word counting routine hasn’t changed; we shouldn’t need to recreate those files.

Alternatively, we could manually rerun the plotting for each word-count file. (Experienced shell scripters can make this easier on themselves using a for-loop.)

$ for book in abyss isles; do python plotcount.py $book.dat $book.png; done

With this approach, however, we don’t get many of the benefits of having a shell script in the first place.

Another popular option is to comment out a subset of the lines in run_pipeline.sh:

# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss
# and the summary table

# These lines are commented out because they don't need to be rerun.
#python wordcount.py books/isles.txt isles.dat
#python wordcount.py books/abyss.txt abyss.dat

python plotcount.py isles.dat isles.png
python plotcount.py abyss.dat abyss.png

# This line is also commented out because it doesn't need to be rerun.
# python zipf_test.py abyss.dat isles.dat > results.txt

Then, we would run our modified shell script using bash run_pipeline.sh.

But commenting out these lines, and subsequently un-commenting them, can be a hassle and source of errors in complicated pipelines. What happens if we have hundreds of input files? No one wants to enter the same command a hundred times, and then edit the result.

What we really want is an executable description of our pipeline that allows software to do the tricky part for us: figuring out what tasks need to be run where and when, then perform those tasks for us.

What is Snakemake and why are we using it?

There are many different tools that researchers use to automate this type of work. Snakemake is a very popular tool, and the one we have selected for this tutorial. There are several reasons this tool was chosen:

The rest of these lessons aim to teach you how to use Snakemake by example. Our goal is to automate our example workflow, and have it do everything for us in parallel regardless of where and how it is run (and have it be reproducible!).

Key Points

  • Bash scripts are not an efficient way of storing a workflow.

  • Snakemake is one method of managing a complex computational workflow.


Snakefiles

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How do I write a simple workflow?

Objectives
  • Understand the components of a Snakefile: rules, inputs, outputs, and actions.

  • Write a simple Snakefile.

  • Run Snakemake from the shell.

Create a file, called Snakefile, with no file extension and containing the following content:

# Count words.
rule count_words:
    input:    'books/isles.txt'
    output:   'isles.dat'
    shell:    'python wordcount.py books/isles.txt isles.dat'

This is a build file, which for Snakemake is called a Snakefile — a file executed by Snakemake. Note that aside from a few keyword additions like rule, it follows standard Python 3 syntax.

Let us go through each line in turn:

Our rule above describes how to build the target isles.dat using the action python wordcount.py and the dependency books/isles.txt.

Information that was implicit in our shell script - that we are generating a file called isles.dat and that creating this file requires books/isles.txt - is now made explicit by Snakemake’s syntax.

Let’s first ensure we start from scratch and delete the .dat and .png files we created earlier:

$ rm *.dat *.png

By default, Snakemake looks for a file called Snakefile, and we can run Snakemake as follows:

$ snakemake

By default, Snakemake tells us what it’s doing as it executes actions:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       count_words
        1

rule count_words:
    input: books/isles.txt
    output: isles.dat
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

Depending on your setup, you may receive an error Error: you need to specify the maximum number of CPU cores to be used at the same time with --cores. This can be fixed using an argument to the sankemake command. Try running the following:

$ snakemake --cores 1

If you see a different error, check your syntax and the filepaths of the files in your Snakefile. You can check your present working directory using the command pwd.

Remember, aside from stuff like rule and input, Snakemake follows Python syntax. Let’s see if we got what we expected:

$ head -5 isles.dat

The first 5 lines of isles.dat should look exactly like before.

Snakefiles Do Not Have to be Called Snakefile

We don’t have to call our Snakefile Snakefile. However, if we call it something else we need to tell Snakemake where to find it. This we can do using -s flag. For example, if our Snakefile is named MyOtherSnakefile:

$ snakemake -s MyOtherSnakefile

When we re-run our Snakefile, Snakemake now informs us that:

Nothing to be done.

This is because our target, isles.dat, has now been created, and Snakemake will not create it again. To see how this works, let’s pretend to update one of the text files. Rather than opening the file in an editor, we can use the shell touch command to update its timestamp (which would happen if we did edit the file):

$ touch books/isles.txt

If we compare the timestamps of books/isles.txt and isles.dat,

$ ls -l books/isles.txt isles.dat

then we see that isles.dat, the target, is now older thanbooks/isles.txt, its dependency:

-rw-r--r--    1 mjj      Administ   323972 Jun 12 10:35 books/isles.txt
-rw-r--r--    1 mjj      Administ   182273 Jun 12 09:58 isles.dat

If we run Snakemake again,

$ snakemake

then it recreates isles.dat:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       count_words
        1

rule count_words:
    input: books/isles.txt
    output: isles.dat
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

When it is asked to build a target, Snakemake checks the “last modification time” of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target. Using this approach, Snakemake knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an incremental build.

Snakefiles as Documentation

By explicitly recording the inputs to and outputs from steps in our analysis and the dependencies between files, Snakefiles act as a type of documentation, reducing the number of things we have to remember.

Let’s add another rule to the end of Snakefile. Note that rules cannot have the same name, so we’ll call this one count_words_abyss.

rule count_words_abyss:
    input:    'books/abyss.txt'
    output:   'abyss.dat'
    shell:    'python wordcount.py books/abyss.txt abyss.dat'

If we run Snakemake,

$ snakemake

then we get:

Nothing to be done.

Nothing happens because Snakemake attempts to build the first target it finds in the Snakefile, the default target, which is isles.dat which is already up-to-date. We need to explicitly tell Snakemake we want to build abyss.dat:

$ snakemake abyss.dat

Now, we get:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       count_words_abyss
        1

rule count_words_abyss:
    input: books/abyss.txt
    output: abyss.dat
    jobid: 0

Finished job 0.
1 of 1 steps (100%) doneat

“Up to Date” Versus “Nothing to be Done”

If we ask Snakemake to build a file that already exists and is up to date, then Snakemake informs us that:

Nothing to be done

If we ask Snakemake to build a file that exists but for which there is no rule in our Snakefile, then we get a message like:

$ snakemake wordcount.py
MissingRuleException:
No rule to produce wordcount.py (if you use input functions make sure
that they don't raise unexpected exceptions).

When we see this error, double-check that you have a rule to produce that file, and also that the filename has been specified correctly. Even a small difference in a filename will result in a MissingRuleException.

We may want to remove all our data files so we can explicitly recreate them all. We can introduce a new target, and associated rule, to do this. We will call it clean, as this is a common name for rules that delete auto-generated files, like our .dat files:

rule clean:
    shell: 'rm -f *.dat'

This is an example of a rule that has no inputs or outputs! We just want to remove the data files whether or not they exist. If we run Snakemake and specify this target,

$ snakemake clean

then we get:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       clean
        1

rule clean:
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

An ls of our current directory reveals that all of our troublesome output files are now gone (as planned)!

We can add a similar command to create all the data files. We can put this at the top of our Snakefile so that it is the default target, which is executed by default if no target is given to the snakemake command:

rule dats:
    input:
        'isles.dat',
        'abyss.dat'

This is an example of a rule that has dependencies that are targets of other rules. When snakemake runs, it will check to see if the dependencies exist and, if not, will see if rules are available that will create these. If such rules exist it will invoke these first, otherwise snakemake will raise an error.

Dependencies

The order of rebuilding dependencies is arbitrary. You should not assume that they will be built in the order in which they are listed.

Dependencies must form a directed acyclic graph. A target cannot depend on a dependency which itself, or one of its dependencies, depends on that target.

This rule is also an example of a rule that has no actions. It is used purely to trigger the build of its dependencies, if needed.

If we run,

$ snakemake dats

then snakemake creates the data files:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       count_words
        1       count_words_abyss
        1       dats
        3

rule count_words_abyss:
    input: books/abyss.txt
    output: abyss.dat
    jobid: 1

Finished job 1.
1 of 3 steps (33%) done

rule count_words:
    input: books/isles.txt
    output: isles.dat
    jobid: 2

Finished job 2.
2 of 3 steps (67%) done

localrule dats:
    input: isles.dat, abyss.dat
    jobid: 0

Finished job 0.
3 of 3 steps (100%) done

If we run dats again, then snakemake will see that the dependencies (isles.dat and abyss.dat) are already up to date. Given the target dats has no actions, there is nothing to be done:

$ snakemake dats
Nothing to be done

Our Snakefile now looks like this:

rule dats:
     input:
         'isles.dat',
         'abyss.dat'


# delete everything so we can re-run things
rule clean:
    shell: 'rm -f *.dat'


# count words in one of our "books"
rule count_words:
    input:    'books/isles.txt'
    output:   'isles.dat'
    shell:    'python wordcount.py books/isles.txt isles.dat'


rule count_words_abyss:
    input:    'books/abyss.txt'
    output:   'abyss.dat'
    shell:    'python wordcount.py books/abyss.txt abyss.dat'

The following figure shows a graph of the dependencies embodied within our Snakefile, involved in building the dats target:

/hpc-python/Dependencies%20represented%20within%20the%20Snakefile

At this point, it becomes important to see what Snakemake is doing behind the scenes. What commands is Snakemake actually running? Snakemake has a special option (-p), that prints every command it is about to run. Additionally, we can also perform a dry run with -n. A dry run does nothing, and simply prints out commands instead of actually executing them. Very useful for debugging!

$ snakemake clean
$ snakemake -n -p isles.dat
rule count_words:
    input: wordcount.py, books/isles.txt
    output: isles.dat
    jobid: 0
    wildcards: file=isles

python wordcount.py books/isles.txt isles.dat
Job counts:
	count	jobs
	1	count_words
	1

Write Two New Rules

  1. Write a new rule for last.dat, created from books/last.txt.
  2. Update the dats rule with this target.
  3. Write a new rule for results.txt, which creates the summary table. The rule needs to:
    • Depend upon each of the three .dat files.
    • Invoke the action python zipf_test.py abyss.dat isles.dat last.dat > results.txt.
  4. Put this rule at the top of the Snakefile so that it is the default target.
  5. Update clean so that it removes results.txt.

The following figure shows the dependencies embodied within our Snakefile, involved in building the results.txt target:

/hpc-python/results.txt%20dependencies%20represented%20within%20the%20Snakefile

Key Points

  • Snakemake follows Python syntax

  • Rules can have an input and/or outputs, and a command to be run.


Wildcards

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How can I abbreviate the rules in my pipeline?

Objectives
  • Use snakemake wildcards to simplify our rules.

  • Output files are a product not only of input files but of the scripts or code that created the output files.

After the exercise at the end of the previous episode, our Snakefile looked like this:

# generate summary table
rule zipf_test:
    input:  'abyss.dat', 'last.dat', 'isles.dat'
    output: 'results.txt'
    shell:  'python zipf_test.py abyss.dat isles.dat last.dat > results.txt'

rule dats:
     input: 'isles.dat', 'abyss.dat', 'last.dat'

# delete everything so we can re-run things
rule clean:
    shell:  'rm -f *.dat results.txt'

# count words in one of our "books"
rule count_words:
    input:  'books/isles.txt'
    output: 'isles.dat'
    shell:  'python wordcount.py books/isles.txt isles.dat'

rule count_words_abyss:
    input:  'books/abyss.txt'
    output: 'abyss.dat'
    shell:  'python wordcount.py books/abyss.txt abyss.dat'

rule count_words_last:
    input:  'books/last.txt'
    output: 'last.dat'
    shell:  'python wordcount.py books/last.txt last.dat'

Our Snakefile has a lot of duplication. For example, the names of text files and data files are repeated in many places throughout the Snakefile. Snakefiles are a form of code and, in any code, repeated code can lead to problems (e.g. we rename a data file in one part of the Snakefile but forget to rename it elsewhere).

D.R.Y. (Don’t Repeat Yourself)

In many programming languages, the bulk of the language features are there to allow the programmer to describe long-winded computational routines as short, expressive, beautiful code. Features in Python or R or Java, such as user-defined variables and functions are useful in part because they mean we don’t have to write out (or think about) all of the details over and over again. This good habit of writing things out only once is known as the “Don’t Repeat Yourself” principle or D.R.Y.

Let us set about removing some of the repetition from our Snakefile. In our zipf_test rule we duplicate the data file names and the name of the results file name:

rule zipf_test:
    input:
            'abyss.dat',
            'last.dat',
            'isles.dat'
    output: 'results.txt'
    shell:  'python zipf_test.py abyss.dat isles.dat last.dat > results.txt'

Looking at the results file name first, we can replace it in the action with {output}:

rule zipf_test:
    input:  'abyss.dat', 'last.dat', 'isles.dat'
    output: 'results.txt'
    shell:  'python zipf_test.py abyss.dat isles.dat last.dat > {output}'

{output} is a Snakemake wildcard which is equivalent to the value we specified for the output section of the rule.

We can replace the dependencies in the action with {input}:

rule zipf_test:
    input:  'abyss.dat', 'last.dat', 'isles.dat'
    output: 'results.txt'
    shell:  'python zipf_test.py {input} > {output}'

{input} is another wildcard which means ‘all the dependencies of the current rule’. Again, when Snakemake is run it will replace this variable with the dependencies.

Let’s update our text files and re-run our rule:

$ touch books/*.txt
$ snakemake results.txt

We get:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count	jobs
    1	count_words
    1	count_words_abyss
    1	count_words_last
    1	zipf_test
    4

rule count_words_last:
    input: books/last.txt
    output: last.dat
    jobid: 1

Finished job 1.
1 of 4 steps (25%) done

rule count_words_abyss:
    input: books/abyss.txt
    output: abyss.dat
    jobid: 2

Finished job 2.
2 of 4 steps (50%) done

rule count_words:
    input: books/isles.txt
    output: isles.dat
    jobid: 3

Finished job 3.
3 of 4 steps (75%) done

rule zipf_test:
    input: abyss.dat, last.dat, isles.dat
    output: results.txt
    jobid: 0

Finished job 0.
4 of 4 steps (100%) done

Update Dependencies

What will happen if you now execute:

$ touch *.dat
$ snakemake results.txt
  1. nothing
  2. all files recreated
  3. only .dat files recreated
  4. only results.txt recreated

Solution

4. Only results.txt recreated.

The rules for *.dat are not executed because their corresponding .txt files haven’t been modified.

If you run:

$ touch books/*.txt
$ snakemake results.txt

you will find that the .dat files as well as results.txt are recreated.

As we saw, {input} means ‘all the dependencies of the current rule’. This works well for results.txt as its action treats all the dependencies the same — as the input for the zipf_test.py script.

Rewrite .dat rules to use wildcards

Rewrite each .dat rule to use the {input} and {output} wildcards.

Handling dependencies differently

For many rules, we may want to treat some dependencies differently. For example, our rules for .dat use their first (and only) dependency specifically as the input file to wordcount.py. If we add additional dependencies (as we will soon do) then we don’t want these being passed as input files to wordcount.py as it expects only one input file to be named when it is invoked.

Snakemake provides several solutions to this. Depending on what we want to do, it’s possible to both index and name our wildcards.

Suppose we want to add wordcount.py as a dependency of each data file. In this case, we can use {input[0]} to refer to the first dependency, and {input[1]} to refer to the second.

rule count_words:
    input:  'wordcount.py', 'books/isles.txt'
    output: 'isles.dat'
    shell:  'python {input[0]} {input[1]} {output}'

Alternatively, we can name our dependencies.

rule count_words_abyss:
    input:
        wc='wordcount.py',
        book='books/abyss.txt'
    output: 'abyss.dat'
    shell:  'python {input.wc} {input.book} {output}'

Let’s mark wordcount.py as updated, and re-run the pipeline.

$ touch wordcount.py
$ snakemake
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count	jobs
    1	count_words
    1	count_words_abyss
    1	zipf_test
    3

rule count_words_abyss:
    input: wordcount.py, books/abyss.txt
    output: abyss.dat
    jobid: 2

Finished job 2.
1 of 3 steps (33%) done

rule count_words:
    input: wordcount.py, books/isles.txt
    output: isles.dat
    jobid: 1

Finished job 1.
2 of 3 steps (67%) done

rule zipf_test:
    input: abyss.dat, last.dat, isles.dat
    output: results.txt
    jobid: 0

Finished job 0.
3 of 3 steps (100%) done

Notice how last.dat (which does not depend on wordcount.py) is not rebuilt. Intuitively, we should also add wordcount.py as dependency for results.txt, as the final table should be rebuilt as we remake the .dat files. However, it turns out we don’t have to! Let’s see what happens to results.txt when we update wordcount.py:

$ touch wordcount.py
$ snakemake results.txt

then we get:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count	jobs
    1	count_words
    1	count_words_abyss
    1	zipf_test
    3

rule count_words_abyss:
    input: wordcount.py, books/abyss.txt
    output: abyss.dat
    jobid: 2

Finished job 2.
1 of 3 steps (33%) done

rule count_words:
    input: wordcount.py, books/isles.txt
    output: isles.dat
    jobid: 1

Finished job 1.
2 of 3 steps (67%) done

rule zipf_test:
    input: abyss.dat, last.dat, isles.dat
    output: results.txt
    jobid: 0

Finished job 0.
3 of 3 steps (100%) done

The whole pipeline is triggered, even the creation of the results.txt file! To understand this, note that according to the dependency figure, results.txt depends on the .dat files. The update of wordcount.py triggers an update of the *.dat files. Thus, snakemake sees that the dependencies (the .dat files) are newer than the target file (results.txt) and thus it recreates results.txt. This is an example of the power of snakemake: updating a subset of the files in the pipeline triggers rerunning the appropriate downstream steps.

Updating One Input File

What will happen if you now execute:

touch books/last.txt
snakemake results.txt
  1. only last.dat is recreated
  2. all .dat files are recreated
  3. only last.dat and results.txt are recreated
  4. all .dat and results.txt are recreated

More dependencies…

Add zipf_test.py as a dependency of results.txt Which method do you prefer here, indexing or named input files? Yes, this will be clunky, but we’ll fix that part later! Remember that you can do a dry run with snakemake -n -p!

Key Points

  • Use {output} to refer to the output of the current rule.

  • Use {input} to refer to the dependencies of the current rule.

  • You can use Python indexing to retrieve individual outputs and inputs (example: {input[0]})

  • Wildcards can be named (example: {input.file1}).


Pattern Rules

Overview

Teaching: 15 min
Exercises: 0 min
Questions
  • How can I define rules to operate on similar files?

Objectives
  • Write Snakemake pattern rules.

Our Snakefile still has a ton of repeated content. The rules for each .dat file all do the same thing for the part. We can replace these rules with a single pattern rule which can be used to build any .dat file from a .txt file in books/:

rule count_words:
    input:
        wc='wordcount.py',
        book='books/{file}.txt'
    output: '{file}.dat'
    shell:  'python {input.wc} {input.book} {output}'

{file} is another arbitrary wildcard, that we can use as a placeholder for any generic book to analyse. Note that we don’t have to use {file} as the name of our wildcard — it can be anything we want!

This rule can be interpreted as: “In order to build a file named something.dat (the output) find a file named books/something.txt (the input) and run wordcount.py input output.”

$ snakemake clean
# use the -p option to show that it is running things correctly!
$ snakemake -p dats

We should see the same output as before. Note that we can still use snakemake to build individual .dat targets as before, and that our new rule will work no matter what stem is being matched.

$ snakemake -p sierra.dat

which gives the output below:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count	jobs
    1	count_words
    1

rule count_words:
    input: wordcount.py, books/sierra.txt
    output: sierra.dat
    jobid: 0
    wildcards: file=sierra

python wordcount.py books/sierra.txt sierra.dat
Finished job 0.
1 of 1 steps (100%) done

Using wildcards

Our arbitrary wildcards like {file} can only be used in input: and output: fields. It cannot be used in actions.

Our Snakefile is now much shorter and cleaner:

# generate summary table
rule zipf_test:
    input:  'zipf_test.py', 'abyss.dat', 'last.dat', 'isles.dat'
    output: 'results.txt'
    shell:  'python {input[0]} {input[1]} {input[2]} {input[3]} > {output}'

rule dats:
     input:
         'isles.dat', 'abyss.dat', 'last.dat'

# delete everything so we can re-run things
rule clean:
    shell:  'rm -f *.dat results.txt'

# count words in one of our "books"
rule count_words:
    input:
        wc='wordcount.py',
        book='books/{file}.txt'
    output: '{file}.dat'
    shell:  'python {input.wc} {input.book} {output}'

Key Points

  • Use any named wildcard ({some_name}) as a placeholder in targets and dependencies.


Snakefiles are Python code

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How can I automatically manage dependencies and outputs?

  • How can I use Python code to add features to my pipeline?

Objectives
  • Use variables, functions, and imports in a Snakefile.

  • Learn to use the run action to execute Python code as an action.

Despite our efforts, our pipeline still has repeated content, for instance the names of output files/dependencies. Our zipf_test rule, for instance, is extremely clunky. What happens if we want to analyse books/sierra.txt as well? We’d have to update everything!

rule zipf_test:
    input:  'zipf_test.py', 'abyss.dat', 'last.dat', 'isles.dat'
    output: 'results.txt'
    shell:  'python {input[0]} {input[1]} {input[2]} {input[3]} > {output}'

First, let’s cut down on a little bit of the clunkiness of the shell directive. One thing you’ve probably noticed is that all of our rules are using Python strings. Other data structures work too — let’s try a list:

rule zipf_test:
    input:
        zipf='zipf_test.py',
        books=['abyss.dat', 'last.dat', 'isles.dat']
    output: 'results.txt'
    shell:  'python {input.zipf} {input.books} > {output}'

(snakemake clean and snakemake -p should show that the pipeline still works!)

This illustrates a key feature of Snakemake. Snakefiles are just Python code. We can make our list into a variable to demonstrate this. Let’s create the variable DATS and use it in our zipf_test and dats rules.

DATS=['abyss.dat', 'last.dat', 'isles.dat']

# generate summary table
rule zipf_test:
    input:
        zipf='zipf_test.py',
        books=DATS
    output: 'results.txt'
    shell:  'python {input.zipf} {input.books} > {output}'

rule dats:
    input: DATS

Try re-creating both the dats and results.txt targets (run snakemake clean in between).

When are Snakefiles executed?

The last example illustrated that we can use arbitrary Python code in our Snakefile. It’s important to understand when this code gets executed. Let’s add a print instruction to the top of our Snakefile.

print('Snakefile is being executed!')

DATS=['abyss.dat', 'last.dat', 'isles.dat']

# generate summary table
rule zipf_test:
    input:
# more output below

Now let’s clean up our workspace with snakemake clean

snakemake clean
Snakefile is being executed!
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count	jobs
    1	clean
    1

rule clean:
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

Now let’s re-run the pipeline…

$ snakemake
Snakefile is being executed!
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count	jobs
    3	count_words
    1	zipf_test
    4

rule count_words:
    input: wordcount.py, books/last.txt
    output: last.dat
    jobid: 3
    wildcards: file=last

Finished job 3.
1 of 4 steps (25%) done

rule count_words:
    input: wordcount.py, books/abyss.txt
    output: abyss.dat
    jobid: 1
    wildcards: file=abyss

Finished job 1.
2 of 4 steps (50%) done

rule count_words:
    input: wordcount.py, books/isles.txt
    output: isles.dat
    jobid: 2
    wildcards: file=isles

Finished job 2.
3 of 4 steps (75%) done

rule zipf_test:
    input: zipf_test.py, abyss.dat, last.dat, isles.dat
    output: results.txt
    jobid: 0

Finished job 0.
4 of 4 steps (100%) done

Let’s do a dry-run:

$ snakemake -n
Snakefile is being executed!
Nothing to be done.

In every case, the print() statement ran before any of the actual pipeline code was run. What we can take away from this is that Snakemake executes the entire Snakefile every time we run snakemake (regardless of if it’s a dry run!). Because of this, we need to be careful, and only put tasks that do “real work” (changing files on disk) inside rules.

Using functions in Snakefiles

In our example here, we only have 4 books. But what if we had 700 books to be processed? It would be a massive effort to update our DATS variable to add the name of every single book’s corresponding .dat filename.

Fortunately, Snakemake ships with several functions that make working with large numbers of files much easier. The two most helpful ones are glob_wildcards() and expand(). Let’s start an interactive Python session to see how they work:

$ python3
Python 3.6.1 (default, Jun 27 2017, 14:35:15)
Type "copyright", "credits" or "license" for more information.

In this example, we will import these Snakemake functions directly in our interactive Python session. It is not necessary to import these functions within your Snakefile — these functions are always imported for you.

from snakemake.io import expand, glob_wildcards

Generating file names with expand()

The first function we’ll use is expand(). expand() is used quite literally, to expand a snakemake wildcard(s) into a set of filenames.

>>> expand('folder/{wildcard1}_{wildcard2}.txt',
...        wildcard1=['a', 'b', 'c'],
...        wildcard2=[1, 2, 3])
['folder/a_1.txt',
 'folder/a_2.txt',
 'folder/a_3.txt',
 'folder/b_1.txt',
 'folder/b_2.txt',
 'folder/b_3.txt',
 'folder/c_1.txt',
 'folder/c_2.txt',
 'folder/c_3.txt']

In this case, expand() created every possible combination of filenames from the two wildcards. Useful! Of course, this still leaves us needing somehow get the values for wildcard1 and wildcard2 in the first place.

Get wildcard values with glob_wildcards()

To get a set of wildcards from a list of files, we can use the glob_wildcards() function. Let’s try grabbing all of the book titles in our books folder.

>>> glob_wildcards('books/{example}.txt')
Wildcards(example=['isles', 'last', 'abyss', 'sierra'])

glob_wildcards() returns a Wildcards object as output. Wildcards is a special object defined by Snakemake that provides named lists.

In this case, there is only one wildcard, {example}. We can extract the values for the file names by getting the example property from the output of glob_wildcards()

>>> glob_wildcards('books/{example}.txt').example
['isles', 'last', 'abyss', 'sierra']

Putting it all together

Using the expand() and glob_wildcards() functions, modify the pipeline so that it automatically detects and analyses all the files in the books/ folder.

Using Python code as actions

One very useful feature of Snakemake is the ability to execute Python code instead of just shell commands. Instead of shell: as an action, we can use run: instead.

Add the following to our snakefile:

# at the top of the file
import os
import glob

# add this wherever
rule print_book_names:
    run:
        print('These are all the book names:')
        for book in glob.glob('books/*.txt'):
            print(book)

Upon execution of the corresponding rule, Snakemake dutifully runs our Python code in the run: block:

$ snakemake print_book_names
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count	jobs
    1	print_book_names
    1

rule print_book_names:
    jobid: 0

These are all the book names:
books/isles.txt
books/last.txt
books/abyss.txt
books/sierra.txt
Finished job 0.
1 of 1 steps (100%) done

Moving output locations

Alter the rules in your Snakefile so that the .dat files are created in their own dats/ folder. Note that creating this folder beforehand is unnecessary. Snakemake automatically creates any folders for you, as needed.

Creating PNGs

Add new rules and update existing rules to:

  • Create .png files from .dat files using plotcount.py.
  • Remove all auto-generated files (.dat, .png, results.txt).

Finally, many Snakefiles define a default target called all as first target, that will build what the Snakefile has been written to build (e.g. in our case, the .png files and the results.txt file). Add an all target to your Snakefile (Hint: this rule has the results.txt file and the .png files as dependencies, but no actions). With that in place, instead of running snakemake results.txt, you should now run snakemake all, or just simply snakemake.

Creating an Archive

Update your pipeline to:

  • Create an archive, zipf_analysis.tar.gz, to hold all our .dat files, plots, and the Zipf summary table.
  • Update all to expect zipf_analysis.tar.gz as input.
  • Remove zipf_analysis.tar.gz when snakemake clean is called.

The syntax to create an archive is shown below:

tar -czvf zipf_analysis.tar.gz file1 directory2 file3 etc

After these exercises our final workflow should look something like the following:

Final directed acyclic graph

Adding more books

We can now do a better job at testing Zipf’s rule by adding more books. The books we have used come from the Project Gutenberg website. Project Gutenberg offers thousands of free e-books to download.

Exercise instructions

  • Go to Project Gutenberg and use the search box to find another book, for example ‘The Picture of Dorian Gray’ by Oscar Wilde.
  • Download the ‘Plain Text UTF-8’ version and save it to the books folder; choose a short name for the file
  • Optionally, open the file in a text editor and remove extraneous text at the beginning and end (look for the phrase End of Project Gutenberg's [title], by [author])
  • Run snakemake and check that the correct commands are run
  • Check the results.txt file to see how this book compares to the others

Key Points

  • Snakefiles are Python code.

  • The entire Snakefile is executed whenever you run snakemake.

  • All actual work should be done by rules.


Resources and parallelism

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How do I scale a pipeline across multiple cores?

  • How do I manage access to resources while working in parallel?

Objectives
  • Modify your pipeline to run in parallel.

After the exercises at the end of our last lesson, our Snakefile looks something like this:

# our zipf analysis pipeline
DATS = glob_wildcards('books/{book}.txt').book

rule all:
    input:
        'zipf_analysis.tar.gz'

# delete everything so we can re-run things
rule clean:
    shell:
        '''
        rm -rf results dats plots
        rm -f results.txt zipf_analysis.tar.gz
        '''

# count words in one of our "books"
rule count_words:
    input:
        wc='wordcount.py',
        book='books/{file}.txt'
    output: 'dats/{file}.dat'
    shell:  'python {input.wc} {input.book} {output}'

# create a plot for each book
rule make_plot:
    input:
        plotcount='plotcount.py',
        book='dats/{file}.dat'
    output: 'plots/{file}.png'
    shell:  'python {input.plotcount} {input.book} {output}'

# generate summary table
rule zipf_test:
    input:
        zipf='zipf_test.py',
        books=expand('dats/{book}.dat', book=DATS)
    output: 'results.txt'
    shell:  'python {input.zipf} {input.books} > {output}'

# create an archive with all of our results
rule make_archive:
    input:
        expand('plots/{book}.png', book=DATS),
        expand('dats/{book}.dat', book=DATS),
        'results.txt'
    output: 'zipf_analysis.tar.gz'
    shell: 'tar -czvf {output} {input}'

At this point, we have a complete data analysis pipeline. Very cool. But how do we make it run as efficiently as possible?

Running in parallel

Up to this point, Snakemake has printed out an interesting message whenever we run our pipeline.

Provided cores: 1
Rules claiming more threads will be scaled down.

So far, Snakemake has been run with one core. Let’s scale up our pipeline to run in parallel. The only change we need to make is run Snakemake with the -j argument. -j is used to indicate number of CPU cores available, and on a cluster, maximum number of jobs (we’ll get to that part later). Note that 4 cores is usually a safe assumption when working on a laptop.

$ snakemake clean
$ snakemake -j 4
Provided cores: 4
Rules claiming more threads will be scaled down.
# more output follows

Our pipeline ran in parallel and finished roughly 4 times as quickly! The takeaway here is that all we need to do to scale from a serial pipeline is run snakemake with the -j option.

How many CPUs does your computer have?

Now that we can have our pipeline use multiple CPUs, how do we know how many CPUs to provide to the -j option? Note that for all of these options, it’s best to use CPU cores, and not CPU threads.

Linux: You can use the lscpu command.

All platforms: Python’s psutil module can be used to fetch the number of cores in your computer. Using logical=False returns the number of true CPU cores. logical=True gives the number of CPU threads on your system.

import psutil
psutil.cpu_count(logical=False)

Managing CPUs

Each rule has a number of optional keywords aside from the usual input, output, and shell/run. The threads keyword is used to specify how many CPU cores a rule needs while executing. Though in reality CPU threads are not quite the same as CPU cores, the two terms are interchangeable when working with Snakemake.

Let’s pretend that our count_words rule is actually very CPU-intensive. We’ll say that it needs a whopping 4 CPUs per run. We can specify this with the threads keyword in our rule. We will also modify the rule to print out the number of threads it thinks it is using. Please note that just giving a Snakemake rule 4 threads does not automatically make its action run in parallel! The action also needs to be threads-capable and to explicitly use the {threads} information for this to happen. In this case wordcount.py is actually still running with 1 core, we are simply using it as a demonstration of how to go about running something with multiple cores.

rule count_words:
    input:
        wc='wordcount.py',
        book='books/{file}.txt'
    output: 'dats/{file}.dat'
    threads: 4
    shell:
        '''
        echo "Running {input.wc} with {threads} cores."
        python {input.wc} {input.book} {output}
        '''

Now when we run snakemake -j 4, the jobs from count_words are run one at a time so as to give each job the resources it needs. Since each job of the count_words rule requires 4 threads (as per the newly added thread directive), and because all jobs have a maximum of 4 cores available to them as per the -j 4 option, the count_words jobs are run one at a time. All of our other rules will still run in parallel since they default to requesting a single thread. Unless otherwise specified with {threads}, rules will use 1 core by default.

Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
    count	jobs
    1	all
    4	count_words
    1	make_archive
    4	make_plot
    1	zipf_test
    11

rule count_words:
    input: wordcount.py, books/last.txt
    output: dats/last.dat
    jobid: 3
    wildcards: file=last
    threads: 4

Running wordcount.py with 4 cores.
Finished job 3.
1 of 11 steps (9%) done

# other output follows

What happens when we don’t have 4 cores available? What if we tell Snakemake to run with 2 cores instead?

$ snakemake -j 2
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
    count	jobs
    1	all
    4	count_words
    1	make_archive
    4	make_plot
    1	zipf_test
    11

rule count_words:
    input: wordcount.py, books/last.txt
    output: dats/last.dat
    jobid: 6
    wildcards: file=last
    threads: 2

Running wordcount.py with 2 cores.
Finished job 6.
1 of 11 steps (9%) done

# more output below

The key bit of output is Rules claiming more threads will be scaled down.. When Snakemake doesn’t have enough cores to run a rule (as defined by {threads}), Snakemake will run that rule with the maximum available number of cores instead. After all, Snakemake’s job is to get our workflow done. It automatically scales our workload to match the maximum number of cores available without us editing the Snakefile.

Chaining multiple commands

Up until now, all of our commands have fit on one line. To execute multiple bash commands, the only modification we need to make is use a Python multiline string (begin and end with ''')

One important addition we should be aware of is the && operator. && is a bash operator that runs commands as part of a chain. If the first command fails, the remaining steps are not run. This is more forgiving than bash’s default “hit an error and keep going” behavior. After all, if the first command failed, it’s unlikely the other steps will work.

# count words in one of our "books"
rule count_words:
    input:
        wc='wordcount.py',
        book='books/{file}.txt'
    output: 'dats/{file}.dat'
    threads: 4
    shell:
        '''
        echo "Running {input.wc} with {threads} cores on {input.book}." &&
            python {input.wc} {input.book} {output}
        '''

Managing other types of resources

Not all compute resources are CPUs. Examples might include limited amounts of RAM, number of GPUs, database locks, or perhaps we simply don’t want multiple processes writing to the same file at once. All non-CPU resources are handled using the resources keyword.

For our example, let’s pretend that creating a plot with plotcount.py requires dedicated access to a GPU (it doesn’t), and only one GPU is available. How do we indicate this to Snakemake so that it knows to give dedicated access to a GPU for rules that need it? Let’s modify the make_plot rule as an example:

# create a plot for each book
rule make_plot:
    input:
        plotcount='plotcount.py',
        book='dats/{file}.dat'
    output: 'plots/{file}.png'
    resources: gpu=1
    shell:  'python {input.plotcount} {input.book} {output}'

We can execute our pipeline using the following (using 8 cores and 1 gpu):

$ snakemake clean
$ snakemake -j 8 --resources gpu=1
Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: gpu=1
# other output removed for brevity

Resources are entirely arbitrary — like wildcards, they can be named anything. Snakemake knows nothing about them aside from the fact that they have a name and a value. In this case gpu indicates simply that there is a resource called gpu used by make_plot. We provided 1 gpu to the workflow, and the gpu is considered in use as long as the rule is running. Once the make_plot rule completes, the gpu it consumed is added back to the pool of available gpus. To be extra clear: gpu in this case does not actually represent a GPU, it is an arbitrary limit used to prevent multiple tasks that use a gpu from executing at the same time.

But what happens if we run our pipeline without specifying the number of GPUs?

$ snakemake clean
$ snakemake -j 8
Provided cores: 8
Rules claiming more threads will be scaled down.
Unlimited resources: gpu

If you have specified that a rule needs a certain resource, but do not specify how many you have, Snakemake will assume that the resources in question are unlimited.

Other uses for resources

Resources do not have to correspond to actual compute resources. Perhaps one rule is particularly I/O heavy, and it’s best if only a limited number of these jobs run at a time. Or maybe a type of rule uses a lot of network bandwidth as it downloads data. In all of these cases, resources can be used to constrain access to arbitrary compute resources so that each rule can run at it’s most efficient. Snakemake will run your rules in such a way as to maximise throughput given your resource constraints.

Key Points

  • Use threads to indicate the number of cores used by a rule.

  • Resources are arbitrary and can be used for anything.

  • The && operator is a useful tool when chaining bash commands.


Scaling a pipeline across a cluster

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How do I run my workflow on an HPC system?

Objectives
  • Understand the Snakemake cluster job submission workflow.

Right now we have a reasonably effective pipeline that scales nicely on our local computer. However, for the sake of this course, we’ll pretend that our workflow actually takes significant computational resources and needs to be run on a cluster.

HPC cluster architecture

Most HPC clusters are run using a scheduler. The scheduler is a piece of software that handles which compute jobs are run on which compute nodes and where. It allows a set of users to share a shared computing system as efficiently as possible. In order to use it, users typically must write their commands to be run into a shell script and then “submit” it to the scheduler.

A good analogy would be a university’s room booking system. No one gets to use a room without going through the booking system. The booking system decides which rooms people get based on their requirements (# of students, time allotted, etc.).

Normally, moving a workflow to be run by a cluster scheduler requires a lot of work. Batch scripts need to be written, and you’ll need to monitor and babysit the status of each of your jobs. This is especially difficult if one batch job depends on the output from another. Even moving from one cluster to another (especially ones using a different scheduler) requires a large investment of time and effort — all the batch scripts from before need to be rewritten.

Snakemake does all of this for you. All details of running the pipeline through the cluster scheduler are handled by Snakemake — this includes writing batch scripts, submitting, and monitoring jobs. In this scenario, the role of the scheduler is limited to ensuring each Snakemake rule is executed with the resources it needs.

We’ll explore how to port our example Snakemake pipeline. Our current Snakefile is shown below:

# our zipf analysis pipeline
DATS = glob_wildcards('books/{book}.txt').book

rule all:
    input:
        'zipf_analysis.tar.gz'

# delete everything so we can re-run things
rule clean:
    shell:
        '''
        rm -rf results dats plots
        rm -f results.txt zipf_analysis.tar.gz
        '''

# count words in one of our "books"
rule count_words:
    input:
        wc='wordcount.py',
        book='books/{file}.txt'
    output: 'dats/{file}.dat'
    threads: 4
    shell:
        '''
        python {input.wc} {input.book} {output}
        '''

# create a plot for each book
rule make_plot:
    input:
        plotcount='plotcount.py',
        book='dats/{file}.dat'
    output: 'plots/{file}.png'
    resources: gpu=1
    shell:  'python {input.plotcount} {input.book} {output}'

# generate summary table
rule zipf_test:
    input:
        zipf='zipf_test.py',
        books=expand('dats/{book}.dat', book=DATS)
    output: 'results.txt'
    shell:  'python {input.zipf} {input.books} > {output}'

# create an archive with all of our results
rule make_archive:
    input:
        expand('plots/{book}.png', book=DATS),
        expand('dats/{book}.dat', book=DATS),
        'results.txt'
    output: 'zipf_analysis.tar.gz'
    shell: 'tar -czvf {output} {input}'

To run Snakemake on a cluster, we need to create a profile to tell snakemake how to submit jobs to our cluster. We can then submit jobs with this profile using the --profile argument followed by the name of our profile. In this configuration, Snakemake runs on the cluster head node and submits jobs. Each cluster job executes a single rule and then exits. Snakemake detects the creation of output files, and submits new jobs (rules) once their dependencies are created.

Transferring our workflow

Let’s port our workflow to Compute Canada’s Graham cluster as an example (you will probably be using a different cluster, adapt these instructions to your cluster). The first step will be to transfer our files to the cluster and log on via SSH. Snakemake has a powerful archiving utility that we can use to bundle up our workflow and transfer it.

$ snakemake clean
$ tar -czvf pipeline.tar.gz .
# transfer the pipeline via scp
$ scp pipeline.tar.gz [email protected]:
# log on to the cluster
$ ssh -X [email protected]

snakemake --archive and Conda deployment

Snakemake has a built-in method to archive all input files and scripts under version control: snakemake --archive. What’s more, it also installs any required dependencies if they can be installed using Anaconda’s conda package manager. You can use this feature for this tutorial (I’ve already added all of the files to version control for you), but if you want to use this feature in your own work, you should familiarise yourself with a version control tool like Git.

For more information on how to use this feature, see http://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html

At this point we’ve archived our entire pipeline, sent it to the cluster, and logged on. Let’s create a folder for our pipeline and unpack it there.

$ mkdir pipeline
$ mv pipeline.tar.gz pipeline
$ cd pipeline
$ tar -xvzf pipeline.tar.gz

If Snakemake and Python are not already installed on your cluster, you can install them in an Anaconda Python environment using the following commands:

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

This is an interactive installation through the command line. Review and accept the license agreement, then work through the prompts. The defaults are probably fine. Accept its offer to initialize your environment (conda init), then run the suggested command to load the conda base environment so you can use it straight away. Finally, install Snakemake from the bioconda channel:

$ conda install -y -c bioconda graphviz matplotlib numpy snakemake

Assuming you’ve transferred your files and everything is set to go, the command snakemake -n should work without errors.

Creating a cluster profile

Snakemake uses a YAML-formatted configuration file to retrieve cluster submission parameters; we will use the SLURM scheduler for an example. When we use the ‘–profile slurm’ argument, snakemake looks for a directory with the name of our profile (slurm) containing a ‘config.yaml’ file such as the one below.

cluster: "sbatch --time={resources.time_min} --mem={resources.mem_mb}
          -c {resources.cpus} -o slurm/logs/{rule}_{wildcards}
          -e slurm/logs/{rule}_{wildcards}"
jobs: 25
default-resources: [cpus=1, mem_mb=1000, time_min=5]
resources: [cpus=100, mem_mb=100000]

This file has several components. cluster and the arguments that follow tell snakemake how to submit jobs to the cluster. Here we’ve used SLURM’s sbatch command and arguments for setting time limits and resources with snakemake wildcards defining the requested values.

We’ve also specified where to save SLURM logs and what to call them. Note that this folder must already exist. If the folders don’t exist, Snakemake will hang.

Values for any command line argument to snakemake can be defined in our profile, although a value is required (e.g. the --use-conda argument could be included in our profile with use-conda: true).

jobs specifies the maximum number of jobs that will be submitted at one time. We also specified the default-resources that will be requested for each job, while resources defines the resource limits.

With these parameters, snakemake will use no more than 100 cpus and 100000 MB (100 GB) at a time between all currently submitted jobs. While it does not come into play here, a generally sensible default is slightly above the maximum number of jobs you are allowed to have submitted at a time.

The defaults won’t always be perfect, however — chances are some rules may need to run with non-default amounts of memory or time limits. We are using the count_words rule as an example of this. To request non-default resources for a job, we can modify the rule in our snakefile to include a resources section like this:

# count words in one of our "books"
rule count_words:
    input:
        wc='wordcount.py',
        book='books/{file}.txt'
    output: 'dats/{file}.dat'
    threads: 4
    resources: cpus=4, mem_mb=8000, time_min=20
    shell:
        '''
        python {input.wc} {input.book} {output}
        '''

Local rule execution

Some Snakemake rules perform trivial tasks where job submission might be overkill (i.e. less than 1 minute worth of compute time). It would be a better idea to have these rules execute locally (i.e. where the snakemake command is run) instead of as a job. Let’s define all, clean, and make_archive as localrules near the top of our Snakefile.

localrules: all, clean, make_archive

Running our workflow on the cluster

OK, time for the moment we’ve all been waiting for — let’s run our workflow on the cluster with the profile we’ve created. Use this command:

$ snakemake --profile slurm"

While things execute, you may wish to SSH to the cluster in another window so you can watch the pipeline’s progress with watch squeue -u $(whoami).

Notes on $PATH

As with any cluster jobs, jobs started by Snakemake need to have the commands they are running on $PATH. For some schedulers (SLURM), no modifications are necessary — variables are passed to the jobs by default. Other schedulers (SGE) need to have this enabled through a command line flag when submitting jobs (-V for SGE). If this is possible, just run the module load commands you need ahead of the job and run Snakemake as normal.

If this is not possible, you have several options:

  • You can edit your .bashrc file to modify $PATH for all jobs and sessions you start on a cluster.
  • Inserting shell.prefix('some command') in a Snakefile means that all rules run will be prefixed by some_command. You can use this to modify $PATH, e.g., shell.prefix('PATH=/extra/directory:$PATH ').
  • You can modify rules directly to run the appropriate module load commands beforehand. This is not recommended, only if because it is more work than the other options available.

Submitting a workflow with nohup

nohup some_command & runs a command in the background and lets it keep running if you log off. Try running the pipeline in cluster mode using nohup (run snakemake clean beforehand). Where does the Snakemake log go to? Why might this technique be useful? Can we also submit the snakemake --profile slurm pipeline as a job? Where does the Snakemake command run in each scenario?

You can kill the running Snakemake process with killall snakemake. Notice that if you try to run Snakemake again, it says the directory is locked. You can unlock the directory with snakemake --unlock.

Key Points

  • Snakemake generates and submits its own batch scripts for your scheduler.

  • localrules defines rules that are executed on the Snakemake head node.

  • $PATH must be passed to Snakemake rules.

  • nohup <command> & prevents <command> from exiting when you log off.


Final notes

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • What are some tips and tricks I can use to make this easier?

Objectives
  • Understand how to make a DAG graph.

Now that we know how to write and scale a pipeline, here are some tips and tricks for making the process go more smoothly.

snakemake -n is your friend

Whenever you edit your Snakefile, run snakemake -n immediately afterwards. This will check for errors and make sure that the pipeline is able to run.

The most common source of errors is a mismatch in filenames (Snakemake doesn’t know how to produce a particular output file) — snakemake -n will catch this as long as the troublesome output files haven’t already been made.

Configuring logging

By default, Snakemake prints all output from stderr and stdout from rules. This is useful, but if a failure occurs (or we otherwise need to inspect the logs) it can be extremely difficult to determine what happened or which rule had an issue, especially when running in parallel.

The solution to this issue is to redirect the output from each rule/ set of inputs to a dedicated log file. We can do this using the log keyword. Let’s modify our count_words rule to be slightly more verbose and redirect this output to a dedicated log file.

Two things before we start:

# count words in one of our "books"
rule count_words:
  input:
      wc='wordcount.py',
      book='books/{file}.txt'
  output: 'dats/{file}.dat'
  threads: 4
  log: 'dats/{file}.log'
  shell:
      '''
      echo "Running {input.wc} with {threads} cores on {input.book}." &> {log}
      python {input.wc} {input.book} {output} &>> {log}
      '''
$ snakemake clean
$ snakemake -j 8
$ cat dats/abyss.log
# snakemake output omitted
Running wordcount.py with 4 cores on books/abyss.txt.

Notice how the pipeline no longer prints to the pipeline’s log, and instead redirects this to a log file.

Choosing a good log file location

Though you can put a log anywhere (and name it anything), it is often a good practice to put the log in the same directory where the rule’s output will be created. If you need to investigate the output for a rule and associated log files, this means that you only have to check one location!

Token files

Often, a rule does not generate a unique output, and merely modifies a file. In these cases it is often worthwhile to create a placeholder, or “token file” as output. A token file is simply an empty file that you can create with the touch command (touch some_file.txt creates an empty file called some_file.txt). An example rule using this technique is shown below:

rule token_example:
    input:  'some_file.txt'
    output: 'some_file.tkn'   # marks some_file.txt as modified
    shell:
        '''
        some_command --do-things {input} &&
            touch {output}
        '''

Directory locks

Only one instance of Snakemake can run in a directory at a time. If a Snakemake run fails without unlocking the directory (if you killed the process, for instance), you can run snakemake --unlock to unlock it.

Python as a fallback

Remember, you can use Python imports and functions anywhere in a Snakefile. If something seems a little tricky to implement - Python can do it. The os, shutil, and subprocess packages are useful tools for using Python to execute command line actions. In particular, os.system('some command') will run a command on the command-line and block until execution is complete.

Creating a workflow diagram

Assuming graphviz is installed (conda install graphviz), you can create a diagram of your workflow with the command: snakemake --dag | dot -Tsvg > dag.svg. This creates a plot of your “directed acyclic graph” (a plot of all of the rules Snakemake thinks it needs to complete), which you can view using any picture viewing program. In fact this was the tool used to create all of the diagrams in this lesson:

snakemake --dag | dot -Tsvg > dag.svg
eog dag.svg     # eog is an image viewer installed on many Linux systems

Example DAG plot

Rules that have yet to be completed are indicated with solid outlines. Already completed tasks will be indicated with dashed outlines. In this case, I ran snakemake clean, just before creating the diagram — no rules have been run yet.

Viewing the GUI

Snakemake has an experimental web browser GUI. I personally haven’t used it for anything, but it’s cool to know it’s there and can be used to view your workflow on the fly.

snakemake --gui

Where to go for documentation / help

The Snakemake documentation is located at snakemake.readthedocs.io

Key Points

  • Token files can be used to take the place of output files if none are created.

  • snakemake --dag | dot -Tsvg > dag.svg creates a graphic of your workflow.

  • snakemake --gui opens a browser window with your workflow.