This lesson is in the early stages of development (Alpha version)

# Basic syntax

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• Where do I start?

Objectives
• Understand basic Python syntax and data types.

The most basic use of Python is to use it as a fancy calculator. It is very easy to do basic maths in Python.

``````print(5 + 1)
``````
``````6
``````

Note that we don’t always have to use the `print()` statement. Notice how leaving out `print()` gives us the same result as above.

``````5 + 1
``````
``````6
``````

Python can do all of the normal basic maths operations you’d expect.

``````5 + 3
2 - 9
4 * 6
14 / 3
``````
``````8
-7
24
4.666666666666667
``````

You can also use it to more complicated operations, like exponentiation (`**`):

``````5 ** 2
``````
``````25
``````

Along with floor and remainder division. Floor division (`//`) gives the results of division, rounded down. Remainder division (`%`), gives the remainder after division.

``````5 // 2  # floor division
5 % 2   # remainder division
``````
``````2
1
``````

Python follows the normal order of operations for maths.

``````4 + 1 * 6
``````
``````10
``````

However, if you want Python to change the order it does things in, you can use parentheses to specify what to do first. Note that there is no limit to the number of parentheses you can use.

``````(4 + 1) * 6
``````
``````30
``````

## Variables

Of course, we will probably want to save our answers at some point. We can do this by assigning a variable. In Python, a variable is a name for a saved result. We can set them with the `=` sign.

``````weight_kg = 55
``````

If we want to retrieve the information we’ve stored, we can do it by simply typing the name of the variable again.

``````weight_kg
``````
``````55
``````

We can perform maths on variables the same way we would normally.

``````print('weight in pounds:', 2.2 * weight_kg)
``````
``````weight in pounds: 121.00000000000001
``````

As the example above shows, we can print several things at once by separating them with commas. Note that in this case, the number might appear as 121.00000000000001 due to the way numbers are internally represented in Python.

We can also change a variable’s value by assigning it a new one:

``````weight_lb = 2.2 * weight_kg
print(weight_lb)
``````
``````121.00000000000001
``````

What happens when we change a variable? Let’s update `weight_kg` and see what happens to `weight_lb`.

``````print('weight_kg starting value is', weight_kg)
weight_kg = 10000
print('after updating, weight_kg ending value is', weight_kg)
print('weight in lb ending value is', weight_lb)
``````
``````weight_kg starting value is 55
after updating, weight_kg ending value is 10000
weight in lb ending value is 121.00000000000001
``````

Notice how even though we changed the value of `weight_kg`, `weight_lb` did not update. This demonstrates a very important property of programming languages: a computer will not do anything unless you specifically tell it to — nothing ever happens automatically. This is different from the behaviour of a spreadsheets, where a cell will automatically update when the cells it refers to are updated.

If we want to tell Python to update `weight_lb` to reflect the new value of `weight_kg`, we will need to perform this operation explicitly.

``````weight_lb = weight_kg * 2.2
print('new value for weight_lb is', weight_lb)
``````
``````new value for weight_lb is 22000.0
``````

One more thing to note: what we just did is the best way to learn Python. Don’t know how something works? Try it and find out!

## Where are variables stored?

Your computer has two places where it stores information: hard disk and memory. What are they and what are they used for? Where do variables get stored?

Memory is where temporary information on your computer gets placed. It is very fast and easy to access, but has one important drawback: data here is erased when your program quits or your computer shuts down. All information you save as variables in Python will be stored in memory! When programming, we always need to save our data as a file (on our hard disk) if we want to keep it!

Your computer’s hard disk is used to store information long-term. This is where files get stored, and the information on your hard drive is more or less permanent. Hard drives can also store lots of data very cheaply — a terabyte of hard drive space is very cheap, whereas the same amount of memory costs a lot more. So if hard drive space is permanent and super-cheap, why don’t we use it to store all of our data? The biggest reason is speed — memory is typically hundreds, if not thousands of times faster to access. If we stored our variables to our hard disk, our programs would be incredibly slow!

## Errors

Of course, not everything will always work perfectly. We are going to run into errors. For instance, what happens if we accidentally don’t finish a command?

``````1 +
``````
``````SyntaxError: invalid syntax (<ipython-input-15-70475fc083df, line 1)
File "<ipython-input-15-70475fc083df", line 1
1 +
^
SyntaxError: invalid syntax
``````

This is an error. Errors are good! When we do something that Python doesn’t like, it will give us an error message. These error messages are called tracebacks, and often tell us exactly how to fix our stuff!

### Let’s walk through this error:

``````SyntaxError: invalid syntax
``````

All errors have types. This one is a `SyntaxError`, indicating, well… an error in our syntax. Syntax is “computer-speak” for how things are supposed to be typed. Python only understands certain commands, and typos will mess it up. If we type a command in such a way that Python can’t understand it, we need to fix our syntax (make sure we’ve typed a valid command). Takeaway message: We made an error when typing things.

``````File "<ipython-input-15-70475fc083df", line 1
1 +
^
``````

Python is trying to be helpful and tell us exactly where our error occurred. The first thing it does is tell us which file had the error in it. Since we are using the terminal, it gives us the semi-confusing `<ipython-input-15-70475fc083df` instead of a filename.

The `line 1` bit tells us that our error was on line 1 of our last command. Specifically, Python has printed out the offending line for us, and pointed an arrow (`^`) at the bad part. Takeaway message: The error came right after we typed the `+` sign.

## Different types of data

Computers are not smart, and have to be explicitly told how to handle different types of data. Although a human might know that you can’t do maths on a word, our computer does not. To work around this problem, programming languages store different types of data in different ways.

For reasons that are likely obvious, we will need to store text differently than numbers. What is less obvious is that Python also has special ways of handling integers vs. decimals, Boolean values (True/False), and a special value used to indicate no data whatsoever.

### Strings

We’ve already encountered 3 of these “data types” already. The first is strings, which are used to store text. Strings are indicated with either single (`'`) or double (`"`) quotes.

To see what data type something is, we can use the `type()` command. It will print the data type of whatever is inside the parentheses.

``````type('this is a string')
type("this is also a string")
``````
``````str
str
``````

We can also make multiline strings with 3 of either set of quotes.

``````multiline = '''
This string
spans
multiple
lines
!!!!
'''
print(multiline)
type(multiline)
``````
``````This string
spans
multiple
lines
!!!!

str
``````

Python makes it very easy to work with basic text. For instance, we can even use the `+` sign to put strings together!

``````'some text' + 'MORE TEXT'
'repetition' * 3
``````
``````'some textMORE TEXT'
'repetitionrepetitionrepetition'
``````

Note that maths operations on strings will only work within reason. Attempting to add a string to a number doesn’t work!

``````'5' + 5
``````
``````---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-f9dbf5f0b234 in <module()
---- 1 '5' + 5

TypeError: Can't convert 'int' object to str implicitly
``````

This error indicates that Python doesn’t know how to convert a string to an integer (without our help)!

### Numbers

Integers are used to store any whole number, either positive or negative. Any number without a decimal place is an int, or integer.

``````type(5)
type(-1000)
type(6 + -33)
``````
``````int
int
int
``````

But what happens when we perform a maths operation that would result in a decimal?

``````type(10 / 3)
``````
``````float
``````

Any operation that would result in a decimal being created converts the number to a “float”. Floats are used to represent decimals in Python. To explicitly set a number as a float, just add a decimal point.

``````type(1.3)
type(22.)
``````
``````float
float
``````

### Other data types

Python has two special “Boolean” values to indicate whether or not something is true or false. Unsurprisingly, these are defined as `True` and `False`.

``````type(True)
type(False)
``````
``````bool
bool
``````

Finally, there is a special value called `None` used to indicate no data. We will revisit `None` in more detail later, so for now, just be aware it exists.

``````type(None)
``````
``````NoneType
``````

## Converting between data types

Data often isn’t the format you want it to be. For instance, we got an error earlier while attempting to perform addition between a string and a number (`'5' + 5`). What if we really needed to do that? Fortunately, Python makes it rather easy to convert between data types. Each data type has a function used to convert another piece of data.

To convert a string to an integer, for instance, we can use the `int()` command:

``````print(int('5') + 5)
``````
``````10
``````

Likewise, we can use the following commands to convert data to other types:

• `str()` - creates a string
• `int()` - creates an integer
• `float()` - creates a float
• `bool()` - creates a Boolean

Using this information, see if you can fix the left side of these statements to equal the right side of each statement. Use only the commands shown above.

``````1 + '1' == '11'
'6' - 7 == -1
7.23 == 7
'5' == True
4 / 1.3 == 4
``````

## Data type conversion pitfalls

You may have noticed something weird when converting a float to an int in the last example. Is Python simply rounding floats to the nearest integer, or is it doing something else?

## Key Points

• Errors are there to help us.

# Scripts and imports

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• What is a Python program?

Objectives
• Explain what constitutes a Python program.

• Import Python modules.

Everything we’ve learned so far is pretty cool. But if we want to run a set of commands more than once? How do we write a program in Python?

Python programs are simply a set of Python commands saved in a file. No compiling required! To demonstrate this, let’s write our first program! Enter the following text in a text editor and save it under any name you like (Python files are typically given the extension `.py`).

``````print('it works!!!')
``````

We can now run this program in several ways. If we were to open up a terminal in the folder where we had saved our program, we could run the command `python3 our-script-name.py` to run it.

``````it works!!!
``````

## What’s the point of print()?

We saw earlier that there was no difference between printing something with `print()` and just entering a command on the command line. But is this really the case? Is there a difference after all?

Try executing the following code:

``````print('this involves print')
'this does not'
``````

What gets printed if you execute this as a script? What gets printed if you execute things line by line? Using this information, what’s the point of `print()`?

## `import`-ing things

IPython has a neat trick to run command line commands without exiting IPython. Any command that begins with `!` gets run on your computer’s command line, and not the IPython terminal.

We can use this fact to run the command `python3 our-script-name.py`. I’ve called my script `test.py` as an example.

``````!python3 test.py
``````
``````it works!!!
``````

What if we wanted to pass additional information to Python? For example, what if we want Python to print whatever we type back at us? To do this, we’ll need to use a bit of extra functionality: the `sys` package.

Python includes a lot of extra features in the form of packages, but not all of them get loaded by default. To access a package, we need to `import` it.

``````import sys
``````

You’ll notice that there’s no output. Only one thing is changed: We can now use the bonuses provided by the `sys` package. For now, all we will use is `sys.argv`. `sys.argv` is a special variable that stores any additional arguments we provide on the command-line after `python3 our-script-name.py`. Let’s make a new script called `command-args.py` to try this out.

``````import sys
print('we typed: ', sys.argv)
``````

We can then execute this program with:

``````!python3 test.py word1 word2 3
``````
``````we typed: ['test.py', 'word1', 'word2', '3']
``````

You’ll notice that `sys.argv` looks different from other data types we’ve seen so far. `sys.argv` is a list (more about this in the next session).

## Key Points

• To run a Python program, use `python3 program_name.py`.

# Numpy arrays and lists

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• How do we store large amounts of data?

Objectives
• Learn to use lists and Numpy arrays, and explain the difference between each.

At the end of the last lesson, we noticed that `sys.argv` gave us a new data structure: a list. A list is a set of objects enclosed by a set of square brackets (`[]`).

``````example = [1, 2, 4, 5]
example
``````
``````[1, 2, 4, 5]
``````

Note that a list can hold any type of item, even other lists!

``````example = [1, True, None, ["word", 123], "test"]
example
``````
``````[1, True, None, ['word', 123], 'test']
``````

We can get different pieces of a list via indexing. We add a set of square brackets after the list in question along with the index of the values we want. Note that in Python, all indices start from 0 — the first element is actually the 0th element (this is different from languages like R or MATLAB). The best way to think about array indices is that they are the number of offsets from the first position — the first element does not require an offset to get to.

A few examples of this in action:

``````# first element
example[0]
# second element
example[1]
# fetch the list inside the list
example[3]
``````
``````1
True
['word', 123]
``````

Note that we can index a range using the colon (`:`) operator. A colon by itself means fetch everything.

``````example[:]
``````
``````[1, True, None, ['word', 123], 'test']
``````

A colon on the right side of an index means everything after the specified index.

``````example[2:]
``````
``````[None, ['word', 123], 'test']
``````

A colon on the left side of an index means everything before, but not including, the index.

``````example[:2]
``````
``````[1, True]
``````

And if we use a negative index, it means get elements from the end, going backwards.

``````# last element
example[-1]
# everything except the last two elements
example[:-2]
``````
``````'test'
[1, True, None]
``````

Note that we can use the index multiple times to retrieve information from nested objects.

``````example[3][0]
``````
``````'word'
``````

If we index out of range, it is an error:

``````example[5]
``````
``````---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-12-98429cb6526b> in <module>()
----> 1 example[5]

IndexError: list index out of range
``````

We can also add two lists together to create a larger list.

``````[45, 2] + [3]
``````
``````[45, 2, 3]
``````

## Lists as objects

Like other objects in Python, lists have a unique behaviour that can catch a lot of people off guard. What happens when we run the following code?

``````list1 = [1, 2, 3, 4]
list2 = list1
list2 += [5, 6, 7]
print('List 2 is: ', list2)
print('List 1 is: ', list1)
``````
``````List 2 is:  [1, 2, 3, 4, 5, 6, 7]
List 1 is:  [1, 2, 3, 4, 5, 6, 7]
``````

Modifying `list2` actually modified `list1` as well. In Python, lists are objects. Objects are not copied when we assign them to a new value (like in R). This is an important optimisation, as we won’t accidentally fill up all of our computer’s memory by renaming a variable a couple of times. When we ran `list2 = list1`, it just created a new name for `list1`. `list1` still points at the same underlying object.

We can verify this with the `id()` function. `id()` prints an objects unique identifier. Two objects will not have the same ID unless they are the same object.

``````id(list1)
id(list2)
``````
``````140319556870408
140319556870408
``````

In order to create `list2` as a unique copy of `list1`. We have to use the `.copy()` method.

``````list1 = [1, 2, 3, 4]
list2 = list1.copy()
list2 += [5, 6, 7]
print('List 2 is: ', list2)
print('List 1 is: ', list1)
id(list2)
id(list1)
``````
``````List 2 is:  [1, 2, 3, 4, 5, 6, 7]
List 1 is:  [1, 2, 3, 4]
140319554648072
140319554461896
``````

`.copy()` is a method. Methods are special functions associated with an object and define what it can do. They always follow the syntax `object.method(arg1, arg2)` and have predefined number of arguments mostly with default values. We may also specify a subset of arguments, e.g. `object.method(arg1, arg4=some_value)`.

Other frequently used methods of lists include `.append()`:

``````list1.append(77)
``````
``````[1, 2, 3, 4, 77]
``````
``````# this adds a one-element list
list1.append([88])
``````
``````[1, 2, 3, 4, 77, [88]]
``````

And `.extend()` (combines two lists, instead of adding the second list as an element):

``````list1.extend([99, 88, 101])
``````
``````[1, 2, 3, 4, 77, [88], 99, 88, 101]
``````

And of course, `.remove()` and `.clear()` (both do exactly what you think they should do):

``````list1.remove([88])
print(list1)
list1.clear()
print(list1)
``````
``````[1, 2, 3, 4, 77, 99, 88, 101]
[]
``````

## Dynamic resizing of lists

Python’s lists are an extremely optimised data structure. Unlike R’s vectors, there is no time penalty to continuously adding elements to list. You never need to pre-allocate a list at a certain size for performance reasons.

## Iterating through lists

We’ll very frequently want to iterate over lists and perform an operation with every element. We do this using a for loop.

A for loop generally looks like the following:

``````for variable in things_to_iterate_over:
do_stuff_with(variable)
``````

An example of an actually functioning for loop is shown below:

``````for i in range(10):
print(i)
``````
``````0
1
2
3
4
5
6
7
8
9
``````

In this case we are iterating over the values provided by `range()`. `range()` is a special generator function we can use to provide a sequence of numbers.

We can also iterate over a list, or any collection of elements:

``````for element in ['a', True, None]:
print(type(element))
``````
``````<class 'str'>
<class 'bool'>
<class 'NoneType'>
``````

## Vectorised operations with Numpy

Numpy is a numerical library designed to make working with numbers easier than it would otherwise be.

For example, say we had a list of a thousand numbers. There’s no way to do vector maths without iterating through all the elements!

``````vals = list(range(1000))

new_vals = vals.copy()
print(new_vals[:5])
for idx in range(1000):
new_vals[idx] += 10

print(new_vals[:5])
``````
``````[0, 1, 2, 3, 4]
[10, 11, 12, 13, 14]
``````

That was a lot of work. Numpy lets us do vector maths like in R, saving us a lot of effort. The most basic function is `np.array()` which creates a numerical array from a list. A numpy array is a collection of numbers that can have any number of dimensions. In this case, there is only one dimension, since we created the array from a list.

``````import numpy as np

new_vals = np.array(vals)
new_vals += 10
new_vals[:5]
``````
``````array([10, 11, 12, 13, 14])
``````

One very nice thing about Numpy is that it’s much more performant than ordinary Python lists. A nice trick we can use with IPython to measure execution times is the `%timeit` magic function. Anything following the `%timeit` gets measured for speed. Adding `%%` to the `timeit` command instead of `%` means that `timeit` is run on the entire cell, not just a single line. Note that `%%timeit` must be on the first line of an IPython/Jupyter cell for it to work, whereas the `%timeit` command can be used anywhere.

Using Python’s lists:

``````%%timeit
for idx in range(1000):
vals[idx] + 10
``````
``````10000 loops, best of 3: 165 µs per loop
``````

Using numpy:

``````%timeit new_vals + 10
``````
``````The slowest run took 22.13 times longer than the fastest.
This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.63 µs per loop
``````

Numpy was about 100x faster, though `%timeit` did mention that Numpy could be cheating a bit. Even in Numpy’s worst case scenario however, it still ran 5x faster than using Python’s basic lists.

## Working with multiple dimensions

Sometimes, you’ll encounter a dataset with multiple dimensions and will need to be able to retrieve elements from it as such.

``````arr2d = np.arange(0, 40)  # sequence of numbers from 0 to 39
arr2d = arr2d.reshape([5, 8])  # reshape so it has 5 rows and 8 columns
arr2d
``````
``````array([[ 0,  1,  2,  3,  4,  5,  6,  7],
[ 8,  9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39]])
``````

In this case, we must index using multiple indices, separated by a comma.

To grab the first element, we would use `[0, 0]`

``````arr2d[0, 0]
``````
``````0
``````

The first index, corresponds to rows, the second corresponds to columns, and the third to the next dimension…

``````arr2d[0, :]
arr2d[:, 0]
``````
``````array([0, 1, 2, 3, 4, 5, 6, 7])
array([ 0,  8, 16, 24, 32])
``````

## Practising indexing

Retrieve everything defined in the range of rows 4-5 and columns 1-4.

## Key Points

• Lists store a sequence of elements.

• Numpy allows vector maths in Python.

# Storing data with dicts

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• How do I store structured data?

Objectives
• Be able to store data using Python’s dict objects.

Dictionaries (also called dicts) are another key data structure we’ll need to use to write a pipeline. In particular, dicts allow efficient key-value storage of any type of data.

To create a dict, we use syntax like the following.

``````example = {}
type(example)
``````
``````dict
``````

We can then store values in our dict using indexing. The index is referred to as the “key”, and the stored data is referred to as the “value”.

``````example['key'] = 'value'
example['key']
``````
``````'value'
``````

In addition, keys can be stored using any type of value. Let’s add several more values to demonstrate this.

``````example[1] = 2
example[4] = False
example['test'] = 5
example[7] = 'myvalue'
``````

To retrieve all keys in the dictionary, we can use the `.keys()`method. Note how we used the `list()` function to turn our resulting output into a list.

``````list(example.keys())
``````
``````['key', 1, 4, 'test', 7]
``````

Likewise, we can retrieve all the values at once, using `.values()`

``````list(example.values())
``````
``````['value', 2, False, 5, 'myvalue']
``````

## Dictionary order

Note that the order of keys and values in a dictionary should not be relied upon. We’ll create dictionary another way to demonstrate this:

``````unordered = {'a': 1,
'b': 2,
'c': 3,
'd': 4}
``````
``````{'a': 1, 'b': 2, 'c': 3, 'd': 4}
``````

Depending on your version of Python, the dictionary will either be in order, or out of order. If you are on Python 3.6+ dictionaries are ordered.

Iterate through and print the dictionary’s keys in both forward and reverse order.

(To iterate through the dict in a specific order, you will need to sort the keys using the `sorted()` function.)

## Key Points

• Dicts provide key-value storage of information.

# Functions and Conditions

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• How do I write functions?

Objectives
• Be able to write our own functions and use basic functional programming constructs like `map()` and `filter()`.

Of course, at some point, we are going to want to define our own functions rather than just use the ones provided by Python and its various modules.

The general syntax for defining a function is as follows:

``````def function(arg1):
# do stuff with arg1
``````

So, an example function that adds two numbers together might look a little like this:

``````def adder(x, y):
return x + y

``````
``````3
``````

We can also add a default argument (say if we wanted y to be equal to 10 unless we otherwise specified), by using an equals sign and a default value in our function definition:

``````def adder(x, y=10):
return x + y

``````
``````15
``````

## Practice defining functions

Define a function that converts from temperatures in Fahrenheit to temperatures in Kelvin, and another function that converts back again.

The general formula for the conversion from Fahrenheit to Kelvin is:

`kelvin = (fahr - 32) * 5 / 9 + 273.15`

## Conditional statements

We may also need to have our functions do specific things in some conditions, but not in others. This relies upon comparisons between items:

In python, comparison is done using the `==` operator:

``````True == True
True == False
'words' == 'words'
``````
``````True
False
True
``````

`not` indicates the opposite of True or False, and `!=` means not equal to.

``````not True == False
True != False
``````
``````True
True
``````

As with other programming languages, we can make the usual comparisons with the `>` and `<` operators. Adding an equals sign (`>=`, `<=`) indicates less than or equal to or greater than or equal to.

``````5 < 10
5 > 10
-4 >= -4
1 <= 2
``````
``````True
False
True
True
``````

These statements can be combined with the `if` statement to produce code that executes at various times.

``````number = 5
if number <= 10:
print('number was less than 10')
``````
``````number was less than 10
``````

If the `if` statement is not equal to `True`, the statement does not execute:

``````number = 11
if number <= 10:
print('number was less than 10')
``````

However, we can add code to execute when the `if` condition is not met by adding an `else` statement.

``````number = 11
if number <= 10:
print('number was less than 10')
else:
print('number was greater than 10')
``````
``````number was greater than 10
``````

And if we want to check an additional statement, we can use the `elif` keyword (else-if):

``````number = 10
if number < 10:
print('number was less than 10')
elif number == 10:
print('number was equal to 10')
else:
print('number was greater than 10')
``````

One final note, to check if a value is equal to `None` in Python we must use `is None` and `is not None`. Normal `==` operators will not work.

``````None is None
5 is not None
``````
``````True
True
``````

Additionally, we can check if one value is in another set of values with the `in` operator:

``````5 in [4, 5, 6]
43 in [4, 5, 6]
``````
``````True
False
``````

## map(), filter(), and anonymous (lambda) functions

Python has good support for functional programming, and has its own equivalents for map/reduce-style functionality. To “map” a function means to apply it to a set of elements. To “reduce” means to collapse a set of values to a single value. Finally, “filtering” means returning only a set of elements where a certain value is true.

Let’s explore what that means with our own functions. The syntax of map/reduce/filter is identical:

``````map(function, thing_to_iterate_over, next_thing_to_iterate_over)
``````

Let’s apply this to a few test cases using map. Note that when selecting which function we are going to “map” with,

``````import math
values = [0, 1, 2, 3, 4, 5, 6]
map(math.sin, values)
``````
``````<map object at 0x7f31c246cba8>
``````

To retrieve the actual values, we typically need to make the resulting output a list.

``````list(map(math.sin, values))
``````
``````[0.0,
0.8414709848078965,
0.9092974268256817,
0.1411200080598672,
-0.7568024953079282,
-0.9589242746631385,
-0.27941549819892586]
``````

`filter()` applies a similar operation, but instead of applying a function to every piece, it only returns points where a function returns true.

``````def less_than_3(val):
return val < 3

list(filter(less_than_3, values))
``````
``````[0, 1, 2]
``````

That was very inconvenient. We had to define an entire function just to only use it once. The solution for this is to write a one-time use function that has no name. Such functions are called either anonymous functions or lamdba functions (both mean the same thing).

To define a lambda function in python, the general syntax is as follows:

``````lambda x: x + 54
``````

In this case, `lambda x:` indicates we are defining a lambda function with a single argument, `x`. Everything following the `:` is our function. Whatever value this evaluates to is automatically returned. So `lambda x: x + 54` equates to:

``````def some_func(x):
return x + 54
``````

Rewriting our filter statement to use a lambda function:

``````list(filter(lambda x: x < 3, values))
``````
``````[0, 1, 2]
``````

And a side-by-side example that demonstrates the difference between `map()` and `filter()`.

``````list(map(lambda x: x+100, [1,2,3,4,5]))
list(filter(lambda x: x<3, [1,2,3,4,5]))
``````
``````[101, 102, 103, 104, 105]   # map()
[1, 2]   # filter()
``````

## Using lambdas in practice

Add `'-cheesecake'` to every word in the following list using `map()`.

`['new york', 'chocolate', 'new york', 'ketchup', 'mayo']`

Using `filter()`, remove the items which would be absolutely terrible to eat.

## map/filter style functionality with Numpy arrays

Although you could use a for-loop to apply a custom function to a numpy array in a single go, there is a handy `np.vectorize()` function you can use to convert your functions to a vectorised numpy equivalent. Note that this is purely for convenience — this uses a `for-loop` internally.

``````import numpy as np
# create a function to perform cubes of a number
vector_cube = np.vectorize(lambda x: x ** 3)

vector_cube(np.array([1, 2, 3, 4, 5]))
``````
``````array([  1,   8,  27,  64, 125])
``````

To perform a similar option to `filter()`, you can actually specify a conditional statement inside the `[]` when indexing a Numpy array.

``````arr = np.array([1, 2, 3, 4, 5])
arr[arr >= 3]
``````

## Removing np.nan values

Remove all of the `np.nan` values from the following sequence using logical indexing.

`np.array([np.nan, np.nan, 2, 3, 4, np.nan])`

## Key Points

• `map()` applies a function to every object in a data structure.

• `filter()` returns only the data objects for which some condition is true.

# Introduction to parallel computing

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• How do I run code in parallel?

Objectives
• Understand how to run parallel code with `multiprocessing`.

The primary goal of these lesson materials is to accelerate your workflows by executing them in a massively parallel (and reproducible!) manner. Of course, what does this actually mean?

The basic concept of parallel computing is simple to understand: we divide our job in tasks that can be executed at the same time, so that we finish the job in a fraction of the time that it would have taken if the tasks are executed one by one. There are a lot of different ways of parallelizing things however - we need to cover these concepts before running our workflows in parallel.

Let’s start with an analogy: suppose that we want to paint the four walls in a room. This is our problem. We can divide our problem in 4 different tasks: paint each of the walls. In principle, our 4 tasks are independent from each other in the sense that we don’t need to finish one to start another. However, this does not mean that the tasks can be executed simultaneously or in parallel. It all depends on the amount of resources that we have for the tasks.

### Concurrent vs. parallel execution

If there is only one painter, they could work for a while in one wall, then start painting another one, then work for a little bit in the third one, and so on. The tasks are being executed concurrently but not in parallel. Only one task is being performed at a time. If we have 2 or more painters for the job, then the tasks can be performed in parallel.

In our analogy, the painters represent CPU cores in your computer. The number of CPU cores available determines the maximum number of tasks that can be performed in parallel. The number of concurrent tasks that can be started at the same time, however, is unlimited.

### Synchronous vs. asynchronous execution

Now imagine that all workers have to obtain their paint form a central dispenser located at the middle of the room. If each worker is using a different colour, then they can work asynchronously. However, if they use the same colour, and two of them run out of paint at the same time, then they have to synchronise to use the dispenser — one should wait while the other is being serviced.

In our analogy, the paint dispenser represents access to the memory in your computer. Depending on how a program is written, access to data in memory can be synchronous or asynchronous.

### Distributed vs. shared memory

Finally, imagine that we have 4 paint dispensers, one for each worker. In this scenario, each worker can complete its task totally on their own. They don’t even have to be in the same room, they could be painting walls of different rooms in the house, on different houses in the city, and different cities in the country. In many cases, however, we need a communication system in place. Suppose that worker A, needs a colour that is only available in the dispenser of worker B — worker A should request the paint to worker B, and worker B should respond by sending the required colour.

Think of the memory distributed on each node/computer of a cluster as the different dispensers for your workers. A fine-grained parallel program needs lots of communication/synchronisation between tasks, in contrast with a course-grained one that barely communicates at all. An embarrassingly/massively parallel problem is one where all tasks can be executed completely independent from each other (no communications required).

Our example painters have two arms, and could potentially paint with both arms at the same time. Technically, the work being done by each arm is the work of a single painter.

In this example, each painter would be a process (an individual instance of a program). The painters’ arms represent a “thread” of a program. Threads are separate points of execution within a single program, and can be executed either synchronously or asynchronously.

## How does parallelization work in practice?

These concepts translate into several different types of parallel computing, each good at certain types of tasks:

### Asynchronous programming

Often times, certain computations involve a lot of waiting. Perhaps you sent some information to a webserver on the internet and are waiting back on a response. In this case, if you needed to make lots of requests over the internet, your program would spend ages just waiting to hear back. In this scenario, it would be very advantageous to fire off a bunch of requests to the internet, and then instead of waiting on each one, check back periodically to see if the request has completed before processing each request individually.

This is an example of asynchronous programming. One thread executes many tasks at the same time, periodically checking on each one, and only taking an action once some external task has completed. Asynchronous programming is very important when programming for the web, where lots of waiting around happens. To do this in Python, you’d typically want to use something like the asyncio module. It’s not very useful for scientific programming, because only one core/thread is typically doing any work — a normal program that doesn’t run in parallel at all would be just as fast!

### Shared memory programming

Shared memory programming means using the resources on a single computer, and having multiple threads or processes work together on a single copy of a dataset in memory. This is the most common form of parallel programming and is relatively easy to do. We will cover basic shared-memory programming in Python using the `multiprocess` / `multiprocessing` packages in this lesson.

### Distributed memory programming

Shared memory programming, although very useful, has one major limitation: we can only use the number of CPU cores present on a single computer. If we want to increase speed any more, we need a better computer. Big computers cost lots and lots of money. Wouldn’t it be more efficient to just use a lot of smaller, cheap computers instead?

This is the rationale behind distributed memory programming — a task is farmed out to a large number of computers, each of which tackle an individual portion of a problem. Results are communicated back and forth between compute nodes.

This is most advantageous when a dataset is too large to fit into a computer’s memory (depending on the hardware you have access to this can be anything from several dozen gigabytes, to several terabytes). Frameworks like MPI, Hadoop, and Spark see widespread use for these types of problems (and are not covered in this lesson).

### Serial farming

In many cases, we’ll need to repeat the same computation multiple times. Maybe we need to run the same set of steps on 10 different samples. There doesn’t need to be any communication at all, and each task is completely independent of the others.

In this scenario, why bother with all of these fancy parallel programming techniques, let’s just start the same program 10 times on 10 different datasets on 10 different computers. The work is still happening in parallel, and we didn’t need to change anything about our program to achieve this. As an extra benefit, this works the same for every program, regardless of what it does or what language it was written in.

This technique is known as serial farming, and is the primary focus of this lesson. We will learn to use Snakemake to coordinate the parallel launch of dozens, if not hundreds or thousands of independent tasks.

## Parallelization in Python

Python does not thread very well. Specifically, Python has a very nasty drawback known as a Global Interpreter Lock (GIL). The GIL ensures that only one compute thread can run at a time. This makes multithreaded processing very difficult. Instead, the best way to go about doing things is to use multiple independent processes to perform the computations. This method skips the GIL, as each individual process has it’s own GIL that does not block the others. This is typically done using the `multiprocessing` module.

Before we start, we will need the number of CPU cores in our computer. To get the number of cores in our computer, we can use the `psutil` module. We are using `psutil` instead of `multiprocessing` because `psutil` counts cores instead of threads. Long story short, cores are the actual computation units, threads allow additional multitasking using the cores you have. For heavy compute jobs, you are generally interested in cores.

``````import psutil
# logical=True counts threads, but we are interested in cores
psutil.cpu_count(logical=False)
``````
``````8
``````

Using this number, we can create a pool of worker processes with which to parallelize our jobs:

``````from multiprocessing import Pool
pool = Pool(psutil.cpu_count(logical=False))
``````

The `pool` object gives us a set of parallel workers we can use to parallelize our calculations. In particular, there is a map function (with identical syntax to the `map()` function used earlier), that runs a workflow in parallel.

Let’s try `map()` out with a test function that just runs sleep.

``````import time

def sleeping(arg):
time.sleep(0.1)

%timeit list(map(sleeping, range(24)))
``````
``````1 loop, best of 3: 2.4 s per loop
``````

Now let’s try it in parallel:

``````%timeit pool.map(sleeping, range(24))
``````

If you are using a Jupyter notebook, this will fail:

``````# more errors omitted
AttributeError: Can't get attribute 'sleeping' on <module '__main__'>
AttributeError: Can't get attribute 'sleeping' on <module '__main__'>
``````

## Differences between Jupyter notebooks versus and the Python interpreters

The last command may have succeeded if you are running in a Python or IPython shell. This is due to a difference in the way Jupyter executes user-defined functions):

``````1 loop, best of 3: 302 ms per loop
``````

Jupyter notebooks define user functions under a special Python module called `__main__`. This does not work with `multiprocessing`. However these issues are not limited to Jupyter notebooks — a similar error will occur if you use a lambda function instead:

``````pool.map(lambda x: time.sleep(0.1), range(24))
``````
``````---------------------------------------------------------------------------
PicklingError                             Traceback (most recent call last)
<ipython-input-10-df8237b4b421> in <module>()
----> 1 pool.map(lambda x: time.sleep(0.1), range(24))

# more errors omitted
``````

The `multiprocessing` module has a major limitation: it only accepts certain functions, and in certain situations. For instance any class methods, lambdas, or functions defined in `__main__` wont’ work. This is due to the way Python “pickles” (read: serialises) data and sends it to the worker processes. “Pickling” simply can’t handle a lot of different types of Python objects.

Fortunately, there is a fork of the `multiprocessing` module called `multiprocess` that works just fine (`pip install --user multiprocess`). `multiprocess` uses `dill` instead of `pickle` to serialise Python objects (read: send your data and functions to the Python workers), and does not suffer the same issues. Usage is identical:

``````# shut down the old workers
pool.close()

from multiprocess import Pool
pool = Pool(8)
%timeit pool.map(lambda x: time.sleep(0.1), range(24))
pool.close()
``````
``````1 loop, best of 3: 309 ms per loop
``````

This is a general purpose parallelization recipe that you can use for your Python projects.

``````# make sure to always use multiprocess
from multiprocess import Pool
pool = Pool(number_of_cores)

# execute a computation(s) in parallel
result = pool.map(your_function, something_to_iterate_over)
result2 = pool.map(another_function, more_stuff_to_iterate_over)

# turn off your parallel workers at the end of your script
pool.close()
``````

Parallel workers (with their own copy of everything) are created, data are sent to these workers, and then results are combined back together again. There is also an optional `chunksize` argument (for `pool.map()`) that lets you control how big each chunk of data is before it’s sent off to each worker. A larger chunk size means that less time is spent shuttling data to and from workers, and will be more useful if you have a large number of very fast computations to perform. When each iteration takes a very long time to run, you will want to use a smaller chunk size.

## Key Points

• `Pool.map()` will perform an operation in parallel.

# Introduction to Snakemake

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• How can I make my results easier to reproduce?

Objectives
• Understand our example problem.

Let’s imagine that we’re interested in seeing the frequency of various words in various books.

We’ve compiled our raw data i.e. the books we want to analyse and have prepared several Python scripts that together make up our analysis pipeline.

Let’s take quick look at one of the books using the command

``````\$ head books/isles.txt
``````

By default, `head` displays the first 10 lines of the specified file.

``````A JOURNEY TO THE WESTERN ISLANDS OF SCOTLAND

INCH KEITH

I had desired to visit the Hebrides, or Western Islands of Scotland, so
long, that I scarcely remember how the wish was originally excited; and
was in the Autumn of the year 1773 induced to undertake the journey, by
finding in Mr. Boswell a companion, whose acuteness would help my
``````

Our directory has the Python scripts and data files we we will be working with:

``````|- books
|  |- abyss.txt
|  |- isles.txt
|  |- last.txt
|  |- sierra.txt
|- plotcount.py
|- wordcount.py
|- zipf_test.py
``````

The first step is to count the frequency of each word in a book. The first argument (`books/isles.txt`) to wordcount.py is the file to analyse, and the last argument (`isles.dat`) specifies the output file to write.

``````\$ python wordcount.py books/isles.txt isles.dat
``````

Let’s take a quick peek at the result.

``````\$ head -5 isles.dat
``````

This shows us the top 5 lines in the output file:

``````the 3822 6.7371760973
of 2460 4.33632998414
and 1723 3.03719372466
to 1479 2.60708619778
a 1308 2.30565838181
``````

We can see that the file consists of one row per word. Each row shows the word itself, the number of occurrences of that word, and the number of occurrences as a percentage of the total number of words in the text file.

We can do the same thing for a different book:

``````\$ python wordcount.py books/abyss.txt abyss.dat
``````
``````the 4044 6.35449402891
and 2807 4.41074795726
of 1907 2.99654305468
a 1594 2.50471401634
to 1515 2.38057825267
``````

Let’s visualise the results. The script `plotcount.py` reads in a data file and plots the 10 most frequently occurring words as a text-based bar plot:

``````\$ python plotcount.py isles.dat ascii
``````
``````the   ########################################################################
of    ##############################################
and   ################################
to    ############################
a     #########################
in    ###################
is    #################
that  ############
by    ###########
it    ###########
``````

`plotcount.py` can also show the plot graphically:

``````\$ python plotcount.py isles.dat show
``````

Close the window to exit the plot.

`plotcount.py` can also create the plot as an image file (e.g. a PNG file):

``````\$ python plotcount.py isles.dat isles.png
``````

Finally, let’s test Zipf’s law for these books:

``````\$ python zipf_test.py abyss.dat isles.dat
``````
``````Book	First	Second	Ratio
abyss	4044	2807	1.44
isles	3822	2460	1.55
``````

## Zipf’s Law

Zipf’s Law is an empirical law formulated using mathematical statistics that refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions.

Zipf’s law was originally formulated in terms of quantitative linguistics, stating that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. For example, in the Brown Corpus of American English text, the word the is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf’s Law, the second-place word of accounts for slightly over 3.5% of words (36,411 occurrences), followed by and (28,852). Only 135 vocabulary items are needed to account for half the Corpus.

Source: Wikipedia

Together these scripts implement a common workflow:

2. Perform an analysis on this data file.
3. Write the analysis results to a new file.
4. Plot a graph of the analysis results.
5. Save the graph as an image, so we can put it in a paper.
6. Make a summary table of the analyses

Running `wordcount.py` and `plotcount.py` at the shell prompt, as we have been doing, is fine for one or two files. If, however, we had 5 or 10 or 20 text files, or if the number of steps in the pipeline were to expand, this could turn into a lot of work. Plus, no one wants to sit and wait for a command to finish, even just for 30 seconds.

The most common solution to the tedium of data processing is to write a shell script that runs the whole pipeline from start to finish.

Using your text editor of choice (e.g. nano), add the following to a new file named `run_pipeline.sh`.

``````# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss
# and the summary table for the Zipf's law tests

python wordcount.py books/isles.txt isles.dat
python wordcount.py books/abyss.txt abyss.dat

python plotcount.py isles.dat isles.png
python plotcount.py abyss.dat abyss.png

# Generate summary table
python zipf_test.py abyss.dat isles.dat > results.txt
``````

Run the script and check that the output is the same as before:

``````\$ bash run_pipeline.sh
\$ cat results.txt
``````

This shell script solves several problems in computational reproducibility:

1. It explicitly documents our pipeline, making communication with colleagues (and our future selves) more efficient.
2. It allows us to type a single command, `bash run_pipeline.sh`, to reproduce the full analysis.
3. It prevents us from repeating typos or mistakes. You might not get it right the first time, but once you fix something it’ll stay fixed.

Despite these benefits it has a few shortcomings.

Let’s adjust the width of the bars in our plot produced by `plotcount.py`.

Edit `plotcount.py` so that the bars are 0.8 units wide instead of 1 unit. (Hint: replace `width = 1.0` with `width = 0.8` in the definition of `plot_word_counts`.)

Now we want to recreate our figures. We could just `bash run_pipeline.sh` again. That would work, but it could also be a big pain if counting words takes more than a few seconds. The word counting routine hasn’t changed; we shouldn’t need to recreate those files.

Alternatively, we could manually rerun the plotting for each word-count file. (Experienced shell scripters can make this easier on themselves using a for-loop.)

``````\$ for book in abyss isles; do python plotcount.py \$book.dat \$book.png; done
``````

With this approach, however, we don’t get many of the benefits of having a shell script in the first place.

Another popular option is to comment out a subset of the lines in `run_pipeline.sh`:

``````# USAGE: bash run_pipeline.sh
# to produce plots for isles and abyss
# and the summary table

# These lines are commented out because they don't need to be rerun.
#python wordcount.py books/isles.txt isles.dat
#python wordcount.py books/abyss.txt abyss.dat

python plotcount.py isles.dat isles.png
python plotcount.py abyss.dat abyss.png

# This line is also commented out because it doesn't need to be rerun.
# python zipf_test.py abyss.dat isles.dat > results.txt
``````

Then, we would run our modified shell script using `bash run_pipeline.sh`.

But commenting out these lines, and subsequently un-commenting them, can be a hassle and source of errors in complicated pipelines. What happens if we have hundreds of input files? No one wants to enter the same command a hundred times, and then edit the result.

What we really want is an executable description of our pipeline that allows software to do the tricky part for us: figuring out what tasks need to be run where and when, then perform those tasks for us.

## What is Snakemake and why are we using it?

There are many different tools that researchers use to automate this type of work. Snakemake is a very popular tool, and the one we have selected for this tutorial. There are several reasons this tool was chosen:

• It’s free, open-source, and installs in about 5 seconds flat via `pip`.

• Snakemake works cross-platform (Windows, MacOS, Linux) and is compatible with all HPC schedulers. More importantly, the same workflow will work and scale appropriately regardless of whether it’s on a laptop or cluster without modification.

• Snakemake uses pure Python syntax. There is no tool specific-language to learn like in GNU Make, NextFlow, WDL, etc.. Even if students end up not liking Snakemake, you’ve still taught them how to program in Python at the end of the day.

• Anything that you can do in Python, you can do with Snakemake (since you can pretty much execute arbitrary Python code anywhere).

• Snakemake was written to be as similar to GNU Make as possible. Users already familiar with Make will find Snakemake quite easy to use.

• It’s easy. You can (hopefully!) learn Snakemake in an afternoon!

The rest of these lessons aim to teach you how to use Snakemake by example. Our goal is to automate our example workflow, and have it do everything for us in parallel regardless of where and how it is run (and have it be reproducible!).

## Key Points

• Bash scripts are not an efficient way of storing a workflow.

• Snakemake is one method of managing a complex computational workflow.

# Snakefiles

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• How do I write a simple workflow?

Objectives
• Understand the components of a Snakefile: rules, inputs, outputs, and actions.

• Write a simple Snakefile.

• Run Snakemake from the shell.

Create a file, called `Snakefile`, with no file extension and containing the following content:

``````# Count words.
rule count_words:
input:    'books/isles.txt'
output:   'isles.dat'
shell:    'python wordcount.py books/isles.txt isles.dat'
``````

This is a build file, which for Snakemake is called a Snakefile — a file executed by Snakemake. Note that aside from a few keyword additions like `rule`, it follows standard Python 3 syntax.

Let us go through each line in turn:

• `#` denotes a comment. Any text from `#` to the end of the line is ignored by Snakemake.
• `isles.dat` is a target, a file to be created, or built. In Snakemake, these are called “outputs”, for simplicity’s sake.
• `books/isles.txt` is a dependency, a file that is needed to build or update the target. Targets can have zero or more dependencies. Dependencies in Snakemake are called “inputs”.
• `python wordcount.py books/isles.txt isles.dat` is an action, a command to run to build or update the target using the dependencies. In this case the action is a set of shell commands (we can also use Python code… more on that later).
• Like Python, you can use either tabs or spaces for indentation — don’t use both!
• Together, the target, dependencies, and actions form a rule. A rule is a recipe for how to make things.

Our rule above describes how to build the target `isles.dat` using the action `python wordcount.py` and the dependency `books/isles.txt`.

Information that was implicit in our shell script - that we are generating a file called `isles.dat` and that creating this file requires `books/isles.txt` - is now made explicit by Snakemake’s syntax.

Let’s first ensure we start from scratch and delete the `.dat` and `.png` files we created earlier:

``````\$ rm *.dat *.png
``````

By default, Snakemake looks for a file called `Snakefile`, and we can run Snakemake as follows:

``````\$ snakemake
``````

By default, Snakemake tells us what it’s doing as it executes actions:

``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count   jobs
1       count_words
1

rule count_words:
input: books/isles.txt
output: isles.dat
jobid: 0

Finished job 0.
1 of 1 steps (100%) done
``````

Depending on your setup, you may receive an error ```Error: you need to specify the maximum number of CPU cores to be used at the same time with --cores.``` This can be fixed using an argument to the sankemake command. Try running the following:

``````\$ snakemake --cores 1
``````

If you see a different error, check your syntax and the filepaths of the files in your Snakefile. You can check your present working directory using the command `pwd`.

Remember, aside from stuff like `rule` and `input`, Snakemake follows Python syntax. Let’s see if we got what we expected:

``````\$ head -5 isles.dat
``````

The first 5 lines of `isles.dat` should look exactly like before.

## Snakefiles Do Not Have to be Called `Snakefile`

We don’t have to call our Snakefile `Snakefile`. However, if we call it something else we need to tell Snakemake where to find it. This we can do using `-s` flag. For example, if our Snakefile is named `MyOtherSnakefile`:

``````\$ snakemake -s MyOtherSnakefile
``````

When we re-run our Snakefile, Snakemake now informs us that:

``````Nothing to be done.
``````

This is because our target, `isles.dat`, has now been created, and Snakemake will not create it again. To see how this works, let’s pretend to update one of the text files. Rather than opening the file in an editor, we can use the shell `touch` command to update its timestamp (which would happen if we did edit the file):

``````\$ touch books/isles.txt
``````

If we compare the timestamps of `books/isles.txt` and `isles.dat`,

``````\$ ls -l books/isles.txt isles.dat
``````

then we see that `isles.dat`, the target, is now older than`books/isles.txt`, its dependency:

``````-rw-r--r--    1 mjj      Administ   323972 Jun 12 10:35 books/isles.txt
-rw-r--r--    1 mjj      Administ   182273 Jun 12 09:58 isles.dat
``````

If we run Snakemake again,

``````\$ snakemake
``````

then it recreates `isles.dat`:

``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count   jobs
1       count_words
1

rule count_words:
input: books/isles.txt
output: isles.dat
jobid: 0

Finished job 0.
1 of 1 steps (100%) done
``````

When it is asked to build a target, Snakemake checks the “last modification time” of both the target and its dependencies. If any dependency has been updated since the target, then the actions are re-run to update the target. Using this approach, Snakemake knows to only rebuild the files that, either directly or indirectly, depend on the file that changed. This is called an incremental build.

## Snakefiles as Documentation

By explicitly recording the inputs to and outputs from steps in our analysis and the dependencies between files, Snakefiles act as a type of documentation, reducing the number of things we have to remember.

Let’s add another rule to the end of `Snakefile`. Note that rules cannot have the same name, so we’ll call this one `count_words_abyss`.

``````rule count_words_abyss:
input:    'books/abyss.txt'
output:   'abyss.dat'
shell:    'python wordcount.py books/abyss.txt abyss.dat'
``````

If we run Snakemake,

``````\$ snakemake
``````

then we get:

``````Nothing to be done.
``````

Nothing happens because Snakemake attempts to build the first target it finds in the Snakefile, the default target, which is `isles.dat` which is already up-to-date. We need to explicitly tell Snakemake we want to build `abyss.dat`:

``````\$ snakemake abyss.dat
``````

Now, we get:

``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count   jobs
1       count_words_abyss
1

rule count_words_abyss:
input: books/abyss.txt
output: abyss.dat
jobid: 0

Finished job 0.
1 of 1 steps (100%) doneat
``````

## “Up to Date” Versus “Nothing to be Done”

If we ask Snakemake to build a file that already exists and is up to date, then Snakemake informs us that:

``````Nothing to be done
``````

If we ask Snakemake to build a file that exists but for which there is no rule in our Snakefile, then we get a message like:

``````\$ snakemake wordcount.py
``````
``````MissingRuleException:
No rule to produce wordcount.py (if you use input functions make sure
that they don't raise unexpected exceptions).
``````

When we see this error, double-check that you have a rule to produce that file, and also that the filename has been specified correctly. Even a small difference in a filename will result in a `MissingRuleException`.

We may want to remove all our data files so we can explicitly recreate them all. We can introduce a new target, and associated rule, to do this. We will call it `clean`, as this is a common name for rules that delete auto-generated files, like our `.dat` files:

``````rule clean:
shell: 'rm -f *.dat'
``````

This is an example of a rule that has no inputs or outputs! We just want to remove the data files whether or not they exist. If we run Snakemake and specify this target,

``````\$ snakemake clean
``````

then we get:

``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count   jobs
1       clean
1

rule clean:
jobid: 0

Finished job 0.
1 of 1 steps (100%) done
``````

An `ls` of our current directory reveals that all of our troublesome output files are now gone (as planned)!

We can add a similar command to create all the data files. We can put this at the top of our Snakefile so that it is the default target, which is executed by default if no target is given to the `snakemake` command:

``````rule dats:
input:
'isles.dat',
'abyss.dat'
``````

This is an example of a rule that has dependencies that are targets of other rules. When `snakemake` runs, it will check to see if the dependencies exist and, if not, will see if rules are available that will create these. If such rules exist it will invoke these first, otherwise `snakemake` will raise an error.

## Dependencies

The order of rebuilding dependencies is arbitrary. You should not assume that they will be built in the order in which they are listed.

Dependencies must form a directed acyclic graph. A target cannot depend on a dependency which itself, or one of its dependencies, depends on that target.

This rule is also an example of a rule that has no actions. It is used purely to trigger the build of its dependencies, if needed.

If we run,

``````\$ snakemake dats
``````

then snakemake creates the data files:

``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count   jobs
1       count_words
1       count_words_abyss
1       dats
3

rule count_words_abyss:
input: books/abyss.txt
output: abyss.dat
jobid: 1

Finished job 1.
1 of 3 steps (33%) done

rule count_words:
input: books/isles.txt
output: isles.dat
jobid: 2

Finished job 2.
2 of 3 steps (67%) done

localrule dats:
input: isles.dat, abyss.dat
jobid: 0

Finished job 0.
3 of 3 steps (100%) done
``````

If we run `dats` again, then snakemake will see that the dependencies (`isles.dat` and `abyss.dat`) are already up to date. Given the target `dats` has no actions, there is `nothing to be done`:

``````\$ snakemake dats
``````
``````Nothing to be done
``````

Our Snakefile now looks like this:

``````rule dats:
input:
'isles.dat',
'abyss.dat'

# delete everything so we can re-run things
rule clean:
shell: 'rm -f *.dat'

# count words in one of our "books"
rule count_words:
input:    'books/isles.txt'
output:   'isles.dat'
shell:    'python wordcount.py books/isles.txt isles.dat'

rule count_words_abyss:
input:    'books/abyss.txt'
output:   'abyss.dat'
shell:    'python wordcount.py books/abyss.txt abyss.dat'
``````

The following figure shows a graph of the dependencies embodied within our Snakefile, involved in building the `dats` target:

At this point, it becomes important to see what Snakemake is doing behind the scenes. What commands is Snakemake actually running? Snakemake has a special option (`-p`), that prints every command it is about to run. Additionally, we can also perform a dry run with `-n`. A dry run does nothing, and simply prints out commands instead of actually executing them. Very useful for debugging!

``````\$ snakemake clean
\$ snakemake -n -p isles.dat
``````
``````rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 0
wildcards: file=isles

python wordcount.py books/isles.txt isles.dat
Job counts:
count	jobs
1	count_words
1
``````

## Write Two New Rules

1. Write a new rule for `last.dat`, created from `books/last.txt`.
2. Update the `dats` rule with this target.
3. Write a new rule for `results.txt`, which creates the summary table. The rule needs to:
• Depend upon each of the three `.dat` files.
• Invoke the action `python zipf_test.py abyss.dat isles.dat last.dat > results.txt`.
4. Put this rule at the top of the Snakefile so that it is the default target.
5. Update `clean` so that it removes `results.txt`.

The following figure shows the dependencies embodied within our Snakefile, involved in building the `results.txt` target:

## Key Points

• Snakemake follows Python syntax

• Rules can have an input and/or outputs, and a command to be run.

# Wildcards

## Overview

Teaching: 30 min
Exercises: 15 min
Questions
• How can I abbreviate the rules in my pipeline?

Objectives
• Use snakemake wildcards to simplify our rules.

• Output files are a product not only of input files but of the scripts or code that created the output files.

After the exercise at the end of the previous episode, our Snakefile looked like this:

``````# generate summary table
rule zipf_test:
input:  'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell:  'python zipf_test.py abyss.dat isles.dat last.dat > results.txt'

rule dats:
input: 'isles.dat', 'abyss.dat', 'last.dat'

# delete everything so we can re-run things
rule clean:
shell:  'rm -f *.dat results.txt'

# count words in one of our "books"
rule count_words:
input:  'books/isles.txt'
output: 'isles.dat'
shell:  'python wordcount.py books/isles.txt isles.dat'

rule count_words_abyss:
input:  'books/abyss.txt'
output: 'abyss.dat'
shell:  'python wordcount.py books/abyss.txt abyss.dat'

rule count_words_last:
input:  'books/last.txt'
output: 'last.dat'
shell:  'python wordcount.py books/last.txt last.dat'
``````

Our Snakefile has a lot of duplication. For example, the names of text files and data files are repeated in many places throughout the Snakefile. Snakefiles are a form of code and, in any code, repeated code can lead to problems (e.g. we rename a data file in one part of the Snakefile but forget to rename it elsewhere).

## D.R.Y. (Don’t Repeat Yourself)

In many programming languages, the bulk of the language features are there to allow the programmer to describe long-winded computational routines as short, expressive, beautiful code. Features in Python or R or Java, such as user-defined variables and functions are useful in part because they mean we don’t have to write out (or think about) all of the details over and over again. This good habit of writing things out only once is known as the “Don’t Repeat Yourself” principle or D.R.Y.

Let us set about removing some of the repetition from our Snakefile. In our `zipf_test` rule we duplicate the data file names and the name of the results file name:

``````rule zipf_test:
input:
'abyss.dat',
'last.dat',
'isles.dat'
output: 'results.txt'
shell:  'python zipf_test.py abyss.dat isles.dat last.dat > results.txt'
``````

Looking at the results file name first, we can replace it in the action with `{output}`:

``````rule zipf_test:
input:  'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell:  'python zipf_test.py abyss.dat isles.dat last.dat > {output}'
``````

`{output}` is a Snakemake wildcard which is equivalent to the value we specified for the `output` section of the rule.

We can replace the dependencies in the action with `{input}`:

``````rule zipf_test:
input:  'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell:  'python zipf_test.py {input} > {output}'
``````

`{input}` is another wildcard which means ‘all the dependencies of the current rule’. Again, when Snakemake is run it will replace this variable with the dependencies.

Let’s update our text files and re-run our rule:

``````\$ touch books/*.txt
\$ snakemake results.txt
``````

We get:

``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count	jobs
1	count_words
1	count_words_abyss
1	count_words_last
1	zipf_test
4

rule count_words_last:
input: books/last.txt
output: last.dat
jobid: 1

Finished job 1.
1 of 4 steps (25%) done

rule count_words_abyss:
input: books/abyss.txt
output: abyss.dat
jobid: 2

Finished job 2.
2 of 4 steps (50%) done

rule count_words:
input: books/isles.txt
output: isles.dat
jobid: 3

Finished job 3.
3 of 4 steps (75%) done

rule zipf_test:
input: abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0

Finished job 0.
4 of 4 steps (100%) done
``````

## Update Dependencies

What will happen if you now execute:

``````\$ touch *.dat
\$ snakemake results.txt
``````
1. nothing
2. all files recreated
3. only `.dat` files recreated
4. only `results.txt` recreated

## Solution

`4.` Only `results.txt` recreated.

The rules for `*.dat` are not executed because their corresponding `.txt` files haven’t been modified.

If you run:

``````\$ touch books/*.txt
\$ snakemake results.txt
``````

you will find that the `.dat` files as well as `results.txt` are recreated.

As we saw, `{input}` means ‘all the dependencies of the current rule’. This works well for `results.txt` as its action treats all the dependencies the same — as the input for the `zipf_test.py` script.

## Rewrite `.dat` rules to use wildcards

Rewrite each `.dat` rule to use the `{input}` and `{output}` wildcards.

## Handling dependencies differently

For many rules, we may want to treat some dependencies differently. For example, our rules for `.dat` use their first (and only) dependency specifically as the input file to `wordcount.py`. If we add additional dependencies (as we will soon do) then we don’t want these being passed as input files to `wordcount.py` as it expects only one input file to be named when it is invoked.

Snakemake provides several solutions to this. Depending on what we want to do, it’s possible to both index and name our wildcards.

Suppose we want to add `wordcount.py` as a dependency of each data file. In this case, we can use `{input[0]}` to refer to the first dependency, and `{input[1]}` to refer to the second.

``````rule count_words:
input:  'wordcount.py', 'books/isles.txt'
output: 'isles.dat'
shell:  'python {input[0]} {input[1]} {output}'
``````

Alternatively, we can name our dependencies.

``````rule count_words_abyss:
input:
wc='wordcount.py',
book='books/abyss.txt'
output: 'abyss.dat'
shell:  'python {input.wc} {input.book} {output}'
``````

Let’s mark `wordcount.py` as updated, and re-run the pipeline.

``````\$ touch wordcount.py
\$ snakemake
``````
``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count	jobs
1	count_words
1	count_words_abyss
1	zipf_test
3

rule count_words_abyss:
input: wordcount.py, books/abyss.txt
output: abyss.dat
jobid: 2

Finished job 2.
1 of 3 steps (33%) done

rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 1

Finished job 1.
2 of 3 steps (67%) done

rule zipf_test:
input: abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0

Finished job 0.
3 of 3 steps (100%) done
``````

Notice how `last.dat` (which does not depend on `wordcount.py`) is not rebuilt. Intuitively, we should also add `wordcount.py` as dependency for `results.txt`, as the final table should be rebuilt as we remake the `.dat` files. However, it turns out we don’t have to! Let’s see what happens to `results.txt` when we update `wordcount.py`:

``````\$ touch wordcount.py
\$ snakemake results.txt
``````

then we get:

``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count	jobs
1	count_words
1	count_words_abyss
1	zipf_test
3

rule count_words_abyss:
input: wordcount.py, books/abyss.txt
output: abyss.dat
jobid: 2

Finished job 2.
1 of 3 steps (33%) done

rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 1

Finished job 1.
2 of 3 steps (67%) done

rule zipf_test:
input: abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0

Finished job 0.
3 of 3 steps (100%) done
``````

The whole pipeline is triggered, even the creation of the `results.txt` file! To understand this, note that according to the dependency figure, `results.txt` depends on the `.dat` files. The update of `wordcount.py` triggers an update of the `*.dat` files. Thus, `snakemake` sees that the dependencies (the `.dat` files) are newer than the target file (`results.txt`) and thus it recreates `results.txt`. This is an example of the power of `snakemake`: updating a subset of the files in the pipeline triggers rerunning the appropriate downstream steps.

## Updating One Input File

What will happen if you now execute:

``````touch books/last.txt
snakemake results.txt
``````
1. only `last.dat` is recreated
2. all `.dat` files are recreated
3. only `last.dat` and `results.txt` are recreated
4. all `.dat` and `results.txt` are recreated

## More dependencies…

Add `zipf_test.py` as a dependency of `results.txt` Which method do you prefer here, indexing or named input files? Yes, this will be clunky, but we’ll fix that part later! Remember that you can do a dry run with `snakemake -n -p`!

## Key Points

• Use `{output}` to refer to the output of the current rule.

• Use `{input}` to refer to the dependencies of the current rule.

• You can use Python indexing to retrieve individual outputs and inputs (example: `{input[0]}`)

• Wildcards can be named (example: `{input.file1}`).

# Pattern Rules

## Overview

Teaching: 15 min
Exercises: 0 min
Questions
• How can I define rules to operate on similar files?

Objectives
• Write Snakemake pattern rules.

Our Snakefile still has a ton of repeated content. The rules for each `.dat` file all do the same thing for the part. We can replace these rules with a single pattern rule which can be used to build any `.dat` file from a `.txt` file in `books/`:

``````rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: '{file}.dat'
shell:  'python {input.wc} {input.book} {output}'
``````

`{file}` is another arbitrary wildcard, that we can use as a placeholder for any generic book to analyse. Note that we don’t have to use `{file}` as the name of our wildcard — it can be anything we want!

This rule can be interpreted as: “In order to build a file named `something.dat` (the output) find a file named `books/something.txt` (the input) and run `wordcount.py input output`.”

``````\$ snakemake clean
# use the -p option to show that it is running things correctly!
\$ snakemake -p dats
``````

We should see the same output as before. Note that we can still use snakemake to build individual `.dat` targets as before, and that our new rule will work no matter what stem is being matched.

``````\$ snakemake -p sierra.dat
``````

which gives the output below:

``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count	jobs
1	count_words
1

rule count_words:
input: wordcount.py, books/sierra.txt
output: sierra.dat
jobid: 0
wildcards: file=sierra

python wordcount.py books/sierra.txt sierra.dat
Finished job 0.
1 of 1 steps (100%) done
``````

## Using wildcards

Our arbitrary wildcards like `{file}` can only be used in `input:` and `output:` fields. It cannot be used in actions.

Our Snakefile is now much shorter and cleaner:

``````# generate summary table
rule zipf_test:
input:  'zipf_test.py', 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell:  'python {input[0]} {input[1]} {input[2]} {input[3]} > {output}'

rule dats:
input:
'isles.dat', 'abyss.dat', 'last.dat'

# delete everything so we can re-run things
rule clean:
shell:  'rm -f *.dat results.txt'

# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: '{file}.dat'
shell:  'python {input.wc} {input.book} {output}'
``````

## Key Points

• Use any named wildcard (`{some_name}`) as a placeholder in targets and dependencies.

# Snakefiles are Python code

## Overview

Teaching: 30 min
Exercises: 15 min
Questions
• How can I automatically manage dependencies and outputs?

• How can I use Python code to add features to my pipeline?

Objectives
• Use variables, functions, and imports in a Snakefile.

• Learn to use the `run` action to execute Python code as an action.

Despite our efforts, our pipeline still has repeated content, for instance the names of output files/dependencies. Our `zipf_test` rule, for instance, is extremely clunky. What happens if we want to analyse `books/sierra.txt` as well? We’d have to update everything!

``````rule zipf_test:
input:  'zipf_test.py', 'abyss.dat', 'last.dat', 'isles.dat'
output: 'results.txt'
shell:  'python {input[0]} {input[1]} {input[2]} {input[3]} > {output}'
``````

First, let’s cut down on a little bit of the clunkiness of the `shell` directive. One thing you’ve probably noticed is that all of our rules are using Python strings. Other data structures work too — let’s try a list:

``````rule zipf_test:
input:
zipf='zipf_test.py',
books=['abyss.dat', 'last.dat', 'isles.dat']
output: 'results.txt'
shell:  'python {input.zipf} {input.books} > {output}'
``````

(`snakemake clean` and `snakemake -p` should show that the pipeline still works!)

This illustrates a key feature of Snakemake. Snakefiles are just Python code. We can make our list into a variable to demonstrate this. Let’s create the variable `DATS` and use it in our `zipf_test` and `dats` rules.

``````DATS=['abyss.dat', 'last.dat', 'isles.dat']

# generate summary table
rule zipf_test:
input:
zipf='zipf_test.py',
books=DATS
output: 'results.txt'
shell:  'python {input.zipf} {input.books} > {output}'

rule dats:
input: DATS
``````

Try re-creating both the `dats` and `results.txt` targets (run `snakemake clean` in between).

## When are Snakefiles executed?

The last example illustrated that we can use arbitrary Python code in our Snakefile. It’s important to understand when this code gets executed. Let’s add a `print` instruction to the top of our Snakefile.

``````print('Snakefile is being executed!')

DATS=['abyss.dat', 'last.dat', 'isles.dat']

# generate summary table
rule zipf_test:
input:
# more output below
``````

Now let’s clean up our workspace with `snakemake clean`

``````snakemake clean
``````
``````Snakefile is being executed!
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count	jobs
1	clean
1

rule clean:
jobid: 0

Finished job 0.
1 of 1 steps (100%) done
``````

Now let’s re-run the pipeline…

``````\$ snakemake
``````
``````Snakefile is being executed!
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count	jobs
3	count_words
1	zipf_test
4

rule count_words:
input: wordcount.py, books/last.txt
output: last.dat
jobid: 3
wildcards: file=last

Finished job 3.
1 of 4 steps (25%) done

rule count_words:
input: wordcount.py, books/abyss.txt
output: abyss.dat
jobid: 1
wildcards: file=abyss

Finished job 1.
2 of 4 steps (50%) done

rule count_words:
input: wordcount.py, books/isles.txt
output: isles.dat
jobid: 2
wildcards: file=isles

Finished job 2.
3 of 4 steps (75%) done

rule zipf_test:
input: zipf_test.py, abyss.dat, last.dat, isles.dat
output: results.txt
jobid: 0

Finished job 0.
4 of 4 steps (100%) done
``````

Let’s do a dry-run:

``````\$ snakemake -n
``````
``````Snakefile is being executed!
Nothing to be done.
``````

In every case, the `print()` statement ran before any of the actual pipeline code was run. What we can take away from this is that Snakemake executes the entire Snakefile every time we run `snakemake` (regardless of if it’s a dry run!). Because of this, we need to be careful, and only put tasks that do “real work” (changing files on disk) inside rules.

## Using functions in Snakefiles

In our example here, we only have 4 books. But what if we had 700 books to be processed? It would be a massive effort to update our `DATS` variable to add the name of every single book’s corresponding `.dat` filename.

Fortunately, Snakemake ships with several functions that make working with large numbers of files much easier. The two most helpful ones are `glob_wildcards()` and `expand()`. Let’s start an interactive Python session to see how they work:

``````\$ python3
``````
``````Python 3.6.1 (default, Jun 27 2017, 14:35:15)
``````

In this example, we will import these Snakemake functions directly in our interactive Python session. It is not necessary to import these functions within your Snakefile — these functions are always imported for you.

``````from snakemake.io import expand, glob_wildcards
``````

### Generating file names with expand()

The first function we’ll use is `expand()`. `expand()` is used quite literally, to expand a snakemake wildcard(s) into a set of filenames.

``````>>> expand('folder/{wildcard1}_{wildcard2}.txt',
...        wildcard1=['a', 'b', 'c'],
...        wildcard2=[1, 2, 3])
``````
``````['folder/a_1.txt',
'folder/a_2.txt',
'folder/a_3.txt',
'folder/b_1.txt',
'folder/b_2.txt',
'folder/b_3.txt',
'folder/c_1.txt',
'folder/c_2.txt',
'folder/c_3.txt']
``````

In this case, `expand()` created every possible combination of filenames from the two wildcards. Useful! Of course, this still leaves us needing somehow get the values for `wildcard1` and `wildcard2` in the first place.

### Get wildcard values with glob_wildcards()

To get a set of wildcards from a list of files, we can use the `glob_wildcards()` function. Let’s try grabbing all of the book titles in our `books` folder.

``````>>> glob_wildcards('books/{example}.txt')
``````
``````Wildcards(example=['isles', 'last', 'abyss', 'sierra'])
``````

`glob_wildcards()` returns a `Wildcards` object as output. `Wildcards` is a special object defined by Snakemake that provides named lists.

In this case, there is only one wildcard, `{example}`. We can extract the values for the file names by getting the `example` property from the output of `glob_wildcards()`

``````>>> glob_wildcards('books/{example}.txt').example
``````
``````['isles', 'last', 'abyss', 'sierra']
``````

## Putting it all together

Using the `expand()` and `glob_wildcards()` functions, modify the pipeline so that it automatically detects and analyses all the files in the `books/` folder.

## Using Python code as actions

One very useful feature of Snakemake is the ability to execute Python code instead of just shell commands. Instead of `shell:` as an action, we can use `run:` instead.

Add the following to our snakefile:

``````# at the top of the file
import os
import glob

rule print_book_names:
run:
print('These are all the book names:')
for book in glob.glob('books/*.txt'):
print(book)
``````

Upon execution of the corresponding rule, Snakemake dutifully runs our Python code in the `run:` block:

``````\$ snakemake print_book_names
``````
``````Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count	jobs
1	print_book_names
1

rule print_book_names:
jobid: 0

These are all the book names:
books/isles.txt
books/last.txt
books/abyss.txt
books/sierra.txt
Finished job 0.
1 of 1 steps (100%) done
``````

## Moving output locations

Alter the rules in your Snakefile so that the `.dat` files are created in their own `dats/` folder. Note that creating this folder beforehand is unnecessary. Snakemake automatically creates any folders for you, as needed.

## Creating PNGs

Add new rules and update existing rules to:

• Create `.png` files from `.dat` files using `plotcount.py`.
• Remove all auto-generated files (`.dat`, `.png`, `results.txt`).

Finally, many Snakefiles define a default target called `all` as first target, that will build what the Snakefile has been written to build (e.g. in our case, the `.png` files and the `results.txt` file). Add an `all` target to your Snakefile (Hint: this rule has the `results.txt` file and the `.png` files as dependencies, but no actions). With that in place, instead of running `snakemake results.txt`, you should now run `snakemake all`, or just simply `snakemake`.

## Creating an Archive

• Create an archive, `zipf_analysis.tar.gz`, to hold all our `.dat` files, plots, and the Zipf summary table.
• Update `all` to expect `zipf_analysis.tar.gz` as input.
• Remove `zipf_analysis.tar.gz` when `snakemake clean` is called.

The syntax to create an archive is shown below:

``````tar -czvf zipf_analysis.tar.gz file1 directory2 file3 etc
``````

After these exercises our final workflow should look something like the following:

We can now do a better job at testing Zipf’s rule by adding more books. The books we have used come from the Project Gutenberg website. Project Gutenberg offers thousands of free e-books to download.

### Exercise instructions

• Go to Project Gutenberg and use the search box to find another book, for example ‘The Picture of Dorian Gray’ by Oscar Wilde.
• Download the ‘Plain Text UTF-8’ version and save it to the `books` folder; choose a short name for the file
• Optionally, open the file in a text editor and remove extraneous text at the beginning and end (look for the phrase ```End of Project Gutenberg's [title], by [author]```)
• Run `snakemake` and check that the correct commands are run
• Check the `results.txt` file to see how this book compares to the others

## Key Points

• Snakefiles are Python code.

• The entire Snakefile is executed whenever you run `snakemake`.

• All actual work should be done by rules.

# Resources and parallelism

## Overview

Teaching: 30 min
Exercises: 15 min
Questions
• How do I scale a pipeline across multiple cores?

• How do I manage access to resources while working in parallel?

Objectives
• Modify your pipeline to run in parallel.

After the exercises at the end of our last lesson, our Snakefile looks something like this:

``````# our zipf analysis pipeline
DATS = glob_wildcards('books/{book}.txt').book

rule all:
input:
'zipf_analysis.tar.gz'

# delete everything so we can re-run things
rule clean:
shell:
'''
rm -rf results dats plots
rm -f results.txt zipf_analysis.tar.gz
'''

# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
shell:  'python {input.wc} {input.book} {output}'

# create a plot for each book
rule make_plot:
input:
plotcount='plotcount.py',
book='dats/{file}.dat'
output: 'plots/{file}.png'
shell:  'python {input.plotcount} {input.book} {output}'

# generate summary table
rule zipf_test:
input:
zipf='zipf_test.py',
books=expand('dats/{book}.dat', book=DATS)
output: 'results.txt'
shell:  'python {input.zipf} {input.books} > {output}'

# create an archive with all of our results
rule make_archive:
input:
expand('plots/{book}.png', book=DATS),
expand('dats/{book}.dat', book=DATS),
'results.txt'
output: 'zipf_analysis.tar.gz'
shell: 'tar -czvf {output} {input}'
``````

At this point, we have a complete data analysis pipeline. Very cool. But how do we make it run as efficiently as possible?

## Running in parallel

Up to this point, Snakemake has printed out an interesting message whenever we run our pipeline.

``````Provided cores: 1
Rules claiming more threads will be scaled down.
``````

So far, Snakemake has been run with one core. Let’s scale up our pipeline to run in parallel. The only change we need to make is run Snakemake with the `-j` argument. `-j` is used to indicate number of CPU cores available, and on a cluster, maximum number of jobs (we’ll get to that part later). Note that 4 cores is usually a safe assumption when working on a laptop.

``````\$ snakemake clean
\$ snakemake -j 4
``````
``````Provided cores: 4
Rules claiming more threads will be scaled down.
# more output follows
``````

Our pipeline ran in parallel and finished roughly 4 times as quickly! The takeaway here is that all we need to do to scale from a serial pipeline is run `snakemake` with the `-j` option.

## How many CPUs does your computer have?

Now that we can have our pipeline use multiple CPUs, how do we know how many CPUs to provide to the `-j` option? Note that for all of these options, it’s best to use CPU cores, and not CPU threads.

Linux: You can use the `lscpu` command.

All platforms: Python’s `psutil` module can be used to fetch the number of cores in your computer. Using `logical=False` returns the number of true CPU cores. `logical=True` gives the number of CPU threads on your system.

``````import psutil
psutil.cpu_count(logical=False)
``````

## Managing CPUs

Each rule has a number of optional keywords aside from the usual `input`, `output`, and `shell`/`run`. The `threads` keyword is used to specify how many CPU cores a rule needs while executing. Though in reality CPU threads are not quite the same as CPU cores, the two terms are interchangeable when working with Snakemake.

Let’s pretend that our `count_words` rule is actually very CPU-intensive. We’ll say that it needs a whopping 4 CPUs per run. We can specify this with the `threads` keyword in our rule. We will also modify the rule to print out the number of threads it thinks it is using. Please note that just giving a Snakemake rule 4 threads does not automatically make its action run in parallel! The action also needs to be threads-capable and to explicitly use the `{threads}` information for this to happen. In this case `wordcount.py` is actually still running with 1 core, we are simply using it as a demonstration of how to go about running something with multiple cores.

``````rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
shell:
'''
echo "Running {input.wc} with {threads} cores."
python {input.wc} {input.book} {output}
'''
``````

Now when we run `snakemake -j 4`, the jobs from `count_words` are run one at a time so as to give each job the resources it needs. Since each job of the `count_words` rule requires 4 threads (as per the newly added thread directive), and because all jobs have a maximum of 4 cores available to them as per the `-j 4` option, the count_words jobs are run one at a time. All of our other rules will still run in parallel since they default to requesting a single thread. Unless otherwise specified with `{threads}`, rules will use 1 core by default.

``````Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
count	jobs
1	all
4	count_words
1	make_archive
4	make_plot
1	zipf_test
11

rule count_words:
input: wordcount.py, books/last.txt
output: dats/last.dat
jobid: 3
wildcards: file=last

Running wordcount.py with 4 cores.
Finished job 3.
1 of 11 steps (9%) done

# other output follows
``````

What happens when we don’t have 4 cores available? What if we tell Snakemake to run with 2 cores instead?

``````\$ snakemake -j 2
``````
``````Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count	jobs
1	all
4	count_words
1	make_archive
4	make_plot
1	zipf_test
11

rule count_words:
input: wordcount.py, books/last.txt
output: dats/last.dat
jobid: 6
wildcards: file=last

Running wordcount.py with 2 cores.
Finished job 6.
1 of 11 steps (9%) done

# more output below
``````

The key bit of output is `Rules claiming more threads will be scaled down.`. When Snakemake doesn’t have enough cores to run a rule (as defined by `{threads}`), Snakemake will run that rule with the maximum available number of cores instead. After all, Snakemake’s job is to get our workflow done. It automatically scales our workload to match the maximum number of cores available without us editing the Snakefile.

## Chaining multiple commands

Up until now, all of our commands have fit on one line. To execute multiple bash commands, the only modification we need to make is use a Python multiline string (begin and end with `'''`)

One important addition we should be aware of is the `&&` operator. `&&` is a bash operator that runs commands as part of a chain. If the first command fails, the remaining steps are not run. This is more forgiving than bash’s default “hit an error and keep going” behavior. After all, if the first command failed, it’s unlikely the other steps will work.

``````# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
shell:
'''
echo "Running {input.wc} with {threads} cores on {input.book}." &&
python {input.wc} {input.book} {output}
'''
``````

## Managing other types of resources

Not all compute resources are CPUs. Examples might include limited amounts of RAM, number of GPUs, database locks, or perhaps we simply don’t want multiple processes writing to the same file at once. All non-CPU resources are handled using the `resources` keyword.

For our example, let’s pretend that creating a plot with `plotcount.py` requires dedicated access to a GPU (it doesn’t), and only one GPU is available. How do we indicate this to Snakemake so that it knows to give dedicated access to a GPU for rules that need it? Let’s modify the `make_plot` rule as an example:

``````# create a plot for each book
rule make_plot:
input:
plotcount='plotcount.py',
book='dats/{file}.dat'
output: 'plots/{file}.png'
resources: gpu=1
shell:  'python {input.plotcount} {input.book} {output}'
``````

We can execute our pipeline using the following (using 8 cores and 1 gpu):

``````\$ snakemake clean
\$ snakemake -j 8 --resources gpu=1
``````
``````Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: gpu=1
# other output removed for brevity
``````

Resources are entirely arbitrary — like wildcards, they can be named anything. Snakemake knows nothing about them aside from the fact that they have a name and a value. In this case `gpu` indicates simply that there is a resource called `gpu` used by `make_plot`. We provided 1 `gpu` to the workflow, and the `gpu` is considered in use as long as the rule is running. Once the `make_plot` rule completes, the `gpu` it consumed is added back to the pool of available `gpu`s. To be extra clear: `gpu` in this case does not actually represent a GPU, it is an arbitrary limit used to prevent multiple tasks that use a `gpu` from executing at the same time.

But what happens if we run our pipeline without specifying the number of GPUs?

``````\$ snakemake clean
\$ snakemake -j 8
``````
``````Provided cores: 8
Rules claiming more threads will be scaled down.
Unlimited resources: gpu
``````

If you have specified that a rule needs a certain resource, but do not specify how many you have, Snakemake will assume that the resources in question are unlimited.

## Other uses for `resources`

Resources do not have to correspond to actual compute resources. Perhaps one rule is particularly I/O heavy, and it’s best if only a limited number of these jobs run at a time. Or maybe a type of rule uses a lot of network bandwidth as it downloads data. In all of these cases, `resources` can be used to constrain access to arbitrary compute resources so that each rule can run at it’s most efficient. Snakemake will run your rules in such a way as to maximise throughput given your resource constraints.

## Key Points

• Use `threads` to indicate the number of cores used by a rule.

• Resources are arbitrary and can be used for anything.

• The `&&` operator is a useful tool when chaining bash commands.

# Scaling a pipeline across a cluster

## Overview

Teaching: 30 min
Exercises: 15 min
Questions
• How do I run my workflow on an HPC system?

Objectives
• Understand the Snakemake cluster job submission workflow.

Right now we have a reasonably effective pipeline that scales nicely on our local computer. However, for the sake of this course, we’ll pretend that our workflow actually takes significant computational resources and needs to be run on a cluster.

## HPC cluster architecture

Most HPC clusters are run using a scheduler. The scheduler is a piece of software that handles which compute jobs are run on which compute nodes and where. It allows a set of users to share a shared computing system as efficiently as possible. In order to use it, users typically must write their commands to be run into a shell script and then “submit” it to the scheduler.

A good analogy would be a university’s room booking system. No one gets to use a room without going through the booking system. The booking system decides which rooms people get based on their requirements (# of students, time allotted, etc.).

Normally, moving a workflow to be run by a cluster scheduler requires a lot of work. Batch scripts need to be written, and you’ll need to monitor and babysit the status of each of your jobs. This is especially difficult if one batch job depends on the output from another. Even moving from one cluster to another (especially ones using a different scheduler) requires a large investment of time and effort — all the batch scripts from before need to be rewritten.

Snakemake does all of this for you. All details of running the pipeline through the cluster scheduler are handled by Snakemake — this includes writing batch scripts, submitting, and monitoring jobs. In this scenario, the role of the scheduler is limited to ensuring each Snakemake rule is executed with the resources it needs.

We’ll explore how to port our example Snakemake pipeline. Our current Snakefile is shown below:

``````# our zipf analysis pipeline
DATS = glob_wildcards('books/{book}.txt').book

rule all:
input:
'zipf_analysis.tar.gz'

# delete everything so we can re-run things
rule clean:
shell:
'''
rm -rf results dats plots
rm -f results.txt zipf_analysis.tar.gz
'''

# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
shell:
'''
python {input.wc} {input.book} {output}
'''

# create a plot for each book
rule make_plot:
input:
plotcount='plotcount.py',
book='dats/{file}.dat'
output: 'plots/{file}.png'
resources: gpu=1
shell:  'python {input.plotcount} {input.book} {output}'

# generate summary table
rule zipf_test:
input:
zipf='zipf_test.py',
books=expand('dats/{book}.dat', book=DATS)
output: 'results.txt'
shell:  'python {input.zipf} {input.books} > {output}'

# create an archive with all of our results
rule make_archive:
input:
expand('plots/{book}.png', book=DATS),
expand('dats/{book}.dat', book=DATS),
'results.txt'
output: 'zipf_analysis.tar.gz'
shell: 'tar -czvf {output} {input}'
``````

To run Snakemake on a cluster, we need to create a profile to tell snakemake how to submit jobs to our cluster. We can then submit jobs with this profile using the `--profile` argument followed by the name of our profile. In this configuration, Snakemake runs on the cluster head node and submits jobs. Each cluster job executes a single rule and then exits. Snakemake detects the creation of output files, and submits new jobs (rules) once their dependencies are created.

## Transferring our workflow

Let’s port our workflow to Compute Canada’s Graham cluster as an example (you will probably be using a different cluster, adapt these instructions to your cluster). The first step will be to transfer our files to the cluster and log on via SSH. Snakemake has a powerful archiving utility that we can use to bundle up our workflow and transfer it.

``````\$ snakemake clean
\$ tar -czvf pipeline.tar.gz .
# transfer the pipeline via scp
\$ scp pipeline.tar.gz [email protected]:
# log on to the cluster
\$ ssh -X [email protected]
``````

## `snakemake --archive` and Conda deployment

Snakemake has a built-in method to archive all input files and scripts under version control: `snakemake --archive`. What’s more, it also installs any required dependencies if they can be installed using Anaconda’s `conda` package manager. You can use this feature for this tutorial (I’ve already added all of the files to version control for you), but if you want to use this feature in your own work, you should familiarise yourself with a version control tool like Git.

At this point we’ve archived our entire pipeline, sent it to the cluster, and logged on. Let’s create a folder for our pipeline and unpack it there.

``````\$ mkdir pipeline
\$ mv pipeline.tar.gz pipeline
\$ cd pipeline
\$ tar -xvzf pipeline.tar.gz
``````

If Snakemake and Python are not already installed on your cluster, you can install them in an Anaconda Python environment using the following commands:

``````\$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
\$ bash Miniconda3-latest-Linux-x86_64.sh
``````

This is an interactive installation through the command line. Review and accept the license agreement, then work through the prompts. The defaults are probably fine. Accept its offer to initialize your environment (`conda init`), then run the suggested command to load the `conda` base environment so you can use it straight away. Finally, install Snakemake from the bioconda channel:

``````\$ conda install -y -c bioconda graphviz matplotlib numpy snakemake
``````

Assuming you’ve transferred your files and everything is set to go, the command `snakemake -n` should work without errors.

## Creating a cluster profile

Snakemake uses a YAML-formatted configuration file to retrieve cluster submission parameters; we will use the SLURM scheduler for an example. When we use the ‘–profile slurm’ argument, snakemake looks for a directory with the name of our profile (slurm) containing a ‘config.yaml’ file such as the one below.

``````cluster: "sbatch --time={resources.time_min} --mem={resources.mem_mb}
-c {resources.cpus} -o slurm/logs/{rule}_{wildcards}
-e slurm/logs/{rule}_{wildcards}"
jobs: 25
default-resources: [cpus=1, mem_mb=1000, time_min=5]
resources: [cpus=100, mem_mb=100000]
``````

This file has several components. `cluster` and the arguments that follow tell snakemake how to submit jobs to the cluster. Here we’ve used SLURM’s `sbatch` command and arguments for setting time limits and resources with snakemake wildcards defining the requested values.

We’ve also specified where to save SLURM logs and what to call them. Note that this folder must already exist. If the folders don’t exist, Snakemake will hang.

Values for any command line argument to snakemake can be defined in our profile, although a value is required (e.g. the `--use-conda` argument could be included in our profile with `use-conda: true`).

`jobs` specifies the maximum number of jobs that will be submitted at one time. We also specified the `default-resources` that will be requested for each job, while `resources` defines the resource limits.

With these parameters, snakemake will use no more than 100 cpus and 100000 MB (100 GB) at a time between all currently submitted jobs. While it does not come into play here, a generally sensible default is slightly above the maximum number of jobs you are allowed to have submitted at a time.

The defaults won’t always be perfect, however — chances are some rules may need to run with non-default amounts of memory or time limits. We are using the `count_words` rule as an example of this. To request non-default resources for a job, we can modify the rule in our snakefile to include a `resources` section like this:

``````# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
resources: cpus=4, mem_mb=8000, time_min=20
shell:
'''
python {input.wc} {input.book} {output}
'''
``````

## Local rule execution

Some Snakemake rules perform trivial tasks where job submission might be overkill (i.e. less than 1 minute worth of compute time). It would be a better idea to have these rules execute locally (i.e. where the `snakemake` command is run) instead of as a job. Let’s define `all`, `clean`, and `make_archive` as localrules near the top of our `Snakefile`.

``````localrules: all, clean, make_archive
``````

## Running our workflow on the cluster

OK, time for the moment we’ve all been waiting for — let’s run our workflow on the cluster with the profile we’ve created. Use this command:

``````\$ snakemake --profile slurm
``````

While things execute, you may wish to SSH to the cluster in another window so you can watch the pipeline’s progress with `watch squeue -u \$(whoami)`.

## Notes on `\$PATH`

As with any cluster jobs, jobs started by Snakemake need to have the commands they are running on `\$PATH`. For some schedulers (SLURM), no modifications are necessary — variables are passed to the jobs by default. Other schedulers (SGE) need to have this enabled through a command line flag when submitting jobs (`-V` for SGE). If this is possible, just run the `module load` commands you need ahead of the job and run Snakemake as normal.

If this is not possible, you have several options:

• You can edit your `.bashrc` file to modify `\$PATH` for all jobs and sessions you start on a cluster.
• Inserting `shell.prefix('some command')` in a Snakefile means that all rules run will be prefixed by `some_command`. You can use this to modify `\$PATH`, e.g., `shell.prefix('PATH=/extra/directory:\$PATH ')`.
• You can modify rules directly to run the appropriate `module load` commands beforehand. This is not recommended, only if because it is more work than the other options available.

## Submitting a workflow with nohup

`nohup some_command &` runs a command in the background and lets it keep running if you log off. Try running the pipeline in cluster mode using `nohup` (run `snakemake clean` beforehand). Where does the Snakemake log go to? Why might this technique be useful? Can we also submit the `snakemake --profile slurm` pipeline as a job? Where does the Snakemake command run in each scenario?

You can kill the running Snakemake process with `killall snakemake`. Notice that if you try to run Snakemake again, it says the directory is locked. You can unlock the directory with `snakemake --unlock`.

## Key Points

• Snakemake generates and submits its own batch scripts for your scheduler.

• `localrules` defines rules that are executed on the Snakemake head node.

• `\$PATH` must be passed to Snakemake rules.

• `nohup <command> &` prevents `<command>` from exiting when you log off.

# Final notes

## Overview

Teaching: 15 min
Exercises: 15 min
Questions
• What are some tips and tricks I can use to make this easier?

Objectives
• Understand how to make a DAG graph.

Now that we know how to write and scale a pipeline, here are some tips and tricks for making the process go more smoothly.

## `snakemake -n` is your friend

Whenever you edit your Snakefile, run `snakemake -n` immediately afterwards. This will check for errors and make sure that the pipeline is able to run.

The most common source of errors is a mismatch in filenames (Snakemake doesn’t know how to produce a particular output file) — `snakemake -n` will catch this as long as the troublesome output files haven’t already been made.

## Configuring logging

By default, Snakemake prints all output from stderr and stdout from rules. This is useful, but if a failure occurs (or we otherwise need to inspect the logs) it can be extremely difficult to determine what happened or which rule had an issue, especially when running in parallel.

The solution to this issue is to redirect the output from each rule/ set of inputs to a dedicated log file. We can do this using the `log` keyword. Let’s modify our `count_words` rule to be slightly more verbose and redirect this output to a dedicated log file.

Two things before we start:

• `&>` is a handy operator in bash that redirects both `stdout` and `stderr` to a file.
• `&>>` does the same thing as `&>`, but appends to a file instead of overwriting it.
``````# count words in one of our "books"
rule count_words:
input:
wc='wordcount.py',
book='books/{file}.txt'
output: 'dats/{file}.dat'
log: 'dats/{file}.log'
shell:
'''
echo "Running {input.wc} with {threads} cores on {input.book}." &> {log}
python {input.wc} {input.book} {output} &>> {log}
'''
``````
``````\$ snakemake clean
\$ snakemake -j 8
\$ cat dats/abyss.log
``````
``````# snakemake output omitted
Running wordcount.py with 4 cores on books/abyss.txt.
``````

Notice how the pipeline no longer prints to the pipeline’s log, and instead redirects this to a log file.

## Choosing a good log file location

Though you can put a log anywhere (and name it anything), it is often a good practice to put the log in the same directory where the rule’s output will be created. If you need to investigate the output for a rule and associated log files, this means that you only have to check one location!

## Token files

Often, a rule does not generate a unique output, and merely modifies a file. In these cases it is often worthwhile to create a placeholder, or “token file” as output. A token file is simply an empty file that you can create with the touch command (`touch some_file.txt` creates an empty file called `some_file.txt`). An example rule using this technique is shown below:

``````rule token_example:
input:  'some_file.txt'
output: 'some_file.tkn'   # marks some_file.txt as modified
shell:
'''
some_command --do-things {input} &&
touch {output}
'''
``````

## Directory locks

Only one instance of Snakemake can run in a directory at a time. If a Snakemake run fails without unlocking the directory (if you killed the process, for instance), you can run `snakemake --unlock` to unlock it.

## Python as a fallback

Remember, you can use Python imports and functions anywhere in a Snakefile. If something seems a little tricky to implement - Python can do it. The `os`, `shutil`, and `subprocess` packages are useful tools for using Python to execute command line actions. In particular, `os.system('some command')` will run a command on the command-line and block until execution is complete.

## Creating a workflow diagram

Assuming graphviz is installed (`conda install graphviz`), you can create a diagram of your workflow with the command: `snakemake --dag | dot -Tsvg > dag.svg`. This creates a plot of your “directed acyclic graph” (a plot of all of the rules Snakemake thinks it needs to complete), which you can view using any picture viewing program. In fact this was the tool used to create all of the diagrams in this lesson:

``````snakemake --dag | dot -Tsvg > dag.svg
eog dag.svg     # eog is an image viewer installed on many Linux systems
``````

Rules that have yet to be completed are indicated with solid outlines. Already completed tasks will be indicated with dashed outlines. In this case, I ran `snakemake clean`, just before creating the diagram — no rules have been run yet.

## Viewing the GUI

Snakemake has an experimental web browser GUI. I personally haven’t used it for anything, but it’s cool to know it’s there and can be used to view your workflow on the fly.

`snakemake --gui`

## Where to go for documentation / help

The Snakemake documentation is located at snakemake.readthedocs.io

## Key Points

• Token files can be used to take the place of output files if none are created.

• `snakemake --dag | dot -Tsvg > dag.svg` creates a graphic of your workflow.

• `snakemake --gui` opens a browser window with your workflow.