Completing the Pipeline
Overview
Teaching: 10 min
Exercises: 30 minQuestions
How do I move generated files into a subdirectory?
How do I add new processing rules to a Snakefile?
What are some common practices for Snakemake?
How can I get my workflow to clean up generated files?
What is a default rule?
Objectives
Update existing rules so that
datfiles are created in a subdirectory.Add a rule to your Snakefile that generates PNG plots of word frequencies.
Add an
allrule to your Snakefile.Make
allthe default rule.
Moving Output Files into a Subdirectory
Currently our workflow is generating a lot of files in the main directory. This is not so bad with small numbers of files, but it can get messy as the file count grows. One approach to this is to generate outputs into their own directories, named after the file types. For example:
.
├── books
│ ├── abyss.txt
│ ├── isles.txt
│ ├── last.txt
│ ├── LICENSE_TEXTS.md
│ └── sierra.txt
├── dats
│ ├── abyss.dat
│ ├── isles.dat
│ ├── last.dat
│ └── sierra.dat
├── Pipfile
├── plotcount.py
├── requirements.txt
├── results.txt
├── Snakefile
├── wordcount.py
├── zipf_analysis.tar.gz
└── zipf_test.py
There are many potential arrangements, so you are free to choose whatever makes sense
for your project. Snakemake is not prescriptive, it will put files wherever
you tell it. So here we will learn how to move the dat files into a dats
directory.
Moving the
datfilesAlter the rules in your Snakefile so that the
datfiles are created in their owndats/folder. Note that creating this folder beforehand is unnecessary. Snakemake automatically create any folders for you, as needed.Hint
- Make sure your
Snakefileis up to date with the end of the preceeding lesson. Use the provided solution files if necessary.- Look for all the locations that reference the
datfiles and update to add thedats/directory.Solution
First update the
DATSvariable with thedatsdirectory:DATS = expand('dats/{file}.dat', file=glob_wildcards('./books/{book}.txt').book)Then update
count_wordsso the dat files get created in the same place:rule count_words: input: cmd='wordcount.py', book='books/{file}.txt' output: 'dats/{file}.dat' shell: 'python {input.cmd} {input.book} {output}'Finally, update the
cleanrule to remove thedatsdirectory:rule clean: shell: 'rm -rf dats/ *.dat results.txt'Note that in the clean rule there is no harm from keeping the
*.datpattern in thermcommand even though no new files will be created in that location. It will help clean up if you forgot to runsnakemake cleanbefore updating the Snakefile.See
.solutions/completing_the_pipeline/Snakefile_move_dats.
Windows Note
At the time of writing, there is an open bug in Snakemake (version 5.8.2) on Windows systems that prevents requesting specific files from the command line when those files are in a subdirectory.
For example, before moving the
datfiles to thedatsdirectory, you could request that Snakemake build a specific file with a command like:snakemake -c 1 last.datAfter moving the location of
datfiles, the correct command is:snakemake -c 1 dats/last.datOn Windows systems this command produces an error. However, Snakemake can still build the files correctly when processing inputs for other rules (such as the
datsrule). The bug only affects files requested from the command line.Later in this episode we will see one way around this issue when we introduce the
allrule.
Generating Plots
Creating PNGs
Your challenge is to update your Snakefile so that it can create
.pngfiles fromdatfiles usingplotcount.py.
- The new rule should be called
make_plot.- All
.pngfiles should be created in a directory calledplots. If you are using a Windows system, you could create the plots in the top-level directory instead in order to avoid the Windows subdirectory bug. You may need to change back to theplotsdirectory after we introduce theallrule.As well as a new rule you may also need to update existing rules.
Remember that when testing a pattern rule, you can’t just ask Snakemake to execute the rule by name. You need to ask Snakemake to build a specific file. So instead of
snakemake count_wordsyou need something likesnakemake dats/last.dat.Solution
Modify the
cleanrule and add a new pattern rulemake_plot:# delete everything so we can re-run things rule clean: shell: 'rm -rf dats/ plots/ *.dat results.txt' # plot one word count dat file rule make_plot: input: cmd='plotcount.py', dat='dats/{file}.dat' output: 'plots/{file}.png' shell: 'python {input.cmd} {input.dat} {output}'
Default Rules
The default rule is the rule that Snakemake runs if you don’t specify a rule
on the command-line (e.g.: if you just run snakemake -c 1).
The default rule is simply the first rule in a Snakefile. While the default
rule can be anything you like, it is common practice to call the default rule
all, and have it run the entire workflow.
Add an
allruleAdd an
allrule to your Snakefile.Note that
allrules often don’t need to do any processing of their own. It is suffient to make them depend on all the final outputs from other rules. In this case, the outputs areresults.txtand all the PNG files.Hint
It is easiest to use
glob_wildcardsandexpandto build the list of all expected.pngfiles.Solution
First, we modify the existing code that builds
DATSto first extract the list of book names, and then to buildDATSand a new global variablePLOTSlisting all plots:# Build the list of book names. We need to use it multiple times when building # the lists of files that will be built in the workflow BOOK_NAMES = glob_wildcards('./books/{book}.txt').book # The list of all dat files DATS = expand('dats/{file}.dat', file=BOOK_NAMES) # The list of all plot files PLOTS = expand('plots/{file}.png', file=BOOK_NAMES)Now we can define the
allrule:# pseudo-rule that tries to build everything. # Just add all the final outputs that you want built. rule all: input: 'results.txt', PLOTS
Cleaning House
It is common practice to have a clean rule that deletes all intermediate
and generated files, taking your workflow back to a blank slate.
We already have a clean rule, so now is a good time to check that it
removes all intermediate and output files. First do a snakemake -c 1 all followed
by snakemake -c 1 clean. Then check to see if any output files remain and add them
to the clean rule if required.
Creating an Archive
Let’s add a processing rule that depends on all previous stages of the workflow. In this case, we will create an archive tar file.
Windows Note
If you are using a Windows system, make sure you have followed the setup instructions regarding the use of Git Bash. This should provide the required environment for the
tarcommand to work.
Creating an Archive
Update your pipeline to:
- Create an archive file called
zipf_analysis.tar.gz- The archive should contain all
datfiles, plots, and the Zipf summary table (results.txt).- Update
allto expectzipf_analysis.tar.gzas input.- Remove the archive when
snakemake -c 1 cleanis called.The syntax to create an archive is:
tar -czvf zipf_analysis.tar.gz file1 directory2 file3 etcSolution
First the
create_archiverule:# create an archive with all results rule create_archive: input: 'results.txt', DATS, PLOTS output: 'zipf_analysis.tar.gz' shell: 'tar -czvf {output} {input}'Then the update to the
clean target:# delete everything so we can re-run things rule clean: shell: 'rm -rf dats/ plots/ *.dat results.txt zipf_analysis.tar.gz'Then the change to
all. The workflow would still be correct without this step, but sincecreate_archiverequires building everything, it is simpler to just getallto depend oncreate_archive. This means we have just one rule to maintain if we add new outputs later on.# pseudo-rule that tries to build everything. # Just add all the final outputs that you want built. rule all: input: 'zipf_analysis.tar.gz'
After these exercises our final workflow should look something like the following:
Adding more books
We can now do a better job at testing Zipf’s rule by adding more books. The books we have used come from the Project Gutenberg website. Project Gutenberg offers thousands of free ebooks to download.
Exercise instructions:
- go to Project Gutenberg and use the search box to find another book, for example ‘The Picture of Dorian Gray’ from Oscar Wilde.
- download the ‘Plain Text UTF-8’ version and save it to the
booksfolder; choose a short name for the file- optionally, open the file in a text editor and remove extraneous text at the beginning and end (look for the phrase
End of Project Gutenberg's [title], by [author])- run
snakemakeand check that the correct commands are run- check the results.txt file to see how this book compares to the others
Key Points
Keeping output files in the top-level directory can get messy. One solution is to put files into subdirectories.
It is common practice to have a
cleanrule that deletes all intermediate and generated files, taking your workflow back to a blank slate.A default rule is the rule that Snakemake runs if you don’t specify a rule on the command line. It is simply the first rule in a Snakefile.
Many Snakefiles define a default target called
allas first target in the file. This runs by default and typically executes the entire workflow.