Wildcards and pipes
OverviewTeaching: 45 min
Exercises: 10 minQuestions
How can I run a command on multiple files at once?
Is there an easy way of saving a command’s output?Objectives
Redirect a command’s output to a file.
Process a file instead of keyboard input using redirection.
Construct command pipelines with two or more stages.
Explain what usually happens if a program or pipeline isn’t given any input to process.
If you didn’t get them in the last lesson, make sure to download the example files used in the next few sections:
Using a web browser: http://www.hpc-carpentry.org/hpc-shell/files/bash-lesson.tar.gz
Now that we know some of the basic UNIX commands, we are going to explore some
more advanced features. The first of these features is the wildcard
*. In our
examples before, we’ve done things to files one at a time and otherwise had to
specify things explicitly. The
* character lets us speed things up and do
things across multiple files.
Ever wanted to move, delete, or just do “something” to all files of a certain
type in a directory?
* lets you do that, by taking the place of one or more
characters in a piece of text. So
*.txt would be equivalent to all
files in a directory for instance.
* by itself means all files. Let’s use our
example data to see what I mean.
$ tar xvf bash-lesson.tar.gz $ ls
bash-lesson.tar.gz SRR307026_1.fastq dmel-all-r6.19.gtf SRR307026_2.fastq dmel_unique_protein_isoforms_fb_2016_01.tsv SRR307027_1.fastq gene_association.fb SRR307027_2.fastq SRR307023_1.fastq SRR307028_1.fastq SRR307023_2.fastq SRR307028_2.fastq SRR307024_1.fastq SRR307029_1.fastq SRR307024_2.fastq SRR307029_2.fastq SRR307025_1.fastq SRR307030_1.fastq SRR307025_2.fastq SRR307030_2.fastq
Now we have a whole bunch of example files in our directory. For this example
we are going to learn a new command that tells us how long a file is:
-l file tells us the length of a file in lines.
$ wc -l dmel-all-r6.19.gtf
Interesting, there are over 540000 lines in our
dmel-all-r6.19.gtf file. What
if we wanted to run
wc -l on every .fastq file? This is where
* comes in
*.fastq would match every file ending in
$ wc -l *.fastq
20000 SRR307023_1.fastq 20000 SRR307023_2.fastq 20000 SRR307024_1.fastq 20000 SRR307024_2.fastq 20000 SRR307025_1.fastq 20000 SRR307025_2.fastq 20000 SRR307026_1.fastq 20000 SRR307026_2.fastq 20000 SRR307027_1.fastq 20000 SRR307027_2.fastq 20000 SRR307028_1.fastq 20000 SRR307028_2.fastq 20000 SRR307029_1.fastq 20000 SRR307029_2.fastq 20000 SRR307030_1.fastq 20000 SRR307030_2.fastq 320000 total
That was easy. What if we wanted to do the same command, except on every file
in the directory? A nice trick to keep in mind is that
* by itself matches
$ wc -l *
53037 bash-lesson.tar.gz 542048 dmel-all-r6.19.gtf 22129 dmel_unique_protein_isoforms_fb_2016_01.tsv 106290 gene_association.fb 20000 SRR307023_1.fastq 20000 SRR307023_2.fastq 20000 SRR307024_1.fastq 20000 SRR307024_2.fastq 20000 SRR307025_1.fastq 20000 SRR307025_2.fastq 20000 SRR307026_1.fastq 20000 SRR307026_2.fastq 20000 SRR307027_1.fastq 20000 SRR307027_2.fastq 20000 SRR307028_1.fastq 20000 SRR307028_2.fastq 20000 SRR307029_1.fastq 20000 SRR307029_2.fastq 20000 SRR307030_1.fastq 20000 SRR307030_2.fastq 1043504 total
You can even use multiple
*s at a time. How would you run
wc -lon every file with “fb” in it?
wc -l *fb*
i.e. anything or nothing then
fbthen anything or nothing
Using other commands
Now let’s try cleaning up our working directory a bit. Create a folder called “fastq” and move all of our .fastq files there in one
mkdir fastq mv *.fastq fastq/
Each of the commands we’ve used so far does only a very small amount of work. However, we can chain these small UNIX commands together to perform otherwise complicated actions!
For our first foray into piping, or redirecting output, we are going to use
> operator to write output to a file. When using
>, whatever is on the
left of the
> is written to the filename you specify on the right of the
arrow. The actual syntax looks like
command > filename.
Let’s try several basic usages of
echo simply prints back, or echoes
whatever you type after it.
$ echo "this is a test" $ echo "this is a test" > test.txt $ ls $ cat test.txt
this is a test bash-lesson.tar.gz fastq dmel-all-r6.19.gtf gene_association.fb dmel_unique_protein_isoforms_fb_2016_01.tsv test.txt this is a test
Awesome, let’s try that with a more complicated command, like
$ wc -l * > word_counts.txt $ cat word_counts.txt
wc: fastq: Is a directory 53037 bash-lesson.tar.gz 542048 dmel-all-r6.19.gtf 22129 dmel_unique_protein_isoforms_fb_2016_01.tsv 0 fastq 106290 gene_association.fb 1 test.txt 723505 total
Notice how we still got some output to the console even though we “piped” the output to a file? Our expected output still went to the file, but how did the error message get skipped and not go to the file?
This phenomena is an artefact of how UNIX systems are built. There are 3
input/output streams for every UNIX program you will run:
Let’s dissect these three streams of input/output in the command we just ran:
wc -l * > word_counts.txt
stdinis the input to a program. In the command we just ran,
stdinis represented by
*, which is simply every filename in our current directory.
stdoutcontains the actual, expected output. In this case,
stdoutto the file
stderrtypically contains error messages and other information that doesn’t quite fit into the category of “output”. If we insist on redirecting both
stderrto the same file we would use
>. (We can redirect just
Knowing what we know now, let’s try re-running the command, and send all of the
output (including the error message) to the same
word_counts.txt files as
$ wc -l * &> word_counts.txt
Notice how there was no output to the console that time. Let’s check that the error message went to the file like we specified.
$ cat word_counts.txt
53037 bash-lesson.tar.gz 542048 dmel-all-r6.19.gtf 22129 dmel_unique_protein_isoforms_fb_2016_01.tsv wc: fastq: Is a directory 0 fastq 106290 gene_association.fb 1 test.txt 7 word_counts.txt 723512 total
wc: fastq: Is a directory error message was written to the file.
Also, note how the file was silently overwritten by directing output to the
same place as before. Sometimes this is not the behaviour we want. How do we
append (add) to a file instead of overwriting it?
Appending to a file is done the same was as redirecting output. However,
>, we will use
$ echo "We want to add this sentence to the end of our file" >> word_counts.txt $ cat word_counts.txt
22129 dmel_unique_protein_isoforms_fb_2016_01.tsv 471308 Drosophila_melanogaster.BDGP5.77.gtf 0 fastq 1304914 fb_synonym_fb_2016_01.tsv 106290 gene_association.fb 1 test.txt 1904642 total We want to add this sentence to the end of our file
Chaining commands together
We now know how to redirect
stderr to files. We can actually
take this a step further and redirect output (
stdout) from one command to
serve as the input (
stdin) for the next. To do this, we use the
grep is an extremely useful command. It finds things for us within files.
Basic usage (there are a lot of options for more clever things, see the
page) uses the syntax
grep whatToFind fileToSearch. Let’s use
grep to find
all of the entries pertaining to the
Act5C gene in Drosophila melanogaster.
$ grep Act5C dmel-all-r6.19.gtf
The output is nearly unintelligible since there is so much of it. Let’s send
the output of that
grep command to
head so we can just take a peek at the
first line. The
| operator lets us send output from one command to the next:
$ grep Act5C dmel-all-r6.19.gtf | head -n 1
X FlyBase gene 5900861 5905399 . + . gene_id "FBgn0000042"; gene_symbol "Act5C";
Nice work, we sent the output of
head. Let’s try counting the
number of entries for Act5C with
wc -l. We can do the same trick to send
grep’s output to
$ grep Act5C dmel-all-r6.19.gtf | wc -l
Note that this is just the same as redirecting output to a file, then reading the number of lines from that file.
Writing commands using pipes
How many files are there in the “fastq” directory we made earlier? (Use the shell to do this.)
ls fastq/ | wc -l
lsis one line per item, when chaining commands together like this, so counting lines gives the number of files.
Reading from compressed files
Let’s compress one of our files using gzip.
$ gzip gene_association.fb
cat, except that it can read information from
.gz(compressed) files. Using
zcat, can you write a command to take a look at the top few lines of the
gene_association.fb.gzfile (without decompressing the file itself)?
zcat gene_association.fb.gz | head
headcommand without any options shows the first 10 lines of a file.
*wildcard is used as a placeholder to match any text that follows a pattern.
Redirect a command’s output to a file with
Commands can be chained with