This lesson is in the early stages of development (Alpha version)

Introduction to Using the Shell in a High-Performance Computing Context

Why Use a Cluster?

Overview

Teaching: 25 min
Exercises: 5 min
Questions
  • Why would I be interested in High Performance Computing (HPC)?

  • What can I expect to learn from this course?

Objectives
  • Be able to describe what an HPC system is.

  • Identify how an HPC system could benefit you.

Why Use These Computers?

What do you need?

Talk to your neighbor about your research. How does computing help you do your research? How could more computing help you do more or better research?

Frequently, research problems that use computing can outgrow the desktop or laptop computer where they started:

In all these cases, what is needed is access to more computers than can be used at the same time. Luckily, large scale computing systems — shared computing resources with lots of computers — are available at many universities, labs, or through national networks. These resources usually have more central processing units(CPUs), CPUs that operate at higher speeds, more memory, more storage, and faster connections with other computer systems. They are frequently called “clusters”, “supercomputers” or resources for “high performance computing” or HPC. In this lesson, we will usually use the terminology of HPC and HPC cluster.

Using a cluster often has the following advantages for researchers:

This is how a large-scale compute system like a cluster can help solve problems like those listed at the start of the lesson.

Thinking ahead

How do you think using a large-scale computing system will be different from using your laptop? Talk to your neighbor about some differences you may already know about, and some differences/difficulties you imagine you may run into.

On Command Line

Using HPC systems often involves the use of a shell through a command line interface (CLI) and either specialized software or programming techniques. The shell is a program with the special role of having the job of running other programs rather than doing calculations or similar tasks itself. What the user types goes into the shell, which then figures out what commands to run and orders the computer to execute them. (Note that the shell is called “the shell” because it encloses the operating system in order to hide some of its complexity and make it simpler to interact with.) The most popular Unix shell is Bash, the Bourne Again SHell (so-called because it’s derived from a shell written by Stephen Bourne). Bash is the default shell on most modern implementations of Unix and in most packages that provide Unix-like tools for Windows.

Interacting with the shell is done via a command line interface (CLI) on most HPC systems. In the earliest days of computers, the only way to interact with early computers was to rewire them. From the 1950s to the 1980s most people used line printers. These devices only allowed input and output of the letters, numbers, and punctuation found on a standard keyboard, so programming languages and software interfaces had to be designed around that constraint and text-based interfaces were the way to do this. A typing-based interface is often called a command-line interface, or CLI, to distinguish it from a graphical user interface, or GUI, which most people now use. The heart of a CLI is a read-evaluate-print loop, or REPL: when the user types a command and then presses the Enter (or Return) key, the computer reads it, executes it, and prints its output. The user then types another command, and so on until the user logs off.

Learning to use Bash or any other shell sometimes feels more like programming than like using a mouse. Commands are terse (often only a couple of characters long), their names are frequently cryptic, and their output is lines of text rather than something visual like a graph. However, using a command line interface can be extremely powerful, and learning how to use one will allow you to reap the benefits described above.

The rest of this lesson

The only way to use these types of resources is by learning to use the command line. This introduction to HPC systems has two parts:

The skills we learn here have other uses beyond just HPC: Bash and UNIX skills are used everywhere, be it for web development, running software, or operating servers. It’s become so essential that Microsoft now ships it as part of Windows! Knowing how to use Bash and HPC systems will allow you to operate virtually any modern device. With all of this in mind, let’s connect to a cluster and get started!

Key Points

  • High Performance Computing (HPC) typically involves connecting to very large computing systems elsewhere in the world.

  • These HPC systems can be used to do work that would either be impossible or much slower or smaller systems.

  • The standard method of interacting with such systems is via a command line interface such as Bash.


Connecting to the remote HPC system

Overview

Teaching: 25 min
Exercises: 10 min
Questions
  • How do I open a terminal?

  • How do I connect to a remote computer?

  • What is an SSH key?

Objectives
  • Connect to a remote HPC system.

Opening a Terminal

Connecting to an HPC system is most often done through a tool known as “SSH” (Secure SHell) and usually SSH is run through a terminal. So, to begin using an HPC system we need to begin by opening a terminal. Different operating systems have different terminals, none of which are exactly the same in terms of their features and abilities while working on the operating system. When connected to the remote system the experience between terminals will be identical as each will faithfully present the same experience of using that system.

Here is the process for opening a terminal in each operating system.

Linux

There are many different versions (aka “flavours”) of Linux and how to open a terminal window can change between flavours. Fortunately most Linux users already know how to open a terminal window since it is a common part of the workflow for Linux users. If this is something that you do not know how to do then a quick search on the Internet for “how to open a terminal window in” with your particular Linux flavour appended to the end should quickly give you the directions you need.

Mac

Macs have had a terminal built in since the first version of OS X since it is built on a UNIX-like operating system, leveraging many parts from BSD (Berkeley Software Distribution). The terminal can be quickly opened through the use of the Searchlight tool. Hold down the command key and press the spacebar. In the search bar that shows up type “terminal”, choose the terminal app from the list of results (it will look like a tiny, black computer screen) and you will be presented with a terminal window. Alternatively, you can find Terminal under “Utilities” in the Applications menu.

Windows

While Windows does have a command-line interface known as the “Command Prompt” that has its roots in MS-DOS (Microsoft Disk Operating System) it does not have an SSH tool built into it and so one needs to be installed. There are a variety of programs that can be used for this; a few common ones we describe here, as follows:

Git BASH

Git BASH gives you a terminal like interface in Windows. You can use this to connect to a remote computer via SSH. It can be downloaded for free from here.

Windows Subsystem for Linux

The Windows Subsystem for Linux also allows you to connect to a remote computer via SSH. Instructions on installing it can be found here.

MobaXterm

MobaXterm is a terminal window emulator for Windows and the home edition can be downloaded for free from mobatek.net. If you follow the link you will note that there are two editions of the home version available: Portable and Installer. The portable edition puts all MobaXterm content in a folder on the desktop (or anywhere else you would like it) so that it is easy to add plug-ins or remove the software. The installer edition adds MobaXterm to your Windows installation and menu as any other program you might install. If you are not sure that you will continue to use MobaXterm in the future, the portable edition is likely the best choice for you.

Download the version that you would like to use and install it as you would any other software on your Windows installation. Once the software is installed you can run it by either opening the folder installed with the portable edition and double-clicking on the executable file named MobaXterm_Personal_11.1 (your version number may vary) or, if the installer edition was used, finding the executable through either the start menu or the Windows search option.

Once the MobaXterm window is open you should see a large button in the middle of that window with the text “Start Local Terminal”. Click this button and you will have a terminal window at your disposal.

PuTTY

It is strictly speaking not necessary to have a terminal running on your local computer in order to access and use a remote system, only a window into the remote system once connected. PuTTY is likely It is, strictly speaking, not necessary to have a terminal running on your local computer in order to access and use a remote system, only a window into the remote system once connected. PuTTY is likely the oldest, most well-known, and widely used software solution to take this approach.

PuTTY is available for free download from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html. Download the version that is correct for your operating system and install it as you would other software on your Windows system. Once installed it will be available through the start menu or similar.

Running PuTTY will not initially produce a terminal but instead a window full of connection options. Putting the address of the remote system in the “Host Name (or IP Address)” box and either pressing enter or clicking the “Open” button should begin the connection process.

If this works you will see a terminal window open that prompts you for a username through the “login as:” prompt and then for a password. If both of these are passed correctly then you will be given access to the system and will see a message saying so within the terminal. If you need to escape the authentication process you can hold the Control (Ctrl) key and press the c key to exit and start again.

Note that you may want to paste in your password rather than typing it. Use Ctrl plus a right-click of the mouse to paste content from the clipboard to the PuTTY terminal.

For those logging in with PuTTY it would likely be best to cover the terminal basics already mentioned above before moving on to navigating the remote system.

Creating an SSH key

SSH keys are an alternative method for authentication to obtain access to remote computing systems. They can also be used for authentication when transferring files or for accessing version control systems. In this section you will create a pair of SSH keys, a private key which you keep on your own computer and a public key which is placed on the remote HPC system that you will log in to.

Linux, Mac and Windows Subsystem for Linux

Once you have opened a terminal check for existing SSH keys and filenames since existing SSH keys are overwritten,

$ ls ~/.ssh/

then generate a new public-private key pair,

$ ssh-keygen -t ed25519 -a 100 -f ~/.ssh/id_Graham_ed25519

If ed25519 is not available, use the older (but strong and trusted) RSA cryptography:

$ ls ~/.ssh/
$ ssh-keygen -o -a 100 -t rsa -b 4096 -f ~/.ssh/id_Graham_rsa

The flag -b sets the number of bits in the key. The default is 2048. EdDSA uses a fixed key length, so this flag would have no effect.

When prompted, enter a strong password that you will remember. Cryptography is only as good as the weakest link, and this will be used to connect to a powerful, precious, computational resource.

Take a look in ~/.ssh (use ls ~/.ssh). You should see the two new files: your private key (~/.ssh/key_Graham_ed25519 or ~/.ssh/key_Graham_rsa) and the public key (~/.ssh/key_Graham_ed25519.pub or ~/.ssh/key_Graham_rsa.pub). If a key is requested by the system administrators, the public key is the one to provide.

Private keys are your private identity

A private key that is visible to anyone but you should be considered compromised, and must be destroyed. This includes having improper permissions on the directory it (or a copy) is stored in, traversing any network in the clear, attachment on unencrypted email, and even displaying the key (which is ASCII text) in your terminal window.

Protect this key as if it unlocks your front door. In many ways, it does.

Further information

For more information on SSH security and some of the flags set here, an excellent resource is Secure Secure Shell.

Windows

On Windows you can use

Logging onto the system

With all of this in mind, let’s connect to a remote HPC system. In this workshop, we will connect to Graham — an HPC system located at the University of Waterloo. Although it’s unlikely that every system will be exactly like Graham, it’s a very good example of what you can expect from an HPC installation. To connect to our example computer, we will use SSH (if you are using PuTTY, see above).

SSH allows us to connect to UNIX computers remotely, and use them as if they were our own. The general syntax of the connection command follows the format ssh -i ~/.ssh/key_for_remote_computer [email protected] when using SSH keys and ssh [email protected] if only password access is available. Let’s attempt to connect to the HPC system now:

ssh -i ~/.ssh/key_Graham_ed25519 [email protected]

or

ssh -i ~/.ssh/key_Graham_rsa [email protected]

or if SSH keys have not been enabled


The authenticity of host 'graham.computecanada.ca (199.241.166.2)' can't be established.
ECDSA key fingerprint is SHA256:JRj286Pkqh6aeO5zx1QUkS8un5fpcapmezusceSGhok.
ECDSA key fingerprint is MD5:99:59:db:b1:3f:18:d0:2c:49:4e:c2:74:86:ac:f7:c6.
Are you sure you want to continue connecting (yes/no)?  # type "yes"!
Warning: Permanently added the ECDSA host key for IP address '199.241.166.2' to the list of known hosts.
[email protected]'s password:  # no text appears as you enter your password
Last login: Wed Jun 28 16:16:20 2017 from s2.n59.queensu.ca

Welcome to the ComputeCanada/SHARCNET cluster Graham.

If you’ve connected successfully, you should see a prompt like the one below. This prompt is informative, and lets you grasp certain information at a glance. (If you don’t understand what these things are, don’t worry! We will cover things in depth as we explore the system further.)

[yourUsername@gra-login1 ~]$ 

Telling the Difference between the Local Terminal and the Remote Terminal

You may have noticed that the prompt changed when you logged into the remote system using the terminal (if you logged in using PuTTY this will not apply because it does not offer a local terminal). This change is important because it makes it clear on which system the commands you type will be run when you pass them into the terminal. This change is also a small complication that we will need to navigate throughout the workshop. Exactly what is reported before the $ in the terminal when it is connected to the local system and the remote system will typically be different for every user. We still need to indicate which system we are entering commands on though so we will adopt the following convention:

Being certain which system your terminal is connected to

If you ever need to be certain which system a terminal you are using is connected to then use the following command: $ hostname.

Keep two terminal windows open

It is strongly recommended that you have two terminals open, one connected to the local system and one connected to the remote system, that you can switch back and forth between. If you only use one terminal window then you will need to reconnect to the remote system using one of the methods above when you see a change from [local]$ to [yourUsername@gra-login1 ~]$ and disconnect when you see the reverse.

Key Points

  • To connect to a remote HPC system using SSH and a password, run ssh [email protected].

  • To connect to a remote HPC system using SSH and an SSH key, run ssh -i ~/.ssh/key_for_remote_computer [email protected].


Moving around and looking at things

Overview

Teaching: 15 min
Exercises: 5 min
Questions
  • How do I navigate and look around the system?

Objectives
  • Learn how to navigate around directories and look at their contents

  • Explain the difference between a file and a directory.

  • Translate an absolute path into a relative path and vice versa.

  • Identify the actual command, flags, and filenames in a command-line call.

  • Demonstrate the use of tab completion, and explain its advantages.

At this point in the lesson, we’ve just logged into the system. Nothing has happened yet, and we’re not going to be able to do anything until we learn a few basic commands. By the end of this lesson, you will know how to “move around” the system and look at what’s there.

Right now, all we see is something that looks like this:

[yourUsername@gra-login1 ~]$ 

The dollar sign is a prompt, which shows us that the shell is waiting for input; your shell may use a different character as a prompt and may add information before the prompt. When typing commands, either from these lessons or from other sources, do not type the prompt, only the commands that follow it.

Type the command whoami, then press the Enter key (sometimes marked Return) to send the command to the shell. The command’s output is the ID of the current user, i.e., it shows us who the shell thinks we are:

$ whoami
yourUsername

More specifically, when we type whoami the shell:

  1. finds a program called whoami,
  2. runs that program,
  3. displays that program’s output, then
  4. displays a new prompt to tell us that it’s ready for more commands.

Next, let’s find out where we are by running a command called pwd (which stands for “print working directory”). (“Directory” is another word for “folder”). At any moment, our current working directory (where we are) is the directory that the computer assumes we want to run commands in unless we explicitly specify something else. Here, the computer’s response is /home/yourUsername, which is yourUsername home directory. Note that the location of your home directory may differ from system to system.

$ pwd
/home/yourUsername

So, we know where we are. How do we look and see what’s in our current directory?

$ ls

ls prints the names of the files and directories in the current directory in alphabetical order, arranged neatly into columns.

Differences between remote and local system

Open a second terminal window on your local computer and run the ls command without logging in remotely. What differences do you see?

Solution

You would likely see something more like this:

Applications Documents    Library      Music        Public
Desktop      Downloads    Movies       Pictures

In addition you should also note that the preamble before the prompt ($) is different. This is very important for making sure you know what system you are issuing commands on when in the shell.

If nothing shows up when you run ls, it means that nothing’s there. Let’s make a directory for us to play with.

mkdir <new directory name> makes a new directory with that name in your current location. Notice that this command required two pieces of input: the actual name of the command (mkdir) and an argument that specifies the name of the directory you wish to create.

$ mkdir documents

Let’s us ls again. What do we see?

Our folder is there, awesome. What if we wanted to go inside it and do stuff there? We will use the cd (change directory) command to move around. Let’s cd into our new documents folder.

$ cd documents
$ pwd
~/documents

What is the ~ character? When using the shell, ~ is a shortcut that represents /home/yourUserName.

Now that we know how to use cd, we can go anywhere. That’s a lot of responsibility. What happens if we get “lost” and want to get back to where we started?

To go back to your home directory, the following three commands will work:

$ cd /home/yourUserName
$ cd ~
$ cd

A quick note on the structure of a UNIX (Linux/Mac/Android/Solaris/etc) filesystem. Directories and absolute paths (i.e. exact position in the system) are always prefixed with a /. / by itself is the “root” or base directory.

Let’s go there now, look around, and then return to our home directory.

$ cd /
$ ls
$ cd ~
bin    dev   initrd  local         mnt  proc     root  scratch  tmp  work
boot   etc   lib     localscratch  nix  project  run   srv      usr
cvmfs  home  lib64   media         opt  ram      sbin  sys      var

The “home” directory is the one where we generally want to keep all of our files. Other folders on a UNIX OS contain system files, and get modified and changed as you install new software or upgrade your OS.

Using HPC filesystems

On HPC systems, you have a number of places where you can store your files. These differ in both the amount of space allocated and whether or not they are backed up.

File storage locations:

  • Network filesystem - Your home directory is an example of a network filesystem. Data stored here is available throughout the HPC system and files stored here are often backed up (but check your local configuration to be sure!). Files stored here are typically slower to access, the data is actually stored on another computer and is being transmitted and made available over the network!
  • Scratch - Some systems may offer “scratch” space. Scratch space is typically faster to use than your home directory or network filesystem, but is not usually backed up, and should not be used for long term storage.
  • Work file system - As an alternative to (or sometimes as well as) Scratch space, some HPC systems offer fast file system access as a work file system. Typically, this will have higher performance than your home directory or network file system and may not be backed up. It differs from scratch space in that files in a work file system are not automatically deleted for you, you must manage the space yourself.
  • Local scratch (job only) - Some systems may offer local scratch space while executing a job. (A job is a program which you submit to run on an HPC system, and will be covered later.) Such storage is very fast, but will be deleted at the end of your job.
  • Ramdisk (job only) - Some systems may let you store files in a “RAM disk” while running a job, where files are stored directly in the computer’s memory. This extremely fast, but files stored here will count against your job’s memory usage and be deleted at the end of your job.

There are several other useful shortcuts you should be aware of.

Let’s try these out now:

$ cd ./documents
$ pwd
$ cd ..
$ pwd
/home/yourUserName/documents
/home/yourUserName

Many commands also have multiple behaviours that you can invoke with command line ‘flags.’ What is a flag? It’s generally just your command followed by a ‘-‘ and the name of the flag (sometimes it’s ‘–’ followed by the name of the flag). You follow the flag(s) with any additional arguments you might need.

We’re going to demonstrate a couple of these “flags” using ls.

Show hidden files with -a. Hidden files are files that begin with ., these files will not appear otherwise, but that doesn’t mean they aren’t there! “Hidden” files are not hidden for security purposes, they are usually just config files and other tempfiles that the user doesn’t necessarily need to see all the time.

$ ls -a
.  ..  .bash_logout  .bash_profile  .bashrc  documents  .emacs  .mozilla  .ssh

Notice how both . and .. are visible as hidden files. Show files, their size in bytes, date last modified, permissions, and other things with -l.

$ ls -l
drwxr-xr-x 2 yourUsername tc001 4096 Jan 14 17:31 documents

This is a lot of information to take in at once, but we will explain this later! ls -l is extremely useful, and tells you almost everything you need to know about your files without actually looking at them.

We can also use multiple flags at the same time!

$ ls -l -a
[yourUsername@gra-login1 ~]$  ls -la
total 36
drwx--S--- 5 yourUsername tc001 4096 Nov 28 09:58 .
drwxr-x--- 3 root         tc001 4096 Nov 28 09:40 ..
-rw-r--r-- 1 yourUsername tc001   18 Dec  6  2016 .bash_logout
-rw-r--r-- 1 yourUsername tc001  193 Dec  6  2016 .bash_profile
-rw-r--r-- 1 yourUsername tc001  231 Dec  6  2016 .bashrc
drwxr-sr-x 2 yourUsername tc001 4096 Nov 28 09:58 documents
-rw-r--r-- 1 yourUsername tc001  334 Mar  3  2017 .emacs
drwxr-xr-x 4 yourUsername tc001 4096 Aug  2  2016 .mozilla
drwx--S--- 2 yourUsername tc001 4096 Nov 28 09:58 .ssh

Flags generally precede any arguments passed to a UNIX command. ls actually takes an extra argument that specifies a directory to look into. When you use flags and arguments together, the syntax (how it’s supposed to be typed) generally looks something like this:

$ command <flags/options> <arguments>

So using ls -l -a on a different directory than the one we’re in would look something like:

$ ls -l -a ~/documents
drwxr-sr-x 2 yourUsername tc001 4096 Nov 28 09:58 .
drwx--S--- 5 yourUsername tc001 4096 Nov 28 09:58 ..

Where to go for help?

How did I know about the -l and -a options? Is there a manual we can look at for help when we need help? There is a very helpful manual for most UNIX commands: man (if you’ve ever heard of a “man page” for something, this is what it is).

$ man ls
LS(1)                          User Commands                          LS(1)

NAME
     ls - list directory contents

SYNOPSIS
     ls [OPTION]... [FILE]...

DESCRIPTION
     List  information  about the FILEs (the current directory by default).
     Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.

     Mandatory arguments to long options are mandatory for short options too.

To navigate through the man pages, you may use the up and down arrow keys to move line-by-line, or try the spacebar and “b” keys to skip up and down by full page. Quit the man pages by typing “q”.

Alternatively, most commands you run will have a --help option that displays addition information For instance, with ls:

$ ls --help
Usage: ls [OPTION]... [FILE]...
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is specified.

Mandatory arguments to long options are mandatory for short options too.
  -a, --all                  do not ignore entries starting with .
  -A, --almost-all           do not list implied . and ..
      --author               with -l, print the author of each file
  -b, --escape               print C-style escapes for nongraphic characters
      --block-size=SIZE      scale sizes by SIZE before printing them; e.g.,
                               '--block-size=M' prints sizes in units of
                               1,048,576 bytes; see SIZE format below
  -B, --ignore-backups       do not list implied entries ending with ~

# further output omitted for clarity

Unsupported command-line options

If you try to use an option that is not supported, ls and other programs will print an error message similar to this:

[remote]$ ls -j
ls: invalid option -- 'j'
Try 'ls --help' for more information.

Looking at documentation

Looking at the man page for ls or using ls --help, what does the -h (--human-readable) option do?

Absolute vs Relative Paths

Starting from /Users/amanda/data/, which of the following commands could Amanda use to navigate to her home directory, which is /Users/amanda?

  1. cd .
  2. cd /
  3. cd /home/amanda
  4. cd ../..
  5. cd ~
  6. cd home
  7. cd ~/data/..
  8. cd
  9. cd ..

Solution

  1. No: . stands for the current directory.
  2. No: / stands for the root directory.
  3. No: Amanda’s home directory is /Users/amanda.
  4. No: this goes up two levels, i.e. ends in /Users.
  5. Yes: ~ stands for the user’s home directory, in this case /Users/amanda.
  6. No: this would navigate into a directory home in the current directory if it exists.
  7. Yes: unnecessarily complicated, but correct.
  8. Yes: shortcut to go back to the user’s home directory.
  9. Yes: goes up one level.

Relative Path Resolution

Using the filesystem diagram below, if pwd displays /Users/thing, what will ls -F ../backup display?

  1. ../backup: No such file or directory
  2. 2012-12-01 2013-01-08 2013-01-27
  3. 2012-12-01/ 2013-01-08/ 2013-01-27/
  4. original/ pnas_final/ pnas_sub/

File System for Challenge Questions

Solution

  1. No: there is a directory backup in /Users.
  2. No: this is the content of Users/thing/backup, but with .. we asked for one level further up.
  3. No: see previous explanation.
  4. Yes: ../backup/ refers to /Users/backup/.

ls Reading Comprehension

Assuming a directory structure as in the above Figure (File System for Challenge Questions), if pwd displays /Users/backup, and -r tells ls to display things in reverse order, what command will display:

pnas_sub/ pnas_final/ original/
  1. ls pwd
  2. ls -r -F
  3. ls -r -F /Users/backup
  4. Either #2 or #3 above, but not #1.

Solution

  1. No: pwd is not the name of a directory.
  2. Yes: ls without directory argument lists files and directories in the current directory.
  3. Yes: uses the absolute path explicitly.
  4. Correct: see explanations above.

Exploring More ls Arguments

What does the command ls do when used with the -l and -h arguments?

Some of its output is about properties that we do not cover in this lesson (such as file permissions and ownership), but the rest should be useful nevertheless.

Solution

The -l arguments makes ls use a long listing format, showing not only the file/directory names but also additional information such as the file size and the time of its last modification. The -h argument makes the file size “human readable”, i.e. display something like 5.3K instead of 5369.

Listing Recursively and By Time

The command ls -R lists the contents of directories recursively, i.e., lists their sub-directories, sub-sub-directories, and so on in alphabetical order at each level. The command ls -t lists things by time of last change, with most recently changed files or directories first. In what order does ls -R -t display things? Hint: ls -l uses a long listing format to view timestamps.

Solution

The directories are listed alphabetical at each level, the files/directories in each directory are sorted by time of last change.

Key Points

  • Your current directory is referred to as the working directory.

  • To change directories, use cd.

  • To view files, use ls.

  • You can view help for a command with man command or command --help.

  • Hit tab to autocomplete whatever you’re currently typing.


Writing and reading files

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How do I create/edit text files?

  • How do I move/copy/delete files?

Objectives
  • Learn to use the nano text editor.

  • Understand how to move, create, and delete files.

Now that we know how to move around and look at things, let’s learn how to read, write, and handle files! We’ll start by moving back to our home directory and creating a scratch directory:

$ cd ~
$ mkdir hpc-test
$ cd hpc-test

Creating and Editing Text Files

When working on an HPC system, we will frequently need to create or edit text files. Text is one of the simplest computer file formats, defined as a simple sequence of text lines.

What if we want to make a file? There are a few ways of doing this, the easiest of which is simply using a text editor. For this lesson, we are going to us nano, since it’s more intuitive than many other terminal text editors.

To create or edit a file, type nano <filename>, on the terminal, where <filename> is the name of the file. If the file does not already exist, it will be created. Let’s make a new file now, type whatever you want in it, and save it.

$ nano draft.txt

Nano in action

Nano defines a number of shortcut keys (prefixed by the Control or Ctrl key) to perform actions such as saving the file or exiting the editor. Here are the shortcut keys for a few common actions:

Using vim as a text editor

From time to time, you may encounter the vim text editor. Although vim isn’t the easiest or most user-friendly of text editors, you’ll be able to find it on any system and it has many more features than nano.

vim has several modes, a “command” mode (for doing big operations, like saving and quitting) and an “insert” mode. You can switch to insert mode with the i key, and command mode with Esc.

In insert mode, you can type more or less normally. In command mode there are a few commands you should be aware of:

  • :q! — quit, without saving
  • :wq — save and quit
  • dd — cut/delete a line
  • y — paste a line

Do a quick check to confirm our file was created.

$ ls
draft.txt

Reading Files

Let’s read the file we just created now. There are a few different ways of doing this, one of which is reading the entire file with cat.

$ cat draft.txt
It's not "publish or perish" any more,
it's "share and thrive".

By default, cat prints out the content of the given file. Although cat may not seem like an intuitive command with which to read files, it stands for “concatenate”. Giving it multiple file names will print out the contents of the input files in the order specified in the cat’s invocation. For example,

$ cat draft.txt draft.txt
It's not "publish or perish" any more,
it's "share and thrive".
It's not "publish or perish" any more,
it's "share and thrive".

Reading Multiple Text Files

Create two more files using nano, giving them different names such as chap1.txt and chap2.txt. Then use a single cat command to read and print the contents of draft.txt, chap1.txt, and chap2.txt.

Creating Directory

We’ve successfully created a file. What about a directory? We’ve actually done this before, using mkdir.

$ mkdir files
$ ls
draft.txt  files

Moving, Renaming, Copying Files

Moving — We will move draft.txt to the files directory with mv (“move”) command. The same syntax works for both files and directories: mv <file/directory> <new-location>

$ mv draft.txt files
$ cd files
$ ls
draft.txt

Renamingdraft.txt isn’t a very descriptive name. How do we go about changing it? It turns out that mv is also used to rename files and directories. Although this may not seem intuitive at first, think of it as moving a file to be stored under a different name. The syntax is quite similar to moving files: mv oldName newName.

$ mv draft.txt newname.testfile
$ ls
newname.testfile

File extensions are arbitrary

In the last example, we changed both a file’s name and extension at the same time. On UNIX systems, file extensions (like .txt) are arbitrary. A file is a .txt file only because we say it is. Changing the name or extension of the file will never change a file’s contents, so you are free to rename things as you wish. With that in mind, however, file extensions are a useful tool for keeping track of what type of data it contains. A .txt file typically contains text, for instance.

Copying — What if we want to copy a file, instead of simply renaming or moving it? Use cp command (an abbreviated name for “copy”). This command has two different uses that work in the same way as mv:

Let’s try this out.

$ cp newname.testfile copy.testfile
$ ls
$ cp newname.testfile ..
$ cd ..
$ ls
newname.testfile copy.testfile
files documents newname.testfile

Removing files

We’ve begun to clutter up our workspace with all of the directories and files we’ve been making. Let’s learn how to get rid of them. One important note before we start… when you delete a file on UNIX systems, they are gone forever. There is no “recycle bin” or “trash”. Once a file is deleted, it is gone, never to return. So be very careful when deleting files.

Files are deleted with rm file [moreFiles]. To delete the newname.testfile in our current directory:

$ ls
$ rm newname.testfile
$ ls
files Documents newname.testfile
files Documents

That was simple enough. Directories are deleted in a similar manner using rm -r (the -r option stands for ‘recursive’).

$ ls
$ rm -r Documents
$ rm -r files
$ ls
files Documents
rmdir: failed to remove `files/': Directory not empty
files

What happened? As it turns out, rmdir is unable to remove directories that have stuff in them. To delete a directory and everything inside it, we will use a special variant of rm, rm -rf directory. This is probably the scariest command on UNIX- it will force delete a directory and all of its contents without prompting. ALWAYS double check your typing before using it… if you leave out the arguments, it will attempt to delete everything on your file system that you have permission to delete. So when deleting directories be very, very careful.

What happens when you use rm -rf accidentally

Steam is a major online sales platform for PC video games with over 125 million users. Despite this, it hasn’t always had the most stable or error-free code.

In January 2015, user kevyin on GitHub reported that Steam’s Linux client had deleted every file on his computer. It turned out that one of the Steam programmers had added the following line: rm -rf "$STEAMROOT/"*. Due to the way that Steam was set up, the variable $STEAMROOT was never initialized, meaning the statement evaluated to rm -rf /*. This coding error in the Linux client meant that Steam deleted every single file on a computer when run in certain scenarios (including connected external hard drives). Moral of the story: be very careful when using rm -rf!

Looking at files

Sometimes it’s not practical to read an entire file with cat- the file might be way too large, take a long time to open, or maybe we want to only look at a certain part of the file. As an example, we are going to look at a large and complex file type used in bioinformatics- a .gtf file. The GTF2 format is commonly used to describe the location of genetic features in a genome.

Let’s grab and unpack a set of demo files for use later. To do this, we’ll use wget (wget link downloads a file from a link).

$ wget http://www.hpc-carpentry.org/hpc-shell/files/bash-lesson.tar.gz

Problems with wget?

wget is a stand-alone application for downloading things over HTTP/HTTPS and FTP/FTPS connections, and it does the job admirably — when it is installed.

Some operating systems instead come with cURL, which is the command-line interface to libcurl, a powerful library for programming interactions with remote resources over a wide variety of network protocols. If you have curl but not wget, then try this command instead:

$ curl -O http://www.hpc-carpentry.org/hpc-shell/files/bash-lesson.tar.gz

For very large downloads, you might consider using Aria2, which has support for downloading the same file from multiple mirrors. You have to install it separately, but if you have it, try this to get it faster than your neighbors:

$ aria2c http://www.hpc-carpentry.org/hpc-shell/files/bash-lesson.tar.gz

Install cURL

  • macOS: curl is pre-installed on macOS. If you must have the latest version you can brew install it, but only do so if the stock version has failed you.
  • Windows: curl comes preinstalled for the Windows 10 command line. For earlier Windows systems, you can download the executable directly; run it in place.

    curl comes preinstalled in Git for Windows and Windows Subsystem for Linux. On Cygwin, run the setup program again and select the curl package to install it.

  • Linux: curl is packaged for every major distribution. You can install it through the usual means.
    • Debian, Ubuntu, Mint: sudo apt install curl
    • CentOS, Red Hat: sudo yum install curl or zypper install curl
    • Fedora: sudo dnf install curl

Install Aria2

  • macOS: aria2c is available through a homebrew. brew install aria2.
  • Windows: download the latest release and run aria2c in place. If you’re using the Windows Subsystem for Linux,
  • Linux: every major distribution has an aria2 package. Install it by the usual means.
  • Debian, Ubuntu, Mint: sudo apt install aria2
  • CentOS, Red Hat: sudo yum install aria2 or zypper install aria2
  • Fedora: sudo dnf install aria2

You’ll commonly encounter .tar.gz archives while working in UNIX. To extract the files from a .tar.gz file, we run the command tar -xvf filename.tar.gz:

$ tar -xvf bash-lesson.tar.gz
dmel-all-r6.19.gtf
dmel_unique_protein_isoforms_fb_2016_01.tsv
gene_association.fb
SRR307023_1.fastq
SRR307023_2.fastq
SRR307024_1.fastq
SRR307024_2.fastq
SRR307025_1.fastq
SRR307025_2.fastq
SRR307026_1.fastq
SRR307026_2.fastq
SRR307027_1.fastq
SRR307027_2.fastq
SRR307028_1.fastq
SRR307028_2.fastq
SRR307029_1.fastq
SRR307029_2.fastq
SRR307030_1.fastq
SRR307030_2.fastq

Unzipping files

We just unzipped a .tar.gz file for this example. What if we run into other file formats that we need to unzip? Just use the handy reference below:

  • gunzip extracts the contents of .gz files
  • unzip extracts the contents of .zip files
  • tar -xvf extracts the contents of .tar.gz and .tar.bz2 files

That is a lot of files! One of these files, dmel-all-r6.19.gtf is extremely large, and contains every annotated feature in the Drosophila melanogaster genome. It’s a huge file- what happens if we run cat on it? (Press Ctrl + C to stop it).

So, cat is a really bad option when reading big files… it scrolls through the entire file far too quickly! What are the alternatives? Try all of these out and see which ones you like best!

Out of cat, head, tail, and less, which method of reading files is your favourite? Why?

Key Points

  • Use nano to create or edit text files from a terminal.

  • Use cat file1 [file2 ...] to print the contents of one or more files to the terminal.

  • Use mv old dir to move a file or directory old to another directory dir.

  • Use mv old new to rename a file or directory old to a new name.

  • Use cp old new to copy a file under a new name or location.

  • Use cp old dir copies a file old into a directory dir.

  • Use rm old to delete (remove) a file.

  • File extensions are entirely arbitrary on UNIX systems.


Wildcards and pipes

Overview

Teaching: 45 min
Exercises: 10 min
Questions
  • How can I run a command on multiple files at once?

  • Is there an easy way of saving a command’s output?

Objectives
  • Redirect a command’s output to a file.

  • Process a file instead of keyboard input using redirection.

  • Construct command pipelines with two or more stages.

  • Explain what usually happens if a program or pipeline isn’t given any input to process.

Required files

If you didn’t get them in the last lesson, make sure to download the example files used in the next few sections:

Using wget: wget http://www.hpc-carpentry.org/hpc-shell/files/bash-lesson.tar.gz

Using a web browser: http://www.hpc-carpentry.org/hpc-shell/files/bash-lesson.tar.gz

Now that we know some of the basic UNIX commands, we are going to explore some more advanced features. The first of these features is the wildcard *. In our examples before, we’ve done things to files one at a time and otherwise had to specify things explicitly. The * character lets us speed things up and do things across multiple files.

Ever wanted to move, delete, or just do “something” to all files of a certain type in a directory? * lets you do that, by taking the place of one or more characters in a piece of text. So *.txt would be equivalent to all .txt files in a directory for instance. * by itself means all files. Let’s use our example data to see what I mean.

$ tar xvf bash-lesson.tar.gz
$ ls
bash-lesson.tar.gz                           SRR307026_1.fastq
dmel-all-r6.19.gtf                           SRR307026_2.fastq
dmel_unique_protein_isoforms_fb_2016_01.tsv  SRR307027_1.fastq
gene_association.fb                          SRR307027_2.fastq
SRR307023_1.fastq                            SRR307028_1.fastq
SRR307023_2.fastq                            SRR307028_2.fastq
SRR307024_1.fastq                            SRR307029_1.fastq
SRR307024_2.fastq                            SRR307029_2.fastq
SRR307025_1.fastq                            SRR307030_1.fastq
SRR307025_2.fastq                            SRR307030_2.fastq

Now we have a whole bunch of example files in our directory. For this example we are going to learn a new command that tells us how long a file is: wc. wc -l file tells us the length of a file in lines.

$ wc -l dmel-all-r6.19.gtf
542048 dmel-all-r6.19.gtf

Interesting, there are over 540000 lines in our dmel-all-r6.19.gtf file. What if we wanted to run wc -l on every .fastq file? This is where * comes in really handy! *.fastq would match every file ending in .fastq.

$ wc -l *.fastq
20000 SRR307023_1.fastq
20000 SRR307023_2.fastq
20000 SRR307024_1.fastq
20000 SRR307024_2.fastq
20000 SRR307025_1.fastq
20000 SRR307025_2.fastq
20000 SRR307026_1.fastq
20000 SRR307026_2.fastq
20000 SRR307027_1.fastq
20000 SRR307027_2.fastq
20000 SRR307028_1.fastq
20000 SRR307028_2.fastq
20000 SRR307029_1.fastq
20000 SRR307029_2.fastq
20000 SRR307030_1.fastq
20000 SRR307030_2.fastq
320000 total

That was easy. What if we wanted to do the same command, except on every file in the directory? A nice trick to keep in mind is that * by itself matches every file.

$ wc -l *
    53037 bash-lesson.tar.gz
   542048 dmel-all-r6.19.gtf
    22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
   106290 gene_association.fb
    20000 SRR307023_1.fastq
    20000 SRR307023_2.fastq
    20000 SRR307024_1.fastq
    20000 SRR307024_2.fastq
    20000 SRR307025_1.fastq
    20000 SRR307025_2.fastq
    20000 SRR307026_1.fastq
    20000 SRR307026_2.fastq
    20000 SRR307027_1.fastq
    20000 SRR307027_2.fastq
    20000 SRR307028_1.fastq
    20000 SRR307028_2.fastq
    20000 SRR307029_1.fastq
    20000 SRR307029_2.fastq
    20000 SRR307030_1.fastq
    20000 SRR307030_2.fastq
  1043504 total

Multiple wildcards

You can even use multiple *s at a time. How would you run wc -l on every file with “fb” in it?

Solution

wc -l *fb*

i.e. anything or nothing then fb then anything or nothing

Using other commands

Now let’s try cleaning up our working directory a bit. Create a folder called “fastq” and move all of our .fastq files there in one mv command.

Solution

mkdir fastq
mv *.fastq fastq/

Redirecting output

Each of the commands we’ve used so far does only a very small amount of work. However, we can chain these small UNIX commands together to perform otherwise complicated actions!

For our first foray into piping, or redirecting output, we are going to use the > operator to write output to a file. When using >, whatever is on the left of the > is written to the filename you specify on the right of the arrow. The actual syntax looks like command > filename.

Let’s try several basic usages of >. echo simply prints back, or echoes whatever you type after it.

$ echo "this is a test"
$ echo "this is a test" > test.txt
$ ls
$ cat test.txt
this is a test

bash-lesson.tar.gz                           fastq
dmel-all-r6.19.gtf                           gene_association.fb
dmel_unique_protein_isoforms_fb_2016_01.tsv  test.txt

this is a test

Awesome, let’s try that with a more complicated command, like wc -l.

$ wc -l * > word_counts.txt
$ cat word_counts.txt
wc: fastq: Is a directory

    53037 bash-lesson.tar.gz
   542048 dmel-all-r6.19.gtf
    22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
        0 fastq
   106290 gene_association.fb
        1 test.txt
   723505 total

Notice how we still got some output to the console even though we “piped” the output to a file? Our expected output still went to the file, but how did the error message get skipped and not go to the file?

This phenomena is an artefact of how UNIX systems are built. There are 3 input/output streams for every UNIX program you will run: stdin, stdout, and stderr.

Let’s dissect these three streams of input/output in the command we just ran: wc -l * > word_counts.txt

Knowing what we know now, let’s try re-running the command, and send all of the output (including the error message) to the same word_counts.txt files as before.

$ wc -l * &> word_counts.txt

Notice how there was no output to the console that time. Let’s check that the error message went to the file like we specified.

$ cat word_counts.txt
    53037 bash-lesson.tar.gz
   542048 dmel-all-r6.19.gtf
    22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
wc: fastq: Is a directory
        0 fastq
   106290 gene_association.fb
        1 test.txt
        7 word_counts.txt
   723512 total

Success! The wc: fastq: Is a directory error message was written to the file. Also, note how the file was silently overwritten by directing output to the same place as before. Sometimes this is not the behaviour we want. How do we append (add) to a file instead of overwriting it?

Appending to a file is done the same was as redirecting output. However, instead of >, we will use >>.

$ echo "We want to add this sentence to the end of our file" >> word_counts.txt
$ cat word_counts.txt
  22129 dmel_unique_protein_isoforms_fb_2016_01.tsv
 471308 Drosophila_melanogaster.BDGP5.77.gtf
      0 fastq
1304914 fb_synonym_fb_2016_01.tsv
 106290 gene_association.fb
      1 test.txt
1904642 total
We want to add this sentence to the end of our file

Chaining commands together

We now know how to redirect stdout and stderr to files. We can actually take this a step further and redirect output (stdout) from one command to serve as the input (stdin) for the next. To do this, we use the | (pipe) operator.

grep is an extremely useful command. It finds things for us within files. Basic usage (there are a lot of options for more clever things, see the man page) uses the syntax grep whatToFind fileToSearch. Let’s use grep to find all of the entries pertaining to the Act5C gene in Drosophila melanogaster.

$ grep Act5C dmel-all-r6.19.gtf

The output is nearly unintelligible since there is so much of it. Let’s send the output of that grep command to head so we can just take a peek at the first line. The | operator lets us send output from one command to the next:

$ grep Act5C dmel-all-r6.19.gtf | head -n 1
X	FlyBase	gene	5900861	5905399	.	+	.	gene_id "FBgn0000042"; gene_symbol "Act5C";

Nice work, we sent the output of grep to head. Let’s try counting the number of entries for Act5C with wc -l. We can do the same trick to send grep’s output to wc -l:

$ grep Act5C dmel-all-r6.19.gtf | wc -l
46

Note that this is just the same as redirecting output to a file, then reading the number of lines from that file.

Writing commands using pipes

How many files are there in the “fastq” directory we made earlier? (Use the shell to do this.)

Solution

ls fastq/ | wc -l

Output of ls is one line per item, when chaining commands together like this, so counting lines gives the number of files.

Reading from compressed files

Let’s compress one of our files using gzip.

$ gzip gene_association.fb

zcat acts like cat, except that it can read information from .gz (compressed) files. Using zcat, can you write a command to take a look at the top few lines of the gene_association.fb.gz file (without decompressing the file itself)?

Solution

zcat gene_association.fb.gz | head

The head command without any options shows the first 10 lines of a file.

Key Points

  • The * wildcard is used as a placeholder to match any text that follows a pattern.

  • Redirect a command’s output to a file with >.

  • Commands can be chained with |


Scripts, variables, and loops

Overview

Teaching: 45 min
Exercises: 10 min
Questions
  • How do I turn a set of commands into a program?

Objectives
  • Write a shell script

  • Understand and manipulate UNIX permissions

  • Understand shell variables and how to use them

  • Write a simple “for” loop.

We now know a lot of UNIX commands! Wouldn’t it be great if we could save certain commands so that we could run them later or not have to type them out again? As it turns out, this is straightforward to do. A “shell script” is essentially a text file containing a list of UNIX commands to be executed in a sequential manner. These shell scripts can be run whenever we want, and are a great way to automate our work.

Writing a Script

So how do we write a shell script, exactly? It turns out we can do this with a text editor. Start editing a file called “demo.sh” (to recap, we can do this with nano demo.sh). The “.sh” is the standard file extension for shell scripts that most people use (you may also see “.bash” used).

Our shell script will have two parts:

Our file should now look like this:

#!/bin/bash

echo "Our script worked!"

Ready to run our program? Let’s try running it:

$ demo.sh 
bash: demo.sh: command not found...

Strangely enough, Bash can’t find our script. As it turns out, Bash will only look in certain directories for scripts to run. To run anything else, we need to tell Bash exactly where to look. To run a script that we wrote ourselves, we need to specify the full path to the file, followed by the filename. We could do this one of two ways: either with our absolute path /home/yourUserName/demo.sh, or with the relative path ./demo.sh.

$ ./demo.sh
bash: ./demo.sh: Permission denied

There’s one last thing we need to do. Before a file can be run, it needs “permission” to run. Let’s look at our file’s permissions with ls -l:

$ ls -l
-rw-rw-r-- 1 yourUsername tc001 12534006 Jan 16 18:50 bash-lesson.tar.gz
-rw-rw-r-- 1 yourUsername tc001       40 Jan 16 19:41 demo.sh
-rw-rw-r-- 1 yourUsername tc001 77426528 Jan 16 18:50 dmel-all-r6.19.gtf
-rw-r--r-- 1 yourUsername tc001   721242 Jan 25  2016 dmel_unique_protein_is...
drwxrwxr-x 2 yourUsername tc001     4096 Jan 16 19:16 fastq
-rw-r--r-- 1 yourUsername tc001  1830516 Jan 25  2016 gene_association.fb.gz
-rw-rw-r-- 1 yourUsername tc001       15 Jan 16 19:17 test.txt
-rw-rw-r-- 1 yourUsername tc001      245 Jan 16 19:24 word_counts.txt

That’s a huge amount of output: a full listing of everything in the directory. Let’s see if we can understand what each field of a given row represents, working left to right.

  1. Permissions: On the very left side, there is a string of the characters d, r, w, x, and -. The d indicates if something is a directory (there is a - in that spot if it is not a directory). The other r, w, x bits indicate permission to Read, Write, and eXecute a file. There are three fields of rwx permissions following the spot for d. If a user is missing a permission to do something, it’s indicated by a -.
    • The first set of rwx are the permissions that the owner has (in this case the owner is yourUsername).
    • The second set of rwxs are permissions that other members of the owner’s group share (in this case, the group is named tc001).
    • The third set of rwxs are permissions that anyone else with access to this computer can do with a file. Though files are typically created with read permissions for everyone, typically the permissions on your home directory prevent others from being able to access the file in the first place.
  2. References: This counts the number of references (hard links) to the item (file, folder, symbolic link or “shortcut”).
  3. Owner: This is the username of the user who owns the file. Their permissions are indicated in the first permissions field.
  4. Group: This is the user group of the user who owns the file. Members of this user group have permissions indicated in the second permissions field.
  5. Size of item: This is the number of bytes in a file, or the number of filesystem blocks occupied by the contents of a folder. (We can use the -h option here to get a human-readable file size in megabytes, gigabytes, etc.)
  6. Time last modified: This is the last time the file was modified.
  7. Filename: This is the filename.

So how do we change permissions? As I mentioned earlier, we need permission to execute our script. Changing permissions is done with chmod. To add executable permissions for all users we could use this:

$ chmod +x demo.sh
$ ls -l
-rw-rw-r-- 1 yourUsername tc001 12534006 Jan 16 18:50 bash-lesson.tar.gz
-rwxrwxr-x 1 yourUsername tc001       40 Jan 16 19:41 demo.sh
-rw-rw-r-- 1 yourUsername tc001 77426528 Jan 16 18:50 dmel-all-r6.19.gtf
-rw-r--r-- 1 yourUsername tc001   721242 Jan 25  2016 dmel_unique_protein_is...
drwxrwxr-x 2 yourUsername tc001     4096 Jan 16 19:16 fastq
-rw-r--r-- 1 yourUsername tc001  1830516 Jan 25  2016 gene_association.fb.gz
-rw-rw-r-- 1 yourUsername tc001       15 Jan 16 19:17 test.txt
-rw-rw-r-- 1 yourUsername tc001      245 Jan 16 19:24 word_counts.txt

Now that we have executable permissions for that file, we can run it.

$ ./demo.sh
Our script worked!

Fantastic, we’ve written our first program! Before we go any further, let’s learn how to take notes inside our program using comments. A comment is indicated by the # character, followed by whatever we want. Comments do not get run. Let’s try out some comments in the console, then add one to our script!

# This won't show anything.

Now lets try adding this to our script with nano. Edit your script to look something like this:

#!/bin/bash

# This is a comment... they are nice for making notes!
echo "Our script worked!"

When we run our script, the output should be unchanged from before!

Shell variables

One important concept that we’ll need to cover are shell variables. Variables are a great way of saving information under a name you can access later. In programming languages like Python and R, variables can store pretty much anything you can think of. In the shell, they usually just store text. The best way to understand how they work is to see them in action.

To set a variable, simply type in a name containing only letters, numbers, and underscores, followed by an = and whatever you want to put in the variable. Shell variable names are often uppercase by convention (but do not have to be).

$ VAR="This is our variable"

To use a variable, prefix its name with a $ sign. Note that if we want to simply check what a variable is, we should use echo (or else the shell will try to run the contents of a variable).

$ echo $VAR
This is our variable

Let’s try setting a variable in our script and then recalling its value as part of a command. We’re going to make it so our script runs wc -l on whichever file we specify with FILE.

Our script:

#!/bin/bash

# set our variable to the name of our GTF file
FILE=dmel-all-r6.19.gtf

# call wc -l on our file
wc -l $FILE
$ ./demo.sh
542048 dmel-all-r6.19.gtf

What if we wanted to do our little wc -l script on other files without having to change $FILE every time we want to use it? There is actually a special shell variable we can use in scripts that allows us to use arguments in our scripts (arguments are extra information that we can pass to our script, like the -l in wc -l).

To use the first argument to a script, use $1 (the second argument is $2, and so on). Let’s change our script to run wc -l on $1 instead of $FILE. Note that we can also pass all of the arguments using $@ (not going to use it in this lesson, but it’s something to be aware of).

Our script:

#!/bin/bash

# call wc -l on our first argument
wc -l $1
$ ./demo.sh dmel_unique_protein_isoforms_fb_2016_01.tsv
22129 dmel_unique_protein_isoforms_fb_2016_01.tsv

Nice! One thing to be aware of when using variables: they are all treated as pure text. How do we save the output of an actual command like ls -l?

A demonstration of what doesn’t work:

$ TEST=ls -l
-bash: -l: command not found

What does work (we need to surround any command with $(command)):

$ TEST=$(ls -l)
$ echo $TEST
total 90372 -rw-rw-r-- 1 jeff jeff 12534006 Jan 16 18:50 bash-lesson.tar.gz -rwxrwxr-x. 1 jeff jeff 40 Jan 1619:41 demo.sh -rw-rw-r-- 1 jeff jeff 77426528 Jan 16 18:50 dmel-all-r6.19.gtf -rw-r--r-- 1 jeff jeff 721242 Jan 25 2016 dmel_unique_protein_isoforms_fb_2016_01.tsv drwxrwxr-x. 2 jeff jeff 4096 Jan 16 19:16 fastq -rw-r--r-- 1 jeff jeff 1830516 Jan 25 2016 gene_association.fb.gz -rw-rw-r-- 1 jeff jeff 15 Jan 16 19:17 test.txt -rw-rw-r-- 1 jeff jeff 245 Jan 16 19:24 word_counts.txt

Note that everything got printed on the same line. This is a feature, not a bug, as it allows us to use $(commands) inside lines of script without triggering line breaks (which would end our line of code and execute it prematurely).

Loops

To end our lesson on scripts, we are going to learn how to write a for-loop to execute a lot of commands at once. This will let us do the same string of commands on every file in a directory (or other stuff of that nature).

for-loops generally have the following syntax:

#!/bin/bash

for VAR in first second third
do
    echo $VAR
done

When a for-loop gets run, the loop will run once for everything following the word in. In each iteration, the variable $VAR is set to a particular value for that iteration. In this case it will be set to first during the first iteration, second on the second, and so on. During each iteration, the code between do and done is performed.

Let’s run the script we just wrote (I saved mine as loop.sh).

$ chmod +x loop.sh
$ ./loop.sh
first
second
third

What if we wanted to loop over a shell variable, such as every file in the current directory? Shell variables work perfectly in for-loops. In this example, we’ll save the result of ls and loop over each file:

#!/bin/bash

FILES=$(ls)
for VAR in $FILES
do
        echo $VAR
done
$ ./loop.sh
bash-lesson.tar.gz
demo.sh
dmel_unique_protein_isoforms_fb_2016_01.tsv
dmel-all-r6.19.gtf
fastq
gene_association.fb.gz
loop.sh
test.txt
word_counts.txt

There’s a shortcut to run on all files of a particular type, say all .gz files:

#!/bin/bash

for VAR in *.gz
do
    echo $VAR
done
bash-lesson.tar.gz
gene_association.fb.gz

Writing our own scripts and loops

cd to our fastq directory from earlier and write a loop to print off the name and top 4 lines of every fastq file in that directory.

Is there a way to only run the loop on fastq files ending in _1.fastq?

Solution

Create the following script in a file called head_all.sh

#!/bin/bash

for FILE in *.fastq
do
   echo $FILE
   head -n 4 $FILE
done

The “for” line could be modified to be for FILE in *_1.fastq to achieve the second aim.

Concatenating variables

Concatenating (i.e. mashing together) variables is quite easy to do. Add whatever you want to concatenate to the beginning or end of the shell variable after enclosing it in {} characters.

FILE=stuff.txt
echo ${FILE}.example
stuff.txt.example

Can you write a script that prints off the name of every file in a directory with “.processed” added to it?

Solution

Create the following script in a file called process.sh

#!/bin/bash

for FILE in *
do
   echo ${FILE}.processed
done

Note that this will also print directories appended with “.processed”. To truly only get files and not directories, we need to modify this to use the find command to give us only files in the current directory:

#!/bin/bash

for FILE in $(find . -max-depth 1 -type f)
do
   echo ${FILE}.processed
done

but this will have the side-effect of listing hidden files too.

Special permissions

What if we want to give different sets of users different permissions. chmod actually accepts special numeric codes instead of stuff like chmod +x. The numeric codes are as follows: read = 4, write = 2, execute = 1. For each user we will assign permissions based on the sum of these permissions (must be between 7 and 0).

Let’s make an example file and give everyone permission to do everything with it.

touch example
ls -l example
chmod 777 example
ls -l example

How might we give ourselves permission to do everything with a file, but allow no one else to do anything with it.

Solution

chmod 700 example

We want all permissions so: 4 (read) + 2 (write) + 1 (execute) = 7 for user (first position), no permissions, i.e. 0, for group (second position) and all (third position).

Key Points

  • A shell script is just a list of bash commands in a text file.

  • To make a shell script file executable, run chmod +x script.sh.