Introduction to Python for scientific computing

(c) 2022 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with support financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.



You should have already installed a Python distribution using Anaconda. Anaconda contains most of what we need to do scientific computing with Python. At the most basic level, it has Python 3.8. It contains other packages we will make heavy use of, the three most important ones being NumPy, Bokeh, and Jupyter. We will also make heavy use of and scikit-image throughout the course. Importantly, it also has JupyterLab, which we will use to do all of our assignments in Bi 1x.

In this tutorial, we will first learn some of the basics of the Python programming language at the same time exploring the properties of NumPy's very useful (and as we'll see, ubiquitous) ndarray data structure, which we will refer to as a "NumPy array." Finally, we'll load some data and use Bokeh to generate plots.

You should have already learned how to launch a Jupyter notebook for JupyterLab from the tutorial on setting up your environment.

Hello, world.

As a new programmer, your first task is to greet the world with your new set of skills! We'll start by printing "Hello, world." In a code cell in your Jupyter notebook, type the following:

Do you see "Hello, world." printed to the screen right below your command? The Python interpreter is taking your input and printing it out in the output of the cell in your Jupyter notebook.

Here we see the syntax for function calls in Python. Function arguments are enclosed in parentheses. We also learned another important piece of Python syntax. Strings are enclosed in single (or double) quotes.

Now try the following:

Notice how even though you wrote text above the print statement, only "Hello, world." is printed. The # starts a comment; anything after the #, will not be read or interpreted by Python. This is how you add notes to your program about what certain lines do. Including comments in your code is essential so that other people can read your code and understand what you are trying to accomplish.

Variable Assignment

To assign a variable in python, we use an =, just like you would expect from any math class. Arithmetic operations are also as expected: +, -, *, /.

Try the following:

In the assignment of the variable b, we used the ** operator. This means "raise to the power of." In the assignment of the variable d, we used the // operator. This does integer division, returning the floor of the result of the division as an integer.

Lists, tuples, slicing, and indexing

A list is a mutable array, meaning it can be edited. Let's explore below! Notice that a list is created using [].

Notice that indexing is done with brackets, []. Importantly, in Python, indexing starts are zero. Also notice that we can index the last element of a list with -1.

A tuple is like a list, but it's immutable, meaning that once created, it cannot be changed. Let's try the same thing as above, but with a tuple, created with (). But notice that the indexing of the tuple is still denoted with [].

What's the error? The Python interpreter is objecting because it cannot replace the 1 in my_tuple[0] with 3.14; that operation is not supported. If you try printing out my_tuple again, you will see it hasn't been changed.

A string is just a bunch of letters, or letters and numbers, strung together in an order. It can also be indexed, like we did above with the list and the tuple. Let's look at our favorite phrase.

IMPORTANT! Python interprets [0:4] as [0,4), so be careful when pulling out strings of specific length. Pulling small strings out of our larger string is called slicing, and can also be done with lists and tuples. This can be very powerful, as we can even pull out pieces at regular intervals.

Make sure you notice how we create lists c and d. We select the entries in the list from position 0 to 9, selecting every 2 or 3, respectively.

Objects, types, and methods

Python is object-oriented, and all values in a program are objects. An object consists of an identity (where it is stored in memory), a type (a definition of how the object is represented), and data (the value of the object). An object of a given type can have various methods that operate on the data of the object. How do we keep track of what type our variables are? Fortunately, Python has a function for this, called type(). Let's try it out. First, we'll create some new objects.

What is most important to notice here is that my_list is a list, and that it can contain many different objects, from numbers to strings.

The data are the numbers and values that you associate with your variable.

Finally, objects have methods that can perform operations on the data. A method is called similarly to a function. This is best seen by example.

As you can see, a list object has several methods including count() and sort(). They are called just like functions, with arguments in parentheses. The name of the method comes after the object name followed by a dot (.).

The count() method takes a single argument and returns the number of times that argument appears in the list. The sort function takes no arguments (but still requires open and closed parentheses to be called), and sorts the list in place. Note that my_list has been changed and is now sorted. We also use the method append(), which adds another element to the end of a list.

Within JupyterLab, you can see what methods and data are available by entering the object name followed by a dot (.), and then pressing tab. Try it!

Packages

It is common that scientific software packages such as Matlab and Mathematica are optimized for a flavor of scientific computing (such as matrix computation in the case of Matlab) and are rather full-featured. On the other hand, Python is a generic programming language. It was not specifically designed to do scientific computing. So, plain old Python is very limited in scientific computing capability.

However, Python is very flexible and allows use of packages. A package contains classes, functions, attributes, data types, etc., beyond what is built in to the core language. In order to use a module, you need to import it to make it available for use. So, as we begin working on data analysis and simulations in Bi 1x, we need to import modules we will use.

It is important to note that until we import a module, its capacities are not available. Keep that in mind: Plain old Python won't do much until you import a package.

Let's now import one of the major workhorses of our class, NumPy!

Notice that we used the import <module> as <abbrev> construction. This enabled us to abbreviate the name of the module so we do not have to type numpy each time. Also, notice that to access the (approximate) value of π in the numpy module, we prefaced the name of the attribute (pi) with the module name followed by a dot (np.). This is generally how you access attributes in modules.

We're already getting dangerous with Python. So dangerous, in fact, that we'll write our own function!

Writing your own function (and learning a bunch of syntax!)

We will create a function that finds the roots of the quadratic equation

(1)ax2+bx+c=0.

To do this, we will first write a function to compute the discriminant and then one to return the roots.

There is a whole bunch of syntax in there to point out.

Now, let's test our new function out!

Very nice! Now, let's try another example. This one might have a problem....

Oh no! It gave us nan, which means "not a number," as our roots. It also gave some warning that it encountered invalid (negative) arguments for the np.sqrt() function. The roots should be 1±i, where i=1. We will use this opportunity to introduce Python's control flow, starting with an if statement.

Control flow: the if statement

We will decree that our quadratic equation solver only handles real roots, so it will raise an exception if an imaginary root is encountered. So, we modify the roots() function as follows.

We have now exposed the syntax for a Python if statement. The conditional expression ends with a colon, just like the def statement. Note the indentation of blocks of code after the conditionals. (We actually did not need the else statement, because the program would just continue without the exception, but I left it there for illustrative purposes. It is actually preferred not to have the else statement.)

Let's try out our new function.

This threw the appropriate exception.

Loops

If we want to do something many times, we can use a for loop. This is again best learned by example. Let's make a function that counts the number of times a substring is present in a sequence of DNA.

Let's see how it works!

Now that we know it works, let's look at how the loop was constructed. In the statement at the beginning of the loop, we use the function range() to define an iterator. Calling range(n) creates an object akin to a list of integers from 0 to n-1. The for statement says that the variable i will successively take the values of the iterator.

Keyword arguments

Before concluding our quick trip through the very basics of Python and on to NumPy, I want to show a very handy tool in Python, keyword arguments. Before, when we defined a function, we specified its arguments as variable names separated by commas in the def statement. We can also specify keyword arguments. Here is an example from our quadratic equation solver.

The function quadratic_roots now has two keyword arguments. They are signified by the equals sign.

The function has three required arguments, a, b, c. If the keyword arguments are omitted in the function call, they take on the default values, as specified in the function definition.

In the example, the default for print_discriminant is False and the default for message_to_the_world is 'Bi 1x rules!'. Furthermore, ordering of keyword arguments does not matter when calling the function. They are called in the function similarly to the way they are defined in the function definition.

Let's try it!

Since we did not specify the keyword arguments, the defaults were used. We can specify other values.

In the above function call to roots(), notice that I put the arguments on separate lines. This is valid Python syntax and, when used appropriately, can make your code more readable.

Intro to NumPy, SciPy, and Bokeh

Here are some words to live by: If you are trying to do a task that you think might be common, it's probably part of NumPy or some other package. Look, or ask Google, first. In this case, NumPy has a function called roots() that computes the roots of a polynomial. To figure out how to use it, we can either look at the doc string, or look in the NumPy and SciPy documentation online (the documentation for np.roots() is available here). To look at the doc string, you can enter the following in a code cell:

We see that we need to pass the coefficients of the polynomial we would like the roots of using an "array_like" object. We will discuss what this means in a moment, but for now, we will just use a list to specify our coefficients and call the np.roots() function.

Some array_like data types

In the previous example, we used a list as an array_like data type. Python has several native data types. We have already mentioned ints and floats. We just were not very explicit about it. Python's native array_like data types are lists and tuples. Internally, these things are converted into NumPy arrays, which is the most often used array_like data type we will use.

The NumPy array: maybe your new best friend

Lists and tuples can be useful, but for many applications in data analysis, the np.ndarray, which we will colloquially call a "NumPy array," is most often used. They are created using the np.array function with a list or tuple as an argument. Once created, we can do all sorts of things with them. Let's play!

Let's make some arrays to see what they look like:

Sometimes you want an array of all zero values, and that can also be done with numpy.

Now let's see how we can do operations on some arrays.

We can check the data type of our matrix.

We can also change the type of the entries of our arrays, for example, from integers to floating point.

Slicing is also intuitive.

Using slices, we can reassign values to the entries in a NumPy array. For example, say we wanted the third row to be all zeros.

We can also reshape arrays.

NumPy functions

Here are some useful NumPy functions we think you might want to use! Go through these one-by-one in separate cells to see what they do.

Bokeh: our primary plotting tool

Bokeh is a tool for generating interactive plots that can be views in a browser (and therefore also Jupyter notebooks). Its syntax is intuitive, and best learned by example. First, we need to import packages and set things up so that the graphics can be displayed in the notebook.

Now that we have Bokeh ready, we can make some plots.

To export the figure for insertion into a document, you can click on the disk image in the Bokeh tools. This will give you a PNG, but for publications, you should use vector graphics. There are ways to save Bokeh plots as vector graphics, but we will not use them here.

Programming style

PEPs (Python Enhancement Proposals) document suggestions and guidelines for the Python language, its development, etc. My personal favorite is PEP8, the Style Guide for Python Code. This was largely written by Guido von Rossum, the inventor of Python and its benevolent dictator for life. It details how you should style your code. As Guido says, code is much more often read than written. I strongly urge you to follow PEP8 the best you can. It's a lot to read, so I will highlight important points here. If you follow these guidelines, your code will be much more readable than otherwise. This is particularly useful because you are working in groups and the TAs need to grade your work.

For me, an easy way to keep my code looking pretty is to use Black. If you have installed black and black cell magic, you can use Black in the Jupyter notebook. In a cell at the beginning of your notebook, execute the following code.

Then, in any code cell where you want Black to adjust the formatting, put %%black at the top of the cell and execute it. The code in the cell will not be executed; rather, it will be reformatted according to Black's style.

Using NumPy/SciPy to perform a linear regression

As a case study of how we can load in data and do analysis with it, we will consider a data set from one of the classic organisms in biology: Darwin's finches. Peter and Rosemary Grant have been working on the Galápagos island of Daphne for over forty years. During this time, they have collected lots and lots of data about physiological features of finches. A couple years ago, they published a book with a summary of some of their major results (Grant P. R., Grant B. R., Data from: 40 years of evolution. Darwin's finches on Daphne Major Island, Princeton University Press, 2014). They made their data from the book publicly available via the Dryad Digital Repository.

We will investigate their data on the heritability of beak depth (the distance, top to bottom, of a closed beak) in the ground finch Geospiza fortis. The data set consists of the maternal beak depth, the paternal beak depth, and the mean beak depth of their offspring. You can download the data set, which I got from Dryad and tidied up into a CSV file, here. Our goal it so do a linear regression to see how parental beak depth is related to that of the offspring.

First, we will load in the data set. Before attempting to load the data set, it is important to know where it is on your machine. It should be in the same directory from which you launched Jupyter lab. Otherwise, you can enter the full path below. It is in a comma delimited text file. For this particular file, the data are preceded by comments about their content, with each comment line starting with a #. We will use the wonderful package pandas to read in the data.

This pandas DataFrame has 413 measurements with three columns. We can index the columns to get Numpy arrays containing the data we want.

We want to plot mean offspring beak depth vs. mean parental beak depth, so we need to compute the mean parental beak depth. Conveniently, we do not have to write a loop to compute this for each parent pair. NumPy arrays allow for convenient elementwise calculation in a single command.

Now we have the arrays we need to make our plot.

I chose to use the alpha keyword argument in the p.circle() function because it helps visualize overlapping data points. We see that parental and offspring beak depth are correlated, as we might expect.

Let's perform a linear regression to get the slope and intercept. To perform the linear regression, we will use scipy.optimize.curve_fit(). This function takes a set of data and returns the parameters of the model function that best describe the data. Its first input is the model function (which we have defined as linear); next is the x-data (in our case parental beak depth, parental_bd); the third input is the y-data (offspring beak depth, offspring_bd). The last piece that we will need to address is a best guess for our parameters. In this case, we have to provide two values, the slope and the intercept. curve_fit() works by finding the curve that minimizes the distance of every point from the curve (in our case, this will be minimizing the vertical distance between measured data points and the line describing the data).

More explicitly, scipy.optimize.curve_fit() will take the sum of the squares of the vertical differences of each data point to the model function and try to minimize this value by adjusting the model parameters. For nonlinear fits, you will need to provide reasonable guesses for the best-fit parameters, as there may be many local minima for minimizing the error. In the case of linear regression, we do not need to be too careful, as there will actually be one minimum that best fits the data. For more on the scipy.optimize.curve_fit() function, please read the documentation.

When we run scipy.optimize.curve_fit(), we will get two arrays out. The first array contains the optimal set of parameters for the curve fit. The second array is a 2d array with the covariance of the optimal parameters. For this course, we will ignore the second array and will only use the first array, which, for linear regressions, will give us the slope and intercept.

So, we got our values for the slope and intercept. We can plot the best fit line along with the data.

This does not look exactly like what we would draw by eye, but this is because of overlapping data points. This is, indeed, the best fit line.

Including images and converting to HTML

If you want to include images in your notebook, you can do that directly in a Markdown cell using HTML tags. For example:

<img src='my_image.png'>

You may have scans of equations you want to include, and you can use this technique to do it. However, if you are displaying images from your experiments, you should load them in using skimage and display them as shown in the next tutorial on image processing.

For all of your assignment, you will need to submit a Jupyter notebook and the notebook conversion of the notebook. To convert to HTML, click through the following menu option.

File -> Export Notebook As... -> HTML

Note that if you are displaying images using HTML tags, you must also include the images with your submission. The best way to do the submission is to zip everything together in a single ZIP file.

Conclusions

This concludes our introductory tour of Python with some NumPy, SciPy, Pandas, and Bokeh thrown in for good measure. There is still much to learn, but you will pick up more and more tricks and syntax as we go along.

For the next tutorial, we will use Python to do some image processing. That is, we will write code to extract data of interest from images. As you get more and more proficient, coding (particularly in Python, in my opinion) will be more and more empowering and FUN!

Computing environment