Introduction to Python for scientific computing

(c) 2020 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with support financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

You should have already installed a Python distribution using Anaconda. Anaconda contains most of what we need to do scientific computing with Python. At the most basic level, it has Python 3.7. It contains other packages we will make heavy use of, the three most important ones being NumPy, Bokeh, and Jupyter. We will also make heavy use of and scikit-image throughout the course. Importantly, it also has JupyterLab, which we will use to do all of our assignments in Bi 1x.

In this tutorial, we will first learn some of the basics of the Python programming language at the same time exploring the properties of NumPy's very useful (and as we'll see, ubiquitous) ndarray data structure, which we will refer to as a "NumPy array." Finally, we'll load some data and use Bokeh to generate plots.

You should have already learned how to launch a Jupyter notebook for JupyterLab from the tutorial on setting up your environment.

Hello, world.

As a new programmer, your first task is to greet the world with your new set of skills! We'll start by printing "Hello, world." In a code cell in your Jupyter notebook, type the following:

In [1]:
print('Hello, world.')
Hello, world.

Do you see "Hello, world." printed to the screen right below your command? The Python interpreter is taking your input and printing it out in the output of the cell in your Jupyter notebook.

Here we see the syntax for function calls in Python. Function arguments are enclosed in parentheses. We also learned another important piece of Python syntax. Strings are enclosed in single (or double) quotes.

Now try the following:

In [2]:
# This prints hello world
print ('Hello, world.')
Hello, world.

Notice how even though you wrote text above the print statement, only "Hello, world." is printed. The # starts a comment; anything after the #, will not be read or interpreted by Python. This is how you add notes to your program about what certain lines do. Including comments in your code is essential so that other people can read your code and understand what you are trying to accomplish.

Variable Assignment

To assign a variable in python, we use an =, just like you would expect from any math class. Arithmetic operations are also as expected: +, -, *, /.

Try the following:

In [3]:
a = 3

b = a**3

c = (b + 2*a) / 2

d = (b + 2*a) // 2

print('a is', a)
print('b is',  b)
print('c is', c)
print ('d is', d)
a is 3
b is 27
c is 16.5
d is 16

In the assignment of the variable b, we used the ** operator. This means "raise to the power of." In the assignment of the variable d, we used the // operator. This does integer division, returning the floor of the result of the division as an integer.

Lists, tuples, slicing, and indexing

A list is a mutable array, meaning it can be edited. Let's explore below! Notice that a list is created using [].

In [4]:
my_list = [1,2,3,4]
print('my_list is', my_list)

my_list[0] = 3.14
print('my_list is now', my_list)
print('the last element in my_list is', my_list[-1])
my_list is [1, 2, 3, 4]
my_list is now [3.14, 2, 3, 4]
the last element in my_list is 4

Notice that indexing is done with brackets, []. Importantly, in Python, indexing starts are zero. Also notice that we can index the last element of a list with -1.

A tuple is like a list, but it's immutable, meaning that once created, it cannot be changed. Let's try the same thing as above, but with a tuple, created with (). But notice that the indexing of the tuple is still denoted with [].

In [5]:
# Create a tuple and print it
my_tuple = (1,2,3,4)
print('my tuple is', my_tuple) 

# This will make Python scream at us because tuples are immutable.
my_tuple[0] = 3.14
my tuple is (1, 2, 3, 4)
TypeError                                 Traceback (most recent call last)
<ipython-input-5-f6a863764023> in <module>
      5 # This will make Python scream at us because tuples are immutable.
----> 6 my_tuple[0] = 3.14

TypeError: 'tuple' object does not support item assignment

What's the error? The Python interpreter is objecting because it cannot replace the 1 in my_tuple[0] with 3.14; that operation is not supported. If you try printing out my_tuple again, you will see it hasn't been changed.

A string is just a bunch of letters, or letters and numbers, strung together in an order. It can also be indexed, like we did above with the list and the tuple. Let's look at our favorite phrase.

In [6]:
my_string = 'Hello, world.'
print('The fifth letter in the phrase is', my_string[4])
print('The first four letters are', my_string[0:4])
The fifth letter in the phrase is o
The first four letters are Hell

IMPORTANT! Python interprets [0:4] as [0,4), so be careful when pulling out strings of specific length. Pulling small strings out of our larger string is called slicing, and can also be done with lists and tuples. This can be very powerful, as we can even pull out pieces at regular intervals.

In [7]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

a = my_list[0:5]

b = my_list[5:]

c = my_list[0:10:2]

d = my_list[1:10:3]
[1, 2, 3, 4, 5]
[6, 7, 8, 9, 10]
[1, 3, 5, 7, 9]
[2, 5, 8]

Make sure you notice how we create lists c and d. We select the entries in the list from position 0 to 9, selecting every 2 or 3, respectively.

Objects, types, and methods

Python is object-oriented, and all values in a program are objects. An object consists of an identity (where it is stored in memory), a type (a definition of how the object is represented), and data (the value of the object). An object of a given type can have various methods that operate on the data of the object. How do we keep track of what type our variables are? Fortunately, Python has a function for this, called type(). Let's try it out. First, we'll create some new objects.

In [8]:
a = 4
b = 4.6

my_list = [1, 3.49, 'bi1x']

print('the type of a is', type(a))
print('the type of b is', type(b))
print('the type of my_list is', type(my_list))
print('the type of my_list[0] is', type(my_list[0]))
print('the type of my_list[1] is', type(my_list[1]))
print('the type of my_list[2] is', type(my_list[2]))
the type of a is <class 'int'>
the type of b is <class 'float'>
the type of my_list is <class 'list'>
the type of my_list[0] is <class 'int'>
the type of my_list[1] is <class 'float'>
the type of my_list[2] is <class 'str'>

What is most important to notice here is that my_list is a list, and that it can contain many different objects, from numbers to strings.

The data are the numbers and values that you associate with your variable.

Finally, objects have methods that can perform operations on the data. A method is called similarly to a function. This is best seen by example.

In [9]:
my_list = [1 , 5 , 4 , 13 , 3 , 5 , 19 , 31 , 3 , 1 , 17]

print("the number of 5's in the list is", my_list.count(5))
print("the number of 4's in the list is", my_list.count(4))

# Sort the list in place

# Tack on a string to the end of the list
the number of 5's in the list is 2
the number of 4's in the list is 1
[1, 1, 3, 3, 4, 5, 5, 13, 17, 19, 31]
[1, 1, 3, 3, 4, 5, 5, 13, 17, 19, 31, 'bi1x']

As you can see, a list object has several methods including count() and sort(). They are called just like functions, with arguments in parentheses. The name of the method comes after the object name followed by a dot (.).

The count() method takes a single argument and returns the number of times that argument appears in the list. The sort function takes no arguments (but still requires open and closed parentheses to be called), and sorts the list in place. Note that my_list has been changed and is now sorted. We also use the method append(), which adds another element to the end of a list.

Within JupyterLab, you can see what methods and data are available by entering the object name followed by a dot (.), and then pressing tab. Try it!


It is common that scientific software packages such as Matlab and Mathematica are optimized for a flavor of scientific computing (such as matrix computation in the case of Matlab) and are rather full-featured. On the other hand, Python is a generic programming language. It was not specifically designed to do scientific computing. So, plain old Python is very limited in scientific computing capability.

However, Python is very flexible and allows use of packages. A package contains classes, functions, attributes, data types, etc., beyond what is built in to the core language. In order to use a module, you need to import it to make it available for use. So, as we begin working on data analysis and simulations in Bi 1x, we need to import modules we will use.

It is important to note that until we import a module, its capacities are not available. Keep that in mind: Plain old Python won't do much until you import a package.

Let's now import one of the major workhorses of our class, NumPy!

In [10]:
# Importing is done with the import statement
import numpy as np

# We now have access to some handy things, i.e. np.pi
print('circumference / diameter = ', np.pi)
print('cos(pi) = ', np.cos(np.pi))
circumference / diameter =  3.141592653589793
cos(pi) =  -1.0

Notice that we used the import <module> as <abbrev> construction. This enabled us to abbreviate the name of the module so we do not have to type numpy each time. Also, notice that to access the (approximate) value of $\pi$ in the numpy module, we prefaced the name of the attribute (pi) with the module name followed by a dot (np.). This is generally how you access attributes in modules.

We're already getting dangerous with Python. So dangerous, in fact, that we'll write our own function!

Writing your own function (and learning a bunch of syntax!)

We will create a function that finds the roots of the quadratic equation

\begin{align} ax^2 + bx + c = 0. \end{align}

To do this, we will first write a function to compute the discriminant and then one to return the roots.

In [11]:
def discriminant(a, b, c):
    Returns the discriminant of a quadratic polynomial
    a * x**2 + b * x + c = 0.    
    return b**2 - 4.0 * a * c

def roots(a, b, c):
    Returns the roots of the quadratic equation
    a * x**2 + b * x + c = 0.
    delta = discriminant(a, b, c)
    root_1 = (-b + np.sqrt(delta)) / (2.0 * a)
    root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
    return root_1, root_2

There is a whole bunch of syntax in there to point out.

  • A function is defined with the def statement. It has the function prototype, followed by a colon.
  • Indentation in Python matters! Everything indented after the def statement is part of the function. Once the indentation goes back to the level of the def statement, you are no longer in the function.
  • The return statement is used to return the result of a function. If multiple objects are returned, they are separated by commas.
  • The text within triple quotes are doc strings. They say what the function does. These are essential for people to know what your code is doing.

Now, let's test our new function out!

In [12]:
# Python has nifty syntax for making multiple definitions on the same line
a, b, c = 3.0, -7.0, -6.0

# Call the function and print the result.
root_1, root_2 = roots(a, b, c)
print('roots:', root_1, root_2)
roots: 3.0 -0.6666666666666666

Very nice! Now, let's try another example. This one might have a problem....

In [13]:
# Specify a, b, and c that will give imaginary roots
a, b, c = 1.0, -2.0, 2.0

# Call the function and print the result
root_1, root_2 = roots(a, b, c)
print('roots:', root_1, root_2)
roots: nan nan
/Users/bois/opt/anaconda3/lib/python3.7/site-packages/ RuntimeWarning: invalid value encountered in sqrt
  from ipykernel import kernelapp as app
/Users/bois/opt/anaconda3/lib/python3.7/site-packages/ RuntimeWarning: invalid value encountered in sqrt

Oh no! It gave us nan, which means "not a number," as our roots. It also gave some warning that it encountered invalid (negative) arguments for the np.sqrt() function. The roots should be $1 \pm i$, where $i = \sqrt{-1}$. We will use this opportunity to introduce Python's control flow, starting with an if statement.

Control flow: the if statement

We will decree that our quadratic equation solver only handles real roots, so it will raise an exception if an imaginary root is encountered. So, we modify the roots() function as follows.

In [14]:
def roots(a, b, c):
    Returns the roots of the quadratic equation
    a * x**2 + b * x + c = 0.
    delta = discriminant(a, b, c)

    if delta < 0.0:
        raise RuntimeError('Imaginary roots!  We only do real roots!')
        root_1 = (-b + np.sqrt(delta)) / (2.0 * a)
        root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
        return root_1, root_2

We have now exposed the syntax for a Python if statement. The conditional expression ends with a colon, just like the def statement. Note the indentation of blocks of code after the conditionals. (We actually did not need the else statement, because the program would just continue without the exception, but I left it there for illustrative purposes. It is actually preferred not to have the else statement.)

Let's try out our new function.

In [15]:
# Pass in parameters that will give imaginary roots (use qd.roots)
a, b, c = 1.0, -2.0, 2.0
root_1, root_2 = roots(a, b, c)
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-4ca9ec0d337f> in <module>
      1 # Pass in parameters that will give imaginary roots (use qd.roots)
      2 a, b, c = 1.0, -2.0, 2.0
----> 3 root_1, root_2 = roots(a, b, c)

<ipython-input-14-dbffeb39379d> in roots(a, b, c)
      8     if delta < 0.0:
----> 9         raise RuntimeError('Imaginary roots!  We only do real roots!')
     10     else:
     11         root_1 = (-b + np.sqrt(delta)) / (2.0 * a)

RuntimeError: Imaginary roots!  We only do real roots!

This threw the appropriate exception.


If we want to do something many times, we can use a for loop. This is again best learned by example. Let's make a function that counts the number of times a substring is present in a sequence of DNA.

In [16]:
def n_substring(seq, substr):
    Given a sequence seq , returns the number of occurrances of substring.
    # First make sure the length of substring is shorter than seq.
    if len(substr) > len(seq) :
        return 0

    # We loop through the sequence to check for matches
    num_substr = 0
    for i in range(0, len(seq)-len(substr)+1):
        if seq [i:i+len(substr)] == substr:
            num_substr += 1
    # We are done looping now , so return the number of subsequences
    return num_substr

Let's see how it works!

In [17]:
# Specify sequences
substr = 'GAT'

# Call function to get number of GAT substrings
n_gat = n_substring(seq, substr)

# Print the result
print('There are %d GATs in the sequence.' % n_gat)
There are 3 GATs in the sequence.

Now that we know it works, let's look at how the loop was constructed. In the statement at the beginning of the loop, we use the function range() to define an iterator. Calling range(n) creates an object akin to a list of integers from 0 to n-1. The for statement says that the variable i will successively take the values of the iterator.

Keyword arguments

Before concluding our quick trip through the very basics of Python and on to NumPy, I want to show a very handy tool in Python, keyword arguments. Before, when we defined a function, we specified its arguments as variable names separated by commas in the def statement. We can also specify keyword arguments. Here is an example from our quadratic equation solver.

In [18]:
def roots(a, b, c, print_discriminant=False, message_to_the_world='Bi 1x rules'):
    Returns the roots of the quadratic equation
    a * x**2 + b * x + c = 0.
    If print_discriminant is True, prints discriminant to screen
    delta = discriminant(a, b, c)
    if print_discriminant: 
        print('discriminant =', delta)
    if message_to_the_world is not None: 
        print('\n' + '*'*len(message_to_the_world))
        print('*'*len(message_to_the_world) + '\n')

    if delta < 0.0:
        raise ValueError('Imaginary roots!  We only do real roots!')
        root_1 = (-b + np.sqrt(delta)) / (2.0 * a)
        root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
        return root_1, root_2

The function quadratic_roots now has two keyword arguments. They are signified by the equals sign.

The function has three required arguments, a, b, c. If the keyword arguments are omitted in the function call, they take on the default values, as specified in the function definition.

In the example, the default for print_discriminant is False and the default for message_to_the_world is 'Bi 1x rules!'. Furthermore, ordering of keyword arguments does not matter when calling the function. They are called in the function similarly to the way they are defined in the function definition.

Let's try it!

In [19]:
a, b, c = 3.0, -7.0, -6.0

root_1, root_2 = roots(a, b, c)

print('roots:', root_1, root_2)
Bi 1x rules

roots: 3.0 -0.6666666666666666

Since we did not specify the keyword arguments, the defaults were used. We can specify other values.

In [20]:
root_1, root_2 = roots(
    message_to_the_world="Bi 1x TAs are the best!"

print("roots:", root_1, root_2)
discriminant = 121.0

Bi 1x TAs are the best!

roots: 3.0 -0.6666666666666666

In the above function call to roots(), notice that I put the arguments on separate lines. This is valid Python syntax and, when used appropriately, can make your code more readable.

Intro to NumPy, SciPy, and Bokeh

Here are some words to live by: If you are trying to do a task that you think might be common, it's probably part of NumPy or some other package. Look, or ask Google, first. In this case, NumPy has a function called roots() that computes the roots of a polynomial. To figure out how to use it, we can either look at the doc string, or look in the NumPy and SciPy documentation online (the documentation for np.roots() is available here). To look at the doc string, you can enter the following in a code cell:

In [21]:
Signature: np.roots(p)
Return the roots of a polynomial with coefficients given in p.

The values in the rank-1 array `p` are coefficients of a polynomial.
If the length of `p` is n+1 then the polynomial is described by::

  p[0] * x**n + p[1] * x**(n-1) + ... + p[n-1]*x + p[n]

p : array_like
    Rank-1 array of polynomial coefficients.

out : ndarray
    An array containing the roots of the polynomial.

    When `p` cannot be converted to a rank-1 array.

See also
poly : Find the coefficients of a polynomial with a given sequence
       of roots.
polyval : Compute polynomial values.
polyfit : Least squares polynomial fit.
poly1d : A one-dimensional polynomial class.

The algorithm relies on computing the eigenvalues of the
companion matrix [1]_.

.. [1] R. A. Horn & C. R. Johnson, *Matrix Analysis*.  Cambridge, UK:
    Cambridge University Press, 1999, pp. 146-7.

>>> coeff = [3.2, 2, 1]
>>> np.roots(coeff)
array([-0.3125+0.46351241j, -0.3125-0.46351241j])
File:      ~/opt/anaconda3/lib/python3.7/site-packages/numpy/lib/
Type:      function

We see that we need to pass the coefficients of the polynomial we would like the roots of using an "array_like" object. We will discuss what this means in a moment, but for now, we will just use a list to specify our coefficients and call the np.roots() function.

In [22]:
# Define the coefficients in a list (using square brackets)
coeffs = [3.0, -7.0, -6.0]

# Call np.roots.  It returns an np.ndarray with the roots
roots = np.roots(coeffs)
print('Roots for (a, b, c) = (3, -7, -6):', roots)

# It even handles complex roots!
roots = np.roots([1.0, -2.0, 2.0])
print('Roots for (a, b, c) = (1, -2, 2): ', roots)
Roots for (a, b, c) = (3, -7, -6): [ 3.         -0.66666667]
Roots for (a, b, c) = (1, -2, 2):  [1.+1.j 1.-1.j]

Some array_like data types

In the previous example, we used a list as an array_like data type. Python has several native data types. We have already mentioned ints and floats. We just were not very explicit about it. Python's native array_like data types are lists and tuples. Internally, these things are converted into NumPy arrays, which is the most often used array_like data type we will use.

The NumPy array: maybe your new best friend

Lists and tuples can be useful, but for many applications in data analysis, the np.ndarray, which we will colloquially call a "NumPy array," is most often used. They are created using the np.array function with a list or tuple as an argument. Once created, we can do all sorts of things with them. Let's play!

Let's make some arrays to see what they look like:

In [23]:
array_1 = np.array([1, 2, 3, 4])
print('array 1:')
print(array_1, '\n')

array_2 = np.array([[1, 2], [1, 2]])
print(array_2, '\n')

array_3 = np.array([[1, 2, 3], [1, 2, 3], [1, 2, 3]])
array 1:
[1 2 3 4] 

[[1 2]
 [1 2]] 

[[1 2 3]
 [1 2 3]
 [1 2 3]]

Sometimes you want an array of all zero values, and that can also be done with numpy.

In [24]:
zero_array = np.zeros((3,4))
print(zero_array, '\n')
print('the dimesions are', zero_array.shape)
[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]] 

the dimesions are (3, 4)

Now let's see how we can do operations on some arrays.

In [25]:
a = np.array([1, 2, 3])
b = np.array([4.0, 5.0, 6.0])

# Arithmetic operations are done elementwise
print('a:      ', a)
print('b:      ', b)
print('a + b:  ', a + b)
print('a * b:  ', a * b)
print('1.0 + a:', 1.0 + a)
print('a**2:   ', a**2)
print('b**a:   ', b**a)
a:       [1 2 3]
b:       [4. 5. 6.]
a + b:   [5. 7. 9.]
a * b:   [ 4. 10. 18.]
1.0 + a: [2. 3. 4.]
a**2:    [1 4 9]
b**a:    [  4.  25. 216.]

We can check the data type of our matrix.

In [26]:

We can also change the type of the entries of our arrays, for example, from integers to floating point.

In [27]:
array_a = a.astype(float)

array_b = b.astype(int)
[1. 2. 3.]
[4 5 6]

Slicing is also intuitive.

In [28]:
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])

print(a, "\n")
print(a[0:3, 2:3])
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]] 

[[ 3]
 [ 7]
In [29]:
# view second column
array([ 2,  6, 10, 14])
In [30]:
# see all entries below the value of 10
array([1, 2, 3, 4, 5, 6, 7, 8, 9])

Using slices, we can reassign values to the entries in a NumPy array. For example, say we wanted the third row to be all zeros.

In [31]:
a[2, :] = np.zeros(a[2 ,:].shape)
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 0  0  0  0]
 [13 14 15 16]]

We can also reshape arrays.

In [32]:
a = a.reshape (2 ,8)
print(a, '\n')  

a = a.reshape(4,4)
[[ 1  2  3  4  5  6  7  8]
 [ 0  0  0  0 13 14 15 16]] 

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 0  0  0  0]
 [13 14 15 16]]

NumPy functions

Here are some useful NumPy functions we think you might want to use! Go through these one-by-one in separate cells to see what they do.

In [33]:
# create evenly spaced points
np.linspace(0, 1, 10)

# matrix or vector dot products, a)

# concatenate in row dimensions
np.concatenate((a, a))

# concatenate in the column dimension
np.concatenate((a, a), axis=1)

# transpose (the result of this final operation will be displayed)
array([[ 1,  5,  0, 13],
       [ 2,  6,  0, 14],
       [ 3,  7,  0, 15],
       [ 4,  8,  0, 16]])

Bokeh: our primary plotting tool

Bokeh is a tool for generating interactive plots that can be views in a browser (and therefore also Jupyter notebooks). Its syntax is intuitive, and best learned by example. First, we need to import packages and set things up so that the graphics can be displayed in the notebook.

In [34]:
# These are the necessary submodules
import bokeh.plotting

# This is necesssary to display in the notebook
Loading BokehJS ...

Now that we have Bokeh ready, we can make some plots.

In [35]:
# Make an x-variable for plotting
x = np.linspace(0, 2 * np.pi, 200)

# This is a nice function
y_1 = np.exp(np.sin(x))

# We can make another one
y_2 = np.exp(np.cos(x))

# We can make some random data to plot as well
x_rand = np.random.rand(20) * 2 * np.pi
y_rand = np.random.rand(20) * 3.0

# Set up plot
p = bokeh.plotting.figure(
    frame_width=500, frame_height=300, x_axis_label="x", y_axis_label="y"

# Populate the plot with glyphs
p.line(x, y_1, line_width=2)
p.line(x, y_2, line_width=2, color="orange"), y_rand, color="black")

# Show the plot

To export the figure for insertion into a document, you can click on the disk image in the Bokeh tools. This will give you a PNG, but for publications, you should use vector graphics. There are ways to save Bokeh plots as vector graphics, but we will not use them here.

Programming style

PEPs (Python Enhancement Proposals) document suggestions and guidelines for the Python language, its development, etc. My personal favorite is PEP8, the Style Guide for Python Code. This was largely written by Guido von Rossum, the inventor of Python and its benevolent dictator for life. It details how you should style your code. As Guido says, code is much more often read than written. I strongly urge you to follow PEP8 the best you can. It's a lot to read, so I will highlight important points here. If you follow these guidelines, your code will be much more readable than otherwise. This is particularly useful because you are working in groups and the TAs need to grade your work.

  • Limit line widths to 79 characters (the line break character for Python is \).
  • Use whitespace around low-priority operators. E.g., x = a**2 + b**2 has whitespace around the + operator. Another good example is y = m*x + b.
  • In function calls, use a space after each comma. Use no spaces before and after the equals sign when using keyword arguments. E.g., my_fun(a, b, c=True).
  • Do not use excess space when indexing. E.g., a[i], not a [ i ].
  • Function names should be lowercase, with words separated by underscores as necessary to improve readability.
  • Comment lines should appear immediately before the code they describe. Use in-line comments sparingly.

For me, an easy way to keep my code looking pretty is to use Black. If you have installed black and black cell magic, you can use Black in the Jupyter notebook. In a cell at the beginning of your notebook, execute the following code.

In [36]:
%load_ext blackcellmagic

Then, in any code cell where you want Black to adjust the formatting, put %%black at the top of the cell and execute it. The code in the cell will not be executed; rather, it will be reformatted according to Black's style.

Using NumPy/SciPy to perform a linear regression

As a case study of how we can load in data and do analysis with it, we will consider a data set from one of the classic organisms in biology: Darwin's finches. Peter and Rosemary Grant have been working on the Galápagos island of Daphne for over forty years. During this time, they have collected lots and lots of data about physiological features of finches. A couple years ago, they published a book with a summary of some of their major results (Grant P. R., Grant B. R., Data from: 40 years of evolution. Darwin's finches on Daphne Major Island, Princeton University Press, 2014). They made their data from the book publicly available via the Dryad Digital Repository.

We will investigate their data on the heritability of beak depth (the distance, top to bottom, of a closed beak) in the ground finch Geospiza fortis. The data set consists of the maternal beak depth, the paternal beak depth, and the mean beak depth of their offspring. You can download the data set, which I got from Dryad and tidied up into a CSV file, here. Our goal it so do a linear regression to see how parental beak depth is related to that of the offspring.

First, we will load in the data set. Before attempting to load the data set, it is important to know where it is on your machine. It should be in the same directory from which you launched Jupyter lab. Otherwise, you can enter the full path below. It is in a comma delimited text file. For this particular file, the data are preceded by comments about their content, with each comment line starting with a #. We will use the wonderful package pandas to read in the data.

In [37]:
import pandas as pd

# Load the data
df = pd.read_csv('grant_and_grant_2014.csv', comment='#')

# Take a look
Average offspring beak depth (mm) Paternal beak depth (mm) Maternal beak depth (mm)
0 10.70 10.90 9.3
1 9.78 10.70 8.4
2 9.48 10.70 8.1
3 9.60 10.70 9.8
4 10.27 9.85 10.4

This pandas DataFrame has 413 measurements with three columns. We can index the columns to get Numpy arrays containing the data we want.

In [38]:
offspring_bd = df['Average offspring beak depth (mm)'].values
paternal_bd = df['Paternal beak depth (mm)'].values
maternal_bd = df['Maternal beak depth (mm)'].values

We want to plot mean offspring beak depth vs. mean parental beak depth, so we need to compute the mean parental beak depth. Conveniently, we do not have to write a loop to compute this for each parent pair. NumPy arrays allow for convenient elementwise calculation in a single command.

In [39]:
parental_bd = (maternal_bd + paternal_bd) / 2

Now we have the arrays we need to make our plot.

In [40]:
p = bokeh.plotting.figure(
    x_axis_label="parental beak depth (mm)",
    y_axis_label="offpsring beak depth (mm)",
), offspring_bd, alpha=0.3)

I chose to use the alpha keyword argument in the function because it helps visualize overlapping data points. We see that parental and offspring beak depth are correlated, as we might expect.

Let's perform a linear regression to get the slope and intercept. To perform the linear regression, we will use scipy.optimize.curve_fit(). This function takes a set of data and returns the parameters of the model function that best describe the data. Its first input is the model function (which we have defined as linear); next is the $x$-data (in our case parental beak depth, parental_bd); the third input is the $y$-data (offspring beak depth, offspring_bd). The last piece that we will need to address is a best guess for our parameters. In this case, we have to provide two values, the slope and the intercept. curve_fit() works by finding the curve that minimizes the distance of every point from the curve (in our case, this will be minimizing the vertical distance between measured data points and the line describing the data).

More explicitly, scipy.optimize.curve_fit() will take the sum of the squares of the vertical differences of each data point to the model function and try to minimize this value by adjusting the model parameters. For nonlinear fits, you will need to provide reasonable guesses for the best-fit parameters, as there may be many local minima for minimizing the error. In the case of linear regression, we do not need to be too careful, as there will actually be one minimum that best fits the data. For more on the scipy.optimize.curve_fit() function, please read the documentation.

When we run scipy.optimize.curve_fit(), we will get two arrays out. The first array contains the optimal set of parameters for the curve fit. The second array is a 2d array with the covariance of the optimal parameters. For this course, we will ignore the second array and will only use the first array, which, for linear regressions, will give us the slope and intercept.

In [41]:
import scipy.optimize

# Define linear function
def linear_fun(x, slope, intercept):
    return slope * x + intercept

# Compute the curve fit (Guess is unit slope and zero intercept)
popt, _ = scipy.optimize.curve_fit(linear_fun, parental_bd, offspring_bd, p0=[1, 0])

# Parse the results
slope, intercept = popt

# Print the results
print("slope =", slope)
print("intercept =", intercept, "mm")
slope = 0.7229051885446406
intercept = 2.448418309793462 mm

So, we got our values for the slope and intercept. We can plot the best fit line along with the data.

In [42]:
# Plot best fit line
x = np.array([7, 12])
y = linear_fun(x, slope, intercept)

p.line(x, y, line_width=2, color='orange')

This does not look exactly like what we would draw by eye, but this is because of overlapping data points. This is, indeed, the best fit line.

Including images and converting to HTML

If you want to include images in your notebook, you can do that directly in a Markdown cell using HTML tags. For example:

<img src='my_image.png'>

You may have scans of equations you want to include, and you can use this technique to do it. However, if you are displaying images from your experiments, you should load them in using skimage and display them as shown in the next tutorial on image processing.

For all of your assignment, you will need to submit a Jupyter notebook and the notebook conversion of the notebook. To convert to HTML, click through the following menu option.

File -> Export Notebook As... -> HTML

Note that if you are displaying images using HTML tags, you must also include the images with your submission. The best way to do the submission is to zip everything together in a single ZIP file.


This concludes our introductory tour of Python with some NumPy, SciPy, Pandas, and Bokeh thrown in for good measure. There is still much to learn, but you will pick up more and more tricks and syntax as we go along.

For the next tutorial, we will use Python to do some image processing. That is, we will write code to extract data of interest from images. As you get more and more proficient, coding (particularly in Python, in my opinion) will be more and more empowering and FUN!

Computing environment

In [43]:
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,jupyterlab
CPython 3.7.7
IPython 7.13.0

numpy 1.18.1
scipy 1.4.1
pandas 0.24.2
bokeh 2.0.1
jupyterlab 1.2.6