#grayscale");filter:gray;-webkit-filter:grayscale(100%);}.bk-root .bk-logo-small{width:20px;height:20px;background-image:url();}.bk-root .bk-logo-notebook{display:inline-block;vertical-align:middle;margin-right:5px;}.rendered_html .bk-root .bk-tooltip table,.rendered_html .bk-root .bk-tooltip tr,.rendered_html .bk-root .bk-tooltip th,.rendered_html .bk-root .bk-tooltip td{border:none;padding:1px;}.bk-root .bk-toolbar-hidden{visibility:hidden;opacity:0;transition:visibility 0.3s linear, opacity 0.3s linear;}.bk-root .bk-toolbar{display:flex;flex-wrap:nowrap;align-items:center;user-select:none;-ms-user-select:none;-moz-user-select:none;-webkit-user-select:none;}.bk-root .bk-toolbar .bk-logo{flex-shrink:0;}.bk-root .bk-toolbar.bk-above,.bk-root .bk-toolbar.bk-below{flex-direction:row;justify-content:flex-end;}.bk-root .bk-toolbar.bk-above .bk-logo,.bk-root .bk-toolbar.bk-below .bk-logo{order:1;margin-left:5px;margin-right:0px;}.bk-root .bk-toolbar.bk-left,.bk-root .bk-toolbar.bk-right{flex-direction:column;justify-content:flex-start;}.bk-root .bk-toolbar.bk-left .bk-logo,.bk-root .bk-toolbar.bk-right .bk-logo{order:0;margin-bottom:5px;margin-top:0px;}.bk-root .bk-toolbar-button{width:30px;height:30px;cursor:pointer;background-size:60% 60%;background-origin:border-box;background-color:transparent;background-repeat:no-repeat;background-position:center center;}.bk-root .bk-toolbar-button:hover,.bk-root .bk-tool-overflow:hover{background-color:rgba(192, 192, 192, 0.15);}.bk-root .bk-toolbar-button:focus,.bk-root .bk-tool-overflow:focus,.bk-root .bk-toolbar-button:focus-visible,.bk-root .bk-tool-overflow:focus-visible{outline:1px dotted #26aae1;outline-offset:-1px;}.bk-root .bk-toolbar-button::-moz-focus-inner,.bk-root .bk-tool-overflow::-moz-focus-inner{border:0;}.bk-root .bk-above .bk-toolbar-button{border-bottom:2px solid transparent;}.bk-root .bk-above .bk-toolbar-button.bk-active{border-bottom-color:#26aae1;}.bk-root .bk-below .bk-toolbar-button{border-top:2px solid transparent;}.bk-root .bk-below .bk-toolbar-button.bk-active{border-top-color:#26aae1;}.bk-root .bk-right .bk-toolbar-button{border-left:2px solid transparent;}.bk-root .bk-right .bk-toolbar-button.bk-active{border-left-color:#26aae1;}.bk-root .bk-left .bk-toolbar-button{border-right:2px solid transparent;}.bk-root .bk-left .bk-toolbar-button.bk-active{border-right-color:#26aae1;}.bk-root .bk-divider{content:" ";display:inline-block;background-color:lightgray;}.bk-root .bk-above .bk-divider,.bk-root .bk-below .bk-divider{height:10px;width:1px;}.bk-root .bk-left .bk-divider,.bk-root .bk-right .bk-divider{height:1px;width:10px;}.bk-root .bk-tool-overflow{color:gray;display:flex;align-items:center;}.bk-root .bk-above .bk-tool-overflow,.bk-root .bk-below .bk-tool-overflow,.bk-root .bk-horizontal .bk-tool-overflow{width:15px;height:30px;flex-direction:row;}.bk-root .bk-left .bk-tool-overflow,.bk-root .bk-right .bk-tool-overflow,.bk-root .bk-vertical .bk-tool-overflow{width:30px;height:15px;flex-direction:column;}.bk-root .bk-tool-icon-copy-to-clipboard{background-image:url("");}.bk-root .bk-tool-icon-replace-mode{background-image:url("");}.bk-root .bk-tool-icon-append-mode{background-image:url("");}.bk-root .bk-tool-icon-intersect-mode{background-image:url("");}.bk-root .bk-tool-icon-subtract-mode{background-image:url("");}.bk-root .bk-tool-icon-clear-selection{background-image:url("");}.bk-root .bk-tool-icon-box-select{background-image:url("");}.bk-root .bk-tool-icon-box-zoom{background-image:url("");}.bk-root .bk-tool-icon-zoom-in{background-image:url("");}.bk-root .bk-tool-icon-zoom-out{background-image:url("");}.bk-root .bk-tool-icon-help{background-image:url("");}.bk-root .bk-tool-icon-hover{background-image:url("");}.bk-root .bk-tool-icon-crosshair{background-image:url("");}.bk-root .bk-tool-icon-lasso-select{background-image:url("");}.bk-root .bk-tool-icon-pan{background-image:url("");}.bk-root .bk-tool-icon-xpan{background-image:url("");}.bk-root .bk-tool-icon-ypan{background-image:url("");}.bk-root .bk-tool-icon-range{background-image:url("");}.bk-root .bk-tool-icon-polygon-select{background-image:url("");}.bk-root .bk-tool-icon-redo{background-image:url("");}.bk-root .bk-tool-icon-reset{background-image:url("");}.bk-root .bk-tool-icon-save{background-image:url("");}.bk-root .bk-tool-icon-tap-select{background-image:url("");}.bk-root .bk-tool-icon-undo{background-image:url("");}.bk-root .bk-tool-icon-wheel-pan{background-image:url("");}.bk-root .bk-tool-icon-wheel-zoom{background-image:url("");}.bk-root .bk-tool-icon-box-edit{background-image:url("");}.bk-root .bk-tool-icon-freehand-draw{background-image:url("");}.bk-root .bk-tool-icon-poly-draw{background-image:url("");}.bk-root .bk-tool-icon-point-draw{background-image:url("");}.bk-root .bk-tool-icon-poly-edit{background-image:url("");}.bk-root .bk-tool-icon-line-edit{background-image:url("");}.bk-root .bk-menu-icon{width:28px;height:28px;background-size:60%;background-color:transparent;background-repeat:no-repeat;background-position:center center;}.bk-root .bk-context-menu{position:absolute;display:inline-flex;flex-wrap:nowrap;user-select:none;-ms-user-select:none;-moz-user-select:none;-webkit-user-select:none;width:auto;height:auto;z-index:100;cursor:pointer;font-size:12px;background-color:#fff;border:1px solid #ccc;border-radius:4px;box-shadow:0 6px 12px rgba(0, 0, 0, 0.175);}.bk-root .bk-context-menu.bk-horizontal{flex-direction:row;}.bk-root .bk-context-menu.bk-vertical{flex-direction:column;}.bk-root .bk-context-menu > .bk-divider{cursor:default;overflow:hidden;background-color:#e5e5e5;}.bk-root .bk-context-menu.bk-horizontal > .bk-divider{width:1px;margin:5px 0;}.bk-root .bk-context-menu.bk-vertical > .bk-divider{height:1px;margin:0 5px;}.bk-root .bk-context-menu > :not(.bk-divider){border:1px solid transparent;}.bk-root .bk-context-menu > :not(.bk-divider).bk-active{border-color:#26aae1;}.bk-root .bk-context-menu > :not(.bk-divider):hover{background-color:#f9f9f9;}.bk-root .bk-context-menu > :not(.bk-divider):focus,.bk-root .bk-context-menu > :not(.bk-divider):focus-visible{outline:1px dotted #26aae1;outline-offset:-1px;}.bk-root .bk-context-menu > :not(.bk-divider)::-moz-focus-inner{border:0;}.bk-root .bk-context-menu.bk-horizontal > :not(.bk-divider):first-child{border-top-left-radius:4px;border-bottom-left-radius:4px;}.bk-root .bk-context-menu.bk-horizontal > :not(.bk-divider):last-child{border-top-right-radius:4px;border-bottom-right-radius:4px;}.bk-root .bk-context-menu.bk-vertical > :not(.bk-divider):first-child{border-top-left-radius:4px;border-top-right-radius:4px;}.bk-root .bk-context-menu.bk-vertical > :not(.bk-divider):last-child{border-bottom-left-radius:4px;border-bottom-right-radius:4px;}.bk-root .bk-menu{position:absolute;left:0;width:100%;z-index:100;cursor:pointer;font-size:12px;background-color:#fff;border:1px solid #ccc;border-radius:4px;box-shadow:0 6px 12px rgba(0, 0, 0, 0.175);}.bk-root .bk-menu.bk-above{bottom:100%;}.bk-root .bk-menu.bk-below{top:100%;}.bk-root .bk-menu > .bk-divider{height:1px;margin:7.5px 0;overflow:hidden;background-color:#e5e5e5;}.bk-root .bk-menu > :not(.bk-divider){padding:6px 12px;}.bk-root .bk-menu > :not(.bk-divider):hover,.bk-root .bk-menu > :not(.bk-divider).bk-active{background-color:#e6e6e6;}.bk-root .bk-caret{display:inline-block;vertical-align:middle;width:0;height:0;margin:0 5px;}.bk-root .bk-caret.bk-down{border-top:4px solid;}.bk-root .bk-caret.bk-up{border-bottom:4px solid;}.bk-root .bk-caret.bk-down,.bk-root .bk-caret.bk-up{border-right:4px solid transparent;border-left:4px solid transparent;}.bk-root .bk-caret.bk-left{border-right:4px solid;}.bk-root .bk-caret.bk-right{border-left:4px solid;}.bk-root .bk-caret.bk-left,.bk-root .bk-caret.bk-right{border-top:4px solid transparent;border-bottom:4px solid transparent;}
(c) 2023 Justin Bois. This work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.
This document was prepared at Caltech with support financial support from the Donna and Benjamin M. Rosen Bioengineering Center.
This tutorial was generated from an Jupyter notebook. You can download the notebook here.
You should have already installed a Python distribution using Anaconda. Anaconda contains most of what we need to do scientific computing with Python. At the most basic level, it has Python 3.8. It contains other packages we will make heavy use of, the three most important ones being NumPy, Bokeh, and Jupyter. We will also make heavy use of and scikit-image throughout the course. Importantly, it also has JupyterLab, which we will use to do all of our assignments in Bi 1x.
In this tutorial, we will first learn some of the basics of the Python programming language at the same time exploring the properties of NumPy's very useful (and as we'll see, ubiquitous) ndarray
data structure, which we will refer to as a "NumPy array." Finally, we'll load some data and use Bokeh to generate plots.
You should have already learned how to launch a Jupyter notebook for JupyterLab from the tutorial on setting up your environment.
As a new programmer, your first task is to greet the world with your new set of skills! We'll start by printing "Hello, world.
" In a code cell in your Jupyter notebook, type the following:
print('Hello, world.')
Hello, world.
Do you see "Hello, world.
" printed to the screen right below your command? The Python interpreter is taking your input and printing it out in the output of the cell in your Jupyter notebook.
Here we see the syntax for function calls in Python. Function arguments are enclosed in parentheses. We also learned another important piece of Python syntax. Strings are enclosed in single (or double) quotes.
Now try the following:
# This prints hello world
print ('Hello, world.')
Hello, world.
Notice how even though you wrote text above the print
statement, only "Hello, world.
" is printed. The #
starts a comment; anything after the #
, will not be read or interpreted by Python. This is how you add notes to your program about what certain lines do. Including comments in your code is essential so that other people can read your code and understand what you are trying to accomplish.
To assign a variable in python, we use an =
, just like you would expect from any math class. Arithmetic operations are also as expected: +
, -
, *
, /
.
Try the following:
a = 3
b = a**3
c = (b + 2*a) / 2
d = (b + 2*a) // 2
print('a is', a)
print('b is', b)
print('c is', c)
print ('d is', d)
a is 3 b is 27 c is 16.5 d is 16
In the assignment of the variable b
, we used the **
operator. This means "raise to the power of." In the assignment of the variable d
, we used the //
operator. This does integer division, returning the floor of the result of the division as an integer.
A list is a mutable array, meaning it can be edited. Let's explore below! Notice that a list is created using []
.
my_list = [1,2,3,4]
print('my_list is', my_list)
my_list[0] = 3.14
print('my_list is now', my_list)
print('the last element in my_list is', my_list[-1])
my_list is [1, 2, 3, 4] my_list is now [3.14, 2, 3, 4] the last element in my_list is 4
Notice that indexing is done with brackets, []
. Importantly, in Python, indexing starts are zero. Also notice that we can index the last element of a list with -1
.
A tuple is like a list, but it's immutable, meaning that once created, it cannot be changed. Let's try the same thing as above, but with a tuple, created with ()
. But notice that the indexing of the tuple is still denoted with []
.
# Create a tuple and print it
my_tuple = (1,2,3,4)
print('my tuple is', my_tuple)
# This will make Python scream at us because tuples are immutable.
my_tuple[0] = 3.14
my tuple is (1, 2, 3, 4)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[5], line 6 3 print('my tuple is', my_tuple) 5 # This will make Python scream at us because tuples are immutable. ----> 6 my_tuple[0] = 3.14 TypeError: 'tuple' object does not support item assignment
What's the error? The Python interpreter is objecting because it cannot replace the 1
in my_tuple[0]
with 3.14
; that operation is not supported. If you try printing out my_tuple
again, you will see it hasn't been changed.
A string is just a bunch of letters, or letters and numbers, strung together in an order. It can also be indexed, like we did above with the list and the tuple. Let's look at our favorite phrase.
my_string = 'Hello, world.'
print('The fifth letter in the phrase is', my_string[4])
print('The first four letters are', my_string[0:4])
The fifth letter in the phrase is o The first four letters are Hell
IMPORTANT! Python interprets [0:4]
as [0,4), so be careful when pulling out strings of specific length. Pulling small strings out of our larger string is called slicing, and can also be done with lists and tuples. This can be very powerful, as we can even pull out pieces at regular intervals.
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
a = my_list[0:5]
print(a)
b = my_list[5:]
print(b)
c = my_list[0:10:2]
print(c)
d = my_list[1:10:3]
print(d)
[1, 2, 3, 4, 5] [6, 7, 8, 9, 10] [1, 3, 5, 7, 9] [2, 5, 8]
Make sure you notice how we create lists c
and d
. We select the entries in the list from position 0 to 9, selecting every 2 or 3, respectively.
Python is object-oriented, and all values in a program are objects. An object consists of an identity (where it is stored in memory), a type (a definition of how the object is represented), and data (the value of the object). An object of a given type can have various methods that operate on the data of the object. How do we keep track of what type our variables are? Fortunately, Python has a function for this, called type()
. Let's try it out. First, we'll create some new objects.
a = 4
b = 4.6
my_list = [1, 3.49, 'bi1x']
print('the type of a is', type(a))
print('the type of b is', type(b))
print('the type of my_list is', type(my_list))
print('the type of my_list[0] is', type(my_list[0]))
print('the type of my_list[1] is', type(my_list[1]))
print('the type of my_list[2] is', type(my_list[2]))
the type of a is <class 'int'> the type of b is <class 'float'> the type of my_list is <class 'list'> the type of my_list[0] is <class 'int'> the type of my_list[1] is <class 'float'> the type of my_list[2] is <class 'str'>
What is most important to notice here is that my_list
is a list, and that it can contain many different objects, from numbers to strings.
The data are the numbers and values that you associate with your variable.
Finally, objects have methods that can perform operations on the data. A method is called similarly to a function. This is best seen by example.
my_list = [1 , 5 , 4 , 13 , 3 , 5 , 19 , 31 , 3 , 1 , 17]
print("the number of 5's in the list is", my_list.count(5))
print("the number of 4's in the list is", my_list.count(4))
# Sort the list in place
my_list.sort()
print(my_list)
# Tack on a string to the end of the list
my_list.append('bi1x')
print(my_list)
the number of 5's in the list is 2 the number of 4's in the list is 1 [1, 1, 3, 3, 4, 5, 5, 13, 17, 19, 31] [1, 1, 3, 3, 4, 5, 5, 13, 17, 19, 31, 'bi1x']
As you can see, a list
object has several methods including count()
and sort()
. They are called just like functions, with arguments in parentheses. The name of the method comes after the object name followed by a dot (.
).
The count()
method takes a single argument and returns the number of times that argument appears in the list. The sort function takes no arguments (but still requires open and closed parentheses to be called), and sorts the list in place. Note that my_list
has been changed and is now sorted. We also use the method append()
, which adds another element to the end of a list.
Within JupyterLab, you can see what methods and data are available by entering the object name followed by a dot (.
), and then pressing tab. Try it!
It is common that scientific software packages such as Matlab and Mathematica are optimized for a flavor of scientific computing (such as matrix computation in the case of Matlab) and are rather full-featured. On the other hand, Python is a generic programming language. It was not specifically designed to do scientific computing. So, plain old Python is very limited in scientific computing capability.
However, Python is very flexible and allows use of packages. A package contains classes, functions, attributes, data types, etc., beyond what is built in to the core language. In order to use a module, you need to import it to make it available for use. So, as we begin working on data analysis and simulations in Bi 1x, we need to import modules we will use.
It is important to note that until we import a module, its capacities are not available. Keep that in mind: Plain old Python won't do much until you import a package.
Let's now import one of the major workhorses of our class, NumPy!
# Importing is done with the import statement
import numpy as np
# We now have access to some handy things, i.e. np.pi
print('circumference / diameter = ', np.pi)
print('cos(pi) = ', np.cos(np.pi))
circumference / diameter = 3.141592653589793 cos(pi) = -1.0
Notice that we used the import <module> as <abbrev>
construction. This enabled us to abbreviate the name of the module so we do not have to type numpy
each time. Also, notice that to access the (approximate) value of in the numpy
module, we prefaced the name of the attribute (pi
) with the module name followed by a dot (np.
). This is generally how you access attributes in modules.
We're already getting dangerous with Python. So dangerous, in fact, that we'll write our own function!
We will create a function that finds the roots of the quadratic equation
To do this, we will first write a function to compute the discriminant and then one to return the roots.
def discriminant(a, b, c):
"""
Returns the discriminant of a quadratic polynomial
a * x**2 + b * x + c = 0.
"""
return b**2 - 4.0 * a * c
def roots(a, b, c):
"""
Returns the roots of the quadratic equation
a * x**2 + b * x + c = 0.
"""
delta = discriminant(a, b, c)
root_1 = (-b + np.sqrt(delta)) / (2.0 * a)
root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
return root_1, root_2
There is a whole bunch of syntax in there to point out.
def
statement. It has the function prototype, followed by a colon.def
statement is part of the function. Once the indentation goes back to the level of the def
statement, you are no longer in the function.return
statement is used to return the result of a function. If multiple objects are returned, they are separated by commas.Now, let's test our new function out!
# Python has nifty syntax for making multiple definitions on the same line
a, b, c = 3.0, -7.0, -6.0
# Call the function and print the result.
root_1, root_2 = roots(a, b, c)
print('roots:', root_1, root_2)
roots: 3.0 -0.6666666666666666
Very nice! Now, let's try another example. This one might have a problem....
# Specify a, b, and c that will give imaginary roots
a, b, c = 1.0, -2.0, 2.0
# Call the function and print the result
root_1, root_2 = roots(a, b, c)
print('roots:', root_1, root_2)
roots: nan nan
/var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/ipykernel_955/3763812685.py:15: RuntimeWarning: invalid value encountered in sqrt root_1 = (-b + np.sqrt(delta)) / (2.0 * a) /var/folders/j_/c5r9ch0913v3h1w4bdwzm0lh0000gn/T/ipykernel_955/3763812685.py:16: RuntimeWarning: invalid value encountered in sqrt root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
Oh no! It gave us nan
, which means "not a number," as our roots. It also gave some warning that it encountered invalid (negative) arguments for the np.sqrt()
function. The roots should be , where . We will use this opportunity to introduce Python's control flow, starting with an if
statement.
We will decree that our quadratic equation solver only handles real roots, so it will raise an exception if an imaginary root is encountered. So, we modify the roots()
function as follows.
def roots(a, b, c):
"""
Returns the roots of the quadratic equation
a * x**2 + b * x + c = 0.
"""
delta = discriminant(a, b, c)
if delta < 0.0:
raise RuntimeError('Imaginary roots! We only do real roots!')
else:
root_1 = (-b + np.sqrt(delta)) / (2.0 * a)
root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
return root_1, root_2
We have now exposed the syntax for a Python if
statement. The conditional expression ends with a colon, just like the def
statement. Note the indentation of blocks of code after the conditionals. (We actually did not need the else
statement, because the program would just continue without the exception, but I left it there for illustrative purposes. It is actually preferred not to have the else
statement.)
Let's try out our new function.
# Pass in parameters that will give imaginary roots (use qd.roots)
a, b, c = 1.0, -2.0, 2.0
root_1, root_2 = roots(a, b, c)
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[15], line 3 1 # Pass in parameters that will give imaginary roots (use qd.roots) 2 a, b, c = 1.0, -2.0, 2.0 ----> 3 root_1, root_2 = roots(a, b, c) Cell In[14], line 9, in roots(a, b, c) 6 delta = discriminant(a, b, c) 8 if delta < 0.0: ----> 9 raise RuntimeError('Imaginary roots! We only do real roots!') 10 else: 11 root_1 = (-b + np.sqrt(delta)) / (2.0 * a) RuntimeError: Imaginary roots! We only do real roots!
This threw the appropriate exception.
If we want to do something many times, we can use a for loop. This is again best learned by example. Let's make a function that counts the number of times a substring is present in a sequence of DNA.
def n_substring(seq, substr):
"""
Given a sequence seq , returns the number of occurrances of substring.
"""
# First make sure the length of substring is shorter than seq.
if len(substr) > len(seq) :
return 0
# We loop through the sequence to check for matches
num_substr = 0
for i in range(0, len(seq)-len(substr)+1):
if seq [i:i+len(substr)] == substr:
num_substr += 1
# We are done looping now , so return the number of subsequences
return num_substr
Let's see how it works!
# Specify sequences
seq = 'ACTGTACGATCGAGCGATCGAGCGAGTCATTACGACTGAGATCC'
substr = 'GAT'
# Call function to get number of GAT substrings
n_gat = n_substring(seq, substr)
# Print the result
print('There are %d GATs in the sequence.' % n_gat)
There are 3 GATs in the sequence.
Now that we know it works, let's look at how the loop was constructed. In the statement at the beginning of the loop, we use the function range()
to define an iterator. Calling range(n)
creates an object akin to a list of integers from 0
to n-1
. The for
statement says that the variable i
will successively take the values of the iterator.
Before concluding our quick trip through the very basics of Python and on to NumPy, I want to show a very handy tool in Python, keyword arguments. Before, when we defined a function, we specified its arguments as variable names separated by commas in the def statement. We can also specify keyword arguments. Here is an example from our quadratic equation solver.
def roots(a, b, c, print_discriminant=False, message_to_the_world='Bi 1x rules'):
"""
Returns the roots of the quadratic equation
a * x**2 + b * x + c = 0.
If print_discriminant is True, prints discriminant to screen
"""
delta = discriminant(a, b, c)
if print_discriminant:
print('discriminant =', delta)
if message_to_the_world is not None:
print('\n' + '*'*len(message_to_the_world))
print(message_to_the_world)
print('*'*len(message_to_the_world) + '\n')
if delta < 0.0:
raise ValueError('Imaginary roots! We only do real roots!')
else:
root_1 = (-b + np.sqrt(delta)) / (2.0 * a)
root_2 = (-b - np.sqrt(delta)) / (2.0 * a)
return root_1, root_2
The function quadratic_roots now has two keyword arguments. They are signified by the equals sign.
The function has three required arguments, a
, b
, c
. If the keyword arguments are omitted in the function call, they take on the default values, as specified in the function definition.
In the example, the default for print_discriminant
is False
and the default for message_to_the_world
is 'Bi 1x rules!'
. Furthermore, ordering of keyword arguments does not matter when calling the function. They are called in the function similarly to the way they are defined in the function definition.
Let's try it!
a, b, c = 3.0, -7.0, -6.0
root_1, root_2 = roots(a, b, c)
print('roots:', root_1, root_2)
*********** Bi 1x rules *********** roots: 3.0 -0.6666666666666666
Since we did not specify the keyword arguments, the defaults were used. We can specify other values.
root_1, root_2 = roots(
a,
b,
c,
print_discriminant=True,
message_to_the_world="Bi 1x TAs are the best!"
)
print("roots:", root_1, root_2)
discriminant = 121.0 *********************** Bi 1x TAs are the best! *********************** roots: 3.0 -0.6666666666666666
In the above function call to roots()
, notice that I put the arguments on separate lines. This is valid Python syntax and, when used appropriately, can make your code more readable.
Here are some words to live by: If you are trying to do a task that you think might be common, it's probably part of NumPy or some other package. Look, or ask Google, first. In this case, NumPy has a function called roots()
that computes the roots of a polynomial. To figure out how to use it, we can either look at the doc string, or look in the NumPy and SciPy documentation online (the documentation for np.roots()
is available here). To look at the doc string, you can enter the following in a code cell:
np.roots?
Signature: np.roots(p) Docstring: Return the roots of a polynomial with coefficients given in p. .. note:: This forms part of the old polynomial API. Since version 1.4, the new polynomial API defined in `numpy.polynomial` is preferred. A summary of the differences can be found in the :doc:`transition guide </reference/routines.polynomials>`. The values in the rank-1 array `p` are coefficients of a polynomial. If the length of `p` is n+1 then the polynomial is described by:: p[0] * x**n + p[1] * x**(n-1) + ... + p[n-1]*x + p[n] Parameters ---------- p : array_like Rank-1 array of polynomial coefficients. Returns ------- out : ndarray An array containing the roots of the polynomial. Raises ------ ValueError When `p` cannot be converted to a rank-1 array. See also -------- poly : Find the coefficients of a polynomial with a given sequence of roots. polyval : Compute polynomial values. polyfit : Least squares polynomial fit. poly1d : A one-dimensional polynomial class. Notes ----- The algorithm relies on computing the eigenvalues of the companion matrix [1]_. References ---------- .. [1] R. A. Horn & C. R. Johnson, *Matrix Analysis*. Cambridge, UK: Cambridge University Press, 1999, pp. 146-7. Examples -------- >>> coeff = [3.2, 2, 1] >>> np.roots(coeff) array([-0.3125+0.46351241j, -0.3125-0.46351241j]) File: ~/anaconda3/lib/python3.10/site-packages/numpy/lib/polynomial.py Type: function
We see that we need to pass the coefficients of the polynomial we would like the roots of using an "array_like
" object. We will discuss what this means in a moment, but for now, we will just use a list to specify our coefficients and call the np.roots()
function.
# Define the coefficients in a list (using square brackets)
coeffs = [3.0, -7.0, -6.0]
# Call np.roots. It returns an np.ndarray with the roots
roots = np.roots(coeffs)
print('Roots for (a, b, c) = (3, -7, -6):', roots)
# It even handles complex roots!
roots = np.roots([1.0, -2.0, 2.0])
print('Roots for (a, b, c) = (1, -2, 2): ', roots)
Roots for (a, b, c) = (3, -7, -6): [ 3. -0.66666667] Roots for (a, b, c) = (1, -2, 2): [1.+1.j 1.-1.j]
array_like
data types¶In the previous example, we used a list as an array_like
data type. Python has several native data types. We have already mentioned int
s and float
s. We just were not very explicit about it. Python's native array_like
data types are lists and tuples. Internally, these things are converted into NumPy arrays, which is the most often used array_like
data type we will use.
Lists and tuples can be useful, but for many applications in data analysis, the np.ndarray
, which we will colloquially call a "NumPy array," is most often used. They are created using the np.array
function with a list or tuple as an argument. Once created, we can do all sorts of things with them. Let's play!
Let's make some arrays to see what they look like:
array_1 = np.array([1, 2, 3, 4])
print('array 1:')
print(array_1, '\n')
array_2 = np.array([[1, 2], [1, 2]])
print('array_2:')
print(array_2, '\n')
array_3 = np.array([[1, 2, 3], [1, 2, 3], [1, 2, 3]])
print('array_3:')
print(array_3)
array 1: [1 2 3 4] array_2: [[1 2] [1 2]] array_3: [[1 2 3] [1 2 3] [1 2 3]]
Sometimes you want an array of all zero values, and that can also be done with numpy
.
zero_array = np.zeros((3,4))
print(zero_array, '\n')
print('the dimesions are', zero_array.shape)
[[0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.]] the dimesions are (3, 4)
Now let's see how we can do operations on some arrays.
a = np.array([1, 2, 3])
b = np.array([4.0, 5.0, 6.0])
# Arithmetic operations are done elementwise
print('a: ', a)
print('b: ', b)
print('a + b: ', a + b)
print('a * b: ', a * b)
print('1.0 + a:', 1.0 + a)
print('a**2: ', a**2)
print('b**a: ', b**a)
a: [1 2 3] b: [4. 5. 6.] a + b: [5. 7. 9.] a * b: [ 4. 10. 18.] 1.0 + a: [2. 3. 4.] a**2: [1 4 9] b**a: [ 4. 25. 216.]
We can check the data type of our matrix.
print(a.dtype)
print(b.dtype)
int64 float64
We can also change the type of the entries of our arrays, for example, from integers to floating point.
array_a = a.astype(float)
print(array_a)
array_b = b.astype(int)
print(array_b)
[1. 2. 3.] [4 5 6]
Slicing is also intuitive.
a = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
print(a, "\n")
print(a[0:3, 2:3])
[[ 1 2 3 4] [ 5 6 7 8] [ 9 10 11 12] [13 14 15 16]] [[ 3] [ 7] [11]]
# view second column
a[:,1]
array([ 2, 6, 10, 14])
# see all entries below the value of 10
a[a<10]
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
Using slices, we can reassign values to the entries in a NumPy array. For example, say we wanted the third row to be all zeros.
a[2, :] = np.zeros(a[2 ,:].shape)
print(a)
[[ 1 2 3 4] [ 5 6 7 8] [ 0 0 0 0] [13 14 15 16]]
We can also reshape arrays.
a = a.reshape (2 ,8)
print(a, '\n')
a = a.reshape(4,4)
print(a)
[[ 1 2 3 4 5 6 7 8] [ 0 0 0 0 13 14 15 16]] [[ 1 2 3 4] [ 5 6 7 8] [ 0 0 0 0] [13 14 15 16]]
Here are some useful NumPy functions we think you might want to use! Go through these one-by-one in separate cells to see what they do.
# create evenly spaced points
np.linspace(0, 1, 10)
# matrix or vector dot products
np.dot(a, a)
# concatenate in row dimensions
np.concatenate((a, a))
# concatenate in the column dimension
np.concatenate((a, a), axis=1)
# transpose (the result of this final operation will be displayed)
np.transpose(a)
array([[ 1, 5, 0, 13], [ 2, 6, 0, 14], [ 3, 7, 0, 15], [ 4, 8, 0, 16]])
Bokeh is a tool for generating interactive plots that can be views in a browser (and therefore also Jupyter notebooks). Its syntax is intuitive, and best learned by example. First, we need to import packages and set things up so that the graphics can be displayed in the notebook.
# These are the necessary submodules
import bokeh.io
import bokeh.plotting
# This is necesssary to display in the notebook
bokeh.io.output_notebook()
Now that we have Bokeh ready, we can make some plots.
# Make an x-variable for plotting
x = np.linspace(0, 2 * np.pi, 200)
# This is a nice function
y_1 = np.exp(np.sin(x))
# We can make another one
y_2 = np.exp(np.cos(x))
# We can make some random data to plot as well
x_rand = np.random.rand(20) * 2 * np.pi
y_rand = np.random.rand(20) * 3.0
# Set up plot
p = bokeh.plotting.figure(
frame_width=500, frame_height=300, x_axis_label="x", y_axis_label="y"
)
# Populate the plot with glyphs
p.line(x, y_1, line_width=2)
p.line(x, y_2, line_width=2, color="orange")
p.circle(x_rand, y_rand, color="black")
# Show the plot
bokeh.io.show(p)
To export the figure for insertion into a document, you can click on the disk image in the Bokeh tools. This will give you a PNG, but for publications, you should use vector graphics. There are ways to save Bokeh plots as vector graphics, but we will not use them here.
PEPs (Python Enhancement Proposals) document suggestions and guidelines for the Python language, its development, etc. My personal favorite is PEP8, the Style Guide for Python Code. This was largely written by Guido von Rossum, the inventor of Python and its benevolent dictator for life. It details how you should style your code. As Guido says, code is much more often read than written. I strongly urge you to follow PEP8 the best you can. It's a lot to read, so I will highlight important points here. If you follow these guidelines, your code will be much more readable than otherwise. This is particularly useful because you are working in groups and the TAs need to grade your work.
\
).x = a**2 + b**2
has whitespace around the +
operator. Another good example is y = m*x + b
.my_fun(a, b, c=True)
.a[i]
, not a [ i ]
.For me, an easy way to keep my code looking pretty is to use Black. If you have installed black and black cell magic, you can use Black in the Jupyter notebook. In a cell at the beginning of your notebook, execute the following code.
%load_ext blackcellmagic
Then, in any code cell where you want Black to adjust the formatting, put %%black
at the top of the cell and execute it. The code in the cell will not be executed; rather, it will be reformatted according to Black's style.
As a case study of how we can load in data and do analysis with it, we will consider a data set from one of the classic organisms in biology: Darwin's finches. Peter and Rosemary Grant have been working on the Galápagos island of Daphne for over forty years. During this time, they have collected lots and lots of data about physiological features of finches. A couple years ago, they published a book with a summary of some of their major results (Grant P. R., Grant B. R., Data from: 40 years of evolution. Darwin's finches on Daphne Major Island, Princeton University Press, 2014). They made their data from the book publicly available via the Dryad Digital Repository.
We will investigate their data on the heritability of beak depth (the distance, top to bottom, of a closed beak) in the ground finch Geospiza fortis. The data set consists of the maternal beak depth, the paternal beak depth, and the mean beak depth of their offspring. You can download the data set, which I got from Dryad and tidied up into a CSV file, here. Our goal it so do a linear regression to see how parental beak depth is related to that of the offspring.
First, we will load in the data set. Before attempting to load the data set, it is important to know where it is on your machine. It should be in the same directory from which you launched Jupyter lab. Otherwise, you can enter the full path below. It is in a comma delimited text file. For this particular file, the data are preceded by comments about their content, with each comment line starting with a #
. We will use the wonderful package pandas to read in the data.
import pandas as pd
# Load the data
df = pd.read_csv('grant_and_grant_2014.csv', comment='#')
# Take a look
df.head()
Average offspring beak depth (mm) | Paternal beak depth (mm) | Maternal beak depth (mm) | |
---|---|---|---|
0 | 10.70 | 10.90 | 9.3 |
1 | 9.78 | 10.70 | 8.4 |
2 | 9.48 | 10.70 | 8.1 |
3 | 9.60 | 10.70 | 9.8 |
4 | 10.27 | 9.85 | 10.4 |
This pandas DataFrame
has 413 measurements with three columns. We can index the columns to get Numpy arrays containing the data we want.
offspring_bd = df['Average offspring beak depth (mm)'].values
paternal_bd = df['Paternal beak depth (mm)'].values
maternal_bd = df['Maternal beak depth (mm)'].values
We want to plot mean offspring beak depth vs. mean parental beak depth, so we need to compute the mean parental beak depth. Conveniently, we do not have to write a loop to compute this for each parent pair. NumPy arrays allow for convenient elementwise calculation in a single command.
parental_bd = (maternal_bd + paternal_bd) / 2
Now we have the arrays we need to make our plot.
p = bokeh.plotting.figure(
frame_width=300,
frame_height=300,
x_axis_label="parental beak depth (mm)",
y_axis_label="offpsring beak depth (mm)",
)
p.circle(parental_bd, offspring_bd, alpha=0.3)
bokeh.io.show(p)
I chose to use the alpha
keyword argument in the p.circle()
function because it helps visualize overlapping data points. We see that parental and offspring beak depth are correlated, as we might expect.
Let's perform a linear regression to get the slope and intercept. To perform the linear regression, we will use scipy.optimize.curve_fit()
. This function takes a set of data and returns the parameters of the model function that best describe the data. Its first input is the model function (which we have defined as linear); next is the -data (in our case parental beak depth, parental_bd
); the third input is the -data (offspring beak depth, offspring_bd
). The last piece that we will need to address is a best guess for our parameters. In this case, we have to provide two values, the slope and the intercept. curve_fit()
works by finding the curve that minimizes the distance of every point from the curve (in our case, this will be minimizing the vertical distance between measured data points and the line describing the data).
More explicitly, scipy.optimize.curve_fit()
will take the sum of the squares of the vertical differences of each data point to the model function and try to minimize this value by adjusting the model parameters. For nonlinear fits, you will need to provide reasonable guesses for the best-fit parameters, as there may be many local minima for minimizing the error. In the case of linear regression, we do not need to be too careful, as there will actually be one minimum that best fits the data. For more on the scipy.optimize.curve_fit()
function, please read the documentation.
When we run scipy.optimize.curve_fit()
, we will get two arrays out. The first array contains the optimal set of parameters for the curve fit. The second array is a 2d array with the covariance of the optimal parameters. For this course, we will ignore the second array and will only use the first array, which, for linear regressions, will give us the slope and intercept.
import scipy.optimize
# Define linear function
def linear_fun(x, slope, intercept):
return slope * x + intercept
# Compute the curve fit (Guess is unit slope and zero intercept)
popt, _ = scipy.optimize.curve_fit(linear_fun, parental_bd, offspring_bd, p0=[1, 0])
# Parse the results
slope, intercept = popt
# Print the results
print("slope =", slope)
print("intercept =", intercept, "mm")
slope = 0.7229051885446406 intercept = 2.448418309793462 mm
So, we got our values for the slope and intercept. We can plot the best fit line along with the data.
# Plot best fit line
x = np.array([7, 12])
y = linear_fun(x, slope, intercept)
p.line(x, y, line_width=2, color='orange')
bokeh.io.show(p)
This does not look exactly like what we would draw by eye, but this is because of overlapping data points. This is, indeed, the best fit line.
If you want to include images in your notebook, you can do that directly in a Markdown cell. The syntax is

You may have scans of equations you want to include, and you can use this technique to do it. However, if you are displaying images from your experiments, you should load them in using skimage
and display them as shown in the next tutorial on image processing.
For all of your assignment, you will need to submit a Jupyter notebook and the notebook conversion of the notebook. To convert to HTML, click through the following menu option.
File -> Export Notebook As... -> HTML
Note that if you are displaying images using HTML tags, you must also include the images with your submission. The best way to do the submission is to zip everything together in a single ZIP file.
This concludes our introductory tour of Python with some NumPy, SciPy, Pandas, and Bokeh thrown in for good measure. There is still much to learn, but you will pick up more and more tricks and syntax as we go along.
For the next tutorial, we will use Python to do some image processing. That is, we will write code to extract data of interest from images. As you get more and more proficient, coding (particularly in Python, in my opinion) will be more and more empowering and FUN!
%load_ext watermark
%watermark -v -p numpy,scipy,pandas,bokeh,jupyterlab
Python implementation: CPython Python version : 3.10.9 IPython version : 8.10.0 numpy : 1.23.5 scipy : 1.10.0 pandas : 1.5.3 bokeh : 2.4.3 jupyterlab: 3.5.3