Pyton - part 2
Functions, file I/O, classes, and extension modulesMaterial adapted from Justin Johnson’s Python tutorial, from Python 3 Tutorial, from Cornell Virtual Workshop tutorial on Python for High Performance, and from What is super() in Python?.A Jupyter notebook version of this material is available online in Google Colab.
Lesson plan:
- Q&A for homework 1 + lecture outline
- Form groups for presentation by responding to post @9 in Piazza.
- Extra credit: submit questions to be considered for Quiz 2 (all material from lectures 1-4, associated textbook content, and recommended videos) - by Thursday, August 20, at 11:59 PM
- Quiz 2, on Friday, August 21, 12:00 - 11:59 PM
- Homework 1 - due Friday, August 21, at 11:59 PM
- Functions
- File I/O
- Classes
- Extension modules
- Interpreted, yet fast
- Summary
Functions
Python functions are defined using the def
keyword. For example:
def sign(x):
if x > 0:
return 'positive'
elif x < 0:
return 'negative'
else:
return 'zero'
for x in [-1, 0, 1]:
print (sign(x))
We will often define functions to take optional keyword arguments, like this:
def hello(name, loud=False):
if loud:
print ('HELLO, %s' % name.upper())
else:
print ('Hello, %s!' % name)
hello('Bob')
hello('Fred', loud=True)
Argument passing
Python uses a mechanism, which is known as “Call-by-Object”, sometimes also called “Call by Object Reference” or “Call by Sharing”.
-
If you pass immutable arguments like integers, strings or tuples to a function, the passing acts like call-by-value. The object reference is passed to the function parameters. The arguments cannot be changed within the function, because they cannot be changed at all, i.e., they are immutable.
-
It’s different, if we pass mutable arguments. They are also passed by object reference, but they can be changed in place within the function. If we pass a list to a function, we have to consider two cases:
- Elements of a list can be changed in place, i.e., the list will be changed even in the caller’s scope.
- If a new list is assigned to the name, the old list will not be affected, i.e., the list in the caller’s scope will remain untouched.
First, let’s have a look at the integer variables below. The parameter inside the function remains a reference to the argument’s variable, as long as the parameter is not changed. As soon as a new value is assigned to it, Python creates a separate local variable. The caller’s variable will not be changed this way:
def ref_demo(x):
print("x=",x," id=",id(x))
x=42
print("x=",x," id=",id(x))
In the example above, we used the id()
function, which takes an object as a parameter. id(obj)
returns the “identity” of the object “obj”. This identity, the return value of the function, is an integer which is unique and constant for this object during its lifetime. Two different objects with non-overlapping lifetimes may have the same id()
value.
If you call the function ref_demo()
we can check what happens to x
with the id()
function:
x = 9
id(x)
Output: 140709692940944
We can see that in the main scope, x
has the identity 140709692940944
.
ref_demo(x)
x= 9 id= 140709692940944
x= 42 id= 140709692942000
id(x)
Output: 140709692940944
In the first print
statement of the ref_demo()
function, the x
from the main scope is used, because we can see that we get the same identity. After we assigned the value 42 to x
, x
gets a new identity 140709692942000
, i.e., a separate memory location from the global x
. So, when we are back in the main scope x
has still the original value 9
and the id 140709692940944
.
In other words, Python initially behaves like call-by-reference, but as soon as we change the value of such a variable, i.e., as soon as we assign a new object to it, Python “switches” to call-by-value. That is, a local variable x
will be created and the value of the global variable x
will be copied into it.
x = 9
id(x)
ref_demo(x)
id(x)
Command Line Arguments
If you use a command line interface, i.e., a text user interface (TUI) , and not a graphical user interface (GUI), command line arguments are very useful. They are arguments which are added after the function call in the same line.
It’s easy to write Python scripts using command line arguments. If you call a Python script from a shell, the arguments are placed after the script name. The arguments are separated by spaces. Inside the script these arguments are accessible through the list variable sys.argv
. The name of the script is included in this list, sys.argv[0]
. sys.argv[1]
contains the first parameter, sys.argv[2]
the second and so on.
For example, if the following script is arguments.py
, it prints all arguments:
# Module sys has to be imported:
import sys
Iteration over all arguments:
for eachArg in sys.argv:
print(eachArg)
If we were to call this script with the following arguments
python arguments.py python course for beginners
it would create the following output:
arguments.py
python
course
for
beginners
Variable Length of Parameters
Functions can also take an arbitrary number of arguments. Those who have some programming background in C or C++ know this from the varargs feature of these languages.
Some definitions: A function with an arbitrary number of arguments is usually called a variadic
function in computer science. To use another special term: A variadic function is a function of indefinite arity. The arity of a function or an operation is the number of arguments or operands that the function or operation takes. The term was derived from words like “unary”, “binary”, “ternary”, all ending in “ary”.
The asterisk *
is used in Python to define a variable number of arguments. The asterisk character has to precede a variable identifier in the parameter list.
def varpafu(*x): print(x)
varpafu()
varpafu(34,"Do you like Python?", "Of course")
The arguments passed to the function call of varpafu()
are collected in a tuple, which can be accessed as a “normal” variable x
within the body of the function. If the function is called without any arguments, the value of x
is an empty tuple.
Sometimes, it’s necessary to use positional parameters followed by an arbitrary number of parameters in a function definition. This is possible, but the positional parameters always have to precede the arbitrary parameters. In the following example, we have a positional parameter “city”, - the main location, - which always has to be given, followed by an arbitrary number of other locations:
def locations(city, *other_cities): print(city, other_cities)
locations("Paris")
locations("Paris", "Strasbourg", "Lyon", "Dijon", "Bordeaux", "Marseille")
Functional style code
We can define functions using the imperative style.
def f(x): return x**3
print (f(8))
Or using the functional style.
f = lambda x: x**3
print (f(8))
The lambda
operator or lambda function is a way to create small anonymous functions, i.e. functions without a name. These functions are throw-away functions, i.e., they are just needed where they have been created. In addition to defining functions, lambda expressions are mainly used in combination with the functions filter()
, map()
and reduce()
.
The following two functions have functional style coding fragments.
def thepower(n):
return lambda x: x**n
f = thepower(2)
print(f(8))
f = thepower(3)
print(f(8))
f = lambda x: print(x)
f("purple")
f("blue")
Using functional style makes the code more concise, easier to understand, and faster to execute. Below is an example for cubing all the values in a list, first using the functional style, and second using the imperative style.
items = [1, 2, 3, 4, 5]
cubed = list(map(lambda x: x**3, items))
print(cubed)
items = [1, 2, 3, 4, 5]
cubed = []
for i in items:
cubed.append(i**3)
print(cubed)
In-class exercise
Write a function which calculates the arithmetic mean of a variable number of values.
File I/O
The syntax for reading and writing files in Python is similar to programming languages like C, C++, Java, Perl, and others but a lot easier to handle.
Read
To open a file for reading use the open
function. Its first parameter is the name of the file to read from and the second parameter, assigned the value "r"
, indicates that we want to read from the file:
fobj = open("myfile.txt", "r")
The "r"
is optional. An open()
command with just a file name is opened for reading per default. The open()
function returns a file object, which offers attributes and methods.
We can use the rstrip()
method to strip off whitespaces (newlines included) on the right side of the string "line"
:
for line in fobj:
print(line.rstrip())
After we have finished working with a file, we have to close it again by using the file object method close()
:
fobj.close()
Write
Writing to a file is as easy as reading from a file. To open a file for writing we set the second parameter to "w"
instead of "r"
. To actually write the data into this file, we use the method write()
of the file handle object.
fh = open("example.txt", "w")
fh.write("To write or not to write\nthat is the question!\n")
fh.close()
Especially if you are writing to a file, you should never forget to close the file handle again. Otherwise you will risk your data ending up in a non consistent state.
With
statement
You will often find the with
statement for reading and writing files. The advantage is that the file will be automatically closed after the indented block after the with has finished execution:
with open("example.txt", "w") as fh:
fh.write("To write or not to write\nthat is the question!\n")
Our first example can also be rewritten like this with the with statement:
with open("myfile.txt") as fobj:
for line in fobj:
print(line.rstrip())
You can read more about file I/O here.
Classes
The syntax for defining classes in Python is straightforward. It consists of two parts: the header and the body. The header usually consists of just one line of code. It begins with the keyword “class” followed by a blank and an arbitrary name for the class, “Greeter” in this example. The class name is followed by a listing of other class names, which are classes from which the defined class inherits. These classes are called superclasses, base classes or sometimes parent classes. The body of a class consists of an indented block of statements.
__init__
is a method which is immediately and automatically called after an instance has been created. This name is fixed and it is not possible to chose another name. The __init__
method is used to initialize an instance. There is no explicit constructor or destructor method in Python, as they are known in C++ and Java. The __init__
method can be anywhere in a class definition, but it is usually the first method of a class, i.e. it follows right after the class header.
class Greeter:
# Constructor
def __init__(self, name):
self.name = name # Create an instance variable
# Instance method
def greet(self, loud=False):
if loud:
print ('HELLO, %s!' % self.name.upper())
else:
print ('Hello, %s' % self.name)
g = Greeter('Fred') # Construct an instance of the Greeter class
g.greet() # Call an instance method; prints "Hello, Fred"
g.greet(loud=True) # Call an instance method; prints "HELLO, FRED!"
The first parameter in the definition of a method has to be a reference to the instance, which called the method. This parameter is usually called “self”. Most other object-oriented programming languages pass the reference to the object (self) as a hidden parameter to the methods.
A class can have different types of attributes:
- Public - These attributes can be freely used inside or outside a class definition (e.g.,
name
). - Protected - Protected attributes should not be used outside the class definition, unless inside a subclass definition (e.g.,
_name
). - Private - This kind of attribute is inaccessible and invisible. It’s neither possible to read nor write to those attributes, except inside the class definition itself (e.g.,
__name
).
There is no “real” destructor, but something similar, i.e., the method __del__
. It is called when the instance is about to be destroyed and if there is no other reference to this instance. If a base class has a __del__()
method, the derived class’s __del__()
method, if any, must explicitly call it to ensure proper deletion of the base class part of the instance.
Inheritance
Python not only supports inheritance but multiple inheritance as well. Generally speaking, inheritance is the mechanism of deriving new classes from existing ones. By doing this, we get a hierarchy of classes. In most class-based object-oriented languages, an object created through inheritance (a “child object”) acquires all, - though there are exceptions in some programming languages, - of the properties and behaviors of the parent object.
Inheritance allows programmers to create classes that are built upon existing classes, and this enables a class created through inheritance to inherit the attributes and methods of the parent class. This means that inheritance supports code reusability. The methods or generally speaking the software inherited by a subclass is considered to be reused in the subclass. The relationships of objects or classes through inheritance give rise to a directed graph.
The class from which a class inherits is called the parent or superclass. A class which inherits from a superclass is called a subclass, also called heir class or child class. Superclasses are sometimes called ancestors as well. There exists a hierarchical relationship between classes. It’s similar to relationships or categorizations that we know from real life. Think about vehicles, for example. Bikes, cars, buses and trucks are vehicles. Pick-ups, vans, sports cars, convertibles and estate cars are all cars and by being cars they are vehicles as well. We could implement a vehicle class in Python, which might have methods like accelerate and brake. Cars, Buses and Trucks and Bikes can be implemented as subclasses which will inherit these methods from vehicle.
Simple Inheritance example
Here we define a class PhysicianRobot
, which inherits from Robot
, and in this class we override the say_hi()
method.
class Robot:
def __init__(self, name):
self.name = name
def say_hi(self):
print("Hi, I am " + self.name + ".")
class PhysicianRobot(Robot):
def say_hi(self):
print("Everything will be okay! ")
print(self.name + " takes care of you!")
x = Robot("Marvin")
y = PhysicianRobot("James")
print(x, type(x))
print(y, type(y))
x.say_hi()
y.say_hi()
Single inheritance with super()
Consider the example below, where the parent class is referred to by the super
keyword:
class Computer():
def __init__(self, computer, ram, storage):
self.computer = computer
self.ram = ram
self.storage = storage
# Class Mobile inherits Computer
class Mobile(Computer):
def __init__(self, computer, ram, storage, model):
super().__init__(computer, ram, storage)
self.model = model
Apple = Mobile('Apple', 2, 64, 'iPhone X')
print('The mobile is:', Apple.computer)
print('The RAM is:', Apple.ram)
print('The storage is:', Apple.storage)
print('The model is:', Apple.model)
In this example, Computer
is a super (parent) class, while Mobile
is a derived (child) class. The usage of the super
keyword on line 10 allows the child class to access the parent class’ init()
property.
In other words, super()
allows you to build classes that easily extend the functionality of previously built classes without implementing their functionality again.
Multiple inheritances
The following example shows how the super()
function is used to implement multiple inheritances:
class Animal:
def __init__(self, animalName):
print(animalName, 'is an animal.');
# Mammal inherits Animal
class Mammal(Animal):
def __init__(self, mammalName):
print(mammalName, 'is a mammal.')
super().__init__(mammalName)
# CannotFly inherits Mammal
class CannotFly(Mammal):
def __init__(self, mammalThatCantFly):
print(mammalThatCantFly, "cannot fly.")
super().__init__(mammalThatCantFly)
# CannotSwim inherits Mammal
class CannotSwim(Mammal):
def __init__(self, mammalThatCantSwim):
print(mammalThatCantSwim, "cannot swim.")
super().__init__(mammalThatCantSwim)
# Cat inherits CannotSwim and CannotFly
class Cat(CannotSwim, CannotFly):
def __init__(self):
print('I am a cat.');
super().__init__('Cat')
# Driver code
cat = Cat()
print('')
bat = CannotSwim('Bat')
Consider the Cat
class’ instantiation on line 30; the following is the order of events that occur after it:
- The
Cat
class is called first. - The
CannotSwim
parent class is called since it appears beforeCannotFly
in the order of inheritance; this follows Python’s Method Resolution Order (MRO) which outlines the order in which methods are inherited. - The
CannotFly
class is called. - The
Mammal
class is called. - Finally, the
Animal
class is called.
Next, consider the bat
object. Bats are flying mammals, but they cannot swim, which is why it is instantiated with the CannotSwim
class. The super
function in the CannotSwim
class invokes the Mammal
class’ constructor after it. The Mammal
class then invokes the Animal
class’ constructor.
Modules
You can install packages in terminal using pip install [package_name]
.
# Import 'os' and 'time' modules
import os, time
# Import under an alias
import numpy as np
np.dot(x, y) # Access components with pkg.fn
# Import specific submodules/functions
from numpy import linalg as la, dot as matrix_multiply
In-class exercise
What is one main difference, in terms of speed of execution, between compiled and interpreted languages?
Interpreted, yet fast
Faster Native Python
In this seciton we’ll focus on writing fast Python code (see Figure 1) by choosing the right data type:
Figure 1. During the program execution, your code will run native Python code, shown by the arrows on the left, and at times, call compiled libraries, shown by the arrows on the right.
- Collections and containers: the right data structure for the job
- Lazy evaluation
- Memory management
Collections and containers
Use the right data structure for the job
For example, numerical performance of membership testing for sets and lists containing the same data can be very different:
- sets ~ O(1)
- lists ~ O(N)
set1 = set(range(0,10000))
list1 = list(range(0,10000))
%timeit 9800 in set1
%timeit 9800 in list1
# hash() can be called on any object to return an integer value for an object.
# The returned value is the object’s hash code or hash value.
# While most objects are hashable, not every object is.
# In particular, mutable objects like lists may not be hashable
# because when an object is mutated its hash value may also change.
hash('123')
More details on hashing are available in the free book Data Structures and Algorithms with Python.
As with membership testing, computing set intersections is much faster using sets than lists
# item lookup is much faster for sets and dictionaries than for lists (for which lookup is O(N))
set1 = set(range(0,1000)) # create a set with a bunch of numbers in it
set2 = set(range(500,2000)) # create another set with a bunch of numbers in it
isec = set1 & set2 # same as isec = set1.intersection(set2)
list1 = list(range(0,1000)) # create a list with a bunch of numbers in it
list2 = list(range(500,2000)) # create another list with a bunch of numbers in it
isec = [e1 for e1 in list1 for e2 in list2 if e1==e2] # uses list comprehensions
%timeit set1 & set2
%timeit [e1 for e1 in list1 for e2 in list2 if e1==e2]
Lazy evaluation
- a strategy implemented in some programming languages whereby certain objects are not produced until they are needed
- often used in conjunction with functions that produce collections of objects
- if you only need to iterate over the items in a collection, you don’t need to produce that entire collection
Last time we looked at list comprehensions. List comprehensions are useful and can help you write elegant code that’s easy to read and debug, but they’re not the right choice for all circumstances. They might make your code run more slowly or use more memory. If your code is less performant or harder to understand, then it’s probably better to choose an alternative.
When the size of a list becomes problematic, it’s often helpful to use a generator instead of a list comprehension. A generator doesn’t create a single, large data structure in memory, but instead returns an iterable. Your code can ask for the next value from the iterable as many times as necessary or until you’ve reached the end of your sequence, while only storing a single value at a time.
If you were to sum the first billion squares with a generator, then your program will likely run for a while, but it shouldn’t cause your computer to freeze. The example below uses a generator:
sum(i * i for i in range(1000000000))
You can tell this is a generator because the expression isn’t surrounded by brackets or curly braces. Optionally, generators can be surrounded by parentheses.
This example still requires a lot of work, but it performs the operations lazily. Because of lazy evaluation, values are only calculated when they’re explicitly requested. After the generator yields a value, it can add that value to the running sum, then discard that value and generate the next value. When the sum function requests the next value, the cycle starts over. This process keeps the memory footprint small. More details are available at When to Use a List Comprehension in Python.
map()
also operates lazily, meaning memory won’t be an issue if you choose to use it in this case:
sum(map(lambda i: i*i, range(1000000000)))
Python provides a variety of mechanisms to support lazy evaluation:
- generators: like functions, but maintain internal state and
yield
next value when called - dictionary views (keys and values)
- range:
for i in range(N)
- zip:
pairs = zip(seq1, seq2)
- enumerate:
for i, val in enumerate(seq)
- open:
with open(filename, 'r') as f
Python/C API and extension modules
CPython is the most widely used interpreter, generally installed as “python”. It is written in C, and is accompanied by an Application Programming Interface (API) that enables communication between Python and C (and thus to basically any other language).
- Python/C API allows for compiled chunks of code to be called from Python or executed within the CPython interpreter → extension modules
- Compiled shared object library (.so, .dll, etc.) is accessed through Python/C API
- compiled code executes operations of interest
- wrapper/interface code consists of calls to Python/C API and underlying compiled code
- Can be imported into python interpreter just as pure Python source code can
- Much core functionality of the Python language and standard library are written in C
Hybrid Codes
It is often advantageous to blend high-level languages for control with low-level languages for performance. Overall performance depends on the granularity of computations in compiled code and the overhead required to communicate between languages, as shown in Figure 2.
Figure 2. Granularity of computations in compiled code: The program on the left runs only native Python code. The program in the middle runs compiled libraries inefficiently (for example by using for loops with collections, rather than using map). This incurs the overhead of communicating between the Python code and the C API. The program on the right runs compiled code efficiently - it passes control to the Python-C API, which performs the operation on all elements of the collection, after which it returns control to the native Python code.
There are many different tools that support the interleaving of Python and compiled extension modules.
- We’ll look at three main strategies for improved performance, as shown in Figure 3:
Figure 3. There are three strategies to improve the performance: use the appropriate data types to make native Python code run faster, use compiled libraries when available, and spread the execution across multiple CPUs/nodes.
* More compiled code
* Parallel processing
* Faster native Python
More compiled code
Use more compiled code, when available, as shown in Figure 4:
Figure 4. Using compiled code can speed-up the executing of a Python program. This can be done by using third-party libraries like NumPy, SciPy, etc., or by using compiled custom code.
- Compiled third-party libraries
- Compiling custom code
Third-Party Libraries
Third-party libraries are used for numerical & scientific computing (they are part of the Python scientific computing ecosystem)
- Most specific functionality for scientific computing is provided by third-party libraries, which are typically a mix of Python code and compiled extension modules
- NumPy: multi-dimensional arrays, linear algebra, random numbers
- SciPy: routines for integration, optimization, root-finding, interpolation, etc.
- Pandas: Series and Dataframes for tabular data and statistics (e.g., from spreadsheets)
- Scikit-learn, TensorFlow, Caffe, PyTorch, Keras: machine learning
- Matplotlib, Seaborn, Bokeh: plotting
- NetworkX: networks
- etc.
- Bundled distributions (e.g., Anaconda) contain many of these, with tools for installing additional packages
NumPy
Numpy (“Numerical Python”) is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. This is an optimized library for matrix and vector computation that makes use of C/C++ subroutines and memory-efficient data structures.
- conventional wisdom: avoid loops in Python and use array syntax
- Instead of making multiple calls to compiled functions with results returned to Python, all those compiled function calls are “bundled” together and triggered by a single function call from Python.
- a challenge: figuring out how to express complex operations solely using array syntax
- including indexing, slicing, and broadcasting
- a caveat: convenient syntax can disguise performance inefficiencies (e.g., temporary arrays)
If you are already familiar with MATLAB, you might find this tutorial useful to get started with Numpy.
NumPy provides a rich syntax for operating on arrays in a compact and efficient manner without explicit looping and indexing into the array, as shown in Figure 5. For example, one can add element-wise two arrays of the same size and shape with a single operation as:
Figure 5. The virtue of the first form of array addition is not only that it is more compact and readable, but also more numerically efficient, as the looping, indexing and addition are done in a compiled C library with a single call from the Python interpreter (as shown on the right), rather than in multiple calls from the Python interpreter (as shown in the middle).
import numpy as np
a = np.ones((3,3)) # Create a 3x3 array of all ones
b = np.zeros((3,3)) # Create a 3x3 array of all zeros
c = a + b # throws ValueError if a and b not the same shape
print (a, b, c)
This is more or less equivalent to:
assert(a.shape == b.shape) # throws AssertionError if a and b not the same shape
c = np.zeros_like(a) # prefills a zero array of the correct shape
for i in range(a.shape[0]):
for j in range(a.shape[1]):
c[i,j] = a[i,j] + b[i,j]
print (a, b, c)
One downside of this type of array syntax, as described in more detail below, involves the construction of temporary arrays in complicated expressions.
The simple array addition above only scratches the surface of the set of compact and efficient functionality available with numpy arrays. More details are available at Python for High Performance: NumPy.
To use Numpy, we first need to import the numpy
package:
import numpy as np
Arrays
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
We can initialize numpy arrays from nested Python lists, and access elements using square brackets:
a = np.array([1, 2, 3]) # Create a rank 1 array
print (type(a), a.shape, a[0], a[1], a[2])
a[0] = 5 # Change an element of the array
print (a)
b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array
print (b)
print (b.shape)
print (b[0, 0], b[0, 1], b[1, 0])
Numpy also provides many functions to create arrays:
a = np.zeros((2,2)) # Create an array of all zeros
print (a)
b = np.ones((1,2)) # Create an array of all ones
print (b)
c = np.full((2,2), 7) # Create a constant array
print (c)
d = np.eye(2) # Create a 2x2 identity matrix
print (d)
e = np.random.random((2,2)) # Create an array filled with random values
print (e)
f = np.random.randint(low=0, high=10, size=(2,2))
print (f)
Array indexing
Numpy offers several ways to index into arrays.
Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:
import numpy as np
# Create the following rank 2 array with shape (3, 4)
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
# [6 7]]
b = a[:2, 1:3]
print (b)
A slice of an array is a view into the same data, so modifying it will modify the original array.
print (a[0, 1])
b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1]
print (a[0, 1])
Similarly, you can index an numpy array with an array of indices.
c = np.array([0, 1, 4, 9, 16, 25, 36, 49])
idx = [1, 3, 5, 6]
print (c[idx])
You can also mix integer indexing with slice indexing. However, doing so will yield an array of lower rank than the original array. Note that this is quite different from the way that MATLAB handles array slicing:
# Create the following rank 2 array with shape (3, 4)
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print (a)
Two ways of accessing the data in the middle row of the array. Mixing integer indexing with slices yields an array of lower rank, while using only slices yields an array of the same rank as the original array:
row_r1 = a[1, :] # Rank 1 view of the second row of a
row_r2 = a[1:2, :] # Rank 2 view of the second row of a
row_r3 = a[[1], :] # Rank 2 view of the second row of a
print (row_r1, row_r1.shape)
print (row_r2, row_r2.shape)
print (row_r3, row_r3.shape)
# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print (col_r1, col_r1.shape)
print ()
print (col_r2, col_r2.shape)
Integer array indexing: When you index into numpy arrays using slicing, the resulting array view will always be a subarray of the original array. In contrast, integer array indexing allows you to construct arbitrary arrays using the data from another array. Here is an example:
a = np.array([[1,2], [3, 4], [5, 6]])
# An example of integer array indexing.
# The returned array will have shape (3,) and
print (a[[0, 1, 2], [0, 1, 0]])
# The above example of integer array indexing is equivalent to this:
print (np.array([a[0, 0], a[1, 1], a[2, 0]]))
# When using integer array indexing, you can reuse the same
# element from the source array:
print (a[[0, 0], [1, 1]])
# Equivalent to the previous integer array indexing example
print (np.array([a[0, 1], a[0, 1]]))
One useful trick with integer array indexing is selecting or mutating one element from each row of a matrix:
# Create a new array from which we will select elements
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print (a)
# Create an array of indices
b = np.array([0, 2, 0, 1])
# Select one element from each row of a using the indices in b
print (a[np.arange(4), b]) # Prints "[ 1 6 7 11]"
# Mutate one element from each row of a using the indices in b
a[np.arange(4), b] += 10
print (a)
Boolean array indexing: Boolean array indexing lets you pick out arbitrary elements of an array. Frequently, this type of indexing is used to select the elements of an array that satisfy some condition. Here is an example:
import numpy as np
a = np.array([[1,2], [3, 4], [5, 6]])
bool_idx = (a > 2) # Find the elements of a that are bigger than 2;
# this returns a numpy array of Booleans of the same
# shape as a, where each slot of bool_idx tells
# whether that element of a is > 2.
print (bool_idx)
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print (a[bool_idx])
# We can do all of the above in a single concise statement:
print (a[a > 2])
For brevity we have left out a lot of details about numpy array indexing; if you want to know more you should read the documentation.
Datatypes
Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:
x = np.array([1, 2]) # Let numpy choose the datatype
y = np.array([1.0, 2.0]) # Let numpy choose the datatype
z = np.array([1, 2], dtype=np.int64) # Force a particular datatype
print (x.dtype, y.dtype, z.dtype)
You can read all about numpy datatypes in the documentation.
Array math
Basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the numpy module:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
# Elementwise sum; both produce the array
print (x + y)
print (np.add(x, y))
# Elementwise difference; both produce the array
print (x - y)
print (np.subtract(x, y))
# Elementwise product; both produce the array
print (x * y)
print (np.multiply(x, y))
# Elementwise division; both produce the array
# [[ 0.2 0.33333333]
# [ 0.42857143 0.5 ]]
print (x / y)
print (np.divide(x, y))
# Elementwise square root; produces the array
# [[ 1. 1.41421356]
# [ 1.73205081 2. ]]
print (np.sqrt(x))
Note that unlike MATLAB, *
is elementwise multiplication, not matrix multiplication. We instead use the dot
function to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices. dot
is available both as a function in the numpy module and as an instance method of array objects:
There exists numpy matrices (np.matrix
) that use *
for matrix multiplication. However, np.matrix
are only for 2d matrices, whereas np.array
are for n
-dimensional. For this reason, it is better to always use np.array
. Additional discussion can be found here.
If you think python is dumb for having bulky unreadable matrix multiplication syntax, please check out the programming language Julia. Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments.
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
v = np.array([9,10])
w = np.array([11, 12])
# Inner product of vectors; both produce 219
print (v.dot(w))
print (np.dot(v, w))
# Matrix / vector product; both produce the rank 1 array [29 67]
print (x.dot(v))
print (np.dot(x, v))
# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
# [43 50]]
print (x.dot(y))
print (np.dot(x, y))
x = np.array([1, 4, 3, 7]) #Compute the outer product of two arrays
y = np.array([2, 3, 9, 6])
print (np.outer(x, y))
Numpy provides many useful functions for performing computations on arrays; one of the most useful is sum
:
x = np.array([[1,2],[3,4]])
print (np.sum(x)) # Compute sum of all elements; prints "10"
print (np.sum(x, axis=0)) # Compute sum of each column; prints "[4 6]"
print (np.sum(x, axis=1)) # Compute sum of each row; prints "[3 7]"
You can find the full list of mathematical functions provided by numpy in the documentation.
Apart from computing mathematical functions using arrays, we frequently need to reshape or otherwise manipulate data in arrays. The simplest example of this type of operation is transposing a matrix; to transpose a matrix, simply use the T
attribute of an array object:
print (x)
print (x.T)
v = np.array([1,2,3])
print (v)
print (v.T)
Broadcasting
Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes when performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array.
For example, suppose that we want to add a constant vector to each row of a matrix. We could do it like this:
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = np.empty_like(x) # Create an empty matrix with the same shape as x
# Add the vector v to each row of the matrix x with an explicit loop
for i in range(4):
y[i, :] = x[i, :] + v
print (y)
This works; however when the matrix x
is very large, computing an explicit loop in Python could be slow. Note that adding the vector v
to each row of the matrix x
is equivalent to forming a matrix vv
by stacking multiple copies of v
vertically, then performing elementwise summation of x
and vv
. We could implement this approach like this:
vv = np.tile(v, (4, 1)) # Stack 4 copies of v on top of each other
print (vv) # Prints "[[1 0 1]
# [1 0 1]
# [1 0 1]
# [1 0 1]]"
y = x + vv # Add x and vv elementwise
print (y)
Numpy broadcasting allows us to perform this computation without actually creating multiple copies of v
. Consider this version, using broadcasting:
import numpy as np
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v # Add v to each row of x using broadcasting
print (y)
The line y = x + v
works even though x
has shape (4, 3)
and v
has shape (3,)
due to broadcasting; this line works as if v
actually had shape (4, 3)
, where each row was a copy of v
, and the sum was performed elementwise.
Broadcasting two arrays together follows these rules:
- If the arrays do not have the same rank, prepend the shape of the lower rank array with 1s until both shapes have the same length.
- The two arrays are said to be compatible in a dimension if they have the same size in the dimension, or if one of the arrays has size 1 in that dimension.
- The arrays can be broadcast together if they are compatible in all dimensions.
- After broadcasting, each array behaves as if it had shape equal to the elementwise maximum of shapes of the two input arrays.
- In any dimension where one array had size 1 and the other array had size greater than 1, the first array behaves as if it were copied along that dimension
If this explanation does not make sense, try reading the explanation from the documentation.
Functions that support broadcasting are known as universal functions. You can find the list of all universal functions in the documentation.
Here are some applications of broadcasting:
# Compute outer product of vectors
v = np.array([1,2,3]) # v has shape (3,)
w = np.array([4,5]) # w has shape (2,)
# To compute an outer product, we first reshape v to be a column
# vector of shape (3, 1); we can then broadcast it against w to yield
# an output of shape (3, 2), which is the outer product of v and w:
print (np.reshape(v, (3, 1)) * w)
# Add a vector to each row of a matrix
x = np.array([[1,2,3], [4,5,6]])
# x has shape (2, 3) and v has shape (3,) so they broadcast to (2, 3),
# giving the following matrix:
print (x + v)
# Add a vector to each column of a matrix
# x has shape (2, 3) and w has shape (2,).
# If we transpose x then it has shape (3, 2) and can be broadcast
# against w to yield a result of shape (3, 2); transposing this result
# yields the final result of shape (2, 3) which is the matrix x with
# the vector w added to each column. Gives the following matrix:
print ((x.T + w).T)
# Another solution is to reshape w to be a row vector of shape (2, 1);
# we can then broadcast it directly against x to produce the same
# output.
print (x + np.reshape(w, (2, 1)))
# Multiply a matrix by a constant:
# x has shape (2, 3). Numpy treats scalars as arrays of shape ();
# these can be broadcast together to shape (2, 3), producing the
# following array:
print (x * 2)
Broadcasting typically makes your code more concise and faster, so you should strive to use it where possible.
This brief overview has touched on many of the important things that you need to know about numpy, but is far from complete. Check out the numpy reference to find out much more about numpy.
Aggregated operations over arrays
import numpy as np
a = np.random.random((1000,1000))
b = np.random.random((1000,1000))
log_a = np.log(a) # an example of a "universal function" (ufunc) -- operates on scalars and arrays
c = a + b # throws ValueError if a and b not the same shape
The challenge of more complex operations
# numpy multiplicative outer product -- defined internally within numpy namespace
def outer(a,b):
multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
x = np.array([1,2,3])
y = np.array([2,3,4])
np.outer(x,y)
The caveat of array temporaries
# The problem of temporary array creation
d = a + 2*b + 3*c
# temporaries: 3*c, 2*b, (2*b)+(3*c), a + ((2*b) + (3*c))
In-class exercises
- Create a vector with values ranging from 10 to 49
- Create a 2D array with 1 on the border and 0 inside
- Multiply a 5x3 matrix by a 3x2 matrix (real matrix product)
SciPy
- SciPy = “Scientific Python”
- sits on top of NumPy, wrapping many C and Fortran numerical routines, with convenient Python interface
- routines for integration, optimization, root-finding, interpolation, fitting, etc.
- Python callbacks enable ease-of-use, but with some performance penalty, as shown in Figure 6
Figure 6. Python callbacks enable ease-of-use, but with some performance penalty (the overhead required to communicate between compiled code and Python native code).
Matplotlib
Matplotlib is a plotting library. In this section give a brief introduction to the matplotlib.pyplot
module, which provides a plotting system similar to that of MATLAB.
import matplotlib.pyplot as plt
By running this special iPython command, we will be displaying plots inline:
%matplotlib inline
Plotting
The most important function in matplotlib
is plot, which allows you to plot 2D data. Here is a simple example:
# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)
# Plot the points using matplotlib
plt.plot(x, y)
With just a little bit of extra work we can easily plot multiple lines at once, and add a title, legend, and axis labels:
y_cos = np.cos(x)
y_sin = np.sin(x)
# Plot the points using matplotlib
plt.clf() # clear previous plot
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
Subplots
You can plot different things in the same figure using the subplot
function. Here is an example:
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)
# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)
# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')
# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')
# Show the figure.
plt.show()
You can read much more about the subplot
function in the documentation.
Histogram
mu, sigma = 100, 15
x = mu + sigma*np.random.randn(10000)
# the histogram of the data, separated into 50 equally spaced bins, with unnormalized frequency
plt.clf()
n, bins, patches = plt.hist(x, bins=50)
print (bins) # list of bounds for each bin
print (n) # frequency for each bin
plt.show()
Summary
- Functions
- File I/O
- Classes
-
Extension modules
- Next time
- Prolog programming language, part 1 of 2
- Reminders
- Extra credit: submit questions to be considered for Quiz 2 (all material from lectures 1-4, associated textbook content, and recommended videos) - by Thursday, August 20, at 11:59 PM
- Quiz 2, open Friday, August 21, at 11:59 PM
- Homework 1 - due Friday, August 21, at 11:59 PM