Advanced operations with numpy
Contents
Advanced operations with numpy
#
Goals of this lecture#
This lecture will continue our discussion of the numpy
package. We’ll discuss:
More practice with vector operations.
Descriptive statistics with vectors.
Working with matrices (multi-dimensional arrays).
Other useful functions.
import numpy as np
Practice with vectors#
numpy
vectors make it easier to do all sorts of operations, such as arithmetic operations.No more need to use
for
loops––can do vector arithmetic the same way we multiply individual numbers.numpy
calculations are also much faster and more efficient!
Practice problems#
Check-in#
Consider the two arrays below. How would you calculate the difference between each item in a
and each item in b
?
a = np.array([2, 8, 9, 10])
b = np.array([3, 7, 6, 10])
## Your code here
Solution#
a - b
array([-1, 1, 3, 0])
Check-in#
Consider the same two arrays as before. How would you:
Calculate the product of these arrays?
Calculate the sum of the elements in this new “product” array?
Note that this is also called the dot product.
### Your code here
Solution#
This is also called the dot product.
## First, calculate the product
product = a * b
product
array([ 6, 56, 54, 100])
## Then, calculate the sum of this array
product.sum()
216
Descriptive statistics with numpy
#
Descriptive statistics are ways to summarize and organize data.
A big advantage of numpy
is that it has built-in functions to calculate various descriptive statistics:
The
sum
of a set of numbers.The
mean
(or “average”) of a set of numbers.The
median
(or “middle value”) of a set of numbers.
sum
#
The sum of a set of numbers is simply the result of adding each number together.
The numpy
package has a sum
function built in, which makes it easier to calculate the sum of a vector.
## First, create an array with some numbers
v = np.array([5, 9, 10])
v
array([ 5, 9, 10])
## Now calculate the sum
v.sum()
24
mean
#
The mean of a set of scores is the sum of those scores divided by the number of the observations.
The numpy
package also has a mean
function built in. Two options:
array_name.mean()
np.mean(array_name)
## First, create an array with some numbers
v = np.array([5, 9, 10])
v
array([ 5, 9, 10])
## Now calculate the mean
v.mean()
8.0
## Or do it with numpy.mean
np.mean(v)
8.0
Check-in#
What would happen if we ran the following code?
v = np.array([1, 5, 9])
np.mean(v)
### Your code here
Solution#
This will also calculate the mean
.
v = np.array([1, 5, 9])
np.mean(v)
5.0
Check-in#
What would happen if we ran the following code?
v = np.array(["a", "b", "a"])
v.mean()
### Your code here
Solution#
It will throw an error.
You cannot calculate the
mean
of a vector ofstr
types.The
mean
can only be calculated for interval/ratio data.
median
#
The median of a set of scores is the middle score when those scores are arranged from least to greatest.
The numpy
package also has a mean
function built in.
Syntax:
np.median(array_name)
.Unlike
mean
, cannot usearray_name.median()
## First, create an array with some numbers
v = np.array([5, 9, 10, 1, 20])
v
array([ 5, 9, 10, 1, 20])
## Now calculate the median
np.median(v)
9.0
Check-in#
What would be the median of the following vector?
v = np.array([1, 2, 5, 8])
### Your code here
Solution#
If the vector has an even number of elements, the median is the mean of the middle two elements.
## First, create an array with some numbers
v = np.array([1, 2, 5, 8])
v
array([1, 2, 5, 8])
## Use np.median
np.median(v)
3.5
## Equivalent to *mean* of 2 and 5
(2 + 5) / 2
3.5
Interim summary#
Descriptive statistics are a really useful way to summarize data.
Very valuable for both basic and applied research (e.g., in industry).
Examples:
median
salary,mean
sales per fiscal quarter,mean
reaction time on a psychophysics task, etc.
numpy
makes this much easier.Later, we’ll discuss how
pandas
(a way to represent data tables) uses these same functions.
Working with matrices#
A matrix is a rectangular array of data (i.e., a multi-dimensional array).
numpy
is designed for representing and performing calculations with matrices.
A “vector” is just a one-dimensional matrix.
Many of the same operations we’ve discussed also apply to working with matrices.
Creating a matrix#
Matrices can be created just like vectors.
The key difference is that they contain nested lists.
md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
array([[1, 2, 5],
[3, 4, 7]])
## This is a 2 by 3 matrix
md_array.shape
(2, 3)
Indexing into a matrix#
You can index into a matrix, just like with a vector.
A key difference is that you use multiple indices, for each dimension.
matrix_name[D1_index, D2_index, ...]
# This just returns the first *row*
md_array[0]
array([1, 2, 5])
# This returns the second element of the first row
md_array[0, 1]
2
Check-in#
How would you return the first element of the second row of md_array
?
md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
### Your code here
Solution#
# Can use [1] to get second row
md_array[1]
array([3, 4, 7])
# Use [1, 0] to get first element of second row
md_array[1, 0]
3
Check-in#
A more challenging problem: how would you return the second column of md_array
? How many observations should it have?
md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
### Your code here
Solution#
Retrieving a column uses the [:,column_index]
syntax.
## column 1
md_array[:, 0]
array([1, 3])
## column 2
md_array[:, 1]
array([2, 4])
## column 3
md_array[:, 2]
array([5, 7])
Check-in#
How would you retrieve the second and third element of the second row?
md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
### Your code here
Solution#
## First, get second row with md_array[1]
md_array[1]
array([3, 4, 7])
## First, get second/third elements with slicing
## I.e., [1:3] syntax
md_array[1, 1:3]
array([4, 7])
Summary statistics with matrices#
When you call
np.sum
(ormean
, etc.), you can specify which axis to calculate that statistic from.axis = 0
: calculatesum
(ormean
, etc.) of each column.axis = 1
: calculatesum
(ormean
, etc.) of each row.
## Calcaulate mean of each column
md_array.sum(axis = 0)
array([ 4, 6, 12])
## Calcaulate mean of each row
md_array.sum(axis = 1)
array([ 8, 14])
Check-in#
How would you calculate the mean
of each row of the following matrix?
m = np.array([[5, 10, 2],
[20, 5, 100]])
### your code here
Solution#
m.mean(axis = 1)
array([ 5.66666667, 41.66666667])
m.mean(axis = 1).shape
(2,)
Check-in#
Suppose you have a 5x6 matrix (5 rows, 6 columns). If you calculated the mean
of each column, what would the shape
be of the resulting vector?
Solution#
The vector would have a shape
of (6,)
, i.e., six observations.
There are six columns, so calculating the mean of each column would result in six observations.
Side note: arithmetic with matrices#
You can also perform arithmetic with matrices (e.g., addition, multiplication, etc.).
However, note that matrices must have compatible dimensions.
More discussion of this in a Linear Algebra class.
Identifying the location of an item#
Often, you’ll need to search a vector or matrix for items that meet a certain conditions.
All
scores == 100
.All
building_heights
above a certain threshold.All
reaction_times
above a certain cutoff.
You can think of this as applying a conditional statement to search a vector.
Identifying the location of an item#
Often, you’ll need to search a vector or matrix for items that meet a certain conditions.
All
scores == 100
.All
building_heights
above a certain threshold.All
reaction_times
above a certain cutoff.
You can think of this as applying a conditional statement to search a vector.
Using ==
#
This will return a vector of True
or False
, indicating whether each index/element matches the condition.
## Scores
scores = np.array([100, 95, 100, 85])
## Which scores == 100?
scores == 100
array([ True, False, True, False])
## Select only scores == 100
scores[scores == 100]
array([100, 100])
Using np.where
#
By default, this will return the indices in the initial array corresponding to the condition.
## Get indices
np.where(scores == 100)
(array([0, 2]),)
## Applying indices to vector
scores[np.where(scores == 100)]
array([100, 100])
Check-in#
Consider the following array of building_heights
. How would you find out which buildings are taller than 50 feet?
building_heights = np.array([25, 45, 10, 60, 10, 85, 100])
### Your code here
Solution using ==
#
building_heights > 50
array([False, False, False, True, False, True, True])
building_heights[building_heights > 50]
array([ 60, 85, 100])
Solution using np.where
#
## Get indices
np.where(building_heights > 50)
(array([3, 5, 6]),)
## Apply indices
building_heights[np.where(building_heights > 50)]
array([ 60, 85, 100])
Other useful functions#
numpy
also has a host of other useful functions. For now, we’ll focus on:
Generating an array with either random numbers or
ones
orzeros
.Reshaping an array with `reshape.
Initializing a random array#
numpy.random.rand(d1, ...)
can be used to initialize an array with random numbers and dimensionality (d1, ...)
.
## Generates a 1-D vector with 10 elements
np.random.rand(10)
array([0.75713674, 0.66036241, 0.80925511, 0.6748789 , 0.34517568,
0.33181093, 0.31448705, 0.55857664, 0.26887019, 0.62232903])
## Generates a 2-D vector with shape (2, 2)
np.random.rand(2, 2)
array([[0.56690475, 0.01237092],
[0.59620774, 0.1829824 ]])
Check-in#
Generate a random array with shape (3, 2), then calculate the mean
of each column.
### Your code here
Solution#
r = np.random.rand(3, 2)
r
array([[0.41075628, 0.98461251],
[0.69721637, 0.78523541],
[0.08647911, 0.00251936]])
r.mean(axis = 0)
array([0.39815059, 0.59078909])
Initializing an array of ones
or zeros
#
This is like np.random.rand
, but each element is either a 1
or 0
.
np.ones((2, 2))
array([[1., 1.],
[1., 1.]])
np.zeros((2, 2))
array([[0., 0.],
[0., 0.]])
Using numpy.reshape
#
Sometimes, a matrix or vector isn’t the right shape to perform a computation.
E.g., multiplying by another vector.
E.g., using for regression in a regression equation.
We can use np.reshape
to reshape that array.
Example: turning a vector into a matrix#
# Create a (10, ) vector
og_array = np.ones(10)
og_array
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
# Reshape to (2, 5)
og_array.reshape((2, 5))
array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
# Reshape to (5, 2)
og_array.reshape((5, 2))
array([[1., 1.],
[1., 1.],
[1., 1.],
[1., 1.],
[1., 1.]])
Dimensions must be compatible#
If you try to reshape
into a shape
that’s not compatible with the original size
(i.e., not divisible by size
), you’ll get an error.
# Reshape to (5, 2)
og_array.reshape((4, 4))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [60], in <cell line: 2>()
1 # Reshape to (5, 2)
----> 2 og_array.reshape((4, 4))
ValueError: cannot reshape array of size 10 into shape (4,4)
Conclusion#
This concludes our brief foray into numpy
.
Now, you’ll be more familiar with:
Using
numpy
for vector arithmetic.Basic summary statistics with
numpy
.Working with multi-dimensional arrays.
Eventually, this will form the foundation for more advanced work in statistics, machine learning, and more.