Advanced operations with numpy#

Goals of this lecture#

This lecture will continue our discussion of the numpy package. We’ll discuss:

  • More practice with vector operations.

  • Descriptive statistics with vectors.

  • Working with matrices (multi-dimensional arrays).

  • Other useful functions.

import numpy as np

Practice with vectors#

  • numpy vectors make it easier to do all sorts of operations, such as arithmetic operations.

  • No more need to use for loops––can do vector arithmetic the same way we multiply individual numbers.

    • numpy calculations are also much faster and more efficient!

Practice problems#

Check-in#

Consider the two arrays below. How would you calculate the difference between each item in a and each item in b?

a = np.array([2, 8, 9, 10])
b = np.array([3, 7, 6, 10])
## Your code here

Solution#

a - b
array([-1,  1,  3,  0])

Check-in#

Consider the same two arrays as before. How would you:

  • Calculate the product of these arrays?

  • Calculate the sum of the elements in this new “product” array?

Note that this is also called the dot product.

### Your code here

Solution#

This is also called the dot product.

## First, calculate the product
product = a * b
product
array([  6,  56,  54, 100])
## Then, calculate the sum of this array
product.sum()
216

Descriptive statistics with numpy#

Descriptive statistics are ways to summarize and organize data.

A big advantage of numpy is that it has built-in functions to calculate various descriptive statistics:

  • The sum of a set of numbers.

  • The mean (or “average”) of a set of numbers.

  • The median (or “middle value”) of a set of numbers.

sum#

The sum of a set of numbers is simply the result of adding each number together.

The numpy package has a sum function built in, which makes it easier to calculate the sum of a vector.

## First, create an array with some numbers
v = np.array([5, 9, 10])
v
array([ 5,  9, 10])
## Now calculate the sum
v.sum()
24

mean#

The mean of a set of scores is the sum of those scores divided by the number of the observations.

The numpy package also has a mean function built in. Two options:

  • array_name.mean()

  • np.mean(array_name)

## First, create an array with some numbers
v = np.array([5, 9, 10])
v
array([ 5,  9, 10])
## Now calculate the mean
v.mean()
8.0
## Or do it with numpy.mean
np.mean(v)
8.0

Check-in#

What would happen if we ran the following code?

v = np.array([1, 5, 9])
np.mean(v)
### Your code here

Solution#

This will also calculate the mean.

v = np.array([1, 5, 9])
np.mean(v)
5.0

Check-in#

What would happen if we ran the following code?

v = np.array(["a", "b", "a"])
v.mean()
### Your code here

Solution#

It will throw an error.

  • You cannot calculate the mean of a vector of str types.

  • The mean can only be calculated for interval/ratio data.

median#

The median of a set of scores is the middle score when those scores are arranged from least to greatest.

The numpy package also has a mean function built in.

  • Syntax: np.median(array_name).

  • Unlike mean, cannot use array_name.median()

## First, create an array with some numbers
v = np.array([5, 9, 10, 1, 20])
v
array([ 5,  9, 10,  1, 20])
## Now calculate the median
np.median(v)
9.0

Check-in#

What would be the median of the following vector?

v = np.array([1, 2, 5, 8])
### Your code here

Solution#

If the vector has an even number of elements, the median is the mean of the middle two elements.

## First, create an array with some numbers
v = np.array([1, 2, 5, 8])
v
array([1, 2, 5, 8])
## Use np.median
np.median(v)
3.5
## Equivalent to *mean* of 2 and 5
(2 + 5) / 2
3.5

Interim summary#

  • Descriptive statistics are a really useful way to summarize data.

  • Very valuable for both basic and applied research (e.g., in industry).

    • Examples: median salary, mean sales per fiscal quarter, mean reaction time on a psychophysics task, etc.

  • numpy makes this much easier.

    • Later, we’ll discuss how pandas (a way to represent data tables) uses these same functions.

Working with matrices#

A matrix is a rectangular array of data (i.e., a multi-dimensional array).

numpy is designed for representing and performing calculations with matrices.

  • A “vector” is just a one-dimensional matrix.

  • Many of the same operations we’ve discussed also apply to working with matrices.

Creating a matrix#

  • Matrices can be created just like vectors.

  • The key difference is that they contain nested lists.

md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
array([[1, 2, 5],
       [3, 4, 7]])
## This is a 2 by 3 matrix
md_array.shape
(2, 3)

Indexing into a matrix#

  • You can index into a matrix, just like with a vector.

  • A key difference is that you use multiple indices, for each dimension.

    • matrix_name[D1_index, D2_index, ...]

# This just returns the first *row*
md_array[0]
array([1, 2, 5])
# This returns the second element of the first row
md_array[0, 1]
2

Check-in#

How would you return the first element of the second row of md_array?

md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
### Your code here

Solution#

# Can use [1] to get second row
md_array[1]
array([3, 4, 7])
# Use [1, 0] to get first element of second row
md_array[1, 0]
3

Check-in#

A more challenging problem: how would you return the second column of md_array? How many observations should it have?

md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
### Your code here

Solution#

Retrieving a column uses the [:,column_index] syntax.

## column 1
md_array[:, 0]
array([1, 3])
## column 2
md_array[:, 1]
array([2, 4])
## column 3
md_array[:, 2]
array([5, 7])

Check-in#

How would you retrieve the second and third element of the second row?

md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
### Your code here

Solution#

## First, get second row with md_array[1]
md_array[1]
array([3, 4, 7])
## First, get second/third elements with slicing
## I.e., [1:3] syntax
md_array[1, 1:3]
array([4, 7])

Summary statistics with matrices#

  • When you call np.sum (or mean, etc.), you can specify which axis to calculate that statistic from.

  • axis = 0: calculate sum (or mean, etc.) of each column.

  • axis = 1: calculate sum (or mean, etc.) of each row.

## Calcaulate mean of each column
md_array.sum(axis = 0)
array([ 4,  6, 12])
## Calcaulate mean of each row
md_array.sum(axis = 1)
array([ 8, 14])

Check-in#

How would you calculate the mean of each row of the following matrix?

m = np.array([[5, 10, 2],
            [20, 5, 100]])
### your code here

Solution#

m.mean(axis = 1)
array([ 5.66666667, 41.66666667])
m.mean(axis = 1).shape
(2,)

Check-in#

Suppose you have a 5x6 matrix (5 rows, 6 columns). If you calculated the mean of each column, what would the shape be of the resulting vector?

Solution#

The vector would have a shape of (6,), i.e., six observations.

  • There are six columns, so calculating the mean of each column would result in six observations.

Side note: arithmetic with matrices#

  • You can also perform arithmetic with matrices (e.g., addition, multiplication, etc.).

  • However, note that matrices must have compatible dimensions.

Identifying the location of an item#

Often, you’ll need to search a vector or matrix for items that meet a certain conditions.

  • All scores == 100.

  • All building_heights above a certain threshold.

  • All reaction_times above a certain cutoff.

You can think of this as applying a conditional statement to search a vector.

Identifying the location of an item#

Often, you’ll need to search a vector or matrix for items that meet a certain conditions.

  • All scores == 100.

  • All building_heights above a certain threshold.

  • All reaction_times above a certain cutoff.

You can think of this as applying a conditional statement to search a vector.

Using ==#

This will return a vector of True or False, indicating whether each index/element matches the condition.

## Scores
scores = np.array([100, 95, 100, 85])
## Which scores == 100?
scores == 100
array([ True, False,  True, False])
## Select only scores == 100
scores[scores == 100]
array([100, 100])

Using np.where#

By default, this will return the indices in the initial array corresponding to the condition.

## Get indices
np.where(scores == 100)
(array([0, 2]),)
## Applying indices to vector
scores[np.where(scores == 100)]
array([100, 100])

Check-in#

Consider the following array of building_heights. How would you find out which buildings are taller than 50 feet?

building_heights = np.array([25, 45, 10, 60, 10, 85, 100])
### Your code here

Solution using ==#

building_heights > 50
array([False, False, False,  True, False,  True,  True])
building_heights[building_heights > 50]
array([ 60,  85, 100])

Solution using np.where#

## Get indices
np.where(building_heights > 50)
(array([3, 5, 6]),)
## Apply indices
building_heights[np.where(building_heights > 50)]
array([ 60,  85, 100])

Other useful functions#

numpy also has a host of other useful functions. For now, we’ll focus on:

  • Generating an array with either random numbers or ones or zeros.

  • Reshaping an array with `reshape.

Initializing a random array#

numpy.random.rand(d1, ...) can be used to initialize an array with random numbers and dimensionality (d1, ...).

## Generates a 1-D vector with 10 elements
np.random.rand(10)
array([0.75713674, 0.66036241, 0.80925511, 0.6748789 , 0.34517568,
       0.33181093, 0.31448705, 0.55857664, 0.26887019, 0.62232903])
## Generates a 2-D vector with shape (2, 2)
np.random.rand(2, 2)
array([[0.56690475, 0.01237092],
       [0.59620774, 0.1829824 ]])

Check-in#

Generate a random array with shape (3, 2), then calculate the mean of each column.

### Your code here

Solution#

r = np.random.rand(3, 2)
r
array([[0.41075628, 0.98461251],
       [0.69721637, 0.78523541],
       [0.08647911, 0.00251936]])
r.mean(axis = 0)
array([0.39815059, 0.59078909])

Initializing an array of ones or zeros#

This is like np.random.rand, but each element is either a 1 or 0.

np.ones((2, 2))
array([[1., 1.],
       [1., 1.]])
np.zeros((2, 2))
array([[0., 0.],
       [0., 0.]])

Using numpy.reshape#

Sometimes, a matrix or vector isn’t the right shape to perform a computation.

  • E.g., multiplying by another vector.

  • E.g., using for regression in a regression equation.

We can use np.reshape to reshape that array.

Example: turning a vector into a matrix#

# Create a (10, ) vector
og_array = np.ones(10)
og_array
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
# Reshape to (2, 5)
og_array.reshape((2, 5))
array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])
# Reshape to (5, 2)
og_array.reshape((5, 2))
array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

Dimensions must be compatible#

If you try to reshape into a shape that’s not compatible with the original size (i.e., not divisible by size), you’ll get an error.

# Reshape to (5, 2)
og_array.reshape((4, 4))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [60], in <cell line: 2>()
      1 # Reshape to (5, 2)
----> 2 og_array.reshape((4, 4))

ValueError: cannot reshape array of size 10 into shape (4,4)

Conclusion#

This concludes our brief foray into numpy.

Now, you’ll be more familiar with:

  • Using numpy for vector arithmetic.

  • Basic summary statistics with numpy.

  • Working with multi-dimensional arrays.

Eventually, this will form the foundation for more advanced work in statistics, machine learning, and more.