Anda di halaman 1dari 136

1

As an example, here is an implementation of the classic quicksort algorithm in


Python:
def quicksort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) / 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quicksort(left) + middle + quicksort(right)
print quicksort([3,6,8,10,1,2,1])
# Prints "[1, 1, 2, 3, 6, 8, 10]"

Numbers: Integers and floats work as you would expect from other languages:
x=3
print type(x) # Prints "<type 'int'>"
print x
# Prints "3"
print x + 1 # Addition; prints "4"
print x - 1 # Subtraction; prints "2"
print x * 2 # Multiplication; prints "6"
print x ** 2 # Exponentiation; prints "9"
x += 1
print x # Prints "4"
x *= 2
print x # Prints "8"
y = 2.5
print type(y) # Prints "<type 'float'>"
print y, y + 1, y * 2, y ** 2 # Prints "2.5 3.5 5.0 6.25"

Booleans:
t = True
f = False
print type(t) # Prints "<type 'bool'>"
print t and f # Logical AND; prints "False"
print t or f # Logical OR; prints "True"
print not t # Logical NOT; prints "False"
print t != f # Logical XOR; prints "True"
Strings:
hello = 'hello' # String literals can use single quotes
world = "world" # or double quotes; it does not matter.
print hello
# Prints "hello"
print len(hello) # String length; prints "5"
hw = hello + ' ' + world # String concatenation
print hw # prints "hello world"

hw12 = '%s %s %d' % (hello, world, 12) # sprintf style string formatting
print hw12 # prints "hello world 12"

String objects have a bunch of useful methods; for example:


s = "hello"
print s.capitalize() # Capitalize a string; prints "Hello"
print s.upper()
# Convert a string to uppercase; prints "HELLO"
print s.rjust(7) # Right-justify a string, padding with spaces; prints " hello"
print s.center(7) # Center a string, padding with spaces; prints " hello "
print s.replace('l', '(ell)') # Replace all instances of one substring with another;
# prints "he(ell)(ell)o"
print ' world '.strip() # Strip leading and trailing whitespace; prints "world"

xs = [3, 1, 2] # Create a list


print xs, xs[2] # Prints "[3, 1, 2] 2"
print xs[-1] # Negative indices count from the end of the list; prints "2"
xs[2] = 'foo' # Lists can contain elements of different types
print xs
# Prints "[3, 1, 'foo']"
xs.append('bar') # Add a new element to the end of the list
print xs
# Prints
x = xs.pop() # Remove and return the last element of the list
print x, xs # Prints "bar [3, 1, 'foo']"

10

nums = range(5) # range is a built-in function that creates a list of integers


print nums
# Prints "[0, 1, 2, 3, 4]"
print nums[2:4] # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]"
print nums[2:] # Get a slice from index 2 to the end; prints "[2, 3, 4]"
print nums[:2] # Get a slice from the start to index 2 (exclusive); prints "[0, 1]"
print nums[:] # Get a slice of the whole list; prints ["0, 1, 2, 3, 4]"
print nums[:-1] # Slice indices can be negative; prints ["0, 1, 2, 3]"
nums[2:4] = [8, 9] # Assign a new sublist to a slice
print nums
# Prints "[0, 1, 8, 8, 4]"
We will see slicing again in the context of numpy arrays.

11

animals = ['cat', 'dog', 'monkey']


for animal in animals:
print animal
# Prints "cat", "dog", "monkey", each on its own line.
If you want access to the index of each element within the body of a loop, use the built-in
enumerate function:
animals = ['cat', 'dog', 'monkey']
for idx, animal in enumerate(animals):
print '#%d: %s' % (idx + 1, animal)
# Prints "#1: cat", "#2: dog", "#3: monkey", each on its own line

12

As a simple example, consider the following code that computes square numbers:
nums = [0, 1, 2, 3, 4]
squares = []
for x in nums:
squares.append(x ** 2)
print squares # Prints [0, 1, 4, 9, 16]
You can make this code simpler using a list comprehension:
nums = [0, 1, 2, 3, 4]
squares = [x ** 2 for x in nums]
print squares # Prints [0, 1, 4, 9, 16]
List comprehensions can also contain conditions:
nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]

13

print even_squares # Prints "[0, 4, 16]"

13

You can use it like this:


d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary with some data
print d['cat']
# Get an entry from a dictionary; prints "cute"
print 'cat' in d # Check if a dictionary has a given key; prints "True"
d['fish'] = 'wet' # Set an entry in a dictionary
print d['fish'] # Prints "wet"
# print d['monkey'] # KeyError: 'monkey' not a key of d
print d.get('monkey', 'N/A') # Get an element with a default; prints "N/A"
print d.get('fish', 'N/A') # Get an element with a default; prints "wet"
del d['fish']
# Remove an element from a dictionary
print d.get('fish', 'N/A') # "fish" is no longer a key; prints "N/A"

14

Loops: It is easy to iterate over the keys in a dictionary:


d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
legs = d[animal]
print 'A %s has %d legs' % (animal, legs)
# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs"
If you want access to keys and their corresponding values, use the iteritems method:
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal, legs in d.iteritems():
print 'A %s has %d legs' % (animal, legs)
# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs"
Dictionary comprehensions: These are similar to list comprehensions, but allow you to easily
construct dictionaries. For example:
nums = [0, 1, 2, 3, 4]
even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0}
print even_num_to_square # Prints "{0: 0, 2: 4, 4: 16}"

15

As a simple example, consider the following:


animals = {'cat', 'dog'}
print 'cat' in animals # Check if an element is in a set; prints "True"
print 'fish' in animals # prints "False"
animals.add('fish') # Add an element to a set
print 'fish' in animals # Prints "True"
print len(animals)
# Number of elements in a set; prints "3"
animals.add('cat')
# Adding an element that is already in the set does nothing
print len(animals)
# Prints "3"
animals.remove('cat') # Remove an element from a set
print len(animals)
# Prints "2"

16

As usual, everything you want to know about sets can be found in the documentation.
Loops: Iterating over a set has the same syntax as iterating over a list; however since sets are
unordered, you cannot make assumptions about the order in which you visit the elements of the
set:
animals = {'cat', 'dog', 'fish'}
for idx, animal in enumerate(animals):
print '#%d: %s' % (idx + 1, animal)
# Prints "#1: fish", "#2: dog", "#3: cat"
Set comprehensions: Like lists and dictionaries, we can easily construct sets using set
comprehensions:
from math import sqrt
nums = {int(sqrt(x)) for x in range(30)}
print nums # Prints "set([0, 1, 2, 3, 4, 5])"

17

Here is a trivial example:


d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys
t = (5, 6)
# Create a tuple
print type(t) # Prints "<type 'tuple'>"
print d[t]
# Prints "5"
print d[(1, 2)] # Prints "1"

18

For example:
def sign(x):
if x > 0:
return 'positive'
elif x < 0:
return 'negative'
else:
return 'zero'
for x in [-1, 0, 1]:
print sign(x)
# Prints "negative", "zero", "positive"

19

We will often define functions to take optional keyword arguments, like this:
def hello(name, loud=False):
if loud:
print 'HELLO, %s' % name.upper()
else:
print 'Hello, %s!' % name
hello('Bob') # Prints "Hello, Bob"
hello('Fred', loud=True) # Prints "HELLO, FRED!"

20

class Greeter:
# Constructor
def __init__(self, name):
self.name = name # Create an instance variable
# Instance method
def greet(self, loud=False):
if loud:
print 'HELLO, %s!' % self.name.upper()
else:
print 'Hello, %s' % self.name
g = Greeter('Fred') # Construct an instance of the Greeter class
g.greet()
# Call an instance method; prints "Hello, Fred"
g.greet(loud=True) # Call an instance method; prints "HELLO, FRED!"

21

22

We can initialize numpy arrays from nested Python lists, and access elements using square
brackets:

import numpy as np
a = np.array([1, 2, 3]) # Create a rank 1 array
print type(a)
# Prints "<type 'numpy.ndarray'>"
print a.shape
# Prints "(3,)"
print a[0], a[1], a[2] # Prints "1 2 3"
a[0] = 5
# Change an element of the array
print a
# Prints "[5, 2, 3]"

b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array


print b.shape
# Prints "(2, 3)"
print b[0, 0], b[0, 1], b[1, 0] # Prints "1 2 4"

23

Numpy also provides many functions to create arrays:

import numpy as np
a = np.zeros((2,2)) # Create an array of all zeros
print a
# Prints "[[ 0. 0.]
#
[ 0. 0.]]"
b = np.ones((1,2)) # Create an array of all ones
print b
# Prints "[[ 1. 1.]]"

c = np.full((2,2), 7) # Create a constant array


print c
# Prints "[[ 7. 7.]
#
[ 7. 7.]]"
d = np.eye(2)
# Create a 2x2 identity matrix
print d
# Prints "[[ 1. 0.]

24

[ 0. 1.]]"

e = np.random.random((2,2)) # Create an array filled with random values


print e
# Might print "[[ 0.91940167 0.08143941]
#
[ 0.68744134 0.87236687]]"

24

Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

import numpy as np
# Create the following rank 2 array with shape (3, 4)
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
# [6 7]]
b = a[:2, 1:3]
# A slice of an array is a view into the same data, so modifying it

25

# will modify the original array.


print a[0, 1] # Prints "2"
b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1]
print a[0, 1] # Prints "77"

25

You can also mix integer indexing with slice indexing. However, doing so will yield an array of
lower rank than the original array. Note that this is quite different from the way that MATLAB
handles array slicing:
import numpy as np
# Create the following rank 2 array with shape (3, 4)
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
# Two ways of accessing the data in the middle row of the array.
# Mixing integer indexing with slices yields an array of lower rank,
# while using only slices yields an array of the same rank as the
# original array:
row_r1 = a[1, :] # Rank 1 view of the second row of a

26

row_r2 = a[1:2, :] # Rank 2 view of the second row of a


print row_r1, row_r1.shape # Prints "[5 6 7 8] (4,)"
print row_r2, row_r2.shape # Prints "[[5 6 7 8]] (1, 4)"
# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print col_r1, col_r1.shape # Prints "[ 2 6 10] (3,)"
print col_r2, col_r2.shape # Prints "[[ 2]
#
[ 6]
#
[10]] (3, 1)"

26

Here is an example:
import numpy as np
a = np.array([[1,2], [3, 4], [5, 6]])
# An example of integer array indexing.
# The returned array will have shape (3,) and
print a[[0, 1, 2], [0, 1, 0]] # Prints "[1 4 5]"
# The above example of integer array indexing is equivalent to this:
print np.array([a[0, 0], a[1, 1], a[2, 0]]) # Prints "[1 4 5]"

# When using integer array indexing, you can reuse the same
# element from the source array:
print a[[0, 0], [1, 1]] # Prints "[2 2]"

27

# Equivalent to the previous integer array indexing example


print np.array([a[0, 1], a[0, 1]]) # Prints "[2 2]"

27

Here is an example:
import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])


bool_idx = (a > 2) # Find the elements of a that are bigger than 2;
# this returns a numpy array of Booleans of the same
# shape as a, where each slot of bool_idx tells
# whether that element of a is > 2.
print bool_idx
#
#

# Prints "[[False False]


[ True True]
[ True True]]"

# We use boolean array indexing to construct a rank 1 array


# consisting of the elements of a corresponding to the True values

28

# of bool_idx
print a[bool_idx] # Prints "[3 4 5 6]"
# We can do all of the above in a single concise statement:
print a[a > 2] # Prints "[3 4 5 6]"

28

Here is an example:
import numpy as np
x = np.array([1, 2]) # Let numpy choose the datatype
print x.dtype
# Prints "int64"
x = np.array([1.0, 2.0]) # Let numpy choose the datatype
print x.dtype
# Prints "float64"
x = np.array([1, 2], dtype=np.int64) # Force a particular datatype
print x.dtype
# Prints "int64"

29

30

import numpy as np

x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
# Elementwise sum; both produce the array
# [[ 6.0 8.0]
# [10.0 12.0]]
print x + y
print np.add(x, y)
# Elementwise difference; both produce the array
# [[-4.0 -4.0]
# [-4.0 -4.0]]
print x - y
print np.subtract(x, y)

31

# Elementwise product; both produce the array


# [[ 5.0 12.0]
# [21.0 32.0]]
print x * y
print np.multiply(x, y)
# Elementwise division; both produce the array
# [[ 0.2
0.33333333]
# [ 0.42857143 0.5
]]
print x / y
print np.divide(x, y)
# Elementwise square root; produces the array
# [[ 1.
1.41421356]
# [ 1.73205081 2.
]]

32

print np.sqrt(x)

32

import numpy as np
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
v = np.array([9,10])
w = np.array([11, 12])
# Inner product of vectors; both produce 219
print v.dot(w)
print np.dot(v, w)
# Matrix / vector product; both produce the rank 1 array [29 67]
print x.dot(v)
print np.dot(x, v)
# Matrix / matrix product; both produce the rank 2 array

33

# [[19 22]
# [43 50]]
print x.dot(y)
print np.dot(x, y)

33

import numpy as np
x = np.array([[1,2],[3,4]])
print np.sum(x) # Compute sum of all elements; prints "10"
print np.sum(x, axis=0) # Compute sum of each column; prints "[4 6]"
print np.sum(x, axis=1) # Compute sum of each row; prints "[3 7]"

34

import numpy as np
x = np.array([[1,2], [3,4]])
print x # Prints "[[1 2]
#
[3 4]]"
print x.T # Prints "[[1 3]
#
[2 4]]"
# Note that taking the transpose of a rank 1 array does nothing:
v = np.array([1,2,3])
print v # Prints "[1 2 3]"
print v.T # Prints "[1 2 3]"

35

import numpy as np
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = np.empty_like(x) # Create an empty matrix with the same shape as x
# Add the vector v to each row of the matrix x with an explicit loop
for i in range(4):
y[i, :] = x[i, :] + v

# Now y is the following


# [[ 2 2 4]
# [ 5 5 7]
# [ 8 8 10]
# [11 11 13]]

36

print y

36

This works; however when the matrix x is very large, computing an explicit loop in Python could
be slow. Note that adding the vector v to each row of the matrix x is equivalent to forming a
matrix vv by stacking multiple copies of v vertically, then performing elementwise summation of x
and vv. We could implement this approach like this:
import numpy as np
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
vv = np.tile(v, (4, 1)) # Stack 4 copies of v on top of each other
print vv
# Prints "[[1 0 1]
#
[1 0 1]
#
[1 0 1]
#
[1 0 1]]"
y = x + vv # Add x and vv elementwise

37

print y # Prints "[[ 2 2 4


#
[ 5 5 7]
#
[ 8 8 10]
#
[11 11 13]]"
Numpy broadcasting allows us to perform this computation without actually creating
multiple copies of v. Consider this version, using broadcasting:
import numpy as np
# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
y = x + v # Add v to each row of x using broadcasting
print y # Prints "[[ 2 2 4]
#
[ 5 5 7]
#
[ 8 8 10]
#
[11 11 13]]"
The line y = x + v works even though x has shape (4, 3) and v has shape (3,) due to
broadcasting; this line works as if v actually had shape (4, 3), where each row was a
copy of v, and the sum was performed elementwise.

37

38

There are currently more than 60 universal functions defined in numpy on one or more types,
covering a wide variety of operations. Some of these ufuncs are called automatically on arrays
when the relevant infix notation is used (e.g., add(a, b) is called internally when a + b is written
and a or b is an ndarray). Nevertheless, you may still want to use the ufunc call in order to use the
optional output argument(s) to place the output(s) in an object (or objects) of your choice.
Recall that each ufunc operates element-by-element. Therefore, each ufunc will be described as if
acting on a set of scalar inputs to return a set of scalar outputs.

39

Math operations
add(x1, x2[, out])
subtract(x1, x2[, out])
multiply(x1, x2[, out])
divide(x1, x2[, out])
logaddexp(x1, x2[, out])
logaddexp2(x1, x2[, out])
true_divide(x1, x2[, out])
floor_divide(x1, x2[, out])
inputs.
negative(x[, out])
power(x1, x2[, out])
wise.
remainder(x1, x2[, out])
mod(x1, x2[, out])
fmod(x1, x2[, out])

Add arguments element-wise.


Subtract arguments, element-wise.
Multiply arguments element-wise.
Divide arguments element-wise.
Logarithm of the sum of exponentiations of the inputs.
Logarithm of the sum of exponentiations of the inputs in base-2.
Returns a true division of the inputs, element-wise.
Return the largest integer smaller or equal to the division of the
Numerical negative, element-wise.
First array elements raised to powers from second array, elementReturn element-wise remainder of division.
Return element-wise remainder of division.
Return the element-wise remainder of division.

40

absolute(x[, out])
Calculate the absolute value element-wise.
rint(x[, out]) Round elements of the array to the nearest integer.
sign(x[, out]) Returns an element-wise indication of the sign of a number.
conj(x[, out]) Return the complex conjugate, element-wise.
exp(x[, out]) Calculate the exponential of all elements in the input array.
exp2(x[, out]) Calculate 2**p for all p in the input array.
log(x[, out]) Natural logarithm, element-wise.
log2(x[, out]) Base-2 logarithm of x.
log10(x[, out]) Return the base 10 logarithm of the input array, element-wise.
expm1(x[, out])
Calculate exp(x) - 1 for all elements in the array.
log1p(x[, out])
Return the natural logarithm of one plus the input array,
element-wise.
sqrt(x[, out]) Return the positive square-root of an array, element-wise.
square(x[, out])
Return the element-wise square of the input.
reciprocal(x[, out])
Return the reciprocal of the argument, element-wise.
ones_like(a[, dtype, order, subok])
Return an array of ones with the same
shape and type as a given array.
Tip
The optional output arguments can be used to help you save memory for large
calculations. If your arrays are large, complicated expressions can take longer than
absolutely necessary due to the creation and (later) destruction of temporary
calculation spaces. For example, the expression G = a * b + c is equivalent to t1 = A *
B; G = T1 + C; del t1. It will be more quickly executed as G = A * B; add(G, C, G) which
is the same as G = A * B; G += C.

40

Trigonometric functions
All trigonometric functions use radians when an angle is called for. The ratio of degrees to radians
is 180^{\circ}/\pi.
sin(x[, out]) Trigonometric sine, element-wise.
cos(x[, out]) Cosine element-wise.
tan(x[, out]) Compute tangent element-wise.
arcsin(x[, out])
Inverse sine, element-wise.
arccos(x[, out])
Trigonometric inverse cosine, element-wise.
arctan(x[, out])
Trigonometric inverse tangent, element-wise.
arctan2(x1, x2[, out])
Element-wise arc tangent of x1/x2 choosing the quadrant correctly.
hypot(x1, x2[, out])
Given the legs of a right triangle, return its hypotenuse.
sinh(x[, out]) Hyperbolic sine, element-wise.
cosh(x[, out]) Hyperbolic cosine, element-wise.
tanh(x[, out]) Compute hyperbolic tangent element-wise.
arcsinh(x[, out])
Inverse hyperbolic sine element-wise.
arccosh(x[, out])
Inverse hyperbolic cosine, element-wise.

41

arctanh(x[, out])
deg2rad(x[, out])
rad2deg(x[, out])

Inverse hyperbolic tangent element-wise.


Convert angles from degrees to radians.
Convert angles from radians to degrees.

41

Bit-twiddling functions
These function all require integer arguments and they manipulate the bit-pattern of those
arguments.
bitwise_and(x1, x2[, out])
Compute the bit-wise AND of two arrays element-wise.
bitwise_or(x1, x2[, out])
Compute the bit-wise OR of two arrays element-wise.
bitwise_xor(x1, x2[, out])
Compute the bit-wise XOR of two arrays element-wise.
invert(x[, out])
Compute bit-wise inversion, or bit-wise NOT, element-wise.
left_shift(x1, x2[, out])
Shift the bits of an integer to the left.
right_shift(x1, x2[, out])
Shift the bits of an integer to the right.
Comparison functions
greater(x1, x2[, out])
Return the truth value of (x1 > x2) element-wise.
greater_equal(x1, x2[, out]) Return the truth value of (x1 >= x2) element-wise.
less(x1, x2[, out])
Return the truth value of (x1 < x2) element-wise.
less_equal(x1, x2[, out])
Return the truth value of (x1 =< x2) element-wise.
not_equal(x1, x2[, out])
Return (x1 != x2) element-wise.

42

equal(x1, x2[, out])


logical_and(x1, x2[, out])
logical_or(x1, x2[, out])
logical_xor(x1, x2[, out])
logical_not(x[, out])

Return (x1 == x2) element-wise.


Compute the truth value of x1 AND x2 element-wise.
Compute the truth value of x1 OR x2 element-wise.
Compute the truth value of x1 XOR x2, element-wise.
Compute the truth value of NOT x element-wise.

42

Floating functions
Recall that all of these functions work element-by-element over an array, returning an array
output. The description details only a single operation.
isreal(x)
Returns a bool array, where True if input element is real.
iscomplex(x) Returns a bool array, where True if input element is complex.
isfinite(x[, out])
Test element-wise for finiteness (not infinity or not Not a Number).
isinf(x[, out]) Test element-wise for positive or negative infinity.
isnan(x[, out]) Test element-wise for NaN and return result as a boolean array.
signbit(x[, out])
Returns element-wise True where signbit is set (less than zero).
copysign(x1, x2[, out])
Change the sign of x1 to that of x2, element-wise.
nextafter(x1, x2[, out])
Return the next floating-point value after x1 towards x2, elementwise.
modf(x[, out1, out2])
Return the fractional and integral parts of an array, element-wise.
ldexp(x1, x2[, out])
Returns x1 * 2**x2, element-wise.
frexp(x[, out1, out2])
Decompose the elements of x into mantissa and twos exponent.
fmod(x1, x2[, out])
Return the element-wise remainder of division.

43

floor(x[, out]) Return the floor of the input, element-wise.


ceil(x[, out]) Return the ceiling of the input, element-wise.
trunc(x[, out]) Return the truncated value of the input, element-wise.

43

Here are some applications of broadcasting:


import numpy as np
# Compute outer product of vectors
v = np.array([1,2,3]) # v has shape (3,)
w = np.array([4,5]) # w has shape (2,)
# To compute an outer product, we first reshape v to be a column
# vector of shape (3, 1); we can then broadcast it against w to yield
# an output of shape (3, 2), which is the outer product of v and w:
# [[ 4 5]
# [ 8 10]
# [12 15]]
print np.reshape(v, (3, 1)) * w
# Add a vector to each row of a matrix
x = np.array([[1,2,3], [4,5,6]])

44

# x has shape (2, 3) and v has shape (3,) so they broadcast to (2, 3),
# giving the following matrix:
# [[2 4 6]
# [5 7 9]]
print x + v
# Add a vector to each column of a matrix
# x has shape (2, 3) and w has shape (2,).
# If we transpose x then it has shape (3, 2) and can be broadcast
# against w to yield a result of shape (3, 2); transposing this result
# yields the final result of shape (2, 3) which is the matrix x with
# the vector w added to each column. Gives the following matrix:
# [[ 5 6 7]
# [ 9 10 11]]
print (x.T + w).T
# Another solution is to reshape w to be a row vector of shape (2, 1);
# we can then broadcast it directly against x to produce the same
# output.
print x + np.reshape(w, (2, 1))
# Multiply a matrix by a constant:
# x has shape (2, 3). Numpy treats scalars as arrays of shape ();
# these can be broadcast together to shape (2, 3), producing the
# following array:
# [[ 2 4 6]
# [ 8 10 12]]
print x * 2

44

45

46

Here is a simple example:


import numpy as np
import matplotlib.pyplot as plt
# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)
# Plot the points using matplotlib
plt.plot(x, y)
plt.show() # You must call plt.show() to make graphics appear.

47

48

With just a little bit of extra work we can easily plot multiple lines at once, and add a title, legend,
and axis labels:

import numpy as np
import matplotlib.pyplot as plt
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)
# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')

49

plt.legend(['Sine', 'Cosine'])
plt.show()

49

50

Here is an example:
import numpy as np
import matplotlib.pyplot as plt
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)
# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)
# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')
# Set the second subplot as active, and make the second plot.

51

plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')
# Show the figure.
plt.show()

51

Here is an example:
import numpy as np
from scipy.misc import imread, imresize
import matplotlib.pyplot as plt
img = imread('assets/cat.jpg')
img_tinted = img * [1, 0.95, 0.9]
# Show the original image
plt.subplot(1, 2, 1)
plt.imshow(img)
# Show the tinted image
plt.subplot(1, 2, 2)
# A slight gotcha with imshow is that it might give strange results

52

# if presented with data that is not uint8. To work around this, we


# explicitly cast the image to uint8 before displaying it.
plt.imshow(np.uint8(img_tinted))
plt.show()

52

53

PANDAS

Lesson 1
Create Data - We begin by creating our own data set for analysis. This prevents the end user
reading this tutorial from having to download any files to replicate the results below. We will export
this data set to a text file so that you can get some experience pulling data from a text file.
Get Data - We will learn how to read in the text file. The data consist of baby names and the number
of baby names born in the year 1880.
Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean I
mean we will take a look inside the contents of the text file and look for any anomalities. These can
include missing data, inconsistencies in the data, or any other data that seems out of place. If any
are found we will then have to make decisions on what to do with these records.
Analyze Data - We will simply find the most popular name in a specific year.
Present Data - Through tabular data and a graph, clearly show the end user what is the most
popular name in a specific year.
The pandas library is used for all the data analysis excluding a small piece of the data presentation
section. The matplotlib library will only be needed for the data presentation section. Importing the
libraries is the first step we will take in the lesson.

# Import all libraries needed for the tutorial

# General syntax to import specific functions in a library:


##from (library) import (specific library function)
from pandas import DataFrame, read_csv

# General syntax to import a library but no functions:


##import (library) as (give the library a nickname/alias)
import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number

# Enable inline plotting


%matplotlib inline

print 'Python version ' + sys.version

print 'Pandas version ' + pd.__version__


Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul
2) [MSC v.1500 64 bit (AMD64)]
Pandas version 0.15.2

1 2013, 12:37:5

Create Data
The data set will consist of 5 baby names and the number of births recorded for that year (1880).

# The inital set of baby names and bith rates


names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]

To merge these two lists together we will use the zip function.

zip?

BabyDataSet = zip(names,births)
BabyDataSet
[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]

We are basically done creating the data set. We now will use the pandas library to export this data
set into a csv file.
df will be a DataFrame object. You can think of this object holding the contents of the BabyDataSet
in a format similar to a sql table or an excel spreadsheet. Lets take a look below at the contents
inside df.
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df

Names Births
0 Bob
968
1 Jessica 155
2 Mary 77
3 John 578

Names Births
4 Mel
973
Export the dataframe to a csv file. We can name the file births1880.csv. The function to_csv will be
used to export the file. The file will be saved in the same location of the notebook unless specified
otherwise.
In [7]:
df.to_csv?

The only parameters we will use is index and header. Setting these parameters to True will prevent
the index and header names from being exported. Change the values of these parameters to get a
better understanding of their use.
In [8]:
df.to_csv('births1880.csv',index=False,header=False)

Get Data
To pull in the csv file, we will use the pandas function read_csv. Let us take a look at this function
and what inputs it takes.
In [9]:
read_csv?

Even though this functions has many parameters, we will simply pass it the location of the text file.
Location = C:\Users\ENTER_USER_NAME.xy\startups\births1880.csv
Note: Depending on where you save your notebooks, you may need to modify the location above.
In [10]:
Location = r'C:\Users\david\notebooks\pandas\births1880.csv'
df = pd.read_csv(Location)

Notice the r before the string. Since the slashes are special characters, prefixing the string with a r
will escape the whole string.
In [11]:
df

Out[11]:
Bob 968
0 Jessica 155
1 Mary 77
2 John 578

Bob 968
3 Mel
973
This brings us the our first problem of the exercise. The read_csv function treated the first record in
the csv file as the header names. This is obviously not correct since the text file did not provide us
with header names.
To correct this we will pass the header parameter to the read_csv function and set it to None
(means null in python).
In [12]:
df = pd.read_csv(Location, header=None)
df

Out[12]:
0
1
0 Bob
968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel
973
If we wanted to give the columns specific names, we would have to pass another paramter called
names. We can also omit the header parameter.
In [13]:
df = pd.read_csv(Location, names=['Names','Births'])
df

Out[13]:
Names Births
0 Bob
968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel
973
You can think of the numbers [0,1,2,3,4] as the row numbers in an Excel file. In pandas these are
part of the index of the dataframe. You can think of the index as the primary key of a sql table with
the exception that an index is allowed to have duplicates.
[Names, Births] can be though of as column headers similar to the ones found in an Excel
spreadsheet or sql database.

Delete the csv file now that we are done using it.
In [14]:
import os
os.remove(Location)

Prepare Data
The data we have consists of baby names and the number of births in the year 1880. We already
know that we have 5 records and none of the records are missing (non-null values).
The Names column at this point is of no concern since it most likely is just composed of alpha
numeric strings (baby names). There is a chance of bad data in this column but we will not worry
about that at this point of the analysis. The Births column should just contain integers representing
the number of babies born in a specific year with a specific name. We can check if the all the data is
of the data type integer. It would not make sense to have this column have a data type of float. I
would not worry about any possible outliers at this point of the analysis.
Realize that aside from the check we did on the "Names" column, briefly looking at the data inside
the dataframe should be as far as we need to go at this stage of the game. As we continue in the
data analysis life cycle we will have plenty of opportunities to find any issues with the data set.
In [15]:
# Check data type of the columns
df.dtypes

Out[15]:
Names
object
Births
int64
dtype: object

In [16]:
# Check data type of Births column
df.Births.dtype

Out[16]:
dtype('int64')

As you can see the Births column is of type int64, thus no floats (decimal numbers) or alpha numeric
characters will be present in this column.

Analyze Data
To find the most popular name or the baby name with the higest birth rate, we can do one of the
following.

Sort the dataframe and select the top row


Use the max() attribute to find the maximum value
In [17]:

# Method 1:
Sorted = df.sort(['Births'], ascending=False)

Sorted.head(1)

Out[17]:
Names Births
4 Mel
973
In [18]:
# Method 2:
df['Births'].max()

Out[18]:
973

Present Data
Here we can plot the Births column and label the graph to show the end user the highest point on
the graph. In conjunction with the table, the end user has a clear picture that Mel is the most popular
baby name in the data set.
plot() is a convinient attribute where pandas lets you painlessly plot the data in your dataframe. We
learned how to find the maximum value of the Births column in the previous section. Now to find the
actual baby name of the 973 value looks a bit tricky, so lets go over it.
Explain the pieces:
df['Names'] - This is the entire list of baby names, the entire Names column
df['Births'] - This is the entire list of Births in the year 1880, the entire Births column
df['Births'].max() - This is the maximum value found in the Births column
[df['Births'] == df['Births'].max()] IS EQUAL TO [Find all of the records in the Births column where it is
equal to 973]
df['Names'][df['Births'] == df['Births'].max()] IS EQUAL TO Select all of the records in the Names
column WHERE [The Births column is equal to 973]
An alternative way could have been to use the Sorted dataframe:
Sorted['Names'].head(1).value
The str() function simply converts an object into a string.
In [19]:
# Create graph
df['Births'].plot()

# Maximum value in the data set


MaxValue = df['Births'].max()

# Name associated with the maximum value


MaxName = df['Names'][df['Births'] == df['Births'].max()].values

# Text to display on graph


Text = str(MaxValue) + " - " + MaxName

# Add text to graph


plt.annotate(Text, xy=(1, MaxValue), xytext=(8, 0),
xycoords=('axes fraction', 'data'), textcoords='offset point
s')

print "The most popular name"


df[df['Births'] == df['Births'].max()]
#Sorted.head(1) can also be used
The most popular name

Out[19]:
Names Births
4 Mel
973

Lesson 2
In [1]:
# The usual preamble
import pandas as pd
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)

figsize(15, 5)

We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a
subset of the of 311 service requests from NYC Open Data.
In [2]:
complaints = pd.read_csv('../data/311-service-requests.csv')

2.1 What's even in it? (the summary)


When you look at a large dataframe, instead of showing you the contents of the dataframe, it'll show
you a summary. This includes all the columns, and how many non-null values there are in each
column.
In [3]:
complaints

Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111069 entries, 0 to 111068
Data columns (total 52 columns):
Unique Key
111069
Created Date
111069
Closed Date
60270
Agency
111069
Agency Name
111069
Complaint Type
111069
Descriptor
111068
Location Type
79048
Incident Zip
98813
Incident Address
84441
Street Name
84438

non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values
non-null values

Cross Street 1
84728 non-null values
Cross Street 2
84005 non-null values
Intersection Street 1
19364 non-null values
Intersection Street 2
19366 non-null values
Address Type
102247 non-null values
City
98860 non-null values
Landmark
95 non-null values
Facility Type
110938 non-null values
Status
111069 non-null values
Due Date
39239 non-null values
Resolution Action Updated Date
96507 non-null values
Community Board
111069 non-null values
Borough
111069 non-null values
X Coordinate (State Plane)
98143 non-null values
Y Coordinate (State Plane)
98143 non-null values
Park Facility Name
111069 non-null values
Park Borough
111069 non-null values
School Name
111069 non-null values
School Number
111052 non-null values
School Region
110524 non-null values
School Code
110524 non-null values
School Phone Number
111069 non-null values
School Address
111069 non-null values
School City
111069 non-null values
School State
111069 non-null values
School Zip
111069 non-null values
School Not Found
38984 non-null values
School or Citywide Complaint
0 non-null values
Vehicle Type
99 non-null values
Taxi Company Borough
117 non-null values
Taxi Pick Up Location
1059 non-null values
Bridge Highway Name
185 non-null values
Bridge Highway Direction
185 non-null values
Road Ramp
184 non-null values
Bridge Highway Segment
223 non-null values
Garage Lot Name
49 non-null values
Ferry Direction
37 non-null values
Ferry Terminal Name
336 non-null values
Latitude
98143 non-null values
Longitude
98143 non-null values
Location
98143 non-null values
dtypes: float64(5), int64(1), object(46)

2.2 Selecting columns and rows


To select a column, we index with the name of the column, like this:
In [4]:
complaints['Complaint Type']

Out[4]:
0
1

Noise - Street/Sidewalk
Illegal Parking

2
Noise - Commercial
3
Noise - Vehicle
4
Rodent
5
Noise - Commercial
6
Blocked Driveway
7
Noise - Commercial
8
Noise - Commercial
9
Noise - Commercial
10
Noise - House of Worship
11
Noise - Commercial
12
Illegal Parking
13
Noise - Vehicle
14
Rodent
...
111054
Noise - Street/Sidewalk
111055
Noise - Commercial
111056
Street Sign - Missing
111057
Noise
111058
Noise - Commercial
111059
Noise - Street/Sidewalk
111060
Noise
111061
Noise - Commercial
111062
Water System
111063
Water System
111064
Maintenance or Facility
111065
Illegal Parking
111066
Noise - Street/Sidewalk
111067
Noise - Commercial
111068
Blocked Driveway
Name: Complaint Type, Length: 111069, dtype: object
To get the first 5 rows of a dataframe, we can use a slice: df[:5].

This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to
look at the contents and get a feel for this dataset.
In [5]:
complaints[:5]

Out[5]:

10

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
N
e
1
11
w
0
00
Y
(40
/
/ /
o
.70
3
33
r N
9
82
1
11
k oi
075
/
/ /
C se L
01
U UUUUUUUUU
4 32
22
99
PA2 2 1
7
it - o
36
AJ
1 n nnnnnnnnn
0 59
60
Str
01
r s0 0 2 Q 1 Q
3
y St u
11 9
DA
0 s sssssssss
. 32
51 N
ee
AA
es 1 1 Q U 9 U
.
N P re d
16 S
N N D MN
4 p p p p p p p p p p N NN NNNNNNNN7 02,
83 Y
t/S
VV
ci 3 3 U E 7 E
7
0 a ol et T
49 T
a a R Aa
2 e e e e e e e e e e Na a a a a a aa a a a 0 90 P
ide
EE
i g1 0 E E 3 E
9
N ic /S al
3S R
NNE I N
0 ci ci ci ci ci ci ci ci ci ci N NN NNNNNNNN8 73.
62 D
wa
NN
nn 0 2 E N 8 N
1
e id ki
2T E
SC
2 fi fi fi fi fi fi fi fi fi fi
2 79
5:
lk
UU
ce: : N S 9 S
6
De n
RE
SA
7 e eeeeeeeee
7 16
10
EE
t d0 3 S
0
ew g
ET
d ddddddddd
5 03
8
85
4
p al
E
95
:
: :
ak
T
77
4
41
rt
97
1
17
m
21)
A
AA
e
M
MM
n
t
21
N Ill C
15 5 5 5
B M P 1 0 1 U UUUUUUUUU
4 - (40
N
Str
O
Q 2 Q
60 N e e o
1 8 8 8 9 N N L A N r 0 N 5 0 n n n n n n n n n n N NN NNNNNNNN0 7 .72
Y
ee
p
U 0 U
15 / a w g m 3 A A P S a a O S a e / a Q 0 s s s s s s s s s s Na a a a a a aa a a a . 3 10
P
t/S
e
E 1 E
9 3 N Y al m 7 V V L T N N C P N c 3 N U 9 p p p p p p p p p p N NN NNNNNNNN7 . 40
D
ide
n
E 9 E
31
oP e
8E E A R
KE i 1 E 3 e eeeeeeeee
2 9 53

11

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
6/
r ar r wa N N C E
F T n / E N 4 8 ci N ci ci ci ci ci ci ci ci ci
1 0 56
92
k ki ci lk U U E E
A H c 2 N S 9 4 fi S fi fi fi fi fi fi fi fi fi
0 9 28
80
C n al
EE T
C
t 0 S
e eeeeeeeee
4 4 30
1
it g O
E
1
d ddddddddd
1 5 5, 3
y v
3
3 73.
0
P e
1
90
2
ol r
0
94
:
ic ni
:
53
0
e g
0
06
1
D h
1
79
:
e t
:
17
0
p P
0
65)
4
a a
4
A
rt r
A
M m ki
M
e n
n g
t
11 N L
4 WW
11
- (40
N Cl
1 M
UM UUUUUUUUU
4
20 0 e o
0 BEE
P 00
7 .84
oi ub
AN
2 A1 nAnnnnnnnnn
0
6/ / w u
6 RSS
r C/ /
2
3 33
se /B 1
DE
MN 0 s N s s s s s s s s s
.
5 3 3 NY d
0 OT T
el 3 3
4
. 29
ar/ 0
N N D WN
A H 0 p H p p p p p p p p p N NN NNNNNNNN8
9 1 1 Yo M
BA1 1
co1 1
6
9 75
2
C Re 0
a a R Ya
N A 1 e A e e e e e e e e e Na a a a a a aa a a a 4
4 / / Pr
u
RD7 7
i s/ /
5
3 46
o sta 3
N N E ON
H T 0 ci T ci ci ci ci ci ci ci ci ci N NN NNNNNNNN3
1 2 2 Dk si
O W1 2
ne 2 2
3
9 65
m ur 2
SR
A T 8 fi T fi fi fi fi fi fi fi fi fi
3
3 0 0 C c/
AASS
cd0 0
1
1 13,
m an
SK
T A8 eAeeeeeeeee
3
9 1 1 it P
DYT T
t 11
4er t
T N
dNddddddddd
0
33 y a
W RR
33
4 73.

12

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
0 0 P ci rt
A EE
10A
93
2 2 ol al y
Y EE
02N
91
: : ic
TT
: :
43
04 e
03
71
00 D
09
91
: : e
: :
34
23 p
24
82)
42 a
42
A A rt
AA
MM m
MM
e
n
t
11 N
A
11
C
(40
00 e
M
00
C
WWO
.77
/ / w
S
/ / 0
a
EEL
- 80
3 3 YN
T
B
337 M
UM UUUUUUUUU
4
2
r/
SSU
P
7 08
1 1 o oi
E
L N
1 1 MA
nAnnnnnnnnn
0
6
T Str T T M
rC
92
3 74
/ / r se
1
R
OE
/ / AN
sNsssssssss
.
5
N
r ee 7 7 B
el
82
. 46
22 k0
D N N C WN 2 2 N H
p H p p p p p p p p p N NN NNNNNNNN7
9
Y
u t/S 2 2 U
co
92
9 37
3 0 0 CV
0
A a a K Ya
00HA
e A e e e e e e e e e Na a a a a a aa a a a 7
5
P
c ide S S S
is
77
8 2, 1 1 it e
2
M N N F ON 1 1 A T
ci T ci ci ci ci ci ci ci ci ci N NN NNNNNNNN8
7
D
k wa T T A
ne
32
0 73.
3 3 y hi
3
A
AR
33T T
fi T fi fi fi fi fi fi fi fi fi
0
2
H lk R R V
cd
07
2 98
0 0 P cl
V
CK
00T A
eAeeeeeeeee
0
1
o
EEE
t
1 02
1 2 ol e
E
E
92A N
dNddddddddd
9
r
EEN
3 13
: : ic
N
: : N
n
TTU
49
52 e
U
52
E
02
61 D
E
61

13

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
: : e
: :
39
24 p
21
75)
38 a
30
A A rt
AA
MM m
MM
e
n
t
1
D C
A
11
0
e o
D
10
(40
/
p n
A
/ /
.80
3
a di
M
33
WW
76
1
rt ti
C
01
EEL
1
90
/
m o
L
/ /
SSE
B
0 M
UM UUUUUUUUU
4 92
22
e n
A
22
7
TTN
L N P
MA
nAnnnnnnnnn
0 70
60
n A
Y
00
92
3
D R Va 1 1 1 O
OE
e
AN
sNsssssssss
. 49
51
t
tt
T
11
93
.
N O o ca 0 2 2 X N N C WN Nn
NH
p H p p p p p p p p p N NN NNNNNNNN8 51,
93
o r
O
33
83
9
4 a H d nt 0 4 4 A a a K Y a / d
HA
e A e e e e e e e e e Na a a a a a aa a a a 0 00
f
a
N
00
85
4
N M e Lo 2 S S V N N F ON Ai
AT
ci T ci ci ci ci ci ci ci ci ci N NN NNNNNNNN7 73.
91
H ct
P
11
14
7
H nt t 7 T T E
AR n
T T
fi T fi fi fi fi fi fi fi fi fi
6 94
3:
e in
O
: :
55
3
RRN
CK
g
T A
eAeeeeeeeee
9 73
05
al g
W
55
8
EEU
E
AN
dNddddddddd
1 87
3
t
R
E
39
7
EEE
N
03
:
h o
L
: :
TT
49
4
a d
L
45
14
4
n e
J
44
33)
A
d n
R
AA
M M ts
B
MM

14

R
S
e
XY
c
s
CC
B
h
B
o
oo
r
o T
r
I I
l
oo
S
T Bi
F
o a
i
nn
u
r r
c
ar d
e
P
Sl x
d
In
t t
ti
dd
hS
x i g GF r
F
a
SS
co i
g
C
I ci C e e A
oC i i
oc S
V i d e ae r
C AC
C
a
r P Sc c S
hr C
e
Ur
D
nd r
r r d
no nn
ohSc S
e Pg H r r y
l go
S r
Lc
k ar c h h c
oC o
RH
L
ne
e
ce o s s d
DAmB a a
l ochc
h i ei
ar T L
o Ae m Lo
tr o
ai S
Fk hooh
oit m
oi
o
i a
s
in s eer
u c mo t t
Pohoh
i c Hg gy e a
s g n pl ca
e s
nl t
aBoooo
ly p
ag
n
qt
c
dt
s cce C
e ti u r e e
hl ooo
c k i h e Dr ti Lo
e e c ai tio
et s
di a
co ol l o
Nw a
dh
g
ue
ri
e A S ti ti s it
D o ni o ( (
o Ao l o
l Ug w Li mt cat
d n y nt n
N S
mt t
il r l N R l
oi n
Rw
it
ed
p
n d tr o o s y
a n ty u S S
ndl Sl
e p h a or i u ion
D c N T Ty
a tr
a yu
it o N u e C
td y
aa
u
KD
t
td e nnT
t UB g t t
e d Ct Z
T L wy t e n d
a y a y pe
m e
r Ts
y u a mg o
Fe B
my
d
ea
o
Zr
e SSy
epo h aa
N r it a i
y o a D Nc a e
t mp
e et
ky
N g mb i d
oC o
pS
e
yt
r
ie t
tr tr p
d ar t t
ueyt p
p cyi
a ti l
e ee
2
p
ah eeoe
uo r
e
e
ps 1 e e e
ad ee
ms e
e a Nr
mo N
e
m
r n
nm o
g
s
ee
t
PP
bs
ti a e e n a
e
dp u
m
t t
e
l l
e
o mc
m
l g
e
12
d
aa
r
n e ti
e
a h
n
D
nn
o
i
t
a
ee
n
n
t
) )
t
e
e
O
n
U
t
L
al
E
H
V
y
A
gi
R
e
D
n
e
We can combine these to get the first 5 rows of a column:
In [6]:
complaints['Complaint Type'][:5]

Out[6]:
0
Noise - Street/Sidewalk
1
Illegal Parking
2
Noise - Commercial
3
Noise - Vehicle
4
Rodent
Name: Complaint Type, dtype: object

and it doesn't matter which direction we do it in:


In [7]:

15

complaints[:5]['Complaint Type']

Out[7]:
0
Noise - Street/Sidewalk
1
Illegal Parking
2
Noise - Commercial
3
Noise - Vehicle
4
Rodent
Name: Complaint Type, dtype: object

2.3 Selecting multiple columns


What if we just want to know the complaint type and the borough, but not the rest of the information?
Pandas makes it really easy to select a subset of the columns: just index with list of columns you
want.
In [8]:
complaints[['Complaint Type', 'Borough']]

Out[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111069 entries, 0 to 111068
Data columns (total 2 columns):
Complaint Type
111069 non-null values
Borough
111069 non-null values
dtypes: object(2)

That showed us a summary, and then we can look at the first 10 rows:
In [9]:
complaints[['Complaint Type', 'Borough']][:10]

Out[9]:
Complaint Type
Borough
0 Noise - Street/Sidewalk QUEENS
1 Illegal Parking
QUEENS
2 Noise - Commercial
MANHATTAN
3 Noise - Vehicle
MANHATTAN
4 Rodent
MANHATTAN
5 Noise - Commercial
QUEENS
6 Blocked Driveway
QUEENS
7 Noise - Commercial
QUEENS
8 Noise - Commercial
MANHATTAN
9 Noise - Commercial
BROOKLYN

16

2.4 What's the most common complaint


type?
This is a really easy question to answer! There's a .value_counts() method that we can use:
In [10]:
complaints['Complaint Type'].value_counts()

Out[10]:
HEATING
14200
GENERAL CONSTRUCTION
7471
Street Light Condition
7117
DOF Literature Request
5797
PLUMBING
5373
PAINT - PLASTER
5149
Blocked Driveway
4590
NONCONST
3998
Street Condition
3473
Illegal Parking
3343
Noise
3321
Traffic Signal Condition
3145
Dirty Conditions
2653
Water System
2636
Noise - Commercial
2578
...
Opinion for the Mayor
Window Guard
DFTA Literature Request
Legal Services Provider Complaint
Open Flame Permit
Snow
Municipal Parking Facility
X-Ray Machine/Equipment
Stalled Sites
DHS Income Savings Requirement
Tunnel Condition
Highway Sign - Damaged
Ferry Permit
Trans Fat
DWD
Length: 165, dtype: int64

2
2
2
2
1
1
1
1
1
1
1
1
1
1
1

If we just wanted the top 10 most common complaints, we can do this:

17

In [11]:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]

Out[11]:
HEATING
GENERAL CONSTRUCTION
Street Light Condition
DOF Literature Request
PLUMBING
PAINT - PLASTER
Blocked Driveway
NONCONST
Street Condition
Illegal Parking
dtype: int64

14200
7471
7117
5797
5373
5149
4590
3998
3473
3343

But it gets better! We can plot them!


In [12]:
complaint_counts[:10].plot(kind='bar')

Out[12]:
<matplotlib.axes.AxesSubplot at 0x7ba2290>
.warning{
color: rgb( 240, 20, 20 )
}

18

Lesson 3
Get Data - Our data set will consist of an Excel file containing customer counts per date. We will
learn how to read in the excel file for processing.
Prepare Data - The data is an irregular time series having duplicate dates. We will be challenged in
compressing the data and coming up with next years forecasted customer count.
Analyze Data - We use graphs to visualize trends and spot outliers. Some built in computational
tools will be used to calculate next years forecasted customer count.
Present Data - The results will be plotted.
NOTE: Make sure you have looked through all previous lessons, as the knowledge learned in
previous lessons will be needed for this exercise.
In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys

%matplotlib inline

In [2]:
print 'Python version ' + sys.version
print 'Pandas version: ' + pd.__version__
Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul
2) [MSC v.1500 64 bit (AMD64)]
Pandas version: 0.15.2

1 2013, 12:37:5

We will be creating our own test data for analysis.


In [3]:
# set seed
np.seed(111)

# Function to generate test data


def CreateDataSet(Number=1):

Output = []

19

for i in range(Number):

# Create a weekly (mondays) date range


rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON')

# Create random data


data = np.randint(low=25,high=1000,size=len(rng))

# Status pool
status = [1,2,3]

# Make a random list of statuses


random_status = [status[np.randint(low=0,high=len(status))] for i in
range(len(rng))]

# State pool
states = ['GA','FL','fl','NY','NJ','TX']

# Make a random list of states


random_states = [states[np.randint(low=0,high=len(states))] for i in
range(len(rng))]

Output.extend(zip(random_states, random_status, data, rng))

return Output

Now that we have a function to generate our test data, lets create some data and stick it into a
dataframe.
In [4]:
dataset = CreateDataSet(4)

20

df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','St
atusDate'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 836 entries, 0 to 835
Data columns (total 4 columns):
State
836 non-null object
Status
836 non-null int64
CustomerCount
836 non-null int64
StatusDate
836 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 32.7+ KB

In [5]:
df.head()

Out[5]:
State Status CustomerCount StatusDate
0 GA 1
877
2009-01-05
1 FL 1
901
2009-01-12
2 fl
3
749
2009-01-19
3 FL 3
111
2009-01-26
4 GA 1
300
2009-02-02
We are now going to save this dataframe into an Excel file, to then bring it back to a dataframe. We
simply do this to show you how to read and write to Excel files.
We do not write the index values of the dataframe to the Excel file, since they are not meant to be
part of our initial test data set.
In [6]:
# Save results to excel
df.to_excel('Lesson3.xlsx', index=False)
print 'Done'
Done

Grab Data from Excel


We will be using the read_excel function to read in data from an Excel file. The function allows you
to read in specfic tabs by name or location.
In [7]:
pd.read_excel?

21

Note: The location on the Excel file will be in the same folder as the notebook, unless
specified otherwise.
In [8]:
# Location of file
Location = r'C:\Users\david\notebooks\pandas\Lesson3.xlsx'

# Parse a specific sheet


df = pd.read_excel(Location, 0, index_col='StatusDate')
df.dtypes

Out[8]:
State
Status
CustomerCount
dtype: object

object
int64
int64

In [9]:
df.index

Out[9]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2009-01-05, ..., 2012-12-31]
Length: 836, Freq: None, Timezone: None

In [10]:
df.head()

Out[10]:
State Status CustomerCount
StatusDate
2009-01-05 GA
2009-01-12 FL
2009-01-19 fl
2009-01-26 FL
2009-02-02 GA

1
1
3
3
1

877
901
749
111
300

Prepare Data
This section attempts to clean up the data for analysis.
1. Make sure the state column is all in upper case
2. Only select records where the account status is equal to "1"

22

3. Merge (NJ and NY) to NY in the state column


4. Remove any outliers (any odd results in the data set)
Lets take a quick look on how some of the State values are upper case and some are lower case
In [11]:
df['State'].unique()

Out[11]:
array([u'GA', u'FL', u'fl', u'TX', u'NY', u'NJ'], dtype=object)

To convert all the State values to upper case we will use the upper() function and the dataframe's
apply attribute. The lambda function simply will apply the upper function to each value in the State
column.
In [12]:
# Clean State Column, convert to upper case
df['State'] = df.State.apply(lambda x: x.upper())

In [13]:
df['State'].unique()

Out[13]:
array([u'GA', u'FL', u'TX', u'NY', u'NJ'], dtype=object)

In [14]:
# Only grab where Status == 1
mask = df['Status'] == 1
df = df[mask]

To turn the NJ states to NY we simply...


[df.State == 'NJ'] - Find all records in the State column where they are equal to NJ.
df.State[df.State == 'NJ'] = 'NY' - For all records in the State column where they are equal to NJ,
replace them with NY.
In [15]:
# Convert NJ to NY
mask = df.State == 'NJ'
df['State'][mask] = 'NY'

Now we can see we have a much cleaner data set to work with.
In [16]:

23

df['State'].unique()

Out[16]:
array([u'GA', u'FL', u'NY', u'TX'], dtype=object)

At this point we may want to graph the data to check for any outliers or inconsistencies in the data.
We will be using the plot() attribute of the dataframe.
As you can see from the graph below it is not very conclusive and is probably a sign that we need to
perform some more data preparation.
In [17]:
df['CustomerCount'].plot(figsize=(15,5));

If we take a look at the data, we begin to realize that there are multiple values for the same State,
StatusDate, and Status combination. It is possible that this means the data you are working with is
dirty/bad/inaccurate, but we will assume otherwise. We can assume this data set is a subset of a
bigger data set and if we simply add the values in the CustomerCount column per State,
StatusDate, and Status we will get the Total Customer Count per day.
In [18]:
sortdf = df[df['State']=='NY'].sort(axis=0)
sortdf.head(10)

Out[18]:
State Status CustomerCount
StatusDate
2009-01-19 NY 1
522
2009-02-23 NY 1
710
2009-03-09 NY 1
992
2009-03-16 NY 1
355
2009-03-23 NY 1
728
2009-03-30 NY 1
863
2009-04-13 NY 1
520
2009-04-20 NY 1
820
2009-04-20 NY 1
937
2009-04-27 NY 1
447
Our task is now to create a new dataframe that compresses the data so we have daily customer
counts per State and StatusDate. We can ignore the Status column since all the values in this
column are of value 1. To accomplish this we will use the dataframe's functions groupby and sum().
Note that we had to use reset_index . If we did not, we would not have been able to group by both
the State and the StatusDate since the groupby function expects only columns as inputs. The
reset_index function will bring the index StatusDate back to a column in the dataframe.
In [19]:
# Group by State and StatusDate

24

Daily = df.reset_index().groupby(['State','StatusDate']).sum()
Daily.head()

Out[19]:
Status CustomerCount
State StatusDate
FL 2009-01-12 1
901
2009-02-02 1
653
2009-03-23 1
752
2009-04-06 2
1086
2009-06-08 1
649
The State and StatusDate columns are automatically placed in the index of the Daily dataframe.
You can think of the index as the primary key of a database table but without the constraint of
having unique values. Columns in the index as you will see allow us to easily select, plot, and
perform calculations on the data.
Below we delete the Status column since it is all equal to one and no longer necessary.
In [20]:
del Daily['Status']
Daily.head()

Out[20]:
CustomerCount
State StatusDate
FL 2009-01-12 901
2009-02-02 653
2009-03-23 752
2009-04-06 1086
2009-06-08 649
In [21]:
# What is the index of the dataframe
Daily.index

Out[21]:
MultiIndex(levels=[[u'FL', u'GA', u'NY', u'TX'], [2009-01-05 00:00:00, 2009-0
1-12 00:00:00, 2009-01-19 00:00:00, 2009-02-02 00:00:00, 2009-02-23 00:00:00,
2009-03-09 00:00:00, 2009-03-16 00:00:00, 2009-03-23 00:00:00, 2009-03-30 00:
00:00, 2009-04-06 00:00:00, 2009-04-13 00:00:00, 2009-04-20 00:00:00, 2009-04
-27 00:00:00, 2009-05-04 00:00:00, 2009-05-11 00:00:00, 2009-05-18 00:00:00,
2009-05-25 00:00:00, 2009-06-08 00:00:00, 2009-06-22 00:00:00, 2009-07-06 00:
00:00, 2009-07-13 00:00:00, 2009-07-20 00:00:00, 2009-07-27 00:00:00, 2009-08
-10 00:00:00, 2009-08-17 00:00:00, 2009-08-24 00:00:00, 2009-08-31 00:00:00,
2009-09-07 00:00:00, 2009-09-14 00:00:00, 2009-09-21 00:00:00, 2009-09-28 00:

25

00:00, 2009-10-05 00:00:00, 2009-10-12 00:00:00, 2009-10-19 00:00:00, 2009-10


-26 00:00:00, 2009-11-02 00:00:00, 2009-11-23 00:00:00, 2009-11-30 00:00:00,
2009-12-07 00:00:00, 2009-12-14 00:00:00, 2010-01-04 00:00:00, 2010-01-11 00:
00:00, 2010-01-18 00:00:00, 2010-01-25 00:00:00, 2010-02-08 00:00:00, 2010-02
-15 00:00:00, 2010-02-22 00:00:00, 2010-03-01 00:00:00, 2010-03-08 00:00:00,
2010-03-15 00:00:00, 2010-04-05 00:00:00, 2010-04-12 00:00:00, 2010-04-26 00:
00:00, 2010-05-03 00:00:00, 2010-05-10 00:00:00, 2010-05-17 00:00:00, 2010-05
-24 00:00:00, 2010-05-31 00:00:00, 2010-06-14 00:00:00, 2010-06-28 00:00:00,
2010-07-05 00:00:00, 2010-07-19 00:00:00, 2010-07-26 00:00:00, 2010-08-02 00:
00:00, 2010-08-09 00:00:00, 2010-08-16 00:00:00, 2010-08-30 00:00:00, 2010-09
-06 00:00:00, 2010-09-13 00:00:00, 2010-09-20 00:00:00, 2010-09-27 00:00:00,
2010-10-04 00:00:00, 2010-10-11 00:00:00, 2010-10-18 00:00:00, 2010-10-25 00:
00:00, 2010-11-01 00:00:00, 2010-11-08 00:00:00, 2010-11-15 00:00:00, 2010-11
-29 00:00:00, 2010-12-20 00:00:00, 2011-01-03 00:00:00, 2011-01-10 00:00:00,
2011-01-17 00:00:00, 2011-02-07 00:00:00, 2011-02-14 00:00:00, 2011-02-21 00:
00:00, 2011-02-28 00:00:00, 2011-03-07 00:00:00, 2011-03-14 00:00:00, 2011-03
-21 00:00:00, 2011-03-28 00:00:00, 2011-04-04 00:00:00, 2011-04-18 00:00:00,
2011-04-25 00:00:00, 2011-05-02 00:00:00, 2011-05-09 00:00:00, 2011-05-16 00:
00:00, 2011-05-23 00:00:00, 2011-05-30 00:00:00, 2011-06-06 00:00:00, ...]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, ...], [1, 3, 7, 9, 17, 19, 20, 21, 23, 25, 27, 28, 29, 30, 31, 35, 3
8, 40, 41, 44, 45, 46, 47, 48, 49, 52, 54, 56, 57, 59, 60, 62, 66, 68, 69, 70
, 71, 72, 75, 76, 77, 78, 79, 85, 88, 89, 92, 96, 97, 99, 100, 101, 103, 104,
105, 108, 109, 110, 112, 114, 115, 117, 118, 119, 125, 126, 127, 128, 129, 13
1, 133, 134, 135, 136, 137, 140, 146, 150, 151, 152, 153, 157, 0, 3, 7, 22, 2
3, 24, 27, 28, 34, 37, 42, 47, 50, 55, 58, 66, 67, 69, ...]],
names=[u'State', u'StatusDate'])

In [22]:
# Select the State index
Daily.index.levels[0]

Out[22]:
Index([u'FL', u'GA', u'NY', u'TX'], dtype='object')

In [23]:
# Select the StatusDate index
Daily.index.levels[1]

Out[23]:
<class 'pandas.tseries.index.DatetimeIndex'>

26

[2009-01-05, ..., 2012-12-10]


Length: 161, Freq: None, Timezone: None

Lets now plot the data per State.


As you can see by breaking the graph up by the State column we have a much clearer picture on
how the data looks like. Can you spot any outliers?
In [24]:
Daily.loc['FL'].plot()
Daily.loc['GA'].plot()
Daily.loc['NY'].plot()
Daily.loc['TX'].plot();

We can also just plot the data on a specific date, like 2012. We can now clearly see that the data for
these states is all over the place. since the data consist of weekly customer counts, the variability of
the data seems suspect. For this tutorial we will assume bad data and proceed.
In [25]:
Daily.loc['FL']['2012':].plot()
Daily.loc['GA']['2012':].plot()
Daily.loc['NY']['2012':].plot()
Daily.loc['TX']['2012':].plot();

We will assume that per month the customer count should remain relatively steady. Any data outside
a specific range in that month will be removed from the data set. The final result should have smooth
graphs with no spikes.
StateYearMonth - Here we group by State, Year of StatusDate, and Month of StatusDate.
Daily['Outlier'] - A boolean (True or False) value letting us know if the value in the CustomerCount
column is ouside the acceptable range.
We will be using the attribute transform instead of apply. The reason is that transform will keep the
shape(# of rows and columns) of the dataframe the same and apply will not. By looking at the
previous graphs, we can realize they are not resembling a gaussian distribution, this means we
cannot use summary statistics like the mean and stDev. We use percentiles instead. Note that we
run the risk of eliminating good data.
In [26]:

27

# Calculate Outliers
StateYearMonth = Daily.groupby([Daily.index.get_level_values(0), Daily.index.
get_level_values(1).year, Daily.index.get_level_values(1).month])
Daily['Lower'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant
ile(q=.25) - (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Upper'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quant
ile(q=.75) + (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Outlier'] = (Daily['CustomerCount'] < Daily['Lower']) | (Daily['Custom
erCount'] > Daily['Upper'])

# Remove Outliers
Daily = Daily[Daily['Outlier'] == False]

The dataframe named Daily will hold customer counts that have been aggregated per day. The
original data (df) has multiple records per day. We are left with a data set that is indexed by both the
state and the StatusDate. The Outlier column should be equal to False signifying that the record is
not an outlier.
In [27]:
Daily.head()

Out[27]:
CustomerCount Lower Upper Outlier
State StatusDate
FL 2009-01-12 901
450.5 1351.5 False
2009-02-02 653
326.5 979.5 False
2009-03-23 752
376.0 1128.0 False
2009-04-06 1086
543.0 1629.0 False
2009-06-08 649
324.5 973.5 False
We create a separate dataframe named ALL which groups the Daily dataframe by StatusDate. We
are essentially getting rid of the State column. The Max column represents the maximum customer
count per month. The Max column is used to smooth out the graph.
In [28]:
# Combine all markets

# Get the max customer count by Date


ALL = pd.DataFrame(Daily['CustomerCount'].groupby(Daily.index.get_level_value
s(1)).sum())
ALL.columns = ['CustomerCount'] # rename column

28

# Group by Year and Month


YearMonth = ALL.groupby([lambda x: x.year, lambda x: x.month])

# What is the max customer count per Year and Month


ALL['Max'] = YearMonth['CustomerCount'].transform(lambda x: x.max())
ALL.head()

Out[28]:
CustomerCount Max
StatusDate
2009-01-05 877
901
2009-01-12 901
901
2009-01-19 522
901
2009-02-02 953
953
2009-02-23 710
953
As you can see from the ALL dataframe above, in the month of January 2009, the maximum
customer count was 901. If we had used apply, we would have got a dataframe with (Year and
Month) as the index and just the Max column with the value of 901.
There is also an interest to gauge if the current customer counts were reaching certain goals the
company had established. The task here is to visually show if the current customer counts are
meeting the goals listed below. We will call the goals BHAG (Big Hairy Annual Goal).

12/31/2011 - 1,000 customers


12/31/2012 - 2,000 customers
12/31/2013 - 3,000 customers

We will be using the date_range function to create our dates.


Definition: date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False,
name=None, closed=None)
Docstring: Return a fixed frequency datetime index, with day (calendar) as the default frequency
By choosing the frequency to be A or annual we will be able to get the three target dates from
above.
In [29]:
date_range?
Object `date_range` not found.

In [30]:
# Create the BHAG dataframe

29

data = [1000,2000,3000]
idx = pd.date_range(start='12/31/2011', end='12/31/2013', freq='A')
BHAG = pd.DataFrame(data, index=idx, columns=['BHAG'])
BHAG

Out[30]:
BHAG
2011-12-31 1000
2012-12-31 2000
2013-12-31 3000
Combining dataframes as we have learned in previous lesson is made simple using the concat
function. Remember when we choose axis = 0 we are appending row wise.
In [31]:
# Combine the BHAG and the ALL data set
combined = pd.concat([ALL,BHAG], axis=0)
combined = combined.sort(axis=0)
combined.tail()

Out[31]:
BHAG CustomerCount Max
2012-11-19 NaN 136
1115
2012-11-26 NaN 1115
1115
2012-12-10 NaN 1269
1269
2012-12-31 2000 NaN
NaN
2013-12-31 3000 NaN
NaN
In [32]:
fig, axes = plt.subplots(figsize=(12, 7))

combined['BHAG'].fillna(method='pad').plot(color='green', label='BHAG')
combined['Max'].plot(color='blue', label='All Markets')
plt.legend(loc='best');

There was also a need to forecast next year's customer count and we can do this in a couple of
simple steps. We will first group the combined dataframe by Year and place the maximum customer
count for that year. This will give us one row per Year.
In [33]:

30

# Group by Year and then get the max value per year
Year = combined.groupby(lambda x: x.year).max()
Year

Out[33]:
BHAG CustomerCount Max
2009 NaN 2452
2452
2010 NaN 2065
2065
2011 1000 2711
2711
2012 2000 2061
2061
2013 3000 NaN
NaN
In [34]:
# Add a column representing the percent change per year
Year['YR_PCT_Change'] = Year['Max'].pct_change(periods=1)
Year

Out[34]:
BHAG CustomerCount Max YR_PCT_Change
2009 NaN 2452
2452 NaN
2010 NaN 2065
2065 -0.157830
2011 1000 2711
2711 0.312833
2012 2000 2061
2061 -0.239764
2013 3000 NaN
NaN NaN
To get next year's end customer count we will assume our current growth rate remains constant. We
then will increase this years customer count by that amount and that will be our forecast for next
year.
In [35]:
(1 + Year.ix[2012,'YR_PCT_Change']) * Year.ix[2012,'Max']

Out[35]:
1566.8465510881595

Present Data
Create individual Graphs per State.
In [36]:
# First Graph
ALL['Max'].plot(figsize=(10, 5));plt.title('ALL Markets')

31

# Last four Graphs


fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))
fig.subplots_adjust(hspace=1.0) ## Create space between plots

Daily.loc['FL']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0
,0])
Daily.loc['GA']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0
,1])
Daily.loc['TX']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1
,0])
Daily.loc['NY']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1
,1])

# Add titles
axes[0,0].set_title('Florida')
axes[0,1].set_title('Georgia')
axes[1,0].set_title('Texas')
axes[1,1].set_title('North East');

32

Lesson 4
In this lesson were going to go back to the basics. We will be working with a small data set so that
you can easily understand what I am trying to explain. We will be adding columns, deleting columns,
and slicing the data many different ways. Enjoy!
In [1]:
# Import libraries
import pandas as pd
import sys

In [2]:
print 'Python version ' + sys.version
print 'Pandas version: ' + pd.__version__
Python version 2.7.5 |Anaconda 2.1.0 (64-bit)| (default, Jul
2) [MSC v.1500 64 bit (AMD64)]
Pandas version: 0.15.2

1 2013, 12:37:5

In [3]:
# Our small data set
d = [0,1,2,3,4,5,6,7,8,9]

# Create dataframe
df = pd.DataFrame(d)
df

Out[3]:
0
00
11
22
33
44
55
66
77
88
99
In [4]:
# Lets change the name of the column

33

df.columns = ['Rev']
df

Out[4]:
Rev
00
11
22
33
44
55
66
77
88
99
In [5]:
# Lets add a column
df['NewCol'] = 5
df

Out[5]:
Rev NewCol
00 5
11 5
22 5
33 5
44 5
55 5
66 5
77 5
88 5
99 5
In [6]:
# Lets modify our new column
df['NewCol'] = df['NewCol'] + 1
df

Out[6]:
Rev NewCol
00 6
11 6
22 6
33 6

34

Rev NewCol
44 6
55 6
66 6
77 6
88 6
99 6
In [7]:
# We can delete columns
del df['NewCol']
df

Out[7]:
Rev
00
11
22
33
44
55
66
77
88
99
In [8]:
# Lets add a couple of columns
df['test'] = 3
df['col'] = df['Rev']
df

Out[8]:
Rev test col
00 3 0
11 3 1
22 3 2
33 3 3
44 3 4
55 3 5
66 3 6
77 3 7
88 3 8
99 3 9
In [9]:

35

# If we wanted, we could change the name of the index


i = ['a','b','c','d','e','f','g','h','i','j']
df.index = i
df

Out[9]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
d3 3 3
e4 3 4
f5 3 5
g6 3 6
h7 3 7
i 8 3 8
j 9 3 9
We can now start to select pieces of the dataframe using loc.
In [10]:
df.loc['a']

Out[10]:
Rev
0
test
3
col
0
Name: a, dtype: int64

In [11]:
# df.loc[inclusive:inclusive]
df.loc['a':'d']

Out[11]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
d3 3 3
In [12]:
# df.iloc[inclusive:exclusive]
# Note: .iloc is strictly integer position based. It is available from [versi
on 0.11.0] (http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-11-0
-april-22-2013)

36

df.iloc[0:3]

Out[12]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
We can also select using the column name.
In [13]:
df['Rev']

Out[13]:
a
0
b
1
c
2
d
3
e
4
f
5
g
6
h
7
i
8
j
9
Name: Rev, dtype: int64

In [14]:
df[['Rev', 'test']]

Out[14]:
Rev test
a0 3
b1 3
c2 3
d3 3
e4 3
f5 3
g6 3
h7 3
i 8 3
j 9 3
In [15]:
# df['ColumnName'][inclusive:exclusive]
df['Rev'][0:3]

Out[15]:

37

a
0
b
1
c
2
Name: Rev, dtype: int64

In [16]:
df['col'][5:]

Out[16]:
f
5
g
6
h
7
i
8
j
9
Name: col, dtype: int64

In [17]:
df[['col', 'test']][:3]

Out[17]:
col test
a0 3
b1 3
c2 3
There is also some handy function to select the top and bottom records of a dataframe.
In [18]:
# Select top N number of records (default = 5)
df.head()

Out[18]:
Rev test col
a0 3 0
b1 3 1
c2 3 2
d3 3 3
e4 3 4
In [19]:
# Select bottom N number of records (default = 5)
df.tail()

Out[19]:
Rev test col
f5 3 5
g6 3 6

38

Rev test col


h7 3 7
i 8 3 8
j 9 3 9

39

54

Machine learning: the problem setting


In general, a learning problem considers a set of n samples of data and then tries to predict
properties of unknown data. If each sample is more than a single number and, for instance, a
multi-dimensional entry (aka multivariate data), is it said to have several attributes or features.
We can separate learning problems in a few large categories:

supervised learning, in which the data comes with additional attributes that we want to
predict .This problem can be either:
o

classification: samples belong to two or more classes and we want to learn from already
labeled data how to predict the class of unlabeled data. An example of classification
problem would be the handwritten digit recognition example, in which the aim is to
assign each input vector to one of a finite number of discrete categories. Another way to
think of classification is as a discrete (as opposed to continuous) form of supervised
learning where one has a limited number of categories and for each of the n samples
provided, one is to try to label them with the correct category or class.
regression: if the desired output consists of one or more continuous variables, then the
task is called regression. An example of a regression problem would be the prediction of
the length of a salmon as a function of its age and weight.

unsupervised learning, in which the training data consists of a set of input vectors x
without any corresponding target values. The goal in such problems may be to discover
groups of similar examples within the data, where it is called clustering, or to determine
the distribution of data within the input space, known as density estimation, or to project
the data from a high-dimensional space down to two or three dimensions for the purpose
of visualization

Training set and testing set


Machine learning is about learning some properties of a data set and applying them to new data.
This is why a common practice in machine learning to evaluate an algorithm is to split the data at
hand into two sets, one that we call the training set on which we learn data properties and one
that we call the testing set on which we test these properties.

Loading an example dataset


scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for
classification and the boston house prices dataset for regression.
In the following, we start a Python interpreter from our shell and then load the iris and digits
datasets. Our notational convention is that $ denotes the shell prompt while >>> denotes the
Python interpreter prompt:
$ python
>>> from sklearn import datasets
>>> iris = datasets.load_iris()

>>> digits = datasets.load_digits()

A dataset is a dictionary-like object that holds all the data and some metadata about the data.
This data is stored in the .data member, which is a n_samples, n_features array. In the case
of supervised problem, one or more response variables are stored in the .target member. For
instance, in the case of the digits dataset, digits.data gives access to the features that can be
used to classify the digits samples:
>>> print(digits.data)
[[ 0.
0.
5. ...,
[ 0.
0.
0. ...,
[ 0.
0.
0. ...,
...,
[ 0.
0.
1. ...,
[ 0.
0.
2. ...,
[ 0.
0. 10. ...,

0.
10.
16.

0.
0.
9.

0.]
0.]
0.]

6.
12.
12.

0.
0.
1.

0.]
0.]
0.]]

and digits.target gives the ground truth for the digit dataset, that is the number corresponding
to each digit image that we are trying to learn:
>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])

Shape of the data arrays


The data is always a 2D array, shape (n_samples, n_features), although the original data
may have had a different shape. In the case of the digits, each original sample is an image of
shape (8, 8) and can be accessed using:
>>> digits.images[0]
array([[ 0.,
0.,
[ 0.,
0.,
[ 0.,
3.,
[ 0.,
4.,
[ 0.,
5.,
[ 0.,
4.,
[ 0.,
2.,
[ 0.,
0.,

5.,
13.,
15.,
12.,
8.,
11.,
14.,
6.,

13.,
15.,
2.,
0.,
0.,
0.,
5.,
13.,

9.,
10.,
0.,
0.,
0.,
1.,
10.,
10.,

1.,
15.,
11.,
8.,
9.,
12.,
12.,
0.,

0.,
5.,
8.,
8.,
8.,
7.,
0.,
0.,

0.],
0.],
0.],
0.],
0.],
0.],
0.],
0.]])

Learning and predicting


In the case of the digits dataset, the task is to predict, given an image, which digit it represents.
We are given samples of each of the 10 possible classes (the digits zero through nine) on which
we fit an estimator to be able to predict the classes to which unseen samples belong.
In scikit-learn, an estimator for classification is a Python object that implements the methods
fit(X, y) and predict(T).

An example of an estimator is the class sklearn.svm.SVC that implements support vector


classification. The constructor of an estimator takes as arguments the parameters of the model,
but for the time being, we will consider the estimator as a black box:
>>>
>>> from sklearn import svm
>>> clf = svm.SVC(gamma=0.001, C=100.)

Choosing the parameters of the model


In this example we set the value of gamma manually. It is possible to automatically find good
values for the parameters by using tools such as grid search and cross validation.
We call our estimator instance clf, as it is a classifier. It now must be fitted to the model, that is,
it must learn from the model. This is done by passing our training set to the fit method. As a
training set, let us use all the images of our dataset apart from the last one. We select this training
set with the [:-1] Python syntax, which produces a new array that contains all but the last entry
of digits.data:
>>>
>>> clf.fit(digits.data[:-1], digits.target[:-1])
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

Now you can predict new values, in particular, we can ask to the classifier what is the digit of
our last image in the digits dataset, which we have not used to train the classifier:
>>>
>>> clf.predict(digits.data[-1:])
array([8])

The corresponding image is the following:

As you can see, it is a challenging task: the images are of poor resolution. Do you agree with the
classifier?

Model persistence
It is possible to save a model in the scikit by using Pythons built-in persistence model, namely
pickle:
>>>
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0

In the specific case of the scikit, it may be more interesting to use joblibs replacement of pickle
(joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the
disk and not to a string:
4

>>>
>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl')

Later you can load back the pickled model (possibly in another Python process) with:
>>>
>>> clf = joblib.load('filename.pkl')

Note
joblib.dump returns a list of filenames. Each individual numpy array contained in the clf object
is serialized as a separate file on the filesystem. All files are required in the same folder when
reloading the model with joblib.load.

Conventions
scikit-learn estimators follow certain rules to make their behavior more predictive.
Type casting
Unless otherwise specified, input will be cast to float64:
>>>
>>> import numpy as np
>>> from sklearn import random_projection
>>> rng = np.random.RandomState(0)
>>> X = rng.rand(10, 2000)
>>> X = np.array(X, dtype='float32')
>>> X.dtype
dtype('float32')
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.dtype
dtype('float64')

In this example, X is float32, which is cast to float64 by fit_transform(X).


Regression targets are cast to float64, classification targets are maintained:
>>>
>>> from sklearn import datasets
>>> from sklearn.svm import SVC
>>> iris = datasets.load_iris()

>>> clf = SVC()


>>> clf.fit(iris.data, iris.target)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> list(clf.predict(iris.data[:3]))
[0, 0, 0]
>>> clf.fit(iris.data, iris.target_names[iris.target])
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> list(clf.predict(iris.data[:3]))
['setosa', 'setosa', 'setosa']

Here, the first predict() returns an integer array, since iris.target (an integer array) was
used in fit. The second predict returns a string array, since iris.target_names was for
fitting.
Refitting and updating parameters
Hyper-parameters of an estimator can be updated after it has been constructed via the
sklearn.pipeline.Pipeline.set_params method. Calling fit() more than once will
overwrite what was learned by any previous fit():
>>>
>>> import numpy as np
>>> from sklearn.svm import SVC
>>>
>>>
>>>
>>>

rng = np.random.RandomState(0)
X = rng.rand(100, 10)
y = rng.binomial(1, 0.5, 100)
X_test = rng.rand(5, 10)

>>> clf = SVC()


>>> clf.set_params(kernel='linear').fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> clf.predict(X_test)
array([1, 0, 1, 1, 0])
>>> clf.set_params(kernel='rbf').fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> clf.predict(X_test)

array([0, 0, 0, 1, 0])

Here, the default kernel rbf is first changed to linear after the estimator has been constructed
via SVC(), and changed back to rbf to refit the estimator and to make a second prediction.

Datasets
Scikit-learn deals with learning information from one or more datasets that are represented as 2D
arrays. They can be understood as a list of multi-dimensional observations. We say that the first
axis of these arrays is the samples axis, while the second is the features axis.
A simple example shipped with the scikit: iris dataset
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)

It is made of 150 observations of irises, each described by 4 features: their sepal and petal length
and width, as detailed in iris.DESCR.
When the data is not initially in the (n_samples, n_features) shape, it needs to be
preprocessed in order to be used by scikit-learn.
An example of reshaping data would be the digits dataset

The digits dataset is made of 1797 8x8 images of hand-written digits


>>> digits = datasets.load_digits()
>>> digits.images.shape
(1797, 8, 8)
>>> import pylab as pl
>>> pl.imshow(digits.images[-1], cmap=pl.cm.gray_r)
<matplotlib.image.AxesImage object at ...>

To use this dataset with the scikit, we transform each 8x8 image into a feature vector of length
64
>>> data = digits.images.reshape((digits.images.shape[0], -1))

Estimators objects
Fitting data: the main API implemented by scikit-learn is that of the estimator. An estimator is
any object that learns from data; it may be a classification, regression or clustering algorithm or a
transformer that extracts/filters useful features from raw data.
All estimator objects expose a fit method that takes a dataset (usually a 2-d array):
>>> estimator.fit(data)

Estimator parameters: All the parameters of an estimator can be set when it is instantiated or
by modifying the corresponding attribute:
>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1
1

Estimated parameters: When data is fitted with an estimator, parameters are estimated from the
data at hand. All the estimated parameters are attributes of the estimator object ending by an
underscore:
>>> estimator.estimated_param_

Supervised learning: predicting an output variable from highdimensional observations


The problem solved in supervised learning
Supervised learning consists in learning the link between two datasets: the observed data X and
an external variable y that we are trying to predict, usually called target or labels. Most
often, y is a 1D array of length n_samples.
All supervised estimators in scikit-learn implement a fit(X, y) method to fit the model and a
predict(X) method that, given unlabeled observations X, returns the predicted labels y.
Vocabulary: classification and regression
If the prediction task is to classify the observations in a set of finite labels, in other words to
name the objects observed, the task is said to be a classification task. On the other hand, if the
goal is to predict a continuous target variable, it is said to be a regression task.
When doing classification in scikit-learn, y is a vector of integers or strings.

Nearest neighbor and the curse of dimensionality


Classifying irises:

The iris dataset is a classification task consisting in identifying 3 different types of irises (Setosa,
Versicolour, and Virginica) from their petal and sepal length and width:
9

>>>
>>> import numpy as np
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> iris_X = iris.data
>>> iris_y = iris.target
>>> np.unique(iris_y)
array([0, 1, 2])

k-Nearest neighbors classifier


The simplest possible classifier is the nearest neighbor: given a new observation X_test, find in
the training set (i.e. the data used to train the estimator) the observation with the closest feature
vector.
Training set and testing set
While experimenting with any learning algorithm, it is important not to test the prediction of an
estimator on the data used to fit the estimator as this would not be evaluating the performance of
the estimator on new data. This is why datasets are often split into train and test data.
KNN (k nearest neighbors) classification example:

>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

# Split iris data in train and test data


# A random permutation, to split the data randomly
np.random.seed(0)
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test = iris_X[indices[-10:]]
iris_y_test = iris_y[indices[-10:]]

10

>>> # Create and fit a nearest-neighbor classifier


>>> from sklearn.neighbors import KNeighborsClassifier
>>> knn = KNeighborsClassifier()
>>> knn.fit(iris_X_train, iris_y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
>>> knn.predict(iris_X_test)
array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])
>>> iris_y_test
array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

The curse of dimensionality


For an estimator to be effective, you need the distance between neighboring points to be less than
some value , which depends on the problem. In one dimension, this requires on average
points. In the context of the above -NN example, if the data is described by just one feature
with values ranging from 0 to 1 and with training observations, then new data will be no
further away than
. Therefore, the nearest neighbor decision rule will be efficient as soon as
is small compared to the scale of between-class feature variations.
If the number of features is , you now require
points. Lets say that we require 10
points in one dimension: now
points are required in dimensions to pave the
space. As
becomes large, the number of training points required for a good estimator grows
exponentially.
For example, if each point is just a single number (8 bytes), then an effective -NN estimator in
a paltry
dimensions would require more training data than the current estimated size of the
entire internet (1000 Exabytes or so).
This is called the curse of dimensionality and is a core problem that machine learning addresses.

Linear model: from regression to sparsity


Diabetes dataset
The diabetes dataset consists of 10 physiological variables (age, sex, weight, blood pressure)
measure on 442 patients, and an indication of disease progression after one year:
>>>
>>>
>>>
>>>
>>>
>>>

diabetes = datasets.load_diabetes()
diabetes_X_train = diabetes.data[:-20]
diabetes_X_test = diabetes.data[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

The task at hand is to predict disease progression from physiological variables.

11

Linear regression
LinearRegression, in its simplest form, fits a linear model to the data set by adjusting a set of
parameters in order to make the sum of the squared residuals of the model as small as possible.

Linear models:

: data
: target variable
: Coefficients
: Observation noise

>>>
>>> from sklearn import linear_model
>>> regr = linear_model.LinearRegression()
>>> regr.fit(diabetes_X_train, diabetes_y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> print(regr.coef_)
[
0.30349955 -237.63931533 510.53060544 327.73698041 -814.13170937
492.81458798 102.84845219 184.60648906 743.51961675
76.09517222]
>>> # The mean square error
>>> np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2)
2004.56760268...
>>> # Explained variance score: 1 is perfect prediction
>>> # and 0 means that there is no linear relationship
>>> # between X and Y.
>>> regr.score(diabetes_X_test, diabetes_y_test)
0.5850753022690...

12

Shrinkage
If there are few data points per dimension, noise in the observations induces high variance:

>>>
>>>
>>>
>>>
>>>

X = np.c_[ .5, 1].T


y = [.5, 1]
test = np.c_[ 0, 2].T
regr = linear_model.LinearRegression()

>>> import pylab as pl


>>> pl.figure()
>>> np.random.seed(0)
>>> for _ in range(6):
...
this_X = .1*np.random.normal(size=(2, 1)) + X
...
regr.fit(this_X, y)
...
pl.plot(test, regr.predict(test))
...
pl.scatter(this_X, y, s=3)

A solution in high-dimensional statistical learning is to shrink the regression coefficients to zero:


any two randomly chosen set of observations are likely to be uncorrelated. This is called Ridge
regression:

13

>>>
>>> regr = linear_model.Ridge(alpha=.1)
>>> pl.figure()
>>> np.random.seed(0)
>>> for _ in range(6):
...
this_X = .1*np.random.normal(size=(2, 1)) + X
...
regr.fit(this_X, y)
...
pl.plot(test, regr.predict(test))
...
pl.scatter(this_X, y, s=3)

This is an example of bias/variance tradeoff: the larger the ridge alpha parameter, the higher the
bias and the lower the variance.
We can choose alpha to minimize left out error, this time using the diabetes dataset rather than
our synthetic data:
>>>
>>> alphas = np.logspace(-4, -1, 6)
>>> from __future__ import print_function
>>> print([regr.set_params(alpha=alpha
...
).fit(diabetes_X_train, diabetes_y_train,
...
).score(diabetes_X_test, diabetes_y_test) for alpha in
alphas])
[0.5851110683883..., 0.5852073015444..., 0.5854677540698...,
0.5855512036503..., 0.5830717085554..., 0.57058999437...]

Note

14

Capturing in the fitted parameters noise that prevents the model to generalize to new data is
called overfitting. The bias introduced by the ridge regression is called a regularization.
Sparsity
Fitting only features 1 and 2

15

16

Note
A representation of the full diabetes dataset would involve 11 dimensions (10 feature dimensions
and one of the target variable). It is hard to develop an intuition on such representation, but it
may be useful to keep in mind that it would be a fairly empty space.
We can see that, although feature 2 has a strong coefficient on the full model, it conveys little
information on y when considered with feature 1.
To improve the conditioning of the problem (i.e. mitigating the The curse of dimensionality), it
would be interesting to select only the informative features and set non-informative ones, like
feature 2 to 0. Ridge regression will decrease their contribution, but not set them to zero. Another
penalization approach, called Lasso (least absolute shrinkage and selection operator), can set
some coefficients to zero. Such methods are called sparse method and sparsity can be seen as an
application of Occams razor: prefer simpler models.
>>>
>>> regr = linear_model.Lasso()
>>> scores = [regr.set_params(alpha=alpha
...
).fit(diabetes_X_train, diabetes_y_train
...
).score(diabetes_X_test, diabetes_y_test)
...
for alpha in alphas]
>>> best_alpha = alphas[scores.index(max(scores))]
>>> regr.alpha = best_alpha
>>> regr.fit(diabetes_X_train, diabetes_y_train)
Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
>>> print(regr.coef_)
[
0.
-212.43764548 517.19478111 313.77959962 -160.8303982
-187.19554705
69.38229038 508.66011217
71.84239008]

-0.

17

Different algorithms for the same problem


Different algorithms can be used to solve the same mathematical problem. For instance the Lasso
object in scikit-learn solves the lasso regression problem using a coordinate decent method, that
is efficient on large datasets. However, scikit-learn also provides the LassoLars object using the
LARS algorthm, which is very efficient for problems in which the weight vector estimated is very
sparse (i.e. problems with very few observations).
Classification

For classification, as in the labeling iris task, linear regression is not the right approach as it will
give too much weight to data far from the decision frontier. A linear approach is to fit a sigmoid
function or logistic function:

>>>
>>> logistic = linear_model.LogisticRegression(C=1e5)
>>> logistic.fit(iris_X_train, iris_y_train)
LogisticRegression(C=100000.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

This is known as LogisticRegression.

18

Multiclass classification
If you have several classes to predict, an option often used is to fit one-versus-all classifiers and
then use a voting heuristic for the final decision.
Shrinkage and sparsity with logistic regression
The C parameter controls the amount of regularization in the LogisticRegression object: a large
value for C results in less regularization. penalty="l2" gives Shrinkage (i.e. non-sparse
coefficients), while penalty="l1" gives Sparsity.
Exercise
Try classifying the digits dataset with nearest neighbors and a linear model. Leave out the last
10% and test prediction performance on these observations.
from sklearn import datasets, neighbors, linear_model
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

Solution: ../../auto_examples/exercises/digits_classification_exercise.py

19

Support vector machines (SVMs)


Linear SVMs
Support Vector Machines belong to the discriminant model family: they try to find a combination
of samples to build a plane maximizing the margin between the two classes. Regularization is set
by the C parameter: a small value for C means the margin is calculated using many or all of the
observations around the separating line (more regularization); a large value for C means the
margin is calculated on observations close to the separating line (less regularization).
Unregularized SVM

Regularized SVM (default)

SVMs can be used in regression SVR (Support Vector Regression), or in classification SVC
(Support Vector Classification).
>>>
>>> from sklearn import svm
>>> svc = svm.SVC(kernel='linear')
>>> svc.fit(iris_X_train, iris_y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

Warning
Normalizing data
20

For many estimators, including the SVMs, having datasets with unit standard deviation for each
feature is important to get good prediction.
Using kernels
Classes are not always linearly separable in feature space. The solution is to build a decision
function that is not linear but may be polynomial instead. This is done using the kernel trick that
can be seen as creating a decision energy by positioning kernels on observations:
Linear kernel

Polynomial kernel

>>>
>>>
>>> svc = svm.SVC(kernel='linear')

>>> svc = svm.SVC(kernel='poly',


...
degree=3)
>>> # degree: polynomial degree

RBF kernel (Radial Basis Function)

21

>>>
>>> svc = svm.SVC(kernel='rbf')
>>> # gamma: inverse of size of
>>> # radial kernel

Interactive example
See the SVM GUI to download svm_gui.py; add data points of both classes with right and left
button, fit the model and change parameters and data.

Exercise
22

Try classifying classes 1 and 2 from the iris dataset with SVMs, with the 2 first features. Leave
out 10% of each class and test prediction performance on these observations.
Warning: the classes are ordered, do not leave out the last 10%, you would be testing on only
one class.
Hint: You can use the decision_function method on a grid to get intuitions.
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0, :2]
y = y[y != 0]

23

Anda mungkin juga menyukai