Anda di halaman 1dari 87

Persistent Homology in

Topological Data Analysis

Ben Fraser
May 27, 2015

Data Analysis

Suppose we start with some point cloud data,


and want to extract meaningful information from
it

Data Analysis

Suppose we start with some point cloud data,


and want to extract meaningful information from
it
We may want to visualize the data to do so, by
plotting it on a graph

Data Analysis

Suppose we start with some point cloud data,


and want to extract meaningful information from
it
We may want to visualize the data to do so, by
plotting it on a graph
However, in higher dimensions, visualization
becomes difficult

Data Analysis

Suppose we start with some point cloud data,


and want to extract meaningful information from
it
We may want to visualize the data to do so, by
plotting it on a graph
However, in higher dimensions, visualization
becomes difficult
A possible solution: dimensionality reduction

Principal Component Analysis

Essentially, fits an ellipsoid to the data, where


each of its axes corresponds to a principal
component

Principal Component Analysis

Essentially, fits an ellipsoid to the data, where


each of its axes corresponds to a principal
component
The smaller axes are those along which the
data has less variance

Principal Component Analysis

Essentially, fits an ellipsoid to the data, where


each of its axes corresponds to a principal
component
The smaller axes are those along which the
data has less variance
We could discard these less important principal
components to reduce the dimensionality of the
data while retaining as much of the variance as
possible

Principal Component Analysis

Essentially, fits an ellipsoid to the data, where


each of its axes corresponds to a principal
component
The smaller axes are those along which the data
has less variance
We could discard these less important principal
components to reduce the dimensionality of the
data while retaining as much of the variance as
possible
Then may be easier to graph: identify clusters

Principal Component Analysis

Done by computing the singular value


decomposition of X (each row is a point, each
column a dimension):

Principal Component Analysis

Done by computing the singular value


decomposition of X (each row is a point, each
column a dimension):

Then a truncated score matrix, where L is the


number of principal components we retain:

Principal Component Analysis

8-dim data 2-dim to locate clusters:

Principal Component Analysis

3-dim 2-dim collapses cylinder to circle:

Principal Component Analysis

Scale sensitive! Same transformation produces


poor result on same shape/different scale data

Data Analysis

One weakness of PCA is its sensitivity to the


scale of the data

Data Analysis

One weakness of PCA is its sensitivity to the


scale of the data
Also, it provides no information about the shape
of our data

Data Analysis

One weakness of PCA is its sensitivity to the


scale of the data
Also, it provides no information about the shape
of our data
We want something insensitive to scale which
can identify shape (why?)

Data Analysis

One weakness of PCA is its sensitivity to the


scale of the data
Also, it provides no information about the shape
of our data
We want something insensitive to scale which
can identify shape (why?)
Because data has shape, and shape has
meaning - Ayasdi (Gunnar Carlsson)

Topological Data Analysis

Constructs higher-dimensional structure on our


point cloud via simplicial complexes

Topological Data Analysis

Constructs higher-dimensional structure on our


point cloud via simplicial complexes
Then analyze this family of nested complexes
with persistent homology

Topological Data Analysis

Constructs higher-dimensional structure on our


point cloud via simplicial complexes
Then analyze this family of nested complexes
with persistent homology
Display Betti numbers in graph form

Topological Data Analysis

Constructs higher-dimensional structure on our


point cloud via simplicial complexes
Then analyze this family of nested complexes
with persistent homology
Display Betti numbers in graph form
Essentially, we approximate the shape of the
data by building a graph on it and considering
cliques as higher dimensional objects, and
counting the cycles of such objects.

Algorithm

Since scale doesn't matter in this analysis, we


can normalize the data.

Algorithm

Since scale doesn't matter in this analysis, we


can normalize the data.
Also, since we don't want to work with the entire
data set (especially if it is very large), we want
to choose a subset of the data to work with

Algorithm

Since scale doesn't matter in this analysis, we


can normalize the data.
Also, since we don't want to work with the entire
data set (especially if it is very large), we want
to choose a subset of the data to work with
We would ideally like this subset to be
representative of the original data (but how?)

Algorithm

Since scale doesn't matter in this analysis, we


can normalize the data.
Also, since we don't want to work with the entire
data set (especially if it is very large), we want
to choose a subset of the data to work with
We would ideally like this subset to be
representative of the original data (but how?)
This process is called landmarking

Landmarking

The method used here is minMax

Landmarking

The method used here is minMax

Start by computing a distance matrix D

Landmarking

The method used here is minMax

Start by computing a distance matrix D

Then choose a random point l1 to add to the


subset of landmarks L

Landmarking

The method used here is minMax

Start by computing a distance matrix D

Then choose a random point l1 to add to the


subset of landmarks L
Then choose each subsequent i-th point to add
as that which has maximum distance from the
landmark it is closest to:

Landmarking

The method used here is minMax

Start by computing a distance matrix D

Then choose a random point l1 to add to the


subset of landmarks L
Then choose each subsequent i-th point to add
as that which has maximum distance from the
landmark it is closest to:

li = p such that dist(p,L) = max{dist(x,L) x X}


dist(x,L) = min{dist(x,l) l L}

Landmarking

Landmarking is not an exact science however:


on certain types of data the method just used
may result in a subset very unrepresentative of
the original data. For example:

Algorithm

As long as outliers are ignored, however, the


method used works well to pick points as
spread out as possible among the data

Algorithm

As long as outliers are ignored, however, the


method used works well to pick points as
spread out as possible among the data
Next we keep only the distance matrix between
the landmark points, and normalize it

Algorithm

As long as outliers are ignored, however, the


method used works well to pick points as
spread out as possible among the data
Next we keep only the distance matrix between
the landmark points, and normalize it
This is all the information we need from the
data: the actual position of the points is
irrelevant, all we need are the distances
between the landmarks, on which we will
construct a neighbourhood graph

Neighbourhood Graph

Our goal is to create a nested sequence of


graphs. To be precise, by adding a single edge
at a time, between points x,y L, where
dist(x,y) is the smallest value in D. Then
replace the distance in D with 1.

Neighbourhood Graph

Our goal is to create a nested sequence of


graphs. To be precise, by adding a single edge
at a time, between points x,y L, where
dist(x,y) is the smallest value in D. Then
replace the distance in D with 1.
At each iteration of adding an edge, we keep
track of r = dist(x,y), r [0,1]: this is our
proximity parameter, and will be important when
we graph the Betti numbers later.

Witness Complex
Def: A point x is a weak witness to a p-simplex
(a0,a1,...ap) in A if |x-a| < |x-b| a (a0,a1,...ap),
and b A \ (a0,a1,...ap)

Witness Complex
Def: A point x is a weak witness to a p-simplex
(a0,a1,...ap) in A if |x-a| < |x-b| a (a0,a1,...ap),
and b A \ (a0,a1,...ap)
Def: A point x is a strong witness to a p-simplex
(a0,a1,...ap) in A if x is a weak witness and
additionally, |x-a0| = |x-a1| = = |x-ap|.

Witness Complex
Def: A point x is a weak witness to a p-simplex
(a0,a1,...ap) in A if |x-a| < |x-b| a (a0,a1,...ap),
and b A \ (a0,a1,...ap)
Def: A point x is a strong witness to a p-simplex
(a0,a1,...ap) in A if x is a weak witness and
additionally, |x-a0| = |x-a1| = = |x-ap|
The requirement may be added that an edge is
only added between two points if there exists a
weak witness to that edge.

Simplicial Complexes

Next we want to construct higher dimensional


structure on the neighbourhood graph: called a
simplicial complex

Simplicial Complexes

Next we want to construct higher dimensional


structure on the neighbourhood graph: called a
simplicial complex
A simplex is a point, edge, triangle, tetrahedron,
etc... (a k-simplex is a k+1-clique in the graph)

Simplicial Complexes

Next we want to construct higher dimensional


structure on the neighbourhood graph: called a
simplicial complex
A simplex is a point, edge, triangle, tetrahedron,
etc... (a k-simplex is a k+1-clique in the graph)
A face of a simplex is a sub-simplex of it

Simplicial Complexes

Next we want to construct higher dimensional


structure on the neighbourhood graph: called a
simplicial complex
A simplex is a point, edge, triangle, tetrahedron,
etc... (a k-simplex is a k+1-clique in the graph)
A face of a simplex is a sub-simplex of it
A simplicial k-complex is a set S of simplices,
each of dimension k, such that a face of any
simplex in S is also in S, and the intersection of
any two simplices is a face of both of them

Simplicial Complexes

At each iteration, we add an edge: all we need


to do is see if that creates any new k-simplices

Simplicial Complexes

At each iteration, we add an edge: all we need


to do is see if that creates any new k-simplices
The edge itself adds a single 1-simplex to the
complex

Simplicial Complexes

At each iteration, we add an edge: all we need


to do is see if that creates any new k-simplices
The edge itself adds a single 1-simplex to the
complex
A k-simplex is formed if the intersection of
neighbourhoods of a k-2 simplex contains the
two points in the added edge

Simplicial Complexes

At each iteration, we add an edge: all we need


to do is see if that creates any new k-simplices
The edge itself adds a single 1-simplex to the
complex
A k-simplex is formed if the intersection of
neighbourhoods of a k-2 simplex contains the
two points in the added edge
In other words, if every point in a k-2 simplex is
joined to the two points in the edge, then
together they form a k-simplex

Boundary Matricies

Next we compute boundary matricies.


Essentially, these store the information that k-1
simplices are faces of certain k simplices

Boundary Matricies

Next we compute boundary matricies.


Essentially, these store the information that k-1
simplices are faces of certain k simplices
For instance, in a simplicial complex with 100
triangles and 50 tetrahedra, the 4 th boundary
matrix has 100 rows and 50 columns, with
zeros everywhere except where the given
triangle is a face of the given tetrahedron,
where it is 1.

Boundary Matricies

At each iteration, we need only add rows of


zeros to the kth boundary matrix for each k-1
simplex that was formed, since the only ksimplices they could possibly be faces of are
those new ones which were formed at this
iteration

Boundary Matricies

At each iteration, we need only add rows of


zeros to the kth boundary matrix for each k-1
simplex that was formed, since the only ksimplices they could possibly be faces of are
those new ones which were formed at this
iteration
Then add columns for each of these new ksimplices, and fill them with 0s and 1s by
finding their faces (one of which is guaranteed
to be one of the new k-1 simplices)

Betti Numbers

The kth betti numbers are based on the


connectivity of the k-dimensional simplicial
complexes

Betti Numbers

The kth betti numbers are based on the


connectivity of the k-dimensional simplicial
complexes

The kth betti number is defined as the rank of the


kth homology group, Hk(X) = ker(bdk)/im(bdk+1)

Betti Numbers

The kth betti numbers are based on the


connectivity of the k-dimensional simplicial
complexes

The kth betti number is defined as the rank of the


kth homology group, Hk(X) = ker(bdk)/im(bdk+1)

In lower dimensions, can be understood as the


number of k-dimensional holes

Betti Numbers

The kth betti numbers are based on the


connectivity of the k-dimensional simplicial
complexes

The kth betti number is defined as the rank of the


kth homology group, Hk(X) = ker(bdk)/im(bdk+1)

In lower dimensions, can be understood as the


number of k-dimensional holes

Betti0 number of connected components

Betti Numbers

The kth betti numbers are based on the


connectivity of the k-dimensional simplicial
complexes

The kth betti number is defined as the rank of the


kth homology group, Hk(X) = ker(bdk)/im(bdk+1)

In lower dimensions, can be understood as the


number of k-dimensional holes

Betti0 number of connected components

Betti1 number of holes

Betti Numbers

The kth betti numbers are based on the


connectivity of the k-dimensional simplicial
complexes

The kth betti number is defined as the rank of the


kth homology group, Hk(X) = ker(bdk)/im(bdk+1)

In lower dimensions, can be understood as the


number of k-dimensional holes

Betti0 number of connected components

Betti1 number of holes

Betti2 number of voids

Persistent Homology

Why must we compute the betti numbers


across a range of the proximity parameter r?

Persistent Homology

Why must we compute the betti numbers


across a range of the proximity parameter r?
Because at low values of r, the points may be
too disconnected to see any meaningful
structure, and likewise at high values we are
approaching a complete graph, also not useful

Persistent Homology

However, the solution is not to guess an


intermediate value of r whose corresponding
simplicial complex best approximates the shape
of the data

Persistent Homology

However, the solution is not to guess an


intermediate value of r whose corresponding
simplicial complex best approximates the shape
of the data
Indeed, as seen in the previous example,
features may briefly appear at some value of r
only to disappear within a few edge-adding
iterations

Persistent Homology

However, the solution is not to guess an


intermediate value of r whose corresponding
simplicial complex best approximates the shape
of the data
Indeed, as seen in the previous example,
features may briefly appear at some value of r
only to disappear within a few edge-adding
iterations
So, the idea is to see which features persist, as
they are more likely to accurately represent the
shape of the data

Example: Circle

Choose 3200 points uniformly from the


circumference of a circle

Example: Circle

Choose 3200 points uniformly from the


circumference of a circle
From these, choose a landmark subset of 26
points

Example: Circle

Choose 3200 points uniformly from the


circumference of a circle
From these, choose a landmark subset of 26
points
Iteratively add one edge, compute the simplicial
2-complex, boundary matrices, and betti
numbers

Example: Circle

Choose 3200 points uniformly from the


circumference of a circle
From these, choose a landmark subset of 26
points
Iteratively add one edge, compute the simplicial
2-complex, boundary matrices, and betti
numbers
Plot the betti numbers against the proximity
parameter

Example: Circle

As expected, we find a single hole in the data,


and it persists across a wide range of r values.
The graph has 1 component

Example: Circle

The important information is the lifetime of a


feature, which can be displayed in a
persistence diagram/interval graph/barcode, as
shown below:

Example: Cylinder

Example: Cylinder

Example: Sphere with 4 voids

Example: Sphere with 4 voids

Trial: Lake Monitoring Data

Data was collected from buoys on Lake


Nipissing:

Temperature

Specific conductivity

Dissolved oxygen concentration

pH

Chlorophyll (RFU relative fluorescence


units)

Total Algae (RFU)

Trial: Lake Monitoring Data

Sept.4,2011, 3-complex, all 6 dimensions:

Trial: Lake Monitoring Data

For higher-dimensional data, may make more


sense to construct higher-dimensional
complexes

Trial: Lake Monitoring Data

For higher-dimensional data, may make more


sense to construct higher-dimensional
complexes
Also, to focus our attention to dimensions that
we expect to be more strongly correlated

Trial: Lake Monitoring Data

For higher-dimensional data, may make more


sense to construct higher-dimensional
complexes
Also, to focus our attention to dimensions that
we expect to be more strongly correlated
The next trial constructs a 2-complex on DO
concentration, pH, and algae, using a larger set
of data from Sept.4,2011:

Trial: Lake Monitoring Data

Trial: Lake Monitoring Data

3-complex on Sept.2,2011 data:

Trial: Lake Monitoring Data

Each combination of dimension of the data and


dimension of complex being built has so far
failed to recognize any significant features in
shape of the data

Trial: Lake Monitoring Data

Each combination of dimension of the data and


dimension of complex being built has so far
failed to recognize any significant features in
shape of the data
Combining data sets from different times of
year might result in greater variation in the data,
and a greater chance of patterns being found

Summary

Construct a filtration of a simplicial complex on


our data by building a sequence of
neighbourhood graphs across an interval of the
proximity parameter

Summary

Construct a filtration of a simplicial complex on


our data by building a sequence of
neighbourhood graphs across an interval of the
proximity parameter
Plot betti numbers against this proximity
parameter

Summary

Construct a filtration of a simplicial complex on


our data by building a sequence of
neighbourhood graphs across an interval of the
proximity parameter
Plot betti numbers against this proximity
parameter
Features which persist longer more likely
represent the shape of the data

Summary

Construct a filtration of a simplicial complex on


our data by building a sequence of
neighbourhood graphs across an interval of the
proximity parameter
Plot betti numbers against this proximity
parameter
Features which persist longer more likely
represent the shape of the data
Shape is important!

Acknowledgments

Mark Wachowiak (supervisor, artificial data sets)

Renata Smolikova-Wachowiak (lake monitoring data)

Gunnar Carlsson (see on the shape of data:


https://www.youtube.com/watch?v=kctyag2Xi8o)
Adam Cutbill (author of original program)
Afra Zomorodian (fast construction of the Vietoris-Rips
complex)
Vin de Silva (topological estimation using witness
complexes)

Anda mungkin juga menyukai