Anda di halaman 1dari 11

Chapter 3 Norm And Distance

3.1 Norm
- Introduction
- Properties of Norm
- General Norms
- Root Mean Square (RMS) value
- Norm Of A Sum
- Norm Of Block Vectors
- Chebyshev Inequality
3.2 Distance
- Euclidian Distance
- Triangle Inequality
- Examples
- Units For Heterogeneous Vector Entries
3.3 Standard Deviation
- Intro
- Average, RMS value, standard deviation
- Examples
- Chybeshev Inequality For Standard Deviation
- Properties Of Standard Deviation
- Standardization
3.4 Angles
- Cauchy-Schwarz Inequality
- Verification of Triangle Inequality
- Angles Between Vectors
- Acute And Obtuse Angles
- Examples
- Norm of Sum Via Angles
- Correlation Co-efficient
- Standard Deviation Of Sum
- Hedging Investments
- Units For Heteregenous Vector Entries
3.5 Examples
3.6 Exercises

Notes

Chapter 3 Norm And Distance


The norm of a vector - a measure of its magnitude, and associated concepts like
distance, angle, standard deviation, and correlation.

3.1 Norm
- Introduction

The Euclidean Norm of an vector, denoted ||x|| is defined as sqrt(x1^2+ .. + x_n^2)


= sqrt(x T x) = ||x||_2 where the subscript indicates to what power the components
of x are raised.

e.g || (2,-1,2 || = sqrt (4 + 1 + 4) = sqrt (9) = 3

other terms used for the euclidean norm = magnitude = length (this should be
avoided since 'length' can be used for the number of components in the vector)

when number of elements in x = 1, i.e when x is a 1-vector, the euclidean norm is


the same as the absolute value of x. Euclidean norm is the generalization of
absolute value of x, as indicate by the double lines.

the euclidean norm gives us some indication of the magnitude of a vector, just as
the absolute value of a scalar does. A vector with a 'small' norm is a small
vector, and one with a large norm is a large vector, precise boundaries between
small and large depending on context.

- Properties of Norm
Where x and y are vectors of the same size, and beta is a scalar
1. Non negative homogeneity
|| beta x|| = | beta | || x ||
2. Triangle inequality: || a + b || <= || a || + || b ||. Another name for this
inequality is "subadditivity"
3. Non negativity: || x || >= 0
4. Definiteness: || x || = 0 iff x == 0

(KEY) The last two properties together is called 'positive definiteness'.


Proving these properties is easy *except* for the triangle inequality, the proof of
which is punted for now.

- General Norms

KEY: Any real valued function of vector x which satisfies properties 1 thru 4 above
is called a general norm. We use only Euclidean norm in this book, Ex 3.5 gives
some other useful norms.

- Root Mean Square (RMS) value

Mean Square Value == ms(x) = || x ||^2 / n == (x1^2 + .. + x-n^2) / n = mean


of sum of squares of components of x

*Root* Mean square value == rms (x) == sqrt(mean square value) = sqrt ( (x1 ^2
+ ... ? x-m^2) / n) = ||x||/ sqrt(n)

RMS value often used to compare norms of vectors of different dimensions(?) .


It shows a typical value of the vector's components.
e.g: norm of 1 is sqrt(n) where n is the dimension of the vector, but its rms
is 1

- Norm Of A Sum

A useful formula for the norm of the sum of two vectors || x + y || = sqrt( || x ||
+ 2 x T y + || y ||)

Derivation

|| x + y || ^2 = sqrt (x1^2 + .. x_n^2 + y1^2 + .. + y-n^2 + 2 x_1 y_1 + .. + 2


x_n y_n)
== || x ||^ 2 + || y ||^2 + 2 (x T y)

taking square roots on both sides

|| x + y || = sqrt ( || x ||^2 + || y ||^2 + 2 (x T y))

- Norm Of Block Vectors

The norm ***squared*** values of block vectors is the sum of the norm squared
values of their component vectors.

Let d = (a,b,c) all vectors.


then || d || = d T d = (a, b, c) T (a, b, c) = aTa + bTb + cTc = ||a||^2 + ||b||^2
+ ||c||^2
Rewriting in terms of norms

|| (a,b,c) || =|| (||a||, ||b||, ||c||) ||


In words, the norm of a stacked vector is the norm of the vector formed from the
norms of its components.

- Chebyshev Inequality

Suppose x is an n-vector , a is a scalar > 0 (note NOT >= 0, a has to be > 0)


Let k entries of x be s.t |x_i| >= a

then || x ||^2 = x1^2 + x2^2 + .. + x-n^2 has to be >= k * (a^2) ;; since k


entries >= a in the sum and thanks to squaring the other n - k numbers are non-
negative, from which we can conclude

||x||^2 / a^2 >= k

from which

k <= || x ||^2 / a^2

This is the Chybyshev Inequality.

Which is easier to interpret in terms of root mean square value, so we divide both
sides by n

k/n <= || x ||^2 / (n * a^2)

since rms (x) ||x||/ sqrt(n) this is

k/n <= (rms(x) / a )^2

the LHS is the fraction of entries in x whose *absolute* values are at least a
The RHS is 1/ square of a / rms(x)

"It says for example that at most 1/25 = 4% of a vector's entries can have 5 times
the RMS value".

the question asked is if we have the rms value, then what percentage of a vector's
components have a value n times the RMS value.

so for the question "what percentage of a vector's components have a value >= 5
times the RMS value?"
the RHS becomes (1/5)^2 = (1/25)

then we know that k/n <= 1/25 == 4/100

Since k/n is the ratio the components of the vector that have absolute value >= a
(here 5 times rms value), we conclude that this ratio is 4%

From wikipedia

Suppose we randomly select a journal article from a source with an average of 1000
words per article, with a standard deviation of 200 words. We can then infer that
the probability that it has between 600 and 1400 words (i.e. within k = 2 standard
deviations of the mean) must be at least 75%, because there is no more than 1⁄k2
= 1/4 chance to be outside that range, by Chebyshev's inequality. But if we
additionally know that the distribution is normal, we can say there is a 75% chance
the word count is between 770 and 1230 (which is an even tighter bound).

3.2 Distance
- Euclidian Distance
We can use the norm to define the Euclidean distance between two vectors a and b as
dist(a,b) = || a-b ||
This is true for vectors in any dimension.

dist(a,b) / sqrt(n) where n is the size of the vector, known as the "RMS deviation
between the two vectors"

When the distance between the two vectors is 'small' (context dependent value) they
are said to be 'near'. Else when distance between two vectors is large, 'far'.

e.g: consider the vectors

| 1.8 | | 0.6 | | 2.0 |


u = | 2.0 | v = | 2.1 | w = | 1.9 |
|-3.7 | | 1.9 | |-4.0 |
| 4.7 | |-1.4 | | 4.6 |

then || u - v || = 8.368, || u - w || = 0.387 and || v - w || = 8.533

so u and w are very close, u and v and v and w are further away.

- Triangle Inequality

Picture of a triangle with vertices a, b, c and || a - b ||, || a - c ||, || b - c


|| on the edges.

'by geometric intuition', the length of one side of a triangle cannot be greater
than the sum of the other two sides
==
|| a - c || <= || a - b || + || b - c||
== rewriting the LHS a - c
|| (a-b) + (b -c) || <= || a - b || + || b - c ||
which is true by the triangle inequality.

- Examples

1. Feature Distance
If x and y are feature vectors, i.e, they are vectors with elements
corresponding to n features of two objects, the distance between these two vectors,
|| x - y || is called the feature distance, and gives a measure of how different(or
similar) the objects are in terms of feature values.

E.g: if the feature vectors are about patients in a hospital, with entries
corresponding to age, weight, presence of chestpain, difficulty breathing, and the
results of tests, then we can use feature distance to say one patient case is
'near' another (at least in terms of the features)

2. RMS prediction error


Let y be a time series vector, say hourly temperature at a given location.
Let y-hat be another n-vector with predictions (_ for the same time period)
based on other information (_ than direct reading of temperature, e.g historical
data, or brightness of the sun + direction and strength of wind or whatever).
then the quantity y - y-hat is called the prediction error. *it's* rms value,
rms(y-yhat) is called the RMS-prediction error. If this quantity is small, the
prediction is good.

3. Nearest neighbor.
Let z_1 ... z_m be m n-vectors. Let x be another n-vector. Then
the nearest neighbor of the vector x is defined as z_i s.t || x - z_i || <= ||
x - z_j || for j = 1 .. m.
In words z_i is the closest (among z_m, using dist(x, a_j)) to x.

This idea of nearest neighbor (and generalization to k nearest-neighbors) is


used in many applications.

4. Document Dissimilarity
Let x and y be two word histograms for two documents. Then the distance between
them dist(a,b) reveals how similar (or not) the documents are. We might expect the
similarity to be greater for two documents of the same genre, author, or topic.

- Units For Heterogeneous Vector Entries

Consider the *square* of the distance between two vectors


||x - y || = (x1 - y1)^2 + (x2 - y2)^2 + ... + (x-n - y-n)^2

(_ the squaring is to get rid of the awkwardness of reproducing the sqrt every
time)

roughly speaking, all entries in the vectors have equal status in determining
the distance between them.
the effect (on distance) of x1 and y1 differing by 1 is the same as x-101 and
y-101 varying by one.

this makes sense when x and y represent the same quantity using the same units
(e.g temperature time series at two different locations). Then it makes sense to
say that these vectors are 'close' when the distance between them is 'small'.
(_ i think this is badly written. the real point to be made should be that all
entries *within* a vector represent the same quantity with the same units as with
(say) word frequency vectors, where each entry represents the number of times
word_i is represented in a document).

When different components of the *same* vector represent different quantities with
differing units, if we want each feature to have an equal effect in determining
distance, then we have to ensure that their typical values have roughly the same
magnitude, adjusting units where necessary.

e.g:

consider a vector whose components have attributes about houses - the first
entry being house area and the second being number of bedrooms.
Let's say we use "thousands of square feet" as the unit for area. Then the vector
components representing these features would both vary from about 1 to 5.

consider three houses with the following vectors

x = (1.6,2) y = (1.5, 2), z = (1.6, 4)

dist(x,y) = sqrt((1.6 - 1.5)^2 + (2 - 2)^2) = sqrt(0.01) = 0.1 which is small


(compared to the norms of x and y, which are around 2.5) . this matches our
intuition that the houses are similar having roughly the same area and same number
of bedrooms
(_ the distance between x and z comes to 4 which is bigger than the norms, and so
these houses are dissimilar)

But if we represent the houses in square feet (vs thousands of square feet as
above) we get the vectors
x = (1600,2) y = (1500,2) z = (1600, 4). the norms of the vectors are around 1600.

Now the distance between x and z is 2, which is small compared to the norm, and so
these houses are now 'similar'.
So now house areas strongly influence distance, with bedrooms having a very small
role.

3.3 Standard Deviation


- Intro
A 'de-meaned vector' is obtained from a vector s by first calculating the mean of
all components in the vector, and then subtracting each component of the vector
from it and is denoted by x~.

the sum of values of x~ is zero.

in vector notation x~ = x - (avg(x) * 1)

then standard deviation is the root mean squared value of x~


If avg(x) is denoted by a then std(x) = sqrt [((x1 - a)^2 + .. + (x-n - a)^2)/n]

alternate formulation: std(x) = (|| x - (1 T (x/n)) * 1 ||) / sqrt(n)

The standard deviation shows the extent to which the vector's entries all differ
from their average value.
The standard deviation of a vector is zero only when all its entries are equal.
The standard deviation of a vector is small when all elements of the vector are
mostly the same.

- Average, RMS value, standard deviation


rms(x) ^2 = avg(x) ^2 + std(x) ^2

- Examples
Mean Return And Risk

Suppose that an n-vector represents a time series of return over an investment over
n periods, expressed as a percentage(of how much was invested?)

Its mean gives us the average return over the whole n periods.
the stdev is a measure of how variable this return is over the whole period, how
much it varies from the mean, and the deviation of each period's return from the
std dev is interpreted as the 'risk' of the investment.

A desirable return history vector has a high mean and low risk.

- Chybeshev Inequality For Standard Deviation


The Chybeshev inequality can be transcribed into a form that uses the standard
deviation and mean.

If k is the number of entries that satisfy |x_i - avg(x)| >= a then (k/n) <=
(std(x) / (a)) ^2 . Note here a is a multiple of the *standard deviation* of x( vs
being a multiple of the rms *of x* in the earlier definiton)

"This inequality is only interesting when a > std(x)" ???


e.g: how many entries of a vector can deviate from the mean value of its components
by 3 standard deviatons or more?

working out

RHS = (1/3)^2 = 1/9 = 0.1111 == 11.11 % = k/n (key is that a's magnitude is
represented as number of std devs from mean)

Properties of standard deviation


- adding a constant
std(x + alpha 1) = std(x)
in words, adding a constant to every element of a vector does not change
the standard deviation

- multiplying by a scalar
for any vector a and any scalar alpha,
std (alpha x) = | alpha | std(x)
in words, the std of a vector formed by multiplying a vector by a scalar
alpha is equal to the multiple of the absolute value of alpha and the standard
deviation of the original value of x

- Standardization

for any vector x


de-meaned vector x~ = x - avg(x).1 . x~ has avg = 0
if we divide x~ by std(x) (which is the rms value of x~) we get a vector z
z = (x - avg(x) 1)/std(x)
z is called the 'standardized version of x' and has mean 0 and std 1. It's
entries are called the z-scores of the corresponding values of x.
example z_6 = 1.4 says that x_6 's value is 1.4 standard deviations from the mean
value of x

Interpretation example.
Assume that an n-vector gives the value of a specific medical test (say blood
sugar) of n patients admitted to a hospital.
If z_4 = -3.2, (i.e the patient number 4 has a test value that is 3.2 std
deviations away 'on the left' from the mean value) then we can say that compared to
the population, this patient has an abnormally low value.
if another patient has a z-score 0.3, we can say that she is close to the mean
for the population.

3.4 Angles

- Cauchy-Schwarz Inequality
an important inequality that connects inner products and norms

|a T b| <= || a || || b ||

written out in terms, we get the 'intimidating form'

| a1b1 + .. + a-n b-n | <= sqrt(a1^2 + ... + a_n^2) + sqrt(b_1^2 .. b_n^2)


for proof see notes.

condition for | a T b | = || a || || b|| , is ||beta a - alpha b|| = 0. (see proof


for LHS)
which occurs only when beta a = alpha b, i.e each vector is a scalar multiple of
the other (in the case where both are non zero)

So to summarize, the Cauchy-Schwarz inequality holds with equality when one vector
is a multiple of the other, or either or both are zero vectors, in all other cases,
it holds with inequality.

- Verification of Triangle Inequality

We use the Cauchy Schwarz inequality to prove the triangle inequality.

|| a + b ||^2
== definition of norm of addition
||a||^2 + 2 (a T b) + || b || ^2
<= using Cauchy Schwarz Inequality
||a||^2 + 2 ||a||||b|| + ||b||^2
== using (x + y)^2 = x^2 + 2 x y + y^2 for all scalars x,y
(||a|| + ||b||)^2

So we proved that ||a + b|| ^2 <= (||a|| + ||b||)^2

Taking positive square roots we get the triangle inequality

- Angles Between Vectors

The angle between two non zero vectors theta is defined as

theta = arccos( (a T b) / ||a|| ||b||)


(_ the CS inequality proves that a T b and ||a|| ||b|| are equal when either vector
is zero or if one if a scalar multiple of the other. For two non zero vectors, when
these quantities are *not* equal, what do they say about the vectors' relationship
with each other?. this is what is dealt with here)

where arc cos denotes the inverse cosine and lies in the interval normalized to
[0 .. pi]

iow, we define theta to be the unique number between 0 and pi that satisfies

aTb = || a || || b || cos theta

can be applied to vectors of any dimension (> 1?)

- Acute And Obtuse Angles

5 cases
case 1: angle = 90 degrees = pi / 2 radians
at this angle cos theta = 0 so a T b = 0. Such vectors are called
orthogonal. have a special symbol a perp b
case 2: angle = 0 degrees
vectors are aligned. cos zero == 1 so a T b = || a || || b ||. one vector
is a positive multiple of the other.
case 3: angle = 180 degrees.
cos theta = -1. So a T b = - ( || a || || b||). vectors are anti aligned,
one is a negative multiple of the otehr.
case 4:0 < theta < 90.
cos theta is positive a T b is positive. vectors have positive inner
product. the vectors make an obtuse angle to each other.
case 5: 90 < theta < 180
cos theta is negative . a T b is negative. Vectors have a negative inner
product. They make an obtuse angle to each other.

- Examples
1. Spherical Distance
If a and b are 2 3-vectors which represent points on a sphere. Then the
'spherical distance' between them is given by angle(a,b)

2. Document similarity via angles


If a and b are n-vectors representing word counts (or word histograms) the
angle between them can be used as a measure of (dis)similarity.
table of data given for various documents.

- Norm of Sum Via Angles

For n-vectors a,b we have the 'norm addition' rule.

|| a + b || ^2 = ||a|| ^2 + 2 a T b + || b || ^2

we can replace the a T b with (from the angle calculation formula) || a || || b ||


cos theta to get

|| a + b ||^2 = || a || ^2 + 2 || a || || b || cos theta + || b ||^2

comments
1. If x and y are aligned , (theta = 0), we have || a + b ||^2 = (|| a || + ||
b ||) ^2 == || a + b || = || a || + || b ||. Vector norms add.

2. If x and y are orthogonal, (theta = 90), we have || a + b ||^2 = || a || ^2


+ || b || ^2 from which we get || a + b || = sqrt (|| a ||^2 + || b ||^2) which is
the pythagoras theorem.

- Correlation Co-efficient

Suppose a~ and b~ are the *non zero* de-meaned vectors of a and b. (x~ = x -
avg(x) 1)

Then the correlation co-efficient of the *de-meaned vectors* is defined as

rho = (a~ T b~) / || a~ || || b~ || ;note all angles have ~s . This is NOT


corr. co-eff between a (no tilde) and b

this is the same as the formula for angle (a~, b~).

Thus the correlation coefficient between two demeaned vectors is the same as the
angle between them.

We can express the correlation coefficient in terms of standardized vectors by

rho = (u T v) / n where u and v are the standardized vectors for n-vectors a


and b respectively. (i.e u = a~/std(a) and v = b~/std(b) . we use ||u|| = ||v|| =
sqrt(n))

corr-coeff(a,b) = corr-coeff(b,a) so is a symmetric vector.

The Cauchy Scharz Inequality tells us that the correlation coefficient ranges
between -1 and 1 (how?).
Sometimes the corrcoeff is expressed as a percentage. e.g: cc = 30% == cc = 0.3
when cc = rho = 0 we say that the vectors are uncorrelated. A vector with all
elements the same is conventionally understood to be uncorrelated with any other
vector)

The correlation coefficient tells us how the entries in the two vectors change
together. High correlation (say rho = 0.8) means that entries of a and b are above
their respective means for many of the same entries (_and similarly for those below
mean?, zero?). rho = 1 arises only when both vectors are aligned, i.e they are
positive multiples of each other.

rho = -1 occurs when the vectors are anti-aligned, i.e they are negative multiples
of each other.

The correlation co-efficient is often used when two vectors represent time-series
data e.g: returns on investment over the same period, or rainnfall records over the
same period.

- Standard Deviation Of Sum (in terms of correlation co-efficients)

std(a + b) = sqrt( std^2(a) + 2 rho std(a) std(b) + std^2(b))

Start with equation 3.6 == || x + y ||^2 = ||x||^2 + 2 ||x|| ||y|| cos theta + ||
y||^2

We let a~ and b~ be the de-meaned version of a and b.

Then a~ + b~ is the de-meaned version of a + b

and std(a+b)^2 = || a~ + b~||^2 /n


n std(a+b)^2 = || a~ + b~ ||^2
= ||a~||^2 + 2 rho ||a~|| ||b~|| + || b ||^2
= n std(a)^2 + 2 rho n std(a) std(b) + n std(b)^2; note: no
tildes

divide throughout by n

If rho = 1 std of sum of vectors is the sum of std dev of the vectors
As rho decreases (towards zero) the standard deviation of the sum decreases.
When rho = 0, std (a + b) = sqrt(std(a)^2 + std(b)^2)
when rho = -1 std (a + b) = | std(a) - std(b) |

- Hedging Investments

Suppose a and b are two time series vectors representing returns from investments
in two assets over the same time periods.
Let these investments have the same (average) return mu and same risk (standard
deviation) sigma.

then the vector c = (a + b) / 2 represents an investment of 50% in each of the two


assets.
This blended investment has the same return as the original since

avg((a + b) / 2) = avg(a) + avg(b) / 2 = (mu + mu) / 2 = mu

The risk of this blended investment is


std((a+b)/2) = sqrt( 2 (sigma ^2) + 2 rho (sigma ^2) ) / 2 = sigma sqrt(( 1 +
rho)/2)
From this we see that the blended investment is never riskier than the original
assets, and the risk is smaller when the correlation between the assets is smaller.

When the returns are uncorrelated, the risk is a factor of 1= sqrt(2) = 0.707
smaller than the risk of the original assets.
if the asset returns are negatively correlated, rho near -1, the risk of the
blended asset is much smaller than that of the original assets.

Investing in assets with uncorrelated or negatively correlated returns is called


hedging and is used to reduce risk.

- Units For Heteregenous Vector Entries

When entries of vectors (to be more precise when different entries of the same
vector) represent different quantities with different units, the units chosen for
each quantity affects angle, standard deviation and correlation. The general rule
of thumb is to choose units s.t the values of multiple positions have the same
typical value or ranges of values.

3.5 Complexity
1. Norm: Computing a norm of an n-vector requires n multiplications (to compute
square roots), n - 1 additions, and 1 square root (so 2n flops total. computing the
square root typically takes more flops/time than just computing product or sum, but
is still treated as taking only one flop)

2. RMS: same as above + 2 flops for computing sqrt(n), then dividing.

3. distance between two vectors : first n subtractions, then compute norm. so


roughly 3n flops

4. angle between two vectors : 6n flops(how?)

5. Demeaning a vector: 2n flops, n for finding the average(n-1 additions + 1


division) the n subtractions of the average from each entry of the vector.

6. std is the rms of a demeaned vector so (5) + (2) = 4n flops.


But eq 2.5 presents a more efficient method - first compute average(n
flops) then rms value(2n flops). then calculate
std as (rms^2 - avg^2) then take the square root of this. So 3n + 4

7. to calculate the nearest neighbor among z1 .. zk n-vectors for a given n-vector


x.
the simplest approach is to calculate all distances || z_i - x || then find the
minimum( comparing numbers often takes a flop)
the cost is 3n (to calculate distance between one pair) * k (number of
comparisons reqd) + k-1 (comparing results)
order of this operation is kn.

3.6 Exercises

Anda mungkin juga menyukai