3.1 Norm
- Introduction
- Properties of Norm
- General Norms
- Root Mean Square (RMS) value
- Norm Of A Sum
- Norm Of Block Vectors
- Chebyshev Inequality
3.2 Distance
- Euclidian Distance
- Triangle Inequality
- Examples
- Units For Heterogeneous Vector Entries
3.3 Standard Deviation
- Intro
- Average, RMS value, standard deviation
- Examples
- Chybeshev Inequality For Standard Deviation
- Properties Of Standard Deviation
- Standardization
3.4 Angles
- Cauchy-Schwarz Inequality
- Verification of Triangle Inequality
- Angles Between Vectors
- Acute And Obtuse Angles
- Examples
- Norm of Sum Via Angles
- Correlation Co-efficient
- Standard Deviation Of Sum
- Hedging Investments
- Units For Heteregenous Vector Entries
3.5 Examples
3.6 Exercises
Notes
3.1 Norm
- Introduction
other terms used for the euclidean norm = magnitude = length (this should be
avoided since 'length' can be used for the number of components in the vector)
the euclidean norm gives us some indication of the magnitude of a vector, just as
the absolute value of a scalar does. A vector with a 'small' norm is a small
vector, and one with a large norm is a large vector, precise boundaries between
small and large depending on context.
- Properties of Norm
Where x and y are vectors of the same size, and beta is a scalar
1. Non negative homogeneity
|| beta x|| = | beta | || x ||
2. Triangle inequality: || a + b || <= || a || + || b ||. Another name for this
inequality is "subadditivity"
3. Non negativity: || x || >= 0
4. Definiteness: || x || = 0 iff x == 0
- General Norms
KEY: Any real valued function of vector x which satisfies properties 1 thru 4 above
is called a general norm. We use only Euclidean norm in this book, Ex 3.5 gives
some other useful norms.
*Root* Mean square value == rms (x) == sqrt(mean square value) = sqrt ( (x1 ^2
+ ... ? x-m^2) / n) = ||x||/ sqrt(n)
- Norm Of A Sum
A useful formula for the norm of the sum of two vectors || x + y || = sqrt( || x ||
+ 2 x T y + || y ||)
Derivation
The norm ***squared*** values of block vectors is the sum of the norm squared
values of their component vectors.
- Chebyshev Inequality
from which
Which is easier to interpret in terms of root mean square value, so we divide both
sides by n
the LHS is the fraction of entries in x whose *absolute* values are at least a
The RHS is 1/ square of a / rms(x)
"It says for example that at most 1/25 = 4% of a vector's entries can have 5 times
the RMS value".
the question asked is if we have the rms value, then what percentage of a vector's
components have a value n times the RMS value.
so for the question "what percentage of a vector's components have a value >= 5
times the RMS value?"
the RHS becomes (1/5)^2 = (1/25)
Since k/n is the ratio the components of the vector that have absolute value >= a
(here 5 times rms value), we conclude that this ratio is 4%
From wikipedia
Suppose we randomly select a journal article from a source with an average of 1000
words per article, with a standard deviation of 200 words. We can then infer that
the probability that it has between 600 and 1400 words (i.e. within k = 2 standard
deviations of the mean) must be at least 75%, because there is no more than 1⁄k2
= 1/4 chance to be outside that range, by Chebyshev's inequality. But if we
additionally know that the distribution is normal, we can say there is a 75% chance
the word count is between 770 and 1230 (which is an even tighter bound).
3.2 Distance
- Euclidian Distance
We can use the norm to define the Euclidean distance between two vectors a and b as
dist(a,b) = || a-b ||
This is true for vectors in any dimension.
dist(a,b) / sqrt(n) where n is the size of the vector, known as the "RMS deviation
between the two vectors"
When the distance between the two vectors is 'small' (context dependent value) they
are said to be 'near'. Else when distance between two vectors is large, 'far'.
so u and w are very close, u and v and v and w are further away.
- Triangle Inequality
'by geometric intuition', the length of one side of a triangle cannot be greater
than the sum of the other two sides
==
|| a - c || <= || a - b || + || b - c||
== rewriting the LHS a - c
|| (a-b) + (b -c) || <= || a - b || + || b - c ||
which is true by the triangle inequality.
- Examples
1. Feature Distance
If x and y are feature vectors, i.e, they are vectors with elements
corresponding to n features of two objects, the distance between these two vectors,
|| x - y || is called the feature distance, and gives a measure of how different(or
similar) the objects are in terms of feature values.
E.g: if the feature vectors are about patients in a hospital, with entries
corresponding to age, weight, presence of chestpain, difficulty breathing, and the
results of tests, then we can use feature distance to say one patient case is
'near' another (at least in terms of the features)
3. Nearest neighbor.
Let z_1 ... z_m be m n-vectors. Let x be another n-vector. Then
the nearest neighbor of the vector x is defined as z_i s.t || x - z_i || <= ||
x - z_j || for j = 1 .. m.
In words z_i is the closest (among z_m, using dist(x, a_j)) to x.
4. Document Dissimilarity
Let x and y be two word histograms for two documents. Then the distance between
them dist(a,b) reveals how similar (or not) the documents are. We might expect the
similarity to be greater for two documents of the same genre, author, or topic.
(_ the squaring is to get rid of the awkwardness of reproducing the sqrt every
time)
roughly speaking, all entries in the vectors have equal status in determining
the distance between them.
the effect (on distance) of x1 and y1 differing by 1 is the same as x-101 and
y-101 varying by one.
this makes sense when x and y represent the same quantity using the same units
(e.g temperature time series at two different locations). Then it makes sense to
say that these vectors are 'close' when the distance between them is 'small'.
(_ i think this is badly written. the real point to be made should be that all
entries *within* a vector represent the same quantity with the same units as with
(say) word frequency vectors, where each entry represents the number of times
word_i is represented in a document).
When different components of the *same* vector represent different quantities with
differing units, if we want each feature to have an equal effect in determining
distance, then we have to ensure that their typical values have roughly the same
magnitude, adjusting units where necessary.
e.g:
consider a vector whose components have attributes about houses - the first
entry being house area and the second being number of bedrooms.
Let's say we use "thousands of square feet" as the unit for area. Then the vector
components representing these features would both vary from about 1 to 5.
But if we represent the houses in square feet (vs thousands of square feet as
above) we get the vectors
x = (1600,2) y = (1500,2) z = (1600, 4). the norms of the vectors are around 1600.
Now the distance between x and z is 2, which is small compared to the norm, and so
these houses are now 'similar'.
So now house areas strongly influence distance, with bedrooms having a very small
role.
The standard deviation shows the extent to which the vector's entries all differ
from their average value.
The standard deviation of a vector is zero only when all its entries are equal.
The standard deviation of a vector is small when all elements of the vector are
mostly the same.
- Examples
Mean Return And Risk
Suppose that an n-vector represents a time series of return over an investment over
n periods, expressed as a percentage(of how much was invested?)
Its mean gives us the average return over the whole n periods.
the stdev is a measure of how variable this return is over the whole period, how
much it varies from the mean, and the deviation of each period's return from the
std dev is interpreted as the 'risk' of the investment.
A desirable return history vector has a high mean and low risk.
If k is the number of entries that satisfy |x_i - avg(x)| >= a then (k/n) <=
(std(x) / (a)) ^2 . Note here a is a multiple of the *standard deviation* of x( vs
being a multiple of the rms *of x* in the earlier definiton)
working out
RHS = (1/3)^2 = 1/9 = 0.1111 == 11.11 % = k/n (key is that a's magnitude is
represented as number of std devs from mean)
- multiplying by a scalar
for any vector a and any scalar alpha,
std (alpha x) = | alpha | std(x)
in words, the std of a vector formed by multiplying a vector by a scalar
alpha is equal to the multiple of the absolute value of alpha and the standard
deviation of the original value of x
- Standardization
Interpretation example.
Assume that an n-vector gives the value of a specific medical test (say blood
sugar) of n patients admitted to a hospital.
If z_4 = -3.2, (i.e the patient number 4 has a test value that is 3.2 std
deviations away 'on the left' from the mean value) then we can say that compared to
the population, this patient has an abnormally low value.
if another patient has a z-score 0.3, we can say that she is close to the mean
for the population.
3.4 Angles
- Cauchy-Schwarz Inequality
an important inequality that connects inner products and norms
|a T b| <= || a || || b ||
So to summarize, the Cauchy-Schwarz inequality holds with equality when one vector
is a multiple of the other, or either or both are zero vectors, in all other cases,
it holds with inequality.
|| a + b ||^2
== definition of norm of addition
||a||^2 + 2 (a T b) + || b || ^2
<= using Cauchy Schwarz Inequality
||a||^2 + 2 ||a||||b|| + ||b||^2
== using (x + y)^2 = x^2 + 2 x y + y^2 for all scalars x,y
(||a|| + ||b||)^2
where arc cos denotes the inverse cosine and lies in the interval normalized to
[0 .. pi]
iow, we define theta to be the unique number between 0 and pi that satisfies
5 cases
case 1: angle = 90 degrees = pi / 2 radians
at this angle cos theta = 0 so a T b = 0. Such vectors are called
orthogonal. have a special symbol a perp b
case 2: angle = 0 degrees
vectors are aligned. cos zero == 1 so a T b = || a || || b ||. one vector
is a positive multiple of the other.
case 3: angle = 180 degrees.
cos theta = -1. So a T b = - ( || a || || b||). vectors are anti aligned,
one is a negative multiple of the otehr.
case 4:0 < theta < 90.
cos theta is positive a T b is positive. vectors have positive inner
product. the vectors make an obtuse angle to each other.
case 5: 90 < theta < 180
cos theta is negative . a T b is negative. Vectors have a negative inner
product. They make an obtuse angle to each other.
- Examples
1. Spherical Distance
If a and b are 2 3-vectors which represent points on a sphere. Then the
'spherical distance' between them is given by angle(a,b)
|| a + b || ^2 = ||a|| ^2 + 2 a T b + || b || ^2
comments
1. If x and y are aligned , (theta = 0), we have || a + b ||^2 = (|| a || + ||
b ||) ^2 == || a + b || = || a || + || b ||. Vector norms add.
- Correlation Co-efficient
Suppose a~ and b~ are the *non zero* de-meaned vectors of a and b. (x~ = x -
avg(x) 1)
Thus the correlation coefficient between two demeaned vectors is the same as the
angle between them.
The Cauchy Scharz Inequality tells us that the correlation coefficient ranges
between -1 and 1 (how?).
Sometimes the corrcoeff is expressed as a percentage. e.g: cc = 30% == cc = 0.3
when cc = rho = 0 we say that the vectors are uncorrelated. A vector with all
elements the same is conventionally understood to be uncorrelated with any other
vector)
The correlation coefficient tells us how the entries in the two vectors change
together. High correlation (say rho = 0.8) means that entries of a and b are above
their respective means for many of the same entries (_and similarly for those below
mean?, zero?). rho = 1 arises only when both vectors are aligned, i.e they are
positive multiples of each other.
rho = -1 occurs when the vectors are anti-aligned, i.e they are negative multiples
of each other.
The correlation co-efficient is often used when two vectors represent time-series
data e.g: returns on investment over the same period, or rainnfall records over the
same period.
Start with equation 3.6 == || x + y ||^2 = ||x||^2 + 2 ||x|| ||y|| cos theta + ||
y||^2
divide throughout by n
If rho = 1 std of sum of vectors is the sum of std dev of the vectors
As rho decreases (towards zero) the standard deviation of the sum decreases.
When rho = 0, std (a + b) = sqrt(std(a)^2 + std(b)^2)
when rho = -1 std (a + b) = | std(a) - std(b) |
- Hedging Investments
Suppose a and b are two time series vectors representing returns from investments
in two assets over the same time periods.
Let these investments have the same (average) return mu and same risk (standard
deviation) sigma.
When the returns are uncorrelated, the risk is a factor of 1= sqrt(2) = 0.707
smaller than the risk of the original assets.
if the asset returns are negatively correlated, rho near -1, the risk of the
blended asset is much smaller than that of the original assets.
When entries of vectors (to be more precise when different entries of the same
vector) represent different quantities with different units, the units chosen for
each quantity affects angle, standard deviation and correlation. The general rule
of thumb is to choose units s.t the values of multiple positions have the same
typical value or ranges of values.
3.5 Complexity
1. Norm: Computing a norm of an n-vector requires n multiplications (to compute
square roots), n - 1 additions, and 1 square root (so 2n flops total. computing the
square root typically takes more flops/time than just computing product or sum, but
is still treated as taking only one flop)
3.6 Exercises