Anda di halaman 1dari 9

We'll begin by talking about descriptive statistics descriptive statistics as the

name suggest simply describe what you have they take the data that you have in
hand and all were doing was organizing it in way that it's easier to deal with easy
to work with easier to frankly remember descriptive statistics are simple as
suppose I have a large class and I give everyone in the class a test and see
however with the class did on the test well as a go from person-to-person and
personal reports to be the test score holding on all this test scores remember
another test scores is going to be very very difficult is going to record on the
marbles and trying to hold on Ubuntu's marbles on registrar hold my hand if I'm
smart what creditors get a container will settle box to hold the marbles in the first
instance is does for you is a gives you this container for your marbles instead of
having a hold on all the individual scores I can organize them away this easy to
hold onto I can put them in a container of the container is called a distribution is
epidemiologists understand the world in terms of ratios statistics is
understandable in terms of distribution understanding the world in terms of the
simple organize form call distribution once I have the data organized this
distribution form I can then hold on to the key ideas about the data hold onto the
distribution by remembering not everybody's individual score just two simple
numbers want some measure of central tendency where a lot of mention high
versus low this distribution is and then to some measure of variation or spread
which is a factor way of saying how fat within is the distribution that were looking
at with just these two numbers to parameters central tendency and variation I can
hold on a stunning amount of data by the simple summary provided by the
distribution now whether you like statistics and you have to mystery that's much
easier than trying to deal with the remember everybody's individual score the way
like you think about this is when you read the research literature at the start of
every article there's an abstract hundred hundred 50 word abstract that
summarizes essentials of what's in the article mostly as reasonable to jewelry the
abstract of us were very interested we don't want to read all the detail of the
article and that's okay that's with the abstracts therefore to make sure you go
through a lot of literature relatively quickly well statistics is doing for you here
descriptive statistics is providing an abstract of the data by central tendency by
variation where getting an important summary of the essentials of the information
that will do some detail when abstract leads us of the detail the article but we get
a hold on the essentials and therefore have a very good useful utilitarian sense of
the information provided we hold the deep essentials but we lose some detail this
becomes a very very useful tool Deal with a lot of complicated things by the
simplified idea of putting them into a distribution now I flash appear simple picture
of a normal curve bell shaped curve dousing distribution whatever you prefer I do
this to remind you that stunning amount of the world apparently turns out to be
normally distributed I remind you a normal distribution is well the lower
frequencies run detail the highest frequencies and middle as you see before you
and also these normal distributions are symmetric if you cut it right in the middle
with outline is full the two halves over they should match exactly normal
distribution small frequency of the tail high Street was in the middle and the
absolute symmetrical and once I know I have a normal distribution and once I

know what central tendency is is a measure variation once again I get a hold on
the stunning amount of information by those two or three very simple numbers
and ideas solicitor in our attention understanding these two parameters for
central tendency in the with that in place will turn attention variation is three ways
that we measure central tendency most of you familiar with these already mean
median and mode sometimes call the 3M Corporation the mean is the arithmetic
average take all the scores add together the vibe of the number scores and you
have the mean that's fairly straightforward the media reason is the midpoint to
take all the values of the data rank order them the number in the middle 50%
about the present below is the media if you haven't even number of values with
his chill in the middle well then we derive the meeting by simple convention is all
this we take the middle two numbers as in the other divide by two and that
becomes what we use is meeting again its mere convention is not the reason for
doing it that way below the mode is simply the most common value what in the
most of the highest frequency on a graph of the high point on the graph mean
median mode three ways represent central tendency and your luck is a normal
distribution they're all the same value that is if you know the distributions normal
and you doesn't mean you also have the meeting in the moat but say I give the
class a test unless you tell you the most common score that test was 73% and
now I ask you what's the mean for the class less easy I just only the mode was
73 we still normal distribution very likely for an exam then you know the mean is
also 73 normal distribution if you know one for central tendency you know the law
but not everything the world is a normal distribution some distributions can be a
little off-center so distributions could be what we call skewed you see before you
now representations of a positively skewed distribution and a negatively skewed
distribution must look at the names here positively skewed is on the right of the
tail like a finger-pointing is pointing the positive direction neighbor miscues on the
left of the tail like a finger-pointing point in the negative direction this culture
positive to the right negative to the left notice the curves are defined not by the
hump but by the pointing of the tail on the right positively skewed pointing to the
right on the left negative skewed pointing to the left now focus on the positively
skewed distribution the right with in the positively skewed distribution I draw three
lines which I label with the letters a BNC these letters represent the three kinds of
central tendency mean median mode which is which well which is the main living
turns out to be see which is the media the median turns out to be be which is
about the mode with therefore the one remaining BA after I say that with such
conviction easy because in any skewed distribution is always going to be in that
order you go to the tail account and you first account of the mean and then the
median and then the moat to the mean is more sensitive to pull upward by this
case extreme values to begin myself in the mode not at all so counting for the tail
your first 2000 mean and then the median and then the moat by the way you
confused about the order if here's a hint you notice its alphabetical MEA NED
MOD first mean that median than mode predictable and always going to be that
way in a positively skewed distribution elevation your following this general
concept and so let's take a moment flip our tension over to the negatively skewed
distribution the negligence you distribution against three lines this time a label

them with numbers 12 and three which is a mean which is the median which is a
mode where the meaning is why the median is to the mode is three and again
how can I say that such conviction because once again it's always going to be in
that order counting in from the tail that's the place to start your first encounter the
mean and then the median and in the mode notice the values for these are not
based on the height of the curb with talking about values along the horizontal
number line the height of the cursive looked on the frequency at each given value
here County of the telly first thousand mean in the median and in the mode now
look but also just given you understand you a simple index for telling me whether
distribution is positively skewed negatively skewed or normal without ever looking
at a graphic presentation how easy it had the mean the media if the mean is a
greater number median it must be a positive skew right if the mean is a lower
number median and must be negative skew right at the meeting me that are
exactly the same value then it's likely not necessarily but likely a normal
distribution and this is in fact how we calculate skewness is a calculation fares
the mean to the media the mean is larger positive skew the mean is less
negative skew the mean and median the same value is likely a normal tradition
and that central tendency fairly straightforward easy indices is there tend to be as
a way to just quickly summarize the key factor of the distribution that you looking
at us now leave behind central tendency to turn attention the second parameter
to describe her distribution or talk about variation or talk here about how to
represent variations of fancy represent how fat or thin is the distribution of looking
at was a number of indices of the proposed over the years to represent the
variation in the distribution was most easy to recognize easily compute is the
range the range simply difference with the highest and the lowest score in the
distribution again it's easy computers until about is what it means the range is
something people usually understand before Joel explained in great detail but I
will try to suggest to you the range is pretty poor measure of variation in general
we don't use it I would try to give some reasons why we don't use it first the
range focuses on only two values out of the entire distribution when the rate we
using the range just look at the top and the bottom value of the distribution all the
other available information is simply ignored and that's not right on some level
second you probably know if I give you a test today let's say and I give you a
similar test next week's scores like the shift somewhat between today next week
right we may not know is certain scores are like the ship this change over time
more than others the scores are most likely shift of the most extreme scores in
the distribution is the very highest the very lowest and the likely shift in a
predictable direction that is to be down closer towards the mean we call this
phenomenon regression towards the mean in all regression towards the mean is
telling us is your ago that extreme values was the reassessed our likely measure
less extreme the next time I'll say that again extreme values when the reassess
alike to measure less extreme the next time you stop to think of reefer moment
see whether range the pretty poor indicator first we using only 12 values ignoring
all the other available information secondly and a very fundamental sense ways
in the first two possible values we using the least stable values to use vocabulary
from epidemiology least reliable of all the data we have that means the range is

highly changeable it's not something that is constant is therefore not a very good
indicator don't use the range before you want to talk about what we do use I just
linger here little bit longer with you and just get your logo more familiar with this
idea regression towards the me is a concept again extreme values reassessed
our likely measure less extreme the next time again is not guaranteed it's just the
probabilities the tendency that the more likely come closer towards the center
than the article more towards extremity was implications of this kind of idea well
let's say we have a student who took their step one US only examined got a very
high score lets say something really stunningly high to 80 really good score and
now the same student is going to want to take the step two CK exam is your best
guess what the students are doing step two CK would you guess the going to do
better than 280 about 280 or worse than 280 well if you follow me the clear
whether that is worse than to wait because extreme values reassessed alike will
be less extreme the next time a really high score is likely below are understand I
did not say the student in some sense Delmer could step one is step two CK
what I said is the measurement will be lower here's the key issue once again
were back to that old canard numbers are not reality C is not the reality change is
the perception reality changed because of the Garrity's somewhat measurement
it remember every measurement is reality plus error as the error in my
measurement in one instance takes you to the extreme the next time and
measure the errors likely be simply less extreme and that less extremity is likely
give me a score down closer towards the mean again you going to find
exceptions to this rule but the weight about the highest probability is extreme
values are likely be less extreme like this me in a clinical situation I suppose you
have a patient comes in the CU standard physical exam and the course work at
this patient you discover they have the highest blood pressure you ever seen in
an ambulatory human being concerned you leap and action you prescribed
dietary change strong medication exercise regimen and you're so concerned
about what you seen they want to the patient back to your office two days later
just to double check and see how things are going to days later the patient back
your office I can blood pressures assessed and the second time since the
question the blood pressure measure substantially lower than it did on the first
encounter to what should the physician determines the lower blood pressure
measurement on the second visit what you think drug diet exercise no none of
that it's regression towards the me extreme values reassessed our likely less
extreme the next time understand I can say the patient's blood pressure actually
changed it probably didn't it's just the perception of changes there because of the
issues regarding measurement while because of this regression towards the
main phenomenon and because range only focus on two values including
everything else the range is a really poor measure variation in general don't use it
what you use instead well statistician sat down and tried to think of the solutions
problem came out with what I think any house pretty ingenious solution they said
they let's start with what we have let's start with each score X let's attract X bar
the meeting from every score we have written here X minus X bar which is taking
each score and subtract the mean from if we have the distribution 50 different
scores related to 50 different subtractions than the Sigma is of course the

summation sought without going to add together all these 50 resulting


subtractions okay now stop her second this is a pretty intuitively useful indexes
are not the government to take the mean as a central pivot point and never going
to buy subtraction Singapore every single score is above and below the mean
and use some all those together get indexed about the spread of the scores
around the mean very clever is it not well at least theoretically this is clever
practically this index of work practically this index buys you nothing why it is a
video this calculation whatever deity was start with you pick whatever data you
want your always going to get the exact same answer what if you get one think
about this answer you get is right zero you're always going to get zero for your
answer because reverie positive number that will be a negative number the
positives and negatives will cancel each other out and so your end up back at
zero play with statisticians first discovered this they felt really stupid now I'm not a
statistician who got known your time but let me tell you something about
statisticians alike spirit of Volvo they do not like to peer stupid village look smart
in all things whatever statisticians think they might look stupid there a number of
gimmicks that the use to make themselves look intelligent one of most convincing
is they square things now they look smart now is they don't just look like run a no
people blessed with a square that looks like one something really really important
law E equals MC that doesn't sound nearly as profound as equals MC squared
right in that the square that sells it sure and not only does it make you look smart
it solves the problem here is now instead of summing together pluses and
minuses which cancel out by squaring the number we have nothing but positive
numbers were now adding together simply summation of only positive numbers
getting a larger larger number as the addition progresses but we are a couple of
small things with the saltier before this index is truly useful is one problem
suppose you have somebody doing research and they have a sample may be
100 people and they compute this index you something else during research
they've assembled maybe 1000 people they can do this index well the person
with the files of people is going to a much larger number here right is their
summing together 1000 square differences to bear the person summoned
together only 100 square differences to this number is going to rise or fall
dramatically with sample size right while we can't have that would be nice is that
have indexes fairly stable across sample size itself to make this work will take
this and now divided by the sample size divide by an now I written and -1 here
because that's the correct formula in a nondescript take a brief moment
explained you what that -1 means that -1 is called a degree of freedom degree of
freedom well the sample size is simple talking about of information you have two
work with as the sample size increases the information you have the work with
increases the problem is when you computer the mean here as you've done in
this formula you've already use some information have you not you've used what
we say is at least 1 of freedom to the -1 is basic honesty confessing that we
know we've all reused some of our information now if the sample size is large
post infinity that minus what is trivial literally forget all about it but the sample size
is relatively small under 20 that minus what actually changes the answer, begin
agonize about it I just talked about the degree of freedom so you understand

what that means when you hear it all I really need to remember here is where
dividing by the sample size so were going to have a relatively constant value
even when sample site changes okay Ron was home with one more thing to
finish year right now we have squared units those measures of the feet I now
have squared feet well I don't have a squared unit I get back the original units I
started with the thing to do that well if I was square how to get back to the original
units I started with that's easy I take the square root of the whole thing when I
take the square when the whole thing I'm now give you one for S which is the
standard deviation to the get ass when you see it represents the standard
deviation just as an aside if you ever see the term the variance while the variance
is simply S squared in short if you don't take the square root of this quantity
you're left with the variance the variance has used a lot of statistical calculations
for your purposes is unlikely to be useful and study what you focus on the
standard deviation now why did I just built for uniform of the standard deviation
from scratch for some comfort you will not be asked to compute this value on the
step one exam how can I say that with such clarity I can't do what it is without
calculating two minutes you can either Saddam expect up to the calculation 100
billion formula 2 reasons really one so the standard deviation is an industry this is
not some magical quantity given down to the deities this is simply taking the data
pretty and index in way that makes mathematical sense look at the spread of the
data around that central tendency and summing it together get a useful index a
larger standard deviation music was a fatter distribution a smallest in the division
means of course they thinner distribution and second I given formula to can
answer simple questions for me simple question such as true false as the sample
size increases the standard deviation and since increase right will of course not
as we just said Mr. deviation tends to be stunningly constant across changes in
sample size you the way the mean is fairly constant even with changes in sample
size the standard deviation is fairly constant with changes in sample size in fact
that's a good way to think about the standard deviation the standing ovation
basically the average deviation it's the average deviation away the mean is the
average perceptual tendency is not the calculation of the standard deviation that
going to need if the application of the standard deviation which are going to see
as you turn to focus on this particular graph now this graph that you see before
you I'm hoping is essentially no threat of opening the something seen before your
education I'm hoping this is going to be simply milder view that's not the case
now the time to make it your friend were looking here together as a set of
constants these numbers were presenting are always true in every normal
distribution whatever the computed mean or standard deviation these
percentages relative to the standard deviation units in the curve are always going
to be true and so it's crucial that you know these frankly every physician should
know these flash card ready Nazis sit there all about reason it out the shimmy
numbers you have there available you can do mental calculations using them
without being given any additional information mistress at the numbers the graph
before you first the numbers on top the so-called symmetric numbers is a mean
is available in going equidistant about below the mean and then secondly the
more important numbers I believe the numbers of the bottom the so-called

asymmetric numbers describing discrete areas bounded by standard deviation


units in again every normal distribution is walk through these numbers trying to
help you keep down put data memory trace and then after a review will go
through to some problems based on this particular graph so we ask you to can
talk back to me about this in every normal distribution what percent of the cases
fall within plus or minus one standard deviation wells the graph tells you 68%
60% plus or minus one syndication same mount up as down what percent of the
cases fall within two standard deviations and normal distribution that answers
95.5 please note not just 9595.5 was smidge over that 95 in any normal
distribution what present the case of all in the plus or -3 standard deviation that
turns out to be 99.7 please notice you don't get 100% but were even over 99%
super practical purposes plus or -3 standard edition gives you just about every
body ever care about in any given distribution the numbers again plus -168+ or
-295.5+ -399.7% of the cases okay with those was keep those in our heads but
now let's move on to talk about the asymmetric numbers the numbers out the
bottom part of the chart here were talking about the area that is the percentage of
cases in the bounded areas describe a standard deviation units of the do this
again by question-and-answer and any normal distribution what present the case
involved in the mean and one standing ovation about the meeting with that's easy
34% wasn't easy well because symmetrical distribution if plus or minus one
standard deviation 60 8F 3/2 of that on each side of the mean there for 34% in
any normal distribution what was of the cases fall between one standard
deviation above the Manitoba mean that's easy 13.5% in any normal distribution
what present a case involving two standard deviations above the man three
above the main that these is well 2.4% and what present the cases in any normal
distribution fall above three standard deviations above the mean that's easy .15%
please note less than 1% now taken altogether all those numbers of the red box
the bottom here should some 200% but they do not why basically we've given
you easy to use numbers in the actual numbers we give you rounded numbers
and we lost about percent in the rounding none of these numbers present on the
slide you should be aware are simple as they appear all these to be absolutely
pristine to go out to six sometimes 12 different decimal places but you don't really
need to do that for day-to-day Chilean usage a so it only to do that in order to
answer questions on exam so these are utilitarian numbers that you should know
for your exam and I remind you again is that these numbers flash card ready not
so you take out your whiteboard you draw this regional label of your have these
numbers in your head you can use them and there available to you to do your
reasoning exercise what kind of questions what you need to answer using these
numbers as to a few examples together give you an idea here first and normal
curve what percent of the cases are below two standard deviations below the
mean below two standard deviations below the mean analysis two things here
one is where you starting the second was which direction you go this is below
two standard deviation below the means whereby starting here to standard
deviations below the mean minus 2S which recommend one below that that
means this way from that point all the way down what percent of the cases of
their rounded off its roughly 2.5% the certain means and standard deviations we

expected to knows walk in the exam for example respecting of the mean and the
standard deviation of the IQ distribution what's the mean of the IQ distribution
well Taser got its hundred IQ those constructive me of 100 was a standard
deviation IQ distribution 15 mean 100 standard deviation of 15 so what percent of
US population has an IQ below 71% of US population has an IQ below 70%
through the mean is 105 go to the mean X bar and that's 100 I dropped on one
standard deviation amount -1 assets -50 points that be 85 I dropped on another
standard deviation are now two point assuming minus 2S and S going to be 71%
US population is IQ below 72.5% why is 70 to cut off a mental retardation
because it is two standard deviations below the mean member what present he
is population mental retarded that's really easy since 70s, for mental retardation
to a half percent of the population is mentally retarded these numbers are
constant or useful there all over medicine all over day-to-day life when you get
your lab values and we give you a reference range what is that it's the mean
standard deviations trying to tell you that these are the values that are potent
normal that is expected and anything beyond that is abnormal that is on expected
notice lab values outside the reference range don't mean pathology they need
abnormal with your job than figure out why were getting these abnormalities next
in a normal curve represented the cases are above one standard deviation below
the mean of bald one standard deviation below the mean once again with two
things here one is when you start the seconds where to go so in this case I'm
going to start where one standard deviation below the mean that under which
direction it says above that and so from that's .1 standard deviation below the
mean all the way up to the top that's going to be 84% now I get any 4% relatively
quickly can I not because it's going to be 34 one standard deviation below the
mean until the mean +51+50 cause the mean is also the media to any of you that
were sitting there adding the other all those little individual percentages you stop
that now the student the easy way is that 34+50 and you have your 80 40 by the
way what percent of the cases are below the specified point minus 1S with that's
easy has to be 16% whole things 184 Bob it must be 16 below and please notice
once again this is purely a logic exercise once you have the numbers is just a
matter reasoning thinking clearly in order to answer the questions that are
presented next Wednesday the scores at the 97.5 percentile 97.5 percentile
where's the student on the Kerr but wasn't back in our graph once again you
know that the mean would be the 50th percentile for the mean downward are
50% of the cases right so as I got one standard deviation I now gain another 34%
50+34 one standard deviation above the mean is the 84th percentile as I go up a
second standard deviation to the 84 I now at 13.5 so two standard deviations is
now the 97.5 percentile from that point downward I beaten 97.5% of all students
for that one upward we have remainder reports 2.5% everything being 100 this is
what we mean by percentile by the way it simply what percentage of the students
did you beat and so just to do this is with talking you through the concept at
negative 2S2 standard deviations below the mean where the 2.5 percentile
roughly weber at minus 1S with the 16th percentile where the mean with the 50th
percentile or more plus one asked with the 84th percentile were plus to ask with
the 97.5 percentile next is look at something will be different student took to test

a witch of the test the student a better relative to his classmates with test date
with a student scored 45% with test beam the student scored 60% the question is
which is better and the answer is right now you what's on the screen before you
you can't tell the scores by themselves are essentially meaningless scores only
have meaning when the link to the distribution out of which they came with
information about the distribution biggest sense of the true value of the scores
are estimates give you some information with the distribution the mean pretest
day is 30% the mean pretest B is 40% which one of these two scores 45 or 60 is
better again you can't tell yet does represent distribution is not merely by central
tendency by central Tennessee Divine with the second parameter some measure
variation for which a standard deviation would be a usual representation right
when they give you a standard deviation and now answer the question which
scores better a or B the answer is a 45% is a better score here's why whichever
one of these two is more standard deviations above the mean is a better score is
a higher percentile score for thing about that way the way to look at this simply
mechanically going to go to the mean with had standard deviation units in the
stock we hit the students score which everyone let's is adding more standard
deviation units is simply a well letter test score so I go to 30 I added 5% 35 5%
again 45% of their time 45% stop test a the score is three standard deviations
above the mean is drop-down I go to test the ride the mean of 40% I had 10% 50
tempers again that 60 stop test B is two standard deviations above the mean
both the good scores but test a is simply best test be the student scored two
standard deviations above the mean the 97.5 percentile test a the student scored
three standard deviations above the mean the 99.85 percentile was just .15% of
the population above that particular point both are stunningly good scores in
behalf with any of these but the question was which is better a is better please
notice if the student scored below the mean then the number that's the fewest
standard deviations below the mean would be the best score present would be
again the highest percentile score and this is descriptive statistics were doing
with descriptive statistics is just giving you tools to summarize and describe the
data that you have before you no longer have to focus on remembering all the
individual values you can summarize in a distribution and then represent the
distribution yourself and then we tried was in the regularities we find these
distributions especially in the normal distribution and especially percentages
related to standard deviation units these regularities give us useful tools letters to
useful reasoning exercise