Statistical Terms

Statistical Terms:
1. A Priori Probability A priori probability is the probability estimate prior to receiving

new information.
2. Acceptance Region In hypothesis testing, the test procedure partitions all the
possible sample outcomes into two subsets (on the basis of whether the observed
value of the test statistic is smaller than a threshold value or not). The subset that is
considered to be consistent with the null hypothesis is called the "acceptance
region"; another subset is called the "rejection region" (or "critical region"). If the
sample outcome falls into the acceptance region, then the null hypothesis is
accepted. If the sample outcome falls into the rejection region, then the null
hypothesis is rejected (i.e. the alternative hypothesis is accepted).
3. Acceptance Sampling Acceptance sampling is the use of sampling methods to
determine whether a shipment of products or components is of sufficient quality to
be accepted.
4. Acceptance Sampling Plans For a shipment or production lot, an acceptance
sampling plan defines a sampling procedure and gives decision rules for accepting or
rejecting the shipment or lot, based on the sampling results.
5. Additive effect An additive effect refers to the role of a variable in an estimated
model. A variable that has an additive effect can merely be added to the other terms
in a model to determine its effect on the independent variable.
6. Additive Error Additive error is the error that is added to the true value and does
not depend on the true value itself. In other words, the result of the measurement is
considered as a sum of the true value and the additive error:
Where:
is the result of a measurement of some quantity, say, weight;
is the true value of the quantity;
is the additive error, which is of the same physical dimension as the

quantity
being measured;
The model of additive errors is the most popular error model in statistics. In this
model the additive error is assumed to be independent of the true value
7. Agglomerative Methods (of Cluster Analysis) In agglomerative methods

of hierarchical cluster analysis, the clusters obtained at the previous step are fused
into larger clusters. Agglomerative methods start with N clusters comprising a single
object, then on each step two clusters from the previous step are fused into a bigger
cluster. The criterion of such fusion is the minimal distance between the two clusters
to be fused, as specified by the chosen linkage function.
8. Aggregate Mean In ANOVA and some other techniques used for analysis of several
samples, the aggregate mean is the mean for all values in all samples combined, as
opposed to the mean values of the individual samples.
The term "aggregate mean" is also used as a synonym of the weighted mean because the latter is often used to aggregate a set of values, like examination
scores, to a single value.
9. Alpha Level see Type I error. In a test of significance, Type I error is the error of
rejecting the null hypothesis when it is true -- of saying an effect or event is
statistically significant when it is not. The projected probability of committing type I
error is called the level of significance. For example, for a test comparing two
samples, a 5% level of significance (a = .05) means that when the null hypothesis is
true (i.e. the two samples are part of the same population), you believe that your
test will conclude "theres a significant difference between the samples" 5% of the
time.
10. Alpha Spending Function In the interim monitoring of clinical trials, multiple looks
are taken at the accruing results. In such circumstances, akin to multiple testing, the
alpha-value at each look must be adjusted in order to preserve the overall Type-1
Error. Alpha spending functions, (the Pocock family is one such set; the Lan-Demets
spending function a more specific case) establish these adjusted alpha-values for
each interim monitoring point, given the overall alpha. Typically, they establish
relatively high alpha-values for early looks, and lower alpha-values for later looks.
Thus, they constitute "stopping boundaries," which, when crossed, indicate that
statistical significance has been established. When graphed, a typical set of stopping
boundaries looks like an inverted V pointing to the right. (horizontal axis represents
number of events recorded, vertical axis the standardized value of the test statistic.)
11. Alternate-Form Reliability The alternate-form reliability of a survey instrument, like
a psychological test, helps to overcome the "practice effect", which is typical of
the test-retest reliability. The idea is to change the wording of the survey questions
in a functionally equivalent form, or simply to change the order of the questions in
the first and the second survey of the same respondents.
A common quantitative measure of the alternate-form reliability is the value of
the coefficient between the results obtained in two surveys - with the initial and reworded questions.
12. Alternative Hypothesis In hypothesis testing, there are two competing hypotheses the null hypothesis and the alternative hypothesis. The null hypothesis usually
reflects the status quo (for example, the proposed new treatment is ineffective and
the observed results are just due to chance variation). The hypothesis which
competes with the null hypothesis as an explanation for observed data is called the
alternative hypothesis.
13. Backward Elimination One of several computer-based iterative procedures for
selecting variables to use in a model. The process begins with a model containing all
the independent (predictor) variables of interest. Then, at each step the variable with
the smallest F-statistic is deleted (if the F is not higher than the chosen cutoff level).
14. Bayes Theorem Bayes theorem is a formula for revising a priori probabilities after
receiving new information. The revised probabilities are called posterior probabilities.
For example, consider the probability that you will develop a specific cancer in the
next year. An estimate of this probability based on general population data would be
a prior estimate; a revised (posterior) estimate would be based on both on the

population data and the results of a specific test for cancer.
The formula for Bayes Theorem is as follows:
15. Bernoulli Distribution A random variable x has a Bernoulli distribution with

parameter 0 < p < 1 if
1-p,
x=0
P(x) =
p,
x=1
0,
x {0, 1}
Where P(A) is the probability of outcome A. The parameter p is often called the
"probability of success". For example, a single toss of a coin has a Bernoulli
distribution with p=0.5 (where 0 = "head" and 1 = "tail").
16. Bernoulli Distribution (Graphical) A random variable x has a Bernoulli distribution
with parameter 0 < p < 1 if
Where P(A) is the probability of outcome A. The parameter p is often called the
"probability of success". For example, a single toss of a coin has a Bernoulli
distribution with p=0.5 (where 0 = "head" and 1 = "tail").
17. Beta Distribution Suppose x1, x2, ... , xn are n independent values of a random
variable uniformly distributed within the interval [0,1]. If you sort the values in
ascending order, then the k-th value will have a beta distribution with
parameters a = k, b = n-k+1. The density of beta distribution is given by
G(a+ b)
xa-1 (1-x)b-1,
x ? [0,1]
f(x) =
G(a) G(b)
0,
x ? [0,1]
Where a > 0, b > 0; G() is the gamma function.

18. Beta Distribution (Graphical) Suppose x1, x2, ... , xn are n independent values of a
random variable uniformly distributed within the interval [0,1]. If you sort the values
in ascending order, then the k-th value will have a beta distribution with
parameters
,
. The density of beta distribution is given by
Where
is the gamma function.
19. Bias Bias is a general statistical term meaning a systematic (not random) deviation
from the true value. A bias of a measurement or a sampling procedure may pose a
more serious problem for researcher than random errors because it cannot be
reduced by mere increase in sample size and averaging the outcomes.
20. Biased Estimator An estimator is a biased estimator if its expected value is not
equal to the value of the population parameter being estimated.
21. Binomial Distribution A random variable x has a binomial distribution if and only if
its probability distribution is given by
A binomial distribution is used to describe an experiment for which the probability of

success is the same for each trial and each trial has only two possible outcomes. If a
coin is tossed n number of times, the probability of a certain number of heads being
observed in n tosses of the coin is given by the Binomial distribution.
22. Bivariate Normal Distribution Bivariate normal distribution describes the joint
probability distribution of two variables, say X and Y, that both obey the normal
distribution. The bivariate normal is completely specified by 5 parameters:
mx, my are the mean values of variables X and Y, respectively;
sx, sy are the standard deviation s of variables X and Y;
rxy is the correlation coefficient between X and y.
An essential feature of the bivariate normal distribution is that zero correlation (r=0)
necessarily means that X and Y are independent random variables.
23. Cumulative Frequency Distribution A cumulative frequency distribution is a
summary of a set of data showing the frequency (or number) of items less than or
equal to the upper class limit of each class. This definition holds for quantitative data
and for categorical (qualitative) data (but only if the latter are ordinal - that is, a
natural order of items is specified).
24. Cumulative Relative Frequency Distribution A cumulative relative frequency
distribution is a tabular summary of a set of data showing the relative frequency of
items less than or equal to the upper class class limit of each class. Relative
frequency is the fraction or proportion of the total number of items.
25. Data Data are recorded observations made on people, objects, or other things that
can be counted, measured, or quantified in some way.
26. Data Mining Data mining is concerned with finding latent patterns in large data
bases. The goal is to discover unsuspected relationships that are of practical
importance, e.g., in business.
27. Data Partition Data partitioning in data mining is the division of the whole data
available into two or three non-overlapping sets: the training set , the validation set ,
and the test set . If the data set is very large, often only a portion of it is selected for
the partitions. Partitioning is normally used when the model for the data at hand is
being chosen from a broad set of models. The basic idea of data partitioning is to
keep a subset of available data out of analysis, and to use it later for verification of
the model.
28. Decile Deciles are percentile s taken in tens. The first decile is the 10th percentile;
the second decile is the 20th percentile, etc.
29. Degrees of Freedom For a set of data points in a given situation (e.g. with mean or
other parameter specified, or not), degrees of freedom is the minimal number of
values which should be specified to determine all the data points.
30. Dendrogram The dendrogram is a graphical representation of the results
of hierarchical cluster analysis . This is a tree-like plot where each step of hierarchical
clustering is represented as a fusion of two branches of the tree into a single one.
The branches represent clusters obtained on each step of hierarchical clustering.
31. Density (of Probability) A probability density function or curve is a non-negative
function
random variable
variable
) that describes the distribution of a continuous

. If
is known, then the probability
is within an interval
For very small intervals
that a value of the
is described by the following integral
, where
is a small value, the relation is simpler:
32. Dependent and Independent Variables Statistical models normally specify how one
set of variables, called dependent variables , functionally depend on another set of
variables, called independent variables . The functional relationship does not
necessarily reflect a causal relationship - i.e. the independent variables do not
necessarily describe the cause.
33. Descriptive Statistics Descriptive statistics refers to statistical techniques used to
summarize and describe a data set, and also to the statistics (measures) used in
such summaries. Measures of central tendency (e.g. mean, median) and variation
(e.g. range, standard deviation) are the main descriptive statistics. Displays of data
such as histograms and box-plots are also considered techniques of descriptive
statistics.
34. Design of Experiments Design of experiments is concerned with optimization of the
plan of experimental studies. The goal is to improve the quality of the decision that is
made from the outcome of the study on the basis of statistical methods, and to
ensure that maximum information is obtained from scarce experimental data.
35. Detrended Correspondence Analysis Detrended correspondence analysis is an
extension of correspondence analysis (CA) aimed at addressing a deficiency
of correspondence analysis . The problem is known as the "arch effect" - a nonmonotonic relationship between two sets of scores derived by CA.
36. Dichotomous Dichotomous (outcome or variable) means "having only two possible
values", e.g. "yes/no", "male/female", "head/tail", "age > 35 / age <= 35" etc.
37. Differencing (of Time Series) Differencing of a time series

is the transformation of the series
to a new time series
in discrete time
where the
values
are the differences between consecutive values of
. This procedure
may be applied consecutively more than once, giving rise to the "first differences",
"second differences", etc.
The first differences
expression:
The second differences

according to the expression
of a time series
are described by the following
may be computed from the first differences
The general expression for the differences of order

formula
is given by the recursive
Where the top index means the order of the difference.

38. Discrete Distribution A discrete distribution describes the probabilistic properties of
a random variable that takes on a set of values that are discrete, i.e. separate and
distinct from one another - a discrete random variable. Discrete values are separated
only by a finite number of units - in flipping a coin five times, the result of 5 heads is
separated from the result of 2 heads by two units (3 heads and 4 heads).
39. Discrete Random Variable A random variable whose range of possible values is
finite or countably infinite is said to be a discrete random variable.
40. Discriminant Analysis Discriminant analysis is a method of distinguishing between
classes of objects. The objects are typically represented as rows in a matrix. The
values of various attributes (variables) of an object are measured (the matrix
columns) and a linear classification function is developed that maximizes the ratio of
between-class variability to within-class variability. The function measures statistical
distance between an observation and each class, and is used to assign a
classification to each object
41. Discriminant Factor Analysis see Multiple Discriminant Analysis. Multiple
Discriminant Analysis (MDA) is an extension of discriminant analysis ; it shares ideas
and techniques with multiple analysis of variance (MANOVA) . The goal of MDA is to
classify cases into three or more categories using continuous or dummy categorical
variables as predictors.
42. Discriminant Function In discriminant analysis , a discriminant function (DF) maps
independent (discriminating) variables into a latent variable D. DF is usually
postulated to be a linear function:
D = a0 + a1 x1 + a2 x2 ... aN xN
43. Discriminant Function Analysis Discriminant function analysis is a synonym
for Discriminant Analysis.
44. Dispersion (Measures of) Measures of dispersion express quantitatively the degree
of variation or dispersion of values in a population or in a sample . Along with
measures of central tendency, measures of dispersion are widely used in practice
as descriptive statistics. Some measures of dispersion are the standard deviation,
theaverage deviation, the range, the interquartile range.
45. Disproportionate Stratified Random Sampling see Stratified Sampling (method
II). Stratified sampling is a method of random sampling. In stratified sampling, the
population is first divided into homogeneous groups, also called strata. Then,
elements from each stratum are selected at random according to one of the two
ways: (i) the number of elements drawn from each stratum depends on the stratum
s size in relation to the entire population ("proportionate" sampling), (ii) the
number of elements sampled from each stratum is not proportionate to the size of
the stratum ("disproportionate" sampling); in this case, an equal number of elements
is typically drawn from each stratum and the results are weighted according to the
stratums size in relation to the entire population.
46. Dissimilarity Matrix The dissimilarity matrix (also called distance matrix) describes
pairwise distinction between M objects. It is a square symmetrical MxM matrix with
the (ij)th element equal to the value of a chosen measure of distinction between the
(i)th and the (j)th object. The diagonal elements are either not considered or are
usually equal to zero - i.e. the distinction between an object and itself is postulated
as zero.
47. Distance Matrix The "distance" does not necessarily means distance in space. It is
a common situation when the "distance" is a subjective measure of dissimilarity. The
only property the concept of "distance" implies is that its value is smaller for less
distinct objects.
48. Divergent Validity In psychometrics, the divergent validity of a survey instrument,
like an IQ-test, indicates that the results obtained by this instrument do not correlate
too strongly with measurements of a similar but distinct trait.
49. Econometrics Econometrics is a discipline concerned with the application of
statistics and mathematics to various problems in economics and economic theory.
This term literally means "economic measurement". A central task is quantification
(measurement) of various qualitative concepts of economic theory likedemand , supply , propensity to spend , etc.
50. Edge An edge is a link between two people or entities in a network. Edges can be
directed or undirected. A directed edge has a clear origin and destination: lender >
borrower, tweeter > follower. An undirected edge connects two people or entities
with a mutual relationship: Facebook friends, teams in a sports league.
51. Effect In design of experiments, the effect of a factor is an additive term of the
model, reflecting the contribution of the factor to the response. See Variables (in
design of experiments) for an explanatory example.
52. Effect Size see Sample Size Calculations. Sample size calculations typically arise
in significance testing, in the following context: how big a sample size do I need to
identify a significant difference of a certain size? The analyst must specify three
things: 1) How big a difference is being looked for (also called the effect size; 2)
What is the alpha level of the test (how big a Type I error is tolerable), e.g. alpha = .
05, and 3) The desired power of the test, or how certain do you want to be of
catching a significant difference, if there is one, e.g. power = .8.
53. Efficiency For an unbiased estimator, efficiency indicates how much its precision is
lower than the theoretical limit of precision provided by the Cramer-Rao inequality. A
measure of efficiency is the ratio of the theoretically minimal variance to the actual
variance of the estimator. This measure falls between 0 and 1. An estimator with
efficiency 1.0 is said to be an "efficient estimator".
54. Endogenous Variable Endogenous variables in causal modeling are the variables
with causal links (arrows) leading to them from other variables in the model. In other
words, endogenous variables have explicit causes within the model.
55. Erlang Distribution The Erlang distribution with parameters (n, m) characterizes the
distribution of time intervals until the emergence of n events in a Poisson
process with parameter m.
56. Error Error is a general concept related to deviation of the estimated quantity from
its true value: the greater the deviation, the greater the error.
57. Error Spending Function see Alpha Spending Function. In the interim
monitoring of clinical trials, multiple looks are taken at the accruing results. In such
circumstances, akin to multiple testing, the alpha-value at each look must be
adjusted in order to preserve the overall Type-1 Error. Alpha spending functions, (the
Pocock family is one such set; the Lan-Demets spending function a more specific
case) establish these adjusted alpha-values for each interim monitoring point, given
the overall alpha. Typically, they establish relatively high alpha-values for early looks,
and lower alpha-values for later looks. Thus, they constitute "stopping boundaries,"
which, when crossed, indicate that statistical significance has been established. When
graphed, a typical set of stopping boundaries looks like an inverted V pointing to the
right. (horizontal axis represents number of events recorded, vertical axis the
standardized value of the test statistic.)
58. Estimation Estimation is deriving a guess about the actual value of a
population parameter (or parameters) from a sample drawn from this population.
See also Estimator.
59. Estimator A statistic, measure, or model, applied to a sample, intended to estimate
some parameter of the population that the sample came from.
60. Event In probability theory, an event is an outcome or defined collection of
outcomes of a random experiment. Since the collection of all possible outcomes to a
random experiment is called the sample space, another definiton of event is any
subset of a sample space. For example, on the roll of a die, getting an even number
is an event. This event is a subset containing sample points{2, 4, 6}. The sample
space is{1, 2, 3, 4, 5, 6}
61. Exact Tests Exact tests are hypothesis tests that are guaranteed to produce Type-I
error at or below the nominal alpha level of the test when conducted on samples
drawn from a null model. For example, a test conducted at the 5% level of
significance that yields (false) "significant" results 5% of the time or less (when used
on samples drawn from a null model) is exact. See also permutation tests, which
constitute the largest class of exact tests.
62. Exogenous Variable Exogenous variables in causal modeling are the variables with
no causal links (arrows) leading to them from other variables in the model. In other
words, exogenous variables have no explicit causes within the model.
63. Expected Value The expected value of a random variable is nothing but the
arithmetic mean. For a discrete random variable, the expected value is the weighted
average of the possible values of the random variable, the weights being the
probabilities that those values will occur. For a continuous random variable, the
values of the probability density are used instead of probabilities, and the summation
operator is replaced by the integral.
64. Experiment Any process of observation or measurement is called an experiment in
statistics. For example, counting the number people visiting a restaurant in a day is
an experiment, and so is checking the number obtained on the roll of a die. Typically,
we will be interested in experiments whose outcomes differ from one another due (at
least in some degree) to random chance.
65. Explanatory Variable Explanatory variable is a synonym for independent variable.
66. Exponential Distribution The exponential distribution is a one-sided distribution
completely specified by one parameter r > 0; the density of this distribution is
re-rx,
x 0
f(x) =
0,
x<0
The mean of the exponential distribution is 1/r.

The exponential distribution is a model for the length of intervals between two
consecutive random events in time, or between a given point and the next nearest
point. In this case r is the rate of events - that is, the average number of events per
a unit interval.
67. Exponential Distribution (Graphical) The exponential distribution is a one-sided
distribution completely specified by one parameter
; the density of this
distribution is
The mean of the exponential distribution is

.
The exponential distribution is a model for the length of intervals between two
consecutive random events in time, or between a given point and the next nearest
point. In this case is the rate of events - that is, the average number of events per
a unit interval.
68. Exponential Filter The exponential filter is the simplest linear recursive filter.
Exponential filters are widely used in time series analysis, especially for forecasting
time series.
69. Face Validity The face validity of survey instruments and tests used
in psychometrics, is assessed by cursory review of the items (questions) by
untrained individuals. The individuals make their judgments on whether the items
are relevant. For example, a researcher developing an IQ-test might ask his friends
and relatives to read the questions and make their judgments.
70. Factor In design of experiments, factor is an independent variable manipulated by
the experimenter.
71. Factor Analysis Exploratory research on a topic may identify many variables of
possible interest, so many that their sheer number can become a hindrance to
effective and efficient analysis.
Factor analysis is a data reduction technique that reduces the number of variables
studied to a more limited number of underlying "factors."
72. Factorial ANOVA Factorial ANOVA (factorial analysis of variance) is aimed at
assessing the relative importance of various combinations of independent variables.
Factorial ANOVA is used when there are at least two independent variables.
73. Fair Game A game of chance is said to be fair if each players expected payoff is
zero. A game in which I roll a die and receive 12 for a 1 or 2 and lose 6 otherwise (36) is a fair game.
74. False Discovery Rate A "discovery" is a hypothesis test that yields a statistically
significant result. The false discovery rate is the proportion of discoveries that are, in
reality, not significant (a Type-I error). The true false discovery rate is not known,
since the true state of nature is not known (if it were, there would be no need for
statistical inference). However, one can calculate the false discovery rate under the
assumption that all null hypotheses being tested are true. Controlling this false
discovery rate is often a goal, and a parameter, of studies involving multiple tests.
75. Family-wise Type I Error In multiple comparison procedures, family-wise type I
error is the probability that, even if all samples come from the same population, you
will wrongly conclude that at least one pair of populations differ.
76. Family-wise Type I Error (Graphical) In multiple comparison procedures, familywise type I error is the probability that, even if all samples come from the same
population, you will wrongly conclude that at least one pair of populations differ.
77. Farthest Neighbor Clustering The farthest neighbor clustering is a synonym
for complete linkage clustering.
78. Filter A filter is an algorithm for processing a time series or random process. There
are two major classes of problems solved by filters:
1. To estimate the current value of a time series (X(t), t = 1,2, ...) , which is not
directly observable, from observed values of another time series (Y(t),
t=1,2,...) , related to the time series X(t).
2. To predict the next value Y(t+1) of the observed time series Y from the
current value Y(t) and previous values Y(t-1),Y(t-2), ... .
79. Finite Mixture Models Outside the social research, the term "finite mixture models"
is often used as a synonym for "latent class models" in latent class analysis.
80. Finite Sample Space If a sample space contains a finite number of elements, then
the sample space is said to be a finite sample space. The sample space for the
experiment of a toss of a coin is a finite sample space. It has only two sample points.
But the sample space for the experiment where the coin is tossed until a heads
shows up is not a finite sample space -- it is theoretically possible that you could
keep tossing the coin indefinitely.
81. Gamma Distribution A random variable x is said to have a gamma-distribution with
parameters a > 0 and l > 0 if its probability density p(x) is
la
xa-1 e-lx,
x > 0;
p(x) =
G(a)
0,
82. Gamma Distribution (Graphical) A random variable x is said to have a gammadistribution with parameters a > 0 and l > 0 if its probability density p(x) is
la
xa-1 e-lx,
x > 0;
p(x) =
G(a)
0,
83. Gaussian Distribution see Normal Distribution. The normal distribution is a

probability density which is bell-shaped, symmetrical, and single peaked. The mean,
median and mode coincide and lie at the center of the distribution. The two tails
extend indefinitely and never touch the x-axis (asymptotic to the x-axis). A normal
distribution is fully specified by two parameters - mean and the standard deviation.
84. Gaussian Filter The Gaussian filter is a linear filter that is usually used as
a smoother. The output of the gaussian filter at the moment is the weighted
mean of the input values, and the weights are defined by formula
Where
is the "distance" in time from the current moment;

is the parameter of the Gaussian filter;
is the normalization constant chosen to make the sum of all weights
equal to the unit value.
85. General Association Statistic The general association statistic is one of the statistics
used in the generalized Cochran-Mantel-Haenszel tests . It is applicable when both
the "treatment" and the "response" variables are measured on a nominal scale.
86. General Linear Model General (or generalized) linear models (GLM), in contrast
to linear models, allow you to describe both additive and non-additive relationship
between a dependent variable and N independent variables. The independent
variables in GLM may be continuous as well as discrete. (The dependent variable is
often named "response", independent variables - "factors" and "covariates",
depending on whether they are controlled or not).
87. General Linear Model for a Latin Square In design of experiment, a Latin square is
a three-factor experiment in which for each pair of factors in any combination of
factor values occurs only once.
88. Generalized Cochran-Mantel-Haenszel tests The Generalized Cochran-MantelHaenszel tests is a family of tests aimed at detecting of association between two
categorical variables observed in K strata.
The initial data are represented as a series of K RxC contingency table s, where K is
the number of strata and at least one of the variables ("group", "response") takes on
more than 2 values. Typically, in each table the rows correspond to the "Treatment
group" values (e.g "Placebo", "Low dose", "High dose) and the columns to the
"Response" values (e.g "Worsening", "No change" "Improvement").
89. Geometric Distribution A random variable x obeys the geometric distribution with
parameter p (0<p<1) if
P{x=k} = p(1-p)k,
k=0,1,2, ... .
90. Geometric Distribution (Graphical) A random variable x obeys the geometric
distribution with parameter p (0<p<1) if
91. Geometric mean The geometric mean of n values is determined by multiplying all n
values together, then taking the nth root of the product. It is useful in taking
averages of ratios.
The geometric mean is often used for data which take only on positive values and
the values can vary significantly - e.g. by orders of magnitude. An example of such
data in biomedical applications is the concentration of various substances in blood
and other body fluids.
92. Harmonic Mean Harmonic mean is a measure of central location. The harmonic
mean
of
positive values
is defined by the formula
Let the path between two cities

and
be divided into
parts of equal length.
One drives the th part at velocity
. Then, the average speed on the whole
journey is the harmonic mean
of velocities
93. Hazard Function In medical statistics, the hazard function is a relationship between
a proportion and time. The proportion (also called the hazard ratio) is the proportion
of subjects who die from among those who have survived to a time "t." The term can
be applied in fields other than medical statistics, in which case it refers to the failure
of a unit being studied, rather than the death of a subject.
94. Heteroscedasticity Heteroscedasticity generally means unequal variation of data,
e.g. unequal variance.
95. Heteroscedasticity in hypothesis testing In hypothesis testing, heteroscedasticity
means a situation in which the variance is different for compared samples.
Heteroscedasticity complicates testing because most tests rest on the assumption of

equal variance.
96. Heteroscedasticity in regression In regression analysis, heteroscedasticity means a
situation in which the variance of the dependent variable varies across the data.
Heteroscedasticity complicates analysis because many methods in regression
analysis are based on an assumption of equal variance.
97. Hierarchical Cluster Analysis Hierarchical cluster analysis (or hierarchical clustering)
is a general approach to cluster analysis, in which the object is to group together
objects or records that are "close" to one another. A key component of the analysis is
repeated calculation of distance measures between objects, and between clusters
once objects begin to be grouped into clusters. The outcome is represented
graphically as a dendrogram.
98. Hierarchical Linear Modeling Hierarchical linear modeling is an approach to analysis
of hierarchical (nested) data - i.e. data represented by categories, subcategories, ..., individual units (e.g. school -> classroom -> student).
At the first stage, we choose a linear model (level 1 model) and fit it to individual
units in each group separately using conventional regression analysis. At the second
stage, we consider estimates of the level 1 model parameters as dependent variables
which linearly depend on the level 2 independent variables. The level 2 independent
variables characterize groups, not individuals. We find level 2 regression parameters
by a method of linear regression analysis.
There may be more than 2 levels in this process, provided there are more than two
levels in the hierarchy of groups or categories, e.g. district -> school -> classroom
-> student.
Technically, hierarchical linear models are such models that for any term all effects of
lower order are also included in the model.
99. Histogram A histogram is a graph of a dataset, composed of a series of rectangles.
The width of these rectangles is proportional to the range of values in a class or bin,
all bins being the same width. For example, values lying between 1 and 3, between 3
and 5, etc. The height of the rectangles is proportional to the frequency or the
relative frequency of that class. For example the height of the bar centered at 2 is
determined by the number of values in the class from 1-3.
100.
Hold-Out Sample The hold-out sample is the subset of the data available to
a data mining routine used as the test set.

Statistical Terms

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Statistical Terms

Diunggah oleh

Hak Cipta:

Format Tersedia

Statistical Terms:

1. A Priori Probability A priori probability is the probability estimate prior to receiving

is the result of a measurement of some quantity, say, weight;

is the true value of the quantity;

is the additive error, which is of the same physical dimension as the

7. Agglomerative Methods (of Cluster Analysis) In agglomerative methods

a prior estimate; a revised (posterior) estimate would be based on both on the

15. Bernoulli Distribution A random variable x has a Bernoulli distribution with

Where a > 0, b > 0; G() is the gamma function.

is the gamma function.

A binomial distribution is used to describe an experiment for which the probability of

) that describes the distribution of a continuous

is known, then the probability

For very small intervals

that a value of the

is described by the following integral

is a small value, the relation is simpler:

37. Differencing (of Time Series) Differencing of a time series

to a new time series

The second differences

are described by the following

may be computed from the first differences

The general expression for the differences of order

is given by the recursive

Where the top index means the order of the difference.

The mean of the exponential distribution is 1/r.

The mean of the exponential distribution is

83. Gaussian Distribution see Normal Distribution. The normal distribution is a

is the "distance" in time from the current moment;

Let the path between two cities

Heteroscedasticity complicates testing because most tests rest on the assumption of

Anda mungkin juga menyukai