Anda di halaman 1dari 71

Research can be defined as the search for knowledge or as any systematic investigation, with an

open mind, to establish novel facts, usually using a scientific method. The primary purpose for
applied research (as opposed to basic research) is discovering, interpreting, and the development
of methods and systems for the advancement of human knowledge on a wide variety of scientific
matters of our world and the universe.
Scientific research relies on the application of the scientific method, a harnessing of curiosity.
This research provides scientific information and theories for the explanation of the nature and
the properties of the world around us. It makes practical applications possible. Scientific research
is funded by public authorities, by charitable organizations and by private groups, including
many companies. Scientific research can be subdivided into different classifications according to
their academic and application disciplines.
Artistic research, also seen as 'practice-based research', can take form when creative works are
considered both the research and the object of research itself. It is the debatable body of thought
which offers an alternative to purely scientific methods in research in its search for knowledge
and truth.
Historical research is embodied in the scientific method.
The phrase my research is also used loosely to describe a person's entire collection of
information about a particular subject
Research methods
The goal of the research process is to produce new knowledge, which takes three main forms
(although, as previously discussed, the boundaries between them may be obscure.):
• Exploratory research, which structures and identifies new problems
• Constructive research, which develops solutions to a problem
• Empirical research, which tests the feasibility of a solution using empirical evidence

The research room at the New York Public Library, an example of secondary research in
progress.
Research can also fall into two distinct types:
• Primary research (collection of data that does not already exist)
• Secondary research (summary, collation and/or synthesis of existing research)
In social sciences and later in other disciplines, the following two research methods can be
applied, depending on the properties of the subject matter and on the objective of the research:
• Qualitative research (understanding of human behavior and the reasons that govern such
behavior)
• Quantitative research (systematic empirical investigation of quantitative properties and
phenomena and their relationships)
Research is often conducted using the hourglass model Structure of Research[1]. The hourglass
model starts with a broad spectrum for research, focusing in on the required information through
the methodology of the project (like the neck of the hourglass), then expands the research in the
form of discussion and results.
Generally, research is understood to follow a certain structural process. Though step order may
vary depending on the subject matter and researcher, the following steps are usually part of most
formal research, both basic and applied:
• Formation of the topic
• Hypothesis
• Conceptual definitions
• Operational definition
• Gathering of data
• Analysis of data
• Test, revising of hypothesis
• Conclusion, iteration if necessary
A common misunderstanding is that by this method a hypothesis could be proven or tested.
Generally a hypothesis is used to make predictions that can be tested by observing the outcome
of an experiment. If the outcome is inconsistent with the hypothesis, then the hypothesis is
rejected. However, if the outcome is consistent with the hypothesis, the experiment is said to
support the hypothesis. This careful language is used because researchers recognize that
alternative hypotheses may also be consistent with the observations. In this sense, a hypothesis
can never be proven, but rather only supported by surviving rounds of scientific testing and,
eventually, becoming widely thought of as true (or better, predictive), but this is not the same as
it having been proven. A useful hypothesis allows prediction and within the accuracy of
observation of the time, the prediction will be verified. As the accuracy of observation improves
with time, the hypothesis may no longer provide an accurate prediction. In this case a new
hypothesis will arise to challenge the old, and to the extent that the new hypothesis makes more
accurate predictions than the old, the new will supplant it.
Exploratory research is a type of research conducted for a problem that has not been clearly
defined. Exploratory research helps determine the best research design, data collection method
and selection of subjects. It should draw definitive conclusions only with extreme caution. Given
its fundamental nature, exploratory research often concludes that a perceived problem does not
actually exist.
Exploratory research often relies on secondary research such as reviewing available literature
and/or data, or qualitative approaches such as informal discussions with consumers, employees,
management or competitors, and more formal approaches through in-depth interviews, focus
groups, projective methods, case studies or pilot studies. The Internet allows for research
methods that are more interactive in nature. For example, RSS feeds efficiently supply
researchers with up-to-date information; major search engine search results may be sent by email
to researchers by services such as Google Alerts; comprehensive search results are tracked over
lengthy periods of time by services such as Google Trends; and websites may be created to
attract worldwide feedback on any subject.
The results of exploratory research are not usually useful for decision-making by themselves, but
they can provide significant insight into a given situation. Although the results of qualitative
research can give some indication as to the "why", "how" and "when" something occurs, it
cannot tell us "how often" or "how many".
Exploratory research is not typically generalizable to the population at large.

Descriptive research
From Wikipedia, the free encyclopedia

Jump to: navigation, search

Descriptive research, also known as statistical research, describes data and characteristics
about the population or phenomenon being studied. Descriptive research answers the questions
who, what, where, when and how...
Although the data description is factual, accurate and systematic, the research cannot describe
what caused a situation. Thus, Descriptive research cannot be used to create a causal
relationship, where one variable affects another. In other words, descriptive research can be said
to have a low requirement for internal validity.
The description is used for frequencies, averages and other statistical calculations. Often the best
approach, prior to writing descriptive research, is to conduct a survey investigation. Qualitative
research often has the aim of description and researchers may follow-up with examinations of
why the observations exist and what the implications of the findings are.
In short descriptive research deals with everything that can be counted and studied. But there
are always restrictions to that. Your research must have an impact to the lives of the people
around you. For example, finding the most frequent disease that affects the children of a town.
The reader of the research will know what to do to prevent that disease thus, more people will
live a healthy life.
This article does not cite any references or sources.
Please help improve this article by adding citations to reliable sources. Unsourced material may be
challenged and removed. (December 2007)

Constructive research is perhaps the most common computer science research method. This
type of approach demands a form of validation that doesn’t need to be quite as empirically based
as in other types of research like exploratory research.
Nevertheless the conclusions have to be objectively argued and defined. This may involve
evaluating the “construct” being developed analytically against some predefined criteria or
performing some benchmark tests with the prototype.
The term “construct” is often used in this context to refer to the new contribution being
developed. Construct can be a new theory, algorithm, model, software, or a framework.
The following phrases explain the above figure.
The "fuzzy info from many sources" tab refers to different info sources like training materials,
processes, literature, articles, working experience etc.
In the “solution” tab, “theoretical framework” represents a tool to be used in the problem
solving.
The “practical relevance” tab it refers to empirical knowledge creation that offers final benefits.
The “theoretical relevance” tab it gives the new theoretical knowledge that needs scientific
acceptance: the back arrow to “theoretical body of knowledge” tab.
Steps to be followed in “practical utility” tab (a):
• set objectives and tasks
• identify process model
• select case execution
• interview case organization
• prepare simulation
• run simulation
• interpret simulation results
• give feedback
Steps to be followed in “epistemic utility” tab (b):
• constructive research
• case research
• surveys
• qualitative and quantitative methods
• theory creating
• theory testing

Empirical research
From Wikipedia, the free encyclopedia

Jump to: navigation, search

The introduction to this article provides insufficient context for those


unfamiliar with the subject. Please help improve the article with a good
introductory style. (May 2009)

Empirical research is research that derives its data by means of direct observation or
experiment, such research is used to answer a question or test a hypothesis (e.g. "Does something
such as a type of medical treatment work?"). The results are based upon actual evidence as
opposed to theory or conjecture, as such they can be replicated in follow-up studies. Empirical
research articles are published in peer-reviewed journals. Such research may also be conducted
according to hypothetico-deductive procedures, such as those developed from the work of R. A.
Fisher.

Contents
[hide]
• 1 Terminology
• 2 Usage
○ 2.1 Scientific research
○ 2.2 Empirical Research could overcome low level of Financial
Maths
• 3 Empirical cycle
• 4 See also
• 5 External links

[edit] Terminology
The term empirical was originally used to refer to certain ancient Greek practitioners of medicine
who rejected adherence to the dogmatic doctrines of the day, preferring instead to rely on the
observation of phenomena as perceived in experience. Later empiricism referred to a theory of
knowledge in philosophy which adheres to the principle that knowledge arises from experience
and evidence gathered specifically using the senses. In scientific use the term empirical refers to
the gathering of data using only evidence that is observable by the senses or in some cases using
calibrated scientific instruments. What early philosophers described as empiricist and empirical
research have in common is the dependence on observable data to formulate and test theories and
come to conclusions.
[edit] Usage
The researcher attempts to describe accurately the interaction between the instrument (or the
human senses) and the entity being observed. If instrumentation is involved, the researcher is
expected to calibrate his/her instrument by applying it to known standard objects and
documenting the results before applying it to unknown objects. In other words, it describes the
research that has not been taken place before and their results.
In practice, the accumulation of evidence for or against any particular theory involves planned
research designs for the collection of empirical data, and academic rigor plays a large part of
judging the merits of research design. Several typographies for such designs have been
suggested, one of the most popular of which comes from Campbell and Stanley (1963). They are
responsible for popularizing the widely cited distinction among pre-experimental, experimental,
and quasi-experimental designs and are staunch advocates of the central role of randomized
experiments in educational research.
[edit] Scientific research
Accurate analysis of data using standardized statistical methods in scientific studies is critical to
determining the validity of empirical research. Statistical formulas such as regression,
uncertainty coefficient, t-test, chi square, and various types of ANOVA (analyses of variance)
are fundamental to forming logical, valid conclusions. If empirical data reach significance under
the appropriate statistical formula, the research hypothesis is supported. If not, the null
hypothesis is supported (or, more correctly, not rejected), meaning no effect of the independent
variable(s) was observed on the dependent variable(s).
It is important to understand that the outcome of empirical research using statistical hypothesis
testing is never proof. It can only support a hypothesis, reject it, or do neither. These methods
yield only probabilities.
Among scientific researchers, empirical evidence (as distinct from empirical research) refers to
objective evidence that appears the same regardless of the observer. For example, a thermometer
will not display different temperatures for each individual who observes it. Temperature, as
measured by an accurate, well calibrated thermometer, is empirical evidence. By contrast, non-
empirical evidence is subjective, depending on the observer. Following the previous example,
observer A might truthfully report that a room is warm, while observer B might truthfully report
that the same room is cool, though both observe the same reading on the thermometer. The use
of empirical evidence negates this effect of personal (i.e., subjective) experience.
Ideally, empirical research yields empirical evidence, which can then be analyzed for statistical
significance or reported in its raw form.
[edit] Empirical Research could overcome low level of Financial Maths
There are immense volumes of financial data to analyze and to play with. Some examples: One
can load free historical data for all stock, forex and indexes and try hypothetical models. There is
a problem with derivatives and soc. certificates since most of them are short term. There is an
idea however to introduce
• fictitious options to simply see what would have happened if one had issued
week by week options "at the money" and what would have been "computed"
by Black&Scholes. The result is disillusioning. Error range between 10 and
more than 100 percent when comparing average B-S-values with avg results
of exercised options .
• One can use the powerful offset-spreadsheed-command to compute daily,
weekly, monthly... returns and derive its volatility accordingly. This proves
that the annualized volatiliy using the formula Vyear = SQ(Time) * Vday (with
SQ(Time) considering 250 banking days makes factor approx. 16) are about
threetimes higher than the volatility measured directly. This an important
result. The key volatility indexes are too high. Despite they are used as a key
factor for the B&S-formula - western world 10.000 times daily - affecting
trillions of Dollars.
• One can easily compare a - say - quarterly section of historical data and
compare it with all other sections of up to 100 years (DOW JONES) and find
out, that there are times of relevant correlation and others with no correlation
- "white noise" so to speak. In a diagram it makes "stalagmites". This is a
frontal attack to the soc. GARCH-industry - but nobody arguing.
These easy to use techniques are published 10 years ago but nowbody reading - using.
[edit] Empirical cycle

Empirical cycle according to A.D. de Groot

A.D. de Groot's empirical cycle:


Observation: The collecting and organisation of empirical facts; Forming hypotheses.
Induction: Formulating hypotheses.
Deduction: Deducting consequenses of hypotheses as testable predictions.
Testing: Testing the hypotheses with new empirical material.
Evaluation: Evaluating the outcome of testing.
[edit] See also
• Empirical
• Scientific method
• Fact

[edit] External links


The steps in the design process interact and often occur simultaneously. For example,
the design of a measurement instrument is influenced by the type of analysis that will
be conducted. However, the type of analysis is also influenced by the specific
characteristics of the measurement instrument.

Step 1: Define the Research Problem

Problem definition is the most critical part of the research process. Research problem
definition involves specifying the information needed by management. Unless the
problem is properly defined, the information produced by the research process is
unlikely to have any value. Coca-Cola Company researchers utilized a very sound
research design to collect information on taste preferences. Unfortunately for Coca-
Cola, taste preferences are only part of what drives the soft drink purchase decision.

Research problem definition involves four interrelated steps: (1) management


problem / opportunity clarification, (2) situation analysis, (3) model development,
and (4) specification of information requirements.

The basis goal of problem clarification is to ensure that the decision maker’s initial
description of the management decision is accurate and reflects the appropriate area
of concern for research. If the wrong management problem is translated into a
research problem, the probability of providing management with useful information is
low.

Situation Analysis

The situation analysis focuses on the variables that have produced the stated
management problem or opportunity. The factors that have led to the
problem/opportunity manifestations and the factors that have led to management’s
concern should be isolated.
A situation analysis of the retail trade outflow problem revealed, among other things,
that (1) the local population had grown 25 percent over the previous five years, (2)
buying power per capita appeared to be growing at the national rate of 3 percent a
year, and (3) local retail sales of nongrocery items had increased approximately 20
percent over the past five years. Thus, the local retailers sales are clearly not keeping
pace with the potential in the area.

Step 2: Estimate the Value of the Information

A decision maker normally approaches a problem with some information. If the


problem is, say, whether a new product should be introduced, enough information
will normally have been accumulated through past experience with other decisions
concerning the introduction of new products and from various other sources to allow
some preliminary judgments to be formed about the desirability of introducing the
product in question. There will rarely be sufficient confidence in these judgments
that additional information relevant to the decision would not be accepted if it were
available without cost or delay. There might be enough confidence, however, that
there would be an unwillingness to pay very much or wait very long for the added
information.

Step 3: Select the Data Collection Approach

There are three basic data collection approaches in marketing research: (1) secondary
data, (2) survey data, and (3) experimental data. Secondary data were collected for
some purpose other than helping to solve the current problem, whereas primary data
are collected expressly to help solve the problem at hand.

Step 4: Select the Measurement Technique

There are four basic measurement techniques used in marketing research: (1)
questionnaires, (2) attitude scales, (3) observation, and (4) depth interviews and
projective techniques.
Primary Measurement Techniques
I. Questionnaire – a formalized instrument for asking information directly from a
respondent concerning behavior, demographic characteristics, level of knowledge,
and/or attitudes, beliefs, and feelings.

II. Attitude Scales – a formalized instrument for eliciting self-reports of beliefs and
feelings concerning an object(s).

A. Rating Scales – require the respondent to place the object being rated at some
point along a numerically valued continuum or in one of a numerically ordered series
of categories.

B. Composite Scales – require the respondents to express a degree of belief


concerning various attributes of the object such that the attitude can be inferred
from the pattern of responses.

C. Perceptual maps – derive the components or characteristics an individual uses in


comparing similar objects and provide a score for each object on each characteristic.

D. Conjoint analysis – derive the value an individual assigns to various attributes of a


product.

I. Observation – the direct examination of behavior, the results of behavior, or


physiological changes.

II. Projective Techniques and Depth Interview – designed to gather information that
respondents are either unable or unwilling to provide in response to direct
questioning.
A. Projective Techniques – allow respondents to project or express their own feelings
as a characteristic of someone or something else.

B. Depth Interviews – allow individuals to express themselves without any fear of


disapproval, dispute, or advice from the interviewer.

Step 5: Select the Sample

Most marketing studies involve a sample or subgroup of the total population relevant
to the problem, rather than a census of the entire group.

Step 6: Select the Model of Analysis

It is imperative that the researcher select the analytic techniques prior to collecting
the data. Once the analytic techniques are selected, the researcher should generate
fictional responses (dummy data) to the measurement instrument. These dummy data
are then analyzed by the analytic techniques selected to ensure that the results of
this analysis will provide the information required by the problem at hand.

Step 7: Evaluate the Ethics of the Research

It is essential that marketing researchers restrict their research activities to practices


that are ethically sound. Ethically sound research considers the interests of the
general public, the respondents, the client and the research profession as well as
those of the researcher.

Step 8: Estimate Time and Financial Requirements

The program evaluation review technique (PERT) coupled with the critical path
method (CPM) offers a useful aid for estimating the resources needed for a project
and clarifying the planning and control process. PERT involves dividing the total
research project into its smallest component activities, determining the sequence in
which these activities must be performed, and attaching a time estimate for each
activity. These activities and time estimates are presented in the form of a flow chart
that allow a visual inspection of the overall process. The time estimates allow one to
determine the critical path through the chart – that series of activities whose delay
will hold up the completion of the project.

Step 9: Prepare the Research Proposal

The research design process provides the researcher with a blueprint, or guide, for
conducting and controlling the research project. The blueprint is written in the form
of a research proposal. A written research proposal should precede any research
project.
Primary research involves getting original data directly about the product and market. Primary research data is data that did not
exist before. It is designed to answer specific questions of interest to the business - for example:
What proportion of customers believes the level of customer service provided by the business is rated good or excellent?
What do customers think of a new version of a popular product?
To collect primary data a business must carry out field research. The main methods of field research are:
Face-to-face interviews – interviewers ask people on the street or on their doorstep a series of questions.
Telephone interviews - similar questions to face-to-face interviews, although often shorter.
Online surveys – using email or the Internet. This is an increasingly popular way of obtaining primary data and much less costly
than face-to-face or telephone interviews.
Questionnaires – sent in the post (for example a customer feedback form sent to people who have recently bought a product or
service).
Focus groups and consumer panels – a small group of people meet together with a “facilitator” who asks the panel to examine a
product and then asked in depth questions. This method is often used when a business is planning to introduce a new product or
brand name.
In most cases it is not possible to ask all existing or potential customers the questions that the business wants answering. So
primary research makes use of surveys and sampling to obtain valid results.
The main advantages of primary research and data are that it is:
• Up to date.
• Specific to the purpose – asks the questions the business wants answers to.
• Collects data which no other business will have access to (the results are confidential).
• In the case of online surveys and telephone interviews, the data can be obtained quite quickly (think about how quickly
political opinion polls come out).
The main disadvantages of primary research are that it:
• Can be difficult to collect and/or take a long time to collect.
• Is expensive to collect.
• May provide mis-leading results if the sample is not large enough or chosen with care; or if the questionnaire questions
are not worded properly.

Secondary data
From Wikipedia, the free encyclopedia

Jump to: navigation, search


Secondary data is data collected by someone other than the user. Common sources of secondary
data for social science include censuses, surveys, organizational records and data collected
through qualitative methodologies or qualitative research. Primary data, by contrast, are collected
by the investigator conducting the research.
Secondary data analysis saves time that would otherwise be spent collecting data and,
particularly in the case of quantitative data, provides larger and higher-quality databases than
would be unfeasible for any individual researcher to collect on their own. In addition to that,
analysts of social and economic change consider secondary data essential, since it is impossible
to conduct a new survey that can adequately capture past change and/or developments.

Contents
[hide]
• 1 Sources of secondary data
• 2 Secondary analysis or re-use of
qualitative data
• 3 Overall challenges of secondary data
analysis
• 4 References
• 5 Further reading
• 6 External links

[edit] Sources of secondary data


As is the case in primary research, secondary data can be obtained from two different research
strands:
• Quantitative: Census, housing, social security as well as electoral statistics
and other related databases.
• Qualitative: Semi-structured and structured interviews, focus groups
transcripts, field notes, observation records and other personal, research-
related documents.
A clear benefit of using secondary data is that much of the background work needed has been
already been carried out, for example: literature reviews, case studies might have been carried
out, published texts and statistic could have been already used elsewhere, media promotion and
personal contacts have also been utilized.
This wealth of background work means that secondary data generally have a pre-established
degree of validity and reliability which need not be re-examined by the researcher who is re-
using such data.
Furthermore, secondary data can also be helpful in the research design of subsequent primary
research and can provide a baseline with which the collected primary data results can be
compared to. Therefore, it is always wise to begin any research activity with a review of the
secondary data.
[edit] Secondary analysis or re-use of qualitative data
Qualitative data re-use provides a unique opportunity to study the raw materials of the recent or
more distant past to gain insights for both methodological and theoretical purposes.
In the secondary analysis of qualitative data, good documentation can not be underestimated as it
provides necessary background and much needed context both of which make re-use a more
worthwhile and systematic endeavour [1]. Actually one could go as far as claim that qualitative
secondary data analysis “can be understood, not so much as the analysis of pre-existing data;
rather as involving a process of re-contextualising, and re-constructing, data”[2].
[edit] Overall challenges of secondary data analysis
There are several things to take into consideration when using pre-existing data. Secondary data
does not permit the progression from formulating a research question to designing methods to
answer that question. It is also not feasible for a secondary data analyst to engage in the habitual
process of making observations and developing concepts. These limitations hinder the ability of
the researcher to focus on the original research question.
Data quality is always a concern because its source may not be trusted. Even data from official
records may be unreliable because the data is only as good as the records themselves, in terms of
methodological validity and reliability.
Furthermore, in the case of qualitative material, primary researchers are often reluctant to share
“their less-than-polished early and intermediary materials, not wanting to expose false starts,
mistakes, etc.” [1].
So overall, there are six questions that a secondary analyst should be able to answer about the
data they wish to analyze.
1. What were the agency's or researcher's goals when collecting the data?
2. What data was collected and what is it supposed to measure?
3. When was the data collected?
4. What methods were used? Who was responsible and are they available for questions?
5. How is the data organized?
6. What information is known about the success of that data collection? How consistent is the
data with data from other sources?
[edit] References

Sampling (statistics)
From Wikipedia, the free encyclopedia

Jump to: navigation, search

This article needs additional citations for verification.


Please help improve this article by adding reliable references. Unsourced material
may be challenged and removed. (October 2009)

Sampling is that part of statistical practice concerned with the selection of an unbiased or
random subset of individual observations within a population of individuals intended to yield
some knowledge about the population of concern, especially for the purposes of making
predictions based on statistical inference. Sampling is an important aspect of data collection.
Researchers rarely survey the entire population for two reasons (Adèr, Mellenbergh, & Hand,
2008): the cost is too high, and the population is dynamic in that the individuals making up the
population may change over time. The three main advantages of sampling are that the cost is
lower, data collection is faster, and since the data set is smaller it is possible to ensure
homogeneity and to improve the accuracy and quality of the data.
Each observation measures one or more properties (such as weight, location, color) of
observable bodies distinguished as independent objects or individuals. In survey sampling,
survey weights can be applied to the data to adjust for the sample design. Results from
probability theory and statistical theory are employed to guide practice. In business and medical
research, sampling is widely used for gathering information about a population
Process
The sampling process comprises several stages:
• Defining the population of concern
• Specifying a sampling frame, a set of items or events possible to measure
• Specifying a sampling method for selecting items or events from the frame
• Determining the sample size
• Implementing the sampling plan
• Sampling and data collecting

[edit] Population definition


Successful statistical practice is based on focused problem definition. In sampling, this includes
defining the population from which our sample is drawn. A population can be defined as
including all people or items with the characteristic one wishes to understand. Because there is
very rarely enough time or money to gather information from everyone or everything in a
population, the goal becomes finding a representative sample (or subset) of that population.
Sometimes that which defines a population is obvious. For example, a manufacturer needs to
decide whether a batch of material from production is of high enough quality to be released to
the customer, or should be sentenced for scrap or rework due to poor quality. In this case, the
batch is the population.
Although the population of interest often consists of physical objects, sometimes we need to
sample over time, space, or some combination of these dimensions. For instance, an
investigation of supermarket staffing could examine checkout line length at various times, or a
study on endangered penguins might aim to understand their usage of various hunting grounds
over time. For the time dimension, the focus may be on periods or discrete occasions.
In other cases, our 'population' may be even less tangible. For example, Joseph Jagger studied the
behaviour of roulette wheels at a casino in Monte Carlo, and used this to identify a biased wheel.
In this case, the 'population' Jagger wanted to investigate was the overall behaviour of the wheel
(i.e. the probability distribution of its results over infinitely many trials), while his 'sample' was
formed from observed results from that wheel. Similar considerations arise when taking repeated
measurements of some physical characteristic such as the electrical conductivity of copper.
This situation often arises when we seek knowledge about the cause system of which the
observed population is an outcome. In such cases, sampling theory may treat the observed
population as a sample from a larger 'superpopulation'. For example, a researcher might study the
success rate of a new 'quit smoking' program on a test group of 100 patients, in order to predict
the effects of the program if it were made available nationwide. Here the superpopulation is
"everybody in the country, given access to this treatment" - a group which does not yet exist,
since the program isn't yet available to all.
Note also that the population from which the sample is drawn may not be the same as the
population about which we actually want information. Often there is large but not complete
overlap between these two groups due to frame issues etc. (see below). Sometimes they may be
entirely separate - for instance, we might study rats in order to get a better understanding of
human health, or we might study records from people born in 2008 in order to make predictions
about people born in 2009.
Time spent in making the sampled population and population of concern precise is often well
spent, because it raises many issues, ambiguities and questions that would otherwise have been
overlooked at this stage.
[edit] Sampling frame
In the most straightforward case, such as the sentencing of a batch of material from production
(acceptance sampling by lots), it is possible to identify and measure every single item in the
population and to include any one of them in our sample. However, in the more general case this
is not possible. There is no way to identify all rats in the set of all rats. Where voting is not
compulsory, there is no way to identify which people will actually vote at a forthcoming election
(in advance of the election).
These imprecise populations are not amenable to sampling in any of the ways below and to
which we could apply statistical theory.
As a remedy, we seek a sampling frame which has the property that we can identify every single
element and include any in our sample.[1] The most straightforward type of frame is a list of
elements of the population (preferably the entire population) with appropriate contact
information. For example, in an opinion poll, possible sampling frames include:
• Electoral register
• Telephone directory
Not all frames explicitly list population elements. For example, a street map can be used as a
frame for a door-to-door survey; although it doesn't show individual houses, we can select streets
from the map and then visit all houses on those streets. (One advantage of such a frame is that it
would include people who have recently moved and are not yet on the list frames discussed
above.)
The sampling frame must be representative of the population and this is a question outside the
scope of statistical theory demanding the judgment of experts in the particular subject matter
being studied. All the above frames omit some people who will vote at the next election and
contain some people who will not; some frames will contain multiple records for the same
person. People not in the frame have no prospect of being sampled. Statistical theory tells us
about the uncertainties in extrapolating from a sample to the frame. In extrapolating from frame
to population, its role is motivational and suggestive.
To the scientist, however, representative sampling is the only justified procedure for
choosing individual objects for use as the basis of generalization, and is therefore
usually the only acceptable basis for ascertaining truth.

—Andrew A. Marino[2]

It is important to understand this difference to steer clear of confusing prescriptions found in


many web pages.
In defining the frame, practical, economic, ethical, and technical issues need to be addressed.
The need to obtain timely results may prevent extending the frame far into the future.
The difficulties can be extreme when the population and frame are disjoint. This is a particular
problem in forecasting where inferences about the future are made from historical data. In fact,
in 1703, when Jacob Bernoulli proposed to Gottfried Leibniz the possibility of using historical
mortality data to predict the probability of early death of a living man, Gottfried Leibniz
recognized the problem in replying:
Nature has established patterns originating in the return of events but only for the
most part. New illnesses flood the human race, so that no matter how many
experiments you have done on corpses, you have not thereby imposed a limit on
the nature of events so that in the future they could not vary.

—Gottfried Leibniz

Kish posited four basic problems of sampling frames:


1. Missing elements: Some members of the population are not included in the
frame.
2. Foreign elements: The non-members of the population are included in the
frame.
3. Duplicate entries: A member of the population is surveyed more than once.
4. Groups or clusters: The frame lists clusters instead of individuals.
A frame may also provide additional 'auxiliary information' about its elements; when this
information is related to variables or groups of interest, it may be used to improve survey design.
For instance, an electoral register might include name and sex; this information can be used to
ensure that a sample taken from that frame covers all demographic categories of interest.
(Sometimes the auxiliary information is less explicit; for instance, a telephone number may
provide some information about location.)
Having established the frame, there are a number of ways for organizing it to improve efficiency
and effectiveness.
It's at this stage that the researcher should decide whether the sample is in fact to be the whole
population and would therefore be a census.
[edit] Probability and nonprobability sampling
A probability sampling scheme is one in which every unit in the population has a chance
(greater than zero) of being selected in the sample, and this probability can be accurately
determined. The combination of these traits makes it possible to produce unbiased estimates of
population totals, by weighting sampled units according to their probability of selection.
Example: We want to estimate the total income of adults living in a given street. We visit each
household in that street, identify all adults living there, and randomly select one adult from each
household. (For example, we can allocate each person a random number, generated from a
uniform distribution between 0 and 1, and select the person with the highest number in each
household). We then interview the selected person and find their income. People living on their
own are certain to be selected, so we simply add their income to our estimate of the total. But a
person living in a household of two adults has only a one-in-two chance of selection. To reflect
this, when we come to such a household, we would count the selected person's income twice
towards the total. (In effect, the person who is selected from that household is taken as
representing the person who isn't selected.)
In the above example, not everybody has the same probability of selection; what makes it a
probability sample is the fact that each person's probability is known. When every element in the
population does have the same probability of selection, this is known as an 'equal probability of
selection' (EPS) design. Such designs are also referred to as 'self-weighting' because all sampled
units are given the same weight.
Probability sampling includes: Simple Random Sampling, Systematic Sampling, Stratified
Sampling, Probability Proportional to Size Sampling, and Cluster or Multistage Sampling. These
various ways of probability sampling have two things in common:
1. Every element has a known nonzero probability of being sampled and
2. involves random selection at some point.
Nonprobability sampling is any sampling method where some elements of the population have
no chance of selection (these are sometimes referred to as 'out of coverage'/'undercovered'), or
where the probability of selection can't be accurately determined. It involves the selection of
elements based on assumptions regarding the population of interest, which forms the criteria for
selection. Hence, because the selection of elements is nonrandom, nonprobability sampling does
not allow the estimation of sampling errors. These conditions give rise to exclusion bias, placing
limits on how much information a sample can provide about the population. Information about
the relationship between sample and population is limited, making it difficult to extrapolate from
the sample to the population.
Example: We visit every household in a given street, and interview the first person to answer the
door. In any household with more than one occupant, this is a nonprobability sample, because
some people are more likely to answer the door (e.g. an unemployed person who spends most of
their time at home is more likely to answer than an employed housemate who might be at work
when the interviewer calls) and it's not practical to calculate these probabilities.
Nonprobability Sampling includes: Accidental Sampling, Quota Sampling and Purposive
Sampling. In addition, nonresponse effects may turn any probability design into a nonprobability
design if the characteristics of nonresponse are not well understood, since nonresponse
effectively modifies each element's probability of being sampled.
[edit] Sampling methods
Within any of the types of frame identified above, a variety of sampling methods can be
employed, individually or in combination. Factors commonly influencing the choice between
these designs include:
• Nature and quality of the frame
• Availability of auxiliary information about units on the frame
• Accuracy requirements, and the need to measure accuracy
• Whether detailed analysis of the sample is expected
• Cost/operational concerns

[edit] Simple random sampling


In a simple random sample ('SRS') of a given size, all such subsets of the frame are given an
equal probability. Each element of the frame thus has an equal probability of selection: the frame
is not subdivided or partitioned. Furthermore, any given pair of elements has the same chance of
selection as any other such pair (and similarly for triples, and so on). This minimises bias and
simplifies analysis of results. In particular, the variance between individual results within the
sample is a good indicator of variance in the overall population, which makes it relatively easy to
estimate the accuracy of results.
However, SRS can be vulnerable to sampling error because the randomness of the selection may
result in a sample that doesn't reflect the makeup of the population. For instance, a simple
random sample of ten people from a given country will on average produce five men and five
women, but any given trial is likely to overrepresent one sex and underrepresent the other.
Systematic and stratified techniques, discussed below, attempt to overcome this problem by
using information about the population to choose a more representative sample.
SRS may also be cumbersome and tedious when sampling from an unusually large target
population. In some cases, investigators are interested in research questions specific to subgroups
of the population. For example, researchers might be interested in examining whether cognitive
ability as a predictor of job performance is equally applicable across racial groups. SRS cannot
accommodate the needs of researchers in this situation because it does not provide subsamples of
the population. Stratified sampling, which is discussed below, addresses this weakness of SRS.
Simple random sampling is always an EPS design, but not all EPS designs are simple random
sampling.
[edit] Systematic sampling
Systematic sampling relies on arranging the target population according to some ordering
scheme and then selecting elements at regular intervals through that ordered list. Systematic
sampling involves a random start and then proceeds with the selection of every kth element from
then onwards. In this case, k=(population size/sample size). It is important that the starting point
is not automatically the first in the list, but is instead randomly chosen from within the first to the
kth element in the list. A simple example would be to select every 10th name from the telephone
directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10').
As long as the starting point is randomized, systematic sampling is a type of probability
sampling. It is easy to implement and the stratification induced can make it efficient, if the
variable by which the list is ordered is correlated with the variable of interest. 'Every 10th'
sampling is especially useful for efficient sampling from databases.
Example: Suppose we wish to sample people from a long street that starts in a poor district
(house #1) and ends in an expensive district (house #1000). A simple random selection of
addresses from this street could easily end up with too many from the high end and too few from
the low end (or vice versa), leading to an unrepresentative sample. Selecting (e.g.) every 10th
street number along the street ensures that the sample is spread evenly along the length of the
street, representing all of these districts. (Note that if we always start at house #1 and end at
#991, the sample is slightly biased towards the low end; by randomly selecting the start between
#1 and #10, this bias is eliminated.)
However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is
present and the period is a multiple or factor of the interval used, the sample is especially likely
to be unrepresentative of the overall population, making the scheme less accurate than simple
random sampling.
Example: Consider a street where the odd-numbered houses are all on the north (expensive) side
of the road, and the even-numbered houses are all on the south (cheap) side. Under the sampling
scheme given above, it is impossible' to get a representative sample; either the houses sampled
will all be from the odd-numbered, expensive side, or they will all be from the even-numbered,
cheap side.
Another drawback of systematic sampling is that even in scenarios where it is more accurate than
SRS, its theoretical properties make it difficult to quantify that accuracy. (In the two examples of
systematic sampling that are given above, much of the potential sampling error is due to
variation between neighbouring houses - but because this method never selects two neighbouring
houses, the sample will not give us any information on that variation.)
As described above, systematic sampling is an EPS method, because all elements have the same
probability of selection (in the example given, one in ten). It is not 'simple random sampling'
because different subsets of the same size have different selection probabilities - e.g. the set
{4,14,24,...,994} has a one-in-ten probability of selection, but the set {4,13,24,34,...} has zero
probability of selection.
Systematic sampling can also be adapted to a non-EPS approach; for an example, see discussion
of PPS samples below.
[edit] Stratified sampling
Where the population embraces a number of distinct categories, the frame can be organized by
these categories into separate "strata." Each stratum is then sampled as an independent sub-
population, out of which individual elements can be randomly selected.[3] There are several
potential benefits to stratified sampling.
First, dividing the population into distinct, independent strata can enable researchers to draw
inferences about specific subgroups that may be lost in a more generalized random sample.
Second, utilizing a stratified sampling method can lead to more efficient statistical estimates
(provided that strata are selected based upon relevance to the criterion in question, instead of
availability of the samples). Even if a stratified sampling approach does not lead to increased
statistical efficiency, such a tactic will not result in less efficiency than would simple random
sampling, provided that each stratum is proportional to the group’s size in the population.
Third, it is sometimes the case that data are more readily available for individual, pre-existing
strata within a population than for the overall population; in such cases, using a stratified
sampling approach may be more convenient than aggregating data across groups (though this
may potentially be at odds with the previously noted importance of utilizing criterion-relevant
strata).
Finally, since each stratum is treated as an independent population, different sampling
approaches can be applied to different strata, potentially enabling researchers to use the approach
best suited (or most cost-effective) for each identified subgroup within the population.
There are, however, some potential drawbacks to using stratified sampling. First, identifying
strata and implementing such an approach can increase the cost and complexity of sample
selection, as well as leading to increased complexity of population estimates. Second, when
examining multiple criteria, stratifying variables may be related to some, but not to others,
further complicating the design, and potentially reducing the utility of the strata. Finally, in some
cases (such as designs with a large number of strata, or those with a specified minimum sample
size per group), stratified sampling can potentially require a larger sample than would other
methods (although in most cases, the required sample size would be no larger than would be
required for simple random sampling.
A stratified sampling approach is most effective when three conditions are met

1. Variability within strata are minimized


2. Variability between strata are maximized
3. The variables upon which the population is stratified are strongly correlated
with the desired dependent variable.
Advantages over other sampling methods
1. Focuses on important subpopulations and ignores irrelevant ones.
2. Allows use of different sampling techniques for different subpopulations.
3. Improves the accuracy/efficiency of estimation.
4. Permits greater balancing of statistical power of tests of differences between
strata by sampling equal numbers from strata varying widely in size.
Disadvantages
1. Requires selection of relevant stratification variables which can be difficult.
2. Is not useful when there are no homogeneous subgroups.
3. Can be expensive to implement.
Poststratification
Stratification is sometimes introduced after the sampling phase in a process called
"poststratification".[3] This approach is typically implemented due to a lack of prior knowledge of
an appropriate stratifying variable or when the experimenter lacks the necessary information to
create a stratifying variable during the sampling phase. Although the method is susceptible to the
pitfalls of post hoc approaches, it can provide several benefits in the right situation.
Implementation usually follows a simple random sample. In addition to allowing for
stratification on an ancillary variable, poststratification can be used to implement weighting,
which can improve the precision of a sample's estimates.[3]
Oversampling

Choice-based sampling is one of the stratified sampling strategies. In choice-based sampling,[4]


the data are stratified on the target and a sample is taken from each strata so that the rare target
class will be more represented in the sample. The model is then built on this biased sample. The
effects of the input variables on the target are often estimated with more precision with the
choice-based sample even when a smaller overall sample size is taken, compared to a random
sample. The results usually must be adjusted to correct for the oversampling.
[edit] Probability proportional to size sampling
In some cases the sample designer has access to an "auxiliary variable" or "size measure",
believed to be correlated to the variable of interest, for each element in the population. These
data can be used to improve accuracy in sample design. One option is to use the auxiliary
variable as a basis for stratification, as discussed above.
Another option is probability-proportional-to-size ('PPS') sampling, in which the selection
probability for each element is set to be proportional to its size measure, up to a maximum of 1.
In a simple PPS design, these selection probabilities can then be used as the basis for Poisson
sampling. However, this has the drawback of variable sample size, and different portions of the
population may still be over- or under-represented due to chance variation in selections. To
address this problem, PPS may be combined with a systematic approach.
Example: Suppose we have six schools with populations of 150, 180, 200, 220, 260, and 490
students respectively (total 1500 students), and we want to use student population as the basis
for a PPS sample of size three. To do this, we could allocate the first school numbers 1 to 150,
the second school 151 to 330 (= 150 + 180), the third school 331 to 530, and so on to the last
school (1011 to 1500). We then generate a random start between 1 and 500 (equal to 1500/3)
and count through the school populations by multiples of 500. If our random start was 137, we
would select the schools which have been allocated numbers 137, 637, and 1137, i.e. the first,
fourth, and sixth schools.
The PPS approach can improve accuracy for a given sample size by concentrating sample on
large elements that have the greatest impact on population estimates. PPS sampling is commonly
used for surveys of businesses, where element size varies greatly and auxiliary information is
often available - for instance, a survey attempting to measure the number of guest-nights spent in
hotels might use each hotel's number of rooms as an auxiliary variable. In some cases, an older
measurement of the variable of interest can be used as an auxiliary variable when attempting to
produce more current estimates.
[edit] Cluster sampling
Sometimes it is cheaper to 'cluster' the sample in some way e.g. by selecting respondents from
certain areas only, or certain time-periods only. (Nearly all samples are in some sense 'clustered'
in time - although this is rarely taken into account in the analysis.)
Cluster sampling is an example of 'two-stage sampling' or 'multistage sampling': in the first stage
a sample of areas is chosen; in the second stage a sample of respondents within those areas is
selected.
This can reduce travel and other administrative costs. It also means that one does not need a
sampling frame listing all elements in the target population. Instead, clusters can be chosen from
a cluster-level frame, with an element-level frame created only for the selected clusters. Cluster
sampling generally increases the variability of sample estimates above that of simple random
sampling, depending on how the clusters differ between themselves, as compared with the
within-cluster variation.
Nevertheless, some of the disadvantages of cluster sampling are the reliance of sample estimate
precision on the actual clusters chosen. If clusters chosen are biased in a certain way, inferences
drawn about population parameters from these sample estimates will be far off from being
accurate.
Multistage sampling Multistage sampling is a complex form of cluster sampling in which two
or more levels of units are embedded one in the other. The first stage consists of constructing the
clusters that will be used to sample from. In the second stage, a sample of primary units is
randomly selected from each cluster (rather than using all units contained in all selected
clusters). In following stages, in each of those selected clusters, additional samples of units are
selected, and so on. All ultimate units (individuals, for instance) selected at the last step of this
procedure are then surveyed.
This technique, thus, is essentially the process of taking random samples of preceding random
samples. It is not as effective as true random sampling, but it probably solves more of the
problems inherent to random sampling. Moreover, It is an effective strategy because it banks on
multiple randomizations. As such, it is extremely useful.
Multistage sampling is used frequently when a complete list of all members of the population
does not exist and is inappropriate. Moreover, by avoiding the use of all sample units in all
selected clusters, multistage sampling avoids the large, and perhaps unnecessary, costs associated
traditional cluster sampling.
[edit] Matched random sampling
A method of assigning participants to groups in which pairs of participants are first matched on
some characteristic and then individually assigned randomly to groups.[5]
The procedure for matched random sampling can be briefed with the following contexts,
1. Two samples in which the members are clearly paired, or are matched
explicitly by the researcher. For example, IQ measurements or pairs of
identical twins.
2. Those samples in which the same attribute, or variable, is measured twice on
each subject, under different circumstances. Commonly called repeated
measures. Examples include the times of a group of athletes for 1500m
before and after a week of special training; the milk yields of cows before and
after being fed a particular diet.

[edit] Quota sampling


In quota sampling, the population is first segmented into mutually exclusive sub-groups, just as
in stratified sampling. Then judgment is used to select the subjects or units from each segment
based on a specified proportion. For example, an interviewer may be told to sample 200 females
and 300 males between the age of 45 and 60.
It is this second step which makes the technique one of non-probability sampling. In quota
sampling the selection of the sample is non-random. For example interviewers might be tempted
to interview those who look most helpful. The problem is that these samples may be biased
because not everyone gets a chance of selection. This random element is its greatest weakness
and quota versus probability has been a matter of controversy for many years
[edit] Convenience sampling or Accidental Sampling
Convenience sampling (sometimes known as grab or opportunity sampling) is a type of
nonprobability sampling which involves the sample being drawn from that part of the population
which is close to hand. That is, a sample population selected because it is readily available and
convenient. It may be through meeting the person or including a person in the sample when one
meets them or chosen by finding them through technological means such as the internet or
through phone. The researcher using such a sample cannot scientifically make generalizations
about the total population from this sample because it would not be representative enough. For
example, if the interviewer was to conduct such a survey at a shopping center early in the
morning on a given day, the people that he/she could interview would be limited to those given
there at that given time, which would not represent the views of other members of society in such
an area, if the survey was to be conducted at different times of day and several times per week.
This type of sampling is most useful for pilot testing. Several important considerations for
researchers using convenience samples include:
1. Are there controls within the research design or experiment which can serve
to lessen the impact of a non-random, convenience sample whereby ensuring
the results will be more representative of the population?
2. Is there good reason to believe that a particular convenience sample would or
should respond or behave differently than a random sample from the same
population?
3. Is the question being asked by the research one that can adequately be
answered using a convenience sample?
In social science research, snowball sampling is a similar technique, where existing study
subjects are used to recruit more subjects into the sample.
[edit] Line-intercept sampling
Line-intercept sampling is a method of sampling elements in a region whereby an element is
sampled if a chosen line segment, called a “transect”, intersects the element.
[edit] Panel sampling
Panel sampling is the method of first selecting a group of participants through a random
sampling method and then asking that group for the same information again several times over a
period of time. Therefore, each participant is given the same survey or interview at two or more
time points; each period of data collection is called a "wave". This sampling methodology is
often chosen for large scale or nation-wide studies in order to gauge changes in the population
with regard to any number of variables from chronic illness to job stress to weekly food
expenditures. Panel sampling can also be used to inform researchers about within-person health
changes due to age or help explain changes in continuous dependent variables such as spousal
interaction. There have been several proposed methods of analyzing panel sample data, including
MANOVA, growth curves, and structural equation modeling with lagged effects. For a more
thorough look at analytical techniques for panel data, see Johnson (1995).
[edit] Event sampling methodology
Event sampling methodology (ESM) is a new form of sampling method that allows researchers
to study ongoing experiences and events that vary across and within days in its naturally-
occurring environment. Because of the frequent sampling of events inherent in ESM, it enables
researchers to measure the typology of activity and detect the temporal and dynamic fluctuations
of work experiences. Popularity of ESM as a new form of research design increased over the
recent years because it addresses the shortcomings of cross-sectional research, where once
unable to, researchers can now detect intra-individual variances across time. In ESM,
participants are asked to record their experiences and perceptions in a paper or electronic diary.
There are three types of ESM:
1. Signal contingent – random beeping notifies participants to record data. The
advantage of this type of ESM is minimization of recall bias.
2. Event contingent – records data when certain events occur
3. Interval contingent – records data according to the passing of a certain period
of time
ESM has several disadvantages. One of the disadvantages of ESM is it can sometimes be
perceived as invasive and intrusive by participants. ESM also leads to possible self-selection
bias. It may be that only certain types of individuals are willing to participate in this type of
study creating a non-random sample. Another concern is related to participant cooperation.
Participants may not be actually fill out their diaries at the specified times. Furthermore, ESM
may substantively change the phenomenon being studied. Reactivity or priming effects may
occur, such that repeated measurement may cause changes in the participants' experiences. This
method of sampling data is also highly vulnerable to common method variance.[6]
Further, it is important to think about whether or not an appropriate dependent variable is being
used in an ESM design. For example, it might be logical to use ESM in order to answer research
questions which involve dependent variables with a great deal of variation throughout the day.
Thus, variables such as change in mood, change in stress level, or the immediate impact of
particular events may be best studied using ESM methodology. However, it is not likely that
utilizing ESM will yield meaningful predictions when measuring someone performing a
repetitive task throughout the day or when dependent variables are long-term in nature (coronary
heart problems).
[edit] Replacement of selected units
Sampling schemes may be without replacement ('WOR' - no element can be selected more than
once in the same sample) or with replacement ('WR' - an element may appear multiple times in
the one sample). For example, if we catch fish, measure them, and immediately return them to
the water before continuing with the sample, this is a WR design, because we might end up
catching and measuring the same fish more than once. However, if we do not return the fish to
the water (e.g. if we eat the fish), this becomes a WOR design.
[edit] Sample size
Formulas, tables, and power function charts are well known approaches to determine sample
size.
[edit] Formulas
Where the frame and population are identical, statistical theory yields exact recommendations on
sample size.[7] However, where it is not straightforward to define a frame representative of the
population, it is more important to understand the cause system of which the population are
outcomes and to ensure that all sources of variation are embraced in the frame. Large number of
observations are of no value if major sources of variation are neglected in the study. In other
words, it is taking a sample group that matches the survey category and is easy to survey.
Bartlett, Kotrlik, and Higgins (2001) published a paper titled Organizational Research:
Determining Appropriate Sample Size in Survey Research Information Technology, Learning,
and Performance Journal[8] that provides an explanation of Cochran’s (1977) formulas. A
discussion and illustration of sample size formulas, including the formula for adjusting the
sample size for smaller populations, is included. A table is provided that can be used to select the
sample size for a research problem based on three alpha levels and a set error rate.
[edit] Steps for using sample size tables
1. Postulate the effect size of interest, α, and β.
2. Check sample size table[9]
1. Select the table corresponding to the selected α
2. Locate the row corresponding to the desired power
3. Locate the column corresponding to the estimated effect size.
4. The intersection of the column and row is the minimum sample size
required.

[edit] Sampling and data collection


Good data collection involves:
• Following the defined sampling process
• Keeping the data in time order
• Noting comments and other contextual events
• Recording non-responses
Most sampling books and papers written by non-statisticians focus only in the data collection
aspect, which is just a small though important part of the sampling process.
[edit] Errors in sample surveys
Survey results are typically subject to some error. Total errors can be classified into sampling
errors and non-sampling errors. The term "error" here includes systematic biases as well as
random errors.
[edit] Sampling errors and biases
Sampling errors and biases are induced by the sample design. They include:
1. Selection bias: When the true selection probabilities differ from those
assumed in calculating the results.
2. Random sampling error: Random variation in the results due to the
elements in the sample being selected at random.

[edit] Non-sampling error


Non-sampling errors are caused by other problems in data collection and processing. They
include:
1. Overcoverage: Inclusion of data from outside of the population.
2. Undercoverage: Sampling frame does not include elements in the
population.
3. Measurement error: E.g. when respondents misunderstand a question, or
find it difficult to answer.
4. Processing error: Mistakes in data coding.
5. Non-response: Failure to obtain complete data from all selected individuals.
After sampling, a review should be held of the exact process followed in sampling, rather than
that intended, in order to study any effects that any divergences might have on subsequent
analysis. A particular problem is that of non-response.
Two major types of nonresponse exist: unit nonresponse (referring to lack of completion of any
part of the survey) and item nonresponse (submission or participation in survey but failing to
complete one or more components/questions of the survey).[10][11] In survey sampling, many of
the individuals identified as part of the sample may be unwilling to participate, not have the time
to participate (opportunity cost),[12] or survey administrators may not have been able to contact
them. In this case, there is a risk of differences, between respondents and nonrespondents,
leading to biased estimates of population parameters. This is often addressed by improving
survey design, offering incentives, and conducting follow-up studies which make a repeated
attempt to contact the unresponsive and to characterize their similarities and differences with the
rest of the frame.[13] The effects can also be mitigated by weighting the data when population
benchmarks are available or by imputing data based on answers to other questions.
Nonresponse is particularly a problem in internet sampling. Reasons for this problem include
improperly designed surveys,[11] over-surveying (or survey fatigue),[14][15] and the fact that
potential participants hold multiple e-mail addresses, which they don't use anymore or don't
check regularly. Web-based surveys also tend to demonstrate nonresponse bias; for example,
studies have shown that females and those from a white/Caucasian background are more likely to
respond than their counterparts.[16]
[edit] Survey weights
In many situations the sample fraction may be varied by stratum and data will have to be
weighted to correctly represent the population. Thus for example, a simple random sample of
individuals in the United Kingdom might include some in remote Scottish islands who would be
inordinately expensive to sample. A cheaper method would be to use a stratified sample with
urban and rural strata. The rural sample could be under-represented in the sample, but weighted
up appropriately in the analysis to compensate.
More generally, data should usually be weighted if the sample design does not give each
individual an equal chance of being selected. For instance, when households have equal selection
probabilities but one person is interviewed from within each household, this gives people from
large households a smaller chance of being interviewed. This can be accounted for using survey
weights. Similarly, households with more than one telephone line have a greater chance of being
selected in a random digit dialing sample, and weights can adjust for this.
Weights can also serve other purposes, such as helping to correct for non-response.
[edit] History
Random sampling by using lots is an old idea, mentioned several times in the Bible. In 1786
Pierre Simon Laplace estimated the population of France by using a sample, along with ratio
estimator. He also computed probabilistic estimates of the error. These were not expressed as
modern confidence intervals but as the sample size that would be needed to achieve a particular
upper bound on the sampling error with probability 1000/1001. His estimates used Bayes'
theorem with a uniform prior probability and it assumed his sample was random. The theory of
small-sample statistics developed by William Sealy Gossett put the subject on a more rigorous
basis in the 20th century. However, the importance of random sampling was not universally
appreciated and in the USA the 1936 Literary Digest prediction of a Republican win in the
presidential election went badly awry, due to severe bias [1]. More than two million people
responded to the study with their names obtained through magazine subscription lists and
telephone directories. It was not appreciated that these lists were heavily biased towards
Republicans and the resulting sample, though very large, was deeply flawed

Statistical hypothesis testing


From Wikipedia, the free encyclopedia

Jump to: navigation, search

This article is about frequentist hypothesis testing. For Bayesian hypothesis testing,
see Bayesian inference.

A statistical hypothesis test is a method of making decisions using experimental data. In


statistics, a result is called statistically significant if it is unlikely to have occurred by chance.
The phrase "test of significance" was coined by Ronald Fisher: "Critical tests of this kind may be
called tests of significance, and when such tests are available we may discover whether a second
sample is or is not significantly different from the first."[1]
Hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory
data analysis. In frequency probability, these decisions are almost always made using null-
hypothesis tests (i.e., tests that answer the question Assuming that the null hypothesis is true,
what is the probability of observing a value for the test statistic that is at least as extreme as the
value that was actually observed?)[2] One use of hypothesis testing is deciding whether
experimental results contain enough information to cast doubt on conventional wisdom.
Statistical hypothesis testing is a key technique of frequentist statistical inference, and is widely
used, but also much criticized.[citation needed] While controversial,[3] the Bayesian approach to
hypothesis testing is to base rejection of the hypothesis on the posterior probability.[4] Other
approaches to reaching a decision based on data are available via decision theory and optimal
decisions.
The critical region of a hypothesis test is the set of all outcomes which, if they occur, will lead
us to decide that there is a difference. That is, cause the null hypothesis to be rejected in favor of
the alternative hypothesis. The critical region is usually denoted by C.
The Testing Process
Hypothesis testing is defined by the following general procedure:
1. The first step in any hypothesis testing is to state the relevant null and alternative
hypotheses to be tested. This is important as mis-stating the hypotheses will muddy the
rest of the process.
2. The second step is to consider the statistical assumptions being made about the sample in
doing the test; for example, assumptions about the statistical independence or about the
form of the distributions of the observations. This is equally important as invalid
assumptions will mean that the results of the test are invalid.
3. Decide which test is appropriate, and stating the relevant test statistic T.
4. Derive the distribution of the test statistic under the null hypothesis from the assumptions.
In standard cases this will be a well-known result. For example the test statistics may
follow a Student's t distribution or a normal distribution.
5. The distribution of the test statistic partitions the possible values of T into those for which
the null-hypothesis is rejected, the so called critical region, and those for which it is not.
6. Compute from the observations the observed value tobs of the test statistic T.
7. Decide to either fail to reject the null hypothesis or reject it in favor of the alternative.
The decision rule is to reject the null hypothesis H0 if the observed value tobs is in the
critical region, and to accept or "fail to reject" the hypothesis otherwise.
It is important to note the philosophical difference between accepting the null hypothesis and
simply failing to reject it. The "fail to reject" terminology highlights the fact that the null
hypothesis is assumed to be true from the start of the test; if there is a lack of evidence against it,
it simply continues to be assumed true. The phrase "accept the null hypothesis" may suggest it
has been proved simply because it has not been disproved, a logical fallacy known as the
argument from ignorance. Unless a test with particularly high power is used, the idea of
"accepting" the null hypothesis may be dangerous. Nonetheless the terminology is prevalent
throughout statistics, where its meaning is well understood
Definition of terms
The following definitions are mainly based on the exposition in the book by Lehmann and
Romano:[7]
Simple hypothesis
Any hypothesis which specifies the population distribution completely.
Composite hypothesis
Any hypothesis which does not specify the population distribution completely.
Statistical test
A decision function that takes its values in the set of hypotheses.
Region of acceptance
The set of values for which we fail to reject the null hypothesis.
Region of rejection / Critical region
The set of values of the test statistic for which the null hypothesis is rejected.
Power of a test (1 − β)
The test's probability of correctly rejecting the null hypothesis. The complement of the
false negative rate, β.
Size / Significance level of a test (α)
For simple hypotheses, this is the test's probability of incorrectly rejecting the null
hypothesis. The false positive rate. For composite hypotheses this is the upper bound of
the probability of rejecting the null hypothesis over all cases covered by the null
hypothesis.
Most powerful test
For a given size or significance level, the test with the greatest power.
Uniformly most powerful test (UMP)
A test with the greatest power for all values of the parameter being tested.
Consistent test
When considering the properties of a test as the sample size grows, a test is said to be
consistent if, for a fixed size of test, the power against any fixed alternative approaches 1
in the limit.[8]
Unbiased test
For a specific alternative hypothesis, a test is said to be unbiased when the probability of
rejecting the null hypothesis is not less than the significance level when the alternative is
true and is less than or equal to the significance level when the null hypothesis is true.
Conservative test
A test is conservative if, when constructed for a given nominal significance level, the true
probability of incorrectly rejecting the null hypothesis is never greater than the nominal
level.
Uniformly most powerful unbiased (UMPU)
A test which is UMP in the set of all unbiased tests.
p-value
The probability, assuming the null hypothesis is true, of observing a result at least as
extreme as the test statistic

Multivariate statistics
From Wikipedia, the free encyclopedia

Jump to: navigation, search

Multivariate statistics is a form of statistics encompassing the simultaneous observation and


analysis of more than one statistical variable. The application of multivariate statistics is
multivariate analysis. Methods of bivariate statistics, for example simple linear regression and
correlation, are special cases of multivariate statistics in which two variables are involved.
Multivariate statistics concerns understanding the different aims and background of each of the
different forms of multivariate analysis, and how they relate to each other. The practical
implementation of multivariate statistics to a particular problem may involve several types of
univariate and multivariate analysis in order to understand the relationships between variables
and their relevance to the actual problem being studied.
In addition, multivariate statistics is concerned with multivariate probability distributions, in
terms of both:
• how these can be used to represent the distributions of observed data;
• how they can be used as part of statistical inference, particularly
where several different quantities are of interest to the same analysis
Multivariate analysis of variance (MANOVA) is a generalized form of univariate analysis of
variance (ANOVA). It is used in cases where there are two or more dependent variables. As well
as identifying whether changes in the independent variable(s) have significant effects on the
dependent variables, MANOVA is also used to identify interactions among the dependent
variables and among the independent variables.[1]
Where sums of squares appear in univariate analysis of variance, in multivariate analysis of
variance certain positive-definite matrices appear. The diagonal entries are the same kinds of
sums of squares that appear in univariate ANOVA. The off-diagonal entries are corresponding
sums of products. Under normality assumptions about error distributions, the counterpart of the
sum of squares due to error has a Wishart distribution.
Analogous to ANOVA, MANOVA is based on the product of model variance matrix and error
variance matrix inverse. Invariance considerations imply the MANOVA statistic should be a
measure of magnitude of the singular value decomposition of this matrix product, but there is no
unique choice owing to the multi-dimensional nature of the alternative hypothesis.
The most common[citation needed] statistics are:
• Samuel Stanley Wilks' lambda (Λ)
• the Pillai-M. S. Bartlett trace
• the Lawley-Hotelling trace
• Roy's greatest root (also called Roy's largest root)
Discussion continues over the merits of each, though the greatest root leads only to a bound on
significance which is not generally of practical interest. A further complication is that the
distribution of these statistics under the null hypothesis is not straightforward and can only be
approximated except in a few low-dimensional cases. The best-known approximation for Wilks'
lambda was derived by C. R. Rao.
In the case of two groups all the statistics are equivalent and the test reduces to Hotelling's T-
square

Factor analysis
From Wikipedia, the free encyclopedia

Jump to: navigation, search

Factor analysis is a statistical method used to describe variability among observed variables in
terms of a potentially lower number of unobserved variables called factors. In other words, it is
possible, for example, that variations in three or four observed variables mainly reflect the
variations in a single unobserved variable, or in a reduced number of unobserved variables.
Factor analysis searches for such joint variations in response to unobserved latent variables. The
observed variables are modeled as linear combinations of the potential factors, plus "error"
terms. The information gained about the interdependencies between observed variables can be
used later to reduce the set of variables in a dataset. Factor analysis originated in psychometrics,
and is used in behavioral sciences, social sciences, marketing, product management, operations
research, and other applied sciences that deal with large quantities of data.
Factor analysis is related to principal component analysis (PCA), but the two are not identical.
Because PCA performs a variance-maximizing rotation of the variable space, it takes into
account all variability in the variables. In contrast, factor analysis estimates how much of the
variability is due to common factors ("communality"). The two methods become essentially
equivalent if the error terms in the factor analysis model (the variability not explained by
common factors, see below) can be assumed to all have the same variance.

Contents
[hide]
• 1 Statistical model
○ 1.1 Definition
○ 1.2 Example
○ 1.3 Mathematical model of the same example
• 2 Practical implementation
○ 2.1 Type of factor analysis
○ 2.2 Types of factoring
○ 2.3 Terminology
○ 2.4 Criteria for determining the number of factors
○ 2.5 Rotation methods
• 3 Factor analysis in psychometrics
○ 3.1 History
○ 3.2 Applications in psychology
○ 3.3 Advantages
○ 3.4 Disadvantages
• 4 Factor analysis in marketing
○ 4.1 Information collection
○ 4.2 Analysis
○ 4.3 Advantages
○ 4.4 Disadvantages
• 5 Factor analysis in physical sciences
• 6 Factor analysis in economics
• 7 Factor analysis in microarray analysis
• 8 See also
• 9 Footnotes
• 10 Further reading
• 11 References
• 12 External links

[edit] Statistical model


[edit] Definition

Suppose we have a set of p observable random variables, with means .


Suppose for some unknown constants lij and k unobserved random variables Fj, where

and , where k < p, we have


Here, the εi are independently distributed error terms with zero mean and finite variance, which
may not be the same for all i. Let Var(εi) = ψi, so that we have

In matrix terms, we have

If we have n observations, then we will have the dimensions , , and . Each


column of x and F denote values for one particular observation, and matrix L does not vary
across observations.
Also we will impose the following assumptions on F.
1. F and ε are independent.
2. E(F) = 0
3. Cov(F) = I
Any solution of the above set of equations following the constraints for F is defined as the
factors, and L as the loading matrix.
Suppose Cov(x) = Σ. Then note that from the conditions just imposed on F, we have

or

or

Note that for any orthogonal matrix Q if we set L = LQ and F = QTF, the criteria for being
factors and factor loadings still hold. Hence a set of factors and factor loadings is identical only
up to orthogonal transformations.
[edit] Example
The following example is a simplification for expository purposes, and should not be taken to be
realistic. Suppose a psychologist proposes a theory that there are two kinds of intelligence,
"verbal intelligence" and "mathematical intelligence", neither of which is directly observed.
Evidence for the theory is sought in the examination scores from each of 10 different academic
fields of 1000 students. If each student is chosen randomly from a large population, then each
student's 10 scores are random variables. The psychologist's theory may say that for each of the
10 academic fields, the score averaged over the group of all students who share some common
pair of values for verbal and mathematical "intelligences" is some constant times their level of
verbal intelligence plus another constant times their level of mathematical intelligence, i.e., it is a
linear combination of those two "factors". The numbers for a particular subject, by which the two
kinds of intelligence are multiplied to obtain the expected score, are posited by the theory to be
the same for all intelligence level pairs, and are called "factor loadings" for this subject. For
example, the theory may hold that the average student's aptitude in the field of amphibiology is
{10 × the student's verbal intelligence} + {6 × the student's mathematical
intelligence}.

The numbers 10 and 6 are the factor loadings associated with amphibiology. Other academic
subjects may have different factor loadings.
Two students having identical degrees of verbal intelligence and identical degrees of
mathematical intelligence may have different aptitudes in amphibiology because individual
aptitudes differ from average aptitudes. That difference is called the "error" — a statistical term
that means the amount by which an individual differs from what is average for his or her levels
of intelligence (see errors and residuals in statistics).
The observable data that go into factor analysis would be 10 scores of each of the 1000 students,
a total of 10,000 numbers. The factor loadings and levels of the two kinds of intelligence of each
student must be inferred from the data.
[edit] Mathematical model of the same example
In the example above, for i = 1, ..., 1,000 the ith student's scores are

where
• xk,i is the ith student's score for the kth subject
• μk is the mean of the students' scores for the kth subject (assumed to be
zero, for simplicity, in the example as described above, which would amount
to a simple shift of the scale used)
• vi is the ith student's "verbal intelligence",
• mi is the ith student's "mathematical intelligence",

• are the factor loadings for the kth subject, for j = 1, 2.


• εk,i is the difference between the ith student's score in the kth subject and the
average score in the kth subject of all students whose levels of verbal and
mathematical intelligence are the same as those of the ith student,
In matrix notation, we have

where
• N is 1000 students
• X is a 10 × 1,000 matrix of observable random variables,
• μ is a 10 × 1 column vector of unobservable constants (in this case
"constants" are quantities not differing from one individual student to the
next; and "random variables" are those assigned to individual students; the
randomness arises from the random way in which the students are chosen),
• L is a 10 × 2 matrix of factor loadings (unobservable constants, ten academic
topics, each with two intelligence parameters that determine success in that
topic),
• F is a 2 × 1,000 matrix of unobservable random variables (two intelligence
parameters for each of 1000 students),
• ε is a 10 × 1,000 matrix of unobservable random variables.
Observe that by doubling the scale on which "verbal intelligence"—the first component in each
column of F—is measured, and simultaneously halving the factor loadings for verbal intelligence
makes no difference to the model. Thus, no generality is lost by assuming that the standard
deviation of verbal intelligence is 1. Likewise for mathematical intelligence. Moreover, for
similar reasons, no generality is lost by assuming the two factors are uncorrelated with each
other. The "errors" ε are taken to be independent of each other. The variances of the "errors"
associated with the 10 different subjects are not assumed to be equal.
Note that, since any rotation of a solution is also a solution, this makes interpreting the factors
difficult. See disadvantages below. In this particular example, if we do not know beforehand that
the two types of intelligence are uncorrelated, then we cannot interpret the two factors as the two
different types of intelligence. Even if they are uncorrelated, we cannot tell which factor
corresponds to verbal intelligence and which corresponds to mathematical intelligence without
an outside argument.
The values of the loadings L, the averages μ, and the variances of the "errors" ε must be
estimated given the observed data X and F (the assumption about the levels of the factors is fixed
for a given F).
[edit] Practical implementation
[edit] Type of factor analysis
Exploratory factor analysis (EFA) is used to uncover the underlying structure of a relatively
large set of variables. The researcher a prior assumption is that any indicator may be associated
with any factor. This is the most common form of factor analysis. There is no prior theory and
one uses factor loadings to intuit the factor structure of the data.
Confirmatory factor analysis (CFA) seeks to determine if the number of factors and the loadings
of measured (indicator) variables on them conform to what is expected on the basis of pre-
established theory. Indicator variables are selected on the basis of prior theory and factor analysis
is used to see if they load as predicted on the expected number of factors. The researcher's à
priori assumption is that each factor (the number and labels of which may be specified à priori) is
associated with a specified subset of indicator variables. A minimum requirement of
confirmatory factor analysis is that one hypothesizes beforehand the number of factors in the
model, but usually also the researcher will posit expectations about which variables will load on
which factors. The researcher seeks to determine, for instance, if measures created to represent a
latent variable really belong together.
[edit] Types of factoring
Principal component analysis (PCA): The most common form of factor analysis, PCA seeks a
linear combination of variables such that the maximum variance is extracted from the variables.
It then removes this variance and seeks a second linear combination which explains the
maximum proportion of the remaining variance, and so on. This is called the principal axis
method and results in orthogonal (uncorrelated) factors.
Canonical factor analysis , also called Rao's canonical factoring, is a different method of
computing the same model as PCA, which uses the principal axis method. CFA seeks factors
which have the highest canonical correlation with the observed variables. CFA is unaffected by
arbitrary rescaling of the data.
Common factor analysis, also called principal factor analysis (PFA) or principal axis factoring
(PAF), seeks the least number of factors which can account for the common variance
(correlation) of a set of variables.
Image factoring: based on the correlation matrix of predicted variables rather than actual
variables, where each variable is predicted from the others using multiple regression.
Alpha factoring: based on maximizing the reliability of factors, assuming variables are randomly
sampled from a universe of variables. All other methods assume cases to be sampled and
variables fixed.
[edit] Terminology
Factor loadings: The factor loadings, also called component loadings in PCA, are the correlation
coefficients between the variables (rows) and factors (columns). Analogous to Pearson's r, the
squared factor loading is the percent of variance in that indicator variable explained by the factor.
To get the percent of variance in all the variables accounted for by each factor, add the sum of
the squared factor loadings for that factor (column) and divide by the number of variables. (Note
the number of variables equals the sum of their variances as the variance of a standardized
variable is 1.) This is the same as dividing the factor's eigenvalue by the number of variables.
Interpreting factor loadings: By one rule of thumb in confirmatory factor analysis, loadings
should be .7 or higher to confirm that independent variables identified a priori are represented by
a particular factor, on the rationale that the .7 level corresponds to about half of the variance in
the indicator being explained by the factor. However, the .7 standard is a high one and real-life
data may well not meet this criterion, which is why some researchers, particularly for exploratory
purposes, will use a lower level such as .4 for the central factor and .25 for other factors call
loadings above .6 "high" and those below .4 "low". In any event, factor loadings must be
interpreted in the light of theory, not by arbitrary cutoff levels.
In oblique rotation, one gets both a pattern matrix and a structure matrix. The structure matrix is
simply the factor loading matrix as in orthogonal rotation, representing the variance in a
measured variable explained by a factor on both a unique and common contributions basis. The
pattern matrix, in contrast, contains coefficients which just represent unique contributions. The
more factors, the lower the pattern coefficients as a rule since there will be more common
contributions to variance explained. For oblique rotation, the researcher looks at both the
structure and pattern coefficients when attributing a label to a factor.
Communality (h2): The sum of the squared factor loadings for all factors for a given variable
(row) is the variance in that variable accounted for by all the factors, and this is called the
communality. The communality measures the percent of variance in a given variable explained
by all the factors jointly and may be interpreted as the reliability of the indicator. Spurious
solutions: If the communality exceeds 1.0, there is a spurious solution, which may reflect too
small a sample or the researcher has too many or too few factors.
Uniqueness of a variable: 1-h2. That is, uniqueness is the variability of a variable minus its
communality.
Eigenvalues:/Characteristic roots: The eigenvalue for a given factor measures the variance in all
the variables which is accounted for by that factor. The ratio of eigenvalues is the ratio of
explanatory importance of the factors with respect to the variables. If a factor has a low
eigenvalue, then it is contributing little to the explanation of variances in the variables and may
be ignored as redundant with more important factors. Eigenvalues measure the amount of
variation in the total sample accounted for by each factor.
Extraction sums of squared loadings: Initial eigenvalues and eigenvalues after extraction (listed
by SPSS as "Extraction Sums of Squared Loadings") are the same for PCA extraction, but for
other extraction methods, eigenvalues after extraction will be lower than their initial
counterparts. SPSS also prints "Rotation Sums of Squared Loadings" and even for PCA, these
eigenvalues will differ from initial and extraction eigenvalues, though their total will be the
same.
Factor scores: Also called component scores in PCA, factor scores are the scores of each case
(row) on each factor (column). To compute the factor score for a given case for a given factor,
one takes the case's standardized score on each variable, multiplies by the corresponding factor
loading of the variable for the given factor, and sums these products. Computing factor scores
allows one to look for factor outliers. Also, factor scores may be used as variables in subsequent
modeling.
[edit] Criteria for determining the number of factors
Comprehensibility: Though not a strictly mathematical criterion, there is much to be said for
limiting the number of factors to those whose dimension of meaning is readily comprehensible.
Often this is the first two or three. Using one or more of the methods below, the researcher
determines an appropriate range of solutions to investigate. For instance, the Kaiser criterion
may suggest three factors and the scree test may suggest 5, so the researcher may request 3-, 4-,
and 5-factor solutions and select the solution which generates the most comprehensible factor
structure.
Kaiser criterion: The Kaiser rule is to drop all components with eigenvalues under 1.0. The
Kaiser criterion is the default in SPSS and most computer programs but is not recommended
when used as the sole cut-off criterion for estimating the number of factors.
Scree plot: The Cattell scree test plots the components as the X axis and the corresponding
eigenvalues as the Y-axis. As one moves to the right, toward later components, the eigenvalues
drop. When the drop ceases and the curve makes an elbow toward less steep decline, Cattell's
scree test says to drop all further components after the one starting the elbow. This rule is
sometimes criticised for being amenable to researcher-controlled "fudging". That is, as picking
the "elbow" can be subjective because the curve has multiple elbows or is a smooth curve, the
researcher may be tempted to set the cut-off at the number of factors desired by his or her
research agenda.
Horn's Parallel Analysis (PA): A Monte-Carlo based simulation method that compares the
observed eigenvalues with those obtained from uncorrelated normal variables. A factor or
component is retained if the associated eigenvalue is bigger than the 95th of the distribution of
eigenvalues derived from the random data. PA is one of the most recommendable rules for
determining the number of components to retain, but only few programs include this option [1].
Variance explained criteria: Some researchers simply use the rule of keeping enough factors to
account for 90%; (sometimes 80%) of the variation. Where the researcher's goal emphasizes
parsimony (explaining variance with as few factors as possible), the criterion could be as low as
50%
Before dropping a factor below one's cut-off, however, the researcher should check its
correlation with the dependent variable. A very small factor can have a large correlation with the
dependent variable, in which case it should not be dropped.
[edit] Rotation methods
Rotation serves to make the output more understandable and is usually necessary to facilitate the
interpretation of factors.
Varimax rotation is an orthogonal rotation of the factor axes to maximize the variance of the
squared loadings of a factor (column) on all the variables (rows) in a factor matrix, which has the
effect of differentiating the original variables by extracted factor. Each factor will tend to have
either large or small loadings of any particular variable. A varimax solution yields results which
make it as easy as possible to identify each variable with a single factor. This is the most
common rotation option.
Quartimax rotation is an orthogonal alternative which minimizes the number of factors needed
to explain each variable. This type of rotation often generates a general factor on which most
variables are loaded to a high or medium degree. Such a factor structure is usually not helpful to
the research purpose.
Equimax rotation is a compromise between Varimax and Quartimax criteria.
Direct oblimin rotation is the standard method when one wishes a non-orthogonal (oblique)
solution – that is, one in which the factors are allowed to be correlated. This will result in higher
eigenvalues but diminished interpretability of the factors. See below.
Promax rotation is an alternative non-orthogonal (oblique) rotation method which is
computationally faster than the direct oblimin method and therefore is sometimes used for very
large datasets.
[edit] Factor analysis in psychometrics
[edit] History
Charles Spearman spearheaded the use of factor analysis in the field of psychology and is
sometimes credited with the invention of factor analysis. He discovered that school children's
scores on a wide variety of seemingly unrelated subjects were positively correlated, which led
him to postulate that a general mental ability, or g, underlies and shapes human cognitive
performance. His postulate now enjoys broad support in the field of intelligence research, where
it is known as the g theory.
Raymond Cattell expanded on Spearman’s idea of a two-factor theory of intelligence after
performing his own tests and factor analysis. He used a multi-factor theory to explain
intelligence. Cattell’s theory addressed alternate factors in intellectual development, including
motivation and psychology. Cattell also developed several mathematical methods for adjusting
psychometric graphs, such as his "scree" test and similarity coefficients. His research led to the
development of his theory of fluid and crystallized intelligence, as well as his 16 Personality
Factors theory of personality. Cattell was a strong advocate of factor analysis and psychometrics.
He believed that all theory should be derived from research, which supports the continued use of
empirical observation and objective testing to study human intelligence.
[edit] Applications in psychology
Factor analysis is used to identify "factors" that explain a variety of results on different tests. For
example, intelligence research found that people who get a high score on a test of verbal ability
are also good on other tests that require verbal abilities. Researchers explained this by using
factor analysis to isolate one factor, often called crystallized intelligence or verbal intelligence,
that represents the degree to which someone is able to solve problems involving verbal skills.
Factor analysis in psychology is most often associated with intelligence research. However, it
also has been used to find factors in a broad range of domains such as personality, attitudes,
beliefs, etc. It is linked to psychometrics, as it can assess the validity of an instrument by finding
if the instrument indeed measures the postulated factors.
[edit] Advantages
• Reduction of number of variables, by combining two or more variables into a
single factor. For example, performance at running, ball throwing, batting,
jumping and weight lifting could be combined into a single factor such as
general athletic ability. Usually, in an item by people matrix, factors are
selected by grouping related items. In the Q factor analysis technique, the
matrix is transposed and factors are created by grouping related people: For
example, liberals, libertarians, conservatives and socialists, could form
separate groups.
• Identification of groups of inter-related variables, to see how they are related
to each other. For example, Carroll used factor analysis to build his Three
Stratum Theory. He found that a factor called "broad visual perception"
relates to how good an individual is at visual tasks. He also found a "broad
auditory perception" factor, relating to auditory task capability. Furthermore,
he found a global factor, called "g" or general intelligence, that relates to
both "broad visual perception" and "broad auditory perception". This means
someone with a high "g" is likely to have both a high "visual perception"
capability and a high "auditory perception" capability, and that "g" therefore
explains a good part of why someone is good or bad in both of those
domains.

[edit] Disadvantages
• "...each orientation is equally acceptable mathematically. But different
factorial theories proved to differ as much in terms of the orientations of
factorial axes for a given solution as in terms of anything else, so that model
fitting did not prove to be useful in distinguishing among theories."
(Sternberg, 1977). This means all rotations represent different underlying
processes, but all rotations are equally valid outcomes of standard factor
analysis optimization. Therefore, it is impossible to pick the proper rotation
using factor analysis alone.
• Factor analysis can be only as good as the data allows. In psychology, where
researchers often have to rely on less valid and reliable measures such as
self-reports, this can be problematic.
• Interpreting factor analysis is based on using a “heuristic”, which is a solution
that is "convenient even if not absolutely true" (Richard B. Darlington). More
than one interpretation can be made of the same data factored the same
way, and factor analysis cannot identify causality.

[edit] Factor analysis in marketing


The basic steps are:
• Identify the salient attributes consumers use to evaluate products in this
category.
• Use quantitative marketing research techniques (such as surveys) to collect
data from a sample of potential customers concerning their ratings of all the
product attributes.
• Input the data into a statistical program and run the factor analysis
procedure. The computer will yield a set of underlying attributes (or factors).
• Use these factors to construct perceptual maps and other product positioning
devices.

[edit] Information collection


The data collection stage is usually done by marketing research professionals. Survey questions
ask the respondent to rate a product sample or descriptions of product concepts on a range of
attributes. Anywhere from five to twenty attributes are chosen. They could include things like:
ease of use, weight, accuracy, durability, colourfulness, price, or size. The attributes chosen will
vary depending on the product being studied. The same question is asked about all the products
in the study. The data for multiple products is coded and input into a statistical program such as
R, PSPP, SAS, Stata, Statistica, JMP and SYSTAT.
[edit] Analysis
The analysis will isolate the underlying factors that explain the data. Factor analysis is an
interdependence technique. The complete set of interdependent relationships are examined.
There is no specification of either dependent variables, independent variables, or causality.
Factor analysis assumes that all the rating data on different attributes can be reduced down to a
few important dimensions. This reduction is possible because the attributes are related. The
rating given to any one attribute is partially the result of the influence of other attributes. The
statistical algorithm deconstructs the rating (called a raw score) into its various components, and
reconstructs the partial scores into underlying factor scores. The degree of correlation between
the initial raw score and the final factor score is called a factor loading. There are two
approaches to factor analysis: "principal component analysis" (the total variance in the data is
considered); and "common factor analysis" (the common variance is considered).
Note that principal component analysis and common factor analysis differ in terms of their
conceptual underpinnings. The factors produced by principal component analysis are
conceptualized as being linear combinations of the variables whereas the factors produced by
common factor analysis are conceptualized as being latent variables. Computationally, the only
difference is that the diagonal of the relationships matrix is replaced with communalities (the
variance accounted for by more than one variable) in common factor analysis. This has the result
of making the factor scores indeterminate and thus differ depending on the method used to
compute them whereas those produced by principal component analysis are not dependent on the
method of computation. Although there have been heated debates over the merits of the two
methods, a number of leading statisticians have concluded that in practice there is little
difference (Velicer and Jackson, 1990) which makes sense since the computations are quite
similar despite the differing conceptual bases, especially for datasets where communalities are
high and/or there are many variables, reducing the influence of the diagonal of the relationship
matrix on the final result (Gorsuch, 1983).
The use of principal components in a semantic space can vary somewhat because the
components may only "predict" but not "map" to the vector space. This produces a statistical
principal component use where the most salient words or themes represent the preferred basis.
[ok]
[edit] Advantages
• Both objective and subjective attributes can be used provided the subjective
attributes can be converted into scores
• Factor Analysis can be used to identify hidden dimensions or constructs
which may not be apparent from direct analysis
• It is easy and inexpensive to do

[edit] Disadvantages
• Usefulness depends on the researchers' ability to collect a sufficient set of
product attributes. If important attributes are missed the value of the
procedure is reduced.
• If sets of observed variables are highly similar to each other but distinct from
other items, factor analysis will assign a single factor to them. This may make
it harder to identify factors that capture more interesting relationships.
• Naming the factors may require background knowledge or theory because
multiple attributes can be highly correlated for no apparent reason.

[edit] Factor analysis in physical sciences


Factor analysis has also been widely used in physical sciences such as geochemistry, ecology,
and hydrochemistry[2] .
In groundwater quality management, it is important to relate the spatial distribution of different
chemical parameters to different possible sources, which have different chemical signatures. For
example, a sulfide mine is likely to be associated with high levels of acidity, dissolved sulfates
and transition metals. These signatures can be identified as factors through R-mode factor
analysis, and the location of possible sources can be suggested by contouring the factor scores.[3]
In geochemistry, different factors can correspond to different mineral associations, and thus to
mineralisation.[4]
[edit] Factor analysis in economics
Economists might use factor analysis to see whether productivity, profits and workforce can be
reduced down to an underlying dimension of company growth.
[edit] Factor analysis in microarray analysis

Correlation and dependence


From Wikipedia, the free encyclopedia

Jump to: navigation, search

This article is about correlation and dependence in statistical data. For other uses,
see correlation (disambiguation).

In statistics, correlation and dependence are any of a broad class of statistical relationships
between two or more random variables or observed data values.
Familiar examples of dependent phenomena include the correlation between the physical
statures of parents and their offspring, and the correlation between the demand for a product and
its price. Correlations are useful because they can indicate a predictive relationship that can be
exploited in practice. For example, an electrical utility may produce less power on a mild day
based on the correlation between electricity demand and weather. Correlations can also suggest
possible causal, or mechanistic relationships; however, statistical dependence is not sufficient to
demonstrate the presence of such a relationship.
Formally, dependence refers to any situation in which random variables do not satisfy a
mathematical condition of probabilistic independence. In general statistical usage, correlation or
co-relation can refer to any departure of two or more random variables from independence, but
most commonly refers to a more specialized type of relationship between mean values. There are
several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The
most common of these is the Pearson correlation coefficient, which is sensitive only to a linear
relationship between two variables (which may exist even if one is a nonlinear function of the
other). Other correlation coefficients have been developed to be more robust than the Pearson
correlation, or more sensitive to nonlinear relationships.[1][2][3]
Several sets of (x, y) points, with the Pearson correlation coefficient of x and y for
each set. Note that the correlation reflects the noisiness and direction of a linear
relationship (top row), but not the slope of that relationship (middle), nor many
aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a
slope of 0 but in that case the correlation coefficient is undefined because the
variance of Y is zero.

Contents
[hide]
• 1 Pearson's product-moment coefficient
• 2 Rank correlation coefficients
• 3 Other measures of dependence among random
variables
• 4 Sensitivity to the data distribution
• 5 Correlation matrices
• 6 Common misconceptions
○ 6.1 Correlation and causality
○ 6.2 Correlation and linearity
• 7 Partial correlation
• 8 See also
• 9 References
• 10 Further reading
• 11 External links

[edit] Pearson's product-moment coefficient


Main article: Pearson product-moment correlation coefficient
The most familiar measure of dependence between two quantities is the Pearson product-moment
correlation coefficient, or "Pearson's correlation." It is obtained by dividing the covariance of the
two variables by the product of their standard deviations. Karl Pearson developed the coefficient
from a similar but slightly different idea by Francis Galton.[4]
The population correlation coefficient ρX,Y between two random variables X and Y with expected
values μX and μY and standard deviations σX and σY is defined as:

where E is the expected value operator, cov means covariance, and, corr a widely used
alternative notation for Pearson's correlation.
The Pearson correlation is defined only if both of the standard deviations are finite and both of
them are nonzero. It is a corollary of the Cauchy–Schwarz inequality that the correlation cannot
exceed 1 in absolute value. The correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).
The Pearson correlation is +1 in the case of a perfect positive (increasing) linear relationship
(correlation), −1 in the case of a perfect decreasing (negative) linear relationship
(anticorrelation) [5], and some value between −1 and 1 in all other cases, indicating the degree of
linear dependence between the variables. As it approaches zero there is less of a relationship
(closer to uncorrelated). The closer the coefficient is to either −1 or 1, the stronger the correlation
between the variables.
If the variables are independent, Pearson's correlation coefficient is 0, but the converse is not true
because the correlation coefficient detects only linear dependencies between two variables. For
example, suppose the random variable X is symmetrically distributed about zero, and Y = X2.
Then Y is completely determined by X, so that X and Y are perfectly dependent, but their
correlation is zero; they are uncorrelated. However, in the special case when X and Y are jointly
normal, uncorrelatedness is equivalent to independence.
If we have a series of n measurements of X and Y written as xi and yi where i = 1, 2, ..., n, then
the sample correlation coefficient, can be used to estimate the population Pearson correlation r
between X and Y. The sample correlation coefficient is written

where x and y are the sample means of X and Y, sx and sy are the sample standard deviations of X
and Y.
This can also be written as:

[edit] Rank correlation coefficients


Main article: Spearman's rank correlation coefficient

Main article: Kendall tau rank correlation coefficient

Rank correlation coefficients, such as Spearman's rank correlation coefficient and Kendall's rank
correlation coefficient (τ) measure the extent to which, as one variable increases, the other
variable tends to increase, without requiring that increase to be represented by a linear
relationship. If, as the one variable increase, the other decreases, the rank correlation coefficients
will be negative. It is common to regard these rank correlation coefficients as alternatives to
Pearson's coefficient, used either to reduce the amount of calculation or to make the coefficient
less sensitive to non-normality in distributions. However, this view has little mathematical basis,
as rank correlation coefficients measure a different type of relationship than the Pearson product-
moment correlation coefficient, and are best seen as measures of a different type of association,
rather than as alternative measure of the population correlation coefficient.[6][7]
To illustrate the nature of rank correlation, and its difference from linear correlation, consider the
following four pairs of numbers (x, y):
(0, 1), (10, 100), (101, 500), (102, 2000).

As we go from each pair to the next pair x increases, and so does y. This relationship is perfect,
in the sense that an increase in x is always accompanied by an increase in y. This means that we
have a perfect rank correlation, and both Spearman's and Kendall's correlation coefficients are 1,
whereas in this example Pearson product-moment correlation coefficient is 0.7544, indicating
that the points are far from lying on a straight line. In the same way if y always decreases when x
increases, the rank correlation coefficients will be −1, while the Pearson product-moment
correlation coefficient may or may not be close to -1, depending on how close the points are to a
straight line. Although in the extreme cases of perfect rank correlation the two coefficients are
both equal (being both +1 or both −1) this is not in general so, and values of the two coefficients
cannot meaningfully be compared.[6] For example, for the three pairs (1, 1) (2, 3) (3, 2)
Spearman's coefficient is 1/2, while Kendall's coefficient is 1/3.
[edit] Other measures of dependence among random
variables
The information given by a correlation coefficient is not enough to define the dependence
structure between random variables. The correlation coefficient completely defines the
dependence structure only in very particular cases, for example when the distribution is a
multivariate normal distribution. (See diagram above.) In the case of elliptic distributions it
characterizes the (hyper-)ellipses of equal density, however, it does not completely characterize
the dependence structure (for example, a multivariate t-distribution's degrees of freedom
determine the level of tail dependence).
Distance correlation and Brownian covariance / Brownian correlation [8][9] were introduced to
address the deficiency of Pearson's correlation that it can be zero for dependent random
variables; zero distance correlation and zero Brownian correlation imply independence.
The correlation ratio is able to detect almost any functional dependency, or the entropy-based
mutual information/total correlation which is capable of detecting even more general
dependencies. The latter are sometimes referred to as multi-moment correlation measures, in
comparison to those that consider only 2nd moment (pairwise or quadratic) dependence.
The polychoric correlation is another correlation applied to ordinal data that aims to estimate the
correlation between theorised latent variables.
One way to capture a more complete view of dependence structure is to consider a copula
between them.
[edit] Sensitivity to the data distribution
This section does not cite any references or sources.
Please help improve this article by adding citations to reliable sources. Unsourced
material may be challenged and removed. (May 2009)

The degree of dependence between variables X and Y should not depend on the scale on which
the variables are expressed. Therefore, most correlation measures in common use are invariant to
location and scale transformations of the marginal distributions. That is, if we are analyzing the
relationship between X and Y, most correlation measures are unaffected by transforming X to
a + bX and Y to c + dY, where a, b, c, and d are constants. This is true of most correlation
statistics as well as their population analogues. Some correlation statistics, such as the rank
correlation coefficient, are also invariant to monotone transformations of the marginal
distributions of X and/or Y.

Pearson/Spearman correlation coefficients between X and Y are shown when the


two variables' ranges are unrestricted, and when the range of X is restricted to the
interval (0,1).

Most correlation measures are sensitive to the manner in which X and Y are sampled.
Dependencies tend to be stronger if viewed over a wider range of values. Thus, if we consider
the correlation coefficient between the heights of fathers and their sons over all adult males, and
compare it to the same correlation coefficient calculated when the fathers are selected to be
between 165 cm and 170 cm in height, the correlation will be weaker in the latter case.
Various correlation measures in use may be undefined for certain joint distributions of X and Y.
For example, the Pearson correlation coefficient is defined in terms of moments, and hence will
be undefined if the moments are undefined. Measures of dependence based on quantiles are
always defined. Sample-based statistics intended to estimate population measures of dependence
may or may not have desirable statistical properties such as being unbiased, or asymptotically
consistent, based on the structure of the population from which the data were sampled.
[edit] Correlation matrices
The correlation matrix of n random variables X1, ..., Xn is the n × n matrix whose i,j entry is
corr(Xi, Xj). If the measures of correlation used are product-moment coefficients, the correlation
matrix is the same as the covariance matrix of the standardized random variables Xi /σ (Xi) for i
= 1, ..., n. This applies to both the matrix of population correlations (in which case "σ " is the
population standard deviation), and to the matrix of sample correlations (in which case "σ "
denotes the sample standard deviation). Consequently, each is necessarily a positive-semidefinite
matrix.
The correlation matrix is symmetric because the correlation between Xi and Xj is the same as the
correlation between Xj and Xi.
[edit] Common misconceptions
[edit] Correlation and causality
Main article: Correlation does not imply causation

The conventional dictum that "correlation does not imply causation" means that correlation
cannot be used to infer a causal relationship between the variables.[10] This dictum should not be
taken to mean that correlations cannot indicate the potential existence of causal relations.
However, the causes underlying the correlation, if any, may be indirect and unknown, and high
correlations also overlap with identity relations, where no causal process exists. Consequently,
establishing a correlation between two variables is not a sufficient condition to establish a causal
relationship (in either direction). For example, one may observe a correlation between an
ordinary alarm clock ringing and daybreak, though there is no causal relationship between these
phenomena.
A correlation between age and height in children is fairly causally transparent, but a correlation
between mood and health in people is less so. Does improved mood lead to improved health; or
does good health lead to good mood; or both? Or does some other factor underlie both? In other
words, a correlation can be taken as evidence for a possible causal relationship, but cannot
indicate what the causal relationship, if any, might be.
[edit] Correlation and linearity

Four sets of data with the same correlation of 0.816

The Pearson correlation coefficient indicates the strength of a linear relationship between two
variables, but its value generally does not completely characterize their relationship. In
particular, if the conditional mean of Y given X, denoted E(Y|X), is not linear in X, the correlation
coefficient will not fully determine the form of E(Y|X).
The image on the right shows scatterplots of Anscombe's quartet, a set of four different pairs of
variables created by Francis Anscombe.[11] The four y variables have the same mean (7.5),
standard deviation (4.12), correlation (0.816) and regression line (y = 3 + 0.5x). However, as can
be seen on the plots, the distribution of the variables is very different. The first one (top left)
seems to be distributed normally, and corresponds to what one would expect when considering
two variables correlated and following the assumption of normality. The second one (top right) is
not distributed normally; while an obvious relationship between the two variables can be
observed, it is not linear. In this case the Pearson correlation coefficient does not indicate that
there is an exact functional relationship: only the extent to which that relationship can be
approximated by a linear relationship. In the third case (bottom left), the linear relationship is
perfect, except for one outlier which exerts enough influence to lower the correlation coefficient
from 1 to 0.816. Finally, the fourth example (bottom right) shows another example when one
outlier is enough to produce a high correlation coefficient, even though the relationship between
the two variables is not linear.
These examples indicate that the correlation coefficient, as a summary statistic, cannot replace
the individual examination of the data. Note that the examples are sometimes said to demonstrate
that the Pearson correlation assumes that the data follow a normal distribution, but this is not
correct.[12]
If a pair (X, Y) of random variables follows a bivariate normal distribution, the conditional mean
E(X|Y) is a linear function of Y, and the conditional mean E(Y|X) is a linear function of X. The
correlation coefficient r between X and Y, along with the marginal means and variances of X and
Y, determines this linear relationship:
where EX and EY are the expected values of X and Y, respectively, and σx and σy are the standard
deviations of X and Y, respectively.
[edit] Partial correlation
Main article: Partial correlation

If a population or data-set is characterized by more than two variables, a partial correlation


coefficient measures the strength of dependence between a pair of variables that is not accounted
for by the way in which they both change in response to variations in a selected subset

Regression analysis
From Wikipedia, the free encyclopedia

Jump to: navigation, search

In statistics, regression analysis includes any techniques for modeling and analyzing several
variables, when the focus is on the relationship between a dependent variable and one or more
independent variables. More specifically, regression analysis helps us understand how the typical
value of the dependent variable changes when any one of the independent variables is varied,
while the other independent variables are held fixed. Most commonly, regression analysis
estimates the conditional expectation of the dependent variable given the independent variables
— that is, the average value of the dependent variable when the independent variables are held
fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional
distribution of the dependent variable given the independent variables. In all cases, the
estimation target is a function of the independent variables called the regression function. In
regression analysis, it is also of interest to characterize the variation of the dependent variable
around the regression function, which can be described by a probability distribution.
Regression analysis is widely used for prediction and forecasting, where its use has substantial
overlap with the field of machine learning. Regression analysis is also used to understand which
among the independent variables are related to the dependent variable, and to explore the forms
of these relationships. In restricted circumstances, regression analysis can be used to infer causal
relationships between the independent and dependent variables.
A large body of techniques for carrying out regression analysis has been developed. Familiar
methods such as linear regression and ordinary least squares regression are parametric, in that the
regression function is defined in terms of a finite number of unknown parameters that are
estimated from the data. Nonparametric regression refers to techniques that allow the regression
function to lie in a specified set of functions, which may be infinite-dimensional.
The performance of regression analysis methods in practice depends on the form of the data-
generating process, and how it relates to the regression approach being used. Since the true form
of the data-generating process is not known, regression analysis depends to some extent on
making assumptions about this process. These assumptions are sometimes (but not always)
testable if a large amount of data is available. Regression models for prediction are often useful
even when the assumptions are moderately violated, although they may not perform optimally.
However, in many applications, especially with small effects or questions of causality based on
observational data, regression methods give misleading results

Linear regression
From Wikipedia, the free encyclopedia

(Redirected from Linear regression model)

Jump to: navigation, search

Example of linear regression with one independent variable.

In statistics, linear regression is any approach to modeling the relationship between a scalar
variable y and one or more variables denoted X. In linear regression, models of the unknown
parameters are estimated from the data using linear functions. Such models are called “linear
models.” Most commonly, linear regression refers to a model in which the conditional mean of y
given the value of X is an affine function of X. Less commonly, linear regression could refer to a
model in which the median, or some other quantile of the conditional distribution of y given X is
expressed as a linear function of X. Like all forms of regression analysis, linear regression
focuses on the conditional probability distribution of y given X, rather than on the joint
probability distribution of y and X, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously, and to be used
extensively in practical applications. This is because models which depend linearly on their
unknown parameters are easier to fit than models which are non-linearly related to their
parameters and because the statistical properties of the resulting estimators are easier to
determine.
Linear regression has many practical uses. Most applications of linear regression fall into one of
the following two broad categories:
• If the goal is prediction, or forecasting, linear regression can be used to fit a
predictive model to an observed data set of y and X values. After developing
such a model, if an additional value of X is then given without its
accompanying value of y, the fitted model can be used to make a prediction
of the value of y.
• Given a variable y and a number of variables X1, ..., Xp that may be related to
y, then linear regression analysis can be applied to quantify the strength of
the relationship between y and the Xj, to assess which Xj may have no
relationship with y at all, and to identify which subsets of the Xj contain
redundant information about y, thus once one of them is known, the others
are no longer informative.
Linear regression models are often fitted using the least squares approach, but they may also be
fitted in other ways, such as by minimizing the “lack of fit” in some other norm, or by
minimizing a penalized version of the least squares loss function as in ridge regression.
Conversely, the least squares approach can be used to fit models that are not linear models. Thus,
while the terms “least squares” and linear model are closely linked, they are not synonymous.

Contents
[hide]
• 1 Introduction to linear regression
○ 1.1 Assumptions
○ 1.2 Interpretation
• 2 Estimation methods
• 3 Extensions
• 4 Applications of linear regression
○ 4.1 Trend line
○ 4.2 Epidemiology
○ 4.3 Finance
○ 4.4 Environmental science
• 5 See also
• 6 Further reading
• 7 Notes
• 8 References
• 9 External links

[edit] Introduction to linear regression


Given a data set of n statistical units, a linear regression model assumes
that the relationship between the dependent variable yi and the p-vector of regressors xi is
approximately linear. This approximate relationship is modeled through a so-called “disturbance
term” εi — an unobserved random variable that adds noise to the linear relationship between the
dependent variable and regressors. Thus the model takes form

where ′ denotes the transpose, so that xi′β is the inner product between vectors xi and β.
Often these n equations are stacked together and written in vector form as

where

Some remarks on terminology and general use:

• is called the regressand, endogenous variable, response variable,


measured variable, or dependent variable (see dependent and independent
variables.) The decision as to which variable in a data set is modeled as the
dependent variable and which are modeled as the independent variables may
be based on a presumption that the value of one of the variables is caused
by, or directly influenced by the other variables. Alternatively, there may be
an operational reason to model one of the variables in terms of the others, in
which case there need be no presumption of causality.

• are called regressors, exogenous variables, explanatory variables,


covariates, input variables, predictor variables, or independent variables (see
dependent and independent variables, but not to be confused with
independent random variables). The matrix X is sometimes called the design
matrix.
○ Usually a constant is included as one of the regressors. For example
we can take xi1 = 1 for i = 1, ..., n. The corresponding element of β is
called the intercept. Many statistical inference procedures for linear
models require an intercept to be present, so it is often included even
if theoretical considerations suggest that its value should be zero.
○ Sometimes one of the regressors can be a non-linear function of
another regressor or of the data, as in polynomial regression and
segmented regression. The model remains linear as long as it is linear
in the parameter vector β.
○ The regressors xi may be viewed either as random variables, which we
simply observe, or they can be considered as predetermined fixed
values which we can choose. Both interpretations may be appropriate
in different cases, and they generally lead to the same estimation
procedures; however different approaches to asymptotic analysis are
used in these two situations.

• is a p-dimensional parameter vector. Its elements are also called effects, or


regression coefficients. Statistical estimation and inference in linear
regression focuses on β.

• is called the error term, disturbance term, or noise. This variable captures
all other factors which influence the dependent variable yi other than the
regressors xi . The relationship between the error term and the regressors, for
example whether they are correlated, is a crucial step in formulating a linear
regression model, as it will determine the method to use for estimation.
Example. Consider a situation where a small ball is being tossed up in the air and then we
measure its heights of ascent hi at various moments in time ti. Physics tells us that, ignoring the
drag, the relationship can be modeled as

where β1 determines the initial velocity of the ball, β2 is proportional to the standard gravity, and
εi is due to measurement errors. Linear regression can be used to estimate the values of β1 and β2
from the measured data. This model is non-linear in the time variable, but it is linear in the
parameters β1 and β2; if we take regressors xi = (xi1, xi2) = (ti, ti2), the model takes on the standard
form

[edit] Assumptions
Two key assumptions are common to all estimation methods used in linear regression analysis:
• The design matrix X must have full column rank p. For this property to hold,
we must have n > p, where n is the sample size (this is a necessary but not a
sufficient condition). If this condition fails this is called the multicollinearity in
the regressors. In this case the parameter vector β will be not identifiable —
at most we will be able to narrow down its value to some linear subspace of
Rp.
Methods for fitting linear models with multicollinearity have been developed,
[1][2][3][4]
but require additional assumptions such as “effect sparsity” — that a
large fraction of the effects are exactly zero.
A simpler statement of this is that there must be enough data available
compared to the number of parameters to be estimated. If there is too little
data, then you end up with a system of equations with no unique solution.
See partial least squares regression.
• The regressors xi are assumed to be error-free, that is they are not
contaminated with measurement errors. Although not realistic in many
settings, dropping this assumption leads to significantly more difficult errors-
in-variables models.
Beyond these two assumptions, several other statistical properties of the data strongly influence
the performance of different estimation methods:
• Some estimation methods are based on a lack of correlation, among the n

observations . Statistical independence of


the observations is not needed, although it can be exploited if it is known to
hold.
• The statistical relationship between the error terms and the regressors plays
an important role in determining whether an estimation procedure has
desirable sampling properties such as being unbiased and consistent.
• The variances of the error terms may be equal across the n units (termed
homoscedasticity) or not (termed heteroscedasticity). Some linear regression
estimation methods give less precise parameter estimates and misleading
inferential quantities such as standard errors when substantial
heteroscedasticity is present.
• The arrangement, or probability distribution of the predictor variables x has a
major influence on the precision of estimates of β. Sampling and design of
experiments are highly-developed subfields of statistics that provide
guidance for collecting data in such a way to achieve a precise estimate of β.

[edit] Interpretation
A fitted linear regression model can be used to identify the relationship between a single
predictor variable xj and the response variable y when all the other predictor variables in the
model are “held fixed”. Specifically, the interpretation of βj is the expected change in y for a one-
unit change in xj when the other covariates are held fixed. This is sometimes called the unique
effect of xj on y. In contrast, the marginal effect of xj on y can be assessed using a correlation
coefficient or simple linear regression model relating xj to y.
Care must be taken when interpreting regression results, as some of the regressors may not allow
for marginal changes (such as dummy variables, or the intercept term), while others cannot be
held fixed (recall the example from the introduction: it would be impossible to “hold ti fixed” and
at the same time change the value of ti2).
It is possible that the unique effect can be nearly zero even when the marginal effect is large.
This may imply that some other covariate captures all the information in xj, so that once that
variable is in the model, there is no contribution of xj to the variation in y. Conversely, the unique
effect of xj can be large while its marginal effect is nearly zero. This would happen if the other
covariates explained a great deal of the variation of y, but they mainly explain variation in a way
that is complementary to what is captured by xj. In this case, including the other variables in the
model reduces the part of the variability of y that is unrelated to xj, thereby strengthening the
apparent relationship with xj.
The meaning of the expression “held fixed” may depend on how the values of the predictor
variables arise. If the experimenter directly sets the values of the predictor variables according to
a study design, the comparisons of interest may literally correspond to comparisons among units
whose predictor variables have been “held fixed” by the experimenter. Alternatively, the
expression “held fixed” can refer to a selection that takes place in the context of data analysis. In
this case, we “hold a variable fixed” by restricting our attention to the subsets of the data that
happen to have a common value for the given predictor variable. This is the only interpretation
of “held fixed” that can be used in an observational study.
The notion of a “unique effect” is appealing when studying a complex system where multiple
interrelated components influence the response variable. In some cases, it can literally be
interpreted as the causal effect of an intervention that is linked to the value of a predictor
variable. However, it has been argued that in many cases multiple regression analysis fails to
clarify the relationships between the predictor variables and the response variable when the
predictors are correlated with each other and are not assigned following a study design.[5]
[edit] Estimation methods
Numerous procedures have been developed for parameter estimation and inference in linear
regression. These methods differ in computational simplicity of algorithms, presence of a closed-
form solution, robustness with respect to heavy-tailed distributions, and theoretical assumptions
needed to validate desirable statistical properties such as consistency and asymptotic efficiency.
Some of the more common estimation techniques for linear regression are summarized below.
• Ordinary least squares (OLS) is the simplest and thus very common
estimator. It is conceptually simple and computationally straightforward. OLS
estimates are commonly used to analyze both experimental and
observational data.
The OLS method minimizes the sum of squared residuals, and leads to a
closed-form expression for the estimated value of the unknown parameter β:

The estimator is unbiased and consistent if the errors have finite variance
and are uncorrelated with the regressors[6]

It is also efficient under the assumption that the errors have finite variance
and are homoscedastic, meaning that E[εi2|xi] does not depend on i. The
condition that the errors are uncorrelated with the regressors will generally
be satisfied in an experiment, but in the case of observational data, it is
difficult to exclude the possibility of an omitted covariate z that is related to
both the observed covariates and the response variable. The existence of
such a covariate will generally lead to a correlation between the regressors
and the response variable, and hence to an inconsistent estimator of β. The
condition of homoscedasticity can fail with either experimental or
observational data. If the goal is either inference or predictive modeling, the
performance of OLS estimates can be poor if multicollinearity is present,
unless the sample size is large.
In simple linear regression, where there is only one regressor (with a
constant), the OLS coefficient estimates have a simple form that is closely
related to the correlation coefficient between the covariate and the response.

• Generalized least squares (GLS) is an extension of the OLS method, that


allows efficient estimation of β when either heteroscedasticity, or
correlations, or both are present among the error terms of the model, as long
as the form of heteroscedasticity and correlation is known independently of
the data. To handle heteroscedasticity when the error terms are uncorrelated
with each other, GLS minimizes a weighted analogue to the sum of squared
residuals from OLS regression, where the weight for the ith case is inversely
proportional to var(εi). This special case of GLS is called “weighted least
squares”. The GLS solution to estimation problem is

where Ω is the covariance matrix of the errors. GLS can be viewed as


applying a linear transformation to the data so that the assumptions of OLS
are met for the transformed data. For GLS to be applied, the covariance
structure of the errors must be known up to a multiplicative constant.

• Iteratively reweighted least squares (IRLS) is used when


heteroscedasticity, or correlations, or both are present among the error terms
of the model, but where little is known about the covariance structure of the
errors independently of the data[7]. In the first iteration, OLS, or GLS with a
provisional covariance structure is carried out, and the residuals are obtained
from the fit. Based on the residuals, an improved estimate of the covariance
structure of the errors can usually be obtained. A subsequent GLS iteration is
then performed using this estimate of the error structure to define the
weights. The process can be iterated to convergence, but in many cases, only
one iteration is sufficient to achieve an efficient estimate of β.[8][9]
• Instrumental variables regression (IV) can be performed when the
regressors are correlated with the errors. In this case, we need the existence
of some auxiliary instrumental variables zi such that E[ziεi] = 0. If Z is the
matrix of instruments, then the estimator can be given in closed form as

• Optimal instruments regression is an extension of classical IV regression to


the situation where E[εi|zi] = 0.
• Least absolute deviation (LAD) regression is a robust estimation technique
in that it is less sensitive to the presence of outliers than OLS (but is less
efficient than OLS when no outliers are present). It is equivalent to maximum
likelihood estimation under a Laplace distribution model for ε[10].
• Quantile regression focuses on the conditional quantiles of y given X rather
than the conditional mean of y given X. Linear quantile regression models a
particular conditional quantile, often the conditional median, as a linear
function β′x of the predictors.
• Maximum likelihood estimation can be performed when the distribution of
the error terms is known to belong to a certain parametric family ƒθ of
probability distributions[11]. When fθ is a normal distribution with mean zero
and variance θ, the resulting estimate is identical to the OLS estimate. GLS
estimates are maximum likelihood estimates when ε follows a multivariate
normal distribution with a known covariance matrix.
• Adaptive estimation. If we assume that error terms are independent from

the regressors , the optimal estimator is the 2-step MLE, where the
first step is used to non-parametrically estimate the distribution of the error
term.[12]
• Mixed models are widely used to analyze linear regression relationships
involving dependent data when the dependencies have a known structure.
Common applications of mixed models include analysis of data involving
repeated measurements, such as longitudinal data, or data obtained from
cluster sampling. They are generally fit as parametric models, using
maximum likelihood or Bayesian estimation. In the case where the errors are
modeled as normal random variables, there is a close connection between
mixed models and generalized least squares[13]. Fixed effects estimation is an
alternative approach to analyzing this type of data.
• Principal component regression (PCR) [3][4] is used when the number of
predictor variables is large, or when strong correlations exist among the
predictor variables. This two-stage procedure first reduces the predictor
variables using principal component analysis then uses the reduced variables
in an OLS regression fit. While it often works well in practice, there is no
general theoretical reason that the most informative linear function of the
predictor variables should lie among the dominant principal components of
the multivariate distribution of the predictor variables. The partial least
squares regression is the extension of the PCR method which does not suffer
from the mentioned deficiency.
• Total least squares (TLS) [14] is an approach to least squares estimation of the
linear regression model that treats the covariates and response variable in a
more geometrically symmetric manner than OLS. It is one approach to
handling the "errors in variables" problem, and is sometimes used when the
covariates are assumed to be error-free..
• Ridge regression[15][16][17], and other forms of penalized estimation such as the
Lasso[1], deliberately introduce bias into the estimation of β in order to reduce
the variability of the estimate. The resulting estimators generally have lower
mean squared error than the OLS estimates, particularly when
multicollinearity is present. They are generally used when the goal is to
predict the value of the response variable y for values of the predictors x that
have not yet been observed. These methods are not as commonly used when
the goal is inference, since it is difficult to account for the bias.
• Least angle regression [2] is an estimation procedure for linear regression
models that was developed to handle high-dimensional covariate vectors,
potentially with more covariates than observations.
• Other robust estimation techniques, including the α-trimmed mean approach,
and L-, M-, S-, and R-estimators have been introduced.

[edit] Extensions
• General linear model considers the situation when the response variable y is
not a scalar but a vector. Conditional linearity of E(y|x) = Bx is still assumed,
with a matrix B replacing the vector β of the classical linear regression model.
Multivariate analogues of OLS and GLS have been developed.
• Generalized linear models are a framework for modeling a response variable
y in the form g(β′x) + ε, where g is an arbitrary link function. Single index
models allow some degree of nonlinearity in the relationship between x and
y, while preserving the central role of the linear predictor β′x as in the
classical linear regression model. Under certain conditions, simply applying
OLS to data from a single-index model will consistently estimate β up to a
proportionality constant [18].
• Hierarchical linear models (or multilevel regression) organizes the data into a
hierarchy of regressions, for example where A is regressed on B, and B is
regressed on C. It is often used where the data have a natural hierarchical
structure such as in educational statistics, where students are nested in
classrooms, classrooms are nested in schools, and schools are nested in
some administrative grouping such as a school district. The response variable
might be a measure of student achievement such as a test score, and
different covariates would be collected at the classroom, school, and school
district levels.
• Errors-in-variables models (or “measurement error models”) extend the
traditional linear regression model to allow the predictor variables X to be
observed with error. This error causes standard estimators of β to become
biased. Generally, the form of bias is an attenuation, meaning that the effects
are biased toward zero.
• In Dempster–Shafer theory, or a linear belief function in particular, a linear
regression model may be represented as a partially swept matrix, which can
be combined with similar matrices representing observations and other
assumed normal distributions and state equations. The combination of swept
or unswept matrices provides an alternative method for estimating linear
regression models.

[edit] Applications of linear regression


Linear regression is widely used in biological, behavioral and social sciences to describe possible
relationships between variables. It ranks as one of the most important tools used in these
disciplines.
[edit] Trend line
For trend lines as used in technical analysis, see Trend lines (technical
analysis)

A trend line represents a trend, the long-term movement in time series data after other
components have been accounted for. It tells whether a particular data set (say GDP, oil prices or
stock prices) have increased or decreased over the period of time. A trend line could simply be
drawn by eye through a set of data points, but more properly their position and slope is
calculated using statistical techniques like linear regression. Trend lines typically are straight
lines, although some variations use higher degree polynomials depending on the degree of
curvature desired in the line.
Trend lines are sometimes used in business analytics to show changes in data over time. This has
the advantage of being simple. Trend lines are often used to argue that a particular action or
event (such as training, or an advertising campaign) caused observed changes at a point in time.
This is a simple technique, and does not require a control group, experimental design, or a
sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases
where other potential changes can affect the data.
[edit] Epidemiology
Early evidence relating tobacco smoking to mortality and morbidity came from observational
studies employing regression analysis. In order to reduce spurious correlations when analyzing
observational data, researchers usually include several variables in their regression models in
addition to the variable of primary interest. For example, suppose we have a regression model in
which cigarette smoking is the independent variable of interest, and the dependent variable is
lifespan measured in years. Researchers might include socio-economic status as an additional
dependent variable, to ensure that any observed effect of smoking on lifespan is not due to some
effect of education or income. However, it is never possible to include all possible confounding
variables in an empirical analysis. For example, a hypothetical gene might increase mortality and
also cause people to smoke more. For this reason, randomized controlled trials are often able to
generate more compelling evidence of causal relationships than can be obtained using regression
analyses of observational data. When controlled experiments are not feasible, variants of
regression analysis such as instrumental variables regression may be used to attempt to estimate
causal relationships from observational data.
[edit] Finance
The capital asset pricing model uses linear regression as well as the concept of Beta for
analyzing and quantifying the systematic risk of an investment. This comes directly from the
Beta coefficient of the linear regression model that relates the return on the investment to the
return on all risky assets.
Regression may not be the appropriate way to estimate beta in finance given that it is supposed
to provide the volatility of an investment relative to the volatility of the market as a whole. This
would require that both these variables be treated in the same way when estimating the slope.
Whereas regression treats all variability as being in the investment returns variable, i.e. it only
considers residuals in the dependent variable.[19]

Nonlinear regression
From Wikipedia, the free encyclopedia

Jump to: navigation, search


See Michaelis-Menten kinetics for details

In statistics, nonlinear regression is a form of regression analysis in which observational data


are modeled by a function which is a nonlinear combination of the model parameters and
depends on one or more independent variables. The data are fitted by a method of successive
approximations.

Contents
[hide]
• 1 General
• 2 Regression statistics
• 3 Ordinary and weighted least
squares
• 4 Linearization
○ 4.1 Transformation
○ 4.2 Segmentation
• 5 See also
• 6 References

[edit] General
The data consist of error-free independent variables (explanatory variable), x, and their
associated observed dependent variables (response variable), y. Each y is modeled as a random
variable with a mean given by a nonlinear function f(x,β). Systematic error may be present but its
treatment is outside the scope of regression analysis. If the independent variables are not error-
free, this is an errors-in-variables model, also outside this scope.
For example, the Michaelis–Menten model for enzyme kinetics
can be written as

where β1 is the parameter Vmax, β2 is the parameter Km and [S] is the independent variable, x.
This function is nonlinear because it cannot be expressed as a linear combination of the βs.
Other examples of nonlinear functions include exponential functions, logarithmic functions,
trigonometric functions, power functions, Gaussian function, and Lorentzian curves. Some
functions, such as the exponential or logarithmic functions, can be transformed so that they are
linear. When so transformed, standard linear regression can be performed but must be applied
with caution. See Linearization, below, for more details.
In general, there is no closed-form expression for the best-fitting parameters, as there is in linear
regression. Usually numerical optimization algorithms are applied to determine the best-fitting
parameters. Again in contrast to linear regression, there may be many local minima of the
function to be optimized and even the global minimum may produce a biased estimate. In
practice, estimated values of the parameters are used, in conjunction with the optimization
algorithm, to attempt to find the global minimum of a sum of squares.
For details concerning nonlinear data modeling see least squares and non-linear least squares.
[edit] Regression statistics
The assumption underlying this procedure is that the model can be approximated by a linear
function.

where . It follows from this that the least squares estimators are given by

The nonlinear regression statistics are computed and used as in linear regression statistics, but
using J in place of X in the formulas. The linear approximation introduces bias into the statistics.
Therefore more caution than usual is required in interpreting statistics derived from a nonlinear
model.
[edit] Ordinary and weighted least squares
The best-fit curve is often assumed to be that which minimizes the sum of squared residuals.
This is the (ordinary) least squares (OLS) approach. However, in cases where the dependent
variable does not have constant variance a sum of weighted squared residuals may be minimized;
see weighted least squares. Each weight should ideally be equal to the reciprocal of the variance
of the observation, but weights may be recomputed on each iteration, in an iteratively weighted
least squares algorithm.
[edit] Linearization
[edit] Transformation
Some nonlinear regression problems can be moved to a linear domain by a suitable
transformation of the model formulation.
For example, consider the nonlinear regression problem (ignoring the error):

If we take a logarithm of both sides, it becomes

suggesting estimation of the unknown parameters by a linear regression of ln(y) on x, a


computation that does not require iterative optimization. However, use of a nonlinear
transformation requires caution. The influences of the data values will change, as will the error
structure of the model and the interpretation of any inferential results. These may not be desired
effects. On the other hand, depending on what the largest source of error is, a nonlinear
transformation may distribute your errors in a normal fashion, so the choice to perform a
nonlinear transformation must be informed by modeling considerations.
For Michaelis–Menten kinetics, the linear Lineweaver–Burk plot

of 1/v against 1/[S] has been much used. However, since it is very sensitive to data error and is
strongly biased toward fitting the data in a particular range of the independent variable, [S], its
use is strongly discouraged.
[edit] Segmentation
Yield of mustard and soil salinity

Main article: Segmented regression

The independent or explanatory variable (say X) can be split up into classes or segments and
linear regression can be performed per segment. Segmented regression with confidence analysis
may yield the result that the dependent or response variable (say Y) behaves differently in the
various segments [1].
The figure shows that the soil salinity (X) initially exerts no influence on the crop yield (Y) of
mustard (colza), but beyond the critical or threshold value (breakpoint) the yield is affected
negatively [2].
The figure was made with the SegReg program [3].
What Does Regression Mean?
A statistical measure that attempts to determine the strength of the relationship between one dependent variable (usually denoted
by Y) and a series of other changing variables (known as independent variables).

Investopedia explains Regression


The two basic types of regression are linear regression and multiple regression. Linear regression uses one independent variable to
explain and/or predict the outcome of Y, while multiple regression uses two or more independent variables to predict the
outcome. The general form of each type of regression is:

Linear Regression: Y = a + bX + u
Multiple Regression: Y = a + b1X1 + b2X2 + B3X3 + ... + BtXt + u

Where:
Y= the variable that we are trying to predict
X= the variable that we are using to predict Y
a= the intercept
b= the slope
u= the regression residual.

In multiple regression the separate variables are differentiated by using subscripted numbers.

Regression takes a group of random variables, thought to be predicting Y, and tries to find a mathematical relationship between
them. This relationship is typically in the form of a straight line (linear regression) that best approximates all the individual data
points. Regression is often used to determine how much specific factors such as the price of a commodity, interest rates, particular
industries or sectors influence the price movement of an asset.

Conjoint analysis
From Wikipedia, the free encyclopedia

Jump to: navigation, search

This article does not cite any references or sources.


Please help improve this article by adding citations to reliable sources. Unsourced
material may be challenged and removed. (December 2009)

See also: Conjoint analysis (in marketing), Conjoint analysis (in healthcare),
IDDEA, Rule Developing Experimentation.
Conjoint analysis, also called multi-attribute compositional models or stated preference
analysis, is a statistical technique that originated in mathematical psychology. Today it is used in
many of the social sciences and applied sciences including marketing, product management, and
operations research. It is not to be confused with the theory of conjoint measurement.
[edit] Methodology
Conjoint analysis requires research participants to make a series of trade-offs. Analysis of these
trade-offs will reveal the relative importance of component attributes. To improve the predictive
ability of this analysis, research participants should be grouped into similar segments based on
objectives, values and/or other factors.
The exercise can be administered to survey respondents in a number of different ways.
Traditionally it is administered as a ranking exercise and sometimes as a rating exercise (where
the respondent awards each trade-off scenario a score indicating appeal).
In more recent years it has become common practice to present the trade-offs as a choice
exercise (where the respondent simply chooses the most preferred alternative from a selection of
competing alternatives - particularly common when simulating consumer choices) or as a
constant sum allocation exercise (particularly common in pharmaceutical market research, where
physicians indicate likely shares of prescribing, and each alternative in the trade-off is the
description a real or hypothetical therapy).
Analysis is traditionally carried out with some form of multiple regression, but more recently the
use of hierarchical Bayesian analysis has become widespread, enabling fairly robust statistical
models of individual respondent decision behaviour to be developed.
[edit] Example
A real estate developer is interested in building a high rise apartment complex near an urban Ivy
League university. To ensure the success of the project, a market research firm is hired to
conduct focus groups with current students. Students are segmented by academic year (freshman,
upper classmen, graduate studies) and amount of financial aid received.
Study participants are given a series index cards. Each card has 6 attributes to describe the
potential building project (proximity to campus, cost, telecommunication packages, laundry
options, floor plans, and security features offered). The estimated cost to construct the building
described on each card is equivalent.
Participants are asked to order the cards from least to most appealing. This forced ranking
exercise will indirectly reveal the participants' priorities and preferences. Multi-variate regression
analysis may be used to determine the strength of preferences across target market segments.
Retrieved from "http://en.wikipedia.org/wiki/Conjoint_analysis"

Categories: Psychometrics | Multivariate statistics

Hidden categories: Articles lacking sources from December 2009 | All articles
lacking sources

Personal tools
• New features
• Log in / create account
Namespaces
• Article
• Discussion

Variants

Views
• Read
• Edit
• View history

Actions

Search
Top of Form
Special:Search

Search

Bottom of Form

Navigation
• Main page
• Contents
• Featured content
• Current events
• Random article
• Donate

Interaction
• About Wikipedia
• Community portal
• Recent changes
• Contact Wikipedia
• Help

Toolbox
• What links here
• Related changes
• Upload file
• Special pages
• Permanent link
• Cite this page

Print/export
• Create a book
• Download as PDF
• Printable version

Languages
• Deutsch
• Italiano
• 中文
• This page was last modified on 7 October 2010 at 22:26.
• Text is available under the Creative Commons Attribution-ShareAlike License;
additional terms may apply. See Terms of Use for details.
Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a
non-profit organization.
• Contact us

Canonical analysis
From Wikipedia, the free encyclopedia

Jump to: navigation, search

This article may be confusing or unclear to readers. Please help clarify


the article; suggestions may be found on the talk page. (December 2006)

In statistics, canonical analysis (from Gk.κανων bar, measuring rod, ruler) belongs to the family
of regression methods for data analysis. Regression analysis quantifies a relationship between a
predictor variable and a criterion variable by the coefficient of correlation r, coefficient of
determination r², and the standard regression coefficient β. Multiple regression analysis
expresses a relationship between a set of predictor variables and a single criterion variable by the
multiple correlation R, multiple coefficient of determination R², and a set of standard partial
regression weights β1, β2, etc. Canonical variate analysis captures a relationship between a set of
predictor variables and a set of criterion variables by the canonical correlations ρ1, ρ2, ..., and by
the sets of canonical weights C and D.

Contents
[hide]
• 1 Canonical analysis
• 2 Canonical analysis
(simple)
• 3 See also
• 4 References

[edit] Canonical analysis


Canonical analysis belongs to a group of methods which involve solving the characteristic
equation for its latent roots and vectors. It describes formal structures in hyperspace invariant
with respect to the rotation of their coordinates. In this type of solution, rotation leaves many
optimizing properties preserved, provided it takes place in certain ways and in a subspace of its
corresponding hyperspace. This rotation from the maximum intervariate correlation structure
into a different, simpler and more meaningful structure increases the interpretability of the
canonical weights C and D. In this the canonical analysis differs from Harold Hotelling’s (1936)
canonical variate analysis (also called the canonical correlation analysis), designed to obtain
maximum (canonical) correlations between the predictor and criterion canonical variates. The
difference between the canonical variate analysis and canonical analysis is analogous to the
difference between the principal components analysis and factor analysis, each with its
characteristic set of commonalities, eigenvalues and eigenvectors.
[edit] Canonical analysis (simple)
Canonical analysis is a multivariate technique which is concerned with determining the
relationships between groups of variables in a data set. The data set is split into two groups, let's
call these groups X and Y, based on some common characteristics. The purpose of Canonical
analysis is then to find the relationship between X and Y, i.e. can some form of X represent Y. It
works by finding the linear combination of X variables, i.e. X1, X2 etc., and linear combination of
Y variables, i.e. Y1, Y2 etc., which are most highly correlated. This combination is known as the
"first canonical variates" which are usually denoted U1 and V1, with the pair of U1 and V1 being
called a "canonical function". The next canonical function, U2 and V2 are then restricted so that
they are uncorrelated with U1 and V1. Everything is scaled so that the variance equals 1. One can
also construct relationships which are made to agree with constraint restrictions arising from
theory or to agree with common sense/intuition. These are called maximum correlation models.
(Tofallis, 1999)
[edit] See also

Cluster analysis (in marketing)


From Wikipedia, the free encyclopedia

Jump to: navigation, search

It has been suggested that this article or section be merged into Cluster
analysis. (Discuss)

'Cluster analysis' is a class of statistical techniques that can be applied to data that exhibit
“natural” groupings. Cluster analysis sorts through the raw data and groups them into clusters. A
cluster is a group of relatively homogeneous cases or observations. Objects in a cluster are
similar to each other. They are also dissimilar to objects outside the cluster, particularly objects
in other clusters.
The diagram below illustrates the results of a survey that studied drinkers’ perceptions of spirits
(alcohol). Each point represents the results from one respondent. The research indicates there are
four clusters in this market.

Illustration of clusters

Another example is the vacation travel market. Recent research has identified three clusters or
market segments. They are the: 1) The demanders - they want exceptional service and expect to
be pampered; 2) The escapists - they want to get away and just relax; 3) The educationalist - they
want to see new things, go to museums, go on a safari, or experience new cultures.
Cluster analysis, like factor analysis and multi dimensional scaling, is an interdependence
technique: it makes no distinction between dependent and independent variables. The entire set
of interdependent relationships is examined. It is similar to multi dimensional scaling in that both
examine inter-object similarity by examining the complete set of interdependent relationships.
The difference is that multi dimensional scaling identifies underlying dimensions, while cluster
analysis identifies clusters. Cluster analysis is the obverse of factor analysis. Whereas factor
analysis reduces the number of variables by grouping them into a smaller set of factors, cluster
analysis reduces the number of observations or cases by grouping them into a smaller set of
clusters.

Contents
[hide]
• 1 In marketing, cluster analysis is
used for
• 2 Basic procedure
• 3 Clustering procedures
• 4 External links
• 5 See also
• 6 References

[edit] In marketing, cluster analysis is used for


• Segmenting the market and determining target markets
• Product positioning and New Product Development
• Selecting test markets (see : experimental techniques)

[edit] Basic procedure


1. Formulate the problem - select the variables to which you wish to apply the
clustering technique
2. Select a distance measure - various ways of computing distance:
○ Squared Euclidean distance - the square root of the sum of the squared
differences in value for each variable
○ Manhattan distance - the sum of the absolute differences in value for
any variable
○ Chebyshev distance - the maximum absolute difference in values for
any variable
○ Mahalanobis (or correlation) distance - this measure uses the
correlation coefficients between the observations and uses that as a
measure to cluster them. This is an important measure since it is unit
invariant (can figuratively compare apples to oranges)
3. Select a clustering procedure (see below)
4. Decide on the number of clusters
5. Map and interpret clusters - draw conclusions - illustrative techniques like
perceptual maps, icicle plots, and dendrograms are useful
6. Assess reliability and validity - various methods:
○ repeat analysis but use different distance measure
○ repeat analysis but use different clustering technique
○ split the data randomly into two halves and analyze each part
separately
○ repeat analysis several times, deleting one variable each time
○ repeat analysis several times, using a different order each time

[edit] Clustering procedures


There are several types of clustering methods:
• Non-Hierarchical clustering (also called k-means clustering)
○ first determine a cluster center, then group all objects that are within a
certain distance
○ examples:
 Sequential Threshold method - first determine a cluster
center, then group all objects that are within a predetermined
threshold from the center - one cluster is created at a time
 Parallel Threshold method - simultaneously several cluster
centers are determined, then objects that are within a
predetermined threshold from the centers are grouped
 Optimizing Partitioning method - first a non-hierarchical
procedure is run, then objects are reassigned so as to optimize
an overall criterion.
• Hierarchical clustering
○ objects are organized into an hierarchical structure as part of the
procedure
○ examples:
 Divisive clustering - start by treating all objects as if they are
part of a single large cluster, then divide the cluster into smaller
and smaller clusters
 Agglomerative clustering - start by treating each object as a
separate cluster, then group them into bigger and bigger
clusters
 examples:
 Centroid methods - clusters are generated that
maximize the distance between the centers of
clusters (a centroid is the mean value for all the
objects in the cluster)
 Variance methods - clusters are generated that
minimize the within-cluster variance
 example:
 Ward’s Procedure - clusters are
generated that minimize the squared
Euclidean distance to the center mean
 Linkage methods - cluster objects based on the
distance between them
 examples:
 Single Linkage method - cluster
objects based on the minimum
distance between them (also called
the nearest neighbour rule)
 Complete Linkage method - cluster
objects based on the maximum
distance between them (also called
the furthest neighbour rule)
 Average Linkage method - cluster
objects based on the average distance
between all pairs of objects (one
member of the pair must be from a
different cluster)

[edit] External links


. The language should be simple, clear and unambiguous. Short sentences should be used as far as possible.

2. The phraseology should be adapted to suit the occasion. No technical terms or business phraseology should be used
which are not likely to b understood by the person (s) for whom the report is intended.
3. In writing reports, negative statements should be avoided as far as possible.
4. Reports written by an individual should be written in the first person (I), but reports submitted by a committee or sub-
committee must be written in an impersonal manner, i.e., in the third person.

5. The report should preferably be written in the narrative form setting out the facts, findings and recommendations in
such a logical way that they present a coherent picture.
6. The data presented in support of the recommendations should be accurate, reliable and complete. These should be
properly classified, tabulated and analysed so that they can give a realistic and concrete reading of any problem under
consideration.
7. The conclusions and recommendations should be based on factual data (not impressions) and unbiased so that they can
be depended upon by the recipient (s) for deciding on a course of action.
8. The report should be as brief as possible in keeping with the purpose for which it is needed. But clearness should not be
sacrificed for the sake of conciseness. The report should be to the point, using the minimum number of words and avoiding
all repetitions and exaggerations. It the writer sticks to these qualities, the report will automatically remain concise.

Anda mungkin juga menyukai