Anda di halaman 1dari 26

An Assignment on

Measures of Validity and Reliabillity

Course Code: PAD 474


Course Title: Social Research- II

Submitted to
Md Mohammad Samiul Islam
Assistant Professor
Department of Public Administration
SUST, Sylhet

Submitted by

Name Registration No.


Jahirul Alom 2015237034
Arun Chandro Dey 2015237052
Antara Parvine Liza 2015237061
Hafizur Rahman 2015237062
Md. Abul Khair Dipu 2015237076
Md. Yeasin Sarkar 2015237081

Date of Submission: April 7, 2019

Shahjalal University of Science and Technology, Sylhet – 3114

Validity Reliability Page 1


Introduction
Data in a research functions like a fuel of engine to administer it . No research can be imagined
without pertinent data on the particular area of knowledge , the area have been gone under the
concern of research .So data is a pivotal chunk but a question must be a corollary of discussion
about a research that whether the collected data for the research on the concerned subject is
‗valid and reliable‘. Because validity and reliability excluded data is nothing but like ―labor of a
worker who receive stone as wage by which nothing can be done‖ . Likewise data which aren‘t
valid and reliable will get a result that is meaningless without providing a miscellaneous
knowledge about the concerned subject of research. Now its needless to unfold that the concern
of measuring the ‗validity and reliability‘ of data in research is a prerequisite and must .

Measurement involves assigning scores to individuals so that they represent some characteristic
of the individuals. But how do researchers know that the scores actually represent the
characteristic, especially when it is a construct like intelligence, self-esteem, depression, or
working memory capacity? The answer is that they conduct research using the measure to
confirm that the scores make sense based on their understanding of the construct being
measured. This is an extremely important point. Psychologists do not simply assume that their
measures work. Instead, they collect data to demonstrate that they work. If their research does
not demonstrate that a measure works, they stop using it.

As an informal example, imagine that you have been dieting for a month. Your clothes seem to
be fitting more loosely, and several friends have asked if you have lost weight. If at this point
your bathroom scale indicated that you had lost 10 pounds, this would make sense and you
would continue to use the scale. But if it indicated that you had gained 10 pounds, you would
rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement
method, psychologists consider two general dimensions: reliability and validity.

Validity

What is validity

Validity is the most important consideration in any research. Validity refers to whether or not
the investigation measures what it is supposed to measure.

Validity is the extent to which a test measures what it claims to measure.

In social science research, researchers never completely certain that they are measuring the
variable for which they designed their measurement procedure

Validity Reliability Page 2


Validity is concerned with the question ‗‘Am I measuring what I intend to measure? The
problem of validity arises because measurement in the social sciences is with very few
exceptions, indirect. Unde such circumstances, researchers are never completely certain that they
are measuring the variable for which they designed measurement procedure. (Quoted from:
Nachmias C. F. and nachmias D.)

Validity in qualitative research means ―appropriateness‖ of the tools, processes, and data.
Whether the research question is valid for the desired outcome, the choice of methodology is
appropriate for answering the research question, the design is valid for the methodology, the
sampling and data analysis is appropriate, and finally the results and conclusions are valid for the
sample and context. (Leung L : 2015).

Joppe (2000) provides the following explanation of what validity is in quantitative


research:
Validity determines whether the research truly measures that which it was intended to measure
or how truthful the research results are. In other words, does the research instrument allow you to
hit "the bull‘s eye" of your research object? Researchers generally determine validity by asking a
series of questions, and will often look for the answers in the research of others. [Quoted from
(Golafshani, N (2003) : Understanding Reliability and Validity in Qualitative Research. The
Qualitative Report, 8(4): pp 597-606 ]

For example, does voter turnout truly measure political development? If a respondent agrees
with the statement ―This word is run by a fewer people in power, and there is not much the little
guy can do about it‖. in his or her response a genuine indicator of the presence of the variable
―alienation‖? To answer such questions, the researcher must provide supporting evidence that a
measuring instrument does, in fact, measure what it appears to measure.

Types of validity

Validity can be distinguished in three basic kinds, each of which is connected with a different
aspect of the measurement situation: content validity, empirical validity, construct validity .Each
includes several kinds of evidence and has special value under certain conditions.

Validity Reliability Page 3


Types of Validity

Empirical Content Construct

Face Sampling/Logical Divergent Convergent

Figure: Types of Validity

Content Validity: Content validity is the estimate of how much a measure represents every
single element of a construct. It is the connection between test items and the subject related task.

For example, an educational test with strong content validity will represent the subjects actually
taught to students, rather than asking unrelated question.

Content validity is often seen as a prerequisite to criterion validity, because it is a good indicator
of whether the desired trait is measured. If elements of the test are irrelevant to the main
construct, then they are measuring something else completely, creating potential bias.

In addition, criterion validity derives quantitative correlations from test scores.

Content validity is qualitative in nature, and asks whether a specific element enhances or detracts
from a test or research program

Positivity

Strong to assess validity as it requires subject-matter expert to evaluate content validity, the
assessment get stronger.

Make easier to assess face validity when content validity is measured than its easy to assess
face validity.

Ease in academic and vocational testing in academic and vocational testing test items need to
reflect the knowledge actually required for a given topic area (e.g., history) or job skill (e.g.,

Validity Reliability Page 4


accounting).So in this particular study content validity mantle a easier way to test the testing
items.

Threats

Biases

Cultural relativism

Level of expertise of the panelists

How is Content Validity Measured?

Content validity is related to face validity and sampling validity

 Face Validity: Face validity requires a personal judgment, such as asking participants
whether they thought that a test was well constructed and useful. Content validity arrives
at the same answers, but uses an approach based in statistics, ensuring that it is regarded
as a strong type of validity.

For surveys and tests, each question is given to a panel of expert analysts, and they rate it. They
give their opinion about whether the question is essential, useful or irrelevant to measuring the
construct under study.

Their results are statistically analyzed and the test modified to improve the rational validity.

For example, IQ tests are supposed to measure intelligence. The test would be valid if it
accurately measured intelligence.

Face Validity

Positivity of Face Validity

Quickly eliminate shoddy research For example, let‘s say a researcher‘s work indicated that
drinking milk caused colon cancer. A peer review of the paper by a distinguished biochemist
revealed several shortcomings in the design of the experiment. The biochemist rejected the paper
on face validity, but they were well qualified to do so. In essence, face validity is weaker for a
layperson, and stronger for an expert in the field.

Useful and accurate answer If the respondent knows what information we are looking for, they
can use that ―context‖ to help interpret the questions and provide more useful, accurate answers.

Quick and easy to apply rather than having to investigate the underlying factors that determine
whether a measure is robust, as you have to do when applying content validity or construct

Validity Reliability Page 5


validity, it is easy and quick to come up with measures that are face valid. Often, you simply
need to think what measures (e.g., questions in a questionnaire) would make sense to you if you
were taking part in the research (i.e., if you were being asked the question).

Negativity of Face Validity

Bending and shaping if the respondent knows what information we are looking for, they might
try to ―bend & shape‖ their answers to what they think we want -- ―fake good‖ or ―fake bad‖.

subjective judgement face validity is subjective (i.e., based on the subjective judgement of the
researcher), and only provides the appearance of that a measurement procedure is valid.

Not quantifiable one can't tell how well the measurement procedure measures what it is trying
to measure, which is possible with other forms of validity (e.g., construct validity).

Sampling/Logical Validity:

Sampling validity (sometimes called logical validity) is concerned with how well the test covers
all of the areas you want it to cover.

For example, if you‘re doing an assessment of effectiveness of math tutoring, you would want to
include all areas of interest including: number comprehension, algebra skills and problem
solving. An assessment would be a poor overall measure if it just tested algebra skills.

It‘s usually not possible to cover every single area of interest from a single measure. Therefore,
you‘ll want to choose the most important items in your assessment. One way to do this is to
consult with experts in the field to judge how well the items are representative of the content. In
the above example, you could consult with mathematics educators.

A similar measure is item validity. While sampling validity is focused on the breadth of items,
item validity focuses on their depth (Lodico & Spaulding, 2006).

Alternate Definition

Not all authors define sampling variability in the same way. According to Krippendorff (2012),
sampling variability refers to how well a population is accurately represented by a sample. It can
be measured by the following formula:

Validity Reliability Page 6


sampling validity

Where:

σ = the population standard deviation,

n = the population size,

N = the sample size.

The right section of the equation is the sampling error formula (which can be called sampling
invalidity), so the entire formula (1 – sampling error) is calculating sampling validity by
subtracting the invalid portion of the test from 100%. Perhaps the most important aspect of the
equation is that the larger the sample, the larger the sampling validity.

Positivity of Sampling Validity

 Sampling validity serves an important function: it necessitates familiarity with all the
items of the content population.
 It follows that sampling validity is especially useful in exploratory research where
investigators attempt to construct instruments and employ them for the first time. After
their initial use of the instrument, they can compare its validity with other tests.

Negativity of Sampling Validity

Difficult to measure: In sampling validity the quality of every individual items have to measure
that is a difficult task and most often it is likely to be fault the assessment. For example, if one is
intended to measure the mark of a student, here the items can be included are student’s hand-
writing, attendance, response in class, teacher attitude of marking have to be measured to
measure ultimately student’s mark.

1. Empirical Validity:

The validity that is based on results of observation and experimentation.

Empirical validity is concerned with the relationship between a measuring instrument and the
measurement outcomes. If a measuring instrument is valid there should be a strong relation

Validity Reliability Page 7


between the result produced by applying the instrument and the real relationships existing
among the variables measured.

Scientists assume that if measuring instrument is valid there should be strong relation between
the result produced.

For example, an educator might want to know if the scores obtained by the IQ test he or she was
using really reflect the intelligence of the subject .Investigators gather evidence to support the
existence of a relation by using measures of correlation appropriate to the level of measurement.

Positivity of Empirical Validity

Simplicity and accuracy this method is the simplest and most accurate from a scientific point of
view for estimating the correlation between test scores and criterion scores in the population.

Negativity

Practical Objections: This method is impractical as the decisions are made without the test
scores. In order for it to work, the selection of applicants must either be random or all of them
must be accepted as the population in the validity study must be similar to the general population
of applicants.

Ethical Objections: An incorrect decision has negative consequences for both the
individual and the decision maker. An organization that hires a worker that is likely to
fail on the job is likely to incur losses in terms of training costs and productivity. A
worker who fails at the job may incur negative psychological side effects.

Time consuming : Often requires use of the survey for many years

Construct Validity

It implies using the construct correctly (concepts, notions, ideas) agreement between a
theoritical concept and a specific measuring device and procedure.

Anastasi and Urbina (1997) described construct validity as ―the extent to which the test
may be said to measure a theoretical construct or trait‖[Quoted from,( Rubia, D M

Validity Reliability Page 8


(2003): Objectifying content validity: Conducting a content validity study in social work
research.Social Work Research, Vol (27): pp 94-104)]

For example, A test of intelligence must include measure of multiple intelligences rather
than just logical mathematical measurement.

construct validity can be viewed as an overarching term to assess the validity of the
measurement procedure (e.g., a questionnaire) that you use to measure a given construct
(e.g., depression, commitment, trust, etc.). This is because it incorporates a number of
other forms of validity (i.e., content validity, convergent and divergent validity, and
criterion validity) that help in the assessment of such construct validity (Messick, 1980).
In this sense, construct validity is a process that you work through, involving a number of
procedures (i.e., tests of validity, such as content validity, convergent validity, etc.) to
assess the validity of the measurement procedure that you use in your dissertation to
measure a given construct.

For example, let's imagine that we were interested in studying the construct, post-natal
depression. In order to do this, new mothers taking part in the research were asked (a) to
complete a 10-question survey (i.e., as a form of self-assessment) to assess various
characteristics of post-natal depression, and (b) to be observed (i.e., participant
observation) by trained psychiatric nurses, who used a scale to measure these different
characteristics of post-natal depression. When assessing the construct validity of these
two measurement procedures to measure the construct, post-natal depression, we would
want to know:

 Are the elements/questions used in the 10-question survey and the participant observation
scale relevant and representative of the construct, post-natal depression, which they were
supposed to be measuring? In terms of relevance, are the elements/questions appropriate
considering the purpose of the study and the theory from which they are drawn?
Furthermore, does the measurement procedure include all the necessary
elements/questions? Is there an appropriate balance of elements, or are some over- or
under-represented? This reflects the desire to assess the content validity of the
measurement procedure [see the article: Content validity].
 Do the 10 questions and participant observation scale only measure the construct we are
interested in (i.e., post-natal depression), and not one or more additional constructs;
perhaps constructs such as post-partum mood, stress or anxiety? After all, when assessing
the construct validity of a measurement procedure, we should not only check that the
contents (i.e., elements) are relevant and representative of the construct we are interested
in, but also that the measurement procedure is not measuring something that is should not
be measuring. When this happens, the results can be confounded, which threatens the
internal validity and external validity of your study .
 This reflects the desire to assess the divergent validity of the measurement procedure
 Since the study used two different measurement procedures, how confident can we be
that both measurement procedures were measuring the same construct (i.e., post-natal

Validity Reliability Page 9


depression)? If both measurement procedures were new (i.e., you created them for your
dissertation), we would want to assess their convergent validity, but if one was new (e.g.,
the 10-question survey), but the other was well-established (e.g., the participant
observation scale), we would assess their concurrent validity [see the articles: Convergent
and divergent validity and Criterion validity: (concurrent and predictive validity)].
 Do the scores from the two measurement procedures used make accurate predictions (i.e.,
both theoretically and logically) about the construct they represent (i.e., post-natal
depression)? This reflects the desire to assess the predictive validity of the measurement
procedure [see the article: Criterion validity: (concurrent and predictive validity)].

Ultimately, for construct validity to exist, there needs to be (a) a clear link between the
construct you are interested in and the measures and interventions that are used to
operationalize it (i.e., measure it), and (b) a clear distinction between different constructs
(Cronbach and Meehl, 1955; Nunnally, 1978). This involves creating clear and precise
conceptual and operational definitions of the constructs you are interested in [see the
section on Constructs in quantitative research], as well as performing various tests of
validity

Construct validity has two components: convergent and discriminant validity.

A. Discriminant Validity

Discriminant validity is the extent to which latent variable A discriminates from other latent
variables (e.g., B, C, D). Discriminant validity means that a latent variable is able to account for
more variance in the observed variables associated with it than a) measurement error or similar
external, unmeasured influences; or b) other constructs within the conceptual framework. If this
is not the case, then the validity of the individual indicators and of the construct is questionable
(Fornell and Larcker, 1981). In brief, Discriminant validity (or divergent validity) tests that
constructs that should have no relationship do, in fact, not have any relationship.

B. Convergent Validity

Convergent validity, a parameter often used in sociology, psychology, and other behavioural
sciences, refers to the degree to which two measures of constructs that theoretically should be
related, are in fact related. In brief, Convergent validity tests that constructs that are expected to
be related are, in fact, related.

Validity Reliability Page 10


Positivity of Construct Validity

Best overall assessment of validity : When construct validity is assessed, then to assess content
validity, face validity, empirical validity get easier. Because a construct include content and
instrument in a validity is designed ultimately to assess construct. As the assessment of
perfectness of instrument is face validity, thus assessing construct validity disclose the easiness
of assessing both content and construct validity.

Threats to Construct Validity Measurement

1. Hypothesis Guessing : Most often , in a social research on the human behavior, if the
intention of researcher can be guessed by the people under study can change his natural behavior.
For example , a teacher is intended to study on the the concentration level in a class, if the
student can guess the teacher purpose ,then ,it would be real that the student will behave
artificially.

2. Evaluation Apprehension : Human behave differently as to his condition of mind .


When human is under anxiety, depression they behave differently.

3. Researcher Expectancies and Bias : Researchers are only human and may give cues
that influence the behavior of the subject. Humans give cues through body language, and
subconsciously smiling when the subject gives a correct answer, or frowning at an undesirable
response, all have an effect. This effect can lower construct validity by clouding the effect of the
actual research variable.

4. Poor Construct Definition : Construct validity is all about semantics and labeling.
Defining a construct in too broad or too narrow terms can invalidate the entire experiment. For
example, a researcher might try to use job satisfaction to define overall happiness. This is too
narrow, as somebody may love their job but have an unhappy life outside the workplace.
Equally, using general happiness to measure happiness at work is too broad. Many people enjoy
life but still hate their work.

5. Construct Confounding : This threat to construct validity occurs when other constructs
mask the effects of the measured construct. For example, self-esteem is affected by self-
confidence and self-worth. The effect of these constructs needs to be incorporated into the
research.

Validity Reliability Page 11


6. Interaction of Different Treatments : This particular threat is where more than one
treatment influences the final outcome. For example, a researcher tests an intensive counseling
program as a way of helping smokers give up cigarettes. At the end of the study, the results show
that 64% of the subjects successfully gave up. Sadly, the researcher then finds that some of the
subjects also used nicotine patches and gum, or electronic cigarettes. The construct validity is
now too low for the results to have any meaning. Only good planning and monitoring of the
subjects can prevent this.

Besides,

Internal validity is the extent to which a piece of evidence supports a claim about cause and
effect, within the context of a particular study. It is one of the most important properties of
scientific studies, and is an important concept in reasoning about evidence more generally.
Internal validity is determined by how well a study can rule out alternative explanations for its
findings (usually, sources of systematic error or 'bias'). It contrasts with external validity, the
extent to which results can justify conclusions about other contexts (that is, the extent to which
results can be generalized).

Positivity of Internal Validity

Confidence high internal validity: lets one choose one explanation over another with a lot of
confidence, because it avoids (many possible) confounds.

Determine cause-effect relation: internal validity is concerned with conclusions about whether
(and to what degree) the independent variable, as manipulated, makes a difference in the
dependent variable, as measured.

Negativity

Selection bias: during the selection step of the research study, if an unequal number of test
subjects have similar subject-related variables there is a threat to the internal validity. For
example, a researcher created two test groups, the experimental and the control groups. The
subjects in both groups are not alike with regard to the independent variable but similar in one or
more of the subject-related variables.

Repeated testing (also referred to as testing effects): repeatedly measuring the participants
may lead to bias. Participants may remember the correct answers or may be conditioned to know
that they are being tested. Repeatedly taking (the same or similar) intelligence tests usually leads

Validity Reliability Page 12


to score gains, but instead of concluding that the underlying skills have changed for good, this
threat to Internal Validity provides a good rival hypotheses

Instrument change (instrumentality): the instrument used during the testing process can
change the experiment. This also refers to observers being more concentrated or primed, or
having unconsciously changed the criteria they use to make judgments. This can also be an issue
with self-report measures given at different times. In this case the impact may be mitigated
through the use of retrospective pretesting. If any instrumentation changes occur, the internal
validity of the main conclusion is affected, as alternative explanations are readily available.

External validity is the validity of applying the conclusions of a scientific study outside the
context of that study.[1] In other words, it is the extent to which the results of a study can be
generalized to and across other situations, people, stimuli, and times.[2] In contrast, internal
validity is the validity of conclusions drawn within the context of a particular study. Because
general conclusions are almost always a goal in research, external validity is an important
property of any study. Mathematical analysis of external validity concerns a determination of
whether generalization across heterogeneous populations is feasible, and devising statistical and
computational methods that produce valid generalizations.

Examples

If age is judged to be a major factor causing treatment effect to vary from individual to
individual, then age differences between the sampled students and the general population would
lead to a biased estimate of the average treatment effect in that population. Such bias can be
corrected though by a simple re-weighing procedure: We take the age-specific effect in the
student subpopulation and compute its average using the age distribution in the general
population. This would give us an unbiased estimate of the average treatment effect in the
population. If, on the other hand, the relevant factor that distinguishes the study sample from the
general population is in itself affected by the treatment, then a different re-weighing scheme need
be invoked. Calling this factor Z, we again average the z-specific effect of X on Y in the
experimental sample, but now we weigh it by the "causal effect" of X on Z. In other words, the
new weight is the proportion of units attaining level Z=z had treatment X=x been administered to
the entire population. This interventional probability, often written[13] {\displaystyle
P(Z=z|do(X=x))} {\displaystyle P(Z=z|do(X=x))}, can sometimes be estimated from
observational studies in the general population.

A typical example of this nature occurs when Z is a mediator between the treatment and
outcome, For instance, the treatment may be a cholesterol- reducing drug, Z may be cholesterol
level, and Y life expectancy. Here, Z is both affected by the treatment and a major factor in
determining the outcome, Y. Suppose that subjects selected for the experimental study tend to

Validity Reliability Page 13


have higher cholesterol levels than is typical in the general population. To estimate the average
effect of the drug on survival in the entire population, we first compute the z-specific treatment
effect in the experimental study, and then average it using {\displaystyle P(Z=z|do(X=x))}
{\displaystyle P(Z=z|do(X=x))} as a weighting function. The estimate obtained will be bias-free
even when Z and Y are confounded—that is, when there is an unmeasured common factor that
affects both Z and Y.[14]

Positivity of External validity

Ensure research applicability: external validity determine whether a study findings is to be


fitted in other settings of the same concerned subject .

Reduce time for further: research if a research is found externally valid , then in further
research on the same subject will be easy to conduct by absorbing ideas.

Negativity of External validity

Aptitude–treatment Interaction: The sample may have certain features that interact with the
independent variable, limiting generalizability. For example, comparative psychotherapy studies
often employ specific samples (e.g. volunteers, highly depressed, no comorbidity). If
psychotherapy is found effective for these sample patients, will it also be effective for non-
volunteers or the mildly depressed or patients with concurrent other disorders? If not, the
external validity of the study would be limited.

Situation: All situational specifics (e.g. treatment conditions, time, location, lighting, noise,
treatment administration, investigator, timing, scope and extent of measurement, etc.) of a study
potentially limit generalizability.

Pre-test effects: If cause-effect relationships can only be found when pre-tests are carried out,
then this also limits the generality of the findings.

Post-test effects: If cause-effect relationships can only be found when post-tests are carried out,
then this also limits the generality of the findings.

Note that a study's external validity is limited by its internal validity. If a causal inference made
within a study is invalid, then generalizations of that inference to other contexts will also be
invalid.

Reliability

Validity Reliability Page 14


Reliability is of central concern to social scientist because the measuring instruments they
employ are really completely valid. In many causes, evidence of validity is almost entirely
lacking: instead, the researcher has to evaluate the measuring instrument with respect to other
characteristics and assume its validity. A method frequently used by social scientists for
evaluating an instrument is its degree of reliability.

For example, suppose that the researcher is interested to study the attitude towards democracy
of a number of newspapers. To measure this attitude he can follow a number of procedures. On
of these procedures may be that he reads the editorials of all the newspapers for a specified
number of days, and on the basis of his judgement, rank-orders the newspapers acoording to the
degree of prodemocratic attitude by the possess. This strategy has problems of reliability
inherent in it if several evaluators read the same editorials, they may draw conclusions different
from each other: a newspaper that appears pre democratic to one evaluator may not be so to the
other thus, this procedure-reading newspaper editorials-of measurement attitude towards
democracy. When apply repeatedly may produce different results.

Reliability estimates are used to evaluate (1) the stability of measures administered at different
times to the same individuals or using the same standard (test–retest reliability) or (2) the
equivalence of sets of items from the same test (internal consistency) or of different observers
scoring a behavior or event using the same instrument (interrater reliability)(Kimberlin,C L and
Winterstein, A G (2008): Validity and reliability of measurement instruments used in research.
American Journal of Health-System Pharmacy, Vol( 65), Issue 23, pp 2276–2284).

Joppe (2000) defines reliability as:

…The extent to which results are consistent over time and an accurate representation of the total
population under study is referred to as reliability and if the results of a study can be reproduced
under a similar methodology, then the research instrument is considered to be reliable. [Quoted
from (Golafshani, N (2003) : Understanding Reliability and Validity in Qualitative Research.
The Qualitative Report, 8(4): pp 597-606 ]

Types of Reliability

Validity Reliability Page 15


Test-retest

Types of Reliability
Method
Parallel forms of
Reliability

Split- half Method

Inter-rater

Inter-item

Figure: Techniques of Estimating Reliability

Types of Reliability

Test-retest reliability: The reliability coefficient is obtained by repetition of the same measure
on a second time, is called the test-retest reliability [Graziano and Raulin, 2006]. It assesses the
external consistency of a test [Allen & Yen, 1979]. If the reliability coefficient is high, for

example, r = 0.98, we can suggest that both instruments are relatively free of measurement
errors. If the coefficients yield above 0.7, are considered acceptable, and coefficients yield above
0.8, are considered very good [Sim & Wright, 2005; Madan & Kensinger, 2017].

The test-retest reliability indicates score variation that occurs from testing session to testing
session as a result of errors of measurement. It is a measure of reliability obtained by managing
the same test twice over a period of time ranging from few weeks to months, on a group of
individuals. The scores from Time 1 and Time 2 can then be correlated between the two separate
measurements in order to evaluate the test for stability over time. For example, employees of a
Company may be asked to complete the same questionnaire about employee job satisfaction two
times with an interval of three months, so that test results can be compared to assess stability of
scores. The correlation coefficient calculated between two set of data, and if it found to be high,
the test-retest reliability is better. The interval of the two tests should not be very long, because
the status of the company may change during the second test, which affects the reliability of
research [Bland & Altman, 1986].

Validity Reliability Page 16


The test-retest reliability method is one of the simplest ways of testing the stability and reliability

of an instrument over time.

For example, if a group of students takes a test, you would expect them to show very similar
results if they take the same test a few months later. This definition relies upon there being no
confounding factor during the intervening time interval.

Instruments such as IQ tests and surveys are prime candidates for test-retest methodology,
because there is little chance of people experiencing a sudden jump in IQ or suddenly changing
their opinions.

On the other hand, educational tests are often not suitable, because students will learn much
more information over the intervening period and show better results in the second test.

Positivity

 Advantage of test-retest reliability is that one can determine the extent to which items or
measures are replicable or consistent over time
 Provides mathematical assessment
 Statistics are simple

Negativity

 Must select measures that do not change much over time


  Practice effect
 Anonymity
 Must give survey twice

Parallel Forms Reliability:

Parallel forms reliability (also called equivalent forms reliability) uses one set of questions
divided into two equivalent sets (―forms‖), where both sets contain questions that measure the
same construct, knowledge or skill. The two sets of questions are given to the same sample of
people within a short period of time and an estimate of reliability is calculated from the two sets.

Put simply, you‘re trying to find out if test A measures the same thing as test B. In other words
you want to know if test scores stay the same when you use different instruments.

Validity Reliability Page 17


Example: you want to find the reliability for a test of mathematics comprehension, so you create
a set of 100 questions that measure that construct. You randomly split the questions into two sets
of 50 (set A and set B), and administer those questions to the same group of students a week
apart.

Steps:

Step 1: Give test A to a group of 50 students on a Monday.

Step 2: Give test B to the same group of students that Friday.

Step 3: Correlate the scores from test A and test B.

In order to call the forms ―parallel‖, the observed score must have the same mean and variances.
If the tests are merely different versions (without the ―sameness‖ of observed scores), they are
called alternate forms.

Similarity to Split-Half Reliability

Parallel forms and split-half reliability are similar, but with parallel forms, the same students take
test A and then take test B. With split-half reliability, one group of students is split into two and
both groups sit the test at the same time. The two tests in parallel forms reliability are equivalent
and can be used independently of each other.

Positivity

 Parallel forms reliability can avoid some problems inherent with test-resting.

Negativity

 One have to create a large number of questions that measure the same construct. Proving
that the two test versions are equivalent (parallel) can be a challenge.

Split-half testing measures reliability:

In split-half reliability, a test for a single knowledge area is split into two parts and then both
parts given to one group of students at the same time. The scores from both parts of the test are

Validity Reliability Page 18


correlated. A reliable test will have high correlation, indicating that a student would perform
equally well (or as poorly) on both halves of the test.

Split-half testing is a measure of internal consistency — how well the test components contribute
to the construct that‘s being measured. It is most commonly used for multiple choice tests you
can theoretically use it for any type of test — even tests with essay questions.

Steps

Administer the test to a large group students (ideally, over about 30).

Randomly divide the test questions into two parts. For example, separate even questions from
odd questions.

Score each half of the test for each student.

Find the correlation coefficient for the two halves. See: Find Pearson‘s Correlation Coefficient
for steps.

Drawbacks

One drawback with this method — it only works for a large set of questions (a 100 point test is
recommended) which all measure the same construct/area of knowledge. For example, this
personality inventory test measures introversion, extroversion, depression and a variety of other
personality traits. This is not a good candidate for split-half testing.

Difference with Parallel Forms

Split half-reliability is similar to parallel forms reliability, which uses one set of questions
divided into two equivalent sets. The sets are given to the same students, usually within a short
time frame, like one set of test questions on Monday and another set on Friday. With split-half
reliability, the two tests are given to one group of students who sit the test at the same time.
Another difference: the two tests in parallel forms reliability are equivalent and are independent
of each other. This is not true with split-half reliability; the two sets do not have to be equivalent
(―parallel‖).

Inter-rater Reliability:

In statistics, inter-rater reliability (also called by various similar names, such as inter-rater
agreement, inter-rater concordance, inter-observer reliability, and so on) is the degree of

Validity Reliability Page 19


agreement among raters. It is a score of how much homogeneity, or consensus, there is in the
ratings given by various judges. In contrast, intra-rater reliability is a score of the consistency in
ratings given by the same person across multiple instances. Inter-rater and intra-rater reliability
are aspects of test validity. Assessments of them are useful in refining the tools given to human
judges, for example by determining if a particular scale is appropriate for measuring a particular
variable. If various raters do not agree, either the scale is defective or the raters need to be re-
trained.

Inter-rater reliability can be considered a subset or specific instance of reliability where the
source of inconsistency is not captured by differences in test forms, test items, or administration
occasions. Instead, inconsistency is captured by the scoring process itself, where humans, or in
some instances computers, evaluate the performance, response, or behavior of the object of
measurement. Interrater reliability refers more specifically to consistency of measurement that
involves raters.

Note that rater error is another factor or facet in the measurement process. Because it is another
facet of measurement, raters can introduce additional measurement error, above and beyond error
coming from sampling of items, differences in test forms, or the passage of time between
administrations. This is made explicit within generalizability theory, discussed below.

For example

As mentioned above, tests that require judges or raters are most often some type of performance
assessment, where the focus on the assessment is on processes and products. Performance
assessments require individuals to generate a response, demonstrate a skill, or perform a task.
The key feature of a performance assessment is the requirement that an individual do something,
i.e., perform in some way, to obtain a score.

There are a number of statistics that can be used to determine inter-rater reliability. Different
statistics are appropriate for different types of measurement. Some options are: joint-probability
of agreement, Cohen's kappa, Scott's pi and the related Fleiss' kappa, inter-rater correlation,
concordance correlation coefficient, intra-class correlation, and Krippendorff's al.

Positivity

Evaluating interview perfection: Here interview taken by an observer is to be compared with


another observer who took interview on the same concern of knowledge. Thus, it make easier to
comprehend, whether there was bias in interview result or the degree of biasness in result.

Useful in refining tools: It is useful to refine tools that given to human judges, for example by
determining if a particular scale is appropriate for measuring a particular variable. If various
raters do not agree, either the scale is defective or the raters need to be re-trained.

Negativity

Validity Reliability Page 20


Time Spending: To conduct observation on the same particular subject consume higher time.

Faulty of Comparison: Observation of different observer isn‘t free from subjectivity. If the
compared interviewer quality of assessment is lower than the other, then inter-rater reliability
won‘t be reliable.

Inter item Consistency reliability: The inter-item reliability is important for measurements that
consist of more than one item. Inter-item reliability refers to the extent of consistency between
multiple items measuring the same construct. Personality questionnaires for example often
consist of multiple items that tell you something about the extraversion or confidence of
participants. These items are summed up to a total score. When researchers sum up the answers
of participants to receive a single score, they have to be certain that all items measure the same
construct (for example extraversion). To check to what extent items are in accordance with each
other, the item-total correlation can be computed for each combination of items. This is the
correlation between an item and the rest of all items combined. Each item on the measurement
instrument should correlate with the remaining items. An item-total correlation of .30 or higher
per item is considered to be sufficient.

This is a test of the consistency of respondents‘ answers to all the items in a measure. To the
degree that items are independent measures of the same concept, they will be correlated with one
another.

The most popular test of inter-item consistency reliability is the Cronbach‘s Alpha.

Validity Reliability Page 21


For example:

Salary

Item 1
Inter item relation
Poverty Income

Expenditure

Figure :Inter-item reliability

Relationship between reliability and Validity

Reliability and validity are the two most important properties that test scores can have. They are
often mentioned together, but they give us different kinds of information.

• Reliability tells us how consistently the test scores measure something.

• Validity tells whether the test scores are measuring the right things for a particular use of the
test.

The following figure an analogy that illustrates this difference. Using a test to measure a test
taker‘s proficiency in a particular set of knowledge or skills is like shooting at a target, with each
shot representing one administration of the test.

Figure : Relationship between validity and reliability

Validity Reliability Page 22


The concepts of reliability and validity are similar in some important ways and different in
others. Reliability refers to a particular source of inconsistency in the scores (or possibly more
than one). Validity refers to a particular use of the test. A test can have higher reliability in one
group of test takers than in another group; it can also have higher validity in one group of test
takers than in another group. But its validity will depend on how the scores are being used. Its
reliability will not.

Sometimes people ask, ―Which is more important, reliability or validity?‖ There are two simple
ways to answer this question

: • Simple answer #1: Validity is more important. If you are measuring the wrong thing, it
doesn‘t matter how well you measure it.

• Simple answer #2: Reliability is more important. If the scores depend heavily on chance, you
are not measuring anything. These two answers make it clear that there is really no way to decide
whether reliability or validity is more important. Test scores cannot be valid for any purpose
unless they are reliable.

So, its can be said, every valid data is reliable but every reliable data is not valid

Conclusion

In this paper, we have tried to show that reliability and validity of instrumentation are important
considerations for researchers in their investigations. To perform a good research validity and
reliability tests are needed to take very carefully. We have highlighted on the research errors that
are arisen in measurements. In the study we have observed that a valid tool must be reliable, but
a reliable tool may not necessarily be valid. We have also included the threat to reliability and
validity when a researcher tries to do a good research.

References

Validity Reliability Page 23


Brown, D J (2000) : What is construct validity? , Shiken: JALT Testing & Evaluation SIG
Newsletter, 4 (2) : pp 8-12

Golafshani, N (2003) : Understanding Reliability and Validity in Qualitative Research. The


Qualitative Report, 8(4): pp 597-606

Kimberlin,C L and Winterstein, A G (2008): Validity and reliability of measurement instruments


used in research. American Journal of Health-System Pharmacy, Vol( 65), Issue 23, pp 2276–
2284

Kluwer, W(2015) : Principles and Methods of Validity and Reliability Testing of Questionnaires
Used in Social and Health Science Researches, Nigerian Postgraduate Medical Journal, Vol(22) :
pp 195-201

Lawshl, C H (1997) : A Quantitative Approach to Content Validity, Personnel Psychology, Inc


Vol(28) : pp 563-575

Lange, R. T. (2011). Inter-rater Reliability. Encyclopedia of Clinical Neuropsychology, 1348–


1348.

Livingston, S A (2018) : Test Reliability—Basic Concepts, (Research Memorandum No. RM-


18-01). Princeton, NJ: Educational Testing Service , pp1-38

McLeod, S. A. (2007). What is reliability?. Retrieved from


https://www.simplypsychology.org/reliability.html

Mohajan, H (2017) : Two Criteria for Good Measurements in Research: Validity and Reliability,
Munich Personal RePEc Archive, Vol : 17(3): pp 58-82

Noble, H and Smith, J (2015) : Issues of validity and reliability in qualitative research, Evid
Based Nurs, Vol 18(2) : pp34-35

Petty, E R , Brinol P, Loersch, C and McCaslin, M(2015): The need for cognition, ResearchGate,
pp 318-329

Potter, W J and Donnerstein, L (1999) : Rethinking validity and reliability in content analysis,
Journal of Applied Communication Research, Vol 27(3) pp 258-284

Rubia, D M (2003): Objectifying content validity: Conducting a content validity study in social
work research.Social Work Research, Vol (27): pp 94-104

Validity Reliability Page 24


Silver, H ,Danner, D and Rammstedt,B (2018): The impact of respondent attentiveness on
reliability and validity, International Journal of Social Research, pp 1-12

Weiner, J (2007) : Measurement: Reliability and Validity Measures, John Hopskin Bloomberg
School of Public Health, pp 5-27

Livingston, S A (2018) : Test Reliability—Basic Concepts, (Research Memorandum No. RM-


18-01). Princeton, NJ: Educational Testing Service , pp1-38

Potter, W J and Donnerstein, L (1999) : Rethinking validity and reliability in content analysis,
Journal of Applied Communication Research, Vol 27(3) pp 258-284

Noble, H and Smith, J (2015) : Issues of validity and reliability in qualitative research, Evid
Based Nurs, Vol 18(2) : pp34-35

In this study we have tried to show that reliability and validity of instrumentation are important
considerations for researchers in their investigations. To perform a good research validity and
reliability tests are needed to take very carefully. We have highlighted on the research errors that
are arisen in measurements. In the study we have observed that a valid tool must be reliable, but
a reliable tool may not necessarily be valid. We have also included the threat to reliability and
validity when a researcher tries to do a goodfgggjhh

In this study we have tried to show that reliability and validity of instrumentation are important
considerations for researchers in their investigations. To perform a good research validity and
reliability tests are needed to take very carefully. We have highlighted on the research errors that
are arisen in measurements. In the study we have observed that a valid tool must be reliable, but
a reliable tool may not necessarily be valid. We have also included the threat to reliability and
validity when a researcher tries to do a good research.

we have tried to show that reliability and validity of instrumentation are important cons

Validity Reliability Page 25


tests are needed to take very carefully. We have highlighted on the research errors that are arisen
in measurements. In the study we have observed that a valid tool must be reliable, but a reliable
tool may not necessarily be valid. We have also included the threat to reliability and validity
when a researcher tries to do a good research.

Validity Reliability Page 26

Anda mungkin juga menyukai