Anda di halaman 1dari 71

Assessing and Evaluating


Assessing and Evaluating
Definition of assessment and evaluation
Aim of student evaluation
Steps in student evaluation
The basic principles of assessment/ evaluation
Regulation of learning by the teacher
Types of evaluation
Qualities of a test
Characteristics of measurement instrument
Advantages and disadvantages of different types of tests

Assessment and evaluation are essential

components of teaching and learning.

Without an effective evaluation program it is

impossible to know whether students have
learned, whether teaching has been
effective, or how best to address student
learning needs.
Assessing and Evaluating
Definition of assessment:
Assessment is the process of gathering
information on student learning.

Definition of evaluation:
Evaluation is the process of analyzing,
reflecting upon, and summarizing
assessment information, and making
judgments and/or decisions based on the
information collected.
Aim of student evaluation
Incentive to learn
Feedback to student
Modification of learning activities
Selection of students
Success or failure
Feedback to teacher
Protection of society
Types of evaluation

1- Formative evaluations:
It is an ongoing classroom
process that keeps students and
educators informed of students
progress toward program learning

The main purpose of formative

evaluation is to improve instruction
and student learning.
2- Summative evaluations

It occurs most often at the end of a unit.

The teacher uses summative evaluation to

determine what has been learned over a
period of time, to summarize student
progress, and to report to students, parents
and educators on progress relative to
curriculum objectives.
3- Diagnostic evaluation

It usually occurs at the beginning of the

school year or before a new unit.
It identifies students who lack prerequisite
knowledge, understanding or skills.
Diagnostic testing also identifies student
Diagnostic evaluation provides information
essential to teachers in designing
appropriate programs for all students.
Steps in student evaluation
The criteria of the educational objectives

Development and use of measuring


Interpretation of measurement data

Formulating of judgment and taking of

appropriate action
Principles of Evaluation
Evaluation should be
1. Based on clearly stated objectives
2. Comprehensive
3. Cooperative
4. Used Judiciously
5. Continuous and integral part of the teaching
learning process
Qualities of a Good Measuring
Validity: the extent to which the instrument
measures what it is intended to measure.
Reliability: the consistency with which an instrument
measures a given variable.
Objectivity: the extent to which independent and
competent examiners agree on what constitute a
good answer for each of the elements of a
measuring instruments
Practicability: the overall simplicity of the use of a
test both for test constructor and for students.
Qualities of a test

Directly related to educational objectives

Realistic& practical
Concerned with important & useful matters
Comprehensive but brief
Precise& clear

Validity means the degree to which a test or

measuring instrument measures what it intends to
measure. The validity of a measuring instrument
has to do with its soundness, what the test
measures its effectiveness, and how well it could be

For instance, to judge the validity of a performance

test, it is necessary to consider what kind of
performance test is supposed to measure and how
well it manifests itself.

Denotes the extent to which

an instrument is measuring
what it is supposed to
Criterion-Related Validity

A method for assessing the

validity of an instrument by
comparing its scores with another
criterion known already to be a
measure of the same trait or skill.
Criterion-related validity is usually
expressed as a correlation between
the test in question and the
criterion measure. The correlation
coefficient is referred to as a

validity coefficient
Types of Validity
Content Validity. Content validity means
the extent to which the content or topic of
the test is truly representative of the course.
It involves, essentially, the systematic
examination of the test content to determine
if it covers a representative sample of the
behaviour domain to be measured. It is
very important the behaviour domain to be
tested must be systematically analysed to
make certain that all major aspects are
covered by the test items and in correct
Content validity is described by the
relevance of a test to different types of
criteria, such as thorough judgment and
systematic examination of relevant course
syllabi and textbooks, pooled judgment of
subject matter expert, statement of
behavioural objectives, analysis of teacher-
made test questions, and among others.
Thus content validity depends on the
relevance of the individuals responses to
the behavior are under consideration rather
Content validity
Content validity is commonly used in
evaluating achievement test. A well-
constructed achievement test should cover
the objective of instruction, not just its
subject matter. The Taxonomy of
educational Objectives by Bloom would be
of great help in listing the objectives to be
covered in an achievement test.
Content validity is particularly appropriate
for the criterion referenced measure. It

Whether the individual

items of a test
represent what you
actually want to
For instance, a teacher wishes to validate a
test in Mathematics, He request experts in
Mathematics to judge if the test items or
questions measure the knowledge, skills,
and values supposed to be measured.
Another way of testing validity is for the
teacher to check if the test items or
questions represent the knowledge, skills
and values suggested in the Mathematics
course content.
Good and Scates(1972) suggested the evidence of
test or questionnaire validity whichYes
1. Is the question on the subject?
2. Is the questions perfectly clear
and unambiguous? Yes ___ No___
3. Does the question get at something
stable which is typical of the
individual or of the situation? Yes ___ No___
4. Does the question pull? Yes ___ No___
5. Do the responses show a reasonable
range of variation? Yes ___ No___
6. Is the information obtained consistent? Yes ___ No___
7. Is the item sufficiently inclusive? Yes ___ No___
8. Is there a possibility of using an
external criterion to evaluate the test/
questionnaire? Yes ___ No___
Concurrent validity is the degree to which
the test agrees or correlates with a criterion
set up as an acceptable measure. The
criterion is always available at the time of
testing. It is applicable to test employed for
the diagnosis of existing status rather than
for the prediction of further outcome.

The extent to which a

procedure correlates
with the current
behavior of subjects
For example, a teacher wishes to validate a
Science achievement test he has
constructed. He administers the test to
group of Science students. The results of
the test is correlated with an acceptable
Science test which has been previously
proven as valid. If the correlation is high
the Science test he has constructed is valid.
Predictive Validity is determined by showing
how well predictions made from the test are
confirmed by evidence gathered at some
subsequent time. The criterion measure
against this type of validity is important
because the outcome of the subject is

The extent to which a

procedure allows
accurate predictions
about a subjects
future behavior
For instance, the teacher wants to estimate
how well a student may be able to do in
graduate school courses on the bases of
how well he has done on test he took in the
undergraduate courses. The criterion
measure against which the test scores are
validated and obtained are available after a
long period of interval.
Construct validity of the test is the extent to
which the measures a theoretical trait. This
involves such test as those of
understanding, appreciation and
interpretation of data. Examples are
intelligence and mechanical aptitude tests.
The extent to which a test measures a
theoretical construct or attribute.

Abstract concepts such as intelligence,
self-concept, motivation, aggression and
creativity that can be observed by some
type of instrument.
For example, a teacher wishes to establish
the validity of an IQ using the Culture fair
Intelligence Test. He hypothesizes that
students with high IQ also have high
achievement and those with low IQ, low
achievement. He therefore administers
both Culture Fair Intelligence Test and
achievement test to groups of students with
high IQ have high scores in the
achievement test and those with low IQ
have low scores in achievement test, the
A tests construct
validity is often
assessed by its
convergent and
discriminant validity.
1. Test-related factors
2. The criterion to which you
compare your instrument may
not be well enough established
3. Intervening events
4. Reliability
Reliability means the extent to which a test
is dependable, self-consistent and stable.
In other words, the test agrees with itself. It
is concern with the consistency of
responses from moment to moment. Even
if a person takes the same test twice, the
test yields the same results. How ever a
reliable test may not always be valid.
The consistency of measurements

Produces similar scores across various
conditions and situations, including
different evaluators and testing
How do we account for an individual
who does not get exactly the same
test score every time he or she takes
the test?
1. Test-takers temporary psychological or
physical state
2. Environmental factors
3. Test form
4. Multiple raters
The statistic for expressing
Expresses the degree of
consistency in the measurement
of test scores.
Donoted by the letter r with two
identical subscripts (rxx)
For instance, Student C took Chemistry test
twice. His anwser in item 5 What is the
neutral ph? is 6.0. In the second
administration of the same test and
question, his answer is still 6.0, thus, his
response is reliable but not valid. His
answer is reliable due to consistency of
responses, 6.0, but not valid due to no
veracity of his answer. The correct answer
is pH 7.0. Hence, a reliable tst may not
always be valid.
measuring instrument is administered twice to the same
group of students and the correlation coefficient is
determined. The limitations of this method are (1) when
the time interval is short, the respondents may recall their
previous responses and this tends to make the correlation
coefficient high, (2) when the time interval is long, such
factors as unlearning, forgetting, among others may occur
and may result in low correlation of the measuring
instrument, and (3) regardless of the time interval
separating the two administrations, other varying
environmental conditions such as noise, temperature,
lighting, and other factors may affect the correlation
coefficient of the measuring instrument.

Suggests that subjects

tend to obtain the same
score when tested at
different times.
A Spearman Rank Correlation Coefficient
Spearman rho is a statistical tool used to
measure the relationship between paired
ranks assigned to individual scores on two
variables, X and Y. Thus, this is used to
correlate the scores in a test-retest method.
Spearman rho formula:
rs = 1 - 6D2
N3 N
Rs = Spearman rho
D2 = Sum of the squared difference between ranks
N total number of cases
Ex. Spearman rho Computation of the First and Second
Administration of Achievement Test in English(Artificial
Students X Y Rx Ry D D2

1 90 40 70 30 2.0 7.5 5.5 30.25

2 43 43 31 31 13.0 12.5 0.5 0.25
3 84 48 79 31 6.5 3.0 3.5 12.25
4 86 55 70 43 4.5 7.5 -3.0 9.00
5 55 75 43 43 11.0 10.5 0.5 0.25
6 77 77 70 70 8.5 7.5 1.0 1.0
7 84 77 75 70 6.5 4.5 2.0 4.00
8 91 84 88 70 1.0 1.0 0.0 0.00
9 40 84 31 70 14.0 12.5 1.5 2.25
10 75 86 70 75 10.0 7.5 2.5 6.25
11 86 86 80 75 4.5 2.0 2.5 6.25
12 89 89 75 79 3.0 4.5 -1.5 2.25
13 48 90 30 80 12.0 14.0 2.0 4.0
14 77 91 43 88 8.5 10.5 -2.5 4.0
TOTAL D2 = 82.00
Rs = 1 - 6D2
N3 N
= 1 - 6 (82)
(14)3 14
= 1 492
2744 14
= 1 - 492/2730
= 1- 0.18021978
= 0.82(high relationship)
Parrallel-forms method. Parallel or
equivalent forms of a test may be
administered to the group of students, and
the paired observations correlated. In
estimating reliability by the administration of
parallel or equivalent forms of a test, criteria
of parallelism is required (Ferguson and
Takane, 1989). The two forms of the test
must be constructed so that the content,
type of item, difficulty, instructions for
administration, and may others, are similr
Also known as equivalent forms reliability
or parallel forms reliability
Obtained by administering two equivalent
tests to the same group of examinees
Items are matched for difficulty on each
It is necessary that the time frame between
giving the two forms be as short as
The test in this method maybe administered
once, but the test items are divided into two
halves. The common procedure is to divide
a test into odd and even items. The two
halves of the test must be similar but not
identical in content, number of items,
difficulty, means and standard deviations.
Each student obtain two scores, one on the
odd and the other on the even items in the
same test. The scores obtained in the two
halves are correlated. The result is a
Split-Half Reliability

Sometimes referred to as
internal consistency
Indicates that subjects scores
on some trials consistently
match their scores on other

rwt = 2 ( rht)
1 + rht
Where rwt is the reliability of the whole test;
and rht is the reliability of the half test.
For instance, a test is administered to the students as pilot
sample to test the reliability coefficient of the odd and even

Scores Ranks Difference

Students X odd Y Even Rx Ry D D2
1 23 30 9 7.5 1.5 2.25
2 25 24 7.5 9.5 -2.0 4.00
3 27 30 6 7.5 -1.5 2.25
4 35 40 5 5 0.0 0.00
5 48 55 3 2.5 0.5 0.25
6 21 24 10 9.5 0.5 0.25
7 25 35 7.5 6.0 1.5 2.25
8 50 51 2 4.0 -2.0 4.00
9 38 60 4 1 3.0 9.00
10 55 55 1.0 2.5 -1.5 2.25
Total 26.50
rht = .84

rwt = .91 (very high reliability)

Internal Consistency method
This method is used with psychological test
which consist of dichotomously score items.
The examinee either passes or fails in an
item. A rating of 1 is assigned for a pass
and 0(zero) for failure. The Method of
obtaining this method is determined by
Kuder Richardson Formula 20. The formula
is a measure of internal consistency of
homogeneity of measuring instrument. The
Formula is
r N SD 2 pi qi; SD = (X X)2
xx = X = X/N
N1 SD 2 N-1
I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 f pi qi piqi
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 12 .86 .14 .1204

2 1 1 1 1 1 1 1 1 1 1 1 1 0 0 12 .86 .14 .1204

3 1 1 1 1 1 1 1 1 1 1 1 0 0 0 11 .79 .12 .1659

4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 10 .71 .29 .1059

5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 10 .71 .29 .2059

6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 10 .7 .29 .2059

7 1 1 1 1 1 1 1 1 1 0 0 0 0 0 9 .64 .36 .2304

8 0 1 1 1 1 1 1 1 1 0 1 0 0 0 8 .57 .43 .2451

9 0 1 1 1 1 1 1 1 1 0 0 0 0 0 8 .57 .43 ..2451

1 0 0 1 0 0 1 1 0 1 0 0 0 0 0 4 .29 .71 .2059

7 9 10 9 9 10 9 9 10 6 4 2 0 0 1.9509
Rxx = .79 High relationship
Involves having two raters independently
observe and record specified behaviors,
such as hitting, crying, yelling, and getting
out of the seat, during the same time period

A specific behavior the observer is
looking to record
Interpretation of Correlation of Coefficient Values
An r from + 0.00 to + 0.20 denotes negligible correlation
An r from + 0.21 to + 0.40 denotes low correlation
An r from + 0.41 to + 0.70 denotes marked or moderate correlation
An r from + 0.71 to + 0.90 denotes high correlation
An r from + 0.91 to + 0.99 denotes very high correlation
An r from + 1.00 denotes perfect correlation
The score you get when you administer a test
Consists of two parts: the true score and the
error score

Gives the margin or error that you should
expect in an individual test score because of
imperfect reliability of the test
Evaluating the Reliability Coefficients

The test manual should indicate why a

certain type of reliability coefficient was
The manual should indicate the conditions
under which the data were obtained
The manual should indicate the important
characteristics of the group used in
gathering reliability information
1. Test length
2. Test-retest interval
3. Variability of scores
4. Guessing
5. Variation within the test situation
Test reliability can be improved by the following factors


Usability means the degree to which the
measuring instrument can be satisfactorily
used by teachers, researchers, supervisors
and school managers without undue
expenditure of time, money, and effort. In
other words, usability means practicality
Factors that determine usability
Ease of administration
Ease of scoring
Construction of the test in objective type

Answer keys are adequately prepared

Scring directions are fully understood

Ease of interpretation and application

Low cost
Proper mechanical makeup
Advantages and disadvantages of different
types of tests

1- Oral examinations:
1. Provide direct personal contact with candidates.
2. Provide opportunity to take mitigating circumstances into
3. Provide flexibility in moving from candidate's strong points to
weak areas.
4. Require the candidate to formulate his own replies without cues.
5. Provide opportunity to question the candidate about how he
arrived at an answer.
6. Provide opportunity for simultaneous assessment by two
1- Oral examinations
1. Lack standardization.
2. Lack objectivity and reproducibility of results.
3. Permit favoritism and possible abuse of the
personal contact.
4. Suffer from undue influence of irrelevant factors.
5. Suffer from shortage of trained examiners to
administer the examination.
6. Are excessively costly in terms of professional time
in relation to the limited value of the information it
2- Practical examinations
1. Provide opportunity to test in realistic setting skills
involving all the senses while the examiner observes and
checks performance.
2. Provide opportunity to confront the candidate with
problems he has not met before both in the laboratory and
at the bedside, to test his investigative ability as opposed
to his ability to apply ready-made "recipes".
3. Provide opportunity to observe and test attitudes and
responsiveness to a complex situation (videotape
4. Provide opportunity to test the ability to communicate
under Pressure, to discriminate between important and
trivial issues, to arrange the data in a final form.
2- Practical examinations
1. Lack standardized conditions in laboratory
experiments using animals, in surveys in the
community or in bedside examinations with patients of
varying degrees of cooperativeness.
2. Lack objectivity and suffer from intrusion or
irrelevant factors.
3. Are of limited feasibility for large groups.
4. Entail difficulties in arranging for examiners to
observe candidates demonstrating the skills to be
3- Essay examinations
1. Provide candidate with opportunity to demonstrate
his knowledge and his ability to organize ideas and
express them effectively

1. Limit severely the area of the student's total work
that can be sampled.
2. Lack objectivity.
3. Provide little useful feedback.
4. Take a long time to score
4- Multiple-choice questions

1. Ensure objectivity, reliability and validity; preparation of
questions with colleagues provides constructive
2. Increase significantly the range and variety of facts that
can be sampled in a given time.
3. Provide precise and unambiguous measurement of the
higher intellectual processes.
4. Provide detailed feedback for both student and teachers.
5. Are easy and rapid to score.
4- Multiple-choice questions
1. Take a long time to construct in order to
avoid arbitrary and ambiguous questions.
2. Also require careful preparation to avoid
preponderance of questions testing only
3. Provide cues that do not exist in practice.
4. Are "costly" where number of students is