Criteri of A Good Test


Sebagai suatu alat pengukur yang digunakan untuk mengukur, membandingkan
danmemperoleh suatu informasi yang akurat, maka suatu tes yang baik harus memiliki
karakteristik-karakteristik tertentu. Berikut adalah pandangan para ahli mengenai
karakteristik suatu tes yang baik:
1.Prof. Drs. Anas Sudijono dalam bukunya yang berjudul Pengantar Evaluasi
Pendidikan(2005: 93) mengatakan bahwa setidak-tidaknya ada empat karakteristik yang
harus dimiliki oleh tesyang baik yaitu: valid, reliable, objektif, dan praktis.
2.Masrun MA dan Dra. Sri Mulyani Martaniah (1974: 117) mengatakan bahwa suatu tes yang
baik harus memiliki minimal tiga hal, yaitu: validitas, reliable, dan kemampuan
3.Dra. Suharsimi AK mengatakan bahwa suatu tes yang baik harus memenuhi empat
syarat,yaitu: validitas, reliabilitas, objektifitas, dan praktikabilitas.
4.Arikunto & Suharsimi dalam bukunya Dasar-dasar Evaluasi Pendidikan mengatakan
bahwa syarat-syarat tes yang baik adalah: validitas, reliabilitas, objektivitas, praktikabilitas,
5.Miller (1991: 91) dan Gronlund & Lin (1990: 47) menyatakan bahwa ada tiga hal
yangharus diperhatikan dalam menentukan suatu alat ukur yang berkualitas, yaitu: validitas,
reliabilitas,dan praktikabilitas.Dari beberapa pendapat para ahli di atas dapat kita lihat bahwa
tidak ada yang bertentangan antara yang satu dengan yang lain, tetapi saling melengkapi,
sehingga dapat disimpulkan bahwa kriteria tes yang baik melingkupi:
All good tests possess three qualities: Validity (any test that we use must be appropriate
in terms of our objectives), Reliability (dependable in the evidence it provides), and
Practicality/Usability (applicable to our particular situation). These three contitute the sine
qua non [sini kwa non] , an essential condition/a thing that is absolutely necessary.
Validity is being treat as the most important of the three elements, but Reliability
generally affects Validity and Validity cannot be fully appreciated without a basic
understanding of Reliability.

A. Reliability
Realibility refers to the consistency or stability of test score. Or the consistency of
scores obtained by the same persons when retested with identical test of with an
equivalent form of test (Anastasi:27). Which means, a test cannot measure anything
unless it measures consistently.
For example : (1)

if we tested a group on Tuesday instead of Monday;

if we gave two parallel forms of the test to the same group on
Monday and on Tuesday;
if we scored a particular test on Tuesday instead of Monday;
if two or more competent scorers scored the test independently.

Two different types of consistency or reliability:

1. Test Reliability, it is affected by a number of factors, chief among them being the
adequacy of the sampling of tasks. Generally speaking, the more samples of
students performance we take, the more reliable will our assessment of their
knowledge and ability.
2. Scorer or Rater Reliability, it concerns the stability or consistency with which test
performances are evaluated.

The methods used in determining Reliability are:

a. Equivalent or alternate-form method
To find out the reliability by using this method, two identical test A test and B test
must be made. The test should be tested the same group of students. The result of A
test and B test will be calculated to find out its coefficient reliability by using the
reliability formulas.
b. Test-retest method
Test retest method as its name is a kind of method in which the test conducted twice.
The test instrument itself is made only once. If the two administartions were highly
correlated, we could assume that the test had temporal stability.
c. Split-half method
The test conducted only once and the test items used only one, too. Yet, to find out
the reliability, the result should be grouped or splited into two. The splitting method
consists of two : odd-even grouping and beginning-end grouping.
d. Rational equivalence

Reliability is estimated from a single administration of one form of the test. But in
this case we are concerned with inter-item consistency as determined by the
proportion of persons who pass and the proportion who do not pass each item.
Beside knowing methods used in finding out realibility, it is necessary to know the
factors affecting the reliability of a test. The factors (Heston:155) are as follow:
a) The extent of the sample of material selected for testing. The larger the sample,
the greater the probability that the test as a whole is reliable as the test allow for a
wide field to be covered.
b) The administration of the test. Is the same test administered to different groups
under different conditions or at different times?
c) Test instructions: are the various tasks expected from the testee made clear to all
candidates in the rubrics?
d) Personal factors such as motivation and illness.
e) Scoring the test: this is influenced by the type of the test objective or subjective.

Estimating the Reliability of Speeded Tests

Speed test are those in which the items are comparatively easy but the time limit are
so short that few or none of the candidates can complete all items. Contrasted with power
tests, in which item difficulty generally increases gradually but where ample time is
given for all, or at least most, of the candidates to attempt every item.
Neither the split-half nor the rational equivalence technique of estimating reliability
should be used with speed tests. Test-restest or parallel forms are the methods best
adapted to the measurement of speed-test reliability.
The question of satisfactory reliability
A reliability quotient of 1.00 would indicate that a test perfectly reliable. A quotient
of zero would denote a complete absence of reliability. Generally, reliability can be
increased by lengthening the test, provided always that the additional material is similar
in quality and difficulty to the original. But, it would obviously be wiser to revise the
material or choose another test type.

The standard error of measurement

Few if any forms of educational measurement are perfectly reliable. An obtained
score on any test consists of the true score plus a certain amount of test error. Using the
statistical estimate of reliability, test makers compute a further statistic known as the
standard error of measurement (SEmeas) to estimate the limits within an individuals
obtained score on a test is likely to diverge from his true score.
Final remarks about reliability
It must be remembered that reliability refers purely and simply to the precision with
the test measures. No matter how high the reliability quotient, it is by no means
guarantee that the test measures the test user wants to measure. Data concerning what the
test measures must be sought from some source outside the statistics of the test itself.
B. Validity
It is a term used to refer to the degree to which an instrument measures what it is
supposed to measure. Or the degree to which an instrument parallels the material which
has been taught and the way in which it has been taught (Haris,42).
Different types of validity
1. Logic Validity
Content validity
Content validity refers to the sample items represents the content of instructions.
Content validity here refers to the content of the test. If the students are taught listening
skills, then the test should be on listening, not speaking. In other words content validity
refers to the degree of the relationship between the instruments of the test and the
material which has been taught or the appropriateness between what has been taught
(content). The test that be able to provide us with information about the spesific materials
or skills being tested, and the basis for their selection.
Construct validity
Construct is a characteristics that is deemed to exist to explain some type of behavior.
Construct validity is an indication of the relationship between what a theory predicts and
what test scores show. It refers to the appropriateness between the set learning objective
and the test made. So the construct here does not refer to the construction of any
sentences of a test. If for instance, the objective states that at the end of the lesson

students are expected to be able to choose the right word based on the context, then the
construction of the should be choose the right word based on the context. And the test
should be, for instance, as follow:
The students just gone
a. Has
b. have
c. had
Face Validity
We conclude this brief survey of some common varieties of validation with what is
most probably the most frequently employed type of all, face validity. Here we mean
simply the way the test looks- to the examinees , test administrators, educators, and the
Obviously, this is not validity in the technical sense, ad face validity can never be
permitted to take the place of empirical validation or of the kind of authoritative analysis
of content referred to above.
2. Empirical Validity
Empirical validity is an indication of a perfect correlation between the two measures.
There are two general kinds of empirical validity, concurrent validity and predictive
Concurrent validity
A kind of validity related to the experience. Thus, to determine if the summative test
made is valid or not, the criterion of the daily test score or the previous summative test
can be used. This is can be carried out by comparing a test with another test (that is
meant by experience). A test is considered valid if it is in line with the set criterion.
Predictive validity
Predictive validity is a kind of validity which is expected to be able to make
predictions about future performance.
C. Practicality
A third characteristic of a good test is its practicality or usability. A test is said to be
valid if it is practical and usable. Thus in the preparation of a new test or the adoptation
of an existing one, we must keep in mind these following number of very practical
1. Economy
Economy including both time and money. Testing can be expensive. We must take
into account the cost per copy, and whether or not the test book are reusable. Again,

several administrator and/or scorers will be needed, for more personnel who must be
involved, the more costly the process become. In writing or selecting a test, we
should certainly pay some attention to how long the administering and scoring of it
will take.
2. Ease of administration and scoring
Other consederation of test usability involve the ease with which the test can be
administrated. The test administrator can perform his tasks quickly and efficiently if
full, clear directions provided. Scoring procedure also have a significant effect on
the practicality of a given instrument. Because we need to know the number of
examinees involved, whether the test must be scored subjectively or is objective in
nature, the answer sheet used, and whether we scoring use machine or hand scoring
the tests.
3. Ease of interpretation
If a standard test is being adopted, it is important that we examine and take into
account the data which the publisher provides and whether there is an up-to-date test
manual that gives clear information about test reliability and validity and about
norms for appropriate references groups. Hovewer, we need to have some general
gudance as to the meaning of test scores to begin with.
In short, all the things above influences the quality of a good test.
D. Objektifitas
Sebagaimana telah kita ketahui bersama bahwa obyektif berarti tidak mengandung
unsur-unsur pribadi. Dalam hubungan ini, suatu tes dapat dikatakan obyektif dan
memiliki obyektivitas apabila tes tersebut disusun dan dilaksanakan sesuai dengan apa
yang ada. Isi atau materi tes diambil berdasarkan materi atau bahan pelajaran yang telah
diberikan sebelumnya dan sesuai dengan tujuan yang telah ditentukan (Anas Sudijono,
2005: 96). Dengan kata lain, sebuah tes dikatakan memiliki obyektivitas apabila dalam
pelaksanaan tes tersebut tidak ada factor subjektif yang mempengaruhi, terutama dalam
system penilaian. Apabila dikaitkan dengan reliabilitas, maka objektifitas lebih
menekankan ketetapan pada sistem scoring, sedangkan reliabilitas lebih menekankan
ketetapan dalam hasil tes.
Faktor yang mempengaruhi objektifitas adalah sebagai berikut:

a.Bentuk Tes
Tes yang berbentuk uraian (essay), akan memberikan banyak kemungkinan kepada si
penilai untuk memberikan banyak penilaian (skoring) menurut caranya sendiri. Halini
menunjukkan bahwa dengan menggunakan tes bentuk uraian akanmemungkinkan
masuknya unsur subjektivitas dari si penilai dalam melakukan skoring.
Dengan menggunakan tes bentuk uraian, faktor subjektivitas dari seorang penilaiakan
dapat masuk secara lebih leluasa dan mempengaruhi pemberian skor. Faktor-faktor yang
dapat mempengaruhi dalam subjektivitas penilaian tersebut antara lain:kesan penilai
terhadap peserta tes (hallo-effect), tulisan, bahasa, waktu pelaksanaan penilaian, dan

