Carlo Magno
De La Salle Universoty-Manila
Nicole Tangco
Abstract
The present study conducted a metaevaluation of the teacher performance system used in the
determine whether the evaluation system on teacher performance adheres to quality evaluation,
the standards of feasibility, utility, propriety, and accuracy are used as standards. The system of
teacher performance evaluation in PASU includes the use of students rating called the Student
Instructional Report (PEF) and a rating scale used by peers called the Peer Evaluation Form
(PEF). A series of guided discussions was conducted among the different stakeholders of the
evaluation system in the college such as the deans and program chairs, teaching faculty, and
students to determine their appraisal of the evaluation system in terms of the four standards. A
metaevaluation checklist was also used by experts in measurement and evaluation in the Center
for Learning and Performance Assessment (CLPA). The results of the guided discussion showed
that most of the stakeholders were satisfied with the conduct of teacher performance assessment.
Although in using the standards by the Joint Committee on evaluation, the results are very low.
The ratings of utility, propriety, and feasibility were fair and the standard on accuracy is poor.
of teachers. Assessing teaching performance enables one to gage the quality of instruction
represented by an institution and facilitate better learning among students. The Philippine
Accrediting Association of Schools, Colleges and Universities (PAASCU) judges a school not by
the number of hectares of property or buildings it owns but rather by the caliber of classroom
teaching and learning it can maintain (O’Donnell, 1996). Judging the quality of teacher
performance actually depends on the quality of assessing the components of teaching. When
PAASCU representatives visit schools, they place a high priority on firsthand observation of
actual faculty performance in the classroom. This implies the value of the teaching happening in
have a variety of ways of assessing teacher performance. These commonly include classroom
observation by and feedback from supervisors, assessment from peers, and students’ assessment,
all of which should be firmly anchored on the school’s mission and vision statements.
5
has adopted the learner-centered psychological principles, any assessment technique it uses, as
mentioned in the school’s mission, “recognizes diversity by addressing various needs, interests,
and cultures. As a community of students, faculty, staff, and administrators, we strengthen our
creativity, professional competence, social responsibility, a sense of nationhood, and our faith.
We actively anticipate and respond to individual, industry, and societal needs by offering
innovative and relevant programs that foster holistic human development.” The processes in
highly critical since it is used to decide on matters such as hiring, rehiring, and promotion. There
should be careful calibration and continuous study of the instruments used to assess teachers.
The process of evaluation in the college was established since the start of the institution
in 1988. Since that time, different assessment techniques have been used to evaluate instructors,
professors, and professionals. The assessment of teachers is handled by the Center for Learning
stakeholders. Currently, the instructors and professors are assessed by students using the Student
Instructional Report (SIR), the Peer Evaluation Form (PEF), and academic advising. The current
forms of these instruments have been in use in the last three years.
6
At the present period, there is a need to evaluate the process of evaluating teacher
processes meets the Joint Committee Standards for Evaluation. The Joint Committee Standards
helpful in a metaevaluation process since it provides a set of general rules for dealing with a
variety of specific evaluation problems. The processes and practices of the CLPA in assessing
teaching performance needs to be studied whether it meets the standards of utility, feasibility,
propriety, and accuracy. The metaevaluation technique involves the process of delineating,
obtaining, and applying descriptive information and judgmental information about the standards
of utility, feasibility, propriety, and accuracy of an evaluation in order to guide the evaluation and
This study on metaevaluation addresses the issue of whether the process used by the
CLPA on evaluating teaching performance in DLS-CSB meets the standards and requirements of
a sound evaluation. Specifically, the study will provide information on the adequacy of the SIR,
peer assessment and student advising on following areas: (1) items and instructions of
responding; (2) process of administering the instruments; (3) procedures practiced in assessment;
(4) utility value from stakeholders; (4) accuracy and validity of responses.
Generally, teacher evaluations may be summative or formative. The instruments used for
summative evaluation are typically checklist-type forms that provide little room for narrative,
and take note of observable traits and methods that serve as criteria for continued employment,
promotions, and the like (Searfoss & Enz, 1996 in Isaacs, 2003). On the other hand, formative
evaluations are geared toward professional development. In this form of evaluation, teachers and
7
their administrators meet to try to trace the teacher’s further development as a professional.
works from the premises that teaching is a profession, and as such, teachers should have a certain
level of control over their development as professionals (Glatthorn, 1997 in Isaacs, 2003). This
model allows “for the clinical model of evaluation, cooperative options that allow teachers to
work with peers, and self-directed options guided by the individual teacher” (Isaacs, 2003). The
model allows professional staff and supervisors/administrators options in the process applied for
appropriate to meet the needs of each member of the professional team. The three processes in
the Differentiated Supervision Model are: (1) Focused Supervision, (2) Clinical Supervision, and
The method of collaborative evaluation was developed (Berliner, 1982; Brandt, 1996;
Wolf, 1996 in Isaacs, 2003) with the core of the mentor/administrator-teacher collaboration.
administrative involvement that may include multiple observations, journal writing, or artifact
collections, plus a strong mentoring program” (Isaacs, 2003). At the end of a prescribed period,
the mentor and mentee sit down to compare notes on the data gathered over the observation
period. Together, they identify strengths, weaknesses, areas for improvement, and other such
points. In this model, there are no ratings, no evaluative commentaries and no summative write-
Another is the multiple evaluation checklist which uses several instruments other than
administrator observations. Here, the peer evaluation, the self-evaluation, and the student
Self-evaluation also plays an important role in the evaluation process. It causes the
teacher to think about his or her methods more deeply, and causes him or her to consider the
long-term. It is also said to promote a sense of responsibility and the development of higher
Then there is the most commonly-used evaluation, the student evaluation (Bonfadini,
1998; Lengeling, 1996; Strobbe, 1993; Williams & Ceci, 1997 in Isaacs, 2003). They are the
easiest to administer and they provide a lot of insights about rapport-building skills, teacher
communication, and effectiveness. However, as Williams and Ceci (1997), according to Isaacs
(2003) have found that a change in a content-free variable in teaching (they conducted a study in
which the only variable modified was the teaching style—teachers were told to be more
enthusiastic and attended a seminar on presentation methods) was enough to cause a great
magnitude of increase in teacher ratings, student evaluations have to be viewed with caution.
Another reason is one of the findings of the study by Bonfadini (1998), cited by Isaacs (1993).
He found that, upon asking students to rate their teachers according to four determinant areas, (a)
personal traits, (b) professional competence, (c) student-teacher relationships, and (d) classroom
management, the least used determinant was professional competence. Conclusion: students may
tend to look more at the packaging (content-free variables) rather than that which empirically
makes a good teacher—so viewing student-based information, says Isaacs (2003), should be
In the field of teacher evaluation, the growing use of the portfolio is slowly softening the
otherwise sharp edges of the standardized instrument (Engelson, 1994; Glatthorn, 1997;
National standards are also used as method for teacher evaluation. It is based on the
instigation of a screening board other than the standard licensure committee, something that has
no counterpart in the Philippines. The creation of the National Board for Professional Teaching
Standards (1998) was prompted by the report A Nation Prepared: Teachers for the 21st Century
generated by the 1986 Carnegie Task Force on Teaching as a Profession, which in turn was
prompted by the 1983 A Nation at Risk report (Isaacs, 2003). It is the mission of the NBPTS to:
…establish high and rigorous standards for what experienced teachers should know and
be able to do, to develop and operate a national, voluntary system of assessment and
certification for teachers, and to advance educational reforms for the purpose of
The National Board Certification was meant as a complement to, but not a replacement
for, state licensure exams. While this latter represents the minimum standards required to teach,
the former stands as a test for more advanced standards in teaching as a profession. Unlike the
licensure examinations, it may or may not be taken; it is voluntary. As such, some schools offer
monetary rewards for the completion of the test, as well as opportunities for better positions (i.e.
What makes a good teacher? Offhand, one might say, good communication and rapport-
building skills, a sense of empathy for the students, and, of course, knowledge of the lessons to
be taught. These, however, are not easily measurable, and even if one should manage to grasp the
10
key to quantifying them, the degree of importance each one of those variables might bear might
As the paradigm of how teaching should be done shifts through the years, so should what
is considered as criteria of good teaching change. And it has, through the decades. In the 1970s,
for instance, the predominant philosophy was based on Madeline Hunter’s model, which was in
of fairly low level knowledge” (Danielson and McGreal, 2000). Today, however, things are
different. A good student is not only one who can perfect tests. Rather, he is capable of complex
learning, of problem solving and applying knowledge to unfamiliar situations (Danielson and
McGreal, 2000).
This change did not happen overnight. Like most good changes, it was a gradual process,
with some small shift occurring at as each decade turned the pages of time. From the behaviorist
70s came an increase in the need to help students attain more complex goals, which soon, in the
80s and 90s, highlighted the need for critical thinking, problem solving, lifelong learning,
Danielson and McGreal (2000) described the shift of focus in teaching across history. In
the 1970, emphasis was given on learning styles (encouraged emphasis on teacher-centered,
and checking for understanding guided practice and independent practice. In the 1980’s teacher
effectiveness was given attention that includes: expectancy studies, discipline models, Hunter
derivatives, effective school research, cooperative learning, and brain research. In the 1990’s
numerous studies on critical thinking was present in literature. Teaching emphasized on content
11
teaching and learning, and teaching for understanding. For the 21st century, teaching
effectiveness and critical thinking are refocused on authentic pedagogy. Authentic pedagogy
While the trend in teaching is certainly a good factor to consider when thinking about
how to evaluate, it doesn’t end there. When evaluating, the National Board for Professional
Teaching Standards makes it clear that there is more than just teaching per se:
The fundamental requirements for proficient teaching are relatively clear: a broad
grounding in the liberal arts and sciences; knowledge of the subjects to be taught,
of the skills to be developed, and of the curricular arrangements and materials that
methods for teaching and for evaluating student learning; knowledge of students
(The National Board for Professional Teaching Standards, 1998, p.4, as quoted by
Isaacs, 2003)
If one reads through the paragraph, one will notice that from talking about the academic
Danielson & McGreal (2000) proposed a model containing four domains embodying the
components of professional practice. Domain 1, Planning and Preparation, covers everything that
12
happens before contact with the students: knowing the topic, knowing what one has at one’s
disposal and being able to use it (Demonstrating knowledge of content and pedagogy,
resources, designing coherent instruction, and assessing student learning). Domain 2, The
Classroom Environment, speaks of setting the mood for learning and classroom management
(Creating an environment of respect and rapport, establishing a culture for learning, managing
classroom procedures, managing student behavior, and organizing physical space). The third
domain, Instruction, tackles the raison d’ etre of teaching, from being able to communicate
effectively to actually helping the students learn to giving feedback (Communicating clearly and
accurately, using questioning and discussion techniques, engaging students in learning, providing
feedback to students, and demonstrating flexibility and responsiveness). The last domain,
teacher: being able to maintain accurate records, to talk to parents, and to reflect on one’s
records, communicating with families, contributing to the school and district, growing and
The student evaluation of teaching or SET is probably the most widely used form of
evaluation, favored for its simplicity and economy in administration, scoring, and interpretation.
Since it was first put into modern use by pioneers like Herman Remmers, it has been applied in
at least three different ways, as identified by McKeachie (1996). These are student guidance in
choice of course, improvement of teaching, and evaluating teaching for use in personnel
If Wright’s (2006) source Olshavsky and Spreng (1995) is right, then what “level of
For starters, Wright (2006) says that the “entertainment” level of classroom experience
(Costin, Greenough, & Menges, 1971 in Wright, 2006) is a big determinant. The “Dr. Fox” study
by Naftulin, Ware, and Donnelly (1973, in Wright, 2006) pointed to this. The researchers hired a
highly enthusiastic actor to give a lecture that was intentionally devoid of any educational value
—yet his teaching was rated quite highly. Definitely, most students want their classes to be more
Another thing that seems important to students is their teacher’s communication skills. In
a study by Williams and Ceci (1997, in Wright, 2006), it was found that, though course material,
the lecture format, and student performance were held constant, an improvement of
Then, there are the findings of Wright (2000) himself. According to him, two factors that
It is very possible for students to even favor a factor that is detrimental to their
experiencing effective learning. Wright (2006) cites Strom, Hocevar, and Zimmer (1990) who
found that students who preferred easy courses preferred teachers with high student orientation
but did so much better in the classes of teachers with low student orientation.
What Students Look At: Steiner et al.’s (2006) Study on Student Biases
Steiner et al. (2006) also did a study looking into what students consider when they
evaluate. They broke down student criteria into four variables: student perceptions, instructor
attributes, instructional formats, and course attributes. Each of these had several variables under
14
them, studied individually in relation to predicted SET scores. The findings are of interest in that
they show that students, at least those of the data set, for the researchers did admit that their
sample provided very little generalizability, consider how much they perceive they learned from
the teacher, and that they tend to consider things that are very well beyond the teacher’s control.
Under student perceptions, the predicted value of SET increased with every unit increase
of a student’s perception of how much they learned and decreased with every unit decrease of the
grade students expect to get (i.e. from an A to a B). In addition to this, though not statistically
significant, the researchers noted as important that how challenging a course is perceived to be
gender. Generally, student ratings were worse for female professors than for the males.
Under instructional formats, the significant subvariables were using videos and guest
speakers, offering extra credit opportunities, the percent of course time spent lecturing and the
percent of course time spent in active learning activities. All of these, except for the percent of
No significant subvariable surfaced under the course attributes variable, but the authors
noted that it seems teaching electives and graduate level classes as against teaching required
There was also an interaction effect noted, one between the instructor’s gender and
percent of course time spent lecturing. It appears that the negative effect of lecturing is
The trouble with these findings is that they reveal that sometimes, students include in
their judgment things that are beyond the teachers’ control. For instance, simply being a female
15
instructor for the sample of Steiner et al.’s (2006) already greatly diminishes one’s chance for
merits, which is unfair when compared to the male professors. This may not be the case for the
sample in the mind of the reader, but it should still be impressed in the reader’s mind that any
sample might have some biases that teachers cannot control, so it is important to find out.
Studies are ambivalent about the relationship of grades, both actual and expected, to
SETs. Landrum and Dillinger (2004) say that instructor evaluation is weakly but significantly
correlated with actual grade but not with expected grade. Moore (2006) says that neither actual
nor expected grade (referred to as anticipated grade in his study) were indicators of SETs. And
finally, Steiner et al.’s (2006) study says that expected grade is strongly correlated with SETs.
It is important to be able to establish how the grading system affects SETs because one
major issue that may invalidate the SETs is that “dumbing down” course work making it easier to
To settle this question, one must first determine what use evaluation results are put to, for
the utility attached to them, whether for administrative decisions or for professional
determine if the evaluation generated helps in making administrative decisions. On the other
hand, if it is the improvement of teaching quality that one is after, then one must be guided by
Remmers’ finding that evaluations do help bring about improvement, but not much (McKeachie,
1996) and by the idea that discussing the evaluation results with another teacher produces
substantial improvement (McKeachie et al., 1980 in McKeachie, 1996). This latter, however is a
16
voluntary act, and this is where the dedication of the instructor, teacher, or professor to his or her
Aultman (2006) suggests the use of formative evaluation as she has found through
personal experience that it brings about unexpected benefits. She conducted her own formative
evaluation, separate from the official summative evaluation conducted at the end of the semester.
This she did at the third week of the semester, giving her a lot of time to reinforce her strengths
Her “mini-evaluation” was composed of three sections. The first section allowed students
to write down any questions regarding the course content up to that point of the term. The second
part asked students to rate her in four sections: pace of the course, clarity of her lectures, the
quality of the activities the class did to enhance their understanding, and her preparedness for
class. Finally, the third section asked for student comments on how she could improve the class.
In the next meeting, she tackled the evaluations with her students. She answered all the
content-related questions about the course and the questions raised in the evaluation about future
The results were quite astonishing. The students, seeing that their input was given
importance, soon began to raise more questions in class. Sometimes, they would come to class
early or stay behind to ask questions not only about the subject matter but occasionally also
about her dissertation progress or her family. The mini-evaluation had “served as a catalyst for
improved communication between my students and me. Students saw as a real person as well as
What Aultman (2006) did requires a certain amount of dedication and a certain comfort
level with criticism. Students may just be the learners, but sometimes, since they get to see things
17
that teachers can’t see merely by virtue of their vantage point. If teachers accept that and manage
to learn from their students, they may find that SETs as a method of improving themselves are
effective.
In 1969, Michael Scriven used the term metaevaluation to describe the evaluation of any
evaluation, evaluative tool, device or measure. Seeing how so many decisions are based on
evaluation tools (which is typically their main purpose for existence in the first place—to help
people make informed decisions), it is no wonder that the need to do metaevaluative work on
In the teaching profession, student evaluation of teachers stands as one of the main tools
of evaluating. However, as earlier stated, while it is but fair that students be included in the
evaluative process, depending on the evaluation process and content, it may not be very fair to
teaching professionals to have their very careers at the mercy of a potentially flawed tool.
can refer to anyone whose interests might be affected by the evaluation under the microscope.
2. Staff the Metaevaluation with One or More Qualified Metaevaluators. Preferably, these
should be people with technical knowledge in psychometrics and people who are familiar with
the Joint Committee Personnel Evaluation Standards. It is sound to have more than one
3. Define the Metaevaluation Questions. While this might differ on a case-to-case basis,
the four main criteria ought to be present: propriety, utility, feasibility, and accuracy.
This will serve as a guiding tool. It contains the standards and principles contained in the
previous step and will help both the metaevaluators and their clients understand the direction the
8. Analyze the Findings. Put together all the qualitative and quantitative data in such a
and/or other criteria. This is the truly metaevaluative step. Here, one takes the analyzed data and
judges the evaluation based on the standards that were agreed upon and put down in the formal
contract. In another source, this step is lumped with the previous one to form a single step
(Stufflebeam, 2000).
10. Prepare and Submit the Needed Reports. This entails the finalization of the data into a
coherent report.
11. As Appropriate, Help the Client and Other Stakeholders Interpret and Apply the
Findings. This is important for helping evaluation system under scrutiny improve by ensuring
that the clients know how to use the metaevaluative data properly.
19
There are four standards of metaevaluation: propriety, utility, feasibility, and accuracy.
Propriety standards were set to ensure that the evaluation in question is done in an ethical
and legal manner (P1 Service Orientation, P2 Formal Written Agreements, P3 Rights of Human
P7 Conflict of Interest, P8 Fiscal Responsibility). They also check to see that all welfare of all
Utility standards stand as a check for how much the evaluation in question caters to the
information needs of its users (Widmer, 2003 in Hummel, 2003). They include: (U1) Stakeholder
Identification, (U2) Evaluator Credibility, (U3) Information Scope and Selection, (U4) Values
Identification, (U5) Report Clarity, (U6) Report Timeliness and Dissemination, and (U7)
Evaluation Impact.
Feasibility standards make sure that the evaluation “is conducted in a realistic, well-
considered, diplomatic, and cost-conscious manner” (Widmer, 2003 in Hummel, 2003). They
include: (F1) Practical Procedures, (F2) Political Viability, and (F3) Cost Effectiveness.
Finally, accuracy standards make sure that the evaluation in question produces and
disseminates information that is both valid and useable (Widmer, 2003 in Hummel, 2003). They
include: (A1) Program Documentation, (A2) Context Analysis, (A3) Described Purposes and
Procedures, (A4) Defensible Information Sources, (A5) Valid Information, (A6) Reliable
Analysis of Qualitative Information, (A10) Justified Conclusion, (A11) Impartial Reporting, and
(A12) Metaevaluation.
20
It should be noted that the aforementioned standards were developed primarily for the
The Student Instructional Report (SIR) currently used by the College of Saint Benilde
originated from the SET form used by De La Salle University. It has been revised over the years
—instructions have been changes, certain things were omitted from the manual. The items used
to day are pretty much what they were in 2000, and the instructions more or less the same as
The SIR is administered in the eighth week of every term, the week directly after the
midterms week. The evaluands of the form are teachers; the evaluators, are their students, and
other stakeholders are the chairs and deans, who use the data generated by the SIR for
administrative decisions. The results are presented to the teachers after the course cards are
given. By definition then, it is a form of summative evaluation. There is currently no data that
The Peer Evaluation Form (PEF) is used by faculty members in observing the
performance of their colleagues. The PEF is designed to determine the extent to which the CSB
faculty has been exhibiting teaching behaviors along the areas of: teacher’s procedures, teacher’s
The PEF is used by a peer observer if the teacher is new in the college and due for
promotion. The peer discuss with the faculty evaluated the observation and rating given. The
Method
Guided Discussion
The Guided Discussion is the primary method of data-gathering for all groups concerned.
As stated above, the represented groups include the teachers, the chairs and/or deans, the CLPA-
PASU staff directly involved in the evaluation process, the evaluation measurement expert team
As suggested by Galves (1988), there are five to seven (5-7) participants for every guided
discussion (GD) session. The participants for the GD were chosen by the deans of the respective
schools involved. The groups included are teachers, chairs and/pr deans, the CLPA-PASU staff, a
Separate GD sessions for each of the schools of the college were conducted they have
different needs. The scope of this study is to “assess and evaluate" the current practices
undertaken in the SIR and PEF system of administration, scoring, and interpretation. In the GT
sessions that were conducted, the participants are co-evaluators considering that they all employ
the same PEF and the same SIR items and standards of practice.
Each of the former four of the aforementioned list discuss and evaluate along the lines of
one of the four criteria set by the Joint Committee Standards for Evaluation. The Teachers group
is set to discuss and evaluate the propriety aspect; the Chairs/Deans group, the utility aspect; the
CLPA-PASU Staff group, the feasibility aspect; the team of experts, the accuracy aspect.
Before any of the GD sessions, the list of guide questions for each group was sent to the
chosen participants for a pre-screening of the topics to be discussed at least ten days before the
scheduled GT session for that group. The participants are given the liberty to request that other
The modified guide containing the set of questions to be covered is presented to the
participants. Three researchers play specific roles as prescribed by Galves (1988): the guide shall
ask the questions and guide the discussion, the recorder records of the points raised per question
and any questions the participants may care to ask (using a medium visible to the whole group),
and the observer of the process is tasked to keep the discussion on track, regulate the time per
topic, and prevent anyone from monopolizing the discussion. The guide initiated the discussion
by presenting the new set of questions, at which point the participants were given another
opportunity to add or subtract topics for discussion. Once the final set of questions has been
decided upon and recorded by the recorder, responses were gathered and discussed by the group.
One key feature of the GD method is that a consensus on the issues under a topic must be
reached. When all the points were raised, the group was given the chance to look over their
responses to validate or invalidate them. Whatever the group decides to keep will be kept; what it
The side-along evaluation done by the observer may be done at regular points throughout
the discussions as decided by the group (i.e. after each topic) and/or when he or she deems it fit
to interrupt (i.e. at points when the discussion goes astray, or the participant spend too much
A similar procedure was followed for the Student group. The purpose the students’
discussion is to get information of their perspectives of the evaluation process and their
At the end of each discussion, the participants were asked to give their opinion about the
usefulness and feasibility of having this sort of discussion every year to process their questions,
23
comments, doubts, and suggestions. This provides data for streamlining the metaevaluative
The average ratings of the professors within the last three school years (AY 2003-2004
and 2004-2005) were used to generate findings on how well the results could discriminate the
levels of good teaching and needs improvement teaching. The Cronbach’s alpha was used to
The average of the scores for the three terms was computed for each school year,
generating three average scores. These scores were compared to each other to check the
Metaevaluation Checklist
A checklist was used to determine whether the evaluation meets the standard of utility,
feasibility, propriety, and accuracy. There were seven experts in line with measurement and
evaluation who were invited to evaluate the system used by the CLPA in assessing teachers
performance on both Student Instructional Report (SIR) and Peer Evaluation Report (PEF). The
metaevaluators first used a 30-item checklist adopted from the Joint Committee Standards for
Evaluation. The metaevaluators were guided by information from the ginabayang talakayan
session notes (as transcribed by the taga-tala) and other extant data.
Instrumentation
For the GD sessions, a guide lists was used. The guide is composed of a set of questions
under each standard that is meant to evaluate the evaluation system (see appendix A). The
24
questions in the GT are the pre-written. In the data-gathering method, these are still subject to
change, both in the fielding of the questions prior to the GT sessions and on the day of the GT
session itself.
The Metaevaluation Checklist by Stufflebeam (2000) was used to rate the SIR and PEF
as an evaluation system. It is composed of ten items for each of the subvariables under each of
the four standards (see appendix B). The task is to check the items in each list that are applicable
in the current teacher performance evaluation system done by the center. Nine to ten (9-10) items
generates a rating of excellent for that particular subvariables; 0.7-0.8), a very good; 0.5-0.6,
Data Analysis
The data obtained from the Ginabayang Talakayan (GT) was analyzed using the
qualitative approach. The important themes from the notes produced in the Ginabayang
Talakayan were extracted based on the appraisal components for each area of metaevaluation
standard. For utility appraisal themes referring to stakeholder identification (persons affected by
the evaluation should be identified), evaluator credibility (trustworthiness and competence of the
evaluator), information scope and selection (broad selection of information/data for evaluation),
values identification (description of procedures and rationale of the evaluation), report clarity
(description of the evaluation being evaluated), report timeliness (findings and reports distributed
stakeholders) were extracted. For propriety the appraisal themes extracted are on Service
orientation (designed to assist and address effectively the needs of the organization), formal
agreement (Obligation of formal parties are agreed to in writing), rights of human subjects
(evaluation is conducted to respect and protect the rights of human subjects), and human
25
interaction (respect human dignity and worth). For feasibility the themes extracted are on
practical procedures, political viability, fiscal viability, and legal viability. The qualitative data
were used as basis in accomplishing the metaevaluation checklist for utility, feasibility, and
propriety.
procedures, programs, policies, documentations, and reports were made available to the
In the checklist, every item of the metaevaluation standard that was checked were divided
into 10 and averaged according to the number of metaevaluators who accomplished the checklist.
Each component is then interpreted whether the system reached the typical stands of evaluation.
The scores are interpreted as 0.9 to 1.0, Excellent; 0.7 to 0.8, Very Good; 0.5 to 0.6, Good; 0.3 to
Results
Utility
Under utility there are four standards evaluated: stakeholder identification, information
scope and selection, values identification, functional reporting, follow-up and impact, and
information scope and selection. Table 1 shows the themes and clusters formed in evaluating the
Table 1
For the standard on stakeholder identification, the strands were clustered into four
themes: Mode of feedback, approaches to feedback, sources of feedback, and time of giving
feedback. For the deans and chairs the mode of feedback took the form of “one on one basis,
approach is informal, post conferences, meetings, and when urgent a note is given.” The
approaches in giving feedback were both developmental (suggestions to further improve the
teaching skills) and evaluative (Standing of the faculty). The sources of feedback come from the
students through the SIR, student advising, e-mail from students and parents, and peers (senior
faculty, chairs, deans). Feedback is given “if the rating is high (3.75 and above); sometimes no
feedback is given; when the results of the SIR are low; if the faculty is new to the college; and
those who have been teaching for a long time and getting low ratings.”
For values identification, the strands were clustered into three themes: Needs, action
taken, and value of the instrument. According to the participants, the needs included “Results
(that) are (not) too cumbersome for deans to read; A print out of the results should be given; the
time taken to access the results turns off some teachers from accessing them; students having
difficulty answering the SIR; students don’t see how teaching effectiveness is measured and;
create a particular form for laboratory classes in SHRIM classes.” The action taken theme
included “removing items that are valid and another computation is done and; other evaluation
28
criteria is done.” The instrument value theme showed that for the instrument to be valuable,
“there should be indicators for each score; there should be factors of teaching effectiveness with
clear labels; identify what the instrument measures; there needs to be a lump score on learner-
centeredness and; there are other success indicators that are not reflected in the SIR.”
For Functional reporting, two clusters emerged: decisions and functions. The decisions
made by the teacher evaluation include promotion, loading with course, retaining PTF, deloading
a faculty, permanency, and training enhancement. The functions of the teacher evaluation are
“used for improvement the faculty; the VPA comes up with a list of faculty that will be given
teaching load based on SIR reports and; the PEF constricts what needs to be evaluated more.”
The follow-up and impact included both qualitative and quantitative. The qualitative
aspect of the instruments included suggestions to “give headings/labels for the different parts;
come up with dimensions and subdimensions; devise a way to reach the faculty (yahoo, emails
etc.); the teachers and students should see what aspects to improve on; and there should be
narrative explanations for the figures.” The quantitative aspect of the report included “faculty
doesn’t understand the spreading index; conduct a seminar explaining the statistics; come up
with a general global score; each area should be represented with a number; and a verbal list of
Two clusters were identified for information scope and selection: perception and action.
In the perception the faculty “looks at evaluation as something negative because the school uses
the results.” For the suggested actions “come up with CLPA kit explaining the PEF and SIR;
check on the credibility on the answers of students; and SIR needs to be simplified for the
SDEAS.”
Table 2
29
The ratings for utility using the metaevaluation checklist showed that in most of the item
areas, the performance of the teacher evaluation processes are good. In particular, the area on
information scope and selection is very good. However, report timeliness is poor and evaluation
Propriety
conflict of interest, confidentiality, and helpfulness. Table 3 shows the clusters and themes
Table 3
Service Orientation Results • Not satisfied because the results come very late
• Prepare hard copies of the results
• Most faculty members could not access the results
• PEF qualitative results are not seen online
Examiner • Friendly
• Sometimes late
• New staff have difficulty administering the form
because they could not handle deaf students
• They are not able to answer the questions of
students
Formal Evaluation Students • They get tired of answering many SIR within the
Guidelines day
Observation visits • Make clear who will call the teacher when the SIR
is finish
• The observer can’t make other visits
• The PEF guidelines does not give instructions what
the observer will do
• Not practical for the observer to go through the
whole process of preobservation, observation and
post observation.
Conflict of interest • CLPA do not give in to requests
• Not too many queries about the SIR
• Because the LC is adopted by the college, more
value is given to the SIR
• SIR is not fully explained to the teacher
•
Confidentiality • The information is very confidential
Helpfulness • The comments are read rater than the numbers
• It’s the comments that the teachers look at
• The numerical results need a more clear
explanation
• Comments need to be broken down into specific
factors
For service orientation, the clusters formed were on the results, examiner, and
responding. According to the participants, they were “not satisfied because the results come very
late.” There is a need to “prepare hard copies of the results” because “most faculty members
31
could not access the results” and the “PEF qualitative results are not seen online.” The
participants appraisal of the examiners include being “friendly, sometimes late, new staff have
difficulty administering the form because they could not handle deaf students, and they are not
able to answer the questions of students.” In responding to the SIR it was mentioned that “there
For the formal evaluation guidelines the three areas specified were the students,
frequency of meetings, and observation visits. For the students, it was mentioned that “they get
tired of answering many SIR (forms) within the day.” In terms of the frequency of meetings,
there are “no guidelines for modular classes and team teaching; and “no SIR for OJT classes and
the teacher cannot be promoted.” In the observation visits, it is needed to “make clear who will
call the teacher when the SIR is finish; the observer can’t make other visits; the PEF guidelines
do not give instructions what the observer will do; it is not practical for the observer to go
No clusters were formed for the conflict of interest. The themes extracted were “CLPA do
not give in to requests; not too many queries about the SIR; because the LC is adopted by the
college, more value is given to the SIR; and SIR is not fully explained to the teacher.”
For confidentiality, majority of the participants agree that ‘the information kept by the
For the area on helpfulness the themes identified were “the comments are read rather than
the numbers; it’s the comments that the teachers look at; the numerical results need a more clear
Table 4
Most of the ratings for propriety using the metaevaluation checklist were pegged at good.
A very good rating was obtained for formal agreement. A fair rating is obtained in the areas of
Feasibility
viability, and legal viability. Table 5 shows the clusters and themes for the standards.
Table 5
Cost Effectiveness Human resources • The (human) resources (e.g. staff) are maximized
during SIR. LASU staffers are also used. Ipabasa
ang transcript with the LASU staff.
• Well-utilized
• Some staffers have difficulties going home because
of the late hours.
• Meals provided - okay, except they’re redundant
sometimes. They serve as good compensation for
the late work hours.
Material resources • The scanner is worth it because it encodes the
responses fast and it helps meet the deadlines.
• The SIR process is well-supported by the College
• Sometimes it’s hard to administer in AKIC.
Technology • The programmer is new (as of September 2006)
hindi pa niya feel yung data processing kaya
nawawala siya. Hindi pa siya attuned sa work flow.
• The staffs are oriented with the use of the program.
The program is shared. Random checking of the
comments and the editing can be directly done
• The faculty members do not have their own PCs so
they do not get the memo on time.
• Kaunting respondents with online evaluation. We
need to maximize online. If all classes come
together for on-line the computers hang.
Legal viability Standardizing the • There is a common script
Evaluation Setting • The classroom is generally conducive in answering
• During C-break some classes are affected with the
noise.
For practical procedures, the clusters formed were on the understandability of the
instructions, difficulty with the comments and suggestions part, difficulty with the instrument as
a whole. These clusters show that while there are standardized procedures for every run of the
SIR, there is a difficulty following them because “generally, the students do not understand the
instructions.” The comments and suggestions (part four) part of the instrument appears to be a
particularly problematic part—here too, the instructions do not seem to be clear to the students:
“halatang hindi naiintindihan ang instructions kasi hindi kinukumpleto ang sentence.” (It is
obvious they do not understand the instructions because they do not complete the sentence.).
Other than this, some students are not sure whether “talagang sagutan or optional” (they are
required to answer or it is optional). Others don’t feel safe answering this part because they are
35
afraid their professors will get back at them for whatever they write. Ultimately, observed the
For political viability, eight issue clusters were formed. These were time issues in
name changes, identifying and anticipating teacher-related issues, anticipating student needs, and
concerns about utility. The time issues in administration mentioned administration problems
regarding the first-thirty-minutes policy observed by the Center. The time allotment is generally
too short for the whole administration procedure from giving instructions to the actual answering
of the instrument (“yung thirty minutes kulang sa pagexplain at pagadminister”). Teachers also
have issues regarding the same policy. Some refuse to be rated in the first thirty minutes,
preferring to be rated in the last thirty. Another issue regarding the policy is the refusal of some
teachers to be evaluated in the first thirty minutes. There are faculty members who “dictate that
the last 30 minutes will be used for evaluation”. There are others who “complain about the
duration of the SIR administration”, even if “the guidelines (distributed in the eighth week of the
Though discouraged by the Center, rescheduling still does happen during the evaluation
period. Usually it is because “some of the faculty members (or their students) do not show up”.
Similarly, there are times when some students do come, but their numbers do not meet the fifty
percent quota required for each section’s evaluation. Another common reason for rescheduling
are schedule conflicts with other activities: “(the) Young Hoteliers’ Exposition and some tours
The next issue “cluster” formed is regarding the frequency of evaluation; teachers
question whether there is a need to evaluate every term. Although there is only one strand, it is
important enough to be segregated as it gives voice to one of the interest groups’ major concerns.
The next cluster forms the biggest group, the cluster that talks about identifying and
anticipating the needs of the one of the major interest groups/stakeholders of the whole
evaluation system: the teachers themselves. Their needs range from the minor (“We need to
request for the updated list of the faculty names early in the term, a list including the faculty
members who changed their surnames with ACTC.”) to the major (“Matagal nang mali yang
evaluation form”), and a lot in between. Among this last include the need to make sure that
teachers are aware of their evaluation schedules and the Center’s policies, to come up with ways
to deal with the teachers during the actual administration, and to equip them with the know-how
Just as teachers, the evaluatees, have needs, so do their evaluators, their students. By not
taking care of the students’ needs and/or preferences, the Center risks generates inaccurate
results. Thus, the Center should “compile the needs of students and present it (the SIR) to (the)
students in an attractive form. (CLPA should) drum up the interest of students in the evaluation.”
Last under this area are issues on utilization. There appears to be a need to make the
For the area on cost effectiveness, the clusters formed were human resources, material
resources, and technology. The human resources of the Center are “well-utilized”. The staff feels
that despite special cases when they find it difficult to go home because of the late working
hours, they feel well compensated, in part because of the meals served. As to material resources,
“the SIR process is well-supported by the College” and so, everything is generally provided.
37
There are special cases where the evaluation setting makes administration difficult. For instance,
“sometimes it’s hard to administer in AKIC”, especially in the food labs. Finally, under the theme
of technology, the Center proved well-equipped enough to handle the pen-and-paper instrument’s
processing. However, it may be some time before the process become paperless; if the memos
would be delivered online, instead of personally, as is currently done, some of the faculty would
“not get the memo on time” because “the faculty members do not have their own PCs”. Then, an
attempt was made to administer the instrument online. A problem that was noted in this regard
was “kaunting respondents with online evaluation” (very few respondents are gathered with the
online evaluation). Other than that, “if all classes come together for on-line the computers hang.”
For legal viability, only one theme was developed, standardizing the evaluation setting.
“There is a common script” to keep the instructions standardized and, although “During C-break
some classes are affected with the noise (of C-break activities)”, the “classroom is generally
conducive in answering”.
Table 6
For the three areas of feasibility, a good raring was obtained for practical procedure and
Accuracy
38
The standards of accuracy were rated based on the reliability report of the instrument
since SY 2003-2004 to 2005-2006. The trend of the mean performance of the means of the
Table 7
Internal Consistency of the items for the SIR from 2003 to 2006
School Year
2003-2004 2004-2005 2005-2006
1st Term 0.873 0.875 0.881
2nd Term 0.888 0.892 0.894
3rd Term 0.892 0.885
Summer 0.832 0.866
The reliability of the SIR form is consistently high since 2003 to 2006. The Cronbach
alphas obtained are all in the same high level across the three terms and across three school
years. This indicates that the internal consistency of the SIR measure is stable and accurate
across time.
Figure 1 shows a line graph of the means in the SIR each term across three school years.
Figure 1
4.40
4.30
4.20
Part I
4.10
Mean
Part 2
4.00
Part 3
3.90
3.80
3.70
1st 2nd 3rd 4th 1st 2nd 3rd 4th 1st 2nd 3rd 4th
Term
39
The trend in the means show that the SIR results increase at a high level during summer
terms (4th). The high level of increase can be observed from the spikes in the 4th term in the line
graph for the three part of the SIR instrument. The means during the first, second, and third term
Table 8
The ratings for accuracy using the metaevaluation checklist were generally poor in most
areas. Only systematic information was rated as very good, only defensible information sources
was rated as good, and both reliable and impartial reporting were fair.
Table 9
In the four standards as a whole, feasibility (25%), propriety (25%), and utility (25%) are
met fairly and accuracy (0%) is poor for the entire teacher performance evaluation system of the
center. The poor accuracy is due to zero ratings on content analysis, qualitative information, and
justified information. The three standards rated as fair did not even meet half of the standards in
Figure 2
Standard Outcome
Utility
Propriety
Accuracy
Feasibility
Discussion
The overall findings in the metaevaluation of the teacher evaluation system at the Center
for Learning and Performance Assessment show that it falls below the standards of the Joint
Committee on Evaluation. The ratings of utility, propriety, and feasibility were fair and the
In the standard of utility the report timeliness and dissemination is poor. This is due to the
lack of timely exchanges with the full range of right-to-know audiences. In order to improve the
41
timely exchanges, the Center needs to conduct consistent communications with different offices
For propriety, the rating is only fair because low ratings were obtained for complete and
fair assessment, conflict of interest, and fiscal responsibility. To improve complete and fair
assessment, there is a need to assess and report the strengths and weaknesses of the procedure,
use the strengths to overcome weaknesses, estimate the effects of the evaluation’s limitations on
the overall judgment of the system. In line with conflict of interest, there is a need to make the
release of evaluation procedures, data and reports for public review. For physical responsibility,
there is a need to improve adequate personnel records concerning job allocations and time spent
documentation, content analysis, described purposes and procedures, valid information, analysis
of qualitative and quantitative information, justified conclusion and metaevalaution. For program
documentation the only criteria met was the technical report that documents the programs’
operations; all other nine criteria were not met. For content analysis, all criteria were not met. In
described purposes and procedures, only the record of the client’s purpose of evaluation and
implementation of actual evaluation procedures were met. All other eight criteria were not met.
For valid information, there is a need to focus evaluation on key ideas, employ multiple
measures to address each idea, provide detailed description of the constructs assessed, report the
type of information each employed procedures acquires, report and justify inferences, report the
and recurrent themes using qualitative analysis. In the analysis of qualitative and quantitative
42
information, there is a need to conduct exploratory analysis to assure data correctness, choose
procedures appropriate to the system of evaluating teachers, specify assumptions being met by
the evaluation, report limitations of each analytic procedures, examine outliers and verify
correctness, analyze statistical interactions, and using displays to clarify the presentation and
interpretation of statistical results. In the areas of justified conclusions and metaevaluation, all
viability, the evaluation needs to consider ways to counteract attempts to bias or misapply the
findings, foster cooperation, involve stakeholders throughout the evaluation, issue interim
Given the present condition of the SIR and PEF in evaluating faculty performance based
on the qualitative data, there are still gaps that need to be addressed in line with the evaluation
system. The stakeholders are more or less not yet aware of the detailed standards on conducting
evaluations among their faculty and what is verbalized in the qualitative data is only based on
their personal experience and the practices required of the evaluation system. By contrast, the
standards on evaluation would specify more details that need to be met in the evaluation. Some
areas in the evaluation are interpreted by the stakeholders as acceptable based on the themes of
the qualitative data but more criteria need to be met in a larger range of evaluating teachers. It is
recommended for the Center for Learning and Performance Assessment to consider the specific
areas found wanting under utility, propriety, feasibility, and especially accuracy to attain quality
References
Aultman, L.P. (2006). An unexpected benefit of formal student evaluations. College Teaching,
Benson, M. and Otten, A. (2005, September). Teachers’ Rights Review. The Professional,5.
06/The%20Professional%202005-09.htm
Danielson, C. and McGreal, T.L. (2000). Teacher evaluation to enhance professional practice
Differentiated supervision. Buck County Schools Intermediate Unit, Special Education News.
Frederick Winslow Taylor: Scientific Management. (n.d.). Retrieved September 6, 2006, from
http://en.wikipedia.org/wiki/Frederick_Winslow_Taylor#The_development_of_scientific_manag
ement
http://students.ed.uiuc.edu/vgonzale/eport/eol2.html=
Greenberg, M.S. (2001). Student evaluations of teacher effectiveness in the workplace: Mask
http://www.bhummel.com/Metaevaluation/index.html
44
Isaacs, J.S. (2003). A study of teacher evaluation methods found in select Virginia secondary
public schools using the 4x4 model of block scheduling. Unpublished doctoral dissertation,
Landrum, R.E. and Dillinger, R. J. (2004). The relationship between student performance and
instructor evaluation revisited. The Journal of Classroom Interaction, 39(2), 5. Retrieved from
Mallory, A.L. (2006, August 10). Teacher evaluation law criticized [Electronic version]. Knight
Ridder Business Tribune News, p.1 Retrieved September 6, 2006, from ProQuest database.
American Council of Learned Societies, Occasional Paper No. 33. Retrieved from
http://www.acls.org/op33.htm#TOP
Moore, T. (2006). Teacher evaluations and grades: Additional evidence. Journal of American
Steiner, S., Holley, L.C., Gerdes, K., Campbell, H. E. (2006). Evaluating teaching: Listening to
students while acknowledging bias. Journal of Social Work Education, 42(2), 355. Retrieved
Wright, R.E. (2006). Student evaluations of faculty: Concerns raised in the literature, and
possible solutions. College Student Journal, 40(2), 417. Retrieved from ProQuest Psychology
Journals.
45
Appendix A
Gabay based on the Joint Committee Standards for Evaluation
Propriety
1. Does PASU provide Quality service in delivering SIR? (service orientation)
2. Are the guidelines for the SIR clear? (formal evaluation guidelines)
3. Does PASU answer appropriately to the queries on SIR and PEF? (conflict of
interest)
4. Is confidentiality of information maintained? (access to personnel evaluation)
5. Does the SIR and PEF help teacher improve their teaching performance?
(interaction with evaluatees)
6. What items are applicable in your area?
Utility
1. Do you provide feedback on the rating of your faculty? (constructive orientation)
2. Do the SIR and PEF fit the needs of your school? (defined uses)
3. Is the SIR conducted professionally? (evaluator credibility). Does the PEF
facilitate the feedback process?
4. Is the SIR and PEF helpful in decision making and providing loads for the
teachers? (functional reporting)
5. Are the reports generated clear for faculty and chairpersons? (follow-up and
impact)
Feasibility
1. Are the instructions for the SIR clear for students? (Practical procedures)
2. What part of the policies and procedures for the SIR needs to be appealed and
rectified? (political viability)
3. Are the resources used effectively? (fiscal viability)
4. Does the process adhere to testing standards? (legal viability)
Accuracy
1. Is the staff generally qualified to administer the evaluation? (defined roles)
2. Are the conditions of the students and faculty considered during administration of
the SIR/PEF? (work environment)
3. Is the processing of the SIR/PEF well documented? (documentation and
procedures)
4. Is the staff well-trained in the scoring, coding and data entry? (systematic data
control)
5. Are potential biases safeguarded? (bias control)
6. Do we periodically assess evaluation? (Monitoring and control)
46
Appendix B
Metaevaluation checklist
The Metaevaluation Checklist: For Evaluating Evaluations against The Program Evaluation
Standards - Accuracy
A1 Program Documentation
Collect descriptions of the intended program from the client and various
stakeholders
Analyze discrepancies between the various descriptions of how the program was
intended to function
Analyze discrepancies between how the program was intended to operate and
how it actually operated
Ask the client and various stakeholders to assess the accuracy of recorded
descriptions of both the intended and the actual program
Produce a technical report that documents the program's operations
TOTAL
Total ÷ 10
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
A2 Content Analysis
Analyze how the program's context is similar to or different from contexts where
the program might be adopted
Identify and describe any critical competitors to this program that functioned at the
same time and in the program's environment
Describe how people in the program's general area perceived
TOTAL
Total ÷ 10
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
At the evaluation's outset, record the client's purposes for the evaluation
Monitor and describe how the evaluation's purposes stay the same or change
over time
When interpreting findings, take into account the different stakeholders' intended
uses of the evaluation
When interpreting findings, take into account the extent to which the intended
procedures were effectively executed
Describe the evaluation's purposes and procedures in the summary and full-
length evaluation reports
As feasible, engage independent evaluators to monitor and evaluate the
evaluation's purposes and procedures
TOTAL
48
Total ÷ 10
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
Document, justify, and report the criteria and methods used to select information
sources
Document, justify, and report the means used to obtain information from each
source
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
A5 Valid Information
Assess and report what type of information each employed procedure acquires
Document how information from each procedure was scored, analyzed, and
interpreted
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
A6 Reliable Information
For each employed data collection device, specify the unit of analysis
As feasible, choose measuring devices that in the past have shown acceptable
levels of reliability for their intended uses
Pilot test new instruments in order to identify and control sources of error
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
50
A7 Systematic Information
When feasible, use multiple evaluators and check the consistency of their work
Proofread and verify data tables generated from computer output or other means
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
Choose procedures appropriate for the evaluation questions and nature of the
data
For each procedure specify how its key assumptions are being met
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
Limit conclusions to the applicable time periods, contexts, purposes, and activities
52
Obtain and address the results of a prerelease review of the draft evaluation
report
Report the evaluation's limitations
TOTAL
Total ÷ 10
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
Establish and follow appropriate plans for releasing findings to all right-to-know
audiences
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
A12 Metaevaluation
53
Record the full range of information needed to judge the evaluation against the
stipulated standards
Determine and record which audiences will receive the metaevaluation report
Evaluate the instrumentation, data collection, data handling, coding, and analysis
against the relevant standards
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
F1 Practical Procedures
Minimize disruption
Train staff
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
F2 Political Viability
Foster cooperation
Total ÷ 10
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
F3 Cost Effectiveness
Be efficient
Inform decisions
Minimize disruptions
Minimize time demands on program personnel
TOTAL
Total ÷ 10
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
P1 Service Orientation
Help assure that the full range of rightful program beneficiaries are served
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
P2 Formal Agreements
Audiences
Evaluation reports
Editing
Release of reports
Confidentiality/anonymity of data
Evaluation staff
Metaevaluation
Evaluation resources
TOTAL
57
Total ÷ 10
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
Make clear to stakeholders that the evaluation will respect and protect the rights
of human subjects
Respect diversity
Follow protocol
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
P4 Human Interactions
Minimize disruption
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
As appropriate, show how the program's strengths could be used to overcome its
weaknesses
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
P6 Disclosure of Findings
Report relevant points of view of both supporters and critics of the program
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
P7 Conflict of Interest
When appropriate, release evaluation procedures, data, and reports for public
review
Contract with the funding authority rather than the funded program
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
P8 Fiscal Responsibility
60
Maintain adequate personnel records concerning job allocations and time spent
on the job
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
U1 Stakeholder Identification
61
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
U2 Evaluator Credibility
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
Assure that evaluator and client negotiate pertinent audiences, questions, and
required information
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
U4 Values Identification
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
U5 Report Clarity
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
Employ effective media for reaching and informing the different audiences
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor
U7 Evaluation Impact
Show stakeholders how they might use the findings in their work
Total ÷ 10
0.9 – 1.0 Excellent 0.7 – 0.8 Very Good 0.5 – 0.6 Good
0.3 – 0.4 Fair 0.1 – 0.2 Poor