A. Practicality
An
effective test is practical. This means that it
- Is
not excessively expensive,
- Stays
within appropriate time constraints,
- Is
relatively easy to administer, and
- Has
a scoring/evaluation procedure that is specific and time-efficient.
A
test that is prohibitively expensive is impractical. A test of language
proficiency that takes a student five hours to complete is impractical-it
consumes more time (and money) than necessary to accomplish its objective. A
test that requires individual one-on-one proctoring is impractical for a group
of several hundred test-takers and only a handful of examiners. A test that
takes a few minutes for a student to take and several hours for an examiner too
evaluate is impractical for most classroom situations.
B. Reliability
A
reliable test is consistent and dependable. If you give the same test to the
same student or matched students on two different occasions, the test should
yield similar result. The issue of reliability of a test may best be addressed
by considering a number of factors that may contribute to the unreliability of
a test. Consider the following possibilities (adapted from Mousavi, 2002, p.
804): fluctuations in the student, in scoring, in test administration, and in
the test itself.
- Student-Related
Reliability
He
most common learner-related issue in reliability is caused by temporary
illness, fatigue, a “bad day,” anxiety, and other physical or psychological
factors, which may make an “observed” score deviate from one’s “true” score.
Also included in this category are such factors as a test-taker’s
“test-wiseness” or strategies for efficient test taking (Mousavi, 2002, p.
804).
- Rater
Reliability
Human
error, subjectivity, and bias may enter into the scoring process. Inter-rater
reliability occurs when two or more scores yield inconsistent score of the same
test, possibly for lack of attention to scoring criteria, inexperience,
inattention, or even preconceived biases. In the story above about the placement
test, the initial scoring plan for the dictations was found to be
unreliable-that is, the two scorers were not applying the same standards.
- Test
Administration Reliability
Unreliability
may also result from the conditions in which the test is administered. I once
witnessed the administration of a test of aural comprehension in which a tape
recorder played items for comprehension, but because of street noise outside
the building, students sitting next to windows could not hear the tape
accurately. This was a clear case of unreliability caused by the conditions of
the test administration. Other sources of unreliability are found in
photocopying variations, the amount of light in different parts of the room,
variations in temperature, and even the condition of desks and chairs.
- Test
Reliability
Sometimes
the nature of the test itself can cause measurement errors. If a test is too
long, test-takers may become fatigued by the time they reach the later items
and hastily respond incorrectly. Timed tests may discriminate against students
who do not perform well on a test with a time limit. We all know people (and
you may be include in this category1) who “know” the course material perfectly
but who are adversely affected by the presence of a clock ticking away. Poorly
written test items (that are ambiguous or that have more than on correct
answer) may be a further source of test unreliability.
C. Validity
By
far the most complex criterion of an effective test-and arguably the most
important principle-is validity, “the extent to which inferences made from
assessment result are appropriate, meaningful, and useful in terms of the
purpose of the assessment” (Ground, 1998, p. 226). A valid test of reading
ability actually measures reading ability-not 20/20 vision, nor previous
knowledge in a subject, nor some other variable of questionable relevance. To
measure writing ability, one might ask students to write as many words as they
can in 15 minutes, then simply count the words for the final score. Such a test
would be easy to administer (practical), and the scoring quite dependable
(reliable). But it would not constitute a valid test of writing ability without
some consideration of comprehensibility, rhetorical discourse elements, and the
organization of ideas, among other factors.
- Content-Relate
Evidence
If
a test actually samples the subject matter about which conclusion are to be
drawn, and if it requires the test-takers to perform the behavior that is being
measured, it can claim content-related evidence of validity, often popularly
referred to as content validity (e.g., Mousavi, 2002; Hughes, 2003). You can
usually identify content-related evidence observationally if you can clearly
define the achievement that you are measuring.
- Criterion-Related
Evidence
A
second of evidence of the validity of a test may be found in what is called
criterion-related evidence, also referred to as criterion-related validity, or
the extent to which the “criterion” of the test has actually been reached. You
will recall that in Chapter I it was noted that most classroom-based assessment
with teacher-designed tests fits the concept of criterion-referenced
assessment. In such tests, specified classroom objectives are measured, and
implied predetermined levels of performance are expected to be reached (80
percent is considered a minimal passing grade).
- Construct-Related
Evidence
A
third kind of evidence that can support validity, but one that does not play as
large a role classroom teachers, is construct-related validity, commonly
referred to as construct validity. A construct is any theory, hypothesis, or
model that attempts to explain observed phenomena in our universe of
perceptions. Constructs may or may not be directly or empirically
measured-their verification often requires inferential data.
- Consequential
Validity
As
well as the above three widely accepted forms of evidence that may be
introduced to support the validity of an assessment, two other categories may
be of some interest and utility in your own quest for validating classroom
test. Messick (1989), Grounlund (1998), McNamara (2000), and Brindley (2001),
among others, underscore the potential importance of the consequences of using
an assessment. Consequential validity encompasses all the consequences of a
test, including such considerations as its accuracy in measuring intended
criteria, its impact on the preparation of test-takers, its effect on the
learner, and the (intended and unintended) social consequences of a test’s
interpretation and use.
- Face
Validity
An
important facet of consequential validity is the extent to which “students view
the assessment as fair, relevant, and useful for improving learning” (Gronlund,
1998, p. 210), or what is popularly known as face validity. “Face validity
refers to the degree to which a test looks right, and appears to measure the
knowledge or abilities it claims to measure, based on the subjective judgment
of the examines who take it, the administrative personnel who decode on its
use, and other psychometrically unsophisticated observers” (Mousavi, 2002, p.
244).
D. Authenticity
An
fourth major principle of language testing is authenticity, a concept that is a
little slippery to define, especially within the art and science of evaluating
and designing tests. Bachman and Palmer (1996, p. 23) define authenticity as
“the degree of correspondence of the characteristics of a given language test
task to the features of a target language task,” and then suggest an agenda for
identifying those target language tasks and for transforming them into valid
test items.
E. Washback
A
facet of consequential validity, discussed above, is “the effect of testing on
teaching and learning” (Hughes, 2003, p. 1), otherwise known among
language-testing specialists as washback. In large-scale assessment, wasback
generally refers to the effects the test have on instruction in terms of how
students prepare for the test. “Cram” courses and “teaching to the test” are
examples of such washback. Another form of washback that occurs more in
classroom assessment is the information that “washes back” to students in the
form of useful diagnoses of strengths and weaknesses. Washback also includes
the effects of an assessment on teaching and learning prior to the assessment
itself, that is, on preparation for the assessment.
F. Applying Principles to the Evaluation of
Classroom Tests
The
five principles of practicality, reliability, validity, authenticity, and
washback go a long way toward providing useful guidelines for both evaluating
an existing assessment procedure and designing one on your own. Quizzes, tests,
final exams, and standardized proficiency tests can all be scrutinized through
these five lenses.
- Are
the test procedures practical?
- Is
the test reliable?
- Does
the procedure demonstrate content validity?
- Is
the procedures face valid and “biased for best”?
- Are
the test tasks as authentic as possible?
- Does the test other beneficial washback to the learner?
Reference
:
Brown,
H. Douglas. 2004. Language Assessment: Principles and Classroom Practices. New
York: Longman.