EDITED BY:
J. CHARLES ALDERSON
ALAN BARETTA
Introduction
During the academic year 1985-6,
an eight-month experimental course in first-year German was offered at the
University of Utah. The experiment was designed to test the feasibility of using a radical
implementation of Krashen’s theory of second language acquisition.
Steven Sternfeld directed the program, and Paul Kramer and Louise Lybbert
Nygaard taught the experimental courses. These individuals, along with several
graduate students in the Department of Linguistics, developed and administered
the tests and questionnaires used to evaluate the courses. I assisted in the
planning phase of the evaluation and worked with the two German instruttors
(Kramer and Nygaard) in developing and refining the instructional methodology.
I was also involved in designing the attitude questionnaires. Kramer is
responsible for the analysis of the German performance measures, which were
scored by him and several other raters. Thus, this chapter should be considered
a report on the program evaluation efforts of a fairly large team of
individuals. When we started thinking about how to evaluate the experimental
German course, we had to consider the options, organize them, and select those
in which we were interested. We would have found it useful to have on hand a
document describing how someone else had dealt with the issues we were facing,
and we might have saved considerable time and energy in developing a plan that
worked for us. The general purpose of this chapter is to provide other
researchers with this kind of a head start. First, I will describe the program
briefly. Then, following Alderson (1987) I will discuss options we considered
when addressing six questions:
1) why evaluate,
2) when to evaluate,
3) whom to evaluate,
4) what to evaluate,
5) how to evaluate, and
6) for whom to evaluate.
Finally, I will comment briefly
on some of the findings.
Description of the program
Theory behind the'program
KRASHEN’S GENERAL THEORY
Krashen (1982, 1985) hypothesizes that a language is
acquired (picked up in such a way that it can be used effortlessly and without
conscious thought) in only one way:
a) by exposure to interesting, comprehensible input
b) under non-threatening conditions.
These are Krasheafs first two hypotheses: the ‘input
hypothesis’ and the ‘affective filter hypothesis’. A
Third hypothesis, the ‘production emerges hypothesis’, is
that
c) Speech emerges as
the result of acquisition. Thus, according to Krashen’s theory no production
practice is needed for subconscious language acquisition, nor is any analysis
or explanation of the target language. ln addition, Krashen’s theory includes
other hypotheses. For example, his full theory includes a hypothesis that
conscious ‘learning’ (through analysis, practice, and explanation) is possible and
is useful in certain ways; however, conscious learning is not thought to be
useful in developing acquired competence, and consciously learned material is
not thought to become ‘acquired’ as the result of practice.
VERSION OF THE THEORY TESTED
In the experimental program, we
decided to test as ‘pure’ a version of Krashen’s theory as possible. Therefore,
we only provided interesting, comprehensible input. We deliberately avoided
activities that would tend to produce conscious learning so that vte might observe
the effects of the subconscious acquisition process. We also avoided requiring
the students to speak for three reasons: speaking might create stress, raise
the affective filter, and block acquisition; speaking (according to Krashen’s
theory) is not required for speech to emerge; and early speaking might lead to acquisition
of deviant forms through exposure to a large amount of imperfect German
(Krashen 1985146-7).
Methodology
TEACHING
j. Marvin Brown (director of the
Thai Language Program at the A.U.A. Language Center in Bangkok, Thailand) and I
developed the general teaching methodology used in the study. This is described
in detail in a teachers’ manual called The Listening Approach: methods and
materials for applying Krashen’s Input Hypothesis (Brown and,Palmer 1988).
Probably the most distinctive
characteristic of the Listening Approach is that there are two teachers. The
teachers talk and interact primarily with each other but always try to make
what they say both comprehensible and interesting to the students. The student
just look, listen, and try to understand what is happening. They try to keep
their attention on the meaning, not the language. And what is more, they are
advised not to construct sentences consciously in order to speak but, instead,
to wait until the language comes out by itself (‘emerges’ in Krashen’s terms).
In addition to receiving spoken input, the students are also exposed to written
input. The teachers talk about written material which the students see on
overhead transparencies and handouts, and the students also read interesting,
simple, contextualized material on their own.
ACTIVITIES
The activities were selected from
the large variety described in The Listening Approach and were sequenced
entirely according to their potential interest value and ease of comprehension.
That is, the teachers tried to use activities which would prove as interesting
as possible yet still be comprehensible. They made no conscious attempt to
sequence the activities according to notional, functional, or grammatical
criteria. And there was no student textbook. A typical class included
activities such as the following:
1 discussion: The weather (the
previous night’s heavy snowstorm).
2 discussion: A teacher’s cold
and possible remedies.
3 discussion: The geography of
Germany.
4 game: The teachers read from
sheets containing information about the students. The students guessed which
student was being described.
5 reading: The teachers read a
fairy tale aloud to the students from an overhead transparency and discussed
it.
STUDENTS
The students were drawn from a population of
regular university students enrolled in a first-year German course offered for
credit at the University of Utah. The class consisted of two ordinary sections
of German 101 combined into a single class. The students were not told that the
method to be used was experimental until after they had signed up for the
course. The control group consisted of students who had signed up for a
different section of the same course.
CLASSES
Classes met five days a week for
one hour a day. The entire program continued for three ten-week terms, for a
total of about 150 hours of instruction.
TEACHERS
The two teachers, Paul Kramer and
Louise Nygaard, graduate students in the University of Utah’s Department of
Foreign Languages and experienced in contemporary language teaching
methodology, taught the classes.
Why evaluate?
One of the first questions we
asked was why we were 'evaluating the program. We considered a number of possible
reasons. One was to find out whether our program was feasible: could the
teachers teach it and would the students put up with it? Another as noted above
was to find out whether the program was productive. would it produce the
results one might expect of it given the claims of the theory of language
acquisition upon which it was based? Another was to find out whether the
program was appealing. Perhaps the program could be taught, but would it be
enjoyed?
We decided to address all of
these issues, but our primary reason for evaluating the program was to find out
more about language acquisition theory. Therefore, our initial program design
and any modifications we made to it during the year had to be consistent with
the pure version of Krashen’s theory we were interested in testing, so that we
could be as clear as possible about which teaching processes were being tested
(Beretta 1986a:296). As a result, we accepted the possibility that students
would be resistant to some of the elements in the program, and we decided not
to modify the program in ways that would make conclusions about the validity of
the theory difficult to arrive at.
When to evaluate?
Our next question was when to
evaluate. Should we evaluate the program periodically during the year (formative
evaluation), or should we wait until the end (summative evaluation)? Our
decision was based upon both practical and theoretical concerns.
On the practical side, we were
busy keeping up with the instructional side of the operation (which included visiting
classes, giving one another feedback, planning lessons, etc.), and we had
little time to design and implement a systematic formative evaluation study.
And on the theoretical side, we were
strongly influenced by Krashen’s affective filter hypothesis. According to this
hypothesis, students acquire best under non-threatening conditions, and large
scale, formative valuation might be likely to raise the filter (Krashen 1982 :30-2).
As a result, we decided to
evaluate the program in two phases: a small-scale, somewhat informal formative
evaluation of students’ attitudes toward the program and a large-scale, formal,
summative evaluation of the students’ language abilities. We thought that the
formative evaluation of attitudes would allow us to keep track of attitude
changes as they occurred, which would help us adjust the input to keep it
interesting and comprehensible. The summative evaluation of students’ language
ability would allow us to determine how well the students could perform at the
end of the program.
Whom to evaluate?
I will break down this question
into two main issues. The first is whether to evaluate the experimental group
by itself or to compare the experimental group with a control group. After
discussing this question, I will turn to the issue of Whom to evaluate Within
the group(s): teachers, students, administrators, or some combination of the
three.
Comparative or independent group evaluation
COMPARATIVE EVALUATION
When we first started to think
about how to evaluate the program, what came to mind immediately was a methods
comparison study using experimental and control groups. Such a study would
provide us a means of comparing results of the input-based program with the
results of traditional (eclectic) instruction, along the lines of the studies
by Asher, Kusudo and de la Torre (1983); Burger (1989); Edwards et al. (1984); Hauptman,
Wesche and Ready (1988); Lafayette and Buscaglia (1985); Lightbown (1989); and
Sternfeld (1989).
While comparative studies are
interesting, designing and interpreting them is difficult. Ideally, students
should be assigned to the two groups at random, or if this is not possible,
attempts should be made to control for self-selection. Also, teacher variables
are difficult to control. The results of the Pennsylvania Study (Smith and
Baranyi 1968) indicated that the teacher variable was more important than the
method used. In addition, experimental programs are likely to be new, and one
might expect that they would generate more excitement and enthusiasm than the
traditional programs (the Hawthorn Effect). Finally, the goals of programs employing
different language teaching methods are likely to be quite different, which
would make finding criterion measures appropriate to both programs difficult
(Beretta 1986b).
On the other hand, such studies
do ask questions about alternatives, which many people find interesting. And
while they carry along with them design problems (Kramer and Palmer 1990) and
testing problems (Palmer 1990), when at number of such studies are carried out
questions of interpretation which come up for one study may be answered by
another study conducted under different conditions. For example, if one
suspects that a negative outcome in one study might be the result of instructor
variables (rather than the method used) but then finds similar negative
outcomes in a study conducted with instructors known to be superior, one might
be able to rule out the instructors as a confounding variable and begin to draw
some general conclusions about the effect of the method itself.
INDEPENDENT EVALUATION
The other alternative would be to
evaluate the experimental program on its own merits, which would involve
stating the objectives of the program and collecting data to evaluate how well
these objectives had been met. Such a study would avoid many of the problems
described above, but it could not provide the kinds of comparative data which
many people want.
SOLUTION
We decided to conduct both
independent and comparative evaluations. To evaluate the experimental treatment
on its own merits, we specified objectives for the experimental program,
against which we could compare actual outcomes. To compare the relative
effectiveness of the experimental method with something else, we created a
control group of traditionally taught students and gave the same set of
proficiency tests to each group.
People affected by program
outcomes
Whether a program is or is not
considered successful may depend upon whom we ask. We may decide that it
succeeds if the students learn something and are happy with the learning
experience. ln this case, we would obtain measures or descriptions of student
behavior. Or we might decide to evaluate the program in terms of teachers’
attitudes. Are they pleased with their teaching and with their perceptions of
what their students learned? Or we might evaluate a program’s success in terms
of administrators’ attitudes. Or, as Rodgers (1986) suggests, we might even go
outside of the educational setting altogether and evaluate the program in terms
of the attitudes of people within the community.
We decided to obtain information
from students, teachers, and administrators. The following is a discussion of
the kinds of information we obtained from each population.
What to evaluate?
In attempting to make sense of
the issue of what to evaluate, we first had to decide how to organize the
options. We found Bloom’s cognitive, behavioral, and affective domains
(Krathwohl, Bloom and Masia, 1956) to be a useful framework.
Cognitive domain
The cognitive domain consists of
knowledge about the language. In input-intensive acquisition based programs
such as ours, there is little to evaluate within this domain. Such programs do
not present grammar explicitly (cognitively). Nor do they expect the
subconscious acquisition process to lead to cognitive control. Thus, we decided
not to evaluate cognitive changes or knowledge, even though such outcomes would
be expected for the control group.
Behavioral domain
To evaluate students’ behavioral
performance, we needed to start by determining what kinds of outcomes we could
reasonably expect of students completing the program. The most specific
statement we could find came from Krashen and Terrell (1983):
After 100-150 hours of Natural Approach Spanish, you
will be able to: ‘Get around’ in Spanish; you will be able to communicate with
a monolingual native speaker of Spanish without difficulty; read most ordinary
texts in Spanish with some use of a dictionary; [and] know enough Spanish to
continue to improve on your own.
After 100-150 hours of Natural Approach Spanish, you
will not be able to: pass for a native speaker; use Spanish as easily as you
use English; understand native speakers when they talk to each other (you will
probably not be able to eavesdrop successfully), use Spanish on the telephone
with great comfort, [or] participate easily in a conversation with several
other native speakers on unfamiliar topics.
(Krashen and Terrell l983:74)
Krashen and Terrell also discuss
the role of grammar in defining expected outcomes (pp. 71-2). While they
indicate that beginning students may simply string together appropriate lexical
items in some logical order, they do not expect that students will continue to
use these simple Stringing techniques. They believe that the language
acquisition that takes place in the classroom will result in students acquiring
grammatical competence. Thus, based upon Krashen and Terrell’s guidelines,
appropriate tests for the experimental group might assess the students’
listening, speaking, and reading skills, as well as their subconscious control
of grammar.
We did not have available a
description of behavioral objectives for the control group, so we had rely on
our experience as teachers and administrators to inform our selection of tests.
We concluded that tests of listening, speaking, and reading skills, as well as
subconscious control of grammar (already needed to evaluate the experimental
group) would also be appropriate for the control group. In addition, we decided
that one of the objectives for the control group would be developing writing
skills.
Thus, our final selection of
tests included measures of the four skills (listening, speaking, reading, and
writing) and two elements (grammar and vocabulary).
Affective domain
ln addition to evaluating
behavioral outcomes, we also wanted to evaluate the attitudes of three populations
(students, teachers, and administrators) toward the program. ln deciding what
attitudes to assess, we tried to balance the advice of others, lists of
questions appearing in other research studies, and our own research interests
in preparing our questionnaires.
STUDENTS
We divided our questions to the
students into three types. The first type consisted of questions about fairly
general aspects of their attitude toward the program. The process we used in
preparing these questions included discussions with experts in program
evaluation, discussions with people involved in the program, examination of
other program evaluation studies, and examination of general issues in program
evaluation (for example, Beretta 1986a and 1986b, Richards and Rodgers 1986 Chapter
11). We eventually elicited the students’ opinions on the following topics:
1. reasons for studying German at the beginning
of the program and at present: fun, curiosity, for use in future studies, for
use in work, to satisfy a university requirement, to get a good grade, and
importance of developing proficiency in the four skills and grammar
2. current satisfaction with their instruction in
the areas of listening, speaking, reading, writing, and grammar
3. satisfaction with their instruction in the
areas of listening, speaking, reading, writing, and grammar compared with their
expectations when they began the experimental course
4. confidence that they could cope with
listening, speaking, reading, and writing German in a German-speaking country
5. interest
in being in contact with German culture and language outside of class
6. interest in electing sheltered subject matter
(subject matter courses taught entirely in German to students who are
non-native speakers of German)
7. interest in continuing to study German
8. importance of absence of pressure in class and
receiving a good grade
9. satisfaction with the cost of the program in
terms of time and money
10. appropriateness of the levels of spoken and
written German to which they were exposed in class
11. what
they would want more of, and less of, in class
12. how
much fun they had in class
13. confidence
in the method
14. whether or not they would recommend the class
to their friends
15. level of participation
16. amount
of structure (organization) they liked in a language class
17. amount
of risk they enjoy in class
18. amount
of interaction with classmates they enjoy
19. opinions
of each instructor’s control of language, ability to teach, and personality
20. satisfaction
with the way the instructors interacted
The second type
of question elicited students’ opinions about specific learning activities: Did
they like the activities and find them interesting? We asked many questions of
this type because Krashen’s theory of language acquisition states that affect
is one of the two causal factors in language acquisition. The third type of
question elicited students’ general attitudes toward whatever aspect of the
program concerned them at the time. We used this kind of question to elicit
feedback on issues we had perhaps not taken into consideration when developing
the list of specific questions.
TEACHERS
We also
assessed teachers’ attitudes on a variety of issues. ln deciding what questions
to ask, we followed a procedure similar to that used in deciding what questions
to ask of the students. In addition, where possible we tried to ask the
teachers questions which were similar to those we asked the' students so that
we could determine the extent to iwhich students’ and teachers’ impressions and
attitudes differed. We eventually decided to obtain information on teachers’
attitudes on the following topics:
1. satisfaction with the students’ competence in the four skills and grammar; satisfaction
relative to that in other courses they had taught.
2. Confidence in the students’ ability to cope in German in
four skills
3. satisfaction
with the amount of German culture they were able to present in class
4. interest
in the content of what they talked about in class
5. satisfaction
with the amount of pressure placed on students
6. satisfaction
with the grading policy
7. satisfaction
with the amount of support they received from supervisors/staff, materials
supplied, training program
8. satisfaction
with the amount of class preparation time required
9. confidence
in their ability to train others in method
10. importance
of their experience in obtaining future employment
11. what
they would want more of, or less of
12. amount
of fun they had teaching the course
13. whether
they would want to teach the same class again
14. whether
they would want to study a language using the experimental method
15. what
they liked most and least about teaching the experimental class
ADMINISTRATORS
Finally, we wanted to know how
the department chairman felt about the program. In addition to questions about
perceived effectiveness of instruction, we asked the chairman a number of
questions about how the experimental program contributed to the Language
Department’s overall image and visibility. Where possible, we asked questions
parallel to those that we asked of the students and teachers. The following are
the topics we decided upon:
1 the importance of each of the
following goals for the “experimental first year German program:
a) preparing the students to use
German for fun as in travel, speaking with relatives and friends, reading for
pleasure, etc.
b) Preparing the students to use
German to study in a German-speaking country
c) Preparing the students to use
German in their work or future work, including missionary activities, business,
teaching, etc.
d) Helping the students satisfy a
university requirement
e) Helping the students satisfy their
curiosity or have an enjoyable experience
2 the importance of first year
German students developing competence in each of the following areas:
listening, speaking, reading, writing, and grammar
3 satisfaction (from whatever
impressions he might have formed) with students’ achievement in each of the
following areas: listening, speaking, reading, writing, grammar
How to test
Here I discuss the methods used
to obtain the information given above. I first discuss the issue of method
effect. Then I provide examples of the various instruments.
Method effect
Numerous research studies
(Bachman and Palmer 1981, 1982, 1989; Briitsch 1979; Clifford 1981) indicate
that we cannot get directly at language competence without the method used
influencing the results. And while with a lot of effort the relative influence
of method can be quantified for a small number of competing methods (Bachman
and Palmer 1981, 1982, 1989), I know of no way to discover such a thing as ‘the
best’ or ‘the perfect’ method.
Recently, Bachman (1990) has
proposed a system for the componential analysis of test methods. However, since
the study described in this chapter predated Bachman’s system, our own thinking
about method was less systematic than would be the case today. Basically, we
attempted to select methods that would not be obviously disruptive. For
example, in evaluating the students’ ability to read German, we had them
summarize a reading passage in English rather than German. This did not
introduce their ability to write German as a factor contributing to the test
results. On the other hand, we knew that perhaps the students’ ability to
summarize would influence the test results. We simply considered this the lesser
of the two problems.
Practicality
Second, our decisions on how to
test were influenced by practicality considerations. Because of constraints on
time, money, and personnel, tests had to be easy to develop, administer, and
score And the rating procedures had to be quick and uncomplicated. The
following examples or descriptions of different methods illustrate how we chose
to balance method effect considerations with practicality issues.
SPECIFIC METHODS! OVERVIEW
The specific test methods for
each skill/element are described below. Many of these tests were developed by
Steven Sternfeld and Batya Elbaum (members of the Department of Languages
faculty), with the assistance of graduate students in the Department of
Languages and the Linguistics-TESOL Masters Degree program. I will first
explain the ‘standard’ procedures used to rate most of the protocols. (I also
describe a different ‘special’ rating procedure below under the description of
the oral interview test.) Then I will describe the procedures used to obtain
the performance samples to be rated.
A pool of raters (teachers in the
German program) used the following ‘standard’ procedure to rate the students’
performance on most of the tests. They were told to decide whether a student
who could listen, speak, read, or write at the level demonstrated on the
protocol being rated would be able to perform satisfactorily in the second year
sheltered subject matter course (a content course taught in German specifically
for non-native speakers of German). A rating of ‘high’ indicated that the
student would definitely be able to perform satisfactorily. A rating of ‘mid’
indicated that the student would probably be able to perform satisfactorily.
And a rating of ‘low’ indicated that the student would clearly not be able to
perform satisfactorily. We have reason to believe, however, that the assigned
ratings were more indicative of the raters’ decisions about general language
proficiency than ability to perform in sheltered subject matter courses,
because some of the tests were measuring language abilities which would likely
have little to do with performance in such courses. Moreover, since the raters
had never taught such a course, they had no specific sheltered content course
experience from which to make the ratings.
Practical considerations also
dictated this global approach to most of the protocol rating. A more detailed
rating procedure would have required both a training program for raters and a
small group of raters with the time to participate in the training program and
the actual rating sessions. Funding and time constraints made this impossible.
Two methods were used to evaluate
the students’ German language proficiency: traditional language tests and students’
self-ratings.
LISTENING: LECTURE SUMMARY
A ten-minute videotaped lecture
was prepared by a professor who is a native speaker of German. The lecture
concerned some of the technicalities of East and West German politics. The
German used was academic with complicated syntax. lt was unsimplified and was at a much higher level than the
students had been exposed to in class.
Students were allowed to take
notes during the lecture. Following it, they were instructed to summarize the
lecture as well as possible, in as much detail as possible. Protocols were
rated using the standard procedure described above.
LISTENING : STUDENTS’ SELF-RATINGS
The students were asked to write
brief descriptions of their ability to understand spoken German. We grouped
these descriptions into high, mid, and
low categories, counted the numbers of students in each category, and provided
examples of students’ comments in each category,
SPEAKING: ORAL INTERVIEW
A native speaker of German who
did not know the students interviewed each student for live to eight minutes,
the length depending upon how proficient the students were. The interviewer
went through a list of questions which included alternate questions so no two
interviews were exactly the same. Questions were of the sort one might ask
someone when trying to become acquainted. For example:
1 Where do you live?
2 Where do you work?
3 Have you studied other
languages before?
4 Why are you studying German?
In the middle of the interview,
the student was given some information to elicit by asking questions in German.
For example:
1 Where does the interviewer come
from?
2 What is the interviewer doing
in America?
3 Where does the interviewer
normally work?
4 What does the interviewer’s
family consist of?
Kramer scored these interviews
using two criteria: sentence complexity and well-formedness, and control of
inflectional morphology. He then added the two scores together and then
converted these scores to ratings using a ‘special’ procedure different from
the standard one described above. Kramer compared the students’ performance on
the oral interview with their performance on the remaining tests that had been
rated using the standard procedure. I-Ie determined which students tended to
fall consistently into the high, mid, and low categories on the other tests and
used this information, together with the original rating criteria, to estimate
high, mid, and low break points for the interview scores.
SPEAKING: STUDENTS, SELF-RATINGS
The students were asked to write
brief descriptions of their ability to speak German. We grouped these
descriptions into high, mid, and low categories, counted the numbers of
students in each category, and provided examples of students’ comments in each
category.
READING: TRANSLATION
The students were given a
humorous passage in German about a child who threw a rock through a window and
then bragged about having answered the teacher’s question (about who threw the
rock through the window) correctly. They then wrote an English translation,
which was rated using the standard procedure.
READING : SUMMARY
The students were given a written
passage about a German satirist written in academic, unsimplified German,
followed by a written summary of an interview with the satirist. The students
were then given five minutes to write a summary of the passage in English,
which was rated using the standard procedure.
READING: STUDENTS’ SELF-RATINGS
The students were asked to write
brief descriptions of their ability to write German. We grouped these
descriptions into high, mid, and low categories, counted the numbers of students
in each category, and provided examples of students’ comments in each category.
WRITING : GUIDED COMPOSITION
The students were given a 70-word
passage in German about the weather in Germany. They then wrote a passage in
German about the weather in America. These compositions were rated using the
standard procedure.
WRITING: DICTATION
A 65 -word passage was dictated
to the students. The content consisted of a brief comparison of the political
differences between East and west Germany. The dictation was scored by marking
one point off for each omitted word, extra word, and pair of permuted words.
Words were considered correct if the spelling was phonetically appropriate. In
addition, a half point was deducted for each error in inflectional morphology.
The compositions were then rated using the special procedure described for the
oral interview test.
WRITING: STUDENTS’ SELF-RATINGS
The students were asked to write
brief descriptions of their ability to write German. We grouped these
descriptions into high, mid, and low categories, counted the numbers of
students in each category, and provided examples of students’ comments in each
category.
GRAMMAR: RATIONAL CLOZE
The students were given a
118-word multiple-choice cloze test with rational deletions (developed by Paul
Kramer). The passage, modified somewhat by Kramer, was taken from a German
reader. The subject matter was a humorous story about a student who went into a
restaurant and made a deal with the waiter that if the student could sing the
waiter a song that pleased the waiter, the student would get his dinner free.
The first sentence was left
intact. Twenty-five words were rationally deleted from the remainder of the
passage to test grammar, discourse competency,
and knowledge of the world. Deletions occurred at intervals ranging from four
to eleven words. Four answer choices were provided for each deletion.
Kramer scored the cloze test
using the exact word method. He then assigned ratings using the special
procedure described for the oral interview test.
GRAMMAR : STUDENTS’ SELF-RATINGS
The students were asked to write brief descriptions of their control of German
grammar. We grouped these descriptions into high, mid, and low categories,
counted the numbers of students in each category, and provided examples of
students’ comments in each category.
VOCABULARY: UNCONTEXTUALIZED
TRANSLATIONS The students were given a list of German words to translate into English.
These were scored and then rated using the special procedure described for the
oral interview test.
VOCABULARY: CONTEXTUALIZED
TRANSLATIONS
The students were then given a
reading passage containing the same vocabulary words used in the
uncontextualized translation test. They then translated these words into
English using the additional information they could obtain from the context.
Kramer scored these translations and converted them to ratings using the
special procedure described for the oral interview.
VOCABULARY: STUDENTS’
SELF-RATINGS The students were asked to write brief descriptions of their
control of German vocabulary. We grouped these descriptions into high, mid, and
low categories, counted the numbers of students in each category, and provided
examples of students’ comments in each category.
Affective domain
STUDENTS
Three methods were used to assess
students’ attitudes: activity rating slips, journals, and attitude
questionnaire. First, the students were given an activity rating form to fill
out at the end of each class. On this form, they rated two attributes of each
of the day’s activities: interest and comprehensibility. Ratings were done on a
subjective scale of 0-100. The instructors used these ratings to decide how
successful the activities were.
Second, the students kept
journals in which they described their language learning experience. These
journals were collected on a weekly basis, read by the instructors, and
summarized at the end on the program by a graduate student in the
Linguistics-TESOL M.A. Program.
Third, the students were given a questionnaire
at the end of the first and third terms; Most of the questions asked the students
to rate on a four-point scale their responses to questions of the sort
described above in the section What to evaluate? In addition, the students
responded to a few open-ended questions.
INSTRUCTORS
Three methods were used to assess the
instructors’ attitudes: ongoing conversations, a questionnaire, and a paper. I
visited classes frequently and talked informally with the instructors
afterwards. In these conversations we discussed how the class had gone, which
activities seemed to work and which ones didn’t, and how the students seemed to
be reacting. On the basis of this information we decided how to modify the
instruction appropriately.
Also the instructors wrote a paper at the end
of the program describing their impressions of the experience.
Finally, the instructors completed a 44-item
questionnaire at the end of the program. The questions covered demographic
information and the instructors’ attitudes toward various aspects of the
program (see the section on what to evaluate! above). With the exception of two
open-ended response questions, questions were of the multiple-choice variety.
ADMINISTRATION
The department chairman filled out a 44-item
multiple choice attitude survey at the end of the program.
For whom to evaluate?
Possible audiences As Rodgers
(1986) has pointed out, a variety of people have an interest in the language
teaching program evaluation reports. Students want to learn and have an
enjoyable experience in the process and might want to know what they could
expect of a specific program. Teachers want to know how effective instruction
proves to be and how enjoyable the teaching experience is. Administrators are
concerned with results and their effect on enrollments. Members of t e
community may be concerned with the effect of the students’ language skills on
how well they function in the community. Researchers are concerned with the
results of applying language acquisition theory in the classroom. And program
evaluators may be interested in the report as a guide to future evaluation
studies.
Depending upon which audience the
program evaluator is addressing, the kind of information obtained and the way
it is presented would likely vary. For example, one would make fairly different
assumptions about the interests and background knowledge of readers of Language
Learning and members of a college curriculum review committee.
Options for reporting results
We found audience considerations
to be important when deciding upon procedures for reporting on student
performance, for we wanted readers to be able to form a concrete picture of
what students completing the program could do. Therefore, in addition to
providing quantitative data, Kramer (1989) reported both the numbers of
protocols falling into the high, mid, and low categories and provided examples
of typical protocols at each level. In addition Kramer provided a corrected
version of each protocol, as well as a translation.
I followed a similar procedure
useful in reporting the results of the attitude surveys (Palmer 1987). When
students provided self-ratings of their ability to use Gel man, I grouped their
responses into high, mid, and low categories, reported the number of responses
in each category, and supplied examples of these responses. And when students
responded to open-ended questions about what they liked most and least about
the program, I quoted typical responses.
A few comments on the findings
The purpose of the following
comments is to give the reader some indications of the sorts of conclusions we
reached about the general effectiveness of the program. (For a detailed
account, see Kramer 1989.) I will summarize the findings in three areas:
instructors’ attitudes toward the program, students’ attitudes toward the
program, and students’ performance on behavioral measures.
Practicality of the approach
The instructors felt that the
radical implementation of Krashen’s theory used was practical. They were able
to provide three terms of interesting, comprehensible input. The two-teacher
classroom was indeed feasible and generated a lot of useful interaction between
the instructors, and the students seemed to be interested in this interaction.
The drop-out rate for the course was no worse than for traditionally taught
classes.
Student attitudes toward the
program
Students’ reactions to the method
were very positive at the beginning of the program, when they appreciated the
lack of pressure to produce the language and the absence of formal testing. As
the program progressed, however, they began to worry about whether they would
eventually be able to speak, and there is some indication that they would like
to have been tested in order to know that they were indeed acquiring the
language. Their comments indicated that a number of them would have liked some
specific encouragement to speak and make mistakes in order to get used to this
experience.
One conclusion that we could draw
from this change in student attitudes is that it may be difficult to completely
satisfy both of the conditions necessary for acquisition in Krashen’s theory.
Specifically, we may not be able to provide students like the ones in our
program only with interesting, comprehensible input (and no output practice)
while maintaining a low affective filter strength in a relatively long-term
program.
This situation creates a problem
for researchers interested in evaluating Krashen’s theory. Specifically, in
order to test the theory, we would like to provide all and only what the theory
says is absolutely necessary for acquisition (comprehensible input with low
filter strength). Yet to keep the filter strength low over a long period of
time with students such as ours, we may have to provide output practice. Doing
so, however, makes the interpretation of the results difficult, since if the
students do acquire and if speech does emerge, we can no longer ascertain
whether this would have happened without the output practice. It may be the
case that to provide a good test of a radical implementation of Krashen’s
theory we will need to find a population of students who will not find the
absence of output practice over a long period of time threatening. It will be
interesting to discover whether such a population of students exists.
Student performance on behavioral
measures
THE EXPERIMENTAL GROUP CONSIDERED
BY ITSELEF
Students’ abilities_ to produce
German ranged from slight to quite surprising (Kramer 1989). Students rated as
low were functioning as one would expect of acquirers within the early stages
of an expected interlanguage phase. For students rated as high, evidence of
acquisition was much more apparent, with these students at times able to
produce tell-formed, contextually appropriate, complete sentences.
Also, actual levels of
acquisition may have been higher than the overall performance of the students
in this study indicate, since attitude measures became more negative over time.
This may have raised the ‘output filter’, which is said to limit the
performance of acquired competence (Krashen 1985).
COMPARISON BETWEEN EXPERIMENTAL
AND CONTROL GROUPS
A MANOVA (Multiple Analysis of
Variance) conducted by Kramer (1989) indicated that the control group students
(who received traditional instruction) performed significantly better overall
than the acquisition students. They also performed significantly better on four
of the seven tests (oral interview, reading translation, writing summary, and
vocabulary). The experimental students did not perform significantly better on
any of the tests (Kramer 1989). `
Discussion
Amount of data
We may have gotten somewhat out
of balance with respect to quantity versus quality of data. In some cases we
noticed that when we used several different methods to obtain the same data, we
tended to reach the same conclusions. For example, with student attitude data
we tended to reach the same conclusions from the questionnaires and journals.
Also, since our primary research interest was the issue of the validity of
Krashen’s theory of second language acquisition, we found little immediate use
for the information obtained from the department chairman.
Quality of the data
ln future studies, l would
probably spend more time on test development and use fewer methods. For
example, if we were to use a scaled attitude survey again, l would research
options for wording questions and scaling responses. (See Oskarsson 1978;
Bachman and Palmer 1989.) Bachman and I found that questions about `difficulty’
in using language provided more useful information on the ability than did
‘can-do’ questions about what the students were able to do.
ln addition, some of the data we
obtained would have been of little value even if we had specific uses for it. For example, we asked the
department chairman to rate various aspects of the program without providing
him with enough information to make his task meaningful. We should have
provided him with examples of student test performance and asked him to visit
the class instead of having him rely on what he happened to hear about the
course and the students’ performance on it.
Finally, I would carefully
pre-test all of the instruments. A number of students commented that certain
questions on the attitude survey were difficult to interpret, and the limited
number of answer choices provided did not allow them to make what they
considered valid responses. Also, some of the tests (such as the lecture
summary) may have been too difficult. Pre-testing the instruments would also
have helped us address this problem and balance the quality and quantity of
data obtained.
In general, had we spent less
time gathering the ,same kind of information by means of different methods and
less time gathering data not related to our basic research question, we could
have spent more time refining our primary instruments and preparing the
students for the testing experience.
Formats for reporting data, and effects on interpretability
We found it useful to provide
examples of student performance at different levels in addition to reporting
statistics. This helped audiences relate more directly to the numerical data
provided, particularly since only norm-referenced tests were used. For example,
a dean at the University of Utah was more concerned with the general level of
ability of both experimental and traditional students (as shown by examples of
what the students could do with the language) than with what he considered
minor (though perhaps significant) differences between the groups (as evidenced
by descriptive statistics from the norm-referenced tests used). An even more
productive approach would be to develop and use criterion-referenced tests of
communicative language ability (Bachman 1989; Bachman and Clark 1987; Bachman
and Savignon 1985).
Between-group comparison
Kramer’s finding that the control
group performed significantly better
than the experimental group came as quite a surprise to us. We expected
the experimental group to do better. If this single study is considered in
isolation, the finding would appear to be rather straightforward evidence
against the Input Hypothesis. What this study led to, however, was an
investigation of a number of method-comparison studies, all of which dealt with
the issue of the relative efficiency of input-based and eclectic instruction
(Kramer and Palmer 1990; Palmer 1990). As a result of this investigation, we
discovered an interesting pattern of interaction among method of instruction,
purity of instruction and age of students. Briefly, in studies with ‘impure’
experimental treatments, we found no significant differences. In studies with
‘pure’ experimental treatments, we found significant differences favoring the
experimental (input-intensive) treatment for children, but favoring traditional
(eclectic) instruction for adults.
Conclusion
Evaluating input-based language
teaching programs is challenging. On the one hand, we want to keep the
experimental treatment as pure as possible in order to increase the internal
validity of the study. On the other hand, we want to keep the students happy,
which may require a more eclectic approach. Thus, we seem to be caught in a tug
of war between interpretability and practicality-generalizability: internal and
external validity (Beretta 1986a).
Our experience was particularly
satisfying because the results we obtained were not necessarily the results we
anticipated, so we felt like we had learned something new from the process. Moreover,
struggling with the issues of research design, test design, and interpretation
helped us clarify some of the problems we faced and suggested some new
directions we might take in the future (Kramer and Palmer 1990; Palmer 1990).
The increasing interest in
language testing and program evaluation is demonstrated by the enthusiastic
participation of large numbers of colleagues in conferences such as the 1986
International Conference on Trends in Language Programme Evaluation in Bangkok,
and the 1990 RELC Regional Seminar
on Language Testing and Language Programme Evaluation in Singapore. This
involvement is likely to bring increased insight into what Alderson (1986)
calls ‘The nature of the beast”
Tidak ada komentar:
Posting Komentar