Evaluating Second Language Education : Chapter 4, Issues In Evaluating Input-Based Language Teaching Programs

EDITED BY:

J. CHARLES ALDERSON

ALAN BARETTA

Introduction

During the academic year 1985-6, an eight-month experimental course in first-year German was offered at the University of Utah. The experiment was designed to test the feasibility of using a radical implementation of Krashen’s theory of second language acquisition. Steven Sternfeld directed the program, and Paul Kramer and Louise Lybbert Nygaard taught the experimental courses. These individuals, along with several graduate students in the Department of Linguistics, developed and administered the tests and questionnaires used to evaluate the courses. I assisted in the planning phase of the evaluation and worked with the two German instruttors (Kramer and Nygaard) in developing and refining the instructional methodology. I was also involved in designing the attitude questionnaires. Kramer is responsible for the analysis of the German performance measures, which were scored by him and several other raters. Thus, this chapter should be considered a report on the program evaluation efforts of a fairly large team of individuals. When we started thinking about how to evaluate the experimental German course, we had to consider the options, organize them, and select those in which we were interested. We would have found it useful to have on hand a document describing how someone else had dealt with the issues we were facing, and we might have saved considerable time and energy in developing a plan that worked for us. The general purpose of this chapter is to provide other researchers with this kind of a head start. First, I will describe the program briefly. Then, following Alderson (1987) I will discuss options we considered when addressing six questions:

1) why evaluate,

2) when to evaluate,

3) whom to evaluate,

4) what to evaluate,

5) how to evaluate, and

6) for whom to evaluate.

Finally, I will comment briefly on some of the findings.

Description of the program

Theory behind the'program

KRASHEN’S GENERAL THEORY

Krashen (1982, 1985) hypothesizes that a language is acquired (picked up in such a way that it can be used effortlessly and without conscious thought) in only one way:

a) by exposure to interesting, comprehensible input

b) under non-threatening conditions.

These are Krasheafs first two hypotheses: the ‘input hypothesis’ and the ‘affective filter hypothesis’. A

Third hypothesis, the ‘production emerges hypothesis’, is that

c) Speech emerges as the result of acquisition. Thus, according to Krashen’s theory no production practice is needed for subconscious language acquisition, nor is any analysis or explanation of the target language. ln addition, Krashen’s theory includes other hypotheses. For example, his full theory includes a hypothesis that conscious ‘learning’ (through analysis, practice, and explanation) is possible and is useful in certain ways; however, conscious learning is not thought to be useful in developing acquired competence, and consciously learned material is not thought to become ‘acquired’ as the result of practice.

VERSION OF THE THEORY TESTED

In the experimental program, we decided to test as ‘pure’ a version of Krashen’s theory as possible. Therefore, we only provided interesting, comprehensible input. We deliberately avoided activities that would tend to produce conscious learning so that vte might observe the effects of the subconscious acquisition process. We also avoided requiring the students to speak for three reasons: speaking might create stress, raise the affective filter, and block acquisition; speaking (according to Krashen’s theory) is not required for speech to emerge; and early speaking might lead to acquisition of deviant forms through exposure to a large amount of imperfect German (Krashen 1985146-7).

Methodology

TEACHING

j. Marvin Brown (director of the Thai Language Program at the A.U.A. Language Center in Bangkok, Thailand) and I developed the general teaching methodology used in the study. This is described in detail in a teachers’ manual called The Listening Approach: methods and materials for applying Krashen’s Input Hypothesis (Brown and,Palmer 1988).

Probably the most distinctive characteristic of the Listening Approach is that there are two teachers. The teachers talk and interact primarily with each other but always try to make what they say both comprehensible and interesting to the students. The student just look, listen, and try to understand what is happening. They try to keep their attention on the meaning, not the language. And what is more, they are advised not to construct sentences consciously in order to speak but, instead, to wait until the language comes out by itself (‘emerges’ in Krashen’s terms). In addition to receiving spoken input, the students are also exposed to written input. The teachers talk about written material which the students see on overhead transparencies and handouts, and the students also read interesting, simple, contextualized material on their own.

ACTIVITIES

The activities were selected from the large variety described in The Listening Approach and were sequenced entirely according to their potential interest value and ease of comprehension. That is, the teachers tried to use activities which would prove as interesting as possible yet still be comprehensible. They made no conscious attempt to sequence the activities according to notional, functional, or grammatical criteria. And there was no student textbook. A typical class included activities such as the following:

1 discussion: The weather (the previous night’s heavy snowstorm).

2 discussion: A teacher’s cold and possible remedies.

3 discussion: The geography of Germany.

4 game: The teachers read from sheets containing information about the students. The students guessed which student was being described.

5 reading: The teachers read a fairy tale aloud to the students from an overhead transparency and discussed it.

STUDENTS

The students were drawn from a population of regular university students enrolled in a first-year German course offered for credit at the University of Utah. The class consisted of two ordinary sections of German 101 combined into a single class. The students were not told that the method to be used was experimental until after they had signed up for the course. The control group consisted of students who had signed up for a different section of the same course.

CLASSES

Classes met five days a week for one hour a day. The entire program continued for three ten-week terms, for a total of about 150 hours of instruction.

TEACHERS

The two teachers, Paul Kramer and Louise Nygaard, graduate students in the University of Utah’s Department of Foreign Languages and experienced in contemporary language teaching methodology, taught the classes.

Why evaluate?

One of the first questions we asked was why we were 'evaluating the program. We considered a number of possible reasons. One was to find out whether our program was feasible: could the teachers teach it and would the students put up with it? Another as noted above was to find out whether the program was productive. would it produce the results one might expect of it given the claims of the theory of language acquisition upon which it was based? Another was to find out whether the program was appealing. Perhaps the program could be taught, but would it be enjoyed?

We decided to address all of these issues, but our primary reason for evaluating the program was to find out more about language acquisition theory. Therefore, our initial program design and any modifications we made to it during the year had to be consistent with the pure version of Krashen’s theory we were interested in testing, so that we could be as clear as possible about which teaching processes were being tested (Beretta 1986a:296). As a result, we accepted the possibility that students would be resistant to some of the elements in the program, and we decided not to modify the program in ways that would make conclusions about the validity of the theory difficult to arrive at.

When to evaluate?

Our next question was when to evaluate. Should we evaluate the program periodically during the year (formative evaluation), or should we wait until the end (summative evaluation)? Our decision was based upon both practical and theoretical concerns.

On the practical side, we were busy keeping up with the instructional side of the operation (which included visiting classes, giving one another feedback, planning lessons, etc.), and we had little time to design and implement a systematic formative evaluation study. And on the theoretical side, we were strongly influenced by Krashen’s affective filter hypothesis. According to this hypothesis, students acquire best under non-threatening conditions, and large scale, formative valuation might be likely to raise the filter (Krashen 1982 :30-2).

As a result, we decided to evaluate the program in two phases: a small-scale, somewhat informal formative evaluation of students’ attitudes toward the program and a large-scale, formal, summative evaluation of the students’ language abilities. We thought that the formative evaluation of attitudes would allow us to keep track of attitude changes as they occurred, which would help us adjust the input to keep it interesting and comprehensible. The summative evaluation of students’ language ability would allow us to determine how well the students could perform at the end of the program.

Whom to evaluate?

I will break down this question into two main issues. The first is whether to evaluate the experimental group by itself or to compare the experimental group with a control group. After discussing this question, I will turn to the issue of Whom to evaluate Within the group(s): teachers, students, administrators, or some combination of the three.

Comparative or independent group evaluation

COMPARATIVE EVALUATION

When we first started to think about how to evaluate the program, what came to mind immediately was a methods comparison study using experimental and control groups. Such a study would provide us a means of comparing results of the input-based program with the results of traditional (eclectic) instruction, along the lines of the studies by Asher, Kusudo and de la Torre (1983); Burger (1989); Edwards et al. (1984); Hauptman, Wesche and Ready (1988); Lafayette and Buscaglia (1985); Lightbown (1989); and Sternfeld (1989).

While comparative studies are interesting, designing and interpreting them is difficult. Ideally, students should be assigned to the two groups at random, or if this is not possible, attempts should be made to control for self-selection. Also, teacher variables are difficult to control. The results of the Pennsylvania Study (Smith and Baranyi 1968) indicated that the teacher variable was more important than the method used. In addition, experimental programs are likely to be new, and one might expect that they would generate more excitement and enthusiasm than the traditional programs (the Hawthorn Effect). Finally, the goals of programs employing different language teaching methods are likely to be quite different, which would make finding criterion measures appropriate to both programs difficult (Beretta 1986b).

On the other hand, such studies do ask questions about alternatives, which many people find interesting. And while they carry along with them design problems (Kramer and Palmer 1990) and testing problems (Palmer 1990), when at number of such studies are carried out questions of interpretation which come up for one study may be answered by another study conducted under different conditions. For example, if one suspects that a negative outcome in one study might be the result of instructor variables (rather than the method used) but then finds similar negative outcomes in a study conducted with instructors known to be superior, one might be able to rule out the instructors as a confounding variable and begin to draw some general conclusions about the effect of the method itself.

INDEPENDENT EVALUATION

The other alternative would be to evaluate the experimental program on its own merits, which would involve stating the objectives of the program and collecting data to evaluate how well these objectives had been met. Such a study would avoid many of the problems described above, but it could not provide the kinds of comparative data which many people want.

SOLUTION

We decided to conduct both independent and comparative evaluations. To evaluate the experimental treatment on its own merits, we specified objectives for the experimental program, against which we could compare actual outcomes. To compare the relative effectiveness of the experimental method with something else, we created a control group of traditionally taught students and gave the same set of proficiency tests to each group.

People affected by program outcomes

Whether a program is or is not considered successful may depend upon whom we ask. We may decide that it succeeds if the students learn something and are happy with the learning experience. ln this case, we would obtain measures or descriptions of student behavior. Or we might decide to evaluate the program in terms of teachers’ attitudes. Are they pleased with their teaching and with their perceptions of what their students learned? Or we might evaluate a program’s success in terms of administrators’ attitudes. Or, as Rodgers (1986) suggests, we might even go outside of the educational setting altogether and evaluate the program in terms of the attitudes of people within the community.

We decided to obtain information from students, teachers, and administrators. The following is a discussion of the kinds of information we obtained from each population.

What to evaluate?

In attempting to make sense of the issue of what to evaluate, we first had to decide how to organize the options. We found Bloom’s cognitive, behavioral, and affective domains (Krathwohl, Bloom and Masia, 1956) to be a useful framework.

Cognitive domain

The cognitive domain consists of knowledge about the language. In input-intensive acquisition based programs such as ours, there is little to evaluate within this domain. Such programs do not present grammar explicitly (cognitively). Nor do they expect the subconscious acquisition process to lead to cognitive control. Thus, we decided not to evaluate cognitive changes or knowledge, even though such outcomes would be expected for the control group.

Behavioral domain

To evaluate students’ behavioral performance, we needed to start by determining what kinds of outcomes we could reasonably expect of students completing the program. The most specific statement we could find came from Krashen and Terrell (1983):

After 100-150 hours of Natural Approach Spanish, you will be able to: ‘Get around’ in Spanish; you will be able to communicate with a monolingual native speaker of Spanish without difficulty; read most ordinary texts in Spanish with some use of a dictionary; [and] know enough Spanish to continue to improve on your own.

After 100-150 hours of Natural Approach Spanish, you will not be able to: pass for a native speaker; use Spanish as easily as you use English; understand native speakers when they talk to each other (you will probably not be able to eavesdrop successfully), use Spanish on the telephone with great comfort, [or] participate easily in a conversation with several other native speakers on unfamiliar topics.

(Krashen and Terrell l983:74)

Krashen and Terrell also discuss the role of grammar in defining expected outcomes (pp. 71-2). While they indicate that beginning students may simply string together appropriate lexical items in some logical order, they do not expect that students will continue to use these simple Stringing techniques. They believe that the language acquisition that takes place in the classroom will result in students acquiring grammatical competence. Thus, based upon Krashen and Terrell’s guidelines, appropriate tests for the experimental group might assess the students’ listening, speaking, and reading skills, as well as their subconscious control of grammar.

We did not have available a description of behavioral objectives for the control group, so we had rely on our experience as teachers and administrators to inform our selection of tests. We concluded that tests of listening, speaking, and reading skills, as well as subconscious control of grammar (already needed to evaluate the experimental group) would also be appropriate for the control group. In addition, we decided that one of the objectives for the control group would be developing writing skills.

Thus, our final selection of tests included measures of the four skills (listening, speaking, reading, and writing) and two elements (grammar and vocabulary).

Affective domain

ln addition to evaluating behavioral outcomes, we also wanted to evaluate the attitudes of three populations (students, teachers, and administrators) toward the program. ln deciding what attitudes to assess, we tried to balance the advice of others, lists of questions appearing in other research studies, and our own research interests in preparing our questionnaires.

STUDENTS

We divided our questions to the students into three types. The first type consisted of questions about fairly general aspects of their attitude toward the program. The process we used in preparing these questions included discussions with experts in program evaluation, discussions with people involved in the program, examination of other program evaluation studies, and examination of general issues in program evaluation (for example, Beretta 1986a and 1986b, Richards and Rodgers 1986 Chapter 11). We eventually elicited the students’ opinions on the following topics:

1. reasons for studying German at the beginning of the program and at present: fun, curiosity, for use in future studies, for use in work, to satisfy a university requirement, to get a good grade, and importance of developing proficiency in the four skills and grammar

2. current satisfaction with their instruction in the areas of listening, speaking, reading, writing, and grammar

3. satisfaction with their instruction in the areas of listening, speaking, reading, writing, and grammar compared with their expectations when they began the experimental course

4. confidence that they could cope with listening, speaking, reading, and writing German in a German-speaking country

5. interest in being in contact with German culture and language outside of class

6. interest in electing sheltered subject matter (subject matter courses taught entirely in German to students who are non-native speakers of German)

7. interest in continuing to study German

8. importance of absence of pressure in class and receiving a good grade

9. satisfaction with the cost of the program in terms of time and money

10. appropriateness of the levels of spoken and written German to which they were exposed in class

11. what they would want more of, and less of, in class

12. how much fun they had in class

13. confidence in the method

14. whether or not they would recommend the class to their friends

15. level of participation

16. amount of structure (organization) they liked in a language class

17. amount of risk they enjoy in class

18. amount of interaction with classmates they enjoy

19. opinions of each instructor’s control of language, ability to teach, and personality

20. satisfaction with the way the instructors interacted

The second type of question elicited students’ opinions about specific learning activities: Did they like the activities and find them interesting? We asked many questions of this type because Krashen’s theory of language acquisition states that affect is one of the two causal factors in language acquisition. The third type of question elicited students’ general attitudes toward whatever aspect of the program concerned them at the time. We used this kind of question to elicit feedback on issues we had perhaps not taken into consideration when developing the list of specific questions.

TEACHERS

We also assessed teachers’ attitudes on a variety of issues. ln deciding what questions to ask, we followed a procedure similar to that used in deciding what questions to ask of the students. In addition, where possible we tried to ask the teachers questions which were similar to those we asked the' students so that we could determine the extent to iwhich students’ and teachers’ impressions and attitudes differed. We eventually decided to obtain information on teachers’ attitudes on the following topics:

1. satisfaction with the students’ competence in the four skills and grammar; satisfaction relative to that in other courses they had taught.

2. Confidence in the students’ ability to cope in German in four skills

3. satisfaction with the amount of German culture they were able to present in class

4. interest in the content of what they talked about in class

5. satisfaction with the amount of pressure placed on students

6. satisfaction with the grading policy

7. satisfaction with the amount of support they received from supervisors/staff, materials supplied, training program

8. satisfaction with the amount of class preparation time required

9. confidence in their ability to train others in method

10. importance of their experience in obtaining future employment

11. what they would want more of, or less of

12. amount of fun they had teaching the course

13. whether they would want to teach the same class again

14. whether they would want to study a language using the experimental method

15. what they liked most and least about teaching the experimental class

ADMINISTRATORS

Finally, we wanted to know how the department chairman felt about the program. In addition to questions about perceived effectiveness of instruction, we asked the chairman a number of questions about how the experimental program contributed to the Language Department’s overall image and visibility. Where possible, we asked questions parallel to those that we asked of the students and teachers. The following are the topics we decided upon:

1 the importance of each of the following goals for the “experimental first year German program:

a) preparing the students to use German for fun as in travel, speaking with relatives and friends, reading for pleasure, etc.

b) Preparing the students to use German to study in a German-speaking country

c) Preparing the students to use German in their work or future work, including missionary activities, business, teaching, etc.

d) Helping the students satisfy a university requirement

e) Helping the students satisfy their curiosity or have an enjoyable experience

2 the importance of first year German students developing competence in each of the following areas: listening, speaking, reading, writing, and grammar

3 satisfaction (from whatever impressions he might have formed) with students’ achievement in each of the following areas: listening, speaking, reading, writing, grammar

How to test

Here I discuss the methods used to obtain the information given above. I first discuss the issue of method effect. Then I provide examples of the various instruments.

Method effect

Numerous research studies (Bachman and Palmer 1981, 1982, 1989; Briitsch 1979; Clifford 1981) indicate that we cannot get directly at language competence without the method used influencing the results. And while with a lot of effort the relative influence of method can be quantified for a small number of competing methods (Bachman and Palmer 1981, 1982, 1989), I know of no way to discover such a thing as ‘the best’ or ‘the perfect’ method.

Recently, Bachman (1990) has proposed a system for the componential analysis of test methods. However, since the study described in this chapter predated Bachman’s system, our own thinking about method was less systematic than would be the case today. Basically, we attempted to select methods that would not be obviously disruptive. For example, in evaluating the students’ ability to read German, we had them summarize a reading passage in English rather than German. This did not introduce their ability to write German as a factor contributing to the test results. On the other hand, we knew that perhaps the students’ ability to summarize would influence the test results. We simply considered this the lesser of the two problems.

Practicality

Second, our decisions on how to test were influenced by practicality considerations. Because of constraints on time, money, and personnel, tests had to be easy to develop, administer, and score And the rating procedures had to be quick and uncomplicated. The following examples or descriptions of different methods illustrate how we chose to balance method effect considerations with practicality issues.

SPECIFIC METHODS! OVERVIEW

The specific test methods for each skill/element are described below. Many of these tests were developed by Steven Sternfeld and Batya Elbaum (members of the Department of Languages faculty), with the assistance of graduate students in the Department of Languages and the Linguistics-TESOL Masters Degree program. I will first explain the ‘standard’ procedures used to rate most of the protocols. (I also describe a different ‘special’ rating procedure below under the description of the oral interview test.) Then I will describe the procedures used to obtain the performance samples to be rated.

A pool of raters (teachers in the German program) used the following ‘standard’ procedure to rate the students’ performance on most of the tests. They were told to decide whether a student who could listen, speak, read, or write at the level demonstrated on the protocol being rated would be able to perform satisfactorily in the second year sheltered subject matter course (a content course taught in German specifically for non-native speakers of German). A rating of ‘high’ indicated that the student would definitely be able to perform satisfactorily. A rating of ‘mid’ indicated that the student would probably be able to perform satisfactorily. And a rating of ‘low’ indicated that the student would clearly not be able to perform satisfactorily. We have reason to believe, however, that the assigned ratings were more indicative of the raters’ decisions about general language proficiency than ability to perform in sheltered subject matter courses, because some of the tests were measuring language abilities which would likely have little to do with performance in such courses. Moreover, since the raters had never taught such a course, they had no specific sheltered content course experience from which to make the ratings.

Practical considerations also dictated this global approach to most of the protocol rating. A more detailed rating procedure would have required both a training program for raters and a small group of raters with the time to participate in the training program and the actual rating sessions. Funding and time constraints made this impossible.

Two methods were used to evaluate the students’ German language proficiency: traditional language tests and students’ self-ratings.

LISTENING: LECTURE SUMMARY

A ten-minute videotaped lecture was prepared by a professor who is a native speaker of German. The lecture concerned some of the technicalities of East and West German politics. The German used was academic with complicated syntax. lt was unsimplified and was at a much higher level than the students had been exposed to in class.

Students were allowed to take notes during the lecture. Following it, they were instructed to summarize the lecture as well as possible, in as much detail as possible. Protocols were rated using the standard procedure described above.

LISTENING : STUDENTS’ SELF-RATINGS

The students were asked to write brief descriptions of their ability to understand spoken German. We grouped these descriptions into high, mid, and low categories, counted the numbers of students in each category, and provided examples of students’ comments in each category,

SPEAKING: ORAL INTERVIEW

A native speaker of German who did not know the students interviewed each student for live to eight minutes, the length depending upon how proficient the students were. The interviewer went through a list of questions which included alternate questions so no two interviews were exactly the same. Questions were of the sort one might ask someone when trying to become acquainted. For example:

1 Where do you live?

2 Where do you work?

3 Have you studied other languages before?

4 Why are you studying German?

In the middle of the interview, the student was given some information to elicit by asking questions in German. For example:

1 Where does the interviewer come from?

2 What is the interviewer doing in America?

3 Where does the interviewer normally work?

4 What does the interviewer’s family consist of?

Kramer scored these interviews using two criteria: sentence complexity and well-formedness, and control of inflectional morphology. He then added the two scores together and then converted these scores to ratings using a ‘special’ procedure different from the standard one described above. Kramer compared the students’ performance on the oral interview with their performance on the remaining tests that had been rated using the standard procedure. I-Ie determined which students tended to fall consistently into the high, mid, and low categories on the other tests and used this information, together with the original rating criteria, to estimate high, mid, and low break points for the interview scores.

SPEAKING: STUDENTS, SELF-RATINGS

The students were asked to write brief descriptions of their ability to speak German. We grouped these descriptions into high, mid, and low categories, counted the numbers of students in each category, and provided examples of students’ comments in each category.

READING: TRANSLATION

The students were given a humorous passage in German about a child who threw a rock through a window and then bragged about having answered the teacher’s question (about who threw the rock through the window) correctly. They then wrote an English translation, which was rated using the standard procedure.

READING : SUMMARY

The students were given a written passage about a German satirist written in academic, unsimplified German, followed by a written summary of an interview with the satirist. The students were then given five minutes to write a summary of the passage in English, which was rated using the standard procedure.

READING: STUDENTS’ SELF-RATINGS

The students were asked to write brief descriptions of their ability to write German. We grouped these descriptions into high, mid, and low categories, counted the numbers of students in each category, and provided examples of students’ comments in each category.

WRITING : GUIDED COMPOSITION

The students were given a 70-word passage in German about the weather in Germany. They then wrote a passage in German about the weather in America. These compositions were rated using the standard procedure.

WRITING: DICTATION

A 65 -word passage was dictated to the students. The content consisted of a brief comparison of the political differences between East and west Germany. The dictation was scored by marking one point off for each omitted word, extra word, and pair of permuted words. Words were considered correct if the spelling was phonetically appropriate. In addition, a half point was deducted for each error in inflectional morphology. The compositions were then rated using the special procedure described for the oral interview test.

WRITING: STUDENTS’ SELF-RATINGS

GRAMMAR: RATIONAL CLOZE

The students were given a 118-word multiple-choice cloze test with rational deletions (developed by Paul Kramer). The passage, modified somewhat by Kramer, was taken from a German reader. The subject matter was a humorous story about a student who went into a restaurant and made a deal with the waiter that if the student could sing the waiter a song that pleased the waiter, the student would get his dinner free.

The first sentence was left intact. Twenty-five words were rationally deleted from the remainder of the passage to test grammar, discourse competency, and knowledge of the world. Deletions occurred at intervals ranging from four to eleven words. Four answer choices were provided for each deletion.

Kramer scored the cloze test using the exact word method. He then assigned ratings using the special procedure described for the oral interview test.

GRAMMAR : STUDENTS’ SELF-RATINGS The students were asked to write brief descriptions of their control of German grammar. We grouped these descriptions into high, mid, and low categories, counted the numbers of students in each category, and provided examples of students’ comments in each category.

VOCABULARY: UNCONTEXTUALIZED TRANSLATIONS The students were given a list of German words to translate into English. These were scored and then rated using the special procedure described for the oral interview test.

VOCABULARY: CONTEXTUALIZED TRANSLATIONS

The students were then given a reading passage containing the same vocabulary words used in the uncontextualized translation test. They then translated these words into English using the additional information they could obtain from the context. Kramer scored these translations and converted them to ratings using the special procedure described for the oral interview.

VOCABULARY: STUDENTS’ SELF-RATINGS The students were asked to write brief descriptions of their control of German vocabulary. We grouped these descriptions into high, mid, and low categories, counted the numbers of students in each category, and provided examples of students’ comments in each category.

Affective domain

STUDENTS

Three methods were used to assess students’ attitudes: activity rating slips, journals, and attitude questionnaire. First, the students were given an activity rating form to fill out at the end of each class. On this form, they rated two attributes of each of the day’s activities: interest and comprehensibility. Ratings were done on a subjective scale of 0-100. The instructors used these ratings to decide how successful the activities were.

Second, the students kept journals in which they described their language learning experience. These journals were collected on a weekly basis, read by the instructors, and summarized at the end on the program by a graduate student in the Linguistics-TESOL M.A. Program.

Third, the students were given a questionnaire at the end of the first and third terms; Most of the questions asked the students to rate on a four-point scale their responses to questions of the sort described above in the section What to evaluate? In addition, the students responded to a few open-ended questions.

INSTRUCTORS

Three methods were used to assess the instructors’ attitudes: ongoing conversations, a questionnaire, and a paper. I visited classes frequently and talked informally with the instructors afterwards. In these conversations we discussed how the class had gone, which activities seemed to work and which ones didn’t, and how the students seemed to be reacting. On the basis of this information we decided how to modify the instruction appropriately.

Also the instructors wrote a paper at the end of the program describing their impressions of the experience.

Finally, the instructors completed a 44-item questionnaire at the end of the program. The questions covered demographic information and the instructors’ attitudes toward various aspects of the program (see the section on what to evaluate! above). With the exception of two open-ended response questions, questions were of the multiple-choice variety.

ADMINISTRATION

The department chairman filled out a 44-item multiple choice attitude survey at the end of the program.

For whom to evaluate?

Possible audiences As Rodgers (1986) has pointed out, a variety of people have an interest in the language teaching program evaluation reports. Students want to learn and have an enjoyable experience in the process and might want to know what they could expect of a specific program. Teachers want to know how effective instruction proves to be and how enjoyable the teaching experience is. Administrators are concerned with results and their effect on enrollments. Members of t e community may be concerned with the effect of the students’ language skills on how well they function in the community. Researchers are concerned with the results of applying language acquisition theory in the classroom. And program evaluators may be interested in the report as a guide to future evaluation studies.

Depending upon which audience the program evaluator is addressing, the kind of information obtained and the way it is presented would likely vary. For example, one would make fairly different assumptions about the interests and background knowledge of readers of Language Learning and members of a college curriculum review committee.

Options for reporting results

We found audience considerations to be important when deciding upon procedures for reporting on student performance, for we wanted readers to be able to form a concrete picture of what students completing the program could do. Therefore, in addition to providing quantitative data, Kramer (1989) reported both the numbers of protocols falling into the high, mid, and low categories and provided examples of typical protocols at each level. In addition Kramer provided a corrected version of each protocol, as well as a translation.

I followed a similar procedure useful in reporting the results of the attitude surveys (Palmer 1987). When students provided self-ratings of their ability to use Gel man, I grouped their responses into high, mid, and low categories, reported the number of responses in each category, and supplied examples of these responses. And when students responded to open-ended questions about what they liked most and least about the program, I quoted typical responses.

A few comments on the findings

The purpose of the following comments is to give the reader some indications of the sorts of conclusions we reached about the general effectiveness of the program. (For a detailed account, see Kramer 1989.) I will summarize the findings in three areas: instructors’ attitudes toward the program, students’ attitudes toward the program, and students’ performance on behavioral measures.

Practicality of the approach

The instructors felt that the radical implementation of Krashen’s theory used was practical. They were able to provide three terms of interesting, comprehensible input. The two-teacher classroom was indeed feasible and generated a lot of useful interaction between the instructors, and the students seemed to be interested in this interaction. The drop-out rate for the course was no worse than for traditionally taught classes.

Student attitudes toward the program

Students’ reactions to the method were very positive at the beginning of the program, when they appreciated the lack of pressure to produce the language and the absence of formal testing. As the program progressed, however, they began to worry about whether they would eventually be able to speak, and there is some indication that they would like to have been tested in order to know that they were indeed acquiring the language. Their comments indicated that a number of them would have liked some specific encouragement to speak and make mistakes in order to get used to this experience.

One conclusion that we could draw from this change in student attitudes is that it may be difficult to completely satisfy both of the conditions necessary for acquisition in Krashen’s theory. Specifically, we may not be able to provide students like the ones in our program only with interesting, comprehensible input (and no output practice) while maintaining a low affective filter strength in a relatively long-term program.

This situation creates a problem for researchers interested in evaluating Krashen’s theory. Specifically, in order to test the theory, we would like to provide all and only what the theory says is absolutely necessary for acquisition (comprehensible input with low filter strength). Yet to keep the filter strength low over a long period of time with students such as ours, we may have to provide output practice. Doing so, however, makes the interpretation of the results difficult, since if the students do acquire and if speech does emerge, we can no longer ascertain whether this would have happened without the output practice. It may be the case that to provide a good test of a radical implementation of Krashen’s theory we will need to find a population of students who will not find the absence of output practice over a long period of time threatening. It will be interesting to discover whether such a population of students exists.

Student performance on behavioral measures

THE EXPERIMENTAL GROUP CONSIDERED BY ITSELEF

Students’ abilities_ to produce German ranged from slight to quite surprising (Kramer 1989). Students rated as low were functioning as one would expect of acquirers within the early stages of an expected interlanguage phase. For students rated as high, evidence of acquisition was much more apparent, with these students at times able to produce tell-formed, contextually appropriate, complete sentences.

Also, actual levels of acquisition may have been higher than the overall performance of the students in this study indicate, since attitude measures became more negative over time. This may have raised the ‘output filter’, which is said to limit the performance of acquired competence (Krashen 1985).

COMPARISON BETWEEN EXPERIMENTAL AND CONTROL GROUPS

A MANOVA (Multiple Analysis of Variance) conducted by Kramer (1989) indicated that the control group students (who received traditional instruction) performed significantly better overall than the acquisition students. They also performed significantly better on four of the seven tests (oral interview, reading translation, writing summary, and vocabulary). The experimental students did not perform significantly better on any of the tests (Kramer 1989). `

Discussion

Amount of data

We may have gotten somewhat out of balance with respect to quantity versus quality of data. In some cases we noticed that when we used several different methods to obtain the same data, we tended to reach the same conclusions. For example, with student attitude data we tended to reach the same conclusions from the questionnaires and journals. Also, since our primary research interest was the issue of the validity of Krashen’s theory of second language acquisition, we found little immediate use for the information obtained from the department chairman.

Quality of the data

ln future studies, l would probably spend more time on test development and use fewer methods. For example, if we were to use a scaled attitude survey again, l would research options for wording questions and scaling responses. (See Oskarsson 1978; Bachman and Palmer 1989.) Bachman and I found that questions about `difficulty’ in using language provided more useful information on the ability than did ‘can-do’ questions about what the students were able to do.

ln addition, some of the data we obtained would have been of little value even if we had specific uses for it. For example, we asked the department chairman to rate various aspects of the program without providing him with enough information to make his task meaningful. We should have provided him with examples of student test performance and asked him to visit the class instead of having him rely on what he happened to hear about the course and the students’ performance on it.

Finally, I would carefully pre-test all of the instruments. A number of students commented that certain questions on the attitude survey were difficult to interpret, and the limited number of answer choices provided did not allow them to make what they considered valid responses. Also, some of the tests (such as the lecture summary) may have been too difficult. Pre-testing the instruments would also have helped us address this problem and balance the quality and quantity of data obtained.

In general, had we spent less time gathering the ,same kind of information by means of different methods and less time gathering data not related to our basic research question, we could have spent more time refining our primary instruments and preparing the students for the testing experience.

Formats for reporting data, and effects on interpretability

We found it useful to provide examples of student performance at different levels in addition to reporting statistics. This helped audiences relate more directly to the numerical data provided, particularly since only norm-referenced tests were used. For example, a dean at the University of Utah was more concerned with the general level of ability of both experimental and traditional students (as shown by examples of what the students could do with the language) than with what he considered minor (though perhaps significant) differences between the groups (as evidenced by descriptive statistics from the norm-referenced tests used). An even more productive approach would be to develop and use criterion-referenced tests of communicative language ability (Bachman 1989; Bachman and Clark 1987; Bachman and Savignon 1985).

Between-group comparison

Kramer’s finding that the control group performed significantly better than the experimental group came as quite a surprise to us. We expected the experimental group to do better. If this single study is considered in isolation, the finding would appear to be rather straightforward evidence against the Input Hypothesis. What this study led to, however, was an investigation of a number of method-comparison studies, all of which dealt with the issue of the relative efficiency of input-based and eclectic instruction (Kramer and Palmer 1990; Palmer 1990). As a result of this investigation, we discovered an interesting pattern of interaction among method of instruction, purity of instruction and age of students. Briefly, in studies with ‘impure’ experimental treatments, we found no significant differences. In studies with ‘pure’ experimental treatments, we found significant differences favoring the experimental (input-intensive) treatment for children, but favoring traditional (eclectic) instruction for adults.

Conclusion

Evaluating input-based language teaching programs is challenging. On the one hand, we want to keep the experimental treatment as pure as possible in order to increase the internal validity of the study. On the other hand, we want to keep the students happy, which may require a more eclectic approach. Thus, we seem to be caught in a tug of war between interpretability and practicality-generalizability: internal and external validity (Beretta 1986a).

Our experience was particularly satisfying because the results we obtained were not necessarily the results we anticipated, so we felt like we had learned something new from the process. Moreover, struggling with the issues of research design, test design, and interpretation helped us clarify some of the problems we faced and suggested some new directions we might take in the future (Kramer and Palmer 1990; Palmer 1990).

The increasing interest in language testing and program evaluation is demonstrated by the enthusiastic participation of large numbers of colleagues in conferences such as the 1986 International Conference on Trends in Language Programme Evaluation in Bangkok, and the 1990 RELC Regional Seminar on Language Testing and Language Programme Evaluation in Singapore. This involvement is likely to bring increased insight into what Alderson (1986) calls ‘The nature of the beast”

Pages

Khaliq's Private Blog

Labels

Jumat, 14 Desember 2012

Evaluating Second Language Education : Chapter 4, Issues In Evaluating Input-Based Language Teaching Programs

Tidak ada komentar:

Posting Komentar

Mari Kita Sekolah

Arsip Blog

Test Footer 2

Featured Content Slider

Popular Posts

Blogger templates

Blogger news

Blogroll

About