USING STANDARDIZED TEST UNCONVENTIONALLY: AN ADAPTED READING ASSESSMENT.
by PING LIU , RICHERD PARKER , RAFAEL LARA The appropriateness of standardized, selection-type reading tests have been challenged, especially for students learning English as a Second Language (ESL). This study investigated the use of a multi-step Test Item Post-Conference (TIPC) procedure with thirty ESL students. The procedure results in an "adjustment" of standardized multiple choice test scores. Acceptable interrater reliability of the TIPC procedure was obtained. Strong alternate form reliability of the TIPC scores was also obtained, compared to low alternate form scores obtained for the original (pre-adjustment) test scores. The procedure proved relatively efficient and the results were we l-received by the participating ESL teachers. Detailed advantages and disadvantages of the TICP are discussed. Despite widespread criticism of its use (Cohen, 1988; Cummins, 1984, 1989; Freeman & Freeman, 1992; Neill 8,: Medina, 1989; Rothman, 1990; Scarcella, 1990; Valencia & Pearson, 1987), standardized testing is seen by some constituencies as both useful and necessary (Allerson & Grabe, 1986; Fart & Beck, 1991). In fact, the use of standardized tests is increasing (Pikulski, 1990). Standardized testing intersects the field of ESL/Bilingual education at two points. First, ESL professional, need to prepare students for standardized testing; for many it is a truly foreign experience (Deyhle, 1987). Second, ESL professionals need to ensure that ESL students' standardized testing provides reliable and valid information on students' abilities (Crawford, 1993; Scarcella, 1990). This entails scrutinizing the reliability and validity of test scores, and the appropriateness of the testing format for culturally and linguistically distinct populations (Cohen, 1988). This second need is the focus of the present study. Standardized test scores are end products of a complex process. Educators assume that the scores are good indicators about a common process across students. But the processes of text understanding and test-taking may not be common to all students. These processes may differ greatly because of large cultural and linguistic differences (Crawford, 1993; Scarcella, 1990). Where these differences are great, the test score alone is an inadequate indicator of the underlying process of reading with understanding. For these students, educators need additional information to meaningfully interpret test scores (Cohen, 1988). This additional information may come from a variety of other classroom tasks and observations (e.g. portfolio assessment), or it may be derived directly from a closer scrutiny of performance on the standardized test (Langer, 1987; Farr & Beck, 1991; Valencia & Pearson, 1987). Our study takes the latter approach. One missing element in interpreting standardized reading test scores for ESL students is the role which English language fluency may play. Reading performance cannot be isolated from language competency; particular receptive and expressive language capabilities are relied both on in reading and in responding to the test task. Although listening/speaking and reading/writing are separately observable tasks, they are also windows into a more diffuse language competency (Widdowson, 1978). For ESL students with limited English proficiency, it is especially important that reading assessment permits educators to interpret reading competency in light of English language competencies. The most popular standardized testing format is multiple-choice selection, wherein students read and select best responses from distractors (Hill & Parry, 1992). Among the assumptions made by this format are: that students have read and are responding to the text, that students can read and understand all options, that students have a common cognitive schema for all options (a common cultural experience which indicates their relative relevance), and that students have common strategies or facilities for coping with this testing format. For ESL students, some or all of these assumptions are commonly not met (Cheng, 1987; Cole, 1975; Cummins, 1984; Reibero, 1980). When the assumptions are not met, test results are not directly interpretable (Wanat, 1977). Although publicly challenged, standardized tests continue to enjoy wide-spread use for systems accountability. However, test results also are often used to make decisions for individual students and for supplemental programs within schools and districts, e.g. bilingual or ESL programs. Their use for decision making for individuals and for supplemental programs may have negative social consequences. Salganik (1985) points out that the reliance on standardized test scores of questionable meaning has actually weakened educators' professional judgments. Scores from standardized tests alone provide little understanding of students' reading competency. Similarly, Sizer (1984) claims that dependence on test scores has led educators to focus on the wrong problems, and more seriously, on the wrong solutions to the problems. Rothman (1990) suggests that standardized testing has reduced opportunities for many language minorities, including bilingual students. Crawford (1993), Scarcella (1990) and others argue that standardized tests fail to measure ESL students' performance because of biases in content and in response formats. Timed tests and specific test formats such as mulitple-choice may cause excessive anxiety in students not acculturated to such tests (Scarcella, 1990, p. 154). Deyhle (1987) suggests that test-taking skills need to be taught to many ESL students. Unfamiliar cultural referents may also confuse students because they lack the prior experience to provide the concepts with a robust conceptual field. Consequently, they do not respond as expected to such testing items with unfamiliar culture referents (Cheng, 1987; Cole, 1975; Cummins, 1984; Reibero, 1980). Cohen (1988) points out that standardized tests are not valid and reliable for the sub-groups who were not well-represented in the norm-group. Even if well-represented, separate reliability and validity coefficients need to be obtained for culturally or linguistically distinct groups. Without establishing separate reliability and validity coefficients for such groups, norm-referenced test scores should not be used for individual decision-making. Doing so has reduced opportunities for many language minorities, including bilingual students (Rothman, 1990). For some, establishing separate reliability and validity coefficients for cultural subgroups is an inadequate solution. Instead, alternative test formats and scoring procedures are needed. Such alternative assessments follow two approaches. The first involves assessment of the reading; process that students undergo in place. of the reading product. This approach, commonly includes such methods as; direct observation (Allerson & Grabe, 1986; Frager, 1984) and conferencing (Allerson & Grabe, 1986; Hosenfeld, 1984; Kucer, 1983). The second approach emphasizes integrated language tasks. Proponents of this approach assert that reading is a language task which cannot be isolated or understood apart from more global language competence (Clark, 1983; Cziko, 1982; Oller, 1976, 1979; Savignon, 1982; Brown, 1987). Reading is therefore assessed as part of a complex language task. Text recall or retell tests are typical examples of this second type (Appel & Lantolf, 1994; Connor, 1984; Gambrell, Koskinen, & Kapinus, 1991; Lee, 1986). Reading process assessment (e.g. via observation or conferencing) examines students' responses to reading through observing multiple tasks or through social interaction between a student and a teacher. When teacher observations are focused on a few limited important features, they are able to reproduce results similar to those of formal reading evaluations (Frager, 1984). Conferencing is an assessment procedure involving interaction between a teacher and a student to evaluate text comprehension (Allerson & Grabe, 1986; Harris & Sipay, 1985; Johnson, 1984; Kucer, 1983). To prepare for conferences following silent reading, Kucer (1983) suggests that students mark the trouble spots with a highlighter while reading a text. In the conference, the student may discuss the highlighted portions, as well as verbally summarize the text. The conference may examine student accuracy as well as counterproductive reading strategies. Conferencing can also occur following standardized test-taking. Valencia and Pearson (1987) attempted to validate large scale standardized multiple choice tests by interviewing test-takers. They found that some things can be learned only by talking to students individually; they argue that student assessment should include the opportunity for students to reflect upon their own performance. Integrated language tasks attempt to avoid assessing fragmented subskills of reading. Many reading professionals have argued that the subskill assessment of reading has been counterproductive (Clark, 1983; Cziko, 1982; Oller, 1976, 1979; Savignon, 1982; Brown, 1987). In integrative tests a learner simultaneously applies multiple language abilities to reading. A typical integrative test is text recall or retell. After reading, students tell what they have read, either orally or in writing. Students rely on reading ability, metalinguistic knowledge, retention of text structure, and expressive language skills to perform the task (Connor, 1984). The recall tests evaluate a complex, integrated language performance, and do not purport to differentiate sub-abilities. The two alternative assessment approaches, process assessment and integrated language assessment, each have advantages and limitations. The process assessment approach permits inquiry into reading comprehension strategies rather than products. But process assessment may not provide usable summary scores nor consider reading performance in the context of broader language functioning. Integrated assessment provides information on reading within more "ecologically valid" integrated language tasks. On the other hand, the integrated assessment approach provides whole-task information only, without information of potentially diagnostic use. Both approaches show promise as substitutes for standardized reading assessment, rather than methods to improve standardized testing. Thus, their wide-scale impact may be limited, given the increasing commitment to standardized testing by most school systems. The present study investigates a novel reading assessment procedure, the Test Item Post-Conference (TIPC), which seeks to consider both process-oriented and integrated language elements in assessment, but via querying multiple-choice responses. The TIPC relies on interpreting and expanding (rather than replacing) typical standardized testing. Through the TIPC, students take a standard multiple-choice comprehension test, and then respond to questions about their answers and their familiarity with the reading content. The TIPC yields both a raw score based on multiple choice selection and an adjusted score--a modification of the raw score from debriefing with the student. This study was guided by three research questions: 1) Interrater Reliability: Will different examiners independently obtain similar TIPC adjusted scores from querying the same student? 2) Descriptive: How and to what extent do students' scores change from information gained in item query and interview? 3)Alternate Form Reliability: what is the alternate reliability of the adjusted TIPC scores compared to the original scores? Method Context and Respondents The respondents for this study were 30 fourth grade (ages 9-10) Limited English Proficient (LEP) students from English as a Second Language (ESL) programs in Southcentral Texas. These students were selected from the population of 86 ESL fourth graders in two neighboring school districts. Eighteen participants were from one school district and twelve from another. The ESL programs included diverse students, representing Korean, Chinese, Spanish, and Indonesian languages and cultures; half of the students are Spanish-speaking immigrants from South America. All the participants for the study were relatively recent immigrants who spoke little English before entering the programs. However, the length of time they stayed in the U.S. ranged from 6 to 1.5 years, with a medium of about two years. At the time of this assessment, all of the students had been attending the ESL program for about one to two years. They were able to verbally communicate in the classroom quite well but their reading ability considerably varied based on teachers' evaluation and standardized test scores. Excluded were students with known disabilities. The ESL program involved pulling students from their regular (mainstream) classes for 45 or 75 minutes each day to receive English language instruction in small age-based groups. Content areas were not taught in the ESL program. Normally students enrolled in the ESL program two to three years before being able to competently manage mainstream classrooms all day. Instrumentation The TIPC procedure included three data collection steps. The first involved passage reading and responding to multiple choice questions by the student. The second involved the examiner querying students about their multiple choice responses. The third was an open-ended interview of the student to gain his/her reflections on background knowledge and test-taking strategies. The second and third steps were used to examine the objective multiple choice scores. Passage reading and multiple choice questions. A reading passage of 160-170 words was read both silently and orally, followed by 7 or 8 multiple choice questions. The questions were of three types: literal, inferential, and evaluative. Two passages with questions were selected fro/n the reading comprehension section of Iowa Tests of Basic Skills: Complete Battery Plus Social Studies and Science, Form G, Level 10,1986. The reasons for selecting the passages are as follows: a) all three types of questions are included, b) the length of the passage and number of questions are similar, and c) the standardardized tests were used annually in the schools. Query Record: provided questions for the student asked by the examiner querying the student's multiple choice responses. The student was asked to verbally explained and justify the choices made. Students were asked: "Why did you choose that answer? Why do you think it is correct? What is wrong with the other choices?" The Query Record provided space to record student responses, and for adjustments based on students' defenses to their answers in the multiple choice scoring. The adjustment procedure entailed a justification and coding of the reason for each item change. Reasons for raising (+) or lowering (-) item scores were five: (a) Background: (-) Correct answer solely by background knowledge, (+) On basis of student's cultural understanding, question can be answered differently. (b) Elimination: (-) Correct answer solely by eliminating alternatives. (c) Guessing: (-) Student randomly guessed correctly. (d) Comprehension: (-) Correct answer but for wrong reasons; based on misinterpretation of text or question; (e) Carelessness: (-, +) Student said first response was an error, and spontaneously changed answer (without prompting). Interview Record: provided open-ended questions for students related to their reactions to the tests, their background knowledge of the reading content, and their reading and test-taking strategies. Provided space to record student interview responses. Guidelines were developed for assessors to use the instrument. Procedure The first author was the main assessor and rater in this study. Procedures were piloted over a six-month period with the assistance of another Bilingual/ESL doctoral student. This phase included continuous negotiation between raters during and after scoring student performance. The pilot phase produced a series of revised scoring forms and guidelines. The pilot phase ended with the first author and assistant obtaining interrater reliability in independent scoring on the reading performance of fifteen students. This reliability is reported in detail later. The main study phase utilized test-retest measurement with all the fourth grade ESL participants. Students were assessed twice responding to two different grade-equivalent multiple choice test over a period of one week, for retest reliability. Students were individually drawn out of class during school hours. No time limit was imposed on students reading or answering the post-reading questions. All students were administered the same two grade-level reading tests, with administration order balanced among the 30 students. The TIPC procedure involved six steps. First, a student read a passage silently, while highlighting any difficult parts--words, phrases, or sentences. Second, the student read the passage aloud. Third, the student responded to the multiple-choice questions, with review of the text permitted. Fourth, the student was queried about each of multiple-choice response, using the Query Record. Fifth, the student was interviewed, using the Interview Record. Finally, based upon information gained in steps four and five, the examiner adjusted each multiple choice response, and recalculated an "adjusted" raw score for the multiple choice questions. During the assessment, most students were able to understand the questions posed and responded to them based on their comprehension of the passages. However, several students did have difficulty verbally explaining themselves well. They used gestures, drew pictures, or had to select choices given by the evalutor to get their ideas cross. Some other students whose oral reading was almost perfect and who were able to talk fluently failed to defend themselves for their choices due to poor reading comprehension. It seemed that these students emphasized on different aspects of reading in their language learning process. For the rest students, there seemed to be a positive relationship between different aspects of reading, ie. a student who was strong in oral reading also demonstrated good reading comprehension skill. Results Interrater Reliability The pilot phase ended with an interrater reliability check involving fifteen newly selected students. Reliability was obtained on the adjusted scores--the individual item scores (0/1) modified on the basis of the Query Record and the Interview Record. The two raters independently obtained 87% agreement, with 12% agreement obtainable by chance alone. Cohen's Kappa (Cohen, 1976), an index of agreement which corrects for chance, was .85, considered a good level of agreement. Maximum Kappa for this; dataset was .93. The ratio of Kappa to the Maximum Kappa was .85/.93 = .91. Descriptive Results of Changes from Original to Adjusted Raw Scores The TIPC multiple-choice-based raw scores were expressed as "percent correct". They were adjusted based on an item-by-item query and a follow-up debriefing interview. Adjusted scores could be higher or lower than, or equal to the raw scores. Table 1 shows the direction and amount of change from original to adjusted raw score. | |
Table 1 Changes between Original and Adjusted Multiple Choice Scores (decimals are "percent correct") on the TIPC, Tests 1 and 2 (N-30).
Changes for Adjusted Score
Avg. Max. Min. Lower Same Higher
Test 1 77% (n=23) 17% (n=5) 7% (n=2) Original 49.2 87.5 12.5 Adjusted 38.8 81.3 6.3
Test 2 63% (n=19) 27% (n=8) 10% (n=3) Original 52.4 85.7 14.3 Adjusted 41.9 85.7 7.2
| | The tests proved difficult for most subjects; average scores were around 50 percent correct, and no student got all items correct. Generally, most (73%-84%) student scores were changed due to the item-query and interview. Across both tests, more scores were lowered (63%-77%) than were raised (7%-10%). Score adjustment had the effect of lowering the lowest scores (Min.) and maintaining or slightly lowering the highest scores (Max.). For both tests, group averages (Avg.) for Adjusted scores were lower than for Original scores. These score adjustment patterns were stable, were similar for the first and second testing. Across the two tests, a total of 100 item scores were changed. The reasons for item score change, in descending frequency, were Comprehension (25%), Guessing (22%), Change (21%), Elimination (21%), and Background (11%). Alternate Form Reliability Alternate form reliability was calculated for the 30 students by correlating test 1 and test 2 scores. Both original and adjusted scores were correlated. The correlation between the first and second original scores was weak and non-significant (r=.26). For adjusted scores the results were very different. Adjusted scores demonstrated a moderately strong (r=.72) and statistically significant (p [is less than] .01) relationship. The difference in size between these two coefficients of agreement is large, and is the major finding of this study. Discussion The appropriateness of standardized, selection-type reading tests have been challenged, especially for students learning English as a Second Language (ESL). This study investigated the use of a multi-step Test Item Post-Conference (TIPC) procedure with thirty ESL students. The TIPC procedure aims to provide additional information to reading comprehension scores based on standard multiple-choice test items. Students read a passage silently, while highlighting any difficult parts, and then orally. After answering multiple-choice questions, students were queried about each response. The students were then interviewed more generally about background knowledge of passage content and their test-taking strategy. These two follow-up procedures resulted in an "adjustment" of individual test item scores and therefore the total "percent correct" score. The major finding of this study was the consistency of the adjusted score compared to original score, as evidenced by its relatively strong equivalent form reliability obtained over a one week period. The reliability coefficient of .72 (p [is less than] .01) is quite respectable for a new procedure with a small set of test items (8 in test 1 and 7 in test two). The importance of this coefficient is confirmed by comparing it to the chance-level (r=.26) coefficient for the original scores. The TIPC procedure changed an average of one or two test items for most (four-fifths) students, for an average drop of ten "percent correct" total test points (less than 1 item). The particular score changes were not the same for all students, however; they were individualistic. Of this we can be assured, due to the evidenced major change in student rankings from original to adjusted scores. What was the anatomy of these item changes? We had anticipated that most item score changes would be increased due to different legitimate cultural background and interpretations. In fact, background/cultural reasons accounted for only 11% of the changes. The most prevalent reasons for score changes were the following, each occurring 21-25% of the time: (a) comprehension, (b) guessing, (c) elimination, and (d) carelessness. These results demonstrate that cultural/experiential differences can be used to scrutinize selection-type standardized test results for ESL students. In addition, we need to be concerned with the use and misuse of test-taking strategies. It appears that the same problems in standardized test-taking which are common among the general school-age population are evidenced more strongly among ESL students. Our subjects were more helped than hindered by elimination and guessing strategies, causing us to overestimate their reading ability based on original scores alone. Perhaps part of this success was due to the fact that we offered only four possible choices per item, so students had 25% chance of randomly selecting correctly, even with no reading comprehension. The finding underscores the limitations of bare scores alone for communicating students' ability (Cohen, 1988; Freeman & Freeman, 1992; Salganik, 1985). Following the study, the three classroom teachers of the participants were presented with the original and adjusted test results for their students and asked to react. All three reacted positively and felt that additional information was required beyond lone reading proficiency scores for their students. One teacher wanted the evaluation to be included in the student's record for the next year. A second teacher intended to modify her individual testing to incorporate elements of the querying technique. She felt the TIPC would be especially useful for the students who reached Fluent English Speaking level on the IPT test. The teachers actually wanted to begin using the TIPC procedure (two contingent upon receiving teacher aide help). The three participating teachers also offered their various preferred uses of the procedure: to diagnose reading problems, to diagnose test-taking problems in preparation for mandated group testing, to use at the beginning and end of the school year as a progress measure, etc. The findings of this study suggested that a post-testing query and interview strategy can be used to increase the validity of ESL students' multiple choice reading comprehension test. The procedure certainly warrants further study. In these studies, modification of three important variables is suggested. First, multiple choice tests with 5 options should be studied to offer students another common format, and one under which they are less likely to be able to capitalize on guessing. Second, a study should be attempted with the post-reading questions but without the follow-up interview. The interview was designed mainly to provide more cultural back ground information and feedback on test-taking strategies attempted. The cultural background information gained in the by-item queries was generally sufficient. These queries also provided a lot of direct information on test-taking strategies used, more credible information than that sought by self-report in the interview. A third desired study is one involving student response to text by silent reading only. In the present study we included silent with underlining and oral reading in order to fully engage the student. These procedures do provide additional information on the student-as-reader. However, the additional information was not used for this present study. By varying the test-taking procedure from that commonly encountered in standardized sessions, we limited the external validity of our procedure. The TIPC procedure has a few acknowledged disadvantages. The main disadvantage appears to be that the follow-up query should occur very soon after the reading--exactly how soon is not known. Thus querying responses from a group of ESL students may be logically problematic. A second disadvantage we anticipate is teachers' abilities to fairly adjust their own students' scores. We obtained satisfactory reliability indices in training two doctoral students to make score adjustments. However, they were disapassionate outsiders. The TIPC also possesses distinct advantages which, together with the results from this study, warrant its further study. The first advantage is that TIPC seeks to strengthen, not replace, standardized Jesting. Its results are expressed in terms of the original raw test score, plus the adjusted score. Second, the procedure permits standardized group test administration of ESL students. The extra individual time and effort and the relatively "non-standardized" aspect of the assessment is restricted to following-up on the group testing. The third advantage is that, adjustments in standardized scoring are not made blindly, but are on the basis of student-specific evidence. Fourth, the procedure is open to scrutiny under the traditional psychometric standards of reliability and validity. Together, these advantages offer an adaptive procedure which responds to individual ESL student needs. References Allerson, S. & Grabe, W. (1986). Reading assessment. In F. Dubin, D. E. Eskey, & W. Grab (Eds.), Teaching second language reading for academic purposes (pp. 161-181). Reading, MA: Addison-Wesley. Appel, G. & Lantolf, J. P. (1994). Speaking as mediation: A study of L1 and L2 text recall tasks. The Modern Language Journal, 78(4), 437-452. Brown, D. (1987). Principles of language learning and teaching. Englewood Cliffs, NJ: Prentice Hall Regents. Cheng, L. (1987). Assessing Asian language performance. Rockville, MD: Aspen Publishers. Clark, J. (1983). Language testing: Past and current status-directions for the future. The Modern Language Journal, 67, 431-443. Cohen, S. (1988). Tests: Marked for life? New York: Scholastic. Cole, M. (1075). Culture, cognition and IQ testing. National Elementary Principal, 54, 49-52. Connor, U. (1984). Recall of text: Differences between first and second language readers. TESOL Quarterly, 18, 230-256. Crawford, k. (1993). Language and literacy learning in multicultural classrooms. Needham Heights, MA: Allyn and Bacon. Cummins, J. (1984). Bilingualism and special education: Issues in assessment and pedagogy. Clevedon, England: Multilingual Matters. Cummins, J. (1989). Empowering language minority students. Sacramento: California Association for Bilingual Education. Cziko, G. (1982). Improving the psychometiric, criterion-referenced, and practical qualities of integrative language tests. TESOL Quarterly, 16: 367-379. Davey, B. (1983). Think aloud -- Modeling the cognitive processess of reading comprehension. Journal of Reading, 27, 44-47. Deyhle, D. (1987). Learning failure: Tests as gatekeeper and the culturally different child. In H. Trueba (Ed.), Success or failure? Learning and the language minority student. New York: Harper and Row. Farr, R. & Beck M. (1991). Evaluating language development: Formal methods of evaluation. In J. Flood, J. Jensen, D. Lapp, & J. Squire (Eds.), Handbook of research on teaching the English language arts (pp. 489-501). New York: MacMillan. Frager, A. M. (1984). How good are content teachers' judgments of the reading abilities of secondary school students? Journal of Reading, 27, 402-406. Freeman, Y., & Freeman, D. (1992). Whole language for second language learners. Portmouth, NH: Heinemann. Gambrell, L., Koskinen, P., & Kapinus, B. (1991). Retelling and the reading comprehension of proficient and less-proficient readers. Journal of Educational Research, 84, 356-362. Glazer, S. (1992). Reading comprehension: Self-monitoring strategies to develop independent readers. New York, NY: Scholastic. Harris, A. J., & Sipay, E. R. (1985). How to increase reading ability, 8th ed. New York: Longman. Hierronymus, A., Hoover, H., Lindquist, & others (1986). Iowa tests of basic skills: complete battery plus social studies and science, Form G, Level 10. Chicago, IL: The Riverside Publishing Company. Hill, C., & Parry, K. (1992). The test at the gate: Models of literacy in reading assessment. TESOL Quarterly, 26, 433-461. Hosenfeld, C. (1984). Case studies of ninth grade readers. In J. C. Alderson & A. H. Urquhart (Eds.), Reading in a foreign language (pp. 231-244). New York: Longman. Johnson, P. (1984). Assessment in reading. In D. Pearson (Ed.), Handbook of reading research (pp. 147-182). New York: Longman. Kucer, S. (1983). Personal communication. October. Langer, J. A. (1987). The construction of meaning and the assessment of comprehension: An analysis of reader performance on standardized test items. In R. O. Freedle & R. P. Duran (Eds.), Cognitive and linguistic analyses of test performance. Norwood, NJ: Ablex. Lee, J. F. (1986). On the use of the recall task to measure L2 reading. Studies in Second Language Acquisition, 8(2), 201-211. Neill, D., & Medina, N. (1989). Standardized testing: Harmful to educational health. Phi Delta Kappan. 70. 688-702. Oller, J. (1976). A program for language testing research. Language Learning, Special Issue Number 4: 141165. Oller, J. (1979). Language tests at school: A pragmatic approach. London: Longman Group Limited. Pikulski, J. (1990). The role of tests in a literacy assessment program. The Reading Teacher, 43, 686-688. Reibeiro, J. L. (1980). Testing Portuguese inmmigrant children: Cultural patters and group differences in responses to the WISC-R. In D. P. Macedo (Ed.), Issues in Portuguese bilingual education, New York: Academic Press. Rothman, R. (1990). Ford study urges new test system to "open gates of opportunity," Education Week IX (36): 1,12. Salganik, L. H. (1985). Why testing reforms are so popular and how they are changing education. Phi Delta Kappan, 66, 628-634. Savignon, S. (1982). Dictation as a measure of communicative competence in French as a second language. Language Learning, 32: 33-51. Scarcella, R. (1990). Teaching language minority students in the multicultural classroom. Englewood Cliffs, NJ: Prentice-Hall. Sizer, T R. (1984)). Horace's compromise. Boston, MA: Houghton Mifflin. Valencia, S, & Pearson, P. (1987). Reading assessment: Time for a change. The Reading Teacher, 40, 726-732. Wanat, S. (1977). Introduction. In S. Wanat (Ed.), Issues in evaluating reading: Papers in Applied Linguistics, Series I, v-xi. Arlington, VA: Center for Applied Linguistics. Widdowson, H. (1978). Teaching language as a communication. Oxford: Oxford University Press. Woods, M., & Moe, A. (1989). Analytical reading inventory, fourth edition. Englewood Cliffs, NJ: Macmillan Publishing Company. |
|