Reading comprehension is essential for learning in all subjects and for lifelong learning; it is also a crucial ability allowing people to communicate and interact with one another. Therefore, large-scale international assessments such as the Progress in International Reading Literacy Study and Program for International Student Assessment incorporate reading as an indicator of learning outcomes. This study also recognizes the essential nature of reading comprehension.
However, existing reading comprehension tests have several limitations. For example, the target populations of most tests comprise students in specific grades (e.g., elementary school students) or groups (e.g., students with special needs), and the assessments involve paper-and-pencil tests with fixed items that requires a lot of resources on test implementation and scoring. Currently, no Chinese reading comprehension assessment suitable for long-term implementation in general classrooms exists. Accordingly, the purpose of this study was to develop an assessment system, namely the Diagnostic Assessment of Chinese Competence (DACC), for comprehensively evaluating students’ reading abilities in the form of a computerized adaptive test. The reliability and validity of this system were also verified.
The DACC system holistically assesses students’ reading comprehension and assesses student performance in reading subskills such as comprehension (e.g., lexical, literal, and inferential), contextual integration, and analysis and evaluation. This assessment system was designed for students from the 2nd grade to the 12th grade.
The DACC test items were drafted by school teachers, doctoral students in psychology, and professionals engaged in research on the Chinese language. All drafters were required to attend and pass training before contributing test items to the DACC system. Item topics were selected to be familiar to students, such as topics relating to daily or school life. The topics are not limited to the language arts, covering life experience, history, geography, and science. In the proposed system, assessment texts appear in various formats, including continuous texts, noncontinuous texts, mixed texts, multiple texts, and texts displayed in hypertext. Text styles are also varied and include texts written in narrative, expository, descriptive, and argumentative styles. This wide range of texts reflects real-world reading situations encountered by students in their lives. Most of the DACC items are testlets, with each of the questions in the testlet corresponding to one of the five dimensions including vocabulary, literal comprehension, contextual integration, inferential comprehension, and analysis and evaluation. Such a design measures student performance in each of the dimensions and results in a comprehensive analysis of their reading abilities upon completion of the DACC.
All test items were subjected to pilot tests to collect actual responses from students for the purpose of observing whether the questions meet the proposed design. The responses were also used to estimate item parameters. All DACC items were vertically equated on the basis of the nonequivalent groups with anchor test design. In the pilot tests, the characteristics of the respondents were also considered. Stratified random sampling was adopted to recruit students from both urban and rural areas to ensure that the parameter estimation results for the items apply to all students in the population. The DACC items were dichotomously scored in the pilot tests. At least 300 responses were gathered for each test item, and both classical test theory (CTT) and item response theory (IRT) were applied to analyze the responses. In the IRT-based analysis, this study used the multidimensional random coefficients multinomial logit model (MRCMLM) with marginal maximum likelihood estimation to estimate item parameters and used expected a posteriori measures to estimate ability parameters. In the CTT-based analysis, the pass rates and item discrimination were calculated for each item.
To screen the DACC items for favorable psychometric characteristics, this study adopted two indicators. In the IRT-based analysis, the information-weighted mean square fit statistic (infit MNSQ) was used as the indicator to rule out misfit items, and items with infit MNSQ values between 0.6 and 1.4 were retained. In the CTT-based analysis, item discrimination was used as the indicator. Test items with discrimination of .3 or higher were retained. Accordingly, only when test items that met the requirements for both of these two indicators were entered into the formal item bank of the DACC system, resulting in 1019 items in this bank after data analysis. The range of item difficulties are bigger than -2 to 2, which corresponds to the ability parameters that include most students. The screening also demonstrated that the DACC is suitable for assessing the reading comprehension skills of students from the 2nd to 12th grades.
To strengthen the effectiveness of the DACC system, this study constructed an assessment system based on computerized adaptive testing. For estimation of abilities, maximum a posteriori estimation (MAP) was used. For test item selection, Fisher’s information was applied to calculate the item information each time students finished answering a set of questions. The system then randomly assigned the next question from the five items with the highest information score. When the number of items answered met a previously set standard, the assessment was terminated.
Furthermore, this study provided a set of reference norms for the students’ test results. A total of 38,099 students from 1,255 schools in Taiwan were included in the study. For these students, average scores were calculated for the students in each grade through the DACC system. Thus, students completing the assessment could compare their results against the norm and understand the level of their performance on the test. Such a reference can provide clear and objective standards to assist DACC users in assessing the grade level of their reading abilities. Accordingly, teachers can both determine whether their students’ reading abilities meet the required level and adjust their follow-up instruction based on the assessment results.
In addition to the rigorous procedures for constructing the DACC assessment system, this study examined the reliability and validity of the system. For the test-retest reliability assessment, this study evaluated the scores of 1,449 students who completed the test twice; the evaluation results revealed that the average correlation of their two scores was .76, meaning that the DACC system has high reliability. In the IRT analysis, the conditional reliability of the DACC system was also high. Assessing the test results of 16,479 students revealed that the average reliability of the system was above .80, indicating that the DACC system has a stable and high reliability level for students of differing reading abilities. The validity of the assessment system was examined on the basis of criterion-related validity. Assessing the scores of 2,332 ninth-grade students who underwent both the DACC and the Comprehensive Assessment Program for Junior High School Students (a large-scale standardized test that all graduates of junior high school in Taiwan must complete) indicated that the correlation of the scores from the two tests was moderate ( .64). Moreover, construct validity assessment results demonstrated that all DACC items fit the MRCMLM.
In summary, this study adopted a series of rigorous procedures to construct a DACC assessment system; the reliability and validity of the DACC were also verified. IRT was utilized to analyze item parameters to determine difficulty levels and student ability levels. Additionally, an item bank and ability norms were established for the system, thus enabling the use of a computerized adaptive test for assessment, which can effectively determine reading comprehension levels and provide longterm tracking of reading ability growth trends. Results of test-retest reliability, conditional reliability, criterion validity, and IRT validity tests indicate that the DACC system provides a stable and effective assessment of student reading ability. For future studies, the DACC system’s item bank will be expanded. A control mechanism for the item exposure rate can also be adopted to improve the system’s effectiveness. Moreover, as a comprehensive assessment tool across multiple learning stages, the DACC system can provide empirical evidence for use in solving problems related to reading comprehension and make substantial contributions to related fields of research.
|