Bulletin of Educational Psychology

532 publication date:Dec, 2021

Extended Angoff Method in Setting Standards for Self- Report Measures With Supplementary Performance- Level Descriptors
Author:Jin-Chang Hsieh

Research Article

One vision of the 12-Year Basic Education Curriculum in Taiwan is to promote the comprehensive learning and development of all students. To ensure the quality of this curriculum reform, the Ministry of Education funded a long-term project, the Taiwan Assessment of Student Achievement: Longitudinal Study (TASAL), to evaluate the impact of the curriculum on student performance. The TASAL is a large-scale standards-based assessment, and standard setting is one of its main task. A comprehensive literature review indicated that most empirical studies related to standard setting have focused on cognitive domains and few study undertake expert-oriented standard-setting processes in affective domains because of some practical limitations. The present study suggests a new approach, employing supplementary performance-level descriptors (S-PLDs) in an extended Angoff method in setting standards for self-report measures. The purpose of this study was to uncover evidence of the procedural, internal, and external validity of implementing an extended Angoff method procedure with S-PLDs in standard setting for English comprehension strategy use among seventh grade students in Taiwan.

PLDs are designed to outline the knowledge, skills, and practices that indicate the level of student performance in a target domain. In the present study, the use of comprehension strategies for learning English as a foreign language was examined. S-PLDs provide comparable but unique functions within the standard-setting process. S-PLDs offer supplementary material to subject matter experts to facilitate the formation of profiles of student performance in target domains, especially when ambiguities in conventional PLDs may prevent expert consensus during the standard-setting process.

In this study, stratified two-stage cluster sampling was adopted to select representative seventh graders in Taiwan during the 2018–2019 academic year. After sampling, 7,246 students had been selected ; only 2,732 students, 1,417 boys and 1,315 girls, received an English comprehension strategy use questionnaire and English proficiency test. Student performance on both measurement instruments was the basis for writing PLDs and S-PLDs. The scale measuring English comprehension strategy use was a 4-point discrete visual analogue scale self-report measure developed through standardized procedures and comprises four dimensions: memorization (6 items), cognition (6 items), inference (8 items), and comprehension monitoring (10 items) strategies. The results of four-dimensional confirmatory factor analysis indicated a favorable model–data fit, except for the chisquare value, which was affected by the large sample size . Moreover, the English proficiency test used was a cognitive measure assessing students’ listening and reading comprehension abilities through the use of multiple-choice and constructed-response items. A total of 182 items were developed through a standardized procedure and divided into 13 blocks to assemble 26 test booklets. Each booklet, containing 28 items, was randomly delivered to a participating student; each student completed only one booklet. After data cleansing and item calibration with a multidimensional random coefficient multinomial logit model and the Test Analysis M odules (Robitzsch et al., 2020), the information-weighted fit mean-square indices for all test items ranged from 0.79 to 1.37, meeting the criterion proposed by Linacre (2005).

An expert-oriented standard-setting meeting was hosted on May 20, 2020, after advanced materials, such as agenda, instruction of standard-setting method, had been sent to all experts. Eight experts from across Taiwan were invited to join the meeting, and they all had experience involving standard-setting meetings for student performance on English proficiency tests. The average number of years of teaching experience for these experts was 18.75, and seven had experience in teaching low achievers. Overall, the experts had sufficient prerequisite knowledge and experience with standard-setting processes. On the day of the standard-setting meeting, a series of events, including orientation, training and practice, and three rounds of extended Angoff standard-setting methods with different types of feedback provided between rounds, were undertaken. Feedback questionnaires were developed , and discussions among the experts between the rounds were recorded and analyzed as evidence of procedural and internal validity.

Most of the subject matter experts were satisfied with the events during the standard-setting process and agreed that they could set satisfactory cutoff scores for future usage. From the results of feedback questionnaires completed between rounds, the experts nearly unanimously agreed that the materials received in advance; the introductions to PLDs, S-PLDs, and the extended Angoff method; and previous experience in setting standards for English proficiency were beneficial in judging items during the process. Additionally, the experts agreed that the S-PLDs played a key role in facilitating the formation of outlines for student performance in comprehension strategy use across different levels. All of these results indicate procedural validity.

For evidence of internal validity, classification error (the ratio of the standard error of the passing score to the measurement error), was computed to indicate the consistency of the item ratings between and within the experts during the three-round process. Between experts and across rounds, the classification error ranged from 0.08 to 0.36 for memorization strategies, 0.14 to 0.49 for cognition strategies , 0.19 to 0.61 for inference strategies, and 0.24 to 0.72 for comprehension monitoring strategies. These results indicate that the cognitive levels for the four dimensions affect the consistency of item rating. Strategies with more abstract item content tended to have higher classification error. Furthermore, the lowest classification error values occurred in the second round for memorization and inference strategies and in the third round for cognition and comprehension monitoring strategies. All low values for each dimension were beneath the cutoff of 0.33 proposed by Kaftandjieva (2010), except for the value of 0.37 for comprehension monitoring strategystrategies. Regarding the rating consistency within experts between rounds, the results showed no extreme classification error, and most of the values were beneath 0.33, with the exceptions of 0.35 for cognition strategies and 0.37, 0.42, and 0.61 for comprehension monitoring strategies. Therefore, most experts exhibited rating consistency between the rounds. Additionally, the results of a content analysis of the item rating discussions indicated that three reference sources might affect experts’ judgments regarding the items: (1) students’ actual performance, (2) PLDs and S-PLDs, and (3) experts’ personal expectations. For example, one expert might give a lower score because his students tend to exhibit poor performance on a particular item dependent on their teaching experience, whereas another expert might give a higher score because of their personal expectations.

To examine external validity, student performance on English proficiency tests was adopted as an external criterion. With two cutoff scores used to divide students into basic, proficient, and advanced users in each dimension, a medium effect size was obtained for memorization strategies, and large effect sizes were obtained for cognition, inference, and comprehension monitoring strategies. Furthermore, to compare the final cutoff scores obtained through the study method with existing methods , the study adopted the concept from TIMSS and PIRLS for setting standards for affective domains (Martin et al., 2014, p. 308). The classification accuracy indices, which indicate the proportions of students classified identically, were 90.25%, 81.20%, 82.52%, and 87.56% for the four dimensions. To sum up, the present study obtained satisfactory evidence of the procedural, internal, and external validity of using an extended Angoff procedure for setting standards for self-report measures with S-PLDs; additional suggestions are presented herein.

Download

Keywords: extended Angoff method, self-report measure, supplementary performance-level descriptors, standard setting, Taiwan Assessment of Student Achievement: Longitudinal Study

Relationship among the Will to Meaning, Flow, and Psychological Well-Being of Cross-Strait College Students: An Analysis of Mediating Effects and Multiple Groups