Bulletin of Educational Psychology

522 publication date:Dec, 2020

Investigating the Effects of Rater Equating Designs on Parameter Estimates in the Context of Preservice Principal Oral Performance
Author:Ming-Chuan Hsieh

Research Article

A problem in performance assessments is the degree to which rater severity and leniency can affect the examinee’s scores. In particular, fairness concerns related to performance systems include the exchangeability of raters. A possible resolution for addressing rater severity is for each rater to score each examinee’s performance; thus, the difference in rater severity would affect each student at the same level. However, this is not always feasible in practice for fully crossed rating designs. In the context of performance assessments, equating procedures create links between raters when performing transformation with a fully crossed rating design is not feasible and could control for differences in rater severity.

An effective equating procedure involves a strong statistical model and a systematic data collection approach. The Many– Facet Rasch model (MFRM) is a commonly used approach for adjusting rater differences. Although the use of the MFRM model has gained popularity as an equating approach for rater severity, several key considerations related to data collection designs and model data fit are also crucial. In particular, it is vital to ensure a sufficient level of connectivity in the rating design; that is, the raters can be linked to other assessment components, such as other raters, examinees, or tasks.

Three types of data collection design are commonly used for equating. The first type is a complete network design, in which the data consist of complete designs with subjects of all assessment components. This is an ideal design for a rating system. The second type is an incomplete network design. Under an incomplete network design, examinees do not have scores on all assessment components, but a partial and systematic degree of connectivity exists for raters and tasks to produce a connected network of assessment components. The third type is a nonlinked network design, where no systematic linkage exists in the components of facets. Even if the unlinked scoring network has some potential problems, many important exams in Taiwan still use this rater design.

The purpose of this study was to examine the effect of differences in data collection designs that could affect parameter estimation in the performance assessment. Using empirical data, this study explicitly related the central role of consideration to data collection designs for the interpretation of results when the MFRM is applied. The study had two main research objectives:

(1) To examine the impact of different data collection designs on parameter estimates for examinees’ ability, raters’ severity, and the difficulty of scoring criteria. The indices included infit, outfit, separation index, reliability, and the chi-square test.

(2) To evaluate the correlations of ability estimates between different designs and the magnitude of their impact on the ranking of the examinees’ performance level. Examinees for the top 10, middle 10, and last 10 examinees in the complete network design were selected to evaluate their ranking differences for other designs.

This study used the MFRM and oral performance score data of preservice principals to explore the effects of the three data collection designs. A total of four raters and 85 preservice principals participated. The raters scored seven criteria for each preservice principal’s oral performance: content, structure, word usage, attitude, pronunciation, intonation, and time control. Each criterion was assigned a grade of 1–3, of which 1 represents the basic level, 2 represents a proficient level, and 3 represents an advanced level. The raters were trained before the actual rating was conducted. The specification of each grading level and the standards were explained; raters were also required to complete rating exercises before conducting the official rating. The anchored videos at various levels for the raters were discussed. Through these anchored videos, the raters could better understand the standards.

Four equating designs were considered in this study. Design 1 was the complete network design; four raters rated all preservice principals in this design. Designs 2 and 3 were incomplete network designs. In these two designs, some rating scores overlapped to construct the connectivity of scoring components. In design 2, each student received scores from three raters, whereas in design 3, each student received scores from two raters. Design 4 was a nonlinked network design, in which each rater only reviewed his or her assigned class; there was no connection between raters’ scores. The MFRM, a statistics model of the Rasch family, was employed to perform the four equating designs. When estimating the examinee’s ability level, raters’ severity and scoring criteria were simultaneously considered in the model.

This study had two main findings: (1) For the incomplete and nonlinked network designs, some minor problems were related to the model fit, but overall, the infit and outfit indices were close to 1, which indicated that the use of the MFRM was feasible for analyzing the data used in this study. However, the reliability and separation indices for the nonlinked network design were low, and some chi-square tests did not reach significance—results that were quite different from the complete network design. (2) The lower the linkage between assessment components, the more biased the estimated stability of parameters. The fully connected network design provided the strongest connectivity at all levels (subjects, raters, and criteria), and this design was also the most ideal scenario for the data collection design. However, this design costs much in terms of rating time and money; thus, it is difficult to implement such a design in a large-scale test. By contrast, incomplete network designs are more feasible in large-scale tests, namely for establishing overlaps of the evaluation of some subjects of raters. The correlation between the complete network design and nonlinked network design was only 0.69, but the correlation between the complete network design and incomplete network design rose to 0.79–0.94. Moreover, a clear gap existed in participants’ rankings between the ideal fully connected network design and nonlinked network design. For example, student #59 ranked 79th in the complete network design but 21st in the nonlinked network design, equaling a ranking difference gap of 58. The results revealed that even if the MFRM is used for correction, large errors will still exist in the estimation of ability values and the ranking results of examinees for a nonlinked network design.

This study provided two suggestions: (1) Examination institutions should avoid using the nonlinked network rater design. Carefully constructed network assessment designs based on effective data collection designs have the chance of obtaining objective and fair measurements within systems with multiple facets. Regarding current large-scale tests, many do not use any statistical models for rater severity correction; furthermore, they use the nonlinked rater design. It is possible that examinees can experience bad luck and encounter a severe grader, resulting in them receiving a low score. Therefore, this study recommended that important examinations in the future should adopt a more complete rating plan. (2) This study used empirical data. A simulation study can be considered to further examine the impact of different designs of component connectivity on parameter estimates. In addition, different experimental designs are worth discussing; for example, if examinees are nested within tasks, would this nested relationship affect the parameter estimates of ability and rater severity levels? The impact of more complex data collection designs are worthy of future research.

Download

Keywords: Many-Facet Rasch model (MFRM), preservice principal oral performance, rater equating design, rater severity

Analysis of Relationships Among Science Contestants’ Cooperation Attitude, Knowledge Sharing, and Continuous Sharing Intention