Main Article Content
The objectives of this research were 1) to examine the appropriate test equating methods under varied anchor test patterns, sample sizes, and data formats, and 2) to compare the consistency of the grading results of scores before and after test equating under the specified conditions. The findings were as follows: Kernel equating method with the condition of randomized anchor items under samples size 700 and the condition of randomized anchor items with non-quality questions removed when using 500 samples were slightly best quality. Grading of the scores received before and after equating using Kernel method and 2-parameter IRT method based on varied conditions under 3-level and 8-level grading mainly resulted in inconsistency. The scores needed to be equated before grading. Expanding grading into 8 levels and bigger sample size resulted in more obvious inconsistency. Removing non-quality questions before equating with Kernel method resulted in more obvious inconsistency than 2-parameter IRT method. However, using anchor items with a difficulty of
.4-.6 in both methods resulted in more obvious inconsistency in grading than with randomized difficulty levels. Besides this study found that the relationship of grading after equating by using Kernel method and 2-parameter IRT method under any conditions were statistically significant at .05 level. The two methods under 3-level grading by using randomized anchor test all the same sample size were the best relationship.