|
|
12. A Test It is all too easy to devise a test for a rating system that casts that system in a favorable light. This is particularly true of tests that take a single system as their subject, witness the published tests of the Elo System. The present test pits the Elo System against four other systems: the Berkin System, the Dual Ratio System, the Progressive System, and a linear system. Since linear (interval) systems are essentially equivalent, the representative system may be viewed as an application of the Ingo System, or perhaps of the "approximate" version of the Elo System itself. The test is not intended as proof of the superiority of any one system and will have served its purpose if it stimulates a rethinking of rating theory. A completely "fair" test is, strictly speaking, impossible, and rating practitioners are therefore encouraged to try similar experiments. It is often suggested that a simulation would be better served by data from actual competitions. The advantage of generated data, besides an inexhaustible supply, is that they allow for experimental controls. The drawback of historical data, on the other hand, is that they are not repeatable. The simulation postulates a rating population of 400 players. Playing strength was defined arbitrarily by assigning Elo ratings, 2000 through 2399 respectively. This range yields a percentage expectancy of about .91 at the extremes. The predefined ratings were chosen primarily because of their familiarity, but there is also an ironical twist: the Elo System is seeking to measure playing strength that has been postulated in terms of the system itself. There is nothing wrong, it should be noted, with postulating relative strength in terms of long-term percentage scores (probabilities). The predefined ratings are simply an expedient for assigning such probabilities, admittedly an artificial one. In addition to a predefined rating, each player was assigned a calculated rating for each of the four systems. The calculated ratings were initialized to the mean of predefined ratings for the Elo System, and to .5 for the other systems. Only established rating formulas were used. The simulation consists of a series of ten-player round-robin tournaments. The contestants for each tournament were selected at random from the field of 400 without replacement (equivalent to a partition of the shuffled field). Results were generated randomly, but with probabilities determined by the predefined ratings. This is not equivalent to deciding all results by the toss of a coin, which would be appropriate only for contestants of equal playing strength. Random events are simulated in a computer program by a sequence of pseudorandom values following an initial seed value. A fractional value between 0 and 1, call it F, can be generated with uniform probability in a pseudorandom sequence. F was defined as a win or loss in relation to the predefined probability of a win, P. A win occurred if F was less than P, and a loss if F was greater than P. Draws were excluded from consideration for simplicity. Results were generated, in brief, just as predicted by the Elo System for the predefined individual ratings. All systems, it is important to note, are rating the same results. Statistics were calculated every 200 events, with each player participating in five tournaments. Altogether there were 5000 events, comprising 225,000 games. Sampling weight, which has been shown to be a crucial variable, was set at No = 50 for all systems except the progressive. An argument can be made for a smaller value of Lo in the Berkin System since sampling weight there is based on number of losses. However, the ratings of victorious opponents are not considered in the final rating, making the question of sampling a complicated one. Figure 1 presents a comparison of the predictive powers of the rating systems. The error statistic depicted for each system is standard deviation of actual scores from predicted scores. Both are in the form of tournament scores, varying from 0 to 9 points. The predicted scores were based on the percentage expectancies defined by each system for the average opposition ratings. For the Berkin and progressive systems these cannot be calculated directly from the performance formulas, and the summation method was used instead. Standard deviations were taken over intervals of 200 events for all of the contestants together, with 25 intervals in all. The first two intervals are not shown in this figure for the sake of scale. Figure 2 deals with the same data as Figure 1 and corroborates the results with an unrelated statistic. The statistic in this case is Spearman's rank-difference correlation. This type of correlation does not require that the distributions of the compared data be taken into consideration. It is used here to compare the ranking of calculated ratings, determined every 200 events, with the predefined ranking. Rankings, regarded as more basic than ratings, should naturally reflect the latter. |