Summary of judge performance based on 15 open performance classes – 2003 MFTHBA Celebration

We will try to keep the comments as brief as possible but an explanation of a few of the more important aspects of what is taken into consideration in the analysis of a judge's performance is needed. Let us start with Class 108 of the 2003 Celebration as an example. It is a class where the judging was consistent and represents what we would like to see in all classes. Our criteria for good judge performance are based on the premise that if judges perform well they should be consistent among themselves with regard to which horses are placed high and which horses are placed low. We focus on the top three placings as they are generally considered to be the most important. (For a more detailed explanation of our procedures please see Judge Performance Tutorial)

Ladies 2 year old champion - Class 108

Rankings are: Orderings based on all 5 judges:

r j h d t r j h d t

522 474 140 140 140 1 3 2 2 2

474 522 522 522 522 3 1 1 1 1

140 140 474 474 474 2 2 3 3 3

108 108 593 593 385 4 4 5 5 6

729 385 385 108 593 7 6 6 4 5

593 593 108 277 108 5 5 4 9 4

851 729 851 385 277 8 7 8 6 9

277 851 729 851 851 9 8 7 8 8

385 277 277 729 729 6 9 9 7 7

860 860 860 860 860 10 10 10 10 10

The above results show the placings of the horses on the left and the orderings of the horses on the right. The orderings simply replace an exhibitor's contestant number with the order in which they were placed in the class based on all 5 judges= scores. The first row of orderings in the right hand table above show that judge r chose a horse that was placed first by the consensus of the 5 judges as his first choice. Judge j placed 474 first but that horse was placed 3rd based on the consensus of the 5 judges and the other three judges chose the consensus second place horse (140) as their first choice. The problem with such orderings is that a judge=s own choices influence the consensus placements and that tends to make the consensus placements more like his than would be the case if his placings were not included in the consensus to which he is to be compared. A better measure of how consistent a judge, say judge 1, is with his peers is to not consider his (judge 1 for this example) placements when calculating the consensus ordering, and then compare his (judge 1's) ordering with the consensus of the other 4 judges. Below are the orderings of individual judges placements based on the consensus of their respective 4 peer judges.

Orderings from deleting one judge at a time:

r j h d t

2 3 2 2 2

3 2 1 1 1

1 1 3 3 3

5 5 5 5 6

9 6 6 4 5

4 4 4 9 4

7 9 9 6 9

8 8 7 8 8

6 7 8 7 7

10 10 10 10 10

Distances 2.4 2.8 1.4 1.4 1.4

The above results show that the horse that judge r chose as his first place horse (522) is the horse that was the second place horse based on the consensus of the other 4 judges. Compare that result to the table of orderings based on the consensus of all 5 judges and you will see that horse 522 was the first place horse when we included judge r in the consensus, but 522 slipped to second when we ignore judge r=s choices in determining a consensus. The results above provide a better measure of agreement or consistency as to how each judge=s ordering compares to the consensus of his peers. The same procedure was followed for each judge such that the consensus given for judge j is based on his four peers, etc.

The distances given above are measures of how far each judge=s first three choices are from being 1, 2, 3. Specifically, the most consistent a judge could be would be to have his first 3 choices match the first 3 choices of the consensus of his peers. The distance for a judge who=s first three choices are 1, 2, 3 is zero, and this indicates complete agreement with the consensus of his peers. However, in this class none of the judges chose the same 3 horses as the consensus of their peers. The last 3 judges switched the top pair which gave them distances of 1.4 indicating there was near agreement with their peers, but it was not perfect. The other judges= first 3 choices were farther from 1, 2, 3 and their distance values reflect that. The average of the distances for all 5 judges in this class was 1.90. Average distances are informative when comparing the results from one class to those from another as the magnitude of the difference in their values measures how much more disagreement there was among the judges in one class versus another, with larger average distance being associated with more disagreement in a class.

We also use a statistic we call theta to characterize the degree of agreement among the judges first 3 placings. The range of theta is 0 to 1. If the judges= placements are in close agreement, theta will be around 0.3, or less, and this reflects the situation we want to see. Values of theta greater than 0.4 indicate some lack of agreement among the judges and values greater than 0.5 reflect situations where the judges are placing horses with very little agreement among them, and of course is not what we want to see. For class 108 the estimate of theta is 0.27 which shows there was agreement among the judges as to which the top three horses in the class were. There were slight disagreements as to which horse was first, second, or third but there was total agreement as to which horses were the best 3 in the class. The small value of theta reflects that the judges were quite consistent in their picks as to which horses were the best in the class and it agrees with our conclusion above that was based on the relatively small average distance for the judges which was 1.90 for the class.

Now consider another class where things did not go well at all and it will show how the statistics theta and average distance both can reflect the magnitude of disagreement in a class. Consider the results of Class 92.

Mares, 5 years and older ‑ Class 92

Rankings are: Orderings based on all 5 judges:

t b h r d t b h r d

600 697 284 600 284 1 4 3 1 3

659 600 896 116 600 10 1 2 5 1

483 896 600 433 697 7 2 1 6 4

896 483 116 683 683 2 7 5 9 9

116 433 697 483 686 5 6 4 7 8

433 432 671 847 896 6 11 12 13 2

432 284 354 686 282 11 3 15 8 14

284 686 686 896 433 3 8 8 2 6

686 659 433 671 116 8 10 6 12 5

683 683 659 697 483 9 9 10 4 7

Orderings from deleting one judge at a time:

t b h r d

1 8 6 1 6

15 1 2 7 1

9 3 1 5 7

2 9 7 10 10

6 5 5 6 8

5 13 14 15 2

11 4 15 8 15

4 7 9 2 4

7 10 4 12 3

8 6 10 4 5

Distances 14.3 7.1 5.4 5.4 6.5

The above results reflect the case where the judges were all over the board and they were very inconsistent as to what they were looking for when placing the horses in this class. Note that judge t placed a horse 2^nd that was placed 9^th by one judge, 10^th by another, and was not placed at all by the other two judges. Judges t & r were the only ones to pick the peer consensus first place horse as their first place horse. Judge b placed a horse his peers’ consensus placed 8th as his first place horse and judges h & d picked the consensus 6th place horse as their first place horse. Judge d placed the consensus 3rd place horse 9th and judge r placed the consensus 2nd place horse 8th. The lack of consistency among the judges is reflected by both the values of theta and the average distance for this class.

Theta: 0.54 Average distance of all judges: 7.73

Note that theta is greater than 0.5 and the average distance is very large for this class compared to Class 108. A performance like this indicates either that the judges have not been well trained or that they are not capable of identifying what they are looking for in the horses, because they were certainly seeing different things in the horses in this class. This class represents the kind of judging that leaves contestants scratching their heads and thinking bad thoughts about showing, judging, and the MFTHBA.

Judge performance in the World Grand Champion class was not much better than it was for Class 92. Consider the results for Class 132 below:

World Grand Champion ‑ Class 132 ‑ 5 years & older

Rankings are: Orderings based on all 5 judges:

t b r d j t b r d j

663 600 600 663 663 1 2 2 1 1

665 483 663 665 665 3 4 1 3 3

757 896 307 284 600 6 5 7 9 2

600 665 116 600 896 2 3 8 2 5

483 284 483 757 307 4 9 4 6 7

116 757 896 483 483 8 6 5 4 4

525 663 697 896 757 11 1 10 5 6

307 116 525 307 116 7 8 11 7 8

896 697 665 525 697 5 10 3 11 10

697 307 284 697 284 10 7 9 10 9

Orderings from deleting one judge at a time:

t b r d j

2 2 3 2 2

3 4 1 3 3

8 6 8 10 1

1 3 9 1 6

4 9 4 8 8

9 7 6 4 4

11 1 10 5 5

6 8 11 7 9

5 11 2 11 11

10 5 7 9 7

Distances 5.2 3.7 5.5 7.1 2.4

Theta: 0.49 Average distance of all judges: 4.80

The value of theta tells the story which is that there was a lot of disagreement among the judges as to which were the best horses in this class. Note that only judge j picked the top 3 horses consistent with the consensus of his peers but even then he placed the consensus first place horse 3rd. The other 4 judges picked horses that were placed 6th, 8th, or 10th by their peers as their 3rd place horses. Note judge r placed the horse his peers placed 2^nd as his 9th place horse and judge b placed the consensus 1^st place horse 7th. Shouldn't we expect judging at this level to be much more consistent than this, particularly for such an important class?

As indicated earlier, the results of 15 open classes from the 2003 Celebration were summarized. The summary follows.

Averages for 15 open classes at the 2003 MFTHBA Celebration

#/class placed #/class placed #/class was

a lower third an upper third the only one

horse in the horse in the to place a horse

Judge upper third lower third

b .40 .40 .30

d .25 .17 .42

h .00 .40 .70

j .00 .27 .09

r .36 .55 .45

t .36 .09 .27

Average Theta: .42

Distance Variance # classes

Judge‑b: 5.6 7.3 10

Judge‑d: 4.9 8.8 12

Judge‑h: 3.5 3.5 10

Judge‑j: 3.1 1.8 11

Judge‑r: 4.9 14.8 11

Judge‑t: 4.9 23.3 11

The first part of the summary above presents the average number of times per class a judge placed a horse that is placed 1, 2, or 3 according to the consensus of his peers as his 8^th, 9^th, or 10^th choice, i.e. places a horse that is in the upper 1/3 according to his peers in the lower 1/3 of his placings. It also shows the reverse situation where a judge places a lower 1/3 horse (8^th, 9^th, or 10^th) based to the consensus of his peers as one of his top 3 horses (upper 1/3), and finally it shows the average number of times per class that a particular judge was the only one to pick a particular horse, i.e. no other judge placed a horse that he placed as one of his top 10 choices. The results also show that judge b chose an 8^th, 9^th, or 10^th place horse as one of his top 3 horses an average of 0.40 times per class, placed a 1^st, 2^nd, or 3^rd place horse in 8^th, 9^th, or 10^th place an average of 0.40 times per class, and was the only one to place a particular horse in the top 10, an average of 0.30 times per class. Several other judges displayed similar inconsistencies. Note that the average value of theta over the 15 classes was 0.42 which indicates that the judges disagreed in many of these 15 open classes. In my opinion, the results indicate that judges j & h performed well. The other four judges tended to be erratic and inconsistent with their peer judges. These results indicate a strong need for better judge screening, better training, more stringent selection criteria and, perhaps, a judge certification program as well.

Consider the 2003 WGC class from the TWH association annual Celebration. This organization has very stringent requirements for their Celebration judges.

2003 – World Grand Champion – TWH Associaton

5 judges

10 horses ranked

Rankings are: Orderings based on all 5 judges:

b p s h w b p s h w

9 9 9 9 9 1 1 1 1 1

1229 1229 1229 1229 1229 2 2 2 2 2

2504 935 1474 935 935 4 3 6 3 3

935 2504 2504 187 2504 3 4 4 5 4

187 187 187 2504 1474 5 5 5 4 6

286 1474 935 1474 187 7 6 3 6 5

1474 2399 286 1844 286 6 9 7 8 7

2399 1844 1844 2399 1844 9 8 8 9 8

1844 1742 2399 286 1742 8 10 9 7 10

1742 286 1742 1742 2399 10 7 10 10 9

Orderings from deleting one judge at a time:

b p s h w

1 1 1 1 1

2 2 2 2 2

4 4 6 4 4

3 3 4 6 3

5 5 5 3 6

8 6 3 5 5

6 9 8 8 7

9 8 7 9 9

7 10 9 7 10

10 7 10 10 8

Distances 1.0 1.0 3.0 1.0 1.0

Theta: 0.16 Average distance of all judges: 1.40

Note that in this class all judges agreed as to which were the best two horses. There was only disagreement as to which horse was the 3^rdplace horse and there the differences of opinion were not extreme for any judge. Compare these results to those of the 2003 WGC class for the MFTHBA, Class 132 given above, where theta was 0.49 and the average distance was 4.80. These very different results reflect the effect that having stringent requirements for judge eligibility have on subsequent judges’ performances. The TWH judges have to meet rigorous requirements before they are approved to judge major shows such as their Celebration. It is a very different situation than that found in the MFTHBA where almost anyone can qualify as a judge for the Celebration by obtaining a judges card and judges are selected various methods such as popular vote, committee decision, random selection or restricted random selection from a list of those who have said are willing to judge, or in some other way that is based on very little information about a judge's abilities but more on availability, popularity, or plain old politics. Selection of judges clearly should be done based on objective criteria that are related to one’s established judging ability, not on their popularity or because of their relationship to those who will be judged by them. The differences in the results of these two classes serve to prove the point. Members of the MFTHBA should expect the kind of agreement seen in the above class in their WGC classes but that is not what they get. If judges can’t be consistent in picking a WGC where they are looking at the best horses and the best exhibitors performing with consistency, then it must be because at least some of them are either not well trained or they do not have the ability to recognize the traits in MFT performance horses that are important according to the breed standard. Neither case is acceptable and untrained or incompetent individuals should certainly not be allowed to judge the most important show of the year. Judges should never be picked by a method where a conflict of interest exists. MFTHBA judges are always picked in ways that have little to do with their ability and a lot to do with who approves of them based on unstated and unknown criteria except some persons' opinions. The results seen here are proof that there are very serious ramifications of the methods that have been used in the past and here specifically in 2003.