A Judging Evaluation Procedure

 

The fact that judging of horses, dogs, ice skaters, beauty pageants, etc. is subjective is clear to anyone who has observed such events and even clearer to those who have participated in any such events. When events are judged by multiple judges  there is an expectation that the judges should all be looking for the same characteristics as indications of superior quality.  Whether it be horses or ice skaters we expect a judge=s placings to be consistent with that of the other judges.  In Olympic ice skating when one group of judges prefer one skater and another group of judges prefer another, the conclusion is often made that the judges are biased and are playing politics.  The same conclusion may be made when it comes to horse judging but there is also the possibility that a judge may not be well trained and, thus, is not putting the proper emphasis on the important differences among the horses.  The possibility also exists that a judge may not be capable of judging correctly, i.e. they may not have the ability to be objective and perceptive to the degree required no matter how much training and experience they may have. The need for objective criteria for judge performance evaluation is basically two fold: first we need a means by which a judge=s performance can be summarized, evaluated, and compared to the other judges which could be very useful in making judge selections for future events, i.e. if a judge performs inconsistently, they are probably not the one to choose for a future show; secondly we need a way to determine if a judge training program is effectiveAn effective training program, given to competent and capable individuals, will produce groups of judges who have very similar placings of groups of animals.  If we are able to characterize the similarity, or lack of similarity, that is displayed by a group of judges who have been trained by a particular technique we may be able to tell how effective a judge training program is.

 

When judging Missouri fox trotters at the Celebration they are placed down to the 10th placing if there are 10 or more in a class. It is my belief that this is a nearly impossible task and goes beyond the capabilities of most mortals. I believe that it is the first three places that are really important and that if we can get groups of judges who can consistently pick the same, or very similar, top 3's in the classes, we can be satisfied with these judges= performances.  I am trying to develop an evaluation procedure that will objectively measure the consistency of a group of judges performance. While it is impossible to determine if a particular judge=s placings are correct or not, it is possible to measure how consistent a judge=s placings are compared to the other judges with whom that judge is judging.  The information in the Show & Celebration book is extremely useful for determining which judges were consistent and which classes were most troublesome for the judges to judge.

 


 

In order to evaluate how similar a set of rankings is there must be some measure of distance between two sets of rankings.  There are several different measures used by statisticians to measure these distances. I have chosen one of the simplest ones because it is more intuitive than most others.  The distance chosen is computed by taking the difference in the rankings between two judges placings for each animal, squaring that difference, summing the squared values, then taking the square root of the sum and only considering the first 3 places.  Consider the simplified case where we have a class of animals placed 1,2, 3, ..., 10 by the set of judges.  To measure the distance that a particular judge is from that placing we can look at which animals he ranked first, second, and third. Suppose he ranked the 3rd ranked horse first, the 5th ranked horse second, and the1st place horse third. His distance from the composite ranking would be computed as (1-3)2 + (2-5)2 +(3-1)2 =17, the square root of 17 is 4.12 which is the distance that this judge=s placing is from the composite placing which is based on all the judges rankings. Similar values can be computed for each judge. Note that any judge who places the first place horse first, the second place horse second, and the third place horse third, will have a distance of 0. Thus large distances are associated with placings that do not place the same 3 horses in the top three as did the group of judges.  For example if a judge placed the 10th ranked animal first, the 9th ranked horse second, and the 8th ranked horse third, the distance would be 12.45 while if a judge placed the 3rd ranked horse first, the 2nd ranked horse second, and the 1st ranked horse third, the distance would be 2.83. Given a measure of distance from a composite ranking it is possible to determine which judge was in closest agreement with the group of judges and which judge was farthest away.  By computing the total distance among all the judges we can get a measure of how much disagreement among the judges there was for a given class or set of events.

 

Basing the distance a judge is from the composite ranking of all the judges is flawed from the standpoint that the composite ranking to which he is being compared is influenced by his ranking of the class. A better measure of how well his placings agree with those of his fellow judges is to not include his rankings when determining the composite ranking to which he will be compared.  Thus to see how well judge 1 agrees with the other judges we can leave his placings out and compute a composite ranking based on the rankings of just the other 4 judges.  If judge 1 did something extreme such as placing the 10th place animal first, it will not influence a composite ranking based on the other 4 judges= placings and will provide a better measure of just how far judge 1 may be from the composite rankings of his peers.  To arrive at a composite ranking the method used is to give 10 points to an animal that is ranked first, 9 to one that is ranked second, etc. down to 1 for a horse ranked 10th.  The scores are added for the judges= placings and the horse with the most points gets ranked first, the one with the second highest points is ranked second, etc.  This is similar to the way the horses are placed at the Celebration except there they eliminate the highest and lowest score for each horse. Here we keep the four scores from the four judges whose distances are being used to determine a composite ranking and compute the distance for each judge in similar fashion, i.e. judge 2's distance is computed by computing the composite ranking based on judge 1's, judge 3's, judge 4's and judge 5's placings and then computing the distance his placings are from that composite ranking.

 

The distances computed for each judge can be averaged over several classes and the averages compared in order to determine whether judges are ranking horses similarly or not. The average distance for each judge provides a measure of how consistent his placings are with those of his peers.  This information can then be used to identify which judges are consistent and which are not.

 


 

Other measures of lack of consistency come to mind as well.  How often a judge places a horse in one of the bottom three positions (8, 9,10), or not at all, that the composite ranking places in one of the top three places, is a measure of lack of consistency for a judge. So is the frequency with which a judge places a horse as one of his top three places, which the composite places in 8th place, or lower. Also the frequency with which a judge is the only one of the five judges to place a particular horse may be of interest, especially if the horse is placed in one of the top 5 places of that judge's placings.

 

Consistency is often defined and thought of as something that is repeatable and shows little variability. A consistent golfer may have an average score in the low 70's and hardly ever have a score in the 80's. An inconsistent golfer may also have an average score in the low to mid 70's but have scores routinely ranging from the 60's to the 80's. There may not be much difference in their average scores but one shows consistency while the other shows inconsistency. The statistical measure for this characteristic is called variance. A series of numbers that are nearly alike have little variation and small variance values while a series of numbers that are very different have a lot of variation and large variance values. The variance of judges’ distances can be computed to quantify the consistency of a judge's performance, i.e. judges with smaller variance values are more consistent than those with large variance values.  Two judges may have about the same average distance for a set of classes they have placed but one judge's scores may show small variance values and the other judge may have large variance. This would indicate that the judge with the large variance agrees with the composite ranking part of the time and has some very small distance values, while for other classes he may totally disagree with the composite ranking which results in very large distances.

 

    


 

Examples:

 

To show how the measures discussed above can be used to evaluate judging performance I have included two examples based on the results from classes at the 1999 Celebration.  The first class is one where there was extremely close agreement among the judges, Class 10 - Amateur 4 years old (owned), lady rider 18 & up.

 

Example 1:

 

Class 10 - 2000 Amateur 4 years & up (owned) Lady riders 18 & up

 

 judges: 5    # ranked:10

 

     Rankings are:       Orderings based on all 5 judges:

 

   t   c   b   r   h            t   c   b   r   h

  290 290 147 147 147           2   2   1   1   1

  147 147 290 290 259           1   1   2   2   4

  953 953 953 259 290           3   3   3   4   2

  259 158 148 680 680           4   6  12   5   5

  158 680 259 953 953           6   5   4   3   3

  680 259 102 804  18           5   4   8   7  10

  256 256 680  18 158           9   9   5  10   6

  804 804 519 102 102           7   7  11   8   8

  519 519 804 158 804          11  11   7   6   7

   18 102 256 519 256          10   8   9  11   9

 

        Orderings from deleting one judge at a time:

 

                 t     c     b     r     h

                 2     2     1     1     2

                 1     1     2     2     4

                 3     4     4     4     1

                 4     6    12     5     5

                 6     5     3     3     3

                 5     3    10     8    12

                11    11     5    12     6

                 8     7    11     9     9

                12    12     7     6     7

                 9     8     9    10     8

 

Delta Theta  -.007 -.023  .001  .007 -.058

 

 Theta hat:  .217

 


 

 

 Distances for each judge

 

            1-10   1-5   1-3

 

   Judge-t:  5.5   1.7   1.4

   Judge-c:  6.8   2.6   1.7

   Judge-b: 10.1   8.3   1.0

   Judge-r:  6.7   2.4   1.0

   Judge-h:  7.7   3.7   3.0

 

 

 Grand averages for  1 classes

 

  #/class lower      #/class upper   #/class only

  in upper third     in lower third    to place

 b       .00                .00         1.00

 c       .00                .00          .00

 h       .00                .00          .00

 r       .00                .00          .00

 t       .00                .00          .00

 

 

Discussion:

 

The actual placings are shown followed by the orderings based on all 5 judges' placings. The orderings given second are based on leaving out each judge in order to arrive at a composite ranking for each judge to be compared to. Thus, judge t was left out, a composite ranking was determined for the placings of the other judges, then judge t's placings were determined relative to that composite. Judge t placed 290 first and in the composite based on the other judges 290 was 2nd place. He placed horse 147 second and in the composite it was 1st. The orderings for each judge were then used to compute distances for each judge from the respective composite placings for all 10 places (1-10), for the first 5 places (1-5), and for the first 3 places (1-3). Note that all the distances are small showing there is good agreement in the placing of these horses. The results show there were no top horses placed low or low horses placed high. Judge b was the only one to place a particular horse and that was horse 148 which he placed 4th.  The distances for the first 3 placings range from 1.0 to 3.0. These are very small values and show excellent agreement among the judges' placings for the first three places. Judge h had the largest distance because he/she placed the horse that was 4th based on the composite from the other 4 judges as his/her 2nd place horse while the other judges' placings were closer to their respective composite placings because they all placed either the first or second place horse second. The results from this class are an example of what we would like to see as a result of placing horses with multiple judges.

 


 

Example 2: The second example is Class 95 - World Grand Champion Amateur 2 year old.

 

 Class 95 - 2000 World Grand Champion Amateur 2 year old                          

 judges: 5    # ranked:10

 

     Rankings are:       Orderings based on all 5 judges:

 

   b   c   t   h   l            b   c   t   h   l

  960 960 376 376 365           3   3   1   1   5

  503 503  29 297 297           2   2   6   4   4

  376 135 960 244 376           1  11   3  10   1

  297  37   8 365 503           4   8   9   5   2

   29 106 365   8  37           6  12   5   9   8

  252 376 252 220 252           7   1   7  15   7

   37 252 277  29  29           8   7  14   6   6

  365 660 503 503 495           5  16   2   2  13

  660 612 297 495 135          16  17   4  13  11

  244 277 244 960 244          10  14  10   3  10

 

        Orderings from deleting one judge at a time:

 

                 b     c     t     h     l

                 5     6     1     1     6

                 3     5     8     7     5

                 1    15     4    15     1

                 4    10    11     5     3

                 6    16     5    10     9

                 7     1     7    17     7

                 9     7    17     4     4

                 2    14     2     3    16

                16    17     3    14    11

                10    13     9     2    10

 

Delta Theta   .027 -.050  .001 -.068 -.020

 

 

 Distances for each judge

 

            1-10   1-5   1-3

 

   Judge-b: 10.6   4.7   4.6

   Judge-c: 21.7  18.3  13.3

   Judge-t: 16.1   9.3   6.1

   Judge-h: 21.0  14.0  13.0

   Judge-l: 11.5   7.4   6.2

 

 

 


 

  #/class lower      #/class upper   #/class only

  in upper third     in lower third    to place

 b       .00               1.00          .00

 c      1.00               2.00         2.00

 h      1.00               2.00         1.00

 l       .00               1.00          .00

 t      1.00               2.00          .00

 

Discussion:

 

 The results from this class show that there was not much of a consensus among the judges.  There were 17 horses placed in the composite of the 5 judges' top 10 placings. All of the distances tended to be large showing that the judges were placing horses according to different criteria. Judge b had the smallest distance for 1-3 because he placed horses that were in the top 5 of the composite based on the other judges' placings among his top 3 places. Judges c and h have large distances for 1-3 because they each placed horses that were placed 15th in their respective composite rankings based on the other 4 judges placings as their 3rd place horse. For judge c it was horse 135 and for judge h it was horse 244. Judges b and l placed 244 10th  while the other two judges didn't place it at all. Judge 1 was the only other judge to place135 and he placed it

9th.  Note that the large values for judges c and h for 1-10 are due to different causes. For judge c it is because he placed the 1st ranked horse 6th and he placed horses that hardly anyone else placed as his last 3 placings in addition to placing 135 3rd. For judge h it was due to placing the 2nd and 3rd place horses at the bottom of the class in addition to placing 244 3rd.

 

Judges c, h, and t all placed a horse that was ranked 8th, or lower, in one of the top 3 places. Judges b and 1 each placed a horse that was ranked 1, 2, or 3 in one of the lowest 3 positions or didn't place them at all.  Judges c, h, and t placed two of the top 3 horses as 8th place, or lower, or didn't place them at all.  Judge c was the only one to place two horses: 106 and 612 which he placed as 5th and 9th, respectively.  Judge h was the only one to place horse 220 which he/she placed 6th.

 

This is an example of results we would rather not see as it shows serious disparity among the judges= rankings.  It would be interesting to know why there was so little agreement among the judges.  In fact the cause should be determined so that such situations can be avoided, or at least minimized, in the future.  The rankings of the horse that resulted from this class (an important class at that) are not much better than what might result by randomly drawing contestant numbers from a hat.  We can have very little confidence that the placings reflect the relative quality of the horses in the class.

 

The results from all the amateur and open classes for the 1999 Celebration are posted on separate pages for your perusal.  Any comments that you have on these results relative to their effectiveness in quantifying the quality of the judging of these events will be appreciated.