John G Randolph

Intro

There are a few lines of criticism often levied at spirit scores in ultimate, and at spirit more generally. The definition of spirit varies between communities and between cultures ( Link 1, Link 2 ). Losers receive, on average, higher spirit scores by half a point to a point ( Link ). Spirit introduces and reinforces racial issues ( Link 1, Link 2 ).

This blog post is not concerned with any of those complaints. It is concerned with the fact that in many cases, when we rank teams on their spirit, we may not have enough data points to tell whether that ranking is accurate. After many tournaments, rankings of team spirit are publicized, and at some tournaments, such as USAU nationals, spirit winners are celebrated and awarded medals. But most tournaments only involve 5-8 games played. How much can we trust the average of spirit scores from that size sample set?

The power of spirit scores

We'll use the spirit scores from 2022 Club nationals as our test case. Those spirit scores are available in full thanks to USAU here . The team scores range from 12.29 to 9.0 and the scores from individual games range from 4 to 20. The mean score was 10.17. Here is each team's average score:

Deviation from team avg histogram — Summary of spirit scores from club nationals 2022.

It would be natural to look at these results and conclude that Mad Men, BFG, and Flipside were each the most spirited team in their division, while Sockeye, Red Flag, and Traffic were the least spirited. But it's easy for humans be overconfident about data with few data points, so we need to ask: are the differences in spirit scores between teams statistically significant?

We'll test this with simple t-tests between team averages. This type of test makes three assumptions.

The games are a random sample. This could be false - for example, if there is a positive correlation between team finish and spirit then the lower finishing teams would play games skewed towards lower spirited teams. But I think this is a very reasonable assumption.
Observations are independent.
The sample distribution is normal. I think this is close enough to true to be a reasonable assumption. Here's a graph of the distribution of scores:

To be clear on the way we're modeling the problem, we're assuming that each team has a baseline spirit, which can also be thought of as their average spirit, and that in each game they deviate somewhat from this. There may also be random bias from the team reporting/observing their spirit. Then, spirit scores are unbiased estimates of the true spirit score.

Club Nationals 2022

So let's see what happens when we apply this framework to club nationals 2022. We proceed by comparing each team's spirit to the spirit of each other team in their division. Here are the results, with differences that are significant (p <= .05) highlighted in green.

The first thing that jumps out is that it there are very few statistically significant differences. We can, for example, say it is quite likely that Sockeye, the lowest spirit finisher in the Men's division, has a lower average spirit than Mad Men, the highest spirit finisher. But almost no other differences are significant. We can't even be sure that the highest spirit finisher in either other division has better baseline spirit than the lowest spirit finisher.

Club nationals 2017

It's hard to write a blog post about spirit without making mention of Florida. Surely some teams have such outlier spirit that we can find statistically significant differences between their nationals spirit scores and those of other teams, right? We'll look at an extreme example: the infamous Florida United Team at Club nationals 2017 That year, Florida United finished last with spirit scores of 7, 5, 8, 6, 4, and 10, for an average of 6.7. Also in 2017, Johnny Bravo had an unusually high spirit score of 14.9. We'll run the same test on the men's division of club nationals 2017. Here are the results:

This is a little different! Florida United has a statistically significant lower spirit score than every other team, with the exception of PoNY. This is due not only to their low average but that their scores were consistently low, as their highest score is only 3.3 higher than their average (std dev 2.1). Johnny Bravo only has a statistically significant difference from 4 other teams despite having an average score nearly as far from the mean as Florida United (4.08 vs 4.11). This is because their scores had more variance: 12, 20, 20, 11, 11, 10, 20, (std dev = 4.8).

Conclusion

This is not to say that Mad Men was not the most spirited team in their division, nor that Red Flag was not the least spirited team in their division. Both of those things may very well be true - but the small number of spirit scores collected at 2022 club nationals don't strongly support that claim (though they do supply some evidence). To make that argument, you'd need to rely on something else, whether that is personal experience or a larger set of data points.

Maybe we should do away with awarding spirit trophies to teams at tournaments with only six games. Maybe we should collect spirit scores for all games throughout a whole season. Maybe Mitch Dengler should be running t-tests during timeouts. But as it stands, the system of awarding a trophy to a somewhat randomly determined team as if they are a paragon of spirit is foolish and not mathematically sound.

The Failure of Spirit Rankings
Why spirit rankings are usually statistically meaningless

Intro

The power of spirit scores

Club Nationals 2022

Club nationals 2017

Conclusion

Resources

The Failure of Spirit Rankings Why spirit rankings are usually statistically meaningless

Intro

The power of spirit scores

Club Nationals 2022

Club nationals 2017

Conclusion

Resources

The Failure of Spirit Rankings
Why spirit rankings are usually statistically meaningless