The TrueSkill ranking system is a skill based ranking system for Xbox Live developed at Microsoft Research. The purpose of a ranking system is to both identify and track the skills of gamers in a game (mode) in order to be able to match them into competitive matches. TrueSkill has been used to rank and match players in many different games, from Halo 3 to Forza Motorsport 7.
The classic TrueSkill ranking system only uses the final standings of all teams in a match in order to update the skill estimates (ranks) of all players in the match. The TrueSkill 2 ranking system also uses the individual scores of players in order to weight the contribution of each player to each team. As a result, TrueSkill 2 is much faster at figuring out the skill of a new player.
So, what is so special about the TrueSkill ranking system? Compared to the Elo rating system, the biggest difference is that in the TrueSkill ranking system skill is characterized by two numbers:
- The average skill of the gamer (μ in the picture).
- The degree of uncertainty in the gamer’s skill (σ in the picture).
The ranking system maintains a belief in every gamer’s skill using these two numbers. If the uncertainty is still high, the ranking system does not yet know exactly the skill of the gamer. In contrast, if the uncertainty is small, the ranking system has a strong belief that the skill of the gamer is close to the average skill.
On the side, a belief curve of the TrueSkill ranking system is drawn. For example, the green area is the belief of the TrueSkill ranking system that the gamer has a skill between level 15 and 20.
Maintaining an uncertainty allows the system to make big changes to the skill estimates early on but small changes after a series of consistent games has been played. As a result, the TrueSkill ranking system can identify the skills of individual gamers from a very small number of games. The following table gives an idea of the minimum number of games per gamer that the system needs to identify the skill level:
|Game Mode||Number of Games per Gamer|
|16 Players Free-For-All||3|
|8 Players Free-For-All||3|
|4 Players Free-For-All||5|
|2 Players Free-For-All||12|
|4 Teams/2 Players Per Team||10|
|4 Teams/4 Players Per Team||20|
|2 Teams/4 Players Per Team||46|
|2 Teams/8 Players Per Team||91|
The actual number of games needed per gamer can be up to three times higher depending on several factors such as the variation of the performance per game, the availability of well-matched opponents, the chance of a draw, etc. If you want to learn more about how these numbers are calculated and how the TrueSkill ranking system identifies players’ skills, please read the Detailed Description of the TrueSkill™Ranking Algorithm or find out in the Frequently Asked Questions.
If you play a ranked game on Xbox Live, the TrueSkill ranking system will compare your individual skill (the numbers μ and σ) with the skills of all the game hosts for that game mode on Xbox Live and automatically match you with players with skill similar to your own. But how can this be done when every player’s skill is represented by two numbers? The trick is to use the (hypothetical) chance of drawing with someone else: If you are likely to draw with another player then that player is a good match for you! Sounds simple? It is!
Most games have at their root a metric for judging whether the game’s goals have been met. In the case of matches involving two or more players (“multiplayer matches”), this often includes ways of ranking the skills of match participants. This encourages competition between players, both to “win” individual matches, and to have their overall skill level recognized and acknowledged in a broader community. Players may wish to evaluate their skills relative to people they know or relative to potential opponents they have never played, so they can arrange interesting matches. We term a match “uninteresting” if the chances of winning for the participating players are very unbalanced – very few people enjoy playing a match they cannot win or cannot lose. Conversely, matches which have a relatively even chance of any participant winning are deemed “interesting” matches.
Many ranking systems have been devised over the years to enable leagues to compare the relative skills of their members. A ranking system typically comprises three elements:
- A module to track the skills of all players based on the game outcomes between players (“Update”).
- A module to arrange interesting matches for its members (“Matchmaking”).
- A module to recognize and potentially publish the skills of its members (“Leaderboards”).
In particular, the ELO ranking system has been used successfully by a variety of leagues organized around two-player games, such as world football league, the US Chess Federation or the World Chess Federation, and a variety of others. In video games many of these leagues have game modes with more than two players per match. ELO is not designed to work under these circumstances. In fact, no popular skill-based ranking system is available to support these games. Many one-off ranking systems have been built and are in use for these games, but none of them is general enough to be applied to such a great variety of games.
How to Represent Skills
The TrueSkill ranking system is a skill-based ranking system designed to overcome the limitations of existing ranking systems, and to ensure that interesting matches can be reliably arranged within a league. It uses a technique called Bayesian inference for ranking players.
Rather than assuming a single fixed skill for each player, the system characterizes its belief using a bell-curve belief distribution (also referred to as Gaussian) which is uniquely described by its mean μ (speak [mju:]) (“peak point”) and standard deviation σ (speak [sigma])(“spread”). An exemplary belief is shown in the figure above. Note that the area under the skill belief distribution curve within a certain range corresponds to the belief that the player’s skill will lie in that range. For example, the green area in the figure is the belief that the player’s skill is within level 15 and 20. As the system learns more about a player’s skill, σ has the tendency to become smaller, more tightly bracketing that player’s skill. Another way of thinking about the μ and σ values is to consider them as the “expected player skill” and the “uncertainty” associated with that assessment of their skill.
Since the TrueSkill ranking system uses a Gaussian belief distribution to characterize a player’s skill, all expected player skills (that is, μ‘s) will always lie within ± 4 times the initial σ (more precisely with probability 99.99%). Experimental data tracking roughly 650,000 players over 2.8 million games support this claim: Not a single μ ever happened to be outside the range ± 4 times the initial σ and 99.99% of the μ‘s happen to be even within ± 3 times the initial σ.
Interestingly, the TrueSkill ranking system can do all calculations using an initial uncertainty of 1, because then μ and σ can be scaled to any other range by simply multiplying them. For example, suppose all calculations are done with an initial μ of 3 and σ of 1. If one wishes to express player’s skill as one of 50 “levels”, multiply μ and σ by 50/6 = 8.3 because almost all μ‘s happen to be within ± 3 times the initial σ.
The intuition is that the greater the difference between two player’s μ values – assuming their σ value are similar – the greater the chance of the player with the higher μ value performing better in a game. This principle holds true in the TrueSkill ranking system. But, this does not mean that the players with the larger μ‘s are always expected to win, but rather that their chance of winning is higher than that of the players with the smaller μ‘s. The TrueSkill ranking system assumes that the performance in a single match is varying around the skill of the player, and that the game outcome (relative ranking of all players participating in a game) is determined by their performance. Thus, the skill of a player in the TrueSkill ranking system can be thought of as the average performance of the player over a large number of games. The variation of the performance around the skill is, in principle, a configurable parameter of the TrueSkill ranking system.
How to Update Skills
The TrueSkill ranking system will base its update of μ and σ on the game outcome (relative ranking of all teams) only; it merely assumes that the outcome is due to some unobserved performance that varies around the skill of a player. If one is playing a point based game and the winner beats all the other players by a factor of ten, that player’s victory will be scored no differently than if they had only won by a single point. Every match provides the system with more information about each player’s skill belief, usually driving σ down.
Before starting to determine the new skill beliefs of all participating players for a new game outcome, the TrueSkill ranking system assumes that the skill of each player may have changed slightly between the current and the last game played by each player. The mathematical consequence of making such an assumption is that the skill uncertainty σ will be slightly increased, the amount of which is, in principle, a configurable parameter of the TrueSkill ranking system. It is this parameter that both allows the TrueSkill system to track skill improvements of gamers over time and ensures that the skill uncertainty σ never decreases to zero (“maintaining momentum”).
In order to determine the new skill beliefs of all the participating players for a new game outcome, the TrueSkill ranking system uses Bayes’ Law, which says that the new skill beliefs are proportional to the probability of the observed game outcome (as a function of the player skills) multiplied by the old skill beliefs. This is done by averaging over all possible performances (weighted by their probabilities) and deriving the game outcome from the performances: The player with the highest performance is the winner; the player with the second highest performance is the first runner up, and so on. If two players’ performances are very close together, then the TrueSkill ranking system considers the outcome between these two players a draw. The larger the margin which defines a draw in a given league, the more likely a draw is to occur, according to the TrueSkill ranking system. The size of this margin is a configurable parameter of the TrueSkill ranking system and is adjusted based on the game mode. For example, a street race in Project Gotham Racing 3 can never end in a draw (thus the parameter is set to zero) whereas a Capture-the-Flag game in Perfect Dark Zero can easily end in a draw.
The new skill beliefs derived by the above weighting technique are not Gaussian anymore. The TrueSkill ranking system approximates each skill belief by the closest Gaussian distribution. As a result, given players’ μ values increase for each opponent they out-performed, and decreases for each opponent they lost against. The following table gives before and after values for μ and σ for each (hypothetical) participant in an 8-player match.
Name Outcome Pre-Game μ Pre-Game σ Post-Game μ Post-Game σ Alice 1st 25 8.3 36.771 5.749 Bob 2nd 25 8.3 32.242 5.133 Chris 3rd 25 8.3 29.074 4.943 Darren 4th 25 8.3 26.322 4.874 Eve 5th 25 8.3 23.678 4.874 Fabien 6th 25 8.3 20.926 4.943 George 7th 25 8.3 17.758 5.133 Hillary 8th 25 8.3 13.229 5.749
One can see that σ values – the uncertainty in the skill for each player – is lower after the match, substantially more so for the players on the 4th and 5th rank (Darren and Eve) Those two players have the property that they are “bracketed” by a maximal number of players in terms of defeat: They were defeated by 3 (or 4) players and they defeated 4 (or 3) other players. In contrast, the first player (Alice) is simply known to be better than the 7 other players which does not constraint her skill from above: She may be even better than level 36.771. This is reflected in the larger uncertainty of 5.749.
The simplest case for an TrueSkill ranking system update is a two-person match. Suppose we have players A(lice) and B(ob), with μ and σ values (μA,σA) and (μB,σB), respectively. Once the game has finished, the update algorithm determines the winner (Alice or Bob) and loser (Bob or Alice) and applies the following update equations (we disregard the possibility of a draw for the sake of simplicity here):
In these equations, the only unknown is β2 which is the variance of the performance around the skill of each player. Moreover, ε is the aforementioned draw margin which depends on the game mode. But what do the functions v(.,.) and w(.,.) look like? Instead of giving the exact definitions, let us have a look at plots of these functions for varying values of ε/c:
There are a few observations about these update equations:
- Similarly to the ELO system, in the mean skill update equation the winner gets a multiple of v((μwinner–μloser)/c,ε/c) added to the mean skill and the loser gets a multiple of v((μwinner–μloser)/c,ε/c) subtracted from the mean skill. However, in contrast to ELO the weighting factors are roughly proportional to the uncertainty of the winner/loser vs. the total sum of uncertainties (2β2 is the uncertainty due to the performance variation around the skill and σ2winner+σ2loser is the uncertainty about their true skills). Hence, only if Alice and Bob have the same uncertainty, the TrueSkill ranking system’s mean skill update equation reduces to an ELO update equation. Note that the TrueSkill ranking system’s update equation for the mean skill is thus not guaranteed to be zero sum.
- The uncertainty of both players (regardless of win/loss/draw) is going to decrease by the factor 1-σ2player/c2 * w((μwinner–μloser)/c,ε/c). Again, the player with the larger uncertainty gets the bigger decrease.
- The change in the mean skill, v((μwinner–μloser)/c,ε/c), and the decrease factor in the uncertainty, 1-σ2player/c2* w((μwinner–μloser)/c,ε/c), are close to zero if the game outcome was not surprising.
If the winner had the much bigger mean skill relative to the total uncertainty (thus (μwinner–μloser) > ε) then a win cannot buy the winner extra mean skill points or remove any uncertainty. The opposite is true if the game outcome was surprising: If the winner had the smaller mean skill (μwinner–μloser) > ε), mean points proportional to μloser–μwinner get added/subtracted to/from the winner/loser.
If both player had similar mean skills upfront (thus (μwinner–μloser) > ε) then both player are already close enough together and no mean skill point update needs to be made; hence the uncertainty is not reduced. However, if one player was thought to be much stronger by the TrueSkill ranking system before the game (let’s say, μwinner–μloser) > ε) then his mean skill will be decreased and the other player’s mean skill will be increased which, in effect, brings their two mean skill closer together.
The mean skill update equations of the TrueSkill ranking system are similar to the update equations of the ELO algorithm. The key difference is that a variable K factor is used for both players mainly depending on the ratio of the uncertainties of the two players. Hence, playing against a very certain player in the TrueSkill ranking system allows the uncertain player to move up or down in larger steps than in the case when playing against another uncertain player.
But how does the TrueSkill ranking system incorporate the game outcome of a team match? In this case, the team’s skill is assumed to be the sum of the skills of the players. The algorithm determines the sum of the skills of the two teams and uses the above two equations where (μwinner,σ2winner) and (μwinner,σ2loser) are the mean skills and skill variances of the winning and losing team, respectively.
The update equations for more than two teams are not possible to write down as they require numerical integration (the above plots have been obtained by using the same numerical integration code). In this case the TrueSkill ranking system iterates two team update equations between all teams on neighboring ranks, that is, the 1st versus the 2nd team, the 2nd team versus the 3rd team and so on. If you want to learn more about this variant of the TrueSkill ranking algorithm, scroll down to “How to proceed from here”.
How to Match Players
Matchmaking is an important service provided by gaming leagues. It allows participants to find team-mates and opponents who are reasonably close to their own skill level. As a consequence, it is likely that the match will be interesting, as all participants have roughly the same chances of winning.
TrueSkill ranking system’s skill beliefs are based upon probabilistic outcome models and thus enable players to be compared for relative chance of drawing. The more even the skills of match participants, the more likely it is that this configuration of players will end up in a draw, and the more interesting and fun the match will be for every participant. For example, for two players A(lice) and B(ob) with skill beliefs (μA,σA) and (μB,σB), the (re-scaled) chance of drawing is given by:
This number is always between 0 and 1 where 0 indicates the worst possible match and 1 the best possible match. Even if two players have identical μ values, uncertainty σ affects the quality of the match; if either of the σ values σA or σB is large, then the match quality criterion is significantly smaller than 1!
How to Build a Leaderboard
Using the two parameters μ and σ which characterize a belief in a player’s skill the TrueSkill ranking system ranks players using the so-called conservative skill estimate = μ – k*σ. This estimate is called conservative because it is a conservative approximation of the player’s skill: it is extremely likely the players actual skill is higher than the conservative estimate. The bigger the value of k the more conservative the estimate; a common value of k is 3.
How to Proceed From Here
If you still want to know more about the TrueSkill ranking system, you can go and check out:
- The TrueSkill paper and other publications on the publications tab of this page.
- Chapter 3 of the Model-Based Machine Learning book
- Jeff Moser’s Article about TrueSkill
- F# code for TrueSkill Through Time
- Video of an interview with Ralf Herbrich
- Video of an interview with Thore Graepel and Ralf Herbrich in their office at Cambridge
- IGN article interviewing Ralf and Thore about TrueSkill
- Slashdot stories about TrueSkill, with comments from Ralf
Frequently Asked Questions (FAQ)
Here is a list of questions that gamers have sent us. We have grouped the questions into several categories linked in the right hand column of this page. If you do not find the answer to your question, simply send an Email to trueskill.
Q: Why is the ranking system called TrueSkill™ ranking system?
A: We decided to use this name because this is the defining feature of the ranking system: it quickly identifies a gamer’s true skill. The primary purpose of the TrueSkill system is to minimize the number of games necessary to find out a gamer’s skill in order to optimize matchmaking.
Q: How did you compute the average number of games until convergence for the TrueSkill ranking system?
A: One way to think about the TrueSkill ranking system is that it attempts to identify the correct ordering of n players. If each ordering is equally likely, a computer would need log2(n!) or roughly n*log2(n) many bits of information to do this. Now, assume that 2 players play a Head-to-Head game. Assuming the best player always wins, with no draws, the game outcome provides 1 bit of information (which of the two players was the winner). (This is a stronger assumption than TrueSkill makes, therefore it will give us a bound on what is possible with TrueSkill. This assumption is also stronger than Elo and Glicko, which means we will get a bound on those algorithms as well.) Under this strong assumption, the system needs n*log2(n) many Head-to-Head games. Since each of these games requires 2 players, the system needs 2*log2(n) games per player. Note that the particular Head-to-Head games have to be chosen such that they, in fact, do carry one bit of information. Interestingly, every match-made game where the game outcome is not predictable ahead of time ensures that the game is informative! In general, with t teams of p players in each team, one game outcome provides log2(t!) bits but it needs t*p players per game so in the most general case, the system needs t*p*log2(n)/log2(t!) many games per player.
Of course, this calculation is idealized. There are several factors that increase the number of games necessary:
- Each game is not providing 1 bit of information because the best player does not always win. A player’s performance in a particular game varies around their average skill and the bigger this variation, the more likely it is that the less skilled player wins the game. This can eventually lead to the loss of 75% of the information per game!
- Between games, the TrueSkill ranking system assumes that the skill of the players may have slightly changed. In other words, the rank of each player can have changed and there are extra bits necessary to encode the change in true skill according to learning effects.
But, there are also several factors that decrease the number of games necessary:
- Each game between two teams has three possible outcomes: win, lose, draw. Knowing which of the three outcomes has been realized after a game thus provides more than 1 bit of information. On the left hand side is a plot of the number of bits provided as a function of the chance of drawing. Obviously, if the chance of drawing is zero we have 1 bit of information. But, if draw is the only possible outcome (chance of drawing = 100%) then no information is provided resulting in 0 bits of information.
- Although the ranks of each player are unknown, there is usually not an equal chance that a player is of level 50 or level 25. In practice, the distribution of skills usually follows a bell shaped curve (Gaussian). Thus, the number of bits to recover a player’s skill rating can be less than the number of bits to recover the total ordering of players.
Overall, we observed in our experiments that the sum of these effects leads to an increase by a factor of 2 – 3 in the numbers of games necessary per gamer.
Q: What is the difference between skill and performance?
A:The TrueSkill ranking system implicitly uses a performance model that represents your (hypothetical) score in a particular game. Skill is the average performance. The TrueSkill ranking system maintains a belief in your skill and assumes that your performance in a particular game varies around your skill.
Q: The default TrueSkill of a new player is 25, right?
A: That’s not fully correct. The TrueSkill value that is displayed in the leaderboard is the conservative estimate of a player’s skill, computed from two hidden parameters that are used to track a player’s skill: the mean skill μ and the skill uncertainty σ. The TrueSkill value is then μ-3*σ. What is correct is that a new player is assigned a mean skill of μ=25 and a skill uncertainty of σ=8.333. Thus, the TrueSkill of a new player is 25-3*8.333 = 0. Note that these two choices for μ and σ effectively mean that a new player’s skill can be anywhere from 0 to 50, representing a state of complete uncertainty about their skill.
Q: How many games do I have to win before I go up one level?
A: This depends a lot on how many games you have already played, how many games your opposition have already played and what type of games you play. It is a strength of the TrueSkill ranking system to move you up very quickly early on but to reduce the step-size in the updates after a series of consistent games. In general, the more people per team, the longer it takes to go up or down one level. But the more teams per game, the faster you can go up or down. Here is a list of game modes and number of wins necessary before you go up a level (assuming you have already played a fair number of games; otherwise you usually go up one level in one game).
Game Mode Number of Games per Gamer 8 Players Free-For-All 3 4 Players Free-For-All 4 2 Players Free-For-All 7 4 Teams/2 Players per Team 5 2 Teams/4 Players per Team 10
Q: How many games do I have to lose before I go down one level?
A: These numbers exactly equal the numbers given in the last question. The TrueSkill ranking system has no preferred direction of changing the skill belief.
Q: I have been playing a lot of unranked training games and I think I am now a much more skilled player. Will the TrueSkill ranking system be able to identify my new, higher skills? If so, how many games do I have to play before the TrueSkill ranking system knows my new skill?
A: The TrueSkill ranking is assuming a small skill change between any two consecutive games in a game mode so it is able to identify your new, higher skill. But, if your skill has completely changed (you became the best player in the world from previously being the worst player in the world), then you would need to play a large number of games. We designed the system such that it would need between 50 – 100 games before the system would be able to track a substantial skill increase/decrease.
Q: If I understand the TrueSkill update formula correctly then the change in μ is largest for the first few games and decreases over time. Thus, my first few games are most important; if I lose these games, it will take the TrueSkill much longer to converge to my skill. Right?
A: Not exactly right. It is correct, that the change in μ is getting smaller and smaller with every game played, but regardless if you win or lose them. However, TrueSkill always takes more recent game outcomes more into account than older game outcomes. Hence, when playing against a set of players of same skill multiple times, a late win counts more than an early win.
Q: What other ranking systems are there?
A: It is impossible to enumerate all available ranking systems here. But, in order to illustrate the wide range of systems out there, let us give a few examples:
- ELO (used by the US Chess Federation and the World Chess Federation). Also see “A Comprehensive Guide to Chess Ratings”.
- Glicko (used by the Free Internet Chess Server).
- Halo 2 Ranking System.
- Go Ranking
- Tennis rankings (used by the ATP).
- Kudos Ranking System (used in Project Gotham Racing).
There is an interesting article Collective Choice: Competitive Rating Systems by Christopher Allen covering some of the above ranking systems.
Q: I am a chess player and I have played online chess at the Free Internet Chess Server. They use a system called Glicko which uses rating deviations. What is the relation between the TrueSkill ranking system and the Glicko ranking system?
A: The Glicko system was developed by Mark E. Glickman, chairman of the US Chess Federation (USCF) ratings committee. To the best of our knowledge, Glicko was the first Bayesian ranking system. Similarly to the TrueSkill ranking system, the Glicko system uses a Gaussian belief over a player’s skill which can be represented by two numbers: The mean skill and the variation of the skill (called rating deviation in the context of Glicko). There are a few differences between the TrueSkill ranking system and Glicko:
- The Glicko system (deliberately) does not model draws but it makes an update as the average of a win and a loss (per player). In the TrueSkill ranking system, draws are modelled by assuming that the performance difference in a particular game is small. Hence, the chance of drawing only depends on the difference of the two player’s playing strength. However, empirical findings in the game of chess show that draws are more likely between professional players than beginners. Hence, chance of drawing also seems to depend on the skill level.
- In the Glicko system, the uncertainty in a player’s skill grows linearly with time not played. In the TrueSkill ranking system, it grows by a constant amount between any two consecutive games. However, this could be changed in the TrueSkill ranking system.
- The Glicko system uses a different performance distribution known as the logistic distribution; the TrueSkill ranking system uses a Gaussian distribution (see picture on the right). This results in two different update algorithms for two player matches which make the actual update equations look different. However, conceptually both update algorithms perform very similarly. The Glicko system uses a different performance distribution known as the logistic distribution; the TrueSkill ranking system uses a Gaussian distribution (see picture on the right). This results in two different update algorithms for two player matches which make the actual update equations look different. However, conceptually both update algorithms perform very similarly.
So, what is the difference to the Glicko system? Glicko was developed as an extension of ELO and was thus naturally limited to two player matches which end in either win or loss. Glicko cannot update skill levels of players if they compete in multi-player events or even in teams. The logistic model would make it computationally expensive to deal with team and multi-player games. Moreover, chess is usually played in pre-set tournaments and thus matching the right opponents was not considered a relevant problem in Glicko. In contrast, the TrueSkill ranking system offers a way to measure the quality of a match between any set of players.
Q: I am always playing together in the same team with my friend JoeDoe. Will the TrueSkill ranking system be able to differentiate between us two in terms of skills? In other words, is the TrueSkill ranking system capable of finding that I am the more skilled player of us two?
A: If both you and your friend only play ranked team games together then the TrueSkill ranking system cannot distinguish between you two; it always compares the team’s skills (sums of the player’s skills in the teams) and ‘distributes’ the gain/loss proportional to the individual player’s uncertainties (see detailed description). But note: if your friend also plays team games with anyone other than you then the TrueSkill ranking system will be able to identify the more skilled player of your two. Also, if both of you always only play together, you might consider forming a clan.
Q: Why does it take so many more games until convergence if I play a team game as opposed to a Free-for-All game?
A: The problem is that very little information about the individual player’s skill is contained when only exploiting which of two teams wins or if the two teams draw. This is effectively only up to 1.6 bit of ‘information’ that needs to be ‘shared’ between all players participating in the game. More specifically, consider these two scenarios:
- Alice, Bob, Chris and Darren play a 4-player-Free-for-All game and Alice wins against Bob wins against Chris wins against Darren. This game outcome provides a lot of information: it’s fair to say that probably Alice is better than Bob, Alice is better than Chris, Alice is better than Darren, Bob is better than Chris, etc.
- Alice and Bob play against Chris and Darren in a 2-Teams-2-Player-per-Team game and Alice and Bob win against Chris and Darren. Can we still say that this mean that Alice is better than Chris and Alice is better than Darren? No! All we can confidently assert is that Alice and Bob are better than Chris and Darren. So, the team game outcome provides only knowledge about an individual’s skill in conjunction with all the other team members.
Q: How will a team killer be ranked in the TrueSkill ranking system?
A: In the TrueSkill ranking system, the team skill is the sum of the skills of all players in the team. The TrueSkill ranking system has the potential to assign a negative skill to a player; if such players are added to a team, then the skill of the team goes down (because a team killer both reduces the chance to score against the other team or might even inflict negative points by suicide). Fortunately, the TrueSkill ranking system’s matchmaking procedure will eventually ensure that team killer will only play each other. And this can only be a good thing.
Q: I am playing a team game and all the players in my team drop out of the game. Of course, I lose the game. Will I lose as many skill points as all the people who left me standing in the rain?
A: Unfortunately, yes. All alternative options are possible exploits for cheating:
- If the TrueSkill ranking system does not count the game at all then the losing team can always ensure not to lose points by dropping out early (entirely).
- If the TrueSkill ranking system only uses the team configurations at the end of the game then both the players that dropped would not be penalized and the remaining player can be arbitrarily boosted (that is, shortly before the end of the game all but one player drop from a team; for the update equation it would now seem that a single player has won against a team of, say, 4 players and would apply a massive positive update).
- If the TrueSkill ranking system would introduce an arbitrary lowest rank in which every player falls that drops before the end of the game, then, again, the remaining player(s) in a team can be arbitrarily boosted (he won against the losing team and all the players that dropped. This approach would penalize the players that drop, though.
But: Players who drop regularly from a team would eventually be identified by the TrueSkill ranking system as having a negative impact on the team skill and will eventually be matched with other players of that have a negative team impact. So, you should not see this happening to often if you are a player of average skill.
Q: You are saying that the TrueSkill ranking system assumes that the skill of a team is the sum of the skills of its players. I think this model is not appropriate: I am usually playing much better with people from my friends list rather than with random players. Will this assumption lead to incorrect rankings?
A: The assumption that the team skill is the sum of the skills of its players is exactly that: an assumption. The TrueSkill ranking system will use the assumption to adopt the skill points of individual players such that the team outcome can be best predicted based on the additive assumptions of the skills. Provided that you and your friends also play team games with other players now and then, the TrueSkill ranking system will assign you a skill belief that is somewhere between the skill when you are playing with your friends and the skill when you are playing as an individual. So, in the worst case, every other game is not with your friends: then you are slightly ranked too high when you play with random team players and slightly ranked too low when you play with your friends. But, if you mostly play with your friends only the system will identify your skill correctly for most of your games.
Q: Why can two players in a party not be in two different teams?
A: This would open the possibility to cheat. You could, for example, arrange to play each other and your friend always forfeits the game. This would not allow to boost you to the top of the league but it would increase your skill level artificially. The TrueSkill ranking system always assumes that the game outcome is a result of your skills (in the game) and not of your skills to cheat.
Q: Does the TrueSkill ranking system reward individual players in a team game?
A: The only information the TrueSkill ranking system will process is:
- Which team won?
- Who were the members of the participating teams?
The TrueSkill ranking system takes neither the underlying exact scores (flag captures, kills, time etc.) for each team into account nor which particular team member performed how well. As a consequence, the only way players can influence their skill updates is by promoting the probability that their team wins. Hence, “ball bitches”, “hill whores”, “flag fruits”, “territory twits”, and “bomb bastards” will hurt their individual TrueSkill ranks unless what they are doing helps their team. Obviously, it is difficult to update individual players’ skills from team results only. To understand the difficulty and the solution consider the following analogy: Suppose you have four objects (players), each having an unknown weight (skill). Suppose further that you have a balance scale (game) to measure weight (skill) but are always only allowed to put two objects on each side of the balance. If you always combine the same pair of objects, the only information you can get is which pair of objects is heavier. But if you recombine the players into different pairs you can find out about their individual skills. As a consequence, the TrueSkill ranking system will be able to find out about individual players’ skills from team outcomes given that players not only play in one and the same team all the time but in varying team combinations.
Q: I bought a 360 for my son for Xmas, and both of us have become seriously addicted to Halo 3 on XBox Live, particularly Team Slayer matches. Basing the skill change only on the team performance yields pretty counterintuitive results. For example, I often play a string of team slayer games where I am MVP (Most Valuable Player), which means I outscore everyone. But if my team loses those games, I gain no skill. Then, I can play poorly, but if my team wins I gain skill. This lack of feedback from individual performance is frustrating and makes your skill level beholden to the performance of the rest of your team, which is usually not under your control unless you explicitly team up with friends
A: Great that you are enjoying your 360 and Halo 3.
The question you are asking has indeed been raised by quite a few people and we had many discussions about it. However, we always return to our point of view that in a team game the only way to assess someone’s skill towards the team objective is to consider the team objective only. Any auxiliary measurements such as number of flags carried, number of kills, kill-death spread, etc, all have the problem that they can be exploited thereby compromising the team objective and hence the spirit of the game. If flag carries matter, players will rush to the flag rather than defend their teammates or their own flag. Some may even kill the current flag carrier of their own team to get the flag. If it is number of kills, people will mindlessly enter combat to maximize that metric. If it is K-D spread they may hold back at a time when they could have saved a team mate. Whichever metric you take, there will be people trying to optimize their score under that metric and this will lead to distortions.
Another problem is, of course, that we would like to use the skill ratings for matchmaking. The current system essentially aims at a 50:50 win loss ratio for each team. It is unclear, how individual skill ratings based on individual achievements would change the calibration of such a system.
Of course, one might use a weighted combination of team and individual measurements. However, whenever individual measurements enter the equation there will be trouble, maybe less trouble if the weight is less, but that is not really good enough.
Q: If the skill of every player is represented by two numbers, how is it possible to rank players in a leaderboard?
A: The TrueSkill ranking system uses the so-called conservative skill estimate which is the 1% quantile of the belief distribution: it is extremely likely (to be precise, with a belief of 99%) that the player’s actual skill is higher than the conservative estimate. Have a look in the detailed description.
Q: Who is the better player: Someone with a large μ and a large σ or a small μ and a small σ?
A: The answer to this question is not straightforward. For someone with a large σ the TrueSkill ranking system is still uncertain about the skill. Thus, the player with the large μ and a large σ may be better. The best way to find out is to ask the player with the large σ to play more.
Q: I am a level 30 player with a σ of 5 and my friend is a level 28 player with a σ of 2? Why does the TrueSkill ranking system claim that my friend is better; at the end of the day, my level is higher?
A: That is correct. But, you have not played enough games yet for the TrueSkill ranking system to confidently know that you are better; so conservatively speaking, your level is probably 15 = 30 – 3 * 5 whereas your friend’s conservative estimate is level 22 = 28 – 3 * 2.
Q: A couple of days ago I managed to get into the top 350 (in PGR 3 online career) after winning probably 25 of 30 races and that brought me up about 120 spots. Now tonight I have had 5 races: 2 wins,1 second,5th (got spun twice) and a 4th on one of the Vegas tracks. Because of this pathetic record (that is how the TrueSkill formula sees it) I have gone down 115 spots. How is it fair that 2 bad races basically dropped me down almost as many points as 25 wins out of 30 races took to gain all those places ?
A: There are two reasons that can cause this problem (although the latter is probably more responsible for this “phenomenon”):
- Ranks displayed in PGR 3 are the position in the total leaderboard. That means, if you are rank 659 then there are 658 gamers with a higher skill (estimate) than you. This number can vary without a gamer actually having to play a game; for example, if some (legitimate) “Gotham star” gets to the top 100 players in the world whilst you are not even racing, then your rank goes down to “660” without you doing anything wrong. This “rank” can never be guaranteed to be “stable”.
- Roughly speaking, the change in your skill estimate depends on how “surprising” the game outcome is. If you happen to be (among) the player(s) with the highest skill in each of the games you played, then the 25 wins were not surprising and hence none of these games provided a significant increase in your skill estimate. However, if coming 5th was a rather unlikely outcome in the game were you actually did come fifth, then your skill needs to be adapted significantly. Another way of seeing the issue is that TrueSkill does take the strength of the opposition into account. One cannot simply compute the win ratio and equate this with skill; if all wins happen in the (sometimes) unavoidable unbalanced games then a win is not really testament to your (even) high(er) skill!
Q: Well there must be a bug in the system because I jumped into a 4 person race with 3 lower ranked individuals, won the race and my position in the league I was in dropped about 50 spots.
So, what is going on here? Between any two games of a gamer, the TrueSkill ranking system assumes that the true skill of a gamer, that is, μ, can have changed slightly either up or down; this property is what allows the ranking system to adapt to a change in the skill of a gamer. Technically, this is achieved by a small increase in the σ of each participating gamer before the game outcome is incorporated. Usually, a game outcome provides enough pieces of information to reduce this increased uncertainty. But, in a badly matched game (as the one described above) this is not the case; in this case, nothing can be learned about the winner from the game outcome (because it was already known before the game that the winner was significantly higher ranked than the other gamers he has beaten). So, conservatively speaking, the winner’s skill might have slightly decreased! Note that this can only happen if the gamer is not matched correctly so that he can “prove” to the TrueSkill ranking system that his skill has not changed.
Q: In Dawn of War II, I won a game and went down in TrueSkill. What happened?
A: Usually your TrueSkill rises after a win – however, in Dawn of War II the displayed TrueSkill lags behind one game. (Thanks to CheeseNought for reporting the problem)
Q: Is it at all possible to view the TrueSkill rating of an individual Xbox Live Gamertag? Is there a website that I can go to, to see the ratings of people’s gamertags?
A: This is up to the game developer. Some games have a leaderboard function or a website where you can find your TrueSkill, but some do not.
Q: My favorite game mode is Online Career in Project Gotham Racing 3. How can the TrueSkill ranking system find players of similar skill based on the chance of drawing when it is impossible to draw with someone else in a racing game?
A: When the TrueSkill ranking system computes the match quality of other players, it computes the (hypothetical) probability of draw between you and every other player relative to the probability of drawing between two equally skilled players; this ensures that the ratio is always between 0 and 1. This number would depend on the draw margin and thus the match-quality criterion of the TrueSkill ranking system is actually computing this ratio in the limit of a draw margin of zero! This gives the match quality equation specified in the detailed description.
In other words: The TrueSkill ranking system is not taking into account the chance of drawing for a given game mode! Thus, it does not matter that your game mode has zero chance of drawing.
Q: I am playing my first ranked game in a game mode. Will I be matched more likely with another player new to the game mode or with someone else?
A: When you play your first ranked game in a game mode, the TrueSkill ranking system assigns you an initial skill level μ with a maximal variance σ2 of skills; it’s your first game so the ranking system should reflect its lack of knowledge. Now, the TrueSkill ranking matchmaking criterion will prefer to match you against someone with the same mean skill level μ but a small variance σ2. Thus, if available, you will be matched with another player who may be in the middle of the leaderboard but with a much smaller σ2: a player of established average skill.
Why is this better than matching you with someone else new to the game? Well, this other player may, in fact, be one of the most skilled players (who just happened not to have played the game mode yet) whereas you really are a beginner. Then, you two are (up to) 50 skill levels apart. Matching you with someone who is an established average player guarantees that your skill level gap is never bigger than 25 levels.
Q: I have been playing my first game in PGR3 online career last night. I was matched with a couple of Level 22/Contender players. That does not seem right, what’s going on here?
A: The rank that is displayed in the PGR 3 online career lobby is “the conservative skill estimate”; with a chance of 99% your skill is larger than this number. More specifically, the rank is computed by “mean skill – 3 * uncertainty” but, as far as TrueSkill is concerned, your skill is anywhere between “mean – 3 * uncertainty” and “mean + 3 * uncertainty”. So, when you are displayed as “Unranked”, your mean skill is really 25 and the uncertainty is so large that your skill can be anywhere between 0 and 50. However, in matchmaking you get matched with people based on your “mean skill”. Hence you will see large gaps in the matchmaking lobby. That does not mean you are matched badly, though. You are matched as well as it is possible given the information that TrueSkill has about you and in light of all the lobbies that are available to join when you request it.
Q: In PGR3, I am having a hard time understanding why I (novice level 12) consistently get matched with players in mid to high 20’s. Yesterday I had to race a 29, 22, and a 17. And that is just the one example. It seems that the range for matching part is a little too liberal.
A: There are several effects that can lead to your observation:
- There are not enough players around for the TrueSkill system to choose from at the moment when you are searching for a new game.
- If you have not played enough games (that is, the uncertainty that TrueSkill has in your skill is still large) then you conservative skill estimate as shown by PGR3 is exactly this: a conservative skill estimate. In other words, your displayed level 12 could be anything from, let’s say, level 12 to level 28.
- If you skill is too large or too small, there are usually far less players of this skill range (see answer to next question). However, this is probably not the case for level 12.
One last note: Rest assured that once there are enough active Live players around in your preferred game mode, the matchmaking will become much tighter. Also, the skill learning is not affected by a bad match; in fact, if you are matched with much stronger players there is nothing to lose with respect to your TrueSkill skill; the best thing that can happen is that you pull off a win and move up the skill leaderboard by a large amount.
Q: I am among the top 100 players in the world in my game mode. Why do I usually wait longer in the matchmaking lobby than my friend JoeDoe who is an average skill player?
A: This has an easy explanation: There are simply not enough players of your caliber available at any time! Remember that Xbox Live is a worldwide service, so there are perhaps only 1000 players that would be a perfect match for you. Living in 24 different time zones. The only alternative is to match you with players who are much less skilled and sacrifice match quality for waiting time. And this would ruin both their and your experience on Xbox Live. You see: being a top player has its price!
For example, on the right hand side you see a plot of the distribution of the mean skill levels μ for a popular Xbox Live game. As you can see, there are very few players of skill level 40 and above and 5 and below so the chance that an arbitrary other player online at the moment is a good match is much smaller. This results in the longer waiting time.
Q: I am a player with a mean skill of 30 and a skill variance σ2 of 4 but my friend is only a player with a mean skill of 10 and a skill variance σ2 of 2. If we play as a party, what people will we be matched with?
A: If you play as a party, the mean skill of every party member will be the average of all the mean skills and the skill variance is the average of the skill variances of all party members. Thus, for the purpose of matchmaking only, your mean skill will be 20 and your skill variance will be 3; the same is true for your friend. Hence, together you make a team of skill 40 = 2 * 20 with a joint skill variance of 6 = 4 + 2. But, when you finish a game the update will use your actual mean skill and skill variance; thus, your mean skill will grow/shrink faster (why?) depending on the outcome of the game.
Q: I keep getting matched with people of higher TrueSkill and losing badly, which is very frustrating. Why does this happen?
A: There are several effects that may be at work here:
- There is an inherent conflict between waiting time for a match and match quality: in a real-time system, the longer we wait during matchmaking, the higher the chances to find a tight matching player.
- The TrueSkill matchmaking support that is currently available for games on Xbox Live is based on a host-client model: During the matchmaking process, a player decides to either host a session (“host”) or search & possibly join a session (“client”). Note that this decision is either put in the hands of gamers (such as in Call of Duty 2) or automatically done behind the scenes (such as in Halo 3). TrueSkill comes into play during the search of a session insofar as the list of returned hosts is always sorted in decreasing order of the match quality. However, no filtering is done on the match quality and no constraints are made to pick the session at the top of the list. Thus, in off-peak hours or in situations where there are not enough host sessions available, the match quality can suffer and it may happen that you are getting matched with people of much higher/lower TrueSkill.
- The match quality is effectively measuring how far players are apart in terms of their mean skill level μ – however, the TrueSkill that gets displayed during matchmaking is the conservative skill estimate μ – 3*σ. Thus, the mismatch in terms of conservative skill estimates might look a lot worse than the actual mismatch. Here is an example:
- A game between a new player and an established level 25 player: The match quality is 57.6% though the displayed skill difference is a staggering 23 levels!
- A game between a new player and an established bad player: The match quality is 5.7% though the displayed skill difference is only 1 level.
- Note also that the system can learn a lot more about the skill of a new player in setting 1 than 2 (both in terms of the mean skill level μ and skill uncertainty σ).
Q: Can the TrueSkill ranking system cope with handicapped games?
A: No. Among other things, this is something we are working on right now. The TrueSkill ranking system assumes that two equally skilled teams have the same chance of winning.
Q: Can the TrueSkill ranking system identify cheaters?
A: No. The only thing the TrueSkill ranking system can do is to track the plausibility of game outcomes. If you happen to play a lot of games whose outcomes are not very plausible, then this could raise concerns about you. But it could also mean that you are a very adaptive player whose skill is growing faster than the TrueSkill ranking system anticipated. And the last thing you want to be called then is a cheater!
Q: I am interested to study ranking systems. Do you have any real-world data for a comparative analysis?
Q: Does Microsoft provide software for calculating the TrueSkill updates?
A: Microsoft has open-sourced the Infer.NET library which can perform TrueSkill updates, but it requires some coding. The sample code for the Model-Based Machine Learning book uses Infer.NET to calculate TrueSkill updates.
Q: When there are more than 2 teams, and all teams start with the same skill distribution, teams that draw do not get identical skill estimates. Instead, the estimates are slightly different. Is this expected?
A: Yes. This small approximation is used to reduce computation.
Senior Principal Researcher
Principal Research Engineering Manager