The Insignificance of Pub Stats, Part 2

Last time in The Insignificance of Pub Stats, Part 1 we looked at the danger of letting poorly-designed public statistics change the nature and content of matchmaking flaming.  Today we’re going to examine why (Kills+Assists)/Deaths or KDA is a weak and potentially misleading statistic.  We’ll also take a look at the ways we could strive to make better match statistics than the ones we have now.

(For the sake of reference, this is still a response to Statistical Significance: The Value of Pub Stats)

Let’s begin with a quote that reveals one of the major blind spots in the author’s position.

Understanding that outliers exist, and that the numbers must be evaluated relative to matchmaking, we should see that a player’s skill generally does correlate with their pub stats.

This is trivially true, particularly for GPM as I’ve shown in the past, but for the moment let’s focus on KDA.  Higher skilled players will tend to have higher average KDA.  They’ll also tend to win more often.  And winning teams also tend to have higher KDAs than losing teams.  So given all this, how can we be certain that a higher KDA is a cause of winning more often and not merely the result?

The laning phase in Dota is all about getting an XPM and/or GPM advantage and then leveraging that advantage into a win.  Killing creeps is the most reliable way to get a GPM advantage.  Killing enemy players (or at least forcing them to leave the lane) is a little more complicated because it can translate to either an XPM or GPM advantage, and can often be as much about wasting an important enemy’s time as it is about the experience and gold you get from the kill.  The important thing is any particular kill in a game could be the cause of an XPM advantage or it could be the result of a GPM advantage.  A standard match statistics page can’t distinguish between the two, and that’s hugely problematic for any position that wants to pretend that KDA is a reliable statistic.

To put it another way, the value of a kill varies a lot depending on the circumstances surrounding that kill.  To use some really extreme examples, creating five kills during the laning phase contributes infinitely more to winning the game than getting five kills fountain farming after super creeps come out.  In terms of sheer exp and gold rewards killing the enemy mid is way more valuable than killing the enemy support, and that’s without even factoring in the experience and gold you’re effectively denying while mid is empty.  With some statistics you can just collect enough data and then handwave the variance away through sample size, but this only works if the variance behaves similarly to a normal distribution where extreme outcomes on both ends are equally unlikely.  I see no reason to believe that average value per kill, however we end up defining it, would end up being normally distributed in public matchmaking games.

But let’s put the Kill Value argument aside, because there’s an even bigger flaw to KDA that has the potential to make the statistic actively toxic: if we trust KDA as a skill evaluator then the single best way to ‘get better at Dota’  is to minimize your deaths at the expense of everything else.

The most harmful advice I see people give new players is to focus on not dying above all.  That’s basically saying, “Don’t try to accomplish anything because you might screw up.  Just try not to get in the way and let your teammates carry you.”  That’s not a way to get better at Dota.  It might be a way to win Dota more often without actually improving at anything, but why the hell would we ever want to encourage that attitude?

And the D in KDA basically takes that attitude and supercharges it.  Let’s switch things up.  What’s generally the best response to the opposing team 5-man pushing your tier 1 tower?  Unless you can get all 5 of your team there think you can catch them overcommitting, the best response is to force a trade and push one of their towers.  Playing a perfect game of Dota against a sufficiently competent team is impossible.  What you seek to do instead is force them into unfavorable trades, and that means giving something up.

The Statistical Significance article makes a point that KDA is a fine metric for personal evaluation if used on a hero-by-hero basis.  Incidentally this is kinda dumb because very few players actually play a single hero enough under similar conditions (recent, same patch period, predominantly solo queue) to form a sufficient sample, but whatever.  Say you’re using KDA to evaluate your Weaver play, and it’s currently at 3 (3 Kills+Assists for every Death).  Now in every future Weaver game if you don’t at least kill three people every time you die, it’s a bad death for your statistics.  Take out their carry and support but die in the process to the tower?  That’s now a bad death.  Trade deaths with their highest level player and mid leading to your team being able to easily wipe out the rest of their team in the ensuing teamfight and take a tower?  That’s now a really bad death.  Die with buyback gold in order to take down their first melee barracks?  Now an extremely bad death.  And maybe these were all really good decisions in the context of the game, but they’re not doing any favors to your KDA.

This returns us to the broader criticism of player psychology.  A lot of people will complain that their teammates feed too much, so obviously matchmaking is simply giving them trash teammates.  They’ll point to something like KDA and say “Even when I lose I have the highest KDA, which proves that my teammates were terrible.”  But Dota doesn’t work like that.  You win by finding a weakness in the enemy team and exploiting it, and the single biggest weakness in your typical pub team is when they have three to four static farming lanes and none of them are ever going to leave to help other lanes.  If you just pick evasive heroes in safe lanes and never put yourself at risk to help your teammates then you’re a hindrance to your team no matter what your KDA is.

Hopefully at this point you agree that KDA at the very least has the potential to be incredibly misleading.  The obvious follow-up question is what should we be using instead?  The more depressing answer I have to this is “Nothing at the moment.”  Statistics for evaluating individual performance are at a really bad place right now, and that’s why I focus almost entirely on broader trends.  It’s better to admit that we don’t have a good answer, than to pretend our terrible answer is good and enshrine it as common sense.

But there’s also a more hopeful answer.  Some of the big keys to making better Dota statistics are adding time and position sensitivity to our stats.  With the Dota 2 Demo File Format this is definitely possible, though certainly not easy.  Ignoring for the moment any questions of technical feasibility, what kind of statistics could we eventually create to measure individual performance?

I feel that the first step would be to measure team performances.  For example, we could measure the % of creeps converted to CS on a lane-by-lane basis.  We could further split the samples by Normal-High-Very High to see how the performances change with player skill.  We could also look at this from a Radiant vs Dire basis?  Does one suicide lane get more CS?  Does Radiant Mid outperform Dire Mid due to the mid pulls available to the Radiant?  If we want to get really fancy we could include heroes or lane makeups (solo, duo, trilane) in the analysis.  We could even potentially measure what percentage of creeps get converted to experience, which might be a better way of measuring denials in a safe lane where you’re trying to get denies through neutrals or forcing the suicide laner away from the creeps.

We could also get a much better model to explain how much gold and experience leads in the laning phase predict wins.  This might entail estimating the average end of the laning phase in minutes, or maybe we could dynamically determine a timing for the end of a laning phase based on player movement.  Once we have a reasonably good model, we could then run individual games through the same process and see how they stack up.  Do teams containing certain heroes have a tendency to underperform their gold graph?  Doom(bringer) is an obvious case here, but there could be way more interesting ones like Nature’s Prophet.  On the other hand, do certain players overperform based on their teams gold graph?  Comeback potential is a really difficult thing to measure, but maybe this would pay off as an approach.

So once we establish some of these basic values and a generalized definition of the laning phase, we could turn towards breaking down the tinier details.  Let’s say we’ve put together a pretty big sample of games within a certain skill range with relatively uninterrupted 2v2s lanes.  If we were attempting to create statistics that evaluate support play, we could:

1. Measure the rate of creep attacks by each members of the lane.  Do lanes with a lot of autoattack pushing underperform relative to lanes with only last hits?  Do lanes with a a farmer and support out-CS lanes where both members fight for last hits?

2. Measure the frequency of autoattack harass.  How much does it correlate with winning the lane?  The game?  Do certain heroes tend to harass less often or less successfully when up against a hero with a range advantage?  How much of a difference does safe vs suicide make?  How important are stout shields as a purchase?  What’s average amount of creep damage a harasser takes from creep aggro?

3. Measure XPM rates for the lane.  Do big XPM advantages in 2v2 lanes reliably lead to wins?  You can create XPM advantages through denying, neutral pulling, killing, and harassing.  Are any of these strategies more successful than the others?  Riskier than the others?  Is any of this hero specific?

4. How much more reliably do teams that start with couriers win?  Does flying courier upgrade time play a big role in team win rates?  Does any of this change significantly by skill bracket?

5. How much does rune control contribute to winning games?  Do teams with supports who help guard/capture rune spawns outperform teams that rely solely on their mid?

6. Do lanes with nearby wards die less often than lanes with no nearby wards?  Does it depend on the ward location?  Does the success rate of early stealthers like Bounty Hunter and Broodmother drop significantly when facing lane sentry wards?  How does blocking pull camps effect lane XPM rates?

7. Can we begin to quantify the results of lane scuffles?  How much does a kill change the XPM dynamics of the lane?  Can we differentiate between a relatively forced kill through effective CC chaining and a kill that’s just a result of carelessness or bad positioning?  Can we compare the HP loss of a skirmish to get a general idea of which side came out on top, and possibly also which heroes and item builds are most capable of mitigating bad skirmishes?

8. Can we measure the effectiveness of players in pressuring lanes other than their own?  Do teams with a lot of volatility in player positioning during the lane phase outperform teams with static lanes?  Does it depend on how successful that volatility is in creating kills and HP deficits?

So that’s a lot of things.  None of them are easy.  Some of them might not be practical.  Some of them might work but produce nothing of interest.  I don’t claim to have a magic shortcut to understanding this game.

But let’s suppose some of them do pan out.  We could then use the results to measure support play.  Do you harass less often than the average winning support player?  Do your suicide lane opponents get more XPM out of your lane than average?  Do you fail to buy courier/wards in a reliable and effective manner?  Do you rarely if ever try to contest runes?  Do you get less cross-lane kills and pressures than the average successful support at your skill level?  Are you bad at creating favorable HP trades?

If we accomplished a statistical project of this scope not only would we be capable of measuring whether you’re a good support or not, we would more importantly be able to tell you what you could work on to be a better support.  Even if KDA were a useful stat it wouldn’t really tell us anything about why our KDA is low in the first place, so if we’re all about self-improvement why on earth should we be satisfied with KDA as a statistic?

But once again, I don’t claim that any of this will be easy.  It’s just that if we want matchmaking games to be statistically significant we should dream big.  What we have now is not adequate, and pretending that it is adequate can only distract us from bigger and better possibilities.

Continue to Part 3

Advertisements

5 Responses to The Insignificance of Pub Stats, Part 2

  1. TC says:

    To me, it seems the breaking out of stats is unnecessary. We already have an aggregate measure of how successful a player is within a game. That is of course W-L record. The aggregate W-L will take all of the relevant performance and put it into a measure to show how well a player performs.

    To me, the goal of breaking out statistics is about looking at specific performance areas (some that might be hidden) to look for areas of improvement. For example, CS (especially global) allows individuals to check to see how well they perform relative to others. Gives players goals that they can aim to achieve.

    But to me, these types of numbers can make poor team performance look good. One example I can think of is where my team lost two lanes due to ganks from mid, but our mid farmed up a storm and had strong individual statistics by showing up late (and cleaning up) and by free farming. We lost that game as the rest of us were never able to recover enough from our early deficit. If you were to look at purely game statistics, you might’ve said, given the available results, the individual looks like he was a star getting dragged down by a bad team. But if you asked the team members, they might’ve said, we wish this player would have focused more on team play.

    The problem I see with individual statistics is something you mentioned. Suppose your team is way ahead. You can inflate your own results by dragging the game out, farming more, playing more passive. Or like you suggested, at the end of the game you can camp the fountain. It’s why I prefer aggregation to W-L. That is after all the goal of the game.

    I like the questions you pose. The results they gives insight into actions that should be taken in order to improve probability of winning.

    • phantasmal says:

      You’re right in that consistent Winning and Losing is the only surefire way to evaluate a player or a strategy. My one warning is that any players win/loss record can be warped by any number of factors within matchmaking and isn’t a great evaluator of their actual skill.

      As for your example, situations like that are a definite issue. That’s why I favor an approach that looks at a bunch of individual metrics and points out any of the ones where you appear to be lacking. It’s not possible to statistically say “Oh, you were farming here when you should have been ganking. But we could say that every time you play mid with a certain character you farm really well but statistically you get far less rune spawns than average for your hero at your skill level and gank significantly less often as well, and this might play into why your win rates have been low.” There’s likely no perfect ratio of farming/ganking/map control, but we could possibly spot people who are noticeably deficient in a certain area.

  2. TC says:

    I guess I probably should’ve said something to conclude that long post… But the main thing I was going for its, W-L is a good measure of how good a given player is. Granted there are ways to make W-L artificially higher (e.g. queuing only with team and not solo), but I would bet it’s a robust measure of how good of a player someone is.

  3. Chris says:

    I think its great consider some of the more complex metrics that you have suggested, I think there are number of easier metrics that can be applied particularly to the normal skill bracket.

    C/S after 5 mins (Carry, Mid)
    Denies after 5 mins (Support, Mid)
    Starting items (supports buying courier and/or wards)
    I do agree that metrics around stout sheilds or other items would be interesting.
    For example, average time for bottle or boots.
    Time “idle”, standing doing nothing.
    Number of TP’s purchased/used.

    Whilst it is somewhat true to say that number of deaths is a poor metric, dying in the laning phase is generally pretty bad.

    The vast majority of games are decided by around the 20 minute mark – actually lets see some stats on that!! – number of ‘comebacks’. It would be interesting to see if we could predict game outcome by “kill score”, XP or Gold swing after certain time intervals.

    • phantasmal says:

      Incidentally, all of those (except time “idle,” at least directly) can theoretically be done by Bruno’s Demo Parser that I mention in my newest post.

      Unfortunately for me, it doesn’t appear to be XP compatible.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: