Do Your Own Stats: The Player Data Dump

For all you budding data analysts out there, I bring you the data dump of all ~300k players in my sample.  It’s a zipped .txt file in a comma separated format and imports into both R and Excel (and possibly other things, but I can’t guarantee it!)

The data is divided by skill level in the bracket column.  n is Normal, h is High, and v is Very High.  All games should be from the 6.74 patch.  I have filtered out all games with less than 10 players, which include bot games and early abandons.  I’ve left off a few categories to keep the file easily manageable for now.  I intend for a future release to include item data once I figure out how I want to display it.  Possibly also a file that is all aggregate match data without player entries.

If you want to give it a whirl but don’t have a lot of experience, you can download R here.

1.  Once in the program use setwd(“C:/directory”), where C:/directory is wherever you extracted the text file.

2.  data = read.csv(“PlayerData.txt”) ; This makes ‘data’ a data frame containing all of the player information

3. data[1:10,] ; This will show you the first 10 player entries.  As you can see, they all come from the same matchID, so you have there an entire set of match data.

4. Let’s say you want to do some filtering.  We can create logical entries for that.  If you wanted to see all games played by Tiny, we can use L = data$Hero == ‘Tiny’

data$Hero refers to the Hero column in our data frame.  L is now a logical check for all entries in our data where the player’s hero was Tiny.  (I just realize now I should have included a hero list for the naming conventions, oh well)

5. data[L,] now shows all player entries who played Tiny.  If we want to break it down further we could use add extra checks like L = data$Hero == ‘Tiny’ & data$Bracket == ‘v’ which only shows Tiny games in the Very High bracket.

6. If you want to turn these logical checks into their own separate data frame you can now do tiny = data[L,]  This gives us a data frame named ‘tiny’ that contains all Tiny players in the very high bracket.

7. Want to see the average GPM of Very High Tinys?  Just use mean(tiny$GPM)

8. Want to get even trickier?  Let’s go back to  data and make a new column that represents Creep Kills per minute.  To do this we use data$csm = data$CS * 60 / data$Duration (The 60 is to convert Duration from seconds to minutes)

9.  Now let’s find the top 1% of CS per minute in Normal.  Create a separate frame using L = data$Bracket == ‘n’ and norm = data[L,]

10. We can find the top 1% now using quantile(norm$csm, c(.99)) This tells us that the 99th percentile is at 6.19 CS/min.

11. To actually see the entries, we can use L = norm$csm > 6.19 and then norm[L,] Oh boy, look at all the Nature’s Prophet.


That’s just a basic starting point, and I’m sure more experienced R users could have way more interesting stuff in no time.

As always, if you want to contact me about anything regarding this just use either the comment section or the e-mail up there in the right sidebar.  Happy mining, data spelunkers.


4 Responses to Do Your Own Stats: The Player Data Dump

  1. jc says:

    Could you provide more detail on how the sample was created? Random sample?

    • phantasmal says:

      If I recall correctly, the creation of the earlier samples was grabbing groups of 250 games staggered over the course of a week at fixed hourly intervals. The Normal sample followed a similar method, but by the time I started creating it the API was in bad shape and had a date bug. The necessary workarounds were rushed and the sample stretches out for more than the week of the other samples. For this reason I prefer not to use it heavily in hero usage stats (there are definitely games in it that occurred before the KotL/Nyx/Visage release), but it’s workable if less than ideal for tests that are less sensitive to hero composition.

      Future grabs will hopefully be a bit more disciplined if the API returns in a more stable form. Also hopefully closer to a full representation of the matches over a stretch of time and less of a sample.

  2. asdf says:

    Thanks so much for releasing this.

  3. cosmoflop12 says:

    Thank you very much for creating this file. Just the other day a friend told me I sucked at Dota 2, and I proved him dead wrong by comparing my Kill/Death/Assist (K+A)/D score with the averages of the 4000 players that played as the same hero I did. Very helpful!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: