Anda di halaman 1dari 18

Cal Poly San Luis Obispo

Using Data to Predict the Winner of the


2018 League of Legends World Championships

Terence Tong
ENGL 149 – 09
Professor Sean Green
December 14, 2018
Abstract
The 2018 League of Legends World Championships has data collected from all its matches. With
this data, a logistic regression model will be built to see if the results of the finalists between
the 2 teams can be accurately predicted. The purpose of this research is to demonstrate the
ability explore logistic regression and hopefully to inspire future statistical projects at Riot
Games.

2
Table of Contents
Introduction to Data Science ........................................................................................................................ 4
Introduction to the Research ........................................................................................................................ 4
Setup for Logistic Regression ........................................................................................................................ 4
Data Set ..................................................................................................................................................... 4
Assumptions.............................................................................................................................................. 6
Logistic Regression ........................................................................................................................................ 7
Assessing the Model ..................................................................................................................................... 8
ROC and AUC Assessment......................................................................................................................... 9
Testing the Model ....................................................................................................................................... 10
Final Results ................................................................................................................................................ 11
Analysis of Results....................................................................................................................................... 12
Recommendations ...................................................................................................................................... 12
Closing Thoughts ......................................................................................................................................... 13
Appendix ..................................................................................................................................................... 14

3
Introduction to Data Science
Data science is the intersection of 3 fields: computer science, statistics, and field knowledge1. Computer
science – for the sake of data science – is the ability to program. Statistics is the study of data. Field
knowledge is the ability to understand the data in context. In a trivial example in baseball, statistics like
“batting average” and “runs” are collected for every player. Having field knowledge would mean
knowing that there is a relationship between these 2 statistics. Now, using this knowledge, statistical
methods, and the automation of computer science, data scientists could predict how well a player does
in the season. For this research report, instead of using baseball as the medium for the data, I will be
using the 2018 League of Legends World Championship data.

Introduction to the Research


League of Legends (LoL) an online video game and is part of the growing industry known as eSports, or
electronic sports. With every professional competition, statistics are collected for each team in every
match. This has led to the popular data science question: can the results be predicted before they
occur? I will apply this popular research question to the 2018 LoL World Championships to the 2
finalists: Invictus Gaming and Fnatic. In context, the research question becomes “Can the results of the
2018 LoL World Championships be predicted between the 2 finalists: Invictus Gaming and Fnatic?”

I will be building a logistic regression model by following these steps 2:

1. Satisfying assumptions
2. Understanding the data set
3. Modeling the data
4. Testing the model

A logistic regression model is a statistical model that takes input variables and returns a probability of an
event happening. To build on the baseball example, it would be taking every player’s hit percentage as
the input variable for the model and then returning the probability of the team winning.

I will also be using R (a statistical programming language) to model the data. For each statistical figure, I
will be posting the code in the appendix at the end of the document.

Setup for Logistic Regression


Before any data analysis, I need to establish what the data is and satisfy some conditions.

Data Set
The data set I will be working with comes from the oracle’s elixir3 website and can be downloaded here.
I filtered out categorical variables such as “gameid”, “url”, or any column that is not a numerical statistic.
In addition to that, I will not look at individual performance of a player, but rather the team
performance as a whole. Using my game knowledge as a top 94th percentile4 player of LoL, I went
through the data set and decided what column variables would be highly correlated to a win.

There are abbreviations in the data that is hard to understand even for long time players, but there is a
data set dictionary to help define these terms.

4
I will be splitting the raw data into 2 data sets: a train data set and a test data set. The train data set will
be used to build the model and the test data set to see the results of the model. In this case, the test
data set will be 10% of the data from all before-finals games. The train data set will include the
remaining 90%.

This is necessary for logistic regression because there are not a lot of methods of determining how good
the model is. The last step will be rebuilding the model with all the games before the finals and use the
data in the finals to predict who will win.5

“Result” (at the bottom of the plot in Figure 1) is the binary variable that I will be predicting, or the
dependent variable. A value of 1 means that the team won and a value of 0 means that the lost.

Figure 1: A correlation plot between the many variables from the data set. Source: Terence Tong

In every square in Figure 1, the correlation coefficient (a value between -1 and 1) is listed. A value of 1
means that as a dependent variable changes, so does the independent variable (variables used to
predict the dependent variable). The negative sign is the direction of the correlation. If the dependent
variable decreases and the independent variable increases, then these is a negative correlation. If the
dependent variable increases and the independent variable increases, then these is a positive
correlation.

So to determine if there is a relationship between 2 variables, the value should be farther from 0
regardless of a negative sign.

In this case, all correlation coefficients are positive because from what I’ve seen from the data, the
higher the number, the better the performance of the team.

5
As you can see, I was mostly correct with what variables would be correlated with a win. Other than
“wpm”, there are pretty good correlations between “result” and the remaining variables. I will now take
“wpm” out of my regression model.

The strongest correlated variable with a win is “fbaron” or first baron. That would make sense because it
is the biggest neutral objective on Summoner’s Rift that provides the entire team with a powerful buff
to help siege and end the game.

If I did not have background knowledge about the game, I could have calculated the correlation value on
all the columns variables in the data set, but for the sake of brevity, I specifically chose variables.

Assumptions
The first assumption of logistic regression is independent observations. Each team in a match is an
observation. The variables from each team may not be completely independent from each other
because snowballing (getting a lead and building upon it) is very prevalent in this game, but it is part of
the game, so I will consider the independence condition satisfied.

The second assumption of logistic regression is each observation is independent from each other.
Because 2 observations come from the same match (1 losing team, 1 winning team), I would say that
this assumption is violated. However, in context, it would not make sense to look at only one side of the
data to build a model. I do have some observations that are a matched pair, – “gdat15” (gold difference
at 15 minutes) – so instead of using these match-paired data I will be using the raw values – “goldat15”
(gold at 15 minutes).

Figure 2: Modified correlation plot to account for assumptions and removed "wpm"
Source: Terence Tong

6
Logistic Regression
Now that a proper data set has been built with satisfied assumptions, I can now build the model.

Using the R regression tools the model has the following coefficients.

Figure 3: Coefficients table for the regression model Source: Terence Tong

I will only be focusing on the “estimate” column.

The following equation is the general formula for logistic regression model.
eβ0  + β1 X
p(X)  =    
1  +  eβ0  + β1 X
Unfortunately, I cannot use this equation yet. Each β𝑛 represents a coefficient from the coefficient
table. I have 14 variables so the formula will actually be6:

eβ0  + β1 X1 + β2 X2 +⋯+ β9 X9


p(X)  =  
1  +  eβ0  + β1 X1 + β2 X2 +⋯+ β9 𝑋9
The table and the equation are associated together by substituting β0 with the row “(Intercept)”,
column “Estimate” value which is -68.7. Then, substitute β1 with the row “kpm”, column “Estimate”
value and continue for all β𝑛 with the respective row, column value.

Notice that the table doesn’t include all the coefficients that were in the correlation plot. This has to do
with the concept of multicollinearity.

Multicollinearity is essentially high correlations between variables that were assumed to be


independent.7 For instance, “goldat10” and “goldat15” is the same statistic collected at different time
stamps of the game. I included both because within those 5 minutes, the gold could fluctuate greatly
based on the team’s performance. The model decided that these variables were too closely related with
one another, so the model chose the variable that was better correlated with a win.

If I wanted to individually calculate how each variable contributes to the probability of an event
happening, I would find the odds ratio for each variable.8 I will not go in-depth on this because it is not
important in the scope of this research. The odds ratio are calculated in the Appendix. Each value can be
interpreted as such, “For every 1 unit the variable increases, the probability of the event occurring
increases by the odds ratio.” For more information on the odds ratio, see this resource.

7
Assessing the Model
In our equation, we have 𝑥𝑛 following each βn . Every 𝑥𝑛 represents the corresponding value to put into
the equation. For 𝑥1 , I would substitute a “kpm” statistic in trying to predict the winner for a certain
match. Then, do the same with the remaining variables.

So with the model built from the train data and I input the train data into the model. For a team that has
a probability greater than 0.5 will be considered to be a win. If the probability returned by the model is
less than 0.5 than the game is to be considered a lost.

Figure 4: resulting table of putting the train data set into the model

In Figure 4, each the table is interpreted as follows:

• The “0-FALSE” value is the number of times that the model correctly predicted that a team
would lose.
• The “0-TRUE” value is the number of times that the model incorrectly predicted that the team
would win when they actually lost.
• The “1-FALSE” value is the number of times that the model incorrectly predicted that the team
would lose when they actually won.
• The “1-TRUE” value is the number of times that model correctly predicted that the team would
win when they actually won.

Figure 5: Logistic curve built from the “train” data. The dots are data points from the "train" data set
Blue means that the point was incorrectly classified.
Red means the point was correctly classified.
Source: Terence Tong

8
With a little math, I calculated that out of all the 209 games, 198 games were predicted correctly. This is
a 94% success rate.

A 94% success rate seems pretty good, but how good is it?

I will build a ROC curve to assess how good the model is.

ROC stands for Receiver Operating Characteristic. It comes from the 1940s when the ROC was used to
measure how well a sonar signal could be detected from extraneous sounds.9 Since then, the term has
been generalized from sonar signals to general binary outcomes.

ROC and AUC Assessment

Figure 6: ROC Curve (explained in the paragraphs below) Source: Terence Tong

Before I start the analysis, an important thing to remember about the graph is that the x-axis is reversed.
At the origin of the graph, the x-value is at a max.

On the x-axis of the ROC curve is specificity. Specificity is percentage of the data that was misclassified
(false positive) for a win for any threshold. For example, in my model the threshold was 50% in my study
because anything higher than 50% was classified as a win.

On the y-axis of the ROC curve is sensitivity. Sensitivity is the percentage of the data that was correctly
classified (true positive) for a win for a given threshold.
The graph itself doesn’t demonstrate what the threshold is at any certain point, but for any point on the
graph there exists a threshold where the point was plotted.

9
A perfect model hugs the specificity 100% and sensitivity 100% axes because this would mean the model
perfectly classifies all the data. This is because if the threshold is small, the point would misclassify most
of the data as a false positive (increased sensitivity), but also correctly classifies all the data that are
expected to have a win (increased specificity). So the points are in the bottom left of the plot. However,
when the threshold increases, the chance of a false positive decreases (decreased specificity), but some
data would also be incorrectly classified as less points are classified as true positives.10

There is also a line through the center of the graph. This is the classification of using random chance (a
coin flip) instead of using data to classify the information.

This model is better than just looking at the table in Figure 4 because it only looks the data when the
threshold is at 50%. The ROC looks at all possible values of the threshold.

To sum up the information in the ROC, there is the area under the curve metric, or AUC. The area under
curve is a value between 0.5 and 1. The higher the AUC, the better the model because it is the area of
the perfect model.11 As Figure 6 says in the middle of the plot, the AUC is 0.994. In simpler terms, this
model is near perfect in predicting results.

Testing the Model


Now that the model has been assessed in terms of theory, I’ll now test the model with the “test” data
set I separated from the entire data.

Running it on the test data set I get the following values:

Figure 7: Table of the results returned from inputting the "test" data set
Source: Terence Tong

The model incorrectly predicted only 2 data points out of the 23 data points, a 91% success rate.

The model has been relatively consistent with only using the “train” data, so combining the two data
sets the for final model will not drastically change anything.

10
Final Results

Figure 8: Model built from both the "train" and "test" data sets Source: Terence Tong

In Figure 8, we have the model from combining both the “test” and “train” data. It only misclassifies 13
data points out of 232 points from the entire data set. Now I will substitute the final results of the match
into the model I created with all the previous matches in the World Championships.

Figure 9: Table of the results from the model given the finals games Source: Terence Tong

So this table reads the same as the previous tables. The only misclassification was when a team was
predicted to win when they actually lost in the “0-TRUE” cell. The data is collected from the 2 teams in
the finalists, so there was actually 3 games between the 2 teams.

The model has a 83% success rate for the 6 team data or the 3 games between these 2 teams , so I
would consider this a fairly good model.

Heuristically, I conclude that the model built from all the data before the finals to predict the winner of
the finals is completely possible.

11
Analysis of Results
A predictive model was built using the data from the past games. There are a few things that I feel could
have been done better to build an even better model.

Least importantly, the correlation values. The correlation coefficients I calculated to determine what
variables would be related to a win is mathematically designed for linear relationships.12 If I wanted a
better correlation statistic for each variable I could have looked at each variable’s distribution and see
the relationship between that variable and the win. This does require a lot more work for multiple
regression, but because this is just preliminary research, I decided not to do it.

The research process I took is a bit unrealistic. To predict the finals, I used the data from the final
matches themselves. This defeats the purpose of building the model in the first place. In the future, I
would be using expected values of each variable. How to calculate those expected values could vary
greatly. I could perhaps looks at the team’s average value for that variable. For example, for “csat15,” I
would look at the team’s average “csat15” throughout the championship to determine what the
expected value to be and then substitute that value into the model. The expected value could be as
simple as finding the mean for a value or as difficult as building a regression model per variable.

To build on top of this, I could have an initial model for the match before the game happens based solely
on expected values. But as the game develops, I could substitute real values into the model to get an
even better prediction. This would be even better if each variable had a regression model based off of
the other variables collected.

All in all, this research process did answer the question of whether or not data could be collected to
predict the future outcomes of the matches.

Recommendations
The research done here is small relative to the size of the projects at Riot Games. However, this doesn’t
negate the research done in this project. This research has the potential to become the start of
something large part for this company. The underlying insights that data can provide is something that
should not be passed up.

For instance, some possibilities that data has the power to influence are champion tuning, bug fixes, and
game flow.

Champion tuning involves tweaking numbers or skills such that the champion feels as the game
developer wants it to. Each champion individual stats and skills should be different from one another.
Changing the scaling or the health regen per second or the cooldowns can completely change the
dynamic in the lane. Any game developer would understand that, but can they extract how the dynamic
would change in lane. With thorough enough research, there are possibilities to make precise changes
to champion numbers just enough to push the champion to a perfect spot in the meta game.

Bug fixes could also be a huge benefactors a data science research. In a recent reddit thread, some
league players are frustrated with the client that has been released. Perhaps it may be redditors
overreacting to things that don’t actually matter that much. But some things can be looked into to see if
there are solutions to the bugs. If data about the state of the client is periodically collected, developers
would be able to pinpoint the cause of certain bugs.

12
The final thing that data can be used for is game flow. Game flow is essentially how the game is played.
This could vary from a player’s match-making rating to how much damage is done per minute. Taking
the time to research how some game variables are implicitly related to one another could have the
potential to change the game in the most minimalistic, but effective way possible. Knowing which
factors are highly correlated with a win or loss could be the start of an entirely different way of patching
the game. Decreasing gold generation, but increasing gold earned from creep score is an extremely
unlikely change that would occur. But with every change, there is the potential to change the game in
some way that was completely unpredictable. With data analysis, there is the possibility to capture that
unpredictability and use that to the advantage of the game developers.

Closing Thoughts
I’m just a casual player with a passion for League of Legends. This game played a huge part of
developing me into who I am and the friendships I have today. The research done here has the ability to
further evolve the game even more. It has the potential to push the game into being analogous with
creating or furthering friendships. Creating a game more enjoyable for the players can lead to this game
being an outlet for more people to be themselves. I know I wouldn’t be the same person if I didn’t play
LoL. Having an outlet allows a person to develop. Now, that developed person has the potential to do
something great or just be happy. And it sounds like a win-win situation to me.

I’m just a player trying to give back to the game that has given me an identity.

13
Appendix
Setup:
library(tidyverse)
library(readxl)
library(moderndive)
library(GGally)
library(kableExtra)
library(pROC)

league_dat <- read_excel("2018-worlds-match-data-OraclesElixir-2018-10-


28.xlsx")

drops <- c("gameid", "url", "league", "date", "split", "patchno", "game",


"playerid")
league_dat <- league_dat[ , !(names(league_dat) %in% drops)]

# create a new data set with only the team's data


team_dat <- league_dat %>%
filter(position == "Team")

# choosing variables that I believe are highly correlated with a team's win
imp_team_dat <- team_dat %>%
select(result, team, side, kdr, kpm, teamdragkills, herald, ft,
firsttothreetowers, fbaron, dmgtochampsperminute, wpm, wcpm, monsterkillsene-
myjungle, cspm, gdat10, gdat15, xpdat10, csdat10, csdat15, week)

#just so we can do math


imp_team_dat[is.na(imp_team_dat)] <- 0

# separating the data sets a training model and a testing model


final_dat <- imp_team_dat %>%
filter(week == "F")

notFinal_dat <- imp_team_dat %>%


filter(week != "F")

test <- notFinal_dat[sample(nrow(notFinal_dat), nrow(notFinal_dat) * .1), ]


train <- anti_join(notFinal_dat, test, by = c("result", "team", "side",
"kpm", "teamdragkills", "herald", "ft", "firsttothreetowers", "fbaron",
"dmgtochampsperminute", "wcpm", "monsterkillsenemyjungle", "cspm",
"goldat10", "goldat15", "xpat10", "csat10", "csat15", "week"))

Figure 1:13
train %>%
ggcorr(nbreaks = 8, label = T, low = "red3", high = "green3",
label_round = 2, label_size = 2, name = "Correlation Scale",
label_alpha = F, hjust = 0.99, layout.exp = 10) +
ggtitle(label = "Correlation Plot") +
theme(plot.title = element_text(hjust = 0.6))

14
Figure 2:
# modified to fit assumptions
imp_team_dat <- team_dat %>%
select(result, team, side, kpm, teamdragkills, herald, ft,
firsttothreetowers, fbaron, dmgtochampsperminute, wcpm,
monsterkillsenemyjungle, cspm, goldat10, goldat15, xpdat10, csat10, csat15,
week)

train %>%
ggcorr(nbreaks = 8, label = T, low = "red3", high = "green3",
label_round = 2, label_size = 2, name = "Correlation Scale",
label_alpha = F, hjust = 0.99, layout.exp = 10) +
ggtitle(label = "Correlation Plot") +
theme(plot.title = element_text(hjust = 0.6))

Figure 3:
model1 <- step(glm(factor(result) ~ ., family = binomial(link = "logit"),
data = train[-c(2, 3, 19)]), direction = "both", trace = F)

#Table of Coefficients
summary(model1)

Figure 4:14
test.predicted.m1 <- predict(model1, newdata = train, type = "response")
table(train$result, test.predicted.m1 > 0.5)

Figure 5:15
train %>%
mutate(
Misclassified = ifelse(train$result > .5 & predict(model1, type =
"response") < .5 |
train$result < .5 & predict(model1, type =
"response") > .5,
"Misclassified", "Correctly Classified")) %>%
ggplot(aes(predict(model1, type = "response"), train$result)) +
guides(alpha = F) +
geom_point(aes(alpha = .2, color = Misclassified)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se =
F) +
scale_x_continuous(name = "Predicted Probabilty") +
scale_y_continuous(name = "Actual Probabilty") +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank()) +
ggtitle(label = "Actual vs Predict with Logistic Curve")

15
Odds Ratio:16

exp(cbind("Odds ratio" = coef(model1), confint.default(model1, level =


0.95)))

Example: If “cspm” increases by 1, the probability of the team winning is expected to increases by
5.35%.

Figure 6:17
train$result <- as.factor(train$result)

#Plot rROC
rocobj <- plot.roc(train$result, model1$fitted.values,
main="ROC Curve", percent=TRUE, print.auc=TRUE, asp = NA)

plot(ciobj, type="shape", col="#1c61b6AA")

Figure 7:
test.predicted.m2 <- predict(model1, newdata = test, type = "response")
table(test$result, test.predicted.m2 > 0.5)

16
Figure 8:
notFinal_dat %>%
mutate(
Misclassified = ifelse(notFinal_dat$result > .5 &
predict(final_model, type = "response") < .5 |
notFinal_dat$result < .5 &
predict(final_model, type = "response") > .5,
"Misclassified", "Correctly Classified")) %>%
ggplot(aes(predict(final_model, type = "response"), notFinal_dat$result))
+
guides(alpha = F) +
geom_point(aes(alpha = .2, color = Misclassified)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se =
F) +
scale_x_continuous(name = "Predicted Probabilty") +
scale_y_continuous(name = "Actual Probabilty") +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank()) +
ggtitle(label = "Actual vs Predict with Logistic Curve")

Figure 9:

test.predicted.m3 <- predict(final_model, newdata = final_dat, type =


"response")
table(final_dat$result, test.predicted.m3 > 0.5)

1
“The Data Science Venn Diagram.” Drew Conway, 30 Sept. 2010, drewconway.com/zia/2013/3/26/the-data-
science-venn-diagram.
2
Ellis, James. “Logistic Regression Overview.” Logistic Regression Overview, 16 Aug. 2017, rstudio-pubs-
static.s3.amazonaws.com/300556_680d319a01ec47afbd1f905e3538a86b.html.
3
Sevenhuysen, Tim "Magic". “Match Data Downloads (Beta).” Oracle's Elixir, 3 Nov. 2018, oracleselixir.com/match-
data/.
4
“Turrence - Summoner Stats - League of Legends.” OP.GG North America, 3 Dec. 2018,
na.op.gg/summoner/userName=Turrence. Percentile changes based on in-game performance and is based on live
data.
5
Ellis.
6
Lossing.
7
“Multicollinearity.” Complete Dissertation, Statistics Solutions, www.statisticssolutions.com/multicollinearity/.
8
Ellis.
9
Grace-Martin, Karen. “Generalized Linear Models in R, Part 2: Understanding Model Fit in Logistic Regression
Output.” The Analysis Factor RSS, 6 July 2018, www.theanalysisfactor.com/r-glm-model-fit/.
10
“Introduction to the ROC (Receiver Operating Characteristics) Plot.” Classifier Evaluation with Imbalanced
Datasets, 30 June 2017, classeval.wordpress.com/introduction/introduction-to-the-roc-receiver-operating-
characteristics-plot/.
11
Markham, Kevin. “ROC Curves and Area Under the Curve Explained (Video).” Data School, Data School, 29 May
2018, www.dataschool.io/roc-curves-and-auc-explained/.
12
“ Pearson Correlation Coefficient Calculator.” Pearson Correlation Coefficient Calculator, Jeremy Stangroom,
www.socscistatistics.com/tests/pearson/.

17
13
Ellis.
14
Ellis.
15
Ellis.
16
“How to Calculate Odds Ratio and 95% Confidence Interval for Logistic Regression for the Following Data?” Cross
Validated, Stack Exchange, stats.stackexchange.com/questions/304833/how-to-calculate-odds-ratio-and-95-
confidence-interval-for-logistic-regression.
17
Ellis.

18

Anda mungkin juga menyukai