Sie sind auf Seite 1von 12

Analysis of Kobe Bryants Shot Selection in

Zibran Chaus
June 3, 2016
Kobe Bryant was one of the most influential basketball players of our
time. Data has been gathered throughout his successful career. This
data allows us to see what makes a good basketball player or at least
what makes Kobe have as much success as he did. The analysis was
done through Python. Python provides a variety of different packages for
machine learning. In this study we look at different parameters and make
an attempt in Kaggles competition.


To understand the importance of data analytics we must first look into the
organizations that would be using them. Today about half the teams in the
NBA are worth over one billion dollars [7]. The average salary of the players is
4.9 million dollars [3]. Kobe Bryant held a salary of 23.5 million dollars in 2015.
With this type of investment, team owners expect returns from their players.
In the past the most that could be said about players is whether they could
make three point shots, free throws, lay-ups, etc. These statistics were based on
simple models of how many were made compared to attempt. Today machine
learning allows us to go deeper. We can use a combination of sensors and data
to see what makes certain athletes perform better.


The goal for this study was to see if it was possible to predict whether or not
Kobe Bryant made a basket based on different parameters. These types of
predictions help understand the player and what aspects and traits they need
to work on.


Machine Learning
What is it?

Machine learning, as defined by Britannica, is a subfield in computer science

with an emphasis on pattern recognition and autonomous learning[1]. Machine
learning uses complex algorithms that learn a certain pattern based on the

inputs to make a prediction for the unknown. Today this is widely used in
almost every industry to match a product with a person. Some examples include
Netflixs movie recommendation based on previous selections, Amazons product
suggestions, and even deciding credit approval and limits.



This study was done using Python. Python is a high-level programming language. It provides a variety of different packages that offer a wide range of
functionality. The two big machine learning packages that were used in this
study was Pandas and XGBoost. Pandas provides easy-to-use data structures
and data analysis tools [4]. XGBoost is a library designed for boosting trees
algorithm. XGBoost has been used by many of the previous winners at Kaggle
[5]. It is one of the fastest algorithms of gradient boosting algorithms. It also
can be multi-threaded. For this study we do not will not be looking into its
multi-threading capabilities and only its data processing.


Gradient Boosting

Gradient boosting is a technique used for regression and classification. It provides a prediction model from a group of weak prediction models called decision
trees. Gradient boosting is the process of gradient descent and boosting. Gradient descent looks to find a a local minimum by taking a step proportional to
the negative of the gradient at its current point. Boosting is used to reduce
bias and variance. It is done by taking weak learners and running it multiple
times on a training data and then let the learned classifiers vote. Each iteration
weights each training example by how incorrectly it was classified. It learns a
hypothesis and has strength for the selected hypothesis. At the end the final
classifier takes a linear combination of the votes of different classifiers weighted
by their strength.
In seeks to approximate
F (x) =


i hi (x) + const


Gradient Boosting Algorithm [2]

XGBoost Speed Comparison [5]



The data set was provided by Kaggle. It contains every shot made by Kobe
Bryant during his 20 year career. it contains 30698 entries. Each entry has
information on each of the categories below. Of the 30698 entire 5000 of them
have a blank under the shot made flag. The goal was to predict whether those
blanks are a shot made or missed. This was then submitted into Kaggle and a
rank was assigned.



action type
loc x
shot type
team name

combined shot type

loc y
shot zone area
game date

game event id
seconds remaining
shot zone basic

game id
minutes remaining
shot distance
shot zone range


Here we will look at some of the different parameters that is provided by Kaggle
and see what type of information can be gathered about Kobe Bryant through
the usage of Python packages.



From the original plot of location it seems that Kobe is a very versatile
shooter. He takes shots from almost everywhere on the court. It can be noted
that Kobe does not make an attempt to score from the left corner and on the
three point line. Now the problem with this graph is the amount of clutter and
overlapping data points.

shot made flag
team id
shot id

The heat map [6] of shots gives us a better understanding of Kobes preference of where he likes to shoot from. The area near the ring shows the greatest
amount of activity. From this its safe to assume Kobe makes majority of his
shots near the rim and is less frequent further out.

Now graphing the shots that he made we see that Kobe is overall a great
basketball player. He works around the court and isnt bounded to specific
locations to make a worthwhile shot.

Transitioning this data to a heatmap allows for a better visual of Kobes

shots. From here its clear that while Kobe can make shots from all over the
court he is more comfortable with shots made near the rim. When compared
with the heatmap of all shots its clear that Kobes accuracy further out isnt
very high.

Looking at what shots Kobe misses allows to not only get a better look at
Kobes performance but can be used to develop him as player, if he was still
playing. While the graph does look quite similar to the previous it should be
noted of the increase in the quantity of shots near the three point line. Clearly
one of his weaker points in the game.

The heatmap outlines where exactly Kobe seems to miss the most. Ignoring
the are right under the rim, the map shows Kobe is having the most trouble
near the three point line. He frequently misses shots the further he is.
It is important to note that the above graphs are not normalized, so the
information under the rim can seem misrepresented. Kobe attempts the most
amount of shots under the rim which results in more shots made and missed
when compared to the rest of the areas. Despite this it still clear when looking
at all three heatmaps side by side that Kobe is better at making the basket
under the rim than three point line.


Shot Type

To complement the heat map data, it is important to look at the type of shots
made. The type of shots correlate to location. For example a dunk is made at
the rim while a free throw is further out. This shows what type of shot Kobe has
the greatest accuracy with. The chart shows dunks with the highest percentage,
as expected do to the shot style. The next being a hook bank shot which is
performed near the rim. On the other side his jump shot have poor accuracy
which again maps well to the assumptions made with the heatmap since most
jump shots are made from a distance to the hoop.

This combined with the heatmap would allow for a better training regimen and
game plan to provide greater success for Kobe or any other professional basketball player.


The data is loaded into python and split by the shot made flag. Blanks are
the test data and the rest are the training data. Columns with information
that is considered unimportant or of little value are dropped. The remaining
information is stored into a dictionary. The dictionary is passed into the gradient
boosting API where the different parameters are adjusted for optimal results.
Prediction probability are calculated using XGBs prediction API. The features
used for this study were action, location, opponent, time,shot type, zone, season,
range, opponent, period, distance, and play offs. Of the 25 total categories these
seemed to be the most relevant when evaluating Kobes playstyle. By knowing
the location, shot type, zone and action it allows for a better understanding of
how Kobe moves around the court, what type of shots he prefers and excels
at. The time, season, playoffs, period, and opponent helps evaluate a mental
picture of Kobe. It shows how stress affects him the most. By combining these
its possible to make proper predictions of whether Kobe will make a shot or
not. The data is imputed into XGBoost. The model used for XGBoost is tree
ensembles[8]. A tree ensemble is a set of classification and regression trees,
also known as CART. The data is classified into different leaves and scores are
assigned based on the leaves. Each only contains a decision value. Since a
single tree is weak XGBoost combines multiple trees and uses a classifier at the
end to combine all the weighted votes/decisions. Now the important thing to
remember is the model for a random forest and boosted tree are exactly the

same, tree ensembles. The difference comes in how they are trained. Now to
learn the trees an objective junction must be defined, making sure it has training
loss and regularization.

Training all the trees at once can be difficult and an additive method is
preferred. What is learned from one tree is added to an new tree one at a time.

After each step add one that optimizes the objective function.

Other than the training the complexity of the tree is also important. The
tree is defined as
w is the vector of scores on leaves, q is the function assigning each data
point to the other as leaf and T is the number of leaves. With XGBoost the
complexity is defined as


Kaggle Score

Max Depth

Learning Rate

N estimators

The best result was with a max depth of 6, learning rate of 0.014 and
N estimator of 1000. Tweaking these parameters further did not provide improvement. Which either means the model is at its limit or the style of tuning

can be improved.
The gradient boosting has many different parameters that can be adjusted. For
this study the max depth, learning rate and N estimators were adjusted. The
max depth represents the maximum depth of a tree. This can used to control
overfitting. The learning rate helps make the model more robust by shrinking
the weights for each step. Lastly the N estimators is the amount of trees being
Parameters left static:

colsample bytree


The subsample represents the fraction of data to be randomly sampled for each
tree. The lower the number the more conservative the algorithm works which
can prevent overfitting. The colsample bytree denotes the number of columns
to be randomly sampled for each tree. If the value is too low the result will face
underfitting. The seed is a random number seed and can be used for parameter


Kaggle Scoring Method

For this competition Kaggle evaluates the results based on log loss.
logloass =

1 X
(yi log(pi ) + (1 yi ) log(1 pi ))
N i=1

The bests score was 0.59509. As of June 2nd this score has been beat with
the best now at 0.57866.


The biggest challenge was learning to do machine learning on a new environment. Previously the work done in matlab had the skeleton already written and
the details needed to be added. This time everything had to be done. Pythons
packages have many different APIs that are useful but do require time in figuring which ones to be used for what. The main packages used were pandas,
seaborn, xgboost. After learning a new language, then issues came in setting
up the proper parameters to get optimal results. With the slightest changes
the results would drastically change. After a certain point it became difficult
to see any improvement at all with the current system. While a more mathematical approach is possible using XGBoost it was unfortunately unsuccessful
in implementation in this design.


There are many adjustments that can be made to improve the quality of the
results. The first one being a proper tuning of the parameters used for the

gradient boosting. An attempt was made to follow the work of Aarshay Jain
[tuning] but was unsuccessful do to time constraints. This method of tuning
would be more systematic and help provide actual optimal values for the 4
parameters. In this study the parameters were slightly increased and decreased
to see how it affected the results. Next the features selected were based more on
what seemed important instead of a complete mathematical approach. Cross
validation could be used to see exactly which features should be used and ignored
to make a better prediction.



Through the usage of machine learning and data analysis different teams can
work to improve the quality of their players and team performance. From the
basic analysis of Kobes data it was clear that he doesnt make many shots
near the three point line and should be avoided. While his accuracy near the
rim is much greater. It was also clear that Kobe is able to keep his composure
throughout the game making him an overall great athlete. Now these our just
some of the facts that can be pulled from this data. Combining this with
machine learning a team could potentially figure out exactly what play has the
highest probability of working against an opponent and on what aspects should
each player work on.
While this script did not get rank 1 its results were quite good. The best score
received a log loss of 0.59509 while the best submission for mine was 0.60262
giving it a rank of 67 out of the about 550 different teams. It was interesting
to see it get to the point where trying to get a 0.001 improvement was almost
impossible. It gives perspective to those working with machine learning striving
to increase their accuracy by a fraction of a percent. For this study XGBoost
was used to do gradient boosting on the dataset while adjusting 4 parameters
of the tree and using 12 of the 25 features.




[2] Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." (February