00 positive Bewertungen00 negative Bewertungen

8 Ansichten12 SeitenJun 03, 2016

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

8 Ansichten

00 positive Bewertungen00 negative Bewertungen

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

Sie sind auf Seite 1von 12

Python

Zibran Chaus

June 3, 2016

Abstract

Kobe Bryant was one of the most influential basketball players of our

time. Data has been gathered throughout his successful career. This

data allows us to see what makes a good basketball player or at least

what makes Kobe have as much success as he did. The analysis was

done through Python. Python provides a variety of different packages for

machine learning. In this study we look at different parameters and make

an attempt in Kaggles competition.

Introduction

To understand the importance of data analytics we must first look into the

organizations that would be using them. Today about half the teams in the

NBA are worth over one billion dollars [7]. The average salary of the players is

4.9 million dollars [3]. Kobe Bryant held a salary of 23.5 million dollars in 2015.

With this type of investment, team owners expect returns from their players.

In the past the most that could be said about players is whether they could

make three point shots, free throws, lay-ups, etc. These statistics were based on

simple models of how many were made compared to attempt. Today machine

learning allows us to go deeper. We can use a combination of sensors and data

to see what makes certain athletes perform better.

Problem

The goal for this study was to see if it was possible to predict whether or not

Kobe Bryant made a basket based on different parameters. These types of

predictions help understand the player and what aspects and traits they need

to work on.

3

3.1

Machine Learning

What is it?

with an emphasis on pattern recognition and autonomous learning[1]. Machine

learning uses complex algorithms that learn a certain pattern based on the

1

inputs to make a prediction for the unknown. Today this is widely used in

almost every industry to match a product with a person. Some examples include

Netflixs movie recommendation based on previous selections, Amazons product

suggestions, and even deciding credit approval and limits.

3.2

Python

This study was done using Python. Python is a high-level programming language. It provides a variety of different packages that offer a wide range of

functionality. The two big machine learning packages that were used in this

study was Pandas and XGBoost. Pandas provides easy-to-use data structures

and data analysis tools [4]. XGBoost is a library designed for boosting trees

algorithm. XGBoost has been used by many of the previous winners at Kaggle

[5]. It is one of the fastest algorithms of gradient boosting algorithms. It also

can be multi-threaded. For this study we do not will not be looking into its

multi-threading capabilities and only its data processing.

3.3

Gradient Boosting

Gradient boosting is a technique used for regression and classification. It provides a prediction model from a group of weak prediction models called decision

trees. Gradient boosting is the process of gradient descent and boosting. Gradient descent looks to find a a local minimum by taking a step proportional to

the negative of the gradient at its current point. Boosting is used to reduce

bias and variance. It is done by taking weak learners and running it multiple

times on a training data and then let the learned classifiers vote. Each iteration

weights each training example by how incorrectly it was classified. It learns a

hypothesis and has strength for the selected hypothesis. At the end the final

classifier takes a linear combination of the votes of different classifiers weighted

by their strength.

In seeks to approximate

F (x) =

M

X

i hi (x) + const

i=1

4

4.1

Data

General

The data set was provided by Kaggle. It contains every shot made by Kobe

Bryant during his 20 year career. it contains 30698 entries. Each entry has

information on each of the categories below. Of the 30698 entire 5000 of them

have a blank under the shot made flag. The goal was to predict whether those

blanks are a shot made or missed. This was then submitted into Kaggle and a

rank was assigned.

4.2

Categories

action type

loc x

playoffs

shot type

team name

loc y

season

shot zone area

game date

game event id

lon

seconds remaining

shot zone basic

matchup

game id

minutes remaining

shot distance

shot zone range

opponent

Analysis

Here we will look at some of the different parameters that is provided by Kaggle

and see what type of information can be gathered about Kobe Bryant through

the usage of Python packages.

5.1

Location

From the original plot of location it seems that Kobe is a very versatile

shooter. He takes shots from almost everywhere on the court. It can be noted

that Kobe does not make an attempt to score from the left corner and on the

three point line. Now the problem with this graph is the amount of clutter and

overlapping data points.

lat

period

shot made flag

team id

shot id

The heat map [6] of shots gives us a better understanding of Kobes preference of where he likes to shoot from. The area near the ring shows the greatest

amount of activity. From this its safe to assume Kobe makes majority of his

shots near the rim and is less frequent further out.

Now graphing the shots that he made we see that Kobe is overall a great

basketball player. He works around the court and isnt bounded to specific

locations to make a worthwhile shot.

shots. From here its clear that while Kobe can make shots from all over the

court he is more comfortable with shots made near the rim. When compared

with the heatmap of all shots its clear that Kobes accuracy further out isnt

very high.

Looking at what shots Kobe misses allows to not only get a better look at

Kobes performance but can be used to develop him as player, if he was still

playing. While the graph does look quite similar to the previous it should be

noted of the increase in the quantity of shots near the three point line. Clearly

one of his weaker points in the game.

The heatmap outlines where exactly Kobe seems to miss the most. Ignoring

the are right under the rim, the map shows Kobe is having the most trouble

near the three point line. He frequently misses shots the further he is.

It is important to note that the above graphs are not normalized, so the

information under the rim can seem misrepresented. Kobe attempts the most

amount of shots under the rim which results in more shots made and missed

when compared to the rest of the areas. Despite this it still clear when looking

at all three heatmaps side by side that Kobe is better at making the basket

under the rim than three point line.

5.2

Shot Type

To complement the heat map data, it is important to look at the type of shots

made. The type of shots correlate to location. For example a dunk is made at

the rim while a free throw is further out. This shows what type of shot Kobe has

the greatest accuracy with. The chart shows dunks with the highest percentage,

as expected do to the shot style. The next being a hook bank shot which is

performed near the rim. On the other side his jump shot have poor accuracy

which again maps well to the assumptions made with the heatmap since most

jump shots are made from a distance to the hoop.

This combined with the heatmap would allow for a better training regimen and

game plan to provide greater success for Kobe or any other professional basketball player.

Methodology

The data is loaded into python and split by the shot made flag. Blanks are

the test data and the rest are the training data. Columns with information

that is considered unimportant or of little value are dropped. The remaining

information is stored into a dictionary. The dictionary is passed into the gradient

boosting API where the different parameters are adjusted for optimal results.

Prediction probability are calculated using XGBs prediction API. The features

used for this study were action, location, opponent, time,shot type, zone, season,

range, opponent, period, distance, and play offs. Of the 25 total categories these

seemed to be the most relevant when evaluating Kobes playstyle. By knowing

the location, shot type, zone and action it allows for a better understanding of

how Kobe moves around the court, what type of shots he prefers and excels

at. The time, season, playoffs, period, and opponent helps evaluate a mental

picture of Kobe. It shows how stress affects him the most. By combining these

its possible to make proper predictions of whether Kobe will make a shot or

not. The data is imputed into XGBoost. The model used for XGBoost is tree

ensembles[8]. A tree ensemble is a set of classification and regression trees,

also known as CART. The data is classified into different leaves and scores are

assigned based on the leaves. Each only contains a decision value. Since a

single tree is weak XGBoost combines multiple trees and uses a classifier at the

end to combine all the weighted votes/decisions. Now the important thing to

remember is the model for a random forest and boosted tree are exactly the

same, tree ensembles. The difference comes in how they are trained. Now to

learn the trees an objective junction must be defined, making sure it has training

loss and regularization.

Training all the trees at once can be difficult and an additive method is

preferred. What is learned from one tree is added to an new tree one at a time.

After each step add one that optimizes the objective function.

Other than the training the complexity of the tree is also important. The

tree is defined as

w is the vector of scores on leaves, q is the function assigning each data

point to the other as leaf and T is the number of leaves. With XGBoost the

complexity is defined as

Results

Kaggle Score

0.60262

0.60354

0.62101

0.60266

0.60277

0.60338

0.60367

Max Depth

6

10

6

7

6

6

7

Learning Rate

0.014

0.01

0.014

0.012

0.010

0.017

0.014

N estimators

1000

1000

5000

1000

1000

1000

1000

The best result was with a max depth of 6, learning rate of 0.014 and

N estimator of 1000. Tweaking these parameters further did not provide improvement. Which either means the model is at its limit or the style of tuning

9

can be improved.

The gradient boosting has many different parameters that can be adjusted. For

this study the max depth, learning rate and N estimators were adjusted. The

max depth represents the maximum depth of a tree. This can used to control

overfitting. The learning rate helps make the model more robust by shrinking

the weights for each step. Lastly the N estimators is the amount of trees being

built.

Parameters left static:

subsample

0.6

colsample bytree

0.62

seed

1

The subsample represents the fraction of data to be randomly sampled for each

tree. The lower the number the more conservative the algorithm works which

can prevent overfitting. The colsample bytree denotes the number of columns

to be randomly sampled for each tree. If the value is too low the result will face

underfitting. The seed is a random number seed and can be used for parameter

tuning.

7.1

For this competition Kaggle evaluates the results based on log loss.

logloass =

N

1 X

(yi log(pi ) + (1 yi ) log(1 pi ))

N i=1

The bests score was 0.59509. As of June 2nd this score has been beat with

the best now at 0.57866.

Challenges

The biggest challenge was learning to do machine learning on a new environment. Previously the work done in matlab had the skeleton already written and

the details needed to be added. This time everything had to be done. Pythons

packages have many different APIs that are useful but do require time in figuring which ones to be used for what. The main packages used were pandas,

seaborn, xgboost. After learning a new language, then issues came in setting

up the proper parameters to get optimal results. With the slightest changes

the results would drastically change. After a certain point it became difficult

to see any improvement at all with the current system. While a more mathematical approach is possible using XGBoost it was unfortunately unsuccessful

in implementation in this design.

Improvements

There are many adjustments that can be made to improve the quality of the

results. The first one being a proper tuning of the parameters used for the

10

gradient boosting. An attempt was made to follow the work of Aarshay Jain

[tuning] but was unsuccessful do to time constraints. This method of tuning

would be more systematic and help provide actual optimal values for the 4

parameters. In this study the parameters were slightly increased and decreased

to see how it affected the results. Next the features selected were based more on

what seemed important instead of a complete mathematical approach. Cross

validation could be used to see exactly which features should be used and ignored

to make a better prediction.

10

Conclusion

Through the usage of machine learning and data analysis different teams can

work to improve the quality of their players and team performance. From the

basic analysis of Kobes data it was clear that he doesnt make many shots

near the three point line and should be avoided. While his accuracy near the

rim is much greater. It was also clear that Kobe is able to keep his composure

throughout the game making him an overall great athlete. Now these our just

some of the facts that can be pulled from this data. Combining this with

machine learning a team could potentially figure out exactly what play has the

highest probability of working against an opponent and on what aspects should

each player work on.

While this script did not get rank 1 its results were quite good. The best score

received a log loss of 0.59509 while the best submission for mine was 0.60262

giving it a rank of 67 out of the about 550 different teams. It was interesting

to see it get to the point where trying to get a 0.001 improvement was almost

impossible. It gives perspective to those working with machine learning striving

to increase their accuracy by a fraction of a percent. For this study XGBoost

was used to do gradient boosting on the dataset while adjusting 4 parameters

of the tree and using 12 of the 25 features.

11

Citations

[1] http://www.britannica.com/technology/machine-learning

[2] Friedman, J. H. "Greedy Function Approximation: A Gradient Boosting Machine." (February

1999)

[3] http://www.forbes.com/sites/kurtbadenhausen/2015/01/23/average-mlb-salary-nearlydouble-nfls-but-trails-nba-players/#3522707f269e

[4] http://pandas.pydata.org

[5] http://www.r-bloggers.com/an-introduction-to-xgboost-r-package/

[6] http://savvastjortjoglou.com/nba-shot-sharts.html

[7] http://sports.yahoo.com/blogs/nba-ball-dont-lie/forbes--valuations-are-in--13-nba-teams-

are-worth-over-a-billion-190428258.html

[8]http://xgboost.readthedocs.io/en/latest/model.html

## Viel mehr als nur Dokumente.

Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.

Jederzeit kündbar.