Beruflich Dokumente
Kultur Dokumente
Udit Ennam
MSCS - Data Science (Class of 2019)
Rutgers University, New Brunswick
Outline
● Introduction
● The 6 Data Mining Steps
○ Problem Definition
○ Data Preparation
○ Data Exploration
○ Modeling
○ Evaluation
○ Deployment
● Challenges/Limitations
● Tools / Technologies
● References
Introduction
● Customers are the backbone of running
businesses.
Definition
● What are the customer’s expectations?
○ Personalized feeling
○ Content Recommendations
● JSON columns: 4
○ Each of them in the following format:
'{"browser": "Chrome", "browserVersion": "not available in demo dataset", "browserSize":
"not available in demo dataset", "operatingSystem": "Windows", "operatingSystemVersion":
"not available in demo dataset", "isMobile": false, "mobileDeviceBranding": "not available
in demo dataset", "mobileDeviceModel": "not available in demo dataset",
"mobileInputSelector": "not available in demo dataset", "mobileDeviceInfo": "not available
in demo dataset", "mobileDeviceMarketingName": "not available in demo dataset",
"flashVersion": "not available in demo dataset", "language": "not available in demo
dataset", "screenColors": "not available in demo dataset", "screenResolution": "not
available in demo dataset", "deviceCategory": "desktop"}'
Possible explanation: We can see number of visits peaking for the month of Possible explanation: Number of visits during weekdays are more than that
November, but the mean revenue generated is the least of all the months. It in the weekends. This mostly could be because customers tend to make
could be because of ‘Black Friday’ and ‘Thanksgiving’ sales[generally the purchases at their workplace as Google Analytics plays a critical role in how
discounts range from 50-80%]. most companies market nowadays.
Outlier Detection
● It looks like there are not many outliers. The learned decision function function also doesn’t look
good because of a lot of zero transactions. So, we do not remove the outliers from this dataset.
Columns of importance after data exploration
Type of column Column Names Data Type of Column
Nominal Categorical variables are encoded, the total number of columns were 84
● Scaling
Features are standardized using StandardScaler utility class from the preprocessing module
● Cross validation was then applied with models to avoid overfitting. [10-fold]
● Another way of avoiding overfitting is regularization. I used Elastic-Net regression with cross-validation
to decide on Lambda values because it is good at dealing with situations when there are correlations
between parameters.
● Hyper-parameter tuning was done using GridSearchCV from the scikit-learn library.
5. Evaluation
Evaluation Metrics for our Model
Classification Model:
Regression Model:
● Mean absolute error: RMSE wasn’t used as it wouldn’t help us easily understand the monetary
profit or deficit.
● Adjusted R-squared value: R2 value is the ratio of expected variation to total variation. So the
closer, R2 value is to 1, the better the regressor model we have at our disposition. Adjusted R2
evaluator is more useful when you have multiple predictors
Evaluation of Classification Model
Output: Top x% users with VisitorIds and Prediction Revenue, where x can be defined by a client.
Sample output:
VisitorId PredictedRevenue
Challenges/Limitations
● Working with Target Variable having 98.75% null entries, which led to huge errors in MAE and
evaluation of classification models has been tricky as most of the values were zeros
● JSON columns processing took a lot of time, works fine with upto 2Gb of data. The model works
only with the Google Analytics dataset type currently
Tools / Technologies
● Web Framework: Flask API
● Front-end: HTML, CSS, Javascript, Bootstrap
● Python IDE: Jupyter
● Important libraries used: Scikit-learn, Scikit-plot [helped easily visualize Gain chart], Plotly
[for making combination charts], PyOD [has multiple outlier detection algorithms],
XGBoost, joblib [to serialize and deserialize model and columns]
References
[1] Zhao Y., Li B., Li X., Liu W., Ren S. (2005) Customer Churn Prediction Using Improved One-Class
Support Vector Machine. In: Li X., Wang S., Dong Z.Y. (eds) Advanced Data Mining and Applications.
ADMA 2005. Lecture Notes in Computer Science, vol 3584. Springer, Berlin, Heidelberg
[2] Neslin, S.A., Gupta, S., Kamakura, W., Lu, J., Mason, C.: Defection Detection: Improving Predictive
Accuracy of Customer Churn Models
[3] Nath, S.V., Behara, R.S.: Customer churn analysis in the wireless industry A data mining approach.
In: Proceedings - Annual Meeting of the Decision Sciences Institute, pp. 505–510 (2003)
Thank You