Sie sind auf Seite 1von 3

Exercise: These exercises have been written to give you a feel of power of

Visual Analytics. Last date of submission: 9th Oct, 2017

These exercises are divided into two parts. Part I is easy.


Part II is impt from point of view predictive analytics.
Part II also requires understanding mosaic plots. Accompanying
pdf file carries an explanation. It is easy.
Code for some portion of Part II is given. Unless you so want
do not get into understanding code. Just make necessary changes,
copy and paste in RStudio, draw plots and interpret them.

Dataset: It is a school drop out data from schools of Andhra Pradesh. Data was
collected to make a prediction as to which child was about to drop-out
so as to take preventive measures. File name: studentDropIndia_20161215.csv

PART-I
-------
Histograms and Density Plots
1. Plot a histogram of 'science_marks'. Use bins = 10.
2. Plot a density plot of 'science_teacher'
3. Plot a density plot of 'science_teacher', 'contiue_drop' wise
and use alpha = 0.2. We will draw more density plots in Part II
and interpret them
Bar charts
4. Draw a stcaked bar chart of continue_drop vs gender
5. In the bar chart that you have drawn above, make the
only change: write geom_bar() as geom_bar(position = "fill")
The type of bar-graph changes. Can you explain? And interpret?
Box plots
6. Draw boxplots of gender vs mathematics_marks. Who is
better in maths: males or females? Can you think of
possible reasons?
7. Does guardian matter in performance of a child?
Draw boxplots of guardian vs total marks.
(In ggplot itself you can write like this:
mathematics_marks + english_marks + science_marks)
Scatterplots
8. Those who are good in science are they also good in english.
Does this observation generally hold? Draw a scatterplot
of mathematics_marks vs english_marks.
9. Smoothen the above graph and then observe.

Data being real, if you are inquisitive, you can try to dig
into why girls are dropping more than boys. Is it lack of
toilets? Or other reasons?

PART-II
Exercise on Feature Plotting
===========================
One important analysis often made in predictive analytics is regarding
which features are more important in making correct prediction of target.
Visually this can be done as below.

We assume that your target is a categorical variable (such as, 'heavy-


purchaser'
'medium-purchaser' or 'occassional-purchaser'). If your target is NOT
continuous (as for example 'rent' or 'sale'), make it discrete by cutting it
into three or four parts as we did for 'insurance$age' and 'insurance$bmi'
in the class.
In the case of school-dropout data, our target variable is 'continue_drop'.
We are often interested in predicting which child is likley to drop-out
so that preventive steps can be taken. So which variable will give us
more insight as to why a child is likely to drop-out? Here is the way to
go...

Plots differ depending upon whether the predictor is continuous or discrete.

ALL PLOTS ARE TO BE PASTED IN MS-WORD FILE AND THEN SAVE MS-WORD FILE AS PDF
AND UPLOAD IN MOODLE. EACH PLOT MUST CARRY YOUR INTERPRETATION.

A. Plots for continuous features:


-------------------------------------
Assume in a dataset, features 1 to 4, 7, 10 and 13 (total 7) are continuous,
the codes would be as below. We assume your target variable is called
'target' and dataset is 'data'. Change data/target names and column numbers
accordingly for our school data.
You have to first install library: caret

# 1. Boxplots. Copy and paste it in R after making name and column


# number changes as required.

library(caret)
featurePlot(x = data[,c(1:4, 7, 10,13)], # change 'data' to your dataset name
as also change col numbers
y = data$target, # change 'target' to your target name
plot = "box",
scales = list(y = list(relation="free"),
x = list(rot = 90)),
layout = c(3,3 ), # 3 X 3 grid
auto.key = list(columns = 4))

# 2. Density plots

featurePlot(x = data[,c(1:4, 7, 10,13)],


y = data$target,
plot = "density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
layout = c(3, 3), # Three rows and 3 columns
auto.key = list(columns = 3))

Considering above boxplots and density plots, rank the continuous


features in order of importance.

B. Plots for categorical features:


----------------------------------
To find out impts of categorical variables to target variable, use
mosaic plots. The attached pdf file explains how to interpret mosaic
plots.

Example: How important 'gender' is to dropouts?


First install library: vcd
Two-line code is:
library(vcd)
mosaic(school$continue_drop ~ school$gender, gp = shading_max)

Plot similarly mosiac plots for 'caste', 'guardian' and 'internet',


against 'continue'drop and interpret each and create a ranking of
these features in order of importance as to the strength of their
relationships to 'continue_drop' target.

*************************************************************************
YOU ARE WELCOME TO EMAIL ME OR PROF DHANYA FOR SEEKING ANY CLARIFICATIONS
*************************************************************************

Das könnte Ihnen auch gefallen