Sie sind auf Seite 1von 7

Data Handling in R

@Submitted by : Anirban Bhattacharya

@Data: 28th June 2019

[AB] : Represents answers

First I will set the working directory using setwd() to my local preferred folder to solve the below
problems.

1. Read the input data file “Bank500.csv” in to R as input.data


a. Write the code to read a .csv file
[AB]: The path to read the file now will be relative path under the data folder of my
WD

input.data = read.csv("data/Bank500.csv")

Note: CSV2 function we aren’t using since the data sets doesn’t contain semi-colon
separated values , if that would have been the case , would have used.
_______________________________________________

2. Use following commands to learn about the data you have read into input.data – Write
code for each;
a. Number of rows :

[AB]- Three ways (which I know ). Listed


1. dim(input.data)[1]
2. nrow(input.data)
3. NROW(input.data)

500 rows
_____________________
Number of columns
[AB]- Three ways (which I know ). Listed
b. 1. dim(input.data)[2]
2. ncol(input.data)
3. NCOL(input.data)

21 Columns
_____________________
c. Display Top 6 rows
[AB]- Three ways (which I know ). Listed
head(input.data) or
input.data[1:6,] or
input.data[c(1:6),]
_____________________
d. Display bottom 6 rows
[AB]: Three ways which I know I listed down
tail(input.data)

bottom6thRow = nrow(input.data)-5
print(bottom6thRow)

input.data[bottom6thRow:nrow(input.data),]
input.data[c(bottom6thRow:nrow(input.data)),]
_____________________
e. Display first 20 rows

[AB]:
head(input.data,20) or
input.data[1:20,] or
input.data[c(1:20),]
_____________________

3. Enter code to understand structure of input.data


[AB] : str(input.data)
_________________________

4. Enter code to get a better understanding of the data with a summary


[AB] : summary(input.data)
_________________________

5. What is the datatype of variable “contact”? What are the different levels or values that the
variable contact can assume?

[AB]: class(input.data$contact) gives “Factor” as data type


Whereas table(input.data$contact) gave levels as “Cellular”,”telephone” and “unknown”
___________________________________________

___________________________________________

6. Write code to display only the “age” variable

[AB]:
Well the question seems incomplete to me , If it means only distinct values then
table(input.data$age) would be right
else
input.data$age will also display the age variable
___________________________________________

7. Write code to list the names of all the variables in input.data


[AB]: Two ways which I am aware till now

colnames(input.data)
names(input.data)

___________________________________________

8. Create a new variable “age_below_40” which is a subset of all input.data where all the rows
of age <= 40. Use subset function.
[AB]:age_below_40 = subset(input.data, age<=40)
*Note: subset is very slow compared to deplyr filter
___________________________________________

9. What is the data type of age_below_40?


[AB]: class(age_below_40) results in the data type as “data.frame”
___________________________________________

10. How many rows are in age_below_40?


[AB] : nrow(age_below_40) produces 248
___________________________________________
11. Order the rows in age_below_40 in descending order by age – Write code below
[AB]:Two ways which I am aware
1. age_below_40[order(-age_below_40$age),]
2. age_below_40 [order(age_below_40$age,decreasing=TRUE),]

____________________________________________

12. From the output above, What is the balance of the customer listed on the top row?
[AB]:3571

____________________________________________
Basic Graphs in R
In this section, we will learn the following:

 Graphically understand data


 Generate different carts based on data type
 Interpret and gather insights from charts generated

13. Generate a Histogram for ‘age’ field in input.data – Write code below:
[AB]: hist(input.data$age,main = "Age Histogram",ylab = "Count",xlab = "Age Range")

________________________________

14. Do you observe anything unusual from the histogram generated above? If yes, state your
observation
[AB]: No
________________________________

15. Generate a Barplot to understand the distribution of customers based on marital status –
Write code below:
[AB] : Not sure from the question whether the columns should be horizontally placed or
vertically .Anyways both the answers are below

1.VerticalBarplot = barplot(table(input.data$marital),col=rainbow(3))
2.horizontalBarPlot = barplot(table(input.data$marital),col=rainbow(3),horiz = TRUE)

________________________________

16. Generate a Scatter Plot to understand the correlation between Duration on x-axis and
Balance on y-axis. Write code below:
[AB]:

Normal plot :

plot(input.data$duration, input.data$balance, main = "Duration Vs Balance",


xlab = "Duration", ylab = "Balance amount")
________________________________

________________________________

17. Generate a Box Plot for the variable ‘balance’. Make sure to specify the title and labels for X
and Y axis. Write the code below:
[AB]: boxplot(input.data$balance, main = "Balance", ylab = "balance",xlab="Records Count")

_____________________________________________________

_____________________________________________________

18. Generate a Box Plot for balance versus job. Write code below:
[AB]:ggplot(input.data, aes(x = job, y = balance,)) + geom_boxplot()

_____________________________________________________

_____________________________________________________

19. From the Box Plot generated above, which job category has the highest median balance?
[AB]: Unknown
_____________________________________________________
20. Using ggplot2 package, generate a point plot where x-axis is education and y-axis is balance.
Write code below:
[AB]: ggplot(input.data, aes(x = education, y = balance)) +
geom_point()
_____________________________________________________

21. Using ggplot2 package, generate a point plot where x-axis is age and y-axis is balance. Write
code below: - Is there a correlation between age and balance?
[AB]: ggplot(input.data, aes(x = age, y = balance)) + geom_point().
There is no direct co relation but an inference can be made between two though might not
be practical enough
_____________________________________________________

_____________________________________________________

22. Enhance the point plot above so that we can identify those who have housing loan based on
colour. Write code below:
[AB]: ggplot(input.data, aes(x = age, y = balance,colour=loan)) + geom_point().
__________________________________________________________

23. Let us try to understand the overall Monthly Average Balance. Run the code below and
answer relevant questions:

means = tapply(input.data$balance, input.data$numMonth, mean)

barplot(means, xlab = "Month", ylab = "Average Balance", main


= "Monthly Average Balance")

a. Which month had the highest average Balance?


[AB] December
__________________

b. Which month had the lowest average Balance?


[AB]: July
____________________

c. How many months have average balance of less than 4000?


[AB]:11 months have less than 4000
____________________

Das könnte Ihnen auch gefallen