Sie sind auf Seite 1von 34

Project - Cold Storage

Case Study
Pratik Zanke

Table of Contents

1
1 Project Objective........................................................................................................................................ 4
1.1 Problem 1.............................................................................................................................................4
1.2 Problem 2.............................................................................................................................................4
2 Primary Observations and Assumptions.................................................................................................... 5
3 Exploratory Data Analysis – Step by step approach ...................................................................................6
3.1 Environment Set up and Data Import................................................................................................. 6
3.1.1 Install necessary Packages and Invoke Libraries.......................................................................... 6
3.1.2 Set up working Directory .............................................................................................................6
3.1.3 Import and Read the Dataset.......................................................................................................6
3.2 Variable Identification.........................................................................................................................7
3.2.1 Variable classes/characteristics .................................................................................................. 7
3.3 Univariate Analysis..............................................................................................................................7
3.3.1 Analyses on Continuous Variables ..............................................................................................7
3.3.2 Analyses on Categorical Variables .............................................................................................11
3.4 Bi-Variate Analysis.............................................................................................................................12
3.4.1 Continuous & Continuous..........................................................................................................12
3.4.2 Categorical & Categorical...........................................................................................................12
3.4.3 Categorical & Continuous: ........................................................................................................13
3.5 Missing Value Treatment ................................................................................................................16
3.6 Outlier Treatment ...........................................................................................................................17
3.7 Variable Transformation / Feature Creation ..................................................................................17
4 Problem solving: Problem 1 ..................................................................................................................19
4.1 Mean cold storage temperature for Summer, Winter and Rainy Season....................................19
4.2 Find overall mean for the full year..................................................................................................19
4.3 Find Standard Deviation for the full year........................................................................................20
4.4 Assume Normal distribution, what is the probability of temperature having fallen below 2 C?....20
4.5 Assume Normal distribution, what is the probability of temperature having gone above 4 C.......21
4.6 What will be the penalty for the AMC Company? ..........................................................................21
5 Problem solving: Problem 2 ..................................................................................................................22
5.1 Which Hypothesis test shall be performed to check if corrective action is needed at the cold
storage plant? ........................................................................................................................................22

2
5.2 State the Hypothesis, perform hypothesis test and determine p-value ..............................................22
6 Conclusion/Inference...............................................................................................................................25
7 Appendix A...............................................................................................................................................26

3
1 Project Objective
1.1 Problem 1
Cold Storage started its operations in Jan 2016. They are in the business of
storing Pasteurized Fresh Whole or Skimmed Milk, Sweet Cream, Flavored
Milk Drinks. To ensure that there is no change of texture, body
appearance, separation of fats the optimal temperature to be maintained is
between 2 - 4 C.

In the first year of business they outsourced the plant maintenance work to a
professional company with stiff penalty clauses. It was agreed that if
it was statistically proven that probability of temperature going outside the 2 -
4 C during the one-year contract was above 2.5% and less than 5% then
the penalty would be 10% of AMC (annual maintenance case). In case it
exceeded 5% then the penalty would be 25% of the AMC fee. The average
temperature data at date level is given in the file
“Cold_Storage_Temp_Data.csv”

The objective of the report is to explore the Cold Storage Case Study dataset
using concepts of Statistical Methods of Decision Making and generate insights
about the data. This exploration report will consist of the following:
Importing the dataset in R
Understanding the structure of dataset
Graphical exploration
Descriptive statistics
Insights from the dataset
Find solutions to some problems bases on the key insights drawn from
the data, as elaborated on Section 4.

1) Find mean cold storage temperature for Summer, Winter and


Rainy Season
2) Find overall mean for the full year
3) Find Standard Deviation for the full year
4) Assume Normal distribution, what is the probability of
temperature having fallen below 2 C?
5) Assume Normal distribution, what is the probability of
temperature having gone above 4 C?
6) What will be the penalty for the AMC Company?

1.2 Problem 2
In Mar 2018, Cold Storage started getting complaints from their Clients that
they have been getting complaints from end consumers of the dairy products
going sour and often smelling. On getting these complaints, the supervisor pulls
out data of last 35 days’ temperatures. As a safety measure, the Supervisor
decides to be vigilant to maintain the temperature 3.9 C or below.

4
Assuming 3.9 C as upper acceptable value for mean temperature and at alpha
= 0.1, the objective is to find out if there is need for some corrective action in
the Cold Storage Plant or is it that the problem is from procurement side from
where Cold Storage is getting the Dairy Products. The data of the last 3 days is
in “Cold_Storage_Mar2018.csv”

The objective is to apply statistical methods of decision making to address the


following problems, also elaborated on Section 5
1) Which Hypothesis test shall be performed to check if corrective action is
needed at the cold storage plant?
2) State the Hypothesis, perform hypothesis test and determine p-value
3) Final overall inference/conclusion

2 Primary Observations and Assumptions

Dataset 1
Given the nature of the data provided in the dataset, it can be seen that this
refers to the temperatures in the cold storage over the entire year of 2016.
The 365 rows of the dataset correspond to 365 unique days of the year and
the temperatures recorded on each day.
To provide more insights of season-wise trends,
The dataset has further been broken down to 3 seasons: Summer,
Rainy, and Winter.
Summer corresponds to the months of Feb to May.
Rainy corresponds to June to September.
Winter corresponds to Jan & Oct to Dec.
Also, the following data dictionary is considered for the 4 features in the
dataset:

Sl. Feature Feature Feature Description


No. Name Code
1 Season Season Seasons across the year: Summer,
Rainy, Winter
2 Month Month All 12 months in a year, Jan to Dec
3 Date Date Dates is each month when the
temperatures were recorded
4 Temperature Temperature Temperatures recorded on each day of
the year

Dataset 2
Upon receiving complaints from customers in 2018, the Cold Storage requested
data from the maintenance company. Supervisor pulled temperature data for
last 35 days (Feb 11 to Mar 17)

5
The overall characteristics of this dataset is exactly same as the Dataset 1,
except that the data pulled all correspond to the season of summer.

3 Exploratory Data Analysis – Step by step approach


The Data exploration activity undergone in this report will consist of the
following steps:

1. Environment Set up and Data Import


2. Variable Identification
3. Univariate Analysis
4. Bi-Variate Analysis
5. Missing Value Treatment
6. Outlier Treatment
7. Variable Transformation / Feature Creation

Note: Dataset 2 has been drawn up to address a specific problem(see


section 1.2), so it will not be put through rigorous Exploratory Data Analysis;
all we would do is to just brush over the dataset from the viewpoint of
Descriptive Analysis – covered in very high level on Sections: 3.2.1 and
3.3.1, and descriptive statistics on dataset 2 will not be applied beyond these
sections.

However, the dataset 1 will be put through all possible methods of Univariate
and Bi-variate analyses.

3.1 Environment Set up and Data Import


3.1.1 Install necessary Packages and Invoke Libraries
Before starting the necessary coding in R, we need to install necessary
packages and invoke associated libraries. Having all the packages at the
same place increases code readability.
Please refer to Appendix A for Source Code.

3.1.2 Set up working Directory


Working directory is the location/folder on the PC where one keeps the data,
codes etc. related to the project. Setting a working directory at the start of
the R session makes importing and exporting data files and code files easier.
Please refer to Appendix A for Source Code.

3.1.3 Import and Read the Dataset


The given dataset is in .csv format. Hence, the command ‘read.csv’ is used
for importing the file.
Please refer to Appendix A for Source Code.

6
3.2 Variable Identification
The dataset is analyzed for basic understanding of the features and data
contained. It is usually an activity by which data is explored and organized in
order so the information it contains is made clear.
Please refer to Appendix A for Source Code.

3.2.1 Variable classes/characteristics


No of rows vs. No. of columns:

Dataset 1
No. of Rows No. of Columns
365 4

Dataset 2
No. of Rows No. of Columns
35 4

Variables and their Types(both Dataset 1 & Dataset 2)


No. Feature Name Class Type
1 Season Factor Categorical
2 Month Factor Categorical
3 Date Numeric Continuous
4 Temperature Numeric Continuous

3.3 Univariate Analysis


Univariate Analysis corresponds to the exercise of exploring each variable
one by one, and the method depends on the data type, Categorical or
Continuous.
3.3.1 Analyses on Continuous Variables
From the variable identification exercise undergone earlier, although we can
see there are 2 continuous variables, Temperature and Date, our focus will
be mainly on the Temperature variable, because the Date variable being just
the dates of each month, no substantial inferences/analysis results can be
drawn from this.
In the Variable Transformation/Feature Creation section; however, we have
manipulated the Date variable and converted into a Factor in order to draw
up Statistical Bi-variate analyses between Date and Temperature.

We will leverage the following statistical metrics visualization methods to


explore the Temperature variable.
Central Tendency Measure of Dispersion Visualization Method
Mean Range Histogram
Median Quartile Boxplot
Mode Inter-Quartile Range
Min Variance
Max Standard Deviation

7
Summary() function in R helps deduce most of the key values, however,
there being no inbuilt functions for deducing Mode and IQR, customized
functions have been written – refer to Appendix A for code.

Dataset 1

Temperature – Key Findings


Figure Value
Mean 2.96274
Median 2.900
Mode 2.500
Min 1.700
Max 5.000
Range 3.300
1 Quartile
st
2.500
3rd Quartile 3.300
Inter-Quartile Range 0.800
Variance 0.2586628
Standard Deviation 0.508589

Dataset 2

Temperature – Key Findings


Figure Value
Mean 3.974
Median 3.900
Mode 3.900
Min 3.800
Max 4.600
Range 0.800
1st Quartile 3.900
3rd Quartile 4.100
Inter-Quartile Range 0.200
Variance 0.0254958
Standard Deviation 0.159674

8
Boxplots:
Dataset 2

Dataset 1:

9
Histograms (Dataset 1):

10
3.3.1.1 Continuous Variable Analysis: key observations
Dataset 1:
Overall temperatures across the year has a single outlier as can be
observed from the Boxplot: Temperatures across the Year
However, when the Temperatures are plotted against smaller sets of 3
seasons, the temperatures during the Winter season seem to be having
3 outliers, thereby inference that can be made is,
o Temperature fluctuation is observably higher in Winter season,
compared to Summer that has no outliers and Rainy having one
o When plotted against the whole year, the number of outlier is 1.
Dataset 2
From the Temperature boxplot, we can observe a very heavy Positive
skewness in the data distribution
A single outlier seen that is quite off from the Max value.

3.3.2 Analyses on Categorical Variables


Although traditionally various types of analyses can be performed on
categorical variables, this particular dataset has a very homogeneous mix of
months and seasons, so no substantial inferences can be made from these 2
categorical variables.
However, the Temperature variable which can be plotted against different
seasons/months has already been performed earlier.

11
3.4 Bi-Variate Analysis
Bi-variate Analysis that tasks itself with relationship between two variables from the
perspective of this dataset, we will try to figure out the overall
relationships/correlations among the different variables on hand, both categorical
and continuous.

3.4.1 Continuous & Continuous:


To find the strength of the relationship between 2 continuous, we will use
Correlation. Correlation varies between -1 and +1.
- 1: perfect negative linear correlation
+1: perfect positive linear correlation and
0: No correlation

3.4.1.1 Temperature vs. Date


Observations:
Correlation between Temperature and Date(from R) = - 0.028
It is a negative correlation, but the value being too less, nothing
substantial can be inferred statistically.
An extension to the analysis on Temperature vs. Date has been
covered in section 3.7.

3.4.2 Categorical & Categorical:


3.4.2.1 Month vs Season
Observations:
Summer corresponds to the months of Feb to May.
Rainy corresponds to June to September.
Winter corresponds to Jan & Oct to Dec.
Going by the nature of these 2 variables, which being straight-forward,
nothing substantial can be inferred statistically.

12
3.4.3 Categorical & Continuous:
3.4.3.1 Temperature vs Month

Boxplot: Temperature vs. Month

Observations:
Outliers are present in the Months of Sep, Oct, Jan, where Jan itself has 3.
Skewness:
o Positive: observed in months of Sep, Nov, Jun, Jul, Jan, Dec, Aug
o Negative: observed in months of Mar, May, Feb, Apr, Oct
o Normal Distribution: not observed in any month
In almost 10% of the days in January, there seems to be some anomaly,
which can be further looked into by the AMC Company maintaining the Cold
Storage facility.
Using the rpivotTable function, some observations (Table and Bar Chart
functions used):
o September has clocked the highest Mean temperature, Variance, and
Standard Deviation amongst all 12 months
o Nov has clocked the lowest Mean temperature, Variance, and Standard
Deviation amongst all 12 months

13
14
3.4.3.2 Temperature vs Season
Boxplot: Temperature vs. Season

Observations:
3 Outliers in the Winter season, 1 in Rainy, whereas no Outliers in Summer
Skewness:
o Winter: Positive Skewness can be observed
o Summer: Heavy Negative Skewness can be observed
o Rainy: No/Negligible skewness can be observed, and it seems to
the normally distributed.
Using the rpivotTable function, some observations (Table and Bar Chart
functions used):
o The Mean temperatures for each season, although Summer has
clocked the highest, there’s not much difference compared to Rainy,
but Winter is slower than both
o The variance in Temperatures is higher in Rainy season by quite some
margin compared to both Winter and Summer
o For the Standard deviation in Temperature, it’s Rainy season again
trumping over both Summer and Winter

15
3.5 Missing Value Treatment
Missing value treatment is an important step in Exploratory Data Analysis, as
missing data in the training data set can reduce the power/fit of a model or can
lead to a biased model because we have not analyzed the behavior and relationship
with other variables correctly. It can lead to wrong prediction or classification.

The datasets under scrutiny does not have any Missing values as we have already
observed in the data summary, so it is not elaborated in this project.

16
3.6 Outlier Treatment
Outlier is a commonly used terminology by analysts and data scientists as it needs
close attention; else, it can result in wildly wrong estimations. Simply put, Outlier is
an observation that appears far away and diverges from an overall pattern in a
sample.

Outliers can drastically change the results of the data analysis and statistical
modeling. There are numerous unfavorable impacts of outliers in the data set:
It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other
statistical model assumptions.
Most commonly used method to detect outliers is visualization. We use various
visualization methods, like Box-plot, Histogram, etc., as we have applied earlier on
both the datasets on multiple sections. However, dealing with Outliers being out of
scope for this Project, no specific action has been taken on the data.

3.7 Variable Transformation / Feature Creation


As we cannot draw any substantial statistical inference from the analysis of
Temperature vs. Date, let us try to create a Feature by manipulating the Date field.
Dates 1 to 10 Each month => Start of Month
Dates 11 to 10 each month => Mid Month
Dates greater than 20 => Month End
We will create a new data column called “Datespan” to store the output based on
the Date field values, following which, will try to draw some statistical inferences of
the same vs. Temperature and see if we can find anything that’s statistically viable.
Let us apply some bivariate analysis on the newly created a feature i.e. Datespan
vs. Month

Boxplot: Temperatures vs. Date-spans in Months across the Year

17
Bar charts of different attributes of Central Tendency and Dispersion

18
Observations:
1) Start of Month (first 10 days):
a. Comes second to Month End in terms of total Range and also has 1
outlier
b. Slight Positive skewed characteristic can be observed here
2) Mid Month (middle 10 days):
a. In terms of range, the most consistent one, however, has a single
outlier
b. No/negligible skewness is observed, and seems to be normally
distributed
3) Month End (last 10 days):
a. Has no outliers.
b. The range is bigger than the other two, thus creating the possibility of
increased supervision/attention towards the end of each month.
c. Heavy Positive skewness can be observed
4) From the bar charts, the Mean, Variance, and Standard Deviation do not
seem to be telling much in terms of Statistical relationship between the
Temperatures and time of the month

4 Problem solving: Problem 1

4.1 Mean cold storage temperature for Summer, Winter and


Rainy Season
Mean Temp in Mean Temp in Mean Temp in
Summer Rainy Winter

3.15 3.04 2.7

Inference: From the dataset under analysis, it can be observed that the
highest average temperature is being clocked for the season of Summer
whereas the lowest is for Winter. Although once can assume this to match the
natural ambient temperatures of different seasons, statistically we cannot draw
a conclusion due to the lack of weather data across the year.

4.2 Find overall mean for the full year

Overall Mean
of Full Year
2.96

19
4.3 Find Standard Deviation for the full year

Standard
Deviation for
the full year
0.5086
Inference: As we have already assumed the dataset to be a normally
distributed, statistically we can infer the ranges of data based on the Standard
Deviation and Mean calculated earlier:
- 68% of the values lie in the range of 2.454 and 3.4715 (+- Sigma)
- 95% of the values lie in the range of 1.946 and 3.980 (+- 2Sigma)
- 99.7% of the values lie in the range of 1.437 and 4.488 (+- 3Sigma)

4.4 Assume Normal distribution, what is the probability of


temperature having fallen below 2 C?

Mean = 2.96(rounded), although it ideally should be populated in a


variable while doing the calculations in R
To find out the probability value for Temp’s less than 2, we have to
consider the lower tail = TRUE because it’s at the left half of the graph.
Mean = 0.5086(rounded), although it ideally should be populated in a
variable while doing the calculations in R
Using the pnorm function in R, the result is
= 2.918417% probability

20
4.5 Assume Normal distribution, what is the probability of
temperature having gone above 4 C
To find out the probability value for Temp’s more than 4, we have to
consider the lower tail = FALSE because it’s at the right half of the graph.
Using the pnorm function in R, the result is
= 2.070296% probability

4.6 What will be the penalty for the AMC Company?

A particular Temperature can never attain a value “lower than 2” and “higher
than 4” at the same time, therefore these 2 are mutually exclusive events,
thus P (A U B) = P(A) + P(B)
Therefore, P = P(Temp<2) + P(Temp>4) = 4.988713%

Therefore, penalty = 10% of AMC, since the probability of Temps going


outside of the range of 2 – 4 C falls between the 2.5% and 5% boundary as
mentioned on the problem statement.

21
5 Problem solving: Problem 2
5.1 Which Hypothesis test shall be performed to check if
corrective action is needed at the cold storage plant?
Observations:
1) The dataset 1 contains the temperatures throughout the year of 2016
2) The dataset 2 contains the temperatures from a sample of 35 days in
2018 – 02/11 to 03/17 that was procured on back of customer complaints
3) From this, we can infer, the dataset 2 is not a sample of the dataset 1.
4) Therefore, we cannot get the population mean/standard deviation.
5) As the dataset 2 is independent, and we have not been provided with the
population data for the same, cannot deduce the Population standard
deviation.
Assumptions:

1) We are assuming the population mean and standard deviation to be same


as Sample mean Mu and Sample Standard deviation Sigma, thus
assuming our Sample estimation will be reflective of the reality/population
sampling
Approach:

1) Since the population standard deviation is unknown, the best Statistic test
to perform would be the Student’s T-statistic Test
2) However, we will go ahead, perform the Z test as well, and compare the
results with the same from T Test before drawing up the conclusion.
3) Since we are talking about potential corrective actions, we intend to be
more exhaustive and detail oriented.

5.2 State the Hypothesis, perform hypothesis test and


determine p-value

Hypothesis
The supervisor has been tasked with maintaining the temperature at the cold
storage to below 3.9 C - this will be the Null Hypothesis

Null Hypothesis, Ho: Mu <= 3.9

On the hindsight, since there have been complaints of product being degraded
because the temperature is exceeding the upper higher acceptable limit of 3.9
C, we intend to prove the same claim. This will be our Alternative Hypothesis.

Alternative Hypothesis, Ha: Mu > 3.9

22
Hypothesis Tests
Student’s T-statistic Test:

Step 1: State the Hypotheses: Ho: Mu <= 3.9 & Ha: Mu > 3.9

Step 2: Population mean(assumed), Mu = 3.9


Step 3: Significance(given), alpha = 0.10
Step 4: Sample mean, Xbar = mean (Temperature) = 3.974286

Step 5: Sample standard deviation, S = sd (Temperature) = 0.159674


Step 6: Sample size, m = 35

Step 7: Degrees of freedom, df = (m – 1) = 34


Step 8: Sampling Error, se = (Xbar – Mu) = 0.07428571
Step 9: Standard Error, sde = sd/(m^0.5) = 0.02698984

Step 10: Tstat = Sampling Error/Standard Error = +2.752359


Step 11: Pvalue calculation: Xbar being greater than Mu, we can infer this is
a right-tailed test, therefore, we would be using the following formula on R to
calculate Pvalue.

Pvalue = pt (Tstat, df, lower.tail = FALSE)

Therefore, Pvalue = 0.004711198

Step 12: Result: Since Pvalue < alpha, the Null Hypothesis is rejected, and
Alternative Hypothesis is accepted, thus statistically concluding (via T Test) that
the Temperature in the Cold Storage is greater than 3.9 C with 90% confidence
(1 – 0.1), thus causing the products go sour or smelling.
Step 13: We will find the actual confidence by subtracting the Pvalue from 1.
Actual Confidence = (1 - Pvalue) * 100 = 99.52888%

23
Z-statistic Test:

The assumptions and calculation methods are same as already mentioned/


performed in the T Test. Summarizing the values below.

Mu = 3.9
alpha = 0.1
Xbar = 3.974286
S = 0.159674
m = 35
se = 0.07428571
sde = 0.02698984

As significance, alpha = 0.1, the critical values of Zstat are +1.28 and -
1.28, as calculated using MS Excel.
We are using a probability value of 0.9 instead of 0.1 because MX Excel
considers it cumulatively and it is a right tailed Test.

The rejection region is Zc < -1.28 or Zc > +1.28.

24
The nonrejection region is -1.28 <= Zc <= 1.28.
Result: Since Z < Zstat, we can reject the null hypothesis and accept the
alternate hypothesis, thus reinforcing the results from T Test.

6 Conclusion/Inference
We have seen from the dataset 1 that holds the values from the year 2016, the
average temperature throughout the year is 2.96. However, as months went by,
the working quality of the Cold Storage seems to have degraded. And from the
samples taken in 2018, without even putting it through any Statistical analyses, we
can see a mean temperature to be 3.97, which is 1 degree higher, and going by the
working principle of Cold storages, that does not look good, which is why the
complaints of products going sour and smelling kept pouring in. However, we
reserved our judgement before doing a root cause analysis through Statistical
analysis and concluding the result.

To do a root cause analysis via statistical hypothesis procedures, we performed


both T-statistic and Z-statistic tests, which gave us the following results:
Having performed both T and Z statistic tests, we have confirmed rejecting
the Null Hypothesis of Mean Temp < 3.9 C.
With 90% confidence, we can conclude the temperature indeed crossed the
permissible limit of 3.9 C.
With 99.53% actual confidence, we can conclude the above statement.
With only .47% confidence, we can conclude that the temperature is equal to
or lesser than 3.9 C.
Thus, statistically we can conclude that there needs to be corrective
measures taken to keep the Cold Storage function properly, and
there is no apparent problem (statistically speaking) from
procurement side from where Cold Storage is getting the Dairy
Products
We will submit the results to the owner of the Cold Storage and they need to
figure out the resolution path, is it the lackadaisical approach in work by the
Supervisor or some inherent problems with the machines being used. This,
we cannot conclude statistically due to lack of necessary data.

Also, as we have seen earlier, there is ‘almost’ a 5% probability of the


temperatures to be outside of the permissible range of 2-4 C, thus attracting a
hefty fine of 10% AMC, and until immediate necessary measures are taken, it could
cross the 5C mark and attract even a heftier fine of 25% AMC.

25
7 Appendix A
#============================================================#
# #
# Exploratory Data Analysis - Cold Storage Case Study #
# #
#=========================================================== #
# Environment Set up and Data Import
# Setup Working Directory
setwd("G:/My_R/Project 1")

# Check if the data directory is set properly


getwd()
[1] "G:/My_R/Project 1"

# Load the dataset into a temporary data frame


tempdata = read.csv("Cold_Storage_Temp_Data.csv", header = TRUE)

# Variable Identification
#------------------------#
# Dataset 1 #
#------------------------#
# View if the temporary data frame is populated properly
View(tempdata)
# Check the summary of the dataset ##
summary(tempdata)
Season Month Date Temperature
Rainy :122 Aug : 31 Min. : 1.00 Min. :1.700
Summer:120 Dec : 31 1st Qu.: 8.00 1st Qu.:2.500
Winter:123 Jan : 31 Median :16.00 Median :2.900
Jul : 31 Mean :15.72 Mean :2.963
Mar : 31 3rd Qu.:23.00 3rd Qu.:3.300
May : 31 Max. :31.00 Max. :5.000
(Other):179
# Detailed information on each variable on the dataset
str(tempdata)
'data.frame': 365 obs. of 4 variables:
$ Season : Factor w/ 3 levels "Rainy","Summer",..: 3 3 3 3 3 3 3 3 3 3 .
..
$ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 5 5 5 5 5 5 5 5 5
...
$ Date : int 1 2 3 4 5 6 7 8 9 10 ...
$ Temperature: num 2.4 2.3 2.4 2.8 2.5 2.4 2.8 2.3 2.4 2.8 ...
## use the atach command to store the column names of the dataset in the same
session#
attach(tempdata)

#Check the variable


class(Date)
[1] "integer"
#Summary of the Temperature variable
summary(Temperature)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.700 2.500 2.900 2.963 3.300 5.000
#Variance of the Temperature variable across the year
var(Temperature)

26
[1] 0.2586628
#Standard Deviation of the Temperature variable across the year
sd(Temperature)
[1] 0.508589
#Customized function to get the Mode value of Temperature variable
mode(Temperature)
[1] "numeric"
getmode <- function(Temperature)
+ {
+ uniqv <- unique(Temperature)
+ uniqv[which.max(tabulate(match(Temperature, uniqv)))]
+ }
result <- getmode(Temperature)
print(result)
[1] 2.5
Range = max(Temperature) - min(Temperature)
print(Range)
[1] 3.3

#Inter-quartile Range of the Temperatiure variable


IQR(x) = quantile(x, 3/4) - quantile(x, 1/4)
result = IQR(Temperature)
print(result)
[1] 0.8
# Dataset 2 #
#-----------#
# Load the dataset into a temporary data frame
tempmarch = read.csv("Cold_Storage_Mar2018.csv", header = TRUE)
# View if the temporary data frame is populated properly
View(tempmarch)
# Check the summary of the dataset ##
summary(tempmarch)
Season Month Date Temperature
Summer:35 Feb:18 Min. : 1.0 Min. :3.800
Mar:17 1st Qu.: 9.5 1st Qu.:3.900
Median :14.0 Median :3.900
Mean :14.4 Mean :3.974
3rd Qu.:19.5 3rd Qu.:4.100
Max. :28.0 Max. :4.600
# Detailed information on each variable on the dataset
str(tempmarch)
'data.frame': 35 obs. of 4 variables:
$ Season : Factor w/ 1 level "Summer": 1 1 1 1 1 1 1 1 1 1 ...
$ Month : Factor w/ 2 levels "Feb","Mar": 1 1 1 1 1 1 1 1 1 1 ...
$ Date : int 11 12 13 14 15 16 17 18 19 20 ...
$ Temperature: num 4 3.9 3.9 4 3.8 4 4.1 4 3.8 3.9 ...
## to avoid name conflict with the Temperature header name from the previous
dataset,
## rename the column as a precautionary measure ##
colnames(tempmarch)[colnames(tempmarch)=="Temperature"] <- "temp"
#Check the variable
class(Date)
[1] "integer"
#Summary of the Temperature variable
summary(temp)

27
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.800 3.900 3.900 3.974 4.100 4.600
#Variance of the Temperature variable across the year
var(temp)
[1] 0.0254958
#Standard Deviation of the Temperature variable across the year
sd(temp)
[1] 0.159674
#Customized function to get the Mode value of Temperature variable
mode(temp)
[1] "numeric"
getmode <- function(temp)
+ {
+ uniqv <- unique(temp)
+ uniqv[which.max(tabulate(match(temp, uniqv)))]
+ }
result <- getmode(temp)
print(result)
[1] 3.9
# Range
Range = max(temp) - min(temp)
print(Range)
[1] 0.8
#Inter-quartile Range of the Temperature variable
IQR(x) = quantile(x, 3/4) - quantile(x, 1/4)
result = IQR(temp)
print(result)
[1] 0.2

#install ggplot2 package to be used later for drawing up diff. plots


install.packages("ggplot2")
Error in install.packages : Updating loaded packages
library(ggplot2)
install.packages("ggplot2")
Warning in install.packages :
package ‘ggplot2’ is in use and will not be installed
#Boxplot of Temperatures across the year
boxplot(Temperature,
+ horizontal = TRUE,
+ col = "Blue",
+ main = "Boxplot: Temperatures across the Year",
+ xlab = "Temperatures")
#Boxplot of temperatures across seasons
boxplot(Temperature~Season,
+ horizontal = TRUE,
+ col = c("Cyan", "Red", "Green"),
+ main = "Temperatures vs. Seasons",
+ xlab = "Temperatures",
+ ylab = "Seasons")
#Histogram: Temperatures across the Year
hist(Temperature, main = "Histogram: Temperatures across the Year", col = "B
lue")
#Advanced Histogram for Temperatures across different seasons
ggplot(tempdata, aes(Temperature, fill = Season)) +
+ xlab("Temperature") +

28
+ ylab("") +
+ ggtitle("Histogram: Temperatures vs. Seasons")+
+ geom_histogram() +
+ scale_fill_manual(values =
+ c("Winter" = "Green",
+ "Summer" = "Red",
+ "Rainy" = "Cyan"))

##categorical variables
plot(Season,main='Seasons',xlab = "Seasons", ylab = "Frequency",col = c("Cya
n", "Red", "Green"))
plot(Month,main='Months',xlab = "Months", ylab = "Frequency")
#Bivariate Analyses
#===================
## Temperature vs Month
## Install the randomcoloR to help us generate 12 random colors for each mont
h to be used in the boxplot
install.packages("randomcoloR")
Error in install.packages : Updating loaded packages
library(randomcoloR)
install.packages("randomcoloR")

Warning in install.packages :
package ‘randomcoloR’ is in use and will not be installed
##Boxplot of Temperature vs Month
plot(Month,Temperature,
+ horizontal = TRUE,
+ main='Temperature Vs Month',
+ xlab = "Temperature",
+ ylab = "Month",
+ col = randomColor(12, luminosity="light"))
## Using the rpivotTable to chart out the Mean temperatures for each month t
hrough a single function
rpivotTable(tempdata)
library(rpivotTable)
## Using the rpivotTable to chart out the Mean temperatures for each month t
hrough a single function
rpivotTable(tempdata)
## Temperature vs Season
## Boxplot of Temperature vs Season
plot(Date,Temperature,
+ horizontal = TRUE,
+ main='Temperature Vs Season',
+ xlab = "Temperature",
+ ylab = "Season")
## Temperature vs Date

cor(Temperature, tempdata$Date)
[1] -0.02814857

#Variable Transformation/Feature creation


# Create a new factor variable Datespan based on the Date
tempdata$Datespan <- cut(tempdata$Date, c(-Inf,10,20,Inf), c("Start of Month"
, "Mid Month", "Month End"))
#Validate the correct formation of the new field
View(tempdata)

29
dim(tempdata) [1]
365 5
summary(tempdata)
Season Month Date Temperature Datespa
n
Rainy :122 Aug : 31 Min. : 1.00 Min. :1.700 Start of Month:12
0
Summer:120 Dec : 31 1st Qu.: 8.00 1st Qu.:2.500 Mid Month :12
0
Winter:123 Jan : 31 Median :16.00 Median :2.900 Month End :12
5
Jul : 31 Mean :15.72 Mean :2.963
Mar : 31 3rd Qu.:23.00 3rd Qu.:3.300
May : 31 Max. :31.00 Max. :5.000
(Other):179
head(tempdata)

Season Month Date Temperature Datespan


1 Winter Jan 1 2.4 Start of Month
2 Winter Jan 2 2.3 Start of Month
3 Winter Jan 3 2.4 Start of Month
4 Winter Jan 4 2.8 Start of Month
5 Winter Jan 5 2.5 Start of Month
6 Winter Jan 6 2.4 Start of Month
tail(tempdata)
Season Month Date Temperature Datespan
360 Winter Dec 26 2.7 Month End
361 Winter Dec 27 2.7 Month End
362 Winter Dec 28 2.3 Month End
363 Winter Dec 29 2.6 Month End
364 Winter Dec 30 2.3 Month End
365 Winter Dec 31 2.9 Month End
str(tempdata)
'data.frame': 365 obs. of 5 variables:
$ Season : Factor w/ 3 levels "Rainy","Summer",..: 3 3 3 3 3 3 3 3 3 3 .
..
$ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 5 5 5 5 5 5 5 5 5
...
$ Date : int 1 2 3 4 5 6 7 8 9 10 ...
$ Temperature: num 2.4 2.3 2.4 2.8 2.5 2.4 2.8 2.3 2.4 2.8 ...
$ Datespan : Factor w/ 3 levels "Start of Month",..: 1 1 1 1 1 1 1 1 1 1 .
..
#Boxplot of temperatures across seasons
boxplot(Temperature~Datespan,
+ horizontal = TRUE,
+ col = c("Cyan", "Red", "Green"),
+ main = "Temperatures vs. Datespans in Months across the Year",
+ xlab = "Temperatures",
+ ylab = "Date Spans")
##Problem 1 Question 1
#--------------------#
## Temporay data frame with data from Winter Season##
temp_winter = tempdata[Season %in% "Winter",]
##View the Winter season temp data table
head(temp_winter)
Season Month Date Temperature
1 Winter Jan 1 2.4
2 Winter Jan 2 2.3
3 Winter Jan 3 2.4
4 Winter Jan 4 2.8
5 Winter Jan 5 2.5
6 Winter Jan 6 2.4
tail(temp_winter)
Season Month Date Temperature

30
360 Winter Dec 26 2.7
361 Winter Dec 27 2.7
362 Winter Dec 28 2.3
363 Winter Dec 29 2.6
364 Winter Dec 30 2.3
365 Winter Dec 31 2.9
##Create a variable to calculate and populate the mean temperature of Winter
season
mean_winter = round(mean(temp_winter$Temperature), digits = 2)
##check the variable and the value
mean_winter
[1] 2.7

#-----------------------------------------------------#
## Temporay data frame with data from Summer Season##
temp_summer = tempdata[Season %in% "Summer",]
##View the Winter season temp data table
head(temp_summer)
Season Month Date Temperature
32 Summer Feb 1 3.1
33 Summer Feb 2 3.3
34 Summer Feb 3 3.3
35 Summer Feb 4 3.9
36 Summer Feb 5 3.3
37 Summer Feb 6 3.1
tail(temp_summer)
Season Month Date Temperature
146 Summer May 26 2.6
147 Summer May 27 2.8
148 Summer May 28 3.7
149 Summer May 29 3.8
150 Summer May 30 3.1
151 Summer May 31 3.3
##Create a variable to calculate and populate the mean temperature of Winter
season
mean_summer = round(mean(temp_summer$Temperature), digits = 2)
##check the variable and the value
mean_summer
[1] 3.15

#-----------------------------------------------------#
## Temporay data frame with data from Rainy Season##
temp_rainy = tempdata[Season %in% "Rainy",]
##View the Winter season temp data table
head(temp_rainy)
Season Month Date Temperature
152 Rainy Jun 1 3.2
153 Rainy Jun 2 3.9
154 Rainy Jun 3 2.9
155 Rainy Jun 4 3.1
156 Rainy Jun 5 2.8
157 Rainy Jun 6 2.8
tail(temp_rainy)
Season Month Date Temperature
268 Rainy Sep 25 2.6
269 Rainy Sep 26 3.9
270 Rainy Sep 27 3.3
271 Rainy Sep 28 2.9
272 Rainy Sep 29 1.7

31
273 Rainy Sep 30 2.6
##Create a variable to calculate and populate the mean temperature of Winter
season
mean_rainy = round(mean(temp_rainy$Temperature), digits = 2)
##check the variable and the value
mean_rainy
[1] 3.04
Header <- c("Mean Temp in Summer", "Mean Temp in Rainy", "Mean Temp in Winter
")
Data <- c(mean_summer, mean_rainy, mean_winter)
Mean_Temp = rbind(Header, Data)

Mean_Temp
[,1] [,2] [,3]
Header "Mean Temp in Summer" "Mean Temp in Rainy" "Mean Temp in Winter"
Data "3.15" "3.04" "2.7"

##Problem 1 Question 2
#--------------------#
mean_fullyear_noround = mean(Temperature)
mean_fullyear = round(mean(Temperature), digits = 2)
mean_fullyear
[1] 2.96
Header1 = "Overall Mean of Full Year"
Mean_FY = rbind(Header1,mean_fullyear)
Mean_FY
[,1]
Header1 "Overall Mean of Full Year"
mean_fullyear "2.96"

##Problem 1 Question 3
#--------------------#
sd_fy = round(sd(Temperature, na.rm = TRUE), digits = 4)
sd_fy
[1] 0.5086
Header1 = "Standard Deviation for the full year"
SD_FY = rbind(Header1,sd_fy)
SD_FY
[,1]
Header1 "Standard Deviation for the full year"
sd_fy "0.5086"

##Problem 1 Question 4
#--------------------#
y <- pnorm(2, mean = mean_fullyear_noround , sd = sd_fy, lower.tail = TRUE)
y*100
[1] 2.918417
##Problem 1 Question 5
#--------------------#
z <- pnorm(4, mean = mean_fullyear_noround , sd = sd_fy, lower.tail = FALSE)
z*100
[1] 2.070296
##Problem 1 Question 6
#--------------------#
m = (y + z)*100

32
m
[1] 4.988713
##-------------------------##
## Problem 2 ##
##-------------------------##

## Setting the working directory ##


setwd("G:/My_R/Project 1")
## confirm if the working directory is set properly ##
getwd()
[1] "G:/My_R/Project 1"
## load the dataset into a temporary data frame ##
tempmarch = read.csv("Cold_Storage_Mar2018.csv", header = TRUE)
## View if the temporary data frame is populated properly ##
View(tempmarch)
## Check the summary of the dataset ##
summary(tempmarch)
Season Month Date Temperature
Summer:35 Feb:18 Min. : 1.0 Min. :3.800
Mar:17 1st Qu.: 9.5 1st Qu.:3.900
Median :14.0 Median :3.900
Mean :14.4 Mean :3.974
3rd Qu.:19.5 3rd Qu.:4.100
Max. :28.0 Max. :4.600
## to avoid name conflict with the Temperature header name from the previous
dataset,
## rename the column as a precautionary measure ##
colnames(tempmarch)[colnames(tempmarch)=="Temperature"] <- "temp"
## Reverify if the column rename worked properly ##
summary(tempmarch)
Season Month Date temp
Summer:35 Feb:18 Min. : 1.0 Min. :3.800
Mar:17 1st Qu.: 9.5 1st Qu.:3.900
Median :14.0 Median :3.900
Mean :14.4 Mean :3.974
3rd Qu.:19.5 3rd Qu.:4.100
Max. :28.0 Max. :4.600
## use the attach command to store the column names of the dataset in the sam
e session#
attach(tempmarch)
## Hypothesis Testing ##
## T Test being used since the population standard deviation is unknown ##
## Null Hypothesis: - Ho: t = 3.9 ##
## Alternate Hypothesis: - Ha: t < 3.9 ##
## leading and trailing brackets have been used in all equations to
## save an extra step ##
##------------------------------------------##
## based on the Hypothesis stated, Mu = 3.9 ##
(Mu = 3.9)
[1] 3.9
## calculating the sample mean and populating it in the variable xbar ##
(Xbar = mean(temp))
[1] 3.974286

33
## calculate the standard deviation of the sample and store it in the variabl
e sd ##
(sd = sd(temp))
[1] 0.159674
## populate the sample size into a variable m ##
(m = 35)
[1] 35
## calculate the degrees of freedom that we will need later at the time of Pv
alue calculations ##
(df = m - 1)
[1] 34
## sampling error calculation: xbar - mu ##
(se = Xbar – Mu)
[1] 0.07428571
## standard error calculation: sd/sqrt(m) ##
(sde = sd/m^0.5)
[1] 0.02698984
## tstat calculations ##
(Tstat = se/sde)
[1] 2.752359

## pvalue calculations ##
## The xbar being greater than mu, we can infer this is a right tailed test
##
## therefore, we need to insert the command, lower.tail = FALSE ##
(Pvalue = pt(Tstat, df, lower.tail = FALSE))
[1] 0.004711198
## alpha(significance) = 0.1 ##
(alpha = 0.1)
[1] 0.1
## since pvalue is lesser than alpha, my null hypothesis is rejected ##
## this being an upper tailed test, we need to subtract p from 1 to get the a
ctual confidence ##
## also, the actual confidence is (1 - p)*100 percent ##
(actual_confidence = (1 - pvalue)*100)
[1] 99.52888

#---------------------------------------------------------------------------#
# THE END #
#---------------------------------------------------------------------------#

34

Das könnte Ihnen auch gefallen