Sie sind auf Seite 1von 45

Outline

1.
2.
3.
4.
5.
a.
b.

6.
a.
b.

7.
8.
9.

Background
Motivation
Objective
Description of Data
Intelligence extracted from Data
Using Scatter Plots and Null Hypothesis
Graphs of Correlation
Use of R Programming
R-code
Module wise description
What you have learnt from this Project?
Summary
Innovation finds in the field of Communication

1. Background
With the ever increasing traffic of data both on web and in inventories
we have reached a stage when we are dealing with the concept of Big Data.
Thus, we have abundant data with us ready to be exploited but it is of no
usage unless we make some meaning out of it or untill we analyze it.

Thus, we use the probability approach of data analysis to understand


data, its behaviour and its underlying characteristics. The reason of
probability based data analysis being important is discussed below.

Why probability based data analysis is important?


Probability based data analysis comprises of many statistical techniques that
analyzes facts.
It helps to find particular data which is correct, can understand it in detail
how the data is processed.
Through such techniques, we can even conclude some patterns from the
given data.
We can do any statistics with the data and can extract useful information
from it which can help us in giving conclusions.Through interpretation of data
we can have conclusion and come up with some patterns.
It can help taking decisions also from the data we can interpret some useful
things that helps us in making some important decisions
Data analysis should any certain probability value through which information
can be extracted easily.
Probability based data analysis can be used in business areas, social science
and many areas wherever we need some statistical conclusion such analysis
can be useful
For example for some clever conclusions in business such data analysis can
be used.For example, Population,data of consumers at various places need
probability based data analysis. For weather also such analysis is useful even
in banking data analysis can be of much use.

2. Motivation
AD test is one of the statistical test that is being applied upon the data to
understand the behaviour of data and exploit it characteristics. The
Advantages of using AD test as compared to other lies below:
Advantages of AD Test
1. Determine type of Distribution:

->
AD test can be used to determine the distribution followed by the
specific data. It can be used to test that which distribution is being followed
from the given list of distributions such as: Weibull distribution, Exponential
Distribution, Log-Normal, Normal etc.
->
Thus on knowing the type of distribution of data, we can mention about
the characteristic that data follows and comment about its behavior.

2.

Better Test-Statistics than others:

-> According to the M.A. Stephen the test statistic of Ad test is one of the
best as it can be easily used to find deviations and departures of data from
normality. [1]

3. Critical values are distribution specific:


-> Critical values within ad.test() depends on the distribution being studied.
This makes AD test preferable to KS test where critical values is independent
of the distribution being studied.
-> Critical values being dependent on distribution makes ad.test distribution
oriented and also increases the sensitivity of the test. [2]

4. Best Fit both for small and large samples:


-> The modified Test Statistic is given by A*(1+0.75/n+2.25/n^2). The
modified test statistic takes into consideration the small sample size.Thus,
the test statistic has been modified so that it can cater both small as well as
wide scale of data and thus acts as best distance test. [3]

3. Objective
What you are going to do with an AD test in data analysis and
communication algorithm?
->
The main objective behind AD Test is to know the the type of
distribution followed by data and accordingly predict its behaviour.

->
Each type of distribution has a specific characteristics of their own and
with this characteristics we can get to know the behaviour of the data being
study and thus, analyzing data depending upon their behaviour helps us to
generate some refined conclusions and find particular pattern being
followed.
->
The statistics of Anderson darling test are used in goodness-of-fits-test
for Gompertz distribution, which in turn is used to find out span of real
elements like life cycle of an electronic item, rate at which a code would fail,
and widely used for generating span of living organisms. Anderson darling
test is used with some modifications to find the upper and lower tails of
many distributions.[4] [5]
->
Anderson Darling technique is used in Cognitive Radio. Cognitive radio
is the concept in which unused part of Spectrum is supplied to Secondary
user while catering the requirements of Primary User. In such system the
distribution of Signal can be modeled by Gaussian Distribution and then we
compare the received signal with the noise distribution.In such cases if we
have an aprior information about the noise distribution then we can use
Anderson Darling Test to check whether the received signals are drawn from
the noise distribution.This method is also called as Anderson Darling Sensing.
[6]

4. Description of Data
The collected data monitors the weather and atmospheric conditions of
place in and around James Clerk Maxwell Building, located in Edinburg, U.K..

Data is taken from: http://www.ed.ac.uk/schools-departments/geosciences/weatherstation/download-weather-data

It takes readings of particular/specified parameters::


i.
Atmospheric Pressure(mBar)
ii.
Rainfall (mm)
iii.
Wind Speed (m/s)
iv.
Wind Direction (degrees)
v.
Surface Temperature (Celcius)
vi.
Relative Humidity (%)
vii.
Solar Flux (Kw/m2)
viii. Battery (Volts)
All the above mentioned readings are taken at every minutes starting from
Jan-1 and extending upto December- 31 of the year 2014. This results in total
number of 5,25,206 records.
The data JCMB_2014.csv is a minute by minute data of weather conditions
like atmospheric pressure ,rainfall etc. at James Clerk Maxwell Building,
located in Edinburg, U.K
i.e it contains 60 minutes x 24 hours x 365 days = 525206 data entries over
the period of the Year 2014.

5. Intelligence extracted from Data


a. Using Scatter Plots and other tools
b. Null hypothesis testing

What is Normal Distribution?


A normal distribution is a probability distribution of a Normal Random
Variable X with mean
and variance
. It is statistical probability
distribution with probability density function (PDF): [7]

This probability has distribution is a Bell Shape symmetric curve.


Center peak of the Bell varies as we change the value Mean(

) and the

broadness of the bell curve varies as we vary the value of variance(

).

Not every Bell Shape curve represents the Normal Distribution. The
shape of the Normal distribution does not depend on the distribution
parameters. Even though the data is symmetric in the probability
distribution. Other distributions do have a bell shape curve as we can see
from the following:

Therefore, in order to determine a Specific Distribution, one has to


perform many tests as well as have to test the alternative models.
[8]
NOTE : The reason for using t.test()

Anderson Darling test within nortest package can be used to determine


whether the data follows normal distribution or not. If we want to build
hypothesis regarding mean then it preferable to use t.test(), as it directly
gives me the analysis depending upon the actual mean and the assumed
mean. Whereas Anderson Darling test is distribution specific with Test
Statistic changing for different distribution.

5.a

Using Scatter Plots and Null Hypothesis

5.1 Atmospheric Pressure


1. AD Normality Test
H0 = Atmospheric Pressure is following Normal Distribution
H1 = Atmospheric Pressure is not following Normal Distribution

Atmospheric Pressure does not follow Normal Distribution. Thus, we reject


our null hypothesis.
As we can observe from the Histogram that the value of Atmospheric
Pressure is ranging mainly from 1000 mBar to 1025 mBar
Average Yearly Atmospheric Pressure of Edinburgh is 1013.25 mBar
2. Students T Test
H0 = The mean of Atmospheric Pressure is 1013.25 mBar
H1 = The mean of Atmospheric Pressure is not 1013.25 mBar
2.a.

For 100 Samples

Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar


gets rejected with 95% of confidence level.
As all of the values are nearly same , there is no change in the value and the
graph would be a constant graph so the t.test(); wont work
2.b.

For 1000 Samples

Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar


gets rejected with 95% of confidence level.
As starting 1000 samples means 16.67 hrs of January 1 i.e its winter so the
temperature is less.
As we know that atmospheric pressure rises with increase in temperature ,
hence the value of atmospheric pressure is much less than expected hence
coinciding with our test results
2.c.

For 10,000 Samples

Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar


gets rejected with 95% of confidence level.
As starting 10000 samples means 7 days of January 1 i.e its winter so the
temperature is less.
As the samples are of less range and sample size has increased the mean
would reduce than that of 1000 samples
As we know that atmospheric pressure rises with increase in temperature ,
hence the value of atmospheric pressure is much less than expected hence
coinciding with our test results
2.d.

For 1,00,000 Samples

Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar


gets rejected with 95% of confidence level.
As starting 100000 samples means 2.3 month i.e its winter so the
temperature is less.
As the time has passed and we have data over 2.3 months , the temperature
has gradually started increasing but not much. So the temperature increases
the atmospheric pressure than the previous case.
As we know that atmospheric pressure rises with increase in temperature ,
hence the value of atmospheric pressure is less than expected hence
coinciding with our test results
2.e.

For ALL Samples

Null hypothesis of Atmospheric Pressure having the mean of 1013.25 mBar


gets rejected with 95% of confidence level.
The data is yearly data of the place in UK , as UK is a cold place and at height
from sea level the temperature would be less than that of any place at sea
level. we have considered the null hypothesis for 1013.25 mBar which is
general pressure at sea level
So the value of atmospheric pressure is less than expected hence coinciding
with our test results

5.2 Relative Humidity


1. AD Normality Test
H0 = Relative Humidity following Normal Distribution
H1 = Relative Humidity is not following Normal Distribution

Relative Humidity does not follow Normal Distribution. Thus, we reject our
null hypothesis.
As we can observe from the Histogram that the value of Relative Humidity is
ranging mainly from 72% to 90%
Average Yearly Relative Humidity of Edinburgh is 80.18249 %
2. Students T Test
H0 = The mean of Relative Humidity is 82.91667 %
H1 = The mean of Relative Humidity is not 82.91667 %
2.a.

For 100 Samples

Null hypothesis of Relative Humidity having the mean of 82.91667 % gets


rejected with 95% of confidence level.
As starting 100 samples means 1.667 hrs of January 1 i.e its winter so the
temperature is less.
As we know that humidity is less in cold atmosphere hence the value of
humidity is much less than expected hence coinciding with our test results

2.b.

For 1000 Samples

Null hypothesis of Relative Humidity having the mean of 82.91667 % gets


rejected with 95% of confidence level.
As starting 1000 samples means 16.67 hrs of January 1 i.e its winter so the
temperature is less.
Still the atmosphere is cool so the humidity would not change much , just the
average would change but not reach expected value, hence coinciding with
the test

2.c.

For 10,000 Samples

Null hypothesis of Relative Humidity having the mean of 82.91667 % gets


rejected with 95% of confidence level.
As starting 10000 samples means 7 days of January 1 i.e its winter so the
temperature is less.
Still the atmosphere is cool so the humidity would not change much , just the
average would decrease as the variations over days will change but not
reach expected value, hence coinciding with the test

2.d.

For 1,00,000 Samples

Null hypothesis of Relative Humidity having the mean of 82.91667 % gets


rejected with 95% of confidence level.
As starting 100000 samples means 2.3 month i.e its winter so the
temperature is less.
As the time has passed and we have data over 2.3 months , the temperature
has gradually started increasing but not much. So the temperature change
because of change in humidity than the previous case.
Hence the relative humidity has increased but not upto the expected mean

2.e.

For ALL Samples

Null hypothesis of Relative Humidity having the mean of 82.91667 % gets


rejected with 95% of confidence level.
Here though the relative humidity increases as we reach the monsoon
season , but due to large sample size of 525206 samples the mean gets
reduced instead of increasing
Hence instead of increasing and satisfying the condition it decreases and is
proved by the test

5.3 Surface Temperature


1. AD Normality Test
H0 = Surface Temperature following Normal Distribution
H1 =Surface Temperature is not following Normal Distribution

Surface Temperature does not follow Normal Distribution. Thus, we reject our
null hypothesis.
As we can observe from the Histogram that the value of Surface Temperature
is ranging mainly from 50C to 150C.
Average Yearly Relative Surface Temperature of Edinburgh is 9.410C
2. Students T Test
H0 = The mean of Surface Temperature is 13 0C
H1 = The mean of Surface Temperature is not 13 0C

2.a.

For 100 Samples

Null hypothesis of Surface Temperature having the mean of 130C gets


rejected with 95% of confidence level.
As starting 100 samples means 1.667 hrs of January 1 i.e its winter so the
temperature is less.
As its winter the temperature would be much less than the expected value
2.b.

For 1000 Samples

Null hypothesis of Surface Temperature having the mean of 130C gets


rejected with 95% of confidence level.
As starting 1000 samples means 16.67 hrs of January 1 i.e its winter so the
temperature is less.
As the day passes the temp even drops further so the mean would go down
further

2.c.

For 10,000 Samples

Null hypothesis of Surface Temperature having the mean of 130C gets


rejected with 95% of confidence level.
As starting 10000 samples means 7 days of January 1 i.e its winter so the
temperature is less.
As time passes by and winter goes the avg. temperature rises but not upto
expected yearly avg.
2.d.

For 1,00,000 Samples

Null hypothesis of Surface Temperature having the mean of 130C gets


rejected with 95% of confidence level.
As starting 100000 samples means 2.3 month i.e its winter so the
temperature is less.
As the time has passed and we have data over 2.3 months , the temperature
has gradually started increasing but not much.

2.e.

For ALL Samples

Null hypothesis of Surface Temperature having the mean of 130C gets


rejected with 95% of confidence level.
As over the year we are measuring the temperature of the cold place and the
temp is very less compared to the avg. expected value

5.4 Wind Speed


1. AD Normality Test
H0 = Wind Speed following Normal Distribution
H1 = Wind Speed is not following Normal Distribution

By observing the graph and checking from ad.test(), we find that Wind Speed does
not follow Normal Distribution. Thus, we reject our null hypothesis.
As we can observe from the Histogram that the value of Wind Speed is ranging
mainly from 1.042 m/s to 4.396 m/s
It shows a linear decrease from 1 m/s to 14m/s.
The mean Wind Speed is 2.952 m/s, indicating that a give regular day it is more
likely that a wind speed will be around 3 m/s
Thus, it is less likely to have wind speed beyond 7.5m/s as they take place during
uneven weather conditions
Average Yearly Wind Speed in Edinburgh is 2.952 m/s

Overall Conclusion:
Null Hypothesis get rejected as both test and graphical observation support the
same result.

Mean wind-speed remains at about 3 m/s during regular days

2. Students T Test
H0 = The mean of Wind Speed is 2.83 m/s
H1 = The mean of Wind Speed is not 2.83 m/s

2.a.

For 100 Samples

Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95%
of confidence level.
The mean speed from data is around 6.3 m/s whereas we are checking for 2.83m/s.
Thus, there is a large variation between two means.

2.b.

For 1000 Samples

Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95%
of confidence level.

2.c.

For 10,000 Samples

Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95%
of confidence level.
First 10,000 samples correspond to the data of wind speed from the first week of
January. Roughly, the wind speed in that time is 14Km/h or 3.9m/s. Thus we find,
that 2.83 m/s deviates quite a lot from the recorded mean.

2.d.

For 1,00,000 Samples

Null hypothesis of Wind Speed having the mean of 2.83 m/s gets accepted with 95%
of confidence level.
The acceptance level of mean ranges from 2.819304 and 2.85. Whereas, the mean
that was assumed was 2.83. It perfectly fits in the mid range and hence gets
accepted.

2.e.

For ALL Samples

Null hypothesis of Wind Speed having the mean of 2.83 m/s gets rejected with 95%
of confidence level.

5.5 Wind Directional


1. AD Normality Test
H0 = Wind Directional following Normal Distribution
H1 = Wind Directional is not following Normal Distribution

By observing the graph and checking from ad.test(), we find that Wind Direction
does not follow Normal Distribution. Thus, we reject our null hypothesis.
There is major distribution around two peaks, one at around 225 o-250o and other at
around 301o-320o. Thus, wind direction does not follow normal distribution.
First peak correspond to direction of Southwest and some parts of West and other
peak corresponds to direction of Northwest.
Thus, majority of time wind flows from the west (North-west as well as south-west)
side of direction.
This can also be validated from the fact that there is huge open golf course (Craig
Millar Park) surrounding the west and the southern part of the observatory.
Range from 45o to 135o corresponds to direction of North-East, East and South-East.
Thus no wind from that side.
Average Yearly Wind Direction in Edinburgh is 159.6 O

2. Students T Test
H0 = The mean of Wind Direction is 238O
H1 = The mean of Wind Speed is not 238O

2.a.

For 100 Samples

Null hypothesis of Wind Direction having the mean of 238 O gets rejected with 95%
of confidence level.

2.b.

For 1000 Samples

Null hypothesis of Wind Direction having the mean of 238 O gets accepted with 95%
of confidence level.

2.c.

For 10,000 Samples

Null hypothesis of Wind Direction having the mean of 238 O gets rejected with 95%
of confidence level.

2.d.

For 1,00,000 Samples

Null hypothesis of Wind Direction having the mean of 238 O gets rejected with 95%
of confidence level.

2.e.

Null hypothesis of Wind Direction having the mean of 238 O gets rejected with 95%
of confidence level.

5.b.
1.

For ALL Samples

Graphs of Correlation

Surface-temperature and rainfall

SCATTER PLOT
Observation through plot:

This is the relationship between rainfall and surface- temperature - bar plot for
correlation of atmospheric pressure and relative humidity.
Rainfall on y axis and surface- temperature on x axis
There are much scattered data points in this plot which shows that this relationship
will be weak to a great extent
no linear or curvilinear relationship
Very few influential data points in the range of -9 to 25 values of surfacetemperature
This relation has much lower correlation as seen from the graph due to the
scattered data points.

R CODE:
plot(data$surface.temperature..C.,data$rainfall..mm. )
WITH REGRESSION LINE
lines(lowess(data$surface.temperature..C.,data$rainfall..mm.),col="blue")

Statistical Observation:
CORRELATION

On the basis of r value, it can be said that the strength of the relationship is much
weaker almost tending to 0..
Through statistical data also we can see that the relationship is almost zero and
thus weaker from the graph it was seen as the data points are much scattered.
Since, we have an horizontal line, there is no correlation between data
Also, by the correlation function, we get the value near to zero
Since, the correlation coefficient is negative but close to zero, we find that they are
not correlated
2. Rainfall and humidity

SCATTER PLOT
Observation through plot:

This is the relationship between relative humidity and rainfall- plot for correlation of
rainfall and relative humidity.
Relative humidity is on y axis and rainfall on x axis
The relative humidity is mainly clustered over a certain range between 0 to 2 values
of rainfall
From the graph it is seen that the data points are clustered only at some area this
type of clustering can be said to have no correlation or much lesser correlation. We
can say that the correlation is weak but it cannot be negative.
It does not even follow any linear or curvilinear relationship.
The data points in the range of 0 to 2 of rainfall can be said to be somewhat
influential.

R CODE

plot(data$rainfall..mm.,data$relative.humidity....)
WITH REGRESSION LINE
lines(lowess(data$rainfall..mm.,data$relative.humidity....),col="blue")

Statistical Observation:
CORRELATION

On the basis of r value, it can be said that the strength of the relationship is very
weak relationship but positive weak relationship

Regression line is slightly curvilinear and then constant, and thus r would be near to
zero.
And on calculating correlation, we get it nearly zero and hence proved that they are
not correlated

3. Wind speed and rainfall


SCATTER PLOT
Observation through plot:

This is the relationship between rainfall and wind-speed - bar plot for correlation of
rainfall and wind speed.
Rainfall on y axis and wind-speed on x axis
The data is not even clustered at any place
Slope of the line is also too less which shows that there is not much correlation i.e
lesser correlation. We can say it has weak correlation but we can say that the
relationship would not be negative.
It follows somewhat linear relationship with slope almost negligible so this also
shows that the correlation is weak.

R CODE
plot(data$wind.speed..m.s.,data$rainfall..mm.
WITH REGRESSION LINE
lines(lowess(data$wind.speed..m.s.,data$rainfall..mm.),col="blue")

Statistical Observation:
CORRELATION

On the basis of r value, it can be said that the strength of the relationship is very
weak relationship but positive weak relationship.
From the r value it is clearly seen that there is very weak correlation
Wind speed and rainfall are not correlated as the regression line is horizontal, yet
we can see slight positive correlation between data
This is due to the scattered points above the regression line

4. Surface temperature and atmospheric pressure


SCATTER PLOT
Observation through plot:

This is the relationship between atmospheric pressure and surface-temperature bar plot for correlation of surface-temperature and atmospheric pressure.
atmospheric pressure is on y axis and surface-temperature on x axis
The graph is mainly clustered over a certain range of surface-temperature values
between -9 approximately and 25
These data points do not have a specific pattern so we can say that they have
lesser correlation i.e the correlation is weak
It does not follow any linear or curvilinear relationship
The data points in the range of -9 to 25 can be said to be somewhat influential data
points.

R CODE
plot(data$surface.temperature..C.,data$atmospheric.pressure..mBar.)
WITH REGRESSION LINE
lines(lowess(data$surface.temperature..C.,data$atmospheric.pressure..mBar.),col="
blue")

CORRELATION

On the basis of r value, it can be said that the strength of the relationship is weak
relationship
The surface temperature and atmospheric pressure are positively correlated with
each other, as we get a positive slope regression line

5. Relative humidity and atmospheric pressure


Observation through plot:

This is the relationship between relative humidity and atmospheric pressure- bar
plot for correlation of atmospheric pressure and relative humidity.
Relative humidity is on y axis and atmospheric pressure on x axis
The graph is mainly clustered over a certain range of atmospheric pressure values
between 950 approximately and 1100
The cluster it is decreasing downward gradually after sometime so it can be said
that the direction is downwards and it has negative association. As the atmospheric
pressure increases the relative humidity decreases. Thus we can say that it has
negative correlation by observing the plot.
The form cannot be stated clearly as it is all clustered it does not follow any linear
or curvilinear relationship
The data points are closer in the right corner that shows that they are closely
related with each other i.e they have higher correlation at that corner. We can say
that they show a higher negative correlation as they have negative association and
are more closely related but overall it can be concluded that it has lower correlation.
The data points in the right corner can be said to be influential as they are in the
flow of major cluster of the data points

Statistical Observation:

On the basis of r value, it can be said that the strength of the relationship is
negative weak.
From the regression line, we observe that we have a negative linear regression line
contributing to negative correlation of data

6. Use of R Programming
a. Module wise description of Functions used
1. ad.test()
function (x)
{
DNAME <- deparse(substitute(x))
x <- sort(x[complete.cases(x)])
n <- length(x)
if (n < 8)
stop("sample size must be greater than 7")
logp1 <- pnorm((x - mean(x))/sd(x), log.p = TRUE)
logp2 <- pnorm(-(x - mean(x))/sd(x), log.p = TRUE)
h <- (2 * seq(1:n) - 1) * (logp1 + rev(logp2))
//STEP-5
A <- -n - mean(h)
AA <- (1 + 0.75/n + 2.25/n^2) * A
if (AA < 0.2) {
begins
pval <- 1 - exp(-13.436 + 101.14 * AA - 223.73 * AA^2)
}
else if (AA < 0.34) {
pval <- 1 - exp(-8.318 + 42.796 * AA - 59.938 * AA^2)
}
else if (AA < 0.6) {
pval <- exp(0.9177 - 4.279 * AA - 1.38 * AA^2)
}
else if (AA < 10) {

//STEP-1
//STEP-2
//STEP-3
//STEP-4
//STEP-4

//STEP-5
//STEP-6
//STEP-7

pval <- exp(1.2937 - 5.709 * AA + 0.0186 * AA^2)


}
else pval <- 3.7e-24
//STEP-7 ends
RVAL <- list(statistic = c(A = A), p.value = pval, method = "Anderson-Darling
normality test",
data.name = DNAME)
class(RVAL) <- "htest"
return(RVAL)
}
Step-1:
-> All NA values are removed from data and it is sorted in ascending order.
Step-2:
-> Length of data is calculated
Step-3:
-> Length of data is validated. If length is greater than 7, we proceed else we stop.
Step-4:
-> Calculating CDF of Data using the formula pnorm(x-mean(x))/sd(x). mean(x) will
return one value and it would be the Mean of Data Vector, sd(x) will calculate the
standard deviation of x.
Step-5:
-> We manipulate CDF accordingly to get the Test Statistic which also involves
taking mean of our manipulated CDF.
Step-6:
-> Then, we modify the test statistic so that it gives us the correct result even for
very small sample values.
Step-7:
-> Depending upon the varied range of modified Test Statistic we calculate our pvalue.
2. cor(x,y)
This function takes as an input two columns and calculates the
Correlation coefficient. It just depicts that whether two data are
correlated/follow one another or not by giving r value. r value is
between -1 and 1. From -1 to 0, it tells us about negative correlation of
data and from 0 to 1 it tells about positive correlation of data. For r=0,
we find that there is no correlation between data

3. plot(x,y)
Plots the scatter plot of two data columns.
4. hist(x) or plot(table(x))
Using either function we get the graphical representation of the
frequency of a data column with respect to its values. On X - axis data values
whereas on the Y - axis it has frequency of occurrence.
5. t.test(x, mu = assumed mean value, conf. level = confidence level)
It compares the assumed mean value with the actual mean
value of data and correspondingly takes decision on Null Hypothesis for a given
confidence level.
6. lines(lowess)
Gives the regression line of scatter plot, it is used for
interpreting correlation of data. The factor of alpha, passed as an argument to the
function smoothes the line

7. What you have learnt from this Project?


-

Programming Skills
We got an opportunity to Explore R Programming Language
We got to learn in depth about many functions inside the various packages of
R
i.e., Nortest Package, ad.test, t.test, cor, distfit, MASS Package, FitDistPlus
package
It has enabled/familiarized us to determine Whether a data is drawn from a
specific probability distribution or not.

Data Analysis and Basic Concept of Hypothesis

If the Test Statistic (Ac2) exceeds the critical value then the Null Hypothesis is

rejected.
Another approach can be if the P-value is less than 0.05 significance level
then the Null Hypothesis is rejected.
-

Interpretation of data Graphically

How to correlate two different types of data through graphically and by the

correlation coefficient
We learned to interpret from plot of two types and associate as well as
correlate the data graphically. [We learned to interpret from various types of
plots such as scatter, histogram, normal-plot etc and are able to associate as
well as correlate the data graphically.Thus we learnt the data analysis part.]
From graph, we can approximately get the value of mean which will be around the
peak and standard deviation by width of the graph provided that it is a normal
distribution.

Correlation is also an important part. We can predict many conclusions from


it
We learned about from,strength and direction of plots
For example, if we had a linear data having higher slope then it can be
interpreted as stronger relationship
Whether the curve is linear or have any other shape depicts different
solutions.
There can be positive correlation, negative correlation and no correlation.
There can be combinations also like stronger negative correlation, stronger
positive correlation, weaker positive correlation, weaker negative correlation
and no correlation.
The main conclusions depends on r value(correlation coefficient). It should be
between -1 and 1.
If the data points are fully clustered that shows that it has no correlation

We understood the functionality of ad.test and the internal structure of it


such as:
CDF calculations, modified Test Statistic to incorporate
small sample sizes, and calculation of p-value depending upon
the range of Test Statistic

8. Summary
Anderson Darling Test provided in nortest package is used to
accept/reject null hypothesis by looking at the normal distribution of
data.The normal distribution is Gaussian distribution, calculated from mean
and standard deviation of data.When our Null Hypothesis is based on means
we also apply students t-test on individual data columns and arrive to
certain conclusions based on varying means. The two sample students t-test
familiarizes us with interrelation of data. We have used correlation of data to
arrive to certain conclusion on the interrelation of data. Overall, The
Anderson Darling test, can also be used to detect various distributions such

as weibull, logarithmic, if critical values corresponding to the given


distributions and required significance levels are known. Also, the test can be
modified to be used with mean and standard deviation of our choice and
then checking p-value for comparison between the actual and defined tests
provided we have modified formula of Test-Statistics for the above
mentioned distribution. Correlation leads us to a broader picture of data
analysis, where we see relations between various columns of data; it might
help us in predicting future trends between the given data. It can be thus
used to predict weather, market analysis, product future analysis, etc.
We also found out that Anderson Darling test is not sufficient for data
analysis like correlation, or for known mean and unknown variance,yet for
normality distribution analysis it is one of the most powerful tests.

9. Innovation finds in the field of Communication


SPECTRUM SENSING
Based on our research paper spectrum sensing in cognitive radio using
goodness of fit testing by H Wang, the goodness of fit test is used to
formulate signal sensing techniques as an alternate to energy detection
based sensing. The Anderson Darling test is used here to develop Anderson
darling sensing technique. The main issue today is over the costly spectrum
bands and their optimum utilization, in this scenario it becomes important to
have wide range of users adapting to the same spectrum, to improve its
utilization and on the other hand, save the bandwidth of spectrum for newer
products and technologies. Cognitive radio is a radio technology which gives
us the freedom to shift the spectrum utilization of licensed users by the nonlicensed users when the band is free and shift it back to the primary user
when needed. Thus, it is adaptive to the changing environment parameters.
Also, the Anderson Darling sensing gives higher probability of detecting
signal than Energy Detection based on certain sensing parameters. The
energy detection might use covariance mapping and waveform detection
which are parametric sensing methods. They are used where there is low
signal to noise ratio as energy detection gives low probability of correct
sensing. We need to find the existence of signal quickly and efficiently for
better utility.

MOBILE CELLULAR SYSTEM

The location registration, gives the local device details of the location
of mobile station. We need frequent contact with the mobile station for
higher accuracy. One of the registration types is Distance based registration.
There is Centralized tendency for the mobile terminal during random
movement such that it is distributed at the center. The benefit of this
tendency is that the probability density function of random variable for
movement of mobile station is approximated to normalized distribution.
Here, the Anderson Darling test is used as goodness of fit test for
approximation by finding p-value for each of the multiple contacts.

10.

References:

a. Text References
1. http://en.wikipedia.org/wiki/Anderson
%E2%80%93Darling_test#cite_note-Stephens74-1
2. http://www.isixsigma.com/dictionary/anderson-darlingnormality-test/
3. https://www.scribd.com/doc/234252923/Anderson-Darling-Test
4. http://iussp.org/sites/default/files/event_call_for_papers/lenartMi
ssov.pdf
5. http://maths.york.ac.uk/www/sites/default/files/QilinHu.pdf
6. http://maths.york.ac.uk/www/sites/default/files/QilinHu.pdf
7. http://mathworld.wolfram.com/NormalDistribution.html
8. http://www.mathwave.com/articles/distribution_fitting_faq.html
#q4
b. Other references

1.
http://en.wikipedia.org/wiki/Predictive_analytics
2.
http://www.mathwave.com/articles/goodness_of_fit.html
3.
http://www.mathwave.com/articles/distribution_fitting_faq.html
#q3
4.
http://www.cde.ca.gov/ta/tg/hs/documents/mathstudysec2.pdf
5.
http://www.westga.edu/assetsCOE/virtualresearch/scatterplots_
and_correlation_notes.pdf
6.
http://math.tutorvista.com/statistics/scatter-plot.html
7.
Spectrum sensing in cognitive radio using goodness of fit
testing By Haiquan Wang, Member, IEEE, En-hui Yang, Fellow, IEEE, Zhijin
Zhao and Wei Zhang, Member, IEEE
8. Information Networking Advances in Data Communications and Wireless
Networks:International Conference, ICOIN 2006, Sendai, Japan, January 16-19, 2006,
Revised Selected Papers (https://books.google.co.in)

C. Weather Related Information [Edinburgh ,U.K .]

i. http://www.accuweather.com/en/gb/edinburgh/eh1-3/weatherforecast/327336
ii. http://www.bbc.com/weather/2650225
iii. http://www.weatherhq.co.uk/weather-station/edinburgh-airport
iv. https://weatherspark.com/averages/28753/Edinburgh-Scotland-UnitedKingdom