Sie sind auf Seite 1von 7

GRD Journals- Global Research and Development Journal for Engineering | Volume 4 | Issue 7 | June 2019

ISSN: 2455-5703

Implementation of Data Mining Algorithms


using R
V. Neethidevan
Department of MCA
Mepco Schlenk Engineering College, India

Abstract
Data mining is an inter disciplinary field and it finds application everywhere. To solve many different day to life problems, the
algorithms could be made use. Since R studio is more comfortable for researcher across the globe, most widely used data mining
algorithms for different cases studies are implemented in this paper by using R programming language. Could be implemented
with help of R programming. The advanced sensing and computing technologies have enabled the collection of large amount of
complex data. Data mining techniques can be used to discover useful patterns that in turn can be used for classifying new data or
other purpose. The algorithm for processing large set of data is scalable. Algorithm for processing data with changing pattern must
be capable of incrementally learning and updating data patterns as new data become available. Still data mining algorithm such as
decision tree support the incremental learning of data with mixed data types, the user is not satisfied with scalability of these
algorithms in handling large amount of data. The following algorithms were implemented using R studio with complex data set.
There are four algorithms in the project- 1) Clustering Algorithm 2) Classification Algorithm 3) Apriori Algorithm 4) Decision
Tree Algorithm. It is concluded that R studio produced most efficient result for implementing the above said algorithms.
Keywords- R, Data Mining, Clustering, Classification, Decision Tree, Apriori Algorithm, Data Sets

I. INTRODUCTION
R Studio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical
computing and graphics. R Studio is written in the C++ programming language and uses the Qt framework for its graphical user
interface, which including rich code editing, debugging, testing, and profiling tools.

A. Clustering Algorithm
K--means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. The procedure
follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori.
The main idea is to define k centers, one for each cluster. These centers should be placed in a cunning way because of
different location causes different result. So, the better choice is to place them as much as possible far away from each other. The
next step is to take each point belonging to a given data set and associate it to the nearest center.

B. Classification Algorithm
It is one of the Data Mining. That is used to analyze a given data set and takes each instance of it. It assigns this instance to a
particular class. Such that classification error will be least. It is used to extract models. That define important data classes within
the given data set. Classification is a two-step process.
During the first step, the model is created by applying classification algorithm. That is on training data set. Then in the
second step, the extracted model is tested against a predefined test data set. That is to measure the model trained performance and
accuracy. So classification is the process to assign class label from a data set whose class label is unknown.

C. Apriori Algorithm
Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by
identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item
sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association
rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

D. Decision Tree Algorithm


Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the
branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches
used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are
called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that
lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called
regression trees.

All rights reserved by www.grdjournals.com 4


Implementation of Data Mining Algorithms using R
(GRDJE/ Volume 4 / Issue 7 / 002)

II. DATA SET CREATION

A. Steps to Create a Dataset


The Excel data set type supports one value per parameter. It does not support multiple selection for parameters.
To create a data set using a Microsoft Excel file from a file directory data source:
– Click the New Data Set toolbar button and select Microsoft Excel File. The New Data Set - Microsoft Excel File dialog
launches.
– Enter a name for this data set.
– Click Shared to enable the Data Source list.
– Select the data source where the Microsoft Excel File resides.
– To the right of the File Name field, click the browse icon to browse for the Microsoft Excel file in the data source directories.
Select the file.
– If the Excel file contains multiple sheets or tables, select the appropriate Sheet Name and Table Name for this data set, as
shown below.
– If you added parameters for this data set, click Add Parameter. Enter the Name and select the Value. The Value list is populated
by the parameter

III. IMPLEMENTATION DETAILS


The algorithms were implemented under R Studio with the necessary code. The code is attached with Appendix. Different datasets
were used for each algorithm implementation.

A. Clustering

All rights reserved by www.grdjournals.com 5


Implementation of Data Mining Algorithms using R
(GRDJE/ Volume 4 / Issue 7 / 002)

B. Classification

All rights reserved by www.grdjournals.com 6


Implementation of Data Mining Algorithms using R
(GRDJE/ Volume 4 / Issue 7 / 002)

C. Apriori Algorithm

D. Decision Tree Algorithm

All rights reserved by www.grdjournals.com 7


Implementation of Data Mining Algorithms using R
(GRDJE/ Volume 4 / Issue 7 / 002)

IV. CONCLUSION
The Implementation of Data Mining Algorithm acts efficiently done in R environment and enhancing its features. The large set of
data could be processed and manipulate using R environment. This can be widely used in statistical analysis of data. Since it is
very large in size the user can’t notice its occurrence. This System is able to achieve reliability. It reduces the human involvement
in manipulating the data. This System reduces risk in the mistake that the human occurs while manipulate the larger set of data.

APPENDIX

A. Cluster
library (datasets)
data (iris)
summary (iris)
set.seed(8953)
iris1<-iris
iris1$Species<-NULL
(kmeans.result<-kmeans(iris1,3))
table (iris$Species,kmeans.result$cluster)
plot (iris1[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster)
points (kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")],col = 1:3, pch = 8, cex = 2)
library (fpc)
pamk.result <- pamk(iris1)
pamk.result$nc
table (pamk.result$pamobject$clustering, iris$Species)
layout(matrix(c(1, 2), 1, 2))
plot (pamk.result$pamobject)

All rights reserved by www.grdjournals.com 8


Implementation of Data Mining Algorithms using R
(GRDJE/ Volume 4 / Issue 7 / 002)

library(fpc)
iris2 <- iris[-5] # remove class tags
ds <- dbscan(iris2, eps = 0.42, MinPts = 5)
table(ds$cluster, iris$Species)
plot(ds, iris2[c(1, 4)])
plotcluster(iris2, ds$cluster)

B. Classification Algorithm
str(iris)
set.seed(1234)
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
train.data <- iris[ind == 1, ]
test.data <- iris[ind == 2, ]
library(party)
myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width
iris_ctree <- ctree(myFormula, data = train.data)
table(predict(iris_ctree), train.data$Species)
Print ctree
print(iris_ctree)
print(iris_ctree)
plot(iris_ctree)

C. Apriori Algorithm
install.packages("caTools")
# Decision Tree Regression
# Importing the dataset
setwd("E:\\Research 2018\\Course VN\\Algorithm Datasets\\Decision_Tree_Regression")
datasets = read.csv('Position_Salaries.csv')
dataset = datasets[2:3]
# Splitting the dataset into the Training set and Test set
# # install.packages('caTools')
# library(caTools)
# set.seed(123)
# split = sample.split(dataset$Salary, SplitRatio = 2/3)
# training_set = subset(dataset, split == TRUE)
# test_set = subset(dataset, split == FALSE)
# Feature Scaling
# training_set = scale(training_set)
# test_set = scale(test_set)
# Fitting Decision Tree Regression to the dataset
# install.packages('rpart')
#rpart(Recursive partitioning is a statistical method for multivariable analysis. Recursive partitioning creates a decision tree that
strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous
independent variables)
library(rpart)
# ~. is tilde dot plot for dependent and independet variable
regressor = rpart(formula = Salary ~ .,
data = dataset,
control = rpart.control(minsplit = 1))
# rpart.control: Various parameters that control aspects of the rpart fit.
# minsplit :the minimum number of observations that must exist in a node in order for a split to be attempted.
# Predicting a new result with Decision Tree Regression
y_pred = predict(regressor, data.frame(Level = 6.5))
y_pred
#Apriori algorithm
setwd("E:\\Research 2018\\Course VN\\Algorithm Datasets\\Apriori")
library(arules)
dataset=read.csv('Market_Basket_Optimisation.csv',header=FALSE)
dataset = read.transactions('Market_Basket_Optimisation.csv',sep=',',rm.duplicates=TRUE)
summary(dataset)

All rights reserved by www.grdjournals.com 9


Implementation of Data Mining Algorithms using R
(GRDJE/ Volume 4 / Issue 7 / 002)

itemFrequencyPlot(dataset, topN=10)
rules=apriori(data=dataset,parameter=list(support=0.04,confidence=0.2))
#visulaalizing results
inspect(sort(rules,by='lift')[1:10])A
q()

D. Decision Tree Algorithm


# Visualising the Decision Tree Regression results (higher resolution)
# install.packages('ggplot2')
library(ggplot2)
#seq : Sequence Generation
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01) # 0.01 number: increment of the sequence
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
colour = 'blue') +
ggtitle('Truth or Bluff (Decision Tree Regression)') +
xlab('Level') +
ylab('Salary')
# Plotting the tree
plot(regressor)
text(regressor)

REFERENCES
Book
[1] Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large
Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994.
[2] Rodriguez, J. J.; Kuncheva, L. I.; Alonso, C. J. (2006). "Rotation forest: A new classifier ensemble method". IEEE Transactions on Pattern Analysis and
Machine Intelligence. 28 (10): 1619–1630. doi:10.1109/TPAMI.2006.211
Website
[3] https://docs.oracle.com/middleware/12211/bip/BIPDM/GUID-70F8A7D1-B206-434A-9B20-D2D7377AC0CB.htm#BIPDM179
[4] https://stackoverflow.com/questions/6771588/how-to-define-a-simple-dataset-in-r

All rights reserved by www.grdjournals.com 10