Sie sind auf Seite 1von 4

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org


Volume 6, Issue 2, March - April 2017 ISSN 2278-6856

Enhancing Clustering Mechanism by


Implementation of EM Algorithm for Gaussian
Mixture Model
Jyoti1 and Dr. Rajendar Singh Chhillar2
1,2
Department of Computer Science & Applications, Maharshi Dayanand University,
Rohtak, Haryana, India

Abstract: Big data[1] is a term for data sets that are so big or
complex that traditional data processing applications are
inadequate. Challenges include analysis, data curtain, search,
sharing, storage, transfer, visualization, querying &
information privacy. term often refers simply to use of
predictive analytics or certain other advanced methods to
extract value from data, & seldom to a particular size of data
set. Accuracy in big data might lead to more confident decision
making, & better decisions could result in greater operational
efficiency, cost reduction & reduced risk. Data mining[7] is
central step in a process called knowledge discovery in
databases, namely step in which modeling techniques are
include. Research areas like artificial intelligence, machine Fig 1 Data Mining
learning, & soft computing had contributed to its arsenal of
methods. In our opinion fuzzy approaches could play an
important role in data mining, because they given 2. MOTIVATION & PROBLEM
comprehensible results (although this goal is maybe because
this is sometimes hard to achieve within other methods).
STATEMENT
KeywordsData mining, web mining, web intelligence, Web Intelligence[2] based Google Analytics is known as a
knowledge discovery, fuzzy logic, K-mean service offered by Google that generates detailed about a
website's traffic & traffic sources & measures conversions
& sales. It's most widely used website statistics service.
1. INTRODUCTION Basic services are free of charge & a premium version is
Data mining is extraction of hidden predictive info from available for a fee. Google Analytics might track visitors
large database record records, is a powerful new from all referrers, including search engines direct visits &
technology within great potential to help by companies referring sites. It also tracks email marketing, & digital
focus on most important info within their data value collateral such as links within PDF documents. Regular
warehouses. Data mining utensils predict future trends & article of Google Analytics Integrated within Ad Words,
behaviors, permitting businesses to make taking initiative, users might now review online campaigns by landing
knowledge motivated decisions. Automated, prospective page quality & conversions (goals). Goals might include
analyses offered by data mining transfer outside analyses sales, viewing a specific page, or downloading data.
of past events providing by retrospective utensils typical of Google Analytics approach is to show high-level,
decision support systems. Data mining utensils may dashboard -type data/information for casual user, & more
answer business questions that generally were too time in-depth data/information further into report set. Google
consuming to resolve. They scour database record records Analytics analysis might identify poorly performing pages
for hidden patterns, finding predictive info that experts within techniques/technology such as funnel visualization,
may miss since this lies outside their expectations. where visitors came how long they stayed & their
Data mining (the analysis step of "Knowledge geographical position. It also provides more including
Discover in Databases" process, or KDD), an custom visitor segmentation.
interdisciplinary subfield of computer science, is
computational process of discovering system in large data 3. TOOLS & TECHNOLOGY USED
sets involving methods at intersection of artificial
intelligence, statistics, & database systems. overall goal of
MATLAB (matrix laboratory) is a multi-paradigm
data mining is process of extract information from a data numerical computing environment & fourth-generation
set & transform this into an understandable structure for programming language. Developed by Math Works,
further use.

Volume 6, Issue 2, March April 2017 Page 155


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 6, Issue 2, March - April 2017 ISSN 2278-6856
MATLAB allows matrix manipulations, plotting of proven that in this particular context it does, & that
functions & data, implementation of used to algorithms, derivative of likelihood is (arbitrarily close to) zero at that
creation of user interfaces, & interfacing within programs point, which in turn means that point is either a maximum
written in other languages, including C, C++, Java, Fortran or a saddle point. In general there may be multiple maxima,
& Python. & there is no guarantee that global maximum would be
4.1 Syntax found. Some likelihood also had singularities in them, i.e.
The MATLAB application is built around MATLAB nonsensical maxima. For example, one of solutions that
language, & most use of MATLAB involves typing may be fined by EM in a mixture model involves setting
MATLAB code into Command Window (as an interactive one of components to had zero variance & mean
mathematical shell), or executing text files containing parameter for same component to be equal to one of data
MATLAB code, including scripts and/or functions. points.
1. Variables
Variables are defined using assignment operator, =.
MATLAB is a weakly typed programming language
because types are implicitly converted. It is an inferred
typed language because variables could be assigned
without declaring their type, except if they are to be treated
as symbolic objects, & that their type could change. Values
could come from constants, from computation involving
values of other variables, or from output of a function. For
example:
>> x = 17
x=
17

>> x = 'hat' 5. RESULT & DISSCUSSION


x= Implementation of EM algorithm for Gaussian mixture
hat model
A mixture model could be described more simply by
>> y = x + 0 assuming that each observed data point has a
y= corresponding unobserved data point, or latent variable,
104 97 116 specifying mixture component that each data point belongs
to.
>> x = [3*4, pi/2]
x= EM algorithm for Gaussian mixture model
12.0000 1.5708 function [label, model, llh] = emgm(X, init)
% EM algorithm for Gaussian mixture model
>> y = 3*sin(x) %
y= fprintf('EM for Gaussian mixture: running ... ');
-1.6097 3.0000 R = initialization(X,init);

4. PROPOSED WORK tol = 1e-6;


maxiter = 500;
Traditional EM algorithm is used to find (locally) llh = -inf(1,maxiter);
maximum likelihood parameters of a statistical model in converged = false;
cases where equations cannot be solved directly. Typically t = 1;
these models involve latent variables in addition to while ~converged && t < maxiter
unknown parameters & known data observations. That is,
t = t+1;
either there are missing values among data, or model
model = maximization(X,R);
could be formulated more simply by assuming existence
[R, llh(t)] = expectation(X,model);
of additional unobserved data points.
converged = llh(t)-llh(t-1) < tol*abs(llh(t));
end
The EM algorithm proceeds from observation that [~,label(1,:)] = max(R,[],2);
following is a way to solve these two sets of equations llh = llh(2:t);
numerically. One could simply pick arbitrary values for if converged
one of two sets of unknowns, use them to estimate second fprintf('converged in %d steps.\n',t);
set, then use these new values to find a better estimate of else
first set, & then keep alternating between two until fprintf('not converged in %d steps.\n',maxiter);
resulting values both converge to fixed points. It's not end
obvious that this would work at all, but in fact it could be Initialization function

Volume 6, Issue 2, March April 2017 Page 156


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 6, Issue 2, March - April 2017 ISSN 2278-6856
function R = initialization(X, init) Maximization function
[d,n] = size(X); function model = maximization(X, R)
if isstruct(init) % initialize within model [d,n] = size(X);
R = expectation(X,init); k = size(R,2);
elseif length(init) == 1 % random initialization sigma0 = eye(d)*(1e-6); % regularization factor for
k = init; covariance
idx = randsample(n,k);
m = X(:,idx); s = sum(R,1);
[~,label] = max(bsxfun(@minus,m'*X,sum(m.^2,1)'/2)); w = s/n;
while k ~= unique(label) mu = bsxfun(@rdivide, X*R, s);
idx = randsample(n,k); Sigma = zeros(d,d,k);
m = X(:,idx); for i = 1:k
[~,label] = Xo = bsxfun(@minus,X,mu(:,i));
max(bsxfun(@minus,m'*X,sum(m.^2,1)'/2)); Xo = bsxfun(@times,Xo,sqrt(R(:,i)'));
end Sigma(:,:,i) = (Xo*Xo'+sigma0)/s(i);
R = full(sparse(1:n,label,1,n,k,n)); end
elseif size(init,1) == 1 && size(init,2) == n % initialize
within labels model.mu = mu;
label = init; model.Sigma = Sigma;
k = max(label); model.weight = w;
R = full(sparse(1:n,label,1,n,k,n));
elseif size(init,1) == d && size(init,2) > 1 %initialize
within only centers
k = size(init,2);
m = init;
[~,label] = max(bsxfun(@minus,m'*X,sum(m.^2,1)'/2));
R = full(sparse(1:n,label,1,n,k,n));
else
error('ERROR: init is not valid.');
end
Expectation Function
function [R, llh] = expectation(X, model) Fig 2 Sigma is Diagonal, Shared covariance=true
mu = model.mu;
Sigma = model.Sigma;
w = model.weight;

n = size(X,2);
k = size(mu,2);
R = zeros(n,k);

for i = 1:k
R(:,i) = loggausspdf(X,mu(:,i),Sigma(:,:,i));
end
R = bsxfun(@plus,R,log(w));
T = logsumexp(R,2); Fig 3 Sigma is Diagonal, Shared covariance=false
llh = sum(T)/n; % loglikelihood
R = bsxfun(@minus,R,T);
R = exp(R);

Finding a maximum likelihood solution typically requires


taking derivatives of likelihood function within respect to
all unknown values parameters & latent variables
& simultaneously solving resulting equations. In statistical
models within latent variables, this usually is not possible.
Instead, result is typically a set of interlocking equations
in which solution to parameters requires values of latent
variables & vice versa, but substituting one set of
equations into other produces an unsolvable equation.
Fig 3 Sigma is full, Shared covariance=true

Volume 6, Issue 2, March April 2017 Page 157


International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 6, Issue 2, March - April 2017 ISSN 2278-6856
[9] Piatetsky-Shapiro, Gregory; Parker, Gary (2011).
"Lesson: Data Mining, & Knowledge Discovery: An
Introduction". Introduction to Data Mining. KD
Nuggets. Retrieved 30 August 2012.
[10] Kantardzic, Mehmed (2003). Data Mining: Concepts,
Models, Methods, & Algorithms. John Wiley & Sons.
ISBN 0-471-22852-4. OCLC 50055336.
[11] "Microsoft Academic Search: Top conferences in data
mining". Microsoft Academic Search.
[12] "Google Scholar: Top publications - Data Mining &
Analysis". Google Scholar.
[13] Proceedings, International Conferences on Knowledge
Discovery & Data Mining, ACM, New York.
Fig 4 Sigma is full, Shared covariance=false [14] SIGKDD Explorations, ACM, New York.

6. CONCLUSION AUTHOR
Overall goal of data mining process is to extract
information from a data set & transform this into an Jyoti received the B.Tech degree in Computer
understandable structure. In order to make wise decisions Science and Engineering from KIIT College of
both for people & for things in IoT, data mining Engineering, Gurgaon in 2013.After that joined
M.Tech program 2014-16 in Computer Science
technologies are open to all people within IoT technologies and Engineering from Maharishi Dayanand
for decision making support & system optimization. Data University, Rohtak.
mining involves discovering novel, interesting, &
potentially useful patterns from data & applying algorithms
to extraction of hidden information Due to increasing
amount of data available online, World Wide Web has
becoming one of most valuable resources for information
retrievals & knowledge discoveries. Web mining
technologies are right solutions for knowledge discovery
on Web. Knowledge extracted from Web could be used to
raise performances for Web information retrievals,
question answering, & Web based data warehousing.
REFRENCES
[1] Hellerstein, Joe (9 November 2008). "Parallel
Programming in Age of Big Data". Gigaom Blog.
[2] J. Liu, N. Zhong, Y. Y. Yao, Z. W. Ras, wisdom web:
new challenges for web intelligence (WI), J. Intell.
Inform. Sys.,20(1): 59, 2003.
[3] Congiusta, A. Pugliese, D. Talia, & P. Trunfio,
Designing GridServices for distributed knowledge
discovery, Web Intel. Agent Sys, 1(2): 91104, 2003.
[4] J. A. Hendler & E. A. Feigenbaum, Knowledge is
power: semantic web vision, in N. Zhong, et al. (eds.),
Web Intelligence: Research & Development, LNAI
2198, Springer, 2001, 1829.
[5] N. Zhong & J. Liu (eds.), Intelligent Technologies for
Information Analysis, New York: Springer, 2004.
[6] Bouckaert, Remco R.; Frank, Eibe; Hall, Mark A.;
Holmes, Geoffrey; Pfahringer, Bernhard; Reutemann,
Peter; Witten, Ian H. (2010). "WEKA Experiences
within a Java open-source project"
[7] Journal of Machine Learning Research 11: 2533
2541. original title, "Practical machine learning", was
changed ... term "data mining" was [added] primarily
for marketing reasons.
[8] Mena, Jess (2011). Machine Learning Forensics for
Law Enforcement, Security, & Intelligence. Boca
Raton, FL: CRC Press (Taylor & Francis Group).
ISBN 978-1-4398-6069-4.

Volume 6, Issue 2, March April 2017 Page 158

Das könnte Ihnen auch gefallen