Sie sind auf Seite 1von 10

DATAMINING TECHNIQUES FOR STRUCTURED AND

SEMI STRUCTURED DATA


A

SYNOPSIS

SUBMITTED TO THE

Dr. A.P.J. Abdul Kalam Technical University, U.P., Lucknow.

FOR THE DEGREE

OF

MASTER OF TECHNOLOGY
IN

COMPUTER SCIENCE AND ENGINEERING


By

NAME: ANURAG SINGH Roll No. 1409010501

UNDER THE GUIDANCE OF

Prof. RAJNESH KUMAR SINGH

DEPARTMENT OF COMPUTER SCEINCE AND ENGINEERING

IEC COLLEGE OF ENGINEERING AND TECHNOLOGY, KP-2 GREATER NOIDA

Year 2015-16
DATAMINING TECHNIQUES FOR STRUCTURED AND
SEMI STRUCTURED DATA

Abstract:
Data mining is the application of sophisticated analysis to large amounts of data in order to discover new
knowledge in the form of patterns, trends, and associations with the advent of the World Wide Web, the
amount of data stored and accessible electronically has grown tremendously and the process of
knowledged is covery (datamining) from this data has become very important for the business and
scientic-research communities alike. This master synopisis introduces Query Flocks, a general framework
over relational data that enables the declarative formulation, systematic optimization, and efficient
processing of a large class of mining queries. In Query Flocks, each mining problem is expressed as a
datalog query with parameters and a letter condition. In the optimization phase, a query
fock is transformed in to a sequence of simple requeries that can be execute deciently.As a proof of
concept, Query Flocks have been integrated with a conventional database system and the thesis reports on
the architectural issues and performance results. While the Query-Flock framework is well suited for
relational data, it has limited use for semistructured data, i.e., nested data with implicit and/or irregular
structure, e.g. web pages.

Introduction :
The amount of data stored and available electronically has been growing at an ever in- creasing rate for
the last decade. In the business community, companies collect all sorts of information about the business
process such as financial, payroll, and customer data. The data is often among the most valuable assets of
a business. In the scientific community, a single experiment can produce terabytes of data. Subsequently,
there is growing demand for methods and tools that analyze large volumes of data.

Data mining is broadly defined as the process of finding \patterns" from large amounts of data. The
definition is necessarily vague because it has to encompass the vast array of methods, techniques, and
algorithms from various fields such as databases, machine learning, and statistics.

Motivation:
Two major technological evolutions modify our relationship with our environment:

 Widely available and cheap computer power. Simple objects that surround us are gaining sensors,
computational power, and actuators, and are changing from static, into adaptive and reactive
systems.
 The explosion of networks of all kinds offers new possibilities for the development and self-
organization of communities.

The new characteristics of data reflect a World in Movement:


 Time and space. The objects of analysis exist in time and space. Often they are able to move.
 Dynamic environment. These objects exist in a dynamic and unstable environment, evolving
incrementally over time.
 Information processing capability. The objects are endowed with information processing
capabilities.
 Locality. The objects never see the global picture - they know only their local spatio-temporal
environment.

Data Mining for Structured Data:


The state of the art in data mining for structured data is many different algorithms that operate on limited
types of data. Furthermore,mostdata-miningmethodsareatbest loosely-coupled with relational DBMS, thus
not taking advantage of the existing database technology. In this thesis, we propose a framework, called
query focks, that allows the declarativeformulationofalargeclassofdata-miningqueriesoverrelationaldata.
We also present a method for systematic optimization and efficient processing, called query flock plans,
of such queries.

Data Mining for Semistructured Data:


The importance of semistructured data has been recognized in the database community and is emphasized
by the flurry of research activities in the last several years. The emergence of XML and its rapid adoption
by the e-commerce companies has made semistructured data equally important for the business
community. However,sincetheproliferationof semistructured data has been relatively recent, there is a
lack of tools and methods for analysis of such data.

Literature Survey:
The rest of the thesis is organized as follows. The next chapter introduces query focks and shows that
many data mining problems can be expressed in the query fock framework. The problem of processing
query fock efficiently is addressed in Chapter. This chapter introduces the systematic optimization of
query focks as query fock plans and presents an integration with relational DBMS in a tightly-coupled
fashion.

Several researchers and organizations have conducted reviews of data mining tools and surveys of data
miners. These identify some of the strengths and weaknesses of the software packages. They also provide
an overview of the behaviors, preferences and views of data miners. Some of these reports include:Data
mining techniques provide a popular & powerful tool set to generate various data driven classification
systems.

Leonid Churilov, Adyl Bagirov, Daniel Schwartz, Kate Smith and Michael Dally had already studied
about combined use of self organizing maps & non smooth, non convex optimization techniques inorder
to produce a working case of a data driven risk classification system. The optimization approach
strengthens the validity of self organizing map results. This study is applied to cancer patients. Cancer
patients are partitioned into homogenous groups to support future clinical treatment decisions. Most of the
different approaches to the problem of clustering analysis are mainly based on statistical, neural network,
machine learning techniques. Bagirov et al. propose the global optimization approach to clustering and
demonstrate how the supervised data classification problem can be solved via clustering. The objective
function in this problem is both nonsmooth and nonconvex and has a large number of local minimizers.
Due to a large number of variables and the complexity of the objective function, general purpose global
optimization techniques, as a rule fail to solve such problem. It is very important therefore, to develop
optimization algorithm that allow the decision maker to find “deep” local minimizers of the objective
function. Such deep mininizers provide a good enough description of the data set under consideration as
far as clustering is concerned. Some automated rule generation methods such as classification and
regression trees are available to find rules describing different subsets of the data. When the data sample
size is limited, such approaches tend to find very accurate rules that apply to only a small number of
patients. In Schwarz et al. [16] it was demonstrated that data mining techniques can play an important
role in rule refinement even if the sample size is limited. For that at first stage methodology is used for
exploring and identifying inconsistencies in the existing rules, rather than generating a completely new set
of rules. K-mean algorithm lies in the improved visualization capabilities resulting from the two
dimensional map of the cluster. Kohonen developed self organizing maps as a way of automatically
detecting strong features in large data sets. Self organizing map finds a mapping from the high
dimensional input space to low dimensional feature space, so the clusters that form become visible in this
reduced dimension ability. The software used to generate the self organizing maps is Viscovery SOMine (
www.eudaptics.com),Which provides a colorful cluster visualization tool, & the ability to inspect the
distribution of different variables across the map. The subject of cluster analysis is the unsupervised
classification of data & discovery of relationship within the data set without any guidance. The basic
principle of identifying this hidden relationship is that if input patterns are
similar, they should be grouped together. Two inputs are regarded as similar if the distance between these
two inputs is small. This study demonstrates that data mining techniques can play an important role in
rule refinement, even if the sample size is limited. Leonid Churilov, Adyl Bagirov, Daniel Schwartz, Kate
Smith and Michael Dally demonstrated that both self organizing maps & optimization based clustering
algorithms can be used to explore existing classification rules, developed by experts and identify
inconsistencies with a patient database. As the proposed optimization algorithm calculate clusters step by
step and the form of the objective function allow the user to significantly reduce the number of instances
in a data set. A rule based classification system is important for the clinicians to feel comfortable with the
decision. Decision tree can be used to generate data driven rules but for small sample size these rules tend
to describe outliers that do not necessarily generalize to larger data sets.

 Hurwitz Victory Index: Report for Advanced Analytics as a market research assessment tool, it
highlights both the diverse uses for advanced analytics technology and the vendors who make
those applications possible. Recent- research.
 2011 Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.
 Rexer Analytics Data Miner Surveys (2007–2013).
 Forrester Research 2010 Predictive Analytics and Data Mining Solutions report.
 Gartner 2008 "Magic Quadrant" report.
 Robert A. Nisbet's 2006 Three Part Series of articles "Data Mining Tools: Which One is Best For
CRM?”.
 Haughton et al.'s 2003 Review of Data Mining Software Packages in The American Statistician
 Goebel & Gruenwald 1999 "A Survey of Data Mining a Knowledge Discovery Software Tools"
 in SIGKDD Explorations.

Research papers:
 Leonid Churilov. Adyl Bagirov, Daniel Schwarta, Kate Smith, Michael Dally Journal of
management information system : 2005, Data mining with combined use of optimization
techniques and self organizing maps for improving risk grouping rules : application to prostate
cancer patients.
 Anthony Danna, Oscar H. Gandy, Journal of business ethics : 2002, All that glitters is not gold :
Digging Beneath the surface of data mining.
 AC Yeo, KA Smith, RJ Willis and M Brooks, Journal of the operation research society : 2002 ,
A mathematical programming approach to optimize insurance premium pricing within a data
minning framework.
 Shakil Ahmed, Frans Coenen, Paul Leng, Knowledge Information System : 2006, Tree based
partitioning of data for association rule mining
 Timothy T. Rogers, James L. Mcclelland, Behavioral & Brain Sciences : 2008, Precis of
Semantic Cognition : A Parallel Distributed Processing Approach.
 Ana Cristina, Bicharra Garcia, Inhauma Ferraz and Adriana S. Vivacqua, Arificial Intelligence
for engineering design, analysis and manufacturing: 2009, From data to Knowledge Mining.
 Rachid Anane, Computer and the humanities : 2001, Data mining and serial documents.
 Balaji Padmanabhan and Alexander Tuzhilin, Institute for Operation Research and Management
Science : 2011, On the use of optimisation for data mining : Theorotical Interaction and eCRM
opportunities.

Research methodology to be used:


“Data mining is the name given to an eclectic collection of statistical techniques that are already widely
used in marketing and business, are likely to appear in social science research in the near future, but are
rarely found in academic social science research at present. The list of techniques includes: partitioning or
tree models; boosted trees, forests, and boosted forests; neural networks; linear and nonlinear manifold
clustering; and partial least squares regression (aka ‘soft modeling’).

This is an exploratory class – this is the first time it will be taught at GC – and the class will be taught by
Professor Robert Haralick, a computer scientist, and Professor Paul Attewell, a sociologist. We will be
learning as we go, and the work will be hands on, so do not take this course if you seek a well-structured
highly-organized experience. But if you enjoy exploring new techniques and “learning by doing” then this
course may appeal to you. You should already have some familiarity with statistics, at least to OLS
regression and logistic regression, but this course will not be highly mathematical or technical. Course
grades will stress attendance, participation and project work. A paper will not be required.”

Problem identification/detection:
In data mining, anomaly detection (or outlier detection) is the identification of items, events or
observations which do not conform to an expected pattern or other items in a dataset. Typically the
anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical
problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and
exceptions.

In particular in the context of abuse and network intrusion detection, the interesting objects are often not
rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical
definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised
methods) will fail on such data, unless it has been aggregated appropriately. Instead, a cluster analysis
algorithm may be able to detect the micro clusters formed by these patterns.
Three broad categories of anomaly detection techniques exist.Unsupervised anomaly detection
techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the
instances in the data set are normal by looking for instances that seem to fit least to the remainder of the
data set. Supervised anomaly detection techniques require a data set that has been labeled as "normal"
and "abnormal" and involves training a classifier (the key difference to many other statistical
classification problems is the inherent unbalanced nature of outlier detection).Semi-supervised anomaly
detection techniques construct a model representing normal behavior from a given normal training data
set, and then testing the likelihood of a test instance to be generated by the learnt model.

Tools and software:

1. Rapid minner(formerly known as YELE).

2. WEKA

3. R- Programming

4. KNIME

5. NLTK

Research objective:
1. Upgrade an e-science infrastructure to support collaborative, data mining enabled experimental
research.

2.Develop a knowledge-driven data mining assistant to support researchers in data-intensive,


knowledgerich domains.

3.Design and implement mechanisms for meta-mining the knowledge discovery process.

4. Demonstrate e-LICO on a systems biology approach to disease studies.

Hypothesis :

Suppose fifty different researchers, unaware of each other's work, run clinical trials to test whether
Vitamin X is efficacious in treating cancer. Forty-nine of them find no significant differences between
measurements done on patients who have taken Vitamin X fiftieth study finds a big difference, but the
difference is of a size that one would expect to see in about one of every fifty studies even if vitamin X
has no effect at all, just due to chance (with patients who were going to get better anyway
disproportionately ending up in the Vitamin X group instead of the control group, which can happen since
the entire population of cancer patients cannot be included in the study). When all fifty studies are pooled,
one would say no effect of Vitamin X was found, because the positive result was not more frequent than
chance, i.e. it was not However, it would be reasonable for the investigators running the fiftieth study to
consider it likely that they have found an effect, at least until they learn of the other forty-nine studies.
Now suppose that the one anomalous study was in Denmark. The data suggest a hypothesis that Vitamin
X is more efficacious in Denmark than elsewhere. But Denmark was by chance the one-in-fifty in which
an extreme value of the test statistic happened; one expects such extreme cases one time in fifty on
average if no effect is present. It would therefore be to cite the data as serious evidence for this particular
hypothesis suggested by the data.

However, if another study is then done in Denmark and again finds a difference between the vitamin and
the placebo, then the first study strengthens the case provided by the second study. Or, if a second series
of studies is done on fifty countries, and Denmark stands out in the second study as well, the two series
together constitute important evidence even though neither by itself is at all impressive.

Conclusions:
Data mining | the application of methods to analyze very large volumes of data in order to discover new
knowledge | is rapidly finding its way into mainstream computing and becoming commonplace in such
environments as finance and retail, in which large volumes of cash register data are routinely analyzed for
user buying patterns of goods, shopping habits of individual users, efficiency of marketing strategies for
services and other information. In this thesis, we presented data mining techniques that contribute towards
a comprehensive solution for both structured and semistructured data.

References / Bibliography:
[1] Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal
Talwar. Privacy, accuracy, and consistency too: a holistic solution to contingency table release.
In Proc. of PODS, pages 273–282, New York, NY, 2007.

[2] Michael Barbaro and Tom Zeller Jr. A face is exposed for aol searcher no. 4417749. New
York Times, August 9 2006.

[3] Roberto J. Bayardo Jr. and Rakesh Agrawal. Data privacy through optimal k- anonymization.
In Proc. of ICDE, pages 217–228, 2005.

[4] Elisa Bertino, Beng Chin Ooi, Yanjiang Yang, and Robert H. Deng. Privacy and ownership
preserving of outsourced medical data. In Proc. of ICDE, pages 521–532, 2005.

[5] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The
SuLQ framework. In Proc. of PODS, pages 128–138, New York, NY, June 2005.

[6] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to non-interactive
database privacy. In Proc. of STOC, pages 609–618, 2008.
[7] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification
and Regression Trees. Chapman & Hall, New York, 1984.

[8] Electronic Privacy Information Center. Total “terrorism” information awareness (TIA).
http://www.epic.org/privacy/profiling/tia/.

[9] Kamalika Chaudhuri and Claire Monteleoni. Privacy-preserving logistic regres- sion. In
NIPS, pages 289–296, 2008.

[10] Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, and Hoeteck Wee. Toward
privacy in public databases. In Theory of Cryptography Conference, pages 363–385, 2005.

[11] Raymond Chi-Wing, Jiuyong Li, Ada Wai-Chee Fu, and Ke Wang. (α, k)- anonymity: an
enhanced k-anonymity model for privacy preserving data pub- lishing. In KDD ’06: Proceedings
of the 12th ACM SIGKDD International Con- ference on Knowledge Discovery and Data
Mining, New York, NY, USA, 2006. ACM Press.

[12] Thomas G. Dietterich. Approximate statistical test for comparing supervised classification
learning algorithms. Neural Computation, 10(7):1895–1923, 1998

[13] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. In Proc. of
PODS, pages 202 – 210, June 2003.

[14] C.L. Blake D.J. Newman, S. Hettich and C.J. Merz. UCI repository of machine learning
databases, 1998.

[15] Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In KDD, pages 71–80,
2000.

[16] Wenliang Du and Zhijun Zhan. Building decision tree classifier on private data. In Proc. of
CRPITS’14, pages 1–8, Darlinghurst, Australia, December 2002. Aus- tralian Computer Society,
Inc.

[17] Cynthia Dwork. Differential privacy. In ICALP (2), volume 4052 of LNCS, pages 1–12,
2006.

[18] Cynthia Dwork. Differential privacy: A survey of results. In TAMC, pages 1–19, 2008.

[19] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to
sensitivity in private data analysis. In TCC, pages 265–284, 2006.

[20] Cynthia Dwork, Moni Naor, Omer Reingold, Guy N. Rothblum, and Salil Vadhan. On the
complexity of differentially private data release: efficient algorithms and hardness results. In
STOC ’09: Proceedings of the 41st annual ACM symposium on Theory of computing, pages
381–390, New York, NY, USA, 2009. ACM.
[21] Cynthia Dwork and Kobbi Nissim. Privacy-preserving data mining on vertically partitioned
databases. In Proc. of CRYPTO, August 2004.

[22] Cynthia Dwork and Sergey Yekhanin. New efficient attacks on statistical disclosure control
mechanisms. In CRYPTO, pages 469–480, 2008.

[23] A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving
data mining. In Proc. of PODS, pages 211–222, San Diego, California, USA, June 9-12 2003.

[24] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of


association rules. In Proc. of ACM SIGKDD, pages 217–228, Canada, July 2002.

[25] Dan Feldman, Amos Fiat, Haim Kaplan, and Kobbi Nissim. Private coresets. In STOC,
pages 361–370, 2009.

[26] Frank McSherry. Privacy integrated queries: an extensible platform for privacy- preserving
data analysis. In SIGMOD Conference, pages 19–30, 2009.

[27] Frank McSherry and Ilya Mironov. Differentially private recommender systems: building
privacy into the net. In KDD, pages 627–636, 2009.

[28] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In FOCS,
pages 94–103, 2007.

[29] Adam Meyerson and Ryan Williams. General k-anonymization is hard. In Tech- nical
Report CMU-CS-03-113, March 2003.

[30] Adam Meyerson and Ryan Williams. On the complexity of optimal k-anonymity. In PODS
’04: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on
Principles of database systems, pages 223–228, New York, NY, USA, 2004. ACM Press.

[31] John Mingers. An empirical comparison of selection measures for decision-tree induction.
Machine Learning, 3(4):319–342, 1989.

[32] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets.
In IEEE Symposium on Security and Privacy, pages 111–125, 2008.

[33] Arvind Narayanan and Vitaly Shmatikov. Myths and fallacies of “personally identifiable
information”. Communications of the ACM, 53(6):24–26, 2010.

[34] Chaithanya Pichuka, Raju S. Bapi, Chakravarthy Bhagvati, Arun K. Pujari, and Bulusu
Lakshmana Deekshatulu. A tighter error bound for decision tree learning using PAC learnability.
In IJCAI, pages 1011–1016, 2007.

[35] J. Ross Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.
[36] J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc.,
San Francisco, 1993.

[37] Aaron Roth and Tim Roughgarden. The median mechanism: Interactive and efficient
privacy with multiple queries. In Proc. of STOC, 2010.

[38] P. Samarati. Protecting respondents’ identities in microdata release. IEEE Trans- actions on
Knowledge and Data Engineering, 13(6), 2001

[39] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k- anonymity
and its enforcement through generalization and suppression. In Tech- nical Report SRI-CSL-98-
04. CS Laboratory, SRI International, 1998.

[40] Ralph E. Steur. Multiple criteria optimization: theory computation and applica- tion. John
Wiley & Sons, New York, 1986.

Das könnte Ihnen auch gefallen