Beruflich Dokumente
Kultur Dokumente
INTRODUCTION
What is Data Mining?
to predict who will respond to the new marketing campaigns such as direct mail,
online marketing campaignetc. Through the results, marketers will have
appropriate approach to sell profitable products to targeted customers.
Data mining brings a lot of benefits to retail companies in the same way as
marketing. Through market basket analysis, a store can have an appropriate
production arrangement in a way that customers can buy frequent buying products
together with pleasant. In addition, it also helps the retail companies offer certain
discounts for particular products that will attract more customers.
2. Finance / Banking
Data mining gives financial institutions information about loan information and
credit reporting. By building a model from historical customers data, the bank and
financial institution can determine good and bad loans. In addition, data mining
helps banks detect fraudulent credit card transactions to protect credit cards owner.
3. Manufacturing
By applying data mining in operational engineering data, manufacturers can
detect faulty equipments and determine optimal control parameters. For example
semi-conductor manufacturers has a challenge that even the conditions of
manufacturing environments at different wafer production plants are similar, the
quality of wafer are lot the same and some for unknown reasons even has defects.
Data mining has been applying to determine the ranges of control parameters that
lead to the production of golden wafer. Then those optimal control parameters are
used to manufacture wafers with desired quality.
4. Governments
Data mining helps government agency by digging and analyzing records of
financial transaction to build patterns that can detect money laundering or criminal
activities.
5. Law enforcement:
Data mining can aid law enforcers in identifying criminal suspects as well as
apprehending these criminals by examining trends in location, crime type, habit,
and other patterns of behaviors.
6. Researchers:
Data mining can assist researchers by speeding up their data analyzing process;
thus, allowing those more time to work on other projects.
SYSTEM ANALYSIS
EXISTING SYSTEM:
In this existing system, a data unit is a piece of text that semantically represents
one concept of an entity. It corresponds to the value of a record under an attribute.
It is different from a text node which refers to a sequence of text surrounded by a
pair of HTML tags. It describes the relationships between text nodes and data units
in detail. In this paper, we perform data unit level annotation. There is a high
demand for collecting data of interest from multiple WDBs. For example, once a
book comparison shopping system collects multiple result records from different
book sites, it needs to determine whether any two SRRs refer to the same book.
DISADVANTAGES OF EXISTING SYSTEM:
If ISBNs are not available, their titles and authors could be compared. The system
also needs to list the prices offered by each site. Thus, the system needs to know
the semantic of each data unit. Unfortunately, the semantic labels of data units are
often not provided in result pages. For instance, no semantic labels for the values
of title, author, publisher, etc., are given. Having semantic labels for data units is
not only important for the above record linkage task, but also for storing collected
SRRs into a database table.
PROPOSED SYSTEM:
In this paper, we consider how to automatically assign labels to the data units
within the SRRs returned from WDBs. Given a set of SRRs that have been
extracted from a result page returned from a WDB, our automatic annotation
solution consists of three phases.
ADVANTAGES OF PROPOSED SYSTEM:
This paper has the following contributions:
While most existing approaches simply assign labels to each HTML text
node, we thoroughly analyze the relationships between text nodes and data
units. We perform data unit level annotation.
We utilize the integrated interface schema (IIS) over multiple WDBs in the
same domain to enhance data unit annotation. To the best of our knowledge,
We construct an annotation wrapper for any given WDB. The wrapper can
be applied to efficiently annotating the SRRs retrieved from the same WDB
with new queries.
IMPLEMENTATION
MODULES:
Basic Annotators
Query-Based Annotator
Schema Value Annotator
Common Knowledge Annotator
Combining Annotators
MODULES DESCRIPTION:
Basic Annotators
In a returned result page containing multiple SRRs, the data units corresponding to
the same concept (attribute) often share special common features. And such
common features are usually associated with the data units on the result page in
certain patterns. Based on this observation, we define six basic annotators to label
data units, with each of them considering a special type of patterns/features. Four
of these annotators (i.e., table annotator, query-based annotator, intext prefix/suffix
annotator, and common knowledge annotator) are similar to the annotation
heuristics
Query-Based Annotator
The basic idea of this annotator is that the returned SRRs from aWDBare always
related to the specified query. Specifically, the query terms entered in the search
attributes on the local search interface of the WDB will most likely appear in some
retrieved SRRs. For example, query term machine is submitted through the Title
field on the search interface of the WDB and all three titles of the returned SRRs
contain this query term. Thus, we can use the name of search field Title to annotate
the title values of these SRRs. In general, query terms against an attribute may be
entered to a textbox or chosen from a selection list on the local search interface.
Our Query-based Annotator works as follows: Given a query with a set of query
terms submitted against an attribute A on the local search interface, first find the
group that has the largest total occurrences of these query terms and then assign
gn(A) as the label to the group.
Schema Value Annotator
Many attributes on a search interface have predefined values on the interface. For
example, the attribute Publishers may have a set of predefined values (i.e.,
publishers) in its selection list. More attributes in the IIS tend to have predefined
values and these attributes are likely to have more such values than those in LISs,
because when attributes from multiple interfaces are integrated, their values are
also combined. Our schema value annotator utilizes the combined value set to
perform annotation.
The schema value annotator first identifies the attribute Aj that has the highest
matching score among all attributes and then uses gn(Aj) to annotate the group Gi.
Note that multiplying the above sum by the number of nonzero similarities is to
give preference to attributes that have more matches (i.e., having nonzero
similarities) over those that have fewer matches. This is found to be very effective
in improving the retrieval effectiveness of combination systems in information
retrieval
SYSTEM TESTING
The purpose of testing is to discover errors. Testing is the process of trying
to discover every conceivable fault or weakness in a work product. It provides a
way to check the functionality of components, sub assemblies, assemblies and/or a
finished product It is the process of exercising software with the intent of ensuring
that the
Software system meets its requirements and user expectations and does not fail in
an unacceptable manner. There are various types of test. Each test type addresses a
specific testing requirement.
TYPES OF TESTS
Unit testing
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is the
testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that
relies on knowledge of its construction and is invasive. Unit tests perform basic
tests at component level and test a specific business process, application, and/or
system configuration. Unit tests ensure that each unique path of a business process
performs accurately to the documented specifications and contains clearly defined
inputs and expected results.
Integration testing
Integration tests are designed to test integrated software components to
determine if they actually run as one program. Testing is event driven and is more
concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as shown
by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.
Functional test
Functional tests provide systematic demonstrations that functions tested are
available as specified by the business and technical requirements, system
documentation, and user manuals.
Functional testing is centered on the following items:
Valid Input
Invalid Input
Functions
Output
Unit testing is usually conducted as part of a combined code and unit test
phase of the software lifecycle, although it is not uncommon for coding and unit
testing to be conducted as two distinct phases.
Test strategy and approach
Field testing will be performed manually and functional tests will be written
in detail.
Test objectives
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
SYSTEM DESIGN
HTML Page
Preprocessing Layer
Query Result Data
Result Projection
Profile
Presentation Layer
Prefix/Suffix Annotator
Query-Based Annotator
Frequency-Based Annotator
Table Annotator
Analysis Layer
Knowledge Layer
Annotation
HTML PAGES
XML/OWL
A d m in
U ser
A d d S o u rc e
S earch by U R L
S e a r c h b y A u th o r n a m e
URL
AUTHO R
YEAR
T IT L E
P R IC E
S e a rc h b y Y e a r
S e a r c h b y T it ie
C O N TE NT
DA TA BA S E
EN D
UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized
general-purpose modeling language in the field of object-oriented software
engineering. The standard is managed, and was created by, the Object Management
Group.
The goal is for UML to become a common language for creating models of
object oriented computer software. In its current form UML is comprised of two
major components: a Meta-model and a notation. In the future, some form of
method or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying,
Visualization, Constructing and documenting the artifacts of software system, as
well as for business modeling and other non-software systems.
The UML represents a collection of best engineering practices that have
proven successful in the modeling of large and complex systems.
The UML is a very important part of developing objects oriented software
and the software development process. The UML uses mostly graphical notations
to express the design of software projects.
GOALS:
The Primary goals in the design of the UML are as follows:
1. Provide users a ready-to-use, expressive visual modeling Language so that
they can develop and exchange meaningful models.
2. Provide extendibility and specialization mechanisms to extend the core
concepts.
3. Be independent of particular programming languages and development
process.
4. Provide a formal basis for understanding the modeling language.
5. Encourage the growth of OO tools market.
6. Support higher level development concepts such as collaborations,
frameworks, patterns and components.
7. Integrate best practices.
AD D SO U R CE
S E R A C H B Y U R L , A U T H O R N A M E, Y E A R , T IT I LE
D E T A ILS
USER
A D M IN
IN F O R M A T IO N
CLASS DIAGRAM:
In software engineering, a class diagram in the Unified Modeling Language
(UML) is a type of static structure diagram that describes the structure of a system
by showing the system's classes, their attributes, operations (or methods), and the
relationships among the classes. It explains which class contains information.
A D M IN
U SE R
A d d S ou rc e
A d d C o n te n t
A d d In form a tio n
V iew S ou rc e
U rl, T itle , A u th o r n am e ,
Y e a r,
B ro w se ()
A d d S o u rc e()
S in k
A ctio n re ceive
P rovid e S e rvice s
SEQUENCE DIAGRAM:
A sequence diagram in Unified Modeling Language (UML) is a kind of interaction
diagram that shows how processes operate with one another and in what order. It is
a construct of a Message Sequence Chart. Sequence diagrams are sometimes called
event diagrams, event scenarios, and timing diagrams.
A d m in
S to ra g e
A d d in g K e y w o rd
A d d in g s o u rc e
U se r
S e a rc h
G e t In fo rm a tio n
P ro v id e In fo rm a tio n
ACTIVITY DIAGRAM:
Activity diagrams are graphical representations of workflows of stepwise activities
and actions with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the business and
operational step-by-step workflows of components in a system. An activity
diagram shows the overall flow of control.
B e g in
S ig n U p
L o g in
F al se
C o n d i tio n
User
A d m in
Y es
V a lid U s e r
A d d K e y w o rd
A d d S ource
S ea rch
A d d D e ta i ls
A d d in f o rm a it o n
G e t In fo rm a tio n
D a ta B a s e
End
INPUT DESIGN
The input design is the link between the information system and the user. It
comprises the developing specification and procedures for data preparation and
those steps are necessary to put transaction data in to a usable form for processing
can be achieved by inspecting the computer to read data from a written or printed
document or it can occur by having people keying the data directly into the system.
The design of input focuses on controlling the amount of input required,
controlling the errors, avoiding delay, avoiding extra steps and keeping the process
simple. The input is designed in such a way so that it provides security and ease of
use with retaining the privacy. Input Design considered the following things:
What data should be given as input?
How the data should be arranged or coded?
The dialog to guide the operating personnel in providing input.
Methods for preparing input validations and steps to follow when error
occur.
OBJECTIVES
1.Input Design is the process of converting a user-oriented description of the input
into a computer-based system. This design is important to avoid errors in the data
input process and show the correct direction to the management for getting correct
information from the computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be
free from errors. The data entry screen is designed in such a way that all the data
manipulates can be performed. It also provides record viewing facilities.
3.When the data is entered it will check for its validity. Data can be entered with
the help of screens. Appropriate messages are provided as when needed so that the
user
will not be in maize of instant. Thus the objective of input design is to create an
input layout that is easy to follow
OUTPUT DESIGN
A quality output is one, which meets the requirements of the end user and presents
the information clearly. In any system results of processing are communicated to
the users and to other system through outputs. In output design it is determined
how the information is to be displaced for immediate need and also the hard copy
output. It is the most important and direct source information to the user. Efficient
and intelligent output design improves the systems relationship to help user
decision-making.
1. Designing computer output should proceed in an organized, well thought out
manner; the right output must be developed while ensuring that each output
element is designed so that people will find the system can use easily and
effectively. When analysis design computer output, they should Identify the
specific output that is needed to meet the requirements.
2.Select methods for presenting information.
3.Create document, report, or other formats that contain information produced by
the system.
The output form of an information system should accomplish one or more of the
following objectives.
Convey information about past activities, current status or projections of the
Future.
Signal important events, opportunities, problems, or warnings.
Trigger an action.
Confirm an action.
Index Page:
Register Page:
User:
User Page:
Search by Titile:``
Search by URL:
Search by year:
Result Page:
Additional Information:
Wrong Keyword:
Admin Entry:
Admin Home :
Adding Information:
About:
End:
CONCLUSION
In this paper, we studied the data annotation problem and proposed a multiannotator approach to automatically constructing an annotation wrapper for
annotating the search result records retrieved from any given web database. This
approach consists of six basic annotators and a probabilistic method to combine the
basic annotators. Each of these annotators exploits one type of features for
annotation and our experimental results show that each of the annotators is useful
and they together are capable of generating high quality annotation. A special
feature of our method is that, when annotating the results retrieved from a web
database, it utilizes both the LIS of the web database and the IIS of multiple web
databases in the same domain. We also explained how the use of the IIS can help
alleviate the local interface schema inadequacy problem and the inconsistent label
problem.
In this paper, we also studied the automatic data alignment problem. Accurate
alignment is critical to achieving holistic and accurate annotation. Our method is a
clustering based shifting method utilizing richer yet automatically obtainable
features. This method is capable of handling a variety of relationships between
HTML text nodes and data units, including one-to-one, one-to-many, many-to-one,
and one-to-nothing. Our experimental results show that the precision and recall of
this method are both above 98 percent. There is still room for improvement in
several areas. For example, we need to enhance our method to split composite text
node when there are no explicit separators. We would also like to try using
different machine learning techniques and using more sample pages from each
training site to obtain the feature weights so that we can identify the best technique
to the data alignment problem.
REFERENCES
[1] A. Arasu and H. Garcia-Molina, Extracting Structured Data from Web Pages,
Proc. SIGMOD Intl Conf. Management of Data, 2003.
[2] L. Arlotta, V. Crescenzi, G. Mecca, and P. Merialdo, Automatic Annotation of
Data Extracted from Large Web Sites, Proc. Sixth Intl Workshop the Web and
Databases (WebDB), 2003.
[3] P. Chan and S. Stolfo, Experiments on Multistrategy Learning by MetaLearning, Proc. Second Intl Conf. Information and Knowledge Management
(CIKM), 1993.
[4] W. Bruce Croft, Combining Approaches for Information Retrieval, Advances
in Information Retrieval: Recent Research from the Center for Intelligent
Information Retrieval, Kluwer Academic, 2000.
[5] V. Crescenzi, G. Mecca, and P. Merialdo, RoadRUNNER: Towards Automatic
Data Extraction from Large Web Sites, Proc. Very Large Data Bases (VLDB)
Conf., 2001.
[6] S. Dill et al., SemTag and Seeker: Bootstrapping the Semantic Web via
Automated Semantic Annotation, Proc. 12th Intl Conf. World Wide Web
(WWW) Conf., 2003.
[15] H. He, W. Meng, C. Yu, and Z. Wu, Constructing Interface Schemas for
Search Interfaces of Web Databases, Proc. Web Information Systems Eng.
(WISE) Conf., 2005.
[16] J. Heflin and J. Hendler, Searching the Web with SHOE, Proc. AAAI
Workshop, 2000.
[17] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
[18] N. Krushmerick, D. Weld, and R. Doorenbos, Wrapper Induction for
Information Extraction, Proc. Intl Joint Conf. Artificial Intelligence (IJCAI),
1997.
[19] J. Lee, Analyses of Multiple Evidence Combination, Proc. 20 th Ann. Intl
ACM SIGIR Conf. Research and Development in Information Retrieval, 1997.
[20] L. Liu, C. Pu, and W. Han, XWRAP: An XML-Enabled Wrapper
Construction System for Web Information Sources, Proc. IEEE 16th Intl Conf.
Data Eng. (ICDE), 2001.
[21] W. Liu, X. Meng, and W. Meng, ViDE: A Vision-Based Approach for Deep
Web Data Extraction, IEEE Trans. Knowledge and Data Eng., vol. 22, no. 3, pp.
447-460, Mar. 2010.
[22] Y. Lu, H. He, H. Zhao, W. Meng, and C. Yu, Annotating Structured Data of
the Deep Web, Proc. IEEE 23rd Intl Conf. Data Eng. (ICDE), 2007.
Databases, Proc. 12th Intl Conf. World Wide Web (WWW), 2003.
[31] Z. Wu et al., Towards Automatic Incorporation of Search Engines into a
Large-Scale Metasearch Engine, Proc. IEEE/WIC Intl Conf. Web Intelligence
(WI 03), 2003.
[32] O. Zamir and O. Etzioni, Web Document Clustering: A Feasibility
Demonstration, Proc. ACM 21st Intl SIGIR Conf. Research Information
Retrieval, 1998.
[33] Y. Zhai and B. Liu, Web Data Extraction Based on Partial Tree Alignment,
Proc. 14th Intl Conf. World Wide Web (WWW 05), 2005.
[34] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu, Fully Automatic
Wrapper Generation for Search Engines, Proc. Intl Conf. World Wide Web
(WWW), 2005.
[35] H. Zhao, W. Meng, and C. Yu, Mining Templates form Search Result
Records of Search Engines, Proc. ACM SIGKDD Intl Conf. Knowledge
Discovery and Data Mining, 2007.
[36] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W.-Y. Ma, Simultaneous Record
Detection and Attribute Labeling in Web Data Extraction, Proc. ACM SIGKDD
Intl Conf. Knowledge Discovery and Data Mining, 2006.
LITERATURE SURVEY
This paper describes Seeker, a platform for large-scale text analytics, and SemTag,
an application written on the platform to perform automated semantic tagging of
large corpora. We apply SemTag to a collection of approximately 264 million web
pages, and generate approximately 434 million automatically disambiguated
semantic tags, published to the web as a label bureau providing metadata regarding
the 434 million annotations. To our knowledge, this is the largest scale semantic
tagging effort to date.We describe the Seeker platform, discuss the architecture of
the SemTag application, describe a new disambiguation algorithm specialized to
support ontological disambiguation of large-scale data, evaluate the algorithm, and
present our final results with information about acquiring and making use of the
semantic tags. We argue that automated large scale semantic tagging of ambiguous
content can bootstrap and accelerate the creation of the semantic web.