Beruflich Dokumente
Kultur Dokumente
AIM
The main aim of this project is to provide reliable and fast
SYNOPSIS
In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted
EXISTING SYSTEM
Due to the assumption of all documents being generated from a single common template, solutions for this problem are applicable only when all documents are guaranteed to conform to a common template. However, in real applications, it is not trivial to classify massively crawled documents into homogeneous partitions
PROPOSED SYSTEM
Our work is different from the existing content
MODULES
design for the clients, which is easily understood to interact with this
project. This module developed by servlet package, which is present in J2EE.
Template Extraction
If any of a query searched throughout the networks previously servers organize only URL if it matches transfer the control to those templates. Over here we extract multiple temples from multiple sites. And finally extract which one is properly suite to our query fully extracted and frame it on common template. This form of formation is simply called as template extraction.
Clustering
TEXT-MDL is an agglomerative hierarchical
cluster.
When we merge clusters hierarchically, we select two clusters which maximize the reduction of the MDL cost by merging them. Given a cluster ci, if a cluster cj maximizes the reduction of the MDL cost, we call cj the nearest cluster of ci. In order to efficiently find the nearest cluster of ci.
ARCHITECTURE DIAGRAM
Clients
Search
Server
Clustering
SOFTWARE REQUIREMENTS
Windows XP
JDK 1.6
Servlet, JSP Apache Tomcat
HARDWARE REQUIREMENTS
Hard Disk: 20GB and Above RAM: 512MB and Above Processor: Pentium III and Above