Sie sind auf Seite 1von 13

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol.

2, Issue 2 June 2012 76-88 TJPRC Pvt. Ltd.,

WEB INFORMATION EXTRACTION USING DATA EXTRACTION AND LABEL ASSIGNMENT


1

A SURESH BABU, 2 A RAKESH, 3D BHAVITHA, 4M. PRANAVI

Assistant Professor, Dept. of CSE, JNTU University, Pulivendula,A.P, India. Bachelor of Technology, CSE Student JNTU University, Pulivendula,India. .

2,3,4

ABSTRACT
Information Extraction is the process of recovering structured data from formatted text like identifying fields (e.g. NER), understanding relations between fields (e.g. record association). The users get the specific information from the websites by querying, extracting and integrate the data from different web pages. The Unsupervised IE system used here is DeLa (Data Extraction and Label Assignment) for web databases. It is designed to solve record-level extraction task. It deals with nested object extraction. To integrate data from different web sites, some Information Extraction systems, i.e., wrappers and agents, have been developed to extract data objects from web pages based on their HTMLtag structures. The four components in the DeLa are: Form crawler, collects the labels of the web site form elements and send queries to the web site. HiWe is adopted for this task. Wrapper generator, automatically generate regular expression wrappers from data contained in web pages. The phases in this are Data rich section extraction, C-repeated pattern, Optional attributes and disjunction. Data aligner fits the extracted data into a table. The phases in this are Data extraction and attribute separation. Label assigner, assigns labels to the attributes of the data objects by employing their heuristics. The main goal is to send queries through the form, automatically extracts data objects from the result web pages, fit the extracted data into a table and assign meaningful labels to the attributes of the data objects (the columns of the table). It demonstrates the feasibility of heuristic-based label assignment and the effectiveness of the employed heuristics is performed well.

INTRODUCTION
As search engines are helpful in providing some information of interest to users on the World Wide Web, the search forms which are filled return a large number of the web pages and those are not indexable by most search engines today as they are dynamically originated by querying a back-end database. The set of such web pages, referred to as the Deep Web or Hidden Web, is estimated to be around 500 times the size of the surface web. Consider, for example, a user who wants to search for information such as configuration and price of a notebook computer before he/she buys on the Web.

77

Web Information Extraction using Data Extraction and Label Assignment

Since such information only exists in the back-end databases of the various notebook vendors, the user has to go to the web site of each notebook vendor, send his/her queries, extract the relevant information from the result web pages and compare or integrate the results manually. Therefore, there occurs the need for information integration tools that can help users to propagate the queries, the retrieved information is integrated and the corresponding results are extracted from web pages. This paper describes the problem of automatically extracting data objects from a given web site and assigning meaningful labels to the data. For instance, from a collection of pages like the book tuples are to be extracted, where each tuple consists of the title, the set of authors, the (optional) format, date and other attributes. As the extraction of plain-structured data, i.e., data objects with a fixed number of attributes and values, is no problem, few of the current approaches can extract nested-structured data, i.e., data objects with a variable number of attributes and possibly multiple values for each attribute. There is deficiency in providing users the meaning of the attributes of the extracted data as most of the current work needs human involvement. In particular, the concentration is on the web sites that provide a complex HTML search form, other than keyword searching, for users to query the back-end databases. Solving this problem will allow the data both to be extracted from such web sites and its schema to be captured, which makes it easier to do further manipulation and integration of the data. IR Vs IE IE is different from more advanced technology Information Retrieval. Information Retrieval is not to extract information but also it has to select relevant subset of documents from larger collection based on user query. Then the user search for desired information from the obtained documents. The contrast between aims of IE and IR are: IR retrieves relevant documents from large collection, whereas IE extracts relevant information from documents. Hence, the two techniques are complementary, and used in combination they can provide much better tools for Text processing. IE and IR not only differ in their aims, but also in the techniques used.

BACKGROUND
In the early days of NLP information extraction dates back to the late 1970s. An early commercial system from the mid 1980s was JASPER built for Reuters by the Carnegie Group with the aim of providing real-time financial news to financial traders. Beginning in 1987, IE was spurred by a series of Message Understanding Conferences. MUC is a competition-based conference that focused on the following domains: MUC-1 (1987): Naval operations messages MUC-2 (1989): Naval operations messages. MUC-3 (1991): Terrorism in Latin American countries MUC-4 (1992): Terrorism in Latin American countries.

A Suresh Babu, A Rakesh, D Bhavitha & M. Pranavi

78

MUC-5 (1993): Joint ventures and microelectronics domain. MUC-6 (1995): News articles on management changes. MUC-7 (1998): Satellite launches reports.

Considerable support came from DARPA, the US defense agency, who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism. A precursor of IE is the field of text understanding. Artificial Intelligence researches have been building various systems where the goal is to get an accurate representation of the contents of an entire text. These systems operate in small domains, and they are usually not very portable to new domains. Since the late 1980s the US government has been sponsoring MUC to evaluate and advance the Information Extraction. The MUCs involve a set of participants from different sites, mainly a mix of academic and industrial research labs. The objective of the conferences is to get a quantitative evaluation of IE systems.

METHODOLOGY
DeLa reconstructs (part of) a hidden back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labeled) table. The DeLa system consists of four components Form crawler Wrapper generator Data aligner and Label assigner.

For the first component, the existing hidden web crawler HiWe is adopted to collect the labels of the web site form elements and send queries to the web site. In the remaining three components, the following contributions are made. First, a method is developed to automatically generate regular expression wrappers from data contained in web pages. Second, a new data structure is employed to record the data extracted by the wrappers and develop an algorithm to fill a data table using this structure. Third, the feasibility of heuristic-based automatic data annotation is explored for web databases and evaluates the effectiveness of the proposed heuristics. Form Crawler When a web site with a HTML search form, the form crawler collects the labels of each element contained in the form and sends queries though the form elements to obtain the result pages containing data objects. Hidden web crawler, HiWe is adopted for this task. HiWe was built on the observation that most forms are usually associated with some descriptive text to help the user understand the semantics of the element. It is equipped with a database that stores some values of the task-specific concepts and assigns those values as queries to the form elements if the form labels and the database labels match.

79

Web Information Extraction using Data Extraction and Label Assignment

Similarly DeLa further utilizes the descriptive labels of the form elements by matching them to the attributes of the data extracted from the query-resultpages.

Userforminteraction

Crawler form interaction Figure 3.2 : Interaction with forms Operation model The operation model consists of four phases. They are Internal Form Representation: On receiving a form page, a crawler first builds an internal representation of the search form. Abstractly, the internal representation of a form F includes the following pieces of information: F = (fE1;E2; : : : ; Eng; S;Mg), where fE1;E2; : : :;Eng is a set of n form elements, S is the submission information associated with the form (e.g., submission URL, internal identifiers for each form element, etc.), and M is meta-information about the form (e.g., URL of the form page, web-site hosting the form, set of pages pointing to this form page, other text on the page besides the form, etc.). A form element can be any one of the standard input elements: selection lists, text boxes, text areas, checkboxes, or radio buttons.1 For example, Figure 2 shows a form with three elements (ignore label(Ei) and dom(Ei) for now). Details about the actual contents ofM and the information associated with each Ei are specific to a particular crawler implementation. Task-specific database: A crawler is equipped, at least conceptually, with a task-specific database D.

A Suresh Babu, A Rakesh, D Bhavitha & M. Pranavi

80

Figure 3.3 : Sample labeled form This database contains all the information that is necessary for the crawler to formulate search queries relevant to the particular task. For example, HiWE uses a set of labeled fuzzy sets to represent task-specific information. More complex representations are possible, depending on the kinds of information used by the matching function. Matching function: A crawlers matching algorithm, Match, takes as input, an internal form representation, and the current contents of the database D. It produces as output, a set of value assignments. Formally, Match((fE1; : : : ; Eng; S;M);D) = f[E1 v1; : : : ; En vn]g. A value assignment [E1 v1; : : : ; En vn] associates value vi with form element Ei (e.g., if Ei is a textbox that takes a company name as input, vi could be National Semicondutor Corp.). The crawler uses each value assignment to fill-out and submit the completed form. This process is repeated until either the set of value assignments is exhausted, or some other termination condition is satisfied. Response Analysis: The response to a form submission is received by a response analysis module that stores the page in the crawlers repository. In addition, the response module analysis attempts to distinguish between pages containing search results and pages containing error messages. This feedback can be used to tune the matching function and update the set of value assignments. Notice that the above model lends itself to a number of different implementations depending on the internal form representation, the organization of D, and the algorithm that underlies Match. Wrapper Generation The pages collected by the form crawler are output to the wrapper generator to induce the regular expression wrapper based on the pages HTML-tag structures. Since pre-defined templates generate the web pages, the HTML tag-structure enclosing data objects may appear repeatedly if the page contains more than one instance of a data object. Therefore, the wrapper generator first considers the web page as a token sequence composed of HTML tags and a special token text representing any text string enclosed by pairs of HTML-tags, then extracts repeated HTML tag substrings from the token sequence and induces a regular expression wrapper from the repeated substrings according to some hierarchical relationships among them. The techniques employed in the wrapper generation are:

81

Web Information Extraction using Data Extraction and Label Assignment

Data-rich Section Extraction This algorithm is motivated by the observation that pages from the same web site often use the same HTML design template, so that their HTML structures are quite similar or the same. Moreover, the same information concurrently appearing in those pages is often used for navigational purposes, advertisement or other nepotistic purposes; usually such information has little to do with the main content of the pages. The basic idea here is to compare two pages from the same web site to remove their common sections, and identify the remaining ones as data-rich sections. To the best of our knowledge, there is currently no good approach to identify the data-rich section of web pages. An approach identifies the section of each page that has the highest fan-out as the data-rich section. However, this heuristic does not hold under some cases where the number of data objects contained in the input pages is relatively small. For example, suppose a web page contains two <TABLE>s. One contains only 2 data objects and the other one contains 3 hyperlinks: home, previous, and next, used for navigational purposes. In this case, the second <TABLE>, since it has the highest fanout of 3, will be wrongly identified as the data-rich section although the first <TABLE> is the correct one. After downloading the two pages from the same web site, we need to parse them and employ a data structure to represent their layout in order to run our comparison algorithm. As HTML pages are composed of tags and text enclosed by tags, it is possible to represent an HTML pages layout by a treelike structure referred to as the Document Object Model (DOM). We work on a simplified DOM tree for HTML pages because not all HTML tags provide structural information for a page. Given two DOM trees representing the two downloaded pages, we now face the problem of matching the similar structures of the trees. Our basic idea is to traverse these two trees using a depth-first order and compare them node-by-node from the root to the leaves. However, before explaining our DSE algorithm in more detail, we need to define several functions employed in the node comparison. Right Sibling (parentnode, node) returns the right sibling of node that shares parent node with node TagName(node)returns the tag name of node Attr(node)returns the attribute of the node, if the node is <A> or <IMG>, otherwise it returns null Same(node, node)= 1, if TagName(node)=TagName(node) and Attr(node)=Attr(node)

= 0, otherwise The algorithm works recursively. For two nodes, if they are internal nodes and are the same then the algorithm will go down one level to match their children from the leftmost to the rightmost one. If they are leaf nodes and are the same, then they will be removed from the trees. If the nodes are not the same, the algorithm will return to their parent and continue to compare the other children, if any. If all of the children of two parent nodes have been compared, then they will be removed only if all their children have been removed. Tree matching presents a simple example to demonstrate how we deal with the treematching problem. Suppose in below figure, tree A and tree B are two tag trees representing two pages

A Suresh Babu, A Rakesh, D Bhavitha & M. Pranavi

82

from which we want to extract the data-rich sections. From the roots, we match the structures of tree A and tree B. Because the root pair (table, table) is the same, the process will go down one level to compare their children. The first child pair (tr, tr) are the same and, thus, the process will go down one more level to (td, td), which are also the Same. Next, (a1, a1) are compared. Since they are the same, (a1, a1) are removed. Then (a2, a2) and (a3, a3) are compared and removed. Next, the process go back to (td, td) and find that all of their children are removed; thus (td, td) is also removed. Similarly, (tr, tr) is also removed. Next, process goes back to (table, table) and compare their next children. Thus, (table, map) are compared and they are different. Then (table, tr) is compared and they are different also. Because tr has no more right siblings to compare with table, the processing table is finished and go back again to the root. Note that the leftmost (tr, tr) have already been removed. Thus, the rightmost tr with map is compared next, and tr with tr. Accordingly, the process goes down the two trees and eventually compares (ul, ul). To compare the children of (ul, ul), first (a5, a6) is processed, detect a difference and then go to (a5, a5).Then those nodes are removed since they are same. After that, the process will go up to (ul, ul) and go down again. At this point, only (a6, a6) are left. Therefore, the children and their parents (ul, ul) are removed. Note that when process back up to td, a4 is still there. Therefore neither td, tr nor table will be removed. The algorithm only compares nodes from the same layer of two trees. Let the height of the target tree be H, and the maximum number of nodes in a single layer be M, the time complexity of this algorithm will be O (H*M*M). However, this is the worst case. In practice, lot of comparisons can be avoided since comparison of those nodes is not needed whose parents have different HTML tags.

Figure 3.4 : Tree Matching Example The pruned tree A is shown in above. It is worth mentioning that HTML pages vary so much that even two pages that look the same in a browser can have different tag structures. For example, <TABLE><A></A></TABLE> and <TABLE><TABLE><A></A></TABLE> </TABLE> result in different sub trees, but they look the

83

Web Information Extraction using Data Extraction and Label Assignment

same in a browser. Unfortunately, such variations of tag combinations can be infinite. In our treematching algorithm, we do not consider such cases. However, since the basic assumption of the DSE algorithm is that the same template generates HTML pages from the same web site, therefore, the tree structures of those web pages should be the same. C-repeated patterns The data objects contained in HTML pages are generated by some common templates and the structure of embedded data objects may appear repeatedly if the HTML page contains more than one data object instance. Accordingly continuous repeated (C-repeated) patterns are discovered iteratively as the wrapper candidates from the token sequences representing HTML pages but some web site may show only one data object on each page. For that case, once the data-rich sections of each page are correctly identified, they can be reorganized by combining multiple pages into one single token sequence that will contain multiple data objects. If an input string S is given, a C-repeated substring (pattern) of S is a repeated substring of S having at least one pair of its occurrences that are adjacent. To expose the internal structure of a string, a data structure is adopted which is called a token suffix-tree.

Iteration I

Iteration II

Figure 3.5 : Discovering C-repeated Patterns Each leaf is represented by a square with a number that indicates the starting token position of a suffix. A solid circle represents each internal node with a number that indicates the token position where its children nodes differ. Sibling nodes sharing the same parent are put in alphabetical order. Each edge between two internal nodes has a label, which is the substring between two token positions of the two nodes. Each edge between one internal node and one leaf node has a label, which is the token at the position of the internal node in the suffix starting from the leaf node. For example in (Iteration I), the edge label between node a and node b is <A>text</A>, i.e., the substring starting from the first token up to, but not including the fourth token. The concatenation of edge labels from the root to node e is <A>text</A>text</P><P>, which is the unique prefix indicating the fifth suffix string <A>text</A>text</P><P><A>text</A>text</P>. Since a token suffix-tree is a special suffix-tree, it can be constructed optimally in O (n) time. If pointers to the original string represent the labels of all edges, then the space complexity is O(n). Path-labels of all internal nodes and their prefixes in the token

A Suresh Babu, A Rakesh, D Bhavitha & M. Pranavi

84

suffix-tree are our candidates to discover C-repeated patterns. For each candidate repeated pattern, it is a C-repeated pattern if any two of its occurrences are adjacent, i.e., the distance between the two starting positions is equal to the pattern length. For example, <A>text</A> is a repeated pattern in (Iteration I), since node b is an internal node with its path-label containing the pattern as a prefix. Its leaf nodes c, e and f indicate that the pattern appears three times in the sequence with starting position 2, 5 and 11. Specifically, this pattern is a C-repeated pattern because two of its occurrences are adjacent (5-2=3). The final objective is to discover nested structures from the string sequence representing an HTML page. We use a hierarchical pattern-tree to handle this task. The basic idea is based on the iteration of building token suffix-trees and discovering C-repeated patterns. For example in (Iteration I), token suffix-tree is built first and find <A>text</A> as a C-repeated pattern. Next, we mask the occurrence from S2 to S4 and form a new sequence in (Iteration II). Then, we build a suffix-tree for the new sequence and find a new C-repeated pattern <P><A>text</A>text</P>. Finally, a wrapper is obtained from these two iterations, <P> (<A>text</A>)*text</P> that correctly represents the structure for the two data objects in the given string, where * here means appears zero or more times. In this iterative discovery process, patterns form a hierarchical relationship, since some patterns discovery is dependent on the discovery of some other patterns. For example, in, <P><A>text</A>text</P> cannot be extracted unless <A>text</A> is found. On the other hand, for each iteration phase, we may find several C-repeated patterns that are independent of each other. For example, in the string text<IMG>text<IMG>text<IMG>text<P>text<P>, the discovery of C-repeated pattern text<IMG> is not dependent on the discovery of text<P> and vice versa. A pattern tree can be used to represent both the dependence and independence between discovered C-repeated patterns and to record the entire iterative discovery process. In a pattern-tree: pattern Pi is a child of pattern Pj, if Pi is discovered in the iteration phase right after the phase where Pj is found. pattern Pi is a sibling of pattern Pj, if Pi is discovered in the same iteration phase where Pj is found. Initially, an empty pattern is set as the root of the suffix-tree. The iterative discovery process proceeds as follows. For each discovered pattern the last occurrence is retained in each repeated region of that pattern and mask other occurrences in the current sequence (either the original one or the one formed by masking some tokens in the last phase) to form a new sequence. Then, a suffix-tree is built for the new sequence, discover new C-repeated patterns, and insert those patterns into the pattern-tree as children of the current pattern. This process is repeated for each newly found pattern. Once processing is finished the children patterns of the current one, process go back up one level and continue to process siblings of the current pattern. When the process goes back to the root again, it is finished. For each node in the pattern tree, its repeated occurrences are recorded for later processing. In, the string is ABACCABBAABBACCC, where each letter represents a token, and the structure covering two data objects in this string is (AB*A)*C*, where * means the substring may appear zero

85

Web Information Extraction using Data Extraction and Label Assignment

or more times. In such a pattern tree, every leaf node represents the discovered nested structure given by the path from the root to the leaf node. Although such a tree can clearly reflect the iterative discovery process, it is time-consuming to construct, especially when we handle real HTML pages with hundreds of tokens on average since in each iteration we need to reconstruct a suffix-tree (O(n) time) and retrieve it to discover C-repeated patterns (O(nlogn) time). To lower the complexity, we employ a heuristic to filter out patterns that cross pairs of HTML tags, e.g., </TR><TR>text, and use three rules to prune some branches of the pattern-tree. Prune Rule 1: A pattern is discarded if it contains one of its siblings. Prune Rule 2: A pattern is discarded if it contains its parent while one of its siblings does not. Prune Rule 3: Define the coverage of a pattern to be its string length multiplied by the number of its occurrences. In each processing iteration, if more than one child pattern has been found after applying rules 1 and 2, only the one with the largest coverage is retained and discard the others. The first two rules enforce an order from inner levels to outer levels to form a nested structure for our discovery process. The reason for requiring this ordering is that the discovery of outer structures is dependent on the discovery of inner structures. If outer-level patterns are discovered before inner-level ones, we may get some wrong patterns. Consequently, the third rule is introduced so that each level of the pattern tree has only one child, which is the minimal size by turning a tree into a list. Note that it is not necessary to choose patterns by their coverage. They can be chosen simply by their length or by the number of occurrences or even randomly. However, processing larger-coverage patterns first will mask more tokens in the sequence and thus get a shorter sequence for later processing. After introducing these three rules, a list-like pattern-tree is obtained for a string sequence representing the web page. The time complexity of the pattern extractor as a whole is O(nlogn). The pattern with the highest nested-level in the pattern tree is chosen to be the nested schema we infer from the specific web page. Note that the nested-level of the extracted pattern may be less than the height of the pattern tree, since some patterns at different heights of the tree may be in the same level of the structure. Similarly, there may not be one pattern with the highest nested-level in the pattern tree. If lower-level patterns are enclosed in parentheses followed by the symbol *, the pattern becomes a union-free regular expression (without disjunction, i.e., union operators). However, to make sure a lower-level pattern in the tree is nested inside a higher-level one, the verification is needed whether the former is a substring of the latter and they have overlapping occurrences. Otherwise, unnecessary nested-levels has to be introduced into the patterns. Data Alignment The data aligner works in two phases. The data-extraction phase first extracts data from the web pages according to the wrapper induced by the wrapper generator and fills the extracted data into a table. Then the attribute-separation phase examines the tables columns and separates those attributes that are encoded into one text string into several newly added columns.

A Suresh Babu, A Rakesh, D Bhavitha & M. Pranavi

86

Data Extraction A regular expression pattern and a token sequence is given representing the web page, a

nondeterministic, finite-state automaton can be constructed and employed to match its occurrences from the string sequences representing web pages. The algorithm complexity is O(nm) with n being the target string length and m being the size of the regular expression. Recall that when inducing the wrapper, any text string enclosed by pairs of HTML tags is replaced by the special token text. Therefore, once we obtain the occurrences of the regular expression in the token sequences, we need to restore the original text strings. Thus, each occurrence of the regular expression represents one data object from the web page. To construct a table of data from a data-tree we traverse the tree in a depth-first order and fill the table for one node by adding the tables of its child nodes (star-type internal nodes) or multiplying the tables of its child nodes (cat-type internal nodes). Given table T1 and table T2, with n representing the number of rows of T1 and m representing the number of rows of T2, A data-tree is employed for a given regular expression to record one occurrence of the expression. A data-tree is defined recursively as follows: If the regular expression is atomic, then the data-tree is a single node and the occurrence of the expression is the node label. The result table of T1 adds T2 is constructed by taking the union of T1 and T2, i.e., we insert all rows of T1 and all rows of T2. Thus for the result table, its number of columns is the same as the number of columns of T1 and T2 and its number of rows equals n plus m. If the regular expression is E1E2...En, then the data-tree is a node with n children and the ith (1<i<n) child is a data-tree that records the occurrence of Ei. The result table of T1 multiplies T2 is constructed by a Cartesian product of T1 and T2, i.e., we insert new rows that are built by repeating each row of T1 m times and individually concatenating them with all rows of T2. Thus for the result table, its number of columns is equal to the number of columns of T1 plus the number of columns of T2, and its number of rows is equal to n multiplied by m. If the regular expression is (E1|E2), then the data-tree is a node with one child that records the occurrence of either E1 or E2. If the regular expression is (E)* and there are m occurrences of E, then the data-tree is a node with m children and the ith (1<i<m) child is a data-tree that records the mth occurrence of E. From the definition of a data-tree, internal nodes are classified into two categories: star-type and cattype. The star-type internal node is for (E)*-like sub-expressions of the wrapper and its children all record the occurrences of E, i.e., its children record multiple values of the same attribute of the data object. The cat-type internal node is for (E1|E2)-like or E1E2-like sub-expressions of the wrapper and its children record the occurrences of different sub-expressions, i.e., its children record the values of

87

Web Information Extraction using Data Extraction and Label Assignment

different attributes of the data objects. Using the algorithm,a table of data can be constructed that is similar to a table in a relational DBMS where multiple values of one attribute are distributed into multiple rows of the table while the single value of other attributes are repeated. Attribute Separation Before attribute separation is done all HTML tags are removed contained in the table as well as columns of the table with no text string inside to make sure every cell of the data table contains a text string as its content. The basic assumption of the attribute separation phase is that if several attributes are encoded into one text string, then there should be some special symbol(s) in the string as the separator to visually support users to distinguish the attributes. Thus, the problem here is to find the right separator for each web site. To separate attributes of the data objects, for each column we search all its cells for characters neither a letter nor a digit and record their occurrences and corresponding positions in the cells. If the discovered characters have the same number of occurrences in all cells for one column, then the character is recognized as the separator candidate for this column. The separator candidates are validated by several heuristics. For instance, @ is not a valid separator since it appears in any text string representing email, and the following . after $ is not a valid separator either since it appears in the text string representing price. When several separators are found to be valid for one column, the attribute strings of this column are separated from the beginning to the end in the order of the occurrence positions of each separator. Label Assignment To assign labels to the columns of the data table containing the extracted data objects, i.e., to understand the meaning of the data attributes, the following four heuristics are employed. Heuristic 1: match form element labels to data attributes The search form of a web site through which users submit their queries provides a sketch of the underlying relational database of the web site. If an assumption is made that the web site designers try their best to answer user queries with the most relevant data, keyword queries submitted through one specific form element will re-appear in the corresponding attribute values of the data objects. Recall that the form crawler collects all labels of the form elements and sends out queries through these elements to obtain web pages. Therefore, for each form element with its keyword queries, if the keywords mostly appear in one specific column of the data table, then we assign the label of that form element to the column. Heuristic 2: search for voluntary labels in table headers The HTML specification defines some tags such as <TH> and <THEAD> for page designers to voluntarily list the heading for the columns of their HTML tables. Moreover, those labels are usually placed nearby the data objects. Therefore, the HTML code near (usually on the top of) the contained data objects is examined for possible voluntary labels.

A Suresh Babu, A Rakesh, D Bhavitha & M. Pranavi

88

Heuristic 3: search for voluntary labels encoded together with data attributes Some web sites encode the labels of data attributes together with the attribute values. Therefore, for each column of the data table we try to find the maximal-prefix and maximal-suffix shared by all cells of the column and assigns the meaningful prefix to that column and the meaningful suffix to the column next to that column as the labels. Heuristic 4: label data attributes in conventional formats Some data have a conventional format, e.g., a date is usually organized as dd-mm-yy, dd/mm/yy, etc., email usually has the symbol @, price usually has the symbol $, etc. Thus, such information is used to recognize the corresponding data attributes. Note that the form elements and the data attributes do not need to be perfectly matched. Therefore, the label assigner may not be able to assign meaningful labels to all of the data attributes. In DeLa, the users are also allowed to add a label to unassigned attributes and to modify the assigned labels.

CONCLUSIONS
DeLa performs very well in automatically inducing wrappers, extracting data objects and assigning meaningful labels to the data attributes. The experimental results also demonstrate the feasibility of heuristic-based label assignment and the effectiveness of the employed heuristics, which are to be believed sets the stage for more fully automatic data annotation of web sites. As future work, plan to include an ontology into the system, which can help the label assignment if the data domain of the web site is known & also plan to build an integration component into the system to help users integrate the data objects extracted from different web sites. Another possible direction for future work is the information discovery problem, that is, how to automatically locate the web sites from which users want to extract data objects.

REFERENCES
[1] BrightPlanet Corp. The Deep Web: Surfacing hidden value.

http://www.completeplanet.com/Tutorials/DeepWeb/ [2] D. Florescu, A. Y. Levy and A. O. Mendelzon. Database techniques for the world-wide web: a survey, SIGMOD Record 27(3), 1998, 59-74. [3] [4] D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge, 1997. I. Muslea, S. Minton and C. Knoblock. A hierarchical approach to wrapper induction, Proc. 3rd Intl. Conf. on Autonomous Agents, 1999, 190-197. [5] S. Raghavan and H. Garcia-Molina. Crawling the hidden web, Proc. 27th VLDB Conf., 2001, 129-138. [6] J. Wang and F. Lochovsky. Data-rich section extraction from HTML pages, Proc. 3rd Conf. on Web Information Systems Engineering, 2002, 313-322.

Das könnte Ihnen auch gefallen