Sie sind auf Seite 1von 13

WEBVIEW

An SQL Extension for Joining Corporate Data to Data Derived from the World
Wide Web
Charles A. Wood and Terence T. Ow
Mendoza College of Business
University of Notre Dame
Notre Dame, IN 46556-5646
cwood1@nd.edu
ow.1@nd.edu
ABSTRACT
Researchers point out that a great source of data that can be used to generate more knowledge
can be found within the World Wide Web. In this research, we extend SQL using a new
Webview construct that will allow ad hoc joins from a database to data found on the Web using
ANSI-standard SQL. We also develop a tool used to implement this language, and using this
tool, we show how the proposed Webview construct can be used to join data from Web pages and
databases together. This tool can be used to dynamically gather data from the Web for use
within corporate databases, research data sets, and knowledge management repositories.

Keywords: Agents, Data Mining, Databases, SQL, Web Data Retrieval

Page 1 of 13

WEBVIEW
An SQL Extension for Joining Corporate Data to Data Derived from the World
Wide Web
Charles A. Wood and Terence T. Ow

INTRODUCTION
Knowledge management (KM) knowledge within an organization is often considered as a way to
increase competitive ability (Nonaka 1994). However, KM lately has not been well received
within many corporations. A Bain & Company report (Rigby 2001) evaluated 25 different types
of tools. Of these 25 tools, KM tools ranked 24th in satisfaction. The report also shows how KM
software has a relatively high rate of defection at 13%. The primary reason for this is the expense
(Horwitch and Armacost 2002) and the difficulty acquiring new knowledge (Davenport 1998)
and knowledge dissemination. Consequently, many researchers have advocated data mining of
external data sources to supplement organizational knowledge (e.g., Chung and Gray 1999).
It has been established that programs can be written to retrieve and store data retrieved
from the Web (e.g., Kauffman, March, and Wood 2000). However, development and execution
of these programs is quite complicated. Large programming effort and high maintenance costs
are duplicated across corporations to achieve similar or identical results. Also, data retrieved by
such techniques is static. Figure 1 shows a programmer who collects data from the web, stores
the data that is collected at that particular time into the corporate database, as opposed to ad hoc
queries that are used inside a database to query various information in different formats
depending upon the users needs (Figure 1). Therefore new information that is available for the

Page 2 of 13

web will be made available with these ad hoc queries instead of the static ones that were stored.
Another point is that the information available outside is not stored explicitly in the database.
Therefore, new information is always available when queried. However, as with traditional
database views, SQL commands can transfer this information to a permanent storage.

Web Page (HTML, XML)

static retrieval of web data

Corporate Database

Static Representation

WebView
(relational database)

Integrated
View

Web Page (HTML, XML)


Organization
User Views
(relational database)

Corporate Database
Dynamic Representation
Figure 1: Static versus Dynamic Representation of Web data

Page 3 of 13

In this paper, we develop a Structured Query Language (SQL) extension that allows
corporate databases to be joined to explicit information contained on any corporate or external
Web site. By using existing SQL/database technology, not only are costs minimal for
implementation of this new SQL extension, but users can seamlessly retrieve information from
database/Web joins (See Figure 1). We seek to find answers to the following questions:

Can we represent a Web page to be accessible to a corporate database through SQL


language extensions, and if so, how?

Can a tool be developed that implements these SQL language extensions, allowing
easy data manipulation of Web pages?

We undertake three tasks here. The first task is to design new principled extensions to
the SQL language called a Webview, allowing transparent joins between database and Web data.
The second task is show the Webview is robust such that it can capture Web data of interest, and
that identical uses of the extensions will yield identical results. The third task is to develop a tool
that implements these extensions as a proof of concept that the Webview extension is practical for
use.

LITERATURE REVIEW
In this literature review, we examine two different literature bases derived from Information
Systems (IS) and Computer Science (CS). These include research on Knowledge Management
and Data mining, SQL access of HTML, and multi-database systems (MDBSs).
Knowledge Management and Data Mining. Most knowledge management literature
centers on identifying sources of knowledge within a company and capturing that tacit
knowledge known only by one or few employees, and converting that knowledge to explicit
knowledge inside a knowledge repository of some sort (Nonaka 1994). Software tools that aid
Page 4 of 13

knowledge management has been reported to be expensive and of questionable value (Horwitch
and Armacost 2002).
Mobasher, Cooley, and Srivastava (2000) describe how pattern matching is not sufficient
for data mining, useful and quality information needs to be identified from these patterns. We
build upon their research by creating database constructs that allow ad hoc queries of patterns,
thus allowing a dynamic retrieval of data patterns that are deemed useful. Chung and Gray
(1999) explains how knowledge management, data warehousing, and data mining all work in
conjunction with each other, and how the Web has added a new dimension to knowledge
management by facilitating the acquisition of new knowledge from external sources. We add to
this literature by developing a language and tool that facilitates data collection and joins it to
existing databases information.
SQL and HTML. Structured Query Language (SQL) is the language used by most
databases, and has been advocated as a means to access specific Web data (e.g., Deutsch, et al.
1998). SQL is said to be relationally complete in that it can be used to express any query
supported by predicate (or relational) calculus (Codd 1972). By tightly coupling Web data to
SQL using SQL extensions, we get the benefit of being relationally complete (since SQL itself is
relationally complete) and are left with simpler tasks of ensuring that our SQL extension is
robust in that it is sufficient to capture all Web data, including hierarchical representations (e.g.,
XML) and relational representations (e.g., links). An SQL extension also ensures that users can
access Web data transparently so that Web access is accessible to any SQL-based tool.1 Thus

The transparency condition requires that any SQL statements, such as SELECT, remained unaltered when
accessing the new Webview construct.

Page 5 of 13

far, no single proposed tool for data mining has addressed the challenges of SQL transparency
and robustness.
MDBS. There have been many articles that discuss SQL extensions, mainly in the area of
MDBSs that can access disjoint relational SQL databases (e.g., Krishnan, et al. 2001).
Lakshmanan, Sadri, and Subramanian (1996) advocate five required features for SQL extensions.
These extensions include (1) the language have expressive power that is independent of the
schema where the database is structured, (2) the language must allow restructuring of one
database to conform to the schema of another, (3) the language must be easy to use yet
sufficiently expressive, (4) the language must provide full capabilities that are downward
compatible with SQL, so that existing SQL will function properly in the presence of the MDBS,
and (5) the language must be able to be efficiently implemented. We build upon Lakshmanan,
Sadri, and Subramanians work by proposing AgentSQL to incorporate these five requirements
into a Webview: (1) it must have expressive power that is independent of HTML, XML, or other
Web-based markup languages, (2) it must allow the restructuring of Web data to conform to a
database schema, (3) it must be shown to be sufficient to capture any Web data, including XML
or HTML, (4) it function like existing database constructs to allow transparency for the database
developer, and (5) it must be efficiently implemented.

SQL WEBVIEW EXTENSION FOR AGENTSQL


The CREATE WEBVIEW command is displayed below for creating ad hoc queries. Table 1 also
summarizes the CREATE WEBVIEW clauses, which can be used in any order except that the
COLUMN command must follow the applicable ROW or NESTED ROW, and the CREATE
WEBVIEW command must occur first.

Page 6 of 13

To test the viability of the CREATE WEBVIEW, We piggy-back our engine on top of
an existing Open Database Connectivity (ODBC) database manager utilizing virtual tables and
corresponding SQL statements are then sent to the database engine through the ODBC manager.
Thus, CREATE WEBVIEW can be tested with any database that supports (or has third-party
support) for ODBC (e.g., Oracle, Sybase, SQL Server, Access, etc.).2 The following is the
skeleton for the Webview scheme:

CREATE WEBVIEW schemaname


(URLExpression)
USING
(SELECT statement)

[VARYING

var1 [FROM start] [BY increment] TO finish,]


[var2 [FROM start] [BY increment] TO finish, ]
.
]
.
.

[AS]
[REPLACE[S] (findhtml, replacehtml),
(findhtml, replacehtml),
.
.
.

[KEY
[TRIM

(htmlbegin, htmlend)
[htmlbegin, htmlend)
HOST
[LINK
PATH
[INCLUDE
LEFT
RIGHT
BOTH
HOST
[INCLUDE
PATH
LEFT
RIGHT
BOTH

{
{

}
}

]
]
]
] (htmlbegin, htmlend),
]

] (htmlbegin, htmlend),
]

.
.
.

ROW

{
COLUMN[S]

(htmlbegin, htmlend)
PAGE

Colname
Colname
Colname
Colname
Colname
Colname
Colname
Colname
Colname2

Datatype
PAGE,
ROW,
URL,
KEY,
RETRIEVETIME,
ROWNUM,
EXISTS,

.
.
.

(htmlbegin, htmlend),

(htmlexists),

[NESTED [ROW] (htmlbegin, htmlend)


[NESTED [ROW] (htmlbegin, htmlend)
.
]
.
.
]
;

Thus far, only Access and SQL Server have been tested.

Page 7 of 13

CREATE
WEBVIEW
USING
LINK
ROW
COLUMN
NESTED

VARYING
REPLACE
TRIM
KEY

Indicates the start of the Webview definition.


Defines the Web pages that will be accessed, either via a string literal or a SELECT statement.
Defines URLs contained in one Web page that can be used to access another (identically formatted)
Web page, allowing relational joins of linked database. The INCLUDE sub-clause allows you to
include parts of the current path into the link in case the retrieved link uses a relative path.
Defines each row between each occurrence of a beginning and ending text or HTML. Within each
ROW, COLUMNS are defined.
Defines a column within a row. The column name is listed first followed by the data type and then
the HTML text that precedes and follows the column value. Special data types include URL,
PAGE, KEY. EXISTS returns a Boolean TRUE/FALSE if text appears within a row.
The NESTED (or NESTED ROW) clause is used to indicate that hierarchical data exists that is
subordinate to the preceding ROW or NESTED clause. XML fits this model, as does some HTML.
Hence, NESTED does not indicate multiple row definitions within the same page, but rather a single
row definition where rows of data that are arranged in a hierarchical fashion.
Allows a loop within the urlexpression or SELECT statement of the USING clause.
Allows a replacement of HTML or text before processing begins, which can facilitate processing
Removes all text outside boundaries defined by two strings.
Finds the first occurrence of a string within a page. (Can be used to find a Web page identifier)

Table 1. CREATE WEBVIEW Command Clauses

The tool shown below in Figure 2 takes SQL statements, including the new CREATE
WEBVIEW extension, and passes these statements to an ODBC database engine. The
AgentSQL tool shows proof of concept of the usability of the CREATE WEBVIEW statement,
and use of this statement in combination with existing SQL syntax.

Figure 2. AgentSQL Testing Tool

Page 8 of 13

Create WEBVIEW that captures Data Sets that Span Several Web Pages
The following code shows how we can use the CREATE WEBVIEW AgentSQL statement to
retrieve the results of an Excite search.

CREATE WEBVIEW excite


USING ("http://srch.excite.com/d/search/p/excite/index.jhtml?s=%22OLEDB+and+ODBC%22")
TRIM
("table width=760", "target.gif")
ROW
("<LI>", "</LI>")
LINK INCLUDE LEFT
("http://srch.excite.com", ">")
COLUMN Link
VARCHAR ("href=\"", "\""),
Description
MEMO
("<BR>", "<BR>"),
WebPage
URL,
Host
VARCHAR ("class=size8>", "<");
SELECT * FROM excite;

The above code shows how a search string (OLEDB and ODBC) can be used to
retrieve results shown in Figure 3. (The result could be longer with different searches.) The
search was made specific to limit the time spent on the site.) We provide an example here of a
dataset spanning four Excite Web pages containing a total of 74 results. One dataset spanning
four Excite Web pages containing a total of 74 results is shown here.

Figure 3. Virtual Table Created From Spanning Excite Pages Created by the above code

Page 9 of 13

Create WEBVIEW that Captures Hierarchical Data Sets (e.g., XML)


In order to be sufficient to the data-collecting task, the CREATE WEBVIEW statement needs to
be able to retrieve hierarchical data from a Web page. The code below shows the XML used for
instruction in an XML and B2B class at a midwestern university.

<rentals>
<rental custnum="12345" name="Joe Teacher">
<movie name="Fast and Furious" due="2002-03-04"/>
<movie name="Scoobie Doo and the Witches Ghost" due="2002-03-06"/>
</rental>
<rental name="Joe Student">
<movie name="Slapshot" due="2002-03-04"/>
<movie name="Blair Witch" due="2002-03-02"/>
</rental>
</rentals>

The following code below shows how we can use the CREATE WEBVIEW AgentSQL
statement to retrieve the results of XML similar to that shown in the code above.

CREATE WEBVIEW movie


USING ("http://www.nd.edu/movie.xml")
ROW ("<rental ", "</rental>")
COLUMN CustNum
INT
("custnum=\"", "\""),
CustName
VARCHAR ("name=\"", "\"")
NESTED ROW ("<movie", "/>")
COLUMN
MovieName
VARCHAR ("name=\"", "\""),
Due
DATE
("due=\"", "\"");

The above code shows how the hierarchical nature of XML can be captured into a
relational format by using the CREATE WEBVIEW statement with a NESTED clause. Notice
that, in the second code, Joe Student does not have a customer number. This field is set to
NULL using the AgentSQL tool.

Page 10 of 13

WEBVIEWS created via Joins to Database Tables


On some data retrievals, complex behavior is required to get to the proper page. The following
code and relational tables (figure 4) shows how the URL of some pages can be numbered from 1
to 31 indicating the day they were developed, and also contain categories that may exist on a
database. We combine the power of a SELECT statement inside the USING clause to retrieve a
list of categories from a database with the iteration ability of the VARYING clause and the
recursive nature of the LINK clause, leading to a very powerful routine. The code below was
able to retrieve four categories from a database and use them to represent a dataset containing
18,086 auctions in 5 minutes on a high-speed line from over 439 Web pages.3

CREATE WEBVIEW auct


USING (SELECT
'http://cayman.ebay.com/aw/listings/completed/category'+CatID+'/day'+daynum+'page1.html'
FROM category)
VARYING daynum TO 31 FROM 1 By 1
REPLACE ("<td align=center width=\"6%\">-</td>", "<td align=center width=\"6%\">0</td>")
TRIM
("<strong>Item", "completed/day")
LINK INCLUDE HOST
("]</a> &nbsp;&nbsp;<a href=\"", "\"")
ROW
("eBayISAPI.dll?", "</tr>")
COLUMN
AuctionID VARCHAR
("ViewItem&item=", "&"),
ItemText VARCHAR
(">", "</a>"),
Pix EXISTS
("pic.gif"),
URL URL,
SellingPrice NUMBER
("<b>$", "<"),
Bids NUMBER
("<td align=center width=\"6%\">", "<");

Figure 4. Relational Mapping Created

WEBVIEW joins to other WEBVIEWs were also tested. Since a WEBVIEW mimics a read-only table, these joins
were successful.

Page 11 of 13

CONCLUSION
In this research, we introduce a Webview, an SQL language extension that can collect and
disseminate external Web data to a corporate database based on the varied information needs of
the organization. The tool and the SQL-language allow us to manipulate the data from the Web
pages. It has the ability to download enormous amount of data from large number of Web pages
(see Figure 4). Since it is not explicitly stored, the data derived is not static, up-to-date
information is made available when the query is made. Also, data is not stored in the corporate
databases in various formats to avoid redundancy and duplication of data. The tools developed
using this extension have the potential to impact corporate competitive strategies, supplier and
client relations, and corporate research. For researchers, this language and tool can allow the
building of relatively cost-free databases of actual transaction, economic, and market data that
exists on the Web.

REFERENCES
Chung, H. M., Gray, P., Summer 1999, Special Section: Data Mining, Journal of Management
Information Systems 16 (1), 11.
Codd, E.F., 1972, Further normalization of the data base relational model. Data Base Systems.
(New York) Prentice-Hall, Englewood Cliffs. N.J., 1972, pp. 33-64.
Davenport, T. H., Prusak, L., 1998, Working Knowledge: How Organizations Manage What they
Know Harvard Business Press (Cambridge, MA).
Deutsch, A., Fernandez, M., Florescu, D., Levy, A.; Suciu, D., May 17, 1999, A query language
for XML, Computer Networks 31 (11), 1155-1169
Horwitch, M., Armacost, R., May/Jun 2002, Helping Knowledge Management Be All It Can
Be, The Journal of Business Strategy 23 (3), 26-31.
Lakshmanan, L. V. S., Sadri, F., Subramanian, S. N., 2001, SchemaSQL: An extension to SQL
for multidatabase interoperability. ACM Transactions on Database Systems 26(4), 476-519

Page 12 of 13

Kauffman, R. J., March, S. T., Wood, C. A., December 2000, "Mapping Out Design Aspects for
Data-Collecting Agents," International Journal of Intelligent Systems in Accounting, Finance,
and Management, 9 (4), 217-236.
Krishnan, R., Li, X., Steier, D, Zhao, L., September 2001, On Heterogeneous Database
Retrieval: A Cognitively-guided Approach, Information Systems Research 12 (3), 286-303.
Mobasher, B., Cooley, R., Srivastava, J., August 2000, Automatic Personalization Based on
Web Usage Mining, Communications of the ACM 43 (8), 142-151.
Nonaka, I., February 1994, Dynamic Theory of Organizational Knowledge Creation,
Organization Science 5(1), 14-37.
Rigby, D., 2001, 2001: Management Tools: Annual Survey of Senior Executives, available at
http://www.bain.com/bainweb/expertise/tools/overview.asp.

Page 13 of 13

Das könnte Ihnen auch gefallen