Sie sind auf Seite 1von 62

Mini Project Documentation

Web Wise Document System


Gayatri Vidya Parishad College Of Engineering



This report on “ WEB WISE DOCUMENT SYSTEM (WWDS) ” is a bonafide


of the mini-project work submitted By

S.S.V. Kaushik (Reg No:06131A0579)

P. Santosh Varma (Reg No:06131A0563)

Bh.S. Ramaraju (Reg No:06131A0577)

in their sixth semester of

Bachelor of Technology


Computer Science and Engineering

During the academic year 2009-2010

Guide Observer

Head of the Department

Candidate’s Declaration

We hereby declare that the work presented in this

project titled “Web Wise Document System(WWDS)” submitted towards completion

of mini-project in sixth Semester of B. Tech (CSE) at the Gayatri Vidya Parishad

College Of Engineering(GVPCOE), Visakhapatnam is an authentic record of our

original work pursued under the guidance of Prof. David Wayne Clay and Prof.

Krishna Subba Rao,GVPCOE, Visakhapatnam .

We have not submitted the matter embodied in this project for the award of any other


S.S.V. Kaushik

P. Santosh Varma

Bh.S. Ramaraju

Place: Visakhapatnam

Date: 21 -12 - 2009



First and foremost, we would like to express our sincere

gratitude to our mini project guides, Prof. David Wayne Clay and Prof.

Krishna Subba Rao. We were privileged to experience a sustained

enthusiastic and involved interest from their side. This fueled our

enthusiasm even further and encouraged us to boldly step into what was a

totally dark and unexplored expanse before us.



With the Internet growing in size day by day both in terms

of number of users and content, the traditional stand alone approach is almost on the

verge of an end. There is a need for an approach that would integrate both the World

Wilde Web and the Stand Alone Systems. Web Wise Files comprise of one such



The idea of Web Wise Files comes from the very basis of a

distributed operating system in which the changes done in files of one terminal

should be reflected in every terminal that is a part of that system. That is the whole

system is a virtual system comprising of parts of it at different geographical locations.

Problem Statement

“A user regularly browses the Internet for getting

himself/herself acquainted with the changes in the Stock Market, the Weather of that

day, the Score of an ongoing Cricket match, the latest technological advances and so

on. In addition to these he/she may listen to the latest audio or videos or he may wish

to use the services available on the Internet”.

In order to do all these, the user needs to spend a lot of

time surfing different sections of the World Wide Web for corresponding information.

An approach that would actually do all these things in a jiffy, in a more systematic

customizable way, depriving the users of the strain, would be a boon to all users. We

use the Web Wise Files to solve this problem.


Web Wise Files are the files that change their

contents accordingly with that of the World Wide Web. Yet they have their physical

existence on the very terminal the user works on. The web wise files are actually user

defined files comprising of data (from the WWW) of user’s choice in the format

specified by the user.


Web Wise Document (WWD)

The actual document displaying all the content blocks in user-defined format.

Web Wise Document Definition(WWDD)

It contains the information about all the content blocks involved.

Web Wise Document System (WWDS)

The System which actually contains all the web wise documents and their


Web Wise Document Template (WWDT)

This refers to the XML file containing the WWDD.


Working of a Web Wise Document

System will allow a user to create, edit and view local

document with embedded web content. A web wise document will consist of a layout

definition section and a set of content block definitions. Layout definition section

indicates the placement of content blocks in the view document. These definitions

will be expressed using XML.

Web Wise Document System Layout

There are 2 layout options which a document can have:

1. column layout defines a linear sequence of content blocks

2. row wise layout defines a rectangular array of content blocks

Content Block

Content block definitions indicates the method of locating the content and any details

needed to access the content. These definitions will be expressed using XML. Each

content block definition will contain:

1. content block title

2. content block type

3. content block parameters which depend on block type

Content Block Types

Content block types include the following :

1. Local text

2. Remote text using ftp

3. Web page content

4. Blog post

5. Web service

6. Twitter post

7. RSS Feed


The features provided by the WWDS involve the following:

ü Creating/Editing/Modifying contents of the WWDT specific to a


ü Dynamically retrieving a section of Local or Remote text files.

ü Dynamically retrieving a section or part of a Web Site.

ü Dynamically retrieving recent posts/articles in a Blog.

ü Dynamically retrieving content from RSS Feeds.

ü Providing GUIs to access Web Services.

ü Retrieving data from a social networking site Ex:Twitter

ü Showing the retrieved data in user desired format/layout.

User's Role in a WWDS

User in web wise document system should be able to do the following:

1)Create/Edit a WWDD.

2)Customize Layout, Type and Number of Content Blocks to be displayed in the


3)View the WWD updated to that instant.

System's Role in a WWDS

System in web wise document system,on the other hand, should be able to do the


n Store the WWDD created/edited by the user in some understandable

format such as XML.

n Understand the WWDD and accordingly create the WWD by

retrieving the appropriate content blocks and representing them in the

user desired layout .


The System will contain mainly two modules,

l Creator

l Viewer.


The Creator will have the following properties:

è The Creator will be responsible for generating the XML file for WWD

(Web Wise Document).

è The Creator provides a visual interface for the user to customize layout

and content definitions for each WWD.

è The Creator will express the layout and content definitions using XML


è In addition the Creator may provide options for the customization of the

appearance of the content sections (like color etc.).


The Viewer will have at least the following properties:

l The Viewer will be responsible for retrieving and showing the content

to the user in the appropriate layout using the XML file (the one that is

generated by the Creator).

l The Viewer retrieves data dynamically during the opening of the WWD

by the user.

l The Viewer provides options for storing the WWD.

Content Block Title

Each Content Block will be assigned an id (normally generated sequentially) by the


Content Block title normally represents the following:

1. File name (for local or remote text).

2. URL (for web content,blogs etc.).

3. Name of the service (for web service).

Content Block Parameters

Content Block parameters may include one or more of the following to uniquely

identify it in the source:

1. Page ID or Section ID

2. Id of the HTML tag in case of web content.

3. The absolute position of the content section in the entire page or


4. The relative position of the content section with respect to another fixed

section in the same page or document.

5. The IP address of the source (and optional log in credentials) in case of

remote text.

6. The absolute path of file in case of local or remote text.

7. Heading of the content section.

8. Date and time constraints in case of blog posts etc.

9. Log in credentials for restricted web content.

10. Any other parameters not mentioned above.

The user can define Custom Layout by defining the absolute positions and

orientations for each content block.


The requirements and specifications for this application include both software(SRS)

and hardware requirements.

SRS (Software Requirement Specification)

We mainly need an operating system and Microsoft's Visual

Studio with .NET framework installed on it to develop our application as we

implemented it in VB .NET language.

Operating Systems

Windows 2000, Windows XP, Windows Vista

Windows Server 2003 or more

Microsoft's VB .NET Package

Microsoft Visual Studio or more

(Visual Basic 6.0 or more)

Hardware Requirements

1. RAM: 256 MB

2. Processor: Pentium Class II Processors

3. Video: 800x600, 256 Colors

System Requirements

Functional Requirements

Ø These are the requirements given to the system during Requirements Phase of

the Software Development.

Ø The user should be able to choose from a variety of content sources available in

the internet.

Ø The user should be able to able to view all the content blocks simultaneously.

Ø The user should be able to modify the content blocks and their definitions at his


Ø The system should minimize the overall time spent by the user in surfing and

browsing the content blocks individually.

Ø The user should be able to view the document in the desired layout.

Ø The system should dynamically obtain the contents of the document from the

corresponding sources.

Non Functional Requirements

Ø The interface should be a GUI.

Ø The user should be able to use the GUI with minimum guidance. That is, the

interface should be self-understandable.

Ø The GUI must provide an easy way to create, edit and delete contents of the


Ø The system should consume minmum hardware resources.


Core Features of .NET

Though the application can be developed on either java or .NET, we preferred .NET

over java due to the following advantages of .NET

1. Comprehensive interoperability with existing code

2. Integration among .NET programming languages

3. A common run time engine shared by all .NET-aware languages

4. A comprehensive base class library

5. No more COM (Component Object Model) plumbing

6. A truly simplified deployment model

Though there are many language in .NET like ASP and VB etc., we choose VB

because our project is a desktop based web application.

ASP .NET is mostly preferred for pure web applications.

As our project involves both desktop and web functionalities VB serves better.

Challenges Faced

During the initial stages of the design of the project , there are many challenges

(hurdles) coming through the design of the application. They are:

Ø Do we need to select a few predefined websites?

Ø Do we need to allow the user to select only from a set of predefined

sections corresponding to each of the predefined websites?

Ø The general content is also too variable to solve. We observed the

sites of BBC and CNN and found it very difficult to find the required

content on just the mere mention of a heading and a URL unless we

predefine them before itself separately.

Ø We have to provide only well defined web content at that point. So

for example , a web site where only one heading block appears and it

is contained in a specific table cell. when the user selects web content

you will give a list of only known and parsable content.

Ø The general content problem is too variable to solve. This is a regular

expression solution where the target is not always regular.

Ø At the end of Editor Design, we got a doubt regarding the Document

Viewer (Content Retriever). How exactly do we identify web

content? How exactly do we uniquely identify a content block in its

source page?

Ø For instance we have a (some site) and we need to

fetch a content block with "Heading" as heading. The problem is

there might be several blocks in the source page with the same



1. How exactly do we identify the correct content block?

2. Even if we identify it, how exactly do we determine its boundaries?

That is, what comes into the content block and what doesn't?

To solve the above problems, we thought of including

HTML tag name and id or even the absolute position of the content block rectangle in

the web page.

But this created more problems like:

1. How many users will know about the actual coding details of the content


2. And there can be many ways to uniquely identify the content block

3. Id, name and absolute position are only one way. So, trying to provide

all such options is not feasible.

But without knowing these details we can not fetch the required content block in a

deterministic manner. So now we are in a dilemma of whether to think of user

friendliness or code complexity.

UML Diagrams

UML Diagrams are basic modeling diagrams used to determine the architecture of

the software product being developed.

We are concerned with the three major UML diagrams. They are

1.Use Case



Use Case Diagrams

Web Wise Document System

Local and Remote Text

Web Page Content

User Web Services

Blog Post


The above scenario shows the interaction of the system with two actors:

i) User

ii) XML file

The user interacts with the system to select any of the content types mentioned.

For each content type selected by the users, corresponding changes are updated in the

XML file.

The following scenario shows another type of interaction of the user with the system.

Here the user can perform a variety of functions to create/edit/view a WWDT. The

system uses the WWDT to retrieve the required content blocks dynamically gtom the


Web Wise Document System

Create a WWDT

Edit/Modify existing a

User View a WWD


Retrieve contents from

the internet
Class Diagram

The above class diagram shows the interaction of a total of 9 classes involved in

WWDS. The classes are basically categorized into two packages based on their


i) Windows Application1 – Creator Module

ii) Windows Application2 - Retriver/Viewer Module

Windows Application1

This package comprises of 8 classes that function together to implement the

requirements/features of the creator/editor module.

The 8 Classes involved are:

i) Form1 – This class is responsible for layout selection, content block

creation, modification and deletion in the WWDT.

ii) Form2 – This class contains functions for the defining/modifying title

and type of the selected content block in the WWDT.

iii) Form3 – This class contains functions that define, store/edit the

properties of Local/Remote Text. That is, the information

required to retrieve a Local/Remote Text.

iv) Form4 - This class contains functions that define, store/edit the

properties of Web Page Content. That is, the information

required to retrieve a portion of a Web Page.

v) Form5 - This class contains functions that define, store/edit the

properties of Web Service. That is, the information required


select a Web Service and provide a GUI to it.

vi) Form6 - This class contains functions that define, store/edit the

properties of RSS Feed. That is, the information required to

retrieve information in a RSS Feed.

vii) Form7 - This class contains functions that define, store/edit the

properties of Twitter Post. That is, the information required

to retrieve recent Twitter Post.

viii) Form8 - This class contains functions that define, store/edit the

properties of Blog Post. That is, the information required to

retrieve recent post from a Blog Post.

Windows Application2

This package comprises of a single class that implements the functions a


i) Form1 - This class contains functions that help the system to

read/understand the WWDT and accordingly retrieve the

contents of the WWD. This class also contains functions

that help in organizing the contents into a user-defined


Sequence Diagram

The above sequence diagram encloses a typical sequence of steps that are

followed by a user for creating/editing a WWDT and later viewing it using the


The steps 3 to 14 are actually asynchronous in the sense that each of them can occur

at any time any number of times as required by a particular user.

For instance a user may want to access 3 web services but only a single twitter post.

In such a case, the other steps wont be necessary and even these 4 content blocks can

be defined by the user in any order. That is the order is not important for retrieving

information. But this order is very much essential if the user is also considers the

order in which these content blocks are finally displayed as the viewer/retriever

displays the content blocks in the same chronological order as chosen/defined by the



We used Microsoft's VB Professional Edition 2008 with the .NET 3.5 Framework

Platform as it is the latest and is well suited for web applications and also has more

options than the previous versions regarding styles and functionalities.

For a WWD,

Steps taken to give some input:

1. We have designed the editor as a 3 level form.

2. The user fills the details into the form to create,edit and delete the

definitions of the content blocks to appear in the WWD.

3. The data exchange between the forms in done by using a XML

document as data storage and retrieval.

4. By the end of creation/edition of the WWD, the entire definition is stored

in this XML file (it acts as a template for this WWD).

So, input is taken and is stored in an XML file.

Steps taken to retrieve the output for the given specific input:

1. We used the XML file to retrieve data from the source (web,local text


2. we used SOAP like technology to access the web services using inbuilt

.NET libraries.

3. To retrieve data from the web, we used the concepts of HTML and XML

Parsing along with the nested HTML concepts to solve the parsing html

code problem.

Through this we also achieved a few useful ways to find the content embedded in the

nested html.

Web Wise Document Template Format:

XML is the language used to represent the WWDT. It typically represents the

following information:

i) information regarding the layout of the document.

ii) information regarding each content block to be retrieved.

WWDT XML Structure:

The <wwd> tag represents the root of the document. It essentially comprises of:

i) a <layout> tag

ii) a number of <content> tags

<layout> tag:

The <layout> tag represents the layout of the given document. It encloses the name of

the layout used.

For example,





indicates that the layout of the document is a Column Layout. That is, all the content

blocks are showed column by column. Similarly, Row Wise is used to represent the

Row Layout.

<content> tag:

A content tag typically contains the following tags:

i) <title> tag that contatins title of the block

ii) <type> tag that contains the type of the content to be retrieved.

iii) <params> tag that contains information regarding parameters specific to the

content block.

For example,




<BlogURL> </BlogURL>


<title>My Blog</title>

<type>Blog Post</type>



indicates that one of the content blocks to be retrieved is a Blog Post named

My Blog, with a URL .

Content Types and Parameters

There are 6 types of contents included in the project. Their properties are:

1. Local and remote text:

Inorder to retrieve a portion of the text from a file, we require the following:

− Absolute Path of the text file.

− Starting line number of the text file.

− Number of lines to be displayed.

2. Web page content:

As the general Web Content Problem is too variable and difficult to

solve we have chosen three predefined categories. For each category we chose two

websites exhibitting good web design standards. In each web site we have pre-

selected the portions of the website to be retrieved. The following are the categories

and the corresponding websites chosen:


− Headlines

− Weather

− Sports


Some websites include:

− URL of BBC News

− URL of NDTV news etc.

We used the class and ids of the HTML tags to uniquely identify well

structured blocks in a Web Page.

3. Blog post:

Inorder to retrieve a recent Blog Post article, we need the following information

about the Blog Post.

− URL of the blog of posts

Using this information, we have retrieved the RSS Feeds corresponding to various

articles in the particular blog. The first RSS Feed obtained corresponds to the recent

article posted in the blog.

Twitter post:

Inorder to retrieve the status of a blog we required the following information

− The Twitter ID of the user whose status has to be displayed.

Even here we used the Twitter ID to obtain the RSS Feeds corresponding to the status

updates of that user.

Web Services:

A wide number of services are available in the Internet today. We have chosen a few

very important and frequently used web services for demonstration and provied easily

accessible GUIs to them. Some of them include:

− Stock Quote (gives the current Stock details for a company)

− Currency converter (converts money from one currency system to


− Global Weather (gives the current weather information for a given


− Send SMS World (sends free SMS to any cell phone in India)

RSS Feeds:

The typical information required to get data from a RSS Feed includes:

− URL of the RSS Feeds

The URL of the RSS Feed corresponds to the URL of the WSDL (Web Service

Description Language) corresponding to a specific feed. This WSDL contains

information regarding various functions defined, paramaters to be passed and the

output expected for each function. Using the inbuilt functionality of recognizing

functions provided by a web service, given its WSDL, we have implemented GUIs to

functions that we felt are most necessary.

HTML & XML Parsing

There is no general universal parser to parse HTML and retrieve

required information from a HTML page. Though there are a few HTML parsers

available like MSHTML etc. they provide only a partial solution. This is because,

HTML is not a strongly typed language and hence various users use a variety of non-

standard methods while designing a web page. Often these methods involve tags that

are highly unstructured and syntactically incomplete. Part of this non-standard nature

of the websites can be attributed to the modern browsers which allow and parse a

number of syntactical errors without any complaint. Thus, HTML parsing is a non-

deterministic problem that can be solved only through standardization.

Hence, our HTML parsing is done using our own parsing routines

with the help of MS HTML parser. But this type of page specific parsing is very

limited in its approach and is highly susceptible to errors the moment the

corresponding web site designers decide to change the standards used in the page.

On the contrary, the universally standard structure of a XML

document made it easy to write a XML parser. There a number of XML parsers

available over the Internet. One could write their own XML parser provided they

have enough time. Typically there are two types of XML parsers:

i) a SAX Parser

ii) a DOM Parser

We chose to use a DOM parser because of the ease and efficiency with


one can create/edit/access/delete any node and its corresponding information.

We have used the Microsoft XML DOM parser provided by the .NET package as it

suits very well to our purpose.

Microsoft's Visual Studio

The interface is so simple and easy to access. A lay man can understand the usage of

.NET by observing the control tool box in the menu.

Solution Explorer in Visual Studio 2008

Selection of windows forms application in Visual Studio 2008

Sample Code

For instance, a small part of the code used for retrieving content provided as input to

the XML file is:

Private Sub WebBrowser1_DocumentCompleted(ByVal sender As

System.Object, ByVal e As


' start of sub Procedure

TextBox1.Text = e.Url.ToString()

Dim xdoc As XmlDocument = New XmlDocument()

' Loading of XML file for input

xdoc.Load("C:\Documents and Settings\SantosH\My Documents\Google Talk




Dim node As XmlNode =

xdoc.SelectSingleNode("wwd/content/type[@webpage='" + e.Url.ToString() +


Dim id As Integer = CType(node.Attributes.ItemOf("id").Value, Integer)

Dim url As String = New String(node.Attributes.ItemOf("webpage").Value)




' TextBox1.Text += id.ToString()

Dim doc As HtmlDocument = disp(id).Document

Dim ele As HtmlElement = doc.GetElementById("")

Dim ele1 As HtmlElement = doc.GetElementById("")

If (url = "") Then

ele = doc.GetElementById("tickerHolder").NextSibling.NextSibling

ElseIf (url = "") Then

ele = doc.GetElementById("advert_8").Parent.NextSibling.NextSibling

ElseIf (url = "") Then

ele = doc.GetElementById("box315")

ele1 = doc.GetElementById("divtopstlatest")

ElseIf (url = "") Then

ele = doc.GetElementById("a")

ElseIf (url = "") Then

ele = doc.GetElementById("portlet_878")

doc.GetElementById("tickerHolder").NextSibling.NextSibling for

doc.GetElementById("advert_8").Parent.NextSibling.NextSibling for

Dim ele as


Dim ele1 as htmlelement=


Dim ele as htmlelement=doc.GetElementById("a")

Dim ele as htmlelement=doc.GetElementById("portlet_878") for

End If

TextBox1.Text += ele.InnerHtml

Dim filew As StreamWriter = New StreamWriter(".\" + id.ToString() +


If (url = "") Then

filew.Write("<html>" + "<head><base href='" + e.Url.ToString() + "'

/></head>" + "<body>" + ele.InnerHtml + "<br />" + ele1.InnerHtml



filew.Write("<html>" + "<head><base href='" + e.Url.ToString() + "'

/></head>" + "<body>" + ele.InnerHtml + "</body></html>")

End If


disp(id).ScriptErrorsSuppressed = True



n2/WindowsApplication2/bin/Debug" + id.ToString() + ".html")


' End of Sub Procedure

End Sub

Screen shots

Basic Editor window form without any input provided

Editor window form to select the layout using drop down menu

Selecting column layout and clicking “New” button

Content Block form appearing on clicking “New”

Giving a title and specifying the type of content

On selecting a type, “Properties” button will be highlighted

On clicking “Properties” button in the content block form

selection of file name and lines to be displayed

Clicking OK in the properties form

Clicking OK in the content block form creates it

On selecting a content block, “edit” and “delete” buttons will be highlighted

Clicking NEW for new content block and repeating the same procedure

Properties window form for web page content

Selecting any one of the predefined web sites for the category

After creation of the 2 content blocks

Clicking NEW for new content block on web services

Different Types of web services

After creating 3 content blocks

Same Procedure repeated for Twitter Posts

Same Procedure repeated for RSS Feeds

Entering the URL for RSS Feeds

Click OK after adding all the content blocks

Editing a content block

Deleting a content block

XML File parse the inputs and system stores them

XML File parse the inputs and system stores them

Output Forms

O/P for local text,Headlines and RSS feeds

O/P for weather(web service),twitter and blog posts


Instead of the traditional late Testing, Testing is performed from the initial stages of

the software development life cycle. It is performed at different levels at different

stages of the development.

In the initial stages of the coding, Unit Testing is performed. That is, each form

designed is tested for robustness, consistency and scalability. Each bug is corrected

then and there increasing the reliability of code.

In the later stages of the development, Integration Testing is performed to

identify the new code bugs that creep up when integrated. These bugs are identified

and rectified.

Majority of the paths in the Control Flow Graph are followed to identify bugs and

almost C1+C2 coverage is reached. Each bug identified is used to perform

modifications in the code of individual units involved and again integration testing is

performed to identify new bugs that may have crept due to the modifications. This

testing is rigorously and recusrsively performed to eliminate all major bugs.

A good amount of System Testing is also done to identify most frequent bugs

and corresponding corrections are made. The task of testing is easened to some extent

in case of features like web services which use their own exception handling

mechanisms in the servers where they are implemented.

Boundary testing is also performed to identify the correctness of the code and

VB.NET’s inbuilt Exception Handling is used to catch any unidentified bugs.

Future Scope

In future there can be many extensions to this application. Some of them include:

– Improving the Performance of the System by including a Cache



– Providing options for multiple WWDs to be created, saved and

retrieved in the file system.

– Including an automated scheduler to retrieve the updated content


– Developing a generalized HTML Parser to retrieve content


from the Web.

− Providing access to more Web Services.

− Expanding this application for dynamically changing web site


− Including all web sites that can be accessed.

− Improving the look and feel and including a variety of Visual



With this approach, the user can simply access a Web Wise File as

any other file on his disk except for the extra delay it would take to update itself. This

approach would be faster and easier than the manual surfing of the Internet.


This concept of Web Wise Files will have a profound effect on the

cloud computing and other areas where it can be extended to devices other than

computers. As mentioned in the introduction, this concept has the capability of

restoring the stand alone feel people used to have in the earlier days when there is no

Internet. The results mentioned above are universally applicable to users of almost all



Das könnte Ihnen auch gefallen