Sie sind auf Seite 1von 59

Chapter 1

Currently, the World Wide Web is the largest source of information. Huge amount of data is present on the Web and large amount of data is added to the web constantly. User searches for the required information by using particular keywords. We are specially dealing with News. As the large number of news available on web, so does the need to provide high-quality summaries in order to allow the user to quickly locate the desired information. i.e. to get summary of different news from variety of newspapers about same topic as per the query specification. Summarization is the process of condensing a source text into a shorter version preserving its information content. It can serve several goals from survey analysis of a scientific field to quick indicative notes on the general topic of a text. In other words summarization is the process of automatically creating a compressed version of a given text that provides useful information for the user. The information content of a summary depends on users needs. Topic-oriented summaries focus on a users topic of interest, and extract the information in the text that is related to the specified topic. Indicative summaries, which can be used to quickly decide whether a text is worth reading, are naturally easier to produce Query-oriented summarization (QS) tries to extract a summary for a given query. It is a common task in many text mining applications. For example, a user submits a query to a search engine and the search engine usually returns a lot of result documents. To click-and-view each of the returned documents is obviously tedious and infeasible in many cases. One challenging issue is how to help the user digest the re turned documents. Typically, the documents talk about different perspectives of the query. An ideal solution might be that the system automatically generates a concise and informative summary for each perspective of the query. Much work has been done for document(s) summarization. Generally, document(s) summarization can be classified into three categories: 1. Single document summarization (SDS) 2. Multi-document summarization (MDS) 3. Query oriented summarization (QS) SDS is to extract a summary from a single document; while MDS is to extract a summary from multiple documents. The two tasks have been intensively investigated and many methods have been Multi-Document Extractive Summarization for News Page 1 of 59

proposed. The methods for document(s) summarization can be further categorized into two groups: unsupervised and supervised. The unsupervised method is mainly based on scoring sentences in the documents by combining a set of predefined features. In the supervised method, summarization is treated as a classification or a sequential labeling problem and the task is formalized as identifying whether a sentence should be included in the summary or not. However, the method requires training examples. Query-oriented summarization (QS) is different from the SDS and the MDS tasks. The document cluster denotes the information source and the query denotes the information need. A document cluster is a sub set of the entire document collection. A compelling application of document summarization is the snippets generated by Web search engines for each query result, which assist users in further exploring individual results. The Information Retrieval (IR) community has largely viewed text documents as linear sequences of words for the purpose of summarization. Although this model has proven quite successful in efficiently answering keyword queries, it is clearly not optimal since it ignores the inherent structure in documents. Furthermore, most summarization techniques are query-independent and follow one of the following two extreme approaches: 1. Either they simply extract relevant passages viewing the document as an unstructured set of passages. 2. Employ Natural Language Processing techniques. The former approach ignores the structural information of documents while the latter is too expensive for large datasets (e.g., the Web) and sensitive to the writing style of the documents. Here a method to add structure, in form of a graph, to text documents in order to allow effective query specific summarization is discussed. That is a document is viewed as a set of interconnected text fragments. Main focus is on keyword queries since keyword search is the most popular information discovery method on documents, because of its power and ease of use. This technique has the following key steps: First, at the preprocessing stage, a structure is added to every document, which can then be viewed as a labeled, weighted graph, called the document graph. Then, at query time, given a set of keywords, a keyword proximity search is performed on the document graphs to discover how the keywords are associated in the document graphs. For each document its summary is the minimum spanning tree on the corresponding document graph that contains all the keywords Multi-Document Extractive Summarization for News Page 2 of 59

(or equivalent based on a thesaurus). So data from the minimum spanning tree nodes is collected and presented as a summary of the document. Automatic summarization is the creation of a shortened version of a text by a Computer program which contains important information of the original documents 1.1 History: 1. in 1950s: First systems surface level approaches Term frequency (Luhn, Rath) 2. 1960s: First entity level approaches Syntactic analysis Surface Level: Location features (Edmundson 1969) 3. 1970s: Surface Level: Cue phrases (Pollock and Zamora) Entity Level First Discourse Level: Stroy grammars 4. 1980s: Entity Level (AI): Use of scripts, logic and production rules, semantic networks (Dejong 1982, Fum et al.1985) Hybrid (Aretoulaki 1994) 5. 5.from 1990s-:explosuion of all 1.2 Literature survey: 1.2.1 Aim: Our aim is to achieve multi-document news summarization. 1. In this case we are parsing the HTML document(s) and extracting the text file(s) from it. As we are dealing with the text only, we have chosen the nearest neighbor algorithm for clustering. As it is less complex and sufficient for text. 2. For the same we are dealing with extractive summary along with the query Specification.

1.2.2 Extractive and Abstractive Summarization Extractive Summarization: Multi-Document Extractive Summarization for News Page 3 of 59

Produces a summary by selecting indicative sentences, passages or paragraphs from an original document according to a predefined target summarization ratio. Abstractive summarization: Provides a fluent and concise abstract of a certain length that reflects the key summarization. This requires highly sophisticated techniques, including semantic representation and inference, as well as natural language generation concept of the document. In recent years, researchers have tended to focus on extractive years spectrum of Text Summarization Research. 1.2.3 A System for Query-Specific Document Summarization Mr. Ramakrishna Varadarajan & Vangelis Hristidis presented a method to create query specific summaries by identifying the most query-relevant fragments and combining them using the semantic associations within the document. In particular, structure is added to the documents in the preprocessing stage and converted them to document graphs. Then, the best summaries are computed by calculating the top spanning trees on the document graphs. This paper presents and experimentally evaluates efficient algorithms that support computing summaries in interactive time. Furthermore, the quality of the summarization method is compared to current approaches using a user survey. In this work a structure-based technique is presented to create query-specific summaries for text documents. In particular, the document graph of a document is created; to represent the hidden semantic structure of the document and then perform keyword proximity search on this graph. It is shown in the paper that with a user survey that our approach performs better than other state of the art approaches. Furthermore, feasibility of the approach with a performance evaluation is shown at last. In this approach document graph was built and processing was done on text document, we are implementing somewhat similar methodology but in addition HTML to text parser is added, i.e. we are processing on HTML files.

1.2.4 An Incremental Summary Generation System

Multi-Document Extractive Summarization for News

Page 4 of 59

Mr. C Ravindranath Chowdary & P Sreenivasa Kumar presented an algorithm strategy to finds pair of sentences, one from the current summary and other from the new document that is to be swapped to improve the quality of the summary. For a given query, quality of a summary is determined by its informativeness, coherence and completeness. A scoring function that captures these features to calculate the quality of a summary is proposed. The process of updating/improving summary is continued iteratively till the improvement in quality measure becomes negligible. Experimental results, both qualitative and quantitative, show that performance of the proposed approach for incremental summary generation is quite encouraging. This paper deals with updating the available extractive summary in the scenario where the initial documents used for summarization are not accessible. The proposed algorithm updates the available summary as and when a new document is made available to the system. In this approach extractive summarization in used but the original document is not accessible. We are also dealing with extractive summarization but original document is accessible moreover a highlighted feature is added for convenience of user. 1.2.5 Automatic Text Summarization Mr. Mohamed Abdel Fattah & Fuji Ren investigates the effect of each sentence feature on the summarization task. Then they used all features score function to train genetic algorithm (GA) and mathematical regression (MR) models to obtain a suitable combination of feature weights. The proposed approach performance is measured at several compressions rates on a data corpus composed of 100 English religious articles. The results of the proposed approach are promising. This paper investigates the use of genetic algorithm GA), mathematical regression (MR), for automatic text summarization task. This new approach is applied on a sample of 100 English religious articles. The approach results outperform the baseline approach results. The approaches have been used the feature extraction criteria which gives researchers opportunity to use many varieties of these features based on the used language and the text type. In this approach the algorithm used for summarization was Genetic algorithm while we are using Nearest Neighbor algorithm, moreover it is automatic summarization strategy while we are using extractive summarization strategy. 1.2.6 Multi-topic based Query-oriented Summarization Multi-Document Extractive Summarization for News Page 5 of 59

Mr. Jie Tang , Limin Yao & Dewei Chen tries to break limitations of the existing methods and study a new setup of the problem of multi-topic based query-oriented summarization. More specifically, this paper proposed two strategies to incorporate the query information into a probabilistic model. Experimental results on two different genres of data show that our proposed approach can effectively extract a multi-topic summary from a document collection and the summarization performance is better than baseline methods. The approach is quite general and can be applied to many other mining tasks, for example product opinion analysis and question answering. This paper investigates the problem of multi-topic based query-oriented summarization. The paper formalizes the major tasks and proposes a probabilistic approach to solve the tasks. Two strategies are studied for simultaneously modeling document contents and the query information. We are also dealing with query oriented multi-document summarization and we have specified it for news. 1.2.7 Proposed modules: 1. HTML to text parser 2. Processing the input text file and creating the document graph 3. Adding weighted edges to document graph 4. Document Clustering 5. Create clustered document graph 6. Adding weight to nodes in clustered document graph 7. Generate closure graph and find minimal clusters 8. Result 1.2.8 Clustering: Clustering is one of the most important unsupervised learning processes that organizing objects into groups whose members are similar in some way. Clustering finds structures in a collection of unlabeled data. A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters of Clustering:

Multi-Document Extractive Summarization for News

Page 6 of 59

If a collection is well clustered, we can search only the cluster that will contain relevant documents. Searching a smaller collection should improve effectiveness and efficiency. Neighbour Algorithm 1. Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). 2. Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached. Step1: Nearest Neighbor, Level 2, k = 7 clusters.

Step 2: Nearest Neighbor, Level 3, k = 6 clusters.

Step 3: Nearest Neighbor, Level 4, k = 5 clusters. Multi-Document Extractive Summarization for News Page 7 of 59

Step 4: Nearest Neighbor, Level 5, k = 4 clusters.

Step 5: Nearest Neighbor, Level 6, and k = 3 clusters. Multi-Document Extractive Summarization for News Page 8 of 59

Step6: Nearest Neighbor, Level 7, k = 2 clusters.

Step 7:Nearest Neighbor, Level 8, k = 1 cluster. Multi-Document Extractive Summarization for News Page 9 of 59

1.3 Advantages: 1. Because of the multi-document news summarization, there is no need to go through all the newspapers. 2. As we are dealing with query specific summarization, user can easily have news summery according to his/her interest. 3. The accuracy of the result is depend upon initial Edge Threshold and Cluster threshold as well as Result accuracy percentage, so user can control the relevance. 1.4 Limitations: 1. Takes too long time to process the text files more than 50 KB or having more than 200 paragraphs due to heavy computational loops. 2. The text in the images or in the Flash contents cannot be parsed. 3. The text accessed through web services cannot be parsed.

Chapter 2
Multi-Document Extractive Summarization for News Page 10 of 59

System Requirement and Specification

2.1 Scope of the Project This project can create the query dependent summary generated by clustering algorithm. Here we have considered nearest neighbor clustering algorithm. As every file format can be converted into text file. This algorithm can be applied on text file. Nodes in text file i.e. contents in every newline are clustered and query dependent summary can be generated. 2.2 Requirement Specifications Requirements are the desired characteristics of the software being developed. The first activity in most projects is the identification and documentation of the requirements. Requirements cover both requirements engineering (identification, analysis and capture) and requirements management (managing change, creating and maintaining agreement with customers, trace ability and metrics). The development of large, complex systems presents many challenges to systems engineers. Foremost among these is the ability to ensure that the final system satisfies the needs of users and provide for easy maintenance and enhancement of these systems during their deployed lifetime. These systems often change and evolve throughout their life cycle. This makes it difficult to track the implemented system against the original and evolving user requirements. Requirements establish an understanding of users need and also provide the final yardstick against which implementation success is measured. Various studies have shown that roughly half of the application errors can be traced to requirement errors and deficiencies. Thorough documentation and properly managing requirements are the keys to developing quality applications. By allowing project teams to define and document requirement data including user defined attributes, priority, status, acceptance criteria and traceability, detection and correction of missing, contradictory or inadequately defined requirements can be done the following requirements and constraints were considered during the requirement analysis phase. For clustering we have to take text as input file. If any other file format is there, it is firstly converted into text format. Then clustering of nodes in text document and query dependent summarization is done. 2.2.1 Product performance requirements Input file must be text file. As the size of the text document changes, performance of the algorithms also changes .So if file is larger, we should have better hardware facilities. 2.2.2 Hardware Requirements Multi-Document Extractive Summarization for News Page 11 of 59

Processor Ram Hard Disk Input device Output device

: Pentium IV or higher. : Minimum 256 MB. : 40 GB. : Standard Keyboard and Mouse. : VGA and High Resolution Monitor.

2.2.3 Software Requirements Software components required for building the Project are: Operating System : WINDOWS XP or above Technique Internet explorer. 2.3 Functional requirements The project includes the following modules: HTML to text parser: Processing the input HTML files parsing the HTML contents and extracting the text lines. Uploading and processing: In this module a text file is uploaded and processed. Every single line is considered as a node and the data within that node is displayed. Building document graph: The weight between every node to every other node is calculated. Clustering and making clustered graph: Nearest neighboring method and agglomerative hierarchical clustering technique is used for making clusters of previous step document graph and clustered graph is prepared. Query firing and getting minimal cluster: Here we are firing the query and finding the minimal cluster. Minimal cluster is the cluster, which contains the part of the fired query. Here we are getting summary the result. : Microsoft Visual Studio 2010 (.NET Framework 3.5)

2.4 Feasibility Study Not everything imaginable is feasible!, therefore it is necessary to evaluate feasibility of project at the earliest stage. The software feasibility has 3 solid dimensions: Multi-Document Extractive Summarization for News Page 12 of 59

2.4.1 Technology: Technical feasibility is study of functions, performance, and constraints that may affect the ability to achieve an acceptable system. This project is technically feasible to implement. The user does not require any extra hardware or any higher-end technology. The software can execute on a single client machine operating on a WINDOWS XP or a higher version of Operating System. 2.4.2 Finance: Financial feasibility is the evaluation of the development cost weighed against the ultimate income or benefits derived from the developed system. The resources that are required for the system can be available easily. The system is developed basically for study purpose so economical feasibility is not a major issue. This project is financially feasible because the software does not require any extra hardware or any additional supporting technology which in turn adds no extra cost to the software. Thus the cost is only for the development. Thus the project is financially feasible. 2.4.3 Resources: The organization that wishes to implement this system requires only a single or multiple machines. Thus no additional resources are required to implement the system. Thus the software is also resource feasible. 2.5 .NET Framework : .NET Framework is designed for cross-language compatibility. Cross-language compatibility means .NET components can interact with each other irrespective of the languages they are written in. An application written in VB .NET can reference a DLL file written in C# or a C# application can refer to a resource written in VC++, etc. This language interoperability extends to Object-Oriented inheritance. This cross-language compatibility is possible due to common language runtime.

2.5.1 .NET Framework Advantages: The .NET Framework offers a number of advantages to developers. Different programming languages have different approaches for doing a task. For example, accessing data with a VB 6.0 application and a VC++ application is totally different. When using different programming languages to do a task, a disparity exists among the approach developers Multi-Document Extractive Summarization for News Page 13 of 59

use to perform the task. The difference in techniques comes from how different languages interact with the underlying system that applications rely on. With .NET, for example, accessing data with a VB .NET and a C# .NET looks very similar apart from slight syntactical differences. Both the programs need to import the System. Data namespace, both the programs establish a connection with the database and both the programs run a query and display the data on a data grid. .NET v/s Java : Java is one of the greatest programming languages created by humans. Java doesn't have a visual interface and requires us to write heaps of code to develop applications. On the other hand, with .NET, the Framework supports around 20 different programming languages which are better and focus only on business logic leaving all other aspects to the Framework. Visual Studio .NET comes with a rich visual interfaces and supports drag and drop. Many applications were developed, tested and maintained to compare the differences between .NET and Java and the end result was a particular application developed using .NET requires less lines of code, less time to develop and lower deployment costs along with other important issues. Personally, I don't mean to say that Java is gone or .NET based applications are going to dominate the Internet but I think .NET definitely has an extra edge as it is packed with features that simplify application development. 2.5.2 Main features of C#: C# was developed as a language that would combine the best features of previously existing Web and Windows programming languages. Many of the features in C# language are preexisted in various languages such as C++, Java, Pascal, and Visual Basic. Main features: 1. C# is a simple, modern, object oriented language derived from C++ and Java. 2. It combine the high productivity of Visual Basic and the raw power of C++. 3. It is a part of Microsoft Visual Studio7.0. 4. Visual studio supports VB, VC++, C++, VBscript, and Jscript. All of these languages provide access to the Microsoft .NET platform. 5. .NET includes a Common Execution engine and a rich class library. 6. Microsoft's JVM equiv. is Common language run time (CLR).

Multi-Document Extractive Summarization for News

Page 14 of 59

7. CLR accommodates more than one language such as C#, VB.NET, Jscript, ASP.NET, C++. 8. Source code --->Intermediate Language code. 9. The classes and data types are common to all of the .NET languages. 10. We may develop Console application, Windows application, and Web application using C#. 11. In C# Microsoft has taken care of C++ problems such as Memory management, pointers etc. 12. It supports garbage collection, automatic memory management and a lot. Here is a list of some of the primary characteristics of C# language. Modern and Object Oriented Simple and Flexible Type safety Interoperability Scalable and Updateable

2.6 Risk Management The software development process is inherently subjects to risks, the consequence of which are manifested as financial failures (time scale overrun, budget overrun) and technical failures (failures to meet required functionality, reliability or maintainability).The objectives of risk management are to identify, analyze and give priorities to risk items before they become either threats to successful operation or major sources of expensive software rework, to establish a balanced and integrated strategy for eliminating or reducing the various sources of risk, and to monitor and control the execution of the strategy.

2.7 Data Flow Diagrams Data Flow Diagrams serves two purposes: 1. To provide an indication of how data are transformed as they move through the system. 2. To depict the functions that transforms the data flow. Multi-Document Extractive Summarization for News Page 15 of 59

The DFD provides additional information that is used during the analysis of the information domain and serves as a basis for the modeling of function. A description for each function presented in the DFD is contained in a process specification. As information moves through software, it is modified by a series of transformations. A data flow diagram is a graphical representation of information flow and transforms that are applied as data moves from input to output. The basic form of data flow diagram is also known as data flow graph or bubble chart. The data flow diagram may be used to represent a system or software at any level of abstraction. In fact, DFDs may be partitioned into levels that represent increasing information flow and functional detail. Therefore, the DFD provides a mechanism for functional modeling as well as information flow modeling.


Summary of Html file


Figure2.1 DFD (Level 0)

Data flow diagram( level 1) provides more details of the data flow diagram level zero. It represents information flow and transforms that are applied as data moves from input to output.

Input HTML File(s)

Uploading & Building Multi-Document Extractive Summarization for News Page 16 of 59 HTML To processin document Text g i/p file graph Converter

Clustering algorithm I/P Query

Creating weighted graph


Clustering and building clustered graph

Generati ng Summar y

Highlighting Text in the HTML Document as Result.

Figure2.2 DFD (Level 1)

Chapter 3
Design and analysis
3.1 Design Overview Multi-Document Extractive Summarization for News Page 17 of 59

A specialist has to check for the dataflow and have to manually where the data flows. Analysts have proved that it would take more time for an experienced specialist to note the dataflow. We are accepting HTML/text file only. Newline contents are forming a node, hence a single cluster. If there is no newline content then only one node will be there, hence only on cluster. This will degrade the performance of the algorithms, as the cluster size is very big. 3.2 Software Architecture:

Figure 3.1 Architecture Diagram Figure 3.1 shows the architecture diagram of the system. As shown in figure there are five main blocks : a block for uploading and processing html file(s) by parsing text from html document and making document graph, a block for clustering and making clustered graph, a block for making weighted clustered document graph., the last block for generating summary for fired query. Block 1: HTML to Text conversion: This block accept the input files in the form of Html, and then convert it into text files. After conversion of html to text, the text file passes to the next block as input. Multi-Document Extractive Summarization for News Page 18 of 59

Block 2: Processing input file and generating document graph: This block is needed to accept the text file only. It is responsible to upload text file, to process the file i.e. to form nodes for every newline contents. It is also responsible for generating weight from each node to very other node Block 3: Clustering node and building clustered graph: This block is responsible for choosing a clustering algorithm out of two. It also accepts the threshold, so that can check the similarity between the clusters up to that level. It is responsible for making clusters. Block 4: Creating weighted document clustered graph: This block is responsible to accept the fired query. It is responsible to check the similarities between the query a contents and the contents in the clusters. It then build weighted clustered document graph. Block 5.Summary generation: This block is responsible for generating the summary of the clusters we formed, as a response for fired query. It generated the minimal clusters and after finding the weight of the node for fired query, it gives top most summaries.

3.3 Team Work Graph: Team Members: Mr. Athar Nawaz Khan Mr. Nikhil Vilasrao Ubale Miss. Shraddha B. Ahire.

Multi-Document Extractive Summarization for News

Page 19 of 59

Fig.3.2 Deviation of work 3.4 Software Engineering Model used i.e. Incremental Model: 3.4.1 Communication: The software development process starts with communication between customer and developer. In this phase we communicated with following principles of communication phase. We prepare before the communication i.e. we decide agenda of the meeting for concentrating on the News Summarization. Our leader directs our team and drawn out all the requirement of the user i.e. what they are actually needed, what is input, output format of system. 3.4.2 Planning: It includes complete estimation and scheduling and risk analysis. In this phase we planned about when estimated release the software, cost estimation, risk in the project regarding application etc. Finally in this phase we estimated the cost of the project including all expenditure of software, releasing software according to user deadline with his participation. 3.4.3 Modeling: It includes detail requirement analysis and project design. Flowchart shows complete pictorial flow of program where the algorithm is step by step solution of problem. We analyze the requirement of the user according to that we drawn the block diagrams of the system. That is nothing but behavioral structure of the system using UML 2.0 i.e. Class Diagram, Use Case report, component diagram, communication diagram,activity diagram, state machine diagram. Multi-Document Extractive Summarization for News Page 20 of 59

3.4.4 Construction: It includes Coding and Testing Steps Coding: Design details are implemented using appropriate programming language. In coding we choose the platform i.e.ASP.NET Testing: Testing is carried out by analyzing the system i.e. we first develop the prototype of the system and step by step find out input and output errors such as interface errors, data structure errors, initialization errors etc. Therefore here Black Box testing strategy is useful. 3.4.5 Deployment: It includes software delivery, support and feedback from customer. If customer suggest some corrections, or demands additional features are added into this software.

3.5 Analysis of Work: The following table shows you the way of work that we followed in the period. Table3.1: Analysis of Work Sr. No. 1 Name of Task Information Gathering Subtask 1.Problem Definition: Collecting detail information of the system to be implemented. 2.Literature Survey: Visiting different websites studying existing system with its limitations Going through Journals, magazines 2 Analysis Studying the reference books. Project Plan: Preparing complete project pla.n Multi-Document Extractive Summarization for News 07/08/2011 TO 24/08/2011 Page 21 of 59 Period 17/07/2011 TO 28/07/2011

Requirement Analysis: Software requirements 3 Design Hardware requirements Architectural Design: Describing relationships between modules and sub modules UML documentation: Use case diagram Class Diagram Sequence Diagram Activity Diagram State Machine Diagram Component Diagram Form Design: Showing relationship among different menus and sub 4 GUI menus Output screens: Preparing for detail output screens. Report Submission: Submission of report of 5 Analysis and Design Construction of System Coding: Implementation of design details using Programming language c# .net Testing: Testing the system for 6 Deployment expected results System Deployment: 25/03/2012 Page 22 of 59 12/03/2012 TO 19/03/2012 22/2/2012 TO 29/02/2012 04/09/2011 TO 25/09/2011

Multi-Document Extractive Summarization for News

Delivery of Project Support Feedback 7 Final Document Preparation and Submission Modification Project Submission: Preparing final project Report Submission of final Project Report 3.6 Risk Assessment: The risk always involves two characteristics

TO 15/04/2012

15/04/2012 TO 23/04/2012

Uncertainty: The risk may or may not occur there are no 100% probable risks. Loss: If the risk becomes a reality, unwanted consequences or losses will occur.

3.6.1 Risk projection: Risk projection, also called as risk estimation, attempts to rate each risk in two ways The like hood or probability that risk is real. Consequences of the problems associated with the risk should it occur. Risk Identification is systematic attempts to specify threats to the project plan. Generic risk: These are potential threats to every software project. Product Specific: These risks can be identified only by those with a clear understanding of the technology and the environment that is specific to the project at hand. Following are the Risks involved: 1. Technology to be built: Risks associated with the complexity of the system to be built and the newness of the technology to be packaged by the system. 2. Development Environment: Risks associated with availability and quality of the tools to be used built this system. 3. Risk related to Time. 4. Risk related to Functionality of the system. 3.7 Requirement Analysis: Requirement analysis results in the specification of softwares operational characteristics. Multi-Document Extractive Summarization for News Page 23 of 59

3.6.2 Risk Identification:

Requirements gathering comprises of the following: 1. Elaboration 2. Negotiation 3. Specification 4. Validation Software requirement specification is produced at the culmination of analysis task. The functions and performance allocation to Software as part of system engineering are refined by establishing following: A complete information description A detailed functional description A representation of system behavior An indication of performance requirement and design constraints Appropriate validation criteria A UML diagram is a representation of the components or elements of a system or process model and, depending on the type of diagram, how those elements are connected or how they interact from a particular perspective. We are developing following UML diagrams to show the elements and connection of the elements in the diagram. There are two types of UML diagrams: 1. Structural Diagrams 2. Behavioral Diagrams These two are major grouping of UML diagrams. We are developing some of the diagrams which are sufficient to show the flow and elements of working project. Those diagrams are listed below: Use case Diagram Class Diagram Sequence Diagram Activity Diagram State Machine Diagram Component Diagram Multi-Document Extractive Summarization for News Page 24 of 59

3.8 UML Documentation:

Description about the UML diagram:

3.8.1 Use case Model: A Use Case diagram captures Use Cases and relationships between Actors and the subject (system). It describes the functional requirements of the system, the manner in which outside things (Actors) interact at the system boundary, and the response of the system. Components of Use case diagram: 1. Actor 2. Use Cases 3. Association 4. Include 5. Extends An Actor is a user of the system; user can mean a human user, a machine, or even another system or subsystem in the model. Anything that interacts with the system from the outside or system boundary is termed an Actor. Actors are typically associated with Use Cases. Here in Interactive System we are using five actors. They are as follows: System End User

Use Case Diagram

Multi-Document Extractive Summarization for News

Page 25 of 59

uc use Parsing Prov ide News to system Browse News (HTML file) from internet Parsing

include Extract text files from body tag

Split Split documents into nodes (paragraph) include Processing text files

Weighted graph Build document graph Find similarity between nodes End User Clustering Use Nearest Neighbour aglorithm include Document clustering System

Adding weight to clustered graph

Create clustered document graph Add weight to nodes in graph

Minimal clusters

Enter the query

Find minimal clusters

Enter threshold for minimal cluster

Show result

Figure3.3 Use Case Diagram 3.8.2 Sequence Model: Multi-Document Extractive Summarization for News Page 26 of 59

A Sequence diagram is a structured representation of behavior as a series of sequential steps over time. It is used to depict work flow, message passing and how elements in general cooperate over time to achieve a result. Each sequence element is arranged in a horizontal sequence, with messages passing back and forward between elements. An Actor element can be used to represent the user initiating the flow of events. Stereotyped elements, such as Boundary, Control and Entity, can be used to illustrate screens, controllers and database items, respectively. Each element has a dashed stem called a lifeline, where that element exists and potentially takes part in the interactions. Components of Sequence diagrams: 1. Actor 2. Lifeline 3. Message 4. Self Message 5. End point An Actor is a user of the system; user can mean a human user, a machine, or even another system or subsystem in the model. Anything that interacts with the system from the outside or system boundary is termed an Actor. Actors also represent the role of a user in Sequence Diagrams. Enterprise Architect supports a stereotyped Actor element for business modeling. A Lifeline is an individual participant in an interaction.

Sequence Diagram Multi-Document Extractive Summarization for News Page 27 of 59

sd Parser User Spl itter Graph creati on Relation manager Cl uster al gorithm Mini mal spanning tree

Di splay screen

Provide HT ML docum ents () Edge thrshol d val ue() Extract text fil e from body tag()

T ext docum ent()

Spli t into nodes()

Provide nodes()

Buil d document graph()

Provide graph()

Cal cul ate wei ght of edges() Dispay nodes () Form cl usters() Display clusters() Ask threshol d val ue() Ask for the Query() Provide query()

alt Input query [If avail abl e]

[If not] Show cl usters()

Provi de cl usters() Cacul ates weight() Calcul ates sim i larity between Query and Each Cluster (as node)

Form weighted cl uster graph()

Show wei ghted cl uster graph()

Cal cul ate m inim al clusters() Di spl ay Result() Clusters relevant to the query. (i.e. Cl usters having m axim um wei ght wi th query)

Figure3.4 Sequence Diagram 3.8.3 Class Model: Multi-Document Extractive Summarization for News Page 28 of 59

Class diagrams capture the logical structure of the system, the Classes and objects that make up the model, describing what exists and what attributes and behavior it has. The Class diagram captures the logical structure of the system: the Classes - including Active and Parameterized (template) Classes - and things that make up the model. It is a static model, describing what exists and what attributes and behavior it has, rather than how something is done. Class diagrams are most useful to illustrate relationships between Classes and Interfaces. Generalizations, Aggregations and Associations are all valuable in reflecting inheritance, composition or usage, and connections, respectively. Components of Class diagram: 1. Class 2. Associate 3. Compose 4. Realize A Class is a representation of objects that reflects their structure and behavior within the system. It is a template from which actual running instances are created, although a Class can be defined either to control its own execution or as a template or parameterized Class that specifies parameters that must be defined by any binding class. A Class can have attributes (data) and methods (operations or behavior). Classes can inherit characteristics from parent Classes and delegate behavior to other Classes. Class models usually describe the logical structure of the system and are the building blocks from which components are built.

Class Diagram Multi-Document Extractive Summarization for News Page 29 of 59

c la s s C la s s M o d e l

User - P a ssw o rd : i n t - U se r_ i d : i n t - U se r_ n a m e : i n t + A u t h e n t i c a t i o n () : v o i d + I n p u t () : v o i d + I n p u t _ t h re sh o l d () : v o i d In p u t - File _ n a m e : in t - F i l e _ si z e : i n t + C o n v e rsi o n () : v o i d + F i re q u e ry () : v o i d + I n p u t t h re sh o l d () : v o i d

C l u s te r i n g a l g o r i th m - N o . o f c l u st e r: i n t - T h re sh o l d : i n t + C re a t e c l u st e r g ra p h ( ) : v o i d + D i sp l a y c l u st e r() : v o i d + F o rm c l u st e r() : v o i d

R e l a ti o n M a n a g e r - F i l e _ si z e : i n t - n o . o f n o d e s: i n t In p u t q u e ry + + + + + + B u i l t g ra p h () : v o i d C a l c u l a t e c l u st e r w e i g h t () : v o i d C a l c u l a t e w e i g h t o f n o d e s() : v o i d C re a te n o d e s() : v o i d D i sp l a y n o d e s() : v o i d D i sp l a y w e i g h t () : v o i d - K e y w o rd s: i n t + A ssi g n w e i g h t () : v o i d + C o m p a re n o d e s() : v o i d + F i re q u e ry () : v o i d

S p l i tte r - F i l e _ si z e : i n t - N o .o f st o p w o rd s: i n t + E x t ra c t() : v o i d + P a rsi n g () : v o i d + S p l i t () : v o i d

O u tp u t - R e su l t : i n t + D i sp l a y c l u st e r() : v o i d + D i sp l a y n o d e s() : v o i d + D i sp l a y re su l t () : v o i d

Figure3.5 Class Diagram

3.8.4. Activity Diagram Multi-Document Extractive Summarization for News Page 30 of 59

Activity diagrams are used to model the behaviors of a system, and the way in which these behaviors are related in an overall flow of the system. The logical paths a process follows, based on various conditions, concurrent processing, data access, interruptions and other logical path distinctions, are all used to construct a process, system or procedure.
a c t a c tiv ity

P a rs ing S ta te S plit S ta te HTM L d oc .to s ys te m In i ti a l sta te E x tra c t te x t file from b ody ta g P a rs ing P ro c e s s ing te x t file S plit doc .into n ode s

Te x t file P rov ide N o de s

C lus te ring Fin d s im ila riry be tw e e n n od e s U s e ne a re s t ne ig hb our a lgo rith m

D o c .c lus te ring

B uilt doc .gra ph

C re a t do c .c lus te r gra ph

Add w e igh ts to nod e s in g ra p h

M inim a l c lus te r form a tion Q ue ry b y us e r

Ca lc ula te s im ila rity b e tw e e n que ry a nd e a c h c lus te r

Find m inim a l c lus te r

R e s ult E nte r th e s hold for c lus te r

S how re s u lt

Fi n a l sta te

Figure3.6 Activity Diagram 3.8.5. State machine Diagram Multi-Document Extractive Summarization for News Page 31 of 59

A State Machine diagram illustrates how an element can move between states, classifying its behavior according to transition triggers and constraining guard
stm Input HTML input files threshold v alue

Initial Extract


Text files

Processing text file

Split document into nodes

Weighted graph of nodes

Assign w eigh to each node

Clustering Clustering by nearest neighbour Create node upto threshold v alue

Create graph

Assign w eight to cluster

Result User query Comparison betw een query and clusters

Display topmost result

Figure3.7 State Machine Diagram 3.8.6. Component Diagram Multi-Document Extractive Summarization for News Page 32 of 59

A Component diagram illustrates the pieces of software, embedded controllers and such that make up a system, and their organization and dependencies. A Component diagram has a higher level of abstraction than a Class diagram; usually a component is implemented by one or more Classes (or Objects) at runtime. They are building blocks, built up so that eventually a component can encompass a large portion of a system.

c m p Com ponent M ode l

Input file

T h re sh o l d a nd i n p u t fi l e

S plitte r

P a rs e r

Re la tion m a na ge r

G ra ph c re a tion

Re s ult

Clus te ring a lgorithm

M inim a l s pa nning tre e

Figure3.8 Component Diagram

Chapter 4
Multi-Document Extractive Summarization for News Page 33 of 59

Implementation is the stage in the project where the theoretical design is turned into a working system and is giving confidence on the new system for the users, which it will work efficiently and effectively. It involves careful planning, investigation of the current System and its constraints on implementation, design of methods to achieve the change over, an evaluation, of change over methods. Apart from planning major task of preparing the implementation are education and training of users. The more complex system being implemented, the more involved will be the system analysis and the design effort required just for implementation. An implementation co-ordination committee based on policies of individual organization has been appointed. The implementation process begins with preparing a plan for the implementation of the system. According to this plan, the activities are to be carried out, discussions made regarding the equipment and resources and the additional equipment has to be acquired to implement the new system. Implementation is very important phase, the most critical stage in achieving a successful new system and in giving the users confidence. That the new system will work is effective. After the system is implemented the testing can be done. This method also offers the greatest security since the old system can take over if the errors are found or inability to handle certain type of transactions while using the new system. Main Functions implemented for project are listed as below in the project modules. 4.1 System implementation The input is a text file contains new line keyword. The contents are separated by new line are the contents of the node which are the paragraph. If there are no new line in the file, then whole file contents becomes a single node and hence a single cluster, which can degrade the performance of the result. The total workflow is divided into following modules: Module 1: HTML to text parser Processing the input HTML files parsing the HTML contents and extracting the text lines. Module 2: Processing the input text file and creating the document graph Functions Used: Split () Multi-Document Extractive Summarization for News Page 34 of 59

The system accepts input text file. The file is read and stored into a string. The string is then split by the newline keyword. The split file is assigned to the string array as the split function returns the string array. The array contains paragraphs which are further treated as nodes. string [] nodeList = null; NodeList = File.ReadAllLines (txtInputFile.Text); The next stage is to find the similarity between the nodes that means finding the similarity edges between nodes and finding their similarity or weight. Each paragraph becomes a node in the document graph. The document graph G (V, E) of a document d is defined as follows: vd is split to a set of non-overlapping nodes t (v), v V. An edge e (u, v)E is added between nodes u, v V if there is an association between t (u) and t (v) in d.Hence, we can view G as an equivalent representation of d, where the associations between text fragments of d are depicted. Module 3: Adding Weighted Edges to Document Graph (Note: Adding weighted edge is query independent) A weighted edge is added to the document graph between two nodes if they either correspond to adjacent node or if they are semantically related, and the weight of an edge denotes the degree of the relationship. Here two nodes are considered to be related if they share common words (not stop words) and the degree of relationship is calculated by Semantic parsing. Also notice that the edge weights are query-independent, so they can be pre-computed. The following input parameters are required at the pre computation stage to create the document graph: 4.1.1 Threshold for edge weights: Only edges with weight not below threshold will be created in the document graph. (A threshold is user configurable value that controls the formation of edges) Adding weighted edge is the next step after generating document graph. Here for each pair of nodes u, v we compute the association degree between them, that is, the score (weight) EScore (e) of the edge e (u, v). If Score (e) threshold, then e is added to E. The score of edge e (u, v) where nodes u, v have text fragments t(u), t(v) respectively is:

Multi-Document Extractive Summarization for News

Page 35 of 59

Where t f (d, w) is the number of occurrences of w in d, Id f (w) is the inverse of the number of documents containing w, and size(d) is the size of the document (in words).That is, for every word w appearing in both text fragments we add a quantity equal to the tf/idf score of w. Notice that stop words are ignored. Functions Used: Remove Common Words () The common words are eliminated from the nodes as they can degrade the performance of calculating the similarity between two nodes also they can degrade the system performance because of number of computational loops increases. E.g. a, an, the, he, she, they, as, it, and, are, were, there etc. The filtered two nodes are passed as parameters to the Relation Manager Class for finding the similarity between them. Relation Manager () The relation manager function takes two nodes as a parameter and returns the semantic relation in the form of weight (EScore) between two nodes by traditional edge weight formula specified as below:

If EScore >= Threshold, the edge is added to the document graph. The graph is stored into tabular form as shown below

Table 4.1. Nodes and Node weights First Node Second Node Edge Weight Page 36 of 59

Multi-Document Extractive Summarization for News

1 1 . . 30 30

2 3 . . 31 32

0.5 0.7 . . 0.8 0.6

Module 4: Document Clustering Clustering is grouping of similar nodes (The nodes which shows degree of closure greater than or equal to the Cluster Threshold specified by the user) into a group. The following approach of clustering is used Nearest Neighbor. Algorithm for Nearest Neighbor Clustering: 1. Set i = 1 and k = 1. Assign pattern Let 3. If denote the distance from to cluster . 2. Set i = i + 1. Find nearest neighbor of is in cluster m. greater than or equal to t then assign to where t is the threshold specified by . the user. Otherwise set k = k+1 and assign Functions Used: FindMaxWeight () FindMaxWeight returns the pair of nodes having maximum edge weight with their weight from document graph. E.g. to a new cluster among the patterns already assigned to clusters.

to its nearest neighbor. Suppose the nearest neighbor

4. If every pattern has been considered then stop else go to step 2.

Table 4.2.Nodes and the max weight First Node 1 2 Second Node 22 19 Max Weight 2.5 1.2 Page 37 of 59

Multi-Document Extractive Summarization for News

3 . . 31 NearestNeighborCluster ()

31 . . 12

3.5 . . 2.7

The first pair of nodes in the above table is added in first Cluster because they have maximum weight. Here Node 1 and 22 are closely related hence added to the first cluster. So Cluster_1 contains 2 nodes 1 and 22. Cluster_1 :- 1,22 Next node node 2 shows maximum weight with node 19 but none of the node (node 2 and node 19 )are in previous clusters so they forms new cluster Cluster_2 Cluster_2:- 2, 19 Similarly Node 3 and 13 are forming new cluster. Cluster_3:- 3, 31 Now next pair (node 31 and 12) contains node 31 which is already in cluster_3 hence node 12 is added into cluster_3, so cluster_3 now becomes Cluster_3:-3, 31, 12. The above procedure is repeated till the end of the node pairs. Module 5: Creating Clustered Document Graph After the clusters are formed either by Nearest Neighbor or agglomerative hierarchical, the similarity edges between two similar clusters are calculated. This is same as creating document graph and adding the similarity edges between two similar nodes. Every cluster is split into individual nodes and this grouping of nodes is passed to the relation manager in order to find the weight between two set of nodes or Clusters.[5,7,10] Module 6: Adding Weight to Nodes In Clustered Document Graph When a query Q arrives, the nodes in V are assigned query-dependent weights according to their relevance to Q. In particular, we assign to each node v corresponding to a text fragment t(v) node score NScore(v) defined by the Okapi formula as given below.

NScore (V) =
Tf- is the terms frequency in document, Qtf- is the terms frequency in query, N -is the total number of documents in the collection, Multi-Document Extractive Summarization for News Page 38 of 59

df is the number of documents that contain the term, dl is the document length (in words), avdl is the average document length and k1 (between 1.02.0), b (usually 0.75), and k3 (between 01000) are constants. Functions Used: CalculateClusterWeight () All the values mentioned above are computed and passed as parameters to the okapi formula. The returned Node Weight is stored in the table. e.g. Table 4.3.Cluster node and weight of cluster Cluster No Cluster_1 Cluster_2 Cluster_3 Cluster_4 Nodes 1,22,1, 32 9,17,24 34,12,10 4,14,23 Cluster Weight 2.4 2.5 0 0

Module 7: Generating Closure Graph and Finding Minimal Clusters Closure graph contains minimal clusters. Minimal clusters are the clusters which shows non zero weight with the in out query. In Above example (Tab 3) only Cluster_1 and Cluster_2 are the minimal clusters. The minimal clusters are the clusters which appear in the result. Module 8: Result After getting the minimal clusters, the result can be displayed in two ways: Top 1 Result Summary Multi-Result Summary In top 1 result summary, the minimal cluster having highest weight with the input query is returned, and in multi-result summary all the minimal clusters are returned as result. Before displaying the result as a cluster, the cluster is split into its nodes and the weight of every node with the input query is calculated. The nodes are displayed in decreasing order of the weight with the input query. Means the node having highest weight is displayed at the top and lowest at the bottom.

Multi-Document Extractive Summarization for News

Page 39 of 59

Chapter 5
5.1 Testing strategies Testing is an important phase in the Software Development Life Cycle. Testing should be planned and conducted systematically. Generic aspects of a test strategy Multi-Document Extractive Summarization for News Page 40 of 59

1. Testing begins at the module level and works outwards. 2. Different testing techniques are used at different points of time. 3. Testing is done by developers and mainly for larger projects, by an independent test group. 4. Testing and debugging are two different activities, but debugging should be incorporated into any testing strategy. 5.2 Testing Techniques 5.2.1 Black box testing Black box testing focuses on the functional requirements of the software. That is, black box testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black box testing is not an alternative to white box testing techniques; rather it is a complementary approach that is likely to uncover a different class of errors than white box methods. Black box testing attempts to find errors in the following categories 1. Incorrect or missing functions. 2. Interface errors. 3. Errors in data structures or external database access. 4. Performance errors. 5. Initialization and termination errors Unlike white box testing, which is performed early in the testing process, black box testing tends to be applied during later stages of testing. Because black box testing purposely disregards control structure, attention is focused on the information domain. 5.2.2 White box testing Using white box testing methods, the software engineer can derive test cases that can 1. Guarantee that all dependent paths within a module have been exercised at least once. 2. Exercise all logical decisions on their true and false sides. 3. Execute all loops at their boundaries and within their operational bounds 4. Exercise internal data structures to assure their validity. Need for white box testing arises because of different reasons: 1. Errors tend to creep into the work when we design and implement function, conditions or controls that are out of the main stream.

Multi-Document Extractive Summarization for News

Page 41 of 59

2. We often believe that a logical is not likely to be executed when, in fact, it may be executed on the regular basis. 3. Typographical errors are random. When a program is translated into programming language source code, it is likely that some typing errors will occur. I have used White Box testing and Black Box Testing. It is also called behavioral testing which focuses on functional requirements of the software. In this testing the software is tested as a black box without considering its internal details. Required sets of input were supplied and the desired outputs are obtained. 5.3 Test Cases: Table 5.1. Test cases Test Case ID TC1 Test Name HTML file Validation file Case Description Steps Carried out Expected Results 1. Enter a correct Accepted the Accepted text file. create Then create in a and nodes and show particular node Error Message Other Other format file than file text will not be be size of and at there. 2. in and Enter which click incorrect least 2 Then are nodes an particular folder node web set = format not TC2 Changing threshold for clustering To how threshold value check Selecting threshold for clustering the Actual Result

of the input text file in which text file. paragraphs

show data in a data

pages are stored Error Message = dataset folder.

than text will uploaded uploaded the The size of The value cluster cluster

increases and increased

no of cluster no of cluster Page 42 of 59

Multi-Document Extractive Summarization for News

affects size

the of

decreases. Due to

decreased. Due big to big size of in in case of

cluster and performanc e algorithm of

size of cluster, cluster, looping looping neighbour algorithm increases. Hence performance case of nearest nearest neighbour algorithm increased. its Hence performance its



most To result first result

check After the minimal result and result

decreases. decreased. The cluster The cluster ,get containing the containing the cluster best result for best result for the fired query fired query as should appear appeared at top. at top. In that In thet cluster a cluster a node node containing the containing the best should at result best first of result come came at first.


whether in clustering find

is the best summary


Similarity calculation

Checking similarity between two clusters Calculating weight between


position. two Clustering of Clustering similar clusters

clusters as input

similar clusters


Weight Calculation


two Weight between nodes

Weight two between nodes After removing Page 43 of 59 two

nodes as input


two nodes Removal of Checking


text After

Multi-Document Extractive Summarization for News

common words

the after

effect file


and removing

common words, as less of words to

without common common

removal of words common words

words, as less no remained it is easy easy calculate document nods.

no of words remained it is to calculate document graph, similar Hence get

graph, similar nods. performance increases Textboxes should properly aligned. Color of all buttons should uniform. Should All textboxes should be aligned in a straight line aligned be Should aligned be Should uniform system

Hence performance of

of system get increases TC7 GUI Alignment of Controls Textboxes be should properly aligned. be Should uniform be be


Multi-Document Extractive Summarization for News

Page 44 of 59

5.4 User Interface (Screenshots) The basic user interface consists of at least three windows. First window is needed to input the text file. For this user has to give input as text file only. The second interface is to display different clustering techniques .Here threshold for clustering is also taken as input for clustering. For this user has to select a clustering algorithm out of two. He has to give threshold value for clustering. The third interface is to display the query and to take % of the correlation of cluster with the query.
5.4.1 Before uploading the HTML file(s).

Multi-Document Extractive Summarization for News

Page 45 of 59

Figure5.1. before uploading the HTML Files

5.4.2 Uploading HTML file(s).

Multi-Document Extractive Summarization for News

Page 46 of 59

Figure5.2: Uploading HTML file(s).

Multi-Document Extractive Summarization for News

Page 47 of 59

5.4.3: Browsing the HTML file(s).

Figure5.3: Browsing the HTML file(s).

Multi-Document Extractive Summarization for News

Page 48 of 59

5.4.4: After browsing the HTML file(s).

Figure5.4: After browsing the HTML file(s).

Multi-Document Extractive Summarization for News

Page 49 of 59

5.4.5: Processing HTML file(s) and Display node relations.

Figure5.5: Processing HTML file(s) and Display node relations

Multi-Document Extractive Summarization for News

Page 50 of 59

5.4.6: Before clustering of nodes.

Figure5.6: Before clustering of nodes.

Multi-Document Extractive Summarization for News

Page 51 of 59

5.4.7: Clusters formation and building clustered graph

Figure 5.7: Clusters formation and building clustered graph

Multi-Document Extractive Summarization for News

Page 52 of 59

5.4.8: Taking input query and thresholds for minimal combination of clusters in %.

Figure 5.8: Taking input query and thresholds for minimal combination of clusters in %.

Multi-Document Extractive Summarization for News

Page 53 of 59

5.4. 9: Display minimal cluster as result along with link to actual web page(s).

Figure 5.9: Display minimal cluster as result along with link to actual web page(s).

Multi-Document Extractive Summarization for News

Page 54 of 59

5.4.10: Display actual web page(s) we are currently dealing with and highlighting the output data.

Multi-Document Extractive Summarization for News

Page 55 of 59

Figure 5.10: Display actual web page(s) we are currently dealing with and highlighting the output data.

Chapter 6
Future Scope
Future scope The sentence ordering module can be used to define ordering among those topic sentences. Another important aspect is that our system can be tuned to generate summary with custom size specified by users. It is shown that our system can generate summary for other non-English documents also if some simple resources of the language are available. In future we will use some dictionary to use all the synonyms of the query words and as well As of the keywords as the extra keywords to search the relevant information, so the quality of the summary will increase. In the News domain Update Summary is a very important and useful concept. On a same news topic every day or every hour there are some new or updated news arrived. So one who already read the previous news article, (s) he will not be interested to read the whole article again. (S)He will want to know the updated News only. With the help of the Update summary, reader can read and track news very easily. We can develop a system which will produce the update summary too.

Multi-Document Extractive Summarization for News

Page 56 of 59

We are specially dealing with generating the summary for the News domain. Summary is a very important and useful concept. On a same news topic every day or every hour there are some new or updated news arrived. So one cannot go through all the newspapers and each and every news article. (S)He will want to know the summary only. With the help of the News summary, reader can read and track news very easily. As we are providing facility of query dependent news summary, user can easily have news summery according to his/her interest. We are directly dealing with the HTML pages so user can retrieve online news and directly get the summary for the same. In this work we present a graph based approach for query dependent multi document. Summarization system along with the nearest neighbor clustering technique. It works efficiently in case of news summarization. Because of the multi-document news summarization, there is no need to go through all the newspapers. As we are dealing with query specific summarization, user can easily have news summery according to his/her interest. As well as the accuracy of the result is depend upon initial Edge Threshold and Cluster threshold as well as Result accuracy percentage, so user can control the relevance.

Multi-Document Extractive Summarization for News

Page 57 of 59

References: [1] Beyond Single-Page Web Search Results: Ramakrishna Varadarajan, Vagelis Hristidis, Tao Li Published in IEEE TKDE, 2008 (Journal paper). [2] R. Varadarajan, V Hristidis : A System for Query-Specific Document Summarization , CIKM06, November 511, 2006, Arlington, Virginia, USA. Copyright 2006 ACM 1-59593-433-2/06/0011. [3] R. Varadarajan, V Hristidis: Structure-Based Query-Specific Document Summarization. Poster paper at CIKM 2005 [4] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti and S,Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002. [5] M. White, T. Korelsky, C. Cardie, V. Ng, D. Pierce, and K. Wagstaff.: Multidocument Summarization via Information Extraction. HLT, 2001 [6] Paladhi, S., Bandyopadhyay, S. 2008. A Document Graph Based Query Focused Multi- Document Summarizer. [7] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti and S,Sudarshan: Keyword Searching and Browsing in Databases using BANKS. ICDE, 2002. [8]R. Mihalcea, Graph-based ranking algorithms for sentence extraction, applied to text summarization, in Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, (Morristown, NJ, USA), p. 20, Association for Computational Linguistics, 2004. Books refered: Multi-Document Extractive Summarization for News Page 58 of 59

[1]Analyzing the hierarchical Clustering Algorithm for Categorical Attributes, By Parul Agarwal, M. Afshar Alam, Ranjit Biswas. International journal of innovation, Management and Technology, Vol. 1, No. 2, June 2010 ISSN: 2010-024 [2] Professional 3.5 Author: Bil evjen, scott hanselman,Devin rade Chapter2: pp 63-10, Chapter20: pp 929 [3] website programming Chapter 1: pp 15 - 38 [4] C# 2008 Programmers Reference Author: Wei-Meng Lee [5] Hardy, H., Shimizu, N., Strzalkowski, T., Ting, L., Wise, G. B., Zhang. X. 2002. Crossdocument summarization by concept classification. SIGIR, pp. 65--69. [6] Barzilay, R. and M. Elhadad. 1999. Using lexical chains for text summarization. In Mani and Maybury (1999), 111 21.

Multi-Document Extractive Summarization for News

Page 59 of 59