Sie sind auf Seite 1von 4

Novel Approach for Text Summarization of Multiple Documents

Snehlata Sahu1, Ms. Alka Jaiswal2


1

Rungta College of Engineering and Technology Bhilai (CG)

Asst. Professor of CSE, Rungta College of Engineering and Technology Bhilai (CG) E-mail: 1snehlata.sahu@gmail.com, 2alka_jais@yahoo.co.in

Abstract Multi-document summarization is a technique used to generate summaries of various electronic documents. Text summarization is the process of consolidating a large text into a smaller length, keeping the important information intact. This paper presents a multi-document text summarization approach based on text extraction on the basis of fuzzy logic and genetic algorithm. Most of the summarizers use the approach of assigning high scores to most frequently extracted data from files. The proposed technique combines the statistical approach with linguistic approach i.e. fuzzy and genetic algorithm to generate better summaries. This paper focuses on implementing the features of the genetic algorithm combined with fuzzy logic approach. The sentences are first selected and assigned weights according to specific features. The fuzzy logic is used to filter important and unimportant sentences on the basis of their weights. The genetic algorithm is then applied on extracted sentences as these sentences go through the process of crossover and mutation and the resultant sentences are included in final summary. Keywords: Feature Selection; Fuzzy logic; Genetic algorithm; Text Summarization;
I.

facilitate quick, condensed and accurate identification of the topic from multiple related documents which are domain specific. The objective is to save a prospective reader time and effort in finding useful brief information from many more relevant documents in a specific area. The Extraction of a summary text from multiple documents became popular in mid 1990s, and is mostly used in domain of news articles. The difference between single document and multi document summarization is that multi document summarization involves multiple sources of information. The key task of multiple document summarizations is not just identifying redundancy across documents, but also ensuring that the final summary is both coherent and complete. Ontology Knowledge Based SummarizationThe proposal focuses on dynamic summary generation based on user input query. This approach has been designed for application in specific domain (medical). However it can be used in general domain too. The idea presented in this proposal is based on the fact that user selects the keywords to search for the document with specific requirements. However, these keywords may not match the documents main idea, thus the documents summary provided by the static author-written abstract may be not a good summary for the user and specific search query. Hence, the summary needs to be generated dynamically, according to user requirements given by the search query.

INTRODUCTION

Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. The purpose of automatic summarization in technical literature is to

Feature Appraisal Based SummarizationOther similar approach based on the semantic analysis of document. In this approach the authors propose the scoring system for document extraction based on static and dynamic document features. Static features include sentence locations and named entities (NE) in each sentence. Dynamic features used for scoring include semantic similarity between. Neural Network Based Approach- NetSum system developed at Microsoft Research is utilizing machine-learning method based on neural network algorithm RankNet. The system is customized to be used for summary extraction of news articles including three highlighted sentences. The goal is pure extraction without any sentence compression or sentence generation. Thus, system is designed to extract three sentences from single document that best match three document highlights. Much research has been going on for developing better quality summaries. According to recent research, artificial intelligence models have been found to be effective for generating better quality summaries.
II.

B. Application of Fuzzy Logic A document sentence is represented and scored as set of eight features:Content word: Content words are usually noun keywords are of greater chances to be included in summary. Title word: Sentences containing words that appear in the title are also indicative of the theme of the document. These sentences are having greater chances for including in summary. Sentence location: Usually first and last sentence of first and last paragraph of a text document are more important and are having greater chances to be included in summary. Sentence Length: Very large and very short sentences are usually not included in summary. Proper Noun: Proper noun is name of a person, place and concept etc. Sentences containing proper nouns are having greater chances for including in summary. Upper-case word: Sentences containing acronyms or proper names are included. Adjectives: According to rules as specified in English grammar. The adjectives are also considered to be important in identifying the important portions in a paragraph. Sentence to sentence similarity: Checking similarity of sentences with each other. The sentences are then applied to fuzzy logic, IFTHEN rules and the sentences with maximum scores are added to the collection and used as input to the genetic algorithm feature. C. Application of Genetic Algorithm (GA) Genetic Programming (GP) is an evolutionary algorithm that evolves computer programs and predicts mathematical models from experimental data. The main operators used in evolutionary algorithms are regeneration crossover and mutation. Genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. It uses techniques inspired by

PROPOSED METHODOLOGY

The proposed methodology for text summarization consists of following phases:A. Preprocessing

The pre-processing stage consists of three major activities which are Tokenization, Removal of stop words, Word Stemming. Tokenization The process of extracting individual words from a document or sentence. Removal of Stop WordsRemoving words which appear frequently in document but are not helpful in identifying the important content of the document such as a, an, the, etc.. Word Stemming--The last step for preprocessing is Word Stemming; Word stemming is the process of removing prefixes and suffixes of each word. In the processing phase the sentence are extracted from document on the basis of most commonly occurring stems.

natural evolution, such as inheritance, mutation, selection, and crossover. The initial population chosen are the sentences extracted from the above two approaches. The sentences that are to be included in the final summary are chosen on the basis of scoring obtained by applying a fitness function. The sentences having the maximum scores are eligible for crossover and mutation. The sentences with max scores are combined on the basis of similarities for crossover and mutation so that the resultant sentence inherits the features of both the parents. The evolution process consists of successive generations. At each generation, individuals with high fitness are selected, the chromosomes of selected individuals are recombined and subjected to small mutations. Formally, the scheme of GA can be represented as follows:1. Select random population of n solutions for the problem. 2. Evaluate the fitness function f(x) of each solution x in the population. 3. Create a new population of solutions by repeating following steps until the new population is complete. 4. Select two parent solutions from a population according to their fitness (the better fitness, the bigger chance to be selected). 5. Crossover the parents to form a new solution. If no crossover was performed, offspring is an exact copy of parents. 6. Mutate the offspring with some low probability; a portion of the new individuals will have some of their bits flipped. Its purpose is to maintain diversity within the population and inhibit premature convergence. 7. Place population. new offspring in a new

Figure1: Text Summarization Process

III.

EXPECTED OUTCOME

This paper focuses on generating summarizer based on genetic algorithm combined with fuzzy logic. The advantage of the genetic algorithm approach is the ease with which it can handle different kinds of constraints and objectives; all such things can be handled as weighted components of the fitness function, making it easy to adapt the summarizer to the particular requirements of a very wide range of possible overall objectives. The proposed approach will perform better than other text summarization methods and will generate high quality summaries.
IV.

CONCLUSION

8. Use new generated population for a further run of algorithm. 9. If the end condition is satisfied, stop, and return the best solution in current population. 10. Repeat again i.e. Go to step 2.

The design of an automatic text summarizer is of great importance in the current world which is so filled with data. It would reduce the pain people suffer while reading huge amounts of data, by offering them a concise summary for each document. The aim is to develop a text summarizer that can handle multiple documents and can generate most relevant and precise summary. The proposed summarizer is based on Fuzzy and Genetic algorithm which will be further tested for variety of multiple documents to evaluate its performance and by comparing with other approaches used in previous summarizers.

ACKNOWLEDGMENT The author thanks Rungta College of Engineering and Technology and Miss.Alka Jaiswal for their valuable support. REFERENCES
[1] Asef poormasoomi, M. K. (2011). Context-Based Persian Multi-Document Summarization (global view). IEEE. Camilleri, M. R. (2008). Query-Based MultiDocument Summarisation. Can, G. E. (2009). Cover Coefficient-Based Multidocument Summarization. Springer. Dongmei Zhang1, 2. J. (2012). Multi-Document Summarization of Product Reviews. IEEE. Hakkani, A. C. (2011). Concept- Based Classification for Multi- Document Summarization. IEEE. Hongyan Lill, L. L. (2011). Multi-document Summarization based on Hierarchical Topic Model. IEEE. Huda Yasin, M. M. (January 2011). Automated Multiple Related Documents Summarization via Jaccards Coefficient. International Journal of Computer Applications (0975 8887). Jiang3, P. L. (2011). Generating Aspect-oriented Multi-Document Summarization with Event-aspect model. Association for Computational Linguistics. Kondadadi, F. S. (2008). Fast and accurate querybased multi-document summarization. Association for Computational Linguistics.

[16]Abuobieda, A. (2012). Text Summarization Features Selection Method using Pseudo Genetic-based Model. IEEE. [17]Devasenal, C. L. (2012). Automatic Categorization and Summarization using Reduction. ICAESM. Text Rule

[18]Harabagiu, S. (2005). Topic Themes for MultiDocument Summarization. ACM. [19]Manne, S. A Novel Automatic Text Summarization System with Feature Terms Identification. IEEE. [20]Tsarev, D. (2011). Using NMF-based Text Summarization to Improve Supervised and Unsupervised Classification. IEEE.

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]Ladda Suanmali,N. S. (2011). Fuzzy Genetic Semantic Based Text Summarization. IEEE. [11]M.Mahajan., S. R. (2011). Domain Specific eDocument Summarization Using Extractive Approach. International Journal of Computer Applications. [12]Xiao-Chen Ma 1, 2. G.-B. (2009). Multi-document Summarization Using Clustering Algorithm. IEEE. [13]Yan Liu, S.-h. Z. (2012). Query-Oriented MultiDocument Summarization via Unsupervised Deep Learning. Association for the Advancement of Artificial Intelligence. [14]YAN-MIN CHEN, X.-L. W.-Q. (2005). MultiDocument Summarization Based On Lexical Chains. IEEE. [15]Ying XIONG, H. L. (2010). Multi-Document Summarization Based on Improved Features and Clustering. IEEE.

Das könnte Ihnen auch gefallen