Sie sind auf Seite 1von 10

CROSS DOMAIN TEXT CATEGORIZATION

USING PLSA
By
K.HARITHA -316126510149
P.ANUHYA - 316126510166
P.HARI TEJA - 316126510167
P.MADHU KUMAR - 316126510169

Under the guidance of


Dr M.RAMAKRISHNA MURTHY -Assosciate Professor
Cse department
CONTENTS

 ABSTRACT
 PROBLEM STATEMENT
 INTRODUCTION
 SAMPLE INPUT
 SAMPLE OUTPUT
 REFERENCES
ABSTRACT

 Text Analysis is important, emerging, research area, because plenty of text


resources growing rapidly through the internet and digital world. In the text
data analysis text categorization is one of the vital techniques. Traditional
text categorization methods are not able to handle well with learning across
different domains. Cross-domain classification is more challenging problem
than single domain classification .In this project implement the cross domain
text categorization using PLSA (PROBABLISTIC LATENT SEMANTIC ANALYSIS)
PROBLEM STATEMENT

 The number of text documents are growing with the advent of the internet
and development of world wide web. The huge growth of text of text
documents are incredible to manually classify. In general statistical
approaches have been applied in single domain for text classification. These
approaches are based in the word occurrence i.e. frequency of one or more
words in a given document. But this approach doesn’t work well with multiple
domains so to achieving the goal one of the most important challenges is the
problem of learning topics is text documents that belong to different.
INTRODUCTION

 Text categorization is the task of automatically sorting a set of documents


into categories. When two or more domains are involved in a particular text
document then it is called CROSS DOMAIN Internet is a vast repository of
disparate information growing at an exponential rate. The dynamic growth
of web generates not only huge number of text documents but also wide
varieties of text documents in a result of documents being generated in
various domain. Efficient and effective document retrieval and classification
systems are required to turn the massive amount of data into useful
information and eventually to knowledge
METHODOLOGY

 PLSA does not need labelled information and thus does not considered
available prior knowledge of the domain. PLSA was resultant from the well
known latent semantic analysis(LSA) for text analysis. In this model each
document is considered as the rounded combination of several topics where
this topics are obtained using the maximum likelihood principle. It assigns
multiple topics to a single documents. Each document is assumed to be
generated from multiple topics.
CHALLENGES
 Traditional statistical approaches have been applied in single domain for text
classification. These approaches are based on word occurrences. They require
label data in order to construct reliable and accurate classification model.
But label data are rarely available and getting is to expensive. Other
challenge in machine learning approaches is given a learning task for which
training data is not available. Most important problem is learning topics in
text documents that belongs to different domains
SAMPLE INPUT AND SAMPLE OUTPUT

SAMPLE INPUT
A dataset containing the list of documents, to be classified. The set of
documents to be classified is represented by D.
SAMPLE OUTPUT
Documents are categorized.
REFERENCES

 M.RamaKrishna Murthy, J.V.R Murthy, Prasad Reddy PVGD, S.C.Satapathy


“ A Survey of cross-Domain Text Categorization Techniques”.
 Elisabeth Lex, Christin Seifert, Michael Granitzer and Andreas Juffinger,
“Efficient Cross-Domain Classification of Weblogs”, International Journal of
Intelligent Computing Research (IJICR), Volume 1,Issue ½,March/June 2010.

Das könnte Ihnen auch gefallen