Beruflich Dokumente
Kultur Dokumente
Table of Contents
Unit I: Introduction
Modes of Research Engine Purpose of Research Engine Technologies to be used Hardware Requirements Software Requirements iBlue Class hierarchy 1 2 2 4 4 5
REVISIONS
1. Research Engine Alpha (REiBlue_1.0.1.1086) Initial design of the Research Engine (Semantic Search Engine), using NLP for query processing and refinement. Authors: Salaikumar @ Saravanan and A. Kirubaharan
2. Research Engine Matured Alpha (REiBlue_1.0.3.2841) Improved design using world class standard components, and implementing knowledge engine, code search and public SPARQL end-point for Linked Data Authors: Salaikumar @ Saravanan, Kirubaharan A, Ashwanth Kumar, and Swetha S.
UNIT I
Introduction
Research Engine (Code name: iBlue), is a semantic search engine for the people of 21st century. In brief it has the elegance of Google, and the power of Wolfram Alpha knowledge Engine. Its the outcome of efforts of four people, who was not happy with the way information was available on the internet, and the difficulty involved to get what you want, especially when we are not sure of what we want. If you want the website that contains the best cookery information, or the famous site for Movie ratings, we use Google. When we want to know how a scientific expression is derived, or how exactly it is put into use, and other related science concepts, we are forced to use WolframAlpha (though in Alpha stage, it has a really large entity index of science and technological information). We need a system that combines the power of both of these technologies, to provide to the users, what we call: "Instant Answers to all your Questions!"
Welcome to Research Engine (code name: iBlue). We hope you like using it as much as we did.
Modes of Operation
It operates on three modes. 1. Web Search - Google like searching interface, but uses powerful clustering algorithm to categorize your results, to identify what you want very easily. 2. Knowledge Engine - Works exactly like Wolfram Alpha. All the data of the knowledge base are obtained from Wikipedia (updated till 4th April, 2010) 3. Code Search - Helps you search all opensource code available on SF.net, github, Google Code, and other public SVN.
J2EE:
Java Platform, Enterprise Edition or Java EE is a widely used platform for server programming in the Java programming language. The Java platform (Enterprise Edition) differs from the Java Standard Edition Platform (Java SE) in that it adds libraries which provide functionality to deploy fault-tolerant, distributed, multi-tier Java software, based largely on modular components running on an application server.
Apache Hadoop:
Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.
IBM WASCE:
IBM WebSphere Application Server Community Edition (WASCE) is a free, certified Java EE 5 application server for building and managing Java applications. It is IBM's supported distribution of Apache Geronimo that uses Tomcat for servlet container and Axis 2 for web services.
Jena API:
Jena is an open source Semantic Web framework for Java. It provides an API to extract data from and write to RDF graphs. The graphs are represented as an abstract "model". A model can be sourced with data from files, databases, URLs or a combination of these. A Model can also be queried through SPARQL and updated through SPARUL.
Hardware Requirements
Minimum Configuration 1 Node running - Pentium 4 Processor, 1 GB RAM, 80 GB HDD, DVD Optical drive, and Broadband internet connection. Recommended Configuration 2 100 nodes running - Intel Pentium processor, 512 MB RAM, 40 GB HDD, high speed network connectivity (optic fiber recommended) & uninterrupted power supply
Software Requirements
1. Ubuntu 10.04 or any Linux based operating system 2. Java 1.6 preferably from Oracle, and JAVA_HOME variable must be set to jvm home 3. SSH package must be installed (Used by Hadoop to contact other nodes on the network) 4. Websphere community Edition or Websphere Application server or any other equivalent must be installed 5. IBM DB2 Express Edition or DB2 Enterprise edition
List of all classes being used and i implemented in iBlue can be visualized as abov ve. Each classes, their methods and required fields are mapped under the respective packages. There are some more depende encies and open source tools being used in th project, which he doesnt come into this class hierar ierarchy. You can find the complete list o components used in each module at the end of design under of d each module.
UNIT II
Web Search Component
Web search module enables us sers of the site to perform text based searchin on the entire ing web.
based on the query and semantics. This result set is then clustered by a clus ster engine using Lingo algorithgm. When a user is logged in, the s search results from the indexer is further imp proved using the users connections and their ont ntology. Rest of the process is same with the clustering of results and output formats. The web co omponent is also supported via AJAX also. WebSearch servlet also supports REST based search A h API.
The user activity with the web search module, is explained above. The above sequence is also represented in the following sequence diagram. UserQuery User gives their query in form of text or keywords to search from the web. isUserAuthenticated Returns true if the user is logged in else false SearchIndex Searches the index for matching patterns of the UserQuery Quantization Filter the fetched urls based on the users likes and dislikes, interests, activities, etc. (Available only to loggedin user) AddToWebHistory Add the search query of the user to the WebHistory table in the system database for later retrival and filteration process upon subsequent relevant queries Clustering Groups the results of the UserQuery based on the semantics of the result. It uses Lingo algorthim to categorize the search results into different categories Result Contains the quantized, filetered, and then clustered results for the user. It may be in any of the following forms: text/xml, text/json, text/html
3. Facebook Connect API (http://developers.facebook.com/) - Facebook's powerful APIs enable us to create social experiences to drive growth and engagement on our web site. User context is derived from OpenGraph protocol of Facebook. 4. Apache Lucene (http://lucene.apache.org/) - Apache Lucene is a free/open source information retrieval software library. It is supported by the Apache Software Foundation and is released under the Apache Software License.
UNIT III
Knowledge Base Engine (KBEng e gine)
Knowledge Base Engine, does all t semantic processing of data from the WWW. Currently the the dataset is limited to 3 million art rticles from Wikipedia (as of July, 2010). The dat is published in ata Linked Data format to be compa atible with Open Calais, DBPedia, OpenCyc, etc. . It can also be used as an Analytic Engine, Computational engine, Inference engine, or anything ical you can think of. It is a very bas knowledge engine, it can be extended to be used under any asic e type of application and requirements.
Users of the component includes any user (logged in and guest). There is no restrictions being es, r applied since the information has no context of the user related to it. iBlue Kn as nowledge Engine, can be used to query information about any entity on the web. n
Tech Buddy / SASTRA University / Tamil Nadu 11 | iBlue Research Engine SRS v1.0.3.2841 e
It can also perform analysis u using Machine learning algorithms (at the backend); this is b implemented separately its not t part of the KBEngine. the Administrators can block a partic icular Entity or a Entity type (example scenarios include parental s control or unmatured informatio ion).
Above activity depicts the usage of KBEngine with iBlue. The general operatio e ional activities are as follows. GetUserQuery Get the user query in REST based or form based medium. er TokenizeQuery It generates the valid tokens of the query es ASTForm It is an intermediate form of representation of the query in the memory, ready for iate computation or quering the Knowledge Store GetAction Once the query is tokenized, the corresponding action can be identified from e the AST form of the query PerformAction Once the ac he action is identified, execute it GenerateOutput The com mputed value or information is then syndicat in the form ated requested by the user (JSON/XML)
UNIT IV
Code Search
Code search is an addon implementation of searching feature purely concerntr rated on indexing and searching open source code online from Apache, Google Code, SF.net. Th programs are e he classified based on the programm language used, package, license under wh its available, ming hich class type, method name, file nam pattern, and also based on custom keyword by the user. ame ds
Code search module is very sim imilar to Google Code Search feature. As said its an addon aid, implementation for the iBlue spider (web crawler used by all the modu dules). It mainly concentrates on crawling and ind d indexing open source projects under SVN repositer itery. Currently it crawls SF.net, Apache t he top-level projects, Google Code Eclipse labs projects. p
Users can search via the program ramming language, product license, file name pattern, custom e user query, class, and methods. Users can also contribute a SVN Url for indexing Administrators . g. at the backend, can delete any S repositery that is already indexed by the craw SVN awler. Code Index, is available as a ser ervice so that users writing any IDE can utilize the service for e proving real time code completio sugestion techniques. ion
The above diagram represents the sequence of actions that takes place in the system with respect to the module. User mak a query to the Code Search server, which accepts the query akes ac and searches in the index. The weighted list of results are then returned to the user. In the e mean while the user client has th PrettyPrintCode JS Framework (similar to Bespin). he Bes
Above activity depicts the typical usage of code search module. Each activity method is described as follows: CodeQuery Query from the user for the code, containing various parameters like language, package, type, class, method, file (regular expression), and licesence CodeIndexer Analyses the CodeQuery against the code index to determine any valid search patterns GetRequiredIndexParameter Returns the required properties of the index object repective to the user code query. CodeSearcher Searches the indices for valid pattern match for CodeQuery PrettyPrintOutput Outputs the result in the preferred format (text/json, text/xml or text/html) after formatting the code