Intranet Search Engine

Intranet Search Engine
A Project Report
submitted in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology in COMPUTER ENGINEERING
By
Mr. Rahul U. Saluankhe (20070643) Mr. Anil R. Satao (20070646) Mr. Tage Nobin (20070653)
Under the guidance of
Prof. H.A. Akarte
DEPARTMENT OF COMPUTER ENGINEERING DR. BABASAHEB AMBEDKAR TECHNOLOGICAL UNIVERSITY Lonere-402 103, Tal. Mangaon, Dist. Raigad (MS) INDIA May, 2011
Certificate
The project report entitled Intranet Search Engine submitted by Mr. Rahul U. Saluankhe (20070643), Mr. Anil R. Satao (20070646) and Mr. Tage Nobin (20070653) is approved for the partial fulfillment of the requirement for the award of the degree of Bachelor of Technology in Computer Engineering.
( Prof. H.A. Akarte ) Guide Dept. of Computer Engineering
( Dr. Girish V. Chowdhary ) Head Dept. of Computer Engineering
External Examiner(s) 1. ______________ (Name: )
2. ______________ (Name:
Place:Dr.Babasaheb Ambedkar Technological University, Lonere. Date: 12/05/2010
Acknowledgements
First and foremost, we would like to thank our guide, Prof. H.A Akarte, for his guidance and support. We will forever remain grateful for the constant support and guidance extended by our guide, in making this project. Through our many discussions, he helped us to form and solidify ideas. The invaluable discussions we had with him, the penetrating questions he put onto us and the constant motivation, has all led to the development of the ideas presented in this project. We wish to express our sincere thanks to the Head of department, Dr. Girish V. Chowdhary, other teaching staffs, Prof. Arvind W. Kiwelekar, Prof. Mrs. M. D. Laddha and the departmental staff members for their support. We would also like to thank our wonderful colleagues for listening to our ideas, asking questions and providing feedback and suggestions for improving our ideas. Finally, we would like to thank all whose direct and indirect support helped us in completing the seminar in time.
Mr. Rahul U. Saluankhe (20070643) Mr. Anil R. Satao (20070646) Mr. Tage Nobin (20070653)
Abstract
There are enormous amounts of information widely available in the Intranets. This information is only useful if data can be retrieved in an accurate and timely manner. Currently, Intranet search engine has become a necessity due to lack of an efficient way to disseminate useful information to its members. As Intranets become more common place, the need for implementing the ability to search for content has become more important. Without a search engine, much of the content that makes up an Intranet is lost. With so much importance placed on this tool, it is imperative that thoughtful planning precede the implementation .Thus our project aims to help the user to search and access text information. Intranet search engines provide vital access to shared documents, open the resources of the institution, extend the value of research, and provide information across distance and time.
Contents
1 Introduction 1.1 What is Intranet? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Requirement Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 1.3.2 1.3.3 1.3.4 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 4 4 4 4 4 5 6 7 8 9 9 9 9 10 10 12
1.4 The Difference between Intranet and Internet Design . . . . . . . . . . . . . . . . . 2 Problem Definition 2.1 The Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Objectives of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Mechanization of Search Engine 3.1 Processes of Search Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 3.1.2 3.1.3 3.1.4 Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Searching Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Analysis of Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Evaluation of Search Engine 4.1 Important features of Intranet search engine . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Search Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16 16
4.2 Multi-level approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 4.2.2 4.2.3 4.2.4 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Search Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operation and Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 17 18 19 20 22 22 23 24 26 26 27 30 30 31 32
5 Deploying an Intranet Search Engine 5.1 Use of Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Data Source and File Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Processing of Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Designing the Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 5.4.2 Search Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Result Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Conclusion and Future Work 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography
List of Figures
1.1 Model of Intranet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Components of Search Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Basic Information Retrieval Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Search Engine User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 10 11 29
Chapter 1 Introduction
Although search over World Wide Web pages has recently received much academic and commercial attention, surprisingly little research has been done on how to search the web pages within large/small, diverse intranets. Intranets contain the information associated with the internal workings of an organization. The intranet creates new challenges for information retrieval. The amount of information on the intranet is growing rapidly, as well as the number of new users inexperienced in the art of intranet research. Earlier works that compared intranets and the Internet from the view point of keyword search has pointed to several reasons why the search problem is quite different in these two domains. In this project, we address the problem of providing quality answers to navigational queries over the intranet. As intranets grow, providing access to more and more documents, their value grows. The larger the collection, the harder and harder is becomes to find that important presentation, contract, or HR form. Enterprise Information Portals provide a starting point to intranets, and a search engine helps locate information, including archives and unstructured data. Search engines need to be tuned and indexed to provide the best answers. Our approach is based on crawler identification of navigational pages, intelligent generation of term variants to associate with each page, and the construction of separate indices exclusively devoted to answering navigational queries. This Chapter outlines the aims of the project and motivation behind its implementation.
1.1 What is Intranet? An intranet is a private computer network that uses Internet protocols and network connectivity to securely share any part of an organization's information or operational systems with its employees. 1.1.1. Features of Intranet Sometimes the term refers only to the organization's internal website, but often it is a more extensive part of the organization's computer infrastructure and private websites are an important component and focal point of internal communication and collaboration. An intranet is built from the same concepts and technologies used for the Internet, such as clients and servers running on the Internet Protocol Suite (TCP/IP). Any of the well known Internet protocols may be found in an intranet, such as HTTP (web services), SMTP (email), and FTP (file transfer). Intranets differ from extranets in that the former are generally restricted to employees of the organization while extranets may also be accessed by customers, suppliers, or other approved parties. Intranets are being used to deliver tools and applications, e.g., collaboration (to facilitate working in groups and teleconferencing) or sophisticated corporate directories, sales and Customer relationship management tools, project management etc., to advance productivity. Intranets are also being used as corporate culture-change platforms. For example, large numbers of employees discussing key issues in an intranet forum application could lead to new ideas in management, productivity, quality, and other corporate issues.
Just one example of improved usability from taking advantage of managed diversity: an intranet search engine can take advantage of weighted keywords to increase precision. Weights are impossible on the open Internet, since every site about widgets will claim to have the highest possible relevance weight for the keyword "widget." On an intranet, even a light touch of information management should ensure that authors assign weights reasonably fairly and that they use, say, a controlled vocabulary correctly to classify their pages. Intranet is network of computers that can be accessed only by an authorized set of users within an organization. Its purpose is typically to share information and computing resources among employees within an organization. The term search engine is often used generically to describe both crawler-based search engines and human-powered directories. These two types of
search engines gather their listings in radically different ways. Crawler-based search engines, such as Google, create their listings automatically. Human-powered directories such as the Open Directory, depends on humans for its listings. The search looks for matches only in the descriptions submitted. In this case, if there are changes to any of the web pages, it has no effect on the listing. The only exception is that a good site, with good content, might be more probable to get review. There are two types of Intranet search, namely desktop-based and web-based. Desktop-based address the whole spectra of electronic information that might be found in an organization, including video, images, database etc.
Figure 1.1: A model of Intranet
1.2 Scope of project This project can be used by the various clients who want to search for shared documents scattered all over the intranet.
1.3 Requirement Specifications 1.3.1 Functional Requirements Query Box: There should be a query box for the existing user where the user types in the name of the file that he is searching. Search Button: This button initiates the search operation over the intranet with the text typed by the user. Result Box: The results as obtained based on the query of the user is displayed with the names of the system where the file is stored in.
1.3.2 Non-Functional Requirements Security: Only the files which are stored in the systems for which we have prior permission to access it are displayed thus preventing unauthorized access. Database: Integrity should be maintained and all the constraints should be satisfied. Platform Independence: Written using 100 percent Pure Java Code. 1.3.3 Software Requirements The following softwares have been used for the project. Windows Platform The Microsofts Windows have been be used as the platform for coding. NetBeans IDE NetBeans used for developing the codes for the project using Java.
1.3.4 Hardware Requirements PC with 2 GB Hard disk and 256 MB RAM RJ-45 LAN cables and LAN connectors
1.4 The Difference between Intranet and Internet Design Your intranet and your public website on the open Internet are two different information spaces and should have two different user interface designs. It is tempting to try to save design resources by reusing a single design, but it is a bad idea to do so because the two types of site differ along several dimensions: Users differ. Intranet users are your own employees who know a lot about the company, its organizational structure, and special terminology and circumstances. Your Internet site is used by customers who will know much less about your company and also care less about it. The tasks differ. The intranet is used for everyday work inside the company, including some quite complex applications; the Internet site is mainly used to find out information about your products. The type of information differs. The intranet will have many draft reports, project progress reports, human resource information, and other detailed information, whereas the Internet site will have marketing information and customer support information. The amount of information differs. Typically, an intranet has between ten and a hundred times as many pages as the same company's public website. The difference is due to the extensive amount of work-in-progress that is documented on the intranet and the fact that many projects and departments never publish anything publicly even though they have many internal documents. Bandwidth and cross-platform needs differ. Intranets often run between a hundred and a thousand times faster than most Internet users' Web access which is stuck at low-band or mid-band, so it is feasible to use rich graphics and even multimedia and other advanced content on intranet pages. Also, it is sometimes possible to control what computers and software versions are supported on an intranet, meaning that designs need to be less cross-platform compatible (again allowing for more advanced page content).
Most basically, your intranet and your website are two different information spaces. They should look different in order to let employees know when they are on the internal net and when they have ventured out to the public site. Different looks will emphasize the sense of place and thus facilitate navigation. Also, making the two information spaces feel different will facilitate an understanding of when an employee is seeing information that can be freely shared with the outside and when the information is internal and confidential.
An intranet design should be much more task-oriented and less promotional than an Internet design. A company should only have a single intranet design, so users only have to learn it once. Therefore it is acceptable to use a much larger number of options and features on an intranet since users will not feel intimidated and overwhelmed as they would on the open Internet where people move rapidly between sites. An intranet will need a much stronger navigational system than an Internet site because it has to encompass a larger amount of information. In particular, the intranet will need a navigation system to facilitate movement between servers, whereas a public website only needs to support within-site navigation.
Chapter 2 Problem definition

Todays age is better known as INFORMATION AGE . The world runs on information. According to the Data Warehousing Institute the data available today gets doubled every 6 months. Lots of information is present on the private LAN or intranet of the organizations. So lots of man power is needed to get proper information from scattered data on intranet. An obvious reason for poor enterprise search is that a high performing text retrieval algorithm developed in the laboratory cannot be applied without extensive engineering to the enterprise search problem because of the complexity of typical enterprise information spaces. As organization developed more and more information, there is a need sort the data and information in a systematic manner and made available to the user in intranet as requested. So that he can decide what is necessary and take and appropriate action. Our project will provide a helping hand for this regard, to access the information within our fingertips present in intranet. Our aim is to provide effective, efficient and systematic search engine that works for a local area network. In other words effective in terms of search, efficient in terms of time and systematic in representation is our INTRANET SEARCH ENGINE.
2.1 The Need 1. The need to respect fine-grained individual access-control rights, typically at the document level; thus two users issuing the same search/navigation request may see differing sets of documents due to the differences in their privileges. 2. The need to index and search a large variety of file types (formats), such as PDF, Microsoft Word and Power-point files, etc. 3. The need to seamlessly and scalably combine structured (e.g. relational) as well as unstructured information in a document for search, as well as for organizational purposes (clustering, classification, etc.) and for personalization. An effective search tool on an intranet can make an enormous difference to its usability. A good search engine ensures that users find what they're looking for, first time, regardless of the format or location of the information. This means that a wide variety of information can be
effectively dispersed and made available to staff, without the need for complex navigation systems or filing conventions.
Our project aims to help the user to search and access text information. The search will be based on content based search. As stated earlier there is load of information available for the user to access on the intranet. But only specific required information is to be searched, sorted and represented in a systematic manner to the user, thus increasing the availability of useful information for the user to access. The access will be given to only those data which are shared, thus preventing unauthorized access.
2.2 Objectives of project To implement a centrally managed Intranet search engine that helps a client to search for files over the intranet. Client can execute search operation as per his needs. The files, if present shall be displayed over the same search form. By making use of this project we can provide a enhanced capability of searching over the intranet.
Chapter 3 Mechanization of Search Engine

An intranet search engine is much and more the same as the Web-wide search engines. The search engine locates the documents, extracts the text, and stores it in an index file, making an entry for each word. When an end-user or employee types a word into a form and clicks the Search button, the browser sends it to the server. The search engine receives the search query, looks for matching words in the index file, gathers related document information, sorts the documents by relevance, formats the results into appropriate format, and sends the page back to the user. Several indexing aspects require attention from the intranet site manager. Indexing integrates content from many sources: pages on internal sites, content management systems etc. 3.1 Processes of Search Operation There are various processes and entities involved in finding the results for the user as per the query he has input. 3.1.1 Gathering The index should be kept current: As soon as the new content is published, it should be indexed. Publishing or content management systems can notify the indexer of new data; otherwise, index the frequently changing areas more often. If the search engine cannot respond to queries when updating, use mirrored servers or switch search engines. 3.1.2 Indexing In addition to HTML, XML, and text, intranet search engines deals with binary file formats such as PDF, MS Office formats, including Word, Excel, and PowerPoint, WordPerfect, and others. The index should store the entire content of every file, even very long documents. It should keep every word and the word position in the document, for later phrase searching and match highlighting. Intranets generally include various levels of security and access controls, and the index should store this information, so it can show only the accessible content in the search results. For high-security content, it is a good idea to create a separate index file to avoid co-mingling private and public text.
Figure 3.1: Components of Search Engine
3.1.3 Crawling The general algorithm involves backtracking to the root directory and penetrating new web pages via their links. The process continues until the entire website (Intranet) is indexed. Besides, our crawler is able to recognize duplicate pages and discard them accordingly. 3.1.4 Searching agent This is the tool that will be on the client side and triggered by the server with key word, so it searches on the only client on which it lies. And returns back the result to server. Each client will have the searching agent. When a new search comes to server it searches the index (database) it have if found then s returns back as response and if not then trigger to all clients searching agents and gets the replay from them. When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching pages/files according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Natural language queries allow the user to type a question in the same form one would ask it to a human. As intranets grow, providing access to more and more documents, their value grows. The larger the collection, the harder and harder is becomes to find that important presentation, contract, or HR form. Enterprise Information Portals provide a starting point to intranets, and a search engine helps locate information, including archives and unstructured data. Search engines need to be tuned and indexed to provide the best answers.
Figure 3.2: Basic Information Retrieval Process
3.2 Analysis of Search Engine There is a need of analyzing the search engine so that we can optimize the software to its optimum. For this purpose, we need to understand the pros & cons of the same. 3.2.1 Pros: Search engines provide access to a fairly large portion of the publicly available pages over the internet and intranet, which itself are growing exponentially. Search engines are the best means devised yet for searching the internet and intranet. Stranded in the middle of this global electronic library of information without either a card catalog or any recognizable structure, how else are you going to find what you're looking for? 3.2.2 Cons On the down side, the sheer number of words indexed by search engines increases the likelihood that they will return hundreds of thousands of responses to simple search requests. Remember, they will return lengthy documents in which your keyword appears only once. Additionally, many of these responses will be irrelevant to your search.
Chapter 4 Evaluation of Intranet Search Engine

Any Intranet Search Engine should be developed as per the requirement of the environment in which it will be used. But as per our studies, for the overall deployment of any Intranet Search Engine, there are some generic functions that are almost same for all of them. 4.1 Important Features for Intranet Search Search functionality is divided into several parts: the search form and query options, the search engine retrieval and relevance ranking, and results display. 1. Search Functionality When the user clicks the Search button, the browser sends a query to the search engine server. It looks for the words in the index file. Some search engines use stemming to locate singular and plural forms of words. Once it locates the matches, the search engine gets information about the associated documents, such as URL and titles. It sorts the documents by relevance, as defined by an internal set of rules, by frequency of matched terms in the documents, phrases, and location in the document. 2. Search Results Pages Search results are not a place to surprise users with experimental interfaces. It is best to conform to the basic conventions of Web search results, with a listing of documents showing titles and descriptions. The Internet can be used to identify useful features. 3. Search Problems and No-Matches Pages Searches fail for various reasons: The user forgets to type anything in the search field. The user is searching for text that is not in the scope of the index. The user is using a term that is not used in the index (such as sick day vs. PTO). The user has made a spelling or typing mistake. The user is doing a search in which all the query requirements are not met (for example, one word was matched but the other was not).
To avoid common search failures, create a page that explains these errors and helps users understand what is within the scope of the search engine. If a taxonomy or hierarchy exists, display it on the page to allow users to drill down through the category. 4. Search Log Analysis Search logs are a great window into the minds of intranet users. If the search log tracks the query and the number of matches, this is good. This makes it possible to count the 25 or 100 most popular search terms and to make sure these topics are adequately covered. It is also possible to track the most common terms that do not find matches and to address these problems. 5. The Indexer Full-text indexing literally creates a virtual copy of the entire website. The option is still feasible as it only encompasses Intranet searches. With this, content can be subjected to further scrutiny and hopefully more precise information. The first step is to initiate the creation of an index; this index will contain location information for each and every word in all of your documents. The creation of this index is external of the files and does not affect them in anyway. Indexed documents are typically specified according to directory and extension. There can either be one index for all of the files, or several separate indexes, each for a different project. The indexes automatically are updated when new documents are created, or existing documents are changed. However, any changes to the tables structure such as configuration data will need a complete rebuilding or the full-text index. Once there is an index, it can be used to locate, view and retrieve information. Using the indexes created, the search query can be used to locate the required information in your documents. Results are displayed almost instantly, despite its relatively large size and thus proving the speed and advantages of implementing indexes.
4.2 Multi-level Approach Here in we developed a multi-level approach that comprises of four levels. 4.2.1 Data gathering Most organizations have legacy data in formats other than HTML, e.g. Adobes PDF, MSOffice, FrameMaker, Lotus Notes, Postscript, and plain ASCII text. The spider should at least be able to correctly interpret and index the most frequently used or the most important of these formats. If meta-information and XML tags are likely to show up within the documents, the spider must be able to interpret such tags, and it would also be useful if RDF-formatted information could be gathered intelligently. If USENET newsgroups need to be indexed, the spider must be able to crawl through them. That also goes for client side image maps, CGIscripts, ASP generated pages, pages using frames, and Lotus Domino servers. Although
frames are frequently used within many companies, spiders, which generally work their way round the net by picking up and following hypertext links, may not be able correctly interpret the different syntax used for framed pages. These links could end up ignored. Spidering Domino servers using the above HTTP requests requires the search engine to be able to intelligently filter out the many collapsed/ expanded versions of the same page, or the index will quickly be filled with duplicates. Another, and arguably better, way would be to access Domino servers via the provided APIs. Another situation that is likely to require access via APIs rather than having to crawl through HTTP is when Content Management (CM) systems are used. In CM tools, the actual content of a page is stored separated from the page layout information. Since pages are rendered dynamically only when requested by a user (via her browser), the spider may not be able to pick up the link information that is embedded in the page code. Without those links, the spider will not be able to find the information. Even if the information is found and indexed correctly it might be difficult for the search engine to understand how to display a search result since the information that has been indexed may belong to several dynamic pages. This is an area not yet fully explored by search engine vendors and proposed solutions should be investigated carefully. Intelligent robots are able to detect copies or replicas of already indexed data while crawling and advanced search engines can index active sites, e.g. sites that update frequently, more often than sites that are more passive. If this is not supported, some manual means of determining time-to-live should be provided. There should be some means of restricting the robot from entering certain areas of the net, including any desired domain, sub-net, server, directory, or file level. Also, check if search depth can be set to avoid loops when indexing dynamically generated pages. Support for proxy servers and password handling can be useful, as can the ability to not only follow links but also detect directories and thus find files not linked to from other pages. The spider should be easy to set up and start. Check how the URLs from which to start are specified as well as if the users may add URLs. Finally, the Robot Exclusion Protocol provides a way for the webmaster to tell the robot not to index a certain part of a server. This should be used to avoid indexing temporary files, caches, test or backup copies, as well as classified information such as password files. 4.2.2 Index Although a good index alone does not make a good search engine, the index is an essential part of a search tool. One of the most important issues is keeping the index up-to-date, and the best way to do that is to allow real-time updates. There is a big difference between indexing the full text or just a portion. Though partial indexing saves disk space it may prevent people from finding what they are looking for. The portion of text being indexed also affects the data that is
presented as the search result. Some tools only show the first few lines while others may generate an automatic abstract or use meta-information. If the organization consists of several sub-domains, users might only want to search their specific sub-domain. Allowing the index to be divided into multiple collections might then speed up the search. It may also prove useful to be able to split the index into several collections even though they are kept at one physical location. For example, one may want separate collections for separate topics or business areas. Some tools support linguistic features such as automatic truncation or stemming of the search terms, where the latter is a more sophisticated form that usually performs better. If the organization is located in non-English speaking countries the ability to correctly handle national characters becomes important. Also, note that some products cannot handle numbers. If number searching is required, e.g. serial numbers, this limitation should be taken into consideration. Should words that occur too frequently be removed from the index? Some engines have automatically generated stop-lists, while others require the administrator to remove such words manually. Search engines are of little use if an overview of the indexed data is wanted, unless they are able to categorize the data and present that data as a table of content. Automatic categorization may also be used to focus in on the right sub-topic after having received too many documents. If information about when a particular URL is due for indexing is available, it is useful to make it accessible to the user.
4.2.3 Search features The user query and the search result interfaces are often sadly confusing and unpredictable argue that the text-search community would greatly benefit from a more consistent terminology. Since we do not yet have this concordance, evaluation of the search features must be done with great care. Different vendors use different names for the same feature, or the same name for different features. Though Boolean-type search language is often offered, most users do not feel comfortable with Boolean expressions. Instead, studies have shown that the average user only enters 1.5 keywords. Due to the vocabulary problem, the user is likely to receive many irrelevant documents as a result of a one-keyword search. Natural language queries have been shown to yield more search terms and better search results, even when performed by skilled IR personnel. Apart from Boolean operators, a number of more or less sophisticated options (e.g. full text search, fuzzy search, require/exclude, case sensitivity, field search, stemming, phrase
recognition, thesaurus, or query-by-example) are usually offered. One feature to look for in particular is proximity search, which lets the user search for words that appear relatively close together in a document. Proximity search capability has been noted to have a positive influence on precision. Many organizations prefer to have a common company look on all their intranet pages. This requires customization that may include anything from changing a logo to replacing entire pages or chunks of code. Again, this is an aspect irrelevant to public search services but something an intranet search engine might benefit from. Sometimes a built-in option allows the user to choose a simple or an advanced interface. It should also be possible to customize the result page. The user could be given the opportunity to select the level of output, e.g., by specifying compact or summary. Further, search terms may be highlighted in the retrieved text, the individual word count can be shown, or the last modification date of the documents may be displayed. It can also be possible to restrict the search to a specific domain or server, or to search previously retrieved documents only. For the latter, relevance feedback is a very important way to improve results and increase user satisfaction. Ranking is usually done according to relevancy of some form. However, the true meaning of the ranking is normally hidden to the user, and only presented as a number or percentage on the result page. More sophisticated ways to communicate this important information to the user have been developed, but not many of the commercially available products have yet incorporated such features. However, the possibility to switch between relevancy and date is often supported. Dividing the results into specific categories might help the user to interpret the returned result. Finally, ensure the product comes with good and extensive online user documentation.
4.2.4 Operation and maintenance Hosting a search service requires considerations not necessary when using a public search engine. For example, operations and maintenance issues are of no importance to public search engine evaluations, but for an internal search service, they are of course highly interesting. Start by checking if the product is available on many platforms or if it requires the organization to invest in new and unfamiliar hardware. If the intranet consists of one server only, a spider is not needed, but as the web grows, crawling capabilities become essential. A spider allows the net to grow without forcing the webmasters to install indexing software locally. An intranet search engine often runs on a single machine and is operated and maintained by people with knowledge about servers, but not necessarily experts in spider technology. This suggests that a good intranet spider should be designed specifically for an intranet and not just be a ported
version of an Internet spider. Still, the spider and the index must be able to handle large amounts of data without letting response times degrade or the users will be upset. For example, a product that can take advantage of multi-processor hardware scales better as the intranet grows. The product should therefore have been tested to handle an intranet of the intended size. Running the spider should not interfere with how the index is operated. Both these components need to be active simultaneously. It is found that great differences exist in how straightforward the products were to install, setup, and operate. Some required an external HTTP server while others had a built-in web server. The latter were consistently less complicated to install. However, installation is probably something done once while indexing and searching is done daily. This ratio suggests that indexing and searching features should be weighed higher than installation routines. It is difficult to estimate data collection time since it depends on the network, but during the test installation, this activity should be clocked. Also, try to determine how query response times grow with the size of the index. If an index in every city, state, or country where the organization is represented is wanted, ensure the product supports this kind of distributed operation, and check whether any bandwidth-saving technique is used. Having technical support locally is an advantage if the local support also has local competence. If questions have to be sent to a lab elsewhere, the advantage is lost. An important feature is the ability to automatically detect links to pages that have been moved or removed. If dead links cannot be detected automatically, the links should at least be easy to remove, preferably by the end-user. Allowing end-users to add links is a feature that will off-load the administrator. Functions like email notification to an operator, should any of the main processes die, and good logging and monitoring capabilities, are features to look for. I found that products with a graphical administrator interface were more easily and intuitively handled, though the possibility of being able to operate the engine via line commands may sometimes be desired. It should also be able to administer the product remotely via any standard browser. Documentation should be comprehensive and adequate. Finally, consider the price - is it a fixed fee or is it correlated to the size of the intranet? In addition, what kind of support is offered and to what cost? Sometimes installation and training are included in the price. How long the products have been available and how often they are updated are important factors that indicate the stability of the product, and it is also important to ask about future plans and directions.
Chapter 5 Deploying an Effective Intranet Search Engine

A search engine is often the first method used to find a page, and yet, most users suffer frustration and failure. More still are put off by the complexity of the search engine, and the confusing manner in which the results are displayed. An effective search tool on an intranet can make an enormous difference to its usability. In fact, usability expert Jakob Nielsen found that Poor search was the greatest single cause of reduced usability across intranets. A good search engine ensures that users find what they're looking for, first time, regardless of the format or location of the information. This means that a wide variety of information can be effectively dispersed and made available to staff, without the need for complex navigation systems or filing conventions. Most intranets evolve over time, and search functionality need not be a daunting task. A search tool can be implemented quickly, and then refined as the intranet grows and the needs of the organization change. It is important to recognize that every intranet is different, with its own objectives, requirements and environment. A good search engine must: Be easy to use. Assist users to find the correct information. Display results in a meaningful way. Help authors to improve the site.
5.1 Use of Search Engine? Search engines are best at finding unique keywords, phrases, quotes, and information buried in the full-text of web pages. Because they index word by word, search engines are also useful in retrieving tons of documents. If you want a wide range of responses to specific queries, use a search engine. Today, the line between search engines and subject directories is blurring. Search engines no longer limit themselves to a search mechanism alone. Across the Web, they are partnering with
subject directories, or creating their own directories, and returning results gathered from a variety of other guides and services as well. Selecting a Search Engine Before taking any action in determining the type of the search engine, we need to determine our technical requirements. Once this is complete, research on currently available engines can be pursued and built an effective search engine that caters to the need of ours.
5.2 Data Sources & File Types Once we have our objectives clearly defined, we can work out what type of file formats and data sources your search engine will need to support. Next step will be to list out every file type used in creating the information that we want to share on your intranet. These usually fall into one of three categories: 1. Unstructured formats File formats that contain primarily text-based information. These include text files, word processor files, PDFs, emails and formats used to create most documents. There is no real structure to these file formats and few relationships exist between elements within them. 2. Semi-structured formats File formats that contain a mixture of text-based and database information, with a basic structure. These include file types such as HTML, spreadsheets, XML. There may be relationships between elements within these files, however they are not as rigidly defined as they are in structured formats, and there may be sections of textual information where no structure exists. 3. Structured formats File formats where the information is contained in a well defined structure, such as a relational database. Many enterprise systems have a structured architecture, such as ERP and CRM systems, as well as many legacy databases. An effective intranet search engine should be able to support a large number of files that will be in the intranet data repository.
5.3 Processing of Query For most intranets, there will be a wide spectrum of users, from very basic all the way through to highly technical power users. The search function needs to cater for all of these people, with a simple yet powerful interface that provides options for advanced searching if required. There should be three steps to the search process, and a range of features work to streamline each of these steps: (1) Entering the Query, or asking the initial question, (2) Getting the Search Results, or receiving the list of found documents back from the search engine, and (3) Finding the Right Answer, or examining and refining the search results to find the information you were looking for.
Step 1: Entering the Query. When a user enters their query, they should have the option to do this using a natural language approach; that is, by simply entering the question as they would ask it. Such as What is the cost of double-deck refrigerators? There should also be the option to build queries using Boolean operators, so that users who know exactly what they want can be extremely specific with their search. For example returns~ within 10 words of refrigerator but not freezer. Building a search engine with a simple user interface to make sure it is intuitive for basic users, and also provide powerful advanced search functionality for more experienced users will be a definite aim of ours. A good search engine should enable you to group logical chunks of information together so that searches can be conducted on specific areas of interest.
Step 2: Getting the Search Results. If there is specifically defined data, such as legal documents, a high degree of precision may be required to identify and return specific information. In other situations, however, it may be better to return a wider range of documents for a given query. The accuracy we require depends on the role of the search engine and the nature of the data. If we want to make available a large volume of data on your intranet, providing a fast search engine is important. Otherwise users find it frustrating to wait for the search engine to bring back the search results. With smaller amounts of data this will be less of a concern; it all depends on the volume of data that we intend to make available on the intranet.
Any good search engine should use some form of intelligent relevancy determination. This is where the search engine, based on the query entered, makes a judgment about which results will be the most relevant, and ranks them accordingly. Step 3: Finding the Right Answer. The search process doesnt stop once the user receives the list of results. They then need to refine and manipulate the results list until they find exactly what they were looking for. There are many features that can assist in this task, some of which include:
Document summary information The display of useful document attributes such as file type, file size, date last changed, relevancy rating and the number of hits (key words found) in the document. The display of an extract of the document, say several lines above and below the first hit, is helpful for determining the context in which the document has been returned. Re-sorting The ability to re-sort the results list using different criteria, such as title, number of hits, relevancy, and date changed; file type or any other criteria that makes sense for your organization. Hit-to-hit navigation The provision of navigation buttons enabling users to go directly to the first hit in the returned document, and thereon to the next or previous hit as required. This means users avoid having to read through pages and pages of document before finding the relevant section, making it much more efficient. Hit highlighting A familiar concept from searching the web, hit highlighting is when the key words, or hits, in a document, are highlighted in a different colour. This feature is often not available in an intranet search engine, but it really should be, as combined with hit-to-hit navigation it enables users to immediately see the relevant sections of the document. Fast preview The ability to preview large non-HTML documents in a basic HTML format, without the need for downloading the whole document. This function enables users to view a few lines above and below each hit, and then to expand up or down to continue reading. Search within The ability to search within the current set of results, to further narrow them. Although just some of the features available in intranet search engines, these are the main features
required to ensure that users have the best overall experience. Others that may be relevant to your organization might include intelligent agents that automatically advises users when relevant content appears in the data repository, or the ability to save or export search results.
5.4 Designing the interface Take extra time and effort when designing your search pages. They should be clear, easy, and above all, simple. Dont bother with an advanced search facility: your users wont understand it. Behind the scenes Make your search engine quietly work for the user, to correct their mistakes, and to help them find the right page. While much of the work of deploying a search engine goes on behind the scenes, the design of the user interface greatly influences how successful the system will be. While the interface design must be consistent with the rest of your online material, we recommend the following guidelines: 5.4.1 Search Page Keep it simple There are two key elements on a search page: a field to enter the search terms, and a search button. There is no reason to make the page any more complex than this. Provide hints A list of tips and examples on the main search page helps users when they first use the search engine. This list should be written in plain English, and should cover the common issues and questions. No advanced searching Normal users have enough difficulty with search engines without confronting them with a complex set of advanced search methods. Users want to quickly find a single page, and therefore we must design our interface to meet this need. Always and Few users understand the concept of Boolean operators. Instead, they expect that when they type in three words, they will be given only those documents that contain all three. Furthermore, typing in more words should provide fewer hits, not more.
The search engine must therefore default to and-ing the words together. In fact, eliminate support for Boolean operators all together, unless there is a clear case that they will be of value to your users. Place the cursor When the search page is opened, the cursor should already be in the search field (this is known as setting the focus). This allows the user to simply type in their words, and hit enter. Its a small point, but it took only days for our users to specifically ask us for it
5.4.2 Result Page Make it attractive A results page should encourage users, not frighten them off with tiny text, difficult layouts, and hard-to-read fonts. We expect users to spend time browsing through the list of results, so it is worth spending some extra time making the pages easy on the eye. Keep it simple There are only three things that we need to present for each hit: title (a hyperlink to the actual page), page summary and ranking. Why, for example, would the user want to know the size of the page in kilobytes? The less we say for each hit, the easier it is for the user to scan through the list and find the page they want. Make the description meaningful Ideally, each hit should provide a useful description of the page, obtained from the meta tags within the page. If this information is not available, we shall provide a brief extract, highlighting where the search terms are used. To ensure that the extract always shows some useful text and not the standard headings on every page (how many listings have you seen that start with [Home] [Contents] [Index] ?) is also notified.
Behind the scenes Effort should be spent behind the scenes to improve the effectiveness of your search engine. Most engines have capabilities that, when implemented carefully, will help users to find the pages they are looking for. These features must operate transparently, so that the user is not even aware of their impact. They should simply find the search engine both easy to use and effective. Fuzzy searching, stemming, and more Our selected search engine provided a number of powerful searching capabilities: Fuzzy searching, or sounds-like There were three closely-related options which were essentially designed to find terms which sounded like those entered by the user. In this way, it becomes possible to handle spelling mistakes and other inconstancies. Stemming This feature takes the terms entered by the user, and tries other combinations of endings. For example, searching for walks would also find walk, walking, walked. We found this to be very effective, and it eliminated differences in singular versus plural uses of terms in our pages. There are a wide variety of other tools available in modern search engines, beyond those mentioned above. As per our evaluation and study we noted that just because a feature exists, it doesnt mean it will help the users. Weightings and rankings The order in which results are displayed by a search engine is the product of a number of complex weighting and ranking factors behind the scenes. These vary from engine to engine. They also have a big impact on how effective the search engine is. The main aim would be to understand our search engine, and configure it (if required) to meet our specific requirements. The key is to have the search engine work in a transparent and understandable way.
Figure 6.1: Search Engine User Interface
Chapter 6 Conclusion and Future work

7.1 Conclusion We have discussed the concept of Intranet search engine. Under this project, the mechanism of intranet, search engine was thoroughly examined. Developing a search engine for intranet needs a complete research as per the needs. In brief, we learned the following lessons as a result of this project: Spend a lot of time identifying your needs, and researching the right search engine. Choosing the wrong search engine is a costly mistake that is not easy to rectify half way through a project. Keep the interface simple. The search page should have a field to type in and a search button. Complex interfaces and advanced searches will confuse users: by default, your search engine should simply do what the users expect. Take the time to configure the intelligence under the hood. The search engine should quietly assist the user to find the desired page (via synonyms, fuzzy searching, and so forth). Track the usage of your search engine, and use this to assess how well it is working. You should be gathering enough information to allow you to refine the engines configuration to better meet user needs.
7.2 Future work In this project following modifications and up gradation can be integrated to make it a better search engine. (a) Enable better query understanding Building in intelligence so as to find the correct word and to solve typo errors, search engines till today still lack the intelligence to actually understand the semantics rather than the syntax of a search query. (b) A ranking algorithm Ranks are based on the number of occurrence of words in the content and title. Thus the results are accurate base on content. However, this alone is insufficient when the content searched is not purely documented based, as in the case of internet. (c) Multimedia Search Engine The current version of our Intranet search engine is only capable for searching documents in text format. This version could be enhanced by supporting searches for various types of files including images, audio, video etc.
Bibliography
[1] Cynthia P. Ruppel and Susan J. Harrington. Sharing Knowledge Through Intranets: A Study of Organizational Culture and Intranet Implementation, 2000. [2] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schtze. An Introduction to Information Retrieval, Online edition, 2009. [3] Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan and Alexander Loser. Navigating the Intranet with High Precision, 2007. [4] Dick Stenmark. A Method for Intranet Search Engine Evaluations, Proceedings of IRIS22, 1999. [5] Michael Chen, Marti Hearst and Jason Hong. Cha-Cha: A System for Organizing Intranet Search Results, 2002.

Intranet Search Engine

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Intranet Search Engine

Hochgeladen von

Copyright:

Verfügbare Formate

Intranet Search Engine

Bachelor of Technology in COMPUTER ENGINEERING

Prof. H.A. Akarte

( Prof. H.A. Akarte ) Guide Dept. of Computer Engineering

( Dr. Girish V. Chowdhary ) Head Dept. of Computer Engineering

External Examiner(s) 1. ______________ (Name: )

Place:Dr.Babasaheb Ambedkar Technological University, Lonere. Date: 12/05/2010

Figure 1.1: A model of Intranet

Chapter 2 Problem definition

Chapter 3 Mechanization of Search Engine

Figure 3.1: Components of Search Engine

Figure 3.2: Basic Information Retrieval Process

Chapter 4 Evaluation of Intranet Search Engine

Chapter 5 Deploying an Effective Intranet Search Engine

Figure 6.1: Search Engine User Interface

Chapter 6 Conclusion and Future work

Das könnte Ihnen auch gefallen