Sie sind auf Seite 1von 27

Localhost: A browsable peer-to-peer le sharing system

Aaron Harwood and Thomas Jacobs December 17, 2005


Abstract Peer-to-peer (P2P) le sharing is increasing in use on the Internet. This thesis proposes Localhost, a P2P le sharing system that allows users to nd les in the system by browsing a hierarchical directory structure, not unlike a le system. The hierarchical directory structure is global among all the peers in the system. Localhost stores the hierarchical directory structure simply by placing semantics on the les in the system. Localhost allows users to collaboratively build the hierarchical directory structure by way of a popularity based system. The popularity based system allows every directory node in the structure to have multiple alternate versions, and lets users choose which version they prefer the version with the highest number of viewer preferences is used as the default version. We model the popularity based system to nd the range of model parameters that enable it to operate in a stable and progressive manner.

Contents
1 Introduction 1.1 Peer-to-peer . . . . . . . . . 1.2 Key historical developments 1.3 Methods of nding les . . 1.4 Constructive collaboration . 1.5 Our contribution . . . . . . 1.6 Legal and ethical issues . . 1.7 Thesis organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 4 5 5 6 6 6 6 7 7

2 Related work 2.1 Hierarchical directory structure systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Constructive collaboration systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Shared le system systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Localhost overview 3.1 Interpreting les as directory nodes . . . . . . . . . . . . . 3.2 Browsing the hierarchical structure . . . . . . . . . . . . . 3.3 Downloading les from a directory node . . . . . . . . . . 3.4 Editing and submitting a new version of a directory node 3.5 Viewing dierent versions of each directory node . . . . . 4 Enabling technologies for Localhost 4.1 BitTorrent . . . . . . . . . . . . . . 4.2 Kademlia . . . . . . . . . . . . . . 4.3 Azureus . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

7 . 8 . 9 . 9 . 10 . 10

peer implementation 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 14 15 17 19 19 19 19 20 21 22 23 24

5 Localhost peer design and implementation 5.1 System implementation overview . . . . . . 5.2 The web interface . . . . . . . . . . . . . . . 5.3 Directory node storage . . . . . . . . . . . . 5.4 Directory node retrieval . . . . . . . . . . . 5.5 Global namespace . . . . . . . . . . . . . . 5.6 Directory node display . . . . . . . . . . . . 5.7 File retrieval . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

6 Results and discussion 6.1 Comparison to systems similar to Localhost . 6.2 Theoretical behaviour of the popularity based 6.3 Observation of the Localhost system in use . 6.4 Further work . . . . . . . . . . . . . . . . . . 7 Conclusion

. . . . . system . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Introduction

Over the last two decades, the Internet has facilitated a number of widely accepted applications for such activities as global communication, collaboration, and data distribution. Email and the World Wide Web are now invaluable applications used by practically all Internet users. In the last ve years, peer-to-peer (P2P) le sharing systems have steadily grown in usage [14]. Internet trac data collected by CacheLogic [4] indicates that there is now a larger volume of P2P trac on the Internet than web trac [5], as gure 1 shows. CacheLogic collected the data by installing deep packet inspection devices at a number of Internet Service Providers (ISPs) around the world. These devices monitor packets and classify them not only based on their network port number, but by inspecting the data contents of each packet.

Figure 1: Internet trac by percentage of data volume, categorised by application.

1.1

Peer-to-peer

The term peer-to-peer (P2P) can be dened as a network in which a signicant proportion of the networks functionality is implemented by peers in a decentralised way, rather than being implemented by centralised servers [32]. A peer is a single program that is run on a number of hosts, which interconnect, to form a P2P network. Decentralised typically means that the implementation of functionality is spread across all or most of the peers in the network. Centralised typically means that functionality is implemented using programs that are not peers, that are running on a relatively small number of hosts compared to the number of peers in the network. There are a number of applications of P2P. File sharing is the most widely used application the P2P category of gure 1 consists only of seven P2P le sharing systems. Other applications of P2P include Internet telephony [34], instant messaging [19], grid computing [37], and decentralised gaming [38]. Internet telephony can be implemented without use of a P2P network, but using P2P networks can have advantages. One popular P2P Internet telephony and instant messaging program is Skype [39]. Skype allows users to make voice calls and send instant messages to each other. In a non-P2P system, all of the voice and message packets are routed through a central server. In Skype, the packets are sent

directly from one peer to another, or routed through other peers in the Skype network when a direct connection between the two peers is not possible. As a consequence, the network can cheaply scale to millions of users because there is no need for costly centralised infrastructure. Currently, there are around two million users simultaneously using Skype at any given time [39]. P2P le sharing systems consist of program(s) that are used to create and maintain P2P networks to facilitate the transmission of les between users. They allow users to download les from other users of the P2P network, and often also allow users to designate a set of les from their PCs le system to be shared. Sharing a le makes the le available to other users of the P2P network. There are two key parts of a P2P le sharing system. The rst part is the le distribution system. The le distribution system provides the means to transmit les between peers. It is the protocol used to dictate how peers in the system should behave in order to download and upload les. The second part is the le nding system. The le nding system is the means for users to nd the les that are available on the P2P network. P2P le sharing systems typically provide the le nding system by maintaining some form of index of the les. P2P le sharing systems dier in how and where they implement these two parts. Some maintain the le index in a centralised way, and others in a decentralised way. P2P le sharing systems implement the le distribution system in a decentralised way. This denition of P2P le sharing systems satises the denition of P2P given earlier the most signicant part of networks functionality, the transfer of les, is done directly between the peers in the network, without the use of centralised servers.

1.2

Key historical developments

There are more than one hundred P2P le sharing systems listed online [40]. In this section we cover the most important ones in terms of technological development. In 1999, a P2P le sharing system called Napster [6] was released. Napster was the rst popular P2P le sharing system; at its peak it boasted a registered user base of 70 million and 1.57 million simultaneous users [16]. The Napster approach uses a decentralised le distribution system, and a centralised index for users to nd les in the network. Specically, the Napster approach uses several intercommunicating central servers to maintain a lename-based index of les from all of the peers logged into the system at any one time. Each peer logs in to the system by connecting to one of the central servers and sending it the list of lenames of the les that the peer has to share. The central servers maintain only the names of the shared les and the IP addresses of the peers storing those les, not the le contents. The le contents are transmitted directly between peers. In 2001, Napster was deemed to be illegal and was ordered to be shut down by the courts in response to a lawsuit from several major recording companies [33]. In 2001, a P2P le sharing protocol called Gnutella [28] was released. The Gnutella protocol is implemented in a variety of peers, including Limewire [17], Shareaza [31], and Morpheus [22]. The Gnutella approach is an important development in P2P technology because it was the rst popular P2P system to have a decentralised index. The Gnutella protocol works by having each peer connect to a small set of remote peers. When a user wants to nd a le, the user forms a query string from desired keywords, and the query string is ooded by the Gnutella peer. To ood a query string, the Gnutella peer sends the query string to each remote peer it is connected to. Each remote peer then forwards the query string to all the peers it is connected to, and those remote peers in turn forward the query string, and so on. If a peer has les that satisfy the query string, the peer sends a reply directly to the original querying peer. The graph of peers is cyclic, and precautions are made to prevent innite forwarding, such as including a Time To Live eld on packets. The original Gnutella protocol suered from scalability problems, because each query string generates several gigabytes of trac [29]. The problems impact has been reduced in a later Gnutella protocol [2] by the classication of peers into leaf peers and ultrapeers. Ultrapeers are peers that have high compute and bandwidth capacity and behave as peers do in the original Gnutella protocol. Leaf peers connect only to ultrapeers, and never connect to each other. Each ultrapeer has a number of connections to leaf peers, and maintains an index of the les available on all 3

of the leaf peers that are connected to it. Using this arrangement, query strings need only be ooded through the high bandwidth ultrapeers, and not the leaf peers, reducing the trac in the system. In 2001, a P2P le sharing system called BitTorrent [3] was released. The BitTorrent P2P le sharing system was the rst P2P le sharing system to de-integrate the le distribution system from the le nding system. The BitTorrent P2P le sharing system consists of the BitTorrent peers that make up the le distribution system, and a number of websites that index the les that are available in the BitTorrent network that allow users to nd the les [26]. The index is maintained on hosts that are not peers, and not at all by the peers in the system, so we can say that the BitTorrent system has a centralised index. Files in the BitTorrent network can also be found by use of other Internet applications, such as Internet Relay Chat (IRC). The BitTorrent le sharing system has a number of novel features. The rst novel feature is that the indexes on the websites are maintained by a relatively small (18, in one case [26]) group of moderators, rather than being moderated by every user on the network. This has advantages, which are described in the next two subsections. The second novel feature is that the le distribution protocol dictates that les are divided up into pieces, so a BitTorrent peer can upload a piece of a downloading le to other interested BitTorrent peers as soon as it receives that full piece. This allows the peers to assist the distribution of the le by providing upload bandwidth to remote peers, even before the peer completes the download. The third novel feature is that the le distribution protocol employs a tit-for-tat policy that rewards peers that upload to remote peers by increasing the uploading peers download speed. P2P le sharing systems Soulseek [35] and DirectConnect [10] were the rst popular P2P le sharing system that allowed the user to nd les by browsing each individual nodes shared les. The notion of browsing is important for this thesis and is expanded upon in the next subsection.

1.3

Methods of nding les

In order for a user to download les from a le sharing system, the user needs to nd les that are available on the network. There are a number of possible methods to nd les that are available on the network. The two we look at are query string search and browsing. Query string search is characterised as a process in which the user describes a request by forming a query string that consists of one or more keywords and the system presents a set of lenames that match or satisfy the query string. Usually, the set gives further details of each le, such as le size, and le type. The user can then select les from the set to download. Browsing is another method to nd les. For browsing to be possible, a collection of les must be organised into a browsable structure, such as a tree or directed graph. An example of browsable structure is a le system on a Personal Computer (PC). Typically, PC le systems are organised as tree structures. When the user is browsing for a le in a PC le system, the user starts from some point in the tree, typically the root, and can see the list of all the les and subdirectories in the current directory. The user can select the le from there, or the user can choose to enter a subdirectory to list all of the les and subdirectories in the subdirectory, and so on. Another example of a browsable structure is the World Wide Web, which is organised as a directed graph. The majority of P2P le sharing systems use query string search. Napster, Gnutella-based systems, eMule [11], and KaZaA[13] use query string search as their only means for nding les in their networks. Several other P2P le sharing systems, such as DirectConnect and Soulseek, in addition to query string search, allow the user to browse each individual peers shared les. However, these systems do not support a browsable namespace that is global among all peers because they do not directly provide a way of collaboratively organising les into a single, integrated, coherent categorical or hierarchical structure. Consequently, over 25 terabytes of les are fragmented across more than 8000 individual listings, with each listing having its own way of organising its les [25].

1.4

Constructive collaboration

In a P2P le sharing system, a scheme must exist that denes how les can be added to the network. Most of the popular P2P le sharing systems, such as those mentioned in the previous subsection, have the scheme that lets users simply designate a folder in their PCs le system and have all of its contents shared. Two major problems that occur in P2P le sharing systems that use this scheme are pollution and poisoning [7]. Pollution of a P2P network refers to the accidental injection of unusable copies of les into the network, by non-malicious users. Poisoning is where a large number of fake les are deliberately injected into a P2P network by malicious users or groups. Fake les are specically created by malicious users or groups to seem like certain les, but consist of rubbish data or are unusable in some way. Both of these problems reduce the perceived availability of les to users and reduce the usefulness of the system to users, because nding usable les is more dicult. A study [13] found that a signicant proportion of les on the KaZaA network are unusable, due to poisoning and pollution. A number of P2P le sharing systems employ a le rating system to attempt to combat these problems. File rating systems let users rate each les quality - the theory is that enough users will nd the fake and unusable les and rate them poorly, allowing other users to identify them before downloading them. These le rating systems have been shown to be largely ineective [13]. The scheme used in the BitTorrent le sharing system is that any user can submit les to the index websites, and the le is checked by the moderators of the website before being added to the websites index. If the le is found to be fake or of unusable quality, it is not added to the index. Although pollution and poisoning levels are dicult to measure, sources indicate that the BitTorrent system is virtually pollution and poisoning free because of this scheme [26]. The scheme was found to be a practical one, by the same paper ([26]), as the authors were surprised that a mere 18 moderators are able to eectively manage the numerous daily content injections with such a simple system. The paper went on to state the drawback of this system: Unfortunately, this system relies on a central server and is extremely dicult to distribute. In this thesis we address this drawback by developing a new le nding system for the BitTorrent le distribution system, to form a new P2P le sharing system.

1.5

Our contribution

The major contributions of this thesis are: The conversion of an existing le distribution system into a le sharing system by placing semantics on the downloadable les. A popularity based system that aims to allow constructive collaboration in a decentralised P2P le sharing system. Analysis of the popularity based system via simulation.

Motivated by the problems covered in the previous subsections, namely: poisoning and pollution making indexes in P2P le sharing systems less usable, the non-coherent index of Soulseek and DirectConnect le sharing systems, and the BitTorrent le sharing systems dependence on centralised websites to index the systems les, we designed and implemented a P2P le sharing system, which we call Localhost. The Localhost system in operation consists of Localhost peers running on a number of Internet hosts. The Localhost peer contains the BitTorrent le distribution system, and creates a hierarchical index of the les in the system, by imposing semantics on some of the les. The semantics classify les into regular les and directory nodes. A directory node is a le that contains references to regular les and other directory nodes. The directory nodes form a hierarchical structure in which the regular les are indexed. The hierarchical structure is used by users to nd the regular les available in the system. To facilitate constructive collaboration between users in order to build a cohesive hierarchical structure, we developed a popularity based system. This system lets users choose to view any one of multiple alternate versions of each directory node. Users are initially shown the version that has the highest number of users that have chosen that version to view. New alternate versions of directory nodes can be created by any user, by adding les from their PC and/or new directory nodes to an existing version of a directory node. Lastly, we modelled and simulated the popularity based system to predict its behaviour and nd the range of parameters to the model that provide acceptable behaviour.

1.6

Legal and ethical issues

P2P le sharing systems have largely been used to distribute copyrighted material, without consent of the original copyright holder. The legal and ethical issues of P2P le sharing are not the focus of this thesis. This thesis focuses on the technical issues of P2P le sharing.

1.7

Thesis organisation

The rest of this thesis is organised as follows. Section 2 covers work that is similar to this thesis. Section 3 overviews what Localhost does, without covering technical details. Section 4 introduces the technologies used in implementing Localhost. Section 5 details how Localhost implemented, detailing design decisions along the way. It builds on the technologies introduced in section 4. Section 6 contains the discussion and results of this thesis. Section 7 summarises and concludes.

Related work

We have covered a number of P2P le sharing systems earlier, in subsection 1.3, so this section covers a number of systems that are similar in other ways to Localhost.

2.1

Hierarchical directory structure systems

The Open Directory Project (ODP) [24] is a human-edited directory structure which indexes websites. It indexes websites in a hierarchical structure, and is itself a website. The nodes in the hierarchical structure are categories, and the leaves are website links. The top level nodes are broad categories, such as Arts, Business, Computers, and News. The ODP is constructed and maintained by a global community of volunteer editors.

2.2

Constructive collaboration systems

Wikipedia [41] is a user-edited online encyclopedia. The system allows collaboration among its users to build its content. Any user can change and update the contents of any article in the encyclopedia. The system maintains a history of changes that allow any user to roll the article back to a previous version, in case of unwanted additions, such as vandalism.

2.3

Shared le system systems

Waynder [25] is a P2P le sharing system that provides a global namespace and automatic availability management. It allows any user to modify any portion of the namespace by modifying, adding, and deleting les and directories. Waynders global namespace is constructed by the system automatically merging the local namespaces of individual nodes. Farsite [1] is a serverless distributed le system. Farsite logically functions as a centralised le server but its physical realisation is dispersed among a network of untrusted workstations. OceanStore [15] is a global persistent data store designed to scale to billions of users. It provides a consistent, highly-available, and durable storage utility atop an infrastructure comprised of untrusted servers. Cooperative File System [9] is a global distributed Internet le system that also focuses on scalability. Ivy [23] is a distributed le system that focuses on allowing multiple concurrent writers to les.

Localhost overview

This section gives a top-level overview of Localhost, without covering technical details and how it achieves its behaviour. We use the abstraction Localhost Distributed System (LDS) to refer to the system that is created by Localhost peers running on a number of Internet hosts. The LDS maintains a globalnamespace hierarchical directory structure of les that can be downloaded by Localhost peers. No one peer is responsible for storing the complete hierarchical directory structure; that responsibility is distributed amongst all the peers in the LDS. There is no central server or peer that has more importance than other peers in the LDS. Figure 2 shows one Localhost peer running on a host and the peers interaction with a web browser and the rest of the LDS.

Figure 2: Overview of one Localhost peer running on a host. The namespace of the hierarchical structure is global among all Localhost peers. Every new version of a directory node that any peer creates is viewable by all Localhost peers. Peers can create new versions of any directory node in the hierarchical structure, including the root directory node, so each directory node in the hierarchical structure may have any number of alternate versions. Each peer communicates with a web browser that is running on the same host as the peer to display directory nodes to the user. Each user can view any version of each directory node, and the last version that they view is taken as their preference for that directory node, so each user can have at most one preference for each directory node. When a user views a dierent version of a directory node, the peer informs the LDS of the users preference, which the LDS stores. When a user requests a directory node for the rst time (i.e. the user 7

has not viewed that directory node before), the peer gives the user the most popular version of that directory node. The popularity of a version is dened as the number of users whos preference is for that version. When a version has zero user preferences, it disappears. The hierarchical directory structure is built up over time by new versions of the root directory node and its subdirectory nodes being created by peers. When a directory node includes a reference to another directory node, we call the referenced directory node a subdirectory node. Initially, the hierarchical structure begins with a single version of the root directory node that includes no les or subdirectory nodes. New versions of the root directory node are created by peers, with les and subdirectory nodes included in them. When a peer creates a new version of a directory node with a subdirectory node included in it, a single empty version of that subdirectory node is created as well. New versions of that subdirectory node can then be created by the same peer, or other peers in the system. When a Localhost peer views a directory node, or completes the download of a le, the peer makes the le available to be downloaded from itself, which helps the distribution of the le by providing another complete copy of the le to the system. This makes the system more scalable than if every peer was forced to download a le or version of a directory node from only the peer that added that le or version of the directory node. Each directory node and regular le in the hierarchical structure can be uniquely identied by its path. The path of the root directory node is "/". The path of all other directory nodes is formed by taking the path of the root directory node, "/", and concatenating the names of the directory nodes that are in the chain of directory nodes from the root directory node to that directory node. When concatenating the names, each directory node name has "/" appended to it, in order to separate the names in the path. For example, if the root directory node included a subdirectory node called Videos, and that Videos subdirectory node included a subdirectory node called Trailers, then the path of the Trailers subdirectory node would be "/Videos/Trailers/". The path of a regular le is formed by concatenating the name of the le onto the path of the directory node of which it is inside. From a top-level view, Localhost does the following: Interprets certain les as directory nodes and facilitates displaying them in a web browser. The subdirectory nodes and les that are listed in a directory node are presented as links in the web browsers display. Allows browsing of the hierarchical structure by responding to a user clicking a subdirectory node link by downloading the most popular version of that subdirectory node, and serving it to the web browser for display. This becomes an iterative process, because the displayed subdirectory node can include subdirectory node links of its own. Responds to a user clicking a le link by downloading the le. Allows the user to create new versions of directory nodes by adding and/or removing les and/or subdirectory nodes to/from an existing version of a directory node. The new version is then submitted to the LDS, where it becomes viewable by all of the users of the LDS. Allows the user to select any version of a directory node to view. The LDS maintains the references to all the versions of each directory node, and counts of how many users are viewing each version of each directory node.

3.1

Interpreting les as directory nodes

In order to build the hierarchical directory structure, the Localhost peer interprets certain les in the LDS as directory nodes. These les contain a listing of directory node names and/or le names. The Localhost peer serves these les to a web browser, along with formatting information and details of the six most popular versions of the directory node. This allows the web browser to render the directory page for the directory node. An example directory page is shown in gure 3(a). The directory page has 8

the path of the directory node as the heading of the page. The directory node names and le names in the directory node are displayed as links on the directory page. Details of the six most popular versions of the directory node are displayed on its directory page. The details include the description of that version, and the number of users whos preference is for that version. In the example in gure 3(a) there are only two most popular versions shown because there are only two versions of the directory node. The directory page also includes an edit link, and a versions link. The edit link is described later in this section. When the user clicks the version link, the Localhost peer returns a web page that lists the details, as above, of all of the versions of the directory node.

(a) Directory page of /Videos/Trailers/.

(b) The same directory page in editing-mode-format.

Figure 3: Screenshots of a web browser displaying a directory page.

3.2

Browsing the hierarchical structure

When a user clicks a subdirectory link on a directory page for the rst time, the Localhost peer nds and downloads most popular version of that (sub)directory node from the LDS. After downloading the (sub)directory node, the peer does three things. First, it serves the (sub)directory node to the web browser to be displayed. Second, it informs the LDS that the users preference is now for this version of the (sub)directory node. Third, it makes the (sub)directory node available to be downloaded from the peer by other peers in the LDS, to aid distribution of the (sub)directory node.

3.3

Downloading les from a directory node

When a user clicks a le link on a directory page, the Localhost peer downloads the le to a user-specied location on the users PC. When the le has nished downloading, the peer makes the le available to be downloaded from the peer by other peers in the LDS.

3.4

Editing and submitting a new version of a directory node

Each directory page includes an edit link on it. When clicked, the Localhost peer puts the directory node into editing mode and serves the editing-mode-format of the directory page to the browser. Entering editing mode copies the currently displayed version of the directory node to create a new version of the directory node, which can be edited and then submitted to the LDS. An example of a editing-modeformat of a directory page is shown in gure 3(b). The editing-mode-format of the directory page is served even if the directory page is re-requested by the web browser. This continues to happen until editing mode is exited. The editing-mode-format of a directory page displays the word editing in the directory pages title, and provides options to edit the new version of the directory node. The rst option, Add file, allows the user to add a le from their PCs le system to the new version of the directory node. The second option, Create empty folder, allows the user to create an empty subdirectory in the new version of the directory node. The third option, Add folder, allows the user to add an entire folder structure from their PCs le system to the new version of the directory node. Every le and subdirectory node link on the directory page has a Delete link next to it to allow the user to remove it from the new version of the directory node, even if they were not the user that added it. Finally, once the desired changes have been made, the user can type a description of the new version in the text box provided, and click the Submit This As New Version button. The Localhost peer submits the new version of the directory node to the LDS, which makes the new version viewable by all of the users in the LDS. If the user wants to cancel the editing without a new version being submitted, they can click the Cancel Editing link, which exits out of editing mode and serves the directory page without editing-mode-format.

3.5

Viewing dierent versions of each directory node

The LDS maintains the details of each version of each directory node. The details of a version consists of the description, a count of the number of users with preference for that version, and a reference that allows peers to download the directory node. As described above, each directory page contains details of the six most popular versions of the directory, and the versions link, which leads to a page with details of all of the versions. Each of the descriptions is a link which when clicked causes the Localhost peer to download and return that particular version of the directory node to the web browser. This is also the case for the aforementioned web page that lists all the versions of the directory node. Once the version of the directory node has downloaded, the Localhost peer informs the LDS of the users new preference. If a user has previously specied their preferred version of a directory node, then when the (sub)directory node is requested as described in subsection 3.2, the users preferred version is returned, rather than the most popular version.

Enabling technologies for Localhost peer implementation

The work done in this thesis builds from a number of technologies, which we detail in this section. These technologies include BitTorrent - a P2P protocol, Kademlia - a Distributed Hash Table protocol, and Azureus - an implementation of BitTorrent which uses Kademlia.

4.1

BitTorrent

The BitTorrent protocol is designed and used for P2P le distribution [8]. It was proposed by Bram Cohen, who also released a peer that implements the protocol. A number of other peers have also been released that implement the BitTorrent protocol. The protocols basic premise is to use the otherwise wasted upload bandwidth of downloaders to help distribute les. Following the BitTorrent system, a le is broken up into pieces, which are transmitted between peers. A les piece size is usually between 32 kilobytes and 128 kilobytes, inclusive. 10

Figure 4: An example scenario of BitTorrent protocol operation. A user that wishes to publish a le or collection of les uses a program to create a torrent le. A torrent le contains the name(s) of the le(s), the SHA-1 hash of every piece of every le, the torrent les infohash and web address of one or more tracker s to be used. The infohash is the SHA-1 hash of all of the les data, and is used to uniquely identify a torrent le. A tracker is a server that maintains a list of IP addresses of peers in the swarm. The swarm is the set of peers currently involved in transmitting pieces of the le to each other. The term torrent refers to the collection of le(s) that the torrent le was created from. The torrent le is distributed to other users by some means external to the BitTorrent peer, such as via web sites. The user publishing the le must then act as a seed for the le(s). In BitTorrent terminology, a seed is a peer that has the complete le(s). Initially, there is one seed in the swarm the publisher of the le(s). After peers in the swarm complete the download, they become seeds for the le(s) as well. When a peer acts as a seed for the le(s), it goes through basically the same steps as for downloading the le, which are described in the following paragraph. There are only minor policy dierences in its behaviour for seeding verses downloading. A user interested in downloading a particular le in the system must rst obtain the torrent le for that le. The torrent le is given to their BitTorrent peer. The BitTorrent peer then proceeds to download the le as follows. The peer rst connects to the tracker to request a set of IP addresses of remote peers that are in the swarm. This is shown in gure 4 by dotted lines. The set returned from the tracker is a random subset of the full list the tracker maintains. The request for a set of remote peer IP addresses allows the tracker to add the requesting peer to its list of peer IP address of peers that are in the swarm. After the peer has a partial list of peers in the swarm, it picks a certain amount of them at random, and attempts to connect to them. The certain amount for most peers ranges from four to around thirty, depending on user congurable settings and peer implementation. The peer also listens on a network port, by default 6881 TCP but also user congurable, to allow remote peers that are also attempting to connect to other remote peers to connect to the peer. The peer should start to receive connections after it gives the tracker its IP address, because remote peers will receive this IP address from the tracker and start to connect to the peer, assuming there are other peers in the swarm. Each peer in the swarm aims to maintain the certain number of connections to remote peers, without consideration of which peer initiated the connection. The solid lines in gure 4 show an example interconnection between peers. When a peer connects to a remote peer, the two peers exchange a bitmap that indicates what pieces of the le each do and do not have. This allows each peer to work out what, if any, pieces the local peer has that remote peer does not have, and what, if any, pieces the remote peer has that the local peer does not have. Pieces are transmitted from those peers that have them to those peers that do not. The result of this behaviour in each peers is the following. The initial seed peer connects to remote peers, and transmits pieces of the le(s) to the remote peers. These remote peers do the same again, 11

that is, connect to remote peers, and transmit the pieces they have to them, all while still receiving other pieces from the initial seed peer. As soon as a peer receives a full piece, it can transmit the piece to the remote peers it is connected to. This allows the pieces to be propagated through the swarm. Figure ?? shows an example situation of piece transfer. Note that, assuming zero peer failures, the initial seed peer only needs to transmit each piece of the le once into the swarm, and it is possible for every peer to receive a complete copy of the le, even in a swarm of thousands of peers. To contrast this with conventional web serving, a situation with thousands of requesting clients requires that the complete le is transmitted to every client. This insight demonstrates BitTorrents scalable nature. File distribution systems have experienced problems caused by free-riders in the past [20]. Freeriders are users who download the le, but do not upload the le. Having free-riders results in low download speed for some peers or even the inability for some peers to nd any peers to download the le from. BitTorrent has a real-time tit-for-tat feedback scheme to reward local peers uploading to remote peers [8]. The scheme dictates that each peer only uploads to a subset of remote peers that it is connected to the remote peers that it is getting the highest current piece transfer rate from, out of all of the remote peers the peer is connected to. With most peers in the swarm following this rule, it is in each peers best interest to upload to a remote peer that the peer is receiving pieces from [8]. If the peer does not, the peer is likely to stop receiving pieces from the remote peer, because the remote peer is not getting a high enough download rate from the peer. The peer will have to nd another remote peer to download from if this happens. In addition to the certain number of remote peers each peer connects to, each peer also optimistically connects to other remote peers that it has the IP addresses of. This is so the peer has the possibility of nding a remote peer that it can download pieces from at a higher rate than other currently connected remote peers. This scheme results in unprecedented download speeds for a P2P le distribution system [12].

4.2

Kademlia

Kademlia [21] is a Distributed Hash Table (DHT) protocol. A Distributed Hash Table (DHT) based system provides services similar to that of a hash table, but distributes storage and lookups among a number of peers. There have been a number of DHT protocols developed. The rst four DHT protocols, Chord [36], CAN [27], Pastry [30], and Tapestry [43], were all developed in 2001. DHT based systems support a number of operations. The two major operations are: put(key, value). Stores the data string value under key key in the DHT. value = get(key). Retrieves the data string value from the DHT that is stored under the key key. Some DHTs allow multiple values to be stored under, and retrieved from, a single key. DHT based systems provide the abstraction of a hash table, which is accessed by these two operations. The work done in this thesis builds on the abstraction by using these two operations. The following describes how DHT based systems provide the abstraction. DHT protocols distribute the key-value pairs among the peers by way of the following. Typically, keys are a 160-bit values, and are usually calculated by hashing some data. The DHT protocol partitions the namespace of keys among the peers in the DHT. Each peer in the DHT is responsible for a certain subset of the key namespace, and so is responsible for a storing a certain subset of key-value pairs. The peers in the DHT can join and leave the network freely. The DHT protocol provides an algorithm for peers to enter the DHT, so that the correct key-value pairs can be transferred to the peer when becomes responsible for a set of keys upon entering the DHT. The DHT protocol also provides an algorithm for peers to leave the network and transfer their key-value pairs to other peers so the key-value pairs are not lost. Despite the apparent chaos of periodic random changes to the membership of the network, DHTs make provable guarantees about performance [42, 36, 21].

12

DHT based systems operate in a completely decentralised way. DHT protocols are able to provide the two operations described above by making the peers form a DHT overlay network. The DHT overlay network is formed by each peer maintaining a set of contacts. A contact is the peer ID and IP addresses of a remote peer in the DHT. Each peer has a peer ID, which is a number chosen from the namespace of keys. The set of contacts each peer maintains does not include every possible contact in the DHT. The specic DHT protocol used dictates which contacts each peer maintains. Using these contacts, DHT overlay networks such as those used in Chord and Kademlia allow each peer to locate the remote peer responsible for a certain key in O(log n) time. Once the correct peer has been located, gets and puts can be done by contacting that peer. In a Kademlia DHT, each peers peer ID is chosen randomly from the namespace of keys. Each peer is responsible for the set of keys that are closest to its peer ID. In Kademlia, closeness is dened by the XOR of two values where the two values are a peer ID and a key. Kademlia divides the key namespace progressively into subtrees, by taking each bit of the key in turn, and forming a new subtree for both possible values of the bit, as shown in gure 5. Kademlias contact maintenance policy dictates that all

Figure 5: Example of a Kademlia DHT overlay network with 3-bit keys. peers maintain one contact in each subtree in which it itself is not contained. This results in each node maintaining only O(log n) contacts. Figure 5 shows an example of possible contacts the peer with ID 001 can have the peer is required to maintain exactly one contact from each dotted oval. In this example, the contacts it is maintaining are 110, 011, and 000, as shown in gure 5 by the thin lines. Although not illustrated in gure 5, every peer follows the same rule for contact maintenance. Peers nd the remote peer responsible for a certain key by querying successively closer peers to the key, starting with themselves, as shown in gure 5 by the thick arrowed lines. In this example, the peer 001 is seeking the IP address of the peer 101. The rst step is to query itself to nd the peer from its contacts that is closest to the target peer. In this case all three contact peers 110, 011, and 000 are of equal closeness to 101, so Kademlia chooses the peer that has the most common leftmost bits with the target ID, which is 110. Then the peer 110 is queried to nd the next closest peer to 101 from its set of contacts. The peer 100 is found. That peer is queried, and the IP address the target peer is found because that peer has the target peer as a contact. Each step halves the distance to the target peer, because an entire subtree is eliminated each query. This gives the expected log(n) hops to nd the IP address of the target peer. Peers require the IP address of one peer already in the DHT overlay network in order to join the DHT overlay network. The preceding description of Kademlia is a simplied one. Further details on Kademlia are not included in this thesis. We refer the interested reader to [21] for more information on Kademlia. 13

4.3

Azureus

Azureus is a Java implementation of the BitTorrent protocol. It is a very popular BitTorrent peer, as it has been downloaded from the open source repository http://www.sourceforge.net over 30 million times. Azureus allows a number of les to be downloaded and seeded concurrently. As of version 2.3.0.0, Azureus also includes an implementation of the Kademlia protocol. All Azureus peers join the same DHT, by contacting a certain peer that is set up for the purpose that aims to always be online. Azureus uses the Kademlia DHT to implement a feature called decentralised tracking. Decentralised tracking is an optional replacement for trackers. When decentralised tracking is enabled, an Azureus peer puts a data value that consists of its IP address and BitTorrent network port number into the DHT, for each le it is downloading or seeding. The key that the value is put under is the infohash of the le, which comes from the torrent le. Each Azureus peer with decentralised tracking enabled performs a get for each of the les it is downloading or seeding, where the key is the infohash of the le. The values returned from this get allow the BitTorrent protocol section of the peer to connect to remote peers in the swarm. The put and get are analogous to, and a replacement for, the tracker communication done by the standard BitTorrent protocol. Version 2.3.0.0 of Azureus also introduces a torrent le download feature. The torrent le download feature allows torrent les to be downloaded from remote Azureus peers to a Azureus peer using the User Datagram Protocol (UDP). To download a torrent le, the torrent les infohash is required. Azureus uses the port 6881 (UDP) to transfer torrent les and make connections to other peers to operate the DHT protocol. The Kademlia implementation in Azureus allows each peer to store only a single value under each key. When a peer performs a put(key, value) for a key that the peer has performed a put(key, value) of the same key earlier, the earlier value is overwritten. Multiple peers can each store a dierent value under the same key. A single peer can store dierent values under dierent keys.

Localhost peer design and implementation

We developed the Localhost peer, which implements the functionality described in this section. The peer and its source code are available for download at http://p2p.cs.mu.oz.au/software/Localhost/. The peer was developed as a modication and extension of the Azureus 2.3.0.4 source code base. Azureus was chosen as the source code base to develop the Localhost peer from for a number of reasons. The rst is that Azureus is the only BitTorrent peer with a torrent download feature, which is required to implement the Localhost peer. This feature would have had to have been developed, adding to development time of the Localhost peer. The second is Azureus popularity. Developing the Localhost peer from a program that has shown to be usable by a large number of people is a good starting point to make Localhost a usable system.

5.1

System implementation overview

Figure 6 gives a modular overview of the Localhost system. The Localhost peer consists of Azureus BitTorrent and Kademlia modules, and a module that contains logic which interacts with a web browser and these two modules. The logic and interactions are described in the following subsections.

5.2

The web interface

The Localhost peer has a minimal HTTP server built into it, which is designed to serve a web browser that is running on the same host, to provide the web interface to the program. The HTTP server is shown in gure 6. The HTTP server listens and accepts connections on network port 8880 (TCP). The port 8880 was chosen arbitrarily from the set of port numbers that arent well-known port numbers.

14

Figure 6: Modular overview of the Localhost system. After the peer has been started on a host, it can be accessed by a web browser that is started on the same host and pointed to: http://localhost:8880/path where the path is the path of the le or directory node to be retrieved. If the peer nds that the path is a directory node, then the peer downloads the directory node from the LDS and displays the directory page as described in section 3.1. If the peer nds that the path is a le, then the peer downloads the le from the LDS to a user specied location on the users le system. The path is also used to give commands to the peer. The commands are given in the format: http://localhost:8880/path@command?argname=argument where path is the path of the specied directory node, which is the directory node that the command is to be executed on. The command is the command name, as detailed below. The argname=argument species the argument to the command. Multiple arguments are separated by ampersands. The commands used in the web interface are listed in table 1. The four commands addfile, createfolder, addfolder and delete act on a new version of the directory node that is created when the directory node is placed into editing mode.

5.3

Directory node storage

The conversion of a le distribution system into a le sharing protocol by placing semantics on les to interpret some of them as directory nodes is a central idea in this thesis. There are a number of possible designs that can be used to implement this. This subsection compares the considered possibilities. The rst design considered was a recursive torrent les in torrent le design. In this design, there is one torrent which represents the root directory of the hierarchical structure. This torrent contains only torrent les. Each of those torrent les represents either a le or subdirectory node in the root directory node, depending on whether the torrent le represented a le or collection of les. The torrent 15

Command versions

getversion

edit

addle createfolder addfolder

delete submitversion canceledit

Description Returns a webpage that is dynamically generated by the program, which lists, for each version of the directory node, its description and number of users whos preference is that version. Each version description is a link to the getversion command, with the version description and infohash supplied as the argument. Takes the infohash of a version and its description as arguments. Returns the specied version of the directory node to the web browser, by downloading it if required to. The Localhost peer records the specied version as the users preference, and informs the LDS of the users preference. Places the specied directory into editing mode, and returns the editing-modeformat directory page for that specied directory node, as described in section 3. Makes a new version of the specied directory node for the following four commands to operate on, by copying the currently viewed version. The editingmode-format directory page includes links to the following six commands, providing the correct arguments to the commands. Takes a path to a le on the PCs le system as an argument. Adds the le located by the le path on the PCs le system to the specied directory node. Takes a name as an argument. Creates a new empty subdirectory node inside the specied directory node, giving it the name supplied. Takes a path of a folder on the PCs le system as an argument. Adds the folder, including all of its contents recursively by creating more subdirectory nodes, from the le system to the specied directory node. Takes either a le name or subdirectory node name as argument. Removes that le or subdirectory node from the specied directory node. Submits the new version of the directory node that was created and modied by the preceding ve commands, to the LDS. Exits the specied directory node from editing mode without submitting the new version of it to the LDS, losing the changes made. Table 1: Commands used in the web interface of Localhost.

16

les that represent a collection of les are the subdirectory nodes. Each subdirectory node would also be a collection of torrent les, each of which are of the same nature as stated earlier, thus recursively building up the hierarchal directory structure. The major problem with this strategy is that it suers from a direct dependency problem. Recall that a torrent le contains various hashes that are calculated from the contents of the le(s) that the torrent le represents. Changing the contents of any le alters the torrent le that represents the le, because the hashes contained within the torrent le will have changed. Changing the contents of any le therefore also alters the torrent le of the directory node that the le is contained in, because the directory node contains the torrent le that represents that le. This in turn changes the torrent le of the containing directory nodes containing directory node in the same way. This process happens all the way up to the root directory node, where the root directory nodes torrent le has to be changed. This situation results in the root directory node having to change every time any le or directory node in the entire hierarchy is changed, which is impractical. To solve this problem, the direct dependence between a directory and its subdirectories needed to be removed. A solution along this line of thinking was to add a level on indirection to the design by replacing the torrent les in the torrents with web URLs that pointed to torrent les hosted on websites. This solution made the system non-decentralised, so was not a viable as a solution to keep with the decentralised aim of the system. Furthering the indirection idea, the solution in use in the nal design is to include only the names of the other directory nodes and name of les in each directory node. The DHT is used to locate the infohash of the torrent le that represents a version of a directory node, using the directory node names, as described in the following subsection. Each directory node is simply a le, stored in the Extensible Markup Language (XML) le format. The XML le contains the list of le names and other directory node names (which can be considered to be subdirectory nodes). The XML le only contains the list of the le names and subdirectory node names in the immediate directory node, i.e. the XML le does not contain the list of le names included in the subdirectory nodes that are included in itself. This list is stored in the XML le format so the Localhost peer can modify the list. The Localhost peer needs to modify the list to add and remove subdirectory node names and le names when creating new versions.

5.4

Directory node retrieval

When a Localhost peer receives a request from a web browser to retrieve a directory page, the peer needs to return the correct version of the directory node to the web browser. Algorithm 1 details the logic used by the Localhost peer to serve a request for retrieving a directory node. Most of the logic in algorithm 1 exists to let the peer keep state information across multiple browser requests. The logic makes the peer record which version of each directory node the user prefers. The logic also makes the peer record which directory nodes are in editing mode. The PathHash used in the algorithm is a string used as a key in the DHT and is created by taking the SHA-1 hash of the path of the directory node being retrieved, for example, SHA-1("/Videos/Trailers/"). The ViewingPreference used in the algorithm is a string used as a value in the DHT that consists of the versions description, and its infohash, for example, "Version with game trailers;126ff7a15f7f4f9025a12eae0ff3547c227c355e". The result of each peer storing their ViewingPreference under the key PathHash is that all the infohashes of all of the versions of a directory node are stored under the key that is the SHA-1 hash of the directory nodes path. To retrieve a version of a directory node, the Localhost peer SHA-1 hashes the directory nodes path to get the PathHash, and performs a DHT get(PathHash) to retrieve the descriptions and infohashes of all of the versions of that directory node. The infohashes are then used to download that version, by use of Azureus torrent le download feature, and decentralised tracking feature. The Localhost peer uses a cache to avoid re-downloading versions of directory nodes that it has downloaded previously. When the Localhost peer fetches a version of a directory node, as done in algorithm 1, the Localhost peer rst checks its cache. At this point in time, the Localhost peer knows 17

Algorithm 1 Logic used by the Localhost peer to serve a request to retrieve a directory node. if a specic version is requested via the getversion command then Fetch that specic version of the directory node and return it to the web browser. Perform a DHT put(PathHash, ViewingPreference) to notify the LDS that the user is viewing that version. Record that version as the chosen version. else if the directory is in editing mode then Return editing-mode-format of the directory page. else if a version has been recorded as the one chosen to view then Fetch that version of the directory node and return it to the web browser. else Perform a DHT get(PathHash) to retrieve the ViewingPreferences of the directory node. Tally the ViewingPreferences by combining together ones with identical infohash and description (i.e. the ones that are the same version). Find the version from the tally that has the highest number of viewers. Fetch that version of the directory node and return it to the web browser. Perform a DHT put(PathHash, ViewingPreference) to notify the LDS that the user is viewing that version. Record that version as the chosen version. end if

the infohash of the version of the directory node, from a previous DHT get(PathHash), as listed in algorithm 1. The Localhost peers cache system is stored as folders in the le system of the PC that the Localhost peer is running on. Each version of a directory nodes XML le is stored in the cache in a folder named infohash, where infohash is the infohash of the torrent le for the version of the directory node. An earlier design of the peer had each XML le stored in the cache in a folder that was named by taking the SHA-1 hash of the directory nodes path. This meant that only one version of a particular directory node could be stored at a time. This was inadequate, because users need to be able to switch between dierent versions of a particular directory node quickly, to review the dierences and make their choice. If the directory node XML le is found in the cache, the cached version is used. If the directory node XML le is not found in the cache, the directory node XML le is downloaded from remote Localhost peers that have a copy of it, placed into the cache, and used. To download the directory node XML le, the Localhost peer needs the torrent le for the XML le. The torrent le is downloaded from a remote Localhost peer using the torrent le transfer feature of Azureus. To download the torrent le, the Localhost peer requires the infohash of the torrent le, which it has at this point in time from a previous DHT get(PathHash). The Localhost peer downloads the torrent le by performing a DHT get(infohash), where infohash is infohash of the torrent le, to locate peers that are seeding and downloading the le these peers will also be willing to transmit the torrent le. The torrent le is downloaded from these peers and is then used to download the directory node XML le. The directory node XML le is downloaded using the distributed tracking feature of Azureus, and the result of the previous DHT get(infohash) to locate peers that are seeding the le. After the torrent le and XML le have been downloaded, the Localhost peer acts as a seed for the XML le. The XML le is seeded using distributed tracking, so seeding it involves performing a DHT put(infohash, IPaddress) to add the peers address details to the DHT, where infohash is the infohash of the torrent and IPaddress is the peers IP address. A DHT get(infohash) is also performed, to locate remote peers that are attempting to download the le, so that the peer can connect to them. When a Localhost peer is seeding a le, Azureus torrent le transfer feature allows the peer to transfer the les torrent le via UDP to any remote peer that requests that torrent le. Algorithm 1 states that a DHT get(PathHash) is performed to retrieve the ViewingPreferences of 18

a directory node. The number of ViewingPreferences for any directory node in the system submitted grows by O(n), where n is the number of users in the system. To put a limit on the time taken to perform the DHT get(PathHash) operation to collect the viewing preferences, the operation is limited to operate for four seconds. Four seconds was chosen as a trade-o between taking no time, which would collect no viewing preferences, and taking too long, which would make users wait to view directory nodes too long.

5.5

Global namespace

When the Localhost peer is started on a users PC, the peer launches a web browser which is pointed to http://localhost:8880/. This URL requests the peer to retrieve the directory node of path "/", which is the global root directory node of the hierarchical structure. Having this URL requested by every Localhost peer gives each user a starting point to browse from, and ensures that each peer uses the same string ("/") to hash when nding the root directory node.

5.6

Directory node display

Each directory node XML le contains a reference to an Extensible Stylesheet Language (XSL) le. The XSL le is used to dene how the directory page is displayed in the web browser. When an XML le is returned to the web browser, the reference to the XSL le makes the web browser make a request for the XSL le. The Localhost peer has an XSL le which it returns to the web browser in response to this request. This sequence is shown in gure 6 for one directory page transfer. The XSL le is used by the web browser to visually format the XML le, as shown in gure 3. It was theorised that it would be dicult for a directory node to move on to the next best version, because the most popular version of a particular directory node is returned to a user who is has not chosen to view any other version. The solution implemented for this problem was to add a listing of the six most popular versions on the directory page. Six was chosen as the number of versions to show so that the directory page was not too long, and to include a reasonable number of other versions. Users should more easily be able choose other versions to view, because the versions are visible as the users are viewing the directory page, rather than going to the versions page rst.

5.7

File retrieval

The les listed in the directory nodes are downloaded in the same way as the directory nodes are downloaded the les are downloaded using the BitTorrent protocol, using the decentralised tracking and torrent le transfer features of Azureus. When a le is nished downloading, the Localhost peer becomes a seed for that le, helping to distribute the le.

Results and discussion

We developed a website [18] to release the Localhost peer, and placed it online on the 23rd of August 2005. The website describes the peer, and allows users to download a program that installs the peer onto the users PC. The website includes a link to the web-output of modied Localhost peer running on a host, so that users can have a preview of how the system looks, and what les are on the system, before they download and run the peer on their own PC. The modications to the peer give the users the notice that they have to download and run the peer on their own PC in order to download les from the directory structure when the user clicks a le link. The modication also disallows editing and creating new versions of the directory nodes. We created new versions of subdirectory nodes in the root directory node, such as Pictures, Software, and Audio, with a range of (legal) les added to them, to encourage users to use the peer. The modied peer was set to view those versions so at least one peers preference was for those versions. 19

6.1

Comparison to systems similar to Localhost

Table 2 compares Localhost to various P2P le sharing systems. Gnutella and Localhost are the only systems listed in the table that have decentralised indexes. Three of the systems covered in the table have the ability to nd les using a browsable structure. The DirectConnect and Soulseek systems allow each peers les to be browsed individually, the standard BitTorrent system has a number of separate browsable indexes, and Localhost has a single browsable index of all of the les in the system. Localhost has a number of advantages over some of the other P2P le sharing systems. The rst is that Localhost is a completely decentralised system. Both the transfer of les and the maintenance of the index are done by the peers in the system, without use of any centralised server. Therefore Localhost does not have any single point of failure 1 . The second is that the single browsable structure indexes all of the les in the system, rather than only a subset of them as with the standard BitTorrent le sharing system websites. Localhosts has a number of disadvantages compared to other P2P le sharing systems. The rst is that are that it has limited real world use. Since Localhosts release, the users of the Localhost system have not created enough new versions of directory nodes for us to draw conclusions about the behaviour of the popularity system. The second is that its browsing speed is relatively slow. From our observations of a running peer, the process of downloading a directory node and displaying it usually takes between 10 and 50 seconds on a 1.5 megabits/second home broadband connection. The standard BitTorrent systems speed of browsing is faster than this, as the index is maintained on websites, the pages of which usually are able to be viewed in under 10 seconds of requesting them. System Napster Index placement Centralised Index type Query string searchable Query string searchable Query string searchable, and individual peer browsable A number of separate browsable categorical indexes Global browsable hierarchical structure No centralised components required Peers are browsable Strengths Weaknesses Pollution/poisoning level, centralised index Pollution/poisoning level, centralised index Browsable index isnt cohesive Centralised indexes

Gnutellabased systems Soulseek / DirectConnect Standard BitTorrent system Localhost

Decentralised

Centralised

Centralised

Decentralised

Pollution/poisoning level, download performance Pollution/poisoning level (theoretical), no centralised components, collaboratively created index

Limited real world use, browsing speed

Table 2: Comparison of P2P le sharing systems


is of course the one bootstrap peer that is used by all of the peers in the system to join the DHT overlay network. Each peer requires the IP address of any one peer that is already in the DHT, to join the DHT. There are options other than using the one single bootstrap peer, such as including a list of known remote peer IP addresses with the peer when the peer is downloaded, or word of mouth transfer of IP addresses of other remote peers.
1 There

20

6.2

Theoretical behaviour of the popularity based system

In this section we model the popularity system and run a simulation to study its behaviour. The simulation looks at the number of user preferences for each version of one directory node, over time. We wish to have one dominant version of the directory node for all points in time in the simulation, and for that dominant version to be of higher quality than the other versions of the directory node. The dominant version is the version that stays the most popular version for a period of time. The simulation looks at two properties of the popularity system: The popularity systems ability to change the dominant version of the directory node from one version to other, higher quality, versions. If the system is able to change from a poor quality version (e.g. a version that lists fake or unusable les) to a higher quality version, it should have less of a problem with pollution and poisoning. The stability of the directory node. If two or more versions quickly alternate between being the most popular version of the directory node, then the directory node is not considered stable. An unstable directory node does not have a single dominant version. The simulation is driven by time ticks, where time tick represents one minute in the real system. The simulation simulates 10,000 minutes, and 2000 users enter the system during this time. Each users entry time is a random time from time 1 to time 10,000. Each user stays in the system for an average of 120 minutes, and then leaves. This value was selected based on the uptime of BitTorrent peers in the standard BitTorrent le sharing system [26]. At time tick 1, the directory node has one version, and the system has one user in it. The model considers that for every user, there is a constant probability of 0 < Pc < 1 that the user will create a new version of the directory node. In the model, users that create a new version of the directory node do so when they enter the system, and then they choose to view that version. The model considers that for every time tick, each user has a constant probability of 0 < Pv < 1 of changing their viewing preference of the directory node. When a version has no user preferences, it can never been chosen again, and is considered dead. Each version is given a quality, which is a real value between 0 and 1. Each versions quality value is chosen randomly using a uniform distribution, and stays constant throughout the simulation. In the model, the version that a user changes their preference to is chosen by a random selection that favours each version according to its quality. This is done by placing all the versions along a number line, with each version taking up the length of its quality, so that the number lines length is the sum of all the qualities. A number is chosen at random, using a uniform distribution, from the length of the number line, and the version on which that number falls is the version that is chosen. The simulation measures the popularity systems ability to change the most popular version of a directory node to a better version. The simulation does this by counting number of time ticks taken from when a version that is better than the current most popular version is created, to when that version becomes most popular. This is done for every version that is better than the current most popular version. These counts are summed to give the measure of the systems ability to change. A lower value represents a higher ability to change quickly. The simulation measures the popularity systems stability by counting how many times each version becomes the most popular version. The count is increased every time a version becomes the most popular version, even if that version has become the most popular version before. The counts from all versions are summed, to give an overall value that measures the systems stability. The lower the value is, the more stable the system is. 1 1 1 1 The simulation was run for the values of Pc from 16 to 1024 , and Pv from 16 to 1024 . Figure 7 shows the results of this. 1 Figure 7(a) shows that the system becomes unstable for values of Pv larger than 64 , as indicated by the peaks in the graph. This is because as users change their preferences more frequently, no version

21

(a) The popularity systems stability.

(b) The popularity systems ability to change the directory nodes most popular version to be a better version.

Figure 7: Results of the simulation, with parameters Pv and Pc varying. stays the most popular for long. This instability is increased with larger values of Pc , because there are more versions that can potentially become the most popular. Figure 7(b) shows that with a high Pc and a low Pv , that is, with users infrequently changing their viewing preferences and a high percentage of users creating new versions, the system takes longer to make the higher quality versions the most popular version. The gure shows zero time ticks taken for 1 better versions to become the most popular, for values of Pv where Pc is 1024 , because in these cases only about 10 versions were created over the 10,000 minutes, and only a few versions, at most, were alive at any one time. The two aims of the popularity system are that it is stable, and that it has the ability to change the most popular version of a directory node to be a higher quality version. From inspection of gures 7(a) and 7(b), the range of values of Pc and Pv that achieve these two aims to a reasonable degree are Pv < 1 1 64 and Pc < 128 . The reasonable degree includes having less than 100 changes of most popular version over the 10,000 time ticks, and less than 20 total time ticks to bring new, higher quality versions to be the most popular. Of course, the value of Pc will have to be high enough for new versions to be created, and the value of Pv will have to be high enough to allow some versions to be viewed by users other than their creators.

6.3

Observation of the Localhost system in use

In this subsection we provide some observations on how the Localhost system worked after being released, and look at some of the feedback received from users. A number of third party websites linked to the Localhost website. The peer was downloaded over 10,000 times since its release, according to our web server log statistics. Despite this, only a relatively small number of new versions of directory nodes were created by users. One of which was a version of the root directory node that was created with the description Videos. This version includes a directory node also called Videos, and in that Videos directory node is a directory node called Trailers. In the Trailers directory node is a number of video les of game trailers. The directory page for this is shown in gure 2. Another was a new version of the Pictures subdirectory node that included a number of

22

image les. Finally, a number of new versions of the Software subdirectory node were created, each of which included a number of executable les. The video, image and executable les were all able to be downloaded by our peer. This shows that the Localhost P2P le sharing system does work. Some interesting comments have been elded from users. One user commented: For one, it seems odd to be trying to work le sharing on a structured le set, rather than using search. It would make sense that it would increase discoverability of resources, but it sounds impossible to manage with any real cohesion. Another user replied to this: Thats what was said about a large scale wiki for all to write in. Wikipedia.org is working so far though. A number of users were unable to work out how to use the peer. One user said: Its confusing :s I thought the root folder was the program folder. Installed it yesterday, havent gured out how to change root folder or add folders. But it looks promising.

6.4

Further work

The work done in this thesis opens up a number of opportunities for study. A number of users commented that a Localhost-style system would be useful for inside a single organisation. Work could be done to make Localhost operate in this environment, by letting the peer connect to a user-dened DHT bootstrap peer. From observation of the system, a major factor limiting usability of the Localhost system is the browsing speed, which is limited to how fast directory nodes can be transferred. The reason the browsing speed is slow is because the Localhost peer needs to perform a number of network operations before the directory node can even begin to be downloaded. If the most popular version is required to be downloaded, then the rst operation is the DHT get needed to retrieve the ViewingPreferences of the directory from the DHT. As mentioned earlier, this operation is restricted to be performed for four seconds. The next operation is a DHT get to locate peers that will transfer the torrent le. The next operation is the transfer of the torrent le. The next operation is the transfer the directory node XML le. Occasionally, remote peers go oine without removing their entry in the DHT, so some IP addresses are unusable, and the peer wastes time attempting to connect to them. These operations are required to be performed for a peer to download a single, relatively small (usually less than 1 kilobyte) directory node XML le. The actual download of the directory node XML le usually takes less than a second, and it is usually displayed in the browser in less than a second later. The time taken to download a directory node and display it is dominated by the time taken to nd the directory node and torrent le and nd peers that have a copy of it. A number of things could be done to reduce the time of directory node transfers. The rst is that a dierent le distribution system could be used. One of the design decisions we made was to use BitTorrent as the le distribution system. In making the decision, its relatively high download performance compared to other P2P le distribution systems was considered, but the time it takes to start the transfer was not considered. The second is that the directory node XML le could be transferred at the stage of transferring the torrent le. The third is including a pre-tallied summary of the user preferences in the DHT, so as to reduce the time taken to collect the user preferences. The user preferences could be tallied by some peers, and those tallies injected into the DHT. The values for the tallies will change over time, and malicious peers could inject incorrect tallies, so some combination of the most recent and most popular tally would have to be taken when reading the pre-computed tallies. Any, or a combination, of the preceding improvements would increase the browsing speed of the system. The Localhost system could be turned into a distributed web style system with a relatively small amount of further work. In the current Localhost system, directory nodes are the only thing that are displayed in the web browser by the peer. All other types of les are downloaded to the users le system. The peer could be made to display other types of les, such as HyperText Markup Language (HTML) les in the web browser, and the HTML les could contain links to other HTML les and/or directory nodes. This would create a system that is similar in use to the World Wide Web, but implemented using peers in the system rather than centralised web servers. Further work could be done to allow the peer to provide a virtual network drive that is integrated with the Operating System of the PC, so that the hierarchical structure could be accessed directly 23

by other programs on the users PC. Elements such as the popularity based system would have to be re-considered for this approach.

Conclusion

In this thesis we identied a number of issues in P2P le sharing systems. We designed, built, and released a P2P le sharing system that aimed to solve the identied issues. The system was designed by considering a number of alternative possibilities for implementation. We demonstrated that it is possible to implement a browsable le nding system by placing semantics on les in an existing le distribution system. We introduced a popularity based system for allowing collaboration between users to create a coherent index of les in a P2P system. We modelled and simulated the popularity based system and found that the system is practical for a certain range of parameters of the model.

References
[1] Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, and Roger P. Wattenhofer. Farsite: federated, available, and reliable storage for an incompletely trusted environment. SIGOPS Oper. Syst. Rev., 36(SI):114, 2002. [2] Christopher Rohrs Anurag Sigla. Ultrapeers: Another step towards gnutella scalability, version 1.0, 2005. http://rfc-gnutella.sourceforge.net/src/Ultrapeers1.0.html. [3] Bittorrent website, 2005. http://bittorrent.com. [4] CacheLogic. Cachelogic website, 2005. http://www.cachelogic.com. [5] CacheLogic. Peer-to-peer in 2005., 2005. http://www.cachelogic.com/research/p2p2005.php. [6] Bengt Carlsson and Rune Gustavsson. The rise and fall of napster - an evolutionary approach. In AMT 01: Proceedings of the 6th International Computer Science Conference on Active Media Technology, pages 347354, London, UK, 2001. Springer-Verlag. [7] Nicolas Christin, Andreas S. Weigend, and John Chuang. Content availability, pollution and poisoning in le sharing peer-to-peer networks. In EC 05: Proceedings of the 6th ACM conference on Electronic commerce, pages 6877, New York, NY, USA, 2005. ACM Press. [8] B. Cohen. Incentives build robustness in bittorrent. In Proceedings of the Workshop on Economics of Peer-to-Peer Systems, Berkeley, CA, USA, 2003. [9] Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. Wide-area cooperative storage with cfs. In SOSP 01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 202215, New York, NY, USA, 2001. ACM Press. [10] Directconnect wikipedia entry, file-sharing application. 2005. http://en.wikipedia.org/wiki/Direct connect

[11] Emule website, 2005. http://www.emule-project.net/. [12] David Erman, Dragos Ilie, Adrian Popescu, and Arne Nilsson. Measurement and analysis of bittorrent signaling trac, 2004.

24

[13] Jian Liang Information. Pollution in p2p le sharing systems, 2005. [14] T. Karagiannis, A. Broido, N. Brownlee, Kc Clay, and M. Faloutsos. Is p2p dying or just hiding? In Globecom, Dallas, TX, USA, November 2004. [15] John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Chris Wells, and Ben Zhao. Oceanstore: an architecture for global-scale persistent storage. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, pages 190201, New York, NY, USA, 2000. ACM Press. [16] Jintae Lee. An end-user perspective on le-sharing systems. Commun. ACM, 46(2):4953, 2003. [17] Limewire website, 2005. http://limewire.com. [18] Localhost website, 2005. http://p2p.cs.mu.oz.au/software/Localhost/. [19] Henrik Lundgren, Richard Gold, Erik Nordstrm, and Mattias Wiggberg. A distributed instant messaging architecture based on the pastry peer-to-peer routing substrate. [20] Sepandar Kamvar Mario. Incentives for combatting freeriding on p2p networks., 2003. [21] David Mazieres and Petar Maymounkov. Kademlia: A peer-to-peer information system based on the XOR metric, July 02 2002. [22] Morpheus website, 2005. http://morpheus.com. [23] Athicha Muthitacharoen, Robert Morris, Thomer M. Gil, and Benjie Chen. Ivy: a read/write peer-to-peer le system. SIGOPS Oper. Syst. Rev., 36(SI):3144, 2002. [24] The open directory project, 2005. http://dmoz.org. [25] Christopher Peery, Francisco Matias Cuenca-Acuna, Richard P. Martin, and Thu D. Nguyen. Waynder: Navigating and Sharing Information in a Decentralized World. In International Workshop On Databases, Information Systems and Peer-to-Peer Computing (co-located with VLDB2004), August 2004. [26] J. Pouwelse, P. Garbacki, D. Epema, and H. Sips. A measurement study of the bittorrent peer-topeer le-sharing system, 2004. [27] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker. A scalable content addressable network. Technical Report TR-00-010, Berkeley, CA, 2000. [28] M. Ripeanu. Peer-to-peer architecture case study: Gnutella network, 2001. [29] Jordan Ritter. Why Gnutella cant scale. No, really., 2001. http://www.darkridge.com/jpr5/ doc/gnutella.html. [30] Antony Rowstron and Peter Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pages 329350, November 2001. [31] Shareaza website, 2005. http://shareaza.sourceforge.net. [32] Clay Shirky. What is p2p, and what isnt, 2000. http://www.openp2p.com/pub/a/p2p/2000/11/ 24/shirky1-whatisp2p.html.

25

[33] Napster injunction, 2001. 2001 US Dist. LEXIS 2186 (N.D. Cal. Mar. 5, 2001), ad, 284 F. 3d 1091 (9th Cir. 2002). [34] Kundan Singh and Henning Schulzrinne. Peer-to-peer internet telephony using sip. In NOSSDAV 05: Proceedings of the international workshop on Network and operating systems support for digital audio and video, pages 6368, New York, NY, USA, 2005. ACM Press. [35] Soulseek website, 2005. http://slsk.org. [36] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable Peer-To-Peer lookup service for internet applications. In Proceedings of the 2001 ACM SIGCOMM Conference, pages 149160, 2001. [37] Domenico Talia and Paolo Truno. IEEE Internet computing: Peer to peer toward a synergy between P2P and Grids. IEEE Distributed Systems Online, 4(7), 2003. [38] Egemen Tanin, Aaron Harwood, Hanan Samet, Sarana Nutanong, and Minh Tri Truong. A serverless 3d world. In GIS 04: Proceedings of the 12th annual ACM international workshop on Geographic information systems, pages 157165, New York, NY, USA, 2004. ACM Press. [39] Antti Tapio. Future of telecommunication internet telephony operator skype. [40] Wikipedia. Peer-to-peer entry, 2005. http://en.wikipedia.org/wiki/Peer-to-peer. [41] Wikipedia website, 2005. http://wikipedia.org. [42] Brandon Wiley. Distributed hash tables, part i. Linux J., 2003(114):7, 2003. [43] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, UC Berkeley, April 2001.

26

Das könnte Ihnen auch gefallen