You are on page 1of 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
1

Leveraging Social Networks for P2P Content-based File Sharing in Disconnected MANETs
Kang Chen, Haiying Shen*, Member, IEEE , and Haibo Zhang
AbstractCurrent peer-to-peer (P2P) le sharing methods in mobile ad hoc networks (MANETs) can be classied into three groups: ooding-based, advertisement-based and social contact-based. The rst two groups of methods can easily have high overhead and low scalability. They are mainly developed for connected MANETs, in which end-to-end connectivity among nodes is ensured. The third group of methods adapt to the opportunistic nature of disconnected MANETs but fail to consider the social interests (i.e., contents) of mobile nodes, which can be exploited to improve the le searching efciency. In this paper, we propose a P2P contentbased le sharing system, namely SPOON, for disconnected MANETs. The system uses an interest extraction algorithm to derive a nodes interests from its les for content-based le searching. For efcient le searching, SPOON groups common-interest nodes that frequently meet with each other as communities. It takes advantage of node mobility by designating stable nodes, which has the most frequent contact with community members, as community coordinators for intra-community searching, and highly-mobile nodes that visit other communities frequently as community ambassadors for inter-community searching. An interest-oriented le searching scheme is proposed for high le searching efciency. Additional strategies for le prefetching, querying-completion and loop-prevention, and node churn consideration are discussed to further enhance the le searching efciency. We rst tested our system on the GENI Orbit testbed with a real trace and then conducted event-driven experiment with two real traces and NS2 simulation with simulated disconnected and connected MANET scenario. The test results show that our system signicantly lowers transmission cost and improves le searching success rate compared to current methods. Index TermsMANETs, Content-based le sharing, Social networks

I NTRODUCTION

In the past few years, personal mobile devices such as laptops, PDAs and smart phones have been more and more popular. Indeed, the number of smart-phone users increased by 118 million across the world in 2007 [1], and is expected to reach around 300 million by 2013 [2]. The incredibly rapid growth of mobile users is leading to a promising future, in which they can freely share les between each other whenever and wherever. The number of mobile searching users (through smart phones, feature phones, tablets, etc.) is estimated to reach 901.1 million in 2013 [3]. Currently, mobile users interact with each other and share les via an infrastructure formed by geographically distributed base stations. However, users may nd themselves in an area without wireless service (e.g., mountain areas and rural areas). Moreover, users may hope to reduce the cost on the expensive infrastructure network data. The P2P le sharing model makes large-scale networks a blessing instead of a curse, in which nodes share les directly with each other without a centralized server. Wired P2P le sharing systems (e.g., BitTorrent [4] and Kazaa [5]) have already become a popular and successful paradigm for le sharing among millions of users. The successful deployment of P2P le sharing systems and the aforementioned impediments to le sharing in
* Corresponding Author. Email: shenh@clemson.edu. K. Chen and H. Shen are with the Department of Electrical and Computer Engineering, Clemson University, Clemson, South Carolina, 29634. Haibo Zhang is with the Cerner Cooperation.

MANETs make the P2P le sharing over MANETs (P2P MANETs in short) a promising complement to current infrastructure model to realize pervasive le sharing for mobile users. As the mobile digital devices are carried by people that usually belong to certain social relationships, in this paper, we focus on the P2P le sharing in a disconnected MANET community consisting of mobile users with social network properties. In such a le sharing system, nodes meet and exchange requests and les in the format of text, short videos and voice clips in different interest categories. A typical scenario is a course material (e.g., course slides, review sheets, assignments) sharing system in a school campus. Such a scenario ensures for the most that nodes sharing the same interests (i.e., math) carry corresponding les (i.e., math les) and meet regularly (i.e., attending math classes). In MANETs consisting of digital devices, nodes are constantly moving, forming disconnected MANETs with opportunistic node encountering. Such transient network connections have posed a challenge for the development of P2P MANETs. Traditional methods supporting P2P MANETs are either ooding-based [6][9] or advertisement-based [10][12]. The former methods rely on ooding for le searching. However, they lead to high overhead in broadcast. In the latter methods, nodes advertise their available les, build content tables, and forward les according to these tables. But they have low search efciency because of expired routes in the content tables caused by transient network connections. Also, advertising can lead to high overhead. Some researchers [13][17] further proposed to utilize cache/replication to enhance data dissemination/access efciency in disconnected MANETs. However, nodes in

Digital Object Indentifier 10.1109/TMC.2012.239

1536-1233/12/$26.00 2012 IEEE

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
2

these methods passively wait for contents that they are interested in rather than actively search les, which may lead to a high search delay. Recently, social networks are exploited to facilitate content dissemination/publishing in disconnected MANETs [18][21]. These methods exploit below property to improve the efciency of message forwarding. (P1) nodes (i.e., people) usually exhibit certain movement patterns (e.g., local gathering, diverse centralities and skewed visiting preferences). However, these methods are only for the dissemination of information to subscribers. They are not specically designed for le searching. Also, they fail to take into account other properties of social networks revealed by recent studies to facilitate content sharing: (P2) Users usually have a few le interests that they visit frequently [22] and a users le visit pattern follows a power-law distribution [23]. (P3) Users with common interests tend to meet with each other more often than with others [24]. By leveraging these properties of social networks, we propose Social network based P2P cOntent-based le sharing in disconnected mObile ad-hoc Networks (SPOON) with four components as shown in Figure 1. (1) Based on P2, we propose an interest extraction algorithm to derive a nodes interests from its les. The interest facilitates queries in content-based le sharing and other components of SPOON. (2) We refer to a collective of nodes that share common interests and meet frequently as a community. According to P3, a node has high probability to nd interested les in its community. If this fails, based on P1, the node can rely on nodes that frequently travel to other communities for le searching. Thus, we propose the community construction algorithm to build communities to enable efcient le retrieval. (3) According to P1, we propose a node role assignment algorithm that takes advantage of node mobility for efcient le searching. The algorithm designates a stable node that has the tightest connections with others in its community as the community coordinator to guide intra-community searching. For each known foreign community, a node that frequently travels to it is designated as the community ambassador for inter-community searching. (4) We propose an interest-oriented le searching and retrieval scheme that utilizes an interest-oriented routing algorithm (IRA) and above three components. Based on P3, IRA selects forwarding node by considering the probability of meeting keywords of interests rather than nodes. The le searching scheme has two phases: intra- and inter-community searching. In the former, a node rst queries nearby nodes, then relies on coordinator to search the entire home community. If it fails, the inter-community searching uses an ambassador to send the query to a matched foreign community. A discovered le is sent back through the search path and IRA if the path breaks. SPOON is novel in that it leverages social network properties of both node interest and movement pat-

Social network based P2P cOntent-based file sharing in mobile ad hOc Networks (SPOON)

Interest Extraction

Community Construction

Exploiting Node Stability/Mobility

Interest Oriented Routing

Fig. 1. Components of SPOON. tern. First, it classies common-interest and frequentlyencountered nodes into social communities. Second, it considers the frequency at which a node meets different interests rather than different nodes in le searching. Third, it chooses stable nodes in a community as coordinators and highly mobile nodes that travel frequently to foreign communities as ambassadors. Such a structure ensures that a query can be forwarded to the community of the queried le quickly. SPOON also incorporates additional strategies for le prefetching, queryingcompletion and loop-prevention, and node churn consideration to further enhance le searching efciency. The rest of the paper is arranged as follows. Section 2 provides an overview of related works. Section 3 presents the design of the components of SPOON. In Section 4, the performance of SPOON is evaluated in comparison with other systems. The last section presents concluding remarks and future work.

2
2.1

R ELATED W ORK
P2P File Sharing in MANETs

We rst introduce the P2P le sharing algorithms designed in MANETs. 2.1.1 Flooding-based Methods In ooding-based methods, 7DS [6] is one of the rst approaches to port P2P technology to mobile environments. It exploits the mobility of nodes within a geographic area to disseminate web content among neighbors. Passive Distributed Indexing (PDI) [8] is a general-purpose distributed le searching algorithm. It uses local broadcasting for content searching and sets up content indexes on nodes along the reply path to guide subsequent searching. Klemm et al. [7] proposed a special-purpose ondemand le searching and transferring algorithm based on an application layer overlay network. The algorithm transparently aggregates query results from other peers to eliminate redundant routing paths. Anna Hayes et al. [9] extended the Gnutella system to mobile environments and proposed the use of a set of keywords to represent user interests. However, these ooding-based methods produce high overhead due to broadcasting. 2.1.2 Advertisement-based Methods Tchakarov and Vaidya [10] proposed GCLP for efcient content discovery in location-aware ad hoc networks. It disseminates contents and requests in crossed directions to ensure their encountering. P2PSI [11] combines both advertisement (push) and discovery (pull) processes. It adopts the idea of swarm intelligence by regarding shared les as food sources and routing tables as pheromone. Each le holder regularly broadcasts an

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
3

advertisement message to inform surrounding nodes about its les. The discovery process locates the desired le and also leaves pheromone to help subsequent search requests. Repantis and Kalogeraki [12] proposed a le sharing mechanism in which nodes use the Bloom lter to build content synopses of their data and adaptively disseminate them to other nodes to guide queries. Though the advertisement-based methods reduce the overhead of ooding-based methods, they still generate high overhead for advertising and cannot guarantee the success of le searching due to node mobility. 2.2 P2P File Sharing in Disconnected MANETs The disconnected MANETs are featured by sparse node density and intermittent node connection, which makes previously introduced methods infeasible in such networks. We then further introduce two categories of P2P le sharing methods for disconnected MANETs. 2.2.1 Cache/Replication-based Methods Huang et al. [13] proposed a method that considers multiple factors (e.g., node mobility, le popularity, and le server topology) in creating le replicas in le servers to realize optimal le availability in content distribution community. Gao et al. [14] proposed cooperative caching in disruption tolerant networks. It replicas each le to network central locations, which are frequently visited by nodes in the system, to ensure efcient data access. QCR [15] uses le caching to realize effective multimedia content dissemination in opportunistic networks. In addition to node mobility and le popularity, it also considers the impatience of users when creating replicas. Lenders et al. [16] investigated wireless ad hoc podcasting, in which nodes store contents from their neighbors that are interested by themselves or the nodes they have met. Chen et al. [17] deduced the optimal le replication strategy in MANETs by further considering nodes ability to meet nodes as a resource since replicas on these nodes can meet more requesters and thus have higher availability. Though these methods improve le availability, nodes in these methods passively wait for contents they are interested in rather than actively search les, which may lead to search delay. 2.2.2 Social Network-based Methods Recently, social networks have been utilized in content publishing/dissemination algorithms [18][21] in opportunistic networks. MOPS [18] provides content-based sub/pub service by utilizing the long-term neighboring relationship between nodes. It groups nodes with frequent contacts and selects nodes that connect different groups as brokers, which are responsible for intercommunity communication. Then contents and subscriptions are relayed through brokers to reach different communities. MOPS only considers node mobility while SPOON is more advantageous by considering both node interest and mobility as described previously. Moreover, unlike MOPS that only depends on the meeting of brokers for inter-community search, SPOON enhances the efciency of inter-community search by (1) assigning one ambassador for each known foreign community, which helps to forward a query directly to the destination

community, and (2) utilizing stable nodes (coordinator) to receive messages from ambassadors. The work in [25] is a similar to MOPS. It selects centrality nodes as brokers and builds them into an overlay, in which brokers use unicast or direct protocols (e.g., WiFi access points) for communication. Then node publications are rst transferred to the broker node responsible for the nodes community and then propagated to all brokers to nd matched subscribers. SocialCast [19] calculates a nodes utility value on an interest based on the nodes mobility and co-location with the nodes subscribed to the interest. It publishes contents on an interest to subscribers by forwarding the contents to nodes with the highest utilities on the interest. ContentPlace [21] denes social relationship based communities and a set of content caching policies. Specically, each node calculates a utility value of published data it has met based on the datas destination and its connected communities, and caches the data with the top highest utilities. However, above methods mainly focus on disseminating publications to matched subscribers. Therefore, these methods cannot be applied to le searching directly.

T HE D ESIGN

OF

SPOON

In this section, we st present trace data analysis to verify the social network properties in a real MANET. A P2P le sharing system usually consists of 1) a method to represent contents, 2) a node management structure and 3) a le searching method based on 1) and 2). Accordingly, SPOON has three main components: 1) interest extraction, 2) structure construction including community structure and node role assignment, and 3) interest-oriented le searching and retrieval based on 1) and 2). We then present each component of SPOON. 3.1 Trace Data Analysis In order to validate the correlation between node interests and their contact frequencies, we analyzed the trace from the Haggle project [26], which contains the encountering records among 98 mobile devices carried by scholars attending the Infocom06 conference. Some participants completed questionnaires, indicating the conference tracks that they are interested in. We use Tt to denote the time length of the trace, and dene the total meeting time of two nodes as the sum of the time length of each encountering. By regarding a community as a group of nodes in which each node has total meeting time larger than Tt /4 with at least half of all nodes in the community, we detected 8 communities from the trace. We then calculated each nodes average number of shared interested tracks with other members in its own community Ci (0 i < 8), and with nodes in all other communities, respectively. Finally, the average values of all nodes in each community are calculated and shown in Table 1. From the table, we see that for each community, nodes have higher average number of shared interested tracks with same community nodes than with nodes from other communities. Note that we used a relatively loose community creation requirement that each node only

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
4

needs to have a high contact frequency with half of nodes in a community. With a stricter requirement and a more sophisticated clustering method, nodes in the same community would share more interested tracks. Above traces verify the previously observed social properties and support the basis for SPOON that nodes with common interests tend to meet frequently. TABLE 1 Average number of shared interested tracks.
Community Ci 1 2 3 4 5 6 7 8 Ave. # of shared interests with nodes in Ci 1.50 0.83 1.17 1 1.93 0.33 1.1 1 Ave. # of shared interests with nodes not in Ci 0.99 0.69 0.79 0.39 0.94 0.21 0.71 0.33

(t1 , w1t1 ) from v1 and (t1 , w2t1 ) from v2 . We then use the following formula to calculate the similarity between v1 and v2 : sim(v1 , v2 ) =
m k=1 m k=1

w 1 k w 2k
m k=1 2 w2 k

2 w1 k

(3)

3.2 Interest Extraction Without loss of generality, we assume that node contents can be classied to different interest categories. It was found that users usually have a few le categories that they query for les frequently in a le sharing system. Specically, for the majority of users, 80% of their shared les fall into only 20% of total le categories [22]. Like other le sharing systems [27], [28], we consider that a nodes stored les can reect its le interests. Thus, SPOON derives the interests of a node from its les. Table 2 lists the notations used in this section. TABLE 2 Notations in interest extraction
Notation fi and Gu witk and w utk fui vi v u v N the the the the the the Meaning i-th le and u-th interest group in a node weight of keyword tk in fi and in Gu i-th le in Gu le vector of fi group vector of Gu node vector of node N

where m is the total number of common keywords and w1k and w2k represent the weights of the k th common keyword of the two vectors, respectively. After retrieving the le vector of each of its les, a node classies its les to derive its interest groups. It creates a le similarity matrix A = sim(vr , vs ) (1r & sm ), where m is the number of les the node has. Since the similarities among les are known, we use the AGNES method [31] to cluster the les into interest groups in a hierarchical manner. Each le form an individual group initially. Then, two most similar le groups are merged in each step. This process repeats until the similarity between any two groups is below a threshold. The similarity between two groups is calculated based on their interest vectors introduced below. Consequently, a le is classied to only one interest group and there is no overlap among groups. Each group has a number of les. Suppose there are g les in interest group Gu , denoted by (fu1 , fu2 , . . . , fug ). The average weight of a keyword, say tk , in the group g fui fui is calculated by w utk = i=1 wt /g , where wt denotes k k the weight of tk in fui . We also pre-dene a threshold for the average weight, denoted by Tw . We form an interest vector with keywords having weights larger than Tw and use it to represent interest group Gu : ut1 ; t2 , w ut2 ; t3 , w ut3 ; ...; tn , w tun ). v u = (t1 , w (4) u . Thus, where n is the total number of keywords in v each node has a number of interest vectors to represent its interests. The weight of Gu , denoted W (Gu ), is the portion of the groups les in all les of the node. We then generate a node vector (v N ) to describe a nodes interests. The keywords of v N is the keyword union of all interest group vectors, and the weight of each keyword is the sum of its weights in different interest groups it belongs to normalized by the weights of these groups. 3.3 Community Construction

To derive its interests, a node infers keywords from each of its les using the document clustering technique [29]. Specically, a node derives a le vector for each of its les from its metadata. For le fi , we denote its le vector by vi = (t1 , wit1 ; t2 , wit2 ; t3 , wit3 ; ...; tm , witm ), in which tk and wik (1 k m) denote a keyword and its weight that represents the importance of the keyword in describing the le. We adopt the method in the text retrieval literature [30] to calculate the weight of a keyword, say tk , in a le, say fi , with below formula. witk = 1 + log (ntk ), (1) where ntk refers to the number of occurrences of keyword tk . Suppose there are m keywords in the le, we further normalize the weights by: witk = witk /
m q =1 witq .

(2)

Then, in order to calculate the similarity of two le vectors, say v1 =(t1 , w1t1 ; t2 , w1t2 , t4 , w1t4 ) and v2 =(t1 , w2t1 ; t3 , w2t3 ; t5 , w2t5 ), we rst generate their common vector, which consists of their common keywords and corresponding weights in their own vectors. For example, the common vector of v1 and v2 is

Social network theory reveals that people with the same interest tend to meet frequently [24]. By exploiting this property, SPOON classies nodes with common interests and frequent contacts into a community to facilitates interest based le searching, as introduced latter in Section 3.5. Nodes with multiple interests belong to multiple communities. The community construction can easily be conducted in a centralized manner by collecting node interests and contact frequencies from all nodes to a central node. However, considering that the proposed system is for distributed disconnected MANETs, in which timely information collection and distribution is non-trivial, we further propose a decentralized method to ensure the adaptivity of SPOON in real environment. When two nodes, say N1 and N2 , meet, they consider two cases for community creation: (1) they do not belong

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
5

to any communities, and (2) at least one of them is already a member of a community. In the rst case, they calculate the similarity between each pair of their interest vectors using Formula (3). A pair of interest groups, i and v j , is called say Gi and Gj with interest vectors v matched interest group when W (Gi )W (Gj )sim( vi , v j ) TG , where TG is a predened threshold. The purpose of taking into account the weight of each interest group is to eliminate the noise of interest groups with a small number of les and achieve better interest clustering. If N1 and N2 have at least one pair of matched interest group, and their contact frequency, F (N1 , N2 ), is higher than the top h1 % highest encountering frequencies in either node, the two nodes form a new community. The keywords in their matched interest groups and corresponding weights constitute the community vector (vC ) of the community. In the second case, suppose N2 is already a member of community C , N1 calculates W (Gi )sim( vi , vC ) for each of its interest groups, say Gi , to decide whether it should join in community C . If the similarity value for one interest group is larger than TG , and N1 s contact frequency with community C is higher than the top h2 % of N1 s contact frequencies with all nodes it has met, N1 is granted the membership to community C . The contact frequency with community C refers to the accumulated contact frequency with nodes in C . It is updated upon each encountering with a node in C . This means that Ni contacts members in community C frequently enough to guarantee connections. N1 then copies the community vector and other community information from N2 . Also, when a node meets the community coordinator, it reports its les to the coordinator to update its le index and community vector. The coordinator then forwards the updated community information to community members when meets them. With above community construction method, nodes with common interests and frequent contacts gradually form a community. However, nodes that appear later have more stringent community acceptance requirement. Its contact frequency to the community needs to be higher than that of more nodes, and its interest vector is compared with a longer community vector. Also, nodes in a group admit new members distributively. As a result, nodes in a group may not have very similar interests or high contact frequencies. We propose two solutions to alleviate this problem. First, we set an initial period for newly joined nodes in which they accumulate contact frequencies with others. Then, when a node starts to join in communities, its meeting frequencies with others are relatively stable, which provides more accurate measurement for determining the communities to join in. Second, we use group member pruning. Existing community members can have a second round voting to conrm the eligibility of new community members. Specically, if N2 in community C nds a node, say N1 , satises the requirements of C , it awards N1 a potential membership for C . Then, other community members in C further checks N1 s eligibility to join in C . That is, every time when N1 meets an existing member of C ,

say N3 , N3 checks whether they have at least one pair of matched interest group and whether their contact frequency is higher than the top h1 % of N1 s highest contact frequencies. If yes, N3 approves the membership of N1 . When an existing community member of C notices that N1 receives the grants from half of the community members, it grants N1 the group membership. Another issue is that node contact frequencies and interests may change over time. Since the community construction algorithm is continuous running, a node can detect that it fails to satisfy the requirement of current community. It then withdraws from the current community, noties connected nodes in it, and search for a new community to join in. The values of the thresholds used in the community construction process (TG , h1 , h2 ) are determined by many factors such as number of nodes, number of interests and types of applications. Generally, TG decides the interest tightness among nodes in each community. A larger TG leads to higher similarity between interests of nodes in one community, but also generates more communities. Therefore, TG should be congured based on application scenario. If the application has clear le categories (i.e., course le sharing), we can set a large TG to gather les in the same category. Otherwise, a medium TG should be set to balance the interest closeness and the number of communities. h1 and h2 determines the tightness of a community. The smaller h1 and h2 are, the tighter the community is. Therefore, we set them to 30 by default in experiment to ensure frequent contact among community members. These values were set based on empirical experiences. We leave further investigation on appropriate values as future work. 3.4 Node Role Assignment

A previous study has shown that in a social network consisting of mobile users, only a part of nodes have high degrees [20]. We can often nd an important or popular person who coordinates members in a community in our daily life. For example, the college dean coordinates different departments in the college, and the department head connects to faculty members in the department. Thus, we take advantage of different types of node mobility for le sharing. We dene community coordinator and ambassador nodes in the view of a social network. A community coordinator is an important and popular node in the community. It keeps indexes of all les in its community. Each community has one ambassador for each known foreign community, which serves as the bridge to the community. The coordinator in a community maintains the vC of foreign communities and corresponding ambassadors in order to map queries to ambassadors for inter-community searching. The number of ambassadors and coordinators can be adjusted based on the network size and workload in order to avoid overloading these nodes. Since ambassadors and coordinators take more responsibility, we can also adopt role rotation and extra incentives for fairness consideration. Due to page limit, we leave this as our future work.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
6

3.4.1

Community Coordinator Node Selection


C1 A2 R
Requester

A2

We dene a stable node that has tight contact frequency with other community members as the community coordinator. In network analysis, centrality is often used to determine the relative importance of a vertex within the network. We then adopt the improved degree centrality [32], which assigns weight to each link based on the contact frequency, for coordinator selection since it reects the tightness of a node with other community members. In the initial phase of coordinator discovery, each node, say node Ni , in a community collects contact information from its neighbors in the same community and then calculates its degree centrality by
N

A1 A1

C2

D
Data Holder

Community C1

Community C2

Fig. 2. File searching in SPOON. of time, a new ambassador that satises above requirements is selected. This arrangement facilitates interestoriented le searching by enabling a coordinator to send le requests to matched foreign communities quickly. In above design, ambassadors are the key to connect different communities efciently. Coordinators achieves balance between the centralized and distributed searching by checking whether a community can satisfy a query quickly, which is important in disconnected MANETs. Also, though broadcast is used in coordinator selection, the cost is limited because 1) it is only among community members, and 2) we can set a long interbroadcast period since nodes usually have stable degree centrality. To select ambassadors, each node just reports its utility values to the coordinator, which can be piggybacked on the beacon messages. Therefore, this step does not incur signicant extra costs. 3.5 Interest-oriented File Searching and Retrieval

D ( pi ) =
j =1,j =i

wij ,

(5)

where wij is the link weight between Ni and Nj and N is the number of neighbors in the same community. In order to reect the property that the coordinator has the most connections with all community members, wij equals 1 if the contact frequency between Ni and Nj is larger than a threshold and 0 otherwise. Though such a method cannot ensure its connection to every community member, it ensures that the coordinator has the tightest overall connection to all community members. Each node periodically checks its degree centrality and broadcasts such information to all community members. If a node receives no larger centrality score than its own centrality for three consecutive periods, it claims itself as the potential coordinator. The potential coordinator would conrm its status as the coordinator when meets the previous one. If it is conrmed, it then requests the community information from the old coordinator. Also, when the new coordinator meets community members, they exchange information for group vector update and ambassador selection, as well as request routing. 3.4.2 Community Ambassador Node Selection

An ambassador is used to bridge the coordinator in its home community and a foreign community. We use the product of a nodes contact frequency with its coordinator and that with the foreign community for ambassador selection. Each node i calculates its utility value for foreign community k by Uik = F (Ni , Ck ) F (Ni , Nc ) (6)

where Ck represents foreign community k , Nc is the coordinator in its home community, and F () denotes the meeting frequency. Each node reports its utility values for foreign communities it has met to the coordinator in its home community. Then, the community coordinator chooses one ambassador for each known foreign community. Also, the node that has the highest overall contact frequency with all foreign communities is selected as the default ambassador. In case that a request fails to nd a matched ambassador, the default ambassador can carry the request and seek for potential forwarders in foreign community. If an ambassador loses the connection with the coordinator for a certain period

In social networks, people usually have a few le interests [22] and their le visit pattern generally follows a certain distribution [23]. Also, people with the same interest tend to contact each other frequently [24]. Thus, interests can be a good guidance for le searching. Considering the relation among node movement pattern [33], individuals common interests, and their contact frequencies, we can route le requests to le holders based on nodes frequencies of meeting different interests. Then, the interest-oriented le searching scheme has two steps: intra-community and inter-community searching. A node rst searches les in its home community. If the coordinator nds that the home community cannot satisfy a request, it launches the inter-community searching and forwards the request to an ambassador that will travel to the foreign community that matches the requests interest. A request is deleted when its TTL (Time To Live) expires. During the search, a node sends a message to another node using the interest-oriented routing algorithm (IRA), in which a message is always forwarded to the node that is likely to hold or to meet the queried keywords. The retrieved le is routed along the search path or through IRA if the route expires. 3.5.1 Interest-oriented Routing Algorithm In SPOON, every node maintains a history vector that records its frequency of encountering interest keywords. The history vector is in the form of vH = (t0 , wh0 ; t1 , wh1 ; t2 , wh2 ; ...; tn , whn ), where whi is the aggregated times of encountering keyword ti . whi decays

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
7

periodically as time passes by whi = whi ( < 1). When two nodes meet, they exchange their node vectors and update history vectors. The history vector is used to evaluate the probability of meeting the queried content. The destination of a request is represented by a vector vdest = (t0 , w0 ; t1 , w1 ; t2 , w2 ; ...; tn , wn ). In IRA, a node uses the tness score F to evaluate its neighbors probabilities to be or to meet the le holder. The tness F of neighbor i is measured by F = sim(vdest , v i ) + (1 )sim(vdest , vHi ), where v i and vHi are the node vector and history vector of node i, respectively. The factor of sim(vdest , v i ) aims to nd the node sharing the most similar interests with the destination, and the factor of sim(vdest , vHi ) aims to nd a node that is very likely to meet the destination in its movement. is used to control the weight of these two factors. In IRA, when a node receives a message, if its neighbor with the highest F has higher F than itself, it forwards the message to the neighbor. This process repeats until the message arrives at the destination. Coordinators do not use IRA but send messages to its community members when meeting them because they usually have tight connections with all community members. 3.5.2 Intra-Community File Searching and Retrieval

back to the requester along the original path. If a forwarder on the path is not available due to node mobility, IRA is used to forward the le. Otherwise, Nj uses IRA to further forward the query. If vdest = vNc and Nj is not the coordinator Nc , Nj uses IRA to forward the request to Nc . After Nc receives the query, it checks its le indexes. If the indexes have les satisfying the request, the coordinator sends the request to the le holder when meeting it, which then sends the le back to the requester. Otherwise, Nc initiates the inter-community le searching. Algorithm 1 shows the pseudocode of the intra-community searching algorithm.
Algorithm 1:Pseudo-code of intra-community le searching for query Q conducted by node Ni . Procedure intraSearchForQ () if a neighbor nb of Ni matches query Q then Ni .sendQeuryTo(Q, nb) else if Q.src = Ni then if Sim(vQ , vC ) < Ts then Q.vdest vNC Ni .sendThroughIRATo(Q, NC ) else Q.vdest vQ Ni .rankNbByFitness() overallF 0 for each neighbor nb of node Ni do overallF gets overallF + F (Q, nb) Ni .sendQeuryTo(Q, nb) if overallF > then break else if Q.hops < MaxHopthen Q.vdest vQ Ni .rankNbByFitNess() nb the neighbor with maximal tness Ni .sendQeuryTo(Q, nb) else Q.vdest vNC Ni .sendThroughIRATo(Q, NC )

The query message is represented by a query vector represented as: vQ = (t0 , w0 ; t1 , w1 ; t2 , w2 ; ...; tn , wn ). Each query is associated with a counter (count) indicating the number of hops it can travel. The count is decremented by 1 after each forwarding. Since the query is initiated by users, term weights in vQ are constant values. In the intra-community searching, the destination that a query is sent to is represented by a combination of the vQ and the node vector of the requesters community coordinator (vNC ), represented by: vdest = vQ + (1 )vNC , (7)

3.5.3

Inter-Community File Searching and Retrieval

In the rst step, the requester calculates the similarity between the query vector and the community vector of the community it belongs to. If Sim(vQ , vC ) < Ts , the query is sent to the coordinator of the community directly (i.e., equals 0). Otherwise, equals 1 when the counter (count) is larger than 0 and 0 otherwise. This means that a requester rst searches nearby nodes within count hops, and then resorts to its community coordinator. Specically, the requester sends out a query to top F neighbors with the highest . Having > 1 copies of a request can enhance the efciency of le searching. We call this strategy multi-copy forwarding. In order to limit the number of copies for each request, we set k = min{k | i=0 Fi > }, where Fi is the tness of neighbor i and is the minimum delivery guarantee factor. The hop counter of a query is decreased by one after each forwarding. If the le is not found when count = 0, it is forwarded to the community coordinator ( = 0). When a node receives multiple copies of query, it only processes the rst one. When node Nj receives a request, if vdest = vQ and sim(vdest , vNj ) reaches the similarity threshold specied by the requester, it rst tries to send the satised les

In the inter-community searching algorithm, a coordinator maps a request to the foreign community that is most likely to contain the queried le. Similar to the intracommunity search step, the coordinator also uses the multi-copy forwarding strategy, i.e., it sends out a query to ambassadors having the highest similarity with the query in order to enhance the efciency of the forwarding. We limit the number of copies for each request by k letting = min{k | i=0 sim(vQ , vC ) > }, where is the minimum delivery guarantee factor. Ambassadors then forward the request to the foreign community. Upon receiving the request, the coordinator in the foreign community checks its le index to see if its community has the le. If not, the coordinator repeats the inter-community le searching by looking up its ambassadors to check for further forwarding opportunities. If the le exists, the coordinator asks for the le from the le holder when meeting it and sends the le back to the requesters community through the corresponding ambassador. The coordinator of the requesters community will further forward the le to the requester. Figure 2 depicts the process of le searching, in which a requester (node R) in community C1 generates a le

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
8

request. Since its neighbors within count hops dont have the le, the request is then forwarded to the community coordinator NC 1 . NC 1 checks the community le indexes but still cant nd the le. It then asks the community ambassador NA1 to forward the request to the foreign community matching the queried le. Using the same way as NC 1 , the community coordinator NC 2 nds the le and sends it back to the requesters community via ambassador NA2 . The le is rst sent to NC 1 , and then forwarded to the requester. Algorithm 2 shows the pseudocode of the inter-community searching algorithm.
Algorithm 2:Pseudocode of inter-community le searching for query Q conducted by node Ni . Procedure interSearchForQ () if Ni is a coordinator then bContain Ni .checkContainFile(Q) if bContain Ni .sendQeuryToDes(Q) else Ni .rankAmByMatch() overallS 0 for each ambassador NA of Ni s community do n.sendQeuryTo(q, NA ) overallS overallS + Sim(q.VQ , NA .VC ) if overallS > break if Ni is an ambassador then when Ni meets another node Nj if Nj .homeCommunity = Ni .foreighCommunity then Ni .sendQeuryTo(Q, Nj ) Nj .sendThroughIRATo(Q, NC )

communities frequently. Taking advantage of this feature, an ambassador can intelligently prefetch popular les outside of its home community. Recall that a query in a local community for a le residing in a remote community are forwarded through the coordinator of the local community. Thus, each coordinator keeps track of the frequency of local queries for remote les and provides the information of popular remote les to each ambassador in its community upon encountering it. When a community ambassador nds that its foreign community neighbors have popular remote les that are frequently requested by its home community members, it stores the les on its memory. The prefetched les can directly serve potential requests in the ambassadors home community, thus reducing the le searching delay. 3.8 Querying-Completion and Loop-Prevention Given a le query, there may exist a number of matching les in the system. A node can associate a parameter Smax with its query to specify the number of les that it wishes to nd. A challenge we need to handle is to ensure that the querying process stops when Smax matching les are discovered when multi-copy forwarding is used. To solve this problem, we let a query carry Smax when it is generated. When a query nds a le that matches the query and is not discovered before, it decreases its S by 1. Also, if this query is replicated to another node, S is evenly split to the two nodes. A query stops searching les when its S equals 0. When a query needs to nd more than one le, it is likely that IRA would forward a query to the same node repeatedly. To avoid this phenomenon, SPOON incorporates two strategies. First, the query holder inserts its ID to the query before forwarding the query to the next node. Second, a node records the queries it has received within a certain period of time. The former method avoids sending a packet to nodes it has visited before while the latter method prevents sending different replicas of the same query to the same node. Specically, when a node, say Ni , needs to forward a query to a newly met node Nj based on IRA, Ni checks whether the querys record of traversed nodes contains Nj . If yes, Ni does not forward the query to Nj . Also, when a node receives a query, if the query exists in its record of received queries, the node sends the query back to the sender. These two strategies effectively avoid searching loops by simply preventing a node from forwarding the same query to nodes that have received the query before. 3.9 Node Churn Consideration In SPOON, when a node joins in the system, it rst nds the communities it belongs to and learns the IDs of community coordinators, and then reports its les and utility values to the community coordinator when encountering it. This enables the coordinator to maintain updated information of the community members. A node may leave the system voluntarily when users manually stop the SPOON application on their devices. In this case, a leaving node informs its community coordinator about its departure through IRA. If the leaving

3.6 Information Exchange among Nodes We summarize the information exchanged among nodes in SPOON. In the community construction phase, two encountered nodes exchange their interest vectors and community vectors, if any, for community construction. In the role assignment phase, nodes broadcast their degree centrality within their communities for coordinator selection. When the coordinator is selected, the coordinator ID is also broadcasted to all nodes in the community. Then, each node reports its contact frequencies with foreign communities to the coordinator for ambassador selection. Besides, when a node meets a coordinator of its community, the node also sends its updated node vector to the coordinator to update the community vector and retrieves the updated community vector from the coordinator. When an ambassador meets the coordinator of its community, it reports the community vectors of foreign communities to the coordinator. After above information exchange, two encountered nodes exchange their node vectors and history vectors for packet routing. Each node checks packets in it sequentially to decide which packets should be forwarded to the other node based on the le searching algorithm introduced in Section 3.5. Further, when network turns to be stable, the frequency of information exchange for community construction and node role assignment can be reduced to save costs. 3.7 Intelligent File Prefetching Ambassadors in SPOON can meet nodes holding different les since they usually travel between different

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
9

node is an ambassador, the coordinator then chooses a (3) Event-driven experiments with real trace. We also connew ambassador. If the leaving node is a coordinator, it ducted event-driven experiments with two real traces. uses broadcast to notify other community members to (4) Evaluation of enhancement strategies. We evaluated the effect of the enhancement strategies introduced in Secselect a new coordinator. A node may also leave the system abruptly due to tion 3.7, 3.8, and 3.9 through event-driven experiments. various reasons. Simply relying on the periodical beacon (5) NS2 experiments with synthetic mobility. We conducted message, a node cannot tell whether a neighbor is left experiments on NS-2 [37] using a community based moor is just isolated from itself, which is a usual case in bility model [38] to evaluate the applicability of SPOON MANETs. To handle this problem, each node records the in different types of networks. Due to page limit, the timestamps when it meets other nodes, and sends it to results are shown in Appendix A. the coordinator through IRA. The coordinator receives We disabled the intelligent prefetching and the multithis information and updates the most recent timestamp copy forwarding (i.e., = 1 and = 1) in SPOON to of each node seen by other nodes. If the coordinator make the method comparable. Also, since the comparinds that a nodes timestamp is more than Tx seconds son methods can return only one le for a query, we set ago, it considers this node as a departed node. Similarly, Smax = 1 in SPOON. In each community, we used the normal nodes in a community also maintain and update node having the most contacts with other communities the timestamp of the coordinator to determine whether as the ambassador in SPOON and as the broker in it is still alive. A node piggybacks the coordinator de- MOPS. We also set the node with the most contacts with parture information on the beacon messages. Then, its its community members as the coordinator in SPOON. nearby nodes can know whether the coordinator has left. Besides the Haggle trace, we further tested with the Note that a node can know the number of community MIT Reality trace [39], in which 94 smart phones were members from the coordinator. When a node nds that deployed among students and staffs at MIT to record more than half of community members have found that their encountering. The two traces last 0.34 million secthe coordinator has left, it broadcasts a coordinator re- onds (Ms) and 2.56 Ms, respectively. As in MOPS, we election message to select a new coordinator using the used 40% of the two traces to detect groups in which same method explained in Section 3.4.1. nodes share frequent contacts. Here, we use group to represent a group of nodes with frequent contacts, and 4 P ERFORMANCE E VALUATION use community to represent a group of nodes with We evaluated the performance of SPOON in comparison common-interests and frequent contacts. We got 7 and 8 with MOPS [18], PDI+DIS [8], [12], CacheDTN [14], groups for the MIT Reality trace and the Haggle trace, PodNet [16], and Epidemic [34]. MOPS is a social net- respectively. Then, since there is no real trace for P2P work based content service system. It forms nodes with over MANETs, we collected articles from different news frequent contacts into a community and selects nodes categories (e.g., sports, entertainment and technology) with frequent contacts with other communities as bro- from CNN.com and mapped them to the identied kers for inter-community communication. PDI+DIS is a communities. Each node contains 50 articles from the combination of PDI [8] and an advertisement-based DIS- news category for its community. Each node extracts its semination method (DIS) [12]. PDI provides distributed interests from its stored les. The similarity threshold search service through local broadcasting (3 hops), and was set to 70% in AGNES for le classication. builds content tables in nodes along the response path, In experiments with the Haggle trace and the MIT while DIS let each node disseminate its contents to its Reality trace, we set the initialization period to 0.09Ms neighbors to create content tables. CacheDTN replicate and 0.3Ms, the query generation period to 0.1Ms and les to network centers in decreasing order of their 1Ms, and the TTL of a query to 0.15Ms and 1.2Ms, overall popularity. In PodNet, nodes cache les intereted respectively. Considering that people usually generate by them and nodes they have met. We adopted the queries according to their interests, we set 70% of total Most Solicited le solicitation strategy in PodNet. We queries searching for les located in the local community. doubled the memory on each node in CacheDTN and Each query is for an article randomly selected from the PodNet for replicas. In Epidemic, when two nodes meet article pool. We measured following metrics: each other, they exchange the messages the other has not (1) Hit rate: the percentage of requests that are successseen. We have conducted the following experiments. fully delivered to the le holders. (1) Evaluation of Community Construction. We rst eval- (2) Average delay: the average delay of the successfully uated the proposed community construction algorithm delivered requests. introduced in Section 3.3. (3) Maintenance cost: the total number of all messages (2) GENI experiments. We deployed the systems on the except the requests, which are for routing information real-world GENI ORBIT testbed [35], [36] and tested establishment and update, or replication creation. the performance using the MIT Reality trace. The GENI (4) Total cost: the total number of messages, including ORBIT testbed contains 400 nodes with 802.11 wireless maintenance messages and requests, generated in a test. cards. Nodes can communicate with each other through the wireless interface. We used real trace to simulate node mobility in ORBIT: two nodes can communicate 4.1 Evaluation of Community Construction with each other only during the period of time when they meet in the real trace.

We rst tested the effectiveness of the community construction method in SPOON, denoted by SPOON-CC,

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
10

1 0.9

MITReality

Haggle
Similarity

1 0.9 0.8 0.7 0.6 0.5

MITReality

Haggle

0.8 0.7 0.6 0.5

20

30

h1

40

50

20

30

h2

40

50

(a) h2 = 30 and h1 = 20 50.

(b) h1 = 30 and h2 = 20 50.

Fig. 3. Average similarity values with different h1 and h2 . in comparison with Active-CC and Centralized-CC. The Active-CC selects three most active nodes to collect node contacts and interests when they meet nodes. In Centralized-CC, we purposely let a super node collect all node contacts and interests timely. Both Active-CC and Centralized-CC use AGNES to build communities with the collected information. Since there is no real trace about content sharing in P2P MANETs, we tested in an indirect way. We rst conducted the group construction and content distribution as previously described, and then removed the group identity of each node. Then, we run the three methods to create communities. After this, we matched each new community to the most similar old community and 2 calculated the similarity value by Nsm /(Np Nn ), where Nsm is the number of common nodes, Np and Nn denote the size of the old community and the new community, respectively. In SPOON-CC, we set TG to 1 to ensure interest closeness, and set h1 and h2 to 30. Active-CC and Centralized-CC used the same threshold for granting community membership as SPOON-CC. The average similarities in SPOON-CC, Active-CC, and CentralizedCC are 0.95, 0.87, and 1 with the Haggle trace and 0.91, 0.85, and 1 with the MIT trace, respectively. The Centralized-CC has inferior performance since active nodes can only collect information from nodes they have met, leading to less accurate community construction. Also, the performance of SPOON-CC is close to that of Centralized-CC, which has the best performance in theory. Such a result shows the effectiveness of SPOONCC in this test. We further varied h1 and h2 from 20 to 50 to verify the effectiveness of SPOON-CC. Figure 3(a) and 3(b) show the average similarity values with the two traces. We see that the similarity values are above 90% with various h1 and h2 . Such results further conrm the effectiveness of SPOONs community construction algorithm to cluster nodes with frequent contacts and similar interests in the experiments with the two real traces. 4.2 GENI Experiments Table 3 shows the results of the GENI experiments of the six methods. From the table, we nd that Epidemic generates the highest hit rate with the highest total cost and a low average delay. This is resulted from the dissemination nature of broadcasting. SPOON produces the second highest hit rate at the second lowest total cost and relatively high average delay. This is because SPOON utilizes both contact and content properties of social networks to guide le querying. Therefore, it can successfully locate queried les without the need

of many information exchanges and request messages, though at a relatively slow speed. SPOON outperforms MOPS in terms of hit rate, delay and cost. This is because SPOON utilizes IRA for intra-communication and dedicated ambassadors for inter-communication while MOPS relies heavily on brokers. Also, MOPS only considers node contact in routing, while SPOON considers both content and contact. We will elaborate the reasons in describing the trace driven simulation results later on. TABLE 3 Efciency and cost in the experiments on GENI
Method SPOON MOPS CacheDTN PodNet PDI+DIS Epidemic Hit Rate 0.671 0.629 0.5712 0.5932 0.524 0.8813 Ave. Delay (s) 152731.3 163282.5 219021.4 183621 7418.9 15621.2 Maintenance Cost 258764 310131 283210 223218 298641 669193 Total Cost 275312 320412 298123 240238 359841 860475

Similarity

CacheDTN has low hit rate, median cost and high average delay. This is because though replicas are created, queries wait for les passively on their originators, leading to a long delay. Also, the replication of les to network centers incurs a high cost. PodNet has low hit rate for the same reason as CacheDTN. However, since the replicas on each node are more catered for the interests of itself and nodes it has met, PodNet has slightly higher hit rate than CacheDTN. Moreover, PodNet has the lowest cost because nodes in it only replica les they are interested in. PDI+DIS generates the lowest hit rate at relatively high total cost and low average delay. The low hit rate is caused by the poor mobility resilience of route tables. As a result, only partial queries are resolved quickly in the local broadcasting. Others passively wait for le holders or updated routes and usually cannot be resolved timely. Then, since we only count the average delay of successful queries, PDI+DIS has the lowest average delay. TABLE 4 Memory usage in the experiments on GENI
Metric Query Table SPOON 36.8 10.4 MOPS 44.1 16.9 CacheDTN 48.4 50 PodNet 45.3 50 PDI+DIS 13.1 15.6 Epid. 1998 0

We also evaluated the performance of each method in memory utilization in terms of the average number of queries in the buffer (Query) and the average size of a content/neighbor table (Table). The results are shown in Table 4. For the average number of buffered queries, we nd that PDI+DIS<SPOON< MOPS<PodNet<CacheDTN<Epidemic. In Epidemic, nodes buffer the most queries since it tries to replicate each request to all nodes in the system. CacheDTN and PodNet have a lot of queries in memory since they does not actively search for the queried les. Both SPOON and MOPS keep one copy of each query during the searching process. However, since SPOON completes le query more quickly than MOPS (as shown in Table 3), it buffers fewer quires in memory than MOPS. PDI+DIS has the fewest number of queries on nodes because local broadcasting just forwards queries without buffering. Considering that each entry in the content table has roughly the same size, we used the number of en-

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
11

tries in a table to represent memory usage. The results in Table 4 show that Epidemic<SPOON<MOPS< PDI+DIS<CacheDTN=PodNet. Clearly, Epidemic does not need memory on content table. SPOON stores the second fewest content synopses because most nodes only store the information of the same community members. In MOPS, brokers store content synopses of all nodes in their communities and consume a large amount of memory. Therefore, MOPS produces high average number of stored content synopses. PDI+DIS stores a alrge amount of content synopses because each node collects content synopses from all nodes it has met and from all received reply messages. CacheDTN and PodNet have the highest number of entries in the content table since we doubled the memory (i.e., 50 articles) for replicas. In summary, the results in Table 3 and Table 4 show that SPOON is superior over other methods in terms of hit rate, average delay, total cost and memory efciency. 4.3 Event-driven Experiments with Real Trace

are rapidly distributed to nodes at the cost of multiple copies. As a result, requests can reach their destinations quickly. MOPS exhibits a large delay because requests in it usually have to wait for a long time for brokers or same-community le holders. In contrast, SPOON always tries to nd an optimal neighbor to send a request to the le holder with the interest-oriented routing algorithm. In addition, the designation of ambassadors in SPOON increases the possibility of relaying requests to foreign communities. As a result, SPOON has lower average query delay than MOPS. PodNet and CacheDTN generate high average delay because queries only wait for le holders passively on their originators. However, since PodNet create replicas that are more likely to be encountered by nodes that are interested in them, it has lower average delay than CacheDTN. 4.3.3 Cost Figures 4(c) and 5(c) plot the maintenance costs of the six methods in the experiments with the Haggle trace and the MIT Reality trace, respectively. We see that when the total number of queries is small, the six methods all have low maintenance cost. When the total number of queries is larger than 10000, the maintenance costs generally follow: Epidemic>PDI+DIS>MOPS>CacheDTN> SPOON>PodNet. In PodNet, nodes only replica interested les when meet other nodes, leading to the least maintenance cost. In SPOON, nodes exchange node vectors for the update of history vector. Nodes also report its contents to coordinators for le indexing. In MOPS, brokers exchange the contents of all nodes from their home communities when meeting each other. Therefore, MOPS produces slightly higher cost than SPOON. The active replication of les to network centers in CacheDTN leads to a high cost. PDI+DIS needs to build content tables through reply messages and disseminated queries, so it has higher maintenance cost than above four methods when the number of queries becomes large. In Epidemic, two nodes need to inform each other requests already on them, which causes a lot of information exchange and leads to the highest maintenance cost. We see that when the number of queries increases, the maintenance costs of SPOON, PodNet, CacheDTN, and MOPS remain stable while those of Epidemic and PDI+DIS increase quickly. This is because the maintenance costs of the former four methods are determined by the information/replication exchanges among nodes and are irrelevant with the number of queries, and those of Epidemic and PDI+DIS are related to the total number of queries. Such results prove the scalability of SPOON, MOPS, CacheDTN, and PodNet in query amount. Figures 4(d) and 5(d) show the total cost of each method in the experiments with the Haggle trace and the MIT Reality trace, respectively. In the two gures, the results of MOPS, CacheDTN, SPOON, and PodNet are shown to be very close. We then plot the total costs of the four methods with 95% condence interval in Figure 6(a) and Figure 6(b) for better demonstration. Note we did not show the condence interval of other measurements because they have clear

In this experiments, we varied the total number of queries from 5000 to 25000 to show the scalability of each method in terms of the amount of queries. 4.3.1 Hit Rate Figures 4(a) and 5(a) show the hit rate of each method in the experiments with the Haggle trace and the MIT Reality trace, respectively. We nd that with both traces, Epidemic can resolve almost all requests, while PDI+DIS can only complete about 60% of requests. The hit rates of SPOON, MOPS, PodNet, and CacheDTN reach about 75%, 70%, 68%, 67%, respectively. Epidemic has the highest hit rate because of its broadcasting nature. In SPOON, coordinators and ambassadors facilitate intraand inter- community searching, while the IRA actively forwarded to the node with a high probability of meeting the destination. MOPS only relies on the encountering of mobile brokers for le searching. This probability is lower than that of SPOON, resulting in a lower hit rate. PodNet and CacheDTN lack active query forwarding, leading to median hit rates. However, replicas on each node are more catered to the interests of nodes it can meet in PodNet, while CacheDTN just caches les on network center. Therefore, PodNet has slightly higher hit rate than CacheDTN. In PDI+DIS, many routes in the content table expire quickly due to node mobility. As a result, most successful requests are resolved through the 3-hop broadcast. Others have to passively wait for le holders or updated routes. Therefore, many requests cannot be resolved, leading to a low hit rate. 4.3.2 Average Delay Figures 4(b) and 5(b) show the average delays of the six methods in the tests with the Haggle trace and the MIT Reality trace, respectively. The delays follow PDI+DIS< Epidemic<SPOON<MOPS<PodNet<CacheDTN. Recall that we only measure the delay of successful requests. In PDI+DIS, most successful requests are resolved in the initial 3-hop broadcasting stage. Therefore, it generates the least average delay. In Epidemic, requests

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
12

50 40 30 20 10 0 5 10 15 20 25

MaintenanceCost (x105)

AverageDelay(x103S)

0.95 0.85

14 12 10 8 6 4 2

HitRate

SPOON PDI+DIS

MOPS CacheDTN

Epidemic PodNet

TotalCost(x105)

SPOON Epidemic CacheDTN

MOPS PDI+DIS PodNet

22 17 12 7 2

SPOON Epidemic CacheDTN

MOPS PDI+DIS PodNet

0.75 0.65 0.55

SPOON PDI+DIS

MOPS CacheDTN

Epidemic PodNet

10

15

20

25

10

15

20

25

10

15

20

25

NumberofQueries(x103)

NumberofQueries(x103)

NumberofQueries(x103)

NumberofQueries(x103)

(a) Hit rate.

(b) Average delay.

(c) Maintenance cost.

(d) Total cost.

Fig. 4. Performance in the event-driven experiments with Haggle trace.


MaintenanceCost (x105)

AverageDelay(x104S)

1.00

50 40 30 20 10 0

35 30 25 20 15 10 5

HitRate

0.85

SPOON PDI+DIS

MOPS CacheDTN

Epidemic PodNet

TotalCost(x105)

SPOON Epidemic CacheDTN

MOPS PDI+DIS PodNet

45 35 25 15 5

SPOON Epidemic CacheDTN

MOPS PDI+DIS PodNet

0.70

SPOON PDI+DIS

MOPS CacheDTN

Epidemic PodNet

0.55

10

15

20

25

10

15

20

25

10

15

20

25

10

15

20

25

NumberofQueries(x103)

NumberofQueries(x103)

NumberofQueries(x103)

NumberofQueries(x103)

(a) Hit rate.

(b) Average delay.

(c) Maintenance cost.

(d) Total cost.

Fig. 5. Performance in the event-driven experiments with MIT Reality trace.


7 6 5 4 3 2

TotalCost(x105)

TotalCost(x105)

SPOON CacheDTN

MOPS PodNet

13 12 11 10 9 8 7 6 5

SPOON CacheDTN

MOPS PodNet

10

15

20

25

10

15

20

25

NumberofQueries(x103)

NumberofQueries(x103)

(a) Haggle trace.

(b) MIT Reality trace.

Fig. 6. Total costs with condence intervals. difference and the page limit. We nd that the total costs follow Epidemic>PDI+DIS>MOPS>CacheDTN > SPOON>PodNet, which is the same as Figures 4(c) and 5(c). Such a result means that the maintenance cost is the majority part of the total cost. With above results, we conclude that SPOON has the highest overall le searching efciency in terms of hit rate, delay and cost. 4.4 Evaluation of the Enhancement Strategies 4.4.1 Multi-copy forwarding and Prefetching We rst evaluated the effect of the multi-copy forwarding and the intelligent le prefetching. We let Multicopy forwarding and Prefetching denote the SPOON with the corresponding improvement strategy, respectively, and compare them with the Original SPOON. In Multi-copy forwarding, we let each query originator distribute two copies of its query. In Prefetching, we let each ambassador store top ten most popular les. We varied the number of queries from 5000 to 25000. The test results are shown in Tables 5 and 6. We nd that the multi-copy forwarding strategy with only two copies enhances the hit rate greatly in the experiments with both traces. This is because when each query has two copies in the system, its probability of encountering the node containing the queried le increases. Such a result shows the effectiveness of the multi-forwarding strategy. We also see from the two

tables that the le prefetching strategy slightly improves the hit rate with both the two traces. This is because 1) we only congure two ambassadors per community, and 2) the prefetched les only satisfy a small amount of queries since most queries are for contents in local community. The improvement on the hit rate still demonstrates the effectiveness of the le prefetching strategy and the improvement would be greater with more ambassadors and greatly varied le popularity. TABLE 5 Hit rate improvement with the Haggle trace.
# of packets 5000 10000 15000 20000 25000 Original 0.75137 0.75215 0.73831 0.74928 0.74731 Prefetching 0.75313 0.75633 0.74274 0.75242 0.75201 Multi-copy forwarding 0.779412 0.780135 0.778912 0.774321 0.779415

TABLE 6 Hit rate improvement with the MIT Reality trace.


# of Packets 5000 10000 15000 20000 25000 Original 0.761371 0.759941 0.760135 0.756418 0.751835 Prefetching 0.7671675 0.762841 0.763762 0.759957 0.754837 Multi-copy forwarding 0.780413 0.770838 0.772843 0.773901 0.771963

4.4.2 Querying-Completion and Loop-Prevention We name SPOON without querying-completion as SPOON-OR, SPOON with the querying-completion only as SPOON-QC, and SPOON with both queryingcompletion and loop-prevention as SPOON-QCLP. We set the number of queries to 15000. In SPOON-OR, queries do not stop searching until TTL expiration. We also set the maximal number of les each query can retrieve, Smax , to 2. To enable a query to nd two les, we purposely let each le have two copies in the system. In order to alleviate the inuence of the TTL on the evaluation of the querying-completion, we enlarge the TTL to the entire length of the used traces. The test results are shown in Table 7.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
13

We nd from the table that SPOON-QCLP has slightly higher hit rate than SPOON-QR and SPOON-QC. This is because the loop-prevention avoids forwarding a query to the same le holder repeatedly, thereby utilizing forwarding opportunities more efciently. SPOON-QC has slightly lower hit rate than SPOON-OR because it stops querying when Smax les are fetched. We also see that the number of query forwarding operations follow SPOON-OR>SPOON-QC>SPOON-QCLP. This is because SPOON-OR does not stop querying until the TTL is expired. SPOON-QC reduces the cost as it stops querying after the specied number of les are located. In SPOON-QCLP, the loop-prevention avoids redundant forwarding to the same node, leading to less number of forwarding operations and more efcient le searching. TABLE 7 Effect of query-completion strategy.
SPOON-OR SPOON-QC SPOON-QCLP Hit Rate Haggle 0.8741 0.8701 0.8813 MIT Reality 0.9091 0.9012 0.9406 Number of query forwarding operations Haggle 569841 530516 304953 MIT Reality 721852 609863 249132 Trace

We also tested the scenario in which only the coordinator nodes leave the system. Since the total number of coordinators is limited, we only randomly chose two coordinators to leave the system during the test. We name the scenarios when coordinators depart abruptly and voluntarily as w/ churn consideration-Ab and w/ churn consideration-Vo, respectively. The test results are shown in Table 8. We nd that when coordinators leave the system, the hit rate of SPOON with node churn consideration is much higher than that without node churn consideration. This is because the coordinator is critical in both intra- and inter- le searching in SPOON. Without node churn consideration, queries just wait for coordinators until their TTL expiration if they need to be forwarded to coordinators, leading to a low hit rate and a high average delay. Above results show that SPOONs strategies for node churn consideration can improve the system performance at a low cost.

4.4.3 Node Churn Consideration We name SPOON with and without a strategy to handle node churn as SPOON-CH and SPOON-NA, respectively. The total number of queries was set to 15000. The period for beacon message was set to 100s and 1000s with the Haggle trace and the MIT Reality trace, respectively. In the test, NL nodes leave the system evenly during the rst 1/2 and 1/4 of the Haggle trace and the MIT Reality trace, respectively. NL was varied from 5 to 25. We name the node that contains a le matching a query as the primary matching node for the query. In order to demonstrate the performance of SPOON in node churn, for each queried le, we purposely created a le that has 70% similarity with it in a non-leaving node in the same community with the primary matching node. We name this node as the secondary matching node. The test results of voluntary and abrupt normal nodes departure are shown in Figure 7. We see that in all cases, when node churn consideration is applied, the hit rate is increased and the average delay is decreased. This is because with the node churn consideration, queries failing to nd their primary matching nodes (i.e., have left the system) are further forwarded to their secondary matching nodes while when there is no node churn consideration, these queries just wait on coordinators for the primary matching node, leading to a low hit rate and a high average delay. We also observe that when the number of leaving nodes increases, the hit rate decreases and the average delay increases. This is because leaving nodes can no longer relay queries or departure notication/detection messages, leading to lower hit rate and higher average delay. TABLE 8 Effect of the detection of coordinator departures.
Trace Haggle MIT Reality w/o churn consideration 0.650643 0.641053 w/ churn consideration-Ab 0.684512 0.677841 w/ churn consideration-Vo 0.729942 0.746841

C ONCLUSION In this paper, we propose a Social network based P2P cOntent le sharing system in disconnected mObile adhoc Networks (SPOON). SPOON considers both node interest and contact frequency for efcient le sharing. We introduce four main components of SPOON: Interest extraction identies nodes interests; Community construction builds common-interest nodes with frequent contacts into communities. The node role assignment component exploits nodes with tight connection with community members for intra-community le searching and highly mobile nodes that visit external communities frequently for inter-community le searching; The interest-oriented le searching scheme selects forwarding nodes for queries based on interest similarities. SPOON also incorporates additional strategies for le prefetching, querying-completion and loop-prevention, and node churn consideration to further enhance le searching efciency. The system deployment on the real-world GENI Orbit platform and the trace-driven experiments prove the efciency of SPOON. In future, we will explore how to determine appropriate thresholds in SPOON, how they affects the le sharing efciency, and how to adapt SPOON to larger and more disconnected networks. ACKNOWLEDGMENTS
This research was supported in part by U.S. NSF grants CNS-1249603, OCI-1064230, CNS-1049947, CNS-1156875, CNS-0917056 and CNS-1057530, CNS-1025652, CNS0938189, CSR-2008826, CSR-2008827, Microsoft Research Faculty Fellowship 8300751, and U.S. Department of Energys Oak Ridge National Laboratory including the Extreme Scale Systems Center located at ORNL and DoD 4000111689. An early version of this work was presented in the Proceedings of MASS11 [40]

R EFERENCES
[1] [2] The state of the smartphone market, http://www.allaboutsymb ian.com/news/item/6671 The State of the Smartphone Ma.php. Next Generation Smartphones Players, Opportunities & Forecasts 2008-2013, Juniper Research, Tech. Rep., 2009.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MOBILE COMPUTING
14
0.75 0.70 0.65 0.60 0.55 0.50 0.80 0.75 0.70 0.65 0.60

HitRate

HitRate

Voluntary:SPOONNA Voluntary:SPOONCH

Abrupt:SPOONNA Abrupt:SPOONCH

90 85 80 75 70 65 60 85 80 75 70 65 60 55

AverageDelay(x103S)

AverageDelay(x103S)

Voluntary:SPOONNA Voluntary:SPOONCH

0.7 0.6 0.5 0.7 0.6 0.5

Voluntary:SPOONNA Voluntary:SPOONCH

75 70 65 60 55 70 65 60 55 50

Voluntary:SPOONNA Voluntary:SPOONCH

Abrupt:SPOONNA Abrupt:SPOONCH

Abrupt:SPOONNA Abrupt:SPOONCH

Abrupt:SPOONNA Abrupt:SPOONCH
120000 60000 40000 30000 24000

36000

18000

12000

9000

7200

36000

18000

12000

9000

7200

120000

60000

40000

30000

24000

Periodicaltimeforanodedeparture(s)

Periodicaltimeforanodedeparture(s)

Periodicaltimeforanodedeparture(s)

Periodicaltimeforanodedeparture(s)

(a) Hit rate with Haggle trace.

(b) Ave. delay with Haggle trace.

(c) Hit rate with MIT trace.

(d) Ave. delay with MIT trace.

Fig. 7. Performance with voluntary and abrupt node departures.


[3] [4] [5] [6] A Market Overview And Introduction to GyPSii, http://corporate.gypsii.com/docs/MarketOverview. Bittorrent, http://www.bittorrent.com/. Kazaa, http://www.kazaa.com. M. Papadopouli and H. Schulzrinne, A Performance Analysis of 7DS: a Peer-to-Peer Data Dissemination and Prefetching Tool for Mobile Users, Advances in wired and wireless communications, IEEE Sarnoff Symposium Digest, 2001. A. Klemm, C. Lindemann, and O. Waldhorst, A Special-Purpose Peer-to-Peer File Sharing System for Mobile Ad Hoc Networks, in Proc. of VTC, 2003. C. Lindemann and O. P. Waldhort, A Distributed Search Service for Peer-to-Peer File Sharing, in Proc. of P2P, 2002. D. W. A. Hayes, Peer-to-Peer Information Sharing in a Mobile Ad hoc Environment, in Proc. of WMCSA, 2004. J. B. Tchakarov and N. H. Vaidya, Efcient Content Location in Wireless Ad Hoc Networks, in Proc. of MDM, 2004. C. Hoh and R. Hwang, P2P File Sharing System over MANET based on Swarm Intelligence: A Cross-Layer Design, in Proc of WCNC, 2007, pp. 26742679. T. Repantis and V. Kalogeraki, Data Dissemination in Mobile Peer-to-Peer Networks, in Proc. of MDM, 2005. Y. Huang, Y. Gao, K. Nahrstedt, and W. He, Optimizing le retrieval in delay-tolerant content distribution community, in Proc. of ICDCS, 2009. W. Gao, G. Cao, A. Iyengar, and M. Srivatsa, Supporting cooperative caching in disruption tolerant networks. in Proc. of ICDCS, 2011. J. Reich and A. Chaintreau, The age of impatience: optimal replication schemes for opportunistic networks. in Proc. of CoNEXT, 2009. V. Lenders, M. May, G. Karlsson, and C. Wacha, Wireless ad hoc podcasting, Mobile Computing and Communications Review, 2008. K. Chen and H. Shen, Global optimization of le availability through replication for efcient le sharing in manets. in Proc. of ICNP, 2011. F. Li and J. Wu, MOPS: Providing Content-Based Service in Disruption-Tolerant Networks, in Proc. of ICDCS, 2009. P. Costa, C. Mascolo, M. Musolesi, and G. P. Picco, Socially-aware Routing for Publish-Subscribe in Delay-Tolerant Mobile Ad Hoc Networks, IEEE JSAC, vol. 26, no. 5, pp. 748760, 2008. E. Yoneki, P. Hui, S. Chan, and J. Crowcroft, A Socio-aware Overlay for Publish/subscribe Communication in Delay Tolerant Networks, in Proc. of MSWiM, 2007. C. Boldrini, M. Conti, and A. Passarella, Contentplace: Socialaware data dissemination in opportunistic networks, in Proc. of MSWIM, 2008. A. Fast, D. Jensen, and B. N. Levine, Creating social networks to improve peer-to-peer networking, in Proc. of KDD, 2005. A. Iamnitchi, M. Ripeanu, and I. T. Foster, Small-world lesharing communities, in Proc. of INFOCOM, 2004. M. Mcpherson, Birds of a feather: Homophily in social networks, Annual Review of Sociology, vol. 27, no. 1, 2001. E. Yoneki, P. Hui, S. Chan, and J. Crowcroft, A socio-aware overlay for publish/subscribe communication in delay tolerant networks, in Proc. of MSWiM, 2007. A. Chaintreau, P. Hui, J. Scott, R. Gass, J. Crowcroft, and C. Diot, Impact of human mobility on opportunistic forwarding algorithms, IEEE TMC, vol. 6, no. 6, pp. 606620, 2007. V. Carchiolo, M. Malgeri, G. Mangioni, and V. Nicosia, An adaptive overlay network inspired by social behavior, Journal of Parallel and Distributed Computing (JPDC), 2010. A. Iamnitchi, M. Ripeanu, E. Santos-Neto, and I. Foster, The small world of le sharing, IEEE TPDS, 2011. [29] H. Schtze and C. Silverstein, Projections for Efcient Document Clustering, in Proc. of SIGIR, 1997, pp. 7481. [30] P. Bonacich, Factoring and Weighting Approaches to Status Scores and Clique Identication, J. of Math.Sociol., 1972. [31] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York, USA: Wiley, 1990. [32] E. Daly and M. Haahr, Social network analysis for routing in disconnected delay-tolerant manets, in Proc. of MobiHoc, 2007. [33] W. Hsu, T. Spyropoulos, K. Psounis, and A. Helmy, Modeling time-variant user mobility in wireless mobile networks, in Proc. of INFOCOM, 2007. [34] A. Vahdat and D. Becker, Epidemic routing for partiallyconnected ad hoc networks, Duke University, Tech. Rep., 2000. [35] GENI project, http://www.geni.net/. [36] Orbit, http://www.orbit-lab.org/. [37] The network simulator ns-2, http://www.isi.edu/nsnam/ns/. [38] M. Musolesi and C. Mascolo, Designing Mobility Models Based on Social Network Theory, ACM SIGMOBILE Comput. and Comm. Rev., 2007. [39] N. Eagle, A. Pentland, and D. Lazer, Inferring Social Network Structure using Mobile Phone Data, PNAS, vol. 106, no. 36, 2009. [40] K. Chen and H. Shen, Leveraging social networks for p2p content-based le sharing in mobile ad hoc networks, in Proc. of MASS, 2011. Kang Chen Kang Chen received the BS degree in Electronics and Information Engineering from Huazhong University of Science and Technology, China in 2005, and the MS in Communication and Information Systems from the Graduate University of Chinese Academy of Sciences, China in 2008. He is currently a Ph.D student in the Department of Electrical and Computer Engineering at Clemson University. His research interests include mobile ad hoc networks and delay tolerant networks. Haiying Shen Haiying Shen received the BS degree in Computer Science and Engineering from Tongji University, China in 2000, and the MS and Ph.D. degrees in Computer Engineering from Wayne State University in 2004 and 2006, respectively. She is currently an Assistant Professor in the Department of Electrical and Computer Engineering at Clemson University. Her research interests include distributed computer systems and computer networks, with an emphasis on P2P and content delivery networks, mobile computing, wireless sensor networks, and grid and cloud computing. She was the Program Co-Chair for a number of international conferences and member of the Program Committees of many leading conferences. She is a Microsoft Faculty Fellow of 2010 and a member of the IEEE and ACM. Haibo Zhang Haibo Zhang received the BS degree in Computer Science and Technology from Shanghai Jiao Tong University, China in 2008, and the MS in Computer Science from University of Arkansas, Fayetteville in 2010. Hes currently a Software Engineer in Cerner Corporation that makes software for hospitals and medical institutions to support meaningful use of information system.

[7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]