Sie sind auf Seite 1von 42

Exploration of

Yelp Database
Dionysios Nikolopoulos
Introduction

YELP application 5 datasets

Users
Businesses
Reviews
Check-in
Tips
Tasks

Task 1: Construction and Analysis of a User


Network based on Friendship

Task 2: Construction and Analysis of a User


Network based on Reviews

Task 3: Implementation of an algorithm for


cluster discovery based on paper: "Higher-
order organization of complex networks
ask 2 - Description
In order to uncover user behaviors and trends we construct a weighted
network where a link between two users represents that both of them
reviewed the same business. Links are weighted due to the fact that
users can review more than one identical business.

It is stressed that the resulting network comes out of random sampling


40000 of the 77079 businesses in total (almost 60% of the total number
of businesses)

Subsequently, two groups of the strongest users in terms of degree are


kept for further exploration (1000 and 10000 strongest users
accordingly).

It is obvious that the aforementioned networks contain users that are not
members of the strong groups. As a result, both of the networks are
further subsetted to contain users that belong exclusively to the strong
group.
4
istograms Complete User Network

7
istograms Complete User Network

Degree distribution and Review count seem to follow power law in the log
scale

Average stars are generally high and round numbers are strongly
expressed

8
istograms 1000 strongest users

9
istograms 1000 strongest users
Observations are less but seem to follow a different distribution than the
Complete network

Closer to log-normal distribution

10
istograms 1000 strongest users -
Exclusive

11
istograms 1000 strongest users -
Exclusive
Even less observations but they seem to follow the same scheme

12
istograms 10000 strongest users

13
istograms 10000 strongest users
Average stars and Review count seem to obey a log-normal distribution

Degree distribution is depicted without the strong outliers on the left


(Lots of users with low degree equal to 1 or 2 )

14
istograms 10000 strongest users -
Exclusive

15
istograms 10000 strongest users -
Exclusive
Similar behavior with previous group with different averages

16
Scatter plots Average stars VS Review
count

17
Scatter plots Average stars VS Review
count
All networks seem to obey a wide normal distribution
Logarithmic distribution did not reveal any further characteristics

18
Scatter plots Average stars VS User
degree

19
Scatter plots Average stars VS User
degree
Similar behavior with the previous scatter plot

10000 networks due to the larger number of observations result to a


more apparent visualization of a normal distribution

20
Scatter plots Review count VS User
degree

21
Scatter plots Average stars VS User
degree
Logarithmic plots are used as they seem to reveal more about the
corresponding behaviors

Exclusive networks seem to obey a slightly logarithmic trend that can be


observed from the scatter plots

22
Boxplots Reviews and degree
comparison

23
Boxplots Reviews and degree
comparison
Reviews: Large deviation for 1000 strongest network where average is
slightly larger for the exclusive one.

10000 strongest network has similar behavior while the average number
of reviews seems less in comparison with the 1000 network. Deviation is
less partly due to the more observations.

Complete Network has very low average as most of the users rarely post a
review.

Degree: Lower deviation and lower average degree for the 1000 exclusive
network in comparison with the 1000 network.
Equal behavior for the 10000 networks but the difference in deviation is
significantly smaller

Complete network has again the smallest average while deviation remains
at the same levels
24
Communities in 1000 strongest users -
Exclusive
A built-in greedy algorithm is used to reveal possible communities in the
1000 strongest users exclusive network.

Even the greedy algorithm is very time consuming when applied to the
other, larger networks.

3 sets of users are grouped and an attempt to reveal different or distinct


behaviors is made

25
Communities in 1000 strongest users -
Exclusive

26
Communities in 1000 strongest users -
Exclusive
The groups are largely overlapped and they contain around 300 nodes
each

Boxplots correspond to the network constructed for task 2. A link means a


review for the same business
Groups 3 has larger deviation and larger values while outliers appear
higher than the other two groups 27
Communities in 1000 strongest users -
Exclusive

Degree: Group 3 has larger average and deviation and outliers appear a lot
higher than the other two groups. Similar behavior as the layered
network.

28
Communities in 1000 strongest users
Exclusive Scatter plots

29
Communities in 1000 strongest users
Exclusive Scatter plots
Scatter plots do not reveal any significant trend or a large difference
between the groups

30
Sparse network based on 1000
strongest - Exclusive
An attempt to study and visualize a sparse network based on 1000
strongest users Exclusive is made.

The network is constructed posing the following rule:


A user is a part of the network if his degree is larger than the mean

A built-in greedy algorithm is used to obtain the communities

31
Sparse network based on 1000
strongest - Exclusive

32
Sparse network based on 1000
strongest - Exclusive

As it can be seen from the visualization two groups are formed after the
community algorithm and Group 1 consists of stronger users on average.
Review count follows the opposite trend. 33
Implementation of the algorithm
proposed in the paper

The proposed algorithm is implemented and verified as it can be seen


from the figure on the right. For the same network used in the paper, our
implementation gives identical results.
For comparison with the built-in greedy algorithm parallelization is
required for faster simulation. 34
Implementation of the algorithm in
1000 strongest - exclusive

Group 1 246 Users


Group 2 661 Users

Algorithm was parallelized and network split int two groups

35
Implementation of the algorithm in
1000 strongest - exclusive

36
Implementation of the algorithm in
1000 strongest - exclusive

Group 2 has higher average degree and review count


Review count of Group 2 has larger variance

Scatter Plots are available but did not reveal any particular trend

37
Conclusion
In Task 1 high degree means high number of friends. Analysis has been
done in the assignment of the course

In Task 2 high degree means high number of commonly reviewed


businesses.

What does that reveal for the Users behaviors and characteristics?

38
Conclusion
An effort to focus on the social behavior of the Users of Yelp
Application was made.

2 types of networks were constructed based on friendship and


similarity in reviews

Both of them were different layers that tried to uncover influential


Users/group of Users

Simplified implementation of a novel algorithm proposed on the


aforementioned paper was made to reveal clusters and noticeable
organization of the networks along with the use of built in community
algorithms

39
Conclusion
Average stars and Review count distribution for Strong Users follows an
approximate log-normal distribution

Degree distribution follows power law distribution similar to the Full network

Basic metrics do not illustrate any particular scheme slight dissasortativity


for the strong networks

Exclusive networks have also high average degree in the friendship network
- Strong Users as defined in Task 2 have high number of friends on average

Review count seems to have a slightly directed relation with the degree for the
strong Users

Strong Users have a large number of reviews on average variance is high

Strong Users do not have such a large difference in degree compared to the
difference in the number of reviews

40
Conclusion
Communities try to discover distinct parts in the network

Built in greedy algorithm revealed 3 highly overlapped communities


that do not differ significantly in statistical attributes

2 Communities produced by the posed abstraction ( degree> average


degree)

Degree and review count do not follow the same trend Group 1 has
higher average degree but lower average number of review count

Manual algorithm did not expose distinct differences

41
Conclusion

Av. Strength Av.Review Av.Stars


Count
Cluster 335.90 79.25 3.67
Full Network 45.81 38.78 3.75

Av. Degree Av.Review Av.Stars


Count
Cluster 335.90 79.25 3.67
Full Network 45.81 38.78 3.75

42

Das könnte Ihnen auch gefallen