Sie sind auf Seite 1von 10

Part 1: Yelp Dataset Profiling and Understanding

1. Profile the data by finding the total number of records for each of the tables
below:

SELECT COUNT(*)
FROM table

i. Attribute table = 10000

SELECT COUNT(*)
FROM attribute

ii. Business table = 10000

SELECT COUNT(*)
FROM business

iii. Category table = 10000

SELECT COUNT(*)
FROM category

iv. Checkin table = 10000

SELECT COUNT(*)
FROM checkin

v. elite_years table = 10000

SELECT COUNT(*)
FROM elite_yeard

vi. friend table = 10000

SELECT COUNT(*)
FROM friend

vii. hours table = 10000

SELECT COUNT(*)
FROM hours

viii. photo table = 10000

SELECT COUNT(*)
FROM photo

ix. review table = 10000

SELECT COUNT(*)
FROM review

x. tip table = 10000

SELECT COUNT(*)
FROM tip

xi. user table = 10000


SELECT COUNT(*)
FROM user

2. Find the total distinct records by either the foreign key or primary key for
each table. If two foreign keys are listed in the table, please specify which
foreign key.

SELECT COUNT(DISTINCT(Key))
FROM table

i. Business = 10000 (id)

SELECT COUNT(DISTINCT(id))

FROM business

ii. Hours = 1562 (business_id)

SELECT COUNT(DISTINCT(business_id))

FROM hours

iii. Category = 2643 (business_id)

SELECT COUNT(DISTINCT(business_id))

FROM category

iv. Attribute = 1115 (business_id)

SELECT COUNT(DISTINCT(business_id))
FROM attribute

v. Review = 10000 (id)

SELECT COUNT(DISTINCT(id))
FROM review

vi. Checkin = 493 (business_id)

SELECT COUNT(DISTINCT(business_id))
FROM checkin

vii. Photo = 10000 (id)

SELECT COUNT(DISTINCT(id))
FROM photo

viii. Tip = 537 (user_id)

SELECT COUNT(DISTINCT(user_id))
FROM tip

ix. User = 10000 (id)

SELECT COUNT(DISTINCT(id))
FROM user

x. Friend = 10000 (user_id)

SELECT COUNT(DISTINCT(user_id))
FROM friend

xi. Elite_years = 2780 (user_id)

SELECT COUNT(DISTINCT(user_id))
FROM elite_years

Note: Primary Keys are denoted in the ER-Diagram with a yellow key icon.

3. Are there any columns with null values in the Users table? Indicate "yes," or
"no."

Answer: no

SQL code used to arrive at answer:

SELECT COUNT(*)
FROM user
WHERE id IS NULL OR
name IS NULL OR
review_count IS NULL OR
yelping_since IS NULL OR
useful IS NULL OR
funny IS NULL OR
cool IS NULL OR
fans IS NULL OR
average_stars IS NULL OR
compliment_hot IS NULL OR
compliment_more IS NULL OR
compliment_profile IS NULL OR
compliment_cute IS NULL OR
compliment_list IS NULL OR
compliment_note IS NULL OR
compliment_plain IS NULL OR
compliment_cool IS NULL OR
compliment_funny IS NULL OR
compliment_writer IS NULL OR
compliment_photos IS NULL

4. For each table and column listed below, display the smallest (minimum), largest
(maximum), and average (mean) value for the following fields:

i. Table: Review, Column: Stars

min: 1 max: 5 avg: 3.7082

SELECT MIN(stars), MAX(stars), AVG(stars)


FROM review

ii. Table: Business, Column: Stars


min: 1 max: 5 avg: 3.6549

SELECT MIN(stars), MAX(stars), AVG(stars)


FROM business

iii. Table: Tip, Column: Likes

min: 0 max: 2 avg: 0.0144

SELECT MIN(likes), MAX(likes), AVG(likes)


FROM tip

iv. Table: Checkin, Column: Count

min: 1 max: 53 avg: 1.9414

SELECT MIN(count), MAX(count), AVG(count)


FROM checkin

v. Table: User, Column: Review_count

min: 0 max: 2000 avg: 24.2995

SELECT MIN(review_count), MAX(review_count), AVG(review_count)


FROM user

5. List the cities with the most reviews in descending order:

SQL code used to arrive at answer:

SELECT city, SUM(review_count) AS TotalReviews


FROM business
GROUP BY City
ORDER BY TotalReviews DESC

Copy and Paste the Result Below:


+-----------------+--------------+
| city | TotalReviews |
+-----------------+--------------+
| Las Vegas | 82854 |
| Phoenix | 34503 |
| Toronto | 24113 |
| Scottsdale | 20614 |
| Charlotte | 12523 |
| Henderson | 10871 |
| Tempe | 10504 |
| Pittsburgh | 9798 |
| Montr�al | 9448 |
| Chandler | 8112 |
| Mesa | 6875 |
| Gilbert | 6380 |
| Cleveland | 5593 |
| Madison | 5265 |
| Glendale | 4406 |
| Mississauga | 3814 |
| Edinburgh | 2792 |
| Peoria | 2624 |
| North Las Vegas | 2438 |
| Markham | 2352 |
| Champaign | 2029 |
| Stuttgart | 1849 |
| Surprise | 1520 |
| Lakewood | 1465 |
| Goodyear | 1155 |
+-----------------+--------------+
(Output limit exceeded, 25 of 362 total rows shown)

6. Find the distribution of star ratings to the business in the following cities:

i. Avon

SQL code used to arrive at answer:

SELECT stars AS StarRating, SUM(Review_Count) AS Count


FROM business
WHERE City = 'Avon'
GROUP BY stars

Copy and Paste the Resulting Table Below (2 columns – star rating and count):

+------------+-------+
| StarRating | Count |
+------------+-------+
| 1.5 | 10 |
| 2.5 | 6 |
| 3.5 | 88 |
| 4.0 | 21 |
| 4.5 | 31 |
| 5.0 | 3 |
+------------+-------+

ii. Beachwood

SQL code used to arrive at answer:

SELECT stars AS StarRating, SUM(Review_Count) AS Count


FROM business
WHERE City = 'Beachwood'
GROUP BY stars

Copy and Paste the Resulting Table Below (2 columns – star rating and count):

+------------+-------+
| StarRating | Count |
+------------+-------+
| 2.0 | 8 |
| 2.5 | 3 |
| 3.0 | 11 |
| 3.5 | 6 |
| 4.0 | 69 |
| 4.5 | 17 |
| 5.0 | 23 |
+------------+-------+

7. Find the top 3 users based on their total number of reviews:

SQL code used to arrive at answer:

SELECT name, id, SUM(review_count) AS Reviews


FROM user
GROUP BY id
ORDER BY Reviews DESC
LIMIT 3

Copy and Paste the Result Below:


+--------+------------------------+--------------+
| name | id | review_count |
+--------+------------------------+--------------+
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 |
| Sara | -3s52C4zL_DHRK0ULG6qtg | 1629 |
| Yuri | -8lbUNlXVSoXqaRRiHiSNg | 1339 |
+--------+------------------------+--------------+

8. Does posing more reviews correlate with more fans?

Please explain your findings and interpretation of the results:

Yes, there seems to be a positive correlation between number of reviews and number
of fans.However, it can be seen that number of years also leads to more number of
fans

Code used:
SELECT name,
id,
review_count,fans,
(DATE('NOW') - DATE(yelping_since)) as Years

FROM user
ORDER BY fans DESC

Data retrieved for interpretation:


+-----------+------------------------+--------------+------+-------+
| name | id | review_count | fans | Years |
+-----------+------------------------+--------------+------+-------+
| Amy | -9I98YbNQnLdAmcYfb324Q | 609 | 503 | 13 |
| Mimi | -8EnCioUmDygAbsYZmTeRQ | 968 | 497 | 9 |
| Harald | --2vR0DIsmQ6WfcSzKWigw | 1153 | 311 | 8 |
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 2000 | 253 | 8 |
| Christine | -0IiMAZI2SsQ7VmyzJjokQ | 930 | 173 | 11 |
| Lisa | -g3XIcCb2b-BD0QBCcq2Sw | 813 | 159 | 11 |
| Cat | -9bbDysuiWeo2VShFJJtcw | 377 | 133 | 11 |
| William | -FZBTkAZEXoP7CYvRV2ZwQ | 1215 | 126 | 5 |
| Fran | -9da1xk7zgnnfO1uTVYGkA | 862 | 124 | 8 |
| Lissa | -lh59ko3dxChBSZ9U7LfUw | 834 | 120 | 13 |
| Mark | -B-QEUESGWHPE_889WJaeg | 861 | 115 | 11 |
| Tiffany | -DmqnhW4Omr3YhmnigaqHg | 408 | 111 | 12 |
| bernice | -cv9PPT7IHux7XUc9dOpkg | 255 | 105 | 13 |
| Roanna | -DFCC64NXgqrxlO8aLU5rg | 1039 | 104 | 14 |
| Angela | -IgKkE8JvYNWeGu8ze4P8Q | 694 | 101 | 10 |
| .Hon | -K2Tcgh2EKX6e6HqqIrBIQ | 1246 | 101 | 14 |
| Ben | -4viTt9UC44lWCFJwleMNQ | 307 | 96 | 13 |
| Linda | -3i9bhfvrM3F1wsC9XIB8g | 584 | 89 | 15 |
| Christina | -kLVfaJytOJY2-QdQoCcNQ | 842 | 85 | 8 |
| Jessica | -ePh4Prox7ZXnEBNGKyUEA | 220 | 84 | 11 |
| Greg | -4BEUkLvHQntN6qPfKJP2w | 408 | 81 | 12 |
| Nieves | -C-l8EHSLXtZZVfUAUhsPA | 178 | 80 | 7 |
| Sui | -dw8f7FLaUmWR7bfJ_Yf0w | 754 | 78 | 11 |
| Yuri | -8lbUNlXVSoXqaRRiHiSNg | 1339 | 76 | 12 |
| Nicole | -0zEEaDFIjABtPQni0XlHA | 161 | 73 | 11 |
+-----------+------------------------+--------------+------+-------+

9. Are there more reviews with the word "love" or with the word "hate" in them?

Answer:

'Love' has been used more frequently (1780 time) than 'hate' (232 times).

SQL code used to arrive at answer:

SELECT text
FROM review
WHERE text LIKE '%love%'

= 1780 results

SELECT text
FROM review
WHERE text LIKE '%hate%'

= 232 results

10. Find the top 10 users with the most fans:

SQL code used to arrive at answer:

SELECT name, id, fans


FROM user
ORDER BY fans DESC
LIMIT 10

Copy and Paste the Result Below:


+-----------+------------------------+------+
| name | id | fans |
+-----------+------------------------+------+
| Amy | -9I98YbNQnLdAmcYfb324Q | 503 |
| Mimi | -8EnCioUmDygAbsYZmTeRQ | 497 |
| Harald | --2vR0DIsmQ6WfcSzKWigw | 311 |
| Gerald | -G7Zkl1wIWBBmD0KRy_sCw | 253 |
| Christine | -0IiMAZI2SsQ7VmyzJjokQ | 173 |
| Lisa | -g3XIcCb2b-BD0QBCcq2Sw | 159 |
| Cat | -9bbDysuiWeo2VShFJJtcw | 133 |
| William | -FZBTkAZEXoP7CYvRV2ZwQ | 126 |
| Fran | -9da1xk7zgnnfO1uTVYGkA | 124 |
| Lissa | -lh59ko3dxChBSZ9U7LfUw | 120 |
+-----------+------------------------+------+

Part 2: Inferences and Analysis

1. Pick one city and category of your choice and group the businesses in that city
or category by their overall star rating. Compare the businesses with 2-3 stars to
the businesses with 4-5 stars and answer the following questions. Include your
code.

City chosen: Las Vegas


Category chosen: Shopping

i. Do the two groups you chose to analyze have a different distribution of hours?

Yes, The 'Between '2-3' group has longer work hours than the 'Between 4-5' group.
The store in the former category starts at 8:00 am and ends at 10:00 PM, whereas
the two stores in the other category also start at 8:00 AM but get closed maximum
by 5:00 PM

ii. Do the two groups you chose to analyze have a different number of reviews?

Yes. The 'Between 2-3' group has review count of 6. The 'Between 4-5' group has one
store with 32 reviews and the other with 4. Thus there seems to be no corelation
between rating and review count.

iii. Are you able to infer anything from the location data provided between these
two groups? Explain.

All 3 stores are at different locations with different pincodes.

SQL code used for analysis:

SELECT b.city, b.name, c.category, b.stars, h.hours, b.review_count, b.address,


b.postal_code,
CASE
WHEN B.stars BETWEEN 2 AND 3 THEN 'Between 2-3'
WHEN B.stars BETWEEN 4 AND 5 THEN 'Between 4-5'
END AS Groups
FROM business b
INNER JOIN category c
ON b.id = c.business_id
INNER JOIN hours h
ON c.business_id = h.business_id
WHERE b.city = 'Las Vegas'
AND c.category = 'Shopping'
AND (b.stars BETWEEN 2 AND 3 OR B.stars BETWEEN 4 AND 5)
GROUP BY b.stars
ORDER BY Groups ASC

2. Group business based on the ones that are open and the ones that are closed.
What differences can you find between the ones that are still open and the ones
that are closed? List at least two differences and the SQL code you used to arrive
at your answer.

i. Difference 1:

Average review count is greater for open stores.

ii. Difference 2:

Average Star rating is higher for Open stores

SQL code used for analysis:

SELECT is_open,
CASE
WHEN is_open = 0 THEN 'Closed'
WHEN is_open = 1 THEN 'Open'
END as Status,
AVG(review_count), AVG(stars)
FROM business
GROUP BY is_open

+---------+--------+-------------------+---------------+
| is_open | Status | AVG(review_count) | AVG(stars) |
+---------+--------+-------------------+---------------+
| 0 | Closed | 23.1980263158 | 3.52039473684 |
| 1 | Open | 31.7570754717 | 3.67900943396 |
+---------+--------+-------------------+---------------+

3. For this last part of your analysis, you are going to choose the type of
analysis you want to conduct on the Yelp dataset and are going to prepare the data
for analysis.

Ideas for analysis include: Parsing out keywords and business attributes for
sentiment analysis, clustering businesses to find commonalities or anomalies
between them, predicting the overall star rating for a business, predicting the
number of fans a user will have, and so on. These are just a few examples to get
you started, so feel free to be creative and come up with your own problem you want
to solve. Provide answers, in-line, to all of the following:

i. Indicate the type of analysis you chose to do:

How usefulness correlates with the lenght of the review.

ii. Write 1-2 brief paragraphs on the type of data you will need for your analysis
and why you chose that data:

For this analysis I needed groups of data in various text length categories.
Further, I needed the average useful ratings of the text reviews.

iii. Output of your finished dataset:

As the length of the text in a review increases, the average usefulness of the
review increases.

+-------------------------------+----------------+
| Text_Length | AVG(useful) |
+-------------------------------+----------------+
| None | 0.0 |
| Between 0 - 200 characters | 0.339789706696 |
| Between 201 - 500 characters | 0.579194981704 |
| Between 501 - 1000 characters | 1.2033271719 |
| More than 1001 characters | 2.40542168675 |
+-------------------------------+----------------+

iv. Provide the SQL code you used to create your final dataset:

SELECT CASE
WHEN length(text) BETWEEN 0 AND 200 THEN 'Between 0 - 200 characters'
WHEN length(text) BETWEEN 201 AND 500 THEN 'Between 201 - 500 characters'
WHEN length(text) BETWEEN 501 AND 1000 THEN 'Between 501 - 1000 characters'
WHEN length(text) > 1001 THEN 'More than 1001 characters'
END AS 'Text_Length',
AVG(useful)
FROM review
GROUP BY Text_Length

Das könnte Ihnen auch gefallen