Hybrid Filtering. Computational Journalism Week 6

Frontiers of
Computational Journalism
Columbia Journalism School
Week 6: Hybrid Filtering
October 16, 2015
Filtering Comments
Thousands of comments, what are the good ones?
Comment voting
Problem: putting comments with most votes at

top doesnt work. Why?
Reddit Comment Ranking (old)
Up down votes
plus time decay
Reddit Comment Ranking (new)

N=16
v = 11
p = 11/16 = 0.6875
Hypothetically, suppose all users voted on the comment, and v

out of N up-voted. Then we could sort by proportion p = v/N of
upvotes.
Reddit Comment Ranking

n=3
v = 1
p = 1/3 = 0.333
Actually, only n users out of N vote, giving an observed

approximate proportion p = v/n
Reddit Comment Ranking

p = 0.75
p = 0.1875

p = 0.333
p = 0.6875

Limited sampling can rank votes wrong when we dont have

enough data.
Random error in sampling

If we observe p upvotes from n random users, what is
the distribution of the true proportion p?
Distribution of p when p=0.5
Condence interval
Given observed p, interval that true p has a
probability of lying inside.
Rank comments by lower bound

of condence interval
Analytic solution for condence interval, known as Wilson score
p = observed proportion of upvotes

n = how many people voted
z= how certain do we want to be before we assume that p is
close to true p
User-item matrix
Stores rating of each user for each item. Could also

be binary variable that says whether user clicked,
liked, starred, shared, purchased...
User-item matrix
No content analysis. We know nothing about what is in each
item.
Typically very sparse a user hasnt watched even 1% of all
movies.
Filtering problem is guessing unknown entry in matrix. High
guessed values are things user would want to see.
Filtering process
How to guess unknown rating?

Basic idea: suggest similar items.
Similar items are rated in a similar way by many different
users.
Remember, rating could be a click, a like, a purchase.
o Users who bought A also bought B...
o Users who clicked A also clicked B...
o Users who shared A also shared B...
Similar items
Item similarity
Cosine similarity!
Other distance measures
adjusted cosine similarity
Subtracts average rating for each user, to compensate for general

enthusiasm (most movies suck vs. most movies are great)
Generating a recommendation
Weighted average of item ratings by their similarity.
Matrix factorization recommender
Matrix factorization recommender
Matrix factorization plate model

variation in
item topics
v
topics for item

v
j items
u
variation in
user topics
user rating
of item
i users
topics for user
Combining collaborative ltering

and topic modeling
Content modeling - LDA
topics in doc
topic
topic for word word in doc words in topics word
concentration
concentration
parameter
parameter
N words D docs
in doc
K topics
Collaborative Topic Modeling

topic
concentration
topics in doc topic for word word in doc K topics

(content)
weight of user topics in doc

(collaborative)
selections
variation in
per-user topics topics for user
user rating
of doc
content only
content +
social
Dierent Filtering Systems

Content:
Newsblaster analyzes the topics in the documents.
No concept of users.
Social:
What I see on Twitter determined by who I follow.
Reddit comments filtered by votes as input.
Amazon "people who bought X also bought Y"
No content analysis.
Hybrid:
Recommend based both on content and user behaviur.
Item Content
My Data
Other Users Data
what Ive read/liked

Text analysis,
topic modeling,
clustering...
who I follow
social network
structure,
other users likes
How to evaluate/optimize?
Netflix: try to predict the rating that the user gives a
movie after watching it.
Amazon: sell more stuff.
Google web search: human raters A/B test every
change
Does the user understand how the filter works?
Can they configure it as desired?
Can they correctly predict what they will and won't
see?
Can it be gamed? Spam, "user-generated
censorship," etc.
"During the 2012 election, The

~2000 members of an anti-Ron
Paul subreddit discovered that
anything they posted, anywhere
on reddit, was being rapidly,
repeatedly downvoted. They
created a diagnostic subreddit
and began posting otherwise
meaningless text to verify this
otherwise odd behavior."
Filter design problem

Formally, given
U = user preferences, history, characteristics
S = current story
{P} = results of function on previous stories
{B} = background world knowledge (other users?)
Define
r(S,U,{P},{B}) in [0...1]
relevance of story S to user U
Filter design problem, restated

When should a user see a story?
Aspects to this question:
normative
personal: what I want
societal: emergent group effects
UI
how do I tell the computer I want?
technical
constrained by algorithmic possibility
economic
cheap enough to deploy widely
Does it improve the user's life?

Hybrid Filtering. Computational Journalism Week 6

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hybrid Filtering. Computational Journalism Week 6

Hochgeladen von

Copyright:

Verfügbare Formate

Frontiers of

Thousands of comments, what are the good ones?

Problem: putting comments with most votes at

Reddit Comment Ranking (old)

Reddit Comment Ranking (new)

Hypothetically, suppose all users voted on the comment, and v

Reddit Comment Ranking

Actually, only n users out of N vote, giving an observed

Reddit Comment Ranking

Limited sampling can rank votes wrong when we dont have

Random error in sampling

Distribution of p when p=0.5

Rank comments by lower bound

p = observed proportion of upvotes

Stores rating of each user for each item. Could also

How to guess unknown rating?

Other distance measures

adjusted cosine similarity

Subtracts average rating for each user, to compensate for general

Weighted average of item ratings by their similarity.

Matrix factorization recommender

Matrix factorization recommender

Matrix factorization plate model

topics for item

topics for user

Combining collaborative ltering

Content modeling - LDA

Collaborative Topic Modeling

topics in doc topic for word word in doc K topics

weight of user topics in doc

Dierent Filtering Systems

Other Users Data

what Ive read/liked

"During the 2012 election, The

Filter design problem

relevance of story S to user U

Filter design problem, restated

Das könnte Ihnen auch gefallen