Sie sind auf Seite 1von 35

Frontiers of

Computational Journalism
Columbia Journalism School
Week 6: Hybrid Filtering
October 16, 2015

Filtering Comments

Thousands of comments, what are the good ones?

Comment voting

Problem: putting comments with most votes at


top doesnt work. Why?

Reddit Comment Ranking (old)

Up down votes
plus time decay

Reddit Comment Ranking (new)


N=16
v = 11
p = 11/16 = 0.6875

Hypothetically, suppose all users voted on the comment, and v


out of N up-voted. Then we could sort by proportion p = v/N of
upvotes.

Reddit Comment Ranking


n=3
v = 1
p = 1/3 = 0.333

Actually, only n users out of N vote, giving an observed


approximate proportion p = v/n

Reddit Comment Ranking


p = 0.75
p = 0.1875

p = 0.333
p = 0.6875

Limited sampling can rank votes wrong when we dont have


enough data.

Random error in sampling


If we observe p upvotes from n random users, what is
the distribution of the true proportion p?

Distribution of p when p=0.5

Condence interval
Given observed p, interval that true p has a
probability of lying inside.

Rank comments by lower bound


of condence interval
Analytic solution for condence interval, known as Wilson score

p = observed proportion of upvotes


n = how many people voted
z= how certain do we want to be before we assume that p is
close to true p

User-item matrix

Stores rating of each user for each item. Could also


be binary variable that says whether user clicked,
liked, starred, shared, purchased...

User-item matrix
No content analysis. We know nothing about what is in each
item.
Typically very sparse a user hasnt watched even 1% of all
movies.
Filtering problem is guessing unknown entry in matrix. High
guessed values are things user would want to see.

Filtering process

How to guess unknown rating?


Basic idea: suggest similar items.
Similar items are rated in a similar way by many different
users.
Remember, rating could be a click, a like, a purchase.
o Users who bought A also bought B...
o Users who clicked A also clicked B...
o Users who shared A also shared B...

Similar items

Item similarity
Cosine similarity!

Other distance measures

adjusted cosine similarity

Subtracts average rating for each user, to compensate for general


enthusiasm (most movies suck vs. most movies are great)

Generating a recommendation

Weighted average of item ratings by their similarity.

Matrix factorization recommender

Matrix factorization recommender

Matrix factorization plate model


variation in
item topics
v

topics for item


v

j items

u
variation in
user topics

user rating
of item

i users

topics for user

Combining collaborative ltering


and topic modeling

Content modeling - LDA

topics in doc
topic
topic for word word in doc words in topics word
concentration
concentration
parameter
parameter

N words D docs
in doc

K topics

Collaborative Topic Modeling


topic
concentration

topics in doc topic for word word in doc K topics


(content)

weight of user topics in doc


(collaborative)
selections

variation in
per-user topics topics for user

user rating
of doc

content only
content +
social

Dierent Filtering Systems


Content:
Newsblaster analyzes the topics in the documents.
No concept of users.
Social:
What I see on Twitter determined by who I follow.
Reddit comments filtered by votes as input.
Amazon "people who bought X also bought Y"
No content analysis.
Hybrid:
Recommend based both on content and user behaviur.

Item Content

My Data

Other Users Data

what Ive read/liked


Text analysis,
topic modeling,
clustering...

who I follow

social network
structure,
other users likes

How to evaluate/optimize?

How to evaluate/optimize?
Netflix: try to predict the rating that the user gives a
movie after watching it.
Amazon: sell more stuff.
Google web search: human raters A/B test every
change

How to evaluate/optimize?
Does the user understand how the filter works?
Can they configure it as desired?
Can they correctly predict what they will and won't
see?

How to evaluate/optimize?
Can it be gamed? Spam, "user-generated
censorship," etc.

"During the 2012 election, The


~2000 members of an anti-Ron
Paul subreddit discovered that
anything they posted, anywhere
on reddit, was being rapidly,
repeatedly downvoted. They
created a diagnostic subreddit
and began posting otherwise
meaningless text to verify this
otherwise odd behavior."

Filter design problem


Formally, given
U = user preferences, history, characteristics
S = current story
{P} = results of function on previous stories
{B} = background world knowledge (other users?)
Define

r(S,U,{P},{B}) in [0...1]

relevance of story S to user U

Filter design problem, restated


When should a user see a story?
Aspects to this question:
normative
personal: what I want
societal: emergent group effects
UI
how do I tell the computer I want?
technical
constrained by algorithmic possibility
economic
cheap enough to deploy widely

How to evaluate/optimize?
Does it improve the user's life?

Das könnte Ihnen auch gefallen