Crowd Sourcing

CROWDSOURCING IN HCI
BY AKKAMAHADEVI HANNI
USER STUDIES
 Obtaining input from users is important in HCI

 surveys
 rapid prototyping
 usability tests
 cognitive walkthroughs
 performance measures
 quantitative ratings
ONLINE SOLUTIONS
 Online user surveys

 Remote usability testing
 Online experiments
 But still have difficulties
 Rely on practitioner for recruiting
participants
 Limited pool of participants
WHAT IS CROWDSOURCING?
 Make tasks available for anyone online to complete

 Quickly access a large user pool, collect data, and
compensate users
• Coordinating the crowd (internet) to do micro-tasks in
order to solve problems which cannot be achieved by
computers with accuracy.
AMAZONʼS MECHANICAL TURK
 Market for “human intelligence

tasks”
 Typically short, objective tasks
 Tag an image
 Find a webpage
 Evaluate relevance of search
results
 Users complete for a few pennies
each
USING MECHANICAL TURK FOR USER STUDIES
Traditional user Mechanical Turk

studies
Task complexity Complex Simple
Long Short
Task subjectivity Subjective Objective
Opinions Verifiable
User information Targeted demographics Unknown demographics
High interactivity Limited interactivity
 Can Mechanical Turk be usefully used for user studies?

OTHER EXAMPLES
Collaborative
Wikipedia Knowledge
Project Real-time prediction

Tiramisu for transit systems
Project Digitizing
reCAPTCHA Newspapers
App Testing Test Applications

HOW HCI RESEARCHERS CAN LEVERAGE CROWDS
Conducting online provides a wonderful recruiting tool for
surveys surveys and questionnaires
Conducting provides a cheap and quick way to recruit

experiments participants for user studies or experiments
Training of machine the ML algorithm can “learn” the structural patterns that map content
learning algorithms across different designs
ESP Game: online participants “labeled” images as a secondary effect of

Analyze Text or Images
playing a game
Gathering Subjective used to conduct an experiment on the design process (judgement of

Judgments serial design or parallel design
CONSIDERATIONS FOR CROWDSOURCING?
• Are the tasks well suited for crowdsourcing?

• If it is a user study, what are the tradeoffs between having participants
perform the task online versus in a laboratory?
• How much should crowd workers earn for the task?
• How can researchers ensure good results from crowdsourcing?
WHEN IS CROWDSOURCING APPROPRIATE?
• Consider task complexity, task subjectivity, and the information they can collect through
crowdsourcing.
• List questions that you hope to answer
• Data needed to answer the questions
• Finally, decide whether crowdsourcing can be reliably used for the given demographic.
WHAT ARE THE TRADEOFFS OF CROWDSOURCING?
• A trade-off when performing unsupervised tasks online:

• In a laboratory or field experiment, subjects may feel additional motivation to provide quality results due to
the supervision otherwise users may feel free to cheat.
• The unavailability of qualitative observations:
• There is little way of gathering observational data on the steps the user took while submitting a response.
• Low cost:
• pilot experiments are usually run on only a handful of participants due to the time and cost involved, which
means that the opportunity to identify and correct potential pitfalls is drastically reduced.
OTHER CONSIDERATIONS
• Who Are the Crowd Workers?
• Based on demographics
• How Much Should Crowd-workers Be Paid?
• Based on target demographics determine payment amounts
• Budget control
• How to Ensure Quality Work?
• The easiest way to increase work quality is by preventing workers with bad reputations from participating.
• A researcher may design the most straightforward task but still get a significant number of fraudulent
responses.
CHALLENGES
• Crowd is not free

COST
• Reduce monetary cost
• Crowd may return incorrect answers

QUALITY
• Improve Quality
• Crowd is not real-time

LATENCY
• Reduce time
TASK
 Assess quality of Wikipedia articles

 Started with ratings from expert Wikipedians
 14 articles (e.g., “Germany”, “Noam Chomsky”)
 7-point scale
 Can we get matching ratings with mechanical turk?
EXPERIMENT 1
 Rate articles on 7-point scales:

 Well written
 Factually accurate
 Overall quality
 Free-text input:
 What improvements does the
article need?
 Paid $0.05 each
EXPERIMENT 1: GOOD NEWS
 58 users made 210 ratings (15 per article)

 – $10.50 total
 Fast results
 44% within a day, 100% within two days
 Many completed within minutes
EXPERIMENT 1: BAD NEWS
 Correlation between turkers and Wikipedians only

marginally significant (r=.50, p=.07)
 Worse, 59% potentially invalid responses
 Experiment 1
Invalid 49%
comments
<1 min 31%
responses
 Nearly 75% of these done by only 8 users

NOT A GOOD START
 Summary of Experiment 1:
 Only marginal correlation with experts.
 Heavy gaming of the system by a minority
 Possible Response:
 Can make sure these gamers are not rewarded
 Ban them from doing your hits in the future
 Create a reputation system [Delores Lab]
 Can we change how we collect user input ?
DESIGN CHANGES
 Use verifiable questions to signal  Make malicious answers as high cost

monitoring as good-faith answers
 “How many sections does the  “Provide 4-6 keywords that would
give someone a good summary of the
article have?” contents of the article”
 “How many images does the  Put verifiable tasks before
article have?” subjective responses
 “How many references does  First do objective tasks and
the article have?” summarization
 Only then evaluate subjective quality
 Ecological validity?
DESIGN CHANGES
 Use verifiable questions to signal monitoring

 Make malicious answers as high cost as good-
faith answers
 Make verifiable answers useful for completing task
 Used tasks similar to how Wikipedians
described evaluating quality (organization,
presentation, references)
DESIGN CHANGES

 Make malicious answers as high cost as good-
faith answers
 Put verifiable tasks before subjective
responses
 First do objective tasks and summarization
 Only then evaluate subjective quality
 Ecological validity?
CASE STUDY: ASSESS WIKIPEDIA QUALITY
Use the crowd to assess Wikipedia articles quality
Exp1 : Rate directly Experiment Experiment

Exp2 : Rate with verification questions 1 2
Responses 210 277

 124 users provided 277 ratings (~20 per Invalid 49% 3%
article)
< 1 min 31% 7%
 Significant positive correlation with
Wikipedians (r=. 66, p=.01) Median 1’30” 4’06”
Time
 Smaller proportion malicious responses
Kittur, Aniket, Ed H. Chi, and Bongwon Suh. "Crowdsourcing user studies with Mechanical Turk." Proceedings of the SIGCHI conference on human factors in computing systems.
 ACM,
Increased
2008.
time on task
CASE STUDY: RECRUITMENT
• Personal decisions concerning privacy
• Traditional :
➢ 2 weeks, 100 - 200 participants
➢ Students and staff
• MTurk:
➢ 2 days, 350 responses, $0.25 per participants
➢ 95% white - collar workers
Dow, Steven, et al. "Shepherding the crowd yields better work." Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. ACM, 2012.
LIMITATIONS OF MECHANICAL TURK
 No control of usersʼ environment

 Potential for different browsers,
physical distractions
 General problem with online experimentation
 Not designed for user studies
 Difficult to do between-subjects design
 Involves some programming
 Users
 Uncertainty about user demographics,
expertise
QUICK SUMMARY
 Mechanical Turk offers the practitioner a way to access a

large user pool and quickly collect data at low cost
 Good results require careful task design
 Make malicious answers as high cost as good-faith
answers
 Put verifiable tasks before subjective responses

Crowd Sourcing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Crowd Sourcing

Hochgeladen von

Copyright:

Verfügbare Formate

CROWDSOURCING IN HCI

 Obtaining input from users is important in HCI

 Online user surveys

 Make tasks available for anyone online to complete

 Market for “human intelligence

Traditional user Mechanical Turk

 Can Mechanical Turk be usefully used for user studies?

Project Real-time prediction

App Testing Test Applications

Conducting provides a cheap and quick way to recruit

ESP Game: online participants “labeled” images as a secondary effect of

Gathering Subjective used to conduct an experiment on the design process (judgement of

• Are the tasks well suited for crowdsourcing?

• A trade-off when performing unsupervised tasks online:

• Crowd is not free

• Crowd may return incorrect answers

• Crowd is not real-time

 Assess quality of Wikipedia articles

 Rate articles on 7-point scales:

 58 users made 210 ratings (15 per article)

 Correlation between turkers and Wikipedians only

 Nearly 75% of these done by only 8 users

 Use verifiable questions to signal  Make malicious answers as high cost

 Use verifiable questions to signal monitoring

 Use verifiable questions to signal monitoring

Exp1 : Rate directly Experiment Experiment

Responses 210 277

• Personal decisions concerning privacy

 No control of usersʼ environment

 Mechanical Turk offers the practitioner a way to access a

Das könnte Ihnen auch gefallen