Sie sind auf Seite 1von 25

CROWDSOURCING IN HCI

BY AKKAMAHADEVI HANNI
USER STUDIES

 Obtaining input from users is important in HCI


 surveys
 rapid prototyping
 usability tests
 cognitive walkthroughs
 performance measures
 quantitative ratings
ONLINE SOLUTIONS

 Online user surveys


 Remote usability testing
 Online experiments
 But still have difficulties
 Rely on practitioner for recruiting
participants
 Limited pool of participants
WHAT IS CROWDSOURCING?

 Make tasks available for anyone online to complete


 Quickly access a large user pool, collect data, and
compensate users
• Coordinating the crowd (internet) to do micro-tasks in
order to solve problems which cannot be achieved by
computers with accuracy.
AMAZONʼS MECHANICAL TURK

 Market for “human intelligence


tasks”
 Typically short, objective tasks
 Tag an image
 Find a webpage
 Evaluate relevance of search
results
 Users complete for a few pennies
each
USING MECHANICAL TURK FOR USER STUDIES

Traditional user Mechanical Turk


studies
Task complexity Complex Simple
Long Short
Task subjectivity Subjective Objective
Opinions Verifiable
User information Targeted demographics Unknown demographics
High interactivity Limited interactivity

 Can Mechanical Turk be usefully used for user studies?


OTHER EXAMPLES
Collaborative
Wikipedia Knowledge

Project Real-time prediction


Tiramisu for transit systems

Project Digitizing
reCAPTCHA Newspapers

App Testing Test Applications


HOW HCI RESEARCHERS CAN LEVERAGE CROWDS
Conducting online provides a wonderful recruiting tool for
surveys surveys and questionnaires

Conducting provides a cheap and quick way to recruit


experiments participants for user studies or experiments

Training of machine the ML algorithm can “learn” the structural patterns that map content
learning algorithms across different designs

ESP Game: online participants “labeled” images as a secondary effect of


Analyze Text or Images
playing a game

Gathering Subjective used to conduct an experiment on the design process (judgement of


Judgments serial design or parallel design
CONSIDERATIONS FOR CROWDSOURCING?

• Are the tasks well suited for crowdsourcing?


• If it is a user study, what are the tradeoffs between having participants
perform the task online versus in a laboratory?
• How much should crowd workers earn for the task?
• How can researchers ensure good results from crowdsourcing?
WHEN IS CROWDSOURCING APPROPRIATE?

• Consider task complexity, task subjectivity, and the information they can collect through
crowdsourcing.
• List questions that you hope to answer
• Data needed to answer the questions
• Finally, decide whether crowdsourcing can be reliably used for the given demographic.
WHAT ARE THE TRADEOFFS OF CROWDSOURCING?

• A trade-off when performing unsupervised tasks online:


• In a laboratory or field experiment, subjects may feel additional motivation to provide quality results due to
the supervision otherwise users may feel free to cheat.
• The unavailability of qualitative observations:
• There is little way of gathering observational data on the steps the user took while submitting a response.
• Low cost:
• pilot experiments are usually run on only a handful of participants due to the time and cost involved, which
means that the opportunity to identify and correct potential pitfalls is drastically reduced.
OTHER CONSIDERATIONS
• Who Are the Crowd Workers?
• Based on demographics
• How Much Should Crowd-workers Be Paid?
• Based on target demographics determine payment amounts
• Budget control
• How to Ensure Quality Work?
• The easiest way to increase work quality is by preventing workers with bad reputations from participating.
• A researcher may design the most straightforward task but still get a significant number of fraudulent
responses.
CHALLENGES

• Crowd is not free


COST
• Reduce monetary cost

• Crowd may return incorrect answers


QUALITY
• Improve Quality

• Crowd is not real-time


LATENCY
• Reduce time
TASK

 Assess quality of Wikipedia articles


 Started with ratings from expert Wikipedians
 14 articles (e.g., “Germany”, “Noam Chomsky”)
 7-point scale
 Can we get matching ratings with mechanical turk?
EXPERIMENT 1

 Rate articles on 7-point scales:


 Well written
 Factually accurate
 Overall quality
 Free-text input:
 What improvements does the
article need?
 Paid $0.05 each
EXPERIMENT 1: GOOD NEWS

 58 users made 210 ratings (15 per article)


 – $10.50 total
 Fast results
 44% within a day, 100% within two days
 Many completed within minutes
EXPERIMENT 1: BAD NEWS

 Correlation between turkers and Wikipedians only


marginally significant (r=.50, p=.07)
 Worse, 59% potentially invalid responses
 Experiment 1
Invalid 49%
comments
<1 min 31%
responses

 Nearly 75% of these done by only 8 users


NOT A GOOD START

 Summary of Experiment 1:
 Only marginal correlation with experts.
 Heavy gaming of the system by a minority
 Possible Response:
 Can make sure these gamers are not rewarded
 Ban them from doing your hits in the future
 Create a reputation system [Delores Lab]
 Can we change how we collect user input ?
DESIGN CHANGES

 Use verifiable questions to signal  Make malicious answers as high cost


monitoring as good-faith answers
 “How many sections does the  “Provide 4-6 keywords that would
give someone a good summary of the
article have?” contents of the article”
 “How many images does the  Put verifiable tasks before
article have?” subjective responses
 “How many references does  First do objective tasks and
the article have?” summarization
 Only then evaluate subjective quality
 Ecological validity?
DESIGN CHANGES

 Use verifiable questions to signal monitoring


 Make malicious answers as high cost as good-
faith answers
 Make verifiable answers useful for completing task
 Used tasks similar to how Wikipedians
described evaluating quality (organization,
presentation, references)
DESIGN CHANGES

 Use verifiable questions to signal monitoring


 Make malicious answers as high cost as good-
faith answers
 Make verifiable answers useful for completing task
 Put verifiable tasks before subjective
responses
 First do objective tasks and summarization
 Only then evaluate subjective quality
 Ecological validity?
CASE STUDY: ASSESS WIKIPEDIA QUALITY
Use the crowd to assess Wikipedia articles quality

Exp1 : Rate directly Experiment Experiment


Exp2 : Rate with verification questions 1 2

Responses 210 277


 124 users provided 277 ratings (~20 per Invalid 49% 3%
article)
< 1 min 31% 7%
 Significant positive correlation with
Wikipedians (r=. 66, p=.01) Median 1’30” 4’06”
Time
 Smaller proportion malicious responses
Kittur, Aniket, Ed H. Chi, and Bongwon Suh. "Crowdsourcing user studies with Mechanical Turk." Proceedings of the SIGCHI conference on human factors in computing systems.
 ACM,
Increased
2008.
time on task
CASE STUDY: RECRUITMENT

• Personal decisions concerning privacy

• Traditional :
➢ 2 weeks, 100 - 200 participants
➢ Students and staff

• MTurk:
➢ 2 days, 350 responses, $0.25 per participants
➢ 95% white - collar workers

Dow, Steven, et al. "Shepherding the crowd yields better work." Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. ACM, 2012.
LIMITATIONS OF MECHANICAL TURK

 No control of usersʼ environment


 Potential for different browsers,
physical distractions
 General problem with online experimentation
 Not designed for user studies
 Difficult to do between-subjects design
 Involves some programming
 Users
 Uncertainty about user demographics,
expertise
QUICK SUMMARY

 Mechanical Turk offers the practitioner a way to access a


large user pool and quickly collect data at low cost
 Good results require careful task design
 Use verifiable questions to signal monitoring
 Make malicious answers as high cost as good-faith
answers
 Make verifiable answers useful for completing task
 Put verifiable tasks before subjective responses

Das könnte Ihnen auch gefallen