Beruflich Dokumente
Kultur Dokumente
Matthias C. Sala, Kurt Partridge, Linda Jacobson, and James “Bo” Begole
1 Introduction
A. LaMarca et al. (Eds.): Pervasive 2007, LNCS 4480, pp. 73–90, 2007.
c Springer-Verlag Berlin Heidelberg 2007
74 M.C. Sala et al.
2 Related Work
Much attention recently has been directed toward location-based services,
including location-based advertisement delivery. For example, Aalto et al. [1]
use Bluetooth beacons to determine when to push SMS advertising messages.
Location-targeted advertising is related to our goal because location may
determine a person’s activity. Consider the different activities that happen in
work, school, home, restaurant, or church.
However, location is only one of many indicators of user intention [20]
and there are reasons to believe targeting activity could be more effective.
1
The authors are not affiliated with Mechanical Turk or Amazon.com.
An Exploration into Activity-Informed Physical Advertising Using PEST 75
Some locations are correlated with a certain set of activities, but location-
targeted advertising can only target the set as a whole, not the immediately
occurring activity. Also, some activities can be done anywhere, and the recent
improvements in communication and mobile technologies have further decoupled
activity and location.
A few other systems have suggested extending the types of context beyond
location to other sensed data [3,22]. However, we are not aware of any that have
specifically examined the relationship between advertising and a higher-level
model of activity.
Various research groups have investigated approaches for automatically
inferring activity by parsing calendar entries [17], collecting data from in-
frastructure sensors (e.g., cameras and microphones) [15,21,24], and using
wearable sensors [12,14]. But to our knowledge, no activity-sensing research has
investigated advertising applications.
We did not inform participants of the mechanism used to find the advertise-
ments.
The study was performed in two phases. The first phase compared conditions
1 and 2a. The second phase compared conditions 1 and 2b. Each phase had a
different group of participants.
4.1 Participants
We recruited 19 participants from our in-house research staff: 17 male, and 2
female, aged 21 to 55. Previous experience with mobile devices was not required.
To assess their exposure to advertising, we administered a pre-experiment
questionnaire. The questionnaire showed that the participants on average watch
0.8 hours of TV, listen to almost one hour of radio, and spend 2.7 hours online.
No participant categorically opposed advertising, and many said that they “like
good ads.”
As an incentive, we gave participants a small gift (T-shirt, tote bag, etc.), and
allowed them the general use (web browser, email, etc.) of the device the study
software ran on.
Of the 19 participants, 13 participated in Phase 1, and 6 participated in
Phase 2. We would have liked to include more participants in Phase 2, but were
not able to do so because of resource constraints.
It is worth noting that our recruiting procedures may have affected the results.
Because our recruitment notice indicated that the study involved exposure to
advertisements, people who highly disliked advertising might have chosen not to
participate.
answer each time. To facilitate quick activity-list searching, and to avoid biasing
responses that favored the top entries, we presented the list in alphabetical order,
starting at a random point, and wrapping around at the end of the alphabet back
to the beginning.
Question a asked the participant’s location. While our primary purpose was
to determine whether activity could lead to more relevant advertising, we also
evaluated how location affected self-reported activity, and whether location
affected the relevance and usefulness of the advertisement.
Question b asked the participant to identify their activity at the time of the
alert. Questions c and d were a variation on this question. Question c asked
what activity the participant had planned to perform at the current time. This
question was designed to collect information like what would be in a typical
calendar entry; we wanted to investigate whether targeting one activity was
more effective than targeting another. Question d measured what the person
would have preferred to be doing at the time of the survey alert. Here again, we
speculated that advertising might be more effective if it could target something
other than the user’s exact activity such as what the user desired to be doing.
These first four questions allowed considerable freedom in the participant’s
response. As a result, participants interpreted and responded to the questions in
different ways. For example, in response to the query about current activity, some
participants answered in the abstract (such as “working”) while others provided
specific detail (e.g., “reading Da Vinci code”). This variety made it difficult to
taxonomize the activities. However, it also made the descriptions more likely to
reflect accurately how individuals conceptualize their activities. We felt that it
was important to collect unbiased data, even if it was more difficult to analyze.
Following these questions, the advertisement was shown. The advertisement
was fetched from online sources, using one of the methods described in
Section 4.5.
Questions e–i measured the effectiveness of the displayed advertisement.
Question e caused the advertisement to be emailed to the participant, and was
therefore an implicit interest indicator [5].
Questions f and g were the primary metrics. “Relevance” measures how
well the participant determined that the advertisement was activity-targeted—
how well it matched their current activity. “Usefulness” was the participant’s
assessment of how helpful the advertisement was. Although the ideal advertise-
ment would be both relevant and useful, an advertisement could vary in either
dimension.
Finally, questions h and i investigated the timing of the advertisement. In
pilot studies we noticed that some participants assigned a potentially useful
advertisement a low score if it was not welcome at the immediate time of the
survey. These questions helped identify these situations.
Initially, with one-third chance, either a random word from Ogden’s Basic
English Word List of the 850 most common English words, the current activity
description, or the planned activity description was chosen. For the activity
descriptions, stop-word removal and word-stemming was performed to increase
the chance of an advertisement being associated with the word. The terms were
passed to a search engine. If an advertisement was found in the search results, one
was selected at random. If an advertisement was not found, the system fell back
to random selection. If random selection failed, a new random word was selected.
To maintain the correct proportion of the experimental conditions, the system
tracked the number of randomly selected versus activity-targeted advertisements
and adjusted the probabilities for each condition.
During Phase 1, we observed the following causes of poor matches using simple
keyword matching.
All these aspects are difficult for an automated system to infer, but trivial for
a human to understand. This observation led to the approach used in Phase 2.
Activity Answers
drinking wine dark chocolate
eat sharp cheese
eating lunch watching television news
chewing gum
brush my teeth
catching up on using computer
some school drinking Pepsi-Cola
materials Watching movie
eating sports bar
book bag
In pilot tests we observed that some of the Mechanical Turk workers’ responses
did not lead to useful advertisements, for the reasons explored in Section 5.6.
Mechanical Turk supports oversight by allowing the HIT designer to pay workers
only for appropriate responses. However, we felt that this mechanism was
inappropriate for our task, because a quick response was necessary, and because
having an expert approve all Mechanical Turk responses would not scale up well.
Instead we designed an approach that used a second round of Mechanical
Turk tasks to perform oversight. Our implementation worked as follows. Five
workers generated suggestions using the HIT shown in Fig. 2. Our system then
created a second HIT for ranking the suggestions, shown in Fig. 3. Ten workers
completed this HIT for 7 cents each. The suggestions were ranked by counting
the number of times they were selected. The activity suggestion with the largest
number of votes was chosen as the term to use to find the advertisement. If there
were several equally ranked suggestions, one was chosen randomly.
5 Results
Participants filled out a total of 310 surveys. On average, each filled out 16 (the
standard deviation was 9.4), giving a response rate of about 35%. This rate is
low compared to similar studies [7]. We found several reasons for this during the
post study interviews: participants who carried the device during weekdays had
a low response rate at work; a few forgot to recharge the device overnight; others
struggled with instability problems in the operating system and other installed
programs.
Fig. 5 shows the distributions of ratings. The left plots show keyword targeting,
the right plots show Mechanical Turk targeting. The upper plots show the
relevance, the bottom, usefulness. Within each of the four panes, the upper plot
shows the random baseline, and the lower plot shows the results with targeting.
Each light-colored point represents the averaged score of one user, and the points
are jittered in the vertical direction for better visibility. The dark point indicates
the median value, and the gray box marks the 50-percentile region.
One-tailed t-tests for comparing the means of these distributions show a
statistically significant improvement in only the relevance score between random
and the keyword approach (p < 0.05).
Although no statistically significant differences were found for the Mechanical
Turk results, there were occasions in which Mechanical Turk performed well
where a pure keyword-approach would fail. For example, in response to the
activity “reading newspaper,” the Mechanical Turk response suggested “drink
good coffee,” and the system then selected an advertisement for an online coffee
retailer. The participant gave this response a 6 for relevance. From examples
like this, we believe that an effect might be observed for the Mechanical Turk
condition with a larger data set.
An Exploration into Activity-Informed Physical Advertising Using PEST 83
0 1 2 3 4 5 6 7 8 9 10
RELEVANCE RELEVANCE
KEYWORD ME CHANICA L T URK
* p= 0.046 p= 0.23
RANDOM
TARGETED
USEFULNESS USEFULNESS
KEYWORD ME CHANICA L T URK
p= 0.35 p= 0.44
RANDOM
TARGETED
0 1 2 3 4 5 6 7 8 9 10
5.2 Activity
Since participants entered activities using unrestricted text, many different
entries represented the same or similar activities. To compare the effects of
targeting in activities with different qualities, we categorized the activities into
groups according to our observations of the data. See Table 1 for example
responses for each category.
Fig. 6 shows the frequency distribution and ratings distribution for each
activity class. Frequency distributions are computed separately for current
activity, expected activity, preferred activity, and Mechanical Turk-targeted
previous activity. We treated Mechanical-Turk-targeted activities separately
because the advertisement corresponded to an entry from the previous survey,
not the current survey. Most activities have roughly the same distribution,
with the exception of “Constructive Mental,” which included work, and was
not as often chosen as the preferred activity; “Communication,” which was
less frequently expected or preferred; and “Eating,” which was a frequent
preferred activity. Both “Media Consumption” and “Eating” rate show better
improvement over random than the other activities; we believe this happens
because these activities are highly consumptive, so it is easier to directly match
advertisements to needs. “Shopping” does not show such an improvement, but
the sample size of shopping was small, and several of those advertisements
targeted vendors rather than consumers.
84 M.C. Sala et al.
Category Example
Constructive Mental “experiment analysis”
Communication “in a meeting”
Media Consumption “browsing internet”
Eating “eating lunch”
Transporting “going to neighbor”
Manipulating Objects “laundry”
Live Observation “attending seminar at work”
Fixing “trying out new software”
Game Playing “playing a game”
Thinking and Planning “thinking about my life”
Shopping “picking up prescriptions”
Basic Needs “shaving”
Other “wait for meeting”
Exercise “gardening”
Mobile Entertainment “sightseeing”
5.3 Location
5.4 Timing
On average, participants indicated that “no time” was appropriate for 60%
of all advertisements (because they did not reach sufficiently high usefulness
scores.) Of the remaining advertisements, participants reported that 25% would
be appropriate at “any time,” with the remaining 15% equally distributed over
the times given in Table 1. Figs. 5–7, when restricted to groups of advertisements
with different timing ratings, look roughly the same. Here too, with more data,
an effect may emerge.
Constructive Mental
Communication
Media Consumption
Eating
Transporting
Manipulating Objects
Live Observation
Fixing
Game Playing
Shopping
Basic Needs
Other
Exercise
Mobile Entertainment
0.1 0.2 0. 3 0 1 2 3 4 5 6 0 1 2 3 4 5 6
Frequ enc y Re levan ce Us ef ul nes s
Fig. 6. Breakdown by activity. The leftmost chart shows the frequency of each activity
category occurring, sorted by the current activity. The center chart shows the average
relevance score for random and targeted advertisements for each activity, and the
right chart shows the same measurements for usefulness. Because of small numbers
of observations, measurements toward the lower end of these two charts should be
treated as increasingly less reliable.
having two devices to carry. Text-entry with the soft keyboard was generally
adequate, although participants frequently reused their previous entries. Several
participants expressed confusion over the seemingly random alphabetic ordering
of the activity list, which we did not explain in advance.
Many participants found the periodic alerts intrusive, and commented that the
alert frequency missed some activities and oversampled others. One participant
commented, “Normally, I work longer than only one hour on the same task.”
Participants suggested using other mechanisms for triggering surveys, such as
location-based alerting, time-sensitive alerting, and participant-initiated activity
reporting.
The advertisements themselves generally seemed to displease the participants.
All agreed that most advertisements were irrelevant, but on the other hand, some
admitted that they did mail some advertisements back to themselves for personal
follow-up or to forward to others. Participants especially seemed to dislike
advertisements that were generic if they were shown a similar advertisement
more than once.
The survey questions were generally considered straightforward, although
some seemed to be slightly confused about the distinction between “relevance”
86 M.C. Sala et al.
Office
Home
In Transition
Restaurant
Other Home
Shop
Other
and “usefulness.” The net effect appears to be that our quantitative estimates
of usefulness are probably lower than what they would be if the participants
had been clear about the distinction. Unfortunately, from the qualitative focus
group discussion, we are not able to determine a quantitative difference.
Most participants reported that advertisement exposure was more frequent
than they would have preferred. In exploring alternatives, there was disagreement
as to whether advertisements should appear for different durations (“one-second
advertisements that do not take too much of my time”) or whether they should
be bunched together (“I like the idea of browsing through coupon booklets”).
Participants offered several suggestions for improving the system:
1. Advertising would be more effective if displayed during idle moments rather
than during active tasks.
2. Advertisements might be targeted not to the current activity, but to
a complementary activity. For example, if a person is unemployed, an
advertisement for a job board would be useful.
3. A system could maintain a history of advertisements, which would support
later review of advertisements.
4. A system might proactively determine what items a participant is likely to
later spend time searching for on the web, and display the search results
along with advertising.
5. A system might learn a person’s ratings over time and use this information
to better target advertising.
Finally, in some cases participants were surprised by PEST’s supposed ability to
find obscure, but relevant advertisements. However, upon further investigation we
discovered that these advertisements were in fact random, and that the partici-
pants had attributed more intelligence to the system than it actually possessed.
did help improve response quality, but it did not completely screen out poor
responses, and over time its performance degraded slightly. We observed the
following categories of problems.
Despite these issues, we were able to get many appropriate responses from
Mechanical Turk. Overall, we found Mechanical Turk to be a useful complement
to computation, and would use it in other projects.
Several modifications to our basic architecture might improve the effectiveness
of the Mechanical Turk component. Mechanical Turk workers in the filtering
stage might select among a list of advertisements rather than a list of keywords.
Amazon.com’s “Qualifications” test might be applied to improve the average
worker’s skill level. And finally, the tasks might be redesigned to make them
more engaging or game-like [23].
6 Discussion
We were disappointed that usefulness scores were low in both phases. Efforts
to improve usefulness through better targeting could combine other knowledge
such as a user’s preference profile, his behavior history, his context, and his
similarity to other users. Furthermore, an advertisement might be more effective
if it targeted not a single activity, but a sequence of predicted future activities.
Finally, a system might target advertising to a group of users rather than a single
user.
Usefulness might also be improved by working on the presentation of
the advertisement. As mentioned in Section 3, this work adopted a simple
presentation so we could focus on targeting. The focus group discussions,
however, made clear the importance of presentation, especially timing. To be
effective, an advertisement must be shown when a consumer is receptive to
it, which is not always when the activity is taking place. But, if the user is
bored or not especially engaged with their activity, then they are more likely
88 M.C. Sala et al.
7 Conclusions
best possible response. As we experienced, what works well initially may lose
effectiveness as loopholes are exploited. While direct human oversight could
be used in cases where the difficult task is generation and the easy task is
validation of the generated response, in large-scale systems this is not feasible.
The Mechanical Turk-based oversight system we developed shows promise, but
is not foolproof. More robust oversight mechanisms need to be developed.
Finally, our exploration into activity-based advertising exposes a need for
another line of research in activity recognition. Much of the activity-recognition
research to date has been bottom-up and technology-driven with the goal of
discovering which activities are detectable by sensors. The results have been
techniques to detect motor-level human actions such as walking, running,
biking, climbing stairs, social engagement, task engagement, object use and
other actions. Above this level, researchers have discovered applications that
can combine detected actions to infer higher-level activities such as activities
of daily living and interruptibility. While successful, this bottom-up approach
does not necessarily cover the full range of activities at human cognitive levels.
Detecting the user’s cognitive state is important for the kinds of context-aware
applications that aim to match information to what is on the user’s mind, such as
advertising and information retrieval. We explored a top-down approach, asking
users to describe their activity in their own terms, and then generating a set
of labels for activities in natural terms. This approach will allow research into
activity-recognition techniques to detect the kinds of activities thought of by
users themselves.
References
1. Lauri Aalto, Nicklas Göthlin, Jani Korhonen, and Timo Ojala. Bluetooth and
WAP push based location-aware mobile advertising system. In MobiSys, 2004.
2. Jeff Barr and Luis Felipe Cabrera. AI gets a brain. Queue, 4(4), 2006.
3. Rebecca Bulander, Michael Decker, Gunther Schiefer, and Bernhard Kölmel.
Comparison of different approaches for mobile advertising. In Second IEEE
International Workshop on Mobile Commerce and Services, 2005.
4. Scott Carter and Jennifer Mankoff. Momento: Early stage prototyping and
evaluation for mobile applications. Technical report, University of California,
Berkeley, April 2005. CSD-05-1380.
5. Mark Claypool, Phong Le, Makoto Wased, and David Brown. Implicit interest
indicators. In IUI, 2001.
6. Sunny Consolvo and Miriam Walker. Using the experience sampling method to
evaluate ubicomp applications. IEEE Pervasive Computing, 02(2), 2003.
90 M.C. Sala et al.
7. Jon Froehlich, Mike Y. Chen, Ian E. Smith, and Fred Potter. Voting with your
feet: An investigative study of the relationship between place visit behavior and
preference. In Ubicomp, 2006.
8. John Harms and Douglas Kellner. Towards a critical theory of advertising.
http://www.uta.edu/huma/illuminations/kell6.htm.
9. James M. Hudson, Jim Christensen, Wendy A. Kellogg, and Thomas Erickson. “I’d
be overwhelmed, but it’s just one more thing to do”: Availability and interruption
in research management. In CHI, 2002.
10. Cellfire Inc. http://www.cellfire.com/.
11. Stephen S. Intille, John Rondoni, Charles Kukla, Isabel Ancona, and Ling Bao. A
context-aware experience sampling tool. In CHI, 2003.
12. Jonathan Lester, Tanzeem Choudhury, and Gaetano Borriello. A practical
approach to recognizing physical activities. In Pervasive Computing. Springer-
Verlag, 2006.
13. Freeset Human Locator. http://www.freeset.ca/locator/.
14. Paul Lukowicz, Jamie A. Ward, Holger Junker, Mathias Stäger, Gerhard Tröster,
Amin Atrash, and Thad Starner. Recognizing workshop activity using body worn
microphones and accelerometers. In Pervasive Computing. Springer-Verlag, 2004.
15. Anant Madabhushi and J. K. Aggarwal. A bayesian approach to human activity
recognition. Second IEEE Workshop on Visual Survelliance, 1999.
16. Joe Mandese. Online ad spend predicted to reach $20 billion. Online Media Daily,
June 2006.
17. Nuria Oliver, Eric Horvitz, and Ashutosh Garg. Layered representations for
recognizing office activity. In Fourth IEEE International Conference on Multimodal
Interaction, 2002.
18. P.O.P. ShelfAds. http://www.popbroadcasting.com/main/index.html.
19. Reactrix. http://reactrix.com/.
20. A. Schmidt, M. Beigl, and H-W. Gellersen. There is more to context than location:
Environment sensing technologies for adaptive mobile user interfaces. In Workshop
on Interactive Applications of Mobile Computing (IMC), 1998.
21. Emmanuel Munguia Tapia, Stephen S. Intille, and Kent Larson. Activity
recognition in the home using simple and ubiquitous sensors. In Pervasive
Computing. Springer-Verlag, 2004.
22. Wessel van Binsbergen. Situation based services: The semantic extension of LBS.
In 3rd Twente Student Conference on IT, 2005.
23. Luis von Ahn. Games with a purpose. IEEE Computer Magazine, June 2006.
24. D.H. Wilson and C. Atkeson. Simultaneous tracking and activity recognition
(STAR) using many anonymous, binary sensors. In Pervasive Computing. Springer-
Verlag, 2005.