Beruflich Dokumente
Kultur Dokumente
AbstractIn todays advancing ubiquitous computing age, in computing. Push notifications that pop up in the background
with its ever-increasing amount of information from various of a users attention at random times cause interruptions
applications and services available for consumption, the man- and divided attention. There have been several reports on
agement of peoples attention has become very important. In the negative effects caused by divided attention in terms of
particular, the high volume of notifications on mobile devices productivity, emotion, and mental state [2], [3], [4], [5].
has become a major cause of interruption of users. There has
been much research aimed at detecting the opportune moment Researchers have been investigating user interruptibility in
to present such information to users with in a way that lowers various ubiquitous and pervasive computing situations using
the cognitive load or frustration. However, evaluation of such different techniques with the objective of ensuring that inter-
systems in the real-world production environment with real users
and notifications, and evaluation on users engagement to the
ruptive notifications do not unnecessarily steal users precious
presented notification beyond simple responsiveness have not been attention resources. Breakpoint [6], the boundary between two
adequately studied. To the best of our knowledge, this study adjacent units of user activities, is known as a timing that
is the first to investigate user interruptibility and engagement can lower the impact on users cognitive load. We previously
using a real-world large-scale mobile application and real-world investigated the real-time detection of users breakpoints in
notifications consisting of actual news content. We equipped the their device interactions and physical activities using mobile
Yahoo! JAPAN Android app, one of the most popular applications sensing and machine learning techniques on smartphones [7]
on the national market, with our mobile-sensing and machine- and wearable watch devices [8].
learning-based interruptibility estimation logic. We conducted a
large-scale in-the-wild user study with more than 680,000 users However, we found that three significant issues remain
for three weeks. The results show that in most cases delaying the to be studied: (1) real-world evaluation of breakpoint-based
notification delivery until an interruptible moment is detected is adaptive notification with actual product application and as-
beneficial to users and results in significant reduction of user sociated notifications, (2) software architecture design of such
response time (49.7%) compared to delivering the notifications
interruptibility estimation for real-world deployment both on
immediately. We also observed a higher number of notifications
opened in our system as well as constant improvement in user the client and server sides, and (3) comprehensive evaluation
engagement levels throughout the entire study period. of user behavior in terms of not only interruptibility but also
users further engagement levels with the notification content.
I. I NTRODUCTION In this paper, we present the results and findings from a
While the capacity of our attention as humans is constant, large-scale study conducted on smartphone users interrupt-
the amount of information available for consumption has been ibility and user engagement with a popular real-world smart-
growing by several orders of magnitude. Concomitant with phone application. We designed and implemented our real-time
advances in computing and multitasking operating systems, breakpoint detection and notification scheduling mechanisms
more devices, and more applications and services, increasing inside the Yahoo! JAPAN Android application [9] (shown
volumes of notifications that proactively convey information in Figure 1), one of the most popular applications in the
to users are resulting in a greater number of interruptions.
Versatile applications and services in the cloud are being
developed and utilized in this ubiquitous computing age. These
software and services generate enormous amounts of various
types of information for users, such as big data analysis,
schedule reminders, messages from social media friends, the
weather forecast, breaking news, and status updates from
devices. Such information is delivered to users through devices
such as smartphones and other mobile devices, wearable
watches, and even through ambient devices embedded in a
users environment. For better timeliness and speediness, the
provision of such information has progressively become more
proactive, and it is often delivered through push notification
systems.
In this information-overload world, the constant and limited (Front screen (left), Weather radar (center), Notification (right))
capacity of human attention has become a new bottleneck [1] Fig. 1. Screenshots of Yahoo! JAPAN Android Application [9]
national application market. Considering several real-world and recognized by a user, some amount of his/her attention
requirements related to simplicity, scalability, and efficiency, with limited capacity [18], [19] is allocated to the information
our mechanism particularly focuses on the users physical- carried by the notification. This situation is called divided
activity breakpoints [8], relying on activity recognition APIs attention [20].
on the smartphone platform. Using mobile machine learning
Past studies have been revealing several types of negative
techniques, the detection mechanism embedded in the app
influence of interruptive notification, such as productivity [2],
detects the users breakpoints in real time and shows incoming
[3], [4], [5], [21], [22], emotional and social attribution [21],
notifications at such timings.
and psycho-physiological states [3]. Needs for computing sys-
Our large-scale user study, conducted over three weeks tems that can adapt their behavior to human users attentional
with a total of 687,840 users, revealed the efficiency of resources have been gradually recognized, with an increasing
our proposition. We found that, in most cases, notifications number of literatures particularly on sensing users attentional
delivered at delayed breakpoint timings improved the users states.
overall click timing (earlier). While the notification delivery
delay due to additional breakpoint detection (as opposed to the III. R ELATED W ORK
conventional deliver immediately style approach) is trivial, There are two main targets for sensing a humans current
once the notification is delivered, a significantly reduced user attentional state: the users current cognitive load and inter-
response time (49.7%) was observed in our approach. We also ruptibility.
observed a higher number of notifications opened in our system
as well as constant improvement in user engagement levels
A. Sensing Users Cognitive Load
throughout the entire study period.
The contribution of this paper is three-fold. First, we In cognitive psychology, the concept of cognitive load
present the design and implementation of our interruptibility is defined as the total amount of mental effort allocated to
detection mechanism on a large-scale real-world smartphone working memory. Several different approaches for measuring
platform. Second, we discuss our large-scale in-the-wild user this load have been proposed, including (a) subjective rating-
study on user interruptibility and engagement conducted using based methods, (b) task performance-based methods, and (c)
an actual product and associated notifications in a real-world physiological response-based methods.
situation. Finally, we evaluate our work in terms of not only Several studies on the subjective rating-based approach
interruptibility but also further user engagement with the have shown that the measurement of cognitive load through
presented notification contents. The remainder of this paper is post hoc self-reporting is a relatively reliable methodology for
organized as follows. Section II explains the interruption over- mental effort assessment [23]. The most widely used tool for
load problem. Section III discusses related work. Section IV assessing a users cognitive load is the NASA Task Load Index
clarifies our research goals. Section V specifies requirements (NASA-TLX) [24]. Although use of this method is widespread,
for our solution. Section VI presents the system design and the post hoc nature of the approach makes it difficult to apply
architecture of our system, AtteliaY. Section VII describes our to versatile ubiquitous computing systems where an assessment
initial model training study. Section VIII reports on our large- needs to be completed in real time.
scale in-the-wild user study conducted with 687,840 users for
The measurement of a users task performance is used
three weeks, in terms of our experimental design, methodology,
to objectively assess the users cognitive load during task
results, and analysis. Section IX discusses further research
execution. The users performance regarding their primary and
opportunities arising from the user study. Section X concludes
focal task is used in the primary task measurements, whereas
this paper.
secondary task measurements exploit the performance of a
II. I NTERRUPTION OVERLOAD secondary task (often asked to be) executed simultaneously
with the primary task [23]. In this methodology, the variation
Our current computing life suffers from interruption in reaction performance indicates variations in cognitive load.
overload caused by large numbers of notifications presented However, this methodology may not be feasible in ubiquitous
in inappropriate ways. Interruption overload is one class of computing situations where a user conducts multitasking with
a broader information overload problem discussed in the frequent task switching between multiple tasks, making it
literature [10], [11], [12]. More studies have recently been difficult to measure the response performances of the users
conducted in the context of interruptions and multitasking [13], various types of tasks using uniform measurement criteria.
[14], [15], [16], [17].
The psycho-physiological response-based method includes
The main source of interruption overload is notifications several different techniques, such as tracking of eye movement
from computer system entities such as local operating systems, and pupil size [25], [26], [27], [28], readings from electro-
messaging services connected to other users, and various cardiograms (ECG), galvanic skin response (GSR) [26], [29],
applications. The notification in computer systems was origi- [30], electroencephalogram (EEG) [29], [28], heart rate (HR),
nally designed to provide newly available information to users and HR variability (HRV) [31], [32], [28]. Haapalainen et
in a more timely and speedy manner (than polling by the al. [33] found that, in desktop computing, the combinational
user). Since typical notification systems deliver notifications use of an electrocardiogram and heat flux is the most accurate
immediately to users as soon as they are available, the users at classifying low and high levels of cognitive load. Although
end up facing numerous interruptive notifications from the this approach looks promising in terms of detecting users
background of their current tasks at random timings, regardless cognitive load in real time, the burden placed on the target
of their timing preference. When a notification is perceived users is not trivial.
B. Sensing Users Interruptibility platforms (e.g., Android and iOS) not having opened
their APIs to control notifications. Thus, past studies
Rather than sensing a users cognitive load to represent
mainly used a custom sample application and/or re-
that users internal mental status relatively directly, several
lated custom-made notifications prepared in an ad hoc
researchers have proposed detecting users interruptibility
manner for their research user study. Although some
from the viewpoint of the source of possible interruptions. This
studies [48], [49] focused on real-world smartphone
class of research can be categorized into two main groups: (a)
notifications, their main contributions pertained to
estimation of interruptibility at a certain timing period based
analysis of the current situations.
on a users context, and (b) detecting the users breakpoint [6].
2) The system design for such interruptibility detection
The breakpoint is the boundary between two adjacent actions
and notification adaptation in real situations (with the
of a person, and was found to be the timing when interrupting
real-world applications used by real users) has not
the user results in relatively lower frustration and cognitive
been adequately studied.
overhead [21], [34], [35].
3) User engagement for information content presented
Following early research in the desktop computing do- via notifications, beyond the users initial responses
main [36], [37], [38], [39], [40], more studies have recently to the notifications (such as response time or click
been conducted in the mobile field. Ho et al. used wireless on- rate), have not been adequately evaluated.
body accelerometers to trigger interruptions when users change
activities [41] and found that interruptions at these transition
times reduce user annoyance. The most recent studies have IV. R ESEARCH G OALS AND A PPROACHES
been on widespread mobile and smartphone environments. Considering the research issues outlined above, this study
Fischer et al. identified breakpoints after phone calls and aims to investigate smartphone users interruptibility against
text messages [42]. They found that users tend to be more notifications and further engagement against the notified con-
responsive to notifications after these activities than at other tent in systematically estimated opportune timings. Espe-
random times. Hofte et al. used an experience sampling cially, this research features such investigation in a real-
methodology to collect information on location, transit status, world environment with a real application, real users, and real
company, and activities in order to build a model of inter- notifications. To achieve these goals, we took the following
ruptibility [43], particularly for phone calls. Pejovic et al. approaches after careful discussion.
expanded the use of context to detect interruptible moments
on smartphones, including user activity, location, time of day, A1. Embedding interruptibility estimation logic into a
and emotional states [44]. Recent studies have even detected market-leading smartphone application: We added our in-
user boredom [45] as yet another opportune moment for terruptibility estimation logic into the Yahoo! JAPAN An-
notifications and engagement level[46] as a further indicator droid application [9]. Yahoo! JAPAN has been popular in the
of users response to received information. Japanese market since its launch in 1996, with a search engine
At the system level, current situation of fragmented inter- market share of 32% [50] (its share of Yahoo!s worldwide
ruptive notification delivery over mobile network is also known market share is 3.4%). The Android app, shown in Fig. 1,
to be inefficient in power consumption. Acer et al. [47] showed has an installed base of more than 10 million users, making it
that delaying notification delivery can yield power savings in one of the most popular smartphone applications. The app is a
mobile devices. portal-like application with several different features including
Web search, news reader, weather map, and links to a variety
Our previous works, Attelia I [7] and II [8], have followed of Yahoo! JAPAN services. To the best of our knowledge, we
the same research trend in interruptibility on smartphones are the first to conduct interruptibility research with a real-
and wearable watches (multi-device environment). Towards the world application that is utilized by such a large number of
realization of opportune moment detection and on-the-fly adap- users.
tation in notification scheduling, we particularly emphasized
four design principles: (1) feasibility on users real mobile A2. Using real-world notifications: Along with the applica-
and wearable devices, (2) supporting real-time detection, (3) tion, we use real notifications issued from Yahoo! services on
applicability to diverse types of applications, and (4) affinity the app to evaluate our interruptibility estimation approach.
to all-day use. Attelia realizes real-time breakpoint detection Whereas most previous studies used an artificial notification
on smartphones and wearable watches without any external or ESM [51] as notifications, utilization of real notifications
sensors or modification to existing operating systems or appli- from real information sources enables us to understand how
cations. users behave in real situations.
C. Further Research Challenges A3. Investigating users engagement: Finally, we quan-
titatively measure user engagement levels for the content
Although the studies cited above show that researchers are presented through each notification in addition to several
actively working on interruptibility and notification scheduling immediate response criteria, such as response time and click
on smartphones and wearable devices, there remain significant rate. Because users load Web content from Yahoo! JAPAN
research challenges that need to be addressed: servers when they click notifications, we decided to track
1) To the best of our knowledge, no study has inves- users browsing behavior from the server side by measuring
tigated and evaluated user interruptibility with real engagement-related criteria such as session length and revisit
(product-level) applications and the real notifications rate. This evaluation facilitates our understanding of how
issued from such applications. This is primarily ow- users behave and engage with the presented content beyond
ing to the current situation of major smartphone immediate click to open behavior.
V. R EQUIREMENTS FOR S YSTEM D ESIGN AND
D EVELOPMENT
In spite of the goals and approaches above, doing such
research on the real production environment has several dif-
ferences from that on a research-purpose environment. Here
we present the requirements for the system design and devel-
opment that emerged after our discussion in the beginning of
this project. Since we aim to port the original research software
Attelia into the Yahoo! JAPAN commercial production system,
we faced several real-world requirements related to acceptable
behavior and user experience of the product application, as
well as Yahoo!s business-oriented decision and restrictions.
R1. Users additional burden needs to be minimized.: When
placing the interruptibility estimation as an additional logic
to an existing application, depending on the sensor and API Fig. 2. Two Types of Breakpoints - Device Interaction and Physical Activity
types the logic uses, the users of the app will experience
additional burden, such as explicit confirmation to give ad-
ditional permissions (e.g., accessibility, location information) A. Detection of Physical Activity Breakpoints
to the application. Such burden should be minimized in order
to retain the existing user base of the application. For our concrete interruptibility estimation design to embed
inside the app, we decided to use and extend our previous
R2. Power overhead needs to be minimized.: As power is
work on real-time mobile detection of breakpoints developed
always a precious resource in mobile devices, and because
in Attelia [7], [8]. We previously placed breakpoints into
mobile users are very conscious about an applications power
two classes, namely physical activity breakpoints and device
usage, our system design needs to be energy-aware and mini-
interaction breakpoints as shown in Figure 2. In the figure,
mize power overhead.
a user is sitting down and doing work on her tablet. After a
R3. Cross-platform generalizability needs to be consid- while, she decides to take a coffee break. She stands up, walks
ered.: Although our first step experimental system can be to the kitchen, pours a coffee, walks back, sits down on the
implemented on a single platform for research purpose, the couch, and enjoys her coffee while watching a video on her
fundamental system design needs to be aware of cross-platform smartphone. In essence, in our daily lives with smartphones
generalizability over both iOS and Android major platforms. and wearable devices, there is a significant amount of time
when we simply carry or wear them but do not actively
R4. Collection of sensitive data needs to conform to use (manipulate) them, in contrast to the certain periods we
the corporate policy and process.: Additional collection actually do use (manipulate) them. By detecting two different
of sensitive and/or privacy-related sensor data, such as fine- types of breakpoints, our previous work detected interruptive
grained location information, needs to be proposed, carefully moments in users ubiquitous computing life comprehensively.
discussed, and approved in the corporate-wide business process
for assuring end users privacy protection. It means that this Meanwhile, in the present study, we pay special attention
process can take time and our system design may need to to the utilization of physical activity breakpoints. This is
start with a minimum set of data collection for the time-bound done for several reasons. Collecting UI interaction data on
research period. smartphone needs users explicit permission for accessibility
API on the smartphone platform (against R1). Furthermore,
R5. System design and development needs to conform accessing such sensitive information of users take long time
to the existing product management.: Yahoo! JAPAN in the product management process and it is even not clear if
Android application includes lots of commercial level prod- such data collection gets approved (against R4). On the other
uct features of lots of Yahoo! JAPAN services. Its product hand, detection of physical activity breakpoints has several
planning, design, and development processes are managed advantages. Physical activity breakpoints can solely cover a
in the business-oriented governance. Thus, the design and users all-day computing life as long as the user is carrying
development of our system naturally needs to fit such existing the device (even during the users active device use period).
processes. This also means that opportunity of the application Moreover, activity recognition API used for physical activity
updates to the market (i.e., GooglePlay) is considerably limited breakpoint detection has been recently implemented in both of
compared to single-purpose research prototype application the major mobile platforms [52], [53] (compatible with R3),
which often contains only bare-bone features and can be and those APIs are considered optimized in terms of efficiency,
pushed even nightly. accuracy, and power consumption (compatible with R2).
VIII. E VALUATION The user study was conducted for three weeks (21 days)
in September 2016. To ensure the stability of the production
On the basis of these promising results from our initial application, the new version (including our implementation)
study, we conducted a large-scale in-the-wild user study was released to the production environment with a graduated
in the production environment with about 680,000 users for deployment scheme on the app store. After three days, the new
three weeks to better understand how our breakpoint-based version was made available for all users.
notification scheduling works in a real user environment. Our
evaluation criteria are as follows. The mechanism we utilized for this user study is the one
detailed in Section VI. Our logic was enabled only for the
1. Investigating users immediate response to the experimental group users and not enabled for the control group
breakpoint-scheduled notifications: We want to see how ef- users. Note that, except for the delivery timing difference of
fectively the breakpoint detection works from several different the specific Recommendation type notifications described in
points of view, such as relationships between activity change Section VI-B, the users in both groups experienced the same
and the conclusive detected breakpoints, and actual delay of notification content and delivery timings.
the notification by waiting for a breakpoint.
All users (of the experimental group) used the same model
2. Investigating users response to the breakpoint- for breakpoint prediction. At the beginning of the study, the
scheduled notifications: We want to observe how users react model trained in our initial study was installed in each clients
to the notifications scheduled at detected breakpoint timings. device. Once the study began, a new model was trained every
night at our Hadoop cluster from all clients log data and then
3. Investigating users (long-term) engagement to the pre- was downloaded to each client as a daily update.
sented contents and services: We want to investigate how
the users engagement level to the source Web service of the
C. Results and Analysis
notifications (Yahoo! JAPAN) will be influenced in the longer
term, beyond the users immediate reaction in a short period Through the nightly model update training over 21 days,
of time. our prediction model was gradually adapted to the users real
usage. After 21 days (i.e., 21 iterations of the model update),
A. Participants the latest model showed the average performance accuracy of
91.6% in the same 10-fold cross-validation methodology that
We selected 687,840 users (approximately 10% of the total we used in Section VII.
user base of the Yahoo! Android application) as participants in
this study. We used an existing A/B test infrastructure inside 1) Breakpoint and Activity Change: Table VI shows a
our application where a specific functional component of the breakdown list of detected breakpoints with true annotation
app can be enabled (or disabled) for a specific sub class of (i.e., breakpoints with a notification (presented based-on the
users based on the device ID. Using such system, we randomly basis of the breakpoint detection trigger) that was clicked by
selected 5% of the whole userbase as the experimental group the user within 10 seconds) into activity change pairs.
and the control group respectively. TABLE VI. B REAKDOWN OF T RUE - LABELED D ETECTED
Table V shows the demographics of the users. We split B REAKPOINTS INTO ACTIVITY C HANGES
these users evenly into two groups: (a) the experimental group To
(users to which our interruptibility detection and notification IN_VEHICLE ON_BICYCLE ON_FOOT STILL UNKNOWN TILTING
schedule are enabled), and (b) the control group (users to IN_VEHICLE 0.01 0.22 6.05 0.99 2.71
which our logic is not used) to compare the results and validate ON_BICYCLE 0.03 0.01 0.32 0.01 0.09
ON_FOOT 0.48 0.00 2.68 0.28 1.29
the effectiveness of our system. From
STILL 8.16 0.06 1.75 7.42 43.28
UNKNOWN 0.84 0.00 0.19 3.51 1.50
TABLE V. U SER D EMOGRAPHICS
TILTING 2.63 0.04 0.62 13.55 1.27
Number of Users 687,840
Gender Male 60.7% Very interestingly, the STILL to TILTING activity
Female 39.3% change showed the highest value. Again, as Googles API
0-19 3.6% document [53] mentions, timings such as when a device is
20-29 7.9% picked up from a desk or a user who is sitting stands up
30-39 22.3% are considered to have been opportune moments for the noti-
Age Group 40-49 35.2% fication receiver users. Moreover, activity changes to STILL
(Median: 44.0) 50-59 21.0% showed high numbers, such as TILTING to STILL (13.55%)
(Stdev.: 12.3) 60-69 8.4% and IN VEHICLE to STILL (6.05%). This matches our
70-79 1.4% previous hypothesis [8] that people would have breakpoints
80- 0.3%
when changing from a high energy state to a low energy state. control group is 3,258.1 seconds (standard deviation: 1,920.6).
On the other hand, we see that STILL to IN VEHICLE Comparing the two groups, we see that user response time was
resulted in the third highest number, 8.16%. What types of reduced by 49.7%, with statistical significance. Combined with
real-world situations are caught by this change is a possible the fact that more than 90% of notifications were delivered
topic for future research. (Figure 5), these result mean that, in most cases, the users
clicks of notifications occurred earlier in our breakpoint-
2) Notification Delay Due To Breakpoint: Figure 5 shows based notification scheduling. We conclude that, in most
a cumulative distribution function (CDF) on the delay from cases, delay in notification delivery due to the breakpoint
when a notification content arrives at the client to when a core detection does not hurt and is even beneficial because the users
estimation logic detects a breakpoint and the actual notification can click earlier.
is posted. The graph on the right side shows the overview. We
configured the timeout to 1 hour (3,600 seconds), so the value Figure 7 and 8 shows CDFs of two user groups on the
gets very close to 1 at 3,600 seconds. (We also observed a very response time for each of the four notifications issued every
rare situation where notification was further delayed due to an day. As explained previously, the Recommendation class
implementation issue.) When looking at the left side graph news content becomes available four times a day: 8AM, noon,
(zooming from 0 to 100 seconds), we see that more than 70% 6PM, and 9PM. When we plot each of them on the graph (blue:
of notifications were posted within approximately 10 seconds. 8AM, red: noon, green: 6PM, grey: 9PM), we confirm that,
The overall average delay is 236.8 seconds (approximately 4 for all of them, breakpoint-scheduled notifications resulted in
minutes). The timeout occurrence rate is 1.11%. clearly shorter response times. The average response times are,
in the experimental group, 1,490.0 sec (8AM), 1,527.8 sec
(noon), 1,824.2 sec (6PM), and 1,645.0 sec (9PM), and in the
control group, 3,306.9 sec (8AM), 2,995.7 sec (noon), 3,413.6
sec (6PM), and 3,400.6 sec (9PM).