Beruflich Dokumente
Kultur Dokumente
Driver Drowsiness Detection Systems Potential of Smart Wearable Devices To Improve Vehicle Safety
Driver Drowsiness Detection Systems Potential of Smart Wearable Devices To Improve Vehicle Safety
Submitted at
Institute for Pervasive
Computing
Doctoral Thesis
to obtain the academic degree of
Doktor der technischen Wissenschaften
in the Doctoral Program
Engineering Sciences
JOHANNES KEPLER
UNIVERSITY LINZ
Altenbergerstraße 69
4040 Linz, Österreich
www.jku.at
DVR 0093696
Statutory declaration
I hereby declare that the thesis submitted is my own unaided work, that I
have not used other than the sources indicated, and that all direct and indirect
sources are acknowledged as references. This printed thesis is identical with
the electronic version submitted.
........................................................ ........................................................
Place, Date Thomas Kundinger
i
Abstract
iii
Kurzfassung
v
Acknowledgments
This thesis and the corresponding research were completed with the Johannes
Kepler University Linz, Austria, in cooperation with the AUDI AG Ingolstadt,
Germany, and the Technische Hochschule Ingolstadt, Germany.
First and foremost, I would like to thank Prof. Priv.-Doz. Dr. techn. Andreas
Riener for the supervision of my research, his constant support, and valuable
inputs.
Additionally, I would like to thank Univ.-Prof. Dr. Florian Alt of the Univer-
sität der Bundeswehr München, Germany for taking over the role as the second
evaluator. I would also like to thank Univ.-Prof. Mag. Dr. Gabriele Anderst-
Kotsis for her participation in the examination committee and Univ.-Prof. Dr.
Armin Biere for chairing the defense.
Finally, I would like to thank my family and friends for their constant support
and for reminding me that there is life besides the doctoral thesis.
vii
Contents
ix
Contents
x
Contents
6 Discussion 131
6.1 Preconditions for the Adaptation of Driver Drowsiness Detection
Systems (RQ1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Driver Drowsiness Detection with Vital Data from Smart Wear-
ables (RQ2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3 Acceptance of Drowsiness Detection Systems based on Smart
Wearables (RQ3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.4 Further Deployment Scenarios for Drowsiness Detection Systems
based on Smart Wearables . . . . . . . . . . . . . . . . . . . . . . . 140
xi
Contents
7 Conclusion 143
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Recommendations for the Design and Development of
Drowsiness Detection Systems . . . . . . . . . . . . . . . . 145
7.1.1.1 Preconditions for the Adaptation of Driver
Drowsiness Detection Systems . . . . . . . . . . . 145
7.1.1.2 Model Development for Driver Drowsiness De-
tection Systems using Vital Data from Smart
Wearables . . . . . . . . . . . . . . . . . . . . . . . 145
7.1.1.3 Acceptance of Drowsiness Detection Systems
based on Smart Wearables . . . . . . . . . . . . . 146
7.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . 147
7.2.1 Study Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.2 Model Development with Data from Wearable Devices . . 148
7.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
xii
List of Figures
xiii
List of Figures
xiv
List of Tables
3.1 Pre-Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Epworth Sleepiness Scale (ESS) . . . . . . . . . . . . . . . . . . . . 27
3.3 Post-Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Trust Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Study 1: Results of Pre-Questionnaire . . . . . . . . . . . . . . . . 34
3.6 Study 1: Results of ESS . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 Study 1: KSS Level for Drowsiness Warning . . . . . . . . . . . . 35
3.8 Study 2: Results of Pre-Questionnaire . . . . . . . . . . . . . . . . 43
3.9 Study 2: Results of ESS . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.10 Study 2: KSS Level for Drowsiness Warning . . . . . . . . . . . . 45
3.11 Study 2: Usage of Wearable Devices . . . . . . . . . . . . . . . . . 45
3.12 Study 2: Descriptive Statistics for Trust Scale . . . . . . . . . . . 46
3.13 Study 2: Results from Correlation Analysis with Spearman . . . 51
xv
List of Tables
xvi
List of Acronyms
A Accuracy
Af After
AAA American Automobile Association
ACC Adaptive Cruise Control
ADAS Advanced Driving Assistance Systems
AIC Akaike Information Criterion
ANOVA Analysis Of Variance
ANS Autonomic Nervous System
ANT Adaptive Network Topology
ApEn Approximate Entropy
API Application Programming Interface
ATT Attitude
BIC Bayesian Information Criterion
BLE Bluetooth Low Energy
BN Bayesian Network
BVP Blood Volume Pulse
CFSS Correlation-Based Feature Subset Selection
CI Condence Interval
CS Compound Symmetry
CV Cross-Validation
D During
DDAW Driver Drowsiness and Attention Warning
DS Decision Stump
DT Decision Tree
ECG Electrocardiography
EDA Electrodermal Activity
EEG Electroencephalography
EMG Electromyography
EOG Electrooculography
ESS Epworth Sleepiness Scale
EU European Union
EuroNCAP European New Car Assessment Program
F F-Measure
FFT Fast Fourier Transform
FN False Negative
FP False Positive
xvii
List of Acronyms
xviii
SD Standard Deviation
SDK Software Development Kit
SDLP Standard Deviation of Lane Position
SMOTE Synthetic Minority Oversampling Technique
SR Self-Rating
SSS Stanford Sleepiness Scale
SVM Support Vector Machine
SVR Support Vector Regression
SWM Steering Wheel Movement
TAM Technology Acceptance Model
TN True Negative
TOR Take-Over Request
TP True Positive
UDT User-Dependent Test
UEQ User Experience Questionnaire
UIT User-Independent Test
UX User Experience
VLF Very Low-Frequency
WFCM Weighted Fuzzy C-Mean
xix
1 Introduction
1.1 Motivation
1
1 Introduction
Figure 1.1: Changing role of the driver across the SAE levels of driving automation
Until then, and by looking at the dierent levels of automation in more detail,
the risk factor drowsiness and its reliable detection will still play a crucial role.
Across these levels, the driver's role changes from the sole operator in manual
driving (SAE level 0) to the fallback level (SAE levels 1-3) and nally to the
passenger (SAE levels 4-5) of an entirely automated system. Therefore, and in
terms of driver drowsiness, the lower levels of automation, namely SAE level
1 (driving assistance), level 2 (partial automation), and level 3 (conditional
automation), require special attention. In level 1, the driver is supported by
ADAS that take over either the steering task or acceleration/deceleration in
certain situations, but never both of them at the same time. In level 2, the
driver must continuously monitor the system to intercede and take over control
in an adequate time when asked. In level 3, the driver is excluded from all mon-
2
1.1 Motivation
Many approaches and methods based on dierent measures have been pro-
posed to detect drowsiness in an automotive environment. These measures
can be categorized mainly into four groups: vehicle-based, behavioral, physio-
logical, and subjective measures [10, 25]. Dierent car manufacturers provide
driver assistance systems to counteract the potential risk of drowsiness, e.g.,
with a rest recommendation [26, 27, 28, 29]. These commercial systems cur-
rently mainly focus on vehicle-based measures, i.e., the analysis of parameters
related to the driving behavior and imply drowsiness, such as lane position or
steering angle [10]. However, with increased driving automation levels, these
parameters will be more dicult to evaluate since the automated system con-
trols the car. Therefore, alternatives are necessary. These alternatives not only
have to guarantee reliable drowsiness detection during dierent stages of auto-
mated driving. Also, existing vehicles have to be able to be retrotted with as
little eort as possible since from July 2024, a system for drowsiness detection
is legally required within the EU for all cars to be registered [19].
A promising alternative to vehicle-based measures seems to be methods that
evaluate physiological data to identify driver drowsiness. This kind of data can
dierentiate between wakefulness and sleep and warn the driver in an early
stage [10, 30]. Considering typical Electroencephalography (EEG) or Electro-
cardiography (ECG) measurements in laboratories, complex measuring devices,
including electrodes on the head or upper body, are required to obtain su-
cient data quality. However, due to their intrusiveness, measurements of this
type are not accepted inside a vehicle. Depending on the driving environment,
disruptive factors such as vibrations from the roadbed can lead to reduced data
3
1 Introduction
quality. Therefore, new non- or less intrusive strategies for recording physio-
logical signals inside a vehicle are required.
For reaching these goals, this work builds on data collected in drowsiness stud-
ies. Dierent modeling approaches and detection algorithms are investigated
using physiological data from wrist-worn wearable devices based on supervised
machine learning. In this context, it is further examined how the acceptance
of drowsiness detection systems based on smart wearables is and which precon-
ditions can be considered to adapt drowsiness detection systems and improve
their performance. With the obtained ndings, a contribution is made to the
improvement of drowsiness detection systems and thus vehicle safety on the
way to the full automation of the driving task.
4
1.2 Outline
1.2 Outline
5
2 Theoretical Background and
State-of-the-Art of Drowsiness
Detection
2 in the following the terms drowsiness and sleepiness are used synonymously
7
2 Theoretical Background and State-of-the-Art of Drowsiness Detection
within seconds with periods of missing awareness of the reality that can some-
times lead to a micro-sleep event [35].
In contrast, fatigue can be dened as a subjective state of weariness, often
with muscle aches or discomfort, emotional irritability and a disinclination to
continue activities [35]. Others dened it as a reduced inclination for activ-
ity, due to excessive extension in time or intensity of that activity [36] or as a
subjectively experienced disinclination to continue performing the task at hand
[37]. The longer the mental and physical task or activity, the worse fatigue
gets without any rest in-between. In comparison to drowsiness that can be
decreased by sleep, fatigue can be relieved by rest. Further, no lack of aware-
ness occurs due to fatigue, as it is in the case of drowsiness. When drivers are
driving for an extended time, they can be fatigued but do not have to be in a
drowsy state. In many cases, both drowsiness and fatigue happen simultane-
ously, which might be why these constructs are often used as synonyms [35].
From the denitions presented, it can be seen that drowsiness/sleepiness and
fatigue are distinguishable. However, many people in research, industry and
other areas dealing with road safety topics lack understanding and knowledge
regarding the precise denitions and demarcation of these three terms when
applying as part of their daily work. From a safety perspective, the more dan-
gerous and relevant state is drowsiness due to the lack of awareness caused by
drowsiness. Therefore, early and reliable detection while driving needs to be
ensured.
8
2.2 Driver Drowsiness Detection Methods - Advantages and Limitations
[1, 38, 39]. Self-ratings (SR) involve questionnaires asked at regular intervals
or in certain situations under specic conditions. Some of the commonly used
tests for self-assessment include the Epworth Sleepiness Scale (ESS) [40], the
Multiple Sleep Latency Test (MSLT) [41], the Maintenance of Wakefulness
Test (MWT) [42], the Stanford Sleepiness Scale (SSS) [43], and the Pittsburgh
Sleep Quality Index (PSQI) [44]. The most commonly used self-rating scale
also applied in this work is the Karolinska Sleepiness Scale (KSS) (see Table
2.1), a 9-point Likert scale. In several previous validation studies, it was found
that this subjective scale can be related to dierent objective measures proving
its suitability as a valid indicator of sleepiness [45, 46, 47, 48].
Level Description
1 extremely alert
2 very alert
3 alert
4 rather alert
5 neither alert nor sleepy
6 some signs of sleepiness
7 sleepy; no eort to keep awake
8 sleepy; some eort to keep awake
9 very sleepy; sleep ghting
For observer ratings (OR) of drowsiness, experts or trained raters observe the
driver either in real-time [50] or by watching videos recorded during an exper-
iment [51]. The driver's state is predicted based on sleep-induced indicators
and behavioral changes in the facial region, such as the eyelid position, blink
frequency, and facial muscle activity [52, 51, 38, 39, 53]. The most commonly
used observer rating scale was published by Wierwille and Ellsworth [52]. In
this Ph.D. thesis, the drowsiness scale published by Weinbeer et al. was applied
[50]. This scale categorizes drowsiness into six levels with drowsiness indicators
per level as a reference for the observers (see Table 2.2). It is based on the scale
by Wierwille and Ellsworth [52] and modied with ndings of Wiegand et al.
[54] and Karrer-Gauÿ [55].
9
2 Theoretical Background and State-of-the-Art of Drowsiness Detection
10
2.2 Driver Drowsiness Detection Methods - Advantages and Limitations
[56]. This technique has already been investigated for a long time [33, 57]. For
this purpose, in most cases, the driver is monitored using cameras mounted
inside the car and directed towards the driver's face. Due to advancements
in camera technology combined with novel approaches in computer vision
and image processing, the evaluation of behavioral measures in the context
of camera-based drowsiness detection has been receiving more and more at-
tention in recent years [58]. These methods evaluate mainly three parameters:
eye movements (eye blinking, eye closure activity) via eye-tracking, facial ex-
pressions (yawning, jaw drop, brow rise, lip stretch), and head position (head
scaling/nodding/rotation) [25]. One of the most commonly applied and im-
portant behavioral parameters is the percentage of eyelid closure (PERCLOS)
[59, 60, 61, 62]. Drowsiness is assessed by calculating the proportion of time in
a dened time interval where the eyes were closed 80% to 100% (see Equation
2.1 [4]).
11
2 Theoretical Background and State-of-the-Art of Drowsiness Detection
drowsy driving [77]. The two most commonly used measures are steering wheel
movement (SWM) and standard deviation of lane position (SDLP) [25] that
were applied in several previous works [78, 79, 80, 81, 82, 83]. SWM evaluates
unnatural steering behavior induced by drowsiness using a steering angle sensor
or accelerometer to determine the driver's drowsiness. The number of micro-
corrections made by a drowsy driver is less when compared to an alert driver
[84, 85]. The input for an SDLP system is a camera mounted on cars to
determine the car's relative position in the driving lane, i.e., the deviation
from the lane's center-line. A drowsy driver might cross the current driving
lane abruptly, causing crashes. If the car is found to be crossing the lane or
approaching the sides of the lane, the driver will be alerted [79].
Since this Ph.D. thesis focuses on using physiological data from consumer-grade
wearable devices for drowsiness detection, drowsiness detection methods apply-
ing physiological measures will be described in more detail. In this context, the
research gap in which this Ph.D. thesis is located will be highlighted thereby
again.
12
2.2 Driver Drowsiness Detection Methods - Advantages and Limitations
wakefulness, drowsiness, and sleep, the waves presented in Table 2.3 can be
considered.
Table 2.3: EEG waves with frequency band and measure [25, 90].
The delta band is in the frequency range from 0.5 to 4 Hz and provides infor-
mation on sleep, whereas the theta band with a frequency range of 4 to 8 Hz
reects drowsiness. The alpha band frequency is 7.5 up to 13 Hz and contains
relaxation, i.e., the onset of sleep and the early stages of drowsiness. The beta
band is associated with wakefulness and alertness and lies in the range from
13 to 30 Hz [25, 90]. For drowsiness detection, EEG was applied in numerous
studies [90, 91, 92, 93, 94, 95].
Systems based on skin signals use electromyography (EMG) to measure and
record changes in the electric potential of the skin caused by muscle cells
[96, 97, 98, 99]. Drowsiness detection with EMG assumes an increase in am-
plitude and a decrease in mean frequency [100]. In terms of skin signals, elec-
trodermal activity (EDA) or galvanic skin response (GSR), measured through
skin conductance and resistance, is applied for drowsiness detection. This phe-
nomenon reects the changes of sweat on the human skin that can be referred
to as the current physical state of a person [101, 102, 103, 104].
Eye-based signals can be gathered using electrooculography (EOG) by attach-
ing electrodes to the right and left side of the eye to measure its movements
[105, 106, 107, 108, 109]. With this signal, information is obtained on the one
hand about the blinking pattern, and on the other hand, about eye movements.
Specically, the potential dierence between the cornea and retina is measured
[110, 111]. Slower eye-rolling movements represent a transition between wake-
fulness and sleep, whereas the saccade speed is an indicator of vigilance [112].
Vigilance is identied by faster eye movements replaced by slower rolling move-
ments during the process of getting sleepy. Reduced and rare eye blinks indicate
drowsiness [113].
Using heart signals to identify driver drowsiness, cardiac activity is measured
and analyzed using ECG [114, 115, 116]. One parameter applied in driver
drowsiness detection that can be easily derived from the ECG signal is the
heart rate. This parameter varies signicantly between wakefulness and sleep
[10, 30]. Another physiological parameter that is particularly often used in
driver drowsiness detection is heart rate variability (HRV) [117, 118, 119]. The
changes in the length of the RR intervals, which is the time elapsed between
two successive R waves, i.e., two heartbeats of the QRS complex on the ECG,
13
2 Theoretical Background and State-of-the-Art of Drowsiness Detection
14
2.2 Driver Drowsiness Detection Methods - Advantages and Limitations
takes personal
not
subjective questionnaire feeling into
real-time
account
EEG, ECG,
physiological reliable intrusive
EOG, EMG
measures.
In research and on the market, several non- and less intrusive approaches for
measuring physiological signals inside a vehicle were proposed and will be dis-
cussed in the following section.
In research, experiments were conducted to measure heart rate or ECG for HRV
analysis via integrated sensors on the steering wheel [130, 131, 132]. However,
their usage is limited in the context of automated driving since one or even both
hands have to touch the steering wheel for a longer time. Further, additional
sensors need to be integrated into the steering wheel. The driver's breathing
rate was captured from real-time image recognition. Results show that the
kind of clothes inuences and reduces system performance [133]. As part of a
research project funded by the EU, bio-sensors were built into car seat fabrics
and seatbelts to measure heart rate and respiration [134]. For this, however,
each driver's seat would have to be equipped with the required sensors, or
they would have to be retrotted, which would be associated with considerable
costs. EEG, EOG, EMG, and EDA for micro-sleep detection were detected
with a device worn behind both ears in another project. Limiting factors are
noise artifacts due to sweating and hydration. Further, for ensuring optimal
contact of the device on the skin, wet electrodes through a specic gel are
required which cannot be ensured for all customers [135].
Apart from research, also the market developed systems for driver drowsiness
15
2 Theoretical Background and State-of-the-Art of Drowsiness Detection
Further, several wearable devices for driver drowsiness detection were devel-
oped. A bracelet that measures heart rate and EDA was presented by Steer
16
2.2 Driver Drowsiness Detection Methods - Advantages and Limitations
[139]. Depending on the detected level of drowsiness, the bracelet either vi-
brates or produces a moderate electric shock in higher levels of drowsiness.
StopSleep developed a double ring for measuring EDA. When initial signs of
drowsiness are detected, the driver is warned with a vibration. In higher levels
of drowsiness, an auditory signal is added [140]. Vigiton, a driver drowsiness
detection system proposed by Neurocom, collects physiological information by
measuring GSR with wristband and ring and provides visual and auditory
warnings [141].
The presented solutions based on wearable devices seem to be the most promis-
ing since they can be easily integrated into the vehicle and, depending on the
carrying position, are less or even non-intrusive. Since the area of application
of these wearable devices is very limited and usage not compulsory, full mar-
ket penetration or widespread use of them could not be achieved so that the
majority of these devices were no longer pursued or further developed.
This work aims to apply physiological data from consumer-grade and widely
available wrist-worn wearable devices, such as smartwatches and tness track-
ers, for driver drowsiness detection. In literature, several works about this topic
can be found and will be discussed in detail in the upcoming section. Based on
that, the research gap is highlighted, and the research approach and research
questions of this Ph.D. thesis are presented.
Lee et al. utilized the built-in motion sensors of a smartwatch for driver drowsi-
ness detection by evaluating the driving behavior [142]. Twenty subjects par-
ticipated in a simulator study with an average duration of 60 minutes. Time,
spectral, and frequency domain features were extracted and mapped to the
subject's drowsiness self-ratings. A support vector machine (SVM) classier
reached an accuracy of 98.15%. Lee et al. followed a similar approach, where
accelerometer and gyroscope data were collected during a 2-hour simulator
drive with ve participants (see Figure 2.3(a)). An SVM classier resulted in
an accuracy of 98.80% [143]. Leng et al. developed a wristband connected
to a PPG and GSR sensor on a nger. From data of 20 subjects, ve fea-
tures were extracted, including HRV and respiratory rate, and labeled with
self-ratings [129]. An SVM classier resulted in an accuracy of 98.30%. In the
work of Choi et al., a wrist-worn wearable device with sensors for PPG for
HRV analysis, GSR, temperature, acceleration, and gyroscope was developed
(see Figure 2.3(b)). Twenty-eight people participated in their simulator study,
which consisted of four parts (normal, stress, drowsiness, fatigue) with a total
driving time of 3 hours and 20 minutes. Labels were gathered by analyzing
signs of sleepiness in their facial expressions. With an SVM classier, an accu-
racy of 98.43% of accuracy was reached [103]. Lee et al. conducted a simulator
17
2 Theoretical Background and State-of-the-Art of Drowsiness Detection
study with a duration of one to a maximum of two hours and six participants.
They combined data from a PPG sensor of a Polar smartwatch with ECG data
measured with a breast belt (see Figure 2.3(c)). Labels were assigned by eval-
uating videos of the driver's face and driving behavior. Their classication in
the form of recurrence plots resulted in an accuracy of 70% [114]. The heart
rate measurement of a smartwatch was fused with PERCLOS in the work of Li
et al. [144]. A study in a simulated environment with a duration of 50 minutes
and 10 participants was conducted. An accuracy of 83% was obtained with an
SVM classier. Lee et al. evaluated steering wheel movements with accelerom-
eter and gyroscope data from a smartwatch on one wrist and combined it with
physiological data from a PPG sensor placed on a sports wristband on the
other wrist. From the data collected during a 3-hour simulator drive with 12
participants, time, phase space, and spectral-domain features were calculated
and classied with a Weighted Fuzzy C-Mean (WFCM) model. Their system
reached a detection accuracy of 96.50% [145]. The temperature of the nose
from a sensor and wrist from a smart wearable and heart rate from a chest
strap was collected from 19 participants in a simulator study with a driving
duration between 90 and 150 minutes by Gielen and Aerts [146]. Classication
with a decision tree model resulted in accuracies of 68.40% (temperature nose),
88.90% (temperature wrist), and 70.60% (heart rate). When combining all pa-
rameters, an accuracy of 89.50% was reached. In the work of Misbhauddin et
al., a wearable-based drowsiness detection system consisting of an Empatica
E4 wristband and mobile application was proposed. For real-time identica-
tion of drowsiness, HRV and GSR data from the wristband were streamed to
and processed in the mobile application. For training the system, the users are
required to wear the wristband when not driving and give feedback four times
a day regarding their current drowsiness state through the mobile application.
If both values of GSR and HRV are below a certain threshold, a warning is
issued during driving. An accuracy of 80% was reached after testing the pro-
posed system in a simulator study with 10 participants [102]. In the system of
Bi et al., unsafe hand motions (hands o the steering wheel) through drowsi-
ness or distraction are detected with motion data from two smartwatches (see
Figure 2.3(d)). Their data set consisted of 75 real-world driving trips from
six participants. Their self-employed adaptive training algorithm reached over
97% of precision and recall [147]. Malathi et al. developed a wrist-worn EDA-
based wearable device for drowsiness detection. Results depict intrapersonal
dierences in the collected EDA signal when being active, drowsy, or asleep
[101].
18
2.3 Problem Statement and Research Approach
Figure 2.3: Selected drowsiness detection systems using wrist-worn wearable de-
19
2 Theoretical Background and State-of-the-Art of Drowsiness Detection
To achieve these goals and investigate the presented issues, the following hy-
pothesis (H) and research questions (RQ) being investigated in this work are
proposed.
20
2.3 Problem Statement and Research Approach
ness. Causes and factors inuencing this behavior can dier signicantly from
individual to individual, and it poses a challenge to nd uniform attributes
across individuals [148]. Therefore, for answering this research question, it was
examined to what extent dierent preconditions and human and external fac-
tors, such as age, time of the day, driving mode, driving time, trust, inuence
the drowsiness state. And how this information can be applied during the de-
velopment process and for the parameterization of driver drowsiness detection
systems. This is intended to give researchers and car manufacturers pointers
and framework conditions for developing this kind of system.
To answer this research question and thus to be able to assess the potential of
consumer-grade wrist-worn wearables devices for the detection of driver drowsi-
ness in an automotive environment, dierent investigations were carried out
that provide novel insights researchers can build upon. In these experiments,
wearable devices from dierent manufacturers were compared on the one hand
with one another, and on the other hand, with a medical-grade device. Various
physiological parameters measured by the wearable device were applied to de-
tect drowsiness. Dierent features were used and evaluated concerning quality
and impact on detection performance depending on the physiological parame-
ter. Several supervised machine learning models were compared and evaluated
using various performance measures in user-dependent and user-independent
tests. In these tests, dierent ground truths for drowsiness were applied and
a dierent number of drowsiness levels. Since the ground truth is a decisive
factor in developing drowsiness detection models based on supervised machine
learning and many dierent approaches can be found in the literature, this
topic was examined in more detail.
21
2 Theoretical Background and State-of-the-Art of Drowsiness Detection
This work builds on data collected in drowsiness studies to examine the is-
sues addressed. In total, three user studies, two baseline studies for database
creation, and an evaluation study were conducted in the course of the Ph.D.
project.
In general, the investigation of driver drowsiness is associated with increased
risk and cannot easily be examined under realistic conditions. Often partici-
pants with sleep deprivation are recruited to induce drowsiness more quickly,
which would be even riskier in a study setting in real trac. Therefore, re-
searchers' rst choice for conducting a drowsiness study is a driving simulator
that brings many benets compared to a eld study [149]. Complex study
settings, e.g., with take-over scenarios in an SAE level 2 or level 3 automated
system, can be realized easier and faster, optimally adapted to the study setting
requirements. Study conditions are standardized, controllable, and, above all,
easily reproducible. However, it is dicult to apply the knowledge gained in the
simulator one to one to real-world driving and gure out how the physical state
and the drowsiness development over time would have been aected. Therefore,
it is essential to examine driver drowsiness in simulated and especially realistic
environments and scenarios.
In the following, previous drowsiness studies from related work in simulated and
realistic environments are presented from which implications for the developed
study setting in this work are derived.
22
2.4 Previous Drowsiness Studies in Simulated and Realistic Environments
partially automated drives were highly demanding for the driver and illustrated
the increasing risk of drowsiness [150].
Körber et al. performed a driving simulator study with 20 subjects (age:
M=23.30, SD=2.64). In a 42.5-minute partially automated drive, the driver's
only task was to monitor the system. The results showed that drowsiness oc-
curs when not being engaged in an active task while driving and that the duty
of monitoring leads to a decrease of vigilance [151].
Similar results were determined by Miller et al., where 48 participants (age:
M=20.85, SD=1.32) performed a 40-minutes partially automated drive in a
driving simulator [152].
Vogelpohl et al. compared 60 minutes with sleep-deprived participants of con-
ditionally automated driving with manual driving with 60 participants (age:
M=41.30, SD=21.10). The study was conducted between 8 pm and midnight.
Observers found earlier signs of drowsiness in the former ones [153].
Increasing KSS levels and changes in PERCLOS [72] were observed by Jarosch
et al. in the context of a simulator study where 56 participants (age: M=30.10,
SD=9.00) drove 30 minutes conditionally automated [154].
In another experiment by Jarosch et al., 73 participants (age: M=31.36,
SD=9.86) drove 50 minutes with conditional automation. Due to the monitor-
ing task, increased sleepiness levels were determined, the take-over performance
was impaired, and a higher number of accidents occurred compared to being
engaged in a quiz task while driving [155].
In another simulator experiment conducted by Omae et al., eight of 30 partic-
ipants fell asleep after 60 minutes of automated driving while monitoring the
system [156].
Feldhütter et al. conducted a simulator study with 13 participants. During a
60-minute trip of automated driving, participants had to monitor the driving
task. Three participants fell asleep after 20 minutes of driving. Two of them
closed their eyes for longer than ve minutes, and one experienced a micro-sleep
[157].
The results of the presented studies show that automated driving aects the
driver's physiological state and leads to an increase of drowsiness, already after
a short driving duration. Not in the focus of the presented investigations was
the age of the driver. The studies were mainly based on younger participants
or covered larger age intervals. However, older people are considered a poten-
tial target group who could benet from the technology of automated driving
[158]. With increasing age, the number of health problems rises. Vision and
hearing diminish, and the ability to concentrate, resulting in potential hazards
in road trac. Thus, partially automated driving as one of the enablers to full
automation can benet older people. It is especially relevant for older people
who are disabled in their ability to drive but still want to be mobile. There-
fore, the specic needs and requirements of older people need to be investigated
and should be considered when developing and parameterizing systems for the
detection of drowsiness.
23
2 Theoretical Background and State-of-the-Art of Drowsiness Detection
Not many user studies or publications can be found where, in particular, driver
drowsiness was examined in a real-world automated driving context.
Weinbeer et al. investigated the suitability of a right-hand-drive vehicle
(RHDV) as a test method to explore the utility of several methods for han-
dling driver drowsiness during highly automated driving. The drowsiness of
31 participants (age: M=30.61, SD=8.16) was assessed by two investigators
in the back of the car during a 120-minute motorway drive as part of a user
study. The participants were sitting on the car's left side, where an additional
steering wheel was mounted. A person on the right seat controlled the vehicle.
Between the driver and passenger seat, a curtain was placed. Depending on the
current drowsiness level of the participants, TORs were triggered. In terms of
these, no signicant inuence of drowsiness on take-over reaction times could
be determined. Results further depict that in an RHDV setting combined with
highly automated driving on a highway, high drowsiness levels under safe con-
ditions can be achieved [50].
In the user study of Berghöfer et al. with 34 participants (age: M=54, SD=14),
a Wizard-of-Oz setting was utilized to simulate level-3 automated driving [20]
and to investigate possible inuences of behavior and characteristics on the
take-over reaction time. After the driver turned on the level-3 system, the
driver wizard on the passenger seat assumed responsibility for the test vehi-
cle's longitudinal and lateral control. Therefore, a second pair of pedals, special
control units on the seat and armrest were installed. Observers rated sleepi-
ness, but no signicant inuence on take-over time was determined [159].
The RRADS (Real Road Autonomous Driving Simulator) platform used for
simulating autonomous driving on real roads was presented in the work of Bal-
todano et al. [160]. A partition between the driver and passenger seat was
used to separate the participant from the driving wizard.
The studies show that researchers' choice to simulate automated driving and
ensure the necessary safety while driving was a Wizard-of-Oz approach. How-
ever, in these cases, it is not possible to speak of real drowsy driving since the
vehicle itself is being driven by another person when the auto-pilot is switched
on. Alternative solutions for conducting user studies to experience automated
and manual driving under reproducible and for the participants' safe conditions
in a realistic environment are required.
In the following chapter, the baseline studies for database creation and results
from the subjective evaluation are presented.
24
3 Baseline Studies and
Subjective Evaluation
The database of this work was created in the context of two user studies. An
identical study setting was applied for both studies. The only but signicant
dierence represents the study environment. For determining dierences re-
garding driver drowsiness, baseline study 1 was carried out in a simulator-based
and baseline study 2 in a realistic environment.
In the upcoming sections, the developed study setting is described rst. This
is followed by presenting the results from the subjective evaluation of both
RQ1 (What preconditions can be considered to
studies for answering
adapt and personalize driver drowsiness detection systems and to
model dierent groups of users?). Since drivers of future automated ve-
hicles are mainly typical and average consumers and no experts in this domain,
dierent expectations, previous knowledge, and behavior must be anticipated
and investigated during the development process of intelligent driver-vehicle
interaction systems. This knowledge can be applied to adapt and personalize
intelligent user interfaces for driver drowsiness detection and model dierent
user groups.
The main focus in the subjective evaluation and for answering RQ1 will lie on
the following hypotheses:
25
3 Baseline Studies and Subjective Evaluation
A study setting was developed for this work based on the presented studies from
related work and the derived implications. This study setting was performed
in both simulator and real-world driving for determining dierences between
and collecting data for model development from both environments.
3.1.1 Participants
Several subjective and objective measures were collected, which will be ex-
plained in the following sections.
26
3.1 Study Setting
Pre-Questionnaire
The rst part of the pre-questionnaire contained some basic demographic ques-
tions and queried details about the participants' sleeping behavior and health,
as presented in Table 3.1.
In the second part of the pre-questionnaire, the items of the Epworth Sleepi-
ness Scale (ESS) [40] had to be answered for assessing the participants' daytime
sleepiness (see Table 3.2). The ESS queries how likely it is to doze o or fall
asleep in the mentioned situations, in contrast to feeling just tired, by apply-
ing the following scale: 0 (would never doze), 1 (slight chance of dozing), 2
(moderate chance of dozing), 3 (high chance of dozing).
Drowsiness Self-Ratings
Since this work investigates the potential of using physiological data from smart
wearables in connection with supervised machine learning for driver drowsiness
detection, a ground truth for drowsiness, i.e., labels for the physiological data
recorded with smart wearable devices, is needed. Dierent types of labels were
applied in this work. One of those was determined via drowsiness self-ratings
as it was done in many previous works before [90, 81, 162, 129, 146]. For
this purpose, the most frequently used scale, the Karolinska Sleepiness Scale
(KSS), a nine-point Likert Scale (1 | extremely alert; 2 | very alert; 3 | alert;
4 | rather alert; 5 | neither alert nor sleepy; 6 | some signs of sleepiness; 7
| sleepy; but no eort to keep awake; 8 | sleepy, some eort to keep awake;
27
3 Baseline Studies and Subjective Evaluation
9 | very sleepy, sleep ghting) was applied (see Table 2.1 in Section 2.2.1)
[49]. This scale was displayed in an Android Application on a Google Pixel
C tablet computer placed next to the steering wheel in the center console of
the car. To minimize the impacts of the self-rating requests on the drowsiness
development, the application was programmed as follows: After the start of
the drive, the participant was prompted by the tablet every ve minutes by
slowly increasing the screen brightness without any auditory hints. After the
self-rating was given, i.e., the current drowsiness level selected and conrmed,
the screen brightness slowly faded away. KSS levels 1-4 were colored in shades
of green, level 5, and 6 in yellow and orange, and levels 7-9 in shades of red
(see Figure 3.1). The ratings with the corresponding timestamp were stored
on the local memory of the tablet.
In addition to the self-ratings during the drive, the participants were asked to
draw drowsiness curves after their test drive based on the user experience (UX)
curve method [163]. Therefore, a paper with a two-dimensional coordinate
system was handed out. It showed the duration of the simulation in ve-minute
intervals on the x-axis and the KSS levels on the y-axis. The rst reason was
to investigate how the subjects rate their drowsiness right after the drive and
if there are dierences within the ratings during the drive. The other and more
practical reason was to have potential backup ratings in case of problems with
the tablet's self-ratings while driving.
28
3.1 Study Setting
Trust Questionnaire
For study 2, a questionnaire for assessing trust in automation was added. Apart
from driving in a drowsy state, especially overtrust is a critical challenge for
the safe use of automated vehicle technology [164]. Recent incidents with au-
tomated vehicles, e.g., as mentioned in the introduction, with Tesla Autopi-
lot [22], or the Uber self-driving Taxi [23] are (at least partly) connected to
overtrust, as drivers failed to monitor and intervene properly. Drivers that
trust the automation more may show greater willingness to fall asleep and vice
versa. It will be evaluated how trust levels change after a single session of real
system exposure and if a correlation between subjective trust and drowsiness
can exist. Drowsiness can be an ideal candidate for evaluating such behavior
since giving way to drowsiness (what can be re-formulated as the willingness to
fall asleep) exposes a high risk. Investigating if and how drowsiness is aected
by users' trust levels might reveal additional safety risks and be an option to
measure user trust in an unobtrusive way behaviorally.
Lee and See dene trust as attitude that an agent will help achieve an individ-
ual's goals in a situation characterized by uncertainty and vulnerability [165],
but also state that trust is an attitude underlain by beliefs, that leads to inten-
tions and thus resulting reliance behavior [165]. Further, to accomplish proper
levels, trust should match the true capabilities of an agent [165]. Overtrust is
thus a situation where subjective trust exceeds a system's capabilities, which
can ultimately lead to misuse of technology [166]. Various situations can lead
to overtrust, including pure performance-based (poor calibration, resolution,
or specicity [165]), but also pre-existing (dispositional, situational or learned
trust [167]) explanations. Trust research often emphasizes the performance-
based component, e.g., recent publications in the automated driving domain
suggest making system capabilities and performance transparent to the user
[168, 169, 170, 171]. Future consumer-oriented users of automated vehicles
may add an important aspect that has not been addressed yet. In contrast to
professional operators, drivers may self-negotiate their trust levels to justify the
engagement in non-driving related tasks (NDRT), which is an often mentioned
advantage of automated vehicles [172].
Therefore, to investigate if and how trust in automation aects drowsiness,
the trust scale by Jian et al. (see Table 3.4) was applied [173] which provides
29
3 Baseline Studies and Subjective Evaluation
sub-scales for both trust and distrust. Subjective trust was assessed before and
after the SAE level 2 automated drive in the realistic environment.
Physiological Data
Concerning objective measurements, physiological data for model development
with supervised machine learning were recorded using four wrist-worn wear-
able devices. The devices used were Garmin Forerunner 235 [174], Garmin
Vivosmart 3 [175], Polar A370 [176], and Empatica E4 [177]. The rst three
are standard consumer-grade tness trackers with optical heart rate sensors us-
ing PPG. In contrast, the Empatica E4 wristband represents a more advanced
and medical-grade wrist-worn wearable device often used in research applica-
tions. It oers the acquisition of several physiological signals such as blood
volume pulse and inter-beat intervals (IBIs) for HRV analysis via PPG and
the measurement of electrodermal activity, acceleration, and skin temperature.
The IBI sequence is received from the PPG/BVP (blood volume pulse) signal
with a sampling frequency of 64 Hz. McCarthy et al. checked the validity of
the Empatica E4 wristband against clinical standard gears in recognizing the
anomalies in a heartbeat and found a comparable data quality of 85% between
these devices [178]. Two wearables were worn on each wrist during the study
(see Figure 3.2). The arm's choice was randomized so that the watches were
worn equally often on the left and right wrist by all participants.
Apart from the wrist-worn wearable device, a 3-channel ECG measurement de-
vice, the Faros from Bittium [179], served as a reference measurement device.
ECG data were recorded with the maximum possible sampling frequency of
1000 Hz. With ve adhesive electrodes, it was attached to the subject's upper
body. Before attaching the electrodes, the relevant body sites were shaved with
a disposable razor, if necessary, and then cleaned with alcohol swabs.
30
3.1 Study Setting
Figure 3.2: Wearables on participant's wrists: Garmin Forerunner 235 (1), Garmin
The complete procedure took about two and a half hours for each participant
(see Figure 3.3). The study was carried out at three dierent times of the day
(9 am, 1:30 pm, and 5:30 pm) to compare the inuence on the development of
drowsiness. First, the participants got an introduction and instruction from the
experimenter. After lling the pre-questionnaire, the physiological measuring
instruments were attached to the body. This was followed by the central part
of the study, which consisted of 90 minutes of driving. For investigating the
inuence of driving mode on the development of drowsiness, the 90 minutes
were split into two successive 45-minute sessions: a manual (SAE level 0) and
a partially automated (SAE level 2) one [20]. In partially automated driving,
the driver is obliged to monitor the driving process all the time and be ready to
take-over when requested [20]. In the case of this study, no take-over situations
31
3 Baseline Studies and Subjective Evaluation
were triggered. This would have resulted in an alerting eect on the driver
and negatively aected the progressive development of drowsiness. However,
SAE level 2 and the duty of monitoring was used more as a pretext to uphold
the driver's concentration and focus on the driving environment to achieve a
quicker increase of drowsiness. To accustom the participants to the driving
situation and the drowsiness self-ratings via the tablet, 10 minutes of manual
practice driving was carried out. During driving, the participants were asked
not to use the mobile phone, eat or drink, and avoid chewing gum. They were
instructed not to close their eyes for a long time or to fall asleep. Moreover, they
were instructed to avoid talking to the experimenter and not perform any other
secondary activities. For each participant, the order of the two drives changed.
Half of the participants started with the manual and the other half with the
partially automated drive to counteract possible inuences on the respective
drive's measurement results. As stated in the previous section, the participants
had to rate their drowsiness with the KSS displayed on the tablet every ve
minutes during the drives. In a short break between the two drives and after
the second drive, participants were asked to draw the drowsiness curves. The
study ended with the completion of the post questionnaire.
32
3.2 Study 1: Driving Simulator
Figure 3.5: Left: Hexapod driving simulator at THI; Right: simulator setup with
KSS on tablet.
3.2.1 Results
In the following, results from the subjective evaluation are presented and dis-
cussed based on the postulated hypotheses.
33
3 Baseline Studies and Subjective Evaluation
3.2.1.1 Questionnaires
Pre-Questionnaire
An almost equally distributed number of participants at each day time was
available: nine in the morning, 11 in the afternoon, and 10 in the evening.
Six women and nine men (age: M=22.87 years, SD=1.81 years), most of them
students from Technische Hochschule Ingolstadt (THI), were selected from the
young age group. For the old age group, eight female and seven male partic-
ipants (age: M=67.60 years, SD=1.88 years) were recruited via an advertise-
ment in a local newspaper. Of all 30 subjects, six older participants currently
undergo medical treatment. On average, all participants slept around seven
hours the night before the study (young: M=7.07 hours, SD=0.80 hours; old:
M=6.90 hours, SD=1.27 hours) and in general (young: M=7.08 hours, SD=0.74
hours; old: M=7.07 hours, SD=1.40 hours). Their average sleep duration per
night is at a similar level. Besides, the perceived sleep quality the night before
the study was evaluated by the participants. One younger and three older par-
ticipants answered with very good, nine younger and nine older with good,
and ve younger and three older with medium. A summary of the results
from the pre-questionnaire can be found in Table 3.5.
Apart from the demographic details, the participants were asked to answer the
ESS [40]. For the interpretation of the results, the maximum achievable score of
24 points was grouped into ve categories (0-4) [181], as presented in Table 3.6.
It can be seen that all 15 subjects of the older age group are located in the lower
normal and higher normal daytime sleepiness categories, representing the
34
3.2 Study 1: Driving Simulator
0 (0-5 points)
4 10 14
lower normal daytime sleepiness
1 (6-10 points)
6 5 11
higher normal daytime sleepiness
2 (11-12 points
2 0 2
mild excessive daytime sleepiness
3 (13-15 points)
2 0 2
moderate excessive daytime sleepiness
4 (16-24 points)
1 0 1
severe excessive daytime sleepiness
Post-Questionnaire
In the rst question, participants were asked about the most suitable level of
the KSS to receive an initial drowsiness warning in the vehicle. The results are
presented in Table 3.7.
1 (extremely alert) 0 0 0
2 (very alert) 0 0 0
3 (alert) 0 0 0
4 (rather alert) 1 6 7
In both age groups, the majority (young: 10; old: 6) voted for level 6 (some
signs of sleepiness). It is noteworthy that six older participants would prefer
35
3 Baseline Studies and Subjective Evaluation
The main focus in the subjective evaluation is on the analysis of the drowsiness
self-ratings. All data sets of the KSS ratings from both age groups from dur-
ing the drive via the tablet and afterward by drawing drowsiness curves were
available for the evaluation. Four data sets of KSS ratings were generated per
participant, two from the manual and two from the partially automated drive,
respectively, 36 measuring points per participant. In summary, 1080 KSS rat-
ings were applied in the analysis.
To evaluate the eects of driving mode, driving time, and age on the de-
velopment of drowsiness, a three-factorial analysis of variance (ANOVA) Re-
peated Measure with one within-subject factor (measuring time points), and
two between-subject factors (manual/automated, young/old) was conducted.
This was done separately for the ratings during (D) and after (Af ) the drive.
The ratings for both age groups and driving modes are at an almost similar
level with slightly higher ratings during the drive (see Figure 3.6). The ob-
tained drowsiness ratings steadily increased over the entire period of the drive.
Except for one case, identical eects could be determined with the ANOVA.
Figure 3.6: Average KSS ratings with 95% condence interval (CI) of all partici-
Results show that driving time (D: F(8,21)=6.43; p=.000; ηp2 =.710; Af:
F(8,21)=5.84; p=.001; ηp2 =.690) and driving mode (D: F(1,28)=4.46; p=.044;
36
3.2 Study 1: Driving Simulator
ηp2 =.137; Af: F(1,28)=7.98; p=.009; ηp2 =.222) had a signicant eect on the
development of drowsiness. Therefore, a signicant dierence exists in the KSS
ratings over the nine measuring time points with higher drowsiness levels at
the end of the ride and automated driving. No signicant eect of age group
was found (D: F(1,28)=3.96; n.s.; Af: F(1,28)=5.41; n.s.). Further, no signif-
icant interaction eects on drowsiness were found between driving mode and
age group (D: F(1,28)=1.50; n.s.; Af: F(1,28)=1.74; n.s.), driving time and age
group (D: F(8,21)=1.24; n.s.; Af: F(8,21)=1.05; n.s.), as well as driving mode,
driving time and age group (D: F(8,21)=.56; n.s.; Af: F(8,21)=2.03; n.s.).
Whereas no signicant interaction eect on drowsiness could be identied be-
tween driving mode and driving time for the ratings from during the drive
(D: F(8,21)=1.32; n.s.), it could be for the ones afterward (Af: F(8,21)=3.27;
p=.014; ηp2 =.555).
Apart from the presented results of the ANOVA, the characteristics of the ob-
tained KSS ratings are presented in the form of dierent charts. The average
KSS ratings with 95% condence interval (CI) are plotted over time comparing
the two driving modes (see Figure 3.7) and age groups (see Figure 3.8). Fur-
ther, the average ratings of all participants for both driving modes at the three
dierent times of the day (see Figure 3.9) on which the study was carried out
are considered. For all these charts, the ratings collected during the drive were
used.
Figure 3.7: Average KSS ratings with 95% CI during drive of all participants for
Considering the ratings given during the ride for automated and manual driving
separately (see Figure 3.7), it is noticeable that the ratings were relatively sim-
ilar up to the 3rd measurement point after 15 minutes, but from there started
to diverge and higher KSS ratings were assigned in automated driving. At
minute 40, the average dierence increased to more than one KSS level (1.05).
Regarding the development of drowsiness in both age groups separately for
both driving modes (see Figure 3.8), a similar trend is apparent as presented
in Figure 3.7. The divergence of the two curves already started at the second
measurement point after ten minutes. It increased over time to an average
37
3 Baseline Studies and Subjective Evaluation
Figure 3.8: Average KSS ratings with 95% CI during drive of young and old age
Figure 3.9: Average KSS ratings with 95% CI during drive at dierent times of the
dierence between young and old participants of up to 1.5 KSS levels with
higher levels for the young age group. Considering the KSS ratings at dierent
times of the day (see Figure 3.9), it becomes clear that, in general, the lowest
ratings were given in the evening, followed by the ratings in the afternoon. In
the morning, the highest KSS levels were reached.
In addition to the presented charts, Figure 3.10 compares age groups and driv-
ing modes in terms of the absolute numbers of participants that reached a
certain drowsiness level. It can be seen that for the young age group, all KSS
levels were covered, both for manual and automated driving. A decreasing
trend for both driving modes is recognizable throughout all KSS levels. For
the older age group only in automated driving, all levels were covered. In
manual driving, none of the older participants reached KSS levels 8 and 9. In
general, for manual driving in the old age group, the assessed drowsiness levels
were low and mainly ranged from KSS levels 1 to 5.
38
3.2 Study 1: Driving Simulator
Figure 3.10: Comparison of age groups and driving modes in terms of number of
39
3 Baseline Studies and Subjective Evaluation
40
3.3 Study 2: Test Track
Figure 3.11: Left: test vehicle setup with driving robot on steering wheel (1) and
In trial runs before the actual study, the test track was retracted and stored for
the automated ride. Due to safety restrictions on the one hand and high lateral
accelerations within the curves that would negatively inuence the drowsiness
development, on the other hand, attention was paid to a maximum speed of
41
3 Baseline Studies and Subjective Evaluation
about 25 km/h. The previously recorded route and speeds could be traveled
through high-precision GPS (Global Positioning System) positioning as part of
the partially automated ride. As in the simulated environment, the test track
was designed as a loop to provoke monotonous driving in the best possible
way to accelerate or at least not impair the drowsiness development of the test
persons and reduce possible alerting eects. The design took into account a
safety corridor with a width of approximately ve meters between the test area
boundary and the test track to bring the vehicle to a standstill in time in the
event of an emergency brake (see Figure 3.12).
Figure 3.12: Graphic representation of test area with distance information in meters
The test vehicle used was an Audi A4 Avant (initial registration 2018). To
enable SAE level 2 driving, the car was equipped with a driving robot [182].
From this, the entire lateral and longitudinal guidance of the vehicle could be
completely taken over by the robot. The driving robot, which consisted of
two parts, was mounted on the pedals and the steering wheel (see left part
of Figure 3.11). The driver's usual sitting position was thereby not restricted
since, in the course of the study, the vehicle should and could also be driven
manually. The temperature in the vehicle was set to 23○ C. During the ride,
the radio was o, and no other music was played. In order to make the drive
as safe as possible for the subjects, three safety precautions were deployed. On
the passenger side, on which the experimenter sat during the study, a second
pair of pedals were attached. Moreover, an emergency stop was installed in the
vehicle center console with which either the test person or the experimenter
could immediately bring the vehicle to a standstill. Furthermore, the ride was
monitored by a person outside the test track, who could stop the car via remote
control.
With the provisions made in the test car and on the test track, it was possible to
carry out the study under reproducible and safe conditions for the subjects.
3.3.1 Results
In the following, the results from the subjective evaluation are presented and
discussed based on the postulated hypotheses.
42
3.3 Study 2: Test Track
3.3.1.1 Questionnaires
Pre-Questionnaire
As in study 1, 15 participants, seven female, and eight male were selected in the
age range 20-25 (age: M=23.73 years, SD=1.49 years), and 15, ve female and
ten male, in the range 65-70 years (age: M=67.27 years, SD=1.83 years). The
same number of participants from each age group was invited for the dierent
points in time. As part of the pre-questionnaire, subjects were asked if they ever
had a micro-sleep during a drive. Three younger and three older participants
answered this question with yes. Besides, it has been determined that none of
the younger but seven older volunteers currently undergo medical treatment.
On average, the younger participants slept 7.35 (SD=0.91) and the older ones
7.8 (SD=1.25) hours in the night before the study. The perceived sleep quality
in the night before the study should be evaluated with the choices very good,
good, medium or bad. Four young and ve older subjects answered with
very good, six young and seven older with good, four young and three older
with medium, and only one young participant with bad. A summary of the
pre-questionnaire results can be found in Table 3.8.
Furthermore, they were asked to respond to the questions from ESS [40]. For
better illustrating the outcomes, the highest reachable score of 24 was separated
into ve groups [181], as displayed in Table 3.9. As for study 1, all 15 subjects
of the older age group are located in the lower normal and higher normal
daytime sleepiness categories, representing the non-critical range of the ESS
43
3 Baseline Studies and Subjective Evaluation
0 (0-5 points)
2 11 13
lower normal daytime sleepiness
1 (6-10 points)
9 4 13
higher normal daytime sleepiness
2 (11-12 points
3 0 3
mild excessive daytime sleepiness
3 (13-15 points)
1 0 1
moderate excessive daytime sleepiness
4 (16-24 points)
0 0 0
severe excessive daytime sleepiness
Post-Questionnaire
In the post-questionnaire, participants were asked if they got sleepier in auto-
mated or manual driving. Except for one younger subject, all other younger
and older participants stated that they had to struggle more with drowsiness
during the partially automated ride. Concerning the self-ratings, the subjects
were asked how condent they felt in doing this. Of the three possible op-
tions not condent, medium and condent seven young participants chose
medium and eight condent. It was similar to the older participants. Six
answered with medium and nine with condent. Participants were further
asked at which level they would like to receive a rst drowsiness warning. The
results are presented in Table 3.10. In general, a wide range of possible levels
has been selected. The majority of younger subjects wish to get a warning
from KSS level 6, and four would even be satised with level 7 and three with
level 8. Seven older participants chose level 6, but only two subjects level 7,
and no one level 8 or 9. Levels 4 and 5 are represented by two and three older
participants. One older participant selected even level 3.
44
3.3 Study 2: Test Track
1 (extremely alert) 0 0 0
2 (very alert) 0 0 0
3 (alert) 0 1 1
4 (rather alert) 2 2 4
Table 3.11: Results from post-study questionnaire regarding usage of wearable de-
vices.
Trust Questionnaire
In study 2, it was also investigated how trust/distrust in the automated sys-
tem inuences the drowsiness state and if a correlation exists. The results are
presented in the following paragraphs. This section is based on the following
own publication: [3]
To assess subjective trust, participants had to complete the trust scale by Jian
45
3 Baseline Studies and Subjective Evaluation
et al. [173], which provides sub-scales for both trust and distrust, before and
after the drive. In terms of drowsiness, the KSS ratings from during the drive
were utilized. Participants completed the trust scale before the partially au-
tomated drive and again after the drive to assess the eect of initial system
exposure on their trust levels. Statistical analysis was conducted using IBM
SPSS V.24, and eects are reported as signicant at p<.05. Considering the
trust scale, scale values for both concepts, trust and distrust were calculated,
since all scales showed acceptable reliability (Cronbach's α >0.846 for all scales,
see Table 3.12). Since not all data were normally distributed, Wilcoxon signed
ranked test was applied to evaluate within-subjects eects. Participants rated
the sub-scale distrust lower after than before the trip with the automated ve-
hicle; however, the dierence is not statistically signicant (p=.130). Ratings
for trust, on the other hand, increased signicantly (p=.002) after the ride.
Regarding subjects-eects for gender or the dierent age groups, no signicant
dierences could be found. Male drivers rated trust in the vehicle after the
ride (M=4.58, SD=0.86) higher than female drivers (M=3.56, SD=0.15), but
a statistical signicance was not given (p=.068).
To quantify the eect of the 45-minute monitoring task on drowsiness, a linear
regression on the nine subsequent KSS ratings of each participant and calcu-
lated the slope of the regression line was performed. This allowed the expression
of an increase of drowsiness in a single number while omitting interpersonal dif-
ferences emerging from the ordinal nature of the scale (c.f. an increase from
level 1 to 4 shows an equal slope than an increase from 4 to 7). Statistical eval-
uation using Mann-Whitney U tests revealed no signicant dierences between
the two age groups or gender. Considering a potential correlation between
trust and drowsiness, a signicant positive correlation between the drowsiness
increase (KSS-slope) and trust ratings after the drive (r=.408; n=30; p=.013)
was found.
M SD C. α
Distrust (before) 1.85 1.20 0.887
Distrust (after) 1.56 1.11 0.846
Trust (before) 3.70 0.94 0.864
Trust (after) 4.18 1.11 0.951
KSS-slope 0.49 0.05 -
Table 3.12: Descriptive statistics: values for mean (M), standard deviation (SD)
and Cronbach's alpha (C. α) for trust/distrust items (before/after ride)
and KSS-slope.
46
3.3 Study 2: Test Track
The statistical evaluation was conducted with IBM SPSS V.24. For evaluating
the drowsiness self-ratings, in summary, 1072 ratings, 540 from the manual,
and 532 from the partially automated drive were used as a data basis. For
each participant, four data sets were available. These included two data sets of
the manual and automated ride with ratings from during and after the drive.
Due to technical problems with the driving robot, two subjects had to stop
the partially automated drive after 35 minutes. For these cases, the last two
ratings were missing.
A three-factorial ANOVA for repeated measures with one within-subject factor
(measuring time points) and two between-subject factors (manual/automated,
young/old), was applied separately for the ratings during and after the drive to
evaluate the eects of driving mode, age and driving time on the development
of drowsiness.
Figure 3.13: Average KSS ratings with 95% CI of all participants during and after
the drive.
Concerning the ratings during (D) and after (Af ) the ride for both age groups
and driving modes, whose values are very similar for all measurement points
(see Figure 3.13), the same eects were obtained with the ANOVA. For this
reason, results are presented together. The ANOVA results show a signi-
cant eect of driving time on the development of drowsiness (D: F(8,18)=8.36;
p=.000; ηp2 =.788; Af: F(8,18)=6.94; p=.000; ηp2 =.755). Therefore, a signi-
cant dierence exists in the KSS ratings over the nine measuring time points
with higher drowsiness levels at the end of the ride. Furthermore, results
show a signicant eect of driving mode on the development of drowsiness (D:
F(1,25)=48.84; p=.000; ηp2 =.661; Af: F(1,25)=45.59; p=.000; ηp2 =.646). Over
the nine measuring time points, a signicant dierence between manual and
automated driving was evident in the KSS ratings over time with higher levels
in automated driving. Moreover, a signicant eect of age group was found (D:
F(1,25)=5.11; p=.033; ηp2 =.170; Af: F(1,25)=5.39; p=.029; ηp2 =.177). Thus,
47
3 Baseline Studies and Subjective Evaluation
a signicant dierence exists between the younger and older subjects in the
KSS ratings over the nine measuring time points with higher drowsiness levels
for the younger subjects. No signicant interaction eects were observed for
driving time and age group (D: F(8,18)=1.482; n.s.; Af: F(8,18)=1.12; n.s.),
driving mode and age group (D: F(1,25)=.05; n.s.; Af: F(1,25)=1.37; n.s.),
driving time and driving mode (D: F(8,18)=.65; n.s.; Af: F(8,18)=2.35; n.s.)
as well as driving time, driving mode and age group (D: F(8,18)=.55; n.s.; Af:
F(8,18)=.74; n.s.).
Apart from the presented results of the ANOVA, the results and calculated
eects are apparent in the form of dierent charts. The average KSS ratings
with 95% CI for the considered cases are plotted over time for the manual and
partially automated ride. In addition to the diagrams that compare manual
vs. automated (see Figure 3.14) and young vs. old (see Figure 3.15), the
average ratings of all subjects for both driving modes at the three dierent
times of the day (see Figure 3.16) were considered. For all these charts, the
ratings collected during the drive were used.
Figure 3.14: Average KSS ratings with 95% CI during drive of all participants for
48
3.3 Study 2: Test Track
Figure 3.15: Average KSS ratings with 95% CI during drive of young and older age
Figure 3.16: Average KSS ratings with 95% CI during drive at dierent times of
When comparing manual and partially automated driving (see Figure 3.14)
regardless of age group, it becomes clear that higher self-ratings were given
during automated driving. This reects the signicant dierence in KSS ratings
between the two driving modes over the measuring time points. After just
ve minutes of driving, the average dierence is almost one KSS level, which
increases to a maximum dierence of 1.73 levels by minute 35. The signicant
dierence in the development of KSS ratings over time in terms of age for both
driving modes is shown in Figure 3.15. The younger subjects gave signicantly
higher ratings as time increases. After ten minutes, a dierence in the KSS
ratings of an average of 1.20 levels is apparent. The dierence reaches its
maximum after 25 minutes with 1.60 KSS levels. Towards the end of the rides,
the two curves and so the KSS levels approach slightly.
Considering the KSS ratings at dierent times of the day (see Figure 3.16,
it becomes clear that the lowest ratings were given in the evening. In the
49
3 Baseline Studies and Subjective Evaluation
comparison of morning and afternoon, the two curves are not separable over
time; however, higher KSS levels were reached in the afternoon, especially at
the end of the ride.
To compare the dierences between the two age groups even more clearly and
concerning the two driving modes, Figure 3.17 shows how many participants
reached a certain KSS level for both age groups and driving modes. Whereas
in manual driving from the younger participants, six reached KSS level 7 and
two level 8, for the older subjects, the maximum was at level 6, chosen by three
subjects. In automated driving, it can be seen that only two older subjects
reached level 8 and no one level 9. Further, older subjects generally choose
lower levels, but the dierences between the two age groups are more decisive
in manual driving.
Figure 3.17: Comparison of age groups and driving modes in terms of number of
Since heart rate was found to change during drowsiness [10, 30], in study 2, cor-
relations between drowsiness self-ratings and heart rate data from the wearable
devices were calculated with Spearman's ρ. The reason for choosing Spear-
man is the ordinal and discrete form of the drowsiness self-ratings and the
non-existent bivariate normal distribution in the data. The participants were
equipped with four wearable devices (Empatica E4, Garmin Forerunner 235,
Garmin Vivosmart 3, and Polar A370), two on each wrist. The focus is on
the last three, as these are standard tness trackers available on the consumer
electronics market. In the following, they are further referenced as follows:
Wearable1 (Garmin Forerunner 235), Wearable2 (Garmin Vivosmart 3), Wear-
able3 (Polar A370). The self-ratings are represented in ve-minute intervals,
but the heart rate was measured every second. This was adjusted in a pre-
processing step. For this purpose, the mean value of the heart rate over the
entire ve-minute interval was calculated and matched with the corresponding
50
3.3 Study 2: Test Track
drowsiness level.
Table 3.13 presents the results of the correlation analysis. Across the dier-
ent data sets, a weak linear correlation with the drowsiness self-ratings was
found.
Table 3.13: Results from correlation analysis with Spearman tested separately for
The lowest correlations were achieved with Wearable3. In contrast, the highest
correlations were achieved with the data of the older subjects, with Wearable3
(ρ = -0.351). Signicant correlations could be obtained with all three wearable
devices.
A closer look at the change of heart rate throughout the drives (see Figures
3.18 and 3.19) and in comparison to the self-ratings (see Figures 3.14 and 3.15)
makes the dierences and reasons apparent. Since the highest correlations
for Wearable1 were found on average across all data sets, only data of this
device was considered in these evaluations. Whereas drowsiness in terms of
automated and manual driving and young and old participants shows a constant
increase over time, the value of the heart rate is, except for smaller uctuations,
at a relatively constant value throughout the drive. In consideration of the
heart rate itself, there are noticeable dierences in the comparison between
automated and manual driving (see Figure 3.18) as well as within the age
groups (see Figure 3.19). In manual driving, the average heart rate for all
subjects is 5.79 beats per minute higher than for automated driving, possibly
due to the reduced activity and workload. Furthermore, the average heart rate
for young subjects is 2.40 beats per minute higher than for the older ones.
51
3 Baseline Studies and Subjective Evaluation
Figure 3.18: Average heart rate (Wearable1) with 95% CI (dashed lines) of all par-
Figure 3.19: Average heart rate (Wearable1) with 95% CI (dashed lines) of younger
52
3.3 Study 2: Test Track
for studies in realistic environments. The higher KSS levels for the young age
group may also be related again to the ESS questionnaire results. On aver-
age and similar to the simulator study, the younger subjects achieved higher
ESS scores. Nine younger compared to four older subjects, are located in the
level of higher normal and even three in the area of mild excessive daytime
sleepiness. In the group of older participants, the remaining 11 subjects are
represented in the lowest level. In general, it can be noted that despite the rela-
tively short driving time in manual and partially automated driving, high levels
of drowsiness with higher levels in automated driving could be achieved with
the chosen study setting on a real test area. Thus, even in a production car,
the duty of monitoring and the non-engagement in secondary activities during
partially automated driving aect drowsiness already in a short time. This
issue was also conrmed by the post-questionnaire, where the majority stated
that they had to ght more with drowsiness during automated driving. What
could be another reason is the connection between drowsiness and trust in the
automated system. Results showed a signicant correlation between increased
drowsiness and trust ratings after the ride with the automated vehicle. It was
found that the subjects rated the distrust items lower after driving than before
driving with the automated vehicle. In return, the trust ratings increased after
the trip. Already after a short initial system exposure, trust in the automated
vehicle was present. Drivers who trust the automated vehicle more show larger
signs of drowsiness that may negatively impact the monitoring behavior. This
result is important as the attested safety risk of drowsy driving could become
even more critical with automated vehicles, that (at SAE level 2) demand being
permanently monitored by the driver. On the other hand, this could allow to
include (given this assumption holds for physiological measurements) drowsi-
ness measures as an unobtrusive behavioral measure for automation trust, too.
Increased signs of drowsiness could thus be interpreted so that drivers of au-
tomated vehicles accept to fall asleep due to high trust in automation. The
behavior of drowsy drivers might help to infer trust in an unobtrusively way.
Research on trust in automation and drowsiness will be necessary to prevent
misuse and successfully implement automated driving technology.
Regarding H4, a weak linear correlation with the drowsiness self-ratings was
found. It should be noted that the analysis was performed only at ve-minute
intervals, and only the mean was considered in terms of heart rate. Therefore,
smaller intervals and other heart rate features could be calculated and corre-
lated with drowsiness, resulting in higher correlations. Moreover, the reference
for drowsiness in the form of self-ratings could be brought into question be-
cause the subjects could have misinterpreted their current state of drowsiness
or given a rating not according to the truth. A more objective variant, e.g.,
in the form of observer ratings and these for shorter time intervals, could rep-
resent a more meaningful ground truth for drowsiness. With longer driving
times, possibly more pronounced changes in the heart rate signal could have
been detected and, as a result, higher correlations with drowsiness. Related
studies show that driving with automation tends to result in a decrease in heart
53
3 Baseline Studies and Subjective Evaluation
rate in comparison to manual driving [183, 184, 185]. However, not all stud-
ies show consistent results. Regarding time-on-task eects, subjects could also
have accustomed to the experiment. Concerning the dierences in heart rate
between the two age groups, results from literature show that a decrease in
the maximum heart rate comes with increasing age [186, 187, 188] that could
also be conrmed with the data from consumer-grade wearable devices in this
work. Further, in another work, it was shown that predicting drowsiness for
older people with models that were trained with data of young people is not
reliable and sucient. In contrast, models trained with data from young people
could predict drowsiness for young people with higher accuracies [4]. There-
fore, with the knowledge gained, intelligent driver-vehicle interfaces intended
to warn the driver in the event of an onset of drowsiness can be adapted and
personalized. For example, individual models for dierent driving modes and
age groups can be developed.
Regarding the study itself, it has to be noted that the test site on which the
study was carried out is limited in size, resulting in extremely monotonous
and safe driving conditions. In a real-world scenario, drowsiness might have
occurred later because possible dangers by other road users are not given. Fur-
thermore, it should be noted that certain precautions have been taken to induce
drowsiness more quickly in the subjects, e.g., no caeinated drinks ve hours
before the study, the monotonous route, the duty of monitoring, and the low
speed limit. However, in this environment, under controlled, safe, and above all
reproducible conditions, it was possible to investigate the risk factor of driver
drowsiness in manual and partially automated driving in a more realistic en-
vironment and a production car. However, the presented eects should also
be examined and validated in other age groups and with a larger number of
participants.
54
3.3 Study 2: Test Track
The presented study setting shows that approaching a more realistic sce-
nario under reproducible and above all safe conditions, e.g., when dealing
with safety-critical issues such as drowsiness, is possible if appropriate
preparations and precautions are taken.
55
4 Model Development: Driver
Drowsiness Detection using
Wrist-Worn Wearable Devices
The previous chapter presented the baseline studies and examined possible
preconditions for the adaptation and personalization of driver drowsiness de-
tection systems and modeling of dierent user groups. It was discussed how
the knowledge gained could be incorporated into the development process of
intelligent driver-vehicle interaction concepts for driver drowsiness detection.
RQ2 (Can driver drowsiness be derived from
This chapter addresses
vital parameters measured with wrist-worn smart wearables?). The
applicability of wrist-worn wearable devices for driver drowsiness detection in
an automotive environment will be examined. The potential and feasibility of
using physiological data from a wrist-worn wearable device, readily available
in the consumer electronics market, as a single data input for a machine learn-
ing classier to detect driver drowsiness are being explored and evaluated. In
further steps and based on the results, the knowledge gained and information
provided can then be applied to develop multimodal systems with a sensor fu-
sion approach and merge the data of a wrist-worn wearable device with other
non-intrusive in-vehicle sensors, e.g., a driver monitoring camera. For now,
and within the scope of this thesis, the goal is to investigate which detection
performance can be achieved purely with the wearable device's data.
57
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
58
4.1 Wrist-Worn Wearable vs. Medical-Grade Device
Figure 4.2: RR intervals in QRS complex of ECG signal exemplary taken from a
study participant.
RQ2.3: How do the results dier in the case of user-dependent vs. user-
independent tests?
4.1.1 Method
59
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
In order to obtain a reliable and valid ground truth of drowsiness for supervised
machine learning, a two-stage process with a combination of observer ratings
and image processing was applied.
Observer Ratings
Observer ratings of the driver's facial expressions and behaviors were collected
after the study by evaluating the video data recorded while driving. The 45-
minute partially automated ride was split into nine intervals, each of ve min-
utes in length. From each 5-minute interval, the fourth minute was extracted
to be rated by the observers. Sandberg et al. found that most driver drowsi-
ness indicators can be observed for intervals of 60 seconds or longer to obtain
reasonable signs of a driver's drowsiness state [190]. The extracted one-minute
segments were randomized per participant to eliminate the single segments'
time dependency. Video segments at the end of a participant's drive would
probably be rated higher than those in the beginning. To increase the reliabil-
ity of the results, two trained individuals rated all videos separately. Following
that, the segments with inconsistent ratings were evaluated and discussed by
both raters, and a joint rating was set. The obtained rating represents the en-
tire 5-min interval assuming that the drowsiness state does not change abruptly
but rather slowly. The six-level (1 | not drowsy; 2 | slightly drowsy; 3 | mod-
erately drowsy; 4 | drowsy; 5 | very drowsy; 6 | extremely drowsy) drowsiness
scale by Weinbeer et al. was applied for collecting the observer ratings (see
Table 2.2 in Section 2.2.1).
Taking into account the 30 subjects and the 45-minute partially automated
drive, a total of 270 minutes, i.e., 270 ratings, would have been available for
evaluation. However, problems with the video recording occurred for some
subjects, or the face was only partially visible in the video, e.g., due to an
unusual seating position. These segments were removed so that in the end, 244
min were evaluated. Both raters made the same decisions in 191 of 244 cases,
which correspond to a percentage of 78.28%. Inter-rater reliability in the form
of Cohen's Kappa resulted in a value of 0.69, which represents substantial
agreement following the classication of Landis and Koch [191].
60
4.1 Wrist-Worn Wearable vs. Medical-Grade Device
the respective eyelid closure duration to assign the appropriate level of drowsi-
ness. With this additional step, the observer ratings could be cross-checked
and enhanced. In total, 201 micro-sleep events were detected for 14 out of 30
subjects. All events detected were manually double-checked based on the cor-
responding frame numbers in the video le. As shown in Table 4.1, the events
were split to be assigned directly to drowsiness levels 4 to 6 on the scale used.
As the scale does not consider eyelid closure times between 3 and 4 seconds
(level 5: 23 seconds and level 6: 4 seconds or more), these micro-sleep events
were added to level 5.
Table 4.1: Categorization and allocation of the 201 detected micro-sleep events to
After receiving observer ratings and micro-sleep events, both measures were
combined. For each subject, the ratings for the 5-min intervals with micro-
sleep events were adjusted if necessary. A change in the drowsiness level was
done for 23 out of the 244 received ratings. The drowsiness level was corrected
20 times upwards and one time downwards. Two new ratings could be gained,
giving a total of 246 ratings (see Table 4.2).
2 4 3
2 5 4
2 6 1
3 4 2
3 5 5
4 5 3
5 6 1
6 4 1
n.a. 6 2
events with number of occurrences for each case; n.a.: no observer rating
available.
61
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
Observer ratings +
Class Observer ratings
Micro-sleep events
non-drowsy (levels 13) 212 196
drowsy (levels 46) 32 50
In Figure 4.3, the localization in time of the number of ratings for both the non-
drowsy and drowsy class after integrating micro-sleep events is presented.
The number of drowsy ratings increased almost linearly across all subjects up
to a driving time of 30 min. In contrast, the number of non-drowsy ratings
decreased in the same time interval. From minutes 30 to 40, the opposite
is recognizable. From this, it can be deduced that drowsiness increased on
average across all subjects up to minute 30 and decreased from minutes 30 to
40. Towards the end, the level of drowsiness rose again.
Figure 4.3: Localization of number of ratings in time for both the non-drowsy and
drowsy class.
62
4.1 Wrist-Worn Wearable vs. Medical-Grade Device
Before presenting the HRV feature extraction method in this work, related ap-
proaches from previous work will be presented.
Vicente et al. investigated two drowsiness detection methods with various win-
dows sizes. First, they extracted features from the time and frequency domain
utilizing windows with a length of three minutes. Second, every minute was
assessed and named either non-drowsy or drowsy. They utilized direct dis-
criminant analysis for classication and Wilks lambda minimization criteria for
reducing the number of features. With that approach, they achieved a sensitiv-
ity of 0.59 and a specicity of 0.98 [117]. Zhao et al. investigated the detection
of drowsiness with approximate entropy (ApEn) and power spectral density
(PSD) of an ECG signal. An auto-regressive method was applied to compute
PSD. They found that during drowsiness, the LF PSD diminishes and the HF
PSD and the ApEn of the ECG increases [192]. Jung et al. explored the uti-
lization of conductive fabric electrodes on the steering wheel to record the ECG
of the driver to analyze it in the time and frequency domain. The PSD was cal-
culated with a fast Fourier transform (FFT). For driver state classication, an
ANS balance graph built with LF, HF, LF/HF ratio, and RMSSD (root mean
sum of squared distance) was utilized [193]. A comparative technique (FFT for
PSD calculation) was introduced by Nambiar et al., who trained a neural net-
work for classication and accomplished an accuracy of around 99.99% [194].
Lenis et al. used the Welch method to extract frequency domain measures from
ECG and presumed that during a microsleep, pulse rate decreased and HRV
increased [195].
Based on the presented related work, it can be assumed that an analysis of HRV
with suitable methods and algorithms allows high detection rates concerning
drowsiness.
For the present experiment, the dierent sampling rates of the two devices were
not adjusted. The aim is to compare the usage of data from a consumer and a
medical-grade device. Since three channels were recorded with the ECG mea-
suring device, but only one and the same channel was used for all participants
for further analysis, the three channels of all subjects were visually inspected in
terms of data quality and possible artifacts, e.g., undetected RR peaks in the
ECG pattern, with the Kubios HRV analysis software [196]. Finally, the RR
peaks of ECG channel 1 were applied in further analysis in raw format. The
Empatica E4 wristband uses an algorithm to record the data and process the
PPG/BVP signal. This algorithm lters and removes false peaks due to noise
(e.g., motion artifacts) [127]. For this reason, the raw data of the wristband was
used for the following analyses and is not further ltered or preprocessed. IBIs
from wristband and ECG were processed in time, frequency, and non-linear
domain for HRV analysis. HRV features were extracted from 5-min windows
of the signal with reference to the Task Force of the European Society of Car-
diology and the North American Society of Pacing and Electrophysiology that
63
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
After extraction, features were labeled for supervised machine learning (see
Figure 4.4). In total, data from 27 out of 30 subjects were available for feature
extraction from the wristband. To provide comparability, the missing three
subjects were not considered in the case of the ECG. Overall, the number of
non-drowsy and drowsy instances for the wristband are 14627 and 3987 and
for ECG 24149 and 5845. It becomes clear that ECG contains more instances
due to a more accurate, higher-resolution measurement than the wristband.
64
4.1 Wrist-Worn Wearable vs. Medical-Grade Device
Figure 4.4: Sliding window approach for feature extraction and labeling exemplary
The ability to generalize for new users is crucial for the establishment of systems
for driver activity recognition. Thereby, the problem of inter-driver variance
has to be taken into account because physiological signals within persons, in
our case, drivers of an automated vehicle, can dier to a great extent [197].
We apply a user-independent test (UIT) to deal with this issue. In the UIT, a
leave-one-subject-out cross-validation (LOSOCV) is performed. The data set
for each subject is treated as testing data once. Since data from 27 participants
were collected, in each LOSOCV-iteration, 26 participants are used for training
and the remaining 27th for testing. The prediction results are then averaged
over all subjects (see Figure 4.5). In comparison to the UIT, a user-dependent
test (UDT) will be performed additionally in the form of 10-fold stratied cross-
validation (CV) to obtain the overall classication accuracy and decrease the
eect of inter-driver variance. Stratied cross-validation was utilized because
each fold reects the class distribution in the original data set. In terms of
the present class imbalance, this ensures the same proportion of drowsy and
non-drowsy samples in each cross-validation run. Moreover, it reduces both
bias and variance compared to regular k-fold cross-validation, where the data
set is only randomly divided into k folds [198].
65
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
Performance Measures
Concerning performance measures, accuracy as one of the traditional measures
might be suitable but not when dealing with unbalanced data. Its focus is more
on the majority classes than on the minority ones [201]. Thus, F-measure will
be used additionally. For both UDT and UIT, accuracy and F-measure were
calculated and averaged across all iterations (see Figure 4.5). Concerning the
presented binary classication problem, it is, of course, important to correctly
detect when the driver is in a drowsy state. However, from the customer's
point of view, it is also crucial to correctly detect when the driver is in a non-
drowsy state to not irritate with unnecessary drowsiness warnings. Concerning
a standard confusion matrix with the values for True Positive (TP), True
Negative (TN), False Positive (FP), and False Negative (FN), the formula
for F-measure (see Equation (4.1)) does not take the True Negative (TN) values
in account.
2T P
F -measure = (4.1)
2T P + F P + F N
In the presented case, the correctly classied instances of the negative class,
representing drowsy, would not be considered. For this reason, at each cross-
66
4.1 Wrist-Worn Wearable vs. Medical-Grade Device
validation iteration, the value for F-measure is calculated per class. For the
non-drowsy, i.e., positive class, this is further referenced as F1, and for the
drowsy, i.e., negative class as F2. An average value is then presented for
both F1 and F2 across all subjects. Therefore, F2 is the crucial measure for
the detection of drowsiness. Its value represents how many drowsy instances
were correctly classied as drowsy.
4.1.2 Results
The feature selection procedure using CFSS was performed on the training
data. In the next step, the features in the testing data were adjusted ac-
cordingly. This was done before the actual training and testing of the machine
learning model. Table4.4 presents all selected feature subsets in CFSS for UDT
and UIT for both devices. The total number of available subsets equals the
67
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
UDT
8× meanRR, meanHR
Wristband 1× meanRR, meanHR, RMSSD
1× meanRR, meanHR, RMSSD, NN50
UIT
6× minRR, minHR
4× maxRR, minRR
Table 4.4: Selected Feature Subsets in CFSS for User-Dependent Test (UDT) (10
When looking closer at the selected feature subsets for the UDT, it can be seen
that they consist only of time-domain features. However, the subsets of both
devices do not contain any identical features. In the case of the wristband, a
total of ve dierent features are selected. Each subset includes meanRR and
meanHR. In the case of ECG, maxRR and minRR appear in eight subsets.
Furthermore, maxHR and minHR were selected. The low number of subsets
can be related to the fact that all subjects' data were randomized and evenly
divided into ten folds in a stratied way to counter inter-driver variance between
68
4.1 Wrist-Worn Wearable vs. Medical-Grade Device
the subjects.
In comparison to UDT, 11 dierent feature subsets were selected with data from
the wristband for UIT, which are almost four times as many as for UDT. In
addition to time-domain features, features from the frequency- and non-linear
domain were selected in the feature subsets. It becomes clear that inter-driver
variance inuences the choice of features if only a single person is removed
from the data set. This can also be recognized for ECG. Twice as many feature
subsets exist but, as for the UDT, containing only features from the time-
domain and therein min/max values of the RR and HR signals.
In general, for both the UDT and UIT, the majority of selected feature subsets
for the wristband and ECG mainly consist of time-domain features. In the
present case, the features' importance can be ranked in descending order as
follows: time-domain, frequency-domain, non-linear domain.
Table 4.5 shows the results of UDT and UIT, for both devices and all mod-
els tested, with the respective values for accuracy and F-measure. Focusing
on the UDT accuracy, it can be seen that ECG data produced better results
on all tested classiers except NB. These dierences are, in some cases, more
pronounced as in BN, SVM, DS, and DT, but for KNN, RF, and RT, the dier-
ence is only a few percentage points. For the wristband, the highest accuracy
of 92.13% was achieved with KNN, 91.58% with RF, and 90.02% with RT. For
ECG, the classiers RF and RT performed best with an accuracy of 97.37%and
BN with 96.85%, and DT with 91.18%. In terms of F-measure, it is noticeable
that the values for F1 and F2 diverge more in the wristband. Looking at RF
and RT, the values for F1 are 0.94 and 0.93, and the values for F2 are 0.82
and 0.79. For ECG, these are 0.98 and 0.94 each. In general and concerning
F1, the values of the wristband are slightly lower but comparable to ECG.
Except for DS and MLP, which have higher values for F2 than for ECG, the
dierences between the F1 and F2 are larger. For ECG, the values for F1 and
F2 are very high and at a similar level for BN, KNN, RF, RT, and DT. This
speaks for a low number of false positives and false negatives and an equally
satisfying classication in both classes. The assessment of drowsy instances as
really drowsy has shown to be more dicult for the models when working with
data from the wristband, which can be referred to as lower values for F2 and
a higher number of false negatives than ECG.
With regard to UIT and except SVM (65.64% accuracy), overall lower classi-
cation accuracies were achieved when compared to UDT. In addition to NB
(66.74%), SVM (65.64%), and MLP (25.84%) yielded better results with data
from the wristband. All other models achieved higher results with HRV from
ECG data. However, the dierences between the two devices are not as high
as in the UDT case. DS achieved the highest accuracy with 73.39% and was
69
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
the only model >70%. Regarding ECG data, the classiers NB (41.84%), SVM
(40.01%), an dMLP (25.84%) did not reach the threshold of 50%. In terms of
accuracy, DS scored the best with ECG at 78.94%.
Table 4.5: Classication results with performance measures for UDT/UIT, Wrist-
70
4.1 Wrist-Worn Wearable vs. Medical-Grade Device
it also becomes clear that the performance depends very much on the individual
subject, reecting the strong inuence of the inter-driver variance. Depending
on the model and the type of data, strong performance uctuations exist within
this small extract from the data set.
Participant 3 4
Model Device A F1 F2 A F1 F2
Wristband 59.90 0.75 0.00 20.45 0.32 0.04
BN
ECG 51.73 0.59 0.41 36.57 0.54 0.00
Participant 21 27
Model Device A F1 F2 A F1 F2
Wristband 52.30 0.65 0.27 18.02 0.17 0.20
BN
ECG 77.06 0.87 0.00 85.49 0.92 0.51
Table 4.6: Exemplary classication results of UIT for selected classiers and sub-
In the following, the presented RQs will be answered, and implications for
further research in the development of novel techniques for driver drowsiness
detection derived.
RQ2.1: Is it possible to reliably detect driver drowsiness by using
physiological data (HRV) from a wrist-worn wearable device as a
71
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
72
4.2 Wrist-Worn Wearable vs. Wrist-Worn Wearable
promising models could then increase performance further. Since the focus in
this work was on a specic type of feature selection (CFSS) and class balancing,
other methods should be considered and compared.
Due to the class imbalance in the data set with a higher number of non-
drowsy samples, in the UIT, especially drowsy instances were more chal-
lenging to classify.
In the UIT, the wristband achieved 73.39% accuracy with a DS. The
maximum for ECG was 78.94%.
For both UDT and UIT, the obtained results with data from wristband
and ECG are comparable.
73
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
4.2.1 Method
74
4.2 Wrist-Worn Wearable vs. Wrist-Worn Wearable
Figure 4.6: Average KSS ratings with 95% CI of all participants during SAE level
The drowsiness ratings continuously rose throughout the drive, ranging from an
average KSS level of 3.37 after ve minutes to 5.61 after 40 minutes. The ratings
marginally decreased in the last ve minutes. To get a better impression for the
determined drowsiness self-ratings given during partially automated driving,
the distribution of ratings given by all 30 subjects across all nine levels of the
KSS is plotted in Figure 4.7(a). It can be seen that the ratings are unevenly
distributed, with a minimum of one rating at level 1 to a maximum of 58 ratings
at level 3. Level 4 was chosen 47 times. Level 2 with 28, level 5 with 37, level
with 33, and level 7 with 32 ratings are on a similar level. KSS levels 8 and 9
received 14 and 15 ratings, respectively. Additionally, a grouping of the nine
KSS levels into three and two levels was done concerning later 2- and 3-level
classication of drowsiness in supervised machine learning. The categorization
of the KSS levels was based on the results in the work of Ingre et al. [79].
A simulator study was conducted to investigate sleepiness and accident risk
to give subject-level relative risks to various levels of sleepiness recorded with
KSS. This was done every ve minutes, and events of crashes, accidents, and
incidences were recorded. Their outcomes demonstrated that sleepiness was in
a strong relationship with the risk of an accident. For an average participant,
the risk for an accident was 28.2 times higher at KSS level 8 and 185 times
higher at KSS level 9 than KSS level 5. The grouping of KSS levels was derived
from the predicted probability for the event of an instance during the drive.
Based on this, KSS levels 1-4 represent the state awake (group 1), levels 5 and
6 the transition state (group 2) as well as levels 7-9 the state drowsy (group 3)
(see Figure 4.7(b)). For the 2-level case, group 1 represents the non-drowsy
and group 2 the drowsy class (see Figure 4.7(c)).
75
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
(a) (b)
(c)
Figure 4.7: Distribution of self-ratings across KSS levels (a) and grouped levels for
Concerning the recording of heart rate data, Wearable1 and Wearable3 pro-
vided a heart rate value every second, whereas Wearable2, on average, every
three seconds across all 30 subjects. Figure 4.8 illustrates the average heart
rate of all subjects over the 45-minute SAE level 2 automated drive. It can be
seen that after a few minutes, the data of all three devices show a similarly-
sized drop in heart rate. The heart rate of all three devices is characterized
by many uctuations, some of which are strong, with a minimal increase over
the driving duration. It can also be determined that the values of Wearable2
and Wearable3 are at a very similar level and only drift apart towards the end.
In contrast, Wearable1 has a similar course per se, which is a few beats lower
than Wearable1 and Wearable2. A slight approaching of the curves takes place
towards the end of the drive. Data from the same 28 out of the 30 partici-
pants of each wearable were available for feature extraction. Again, a sliding
window with a length of ve minutes and a 2-second increment was applied
for calculating the feature vectors. The following six features were extracted
(abbreviations in brackets) from the received heart rate data: maximum heart
rate (maxHR), minimum heart rate (minHR), average heart rate (meanHR),
standard deviation of heart rate (stdHR), range of heart (rangeHR), median
of heart rate (medianHR). Feature vectors were then labeled with the corre-
76
4.2 Wrist-Worn Wearable vs. Wrist-Worn Wearable
sponding KSS rating for preparing the data sets, where a rating represents the
entire 5-minute interval.
Figure 4.8: Comparison of average heart rate in beats per minute of all participants
Performance Measures
Accuracy (further referenced as A), i.e., the number of correctly classied
instances in percent and F-measure (further referenced as F), i.e., the har-
monic mean of precision and recall (sensitivity) will be used as performance
measures; the latter one primarily due to the present class imbalance. Besides
the reliable detection of drowsiness and from the customer's point of view, it
is also crucial to recognize the non-drowsy state. The driver does not want
to be irritated by false drowsiness warnings. Therefore, high values for both
recall and precision are needed.
77
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
ing models from dierent categories were evaluated and compared in terms of
performance for the presented classication problem. Again, models were not
ne-tuned or optimized, but the parameter values used that were preset in the
Weka library [189]. From the group of tree classiers, Random Tree (RT) and
Random Forest (RF) (100 trees) were selected. Decision Table (DT) (search al-
gorithm: Best rst, evaluation measure: root-mean-square error (RMSE)) and
Partial Decision List (PART) were applied from the group of rule-based clas-
siers. K-Nearest Neighbor (KNN) (no distance weighting; number of neigh-
bors: 1, search algorithm: brute-force, distance function: Euclidean) from the
category of lazy learners and Bayesian Network (BN) (estimator: simple esti-
mator, search algorithm: K2) and Naive Bayes (NB) classier from the group
of Bayesian classiers were tested. A SVM classier (kernel: polynomial, C: 1)
represented a function-based classier. A Multilayer Perceptron (MLP) (batch
size: 100, hidden layers: (number of features + number of classes)/2, learning
rate: 0.3, momentum: 0.2) served as neural network.
4.2.2 Results
78
4.2 Wrist-Worn Wearable vs. Wrist-Worn Wearable
2-level Classication
3-level Classication
Table 4.7: Selected feature subsets for each wearable device in 10-fold CV for two
Table 4.8 shows the 2-level and 3-level classication results using 10-fold strat-
ied CV with corresponding performance measures (A and F) for all three
wrist-worn wearables used.
Looking at the results for the 2-level classication of driver drowsiness, they
can be divided into two or even three groups in terms of performance. With
the data of all three wearables, KNN, RF, RT, PART, and DT achieved very
high accuracies and F-measures and particularly with Wearable3: KNN (A:
99.42%, F: 0.99), RF (A: 99.34%, F: 0.99), RT (A: 99.13%, F: 0.99), PART (A:
98.53%, F: 0.99), DT (A: 96.03%, F: 0.96). Overall and slightly poorer results
were achieved with BN, MLP, and SVM. The lowest classication performance
was reached with NB, which is even below the threshold of 50% for Wearable1
(A: 40.60%, F: 0.42) and Wearable2 (A: 36.77%, F: 0.37). With regard to the
devices used, it becomes clear that Wearable2 achieved the lowest accuracies
and F-measures across all classiers tested.
When considering the 3-level classication results, similar results were obtained
as in the 2-level classication case. The best classication results were again
achieved with KNN, RF, RT, PART, and DT. The data from Wearable1 per-
formed best with KNN (A: 98.59%, F: 0.99) and PART (A: 97.68%, F:0.98),
with the data from Wearable3 RF (A: 98.56%, F: 0.99), RT (A: 98.01%, F:
0.98) and DT (A: 93.83%, F: 0.94). With regard to the existing strong class
imbalance, the high values for F-measure (0.99) are particularly noteworthy,
which corresponds to high precision and recall, thus a uniformly successful clas-
sication in all classes. This is followed by BN (A: 74.06%, F: 0.75), MLP (A:
53.22%, F: 0.53), SVM (A: 52.69%, F: 0.53), and NB (A: 49.36%, F: 0.50) in
descending order, in which the best results were achieved with data from Wear-
able3. As in the 2-level case, NB is the only classier below 50% accuracy for
all three devices used and, in particular, with Wearable2 (A: 25.95%, F: 0.23).
79
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
The dierences in the best-performing classiers within the tested devices are
small, but Wearable2 contains the lowest values for accuracy and F-measure
across all classiers.
In general, the best results were achieved with a model from the group of lazy
learners (KNN), followed by the tree (RF, RT) and rule-based classiers (DT,
PART). Neural networks (MLP) and Bayesian classiers (BN, NB) performed
less successfully.
In the following, the presented RQs will be answered, the results discussed, and
implications for further research derived.
RQ2.4: Is it possible to reliably detect driver drowsiness by using ex-
clusively physiological data (heart rate) from a wrist-worn wearable
device in combination with a common machine learning classier?
In general, this RQ can be answered with yes. High accuracies (>99%) and
F-measures (0.99) were achieved in both 2- and 3-level classication. However,
80
4.2 Wrist-Worn Wearable vs. Wrist-Worn Wearable
as can be seen from the results, successful detection strongly depends on the
classier type. In the present case, classiers from the group of lazy learners
(KNN), tree (RF, RT) and rule-based (DT, PART) classiers were more suit-
able than Bayesian (BN, NB) or function-based classiers (SVM) as well as
neural networks (MLP). Reasons can also be the type of feature extraction,
feature selection, and the ground truth applied. Another important point to
address is the way the performance of the models has been tested. In this work,
a 10-fold stratied CV was used. Cross-validation is a commonly used statisti-
cal method used to compare and select models for a given predictive modeling
problem, estimate the skill of a model on unseen data, and get an impression of
its prediction capability. However, before the data set is divided into ten folds,
it is shued. Therefore, a subject's data could be in both the training and
test data sets. For this reason, in a future step, the best performing models
have to be tested with entirely new data in order to be able to assess to what
extent they are capable of generalizing to new data. With regard to the results,
it should also be noted that the data set for training and testing the models
contained only 30 subjects from two specic age groups. Thus, the results may
dier if a more extensive data set with subjects from other age ranges were
available.
RQ2.5: How do results of dierent devices dier? Generally speaking,
the three devices used show similar tendencies in all tested classiers, from
which a similar measurement accuracy can be derived. Therefore, for later
in-vehicle usage, dierent wearables could be considered. It is noteworthy,
however, that Wearable2 has achieved, in some cases, poorer results compared
to the other two devices. Wearable2 only provides a heart rate value every 3
seconds on average across all subjects, which could be attributed to a dierent
analysis of the PPG signal or a possible higher susceptibility to external inu-
ences (vibrations, heavy arm movements). The other two wearables deliver a
heart rate value every second. The reduced number of data points may have
negatively aected the expressiveness of some of the extracted features and
thus the classication accuracy. It can be deduced that better results could
be achieved for the present case by sampling the heart rate with a higher fre-
quency. However, this nding needs to be investigated with other wrist-worn
wearables and larger data sets.
RQ2.6: How do results for a dierent number of drowsiness levels
dier? To answer this research question, it can be said that the dierences
between the results of 2-level and 3-level classication are relatively small, with
slightly better performances in the 2-level case. However, it is noteworthy that
in the case of 3-level classication, the dierences within the devices are lower
than in the 2-level case. This speaks for the use of wrist-worn wearables in
both cases. However, it depends on the individual use case whether drowsiness
should be classied binary (non-drowsy and drowsy) or whether the detec-
tion of the transition state from non-drowsy to drowsy, which includes the
detection of the onset of drowsiness, is to be recorded.
One limiting factor of this experiment might be the choice of ground truth
81
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
for drowsiness. As in the previous experiment, the applied models were used
with the default parameters preset in the Weka machine learning library, and
no hyper-parameters were tuned in the current development stage of the de-
tection models. Further, other types of feature selection and class balancing
should be considered.
High values for accuracy (>99%) and F-measure (0.99) were achieved.
In the two experiments presented, both observer ratings and self-ratings were
used as the ground truth for drowsiness. Several other approaches can be found
in the literature, and no standard exists. Therefore, a further investigation, pre-
sented in the following section, was carried out on this topic. This investigation
RQ2 (Can driver drowsiness be derived from
is not directly related to
vital parameters measured with wrist-worn smart wearables?). How-
ever, since it is a fundamental and essential topic for obtaining a high detection
accuracy when dealing with supervised machine learning, it is in the same con-
text. It will be included in the overarching discussion at the end of this Ph.D.
thesis.
82
4.3 Ground Truth for Drowsiness: A Complexity Analysis
Systems for driver drowsiness detection are increasingly being proposed, whose
implementations are based on machine learning, mainly supervised machine
learning in the form of a classication problem, as in the previous two ex-
periments. These algorithms are data-driven and require a sucient amount
of labeled training data. The quality of the labels determines the quality of
the machine learning algorithm and thus of the drowsiness detection system.
However, to collect a ground truth for drowsiness, various approaches exist.
As described above, in the rst experiment, a combination of observer ratings
and image processing served as ground truth. In the second experiment, self-
ratings were applied. The acquisition is, in most cases, tailored to the respective
study. So far, and in terms of comparability of dierent works with each other,
a uniform process, general guidelines, or recommendations are missing. When
looking at previous works, in the majority, self-, observer ratings, or hybrid
solutions are applied. In this experiment, observer ratings, which are mostly
assessed for time intervals of one-minute length, will be analyzed in terms of
necessary complexity and if a reduction in rating frequency aects the quality
of the ratings. The ndings of this experiment can be applied in optimizing the
collection of ground truth for drowsiness. Moreover, it can serve as a starting
point for further research in this area with the ultimate goal of standardizing
the collection of a valid ground truth for driver drowsiness. In the next sec-
tion, an overview of related work is given, and the research questions for this
experiment is derived.
In this section, an overview of dierent ground truth types for driver drowsiness
from related work is given, including self-ratings, observer ratings, and hybrid
approaches. This is followed by the research goals of this experiment.
In the work of Mehreen et al., a hybrid approach for detecting drowsiness was
applied. EEG was combined with data from the accelerometer and gyroscope.
KSS ratings served as ground truth, which was polled every minute during 12-
minute drowsy and fresh drives [90] by lling a questionnaire. Friedrichs
83
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
et al. evaluated 90 hours of real driving data collected by over 900 drivers.
During these trips, the KSS was interrogated verbally by a co-driver every 15
minutes. For drowsiness classication, the 9-point Likert scale was grouped as
follows: awake (KSS ≤ 6), questionable (6 < KSS < 8), and drowsy (8 ≤ KSS).
Facial features formed the data basis from recorded video data classied using
an articial neural network (ANN) [81]. Fu et al. presented a Hidden Markov
Model (HMM) for evaluating ECG, EMG, and respiratory data. During a
3.5-hour drive in real trac, the drivers had to report their drowsiness about
every 15 minutes by means of the KSS after each driving session. Specic levels
of this scale were categorized into three levels: alert (1-3), mild fatigue (3-5),
fatigue (5-7) [162]. Leng et al. used data from PPG and GSR sensors to develop
an SVM model for drowsiness detection. Again, KSS ratings, recorded during
the drive based on the number of minutes and veried by the participants
by watching recorded videos, were applied as ground truth. The 9-level scale
has been divided into ve levels for classication: level 1 (1-2), level 2 (3-
4), level 3 (5-6), level 4 (7-8), level 5 (9) [129]. Gielen and Aerts conducted
a driving simulator study with 26 subjects with a maximum driving time of
150 minutes. For the detection of drowsiness, heart rate, the temperature of
the nose and wrist were classied utilizing a binary decision tree. SSS ratings
served as labels, which had to be submitted every ve minutes while driving.
A person outside the simulator signalized the moment for the rating outside
the simulator with a hand gesture. From a score of ve, the participant was
labeled as drowsy, below as non-drowsy [146].
In the work of Li et al., the steering wheel angle time series were evaluated
for detecting driver drowsiness. Data from six participants, recorded during a
90-minutes monotonous real-world drive, served as a data basis for a binary
decision classier. For the development of their drowsiness detection model,
three experts rated 1-minute video segments as either awake or drowsy. If
the raters achieved no consensus on a sequence, the considered sample was
skipped [203]. Jacobe dé Naurois et al. conducted a driving simulator study
with 21 participants and driving time between 100 and 110 minutes. Several
dierent kinds of measures (behavioral, vehicle-based, and physiological) were
recorded that served as input for an articial neural network (ANN). Two
raters analyzed each minute of the recorded videos independently based on
the methodology published by Wierwille and Ellsworth [52]. This rating scale
ranges from level 0 (alert) to level 4 (extremely drowsy) with steps of length
0.5. The average of both ratings was applied as ground truth for drowsiness.
The threshold for not drowsy and drowsy was set at 1.5. [204]. A brain-
machine interface was developed in the work of Li et al. [91], which fuses EEG
with head movements to detect drowsiness with a binary SVM classier. Six
84
4.3 Ground Truth for Drowsiness: A Complexity Analysis
Wang et al. analyzed EEG features for drowsiness detection. For gathering a
ground truth for drowsiness, the subject's self-assessments were combined with
observer ratings. After four 10-min monotonous simulated drives, 15 subjects
rated their current state of drowsiness by lling the SSS. For observer ratings,
the facial expressions for the same periods were evaluated using a three-level
scale: clear-minded (0), tired (1), and fatigue (2). To establish a connection
between the seven-stage SSS and the three-stage video evaluation, the following
technique was applied to lter inconsistent ratings: clear-minded corresponds
to SSS levels 1 and 2, tired to levels 3 and 4, and fatigue to levels 5 to 7.
With an SVM classier, a drowsiness detection model was built [92]. In the
work of McDonald et al., four kinds of measures were applied for receiving
a ground truth for drowsiness and creating the evaluation data set for their
proposed drowsiness detection algorithm. Two observers evaluated a 1-minute
time window before the occurrence of a drowsy-related lane departure. For
the ratings, a 5-step scale ranging from not drowsy to extremely drowsy
and based on Wierwille and Ellsworth [52] was applied. If ratings were greater
than two, they were assigned to the class drowsy. To be classied as awake
by a Random Forest model, three more tests had to result in awake: Psy-
chomotor Vigilance Test (PVT) [43], and self-assessments using the SSS and
a retrospective sleepiness scale [83]. Lee et al. conducted a simulator study
with six participants and combined data from a PPG sensor with ECG. Labels
were assigned by evaluating 1-minute segments of videos of the driver's face and
driving behavior. Each minute was labeled as either drowsy or awake. Their
binary classication in the form of recurrence plots resulted in an accuracy of
70% [114]. Twenty subjects participated in a one-hour driving study in the
work of Li et al. [206]. Data from an EEG headset served as input for an SVM
classier. Labels for alert and drowsy data were acquired with a combina-
tion of PERCLOS and the number of adjustments (NOA) while steering [143].
Values of PERCLOS ≥ 12% and NOA ≤ 9 corresponds to true drowsy and
PERCLOS < 8% and NOA > 26 to true alert. Lee et al. evaluated steering
wheel movements with accelerometer and gyroscope data from a smartwatch
85
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
on one wrist and combined it with physiological data from a PPG sensor placed
on a sports wristband on the other wrist. From the data that was collected
during a 3-hour simulator drive with 12 participants, time, phase space, and
spectral-domain features were calculated and classied with a WFCM model.
Self-Ratings combined with observer ratings, both calculated every two minutes
based on the KSS, serve as ground truth. KSS levels 1-5 represent wakefulness,
and 6-9 drowsiness [145].
From the presented related work, it can be seen that a majority of dierent
approaches were applied for gathering a ground truth for driver drowsiness.
With regard to self-ratings, rating frequencies ranging from one to 15 minutes
can be found by applying scales such as KSS or SSS. For the rating itself,
dierent types of requests (e.g., sound, visual hint), feedback (e.g., verbally,
entering rating on tablet PC, pointing with nger on a printed scale), and
scales with a dierent number of drowsiness levels were applied. Although these
kinds of ratings are often applied as ground truth for drowsiness, their use is also
discouraged. One reason is their subjectivity; another reason is that, depending
on the type and frequency of rating requests, they have an alerting eect on the
driver, which can negatively aect the development of drowsiness. In contrast
to self-ratings, observer ratings are a promising alternative and are increasingly
being applied as ground truth. Since external observers collect them either in
real-time or oine, they are not intrusive and thus do not negatively aect
the drivers and their drowsiness state during driving. In general, they provide
a more objective ground truth for drowsiness. As for self-ratings, dierent
scales and a dierent number of drowsiness levels are used for assessing them.
However, there is more consensus in the frequency of the ratings in previous
works, mostly one minute.
86
4.3 Ground Truth for Drowsiness: A Complexity Analysis
of drowsiness, usually two (e.g., not drowsy, drowsy) or three (e.g., not drowsy,
transition state with the onset of drowsiness, drowsy) levels are considered (see
Figure 4.9).
Figure 4.9: Label granularity for drowsiness based on the drowsiness scale by Wein-
RQ2.7: Does the quality of the ground truth for drowsiness decrease
with decreasing rating frequency?
4.3.2 Method
For this experiment, the recorded videos of the driver's face/upper body from
both manual and SAE level 2 driving from study 1 were evaluated oine by
external raters. In the following, this procedure will be described in detail.
87
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
Figure 4.10: Extracted sample image of a recorded video le from study 1.
working on the night shift are not allowed to rate. Each rater may rate a max-
imum of one hour at a time and is then obliged to take a break of at least one
hour. Each rater may rate a maximum of four hours per day. If a rater is no
longer able to continue the rating in the usual quality at one's discretion, he or
she should take a break. Furthermore, training material was provided for the
raters to ensure a high quality of the video ratings. This aimed at calibrat-
ing the raters among themselves and a dedicated rater with him- or herself.
The training proceeded as follows: For an introduction, the raters were sensi-
tized to the following points: Subjective drowsiness judgments are not allowed.
The assessments have to be exclusively based on the observed indicators on the
drowsiness scale. Outliers in the sense of extreme forms of individual indicators
should not be taken into account in forming the overall rating. Instead, the
average of an indicator over the 1-minute video sequence has to be built. For
the training, a set of training videos was provided for each level of drowsiness
on the provided scale. The evaluation of these sample video sequences had to
be done individually by each rater during training. Deviations in the ratings
were then discussed with the aim of standardization. Moreover, all other indi-
cators regarding their occurrence were discussed. Depending on the progress
in the video evaluations, the training was repeated at regular intervals. Again
the drowsiness scale published by Weinbeer et al. was applied for the observer
ratings [50].
4.3.3 Results
Considering the 30 participants and the 45-minute SAE level 2 and manual
drive, a total of 60 videos, which corresponds to 2700 minutes or 45 hours,
would have been available for further analysis. However, for some participants,
88
4.3 Ground Truth for Drowsiness: A Complexity Analysis
problems with the video recording occurred, or the face was only partially or
not visible, e.g., due to an unusual seating position. These segments were
removed so that 2465 minutes (around 41 hours) were applicable for video
rating. Since the 1-minute video sequences were randomized for the rating,
the ratings obtained were initially arranged chronologically. In the following,
no distinctions, comparisons, or separate evaluations were conducted regarding
driving mode (manual, partially automated) or age (young, old). All available
ratings were considered as a single data set. In the event of a dierence of two
or more levels of drowsiness, this sequence had to be discussed with a third
rater and an agreement reached. This was not the case for the existing ratings.
Therefore, for further analysis, the 2465 ratings of both raters were averaged.
In Figure 4.11, the video ratings of all subjects of both manual and automated
driving are plotted separately (blue lines) and on average (black line), including
a trend line (dashed black line). The line charts for the individual subjects are
not intended to show a detailed course of their drowsiness state over time. They
should instead give an impression that a high inter-driver variance exists and
that some participants reached higher levels of drowsiness already at an early
stage of the drive than others. On average, across all subjects and both driving
modes, a constant, almost linear increase with minor uctuations in drowsiness
becomes apparent.
Figure 4.11: Video ratings for all participants from manual and automated driving
(blue lines) and averaged (black line) with trend line (black dashed
line).
89
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
4 98 (3.98%) 98 (3.98%)
5 41 (1.66%) 44 (1.78%)
6 68 (2.76%) 66 (2.68%)
Table 4.9: Distribution of the 2465 ratings of the two raters on the dierent drowsi-
A closer look at the ratings (see Table 4.9) shows that, apart from levels 1
and 2, both raters gave an almost identical number of ratings in the other
drowsiness levels. It becomes apparent that the majority of ratings cover levels
1 and 2 (79%), and the remaining 21% is spread over levels 4-6. From that,
it can be deduced that only occasionally high drowsiness levels were reached
within the participants. Among the 2465 ratings, a dierent rating was given
in 313 (13%) cases by both raters. The distribution of inconsistent ratings
within the two raters for adjacent drowsiness levels and their percentage of the
average number of ratings in these two levels per rater is presented in Table
4.10. Given the absolute values, a noticeably higher number of dierent ratings
in the lower levels of drowsiness can be determined. Regarding levels 1 or 2,
199 dierent ratings were given, between levels 2 and 3 82. In comparison, only
24 dierent ratings exist between level 3 and 4, and only four dierent ratings
between levels 4 and 5 and levels 5 and 6. In the rst place, this could be
attributed to the generally higher number of ratings in these levels. However,
the dierences are less decisive when considering the percentage of the average
number of ratings of a rater in the two respective levels.
90
4.3 Ground Truth for Drowsiness: A Complexity Analysis
Percentage of Percentage of
Drowsiness Inconsistent
the number of the number of
levels ratings
ratings of rater 1 ratings of rater 2
2 / 3 82 6.07% 5.91%
3 / 4 24 5.91% 5.96%
4 / 5 4 2.88% 2.82%
5 / 6 4 3.67% 3.64%
Table 4.10: Distribution of inconsistent ratings within the two raters for adjacent
91
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
the average exemplary for a rating interval of ve minutes. The num-
bers in the boxes represent ctitious ratings.
In Table 4.11, the results from correlation analysis with six drowsiness levels
are presented. In the Minute-by-minute evaluation, the correlations are gen-
erally strong (≥ 0.77), and only minor dierences in the results from dierent
time intervals exist. The maximum correlation coecient with a value of 0.81
was reached at intervals of two and ve minutes. By utilizing the average in
calculating the correlations, stronger eects, except for three minutes (0.71),
are noticeable. The most noticeable dierence (0.14) is at an interval of ve
minutes. All correlations obtained are statistically signicant.
Assuming three levels of drowsiness (see Table 4.12), the evaluations resulted
in strong eects (≥ 0.84). The dierences within the Minute-by-minute and
Average evaluations are not as pronounced apart from a time interval of ve
minutes. The strongest correlation with a value of 0.95 was achieved in the case
of Average at a 5-minute rating interval. Specically, in this evaluation, the
variations in the correlations of the dierent time intervals are more extensive
than in the case of Minute-by-minute. There, the obtained correlations are
at an almost constant level.
In the case of a binary split of drowsiness (see Table 4.13), the results of the
Minute-by-minute evaluation can generally be associated with strong corre-
lations, with a maximum value of 0.80 with a rating interval length of two
minutes. However, they decrease steadily with increasing rating interval size
and reach a minimum of 0.70 at an interval of ve minutes. In the case of Aver-
age, contradictory results were achieved in the correlation analysis. Again, the
maximum value of 0.94 can be found at a rating interval of two minutes. For a
5-minute rating interval, the analysis resulted in a ρ-value of 0.90, which is at
a similar level. Further, when comparing rating intervals of two and four min-
utes for both evaluation cases, more substantial dierences in the correlation
coecients become apparent.
92
4.3 Ground Truth for Drowsiness: A Complexity Analysis
Time interval
2 3 4 5
(in minutes)
ρ = 0.81 ρ = 0.77 ρ = 0.80 ρ = 0.81
Minute by minute
p < 0.001 p < 0.001 p < 0.001 p < 0.001
Table 4.11: Results from correlation analysis with Spearman for dierent time in-
Time interval
2 3 4 5
(in minutes)
ρ = 0.88 ρ = 0.89 ρ = 0.84 ρ = 0.84
Minute by minute
p < 0.001 p < 0.001 p < 0.001 p < 0.001
Table 4.12: Results from correlation analysis with Spearman for dierent time in-
Time interval
2 3 4 5
(in minutes)
ρ = 0.80 ρ = 0.78 ρ = 0.76 ρ = 0.70
Minute by minute
p < 0.001 p < 0.001 p < 0.001 p < 0.001
Table 4.13: Results from correlation analysis with Spearman for dierent time in-
93
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
In the following, the results are discussed, and the presented RQs are answered.
Moreover, implications for further research are derived.
RQ2.7: Does the quality of the ground truth for drowsiness decrease
with decreasing rating frequency? When looking at the results on a
drowsiness scale with six levels (see Table 4.11), the minute-by-minute com-
parison resulted in ρ-values >0.77 with a maximum of 0.81 at a rating interval
of two and ve minutes, which corresponds to a strong correlation. In the case
of the Average evaluation, this eect is more evident with a maximum ρ-
value of 0.95 at an interval length of ve minutes. Therefore, an almost perfect
correlation could be determined above the ρ-values of a 3- (0.71) or 4-minute
(0.90) rating interval. Thus, for the present case, and if the rating interval is
increased, i.e., the rating frequency decreased, a ground truth of almost com-
parable quality can be determined.
RQ2.8: If the number of drowsiness levels to be predicted is low,
are lower rating frequencies sucient? By reducing the drowsiness levels
from six to three levels (see Table 4.12), the strongest eect (0.95) was again
found in the Average evaluation with a rating interval of ve minutes. For
this reason, with a reduction to three levels, the quality of the ground truth is
still at an almost identical level with a rating interval of ve minutes compared
to one minute. If drowsiness consists of two levels (see Table 4.13), the highest
correlation coecient with a rating interval of two minutes was reached with
a ρ-value of 0.94. The correlation coecients for rating intervals with a length
of three and four minutes are slightly lower, with values of 0.86 and 0.80. A
similar correlation coecient can be found at a 5-minute time interval, which is
0.90. In summary, it can be said that through a reduction of drowsiness levels,
its inuence on the quality of ground truth is low (RQ2.8). In particular, it
should be noted that this eect occurs at larger rating intervals, in the present
case of up to ve minutes.
In general, it can be deduced that drowsiness represents a rather slowly chang-
ing state. Therefore higher rating intervals are sucient that might give even
a better impression of the driver's drowsiness state. A higher rating frequency
also increases the likelihood of assigning incorrect ratings, which leads to in-
correct labels of the training data. However, in the present case, it should
be noted that the evaluations in this work are based on data from 45-minute
drives in the driving simulator, and the database contains in the majority lower
drowsiness levels what could be a reason for higher correlation coecients due
to few changes in the participant's drowsiness state. Therefore, longer drives,
e.g., over three or four hours and possibly with sleep-deprived participants,
resulting in a less steady course of drowsiness, should be considered for data
collection. This is intended to result in a database with a more evenly dis-
tributed number of ratings across the levels and more frequent changes in the
driver's drowsiness state throughout the drive.
94
4.3 Ground Truth for Drowsiness: A Complexity Analysis
95
4 Model Development: Driver Drowsiness Detection using Wrist-Worn Wearable
Devices
The knowledge gained can be used in future studies in this research area,
the collection, and standardization of a reliable and valid ground truth of
drowsiness, and the process improvement in developing reliable drowsi-
ness detection systems.
96
5 Evaluation: Performance and
Acceptance of a Driver
Drowsiness Detection System
based on Smart Wearables
The previous chapter contained dierent experiments where the feasibility and
potential of using wrist-worn wearable devices inside the vehicle for driver
drowsiness detection were investigated. Promising results were achieved using
their physiological as a single data source combined with supervised machine
learning to detect driver drowsiness.
In this chapter and based on the knowledge gained from these experiments,
a prototype for a driver drowsiness detection system based on a wrist-worn
smart wearable device is proposed and evaluated in the context of a user study
in terms of detection performance. Further, even if the in-vehicle usage of
smart wearables for driver drowsiness detection is feasible and high detection
accuracies were obtained, the driver is to a certain extent forced to use a
smart wearable, which is connected to the vehicle and continuously streams
vital and health data. Therefore, in addition to the performance evaluation,
this chapter will investigate whether the in-vehicle usage of systems based on
wearable devices for safety-critical tasks, in the present case, driver drowsiness
detection is accepted and how this acceptance can be further enhanced.
H2.1: There is a high user experience after rst-time use of the proposed
drowsiness detection system.
97
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
5.1 Concept
The concept of the driver drowsiness detection system is shown in Figure 5.1. A
smart wearable is worn on the driver's wrist and continuously streams real-time
physiological data to an application on a mobile device. On the application, the
received data are processed on the backend. Relevant features are extracted
and fed to an already trained machine learning classier. The classier's out-
put, i.e., the predicted drowsiness level, is then presented on the frontend of
the application in the user interface. Furthermore, a user manual is included in
the application, which supports the user in attaching the device to the wrist/-
forearm and establishing the connection with it.
This concept aimed to develop a portable and handy system for driver drowsi-
ness detection based on a wearable device. Since the system only consists of
a wearable and tablet PC at the current development stage, it can be inte-
grated into any vehicle. The long-term vision is to replace the tablet with the
human-machine interface (HMI) integrated into the vehicle in the long term.
Then, only the smart wearable is required in terms of hardware and exter-
nal/additional sensors. The current trend in the automotive industry shows
that well-known manufacturers will replace their own operating systems with
ones used initially on mobile devices, e.g., with Android Automotive [208].
98
5.2 Implementation
5.2 Implementation
In the work of Lin et al., an Android application was developed that receives and
processes EEG signals from a wireless and wearable headset. A support vector
regression (SVR) model embedded in the app was used to detect the drowsiness
state. On the graphical user interface (GUI), the level of drowsiness and the
dierent EEG channels were displayed and updated every two seconds [209]. Li
and Chung proposed a drowsiness detection system based on EEG signals and
gyroscope data. The EEG signals were streamed from a wearable EEG headset,
and an SVM model was applied for determining the driver's current drowsiness
state. On the app (see Figure 5.2(a)), the raw EEG data and features, 3-axis
gyroscope data and features, as well as the driving status was displayed [91]. Li
and Chung proposed another Android-based driver drowsiness detection system
using HRV and PERCLOS as input data and an SVM model for classication.
The app displays the PERCLOS measure, raw PPG data, and the detected
drowsiness state (see Figure 5.2(b)). Further, in terms of being classied as
drowsy, nearby coee shops are suggested [210]. In the work of Jabbar et
al., an Android-based system is presented, where facial pictures of the mobile
camera were evaluated by a trained deep learning algorithm that was embedded
in an application (see architecture in Figure 5.2(d)). If the driver is drowsy, the
application will signal the driver with auditory and visual notications [64]. In
the work of Misbhauddin et al., a wearable-based drowsiness detection system
99
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
100
5.2 Implementation
be further enhanced.
In the following sections, the implementation of the drowsiness detection will
be described in detail.
The Polar OH1+ optical sensor provides real-time physiological data (see right
part of Figure 5.6). This commercial smart wearable device can be easily worn
on the wrist or forearm with a textile strap. It oers real-time streaming of
heart rate data via BLE or Adaptive Network Topology (ANT+) for more than
12 hours without charging [211]. In the work of Hettiarachchi, the heart rate
data of the Polar strap was compared with a medical-grade ECG measurement
device and resulted in very high correlations (99%) [212].
In terms of real-time behavior, ve test runs for evaluating the battery con-
sumption of the Polar OH1+ device were carried out before the study, in which
the heart rate data was continuously streamed from the wristband to the tablet
via BLE for six hours. The consumption of the individual test runs (blue scat-
tered lines) are almost identical (see Figure 5.3). On average (blue line), the
battery consumption over the entire test duration is almost linear and reaches
around 50% after six hours. Particularly with regard to longer journeys with the
car, the proposed drowsiness detection system could be used without recharging
the wearable device after a short time.
Figure 5.3: Average battery consumption (blue line) of Polar OH1+ for ve test
runs (blue scattered lines) during real-time data transfer to app over a
duration of 6 hours.
The heart rate data was transmitted in real-time via BLE to a Google Pixel C
Tablet with an Android operating system (8.0.1 Oreo) that served as a plat-
101
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
5.2.3.1 Backend
In the backend of the application, the Polar BLE SDK was integrated to handle
the connection to and data transfer from the wearable [213]. Next, the received
heart rate data is pre-processed. Possibly unexpected and longer gaps between
two successive data points due to motion artifacts, e.g., caused by strong hand
movements of the driver, can occur [212]. Hence, a data pre-processing algo-
rithm is implemented to ensure a sucient amount of data and prepare it for
feature extraction. As performed in the experiments in the previous chapter,
features are extracted with a sliding window of ve minutes and a 2-seconds in-
crement. Therefore, at least ve minutes of data are needed for calculating the
rst set of features. From that time, features are calculated every two seconds.
The ow chart of the data pre-processing algorithm is presented in Figure 5.4.
After receiving new data, the dierence (di_1) between the timestamp of the
last received heart rate value and the rst value of the recording time is calcu-
lated. If di_1 is less than the length of the sliding window, i.e., ve minutes,
the system continues to wait for new input. The next step is to check whether
the system is still in the initialization state (init_state), i.e., whether features
are being calculated for the rst time after starting the application. If this is
the case, the sliding window's content is checked whether a sucient amount
of data is available and the maximum permitted time dierence between the
inputs has not been exceeded. In the present case, this maximum permitted
time dierence was set to 330 seconds, i.e., within ve minutes (300 seconds), a
maximum of 30 seconds of missing data is allowed. If it is exceeded, the system
continues to wait for new input. If not, the data is transferred to the feature
extraction algorithm. After the features have been calculated for the rst time,
the system is no longer in init_state. From this point in time, the dierence
between the timestamp of the last received input and the last timestamp of the
previously forwarded 5-minute window to the feature extraction algorithm is
calculated. If di_2 is higher than or equal to the dened increment of the slid-
ing window, i.e., two seconds, it is checked whether sucient data is available.
If this is not the case, the system continues to wait for new input; otherwise,
the data is forwarded to the feature extraction algorithm. In total, six features
from the time-domain are required as an input for the machine learning model,
as proposed in Section 4.2: mean, standard deviation, maximum, minimum,
range, and median of heart rate. As a next step, the obtained feature vector
is fed to a Random Forest (100 trees) [214] machine learning model for binary
classication of drowsiness (non-drowsy vs. drowsy). This model was se-
102
5.2 Implementation
lected since it provided overall the most promising results (>95% accuracy in
10-fold CV) in the experiments of Sections 4.1 and 4.2. The training of the
model was conducted in the WEKA Machine Learning GUI [189] with data
collected with the Polar A370 device in the SAE level 2 drive of study 1 (see
Section 3.2). The one reason is that Polar also manufactured the device used
in the prototype. The other reason is that an SAE level 2 drive is carried out
in a driving simulator to evaluate the prototype. In terms of ground truth for
drowsiness, the same as in Section 4.1, i.e., a combination of observer ratings
and detected micro-sleep events were applied and provided labels for the ex-
tracted feature vectors in the training data set. To use the trained machine
learning model in the app, the WEKA Machine Learning workbench was in-
tegrated as an external Java library (weka.jar). With this library, classiers
trained in and exported from the WEKA GUI can easily be integrated and
executed in Java code.
5.2.3.2 Frontend
A total of six dierent screens were implemented on the frontend that serves
as an interface for the interaction with the user (see Figure 5.5). After starting
the app, a welcome screen was displayed (see Figure 5.5(a)). Initially, the app
ensures internally that Bluetooth on the tablet is switched on. If this is not
the case, the user receives a notication to switch it on. Further, the option
is provided to choose a dierent language via the globe symbol in the upper
right corner. Currently, English, German, and Kannada are oered but can be
103
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
further extended to other languages. As part of the study, the participants had
to enter their ID.
(a) (b)
(c) (d)
(e) (f)
Figure 5.5: Screenshots of developed Android application: (a) Welcome screen; (b)
Screen with instructions for smart wearable usage; (c) Screen while
being classied as non-drowsy; (d) Screen while being classied as
drowsy; (e) Screen after nishing drive; (f ) Screen with KSS scale for
drowsiness self-ratings.
After pressing the Continue button, a user manual followed with pictures of
the wearable, how to put it on the wrist or forearm (see Figure 5.5(b)), switch it
on, and connect it with the application on the tablet. Once the smart wearable
104
5.3 User Study
is connected, the user is able to press the Continue button and switch to the
next screen where the current level of drowsiness is presented. This level is
also stored with a timestamp on the tablet. Figures 5.5(c) and 5.5(d) show
the screens for either being classied as non-drowsy or drowsy. When the
drive is nished, screen 5.5(e) appears, showing the current battery status and
indicating that the wearable should be recharged until the next drive. However,
within the scope of the user study carried out in this work, the screens 5.5(c)
and 5.5(d), i.e., the drowsiness state predicted by the machine learning model,
are not presented to the participants while driving. These two screens are shown
to the participants in the course of evaluating user experience and technology
acceptance of the whole system after the drive. In return, during the simulator
drive, the subjects had to assess their drowsiness via self-ratings during the
drive every ve minutes. For this purpose and in the same way, as in user
studies 1 and 2, the KSS was displayed to the participants (see Figure 5.5(f )).
The ulterior motive was to compare the output of the machine learning model
with the subjects' self-ratings after the study to evaluate the performance of
the machine learning model.
As in study 1, the study was carried out in the high-delity hexapod driving
simulator at Technische Hochschule Ingolstadt (THI). Except for two dier-
ences, the same study setup and driving simulation were applied (see descrip-
tion in Section 3.2). In the simulation, driving was only partially automated,
i.e., with SAE level 2 [20] since the dierences in the development of drowsi-
ness between manual and partially automated driving were already examined.
Further, to potentially obtain higher levels of drowsiness, the driving duration
was extended from 45 to 60 minutes.
5.3.2 Participants
The participants were selected from two age groups since the inuence of age on
user experience and technology acceptance will be examined. To be consistent
105
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
with the rst two studies, the same age groups were chosen. Therefore, 15
participants between 20-25 years and 15 between 65-70 years were selected
for the study. The participants received e25 in cash for participating in the
study and had to meet the same conditions for participation as in study 1 and
2: no sleep disorder, subjectively rated good health, valid driving license, no
limitations in their driving ability, no consumption of caeinated drinks ve
hours before the study.
Various data were recorded in the course of the study that will be described in
the following sections.
5.3.3.1 Pre-Questionnaire
Since the performance of the driver state detection system will be evaluated,
i.e., how capable is the integrated machine learning model in detecting the
non-drowsy and drowsy state of the driver, a ground truth for drowsiness is
needed. In this work, the drowsiness state of the participants was determined
in two dierent ways: self-ratings and observer ratings.
Self-ratings
In the same way, as in studies 1 and 2, the subjects had to assess their drowsi-
ness via self-ratings during the drive using the KSS. The screen containing the
KSS scale was shown to the participants every ve minutes on the tablet in
the vehicle's center console with the same modality as in the previous studies.
(see left part of Figure 5.6). The subjects also had to disclose the development
of their drowsiness after the trip using drowsiness curves, based on the UX
curve method [163], as in studies 1 and 2. This type of self-assessment will be
used to compare how the participants assess themselves directly after compared
to during the drive. Moreover, backup ratings were available in the event of
problems with self-ratings utilizing the tablet.
106
5.3 User Study
Observer ratings
In addition to the subjects' drowsiness self-assessments, their state was assessed
by external raters. With a camera installed in the driving simulator (see left
part of Figure 5.6), the subjects could be observed on a screen outside the
simulator. The observer ratings were collected in real-time during the study
every minute. In the work of Sandberg et al., it was found that for the detection
of reasonable signs of drowsiness, most indicators can be observed for intervals
of 60 seconds or longer [190]. To increase the reliability of the ratings, two
trained individuals worked together and set a rating for every minute after a
short discussion. As in the previous studies and experiments, the drowsiness
Scale by Weinbeer et al. was applied for the observer ratings [50].
Figure 5.6: Left: Simulator setup with camera for video ratings (1) and tablet PC
5.3.3.3 Post-Questionnaire
Subjects had to ll a post-questionnaire after the ride in the simulator, which
consisted of three parts.
107
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
108
5.3 User Study
Further Questions
Apart from TAM and UEQ, the post-questionnaire contained the following
questions (see Table 5.3):
Finally, to receive qualitative feedback, the participants were also asked individ-
ually in the form of unstructured interviews about their overall experience while
using the system for the rst time and possible suggestions for improvement in
the proposed drowsiness detection system. In addition to the interviewer and
the participant, another person was present to take notes on the answers.
An overview of the study procedure that took a maximum of two hours for each
subject is presented in Figure 5.8. The experiments were carried out at 9:00
a.m., 1:30 p.m., and 5:00 p.m. The same number of participants from each age
group was invited for the dierent points in time. Upon arrival, the subject
was given background information about the study, and the pre-questionnaire
was lled before entering the simulator. The subjects were then shown the
location of Polar OH1+ in the vehicle. In the next step, they were advised to
start the developed app on the tablet in the center console of the car and follow
the instructions. They were observed by the experimenter, who documented
any problems while dealing with the app and wearable. The experimenter also
assisted them and ensured that all necessary steps and settings were carried out
correctly before the drive in the simulator was started. A 5-minute partially
automated test drive was initially carried out to accustom the participants to
the rating request and allow the participants to get used to the simulator situa-
tion. Especially for most older participants, participating in a simulator study
was a completely new experience and could, therefore, have led to an increase
109
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
in heart rate due to nervousness and excitement. This could have distorted
the heart rate data at the beginning of the central part of the experiment. A
decrease in heart rate could have been potentially attributed to getting used
to the simulator situation. The central part consisted of a 60-minute partially
automated drive. A total distance of around 100 km was covered. During
the drive, the subjects had to assess their drowsiness every ve minutes via
self-ratings on the tablet. Outside the simulator, two external raters gave an
observer rating every minute. During the entire drive, the participants could
not use the mobile phone, drink or eat, or chew gum. Moreover, they should
only talk to the experimenter in an emergency and not perform any other activ-
ities. After nishing the drive, subjects were asked to draw drowsiness curves
and to ll the post-questionnaire.
5.4 Results
5.4.1 Pre-Questionnaire
In summary, 15 subjects, eight women and seven men (age: M=22.50 years,
SD=1.88 years), between 20-25 years and 15 subjects, eight women and seven
men (age: M=66.60 years, SD=1.84 years), between 65-70 years were selected
for the study. The majority of the younger subjects were students of Tech-
nische Hochschule Ingolstadt (THI). The older ones were recruited through a
newspaper announcement. Further, participants had to answer whether they
already experienced a micro-sleep while driving, which was answered yes by
nine younger and six older subjects. Furthermore, it was found that none of
the younger but nine older participants currently undergo medical treatment.
To stay more realistic, and since not only the detection of the drowsy state
but also the non-drowsy state is of high importance, the participants were not
110
5.4 Results
sleep-deprived. Younger participants slept on average 6.87 and the older ones
6.40 hours before the study, in both age groups, the majority with medium
quality. Younger participants slept on average 6.87 and the older ones 6.40
hours before the study, in both age groups, the majority with medium quality.
The younger participants slept on average 7.07 hours, and the older participants
6.73 hours per night. A summary of the results from the pre-questionnaire can
be found in Table 5.4.
Additionally, for determining their daytime sleepiness, the items of the Epworth
Sleepiness Scale (ESS) were queried [40] and again divided into ve groups [181],
as shown in Table 5.5. It can be seen that the older subjects generally have
lower daytime sleepiness since all 15 subjects are localized in the two lowest
categories. The younger subjects cover all ve categories.
111
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
0 (0-5 points)
3 6 9
lower normal daytime sleepiness
1 (6-10 points)
7 9 16
higher normal daytime sleepiness
2 (11-12 points
1 0 1
mild excessive daytime sleepiness
3 (13-15 points)
3 0 3
moderate excessive daytime sleepiness
4 (16-24 points)
1 0 1
severe excessive daytime sleepiness
In order to validate the performance of the machine learning model, its output
was compared with the obtained self- and observer ratings. In the following,
the self-ratings from during the drive are used since the ones determined via
the drowsiness curves are at an almost identical level (see Figure 5.9). The
dierence averaged 0.24 KSS levels over the entire duration of the 60-minute
drive.
Figure 5.9: Average KSS ratings with 95% CI of all participants during and after
the drive.
A total number of 360 KSS and 1800 Weinbeer ratings are available for all 30
112
5.4 Results
subjects. In Figures 5.10(a) and 5.10(b) for both types of ratings, the average
ratings over the 60-minute drive are shown separately for younger and older
subjects. It can be seen that the younger participants achieved higher levels of
drowsiness on average than the older ones, which can be especially seen from
the course of the self-ratings. Whereas the increase in drowsiness in the rst
20 minutes of the trip is relatively strong among the younger participants and
then tends to remain at an almost constant level, for the older participants, a
relatively moderate but constant increase over the entire duration was recorded.
The dierences between the two age groups are evident in the range of 20-
35 minutes. With regard to the observer ratings, the dierences are not that
substantial. Towards the end of the drive, the two curves even converge and are
at an almost identical level. Figures 5.10(c) and 5.10(d) show the distribution
of the number of ratings across the levels of the dierent scales. A higher
number of ratings in the lower drowsiness levels can be determined for both
scales. Level 3 on the KSS (103 ratings) and level 2 on the Weinbeer scale (770
ratings) were frequently selected. Since drowsiness is classied binary (non-
drowsy, drowsy) by the trained machine learning model, but the KSS consists
of nine and the Weinbeer scale of six levels, this had to be adjusted rst. Based
on the work of Ingre et al. [79] the levels of the KSS were grouped as follows:
Levels 1-6 (non-drowsy), levels 7-9 (drowsy). The observer rating scale was
adopted in the same way as in Section 4.1: Levels 1-3 (non-drowsy), levels 4-6
(drowsy). Here, the class imbalance becomes particularly clear after grouping
both scales into the two levels. In the case of the grouped KSS ratings, 286
ratings are available for the non-drowsy and 74 for the drowsy class (see
Figure 5.10(e)), with the grouped Weinbeer ratings 1538 for non-drowsy and
262 for drowsy (see Figure 5.10(f )).
113
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
(a) (b)
(c) (d)
(e) (f)
Figure 5.10: Evaluation of KSS and Weinbeer ratings: (a) Average KSS ratings for
younger and older participants with 95% CI; (b) Average Weinbeer
ratings for younger and older participants with 95% CI; (c) Absolute
number of KSS ratings given per level; (d) Absolute number of Wein-
beer ratings given per level; (e) Absolute number of ratings for grouped
KSS levels; (f ) Absolute number of ratings for grouped Weinbeer lev-
els.
Further and as performed for studies 1 and 2, these trends were statistically
evaluated using IBM SPSS v.25, separately for KSS and Weinbeer ratings.
Since Weinbeer ratings are available for 1-minute, but KSS ratings for 5-minute
intervals, the mean of Weinbeer ratings over the corresponding 5-minute inter-
vals was considered for evaluation.
KSS ratings : A Linear Mixed Model (LMM) for repeated measures with two
114
5.4 Results
115
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
SR (5 min)), self-ratings over single minutes in the 5-minute interval (further
referenced as SR (1 min)), as well as observer ratings over 1-minute intervals
(further referenced as OR (1 min)).
The classication results are presented in Table 5.6. Since the machine learning
model outputs a drowsiness level every two seconds, these levels were in a rst
step averaged in the considered time interval, i.e., 5-minute and 1-minute inter-
vals. The corresponding output for non-drowsy is 1, and for drowsy 2. From
an average drowsiness level of 1.5 in the considered time interval, the number
was rounded up to 2 (drowsy) and below this threshold to 1 (non-drowsy).
Due to a lack of recorded heart rate data, participants 5, 9, 11, 25, and 30 had
to be removed from the evaluation data set. With a general look at the clas-
sication accuracy of the individual participants, regardless of the drowsiness
reference, it can be seen that a high variance within the participants' accuracies
exists ranging from 50% (participant 8) up to 100% (4,10,12,17,20,21,23,28,29).
Concerning self-ratings, it can be determined that except for participant 2, the
accuracy when considering a 5- or 1-minute interval is almost identical. How-
ever, when comparing self- and observer ratings, more considerable dierences
exist, but no clear trend can be found. Regarding self-ratings, the state of
participant 3 was, e.g., correctly classied with an accuracy of 58.33% (5 min)
and 55.36% (1 min), whereas with observer ratings, an accuracy 91.07% of
was achieved. With participant 8, 50% each was achieved with self-ratings as
ground truth and 83.93% with observer ratings. The opposite can be deter-
mined for participant 22: around 66% with self-ratings and 42.86% in com-
parison to observer ratings. Across all subjects, the dierences in accuracy for
all three reference types are marginal, with a maximum of 82.72% with SR (5
min). Focusing on the two age groups, the classication accuracy is around
16% higher among the older subjects.
116
5.4 Results
Old
Table 5.6: Machine learning classication accuracy (in percent) calculated with ref-
erence to self-ratings with KSS for both 1- and 5-minute intervals and
1-min observer ratings with Weinbeer scale for individual subjects and
on average in each age group. For participants with n.a. not sucient
data was available for evaluation.
So far, and in terms of performance evaluation, the focus was purely on ac-
curacy as one of the most traditional measures. However, when dealing with
class imbalance, as in the present case, other performance measures should be
117
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
considered. Accuracy itself focuses more on the majority than on the minority
classes [201]. Thus, F-measure (further referenced as F) will be additionally
used that represents the harmonic mean of precision and recall with the high-
est possible value of 1 (perfect precision and recall). Concerning the presented
binary classication problem, it is important to detect when the driver is in a
drowsy state correctly. From the customer's point of view, however, it is also
crucial to correctly detect when the driver is non-drowsy to not irritate with
unnecessary drowsiness warnings. Concerning a standard confusion matrix for
binary classication with values for true positive (TP), true negative (TN),
false positive (FP) and false negative (FN), the formula for F does not take
the True Negative (TN) values into account.
For this reason, F-measure is calculated for each class considered as posi-
tive class once. Tables 5.7, 5.8 and 5.9 show the confusion matrices for the
three cases described. When calculating F for the displayed confusion matrices
(non-drowsy as positive class), the following results are obtained: 0.90 for SR
(5min), 0.89 for SR (1 min), and 0.90 for OR (1 min). If drowsy is considered
a positive class, the results are 0.26 for SR (5min), 0.29 for SR (1 min), and
0.25 for OR (1 min), which are noticeably lower in all three cases.
Actual
Non-
Drowsy
drowsy
Non-
Predicted
SR (5 min)
Actual
Non-
Drowsy
drowsy
Non-
Predicted
SR (1 min)
118
5.4 Results
Actual
Non-
Drowsy
drowsy
Non-
Predicted
TP: 1108 FP: 151
drowsy
OR (1 min)
Table 5.9: Confusion matrix for evaluation of observer ratings (OR) (1 min).
The machine learning model delivers a new output every two seconds. However,
it is not practical to update and re-inform the driver of his/her state in such
short periods. In this section, three possible and common methods are com-
pared for post-processing the machine learning output and how each method
inuences the detection accuracy. This should instead provide an outlook for
possible future work. For this reason, the methods are compared and evalu-
ated using participant 6 as an example and observer ratings as ground truth
for drowsiness.
Mean: To evaluate the performance of the machine learning model in the pre-
vious section, the mean was calculated in the considered time interval, e.g.,
in terms of observer ratings for intervals of one-minute length and rounded to
the nearest integer. This could be applied as a simple method to display and
update the driver's state every minute.
Ratio: This method calculates the ratio of the number of non-drowsy or
drowsy outputs to the total number of outputs in the considered time in-
terval. The driver's condition can be decided based on the ratio by setting
a threshold. For example, if the threshold is set to 0.7 and 70% of the out-
put is non-drowsy in a 1-minute interval, the driver would be classied as
non-drowsy. However, supposing the output is non-drowsy over the rst
45 seconds and drowsy over the last 15 seconds in the 1-minute interval, the
output would still be non-drowsy. In reality, the driver would probably be
moving towards a drowsy state. Hence, this technique is not very suitable,
especially when longer time intervals, such as ve or ten minutes, are consid-
ered.
Moving average: To overcome the limitations of the ratio-method, a moving
average, e.g., with a 2-minute sliding window with 1-minute increment, could
be calculated. The obtained average is rounded to the nearest integer. For this
method, the state is determined for every minute, but this time depends on the
previous minute.
Table 5.10 shows how the performance of the machine learning model changed
119
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
Table 5.10: Performance (accuracy, F-measure for both non-drowsy and drowsy
TAM results were collected on a 7-point (1-7) Likert scale. Before evaluating
TAM, the data were rst checked for validity using Cronbach's α. Internal
reliability of all multi-item scales could be conrmed (α > 0.6). The results are
shown in Table 5.11. For a better and more detailed illustration, the results of
the individual TAM subscales for the two age groups are presented in the form
of a box plot in Figure 5.11. In general, the median and mean values for all
TAM subscales show, except for the equal median of PEOU, higher values for
the older participants in all other subscales.
Furthermore, it was checked if the obtained dierences and eects between the
two age groups are signicant. Since the collected TAM data did not follow
a normal distribution, and the applied scale is ordinal, a non-parametric test
in the form of the Mann-Whitney U test was applied per TAM variable. The
results are listed in the right column of Table 5.11. With the assumption
of a signicance level of α = 0.5 and the overall group size of 15 subjects
per age group, the critical U value is 64. If the value of U resulting from
the Mann-Whitney U test is beneath this critical U value, then a signicant
dierence in the ratings from the age groups exists. To check the certainty of
these outcomes, a two-tailed test is performed. If the p-value is less than 0.05,
a signicant dierence between the two groups with the condence of 1-p is
present. Signicant results, i.e., signicant dierences between the age groups,
are indicated in Table 5.11 with a *. Except for PEOU (U: 107.00, z = -0.21, p
= 0.83), the two-tailed Mann Whitney U test was signicant for PU (U: 45.00,
120
5.4 Results
z = -2.78, p = 0.01), ATT (U: 41.50, z = -2.92, p = 0.00) and INT (U: 40.00,
z = -2.99, p = 0.00).
Figure 5.11: Summary of results of TAM subscales for younger and older partici-
pants.
As for the TAM, the UEQ subscales were checked for consistency by calcu-
lating Cronbach's α (see Table 5.12). This could be conrmed for subscales
Attractiveness, Perspicuity, Stimulation, and Novelty but not for Eciency
and Dependability since an alpha value lower than 0.6 was obtained. Espe-
cially in the case of small sample size, as in the present case, the reason is
not necessarily a scale inconsistency. Instead, the misinterpretation of certain
items in the subscales by the participants can be the reason. In such cases, the
mean values of rating per question per subscale can be used to identify the mis-
interpreted items of the corresponding subscale. By applying this technique, it
was found that several participants misinterpreted specic scale items. Thus
the ratings for them were inconsistent with their ratings for other items of the
same subscale. In the case of one older participant, four of such inconsistencies
121
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
could be identied. Hence, for further evaluation and analysis, the ratings of
this participant were discarded. Moreover, in Table 5.12, median and mean
values for each subscale are listed. Further, Figure 5.12 presents the results of
UEQ in the form of box plots. In general, high values for both median and
mean (≥ 5) could be obtained in all subscales. Whereas for the mean, the older
participants achieved higher values in all subscales, in the case of the median,
this is the case in the categories Attractiveness, Stimulation, and Novelty. An
identical median value was determined in the remaining categories, i.e., Per-
spicuity, Eciency, and Dependability.
Furthermore, it was checked with the Mann-Whitney U test if the dierences
between the two age groups in the dierent UEQ subscales are signicant. The
critical U value is again 64. A two-tailed test is performed to check the validity
of Mann-Whitney U. Signicant results are indicated in Table 5.12 with a *.
Signicant results were obtained for Attractiveness (U: 24.50, z = -3.49, p =
0.00), Stimulation (U: 49.50, z = -2.40, p = 0.02) and Novelty (U: 27.00, z =
-3.38, p = 0.00). For Perspicuity (U: 96.50, z = -0.35, p = 0.73), Eciency (U:
81.00, z = -1.03, p = 0.30) and Dependability (U: 83.00, z = -0.94, p = 0.35)
results are not signicant.
Figure 5.12: Summary of results of UEQ subscales for younger and older partici-
pants.
122
5.4 Results
In addition to evaluating the six subscales of the UEQ and comparing the two
age groups, the data of the UEQ were evaluated concerning the three dimen-
sions of UEQ, i.e., Attractiveness, Pragmatic and Hedonic Quality (see Figure
5.13). The median is 6 in all three dimensions. The mean of Attractiveness is
6.21, with a minimum value of 3 and a maximum of 7.
Since, as in the present work, UEQ measurements are only available for a single
system, it is not easy to judge whether the product meets the quality goals. For
this reason, the results obtained were compared with a UEQ benchmark data
set, which contains data from over 20,000 people from around 450 surveys on
various products (business software, development tools, webshops or services,
social networks, mobile applications, household appliances). In all scales, the
product should be located in the Good category [218]. The results are pre-
sented in Figure 5.14. In terms of the presented drowsiness detection system,
this was achieved for all scales.
Figure 5.14: UEQ benchmark graph for proposed drowsiness detection system.
In addition to TAM and UEQ, the subjects were also asked how condent
they felt when submitting their self-ratings. The majority in both age groups,
ten younger, and 11 older participants, stated that they felt condent; four
123
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
medium and one younger subject not condent. Furthermore, they were asked
what level of drowsiness a warning would be appropriate with reference to the
KSS. The results are shown in Table 5.13. It can be seen that none of the
subjects consider levels 1-3 (extremely alert, very alert, alert) and level 9 (very
sleepy; sleep ghting) to be helpful for a warning. One younger and ve older
subjects chose level 4 (rather alert), two from each age group level 5 (neither
sleepy nor alert), three younger and seven older ones level 6 (some signs of
sleepiness) as well as ve younger and one older level 7 (sleepy; no eort to
keep awake). Four of the younger subjects would be satised with a warning
from level 8 (sleepy; some signs of sleepiness). It can be seen that KSS level 7,
which represents the onset of drowsiness, is the relevant level for a rst warning,
and level 9 is already perceived as too late.
1 (extremely alert) 0 0 0
2 (very alert) 0 0 0
3 (alert) 0 0 0
4 (rather alert) 1 5 6
The participants were also asked whether they already had experience with
smartwatches or tness trackers. Six of the younger subjects were familiar
with their usage; the remaining six had little or no experience. Only two of the
older participants came in touch with smartwatches or tness trackers so far.
They were further asked if they own a smartwatch or tness tracker, buy one
or wear one to ensure safety while driving. The results are listed in Table 5.14.
Only three younger and one of the older participants possess a wearable. One
younger and four older ones intend to buy one. In order to ensure safety while
driving, 13 younger and 14 older subjects would wear a wearable.
Participants should further reveal if they could imagine positioning the wear-
able on other body parts during driving, such as waist, arm, neck, nger, or
ankle. However, all younger and 12 older participants voted for the wrist. Two
of the three remaining older subjects chose nger and one waist.
124
5.5 Discussion and Limitations
In the following and based on the obtained results, the presented hypotheses
will be discussed as well as implications for further research in this area derived.
H1: A high detection accuracy can be obtained using the proposed
drowsiness detection system. The detection accuracy of the machine learn-
ing model that was integrated into the Android app was computed, considering
both self-ratings and observer ratings separately as ground truths. From the
results, it can be seen that inter-driver variance has a strong inuence on model
performance. Therefore, at the current development stage, a single model can-
not be applied to detect drowsiness for all participants reliably. On average,
across all participants, the obtained accuracies for the three presented cases (SR
125
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
126
5.5 Discussion and Limitations
127
5 Evaluation: Performance and Acceptance of a Driver Drowsiness Detection
System based on Smart Wearables
can be mentioned. Regarding the purpose of the presented system, the detec-
tion of drowsiness while driving, other reasons could also be considered. With
increasing age, several dierent health problems, such as diminished hearing or
vision, can occur that could lead to dangerous situations while driving. For this
reason, automated driving is particularly relevant for older people who want
to stay mobile but are disabled in their ability to drive. They are probably
more willing to use such kind of driver assistance system on the way to full
automation and therefore show a higher technology acceptance than younger
people. Another noteworthy nding from the post-questionnaire is that only
a few participants own a smart wearable or want to buy one. Automobile
manufacturers have to take this into account when developing systems of this
type in the future. In that case, which several participants after the study also
mentioned, one option would be to sell the wearable with the vehicle.
Regarding limitations, in the context of the user study, specic precautions
were taken to induce drowsiness faster. This included no caeinated drinks
ve hours before the study, a monotonous driving route with little trac and
low speeds, no communication with the experimenter, and warm temperature
inside the simulator. Besides, the study was conducted in a driving simulator.
Therefore, the onset of drowsiness would probably occur later under realistic
conditions since, in a simulator-based environment, drivers are not exposed to
dangers compared to real trac. Moreover, in real-world driving, vibrations
from the roadbed could have also led to degraded or dierent quality of the
collected physiological data and a diering machine learning output. Further-
more, drives with a duration longer than 60 minutes should be considered.
Moreover, two specic age groups were selected. The presented issues should
also be investigated for other age groups and, in general, with a higher number
of subjects.
128
5.6 Main Findings
Both age groups rated the system easy to use and informative.
129
6 Discussion
In the following sections, the postulated research questions and obtained an-
swers will be discussed.
In the evaluation, the main focus was on the subjective drowsiness assessment
via driver self-ratings to capture the drowsiness state directly from the driver's
perspective. The drowsiness self-ratings were collected during and after the
manual and partially automated drive. In both studies, results showed only a
131
6 Discussion
marginal dierence in the ratings during and after the drive. Further, the same
statistical eects could be determined for both ratings. This fact would suggest
that it is sucient to let the test persons assess their drowsiness immediately
and only after and not during driving. Even if the rating request is as little
disruption as possible while driving, it certainly has a very brief alerting eect
on the driver, which can harm the development of drowsiness.
Focusing on the presented hypotheses, the ones for driving mode and driving
time could be accepted in both studies. Therefore, a signicant dierence in the
development of drowsiness between manual and automated driving was evident
with higher levels in automated driving. Further, drowsiness was signicantly
higher at the end of the drive. Despite the short driving time of 45 minutes, high
levels of drowsiness with higher levels in automated driving could be achieved
with the chosen study setting. This is particularly noteworthy for study 2.
Even in a production car, the duty of monitoring during partially automated
driving aects drowsiness already in a short time. The post-questionnaire also
conrmed this issue in both studies, where the majority of participants stated
that they had to ght more with drowsiness during automated driving. There-
fore, the results obtained rearm the need for reliable and early detection of
driver drowsiness, especially in the lower levels of automated driving, where
the driver still forms the fallback level [?]. Since a constant involvement in the
driving task no longer exists, the level of drowsiness rises faster with partially
automated driving compared to manual driving. This eect has already been
presented in previous, and mainly simulator studies [150, 151, 152] and could
be conrmed in this work in a production car in a more realistic scenario within
a short driving time. If, for example, ADAS are activated while driving, which
enable driving in SAE level 2 or 3, there is a likelihood of higher drowsiness
levels. Therefore, the current status of these systems should be applied as in-
put for the adaptation of the drowsiness detection system, e.g., for adjusting
its sensitivity, by issuing warnings earlier.
132
6.1 Preconditions for the Adaptation of Driver Drowsiness Detection Systems (RQ1)
reproducible and safe conditions for a drowsy driver, as performed in this Ph.D.
thesis, can serve as a starting point for bridging the gap between simulator and
the real world.
In both studies, higher levels of drowsiness were found in the younger partic-
ipants. With a focus on the obtained ESS scores, the older subjects achieved
lower ESS scores in studies 1 and 2, which indicates a lower level of daytime
sleepiness and could be a reason for the overall lower levels of drowsiness.
Moreover, in conversation with the participants after the studies, the impres-
sion arose that participating in a user study, dealing with a future topic such
as automated driving, and the experience and acceptance of new technology
were more exciting and fascinating for the older than younger ones. Further, in
both studies, the older subjects were more cautious and wanted a drowsiness
warning at an earlier time than the younger participants. This eect could be
incorporated into the design of a drowsiness detection system. Depending on
age, the time for a warning could be adjusted with earlier warnings for older
participants.
133
6 Discussion
In manual driving, the average heart rate for all subjects is several beats per
minute higher than for automated driving, possibly due to the reduced activity
and workload. Furthermore, the average heart rate for young subjects is higher
than for older ones. Previous work also found that it is challenging to apply
detection models that have been trained with data from a particular age group
to another age group [4]. Results from the literature show that a decrease in the
maximum heart rate comes along with increasing age [186, 187, 188] that could
be conrmed in the context of this Ph.D. thesis in terms of driver drowsiness
with physiological data from low-resolution consumer-grade wearable devices.
The dierences in the heart rate between manual and automated driving and
the two age group thus reect the corresponding signicant dierences in the
self-ratings. If physiological data from wearable devices are applied for detec-
tion, dierent detection models could be integrated depending on the driving
modes and the driver's age.
Overall, the results indicate that various preconditions can and should be con-
sidered to adapt and personalize driver drowsiness detection systems and model
dierent groups of users (RQ1). With the knowledge gained, the performance
of intelligent driver-vehicle interfaces, which are intended to warn the driver in
the event of an onset of drowsiness, can be increased to ensure safe driving and
avoid crashes based on driver drowsiness in the best possible way.
In the rst experiment (see Section 4.1), the goal was to investigate the poten-
tial of using HRV as single data input for a supervised machine learning model
for driver drowsiness detection. Through HRV analysis, the activity of the
ANS can be obtained [117] that helps to get detailed insights into the current
drowsiness state of a person [122]. For this reason, HRV was particularly often
134
6.2 Driver Drowsiness Detection with Vital Data from Smart Wearables (RQ2)
applied in terms of driver drowsiness detection [117, 118, 119]. However, the
recording was often very intrusive, e.g., by attaching adhesive electrodes to the
driver's upper body in an ECG measurement. For this experiment, HRV anal-
ysis was performed with physiological data of a consumer-grade wrist-worn
wearable device. The standard physiological measure of tness trackers and
smartwatches on the market is still the heart rate. However, newer devices
equipped with more advanced sensors can measure the RR intervals to carry
out HRV analysis, e.g., for stress detection. Therefore, it can be assumed that
the measurement of HRV will be standard shortly. The results were compared
with reference data from an intrusive and medical-grade ECG device to ex-
amine feasibility and accuracy further. Electrodes had to be attached to the
upper body to record the ECG data.
The automated drive in study 1 served as the database in this experiment.
In user-dependent (UDT) and user-independent (UIT) tests, the two devices
and dierent machine learning models were compared. Results showed that
with the data of the smart wearable device, albeit not for all models, the re-
sults are comparable and at a similar level compared to the more intrusive
medical-grade device in the in-vehicle setting. Regarding the choice of the
supervised machine learning model, with our proposed approach with KNN,
RT, and RF, in the UDT accuracies >90% could be achieved by using ex-
clusively physiological data (HRV) of the wristband. The models were only
trained with data from less than 30 persons in the current development stage,
and no hyper-parameters were optimized. Thus, for the present application, no
more complex ECG measurement would have to be applied. Instead, the much
less intrusive sensor of the wrist-worn smart wearable would suce to conduct
HRV analysis and use the calculated features for driver drowsiness detection.
If the results of UDT and UIT are compared with one another, the UDT re-
sults are signicantly higher. This fact reects the more considerable inuence
of inter-driver variance. Due to the lower number of drowsy instances, the
classication of this class turned out to be particularly critical at UIT for both
devices. With a thoroughly more extensive and balanced data set, this could
probably have been better accounted for since also oversampling of the data
set during training could not eliminate this problem. However, it is also essen-
tial to detect the non-drowsy state from the customer's point of view. The
driver does not want to be irritated by false drowsiness warnings and would
probably switch o the drowsiness detection system. Creating a meaningful
database indicates one of the major challenges that need to be addressed in
future research and industry before a robust commercial warning system can
be developed with a generalized model and a high ability to adapt to new and
unseen data. Since people behave very dierently in certain physical states, the
system could also transit from a user-independent model to a user-dependent
one and adapt to the user to increase detection performance. Therefore, the
drivers' feedback about their drowsiness state could be requested while driving
and applied for oine training of the drowsiness detection model, e.g., in the
cloud. With over-the-air (OTA) updates that more and more car manufactur-
135
6 Discussion
ers use, model performance could be steadily increased. Instead of the request,
a further possibility would be to have the driver either conrm or correct the
detected drowsiness in dened time intervals.
Due to the increasing demand for smart wearables, more and more manufactur-
ers are entering this market. Therefore, in the second experiment (see Section
4.2), dierent wearables were compared with one another. Heart rate was used
as a physiological parameter for the detection of drowsiness. Previous studies
show that heart rate varies signicantly between wakefulness and sleep [10, 30].
Heart rate is the standard physiological parameter measured by the majority
of consumer-grade wearable devices. The question arises whether it is also pos-
sible to infer driver drowsiness with it.
The automated drive in study 2 served as the database. Since the focus, in
this case, is on comparing dierent wearables and machine learning models for
dierent levels of drowsiness, the investigations were carried out purely as part
of UDTs for dierent levels of drowsiness. High accuracies were achieved in
both 2- and 3-level classications of drowsiness. However, as can be seen from
the results, successful detection strongly depends on the classier type. As
in the previous experiment, again, KNN, RF, and RT resulted in the highest
accuracies. In general, the three used devices show very similar results in all
tested classiers. Therefore, for later in-vehicle usage, dierent wearable de-
vices could be considered. However, it is noteworthy that one device, in some
cases, achieved marginally poorer results than the other two devices. This de-
vice provides a heart rate value about every three seconds across all subjects,
whereas the other devices deliver a value every second. This could probably
be attributed to a dierent analysis of the PPG signal or a higher sensitivity
towards external inuences (vibrations, strong movements). The reduced num-
ber of data points may have negatively aected the expressiveness of some of
the extracted features and thus the classication accuracy. It can be deduced
that better results could be achieved for the present case by sampling the heart
rate with a higher frequency. However, this nding needs to be investigated
with other wrist-worn wearables and larger data sets. It can be said that the
dierences between the results of 2-level and 3-level classication are relatively
small, with slightly better performances in the 2-level case. Moreover, in the
3-level classication, the dierences within the devices are lower than in the
2-level case. Depending on the concrete use case, this speaks for the use of
wrist-worn wearables in both cases. In general, by using heart rate solely from
consumer-grade wearables as input for drowsiness detection, promising results
could be achieved. However, in this experiment, the detection performance of
the tested classiers was evaluated in the form of a user-dependent test apply-
ing 10-fold stratied cross-validation. As in the previous experiments, the best
performing models have to be tested in user-independent tests (UIT) to be able
to assess to what extent they are capable of generalizing to new data.
Another aspect that was investigated in the context of the model development
136
6.2 Driver Drowsiness Detection with Vital Data from Smart Wearables (RQ2)
for driver drowsiness detection, but is not in the main focus for answering this
research question, is the ground truth for drowsiness. In the rst experiment,
observer ratings combined with detected micro-sleep events served as ground
truth for drowsiness. In the second one, self-ratings were applied. In both
experiments, the ratings represented intervals of ve minutes lengths. The
question arises, whether it is sucient to use drowsiness ratings that apply to
5-minute intervals or if ratings at much shorter intervals are needed, or a dif-
ferent ground truth for drowsiness at all should be considered. Given previous
research in driver drowsiness detection, the acquisition of ground truth is, in
most cases, tailored to the respective study. So far, and in terms of comparabil-
ity of dierent works with each other, a uniform process and drowsiness scale,
general guidelines, or recommendations are still missing. In the course of this
Ph.D. thesis, results from a complexity analysis (see Section 4.3) show that in
the case of a decreased rating frequency, a ground truth of almost comparable
quality can be determined, independent of a changing number of drowsiness
levels. It can be deduced that drowsiness represents a rather slowly chang-
ing state, and therefore higher rating intervals are sucient. Deriving already
a standard process from the obtained results for collecting a reliable ground
truth for drowsiness is a complex undertaking. However, the obtained results
can serve as recommendations for further research on this topic.
The proposed methodology was implemented to show and discuss the feasi-
bility of using solely physiological data from a wrist-worn wearable device for
driver drowsiness detection in combination with a supervised machine learning
classier. In general, the results indicate that drowsiness can be derived by
applying vital data (HRV, heart rate) from wrist-worn smart wearable devices
(RQ2), and its detection in an automotive context is feasible. Open challenges
and issues were discussed and highlighted and can serve as a starting point for
further research in this area.
137
6 Discussion
The results of the conducted simulator study showed that, in general, a high
level of user experience and technology acceptance could be ascertained after
using the system for the rst time, with higher ratings among the older sub-
jects. Nearly all participants mentioned that they found the idea of the system
exciting, useful, and easy to understand. Thus, both of these hypotheses could
be accepted. Both age groups found using the system easy to understand and
use. With regard to signicant dierences due to age, the hypotheses pos-
tulated could only be partially accepted, as the signicance did not apply to
all subscales of the questionnaires used. The older study participants found
the product more innovative, more appealing, and aroused more interest com-
pared to the younger participants. Therefore, to increase acceptance among
younger subjects, special attention must be paid to the user interface design
to make the system more appealing. Moreover, the older participants found
the system more useful and showed a higher attitude towards the system and
intention to use it. Older subjects are a potential target group for automated
driving. Despite possible health problems, they want to stay mobile, which
possibly ascribes higher usability to the system and motivates them to use it.
A non-intrusive system, consisting of an easy-to-use mobile application and a
wrist-worn wearable, does not restrict drivers while driving in a partially au-
tomated vehicle and can also give them a feeling of security and condence.
Especially the older participants appreciated that the wearable device could
be worn like a watch and thus in a familiar way. During the interviews, the
participants mentioned several other aspects that would lead to an acceptance
enhancement: The wearable should be sold directly with the vehicle as a com-
plete package. If the vehicle is retrotted with this feature, it can be purchased
then. Further, it should also be possible to charge the device in the vehicle.
138
6.3 Acceptance of Drowsiness Detection Systems based on Smart Wearables (RQ3)
The user manual for operating the wearable and generally using the system
was of great importance, especially for the older participants. They might be
less ane with new technologies than younger people. Because even if smart
wearables are becoming more and more established in society, the results of
post-questionnaires of study 2 and 3 showed that, in general, and also among
the young subjects, only a few used a wearable before or have the intention to
buy one shortly. Therefore, it will be decisive how automobile manufacturers
draw attention to themselves in this regard to enhance the acceptance of po-
tential customers for using this type of system in the vehicle. Dierent aspects
and questions should be considered in this context, depending on the specic
business case: Will the wearable be sold directly with the vehicle? Can the
device be used privately and outside the vehicle for other activities, or is it
only intended and designed for in-vehicle usage? Is it only used for drowsiness
detection inside the vehicle? Which other states can be detected by using phys-
iological data from wearable devices inside the vehicle? Which wearables can
generally be used in the vehicle for that purpose? For which manufacturer's
wearable devices is an interface in the vehicle available? Suppose the wearable
is not sold with the car, and a more general solution is preferred. How can the
market for wearable devices be covered in the best possible way, i.e., should
interfaces only be available for the manufacturers of the currently most popular
wearables?
Overall, the discussed results show that it is feasible to detect drowsiness purely
with physiological data from consumer-grade wrist-worn wearables combined
139
6 Discussion
with supervised machine learning. The focus should not only lie on the pure de-
velopment of the models but also on preconditions, such as external or human
factors, which can be used to adapt the systems to increase their performance.
Furthermore, it is essential to pay attention to how these systems are designed
and advertised to make them equally attractive for everyone across the dif-
ferent age groups. It became evident that there are still many aspects and
open challenges that need to be further researched, mainly related to the top-
ics examined in this doctoral thesis and topics that allude to research on driver
drowsiness detection in general. These include, e.g., the focus on more realis-
tic driving studies, collection of a balanced and suciently large database for
model development, and the establishment of uniform ground truth for drowsi-
ness. Thereby, the knowledge gained in this Ph.D. thesis can serve as a starting
point for further research.
The requirements for driver state monitoring for automobile manufacturers
from international institutions and the ongoing progress in automated driving
have changed and tightened. Safety on the roads needs to be increased by fur-
ther reducing fatal accidents based on risk factors such as driver drowsiness.
With the ndings of this work, the core contribution is made to the improve-
ment of driver drowsiness detection systems and trac safety on the way to
the complete automation of the driving task. However, beyond (automated)
driving apart from the automotive industry and car manufactures, other areas
are also conceivable in which drowsiness represents a major risk factor and
intelligent detection systems based on wrist-worn wearable devices could be
applied.
Areas in which systems for recognizing drowsiness and fatigue have been investi-
gated in recent years are aviation [219, 220] and maritime operations [221, 222].
Existing aircrafts and ships could be retrotted like vehicles and warn pilots
and ship captains when the system detects drowsiness.
Another application area is Industry 4.0 in which the focus is on the fol-
lowing goals and design principles: interconnection, information transparency,
technical assistance, and decentralized decision [223]. With a focus on wear-
able devices and drowsiness detection, interconnection and technical assistance
are particularly relevant. When it comes to interconnection, the aim is that
people, devices, and sensors are interconnected and can communicate, e.g., via
the Internet of Things (IoT). As part of technical assistance, humans shall be
supported in dicult or dangerous tasks. By using and networking the work
140
6.4 Further Deployment Scenarios for Drowsiness Detection Systems based on
Smart Wearables
141
7 Conclusion
As the automation of the driving task progresses, new challenges arise for car
manufacturers. Notably, the interaction of the car with the driver, who forms
the fallback for the automated system in the lower levels of automation, is
one of the essential aspects. How is it ensured that the driver can take over
complete control from the vehicle in an acceptable time frame after a TOR?
To guarantee this, reliable driver state monitoring and detection systems move
more and more to the fore. However, in the automation of the driving task,
new challenges arose; in general, the framework conditions in the area of driver
state monitoring for automobile manufacturers have changed and tightened.
For expressing the importance of these systems, international institutions in-
tegrated them into their programs. Based on the General Safety Regulations
of the European Union (EU), from July 2022, a system for driver drowsiness
detection will be legally binding for new vehicle types and from July 2024 for all
vehicles to be registered [19]. Further, in the 2025 Roadmap of the European
New Car Assessment Program (EuroNCAP), driver monitoring is listed in the
category of primary safety [16]. To overcome limitations of existing drowsiness
detection systems and encouraged by the advancements in the development of
smart wearables devices in consumer electronics in recent years, in this work,
their suitability in an automotive environment, particularly in the eld of driver
drowsiness detection for automated driving, was investigated.
The nal chapter concludes the Ph.D. thesis with a summary and presents rec-
ommendations for the design and development of drowsiness detection systems
in the automotive industry and in general. It highlights the main contributions
and addresses limitations and possible future work.
7.1 Summary
In introducing and motivating the topic, the need for systems to identify the
driver's state, especially the risk factor drowsiness, was emphasized. Since ac-
cidents caused by drowsiness still occur very often, international institutions
are now also stipulating that automobile manufacturers integrate systems of
this type into vehicles in the future. Both the future legal obligation and the
143
7 Conclusion
ongoing driving automation require systems for reliable driver state monitor-
ing.
Based on previous studies from related work and to examine RQ1 and RQ2, a
study setting was presented, carried out in the simulator and on a test track.
The results indicate that various preconditions can be considered to adapt, per-
sonalize and increase the performance of driver drowsiness detection systems
(RQ1). Dierent congurations of drowsiness detection systems, e.g., for dier-
ent times of the day, driving modes, and age groups, can be developed to ensure
safe driving and avoid crashes based on driver drowsiness in the best possible
way. Experiments followed where the development and potential of drowsiness
detection models with physiological data from wearable devices were investi-
gated (RQ2). In general, promising results indicate that drowsiness can be
derived by applying vital data from wrist-worn smart wearable devices, and
its detection in an automotive context is feasible. Open challenges and issues
were discussed and highlighted and can serve as a starting point for further
research in this area. Building on the previous results, a portable prototype for
real-time driver drowsiness detection based on a smart wearable was presented
and evaluated in a third user study. The results showed a high acceptance of
the idea of this kind of system (RQ3). However, for automobile manufactur-
ers, it will be important in the future to draw attention to themselves in this
regard so that potential customers will accept and use systems of this type in
the vehicle.
Finally, based on the postulated hypothesis and three research questions, the
results were discussed, and aspects presented that need to be tackled in future
research.
144
7.1 Summary
Based on the results obtained in this work, recommendations for the design
and development of drowsiness detection systems are summarized.
Usage of time on task, i.e., driving time, as input for the drowsiness
detection system, especially when ADAS are activated.
Usage of time of the day as input for the adaption of the drowsiness
detection system, e.g., by issuing earlier warnings in times of the day,
when humans are more pronounced to become drowsy.
Usage of dierent detection models for dierent age groups and driving
modes.
Drowsiness detection by applying heart rate and HRV data from wrist-
worn wearable devices is feasible.
145
7 Conclusion
The focus should not only lie on the detection of drowsiness, but also on
the detection of alertness/wakefulness. Too many false alarms can, in the
worst case, lead to the deactivation of the system by the user.
The wearable device should deliver the heart rate at regular and small
time intervals. Larger/irregular intervals can result in lower detection
accuracies.
Special attention must be paid to the design of the user interface of the
drowsiness detection system in order to make it appealing for all age
groups.
A user manual for operating the wearable and generally using the system
is of great importance, especially for older people, who might be less ane
with new technologies than younger people.
146
7.2 Limitations and Future Work
Certain precautions were taken to induce drowsiness in all three studies. This
included no communication with the experimenter, no food, no caeinated
drinks ve hours before the study, a monotonous driving route with a low
speed limit, and warm temperature inside the simulator and test vehicle. In
both scenarios, drowsiness would generally occur later since possible dangers in
a simulator-based environment and on a test track compared to real trac can
be neglected. Therefore, the results obtained may not have absolute validity
and cannot be fully transferred to real-world scenarios. Simulators lack realism
due to potentially inadequate movements or eld of views or a possibly poor
graphical representation of the simulation environment. Especially in the sim-
ulator, the hurdle to fall asleep is lower for the test persons since they are not
confronted with severe and realistic consequences. Although a more realistic
follow-up study was conducted on a test track in the course of this thesis, a
similar study should also be carried out in real trac. Moreover, drives with
longer duration, possibly sleep-deprived participants, and night times should
be considered to collect a more evenly distributed data set in terms of drowsy
and not drowsy samples.
Regarding participants, the focus in these studies was on the comparison of two
age groups selected based on their recommended average sleep requirements.
The validity of the occurring eects needs to be enhanced with a higher number
of subjects. Further, participants from other age groups should be considered.
Another limiting factor is the selection process of participants. Whereas the
older participants were recruited via an advertisement in a local newspaper,
the younger participants were mainly students from the Technische Hochschule
Ingolstadt (THI) that possibly have a dierent view on technology or similar
aspects. Therefore, when interpreting the obtained eects, this should be con-
sidered and future studies conducted with participants who better represent
the general population.
147
7 Conclusion
Due to inter-driver variance and an unevenly balanced training data set with
only 30 participants of two age groups, the machine learning models could
not predict drowsiness for all users with the same accuracy and reliability in
the user-independent tests. To develop generalized solutions, a balanced and
suciently large database needs to be collected in future research. Further,
dierent and more objective types of ground truth for drowsiness should be
considered. In this work, heart rate and HRV were applied for driver drowsiness
detection. Other physiological parameters that can be measured with wearable
devices should be applied in the future. The dierent experiments showed
that it is feasible to apply physiological data from smart wearables to detect
driver drowsiness. However, since only a small number of dierent devices were
used, the results obtained may not be transferable to all devices available on
the market. Therefore, more devices from dierent manufacturers need to be
considered and compared in future work.
Further, the applied models were used with the default parameters preset in
the Weka machine learning library. Therefore, no hyper-parameters were tuned.
The aim was to identify which standard machine learning models are suitable
for the proposed classication problem. In future research, ne-tuning of hy-
perparameters of the most promising classiers could increase performance.
Since the focus in this work was on a specic type of feature extraction (sliding
window), feature selection (CFSS), and class balancing (SMOTE), other meth-
ods should be considered and compared. Moreover, rule-based, unsupervised,
or deep learning approaches should be considered in future research.
7.3 Contributions
In the following, the core ndings and individual contributions to the eld of
driver drowsiness detection are briey summarized.
For bridging the gap between simulator studies and experiments in real
trac, a study setting for drowsiness was presented that approaches a
more realistic scenario with safe and reproducible conditions that mini-
mize risk and danger for the participants ([6]).
148
7.3 Contributions
149
A Publications and Contribution
Statement
The following publications were published in the context of this doctoral the-
sis:
[2] Kundinger, T., Riener, A., Sofra, N. & Weigl, K. (2018). Drowsi-
ness Detection and Warning in Manual and Automated Driving: Re-
sults from Subjective Evaluation. In Proceedings of the 10th International
Conference on Automotive User Interfaces and Interactive Vehicular Applica-
tions (pp. 229-236).
My contribution: I developed the initial idea and the study design for this
work in joint discussions with Andreas Riener. I implemented the driving
scenario and the application on the tablet. Further, I conducted the whole
experiment and carried out all evaluations and analyses except the statistical
analysis of drowsiness self-ratings. I authored most parts of the paper.
151
A Publications and Contribution Statement
My contribution: This work is based on the user study presented in [6]. The
initial idea came up in joint discussions with Philipp Wintersberger. I con-
ducted the whole experiment and pre-processed the data for statistical evalu-
ation. I authored most parts of the paper.
My contribution: This work is based on the user study presented in [2]. All
evaluations and analyses carried out in this work were performed by me. I
authored most parts of the paper.
[6] Kundinger, T., Riener, A., Sofra, N. & Weigl, K. (2020). Driver
Drowsiness in Automated and Manual Driving: Insights from a Test
Track Study. In Proceedings of the 25th International Conference on Intelli-
gent User Interfaces (pp. 369-379).
My contribution: I developed the initial idea and the study design for this
work in joint discussions with Andreas Riener and implemented the application
152
on the tablet. Further, I conducted the whole experiment and carried out
all evaluations and analyses except the statistical analysis of drowsiness self-
ratings. I authored most parts of the paper.
My contribution: I developed the initial idea and the study design for this
work in joint discussions with Ramyashree Bhat and Andreas Riener and imple-
mented the driving scenario. Since this study was conducted in Ramyashree's
153
A Publications and Contribution Statement
master's thesis, the majority of evaluations was performed by her after dis-
cussing them with me. I authored most of the work.
154
B German Versions of Study
Questionnaires and Scales
155
B German Versions of Study Questionnaires and Scales
1 | Extrem wach
2 | Sehr wach
3 | Wach
4 | Einigermaÿen wach
5 | Weder wach noch müde
6 | Erste Anzeichen von Müdigkeit
7 | Müde, aber keine Probleme wach zu bleiben
8 | Müde, erste Probleme wach zu bleiben
9 | Sehr müde, mit dem Schlaf kämpfend
156
B.4 Trust Scale
unerfreulich | erfreulich
unverständlich | verständlich
kreativ | phantasielos
leicht zu lernen | schwer zu lernen
wertvoll | minderwertig
langweilig | spannend
uninteressant | interessant
unberechenbar | voraussagbar
schnell | langsam
originell | konventionell
behindernd | unterstützend
gut | schlecht
kompliziert | einfach
abstoÿend | anziehend
herkömmlich | neuartig
unangenehm | angenehm
sicher | unsicher
aktivierend | einschläfernd
erwartungskonform | nicht erwartungskonform
inezient | ezient
übersichtlich | verwirrend
unpragmatisch | pragmatisch
aufgeräumt | überladen
attraktiv | unattraktiv
sympathisch | unsympathisch
konservativ | innovativ
157
B German Versions of Study Questionnaires and Scales
158
C German Version of Developed
Android Application
(a) (b)
(c) (d)
(e) (f)
159
Bibliography
161
Bibliography
[13] USA Today, AAA: Drowsy driving plays larger role in accidents
than federal statistics suggest, 2019, https://eu.usatoday.com/story/
news/2018/02/07/aaa-drowsy-driving-plays-larger-role-accidents-than-
federal-statistics-suggest/313226002/ (retrieved May 16, 2021).
[14] World Health Organisation, Global Status Report on Road Safety 2018:
Summary, World Health Organization, Tech. Rep. 1, 2018. [Online].
Available: https://www.who.int/publications/i/item/9789241565684
162
Bibliography
[17] B. C. Tet, Acute Sleep Deprivation and Risk of Motor Vehicle Crash
Involvement, Tech. Rep., 2016, https://aaafoundation.org/wp-content/
uploads/2017/12/AcuteSleepDeprivationCrashRisk.pdf (retrieved May
16, 2021).
[22] Jordan Golson, Tesla driver killed in crash with autopilot active, nhtsa
investigating, 2016, https://www.theverge.com/2016/6/30/12072408/
tesla-autopilot-car-crash-death-autonomous-model-s (retrieved May 16,
2021).
163
Bibliography
[31] IDC Corporate USA, Worldwide wearables market to top 300 million
units in 2019 and nearly 500 million units in 2023, 2019, https://
www.idc.com/getdoc.jsp?containerId=prUS45737919 (retrieved May 16,
2021).
164
Bibliography
165
Bibliography
166
Bibliography
driving, Eur. Transp. Res. Rev., vol. 7, no. 4, p. 38, November 2015.
[Online]. Available: https://doi.org/10.1007/s12544-015-0188-y
167
Bibliography
168
Bibliography
face and eye tracking, Int J. Adv. Comput. Sci. Appl., vol. 10, 01 2019.
[Online]. Available: https://doi.org/10.14569/IJACSA.2019.0100775
[68] SmartEye, Driver Monitoring System | Interior Sensing for vehicle in-
tegration, 2019, https://smarteye.se/automotive-solutions/ (retrieved
May 16, 2021).
169
Bibliography
170
Bibliography
Factors, vol. 41, no. 1, pp. 118128, 1999, pMID: 10354808. [Online].
Available: https://doi.org/10.1518/001872099779577336
[85] R. Feng, G. Zhang, and B. Cheng, An on-board system for detecting
driver drowsiness based on multi-sensor data fusion using dempster-
shafer theory, in 2009 International Conference on Networking, Sensing
and Control, 2009, pp. 897902. [Online]. Available: https://doi.org/
10.1109/ICNSC.2009.4919399
171
Bibliography
172
Bibliography
173
Bibliography
174
Bibliography
[122] I. Lee, P. Lau, E. C.-P. Chua, J. J. Gooley, W.-Q. Tan, S.-C. Yeo,
K. Puvanendran, and I. H. Mien, Heart Rate Variability Can Be Used
to Estimate Sleepiness-related Decrements in Psychomotor Vigilance
during Total Sleep Deprivation, Sleep, vol. 35, no. 3, pp. 325334,
March 2012. [Online]. Available: https://doi.org/10.5665/sleep.1688
[125] Watch Ranker, How do smartwatches & tness trackers measure your
heart rate (hr)? 2020, https://watchranker.com/how-do-smartwatches-
tness-trackers-measure-heart-rate/.
175
Bibliography
[126] Tom's Guide, Who has the most accurate heart rate monitor? 2018,
https://www.tomsguide.com/us/heart-rate-monitor, review-2885.html.
[134] Final Report Summary - HARKEN (Heart and respiration in-car embed-
ded nonintrusive sensors) | Report Summary | HARKEN | FP7 | CORDIS
| European Commission, 2014, https://cordis.europa.eu/project/rcn/
103870/reporting/en (retrieved May 16, 2021).
176
Bibliography
[139] Creative Mode, STEER: Wearable device that will not let you
fall asleep, 2019, https://www.kickstarter.com/projects/creativemode/
steer-you-will-never-fall-asleep-while-driving?lang=en (retrieved May 16,
2021).
[143] B.-l. Lee, B.-g. Lee, G. Li, and W.-Y. Chung, Wearable Driver Drowsi-
ness Detection System Based on Smartwatch, in Korea Institute of Signal
Processing and Systems, vol. 15, 2014, pp. 134146.
[144] Q. Li, J. Wu, S.-D. Kim, and C.-G. Kim, Hybrid driver fatigue detection
system based on data fusion with wearable sensor devices, 2014.
177
Bibliography
[146] J. Gielen and J.-M. Aerts, Feature extraction and evaluation for driver
drowsiness detection based on thermoregulation, Applied Sciences,
vol. 9, no. 17, 2019. [Online]. Available: https://doi.org/10.3390/
app9173555
178
Bibliography
179
Bibliography
New York, NY, USA: ACM, 2015, pp. 281288. [Online]. Available:
https://doi.org/10.1145/2799250.2799288
[161] Sleep Health Foundation, Sleep needs across the lifespan, 2015,
http://www.sleephealthfoundation.org.au/les/pdfs/Sleep-Needs-
Across-Lifespan.pdf (retrieved May 16, 2021).
[169] J. Koo, J. Kwac, W. Ju, M. Steinert, L. Leifer, and C. Nass, Why did my
car just do that? Explaining semi-autonomous driving actions to improve
180
Bibliography
[172] B. Peging, M. Rang, and N. Broy, Investigating user needs for non-
driving-related activities during automated driving, in Proceedings of
the 15th International Conference on Mobile and Ubiquitous Multimedia,
ser. MUM'16. New York, NY, USA: Association for Computing
Machinery, 2016, pp. 9199. [Online]. Available: https://doi.org/
10.1145/3012709.3012735
181
Bibliography
[179] Bittium Corporation, Bittium faros waterproof ecg devices, 2019, https:
//www.bittium.com/medical/bittium-faros (retrieved May 16, 2021).
[181] The Epworth Sleepiness Scale, About the ess, 2020, http://
epworthsleepinessscale.com/about-the-ess/.
[182] Stähle GmbH, Automated driving system sfphybrid for cars, 2020,
https://www.staehle-robots.com/english-1/products/proving-ground-
driving-systems/ (retrieved May 16, 2021).
182
Bibliography
America, vol. 110, no. 44, pp. 18 01118 016, oct 2013. [Online].
Available: http://www.ncbi.nlm.nih.gov/pubmed/24128759
[193] S.-J. Jung, H.-S. Shin, and W.-Y. Chung, Driver fatigue and drowsiness
monitoring system with embedded electrocardiogram sensor on steering
wheel, IET Intell. Transp. Syst., vol. 8, no. 1, pp. 4350, 2014. [Online].
Available: https://doi.org/10.1049/iet-its.2012.0032
183
Bibliography
Available: https://www.degruyter.com/view/j/cdbme.2016.2.issue-1/
cdbme-2016-0063/cdbme-2016-0063.xml
184
Bibliography
[205] Q. Li, J. Wu, S.-D. Kim, and C.-G. Kim, Hybrid driver fatigue detection
system based on data fusion with wearable sensor devices, 2015.
[211] Polar, Polar oh1 - optical heart rate sensor, 2020, https:
//www.polar.com/us-en/products/accessories/oh1-optical-heart-rate-
sensor.
185
Bibliography
[214] L. Breiman, Random forests, Machine Learning, vol. 45, no. 1, pp. 532,
oct 2001. [Online]. Available: https://doi.org/10.1023/A:1010933404324
186
Bibliography
187