Beruflich Dokumente
Kultur Dokumente
Estimation
Chi Zhang Rui Yao Jinpeng Cai
School of Computer Science and School of Computer Science and College of Engineering &
Technology Technology Computer Science
China University of China University of Australian National University
Mining and Technology Mining and Technology Canberra, 2601, Australia
Xuzhou, 221116, China Xuzhou, 221116, China Email: u5715685@anu.edu.au
Email: chizhang@cumt.edu.cn Email: ruiyao@cumt.edu.cn
arXiv:1707.00548v1 [cs.CV] 3 Jul 2017
AbstractVision based text entry systems aim to help disabled and input text. In previous works, researchers usually design
people achieve text communication using eye movement. Most systems on the basis of the existing eye trackers and they
previous methods have employed an existing eye tracker to only need to design text input systems. However, the following
predict gaze direction and design an input method based upon
that. However, these methods can result in eye tracking quality issues widely exist in such systems: first, the quality of the eye
becoming easily affected by various factors and lengthy amounts tracker constrains the design of the text input system. A high
of time for calibration. Our paper presents a novel efficient gaze accuracy eye tracking system can estimate human gaze with
based text input method, which has the advantage of low cost and less error, so it can precisely select the targets even if there
robustness. Users can type in words by looking at an on-screen are more intensive objects on the screen. A high accuracy
keyboard and blinking. Rather than estimate gaze angles directly
to track eyes, we introduce a method that divides the human gaze tracking system demands high precision equipment, with low
into nine directions. This method can effectively improve the precision equipment unable to meet the input system accuracy
accuracy of making a selection by gaze and blinks. We build a requirement. Accurate eye tracking requires high-resolution
Convolutional Neural Network (CNN) model for 9-direction gaze human eye images, which demands a high quality camera and
estimation. On the basis of the 9-direction gaze, we use a nine- optimal lighting conditions. This increases the cost and limits
key T9 input method which is widely used in candy bar phones.
Bar phones were very popular in the world decades ago and the range of applicable environments. Furthermore, if the de-
have cultivated strong user habits and language models. To train vice is changed, the camera probably needs to be recalibrated
a robust gaze estimator, we created a large-scale dataset with too. So, most previous systems are device-specific. Currently,
images of eyes sourced from 25 people. According to the results there are still many challenges for eye tracking that need to be
from our experiments, our CNN model is able to accurately overcome: low-resolution images, poor lighting conditions and
estimate different peoples gaze under various lighting conditions
by different devices. In considering disable peoples needs, we physiological responses (such as people looking down having a
removed the complex calibration process. The input methods natural squint and tired people showing droopy eyelids). Fig. 1
can run in screen mode and portable off-screen mode. Moreover, illustrates examples of eye images that are hard to use for
The datasets used in our experiments are made available to the estimating gaze. Secondly, the accuracy of human gaze angle
community to allow further experimentation. changes is 0.5 [1], and even when people stare in a fixed
direction, there still exists a jittering movement of the eyes
I. I NTRODUCTION
and the body also moves subconsciously, both of which can
Vision based human-computer interaction (HCI) is a hot make it difficult to execute accurate and stable eye tracking.
topic in the field of computer vision. We design a low-cost Finally, an eye tracker requires a lot of time to calibrate before
gaze controlled text entry method to help disabled people those using. During the calibration process, people are not allowed to
who may be unable to carry out the basic activities of daily move their body, and slight movements may result in an error
living but can move their eyes achieve text communication. in the final estimation. As usage time increases, the calibrated
A typical video-based eye typing system uses a camera to system can become skewed and users may need to calibrate the
capture eye movement, from which users can type in words system again. It is easy to foresee how these inconveniences
by looking at an on-screen keyboard. This process can be could become an issue for disabled people attempting to use
divided into two parts: the first part is an eye tracking system these systems in their daily life.
and the other part is a text entry system. The eye tracking In order to solve the problems above and consider the needs
system aims to detect eyes from images and then estimate of disabled people in the application of these systems, instead
the gaze direction. Before using the system, users must do of employing an existing eye tracker, we attempt to solve the
the calibration to map estimated directions to corresponding eye tracking problem and input method problems together. We
positions on the screen. For the text input system, it is based on separate human gaze into nine directions: left-up, up, right-
where people are looking at on the screen to make selections up, left, middle, right, left-down, down and right-down. The
Fig. 1: Eye images that are difficult to be used for estimating
gaze
Full connection
Max pooling
Convolutions
Convolutions Max pooling Convolutions Max pooling
Fig. 7: The architecture of our CNN model. The CNN consists of 4 blocks. The first three blocks have the same operations
which include convolution with 64 3 3 filters, batch normalization, ReLU and max pooling. The fourth block is a fully-
connected layer with ReLU as the activation function. The input is an RGB image with the size of 32 128 pixels. The
output is the scores of 10 categories.
Eye state
layer as the real eyes. Then we input text according to the
states of the eyes. To make the input system work properly, 5
we need to handle false estimated states, and filter estimated
states of saccade and involuntary blinks.
0
0 100 200 300 400 500
A. Signal analysis Frame number
Noise mainly comes from natural blinks, saccade and false (a)
estimation. Fig. 8 (a) shows estimated states of eyes in each 10
frame of a video sequence. As is shown,the signal is stable
Eye state
Fig. 11: Images of both eyes where left and right eyes have different estimated results. The estimated directions of left and
right eyes are indicated by arrows on the images.
training time before the second session where they trained half
0.3
an hour for each mode. There is a one-day interval between Screen mode
the second and third session. They were required to keep our 0.25
Off-screen mode
16
C. Discussion
14 Testers stated that it was easier to choose letters on screen
12
mode as they were able to find the positions of desired letters
on screen easily. On the other hand, they had to spend time
10
recalling the letters position when they used the off-screen
8
mode. When they forgot the positions of letters, they had to try
6 Screen mode
Off-screen mode
directions one by one and rely on the sound feedback to select
4 the correct letter. After they remembered the letters position,
1 2 3
Session number the speed of input increased greatly. Incorrect text input was all
due to users moving their eyes before they blink to choose the
Fig. 12: Mean text input speed of 5 testers in three sessions correct letter. We can also see that by session three users were
able to accurately type in words letter by letter with very low
error rate. When people use T9 input method on bar phones, [6] M. Silfverberg, I. S. MacKenzie, and P. Korhonen, Predicting text entry
they do not need to type each letter of a word because on bar speed on mobile phones, in Proceedings of the SIGCHI conference on
Human Factors in Computing Systems. ACM, 2000, pp. 916.
phones, language models are used to help users type in words [7] P. Majaranta and A. Bulling, Eye tracking and eye-based human
efficiently. Users only need to choose the letter groups. For computer interaction, in Advances in physiological computing.
example, to type in hello, they only need to select buttons 4, Springer, 2014, pp. 3965.
[8] C. H. Morimoto, A. Amir, and M. Flickner, Detecting eye position and
3, 5, 5 and 6 successively where letters h, e, l, l and o gaze from a single camera and 2 light sources, in Pattern Recognition,
are located respectively and the language model can predict 2002. Proceedings. 16th International Conference on, vol. 4. IEEE,
the desired words. So only one selection is needed to type in 2002, pp. 314317.
[9] J. Chen and Q. Ji, 3d gaze estimation with a single camera without ir
a letter on average. As shown in our experiments, our input illumination, in Pattern Recognition, 2008. ICPR 2008. 19th Interna-
method can reach up to 20 letters per minute which means tional Conference on. IEEE, 2008, pp. 14.
we make 40 selections. So if language model is introduced in [10] X. Zhang, Y. Sugano, M. Fritz, and A. Bulling, Appearance-based
gaze estimation in the wild, in Proceedings of the IEEE Conference
our input method, text entry speed can double and can reach on Computer Vision and Pattern Recognition, 2015, pp. 45114520.
up to 40 letters per minute theoretically. Also, there are many [11] H. O. Istance, C. Spinner, and P. A. Howarth, Providing motor
language models based on 9-key T9 text input keyboard on impaired users with access to standard graphical user interface (gui)
software via eye-based interaction, in Proceedings of the 1st European
bar phones so our vision based text input method is expected Conference on Disability, Virtual Reality and Associated Technologies
to support efficient text input of more languages. (ECDVRAT96), 1996.
[12] P. Majaranta, U.-K. Ahola, and O. Spakov, Fast gaze typing with an
VI. C ONCLUSION adjustable dwell time, in Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems. ACM, 2009, pp. 357360.
We have designed the vision based text entry method that [13] J. Gips and P. Olivieri, Eagleeyes: An eye control system for persons
with disabilities, in The Eleventh International Conference on Technol-
users can input text by eye movement and blinks. We design ogy and Persons with Disabilities, 1996, pp. 115.
the system by solving both eye tracking problems and text [14] T. E. Hutchinson, K. P. White, W. N. Martin, K. C. Reichert, and
input problems. We selected the common bar phones 9-key L. A. Frey, Human-computer interaction using eye-gaze input, IEEE
Transactions on systems, man, and cybernetics, vol. 19, no. 6, pp. 1527
T9 keyboard layout as the interface. A large scale dataset 1534, 1989.
was made which was collected by using several capturing [15] D. W. Hansen, J. P. Hansen, M. Nielsen, A. S. Johansen, and M. B.
devices and covers different lighting conditions, locations Stegmann, Eye typing using markov and active appearance models,
in Applications of Computer Vision, 2002.(WACV 2002). Proceedings.
and time. We used images of both eyes to avoid single Sixth IEEE Workshop on. IEEE, 2002, pp. 132136.
eye images synchronization issues, which increased the es- [16] K. Grauman, M. Betke, J. Lombardi, J. Gips, and G. R. Bradski, Com-
timation accuracy. For applications purposes, we define nine munication via eye blinks and eyebrow raises: Video-based human-
computer interfaces, Universal Access in the Information Society, vol. 2,
gaze directions and train a CNN model for gaze estimation no. 4, pp. 359373, 2003.
which can precisely estimate known peoples and unknown [17] A. Krolak and P. Strumio, Eye-blink detection system for human
peoples gaze. Data augmentation methods we use provide computer interaction, Universal Access in the Information Society,
vol. 11, no. 4, pp. 409419, 2012.
robustness and increase the estimation accuracy. Our text input [18] P. Majaranta and K.-J. Raiha, Text entry by gaze: Utilizing eye-
method utilizes low cost devices, simplifies the calibration tracking, Text entry systems: Mobility, accessibility, universality, pp.
steps and filters noise caused by saccade, natural blinks and 175187, 2007.
[19] M. Fejtova, J. Fejt, and L. Lhotska, Controlling a pc by eye movements:
momentary false estimation during the operation process. Our The memrec project, Computers Helping People with Special Needs,
input method is able to run in two modes: off-screen mode and pp. 623623, 2004.
screen mode. On off-screen mode, users can arbitrarily move [20] J. O. Wobbrock, J. Rubinstein, M. Sawyer, and A. T. Duchowski, Not
typing but writing: Eye-based text entry using letter-like gestures, in
and the device is portable. The text input speed of screen mode Proceedings of The Conference on Communications by Gaze Interaction
can reach 20 letters per minute and off-screen mode can reach (COGAIN), 2007, pp. 6164.
18 letters per minute with low error rate. Our future work [21] O. Tuisku, P. Majaranta, P. Isokoski, and K.-J. Raiha, Now dasher! dash
away!: longitudinal study of fast text entry by eye gaze, in Proceedings
is to introduce language model and add word prediction and of the 2008 symposium on Eye tracking research & applications. ACM,
completion functions to our current system and to train a head 2008, pp. 1926.
pose invariant gaze estimator. [22] P. Viola and M. Jones, Rapid object detection using a boosted cascade
of simple features, in Computer Vision and Pattern Recognition, 2001.
CVPR 2001. Proceedings of the 2001 IEEE Computer Society Confer-
R EFERENCES ence on, vol. 1. IEEE, 2001, pp. II.
[1] P. Majaranta and K.-J. Raiha, Twenty years of eye typing: systems and [23] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep
design issues, in Proceedings of the 2002 symposium on Eye tracking network training by reducing internal covariate shift, arXiv preprint
research & applications. ACM, 2002, pp. 1522. arXiv:1502.03167, 2015.
[2] G. Dagnelie, Visual prosthetics: Physiology, bioengineering, Rehabil- [24] N. Dalal and B. Triggs, Histograms of oriented gradients for human
itation, 2011. detection, in Computer Vision and Pattern Recognition, 2005. CVPR
[3] K. K. C. Dohse, Effects of field of view and stereo graphics on memory 2005. IEEE Computer Society Conference on, vol. 1. IEEE, 2005, pp.
in immersive command and control. Iowa State University, 2007. 886893.
[4] D. W. Hansen and Q. Ji, In the eye of the beholder: A survey of models
for eyes and gaze, IEEE transactions on pattern analysis and machine
intelligence, vol. 32, no. 3, pp. 478500, 2010.
[5] K. Krafka, A. Khosla, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Ma-
tusik, and A. Torralba, Eye tracking for everyone, in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2016,
pp. 21762184.