As one of the primary suppliers of AI and machine learning consulting on the local market, our squad partnered with Beehiveor R&D Labs for a joint research project on a reading assistant system (RAS). We were inspired by different fitness apps and wrist trackers, which use accelerometers, heart rate sensors and GPS data among other things to help people do physical activities better. Similarly, if we can track gaze movement, why not to use this data to facilitate visual activities. Reading was chosen as a primary research objective since it’s one of the activities humans perform daily.
There are certain problems people encounter while reading, and everyone resolves those problems in a unique way: rereading difficult parts of a text, googling unknown words, writing down details to remember. What if this could be automated? For instance, to track human reading, evaluate speed, distinguish reading gaze movements from other activities, and annotate hard-to-read places. These are features that can be extremely useful for people who have to tackle huge amounts of text every day.
Before we get into the research process and findings, let’s figure out what Reading Assistant System (RAS) is and why it is so important. RAS is defined as an AI and gaze tracking-based system for various reading analysis purposes. RAS can be used in many settings, and the modern application is well documented in education, medicine, HR, marketing and other areas.
The system will allow the user to:
- track and store all reading materials and related reading pattern metadata;
- search and filter content by reading pattern parameters (e.g. time and rate of fixation allow to highlight only the area of interest inside the text);
- get various post-processing analytical annotations;
- get real-time assistance options during reading (e.g. automated translation).
Here’s how the technology will behave in practice. The RAS program can automatically track movement of your gaze and match it with the text on the screen. It will allow to process reading patterns in real time and store all the metadata for further analysis.
Thanks to deep learning-trained neural networks, a RAS can identify with up to 96% accuracy whether the person is reading at a given moment.
The initial purpose of our research was tri-fold:
- To check a high accuracy hypothesis
- To create a real-time model
- Further research the gaze patterns
Our team of 3 people carried out research activities over 2 months in the scope of DataRoot University activities. The primary task was to create a program for human visual activity analysis by gaze movement that would incorporate:
- reading/non-reading classification
- regression/sweeps/saccades detection
- gaze to text mapping and point of interest detection
We used mainly open-source libraries based on Python:
- DS basics: Numpy, Scipy, Matplotlib, Seaborn, Plotly, Pandas (data preparation, manipulation, visualization); statsmodels (time series research);
- ML/DL: Scikit-learn (clustering algorithms, time series research), Tensorflow, Keras (CNN, LSTM);
- Computer vision: OpenCV, imutils, PIL, Tesseract (text recognition, eye movement to text mapping)
The algorithms that we used:
- Time series analysis and feature extraction + MLP
- panorama stitching
- OCR with tesseract
For tracking human activity we used GazePoint eye tracker. The tool allowed us to receive gaze coordinates with a 1-1.5 angle error after calibration. During each session, the GazePoint Analysis app recorded face and screen video along with tabular data about gaze movement.
The whole dataset consists of 2 parts: 51 reading and 85 non-reading time series. Each participant performed the following actions in the scope of our research:
- read a 2-minute text;
- found specific information and things on pictures
- watched a 3-minute video.
The resulting time series consists of several columns:
- FPOGX, FPOGY – screen coordinates, relative to screen resolution, algorithm A
- BPOGX, BPOGY – screen coordinates, relative to screen resolution, algorithm B
- FPOGID – ID of fixation
- FPOGD – duration of fixation of eyes
- BPOGV, FPOGV – information validity
- BKID – blink ID
Dataset preprocessing and features selection
In the course of our research, we found that tracked coordinates cannot be ideal. Blinking, head movements, variable lighting – all of those factors interrupt or spoil data flow.
We figured additional steps could alleviate the situation, and we picked smoothing gaze movement as a primary option, even though the value of this approach has limits. Smoothing eliminates one important feature – microsaccades. A saccade is a quick, simultaneous movement of both eyes between two fixation points. A microsaccade is a movement within one fixation which provides an answer as to how users fix their gaze. Though smoothing is not the ideal choice when it comes to saccades, it does help with approximating word detection.
Here’s what filtering meant for our research purposes:
- Filtering by BPOGV, FPOGV
- Filtering only screen gaze movements
In order to easily manipulate the dataset and train/test models, we selected a 100 observations-wide window (average time for reading a single plane line on A4 paper). This resulted in splitting all dataset into 24,568 reading and 14,288 non-reading time series of 100 observation length, considering a 90% overlap.
Our squad used three main techniques for time series classification, described in detail below.
Time series research and manual feature extraction with MLP. We created three feature groups:
Linear trend detection for FPOGX
- MSE after linear approximation for FPOGX. In other words, trying to fit a line to the x-axis trajectory.
- Weight near linear part after linear approximation (weight w in equation x = w*t + b).
- Dickey-Fuller Test on stationarity to detect linear trends.
- Max abs residual after seasonal decomposition (FPOGX, FPOGY)
- The standard deviation for FPOGX and BPOGX time series
The feature extraction, process took us 1 week, with multi-layer perceptron (with one hidden layer) providing an 85% accuracy on the test subset. To further improve this technique requires additional manual feature extraction. In our opinion, the selected features are not informative and do not describe data well.
Long short-term memory. dX, dY features were used. Basic model consists of 2-layer LSTM (1 many-to-many and 1 many-to-one layers) with one dense layer:
This approach gave us even less accurate results at around 63%. Adding extra LSTM layers did not help us either. We figured the possible reason for this could be the small dimension of input data and inability of LSTM to accumulate global information about time series.
Convolutional neural network. dX, dY features were used. After some tuning we found an optimal architecture with 113,006 trainable parameters:
This model produced a 96% accuracy rate on the test subset and was subsequently chosen as a base model for future research.
Reading patterns clustering
Our main task here was to classify every fixation group (observations grouped by FPOGID) as one of the 3 main patterns: saccade, sweep, regression.
A major obstacle we encountered at this stage was dataset labeling since data with a 60 Hz frequency is inherently hard to label. This turned out to be a problem for clustering as well. Some of the minor issues we tackled were the high similarity between regressions and sweeps, as well as fixations during scrolling turning out to be outliers. To exclude fixations during scrolling we used a reading classification algorithm.
The entire dataset was filtered from points to saccades only. To obtain saccade data we grouped points by fixation id (FPOGID) and took only the last observation from every group. As a result, all of the identified saccades were split into min/max values with a naive algorithm along the horizontal axis, and the minimal saccades were divided into sweep and regression groups. We used K-means clusterization on the three basic saccade features to achieve the required results:
- Horizontal axis sweep projection
- Horizontal axis sweep angle
- Difference from the previous saccade
Gaze to text mapping, text recognition out of screenshots and point of interest determination
Opencv and Tesseract python libraries were used heavily in solving this problem. Additionally, we created a single text document from the video.
We’ll walk you through some of the major stages as you scroll down:
- Identified static frames on the video (with threshold deviation between shots)
- Extracted sheets from the frame
- Determined the point of interest
Results and challenges
The outcome of our project is a machine learning model that’s able to predict with a 97% accuracy whether the user read/did not read the text for 1.6 seconds of recording. Some of the major findings in the course of the research include:
- an algorithm that relies on predictions of a previous ML model can count gaze movement (regression, sweeps, and saccades) and calculate relative reading speed;
- an algorithm that gives information about a reader’s point of interest. It provides a weight coefficient for a given word that may represent importance for the reader.
Scientists who research such human behavior as gaze tracking can uncover a previously locked domain in health tech and business. As substantiated by our algorithm, you can make real-time predictions based on gaze tracking tech that can significantly improve the usage and rate and accuracy of any RAS and possibly go beyond that with more scientific application nobody thought was possible before.