Scroll to top

Machine Learning Techniques in Reading Tracking (Research Series)

As one of the primary suppliers of AI and machine learning consulting on the local market, our squad partnered with Beehiveor R&D Labs for a joint research project on a reading assistant system (RAS). We were inspired by different fitness apps and wrist trackers, which use accelerometers, heart rate sensors and GPS data among other things to help people do physical activities better. Similarly, if we can track gaze movement, why not to use this data to facilitate visual activities. Reading was chosen as a primary research objective since it’s one of the activities humans perform daily.

There are certain problems people encounter while reading, and everyone resolves those problems in a unique way: rereading difficult parts of a text, googling unknown words, writing down details to remember. What if this could be automated? For instance, to track human reading, evaluate speed, distinguish reading gaze movements from other activities, and annotate hard-to-read places. These are features that can be extremely useful for people who have to tackle huge amounts of text every day.

The definition

Before we get into the research process and findings, let’s figure out what Reading Assistant System (RAS) is and why it is so important. RAS is defined as an AI and gaze tracking-based system for various reading analysis purposes. RAS can be used in many settings, and the modern application is well documented in education, medicine, HR, marketing and other areas.

The system will allow the user to:

  • track and store all reading materials and related reading pattern metadata;
  • search and filter content by reading pattern parameters (e.g. time and rate of fixation allow to highlight only the area of interest inside the text);
  • get various post-processing analytical annotations;
  • get real-time assistance options during reading (e.g. automated translation).

 

Here’s how the technology will behave in practice. The RAS program can automatically track movement of your gaze and match it with the text on the screen. It will allow to process reading patterns in real time and store all the metadata for further analysis.

 

Thanks to deep learning-trained neural networks, a RAS can identify with up to 96% accuracy whether the person is reading at a given moment.

 

The hypothesis

The initial purpose of our research was tri-fold:

  • To check a high accuracy hypothesis
  • To create a real-time model
  • Further research the gaze patterns

 

Our team of 3 people carried out research activities over 2 months in the scope of DataRoot University activities. The primary task was to create a program for human visual activity analysis by gaze movement that would incorporate:

  • reading/non-reading classification
  • regression/sweeps/saccades detection
  • gaze to text mapping and point of interest detection

The execution

We used mainly open-source libraries based on Python:

  • DS basics: Numpy, Scipy, Matplotlib, Seaborn, Plotly, Pandas (data preparation, manipulation, visualization); statsmodels (time series research);
  • ML/DL: Scikit-learn (clustering algorithms, time series research), Tensorflow, Keras (CNN, LSTM);
  • Computer vision: OpenCV, imutils, PIL, Tesseract (text recognition, eye movement to text mapping)

 

The algorithms that we used:

  • Time series analysis and feature extraction + MLP
  • CNN
  • LSTM
  • k-means
  • panorama stitching
  • OCR with tesseract

 

Dataset gathering

For tracking human activity we used GazePoint eye tracker. The tool allowed us to receive gaze coordinates with a 1-1.5 angle error after calibration. During each session, the GazePoint Analysis app recorded face and screen video along with tabular data about gaze movement.

The whole dataset consists of 2 parts: 51 reading and 85 non-reading time series. Each participant performed the following actions in the scope of our research:

  1. read a 2-minute text;
  2. found specific information and things on pictures
  3. watched a 3-minute video.

    Gase tracking and eye movements tracking

    Gase Tracking: Found specific information and things on pictures

The resulting time series consists of several columns:

  • FPOGX, FPOGY – screen coordinates, relative to screen resolution, algorithm A
  • BPOGX, BPOGY – screen coordinates, relative to screen resolution, algorithm B
  • FPOGID – ID of fixation
  • FPOGD – duration of fixation of eyes
  • BPOGV, FPOGV – information validity
  • BKID – blink ID

Eye Tracking time series in table

 

Dataset preprocessing and features selection

In the course of our research, we found that tracked coordinates cannot be ideal. Blinking, head movements, variable lighting – all of those factors interrupt or spoil data flow.

We figured additional steps could alleviate the situation, and we picked smoothing gaze movement as a primary option, even though the value of this approach has limits. Smoothing eliminates one important feature – microsaccades. A saccade is a quick, simultaneous movement of both eyes between two fixation points. A microsaccade is a movement within one fixation which provides an answer as to how users fix their gaze. Though smoothing is not the ideal choice when it comes to saccades, it does help with approximating word detection.

 

Here’s what filtering meant for our research purposes:

  • Filtering by BPOGV, FPOGV
  • Filtering only screen gaze movements

The Gase movement along the X-axis on the screen

Reading Non-reading Distribution

In order to easily manipulate the dataset and train/test models, we selected a 100 observations-wide window (average time for reading a single plane line on A4 paper). This resulted in splitting all dataset into 24,568 reading and 14,288 non-reading time series of 100 observation length, considering a 90% overlap.

 

Reading/non-reading classification

Our squad used three main techniques for time series classification, described in detail below.

 

Time series research and manual feature extraction with MLP. We created three feature groups:

Linear trend detection for FPOGX

  • MSE after linear approximation for FPOGX. In other words, trying to fit a line to the x-axis trajectory.
  • Weight near linear part after linear approximation (weight w in equation x = w*t + b).
  • Dickey-Fuller Test on stationarity to detect linear trends.

Linear trend detection for FPOGX

Seasonality:

  • Max abs residual after seasonal decomposition (FPOGX, FPOGY)

Residual after seasonal decomposition

General features:

  • The standard deviation for FPOGX and BPOGX time series

 

The feature extraction, process took us 1 week, with multi-layer perceptron (with one hidden layer) providing an 85% accuracy on the test subset. To further improve this technique requires additional manual feature extraction. In our opinion, the selected features are not informative and do not describe data well.

 

Long short-term memory. dX, dY features were used. Basic model consists of 2-layer LSTM (1 many-to-many and 1 many-to-one layers) with one dense layer:

2-layer LSTM for eye-tracking

This approach gave us even less accurate results at around 63%. Adding extra LSTM layers did not help us either. We figured the possible reason for this could be the small dimension of input data and inability of LSTM to accumulate global information about time series.  

 

Convolutional neural network. dX, dY features were used. After some tuning we found  an optimal architecture with 113,006 trainable parameters:

Convolutional Neural Network Structure For Gase Tracking

This model produced a 96% accuracy rate on the test subset and was subsequently chosen as a base model for future research.

 

Reading patterns clustering

Our main task here was to classify every fixation group (observations grouped by FPOGID) as one of the 3 main patterns: saccade, sweep, regression.

3 main reading patterns: saccade, return sweep, regression

A major obstacle we encountered at this stage was dataset labeling since data with a 60 Hz frequency is inherently hard to label. This turned out to be a problem for clustering as well. Some of the minor issues we tackled were the high similarity between regressions and sweeps, as well as fixations during scrolling turning out to be outliers. To exclude fixations during scrolling we used a reading classification algorithm.

 

The entire dataset was filtered from points to saccades only. To obtain saccade data we grouped points by fixation id (FPOGID) and took only the last observation from every group. As a result, all of the identified saccades were split into min/max values with a naive algorithm along the horizontal axis, and the minimal saccades were divided into sweep and regression groups. We used K-means clusterization on the three basic saccade features to achieve the required results:

  1. Horizontal axis sweep projection
  2. Horizontal axis sweep angle
  3. Difference from the previous saccade

 

THE PLOTLY VISUALIZATION

 

Gaze to text mapping, text recognition out of screenshots and point of interest determination

Opencv and Tesseract python libraries were used heavily in solving this problem. Additionally, we created a single text document from the video.

 

We’ll walk you through some of the major stages as you scroll down:

  • Identified static frames on the video (with threshold deviation between shots)
  • Extracted sheets from the frame

Text subtraction from the video

Combined video sheets in the single ‘panorama’ image

Text subtraction and composition from the different video time frames

  • Determined the point of interest

Determined the point of gase

Results and challenges

The outcome of our project is a machine learning model that’s able to predict with a 97% accuracy whether the user read/did not read the text for 1.6 seconds of recording. Some of the major findings in the course of the research include:

  • an algorithm that relies on predictions of a previous ML model can count gaze movement (regression, sweeps, and saccades) and calculate relative reading speed;
  • an algorithm that gives information about a reader’s point of interest. It provides a weight coefficient for a given word that may represent importance for the reader.

 

Scientists who research such human behavior as gaze tracking can uncover a previously locked domain in health tech and business. As substantiated by our algorithm, you can make real-time predictions based on gaze tracking tech that can significantly improve the usage and rate and accuracy of any RAS and possibly go beyond that with more scientific application nobody thought was possible before.