Summary of Real-Time Forecasting of Keyboard and Mouse Actions

Summary Real-Time Forecasting of Keyboard and Mouse Actions arxiv.org

3,088 words - PDF document - View PDF document

One Line

This paper explores the use of RNNs and computer vision to accurately predict real-time keyboard and mouse actions with a 34.63% accuracy.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

A Click Ahead: Real-Time Forecasting of Keyboard and Mouse Actions using RNNs and Computer Vision

Source: arxiv.org - PDF - 3,088 words - view

Introduction

• Users tend to repeat the same input patterns on computers.

• Predicting user actions can assist users in various ways.

• Existing systems focus on specific tasks, but we take a task-agnostic approach.

User Input Representation

• User input is represented as user actions.

• Simple keystrokes are identified as their own user actions.

• Modifier keys acquire meaning only when used with other keys.

Computer Vision for Interactive Areas

• Computer vision is used to identify interactive areas on the screen.

• Image patches of interactive areas are extracted and stored.

• Matching image patches allow for reenactment of actions involving clicks.

Training Recurrent Neural Networks (RNNs)

• RNNs, specifically LSTM and GRU, are used for time series forecasting.

• Contextual information about system state is supplied to the network.

• The model predicts a probability distribution over all user actions.

Results

• The accuracy of the GRU model reaches 34.63% after minimal training.

• Even small RNNs deliver reasonably good predictions.

• The system can perform basic text completion tasks.

Leveraging Predictions for User Assistance

• Predictions can be used to automatically complete repetitive input.

• Frequently used buttons can be made accessible for visually impaired users.

• The cursor can be attracted to the buttons the user is likely to click.

Real-Time Forecasting Benefits

• Enhances user speed and precision in computer workflows.

• Improves accessibility for visually impaired users.

• Offers potential for personalized user assistance.

[Consider adding relevant visuals, such as graphs or images, to support the content on each slide.]

Key Points

User input can be represented as user actions, which describe the action taken and can be reenacted later.
Keystrokes can be identified as user actions, while modifier keys require combinations with other keys to have meaning.
Computer vision is used to identify interactive areas on the screen and extract image patches for reenactment.
Recurrent neural networks (RNNs) are trained to predict the next user action based on previous actions.
The system achieves an accuracy of 34.63% in predicting the next user action from a set of almost 500 possible actions.

Summaries

19 word summary

This paper examines using RNNs and computer vision to predict keyboard and mouse actions in real-time, achieving 34.63% accuracy.

70 word summary

This paper explores using recurrent neural networks (RNNs) and computer vision for real-time prediction of keyboard and mouse actions. User actions are defined as reenactable representations of user input. Computer vision identifies interactive areas clicked on the screen and extracts image patches. RNNs are trained on user activity to predict the next action, achieving 34.63% accuracy. This has potential for improving workflows, automating repetitive input, and aiding visually impaired individuals.

145 word summary

This paper discusses the use of recurrent neural networks (RNNs) and computer vision to predict keyboard and mouse actions in real-time. The authors define a "user action" as a representation of user input that can be reenacted later. They use computer vision to identify the interactive area clicked on the screen and extract an image patch of that area. These patches are stored in a database and can be searched through and extended on the fly. The authors train RNNs on a user's activity to predict the next action from a set of almost 500 possible actions, achieving an accuracy of 34.63% with minimal training. The predictions can improve computer workflows, automate repetitive input, and make frequently used buttons accessible for visually impaired individuals. The study demonstrates the feasibility and value of real-time prediction of keyboard and mouse actions for future research in this field.

287 word summary

This paper explores the use of recurrent neural networks (RNNs) and computer vision to predict keyboard and mouse actions in real-time. The goal is to learn from a user's repetitive input patterns and use that knowledge to assist the user in various ways. The authors define the concept of a "user action" as a representation of user input that is independent of the application and can be reenacted at a later time. Simple keystrokes can be identified as their own user actions, but modifier keys only have meaning when used in combination with other keys. Every key-modifier combination is considered a distinct user action. The authors use computer vision to identify which interactive area on the screen was clicked and extract an image patch of that area. These image patches are stored in a database and can be searched through and extended on the fly. The authors train RNNs, specifically LSTM and GRU models, on a user's activity over approximately a week to predict the next action from a set of almost 500 possible actions. They achieve an accuracy of 34.63% with minimal training. The predictions can be leveraged to improve and speed up computer workflows, such as automatically completing repetitive input or making frequently used buttons accessible for visually impaired individuals. The authors also demonstrate how the predictions can be used to attract the cursor to the buttons the user is most likely to click. However, there are limitations to the system, such as buttons that change size or shape and buttons with notification badges displayed on top of them. Overall, this study demonstrates the feasibility and value of predicting keyboard and mouse actions in real-time and provides a reference for future research in this domain.

Raw indexed text (19,043 chars / 3,088 words / 387 lines)

A Click Ahead: Real-Time Forecasting of Keyboard and Mouse

Actions using RNNs and Computer Vision

Fabio Matti Pierre Dillenbourg Ludovico Novelli

[email protected]

Ecole poyltechnique fédérale de

Lausanne (EPFL)

Lausanne, Switzerland [email protected].

Ecole poyltechnique fédérale de

Lausanne (EPFL)

Lausanne, Switzerland [email protected]

Logitech International S.A.

Lausanne, Switzerland

USER INPUT DETECTION AND

REPRESENTATION

To represent user input such that it is independent of which ap-

plication the action is taken in and also sufficiently describes the

action such that it can be automatically reenacted at a later time,

we hereafter define the concept of a user action. It is important to

note that a good representation uses as few features as possible to

describe the input, keeps the space of possible actions small, and

is consistent between different instances of a user performing the

same action.

Simple keystrokes (such as characters, numbers, and navigation

keys) can be trivially identified as their own user actions. However,

modifier keys (such as CTRL, SHIFT, and ALT) acquire meaning

only when used in combination with other keys. Thus, it is of no

importance when a modifier key is pressed and when it is released,

but only which modifier keys are active during a simple keystroke.

Every encountered key-modifier combination (e.g. CTRL + C, CTRL

+ V, and CTRL + ALT + DEL) is a distinct user action. Figure 1

illustrates how sequences of keyboard inputs are converted to user

actions.

INTRODUCTION

When working on the computer users tend to repeat the same input

patterns over and over again. This motivates us to explore ways in

which these repetitions can be learned and used to predict what

the user is going to do next. These predictions can subsequently be

leveraged to assist the user in multiple ways which we will discuss

at the end of this paper.

There are many examples of systems which model the users’

inputs while performing a specific subset of tasks: Zhang et al. [10]

predict the users’ next action while formatting text in a simpli-

fied text editor; Arapakis and Leiva [2] predict user attention in

search engine result pages; Agrawal et al. [1] learn intent on online

marketplaces; and Bednarik et al. [3] focuses on predicting button

presses while solving a puzzle. We extend the scope by taking –

what to our knowledge has not been attempted yet – a task- and

app-agnostic approach to input modeling.

Contrary to systems which use simulated environments with

which the user interacts [1, 2], we need to face the challenge of

determining what effect a mouse event has on the state of an appli-

cation or website. It is of little use to save the absolute position of

the cursor on the screen, because the position of the button on the

screen often varies depending on the positioning of the window,

the size of the window, and the screen’s DPI. Hence, we resort to a

computer vision-based approach to identify which interactive area

on the current screen was clicked (button, link, text field...). We do

this by assembling a database containing image patches of all the

interactive areas (often simply referred to as buttons) clicked by

the user, which can be searched through and extended on the fly.

Collecting these image patches also allows us to reliably reenact

actions involving clicks on interactive areas.

Some examples in literature also account for gaze dynamics in

their predictions [3, 10]. Due to privacy concerns, computational

constraints, hardware requirements, robustness, and simplicity, we

make no use of this information.

The go-to models for time series forecasting are recurrent neural

networks (RNN). Long short-term memories (LSTM) [6] and gated

recurrent units (GRU) [5] appear particularly often in the context of

predicting user intent [1, 2]. Isolated appearances of hidden Markov

models (HMM) are also seen in literature [9].

Our main contribution is the general representation of computer

inputs which we show to be efficiently learnable by RNNs.

Computer input is more complex than a sequence of single mouse

clicks and keyboard presses. We introduce a novel method to iden-

tify and represent the user interactions and build a system which

predicts – in real-time – the action a user is most likely going to

take next. For this, a recurrent neural network (RNN) is trained on

a person’s usage of the computer. We demonstrate that it is enough

to train the RNN on a user’s activity over approximately a week

to achieve an accuracy of 34.63 % when predicting the next action

from a set of almost 500 possible actions. Specific examples for how

these predictions may be leveraged to build tools for improving

and speeding up workflows of computer users are discussed.

DEL

ABSTRACT

CTRL

ALT

SPACE

CTRL

time

Figure 1: Conversion of raw key and modifier inputs to user

actions. The black bars show the time span during which the

corresponding key was held.

Furthermore, every click of an interactive area on the screen is

identified as a user action too. To be able to reenact the click and

also identify clicks on the same button with the same user action,Fabio Matti, Pierre Dillenbourg, and Ludovico Novelli

we extract an image patch of the interactive area which we then

store. The extraction of the image patch is done with the help of the

YOLOv5 object detection model which was trained to detect buttons,

links, and text fields on computer screens [7]. The extracted image

patch is then matched against a database which stores an image

patch for every interactive area a user has clicked on. The matching

is performed by computing the normalized correlation coefficient

[8]. If for some positioning of one image patch within another a

high enough correlation exists, we have a match. It is necessary to

first extend one of the image patches by a margin of a few pixels

to account for inconsistencies in the extraction of image patches.

To make the system DPI-independent – particularly for multiple

monitor setups – we rescale the image patch to the default DPI

of 96 beforehand. Additionally, to speed up this matching, some

basic features like height, width, and mean color are computed for

each image patch, which can be compared relatively quickly and

reduce the amount of image-image correlations which need to be

computed to determine if two image patches represent the same

button. In case a match for the current image patch is found in the

patch database, we need not add it again, but instead only associate

it with the corresponding user action. The procedure of converting

clicks to user actions is illustrated in Figure 2.

New Tab

Table 1: Some quantities characterizing the data set.

Quantity Value

Number of hours recorded

Total number of user actions taken

Number of distinct user actions encountered

↩→ which are clicks on interactive areas

↩→ which are keyboard, mouse, or combined input

Storage space of data 46.21

86’284

3’008

2’755

253

80.7 MB

we mentioned above. At the end of the RNN, we append a shallow

multilayer perceptron (MLP) with a Softmax output activation (see

Figure 3). Based on the last 𝑛 steps in the time series, the model is

trained to predict a probability distribution over all user actions.

This allows us to predict which user action most likely continues

the sequence and additionally gives us information about the prob-

ability with which each user actions will be taken next. It also gives

us an easy way of removing actions we are not interested in from

the predictions: if we only care about predictions of interactive

areas, we set the predicted probabilities corresponding to all other

user actions to zero, renormalize the distribution, and then deduce

the most likely actions from this filtered distribution.

New Tab

user

action

RNN

MODEL TRAINING

We acquire data by recording a single user’s activity while working

on a Windows 10 dual monitor setup for a week. A brief description

of the data can be found in Table 1.

We only learn those user actions which we have encountered

while acquiring the data set. Furthermore, interactive areas which

were not clicked more than five times can be considered as rare

and are ignored. Under these two simplifications we end up with

an action space containing 442 distinct user actions.

We train two RNNs, the LSTM and GRU on the time series of the

one-hot encoded user actions coupled with the system information

Softmax

likelihoods for next

user actions

(𝑥, 𝑦)

To supply additional information connected to each observed

user actions, we collect information about the system’s state during

it. The contextual information supplied to the network is: the appli-

cation the action was taken in, in one-hot encoded representation;

the relative 𝑥- and 𝑦-position of the cursor within the window the

action was taken in; and the elapsed time since the last action was

taken in multiples of 10 s, capped at 30 s to avoid this value from

blowing up.

Figure 2: The algorithm for extracting an image patch of the

button which was clicked. First, the screen is captured in

the proximity of the cursor. Subsequently, a YOLOv5 model

for button detection is used to segment all buttons on the

image. The smallest button which includes the cursor is dis-

tinguished to be the one which was clicked.

MLP

system

state

Figure 3: The architecture of the model used to predict the

likelihood of every possible user action being taken.

We fit the models using a sliding window along the time series

of user actions: based on five preceding actions, we let the model

predict the probability distribution over which action happens next

and backpropagate the error measured with the cross-entropy loss.

The parameters of the models are summarized in Table 2.

Table 2: Explanation of the parameters of the RNN.

Parameter Value Explanation

n_past

hidden_size

num_layers

loss_function

optimizer

learning_rate 5

600

CrossEntropyLoss

Adam

0.001 Number of past user actions

Number of hidden features

Number of stacked RNN cells

Loss function that is minimized

Optimizer used for training

Learning rate of optimizer

RESULTS

It turns out that even rather small RNNs are able to deliver rea-

sonably good predictions. The validation accuracy, i.e. the relative

amount of correct next action predictions on the validation set, isReal-Time Forecasting of Keyboard and Mouse Actions

plotted in Figure 5. In this context, a prediction is said to be cor-

rect if the user action of highest predicted probability corresponds

exactly to the action the user ended up taking. In case the model

encounters a user action which it has not seen during training, it

ignores it while making a prediction.

button actions

CTRL + A

CTRL + C

ALT + TAB

LEFT

CTRL + V

ALT + TAB

prediction

taken user action

Figure 4: Visualization of how the accuracy of a predictor is

determined on the validation set. From the series of user ac-

tions, five consecutive user actions are sampled, and the most

likely next action is predicted and compared to the action

the user ended up taking. In this example, the user action of

highest probability (ALT + TAB) based on the previous five

user actions did not correspond to what the user did (CTRL

+ V).

0.36

0.3463

0.34

GRU

LSTM

0.3326

0.32

0.30

0.28

0.26

Number of epochs

Figure 5: The per-epoch accuracy of the two RNN architec-

tures for predicting the next user action on an unseen val-

idation set of 6’000 actions (approximately one day of user

activity). The maximum accuracy of 34.63 % of the GRU is

attained after just 256.8 s of training on a single 12 th genera-

tion Intel i7 processor.

To demonstrate the capability of predicting actions in real-time,

we built a small application which listens to a user’s inputs, predicts

what the user is most likely going to do next based on their previous

usage of the computer, and finally visualizes the five most likely

actions on the screen. This includes finding the predicted buttons

on the screen and highlighting them with a red box; by far the most

time-consuming task of the system. To locate an image patch on

the screen we compute the normalized correlation coefficient for

every possible arrangement of the image patch on the screen [8].

In the location where the correlation exceeds a certain threshold,

we have a match. A screenshot of this program in use can be found

in Figure 6.

DISCUSSION

The above work showcases a system which predicts how likely

each action is going to be taken by a computer user. To do so, we

record a user’s keyboard and mouse inputs which we subsequently

convert to a novel representation, termed user actions. We then

predicted user actions

Figure 6: Screenshot of the application which predicts and

visualizes the five next most likely actions while a user is

working on the computer.

train recurrent neural networks (RNN) to predict a probability dis-

tribution over how likely every possible action follows a sequence

of previously taken actions. We show that after less than five min-

utes of training a small gated recurrent unit (GRU), 34.63 % of the

predicted user actions coincide with the one which is indeed taken

by the user. To put this into perspective, out of almost 500 consid-

ered user actions, one week of user data and minimal training is

enough to predict the user’s next action in more than a third of

the cases. It turns out that the model is even able to perform very

basic text completion tasks. However, there exist way more suitable

representations for these problems [4], which we decided not to

exploit due to their little value for the applications we have in mind.

We can think of many ways in which the predictions for the

next action can be leveraged to improve computer users’ speed

or precision: automatically completing repetitive input based on a

user’s previous activity; making the most frequently used buttons

accessible for visually impaired people; and attracting the cursor to

the buttons the user is most likely going to click, to name just a few.

To achieve the last, we manipulate the position of the cursor on the

screen in such a way that it is attracted by, i.e. pulled towards, the

buttons on the screen which the user is most likely going to click

(see Figure 7). To do so, every time the user takes an action we use

a trained RNN to predict the buttons which the user is most likely

going to click next. We then load the image patch from the database

of all clicked buttons and locate it on the screen. The attraction

is implemented by treating the cursor as an object moving in a

gravitational field induced by the buttons. Instead of mass, the

confidence with which the button is going to be clicked determines

the gravitational pull. Now, every time the cursor is moved, we

modify its position based on the gravitational pull it feels from the

buttons.

A small limitation of our system is produced by buttons which

change size or shape. For instance, there are some applications

where buttons are wider when working in full screen mode than

in windowed mode. It is clear that buttons of different shapes will

not yield a match in our image patch comparison routine, and will

be considered as two different buttons. Other buttons significantlyFabio Matti, Pierre Dillenbourg, and Ludovico Novelli

Compose

Figure 7: Sketch of how the mouse cursor is attracted to a

button.

change their appearance when hovering the cursor over them, such

that they will fail to yield a desired match when comparing it to

other image patches. Problematic are also buttons with notification

badges displayed on top of them.

Nevertheless, we hope the current work has managed to demon-

strate the feasibility, and the value, of predicting mouse and key-

board actions in real-time. We also hope this study will be further

expanded and serve as a reference for future research in this do-

main.

REFERENCES

[1] Rakshit Agrawal, Anwar Habeeb, and Chih-Hsin Hsueh. 2017. Learning User

Intent from Action Sequences on Interactive Systems. arXiv:1712.01328 [cs.AI]

[2] Ioannis Arapakis and Luis A. Leiva. 2020. Learning Efficient Representations

of Mouse Movements to Predict User Attention. In Proceedings of the 43rd In-

ternational ACM SIGIR Conference on Research and Development in Information

Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machin-

ery, New York, NY, USA, 1309–1318. https://doi.org/10.1145/3397271.3401031

[3] Roman Bednarik, Hana Vrzakova, and Michal Hradis. 2012. What Do You Want

to Do next: A Novel Approach for Intent Prediction in Gaze-Based Interaction. In

Proceedings of the Symposium on Eye Tracking Research and Applications (Santa

Barbara, California) (ETRA ’12). Association for Computing Machinery, New

York, NY, USA, 83–90. https://doi.org/10.1145/2168556.2168569

[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,

Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,

Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,

Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin

Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya

Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.

arXiv:2005.14165 [cs.CL]

[5] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio.

2014. On the Properties of Neural Machine Translation: Encoder–Decoder

Approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and

Structure in Statistical Translation. Association for Computational Linguistics,

Doha, Qatar, 103–111. https://doi.org/10.3115/v1/W14-4012

[6] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.

Neural Comput. 9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

[7] Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec, NanoCode012,

Yonghye Kwon, Kalen Michael, TaoXie, Jiacong Fang, imyhxy, Lorna, Zeng

Yifu, Colin Wong, Abhiram V, Diego Montes, Zhiqiang Wang, Cristi Fati, Je-

bastin Nadar, Laughing, UnglvKitDe, Victor Sonck, tkianai, yxNONG, Piotr

Skalski, Adam Hogan, Dhruv Nair, Max Strobel, and Mrinal Jain. 2022. ul-

tralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation. https:

//doi.org/10.5281/zenodo.7347926

[8] OpenCV. 2023. Object Detection. Retrieved August 14, 2023 from https://docs.

opencv.org/4.x/df/dfb/group__imgproc__object.html

[9] Alvitta Ottley, Roman Garnett, and Ran Wan. 2019. Follow The Clicks: Learn-

ing and Anticipating Mouse Interactions During Exploratory Data Analysis.

Computer Graphics Forum 38, 3 (2019), 41–52. https://doi.org/10.1111/cgf.13670

[10] Guanhua Zhang, Susanne Hindennach, Jan Leusmann, Felix Bühler, Benedict

Steuerlein, Sven Mayer, Mihai Bâce, and Andreas Bulling. 2022. Predicting Next

Actions and Latent Intents during Text Formatting. In Proceedings of the CHI

Workshop Computational Approaches for Understanding, Generating, and Adapting

User Interfaces. 1–6.