Summary of SeaPearl A Constraint Programming Solver with Reinforcement Learning

Summary SeaPearl A Constraint Programming Solver with Reinforcement Learning arxiv.org

7,875 words - PDF document - View PDF document

One Line

SeaPearl is a hybrid solver that combines constraint programming with deep reinforcement learning to solve combinatorial optimization problems, with promising results and the ability for users to redefine various components for prototyping new research ideas.

Key Points

SeaPearl is a solver that combines constraint programming and reinforcement learning to solve combinatorial optimization problems.
SeaPearl uses reinforcement learning algorithms to find a policy that maximizes the total return.
SeaPearl is fully implemented in Julia language and is available on Github.
SeaPearl can handle any CP model and is trained using randomly generated instances.
SeaPearl has shown promising results on several benchmark problems.

Summaries

323 word summary

SeaPearl is a hybrid solver that combines constraint programming with deep reinforcement learning to solve combinatorial optimization problems, reducing inference costs and enabling accurate predictions. The tool draws on probability distributions, optimization algorithms, and graph convolutional neural networks, and is built on top of the Gecode constraint programming library, with a focus on graph coloring and scheduling. The solver has shown promising results on several benchmark problems.

Future work proposed for SeaPearl includes having two specialized agents, one to find good solutions and one to prove optimality, and finding ways to prioritize these tasks, as well as extending the learning to variable selection and selecting an appropriate neural network architecture.

Experiments evaluate the ability of SeaPearl to learn good heuristics for value-selection, generating instances randomly with a custom generator. Comparisons against greedy heuristics on two NP-hard problems are proposed.

SeaPearl allows the user to redefine types without requiring changes to the source code, and many other components, such as the reward or the state representation, can be redefined by the end-user for prototyping new research ideas. SeaPearl is a new constraint programming solver that uses reinforcement learning to solve combinatorial optimization problems. The solver is implemented in Julia language and includes a learning component that uses reinforcement learning to guide the search process. SeaPearl aims to leverage historical data related to specific problems in order to solve future instances more quickly. The hybridization of constraint programming and machine learning is challenging to build, but SeaPearl demonstrates that reinforcement learning can be successfully used to drive the search process. SeaPearl uses a deep architecture consisting of 4 graph attention layers and 5 fully-connected layers, which can be easily defined using the library Flux. The solver can be used to solve the graph coloring problem, where the goal is to assign labels to each node such that adjacent nodes have a different label. SeaPearl is fully implemented in Julia language and is available on Github.

816 word summary

SeaPearl is a new solver that combines constraint programming and reinforcement learning to efficiently solve combinatorial optimization problems. The solver is implemented in Julia language and includes a learning component that uses reinforcement learning to guide the search process. SeaPearl aims to leverage historical data related to specific problems in order to solve future instances more quickly. The hybridization of constraint programming and machine learning is challenging to build, but SeaPearl demonstrates that reinforcement learning can be successfully used to drive the search process. The goal of combinatorial optimization is to find an optimal solution among a finite set of possibilities. SeaPearl uses reinforcement learning algorithms to find a policy that maximizes the total return, which is the accumulated sum of rewards during an episode. The agent interacts with the environment by taking actions and observing rewards, and learns which sequences of actions lead to the highest reward. SeaPearl is fully implemented in Julia language, avoiding the overhead of Python calls from a C++ solver and is available on Github. SeaPearl is a constraint programming solver that uses reinforcement learning to solve combinatorial optimization problems. The solver is minimalist, with a focus on extensibility and flexibility, and consists of a constraint programming solver and a reinforcement learning model. It uses graph neural networks to embed learning in constraint programming and can handle any CP model. The solving process involves a tripartite graph representation of the problem, which is fed into a graph neural network to compute a latent vector and make branching decisions. The solver is trained using randomly generated instances, and the learning is conducted on randomly selected instances from the training set. SeaPearl uses a deep architecture consisting of 4 graph attention layers and 5 fully-connected layers, which can be easily defined using the library Flux. The solver can be used to solve the graph coloring problem, where the goal is to assign labels to each node such that adjacent nodes have a different label. SeaPearl is a constraint programming solver that uses reinforcement learning to solve hard problems such as graph coloring and traveling salesman with time heuristics for value-selection. The training routines can be defined by the user, including the value-selection to be trained, the instance generator, the number of episodes, the search strategy, and the variable heuristic. Once trained, the heuristic can be used to solve new instances. SeaPearl allows the user to redefine types without requiring changes to the source code. The implementation illustrates only a small subset of the functionalities of the solver. Many other components, such as the reward or the state representation, can be redefined by the end-user for prototyping new research ideas.

The experiments evaluate the ability of SeaPearl to learn good heuristics for value-selection. Instances for training the models have been generated randomly with a custom generator. Training is done until convergence, limited to 13 hours on AWS' EC2 with 1 vCPU of Intel Xeon capped to 3.0GHz, and memory consumption is capped to 32 GB. The evaluation is done on other instances (still randomly generated in the same manner) on the same machine. Comparisons against greedy heuristics on two NP-hard problems are proposed.

SeaPearl is a Constraint Programming (CP) solver that uses Reinforcement Learning (RL) and a graph attention network to find optimal solutions for the Travelling Salesman Problem with Time Windows (TSPTW). Instances are generated using a generator from a previous study, and a graph representing the current TSPTW instance is used instead of the default tripartite graph. The CP model is based on a dynamic programming formulation, and the neural architecture is based on the same design choices as in the previous study. The goal is to minimize the sum of travel distances while respecting time windows. The performance profiles and training curves for the DQN agent are presented for instances with 20 and 30 nodes. Results show that the heuristic performances can be roughly equaled, and that the learned heuristic is able to reproduce the behavior of the min-value heuristic.

SeaPearl is a constraint programming solver that uses reinforcement learning to improve efficiency. The paper proposes future work on having two specialized agents, one to find good solutions and one to prove optimality, and finding ways to prioritize these tasks. The paper also suggests extending the learning to variable selection and selecting an appropriate neural network architecture. The tool aims to facilitate future research in the hybrid SeaPearl is a hybrid solver that combines constraint programming with deep reinforcement learning to solve combinatorial optimization problems. It reduces inference costs and enables accurate predictions. However, it faces challenges when dealing with real-world instances and problems with large action space. SeaPearl draws on probability distributions, optimization algorithms, and graph convolutional neural networks. It is open-source and built on top of the Gecode constraint programming library, with a focus on graph coloring and scheduling. The solver has shown promising results on several benchmark problems.

1754 word summary

SeaPearl is a constraint programming solver guided by reinforcement learning. It uses machine learning and graph theory to solve combinatorial optimization problems, with a focus on graph coloring and scheduling. SeaPearl is built on top of the Gecode constraint programming library and is available in Scala. The solver is guided by reinforcement learning, which allows it to learn from experience and improve its performance over time. SeaPearl has been tested on several benchmark problems and has shown promising results. SeaPearl is a constraint programming solver that incorporates reinforcement learning. The solver relies on several resources, including probability distributions, optimization algorithms, and graph convolutional neural networks. It also draws on open-source software for integer and constraint programming. In developing SeaPearl, researchers have cited a range of studies that explore reinforcement learning, pruning, value networks, and other topics related to computer science and optimization. SeaPearl is a constraint programming solver that uses reinforcement learning (RL) to guide heuristics. It combines techniques from constraint programming, artificial intelligence, and operations research. SeaPearl's approach is based on machine learning for combinatorial optimization using RL. The solver is open-source and can help the community in the development of new hybrid approaches for tackling optimization challenges. Many open challenges should be addressed for an efficient use of machine learning methods inside a solving process. SeaPearl is a hybridization of constraint programming and deep reinforcement learning that proposes a flexible, easy-to-use, and open-source research framework for solving combinatorial optimization problems. The combination of machine learning approaches with a search procedure enables accurate prediction and reduces heavy inference costs of deep models in low-resource settings. However, developing such hybrid approaches requires a combination of constraint programming and reinforcement learning, which is still a challenge as many issues must be tackled. One way to tackle real-world instances is to modify slightly the available instances by introducing small perturbations on them. Another difficulty is to deal with problems having a large action space, which makes the learning more difficult and reduces the generalization to large instances. A possible direction could be to reduce the size of the action space using a dichotomy selection. SeaPearl is a constraint programming solver that uses reinforcement learning to improve efficiency. The paper proposes future work on having two specialized agents, one to find good solutions and one to prove optimality, and finding ways to prioritize these tasks. The paper also suggests extending the learning to variable selection and selecting an appropriate neural network architecture. The tool aims to facilitate future research in the hybridization of constraint programming and deep reinforcement learning. The paper presents results showing that the learned heuristic outperforms the heuristic baseline with a factor of three in terms of the number of nodes visited. SeaPearl is a Constraint Programming (CP) solver that uses Reinforcement Learning (RL) and a graph attention network to find optimal solutions for the Travelling Salesman Problem with Time Windows (TSPTW). Instances are generated using a generator from a previous study, and a graph representing the current TSPTW instance is used instead of the default tripartite graph. The CP model is based on a dynamic programming formulation, and the neural architecture is based on the same design choices as in the previous study. The goal is to minimize the sum of travel distances while respecting time windows. The performance profiles and training curves for the DQN agent are presented for instances with 20 and 30 nodes. Results show that the heuristic performances can be roughly equaled, and that the learned heuristic is able to reproduce the behavior of the min-value heuristic. Comparisons are done with a problem using the smallest domain as variable ordering. The experiments are based on a standard CP formulation of the graph coloring problem. SeaPearl is a constraint programming solver that utilizes reinforcement learning to solve hard problems such as graph coloring and traveling salesman with time heuristics for value-selection. The goal of the experiments is to evaluate the ability of SeaPearl to learn good heuristics for value-selection. Instances for training the models have been generated randomly with a custom generator. Training is done until convergence, limited to 13 hours on AWS' EC2 with 1 vCPU of Intel Xeon capped to 3.0GHz, and memory consumption is capped to 32 GB. The evaluation is done on other instances (still randomly generated in the same manner) on the same machine. Comparisons against greedy heuristics on two NP-hard problems are proposed. The implementation, the models, and the results are released in open-source with the solver.

The training routines can be defined by the user, including the value-selection to be trained, the instance generator, the number of episodes, the search strategy, and the variable heuristic. Once trained, the heuristic can be used to solve new instances. The last snapshot shows how the training routines can be defined. SeaPearl allows the user to redefine types without requiring changes to the source code of SeaPearl. This has been made possible thanks to the multiple dispatching functionality of Julia and has made it easier for users to prototype new research ideas. The implementation illustrates only a small subset of the functionalities of the solver. Many other components, such as the reward or the state representation, can be redefined by the end-user for prototyping new research ideas. SeaPearl is a constraint programming solver that uses deep reinforcement learning to solve problems efficiently. It is built using the Julia programming language, which is efficient and rich in both mathematical programming and machine learning libraries. The solver can be used to solve the graph coloring problem, where the goal is to assign labels to each node such that adjacent nodes have a different label. SeaPearl uses a deep architecture consisting of 4 graph attention layers and 5 fully-connected layers, which can be easily defined using the library Flux. The solving process can then be run, where the objective and constraints are added to the model. SeaPearl also uses a neural network to obtain a score for each possible value and a fully-connected neural network to select the best solution. SeaPearl is a constraint programming solver that uses reinforcement learning. The solving process involves a tripartite graph representation of the problem, which is fed into a graph neural network to compute a latent vector and make branching decisions. The solver can handle any CP model and uses problem-dependent features. The state representation is designed to be generic and handle any triplet of inputs. The solver is trained using randomly generated instances. SeaPearl is a constraint programming solver that uses reinforcement learning to solve combinatorial optimization problems. The solver integrates a support training algorithm that returns a vector of weights used for parametrizing the neural network. The learning is conducted on randomly selected instances from the training set. At the beginning of each new episode, an instance of the problem is used to train the model. The reward signal consists of two terms, one dedicated to finding a feasible solution and another to finding the best feasible solution. The transition function updates the current state according to the action that has been selected, and the action corresponds to a value that can be assigned to a variable of the CP model. The state contains information about the instance that is solved, the current state of the solving process, and different representations are proposed in the case studies. SeaPearl is a constraint programming solver with reinforcement learning. It defines a state as a triplet of the number of backtracks, statistics of the solving process, and the associated CP model. The reinforcement learning environment is designed to represent the behavior of combinatorial problems. SeaPearl aims to improve the CP solving process using knowledge from previously solved problems. The solver is minimalist, with a focus on extensibility and flexibility. SeaPearl's architecture consists of a constraint programming solver and a reinforcement learning model. SeaPearl uses graph neural networks to embed learning in constraint programming. SeaPearl is a solver that uses constraint programming and reinforcement learning to solve combinatorial optimization problems on graphs. It learns a p-dimensional representation for each node in the graph using a Graph Neural Network (GNN) that aggregates information from neighboring nodes. Reinforcement learning algorithms are used to find a policy that maximizes the total return, which is the accumulated sum of rewards during an episode. The agent interacts with the environment by taking actions and observing rewards, and learns which sequences of actions lead to the highest reward. SeaPearl is a constraint programming solver that combines reinforcement learning (RL) and graph neural networks to train agents to take actions in an environment to maximize a cumulated reward. The RL component is fully integrated inside the solver, and the reinforcement learning environment allows CP backtracking inside an episode. SeaPearl is built upon the proof of concept proposed by Cappart et al. and proposes an architecture able to solve CP models, whereas previous solvers were restricted to dynamic programming models. SeaPearl is fully implemented in Julia language, avoiding the overhead of Python calls from a C++ solver. Experiments on two toy-problems, namely the graph coloring and the traveling salesman problem with time windows, are proposed. The code is available on Github. The philosophy behind SeaPearl is to ease and speed up the development process to any researcher desiring to design learning-based approaches to improve constraint programming solvers. SeaPearl is a new solver that combines constraint programming and reinforcement learning to efficiently solve combinatorial optimization problems. The goal of combinatorial optimization is to find an optimal solution among a finite set of possibilities. Various approaches have been developed to solve these problems, including SAT solvers and mixed-integer programming. However, these approaches have limitations, and the use of machine learning-based heuristics has shown promise in improving their performance.

SeaPearl aims to leverage historical data related to specific problems in order to solve future instances more quickly. The solver is implemented in JuMP, an open-source framework, and includes a learning component that uses reinforcement learning to guide the search process. SeaPearl also includes support for modeling the learning component and for learning branching decisions using machine learning routines.

The performance of SeaPearl was evaluated on two problems, and although it is not yet competitive with industrial solvers, it shows promise as a flexible and efficient solver. The hybridization of constraint programming and machine learning is challenging to build, but SeaPearl demonstrates that reinforcement learning can be successfully used to drive the search process. SeaPearl is a constraint programming solver that uses reinforcement learning. The team behind it includes Felix Chalumeau, Ilan Coulon, Quentin Cappart, and Louis-Martin.

Raw indexed text (48,699 chars / 7,875 words / 869 lines)

SeaPearl: A Constraint Programming Solver

guided by Reinforcement Learning

Félix Chalumeau 1,? , Ilan Coulon 1,? , Quentin Cappart 2 , and Louis-Martin

Rousseau 2

École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France

{felix.chalumeau,ilan.coulon}@polytechnique.edu

École Polytechnique de Montréal, Montreal, Canada

{quentin.cappart,louis-martin.rousseau}@polymtl.ca

Abstract. The design of efficient and generic algorithms for solving

combinatorial optimization problems has been an active field of research

for many years. Standard exact solving approaches are based on a clever

and complete enumeration of the solution set. A critical and non-trivial

design choice with such methods is the branching strategy, directing how

the search is performed. The last decade has shown an increasing interest

in the design of machine learning-based heuristics to solve combinatorial

optimization problems. The goal is to leverage knowledge from histori-

cal data to solve similar new instances of a problem. Used alone, such

heuristics are only able to provide approximate solutions efficiently, but

cannot prove optimality nor bounds on their solution. Recent works have

shown that reinforcement learning can be successfully used for driving

the search phase of constraint programming (CP) solvers. However, it

has also been shown that this hybridization is challenging to build, as

standard CP frameworks do not natively include machine learning mech-

anisms, leading to some sources of inefficiencies. This paper presents the

proof of concept for SeaPearl, a new CP solver implemented in Ju-

lia, that supports machine learning routines in order to learn branching

decisions using reinforcement learning. Support for modeling the learn-

ing component is also provided. We illustrate the modeling and solution

performance of this new solver on two problems. Although not yet com-

petitive with industrial solvers, SeaPearl aims to provide a flexible and

open-source framework in order to facilitate future research in the hy-

bridization of constraint programming and machine learning.

Keywords: Reinforcement Learning · Solver Design · Constraint Pro-

gramming

Introduction

The goal of combinatorial optimization is to find an optimal solution among a

finite set of possibilities. Such problems are frequently encountered in transporta-

tion, telecommunications, finance, healthcare, and many other fields [1,5,32,42,48,49].

* The two authors contributed equally to this paper.2

Chalumeau, Coulon et al.

Finding efficient methods to solve them has motivated research efforts for decades.

Many approaches have emerged in the recent years seeking to take advantage of

learning methods to outperform standard solving approaches. Two types of ap-

proaches have been particularly successful while still showing drawbacks. First,

machine learning approaches, such as deep reinforcement learning (DRL), have

shown their promise for designing good heuristics dedicated to solve combinato-

rial optimization problems [6,15,16,29]. The idea is to leverage knowledge from

historical data related to a specific problem in order to solve rapidly future in-

stances of the problem. Although, a very fast computation time for solving the

problem is guaranteed, such approaches only act as a heuristics and no mech-

anisms for improving a solution nor to obtain optimality proofs are proposed.

A second alternative is to embed a learning-component inside a search proce-

dure. This has been proposed for mixed-integer programming [22], local search

[21,56], SAT solvers [45,38], and constraint programming [2,11]. However, it has

been shown that such a hybridization is challenging to build, as standard op-

timization frameworks do not natively include machine learning mechanisms,

leading to some sources of inefficiencies. As an illustrative example, Cappart et

al. [11] used a deep reinforcement learning approach to learn the value-selection

heuristic for driving the search of a CP solver. To do so, they resorted to a Python

binding in order to call deep learning routines in the solver Gecode [44], causing

an important computational overhead.

Following this idea, we think that learning branching decisions in a constraint

programming solver is an interesting research direction. That being said, we be-

lieve that a framework that can be used for prototyping and evaluating new

ideas is currently missing in this research field. Based on this context, this pa-

per presents SeaPearl (homonym of CPuRL, standing for constraint program-

ming using reinforcement learning), a flexible, easy-to-use, research-oriented, and

open-source constraint programming solver able to natively use deep reinforce-

ment learning algorithms for learning value-selection heuristics. The philosophy

behind this solver is to ease and speed-up the development process to any re-

searcher desiring to design learning-based approaches to improve constraint pro-

gramming solvers. Accompanying this paper, the code is available on Github 3 ,

together with a tutorial showcasing the main functionalities of the solver and how

specific design choices can be stated. Experiments on two toy-problems, namely

the graph coloring and the travelling salesman problem with time windows, are

proposed in order to highlight the learning aspect of the solver. Note also that

compared to impact-based search [40], hybridization with ant colony optimiza-

tion [46], or similar mechanisms where a learning component is used to improve

the search of the solving process for a specific instance, the goal of our learning

process is to leverage knowledge learned from other similar instances. This paper

is built upon the proof of concept proposed by Cappart et al. [11]. Our specific

and original contributions are as follows: (1) we propose an architecture able to

solve CP models, whereas [11] was restricted to dynamic programming models,

(2) the learning phase is fully integrated inside the CP solver, and (3) the rein-

https://github.com/corail-research/SeaPearl.jlSeaPearl: A CP Solver guided by RL

forcement learning environment is different as it allows CP backtracking inside

an episode. Finally, the solver is fully implemented in Julia language, avoiding

the overhead of Python calls from a C++ solver.

The next section introduces reinforcement learning and graph neural network,

a deep architecture used in the solver. The complete architecture of SeaPearl is

then proposed, followed by an illustration of the modelling support for the learn-

ing component. Finally, experiments and discussions about research directions

that can be carried out with this solver are proposed.

Technical Background

This section gives details about the two main concepts that make SeaPearl

different from other CP solvers.

2.1

Reinforcement Learning

Reinforcement Learning (RL) [47] is a sub-field of machine learning dedicated

to train agents to take actions in an environment in order to maximize an ac-

cumulated reward. The goal is to let the agent interacts with the environment

and discovers which sequences of actions lead to the highest reward. Formally,

let hS, A, T, Ri be a tuple representing the environment, where S is the set of

states that can be encountered in the environment, A is the set of actions that

can be taken by the agent, T : S × A 7→ S is the transition function leading the

agent from a state to another one given the action taken and, R : S × A 7→ R

is the reward function associated with a particular transition. The behaviour of

an agent is driven by its policy π : S 7→ A, deciding which action to take when

facing a specific state S. The goal of an agent is to compute a policy maximizing

the accumulated sum of rewards during its lifetime, referred to as an episode,

and defined by a sequence of states s t ∈ S with t ∈ [1, T ] and s T is the terminal

state. Considering

a discounting factor γ, the total return at step t is denoted

P T

by G t = k=t γ k−t R(s k , a k ).

In deterministic environments, the value of taking an action a from a state

s under a policy π is defined by the action-value function Q π (s t , a t ) = G t .

Then, the problem consists in finding a policy that maximizes the final return:

π ∗ = argmax π Q π (s, a), ∀s, a ∈ S × A. However, the number of possibilities has

an exponential increase with the number of states and actions, which makes solv-

ing this problem exactly intractable. Reinforcement learning approaches tackle

this issue by letting the agent interact with the environment in order to learn

information that can be leveraged to build a good policy. Many RL algorithms

have been developed for this purpose, the most recent and successful ones are

based on deep learning [23] and are referred to as deep reinforcement learning

[3]. The idea is to approximate either the policy π, or the action-value function

Q by a neural network in order to scale up to larger state-action spaces. For

instance, value-based methods, such as DQN [36], have the following approxima-

tion: Q̂ π (θ, s t , a t ) ≈ Q π (s t , a t ); whereas policy-based methods approximates the

policy: π̂(θ, s) ≈ π(s), where θ are parameters of a trained neural network.4

Chalumeau, Coulon et al.

2.2

Graph Neural Network

Learning on graph structures is a recent and active field of research in the ma-

chine learning community. It has plenty of applications such as molecular biology

[27], social sciences [37], and physics [41]. It has also been considered for solving

combinatorial optimization problems [10]. Formally, let G = (V, E) be a graph

with V the set of vertices, E the set of edges, f v ∈ R k a vector of k features

attached to a vertex v ∈ V , and similarly, h v,u ∈ R q a vector of q features

attached to an edge (v, u) ∈ V . Intuitively, the goal of graph neural networks

(GNN) is to learn a p-dimensional representation µ v ∈ R p for each node v ∈ V

of G. Similar to convolutional neural networks that aggregate information from

neighboring pixels of an image, GNNs aggregate information from neighboring

nodes using edges as conveyors. The features f v are aggregated iteratively with

the neighboring nodes in the graph. After a predefined number of aggregation

steps, the node embedding are produced and encompass both local and global

characteristics of the graph.

Such aggregations can be performed in different ways. A simple one has been

proposed by Dai et al. [14] and used by Khalil et al. [28]. It works as follows. Let

T be the number of aggregation steps, µ tv be the node embedding of v obtained

after t steps and N (v) the set of neighboring nodes of v ∈ V in G. The recursive

computation of µ tv is shown in Eq. (1), where vectors θ 1 ∈ R p×k , θ 2 ∈ R p×p ,

θ 3 ∈ R p×p , θ 4 ∈ R p×q are vectors of parameters that are learned, and σ a non-

linear activation function such as ReLU. The final embedding µ T v +1 obtained

gives a representation for each node v of the graph, that can consequently be

used as input of regular neural networks for any prediction tasks.



µ t+1

= σ  θ 1 f v + θ 2



u∈N (v)

µ tu

+ θ 3

σ (θ 4 h v,u ) 

∀t ∈ {1, . . . , T }

(1)

u∈N (v)

Many variants and improvements have been proposed to this framework. A

noteworthy example is the graph attention network [50], that uses an attention

mechanism [4], commonly used in recurrent neural networks. Detailed informa-

tion about GNNs are proposed in the following surveys [12,53,55,10] and an

intensive comparisons on the computational results of the different architectures

have been proposed by Dwivedi et al. [20].

Embedding Learning in Constraint Programming

This section describes the architecture and the design choices behind SeaPearl.

A high-level overview is illustrated in Fig. 1. Mainly inspired by [11], the architec-

ture has three parts: a constraint programming solver, a reinforcement learning

model, and a common representation acting as a bridge between both modules.SeaPearl: A CP Solver guided by RL

Fig. 1: Overview of SeaPearl architecture

Constraint Programming Solver A CP model is a tuple hX, D, C, Oi where

X is the set of variables we are trying to assign a value to, D(X) is the set of

domains associated with each variable, C the set of constraints that the variables

must respect and O an objective function. The goal of the solver is to assign a

value for each variable x ∈ X from D(x) which satisfy all the constraints in C

and that optimize the objective function O. The design of the solving process

is heavily inspired by what has been done in modern trailing-based solvers such

as OscaR [39], or Choco [26]. It also takes inspiration from MiniCP [30] in

its philosophy. The focus is on the extensibility and flexibility of the solver,

especially for the learning component. The goal is to make learning easy and

fast to prototype inside the solver. That being said, the solver is minimalist. At

the time of writing, only few constraints are implemented.

Reinforcement Learning Model The goal is to improve the CP solving pro-

cess using knowledge from previously solved problems. It is done by learning an

appropriate value-selection heuristic and using it at each node of the tree search.

Following Bengio et al. [7], this kind of learning belongs to the third class (ma-

chine learning alongside optimization algorithms) of ML approaches for solving

combinatorial problems, and raises many challenges. To do so, a generic re-

inforcement learning environment genuinely representing the behaviour of the

solving process for solving a CP model must be designed. Let Q p be a specific

instance of a combinatorial problem p we want to solve, C i p be the associated

CP model at the i-th explored node of the tree search, and S i p be statistics of

the solving process at the i-th node (number of bactracks, if the node has been

already visited, etc.). The environment hS, A, T, Ri we designed is as follows.

State We define a state s i ∈ S as the triplet (Q p , C i p , S i p ). By doing so, each state

contains (1) information about the instance that is solved, (2) the current state6

Chalumeau, Coulon et al.

of the CP model and (3) the current state of the solving process. In practice, each

state is embedded into a d-dimensional vector of features, that serves as input

for a neural network. This can be done in different manners and two possible

representations are proposed in the case studies.

Action Each action corresponds to a value that can be assigned to a variable of

the CP model. An action a ∈ A at a state s i ∈ S is available if and only if it is

in the domain of the variable x that has been selected for branching on at step

i (a ∈ D(X) for C i p ).

Transition function The transition updates the current state according to the

action that has been selected. In our case, it updates the domains of the dif-

ferent variables. It is important to highlight that the transition encompasses

everything that is done inside the CP solver at each branching step, such as the

constraint propagation during the fix-point computation, or the trailing in case

of backtrack. This is an important difference with [11] where the transition only

consists in the assignation of a value to a variable and is disconnected from the

internal mechanisms of the CP solving process.

Reward function The reward is a key component of any reinforcement learning

environment [17]. In our case, it has a direct impact on how the tree search

is explored. Although it is commonly expressed as a function of the objective

function O, it is not clear how it can be shaped in order to drive the search

to provide feasible solutions, the best one and to prove optimality, which often

require different branching strategies. For this reason, the solver allows the user

to define its own reward. That being said, a reward that gives a penalty of −1 at

each step is integrated by default. This simple reward encourages the agent to

conclude an episode as soon as possible. The end of an episode means that the

solver reached optimality. Hence, giving such a penalty drove the agent to reduce

the number of visited nodes for proving optimality. An alternative definition has

been proposed in [11]. The reward signal consists in two terms, having a different

importance. The first one is dedicated to find a feasible solution, whereas the

second one drives the episode to find the best feasible solution. The reward is

designed in order to prioritize the first term. The motivation is to drive the search

to find a feasible solution by penalizing the number of non-assigned variables

before a failure, and then, driving it to optimize the objective function.

Learning agent Once the environment has been defined, any RL agent can be

used to train the model [36,43]. The goal is to build a neural network NN that

outputs an appropriate value to branch on at each node of the tree search. At the

beginning of each new episode, an instance Q p of the problem we want to solve

is randomly taken from the training set and the learning is conducted on it. The

training algorithm returns a vector of weights (θ) which is used for parametrizing

the neural network. A recurrent issue in related works using learning approaches

to solve combinatorial optimization problem is having access to enough data for

training a model. It is not often the case, and this makes the design of newSeaPearl: A CP Solver guided by RL

approaches tedious. To deal with this limitation, the solver integrates a support

for generating synthetic instances, which are directly fed in the learning process.

Then, the training is done using randomly generated instances sampled from a

similar distribution to those we want to solve.

State representation In order to ensure the genericity of the framework, the

neural architecture must be able to take any triplet (Q p , C i p , S i p ) as input, which

requires to encode the RL state by a suitable representation. Doing so for the

statistics (S i p ) is trivial as they mainly consist of numerical values or categorical

data. The information related to the instances (Q p ) is by definition problem-

dependent, and several architectures are possible [11]. This information can

also be omitted in our solver. However, a representation able to handle any CP

model (C i p ) is required. In another context, Gasse et al. [22] proposed a variable-

constraint bipartite graph representation of mixed-integer linear programs in

order to learn branching decisions inside a MIP solver using imitation learning.

The representation they used is leveraged using a graph neural network. Follow-

ing this idea, our solver adopted a similar architecture, referred to as a tripartite

graph but tailored for CP models. A CP model hX, D, C, Oi is represented by a

simple and undirected graph G(V x , V d , V c , E) as follows. Each variable x ∈ X is

associated to a vertex from V x , each possible value (union of all the domains) to

a vertex from V d , and each constraint to a vertex from V c . Edges only connect

either nodes from V x to V c if the variable x is involved in constraint c, or nodes

from V x to V d if d is inside the current domain of x. Finally, each vertex and

each edge can be labelled with a vector of features, corresponding to additional

information of the model (arity of a constraint, domain size of a variable, type of

a global constraint, etc.). The main asset of this representation is its genericity,

as it can be used to represent any CP model. It is important to note that de-

signing the best state representation is still an open research question and many

options are possible. Two representations have been tested in this paper. The

first one is the generic representation based on the tripartite graph, whereas the

second one leverages problem-dependent features, as in [11].

Solving Algorithm The solving process of SeaPearl is illustrated in Algo-

rithm 1. It mostly works in the same manner as any modern CP solver. The

main difference is the consistent use of a learned-heuristic for the value selec-

tion. While the search is not completed (lines 8-14), the fix-point algorithm is

executed on the current node (line 9), and the features used as input of the

neural network are extracted, both concerning the CP model (line 9) and the

solving statistics (line 11). Using such information, the trained model is called

in order to obtain the value on which the current variable must be branched on

(line 12). Finally, the best solution found is returned (line 15). Note also that ad-

ditional mechanisms, such as prediction caching [11], can be added to speed-up

the search. The architecture of the network considered (line 12) is proposed in

Fig. 2. It works as follows: (1) the GNN computes a latent d-dimensional vector

representation of the features related to the current variable the tripartite graph,8

Chalumeau, Coulon et al.

Algorithm 1: Solving process of SeaPearl

14 . Pre: Q p is a specific instance of combinatorial problem p.

. Pre: C 0 p is the state of the CP model hX, D, C, Oi at the root node.

. Pre: NN is a neural architecture giving a value at a node of the tree.

. Pre: w is a trained weight vector parametrizing the neural network.

C 0 p := CPEncoding(Q p )

Ψ := CP-search(C 0 p )

i := 0

while Ψ is not completed do

fixPoint(C i p )

S i p := getSearchStatistics(Ψ )

x := selectVariable(C i p )

v := NN(w, x, Q p , C i p , S i p )

C i+1

:= branch(Ψ, x, v)

i := i + 1

15 return bestSolution(Ψ )

(2) the vector is used as input of a fully-connected neural network in order to

obtain a score for each possible value (resp. action), and (3) this score is passed

through a mask in order to keep only the values that are inside the domain.

Variable's feature

x 1

x 2

x 3

...

x n

GraphNN

y 1

y 2

y 3

...

y d

Feature

extraction

x x

Q values

Dense NN

... z 1

z 3

...

z p 0 0

x x

Ouput

z 1

z 2

z 3

...

Mask

Fig. 2: Simplified representation of the neural network architecture

Modeling, Learning and Solving with SeaPearl

The goal of SeaPearl is to make most of the previous building blocks trans-

parent for the end-user. This section illustrates how it is done with the graph

coloring problem. Let G(V, E) be an undirected graph. A coloring of G is an as-

signment of labels to each node such that adjacent nodes have a different label.

The graph coloring problem consists in finding a coloring that uses the mini-

mal number of labels. The programming language that we selected to developSeaPearl: A CP Solver guided by RL

SeaPearl is Julia [8], which is (1) efficient during runtime, and (2) rich in both

mathematical programming [19,33,9,31] and machine learning libraries [25,54].

This resolves the issue encountered in [11] where an inefficient Python binding

has been developed in order to call deep learning routines in C++ implementa-

tion of the solver Gecode [44]. More examples are also available on the Github

repository of the solver. A regular CP model of the graph coloring problem is

shown in Listing 1.1. The first step is to build a trailer, instantiating the trailing

mechanisms and a model, instantiating the tuple hX, D, C, Oi. Variables, con-

straints, and the objective are then added to this object. Once the model is built,

the solving process can then be run. At that time, no learning is done.

Listing 1.1: CP model for the graph coloring problem

## Preamble ##

n_vertex, n_edge, edges = getInput(instance)

trailer = SeaPearl.Trailer()

model = SeaPearl.CPModel(trailer)

## Variable declaration ##

k = SeaPearl.IntVar(0, n_vertex, trailer)

x = SeaPearl.IntVar[]

for i in 1:n_vertex

push!(x, SeaPearl.IntVar(1, n_vertex, trailer))

SeaPearl.addVariable!(model, last(x))

end

## Constraints ##

for v1, v2 in input.edges

push!(model.constraints, SeaPearl.NotEqual(x[v1], x[v2], trailer))

end

for v in x

push!(model.constraints, SeaPearl.LessOrEqual(v, k, trailer))

end

## Objective (minimizing by default) ##

model.objective = k

SeaPearl.solve!(model)

The next snapshot (Listing 1.2) shows how a reinforcement learning agent

can be easily defined. The first instruction corresponds to the definition of the

deep architecture, consisting of 4 graph attention layers (GATConv), and 5 fully-

connected layers (Dense). To do so, a binding with the library Flux [25] has

been developed. The second instruction defines the deep reinforcement learning

agent. It requires as input (1) the neural architecture, (2) the optimizer desired

(Adam), (3) the type of RL algorithm (DQN) and (4) hyper-parameters that

depends on the specific algorithm selected (e.g., the discounting factor).

The last snapshot (Listing 1.3) shows how the training routines can be de-

fined. The value-selection heuristic is first defined as a heuristic that will be

trained using the previously defined RL agent. Then, a random generator, dedi-

cated to construct new training instances, is instanciated. For the graph coloring

case, this generator is based on the construction proposed by [13] and builds

graphs of n vertices with density p that are k-colorable. Finally, the training can

Chalumeau, Coulon et al.

be run. To do so, the user has to provide the value-selection to be trained, the

instance generator, the number of episodes, the search strategy, and the variable

heuristic. Once trained, the heuristic can be used to solve new instances.

Listing 1.2: Reinforcement Learning agent for the graph coloring problem

## Neural network architecture ##

neuralNetwork = SeaPearl.FlexGNN(

graphChain = Flux.Chain(

GATConv(nInput => 10, heads=2),

GATConv(20 => 10, heads=3),

GATConv(30 => 10, heads=3),

GATConv(30 => 20, heads=2),

nodeChain = Flux.Chain(

Dense(20, 20),

outputLayer = Dense(20, nOutput))

## Reinforcement learning agent ##

agent = RL.Agent(

policy = RL.QBasedPolicy(

learner = SeaPearl.CPDQNLearner(

approximator = RL.Approximator(neuralNetwork, ADAM())

loss_function = huber_loss,

discounting_factor = 0.9999,

batch_size = 32,

...

explorer = SeaPearl.CPEpsilonGreedyExplorer()))

Listing 1.3: Training a value-selection heuristic

## Defining the value selection heuristic as the RL agent ##

val_heuristic = SeaPearl.LearnedHeuristic(agent)

## Generating random instances ##

gc_generator = SeaPearl.GraphColoringGenerator()

## Training the model ##

SeaPearl.train!(

valueSelectionArray = [val_heuristic],

generator = gc_generator,

nb_episodes = 1000,

strategy = SeaPearl.DFSearch,

variableHeuristic = SeaPearl.MinDomain())

We would like to highlight that these pieces of code illustrate only a small

subset of the functionalities of the solver. Many other components, such as the

reward, or the state representation can be redefined by the end-user for proto-

typing new research ideas. This has been made possible thanks to the multiple

dispatching functionality of Julia, allowing the user to redefine types without

requiring changes to the source code of SeaPearl.

SeaPearl: A CP Solver guided by RL

Experimental results

The goal of the experiments is to evaluate the ability of SeaPearl to learn good

heuristics for value-selection. Comparisons against greedy heuristics on two NP-

hard problems are proposed: graph coloring and travelling salesman with time

windows. In order to ease the future research in this field and to ensure repro-

ducibility, the implementation, the models and the results are released in open-

source with the solver. Instances for training the models have been generated

randomly with a custom generator. Training is done until convergence, limited

to 13 hours on AWS’ EC2 with 1 vCPU of Intel Xeon capped to 3.0GHz, and

the memory consumption is capped to 32 GB. The evaluation is done on other

instances (still randomly generated in the same manner) on the same machine.

5.1

Graph Coloring Problem

The experiments are based on a standard CP formulation of the graph coloring

problem (Listing 1.1), using the smallest domain as variable ordering. Instances

are generated in a similar fashion as in [13]. They have a density of 0.5, and

the optimal solutions have less than 5 colors. Comparisons are done with a

heuristic that takes the smallest available label in the domain (min-value), and

a random value selection. For each instance, 200 random trials are performed

and the average, best and worst results are reported. The training phase ran

for 600 episodes (execution time of 13 hours) using DQN learning algorithm

[36], a graph attention network has been used as deep architecture [50] upon

the tripartite graph detailed in Section 3. A new instance is generated for each

episode. The first experiment records the average number of nodes that has been

explored before proving the optimality of the instances at different steps of the

training, and using the default settings of SeaPearl. Results are presented in

Fig. 3 for graphs with 20 and 30 nodes. Every 30 episodes, an evaluation is

performed on a validation set of 10 instances. We can observe that the learned

heuristic is able to reproduce the behaviour of the min-value heuristic, showing

that the model is able to learn inside a CP solver. Results of the final trained

model on 50 new instances are illustrated in Fig. 4 using performance profiles

[18]. The metric considered is still the number of nodes explored before proving

optimality. Results show that the heuristic performances can be roughly equaled.

5.2

Travelling Salesman Problem with Time Windows

Given a graph of n cities, The travelling salesman problem with time windows

(TSPTW) consists of finding a minimum-cost circuit that connects a set of cities.

Each city i is defined by a position and a time window, defining the period when

it can be visited. Each city must be visited once and the travel time between two

cities i and j is defined by d i,j . It is possible to visit a city before the start of its

time windows but the visitor must wait there until the start time. However, it is

not possible to visit a city after its time window. The goal is to minimize the sum

of the travel distances. This case study has been proposed previously as a proof12

Chalumeau, Coulon et al.

(a) Instances with 20 nodes

(b) Instances with 30 nodes

Fig. 3: Training curve of the DQN agent for the graph coloring problem

(a) Instances with 20 nodes

(b) Instances with 30 nodes

Fig. 4: Performance profiles (number of nodes) for the graph coloring problem

of concept for the combination of constraint programming and reinforcement

learning [11]. We reused the same design choices they did. The CP model is

based on a dynamic programming formulation, the neural architecture is based

on a graph attention network and the reward is shaped to drive the agent to find

first a feasible solution, and then to find the optimal one. It is also noteworthy

to mention that the default tripartite graph of our solver is not used in this

experiment. A graph representing directly the current TSPTW instance is used

instead, with the position and the time windows bounds as node features, and

the distances between each pair of nodes as edge features.

Instances are generated using the same generator as in [11]. The training

phase ran for 3000 episodes (execution time of 6 hours) and a new instance is

generated for each episode. The variable ordering used is the one inferred by the

dynamic programming model, and the value selection heuristic consists of taking

the closest city to the current one. The random value selection is also considered.

As with the previous experiment, we record the average number of nodes that

have been explored before proving optimality. Results are presented in Fig. 5 for

instances with 20 and 50 cities. Once the model has been trained, we observeSeaPearl: A CP Solver guided by RL

that the learned heuristic is able to outperform the heuristic baseline with a

factor of 3 in terms of the number of nodes visited. This result is corroborated

by the performance profiles in Figs. 6a-6b, which shows the number of nodes

explored before optimality for the final trained model. Execution time required

to solve the instances is illustrated in Figs. 6c-6d. We can observe that even if

less nodes are explored, the greedy heuristic is still faster. This is due to the

time needed to traverse the neural network at each node of the tree search.

(a) Instances with 20 cities

(b) Instances with 50 cities

Fig. 5: Training curve of the DQN agent for the TSPTW

Perspectives and Future Works

Leveraging machine learning approaches in order to speed-up optimization solver

is a research topic that has an increasing interest [11,2,22]. In this paper, we

propose a flexible and open-source research framework towards the hybridization

of constraint programming and deep reinforcement learning. By doing so, we

hope that our tool can facilitate the future research in this field. For instance,

four aspects can be directly addressed and experimented: (1) how to design

the best representation of a CP state as input of a neural network, (2) how to

select an an appropriate neural network architecture for efficiently learn value-

selection heuristics, (3) what kinds of reinforcement learning algorithm are the

most suited for this task, and (4) how the reward should be designed to maximize

the performances. Besides, other research questions have emerged during the

development of this framework. This section describes five of them.

Extending the Learning to Variable Selection As a first proof of concept,

this work focused on how a value-selection heuristic can be learned inside a

constraint programming solver. Although crucial for the performances of the

solver, especially for proving optimality [51], this has not been studied in this

paper, and it has to be defined by the user. Integrating a learning component

on it as well would be a promising direction.14

Chalumeau, Coulon et al.

(a) Instances with 20 cities (# nodes) (b) Instances with 50 cities (# nodes)

Fig. 6: Performance profiles for the TSPTW

Finding Solutions and Proving Optimality Separately As highlighted by

Vilim et al. [51], finding a good solution and exploring/pruning the search tree

efficiently in order to prove optimality are two different tasks, that may require

distinct heuristics. On the contrary, the reinforcement learning agent presented in

this paper can hardly understand how and when both tasks should be prioritized.

This leads to another avenue of future work: having two specialized agents, one

dedicated to find good solutions, and the other one to prove optimality.

Accelerating the Computation of the Learned Heuristics As for any

solving tool, efficiency is a primary concern. It is thus compulsory for the learned

heuristic to be not only better than a man-engineered heuristic in terms of

number of nodes visited but also in terms of execution time. As highlighted in

the experiments, calling a neural neural network is time consuming compared

to calling a simple greedy heuristic. Interestingly, this aspect has not been so

considered in most deep learning works, as the trained model are only called

few times, rendering the inference time negligible in practice. In our case, as the

model has to be called at each node of the tree search (possibly more than 1

million times), the inference time becomes a critical concern. This opens another

research direction: finding an appropriate trade-off between a large model having

accurate prediction and a small model proposing worse prediction, but muchSeaPearl: A CP Solver guided by RL

faster. This has been addressed by Gupta et al. [24] for standard MIP solvers.

Another direction is to consider the network pruning literature, dedicated to

reduce heavy inference costs of deep models in low-resource settings [34,35].

Reducing the Action Space A recurrent difficulty is to deal with problems

having large domains. On a reinforcement learning perspective, it consists of

having a large action space, which makes the learning more difficult, and reduces

the generalization to large instances. A possible direction could be to reduce the

size of the action space using a dichotomy selection. Assuming a domain of n

values and a number of actions capped at k, the current domain is divided into

k intervals, and the selection of an action consists in taking a specific interval.

This can be done until a final value has been found. Another option is to use

another architectures that are less sensitive to large action spaces, such as pointer

networks [52], which are commonly used in natural language processing but

which have also been considered in combinatorial optimization [16].

Tackling real instances The data used to train the models are randomly

generated from a specified distribution. Although this procedure is common in

much of published research in the field [29,22], it cannot be used for solving

real-world instances. One additional difficulty to consider in real-world instances

is having access to enough data to be able to accurately learn the distribution.

One way to do that is to modify slightly the available instances by introducing

small perturbations on them. This is referred to as data augmentation, but may

be insufficient as it can fail to represent the distribution of the future instances.

Conclusion

Combining machine learning approaches with a search procedure in order to

solve combinatorial optimization problems is a hot topic in the research commu-

nity, but it is still a challenge as many issues must be tackled. We believe that the

combination of constraint programming and reinforcement learning is a promis-

ing direction for that. However, developing such hybrid approaches requires a

tedious and long development process [11]. Based on this context, this paper

proposes a flexible, easy-to-use and open-source research framework towards the

hybridization of constraint programming and deep reinforcement learning. The

integration is done on the search procedure, where the learning component is

dedicated to obtain a good value-selection heuristic. Experimental results show

that a learning is observed, and highlight challenges related to execution time.

Many open challenges should be addressed for an efficient use of machine learn-

ing methods inside a solving process. We position this contribution not only as a

new CP solver, but also as an open-source tool dedicated to help the community

in the development of new hybrid approaches for tackling such challenges.16

Chalumeau, Coulon et al.

References

1. Anagnostopoulos, K.P., Mamanis, G.: A portfolio optimization model with three

objectives and discrete variables. Computers & Operations Research 37(7), 1285–

1297 (2010)

2. Antuori, V., Hébrard, E., Huguet, M.J., Essodaigui, S., Nguyen, A.: Leveraging

reinforcement learning, constraint programming and local search: A case study

in car manufacturing. In: International Conference on Principles and Practice of

Constraint Programming. pp. 657–672. Springer (2020)

3. Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: Deep reinforce-

ment learning: A brief survey. IEEE Signal Processing Magazine 34(6), 26–38

(2017)

4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning

to align and translate. arXiv preprint arXiv:1409.0473 (2014)

5. Baptiste, P., Le Pape, C., Nuijten, W.: Constraint-based scheduling: applying con-

straint programming to scheduling problems, vol. 39. Springer Science & Business

Media (2012)

6. Bello, I., Pham, H., Le, Q.V., Norouzi, M., Bengio, S.: Neural combinatorial opti-

mization with reinforcement learning (2017)

7. Bengio, Y., Lodi, A., Prouvost, A.: Machine learning for combinatorial optimiza-

tion: a methodological tour d’horizon. European Journal of Operational Research

(2020)

8. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh ap-

proach to numerical computing. SIAM Review 59(1), 65–98 (Jan 2017).

https://doi.org/10.1137/141000671,

http://dx.doi.org/10.1137/

141000671

9. Bromberger, S., Fairbanks, J., et al.: Juliagraphs/lightgraphs. jl: an optimized

graphs package for the julia programming language (2017)

10. Cappart, Q., Chételat, D., Khalil, E., Lodi, A., Morris, C., Veličković, P.: Com-

binatorial optimization and reasoning with graph neural networks. arXiv preprint

arXiv:2102.09544 (2021)

11. Cappart, Q., Moisan, T., Rousseau, L.M., Prémont-Schwarz, I., Cire, A.: Com-

bining reinforcement learning and constraint programming for combinatorial opti-

mization. arXiv preprint arXiv:2006.01610 (2020)

12. Chami, I., Abu-El-Haija, S., Perozzi, B., Ré, C., Murphy, K.: Machine learning on

graphs: A model and comprehensive taxonomy (2021)

13. Culberson, J.C., Luo, F.: Exploring the k-colorable landscape with iterated greedy.

In: Dimacs Series in Discrete Mathematics and Theoretical Computer Science. pp.

245–284. American Mathematical Society (1995)

14. Dai, H., Dai, B., Song, L.: Discriminative embeddings of latent variable models for

structured data. In: International conference on machine learning. pp. 2702–2711

(2016)

15. Dai, H., Khalil, E.B., Zhang, Y., Dilkina, B., Song, L.: Learning combinatorial op-

timization algorithms over graphs. In: Proceedings of the 31st International Con-

ference on Neural Information Processing Systems. pp. 6351–6361 (2017)

16. Deudon, M., Cournut, P., Lacoste, A., Adulyasak, Y., Rousseau, L.M.: Learning

heuristics for the tsp by policy gradient. In: van Hoeve, W.J. (ed.) Integration

of Constraint Programming, Artificial Intelligence, and Operations Research. pp.

170–181. Springer International Publishing, Cham (2018)SeaPearl: A CP Solver guided by RL

17. Dewey, D.: Reinforcement learning and the reward engineering principle. In: AAAI

Spring Symposia (2014)

18. Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance

profiles. Mathematical programming 91(2), 201–213 (2002)

19. Dunning, I., Huchette, J., Lubin, M.: Jump: A modeling language for

mathematical optimization. SIAM Review 59(2), 295–320 (Jan 2017).

https://doi.org/10.1137/15m1020575,

http://dx.doi.org/10.1137/

15M1020575

20. Dwivedi, V.P., Joshi, C.K., Laurent, T., Bengio, Y., Bresson, X.: Benchmarking

graph neural networks. arXiv preprint arXiv:2003.00982 (2020)

21. Gambardella, L.M., Dorigo, M.: Ant-q: A reinforcement learning approach to the

traveling salesman problem. In: Machine learning proceedings 1995, pp. 252–260.

Elsevier (1995)

22. Gasse, M., Chételat, D., Ferroni, N., Charlin, L., Lodi, A.: Exact combina-

torial optimization with graph convolutional neural networks. arXiv preprint

arXiv:1906.01629 (2019)

23. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning, vol. 1. MIT

press Cambridge (2016)

24. Gupta, P., Gasse, M., Khalil, E.B., Kumar, M.P., Lodi, A., Bengio, Y.: Hybrid

models for learning to branch. arXiv preprint arXiv:2006.15212 (2020)

25. Innes, M., Saba, E., Fischer, K., Gandhi, D., Rudilosso, M.C., Joy, N.M., Karmali,

T., Pal, A., Shah, V.: Fashionable modelling with flux. CoRR abs/1811.01457

(2018), https://arxiv.org/abs/1811.01457

26. Jussien, N., Rochart, G., Lorca, X.: Choco: an open source java constraint pro-

gramming library. In: CPAIOR’08 Workshop on Open-Source Software for Integer

and Contraint Programming (OSSICP’08). pp. 1–10 (2008)

27. Kearnes, S., McCloskey, K., Berndl, M., Pande, V., Riley, P.: Molecular graph

convolutions: moving beyond fingerprints. Journal of computer-aided molecular

design 30(8), 595–608 (2016)

28. Khalil, E., Dai, H., Zhang, Y., Dilkina, B., Song, L.: Learning combinatorial op-

timization algorithms over graphs. In: Advances in neural information processing

systems. pp. 6348–6358 (2017)

29. Kool, W., Van Hoof, H., Welling, M.: Attention, learn to solve routing problems!

arXiv preprint arXiv:1803.08475 (2018)

30. Laurent Michel, Pierre Schaus, Pascal Van Hentenryck: MiniCP: A

lightweight solver for constraint programming (2018), available from

https://minicp.bitbucket.io

31. Legat, B., Dowson, O., Garcia, J.D., Lubin, M.: Mathoptinterface: a data structure

for mathematical optimization problems (2020)

32. Li, H., Womer, K.: Modeling the supply chain configuration problem with resource

constraints. International Journal of Project Management 26(6), 646–654 (2008)

33. Lin, D., White, J.M., Byrne, S., Bates, D., Noack, A., Pearson, J., Ar-

slan, A., Squire, K., Anthoff, D., Papamarkou, T., Besançon, M., Dru-

gowitsch, J., Schauer, M., other contributors: JuliaStats/Distributions.jl: a

Julia package for probability distributions and associated functions (July

2019). https://doi.org/10.5281/zenodo.2647458, https://doi.org/10.5281/

zenodo.2647458

34. Lin, J., Rao, Y., Lu, J., Zhou, J.: Runtime neural pruning. In: Proceedings of

the 31st International Conference on Neural Information Processing Systems. pp.

2178–2188 (2017)18

Chalumeau, Coulon et al.

35. Liu, Z., Sun, M., Zhou, T., Huang, G., Darrell, T.: Rethinking the value of network

pruning. arXiv preprint arXiv:1810.05270 (2018)

36. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,

Riedmiller, M.: Playing atari with deep reinforcement learning (2013)

37. Monti, F., Frasca, F., Eynard, D., Mannion, D., Bronstein, M.M.: Fake

news detection on social media using geometric deep learning. arXiv preprint

arXiv:1902.06673 (2019)

38. Nejati, S., Le Frioux, L., Ganesh, V.: A machine learning based splitting heuris-

tic for divide-and-conquer solvers. In: International Conference on Principles and

Practice of Constraint Programming. pp. 899–916. Springer (2020)

39. OscaR

Team:

OscaR:

Scala

(2012),

available

from

https://bitbucket.org/oscarlib/oscar

40. Refalo, P.: Impact-based search strategies for constraint programming. In: Wallace,

M. (ed.) Principles and Practice of Constraint Programming – CP 2004. pp. 557–

571. Springer Berlin Heidelberg, Berlin, Heidelberg (2004)

41. Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., Leskovec, J., Battaglia,

P.W.: Learning to simulate complex physics with graph networks. arXiv preprint

arXiv:2002.09405 (2020)

42. Schaus, P., Van Hentenryck, P., Régin, J.C.: Scalable load balancing in nurse to

patient assignment problems. In: International Conference on AI and OR Tech-

niques in Constriant Programming for Combinatorial Optimization Problems. pp.

248–262. Springer (2009)

43. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy

optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

44. Schulte, C., Lagerkvist, M., Tack, G.: Gecode. Software download and online ma-

terial at the website: http://www. gecode. org pp. 11–13 (2006)

45. Selsam, D., Bjørner, N.: Guiding high-performance sat solvers with unsat-core pre-

dictions. In: International Conference on Theory and Applications of Satisfiability

Testing. pp. 336–353. Springer (2019)

46. Solnon, C.: Ant colony optimization and constraint programming. Wiley Online

Library (2010)

47. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT press

(2018)

48. Toth, P., Vigo, D.: The vehicle routing problem. SIAM (2002)

49. Tsolkas, D., Liotou, E., Passas, N., Merakos, L.: A graph-coloring secondary re-

source allocation for d2d communications in lte networks. In: 2012 IEEE 17th

international workshop on computer aided modeling and design of communication

links and networks (CAMAD). pp. 56–60. IEEE (2012)

50. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph

attention networks. arXiv preprint arXiv:1710.10903 (2017)

51. Vilím, P., Laborie, P., Shaw, P.: Failure-Directed Search for Constraint-Based

Scheduling. In: Michel, L. (ed.) Integration of AI and OR Techniques in Constraint

Programming. pp. 437–453. Springer International Publishing, Cham (2015)

52. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks (2015)

53. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, P.S.: A comprehensive survey

on graph neural networks. IEEE Transactions on Neural Networks and Learn-

ing Systems 32(1), 4–24 (Jan 2021). https://doi.org/10.1109/tnnls.2020.2978386,

http://dx.doi.org/10.1109/TNNLS.2020.2978386

54. Yuret, D.: Knet: beginning deep learning with 100 lines of julia. In: Machine Learn-

ing Systems Workshop at NIPS. vol. 2016, p. 5 (2016)SeaPearl: A CP Solver guided by RL

55. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., Sun, M.: Graph

neural networks: A review of methods and applications (2019)

56. Zhou, Y., Hao, J.K., Duval, B.: Reinforcement learning based local search for

grouping problems: A case study on graph coloring. Expert Systems with Appli-

cations 64, 412–422 (2016)