Bandits as an AUTOnomous Machine (BAUTOM)

Constraint programming has developed a wide range of effective, general-purpose methods to solve many large-scale, real-world problems that can be expressed and solved as constraint satisfaction problems (CSPs). Fields that benefit from CP technology are transportation, network management, telecommunications, bioinformatics and finance [1]. As a result, organisations and industries worldwide already exploit CP technology to solve difficult problems in design and configuration, planning and scheduling, temporal and spatial reasoning, etc. But yet constraint programming suffers from a main bottleneck that limits its usability: the need for a CP-expert to tune the solver efficiently such that the solver becomes more efficient for the problem in question. At present, a constraint expert must select among various solving techniques and refine them to build an effective program. This proposal aims at alleviating the need of an expert through the use of machine learning. Reinforcement learning will be used to change and adapt the possible configurations of the solver, dynamically during search, Until now solvers are used as black boxes (except for the experts), where a user could not interfere or understand. Through the use of ML several explanations of solver’s behaviour-operation will be extracted and thus explained. Automating solver’s configuration has received a great attention as it can exempt the user from choosing the suitable configuration to solve her problem. There exist several different approaches that make use of machine learning (ML) methods in adaptive constraint solving (e.g., [2, 3]). There are two main approaches that have been studied. In the first case, a specific strategy (e.g. a search algorithm or a specific solver) is selected automatically among an array of available strategies, either for a whole class of problems or for a specific instance. Such methods, called portfolios, have mainly been proposed for SAT (e.g., SATzilla [4]) and to a lesser extent for CSPs (e.g., CPHydra [5]). In the second case, using ML, a new strategy can be synthesised (e.g., a combination of search algorithm and heuristics) [3]. Such attempts have mainly focused on learning strategies for combining heuristics, resulting for example in new, hybrid variable ordering heuristics. In [6], Xu et al. proposed the use of reinforcement learning for the dynamic selection of a variable ordering heuristic at each point of the search for CSPs. Another recent work uses ML to decide prior to search whether lazy learning will be switched on or off [7]. A reinforcement learning technique, called multi-armed bandits (MAB), has been initially proposed to model resource allocation problems, where given a fixed budget, the aim is to allocate resources among tasks/projects, whose outputs/rewards are unknown or partially known at the time of allocation [8]. The problem requires balancing between maximising the total reward based on the acquired knowledge and trying new actions to further increase knowledge, called exploitation vs. exploration tradeoff in reinforcement learning. Taking sequential decisions in CP search, while maximising solver’s performance can be represented as a MAB problem. Recently, multi-armed bandits have been used for learning the good variable and value ordering heuristics during CSP solving [9,10]. This research project aims at proposing techniques to improve and automate constraint solving by providing an autonomous framework that will direct the search through the use of bandits. More precisely, the solver will learn during search which branching method to use or with which criterion to branch, becoming more robust than choosing a fixed configuration all along search. At the same time, the learning mechanism herself will give us a feedback of the internal operations and behaviour of the solver, that will help us explain and understand the black box solver mechanisms.


[1] M. Wallace. Practical applications of constraint programming. Constraints, 1:139–168, 1996.

[2] S. Minton. Automatically Configuring Constraint Satisfaction Programs: A Case Study. Constraints, 1(1/2):7–43, 1996.

[3] S. Epstein, S. Petrovic. Learning to Solve Constraint Problems. In ICAPS’07 Workshop on Planning and Learning, 2007

[4] L. Xu, F. Hutter, H. H. Hoos, K. Leyton-Brown. Satzilla: Portfolio-based algorithm selection for SAT. J. Artif. Intell. Res. (JAIR), 32:565–606, 2008.

[5] E. O’Mahony, E. Hebrard, A. Holland, C. Nugent, B. OSullivan. Using case-based reasoning in an algorithm portfolio for constraint solving. In Proc. of AICS’08, 2008.

[6] Y. Xu, D. Stern, H. Samulowitz. Learning Adaptation to solve Constraint Satisfaction Problems. In Proc.of LION’09, 2009.

[7]I.P. Gent, C. Jefferson, L. Kotthoff, I. Miguel, N.C.A. Moore, P. Nightingale, K.E. Petrie. Learning when to use lazy learning in constraint solving. In Proc. of ECAI’10, pages 873–878, 2010.

[8] J. C. Gittins. Multi-Armed Bandit Allocation Indices, volume 10. John Wiley and Sons, 1989.

[9] M. Loth, M. Sebag, Y. Hamadi, M. Schoenauer. Bandit-based search for constraint programming. In Proc. of CP’13, pages 464–480, 2013.

[10] A. Balafrej and C. Bessiere and A. Paparrizou, Multi-Armed Bandits for Adaptive Constraint Propagation. In proc of IJCAI’15, pages 290- 296, 2015.

[11] W. Xia and R. H. C. Yap, Learning Robust Search Strategies Using a Bandit-Based Approach. In Proc. of AAAI’18, pages 6657-6665, 2018.