Who are you ?
Data Mining logo

Data Mining & Constraints

Introduction

This web page is dedicated to our contributions to declarative Data mining, and more generally to the cross-fertlization between Data mining (DM), Constraint programming (CP), Propositionnal Satisfiability (SAT) and Logic. This research issue is first introduced by Luc De Raedt, Tias Guns and Siegfried Nijssen from the University of Leuven (Belgium) in 2008 (see. CP4IM). Our research has been supported by the French ANR Agency under the DAG project "Declarative Approaches for enumerating intersting patterns" (2009 Défis program).

Objectives

  • Cross-fertilization between artificial intelligence (CP, SAT), Logic, combinatorial algorithmics and databases
  • Definition of high level declarative languages for describing interesting pattern enumeration problems

Results

Our results include contributions on the following issues:

DM → SAT/CP

  • Boolean satisfiability for Sequence mining
  • Top-k Frequent Closed Itemset Mining Using Top-k SAT Problem
  • Graded modal logic GS5 and itemset support satisfiability
  • Mining-based Compression Approach of Propositional Formulae
  • A Mining-Based Compression Approach for Constraint Satisfaction Problems
  • Mining to Compress Table Constraints

SAT/CP → DM

  • Symmetries in Itemset Mining

AI → DM

  • Preferences in Itemset Mining

Others

  • CNF Encoding of Cardinality Constraints.

Publications

Sequence Mining

We introduced a satisfiability-based approach for enumerating all frequent, closed and maximal patterns with wildcards in a given sequence. Different encodings have been proposed. A generalisation is also proposed for the sequence of itemsets where the empty set play the same role as a wildcard symbol. This declarative framework allows us to exploit the efficiency of modern SAT solvers and particularly their clause learning component.

Saïd Jabbour and Lakhdar Sais and Yakoub Salhi.
Boolean Satisfiability for sequence mining. In Twenty-Second ACM International Conference on Information an Knowledge Management CIKM'2013, pages 649 - 657, October 27th - November 1st, San Francisco, 2013 (pdf).
Emmanuel Coquery, Saïd Jabbour, Lakhdar Saïs and Yakoub Salhi.
A SAT-Based Approach for Discovering Frequent, Closed and Maximal Patterns in a Sequence in proceedings of the 20th European Conference on Artificial Intelligence - ECAI'2012, pages 258-263, 2012 (pdf).
Emmanuel Coquery, Said Jabbour, Lakhdar Sais.
A Constraint Programming Approach for Enumerating Motifs in a Sequence. International Workshop on Declarative Pattern Mining (Workshop WDPM@ICDM 2011) collocated with the 2011 IEEE ICDM International conference,pages 1091-1097, Vancouver, Canada, december 2011 (pdf).

ItemSet Mining

A new problem, called Top-k SAT, that consists in enumerating the $top$-$k$ models of a propositional formula is presented. A $top$-$k$ model is defined as a model with less than $k-1$ June models preferred to it w.r.t. a preference relation. We have shown that Top-k SAT generalizes the two well known problems: the partial Max-SAT problem and the problem of computing minimal models. Moreover, we propose a general algorithm for Top-k SAT. Then, we give the first application of our declarative framework in data mining, namely, the problem of enumerating $top$-$k$ frequent closed itemsets of length at least $min$. Finally, to show the nice declarative aspects of our framework, we encode several other variants of this data mining problem as Top-k SAT problems. We also show for the first time how symmetries can be defined, detected and eliminated in the context of itemset mining. Finally, we show how to reduce Graded modal logic satisfiability (GS5) to propositional satisfiability (SAT). We consider a satisfiability problem related to the frequent itemset mining problem: SUPPSAT(n) (where n is a strictly positive integer). We show how SUPPSATn can be encoded in GS5 satisfiability and consequently in SAT.

Saïd Jabbour and Lakhdar Sais and Yakoub Salhi.
Decomposition Based SAT Encodings for Itemset Mining Problems , In proceedings of PAKDD, 2015.
The Top-k Frequent Closed Itemset Mining Using Top-k SAT Problem, In proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases ECML/PKDD, 2013, pages 403-418, Prague, LNAI 8188, Springer, September 23-27, 2013 (pdf, slides).
Saïd Jabbour, Lakhdar Sais, Yakoub Salhi and Karim Tabia.
Symmetries in Itemset Mining, In proceedings of the 20th European Conference on Artificial Intelligence - ECAI'2012, pages 432-437 (pdf).
Said Jabbour, Mehdi Khiari, Lakhdar Sais, Yakoub Salhi and Tabia Karim.
Symétries et Extraction de Motifs Ensemblistes, in EGC, 2014.
Symmetries and Patterns Extraction Set. In proceedings of the 14th Conference Mining and Knowledge Management - EGC'2014. pages 407-418, January 28-31, 2014, Rennes (pdf, video)
Said Jabbour, Mehdi Khiari, Lakhdar Sais, Yakoub Salhi and Tabia Karim.
Symmetry-Based Pruning in Itemset Mining. In proceedings of the 25th International Conference on Tools with Artificial Intelligence - ICTAI'2013. November 4-6, Washington DC, USA, November 4-6, 2013 (pdf).
Yakoub Salhi and Said Jabbour and Lakhdar Sais.
Graded modal logic GS5 and itemset support satisfiability. In Information Search, Integration and Personalization, Communications in Computer and Information Science, Springer, Volume 146, pages 131-140, Springer, 2013 (pdf).
Said Jabbour and Stephanie Roussel and Lakhdar Sais and Yakoub Salhi
Mining to Compress Table Constraints . In 27th {IEEE} International Conference on Tools with Artificial Intelligence,ICTAI 2015, Vietri sul Mare, Italy, November 9-11, 2015, pages 405--412.

Mining based Compression of Constraints

We propose a first application of data mining techniques to propositional satisfiability. Our proposed mining based compression approach aims to discover and to exploit hidden structural knowledge for reducing the size of propositional formulae in conjunctive normal form (CNF). It combines both frequent itemset mining techniques and Tseitin’s encoding for a compact representation of CNF formulae. The experimental evaluation of our approach shows interesting reductions of the sizes of many application instances taken from the last SAT competitions. We also extend this approach to compress CSP and more precisely table constraints.

Saïd Jabbour and Lakhdar Sais and Yakoub Salhi and Takeaki Uno.
Mining-based compression approach of propositional formulae. In Twenty-Second ACM International Conference on Information an Knowledge Management CIKM'2013 , pages 289 - 298, October 27th - November 1st, San Francisco, 2013 (pdf, slides).
Saïd Jabbour and Lakhdar Sais and Yakoub Salhi.
A Mining-Based Compression Approach for Constraint Satisfaction Problems, CoRR abs/1305.3321, 2013 (pdf).

SAT encoding of cardinality constraints

We propose a new encoding of the cardinality constraint $\sum_{i=1}^{n} x_i \geq b$. It makes an original use of the general formulation of the Pigeon-Hole principle to derive a formula in conjunctive normal form (CNF). Our Pigeon-Hole based CNF encoding can be seen as a nice and simple way to express the semantic of the cardinality constraint, that can be defined as how to put $b$ pigeons into $n$ holes. To derive an efficient CNF encoding that ensures constraint propagation, we exploit the set of symmetries of the Pigeon-Hole based formulation to derive an efficient CNF encoding of the cardinality constraint. More interestingly, the final CNF formula contains is $b\times (n-b)$ variables and clauses and belongs to the well-known Reverse-Horn tractable CNF formula, which can be decided by unit propagation. Our proposed Pigeon-Hole based encoding is theoretically compared with the currently well-known CNF encoding of the cardinality constraint.

Said Jabbour and Lakhdar Sais and Yakoub Salhi.
A pigeon-Hole based encoding of cardinality constraints. Technical communication at 29th International conference on Logic Programming ICLP'2013, Theory and Practice of Logic Programming (TPLP), Volume 13, Numbers 4-5-Online-Supplement, July 2013 (pdf,slides).

Talks

L. Nourine, J-M Petit, L. Sais.
Declarative Approaches for Enumerating Interesting Patterns . Invited Talk, Journées Francophone de Programmation par Contraintes - JFPC/IAF 2012, Toulouse, May 22-24, 2012.
L. Nourine, J.M. Petit, L. Sais
Pattern Discovery: Enumeration, Constraint Programming/SAT and Data Bases, Tutorial in Journées Bases de Données Avancées - BDA 2011, Rabat, Marocco, October 24-27 2011 (slides).
L. Sais
Approches déclaratives pour la fouille de données, Séminaire du Lirmm, Montpellier, 21 février 2014 (slides)

Download

Sym4IM
Implements a framework for breaking symmetries in itemset mining problems. A symmetry allows to reduce the transaction database. The tool transform the transaction database into a graph and a symmetry miner is called to detect symmetries. Here we use Saucy, one of the most efficient tools.
IM(Top-K)
This is an algorithm for enumerating the Top-k models of a CNF propositional formula. A Top-k model is defined as a model with less than k models preferred to it with respect to a preference relation. This algorithm is implemented on the top of the Minisat2.2 solver.
Sat4Sequence
SATMiner is a library for searching interesting patterns using constraint programming/SAT backend. It is currently used as a basis for experiments on SAT/pseudo boolean based pattern mining on sequences attributes sets in relational data (RLT language) The current version is coded in Java and uses SAT4J and a modified version of MiniSat as backends. It also uses BoolVar/PB for translation of some pseudo-boolean constraints to SAT (related page).
Mine2Compress
This is a tool that use itemset mining to compress CNF boolean formulae using tseitin's encoding.

Datasets

Symmetries in Itemsets Mining: Download the datasets here .