Table Of Contents

As a PDF

Note

Information about the tutorial on redescription mining at the ACM SIGKDD International Conference on Knowledge discovery and data mining (KDD 2018), taking place in London, UK, on Sunday August 19, 2018.

Abstract

As data grow increasingly diverse and heterogeneous, it is ever more important to be able to exploit different points of views on the same phenomena in order to extract more coherent and relevant information. Multi-view and multi-modal data analysis techniques are therefore becoming more important in data analysis, gaining more interest from researchers and practitioners alike.

Redescription mining, introduced in KDD‘04 by Ramakrishnan et al., is a relatively novel multi-view data analysis technique that has seen important methodological developments and applications in the recent years. Redescription mining has many interesting open methodological questions, and we expect it to gain even more interest in the KDD research community in the coming years. Yet, the existing methods are mature enough that they can be—and have been—applied to many real-world data analysis problems.

The purpose of this tutorial is to give an overview of redescription mining, allowing method developers to start researching novel and improved algorithms for redescription mining, and for practitioners to understand when and how to use redescription mining in their data analysis problems.

The tutorial consists of three parts. The first part covers the theoretical foundations of redescription mining and its connections to other data analysis methods, such as association rule mining, classification, and subgroup discovery. The second part covers the different algorithmic ideas that have been proposed for redescription mining, building on these foundations. The third part shows four case studies, based on published research, of redescription mining being applied to different fields, ranging from medical sciences to ecology, political sciences, and engineering. This third part also discusses some variations of redescription mining and conclude with challenges and future directions.

Overview

Redescription mining, introduced in 2004 by Ramakrishnan et al. [Ram+04], aims at finding distinct common characterizations of the same objects. Like other multi-view analysis techniques, redescription mining is well suited to extract more coherent and relevant information, by exploiting different points of views on the same phenomena. This is an especially attractive feature as data grow increasingly diverse and heterogeneous. Recent advances in redescription mining algorithms and novel applications thereof—optimizing circuit designs [Goe+10], analyzing political opinions [GM16] and relating clinical and biological characteristics of Alzeimer’s disease patients [Mih+17c]—have increased the interest in this relatively nascent subfield of data analysis. Hence, redescription mining sits in the sweet spot where the methods and tools are mature enough to be used by general data analysts and scientists, and not only by topic experts, but there are still many open methodological questions that are interesting for method developers.

Recently, we published an introductory book on redescription mining [GM17]. This book and the tutorial are complementary means to help foster the research of redescription mining techniques and widen their use. We cover the main aspects of the redescription mining task, from definition, to algorithms and to applications and problem variants. The tutorial follows a similar structure as the book, allowing attendants to easily refer to the corresponding book section to obtain further information and references regarding the aspects they are most interested in.

First, we give a conceptual overview of redescription mining, including the intuition behind the concept and relations with other data analysis problems.

Next, we present existing algorithms, explaining how they build on standard data mining techniques.

In the third part of the tutorial, we present a relational variant of redescription mining, before covering four practical applications of redescription mining. These case studies illustrate the wide variety of potential applications of redescription mining, and highlight the strengths and weaknesses of the current methods in the different applications. Finally, we discuss open problems and future research directions (see the roadmap for more details).

Audience

We expect the material to be interesting to a wide audience with diverse backgrounds, and to be accessible to anyone possessing basic knowledge of core data mining and machine learning techniques.

In this tutorial, you will learn what redescription mining is and what it is not. We approach redescription mining through its two main forebearers: association rule mining and classification, and cover the main related areas (subgroup discovery, subspace clustering, and multi-view data mining). You will learn about the state-of-the-art algorithms for finding redescriptions as well as tailored visualization techniques for interpreting them. The discussion is illustrated with numerous examples throughout, and we pay special attention to four practical applications of redescription mining in biology, ecology, political sciences, and circuit design, respectively.

From the three parts of this tutorial, the data analysts will gain an overview of the redescription mining framework, a necessary understanding of the state-of-the-art methods, a good grasp of the type of problems redescription mining can be applied to, and a perspective on directions in which redescription mining can be extended.

The method developers, on the other hand, will learn the context and problem definition, the stepping stones for building their own tools, the practical requirements of redescription mining algorithms through concrete examples, and issues on which further investigation is required. More generally, the attendees of the tutorial will hear about redescription mining and related methods and will hence be able to use these techniques in their own research.

Tutors’ bios

Pauli Miettinen is a professor of data science at the University of Eastern Finland. Previously he was a senior researcher and head of the area Data Mining at the Databases and Information Systems department of the Max-Planck Institute for Informatics, Germany. He is also an Adjunct Professor (docent) of computer science at the University of Helsinki, Finland, where he previously worked in Prof. Heikki Mannila’s group, and received his PhD in 2009. His main research interest is in Algorithmic Data Analysis. In particular, he has been working on matrix decompositions over non-standard algebras and their applications to data mining and on redescription mining. His research has resulted in numerous publications in top data mining venues, three best paper prices (PKDD‘06 Best paper; PKDD‘08 & ECML PKDD‘16 Best student paper), and an honorary mention at 2010 ACM SIGKDD Doctoral dissertation awards.

Pauli was the presenter of Decomposing Binary Matrices: Where Linear Algebra Meets Combinatorial Data Mining, a tutorial at ECML PKDD‘12 and has given a number of invited talks and lectures.

http://people.mpi-inf.mpg.de/~pmiettin/

Esther Galbrun is a postdoctoral researcher at the Department of Computer Science, Aalto University, Finland, and a junior research scientist at Inria, France (on leave). Earlier, she has been a postdoctoral researcher and part-time lecturer at the Computer Science department of Boston University, MA, USA. She obtained her PhD in 2014 from the Computer Science department at the University of Helsinki, Finland, on the topic of redescription mining.

https://members.loria.fr/EGalbrun/

Roadmap

  1. What is redescription mining?
    1. An introductory example: Bioclimatic niches
    2. Definitions and problem formalization [Ram+04][PR05][ZR05], [Gal13]
    3. Related data mining problems [AS94][MTV94][Zak00][NLW09][KZ09][BS04]
  2. Algorithms for redescription mining.
    1. Combining frequent itemsets into redescriptions [GNDR13], [GMM08]
    2. Mining redescription with decision trees [Ram+04][Kum07], [ZGM15]
    3. Building redescriptions greedily [GMM08][GM12]
  3. Applications, variants and extensions of redescription mining.
    1. Relational redescription mining [GHVdB05][Ong+05][Gal+13], [GK14]
    2. Biology: Alzeimer’s disease patient study [Mih+17a][Mih+17b][Mih+17c]
    3. Ecology: Dental traits analysis [For+02][Liu+12][Zli+16] [Gal+18]
    4. Political sciences: Analyzing poll data [GM16]
    5. Circuit Design: Sequential Equivalence Checking [Goe+10][ZZR06]
    6. Challenges, future directions and discussion

Resources

References

[AS94]Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of 20th International Conference on Very Large Data Bases (VLDB‘94), 487–499. 1994. URL: http://www.vldb.org/conf/1994/P487.PDF.
[BS04]Steffen Bickel and Tobias Scheffer. Multi-View Clustering. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM‘04), 19–26. 2004. doi:10.1109/ICDM.2004.10095.
[For+02]Mikael Fortelius, Jussi Eronen, Jukka Jernvall, Liping Liu, Diana Pushkina, Juhani Rinne, Alexey Tesakov, Inesa Vislobokova, Zhaoqun Zhang, and Liping Zhou. Fossil mammals resolve regional patterns of Eurasian climate change over 20 million years. Evolutionary Ecology Research, 4(7):1005–1016, 2002.
[Gal+13]Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian M Suchanek. AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In Proceedings of the 22nd International World Wide Web Conference (WWW‘13), 413–422. 2013. URL: http://dl.acm.org/citation.cfm?id=2488425.
[Gal13]Esther Galbrun. Methods for Redescription Mining. PhD thesis, Department of Computer Science, University of Helsinki, 2013. URL: http://urn.fi/URN:ISBN:978-952-10-9431-6.
[GK14]Esther Galbrun and Angelika Kimmig. Finding relational redescriptions. Machine Learning, 96(3):225–248, 2014. doi:10.1007/s10994-013-5402-3.
[GM12]Esther Galbrun and Pauli Miettinen. From black and white to full color: extending redescription mining outside the Boolean world. Statistical Analysis and Data Mining, 5(4):284–303, 2012. doi:10.1002/sam.11145.
[GM16]Esther Galbrun and Pauli Miettinen. Analysing Political Opinions Using Redescription Mining. In IEEE International Conference on Data Mining Workshops, 422–427. 2016. URL: https://people.mpi-inf.mpg.de/~pmiettin/papers/galbrun16analysing.pdf.
[GM17]Esther Galbrun and Pauli Miettinen. Redescription Mining. Springer Briefs in Computer Science. Springer, 2017. doi:10.1007/978-3-319-72889-6.
[Gal+18]Esther Galbrun, Hui Tang, Mikael Fortelius, and Indrė Žliobaitė. Computational biomes: The ecometrics of large mammal teeth. Palaeontologia Electronica, 2018. doi:10.26879/786.
[GMM08]Arianna Gallo, Pauli Miettinen, and Heikki Mannila. Finding Subgroups having Several Descriptions: Algorithms for Redescription Mining. In Proceedings of the 8th SIAM International Conference on Data Mining (SDM‘08), 334–345. 2008. doi:10.1137/1.9781611972788.30.
[Goe+10]Neha Goel, Michael S Hsiao, Naren Ramakrishnan, and Mohammed J Zaki. Mining Complex Boolean Expressions for Sequential Equivalence Checking. In Proceedings of the 19th IEEE Asian Test Symposium (ATS‘10), 442–447. 2010. doi:10.1109/ATS.2010.81.
[GHVdB05]Bart Goethals, Eveline Hoekx, and Jan Van den Bussche. Mining tree queries in a graph. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD‘05), 61–69. 2005. doi:10.1145/1081870.1081881.
[GNDR13]Tias Guns, Siegfried Nijssen, and Luc De Raedt. k-Pattern Set Mining under Constraints. IEEE Transactions on Knowledge and Data Engineering, 25(2):402–418, 2013. doi:10.1109/TKDE.2011.204.
[KZ09]Peer Kröger and Arthur Zimek. Subspace Clustering Techniques. In L. Liu and M. T. Özsu, editors, Encyclopedia of Database Systems, pages 2873–2875. 2009. doi:10.1007/978-0-387-39940-9_607.
[Kum07]Deept Kumar. Redescription mining: Algorithms and applications in bioinformatics. PhD thesis, Department of Computer Science, Virginia Polytechnic Institute and State University, 2007. URL: https://theses.lib.vt.edu/theses/available/etd-05032007-223232/unrestricted/deept_redescs.pdf.
[Liu+12]Liping Liu, Kai Puolamäki, Jussi T. Eronen, Majid Mirzaie Ataabadi, Elina Hernesniemi, and Mikael Fortelius. Dental functional traits of mammals resolve productivity in terrestrial ecosystems past and present. Proceedings of the Royal Society of London B: Biological Sciences, 279:2793–2799, 2012. URL: http://dx.doi.org/10.1098/rspb.2012.0211.
[MTV94]Heikki Mannila, Hannu Toivonen, and A Inkeri Verkamo. Efficient Algorithms for Discovering Association Rules. In Usama M Fayyad and Ramasamy Uthurusamy, editors, Papers from the 1994 AAAI Workshop on Knowledge Discovery in Databases, 181–192. AAAI Press, 1994. URL: https://www.aaai.org/Papers/Workshops/1994/WS-94-03/WS94-03-016.pdf.
[Mih+17a]Matej Mihelčić, Sašo Džeroski, Nada Lavrač, and Tomislav Šmuc. A framework for redescription set construction. Expert Syst. Appl., 68:196–215, 2017. doi:10.1016/j.eswa.2016.10.012.
[Mih+17b]Matej Mihelčić, Sašo Džeroski, Nada Lavrač, and Tomislav Šmuc. Redescription mining augmented with random forest of multi-target predictive clustering trees. J. of Intell. Inf. Syst., pages 1–34, 2017. doi:10.1007/s10844-017-0448-5.
[Mih+17c]Matej Mihelčić, Goran Šimić, Mirjana Babić-Leko, Nada Lavrač, Sašo Džeroski, and Tomislav Šmuc. Using redescription mining to relate clinical and biological characteristics of cognitively impaired and alzheimer’s disease patients. PLOS ONE, 12(10):1–35, 10 2017. doi:10.1371/journal.pone.0187364.
[NLW09]Petra Kralj Novak, Nada Lavrač, and Geoffrey I Webb. Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining. Journal of the Machine Learning Research, 10:377–403, 2009. doi:10.1145/1577069.1577083.
[Ong+05]Irene M Ong, Inês de Castro Dutra, David Page, and Vítor Santos Costa. Mode Directed Path Finding. In Proceedings of the 16th European Conference on Machine Learning (ECML‘05), volume 3720, 673–681. 2005. doi:10.1007/11564096_68.
[PR05]Laxmi Parida and Naren Ramakrishnan. Redescription Mining: Structure Theory and Algorithms. In Proceedings of the 20th National Conference on Artificial Intelligence and the 7th Innovative Applications of Artificial Intelligence Conference (AAAI‘05), 837–844. 2005. URL: http://www.aaai.org/Library/AAAI/2005/aaai05-132.php.
[Ram+04]Naren Ramakrishnan, Deept Kumar, Bud Mishra, Malcolm Potts, and Richard F Helm. Turning CARTwheels: An Alternating Algorithm for Mining Redescriptions. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD‘04), 266–275. 2004. doi:10.1145/1014052.1014083.
[Zak00]Mohammed J Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372–390, 2000. doi:10.1109/69.846291.
[ZR05]Mohammed J Zaki and Naren Ramakrishnan. Reasoning About Sets Using Redescription Mining. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD‘05), 364–373. 2005. doi:10.1145/1081870.1081912.
[ZZR06]Lizhuang Zhao, Mohammed J Zaki, and Naren Ramakrishnan. BLOSOM: A framework for mining arbitrary boolean expressions. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD‘06), 827–832. 2006. doi:10.1145/1150402.1150511.
[ZGM15]Tetiana Zinchenko, Esther Galbrun, and Pauli Miettinen. Mining predictive redescriptions with trees. In IEEE International Conference on Data Mining Workshops, 1672–1675. 2015. doi:10.1109/ICDMW.2015.123.
[Zli+16]Indrė Žliobaitė, Janne Rinne, Anikó Tóth, Michael Mechenich, Liping Liu, Anna K. Behrensmeyer, and Mikael Fortelius. Herbivore teeth predict climatic limits in Kenyan ecosystems. Proceedings of the National Academy of Sciences, 113(45):12751–12756, 2016. URL: http://dx.doi.org/10.1073/pnas.1609409113.




Imprint-Dataprotection