In data science, it is increasingly the case that the main challenge is finding, curating, and understanding the data that is available to solve a problem at hand. Furthermore, modern-day data is challenging in that it lacks many forms of semantics ("meaning of data"). Metadata may be incomplete or unreliable, data sources are unknown, and data documentation rarely exists. To address these challenges, the objective of my research is to recover data semantics throughout data discovery, versioning, integration, and quality. In this talk, I will discuss current data science challenges and highlight two specific aspects of my research that assist with such challenges. In particular, I will present ALITE, the first scalable integration solution for tables that may have been discovered in data lakes (repositories of big data). ALITE relaxes previous assumptions that tables share common attribute names (which completely determine the join columns), are complete (without null values), and have acyclic join patterns. I will also introduce Explain-Da-V, a solution that explains dataset versions by generating data transformations that resolve data changes.
Roee Shraga is a Postdoctoral fellow at the Khoury College of Computer Science at Northeastern University in Boston. He received his PhD degree in 2020 from the Technion in the area of Data Science. Roee has published more than a dozen papers in leading journals and conferences on the topics of data integration, human-in-the-loop, machine learning, process mining, and information retrieval. He is a recipient of the Council for Higher Education [VATAT] scholarship for outstanding data science postdocs. He is also a recipient of several PhD fellowships including the Leonard and Diane Sherman Interdisciplinary Fellowship (2017), the Daniel Excellence Scholarship (2019), and the Miriam and Aaron Gutwirth Memorial Fellowship (2020).