visitor-photo

Automated Synthesis of Data Extraction and Transformation Programs

May 19, 2017 at 3:00pm CSE 691

Abstract

Due to the abundance of data in today’s data-rich world, end-users increasingly need to perform various data extraction and transformation tasks. While many of these tedious tasks can be performed in a programmatic way, most end-users lack the required programming expertise to automate them and end up spending their valuable time in manually performing various data-related tasks. The field of program synthesis aims to overcome this problem by automatically generating programs from informal specifications, such as input-output examples or natural language.

In this talk, I will discuss two of my recent research projects for automating important classes of data transformation and extraction tasks. In the first part of the talk, I will describe a novel algorithm for synthesizing hierarchical data transformations from input-output examples. A key novelty of our approach is that it reduces the synthesis of tree transformations to the simpler problem of synthesizing transformations over the paths of the tree. I will also describe a new and effective algorithm for learning path transformations that combines logical SMT-based reasoning with machine learning techniques based on decision trees.

In the second part of the talk, I will address the problem of automating data extraction tasks from natural language. Specifically, I will focus on data retrieval from relational databases and describe a novel approach for learning SQL queries from English descriptions. The method I will describe is fully automatic and database-agnostic (i.e., does not require customization for each database). Our method combines semantic parsing techniques from the NLP community with novel programming languages ideas involving probabilistic type inhabitation and automated sketch repair.

Bio

Navid Yaghmazadeh is a Ph.D. student in the UToPiA research group at the University of Texas at Austin, working with Prof. Isil Dillig. His current research focuses on automatic program generation from informal specifications. Prior to his work on program synthesis, he was a member of the Advanced Systems group at UT Austin working on a novel distributed database framework.