Part I Selecting a Meaning Representation
7.2 Dataset
Since no prior work has considered the dialog-snippet-to-SQL task, we describe a new dataset, Flex-to-SQL dataset, version 0.1, for the task. It combines data from the Flex Dialog dataset with the Advising SQL dataset.
The Flex Dialog dataset (Jiang et al., 2018 forthcoming) consists of dialogs between two university students role-playing an undergraduate student and academic advisor. The “student” received a made-up student profile, and the “advisor” received a set of courses recommended for the student, along with information about each course. The participants then interacted through a chat interface, holding a conversation with the goal of helping the “student” select courses for the following semester.
sponding to questions an undergraduate computer science student might ask an advisor.
7.2.1
Preliminary Annotation
To develop the Flex-to-SQL dataset, researchers spent one day annotating Flex dialogs. After a training session on shared dialogs to ensure annotators understood the task, one annotator reviewed each dialog. Annotators highlighted each student utterance that repre- sented a query that might be answered by a database. For each highlighted utterance, the annotator selected a label from a searchable list. Labels comprised either a question and the tag “CLOSEST” or “EXACTLY”, or just the choice “OTHER.”
Questions that could be used for the labels initially came from the Advising SQL dataset, with one representative question for each SQL query. Annotators also contributed to a shared list of questions that occurred often in the dialogs, and the searchable list was updated to include items on the list approximately once an hour.
Annotators labeled each highlighted utterance with a question from the dropdown that was as semantically similar to the utterance as possible. If the selected question was an exact paraphrase of the highlighted utterance (but for entity names), the “EXACTLY” tag was used; if it was not an exact match, but rather the closest match available in the list, the “CLOSEST” tag was used. If no question from the list was close in meaning to the highlighted utterance, annotators applied the “OTHER” label.
To keep the questions as broad as possible, named entities in the labels were replaced with variables. The label “Who teaches department0 number0 next semester?” thus applied equally to “Who teaches EECS 280 next semester?” and “Who teaches EECS 203 next semester?”
This annotation scheme had multiple goals. First, it enabled annotators who might not be familiar with SQL to quickly indicate whether an utterance was semantically equivalent to a SQL query already in the Advising SQL dataset. Second, for questions that did not
Agree The label is exactly right.
Closest The label is close, but needs to be part of the “closest” review. Unclear The question is unclear and shouldn’t generate SQL.
Data The question asks for data that isn’t in the database and shouldn’t generate SQL. Not a query Annotator mistake; this was not a query at all.
Other Explain in comments.
Table 7.1: Labels used in the second-level review of questions tagged “EXACTLY.”
7.2.2
Subsequent Annotation of “Exact” Utterances
Two native English speakers familiar with the database schema performed a second level of review for every utterance tagged “EXACTLY.” They were shown the dialog up to and including the tagged utterance (the “utterance in context”). They labeled each utterance with one of the options in Table 7.1. A comment field was available for all labels and required for any question labeled “Other.”
After the second-level review, 408 utterances in context had been identified as semanti- cally equivalent to a question in the Advising dataset. The remainder were either identified as exactly matching a question for which there was no SQL query yet, removed as either unclear or requesting data that was not available from the database, or transferred to the “CLOSEST” bucket for further review.
For the 408 utterances that exactly matched a question in the Advising dataset, we were able to create a dataset largely automatically. Each English question input was the entire utterance in context. That is, the “question” consists of the entire conversation up to and including the labeled utterance. Each query was the SQL query from Advising corresponding to the exact-match question. However, this still left the identification of variables in the utterance to be done manually.
As noted above, label questions included variables; for instance, “Who’s the EECS 280 instructor next semester?” would be labeled “Who teaches department0 number0 next semester? EXACTLY” To run against the database and get correct results, we must replace “department0” with “EECS” and “number0” with “280” in the SQL. For these utterances,
we manually identified the variables.
The resulting dataset is Flex-to-SQL v.0.1. It contains 41 distinct SQL queries corre- sponding to the 408 utterances. We created a question-based split and a query-based split, though for the pilot work we report only experiments on the question-based split. Since the dataset is relatively small, we use cross validation. For the query-based split, we use leave-one-query-out cross-validation. For the question-based split, we randomly assigned each question to one of ten buckets.