gaussalgo
T5-LM-Large-text2sql-spider
This model is purposed to generate structured SQL queries from the natural-language prompts. In the Text2SQL task, the model learns how to generate a SQL query based on the question posed in natural language. However, in some cases, the SQL query contains unknown columns etc., and altogether does not take the schema of the specific database into account. That is where our approach comes in. We incorporated the database schema into the input question while training to specify which columns and relations are available to generate an applicable SQL query. The exposition of database schema, together with the prompt, allows the model to learn the mapping of the schema to the expected output. This allows the model to better generalize to the schemas that were not present in the training data. We fine-tune this model from the t5-large-LM-adapt checkpoint. The model was fine-tuned on the training splits of Spider and Spider-Syn datasets. Instead of using only the questions, we added the database schema to the question, as we wanted the model to generate a question over a given database When evaluating the output, we query the SQLite database and get: The standardized database schema the model was trained on: Here is how to use this model to answer the question on a given context using 🤗 Transformers in PyTorch: Evaluation Evaluation was done on the dev split of the Spider and Spider-syn dataset. The databases present in the dev split have no intersection with the databases of the train split. This way we ensure, that the model was not exposed to the evaluated databases during training. The evaluation was done by comparing the results of querying the database using the generated query and reference. Both Spider and Spider-Syn dev splits have 1032 samples. Spider dev accuracy: 49.2% Spider Syn dev accuracy: 39.5% The model has been trained using Adaptor library 0.2.1, on training splits of Spider and Spider-syn datasets with the following parameters: The training is fairly easy to reproduce, but we do not wish to publish modified copies of the Spider datasets that it depends on. If you'd like to investigate further in this direction, feel free to get in touch through a new PR, or via email to stefanik(at)gaussalgo.com.