Summary of Combining Text-to-SQL with Semantic Search for Retrieval Augmented Generation | by Jerry Liu | LlamaIndex Blog | May, 2023

Summary Combining Text-to-SQL with Semantic Search for Retrieval Augmented Generation | by Jerry Liu | LlamaIndex Blog | May, 2023 | Medium medium.com

2,862 words - html page - View html page

One Line

The SQLAutoVectorQueryEngine combines SQL and vector stores to handle complex natural language queries over structured and unstructured data, providing a versatile tool with improved performance and potential for new capabilities.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Combining Text-to-SQL with Semantic Search for Retrieval Augmented Generation

Source: medium.com - html - 2,862 words - view

The SQLAutoVectorQueryEngine

• Combines Text-to-SQL with Semantic Search

• Handles complex natural language queries over structured and unstructured data

• Provides improved performance and potential for new capabilities

Data Lakes in Enterprises

• Encompass both structured and unstructured data

• Structured data stored in SQL databases with predefined schemas and relationships

• Unstructured data lacks a predefined structure and includes text documents, audio recordings, videos, etc.

Text-to-SQL Over Structured Data

• SQL is an expressive language for operating over tabular data

• LLMs can convert natural language to SQL for analytics use cases

• Provides aggregations, joins, sorting, and more

Semantic Search over Unstructured Data

• Retrieval-augmented generation systems perform retrieval and synthesis

• Vector databases store unstructured documents with embeddings

• LLMs fetch relevant documents by embedding similarity and synthesize responses

Combining Text-to-SQL and Semantic Search

• Leverage knowledge in both structured tables and vector databases

• Provides the best answer to queries by combining analytics capabilities and semantic understanding

Example Use Case

• Access to a collection of articles about different cities in a vector database

• Structured table containing statistics for each city

• Query: "Tell me about the arts and culture of the city with the highest population"

SQLAutoVectorQueryEngine Flow

• Selector prompt chooses SQL or vector database query

• SQL query retrieves city with highest population

• Query transformation converts original question into a more detailed question given SQL results

• Vector store query performs retrieval and LLM response synthesis

General Comments about the Approach

• Auto-retrieval module simulates join between SQL and vector databases

• No explicit mapping needed between SQL items and vector metadata

• Implications for query capabilities and relationships between data sources

Results of Experiments

• Example queries leveraging both structured and unstructured data

• Successful across a broad range of queries

• Demonstrates the power of combining LLMs with structured and unstructured data

Conclusion

• Stacks around LLMs + unstructured data and LLMs + structured data have been separate

• Combining LLMs on top of both structured and unstructured data unlocks new retrieval/query capabilities

• Try out the SQLAutoVectorQueryEngine for yourself and provide feedback

Key Takeaways

• SQLAutoVectorQueryEngine combines Text-to-SQL with Semantic Search for complex queries

• Leverages both structured tables and vector databases for the best answer

• Unlock the power of LLMs by combining them with structured and unstructured data

Key Points

The article discusses a query engine called SQLAutoVectorQueryEngine that combines Text-to-SQL with Semantic Search to handle complex natural language queries over a combination of structured and unstructured data.
The engine leverages the expressivity of SQL over structured data and joins it with unstructured context from a vector database.
Text-to-SQL queries are well-suited for analytics use cases, while Semantic Search is suited for queries where the answer can be obtained from unstructured text data.
By combining these two systems, it is possible to leverage both structured tables and vector databases to provide the best answer to a query.
The SQLAutoVectorQueryEngine is able to query, join, sequence, and combine structured data from a SQL database and unstructured data from a vector database to synthesize the final answer.

Summaries

49 word summary

The SQLAutoVectorQueryEngine combines SQL and vector stores to handle complex natural language queries over structured and unstructured data. It allows the combination of structured analytics and semantic search, addressing limitations of existing tools. The engine performs well across various queries and has potential for new retrieval and query capabilities.

187 word summary

The author introduces SQLAutoVectorQueryEngine, a powerful query engine that combines a SQL database and a vector store to handle complex natural language queries over structured and unstructured data. Data lakes in enterprises contain both types of data, and Large Language Models (LLMs) can extract insights from them. Existing tools for handling structured and unstructured data are Text-to-SQL and Semantic Search with a Vector Database. However, SQL is more suitable for analytics use cases, while vector stores are less suited for queries involving aggregations or joins. The SQLAutoVectorQueryEngine addresses this by allowing the combination of structured analytics and semantic search. It has an auto retriever that infers query parameters and executes queries against the vector database, and a selector prompt to choose between the SQL database and the vector database. The engine simulates a join between the two databases to provide accurate responses. Experiments show that the engine works well across various queries. The author is excited about the potential of combining LLMs with structured and unstructured data for new retrieval and query capabilities. The article also includes information about the author's role and recommendations for related articles.

377 word summary

In this article, the author introduces SQLAutoVectorQueryEngine, a powerful query engine that can handle complex natural language queries over structured and unstructured data. The engine combines a SQL database and a vector store to fulfill the queries.

Data lakes in enterprises typically contain both structured and unstructured data. Structured data is stored in SQL databases, while unstructured data includes text documents, audio recordings, videos, etc. Large Language Models (LLMs) can extract insights from both types of data.

There are existing tools for handling structured data (Text-to-SQL) and unstructured data (Semantic Search with a Vector Database). Text-to-SQL converts natural language into SQL statements, while Semantic Search retrieves relevant documents based on embedding similarity and uses LLMs to synthesize a response.

SQL is highly expressive for operating over tabular data, making it suitable for analytics use cases. On the other hand, existing vector stores are less suited for queries involving aggregations or joins.

The author introduces the SQLAutoVectorQueryEngine to address the need for combining knowledge from both structured tables and vector databases/document stores. This query engine allows for the combination of structured analytics and semantic search.

Experiments show that the SQLAutoVectorQueryEngine works well across a broad range of queries. Example queries about arts and culture in Tokyo and the history of Berlin yield accurate responses.

The author concludes by expressing excitement about combining LLMs with structured and unstructured data to unlock new retrieval and query capabilities. Readers are invited to try out the SQLAutoVectorQueryEngine and provide feedback.

The excerpt also includes information about the author's role as an editor for LlamaIndex Blog and their involvement in building the data framework for LLMs. It mentions LlamaIndex raising $8.5M in a seed round led by Greylock Partners and provides links to other articles written by the author on LlamaIndex Blog.

Overall, the article discusses the SQLAutoVectorQueryEngine, the importance of combining structured and unstructured data, and the potential of LLMs in query capabilities. It also provides information about the author's role and includes recommendations for related articles.

831 word summary

In this article, the author introduces a powerful query engine called SQLAutoVectorQueryEngine, which can handle complex natural language queries over a combination of structured and unstructured data. This engine leverages both a SQL database and a vector store to fulfill the queries.

The author explains that data lakes in enterprises typically contain both structured and unstructured data. Structured data is stored in SQL databases, while unstructured data lacks a predefined structure and includes text documents, audio recordings, videos, etc. Large Language Models (LLMs) have the ability to extract insights from both types of data.

The author mentions two existing tooling and stacks for handling these types of data: Text-to-SQL for structured data and Semantic Search with a Vector Database for unstructured data. Text-to-SQL converts natural language into SQL statements that can be executed against the database, while Semantic Search retrieves relevant documents based on embedding similarity and synthesizes a response using LLMs.

In the structured setting, SQL is highly expressive for operating over tabular data, allowing for aggregations, joins, sorting, etc. Text-to-SQL queries are well-suited for analytics use cases where the answer can be found by executing a SQL statement.

In the unstructured setting, retrieval-augmented generation systems first perform retrieval by looking up the most relevant documents to the query based on embedding similarity. Existing vector stores do not offer a SQL-like interface and are less suited for queries involving aggregations or joins.

For some queries, it may be necessary to combine knowledge from both structured tables and vector databases/document stores to provide the best answer. The author gives an example use case where information from a structured table containing statistics for each city is combined with a collection of articles about different cities stored in a vector database.

To address this need, the author introduces the SQLAutoVectorQueryEngine, which can query, join, sequence, and combine structured and unstructured data to synthesize the final answer. This query engine allows for the combination of structured analytics and semantic search.

Overall, the author highlights the importance of leveraging both structured and unstructured data for more comprehensive and accurate query results, and introduces a query engine that facilitates this combination.

The auto retriever in the SQLAutoVectorQueryEngine infers query parameters and executes a query against the vector database. A selector prompt determines whether to query the SQL database or the vector database. If querying the SQL database, a query transformation is run to create a more detailed question based on the SQL query results. The new query is then run through the vector store query engine for retrieval and response synthesis. The original question, SQL query, SQL response, vector store query, and vector store response are combined to synthesize the final answer. This approach simulates a join between the SQL database and vector database. Experiments show that this method works well across a broad range of queries. The experiment setup includes a SQL table called city-stats and a Pinecone index for storing Wikipedia articles about cities. The SQLAutoVectorQueryEngine is used to process example queries and provide responses. The first example query asks about the arts and culture of the city with the highest population, resulting in a response about Tokyo. The second example query asks about the history of Berlin, resulting in a response detailing Berlin's history.

The excerpt is a query made to a query engine, asking for the country corresponding to each city. The final response states that Toronto is in Canada, Tokyo is in Japan, and Berlin is in Germany. The query can be answered by querying the SQL database without needing additional information from the vector database. The query transform step correctly identifies that there is no follow-up question, indicating that the original question has been answered.

The author concludes by stating that the stacks around language model models (LLMs) combined with unstructured data and structured data have been largely separate. However, they are excited about the potential of combining LLMs with both types of data to unlock new retrieval and query capabilities. The author invites readers to try out the SQLAutoVectorQueryEngine and provide feedback.

The full notebook walkthrough for this topic can be found in a guide associated with the article. The article also mentions the author's role as an editor for LlamaIndex Blog and their involvement in building the data framework for LLMs. It mentions LlamaIndex raising $8.5M in a seed round led by Greylock Partners and provides links to other articles written by the author on LlamaIndex Blog.

The excerpt also includes recommendations for other articles on Medium, such as training your own language model using privateGPT, getting started with LangChain for building LLM-powered applications, and project ideas using large language models for portfolios. There are also recommendations for articles on autonomous agents with LLMs, PandasAI, and more.

Overall, the excerpt discusses a query made to a query engine, the response received, and the potential of combining LLMs with structured and unstructured data. It also provides information about the author's role and includes recommendations for related articles on Medium.

Raw indexed text (18,562 chars / 2,862 words / 629 lines)

Combining

Text-to-SQL with Semantic Search for Retrieval Augmented Generation \|

by Jerry Liu \| LlamaIndex Blog \| May, 2023 \| Medium

Open in app

Sign

Write

Sign

Supportindependentauthors andaccess the best of

Medium.

Become a

member

Become a

member

Top

highlight

Combining Text-to-SQL with Semantic

Search for Retrieval Augmented Generation

Jerry

Liu

Published in

LlamaIndex Blog

10 min read

May

349

Listen

Summary

In this article, we showcase a powerful

new query engine (

SQLAutoVectorQueryEngine

) in LlamaIndex that can leverage

both a SQL database as well as a vector store to fulfill complex natural

language queries over a combination of structured and unstructured data.

This query engine can leverage the expressivity of SQL over structured

data, and join it with unstructured context from a vector database. We

showcase this query engine on a few examples and show that it can handle

queries that make use of both structured/unstructured data, or

either.

Check out the full guide here:

https://gpt-index.readthedocs.io/en/latest/examples/query\_engine/SQLAutoVectorQueryEngine.html

Context

Data lakes in enterprises typically

encompass both

structured

and

unstructured

data. Structured data is typically

stored in a tabular format in SQL databases, organized into tables with

predefined schemas and relationships between entities. On the other

hand, unstructured data found in data lakes lacks a predefined structure

and does not fit neatly into traditional databases. This type of data

includes text documents, but also other multimodal formats such as audio

recordings, videos, and more.

Large Language Models (LLMs) have the

ability to extract insights from both structured and unstructured data.

There have been some initial tooling and stacks that have emerged for

tackling both types of data:

Text-to-SQL (Structured

data)

Given a

collection of tabular schemas, we convert

natural language into a SQL statement which can then be

executed against the database.

Semantic

Search with a Vector Database (Unstructured Data):

Store

unstructured documents along with their embeddings in a vector database

(e.g. Pinecone, Chroma, Milvus, Weaviate, etc.). During query-time,

fetch the relevant documents by embedding similarity, and then put into

the LLM input prompt to synthesize a response.

Each of these stacks solves particular

use cases.

Text-to-SQL Over Structured Data

In the structured setting, SQL is an

extremely expressive language for operating over tabular data in the

case of analytics, you can get aggregations, join information across

multiple tables, sort by timestamp, and much more. Using the LLM to

convert natural language to SQL can be thought as a program synthesis

cheat code just let the LLM compile to the right SQL query, and let

the SQL engine on the database handle the rest!

Use Case:

Text-to-SQL queries are well-suited for analytics use cases

where the answer can be found by executing a SQL statement. They are not

suited for cases where youd need more detail than what is found in a

structured table, or if youd need more sophisticated ways of

determining relevance to the query beyond simple constructs like

WHERE

conditions.

Example queries

suited for Text-to-SQL:

What is the

average population of cities in North America?

What are the

largest cities and populations in each respective

continent?

Semantic Search over Unstructured

Data

In the unstructured setting, the behavior

for retrieval-augmented generation systems is to first perform retrieval

and then synthesis. During retrieval, we first look up the most relevant

documents to the query by embedding similarity. Some vector stores

support being able to handle additional metadata filters for retrieval.

We can choose to manually specify the set of required filters, or have

the LLM infer what the query string and metadata filters should be

(see our

auto-retrieval modules

in LlamaIndex or LangChains

self-query

module

Use Case:

Retrieval Augmented Generation is well suited for queries where

the answer can be obtained within some sections of unstructured text

data. Most existing vector stores (e.g. Pinecone, Chroma) do not offer a

SQL-like interface; hence they are less suited for queries that involve

aggregations, joins, sums, etc.

Example queries

suited for Retrieval Augmented Generation

Tell me about the historical museums in

Berlin

What does Jordan ask from Nick on behalf

of Gatsby?

Combining These Two Systems

For some queries, we may want to make use

of knowledge in

both structured tables as well as

vector databases/document stores

in order to give the best

answer to the query. Ideally this can give us the best of both worlds:

the analytics capabilities over structured data, and semantic

understanding over unstructured data.

Heres an example use case:

You have access to a collection of

articles about different cities, stored in a vector database

You

also have access to a structured table containing statistics for each

city.

Given this data collection, lets take an

example query: Tell me about the arts and culture of the city with the

highest population.

The proper way to answer

this question is roughly as follows:

Query the

structured table for the city with the highest population.

SELECT

city, population

FROM

city\_stats

ORDER

population

DESC

LIMIT

Convert the

original question into a more detailed question: Tell me about the arts

and culture of Tokyo.

Ask the new question over your vector

database.

Use the original question + intermediate

queries/responses to SQL db and vector db to synthesize the

answer.

Lets think about some of the

high-level implications of such a sequence:

Instead of doing embedding search (and

optionally metadata filters) to retrieve relevant context, we want to

somehow have a SQL query as a first retrieval step.

want to make sure that we can somehow join the results from the SQL

query with the context stored in the vector database. There is no

existing language to join information between a SQL and vector

database. We will have to implement this behavior ourselves.

Neither data source can answer this

question on its own. The structured table only contains population

information. The vector database contains city information but no easy

way to query for the city with the maximum population.

A Query

Engine to Combine Structured Analytics and Semantic Search

We have created a brand-new query engine

SQLAutoVectorQueryEngine

) that

can query, join, sequence, and combine both structured data from both

your SQL database and unstructured data from your vector database in

order to synthesize the final answer.

The

SQLAutoVectorQueryEngine

is initialized through passing in a

SQL query engine (

GPTNLStructStoreQueryEngine

) as well as a query engine that

uses our vector store

auto-retriever module

VectorIndexAutoRetriever

). Both

the SQL query engine and vector query engines are wrapped as Tool

objects containing a

name

and

description

field.

Reminder: the

VectorIndexAutoRetriever

takes in

a natural language query as input. Given some knowledge of the metadata

schema of the vector database, the auto retriever first

infers

the other necessary query parameters to pass in (e.g. top-k

value, and metadata filters), and executes a query against the vector

database with all the query parameters.

Diagram of the flow for

SQLAutoVectorQueryEngine

During query-time, we run the following

steps:

A selector prompt (similarly used in our

RouterQueryEngine

, see

guide

) first chooses

whether we should query the SQL database or the vector database. If it

chooses to use the vector query engine, then the rest of the function

execution is the same as querying the

RetrieverQueryEngine

with

VectorIndexAutoRetriever

If it chooses to query the

SQL database, it will execute a text-to-SQL query operation against the

database, and (optionally) synthesize a natural language output.

query transformation

is run, to convert

the original question into a more detailed question given the results

from the SQL query. For instance if the original question is Tell me

about the arts and culture of the city with the highest population.,

and the SQL query returns Tokyo as the city with the highest population,

then the new query is Tell me about the arts and culture of Tokyo. The

one exception is if the SQL query itself is enough to answer the

original question; if it is, then function execution returns with the

SQL query as the response.

The new query is then run through

through the vector store query engine, which performs retrieval from the

vector store and then LLM response synthesis. We enforce using a

VectorIndexAutoRetriever

module. This

allows us to automatically infer the right query parameters (query

string, top k, metadata filters), given the result of the SQL query. For

instance, with the example above, we may infer the query to be something

query_str="arts and culture"

and

filters={"title": "Tokyo"}

The original question, SQL query, SQL

response, vector store query, and vector store response are combined

into a prompt to synthesize the final answer.

Taking a step back, here are some general

comments about this approach:

Using our

auto-retrieval module is our way of

simulating

join between the SQL database and vector database. We effectively use

the results from our SQL query to determine the parameters to query the

vector database with.

This also implies that there doesnt need

to be an explicit mapping between the items in the SQL database and the

metadata in the vector database, since we can rely on the LLM being able

come up with the right query for different items. It would be

interesting to model explicit relationships between structured tables

and document store metadata though; that way we dont need to spend an

extra LLM call in the auto-retrieval step inferring the right metadata

filters.

Experiments

So how well does this work? It works

surprisingly well across a broad range of queries, from queries that can

leverage both structured data and unstructured data to queries that are

specific to a structured data collection or unstructured data

collection.

Setup

Our experiment setup is very simple. We

have a SQL table called

city_stats

which contains the city, population, and country of three different

cities: Toronto, Tokyo, and Berlin.

We also use a Pinecone index to store

Wikipedia articles corresponding to the three cities. Each article is

chunked up and stored as a separate Node object; each chunk also

contains a

title

metadata

attribute containing the city name.

We then derive the

VectorIndexAutoRetriever

and

RetrieverQueryEngine

from the Pinecone vector index.

from

llama\_index.indices.vector\_store.retrievers

import

VectorIndexAutoRetriever

from

llama\_index.vector\_stores.types

import

MetadataInfo, VectorStoreInfo

from

llama\_index.query\_engine.retriever\_query\_engine

import

RetrieverQueryEngine

vector\_store\_info = VectorStoreInfo(

content\_info=

'articles about different cities'

metadata\_info=\[

MetadataInfo(

name=

'city'

type

'str'

description=

'The name of the city'

vector\_auto\_retriever = VectorIndexAutoRetriever(vector\_index,

vector\_store\_info=vector\_store\_info)

retriever\_query\_engine = RetrieverQueryEngine.from\_args(

vector\_auto\_retriever, service\_context=service\_context

You can also get the SQL query

engine as follows

sql\_query\_engine =

sql\_index.as\_query\_engine()

Both the SQL query engine and vector

query engine can be wrapped as

QueryEngineTool

objects.

sql\_tool =

QueryEngineTool.from\_defaults(

query\_engine=sql\_query\_engine,

description=(

'Useful for translating a natural language query

into a SQL query over a table containing: '

'city\_stats, containing the population/country of

each city'

vector\_tool =

QueryEngineTool.from\_defaults(

query\_engine=query\_engine,

description=

f'Useful for answering semantic

questions about different cities'

Finally, we can define our

SQLAutoVectorQueryEngine

query\_engine

= SQLAutoVectorQueryEngine(

sql\_tool,

vector\_tool,

service\_context=service\_context

Results

We run some example queries.

Query

query\_engine.query(

'Tell me about the arts and culture of the city with

the highest population'

Intermediate steps:

Final Response:

Tokyo is the city with the

highest population, with 13.96 million people. It is a vibrant city with

a rich culture and a wide variety of art forms. From traditional

Japanese art such as calligraphy and woodblock prints to modern art

galleries and museums, Tokyo has something for everyone. There are also

many festivals and events throughout the year that celebrate the city's

culture and art. Some popular festivals and events in Tokyo include the

Sann Festival at Hie Shrine, the Sanja Festival at Asakusa Shrine, and

the biennial Kanda Festivals. These events often feature parades with

elaborately decorated floats and thousands of people. Additionally, an

enormous fireworks display over the Sumida River takes place annually on

the last Saturday of July, attracting over a million viewers. During

spring, when cherry blossoms bloom, many residents gather in Ueno Park,

Inokashira Park, and the Shinjuku Gyoen National Garden for picnics

under the blossoms.

This query runs through the full flow of

the

SQLAutoVectorQueryEngine

. It

first queries the SQL database for the city with the highest population

(Tokyo), and then queries the vector database with the new query. The

results are combined into a final response.

Query

query\_engine.query(

"Tell me about the history of

Berlin"

Intermediate Steps:

Final Response:

Berlin's history dates back to

the early 13th century when it was founded as a small settlement. In

1618, the Margraviate of Brandenburg entered into a personal union with

the Duchy of Prussia, and in 1701, they formed the Kingdom of Prussia

with Berlin as its capital. The city grew and merged with neighboring

cities, becoming a center of the Enlightenment under the rule of

Frederick the Great in the 18th century.

The Industrial

Revolution in the 19th century transformed Berlin, expanding its

economy, population, and infrastructure. In 1871, it became the capital

of the newly founded German Empire. The early 20th century saw Berlin as

a hub for the German Expressionist movement and a major world capital

known for its contributions to science, technology, arts, and other

fields.

In 1933, Adolf Hitler and the Nazi Party came to

power, leading to a decline in Berlin's Jewish community and the city's

involvement in World War II. After the war, Berlin was divided into East

and West Berlin, with the former under Soviet control and the latter

under the control of the United States, United Kingdom, and France. The

Berlin Wall was built in 1961, physically and ideologically dividing the

city until its fall in 1989. Following the reunification of Germany in

1990, Berlin once again became the capital of a unified Germany and has

since continued to grow and develop as a major global

city.

This query only requires the vector

database and not the SQL database. The initial selector correctly

identifies that we should just query the vector database and return the

result.

Query

query\_engine.query(

'Can you give me the country corresponding to each

city?'

Intermediate

Steps

Final Response

Toronto is in Canada, Tokyo is

in Japan, and Berlin is in Germany.

This query can be answered by just

querying the SQL database, it does not need additional information from

the vector database. The query transform step correctly identifies

None as the followup question, indicating that the original question

has been answered.

Conclusion

So far, the stacks around LLMs +

unstructured data and LLMs + structured data have largely been separate.

Were excited about how combining LLMs on top of both structured and

unstructured data can unlock new retrieval/query capabilities in novel

and interesting ways!

Wed love for you to try

out the

SQLAutoVectorQueryEngine

and let us know what you think.

The full notebook walkthrough can be

found

in this guide

associated

notebook

349

Artificial

Intelligence

Machine

Learning

NLP

Data

Science

Database

349

Written by

Jerry Liu

1.2K

Followers

Editor for

LlamaIndex Blog

Creator of

LlamaIndex