Chat with your data using Gemini, Streamlit & CloudRun

There is something I have to confess: I am not riding the AI-hype-train as much as many others do. Don’t get me wrong – AI as it is now has the potential to drastically improve productivity at a variety of tasks, especially coding. Nevertheless, I do not believe that it will destroy all knowledge work, nor do I believe that we will all lose our jobs in the coming months. Conversely, managers are eager to investigate the potential of AI to generate the aforementioned productivity increase. While coding is the obvious case, distribution of basic analysis can be another one. So today, I am going to demonstrate how to create a Gemini- and BigQuery-based chatbot, that allows anyone to ask questions directly to your data warehouse.

Let’s imagine this case: You have a table containing all of your ecommerce orders. Next to the common columns you can find in most tables, like order_id, order_timestamp, item_name, revenue, it contains a huge number of columns for acquisition, device, and product details. Creating a report or dashboard, that allows users to filter as well as analyze all these different combinations would be a pain – either in terms of complexity or in terms of quantity.
So instead, we are going to create a chat-like interface, that allows your users to ask questions and generate charts based on just that specific table.

This project is based on Googles Conversational API quickstart guide.

My repository: GitHub

Prequisites – what you’ll need

  • Google Cloud Project with added billing
  • Git
  • gcloud installed on your computer
  • GCP service account
  • Python 3.11 or higher (only to test it locally)

Create the application

To start, you have to enable the required APIs in your Google Cloud Project. To do so, run the following command using gcloud CLI. It sets the correct project-id and sets up what service account you use for API-enablement, deployment, and so on.

gcloud config set project YOUR-PROJECT-ID 
gcloud auth activate-service-account YOUR-SERVICE-ACCOUNT-EMAIL-ADRESS --key-file=PATH-TO-SERVICE-ACCOUNT-FILE
gcloud services enable geminidataanalytics.googleapis.com bigquery.googleapis.com cloudaicompanion.googleapis.com people.googleapis.com aiplatform.googleapis.com --project=YOUR-PROJECT-ID

The next thing we need to do is create an OAuth Client in GCP, assuming you don’t want the application to be publicly available. The OAuth Client allows you to set up access for specific Google accounts. While comfortable for teams that are working a lot with Google products already, there could be better access management options for you (e.g. IP-range based access for companies with a static IP-range). However, for the purpose of this post, we will stick to OAuth just as Google did in their guide.

Create consent screen

  • Go to the Google Cloud console and create an Oauth consent screen.
  • Configure “App name” to your choice.
  • Set “User support email” & “Audience” to your choice.
  • Configure “Contact Information”.
  • Click “Create”.

Create OAuth client

  • Go to “APIs & Services” > “Credentials”
  • Click “Create credentials” > “OAuth client ID”
  • Choose “Web application” as the “Application type”
  • Set the application name of your choice
  • Add “http://localhost:8501” to “Authorized JavaScript origins”
  • Add “http://localhost:8501 to “Authorized redirect URIs”
  • Click “Create” and safe the Client ID plus the Client Secret for later.

Prepare the project

Next up is cloning the repository to your local machine:

git clone https://github.com/ramonseradj/conversational-analytics-quickstart.git
cd conversational-analytics-quickstart

Now you need to create a .env file on the root level of your project clone:

PROJECT_ID=YOUR_PROJECT_ID
GOOGLE_CLIENT_ID=YOUR_CLIENT_ID_FROM_PREVIOUS
GOOGLE_CLIENT_SECRET=YOUR_CLIENT_SECRET_FROM_PREVIOUS
REDIRECT_URI=http://localhost:8501

This is the version to run the application locally. Since we are planning to deloy it to CloudRun, the redirect URI has to be changed to the default-URL that will be provisioned for your application once it is deployed. This is how it should look:

PROJECT_ID=YOUR_PROJECT_ID
GOOGLE_CLIENT_ID=YOUR_CLIENT_ID_FROM_PREVIOUS
GOOGLE_CLIENT_SECRET=YOUR_CLIENT_SECRET_FROM_PREVIOUS
REDIRECT_URI=https://SERVICE_NAME-PROJECT_NUMBER.REGION.run.app

Run locally

To run the application locally, all you need to do is install the required packages (I highly recommend using a virtual environment):

pip install -r requirements.txt

And now start the Streamlit app using the following command. To cancel Streamlit just press CTRL and C in your terminal.

streamlit run app.py

Deploy to CloudRun

Before we start the deployment, there are a few updates we need to make beforehand. Firstly, we need to update the .env file like we did before. The Redirect-URI has to be set to the CloudRun Service-URI. Otherwise, your app will always try to redirect you to localhost after authentication.

In addition to that, the CloudRun service URI has to be added to your OAuth-Client for both “Authorized JavaScript origins” and “Authorized redirect URIs”.

Since gcloud doesn’t listen to .gitignore instructions, a .gcloudignore file is necessary. It is already part of the repository, but you should check for possible additions like your venv or service account file.

Last but not least, we want to deploy our project as a dockerized application, so we need to add a Dockerfile. To do so, just create a new file called “Dockerfile” in your project at the root level and add the following contents:

# Use the official Python image as a base
FROM python:3.11-slim

# Set the working directory
WORKDIR /

# Copy the requirements file and install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code
COPY . .

# Define the port that the container will expose
EXPOSE 8501

# Command to run the Streamlit app
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

With all of the above being done, you can deploy your application using this command:

gcloud run deploy --source . --port 8501 --allow-unauthenticated

After deployment, the application is publicly available (authentication is handled by the OAuth Client) and listens to port 8501.
You can test it by browsing to GCP’s CloudRun interface, clicking on your application, and then visiting the default URL.

Create an agent

Now you open your AI-based chat-application and create your first agent to “talk” to your orders table. To do so, click “Agents” on the left and scroll down to “Create a data agent”.

Add your table’s project-, dataset- and table-id as well as an agent name and description. Agent instructions are a bit more complicated – they are basically used to create context for the agent to understand your data better. It is a YAML-based description for the contents of both table and columns: what do they mean, how to run calculations on them, and so on. Please check out Google’s documentation on how to create agent instructions for BigQuery. This is their example for agent instructions:

- system_instruction: str # A description of the expected behavior of the agent. For example: You are a sales agent.
- tables: # A list of tables to describe for the agent.
    - table: # Details about a single table that is relevant for the agent.
        - name: str # The name of the table.
        - description: str # A description of the table.
        - synonyms: list[str] # Alternative terms for referring to the table.
        - tags: list[str] # Keywords or tags that are associated with the table.
        - fields: # Details about columns (fields) within the table.
            - field: # Details about a single column within the current table.
                - name: str # The name of the column.
                - description: str # A description of the column.
                - synonyms: list[str] # Alternative terms for referring to the column.
                - tags: list[str] # Keywords or tags that are associated with the column.
                - sample_values: list[str] # Sample values that are present within the column.
                - aggregations: list[str] # Commonly used or default aggregations for the column.
        - measures: # A list of calculated metrics (measures) for the table.
            - measure: # Details about a single measure within the table.
                - name: str # The name of the measure.
                - description: str # A description of the measure.
                - exp: str # The expression that is used to construct the measure.
                - synonyms: list[str] # Alternative terms for referring to the measure.
        - golden_queries: # A list of important or popular ("golden") queries for the table.
            - golden_query: # Details about a single golden query.
                - natural_language_query: str # The natural language query.
                - sql_query: str # The SQL query that corresponds to the natural language query.
        - golden_action_plans: # A list of suggested multi-step plans for answering specific queries.
            - golden_action_plan: # Details about a single action plan.
                - natural_language_query: str # The natural language query.
                - action_plan: # A list of the steps for this action plan.
                    - step: str # A single step within the action plan.
    - relationships: # A list of join relationships between tables.
        - relationship: # Details about a single join relationship.
            - name: str # The name of this join relationship.
            - description: str # A description of the relationship.
            - relationship_type: str # The join relationship type: one-to-one, one-to-many, many-to-one, or many-to-many.
            - join_type: str # The join type: inner, outer, left, right, or full.
            - left_table: str # The name of the left table in the join.
            - right_table: str # The name of the right table in the join.
            - relationship_columns: # A list of columns that are used for the join.
                - left_column: str # The join column from the left table.
                - right_column: str # The join column from the right table.
- glossaries: # A list of definitions for glossary business terms, jargon, and abbreviations.
    - glossary: # The definition for a single glossary item.
        - term: str # The term, phrase, or abbreviation to define.
        - description: str # A description or definition of the term.
        - synonyms: list[str] # Alternative terms for the glossary entry.
- additional_descriptions: # A list of any other general instructions or content.
    - text: str # Any additional general instructions or context not covered elsewhere.

If you have maintained your BigQuery metadata correctly (I for sure did not :D), it’ll be much easier for both you and the agent.

Now you click “Create agent” navigate on the top left to “Chat” and start chatting with your agent. You can use it to answer direct questions or even create rendered visualizations like bar-charts.

Customizing the Interface

To impress your boss with a sophisticated MVP, you should consider applying your corporate identity to the application. You can add a logo, set a theme color as well as styles for individual elements or even add custom CSS. You can check out the Streamlit documentation on styling to get an overview of what is possible: https://docs.streamlit.io/develop/concepts/configuration/theming.

Last words

I really like the idea of using AI in combination with YAML-based instructions. It feels way more reliable to have a specific agent with detailed instructions for just one table.
Nevertheless you should keep in mind that AI has an error margin; thus you should test thoroughly and consider tracking both usage in general and user satisfaction with the answers given to optimize constantly.

Also keep in mind that you shouldn’t allow access to raw data or tables with huge amounts of data. Otherwise this could cost you huge amounts of money, because at the end of the day, every question always translates to a query.