Semantic caching with PSQL, Python, OpenAI

5 minute read

You’re an app developer now

Let’s say you’re working on an app. The app helps you search for e-mail messages that relate to a certain topic. This works well enough.
As a nifty extra, you added a small pixelated image of the search topic, that is generated on-demand using some ML model.
For instance, when searching for a dog, you’d get a dog image.

Your users like this extra so much (people love AI now) that it’s now a core feature. A new problem arises - generating these images is the most expensive part of your service (Dall-E takes 4 cents an image, which is a lot when your app is freemium).

Caching

An obvious solution is caching - if Alice asked for an image of a dog, and Bob asked for one later as well, we could serve Bob the same image we generated for Alice. Storage is very cheap these days (storing and serving the same photo in S3 would be approximately 100,000 times cheaper than generating it).
Matching on text equality is easy. We could store the images in a pattern like s3://CACHE_BUCKET/dog.jpg, then look for this jpg the next time someone asks for a dog.

Example code could look like this:

def _set_blob(key: str, blob: bytes) -> None:
  # Implementation specific
  ...

def _get_blob(key: str) -> bytes:
  # Implementation specific
  ...

def _generate_content_expensive(key: str) -> bytes:
  # Implementation specific
  ...

def get(key: str) -> bytes
  blob = _get_blob(key)
  if blob:
    return blob

  blob = _generate_content_expensive(key)
  _set_blob(key, blob)
  return blob

Better caching

We might be able to save even more on image generation if we’re ok with serving similar results.
For instance, if Alice searches for “house” and Bob asks for “villa”, the resulting images might be so similar that it’d be a shame to waste 4 cents on creating a new one:

house
villa
spaceship
(probably not as similar)

The non-trivial part in this solution is understanding that “house” and “villa” are pretty similar (and that “house” and “spaceship” are not). Luckily, this is another task that Machine Learning can help us with.

Embedding

Part of LLM (Large Language Models) is assigning a numerical meaning to phrases, which is called “embedding”. This means each phrase can be translated into a vector (array) of fractionals between -1 and 1, each number representing some property of the phrase that might not make sense to humans, but does make sense to the LLM’s brain.
The more “similar” the vectors of phrase A and B, the more similar those phrases are, in the LLM’s opinion.
Similarity for these vectors is usually measured by cosine distance, which is a function that takes two vectors and produces a number between 0 and 1, where 0 is “the same vector” and 1 is “couldn’t be further apart”.
The size of the vector and the specific meaning of each cell in it are specific to the LLM model being used. Using OpenAI’s embeddings (my goto for PoCs) will result in 1536 or 3072 cells (depending on which LLM model you use).

The basic algorithm of our semantic cache now looks like:

  1. For a given phrase, extract the embedding vector
  2. Check whether we already have a cached object with a “close enough” vector
  3. If so, serve the closest cached object we have
  4. Otherwise, fetch the object from upstream (goodbye 4 cents), and store the result and its vector in the cache for next time

We still have to define what “close enough” is, but that’s more of a product question, and can also be adjusted based on our whims in real time.

PostgreSQL

The idea of using only our storage’s API doesn’t work with vector search, because instead of asking for a specific filename, we have to enumerate all keys and do some math over them, which is not a quick operation with blob storage solutions. Instead, we need some sort of a database.

Since my app already uses PostgreSQL, I found pgvector to be a great solution.
Availability-wise, it’s an addon to PostgreSQL that’s supported by big cloud SQL providers like AWS RDS and Google Cloud SQL, and also available as a Docker image.
Functionality-wise, it creates a new data type (vector) and adds some operators (<=> for our distance function), which makes our query look pretty readable.

Our cache table can be as simple as a single vector column, but I’d add another one:

name type
embedding vector
key short text

The actual image can still be kept in the original format s3://CACHE_BUCKET/key.jpg, where the key will be retrieved from the DB. The key can also be used by the developers for debugging (“what is the phrase matching this vector?”).

And now in Python

pgvector-python is a convenient bridge from Python’s psycopg2 (or psycopg3 if you’re brave) to pgvector.
Our usage of OpenAI’s embedding API is so basic that we can stick with basic HTTP calls with requests. A basic api looks like this:

import requests
import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np # vectors are np.array in python-pgvector

def _set_blob(key: str, blob: bytes) -> None:
  # Implementation specific
  ...

def _get_blob(key: str) -> bytes:
  # Implementation specific
  ...

def _generate_content_expensive(key: str) -> bytes:
  # Implementation specific
  ...

def _sql_cursor():
    conn = psycopg2.connect(...)
    register_vector(conn)
    cursor = conn.cursor()
    return cursor


def _get_embedding(key) -> np.array:
    resp = requests.post(
        "https://api.openai.com/v1/embeddings",
        headers={
            "Authorization": "Bearer $OPENAI_API_KEY",
        },
        json={
            "input": key,
            "model": "text-embedding-3-large", # can be "small" too
        }
    )
    resp.raise_for_status()
    j = resp.json()
    ret = j['data'][0]['embedding']
    return np.array(ret)

def _cache_set(key: str, embedding: np.array) -> None:
    cursor = _sql_cursor()
    cursor.execute('INSERT INTO items (key, embedding) VALUES (%s, %s)', (key, embedding))
    cursor.connection.commit()

def _cache_get(embedding: np.array, distance: float) -> str|None:
    cursor = _sql_cursor()
    cursor.execute(
      'SELECT key FROM items WHERE embedding <=> %(embedding)s <= %(distance)s ORDER BY embedding <=> %(embedding)s LIMIT 1',
        {
          'embedding': embedding,
          'distance': distance,
        }
    )
    row = cursor.fetchone()
    if row is None:
      return None
    else:
      return row[0]

def get(key: str, distance: float) -> bytes
  embedding = _get_embedding(key)
  cache_entry = _cache_get(embedding, distance)
  if cache_entry:
    return _get_blob(cache_entry)

  blob = _generate_content_expensive(key)
  _set_blob(key, blob)
  _cache_set(key, embedding)
  return blob

Calling the get method implements the algorithm we described above, and works as a nice PoC.

Things that can be improved

  1. Scaling and indexing: I believe that the current solution requires a linear scan of the cache table, which won’t scale very well. I know the index story is a bit complex for vectors, and it’s going to be a fun adventure exploring it.
  2. Adding a “perfect match” race: We could probably try hitting the blob cache directly for the exact key, in case it’s there. If Bob asked for “dog” and Alice already seeded the cache with “dog”, trying to vector-search “dog” is a waste when we can just check whether “dog” exists in the cache.
  3. Caching the embeddings as well: If we already translated “villa” to an embedding in the past, no need to embed it again. We can search our cache table by key to see whether we already have it there

Images from

HuggingFace kohbanye/pixel-art-style