Embeddings: What they are and why they matter

embeddings 是什么意思icon-default.png?t=N7T8https://simonwillison.net/2023/Oct/23/embeddings/推荐原因:GPT 模型的基础是一种叫做 embeddings 的技术,用来将文本转换成向量,从而可以计算出文本之间的相似度。这篇文章详细地介绍了embeddings及应用

Embeddings are a really neat trick that often come wrapped in a pile of intimidating jargon.

If you can make it through that jargon, they unlock powerful and exciting techniques that can be applied to all sorts of interesting problems.

I gave a talk about embeddings at PyBay 2023. This article represents an improved version of that talk, which should stand alone even without watching the video.

If you’re not yet familiar with embeddings I hope to give you everything you need to get started applying them to real-world problems.

In this article:

  • The 38 minute video version
  • What are embeddings?
  • Related content using embeddings
  • Exploring how these things work with Word2Vec
  • Calculating embeddings using my LLM tool
  • Vibes-based search
  • Embeddings for code using Symbex
  • Embedding text and images together using CLIP
  • Faucet Finder: finding faucets with CLIP
  • Clustering embeddings
  • Visualize in 2D with Principal Component Analysis
  • Scoring sentences using average locations
  • Answering questions with Retrieval-Augmented Generation
  • Q&A
  • Further reading
The 38 minute video version #

Here’s a video of the talk that I gave at PyBay:

The audio quality of the official video wasn’t great due to an issue with the microphone, but I ran that audio through Adobe’s Enhance Speech tool and uploaded my own video with the enhanced audio to YouTube.

What are embeddings? #

Embeddings are a technology that’s adjacent to the wider field of Large Language Models—the technology behind ChatGPT and Bard and Claude.

On the left, a blog entry titled Storing and serving related documents with oepnai-to-sqlite and embeddings. On the right, a JSON array of floating point numbers, with the caption Fixed zise: 300, 1000, 1536...

Embeddings are based around one trick: take a piece of content—in this case a blog entry—and turn that piece of content into an array of floating point numbers.

The key thing about that array is that it will always be the same length, no matter how long the content is. The length is defined by the embedding model you are using—an array might be 300, or 1,000, or 1,536 numbers long.

The best way to think about this array of numbers is to imagine it as co-ordinates in a very weird multi-dimensional space.

It’s hard to visualize 1,536 dimensional space, so here’s a 3D visualization of the same idea:

a 3D chart showing a location in many-multi-dimensional space. 400 randomly placed red dots are scattered around the chart.

Why place content in this space? Because we can learn interesting things about that content based on its location—in particular, based on what else is nearby.

The location within the space represents the semantic meaning of the content, according to the embedding model’s weird, mostly incomprehensible understanding of the world. It might capture colors, shapes, concepts or all sorts of other characteristics of the content that has been embedded.

Nobody fully understands what those individual numbers mean, but we know that their locations can be used to find out useful things about the content.

One of the first problems I solved with embeddings was to build a “related content” feature for my TIL blog. I wanted to be able to show a list of related articles at the bottom of each page.

I did this using embeddings—in this case, I used the OpenAI text-embedding-ada-002 model, which is available via their API.

I currently have 472 articles on my site. I calculated the 1,536 dimensional embedding vector (array of floating point numbers) for each of those articles, and stored those vectors in my site’s SQLite database.

Now, if I want to find related articles for a given article, I can calculate the cosine similarity between the embedding vector for that article and every other article in the database, then return the 10 closest matches by distance.

There’s an example at the bottom of this page. The top five related articles for Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg are:

  • Geopoly in SQLite—2023-01-04
  • Viewing GeoPackage data with SpatiaLite and Datasette—2022-12-11
  • Using SQL with GDAL—2023-03-09
  • KNN queries with SpatiaLite—2021-05-16
  • GUnion to combine geometries in SpatiaLite—2022-04-12

That’s a pretty good list!

Here’s the Python function I’m using to calculate those cosine similarity distances:

def cosine_similarity(a, b):
    dot_product = sum(x * y for x, y in zip(a, b))
    magnitude_a = sum(x * x for x in a) ** 0.5
    magnitude_b = sum(x * x for x in b) ** 0.5
    return dot_product / (magnitude_a * magnitude_b)

My TIL site runs on my Datasette Python framework, which supports building sites on top of a SQLite database. I wrote more about how that works in the Baked Data architectural pattern.

You can browse the SQLite table that stores the calculated embeddings at tils/embeddings.

Screenshot of the embeddings table in Datasette, it has 472, rows each of which consists of a text ID and a binary 6.144 bytes embedding

Those are binary values. We can run this SQL query to view them as hexadecimal:

select id, hex(embedding) from embeddings

Running that SQL query in Datasette returns text IDs and long hexadecimal strings for each embedding

That’s still not very readable though. We can use the llm_embed_decode() custom SQL function to turn them into a JSON array:

select id, llm_embed_decode(embedding) from embeddings limit 10

Try that here. It shows that each article is accompanied by that array of 1,536 floating point numbers.

Now the SQL query returns a JSON array of floating point numbers for each ID

We can use another custom SQL function, llm_embed_cosine(vector1, vector2), to calculate those cosine distances and find the most similar content.

That SQL function is defined here in my datasette-llm-embed plugin.

Here’s a query returning the five most similar articles to my SQLite TG article:

select
  id,
  llm_embed_cosine(
    embedding,
    (
      select
        embedding
      from
        embeddings
      where
        id = 'sqlite_sqlite-tg.md'
    )
  ) as score
from
  embeddings
order by
  score desc
limit 5

Executing that query returns the following results:

idscore
sqlite_sqlite-tg.md1.0
sqlite_geopoly.md0.8817322855676049
spatialite_viewing-geopackage-data-with-spatialite-and-datasette.md0.8813094978399854
gis_gdal-sql.md0.8799581261326747
spatialite_knn.md0.8692992294266506

As expected, the similarity between the article and itself is 1.0. The other articles are all related to geospatial SQL queries in SQLite.

This query takes around 400ms to execute. To speed things up, I pre-calculate the top 10 similarities for every article and store them in a separate table called tils/similarities.

The similarities table has 4,922 rows each with an id, other_id and score column.

I wrote a Python function to look up related documents from that table and called it from the template that’s used to render the article page.

My Storing and serving related documents with openai-to-sqlite and embeddings TIL explains how this all works in detail, including how GitHub Actions are used to fetch new embeddings as part of the build script that deploys the site.

I used the OpenAI embeddings API for this project. It’s extremely inexpensive—for my TIL website I embedded around 402,500 tokens, which at $0.0001 / 1,000 tokens comes to $0.04—just 4 cents!

It’s really easy to use: you POST it some text along with your API key, it gives you back that JSON array of floating point numbers.

Screenshot of curl against api.openai.com/v1/embeddings sending a Bearer token header and a JSON body specifying input text and the text-embedding-ada-002 model. The API responds with a JSON list of numbers.

But... it’s a proprietary model. A few months ago OpenAI shut down some of their older embeddings models, which is a problem if you’ve stored large numbers of embeddings from those models since you’ll need to recalculate them against a supported model if you want to be able to embed anything else new.

Screenshot of the OpenAI First-generation text embedding models list, showing the shutdown date of 4th April 2024 for 7 legacy models.

To OpenAI’s credit, they did promise to “cover the financial cost of users re-embedding content with these new models.”—but it’s still a reason to be cautious about relying on proprietary models.

The good news is that there are extremely powerful openly licensed models which you can run on your own hardware, avoiding any risk of them being shut down. We’ll talk about that more in a moment.

Exploring how these things work with Word2Vec #

Google Research put out an influential paper 10 years ago describing an early embedding model they created called Word2Vec.

That paper is Efficient Estimation of Word Representations in Vector Space, dated 16th January 2013. It’s a paper that helped kick off widespread interest in embeddings.

Word2Vec is a model that takes single words and turns them into a list of 300 numbers. That list of numbers captures something about the meaning of the associated word.

This is best illustrated by a demo.

turbomaze.github.io/word2vecjson is an interactive tool put together by Anthony Liu with a 10,000 word subset of the Word2Vec corpus. You can view this JavaScript file to see the JSON for those 10,000 words and their associated 300-long arrays of numbers.

Screenshot of the Word to Vec JS Demo showing the results for france and the algebra results for germany + paris - france

Search for a word to find similar words based on cosine distance to their Word2Vec representation. For example, the word “france” returns the following related results:

wordsimilarity
france1
french0.7000748343471224
belgium0.6933180492111168
paris0.6334910653433325
germany0.627075617939471
italy0.6135215284228007
spain0.6064218103692152

That’s a mixture of french things and European geography.

A really interesting thing you can do here is perform arithmetic on these vectors.

Take the vector for “germany”, add “paris” and subtract “france”. The resulting vector is closest to “berlin”!

Something about this model has captured the idea of nationalities and geography to the point that you can use arithmetic to explore additional facts about the world.

Word2Vec was trained on 1.6 billion words of content. The embedding models we use today are trained on much larger datasets and capture a much richer understanding of the underlying relationships.

Calculating embeddings using my LLM tool #

I’ve been building a command-line utility and Python library called LLM.

You can read more about LLM here:

  • llm, ttok and strip-tags—CLI tools for working with ChatGPT and other LLMs
  • The LLM CLI tool now supports self-hosted language models via plugins
  • LLM now provides tools for working with embeddings
  • Build an image search engine with llm-clip, chat with models with llm chat

LLM is a tool for working with Large Language Models. You can install it like this:

pip install llm

Or via Homebrew:

brew install llm

You can use it as a command-line tool for interacting with LLMs, or as a Python library.

Out of the box it can work with the OpenAI API. Set an API key and you can run commands like this:

llm 'ten fun names for a pet pelican'

Where it gets really fun is when you start installing plugins. There are plugins that add entirely new language models to it, including models that run directly on your own machine.

A few months ago I extended LLM to support plugins that can run embedding models as well.

Here’s how to run the catchily titled all-MiniLM-L6-v2 model using LLM:

Slide showing the commands listed below

First, we install llm and then use that to install the llm-sentence-transformers plugin—a wrapper around the SentenceTransformers library.

pip install llm
llm install llm-sentence-transformers

Next we need to register the all-MiniLM-L6-v2 model. This will download the model from Hugging Face to your computer:

llm sentence-transformers register all-MiniLM-L6-v2

We can test that out by embedding a single sentence like this:

llm embed -m sentence-transformers/all-MiniLM-L6-v2 \
  -c 'Hello world'

This outputs a JSON array that starts like this:

[-0.03447725251317024, 0.031023245304822922, 0.006734962109476328, 0.026108916848897934, -0.03936201333999634, ...

Embeddings like this on their own aren’t very interesting—we need to store and compare them to start getting useful results.

LLM can store embeddings in a “collection”—a SQLite table. The embed-multi command can be used to embed multiple pieces of content at once and store them in a collection.

That’s what this next command does:

llm embed-multi readmes \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --files ~/ '**/README.md' --store

Here we are populating a collection called “readmes”.

The --files option takes two arguments: a directory to search and a glob pattern to match against filenames. In this case I’m searching my home directory recursively for any file named README.md.

The --store option causes LLM to store the raw text in the SQLite table in addition to the embedding vector.

This command took around 30 minutes to run on my computer, but it worked! I now have a collection called readmes with 16,796 rows—one for each README.md file it found in my home directory.

Now that we have a collection of embeddings, we can run searches against it using the llm similar command:

A terminal running llm similar and piping the results through jq

llm similar readmes -c 'sqlite backup tools' | jq .id

We are asking for items in the readmes collection that are similar to the embedding vector for the phrase “sqlite backup tools”.

This command outputs JSON by default, which includes the full text of the README files since we stored them using --store earlier.

Piping the results through jq .id causes the command to output just the IDs of the matching rows.

The top matching results are:

"sqlite-diffable/README.md"
"sqlite-dump/README.md"
"ftstri/salite/ext/repair/README.md"
"simonw/README.md"
"sqlite-generate/README.md"
"sqlite-history/README.md"
"dbf-to-sqlite/README.md"
"ftstri/sqlite/ext/README.md"
"sqlite-utils/README.md"
"ftstri/sqlite/README.md'

These are good results! Each of these READMEs either describes a tool for working with SQLite backups or a project that relates to backups ins ome way.

What’s interesting about this is that it’s not guaranteed that the term “backups” appeared directly in the text of those READMEs. The content is semantically similar to that phrase, but might not be an exact textual match.

We can call this semantic search. I like to think of it as vibes-based search.

The vibes of those READMEs relate to our search term, according to this weird multi-dimensional space representation of the meaning of words.

This is absurdly useful. If you’ve ever built a search engine for a website, you know that exact matches don’t always help people find what they are looking for.

We can use this kind of semantic search to build better search engines for a whole bunch of different kinds of content.

Embeddings for code using Symbex #

Another tool I’ve been building is called Symbex. It’s a tool for exploring the symbols in a Python codebase.

I originally built it to help quickly find Python functions and classes and pipe them into LLMs to help explain and rewrite them.

Then I realized that I could use it to calculate embeddings for all of the functions in a codebase, and use those embeddings to build a code search engine.

I added a feature that could output JSON or CSV representing the symbols it found, using the same output format that llm embed-multi can use as an input.

Here’s how I built a collection of all of the functions in my Datasette project, using a newly released model called gte-tiny—just a 60MB file!

llm sentence-transformers register TaylorAI/gte-tiny

cd datasette/datasette

symbex '*' '*:*' --nl | \
  llm embed-multi functions - \
  --model sentence-transformers/TaylorAI/gte-tiny \
  --format nl \
  --store

symbex '*' '*:*' --nl finds all functions (*) and class methods (the *:* pattern) in the current directory and outputs them as newline-delimited JSON.

The llm embed-multi ... --format nl command expects newline-delimited JSON as input, so we can pipe the output of symbex directly into it.

This defaults to storing the embeddings in the default LLM SQLite database. You can add --database /tmp/data.db to specify an alternative location.

And now... I can run vibes-based semantic search against my codebase!

I could use the llm similar command for this, but I also have the ability to run these searches using Datasette itself.

Here’s a SQL query for that, using the datasette-llm-embed plugin from earlier:

with input as (
  select
    llm_embed(
      'sentence-transformers/TaylorAI/gte-tiny',
      :input
    ) as e
)
select
  id,
  content
from
  embeddings,
  input
where
  collection_id = (
    select id from collections where name = 'functions'
  )
order by
  llm_embed_cosine(embedding, input.e) desc
limit 5

The :input parameter is automatically turned into a form field by Datasette.

When I run this, I get back functions that relate to the concept of listing plugins:

Running that query in Datasette with an input of list plugins returns the plugins() function from the cli.py file on line 175

The key idea here is to use SQLite as an integration point—a substrate for combining together multiple tools.

I can run separate tools that extract functions from a codebase, run them through an embedding model, write those embeddings to SQLite and then run queries against the results.

Anything that can be piped into a tool can now be embedded and processed by the other components of this ecosystem.

Embedding text and images together using CLIP #

My current favorite embedding model is CLIP.

CLIP is a fascinating model released by OpenAI—back in January 2021, when they were still doing most things in the open—that can embed both text and images.

Crucially, it embeds them both into the same vector space.

If you embed the string “dog”, you’ll get a location in 512 dimensional space (depending on your CLIP configuration).

If you embed a photograph of a dog, you’ll get a location in that same space... and it will be close in terms of distance to the location of the string “dog”!

This means we can search for related images using text, and search for related text using images.

I built an interactive demo to help explain how this works. The demo is an Observable notebook that runs the CLIP model directly in the browser.

It’s a pretty heavy page—it has to load 158MB of resources (64.6MB for the CLIP text model and 87.6MB for the image model)—but once loaded you can use it to embed an image, then embed a string of text and calculate the distance between the two.

I can give it this photo I took of a beach:

A bright blue sky over a beach, with sandy cliffs and the Pacific ocean in the frame

Then type in different text strings to calculate a similarity score, here displayed as a percentage:

Animation showing different similarity scores for different text strings

textscore
beach26.946%
city19.839%
sunshine24.146%
sunshine beach26.741%
california25.686%
california beach27.427%

It’s pretty amazing that we can do all of this in JavaScript running in the browser!

There’s an obvious catch: it’s not actually that useful to be able to take an arbitrary photo and say “how similar is this to the term ’city’?”.

The trick is to build additional interfaces on top of this. Once again, we have the ability to build vibes-based search engines.

Here’s a great example of one of those.

Faucet Finder: finding faucets with CLIP #

Drew Breunig used LLM and my llm-clip plugin to build a search engine for faucet taps.

He was renovating his bathroom, and he needed to buy new faucets. So he scraped 20,000 photographs of faucets from a faucet supply company and ran CLIP against all of them.

He used the result to build Faucet Finder—a custom tool (deployed using Datasette) for finding faucets that look similar to other faucets.

The Faucet Finder homepage - six faucets, each with a Find Similar button.

Among other things, this means you can find an expensive faucet you like and then look for visually similar cheaper options!

Drew wrote more about his project in Finding Bathroom Faucets with Embeddings.

Drew’s demo uses pre-calculated embeddings to display similar results without having to run the CLIP model on the server.

Inspired by this, I spent some time figuring out how to deploy a server-side CLIP model hosted by my own Fly.io account.

Drew’s Datasette instance includes this table of embedding vectors, exposed via the Datasette API.

I deployed my own instance with this API for embedding text strings, then built an Observable notebook demo that hits both APIs and combines the results.

observablehq.com/@simonw/search-for-faucets-with-clip-api

Now I can search for things like “gold purple” and get back vibes-based faucet results:

Observable notebook: Search for Faucets with CLIP. The search term gold purple produces 8 alarmingly tasteless faucets in those combined colors.

Being able to spin up this kind of ultra-specific search engine in a few hours is exactly the kind of trick that excites me about having embeddings as a tool in my toolbox.

Clustering embeddings #

Related content and semantic / vibes-based search are the two most comon applications of embeddings, but there are a bunch of other neat things you can do with them too.

One of those is clustering.

I built a plugin for this called llm-cluster which implements this using sklearn.cluster from scikit-learn.

To demonstrate that, I used my paginate-json tool and the GitHub issues API to collect the titles of all of the issues in my simonw/llm repository into a collection called llm-issues:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
  --store

Now I can create 10 clusters of issues like this:

llm install llm-cluster

llm cluster llm-issues 10

Clusters are output as a JSON array, with output that looks something like this (truncated):

[
  {
    "id": "2",
    "items": [
      {
        "id": "1650662628",
        "content": "Initial design"
      },
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      }
    ]
  },
  {
    "id": "4",
    "items": [
      {
        "id": "1650760699",
        "content": "llm web command - launches a web server"
      },
      {
        "id": "1759659476",
        "content": "`llm models` command"
      },
      {
        "id": "1784156919",
        "content": "`llm.get_model(alias)` helper"
      }
    ]
  },
  {
    "id": "7",
    "items": [
      {
        "id": "1650765575",
        "content": "--code mode for outputting code"
      },
      {
        "id": "1659086298",
        "content": "Accept PROMPT from --stdin"
      },
      {
        "id": "1714651657",
        "content": "Accept input from standard in"
      }
    ]
  }
]

These do appear to be related, but we can do better. The llm cluster command has a --summary option which causes it to pass the resulting cluster text through a LLM and use it to generate a descriptive name for each cluster:

llm cluster llm-issues 10 --summary

This gives back names like “Log Management and Interactive Prompt Tracking” and “Continuing Conversation Mechanism and Management”. See the README for more details.

Visualize in 2D with Principal Component Analysis #

The problem with massively multi-dimensional space is that it’s really hard to visualize.

We can use a technique called Principal Component Analysis to reduce the dimensionality of the data to a more manageable size—and it turns out lower dimensions continue to capture useful semantic meaning about the content.

Matt Webb used the OpenAI embedding model to generate embeddings for descriptions of every episode of the BBC’s In Our Time podcast. He used these to find related episodes, but also ran PCA against them to create an interactive 2D visualization.

Animated screenshot of a cloud of points in 2D space. At one side hovering over them shows things like The War of 1812 and The Battle of Trafalgar - at the other side we get Quantum Gravity and Higgs Boson and Carbon

Reducing 1,536 dimensions to just two still produces a meaningful way of exploring the data! Episodes about historic wars show up near each other, elsewhere there’s a cluster of episodes about modern scientific discoveries.

Matt wrote more about this in Browse the BBC In Our Time archive by Dewey decimal code.

Scoring sentences using average locations #

Another trick with embeddings is to use them for classification.

First calculate the average location for a group of embeddings that you have classified in a certain way, then compare embeddings of new content to those locations to assign it to a category.

Amelia Wattenberger demonstrated a beautiful example of this in Getting creative with embeddings.

She wanted to help people improve their writing by encouraging a mixture of concrete and abstract sentences. But how do you tell if a sentence of text is concrete or abstract?

Her trick was to generate samples of the two types of sentence, calculate their average locations and then score new sentences based on how close they are to either end of that newly defined spectrum.

A document. Different sentences are displayed in different shades of green and purple, with a key on the right hand side showing that green means concreete and purple means abstract, with a gradient between them.

This score can even be converted into a color loosely representing how abstract or concrete a given sentence is!

This is a really neat demonstration of the kind of creative interfaces you can start to build on top of this technology.

Answering questions with Retrieval-Augmented Generation #

I’ll finish with the idea that first got me excited about embeddings.

Everyone who tries out ChatGPT ends up asking the same question: how could I use a version of this to answer questions based on my own private notes, or the internal documents owned by my company?

People assume that the answer is to train a custom model on top of that content, likely at great expense.

It turns out that’s not actually necessary. You can use an off the shelf Large Language Model model (a hosted one or one that runs locally) and a trick called Retrieval Augmented Generation, or RAG.

The key idea is this: a user asks a question. You search your private documents for content that appears relevant to the question, then paste excerpts of that content into the LLM (respecting its size limit, usually between 3,000 and 6,000 words) along with the original question.

The LLM can then answer the question based on the additional content you provided.

This cheap trick is astonishingly effective. It’s trivial to get a basic version of this working—the challenge is in getting it to work as well as possible given the infinite set of questions a user might ask.

The key problem in RAG is figuring out the best possible excerpts of content to include in the prompt to the LLM.

“Vibes-based” semantic search powered by embedding is exactly the kind of thing you need to gather potentially relevant content to help answer a user’s question.

I built a version of this against content from my blog, described in detail in Embedding paragraphs from my blog with E5-large-v2.

I used a model called E5-large-v2 for this. It’s a model trained with this exact use-case in mind.

A challenge with finding content relevant to questions is that the user’s question—“What is shot-scraper?”—isn’t guaranteed to be considered semantically similar to content that answers that question. Questions and assertions have a different grammar from each other.

E5-large-v2 addresses this by supporting two types of content. You can embed phrases (factual sentences) and queries (questions) in the same space, similar to how CLIP supports both images and text.

I embedded 19,000 paragraphs of text from my blog as phrases, and now I can embed a question as a query and use that to find the paragraphs most likely to answer that question.

The result is RAG implemented as a one-liner Bash script:

llm similar blog-paragraphs -c "query: $1" \
  | jq '.content | sub("passage: "; "")' -r \
  | llm -m mlc-chat-Llama-2-7b-chat-hf-q4f16_1 \
  "$1" -s 'You answer questions as a single paragraph'

This example uses Llama 2 Chat 7B running on my laptop (with the llm-mlc plugin), so I’m able to answer questions entirely offline using code running on my laptop.

Running this:

./blog-answer.sh 'What is shot-scraper?'

Outputs this:

Shot-scraper is a Python utility that wraps Playwright, providing both a command line interface and a YAML-driven configuration flow for automating the process of taking screenshots of web pages and scraping data from them using JavaScript. It can be used to take one-off screenshots or take multiple screenshots in a repeatable way by defining them in a YAML file. Additionally, it can be used to execute JavaScript on a page and return the resulting value.

That’s a really good description of my shot-scraper tool. I checked and none of that output is an exact match to content I had previously published on my blog.

Q&A #

My talk ended with a Q&A session. Here are the summarized questions and answers.

  • How does LangChain fit into this?

    LangChain is a popular framework for implementing features on top of LLMs. It covers a lot of ground—my only problem with LangChain is that you have to invest a lot of work in understanding how it works and what it can do for you. Retrieval Augmented Generation is one of the key features of LangChain, so a lot of the things I’ve shown you today could be built on top of LangChain if you invest the effort.

    My philosophy around this stuff differs from LangChain in that I’m focusing on building a suite of small tools that can work together, as opposed to a single framework that solves everything in one go.

  • Have you tried distance functions other than cosine similarity?

    I have not. Cosine similarity is the default function that everyone else seems to be using and I’ve not spent any time yet exploring other options.

    I actually got ChatGPT to write all of my different versions of cosine similarity, across both Python and JavaScript!

    A fascinating thing about RAG is that it has so many different knobs that you can tweak. You can try different distance functions, different embedding models, different prompting strategies and different LLMs. There’s a lot of scope for experimentation here.

  • What do you need to adjust if you have 1 billion objects?

    The demos I’ve shown today have all been on the small side—up to around 20,000 embeddings. This is small enough that you can run brute force cosine similarity functions against everything and get back results in a reasonable amount of time.

    If you’re dealing with more data there are a growing number of options that can help.

    Lots of startups are launching new “vector databases”—which are effectively databases that are custom built to answer nearest-neighbour queries against vectors as quickly as possible.

    I’m not convinced you need an entirely new database for this: I’m more excited about adding custom indexes to existing databases. For example, SQLite has sqlite-vss and PostgreSQL has pgvector.

    I’ve also done some successful experiments with Facebook’s FAISS library, including building a Datasette plugin that uses it called datasette-faiss.

  • What improvements to embedding models are you excited to see?

    I’m really excited about multi-modal models. CLIP is a great example, but I’ve also been experimenting with Facebook’s ImageBind, which “learns a joint embedding across six different modalities—images, text, audio, depth, thermal, and IMU data.” It looks like we can go a lot further than just images and text!

    I also like the trend of these models getting smaller. I demonstrated a new model, gtr-tiny, earlier which is just 60MB. Being able to run these things on constrained devices, or in the browser, is really exciting to me.

Further reading #

If you want to dive more into the low-level details of how embeddings work, I suggest the following:

  • What are embeddings? by Vicki Boykis
  • Text Embeddings Visually Explained by Meor Amer for Cohere
  • The Tensorflow Embedding Projector—an interactive tool for exploring embedding spaces
  • Learn to Love Working with Vector Embeddings is a collection of tutorials from vector database vendor Pinecone

Posted 23rd October 2023 at 1:36 pm · Follow me on Mastodon or Twitter or subscribe to my newsletter

More recent articles

  • What I should have said about the term Artificial Intelligence - 9th January 2024
  • Weeknotes: Page caching and custom templates for Datasette Cloud - 7th January 2024
  • It's OK to call it Artificial Intelligence - 7th January 2024
  • Tom Scott, and the formidable power of escalating streaks - 2nd January 2024
  • Stuff we figured out about AI in 2023 - 31st December 2023
  • Last weeknotes of 2023 - 31st December 2023
  • Recommendations to help mitigate prompt injection: limit the blast radius - 20th December 2023
  • Many options for running Mistral models in your terminal using LLM - 18th December 2023
  • The AI trust crisis - 14th December 2023
  • Weeknotes: datasette-enrichments, datasette-comments, sqlite-chronicle - 8th December 2023

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/320727.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

HTML--CSS--边框、列表、表格样式

边框样式 属性: border-width 边框宽度 border-style 边框外观 border-color 边框颜色 需要同时设定三个属性 border-width 边框宽度 取值为像素值 border-style 边框样式 none 无样式 dashed 虚线 solid 实线 border-color 边框颜色 如示例: 为div设…

NLP论文阅读记录 - 2021 | WOS 使用深度强化学习及其他技术进行自动文本摘要

文章目录 前言0、论文摘要一、Introduction1.1目标问题1.2相关的尝试1.3本文贡献 二.相关工作2.1. Seq2seq 模型2.2.强化学习和序列生成2.3.自动文本摘要 三.本文方法四 实验效果4.1数据集4.2 对比模型4.3实施细节4.4评估指标4.5 实验结果4.6 细粒度分析 五 总结思考 前言 Auto…

Python Matplotlib 动画教程:提高可视化吸引力的强大工具【第24篇—python:Matplotlib】

文章目录 🍖 方法一:使用pause()函数🚀 方法二:使用FuncAnimation()函数🥋 线性图动画:🎻 Python中的条形图追赶动画🌌 Python中的散点图动画:🛹 条形图追赶的…

<软考高项备考>《论文专题 - 65 质量管理(4) 》

4 过程3-管理质量 4.1 问题 4W1H过程做什么为了评估绩效,确保项目输出完整、正确且满足客户期望,而监督和记录质量管理活动执行结果的过程作用:①核实项目可交付成果和工作已经达到主要干系人的质量要求,可供最终验收;②确定项目…

乐意购项目前端开发 #2

一、Axios的安装和简单封装 安装Axios npm install axios在utils目录下创建 http.js 文件, 内容如下 import axios from axios// 创建axios实例 const http axios.create({baseURL: http://localhost:9999,//后端服务器地址timeout: 5000 })// axios请求拦截器 http.interc…

【用法总结】无障碍AccessibilityService

一、背景 本文仅用于做学习总结,转换成自己的理解,方便需要时快速查阅,深入研究可以去官网了解更多:官网链接点这里 之前对接AI语音功能时,发现有些按钮(或文本)在我没有主动注册唤醒词场景…

【JAVA】在 Queue 中 poll()和 remove()有什么区别

🍎个人博客:个人主页 🏆个人专栏:JAVA ⛳️ 功不唐捐,玉汝于成 目录 前言 正文 poll() 方法: remove() 方法: 区别总结: 结语 我的其他博客 前言 在Java的Queue接口中&…

计算机找不到msvcp120.dll如何解决?总结五个可靠的教程

在计算机使用过程中,遇到“找不到msvcp120.dll”这一问题常常令人困扰。msvcp120.dll作为Windows系统中至关重要的动态链接库文件,对于许多应用程序的正常运行起着不可或缺的作用。那么,究竟是什么原因导致找不到msvcp120.dll呢?又…

Multi-View-Information-Bottleneck

encoder p θ ( z 1 ∣ v 1 ) _θ(z_1|v_1) θ​(z1​∣v1​),D S K L _{SKL} SKL​ represents the symmetrized KL divergence. I ˆ ξ ( z 1 ; z 2 ) \^I_ξ(z_1; z_2) Iˆξ​(z1​;z2​) refers to the sample-based parametric mutual information estimatio…

vim升级和配置

vim升级和配置 1、背景2、环境说明3、操作3.1 升级VIM3.2 配置VIM3.2.1、编辑vimrc文件3.2.2、安装插件 1、背景 日常工作跟linux系统打交道比较多,目前主要用到的是Cenots7和Ubuntu18这两个版本的linux系统,其中Centos7主要是服务器端,Ubun…

如何开启文件共享及其他设备如何获取

1.场景分析 日常生活中,常常会遇到多台电脑共同办公文件却不能共享的问题,频繁的用移动硬盘、U盘等拷贝很是繁琐,鉴于此,可以在同一内网环境下设置共享文件夹,减少不必要的文件拷贝工作,提升工作效率。废话…

Maven《一》-- 一文带你快速了解Maven

目录 🐶1.1 为什么使用Maven 1. Mavan是一个依赖管理工具 ①jar包的规模 ②jar包的来源问题 ③jar包的导入问题 ④jar包之间的依赖 2. Mavan是一个构建工具 ①你没有注意过的构建 ②脱离IDE环境仍需构建 3. 结论 🐶1.2 什么是Maven &#x…

C语言数据结构(1)复杂度(大o阶)

欢迎来到博主的专栏——C语言与数据结构 博主ID——代码小豪 文章目录 如何判断代码的好坏时间复杂度什么是时间复杂度如何计算时间复杂度 空间复杂度 如何判断代码的好坏 实现相同作用的不同代码,如何分辨这些代码的优劣之处呢? 有人说了&#xff0c…

stm32学习笔记:USART串口通信

1、串口通信协议(简介软硬件规则) 全双工:打电话。半双工:对讲机。单工:广播 时钟:I2C和SPI有单独的时钟线,所以它们是同步的,接收方可以在时钟信号的指引下进行采样。串口、CAN和…

行业内参~移动广告行业大盘趋势-2023年12月

前言 2024年,移动广告的钱越来越难赚了。市场竞争激烈到前所未有的程度,小型企业和独立开发者在巨头的阴影下苦苦挣扎。随着广告成本的上升和点击率的下降,许多原本依赖广告收入的创业者和自由职业者开始感受到前所未有的压力。 &#x1f3…

IPv6组播--SSM Mapping

概念 SSM(Source-Specific Multicast)称为指定源组播,要求路由器能了解成员主机加入组播组时所指定的组播源。 如果成员主机上运行MLDv2,可以在MLDv2报告报文中直接指定组播源地址。但是某些情况下,成员主机只能运行MLDv1,为了使其也能够使用SSM服务,组播路由器上需要提…

swing快速入门(四十)JList、JComboBox实现列表框

注释很详细,直接上代码 上一篇 新增内容 🧧1.列表的属性设置与选项监听器 🧧2.下拉框的属性设置与选项监听器 🧧3.Box中组件填充情况不符合预期的处理方法 🧧4.LIst向Vector的转化方法 源码: package swing…

操作系统期末考复盘

简答题4题*5 20分计算题2题*5 10分综合应用2题*10 20分程序填空1题10 10分 1、简答题(8抽4) 1、在计算机系统上配置OS的目标是什么?作用主要表现在哪个方面? 在计算机系统上配置OS,主要目标是实现:方便性、…

操作系统复习 七、八章

操作系统复习 七、八章 文章目录 操作系统复习 七、八章第七章 内存管理内存管理的基本要求和原理覆盖与交换连续分配管理方式非连续分配管理方式基本分段存储管理方式段页式管理方式补充 第八章 虚拟内存虚拟内存的基本概念请求分页管理方式易混知识点页面置换算法页面分配策略…

memory泄露分析方法(java篇)

#memory泄露主要分为java和native 2种,本文主要介绍java# 测试每天从monkey中筛选出内存超标的app,提单流转到我 首先,辨别内存泄露类型(java,还是native) 从采到的dumpsys_meminfo_pid看java heap&…
最新文章