Quantcast
Channel: Data Management & Data Architecture
Viewing all articles
Browse latest Browse all 22

Similarity search in vector databases: a comprehensive guide

$
0
0

Similarity search in vector databases has emerged as a pivotal technique enabling efficient retrieval of information by comparing complex data points within high-dimensional spaces. The ability to find similar items efficiently is crucial for applications ranging from RAG pipelines, search engines, recommendation systems, etc. In this blog post, I’ll dive into what similarity search means, explore different algorithms used, and explain each including strengths, weaknesses, and recommendations for when to use each algorithm. I’ll provide examples using pg_vector, a PostgreSQL extension for handling vector-based data. For an introduction into vector databases and embeddings see my article Vector databases: what, why, and how.

The article illustrates the algorithms with examples in PostgreSQL. The above mentioned article explains how to set up pg_vector extension as a prerequisite for the SQL statements used in this article. The following table and data are used.

CREATE TABLE IF NOT EXISTS test_embedding (
     id SERIAL PRIMARY KEY
   , embedding vector
   , created_at timestamptz DEFAULT now()
);

INSERT INTO test_embedding (embedding) VALUES (ARRAY[1, 1]);
INSERT INTO test_embedding (embedding) VALUES (ARRAY[2, 2]);
INSERT INTO test_embedding (embedding) VALUES (ARRAY[2, 1]);
INSERT INTO test_embedding (embedding) VALUES (ARRAY[5, 5]);
INSERT INTO test_embedding (embedding) VALUES (ARRAY[5, 0]);

The data points are drawn in the following diagram: there are five vectors. Each of the vectors will be compared to one another using different search algorithms:

  • Euclidean Distance (L2)
  • Cosine Similarity
  • Manhattan Distance
  • Dot Product (L1)
  • Hamming Distance
  • Jaccard Similarity

The following table shows the results of the different search algorithms using PostgreSQL. The algorithms and SQL queries are explained in the next chapters.

Similarity search algorithms

The following paragraphs explain the different algorithms with an SQL example for PostgreSQL pg_vector extension. The SQL statements compare each vector with one another using a Cartesian cross join (intentionally).
For Euclidian Distance, Cosine Similarity, Manhattan Distance, and Dot Product exist operators in pg_vector. For the two other algorithms (Hamming Distance and Jaccard Similarity) stored procedures have been created. The stored procedures just show the principle and are not tuned for speed (for usage in production systems).
At the end of the article, the algorithms are compared.

Search algorithm: Euclidean Distance (L2)

Euclidean distance measures how far two points are from each other in a straight line. It’s like using a ruler to measure the distance between two dots on a paper.

Mathematical Formula: d(p, q) = sqrt((q1 – p1)^2 + (q2 – p2)^2 + … + (qn – pn)^2)

SELECT a.id, b.id
     , a.embedding as embedding_a
     , b.embedding as embedding_b
     , round((a.embedding <-> b.embedding)::numeric, 5) AS l2_euclidean_distance
FROM test_embedding a
   , test_embedding b
ORDER BY a.embedding, b.embedding;

Search algorithm: Cosine Similarity

Cosine similarity measures how similar two items are, regardless of their size. Imagine comparing the angle between two arrows pointing in different directions; the closer the angle is to zero, the more similar the items are.

Mathematical Formula: cosine_distance(p, q) = 1 – ( (p1*q1 + p2*q2 + … + pn*qn) / (sqrt(p1^2 + p2^2 + … + pn^2) * sqrt(q1^2 + q2^2 + … + qn^2)) )
Cosine Distance is also used with formula cosine_distance(p, q) = 1 – cosine_similarity(p, q).

SELECT a.id, b.id
     , a.embedding as embedding_a
     , b.embedding as embedding_b
     , round((a.embedding <=> b.embedding)::numeric, 5) AS cosine_similarity
FROM test_embedding a
   , test_embedding b
ORDER BY a.embedding, b.embedding;

Search algorithm: Manhattan Distance

Manhattan distance measures the total distance traveled between two points if you can only move along grid lines (like streets in a city). It’s like walking in a city where you can only move north-south or east-west.

Mathematical Formula: manhattan_distance(p, q) = |q1 – p1| + |q2 – p2| + … + |qn – pn|

SELECT a.id, b.id
     , a.embedding as embedding_a
     , b.embedding as embedding_b
     , round((a.embedding <+> b.embedding)::numeric, 5) AS l1_manhattan_distance
FROM test_embedding a
   , test_embedding b
ORDER BY a.embedding, b.embedding;

Search algorithm: Dot Product (L1)

The dot product measures how much two vectors point in the same direction. Imagine you have two arrows; the dot product tells you how much they overlap if they were placed on top of each other.

Mathematical Formula: dot_product(p, q) = p1*q1 + p2*q2 + … + pn*qn

SELECT a.id, b.id
     , a.embedding as embedding_a
     , b.embedding as embedding_b
     , - round((a.embedding <#> b.embedding)::numeric, 5) AS dot_product
FROM test_embedding a
     , test_embedding b
ORDER BY a.embedding, b.embedding;

Search algorithm: Hamming Distance

Hamming distance measures how many places two strings of equal length differ. It’s like comparing two words to see how many letters are different.

Mathematical Formula: hamming_distance(p, q) = count(p_i != q_i for i from 1 to n)
(Here, count(p_i != q_i for i from 1 to n) counts the number of positions where the corresponding elements of p and q are different.)

CREATE OR REPLACE FUNCTION hamming_distance(vec1 vector, vec2 vector)
RETURNS int AS $$
DECLARE
    distance int := 0;
    i int;
    vec1_elements float8[];
    vec2_elements float8[];
BEGIN
    -- Convert vectors to arrays
    vec1_elements := regexp_split_to_array(trim(both '[]' from vec1::text), ',')::float8[];
    vec2_elements := regexp_split_to_array(trim(both '[]' from vec2::text), ',')::float8[];

    -- Ensure the vectors are of the same dimension
    IF array_length(vec1_elements, 1) != array_length(vec2_elements, 1) THEN
        RAISE EXCEPTION 'Vectors must have the same number of dimensions';
    END IF;

    -- Calculate the Hamming distance
    FOR i IN 1..array_length(vec1_elements, 1) LOOP
        IF vec1_elements[i] != vec2_elements[i] THEN
            distance := distance + 1;
        END IF;
    END LOOP;

    RETURN distance;
END;
$$ LANGUAGE plpgsql IMMUTABLE;

SELECT a.id, b.id,
     , a.embedding AS embedding_a
     , b.embedding AS embedding_b
     , hamming_distance(a.embedding, b.embedding) AS hamming_distance
FROM test_embedding a
   , test_embedding b
ORDER BY a.embedding, b.embedding;

Search algorithm: Jaccard Similarity

Jaccard similarity measures how similar two sets are by comparing their common elements. It’s like comparing two groups of friends and seeing how many friends they have in common.

Mathematical Formula: jaccard_similarity(A, B) = |AB| / |AB|
(Here, A and B are sets, |A ∩ B| is the number of elements in the intersection of A and B, and |A ∪ B| is the number of elements in the union of A and B.)
Jaccard Distance is also used with the formula jaccard_distance(A, B) = 1 – (|AB| / |AB|)

CREATE OR REPLACE FUNCTION jaccard_similarity(vec1 vector, vec2 vector)
RETURNS numeric AS $$
DECLARE
    intersection_count int := 0;
    union_count int := 0;
    i int;
    vec1_elements float8[];
    vec2_elements float8[];
BEGIN
    -- Convert vectors to arrays
    vec1_elements := regexp_split_to_array(trim(both '[]' from vec1::text), ',')::float8[];
    vec2_elements := regexp_split_to_array(trim(both '[]' from vec2::text), ',')::float8[];

    -- Ensure the vectors are of the same dimension
    IF array_length(vec1_elements, 1) != array_length(vec2_elements, 1) THEN
        RAISE EXCEPTION 'Vectors must have the same number of dimensions';
    END IF;

    -- Calculate the intersection and union
    FOR i IN 1..array_length(vec1_elements, 1) LOOP
        IF vec1_elements[i] = 1 AND vec2_elements[i] = 1 THEN
            intersection_count := intersection_count + 1;
        END IF;
        IF vec1_elements[i] = 1 OR vec2_elements[i] = 1 THEN
            union_count := union_count + 1;
        END IF;
    END LOOP;

    IF union_count = 0 THEN
        RETURN 0;
    END IF;

    RETURN (intersection_count::numeric / union_count::numeric);
END;
$$ LANGUAGE plpgsql IMMUTABLE;


SELECT a.id, b.id
     , a.embedding AS embedding_a
     , b.embedding AS embedding_b
     , round(jaccard_similarity(a.embedding, b.embedding), 5) AS jaccard_similarity
FROM test_embedding a
   , test_embedding b
ORDER BY a.embedding, b.embedding;

Results of similarity searches with PostgreSQL

The following table lists the results using the algorithms explained above. All results were computed by using pg_vector and the SQL statements from the article.
The following SQL statement creates one table containing 25 rows (5×5 as each vector is compared with one another ) and all algorithms.

SELECT a.id AS id_a, b.id AS id_b
     , a.embedding AS embedding_a
     , b.embedding AS embedding_b
     , round((a.embedding <-> b.embedding)::NUMERIC, 5) AS euclidean_distance
     , round((a.embedding <=> b.embedding)::NUMERIC, 5) AS cosine_similarity
     , round((a.embedding <+> b.embedding)::numeric, 5) AS l1_manhattan_distance
     , - round((a.embedding <#> b.embedding)::numeric, 5) AS dot_product
     , hamming_distance(a.embedding, b.embedding) AS hamming_distance
     , round(jaccard_similarity(a.embedding, b.embedding), 5) AS jaccard_similarity
FROM test_embedding a
   , test_embedding b;

Comparison of search algorithms in vector databases

The following table lists strengths, weaknesses, and recommendations for the above explained algorithms.

Strengths Weaknesses Recommendations
Euclidian Distance
  • Simple and intuitive to understand.
  • Effective for low-dimensional spaces.
  • Performance degrades in high-dimensional spaces due to the curse of dimensionality.
  • Sensitive to scale; requires normalization of data.
Use Euclidean distance when working with low-dimensional data or when simplicity and ease of understanding are important. Ensure data is normalized.
Cosine Similarity
  • Effective for text data and high-dimensional spaces.
  • Insensitive to scale; only considers the orientation of vectors.
  • Can be less effective for very sparse data.
  • Ignores magnitude differences.
Use cosine similarity for text data, document comparison, and other high-dimensional data where scale invariance is important. Normalization is optional,  but common.
Manhattan Distance
  • Simple and intuitive to understand.
  • Effective for grid-like data structures.
  • Can be less effective for high-dimensional data.
  • Sensitive to scale; requires normalization of data.
Use Manhattan distance for grid-based systems, pathfinding in grid-like structures, and when simplicity and ease of understanding are important. Ensure data is normalized.
Dot Product
  • Simple and computationally efficient.
  • Effective for comparing the overall similarity in direction of two vectors.
  • Sensitive to the magnitude of vectors.
  • Not suitable for comparing vectors of varying lengths or magnitudes; requires normalization.
Use the dot product for comparing the direction of vectors, especially when magnitude is relevant, such as in certain machine learning and data analysis tasks.
Hamming Distance
  • Simple and effective for binary strings or fixed-length data.
  • Fast and computationally inexpensive.
  • Only applicable to data of equal length.
  • Not suitable for continuous data.
Use Hamming distance for error detection and correction in binary strings or comparing fixed-length data.
Jaccard Similarity
  • Simple and intuitive for set-based data.
  • Effective for comparing binary attributes.
  • Less effective for continuous or high-dimensional data.
  • Can be computationally expensive for large sets.
Use Jaccard similarity for comparing sets, binary data, or categorical attributes where common elements are significant.

 

Note that some algorithms require normalization of the data. Normalization means scaling the data so that it falls within a specific range, usually between 0 and 1. This is particularly important when different features of the data have different units or magnitudes. Normalization helps eliminate the influence of varying scales and ensures that all features are treated equally.

A common method for normalization is min-max scaling, where the values of a feature are scaled to a range between 0 and 1:
normalized_value = (value – min_value) / (max_value – min_value)

Another method is z-score normalization, where the values are scaled based on their mean and standard deviation:
normalized_value = (value – mean) / standard_deviation

Summary

Similarity search in vector databases is a powerful tool that enables finding items that are semantically similar rather than exactly the same. Each algorithm has its strengths and weaknesses, and knowing when to use each one will help you make the most of your similarity search applications.

Perfomance may suffer if the algorithms are used for huge datasets. Vector Indexes will help to improve performance but penalties in accuracy might be the trade-off.


Viewing all articles
Browse latest Browse all 22

Trending Articles