Quantcast
Channel: Data Management & Data Architecture
Viewing all articles
Browse latest Browse all 22

Oracle AI Vector – Semantic Search

$
0
0

With the advent of Large Language Models (LLM), vector databases are becoming increasingly popular. Vector databases and similar approaches have existed for a long time such as geodata have long been established. Oracle offers since Oracle Database 23ai vector functionalities and further enriches the concept of a converged database. This article looks at similarity search and indexing for Oracle AI Vector.
For an itroduction into vector databases and basic concepts see my article https://buckenhofer.com/2024/05/vector-database-what-why-and-how/.

Oracle AI Vector – installation and first steps

The easiest way to try out Oracle AI Vector is via a cloud instance Oracle 23 AI or via a local Docker container

docker run -p 1521:1521 --name db23ai container-registry.oracle.com/database/free:latest
docker exec -it db23ai ./setPassword.sh password123
docker exec -it db23ai sqlplus sys/password123@freepdb1 as sysdba

After the installation, a new database user should be created with CONNECT and RESOURCE permissions. If an HNSW index is required, the “Vector Pool” storage area within the SGA has to be enlarged. This is done with the VECTOR_MEMORY_SIZE parameter.

And now connect with the newly created user.

docker exec -it db23ai sqlplus newuser/password123@freepdb1

Now a table can be created with a 4-dimensional vector data type. The vector is 4-dimensional because 4 features are used: danceability, speechiness, loudness, tempo.

CREATE TABLE IF NOT EXISTS music (
   id INTEGER PRIMARY KEY,
   artistname VARCHAR2(250),
   songname VARCHAR2(250),
   songvector VECTOR(4)
);

Data is then inserted. There is no embedding model in the example required because the data is already numeric. It may be necessary to normalize data into a range from e.g. 0 to 1. The need for normalization depends on the search algorithm used. We will use the cosine distance later, for which no normalization is required.

The test data contains columns id, artist name, song name and four song characteristics (danceability, speechiness, loudness, tempo) as a 4-dimensional vector. Such data can be obtained via an API from MusicBrainz, Spotify and others. I use these test data for the example because it is well known, easy to understand and easy to transfer to similar use cases.

INSERT INTO music (id, artistname, songname, songvector)
VALUES
   (1, 'Back In Black', 'AC/DC', '[0.31,0.047,-5.678,188.386]'),
   (2, 'Hells Bells', 'AC/DC', '[0.389,0.0475,-4.768,106.767]'),
   (3, 'Highway to Hell', 'AC/DC', '[0.574,0.133,-4.793,115.728]'),
   (4, 'Shoot to Thrill', 'AC/DC', '[0.457,0.0747,-5.303,141.038]'),
   (5, 'Thunderstruck', 'AC/DC', '[0.502,0.0364,-5.175,133.52]'),
   (6, 'T.N.T.', 'AC/DC', '[0.754,0.302,-5.218,126.366]'),
   (7, 'You Shook Me All Night Long', 'AC/DC', '[0.532,0.0574,-5.509,127.361]'),
   (8, 'Detroit Rock City', 'KISS', '[0.546,0.0691,-6.728,90.591]'),
   (9, 'Heaven''s On Fire', 'KISS', '[0.679,0.0635,-6.053,124.324]'),
   (10, 'I Was Made For Lovin'' You', 'KISS', '[0.768,0.038,-6.215,128.373]'),
   (11, 'Lick It Up', 'KISS', '[0.681,0.058,-6.051,120.276]'),
   (12, 'Love Gun', 'KISS', '[0.409,0.0368,-6.236,147.211]'),
   (13, 'Rock And Roll All Nite', 'KISS', '[0.654,0.0998,-5.906,144.769]'),
   (14, 'Strutter', 'KISS', '[0.511,0.059,-5.618,131.34]'),
   (15, 'Another One Bites The Dust - Remastered 2011', 'Queen', '[0.932,0.161,-6.472,109.975]'),
   (16, 'Bohemian Rhapsody - Remastered 2011', 'Queen', '[0.391,0.0539,-9.961,143.879]'),
   (17, 'Crazy Little Thing Called Love - Remastered 2011', 'Queen', '[0.599,0.0423,-6.887,76.961]'),
   (18, 'Don''t Stop Me Now - Remastered 2011', 'Queen', '[0.563,0.16,-5.277,156.271]'),
   (19, 'Killer Queen - Remastered 2011', 'Queen', '[0.535,0.0554,-6.328,117.188]'),
   (20, 'Under Pressure', 'Queen', '[0.678,0.0444,-9.316,113.46]'),
   (21, 'Under Pressure - Remastered 2011', 'Queen', '[0.671,0.0478,-7.813,113.809]'),
   (22, 'We Will Rock You - Remastered 2011', 'Queen', '[0.693,0.119,-7.316,81.308]'),
   (23, 'Proud Mary', 'Tina Turner', '[0.397,0.0685,-10.972,171.067]'),
   (24, 'The Best', 'Tina Turner', '[0.662,0.0296,-9.54,103.867]'),
   (25, 'The Best - Edit', 'Tina Turner', '[0.683,0.0279,-6.864,103.789]'),
   (26, 'We Don''t Need Another Hero (Thunderdome)', 'Tina Turner', '[0.69,0.0471,-5.866,98.851]'),
   (27, 'What''s Love Got to Do with It - 2015 Remaster', 'Tina Turner', '[0.855,0.0576,-8.902,97.792]');

The following screenshot shows the create table command and the inserts.

Oracle AI Vector – Similarity search

One of the key functions of vector databases is their ability to carry out efficient similarity searches. This involves finding data points that are similar to a given query point . Distance metrics such as Euclidean or cosine-like distance are used. Oracle AI vector offers such functionalities as show in the following query.

SELECT
   artistName,
   songName,
   songVector <=> (SELECT songVector
                         FROM music
                         WHERE id = 5) AS similarity
FROM music
ORDER BY similarity
FETCH FIRST 5 ROWS ONLY;

The query above searches for the most similar songs in terms of the four dimensions danceability, speechiness, loudness, tempo compared to the song with ID = 5.
The output is shown in the “semantic search” diagram. The song itself is, as expected, the most similar and four recommendations are listed. The operator <=> uses the cosine distance to find the most similar songs. The Euclidean distance has a different operator <->. There are alternative possibilities to write queries by using VECTOR_DISTANCE().
The diagram shows the query output.

Oracle AI Vector – Indexes

A search without an index is an exact search that can take a long time for large data sets, as each vector in the table must be compared with the query vector. To improve the efficiency of similarity searches, Oarcle AI Vector uses specialized indexes. These indices, such as HNSW (Hierarchical Navigable Small World) or IVF (Inverted File), optimize the search for nearest neighbors in high-dimensional spaces. In the case of Oracle, HNSW is a pure in-memory index.

A search based on an index is no longer exact, but approximate. An exact search for the top 5 recommendations returns five rows. An approximate search can return five, four or fewer rows.

The exact search compares the query vector with every vector in the table, resulting in a time complexity of O(n * d) with n as the number of records and d as the number of dimensions. HNSW has a time complexity of O(log(n)). IVF has a time complexity of O(((n / k) + k) * d) with k as the number of clusters. Simply put, HNSW regularly provides better query performance compared to IVF. On the other hand, the creation of the HNSW index takes longer than the creation of an IVF index.
Oracle AI Vector supports the indexing methods mentioned above. Below are examples for IVF and HNSW to create an index on the column songvector with the cosine algorithm.

CREATE VECTOR INDEX music_ivf ON music (songVector)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 90 PARAMETERS (type IVF, neighbor partitions 10);

CREATE VECTOR INDEX music_hnsw ON music (songVector)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 90;

Summary

The article provides a brief guide to using Oracle AI Vector. It explains the creation of vector data types, the similarity search and the use of specialized indexes to optimize search performance. The example with music data can easily be transferred to other data.

The article also discusses the advantages of vector indices such as HNSW and IVF, which increase the efficiency of similarity searches. Optimizing the index to find the right balance between performance and recall is critical. It is important to understand how indexes are created and the impact of configuration options. Vector databases and indexes must provide tools to help developers optimize the application. Oracle AI Vector Search, for example, provides an index accuracy report that is a good step in the right direction.

The source code is available in my github repository.


Viewing all articles
Browse latest Browse all 22

Trending Articles