Scribbly

PostgreSQL을 위한 오픈소스 라이브러리인 pgvector(https://github.com/pgvector/pgvector)는 벡터를 이용한 각종 거리계산 함수를 이용한다. 이를 이용하여 symentic search를 구현할 수 있다.

이 게시글에서는 pgvector를 이용하여 코사인 거리로 검색하는 내용을 정리하였다.

참고

https://supabase.com/docs/guides/functions/examples/semantic-search

https://supabase.com/docs/guides/database/extensions/pgvector?utm_source=chatgpt.com

스키마 생성

-- 1) 확장 설치 (Supabase에서 "vector" 확장)
create extension if not exists vector;

-- 2) 벡터 컬럼이 있는 테이블 (N은 임베딩 차원 수)
create table if not exists documents (
  id serial primary key,
  content text not null,
  metadata jsonb,
  embedding vector(768) -- 예: 모델 차원 수에 맞춰 설정(embeddinggemma는 768차원)
);

id : serial은 Postgres에서 자동 증가하는 32비트 정수이다. uuid의 128비트에 비해 성능 최적화에 용이한 점이 있다.
content : 임베딩할 텍스트
embedding : pgvector를 통해 생성되는 벡터. 차원을 정확히 입력해야 한다.
metadata : jsonb를 통해 임의로 입력이 가능하도록 설정한다. 청크 번호, 출처, 출처의 id 등 문서 검색에 필요한 값을 키-밸류 형태로 편하게 넣으면 된다.
```
{
    "source": "world_bank_2023.pdf",
    "page_number": 15,
    "chunk_index": 0,
    "word_count": 512,
    "created_at": "2025-09-13",
}
```

HNSW 인덱싱

create index concurrently on documents using hnsw (embedding vector_cosine_ops);

근사 최근접 이웃(ANN)알고리즘인 Hierarchical Navigable Small World로 인덱싱한다.

HNSW는 새로운 벡터가 생성되면, 인접한 벡터들을 찾은 후 연결해둔다.

쿼리가 입력되면, 쿼리와 인접한 벡터들부터 탐색한 후 top-k를 잘라 검색시간을 크게 줄여준다.

코사인 유사도 검색

-- 코사인 유사도 매칭 함수
create or replace function match_documents(
  query_embedding vector(768),
  match_threshold float default 0.0,  -- 유사도 임계값 (-1..1)
  match_count int default 10          -- 반환 개수
)
returns table(
  id bigint,
  content text,
  metadata jsonb,
  similarity float
)
language sql stable
as $$
  select
    d.id,
    d.content,
    d.metadata,
    1 - (d.embedding <=> query_embedding) as similarity  -- 코사인 유사도
  from documents d
  where 1 - (d.embedding <=> query_embedding) >= match_threshold
  order by d.embedding <=> query_embedding asc           -- 인덱스가 먹히도록 거리로 정렬
  limit least(match_count, 200);
$$;

pg벡터의 거리 연산자는 <->(L2 거리), <#>(내적), <=>(코사인 거리)가 있다.

L2 거리: 절대적인 차이가 중요할 때 (두 점 사이의 직선 거리)

이미지 픽셀 값 비교
수치 데이터의 정확한 거리 측정

음수 내적: 계산 속도가 중요할 때 (벡터 간의 곱(내적))

대용량 데이터에서 빠른 검색
추천 시스템

코사인 거리: 방향성이 중요할 때 (각도와 방향)

텍스트 유사도 (문서 길이와 상관없이)
사용자 선호도 패턴 비교

게시글의 각도가 같으면 내용이 유사하다는 의미가 된다. 따라서 코사인 유사도(1-코사인 거리)를 적용한다.

JOIN

만일 원본 게시글이 존재하는 경우, 해당 테이블을 조인하여 반환하는 것이 좋다.

Posts 테이블을 청킹해서 Documents로 저장하는 1:N 관계라고 하면 아래와 같이 스키마를 정의할 수 있을 것이다.

create table if not exists posts (
  id          uuid primary key,
  title       text        not null,
  content     text        not null,
);

create table if not exists documents (
  id           bigserial primary key,
  post_id      uuid      not null references posts(id) on delete cascade,
  chunk_index  integer     not null,
  content      text        not null, -- 청킹된 텍스트
  metadata     jsonb       not null default '{}',
  embedding    vector(768) not null,
  created_at   timestamptz not null default now(),

  -- 같은 post 내에서 chunk_index는 유일
  constraint documents_post_chunk_unique unique (post_id, chunk_index),
  -- chkunk_index는 0 이상
  constraint documents_chunk_index_nonneg check (chunk_index >= 0)
);

위의 두 테이블을 조인해서 반환하고 싶다면 아래와 같이 함수를 작성하면 된다.

create or replace function match_documents_with_posts(
  query_embedding vector(768),
  match_threshold float default 0.0,   -- 유사도 임계값 (-1..1)
  match_count     int   default 10,    -- 반환 개수
  filter_post_id  uuid  default null   -- 특정 post만 필터링하고 싶을 때 (옵션)
)
returns table(
  document_id   bigint,
  post_id       uuid,
  post_title    text,
  chunk_index   int,
  chunk_content text,
  similarity    float,
  metadata      jsonb
)
language sql
stable
as $$
  select
    d.id                                   as document_id,
    d.post_id                              as post_id,
    p.title                                as post_title,
    d.chunk_index                          as chunk_index,
    d.content                              as chunk_content,
    1 - (d.embedding <=> query_embedding)  as similarity,   -- 코사인 유사도
    d.metadata                             as metadata
  from documents d
  join posts p on p.id = d.post_id
  where (filter_post_id is null or d.post_id = filter_post_id)
    and (1 - (d.embedding <=> query_embedding)) >= match_threshold
  order by d.embedding <=> query_embedding asc              -- "거리"로 정렬해야 인덱스 사용
  limit least(match_count, 200);
$$;

위 함수의 반환값은 아래와 같다.

RETURNS TABLE (
  document_id   bigint,
  post_id       uuid,
  post_title    text,
  chunk_index   integer,
  chunk_content text,
  similarity    double precision, -- Postgres에서 float(=float8)과 동일
  metadata      jsonb
)

오버샘플링

만일 동일한 Post에서 가장 결과에 적합한 하나의 결과만 추출하고 싶다면 오버샘플링 기법을 활용해야 정확한 결과를 얻을 수 있다.

각 Post당 최대 5개를 뽑은 후, 이중 코사인 거리가 가장 낮은 하나를 추출하여 topK에 넣는다.

create or replace function match_posts_distinct_on(
  query_embedding vector(768),
  match_threshold float default 0.0,  -- 유사도 임계값 (-1..1)
  match_count     int   default 10,   -- 최종 K
  oversample      int   default 5     -- 중복 제거 대비 1차 후보 과추출 배수
)
returns table(
  document_id   bigint,
  post_id       uuid,
  post_title    text,
  chunk_index   integer,
  chunk_content text,
  similarity    double precision,
  metadata      jsonb
)
language sql
stable
as $$
  -- 1) 1차 후보 추출
  with candidates as (
    select
      d.id                                    as document_id,
      d.post_id                               as post_id,
      p.title                                 as post_title,
      d.chunk_index                           as chunk_index,
      d.content                               as chunk_content,
      d.metadata                              as metadata,
      d.embedding <=> query_embedding         as dist -- 코사인 "거리"
    from documents d
    join posts p on p.id = d.post_id
    where (1 - (d.embedding <=> query_embedding)) >= match_threshold
    order by d.embedding <=> query_embedding asc
    limit least(match_count * oversample, 1000) -- 5배 과추출
  ),
  -- 2) 중복 제거 : post_id를 기준으로 하나의 후보만 남김
  dedup as (
    select distinct on (post_id)
      document_id, post_id, post_title, chunk_index, chunk_content, metadata, dist
    from candidates
    order by post_id, dist asc, document_id asc
  )
  -- 3) 중복 제거된 목록에서 코사인 유사도를 기반으로 topK 반환
  select
    document_id,
    post_id,
    post_title,
    chunk_index,
    chunk_content,
    1 - dist as similarity,
    metadata
  from dedup
  order by dist asc
  limit match_count;
$$;

scribbly.

scribbly.

scribbly.

스키마 생성

HNSW 인덱싱

코사인 유사도 검색

JOIN

오버샘플링