關于RAG的定義:

RAG is an AI Framework that integrates large language models (LLMs) with external knowledge retrieval to enhance accuracy and transparency.
Pre-trained language models generate text based on patterns in their training data.
RAG supplements their capabilities by retrieving relevant facts from constantly updated knowledge bases

我們看到了RAG定義中的幾個關鍵詞:AI 框架, 整合外部知識, 支持即時更新的知識庫。

我們看到圖中的 Original/New Connect 類似于我們的外部數據(像我們日常寫的博客,筆記,郵件,電子書什么的),向量數據庫Vector database 類似于存儲我們外部知識的數據庫:
市面上常見的向量數據庫有很多種:對于DBA比較熟悉的mongo,es,pg 等等都對向量數據庫有支持,圖中的LLM就是我們之前搭建的大模型,最后我們可以看到Framework 在這架構圖中
站在了C位,起到了整合RAG架構的核心地位。

關于Framework 我們選擇 langchain, 關于langchain的定義:

LangChain is a framework for developing applications powered by large language models (LLMs).
LangChain simplifies every stage of the LLM application lifecycle:

簡單地說就是大模型的一個開發框架,支持開發,持續優化,部署發布API等功能。

關于 langchain 對于 pgvector 的支持: https://python.langchain.com/v0.1/docs/integrations/vectorstores/pgvector/

我們來按照官方的例子運行一下demo:

1)安裝LangChain相關的package

pip3 install langchain_core

pip3 install langchain_postgres

pip3 install psycopg-c

pip3 install langchain-community

pip3 install sentence-transformers

2)我們準備一下基礎數據測試數據集:

docs = [

Document(

page_content="2024年歐洲杯的冠軍是西班牙隊",

metadata={"id": 1, "catalog": "sports", "topic": "CCTV-足球體育新聞"},

),

Document(

page_content="2023-2024年NBA的總冠軍是波士頓凱爾特人隊",

metadata={"id": 2, "catalog": "sports", "topic": "CNN-籃球體育新聞"},

),

Document(

page_content="2024年9月份postgres會發布version17版本,含有大量新的功能",

metadata={"id": 3, "catalog": "tech", "topic": "開源數據庫社區"},

),

Document(

page_content="2024年ORACLE發布了跨時代意義的數據庫版本ORACLE 23AI,支持多模數據庫,支持向量數據庫",

metadata={"id": 4, "catalog": "tech", "topic": "甲骨文頻道"},

),

]

3)測試程序load 數據:

from langchain_core.documents import Document

from langchain_postgres import PGVector

from langchain_huggingface import HuggingFaceEmbeddings

from langchain_postgres.vectorstores import PGVector

# See docker command above to launch a postgres instance with pgvector enabled.

connection = "postgresql+psycopg://app_vector:app_vector@xx.xx.xxx.xxx:5432/postgres" # Uses psycopg3!

collection_name = "t_news"

embeddings = HuggingFaceEmbeddings(model_name='D:\\AI\\text2vec-base-chinese')

vectorstore = PGVector(

embeddings=embeddings,

collection_name="t_news",

connection=connection,

use_jsonb=True,

)

docs = [

...

]

print(vectorstore)

##vectorstore.(docs, ids=[doc.metadata["id"] for doc in docs])

vectorstore.add_documents(docs, ids=[doc.metadata["id"] for doc in docs])

4)測試相似度檢索:2024年歐洲杯冠軍,請介紹一下?

vectorstore.similarity_search("2024年歐洲杯冠軍,請介紹一下?", k=1)

我們可以看到embedding 模型給了我們正確的答案。

5)整合大模型接口調用:

首先我們只是單純的直接調用大模型接口,感覺他是在胡天!!!

“歐洲杯的冠軍是葡萄牙隊,他們在2021年在荷蘭舉行的比賽中擊敗了法國隊獲得了冠軍。“

葡萄牙應該是2016年拿的歐洲杯冠軍!!對手到是法國隊。

def LLM(text):

url = "http://127.0.0.1:8868/llm_query/{}".format(text) # FastAPI應用程序運行的地址和端口

response = requests.get(url)

print(response.json())

return response.json()

##def embedding(text):

LLM("2024年歐洲杯冠軍,請介紹一下?")

我們通過RAG增強式檢索:

def LLM(text):

url = "http://127.0.0.1:8868/llm_query/{}".format(text) # FastAPI應用程序運行的地址和端口

response = requests.get(url)

# print(response.json())

return response.json()

def embedding(text):

return vectorstore.similarity_search(text, k=1)[0].page_content

def RAG(text):

msg = embedding(text)

print(msg)

return LLM(""""{},問題是:{}""".format(msg,text))

if "__main__" ==__name__:

print(RAG("2024年歐洲杯冠軍,請介紹一下這個國家?例如這個國家人口,面積,氣候"))

大模型給我們的答案:相對于合理的回答

最后我們看一下 langchain 與 pg_vector 的自動整合下的數據庫表的呈現:
我們發現langchain 框架會自動創建2張表 langchain_pg_collection和langchain_pg_embedding,

postgres=> \dt

List of relations

Schema | Name | Type | Owner

------------+-------------------------+-------+------------

app_vector | langchain_pg_collection | table | app_vector

app_vector | langchain_pg_embedding | table | app_vector

(5 rows)

postgres=> \d+ langchain_pg_collection

Table "app_vector.langchain_pg_collection"

Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description

-----------+-------------------+-----------+----------+---------+----------+-------------+--------------+-------------

uuid | uuid | | not null | | plain | | |

name | character varying | | not null | | extended | | |

cmetadata | json | | | | extended | | |

Indexes:

"langchain_pg_collection_pkey" PRIMARY KEY, btree (uuid)

"langchain_pg_collection_name_key" UNIQUE CONSTRAINT, btree (name)

Referenced by:

TABLE "langchain_pg_embedding" CONSTRAINT "langchain_pg_embedding_collection_id_fkey" FOREIGN KEY (collection_id) REFERENCES langchain_pg_collection(uuid) ON DELETE CASCADE

Access method: heap

postgres=> \d+ langchain_pg_embedding

Table "app_vector.langchain_pg_embedding"

Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description

---------------+-------------------+-----------+----------+---------+----------+-------------+--------------+-------------

id | character varying | | not null | | extended | | |

collection_id | uuid | | | | plain | | |

embedding | vector | | | | external | | |

document | character varying | | | | extended | | |

cmetadata | jsonb | | | | extended | | |

Indexes:

"langchain_pg_embedding_pkey" PRIMARY KEY, btree (id)

"ix_cmetadata_gin" gin (cmetadata jsonb_path_ops)

"ix_langchain_pg_embedding_id" UNIQUE, btree (id)

Foreign-key constraints:

"langchain_pg_embedding_collection_id_fkey" FOREIGN KEY (collection_id) REFERENCES langchain_pg_collection(uuid) ON DELETE CASCADE

Access method: heap

langchain_pg_collection 是主表記錄了向量表的名字:

postgres=> select * from langchain_pg_collection;

uuid | name | cmetadata

--------------------------------------+--------+-----------

17e8df97-5db8-442f-8f49-ea6e71231802 | t_news | null

(1 row)

langchain_pg_embedding是子表記錄了向量的信息:

postgres=> select count(1) from langchain_pg_embedding;

count

-------

4

(1 row)

postgres=> select * from langchain_pg_embedding;

id | collection_id |

1 | 17e8df97-5db8-442f-8f49-ea6e71231802 | [-1.3334022,0.9337577,-0.3636402,-0.053306933,0.0846217,-0.08087579,0.7735808,-0.06978625,-0.14796568,0.54863155,0.7147292,0.6

444973,-0.4289818,-0.64992523,-2.0558815,-0.09939844,0.06320713,1.2094835,0.42997867,-0.045221683,-0.74566567,0.9688923,-0.32088393,0.5072144,-0.2132386,-0.38068974,-0.063

253194,-0.5553703,-0.13070923,0.032516792,0.19199787,-0.35632166,-1.0873616,-0.1506536,0.058472667,1.0499889,-0.08423612,-0.17433228,0.771671,-0.48466313,0.57933533,2.0371

673,0.35173145,0.81162024,-0.39255375,0.90436745,0.009064911,0.2791657,-1.1032667,0.8461039,-0.78653026,0.8507371,-0.64681536,0.95859784,0.6849843,0.53893226,0.77747756,0.

0801601,0.17333724,-0.37513876,-1.2156097,0.27867568,-0.92160845,-0.5047081,0.432022,-0.13728906,-0.24497142,-0.5689873,-0.1558505,-1.9338208,-0.35952917,-0.24267699,0.268

40404,-0.17570858,1.3977934,0.3286393,0.47039926,-0.5733993,0.58036995,-0.6639077,0.19822633,-1.0455183,0.115738526,-0.49547425,-0.7333636,-0.61310935,0.3633987,0.1452295,

這里需要注意:默認langchain 自動生成的表的vector column列并沒有索引,我們可以手動創建一下hnsw類型的索引:

postgres=> CREATE INDEX ON langchain_pg_embedding USING hnsw (embedding vector_cosine_ops);

ERROR: column does not have dimensions

這個索引錯誤是因為你在初始化向量的過程中沒有指定向量的長度導致的
由于初始化函數中沒有指定 vector 的長度,導致生成的表也是沒有長度限制的

生成的表:langchain_pg_embedding

postgres=> \d+ langchain_pg_embedding

Table "app_vector.langchain_pg_embedding"

Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description

---------------+-------------------+-----------+----------+---------+----------+-------------+--------------+-------------

id | character varying | | not null | | extended | | |

collection_id | uuid | | | | plain | | |

embedding | vector | | | | external | | |

document | character varying | | | | extended | | |

cmetadata | jsonb | | | | extended | |

查看langchain 代碼:構造函數中是支持傳入vector長度的入參:embedding_length

vectorstore = PGVector(

embeddings=embeddings,

collection_name="t_news_2",

embedding_length =768,

connection=connection,

use_jsonb=True,

)

我們運行程序重新生成一下表:vectorstore.drop_tables() 是刪除已存在的表

vectorstore.drop_tables()

vectorstore.add_documents(docs, ids=[doc.metadata["id"] for doc in docs])

再次驗證vector的長度:

這回索引可以成功創建了:

postgres=> CREATE INDEX ON langchain_pg_embedding USING hnsw (embedding vector_cosine_ops);

CREATE INDEX

查看langchain 自動生成SQL的執行計劃:我們看到了觸發了我們之前創建的索引Index Scan using langchain_pg_embedding_embedding_idx on langchain_pg_embedding

explain analyze SELECT langchain_pg_embedding.id AS langchain_pg_embedding_id, langchain_pg_embedding.collection_id AS langchain_pg_embedding_collection_id,

langchain_pg_embedding.embedding AS langchain_pg_embedding_embedding, langchain_pg_embedding.document AS langchain_pg_embedding_document,

langchain_pg_embedding.cmetadata AS langchain_pg_embedding_cmetadata, langchain_pg_embedding.embedding <=>

'[-0.77009904,1.1517035,...]'

AS distance

FROM langchain_pg_embedding JOIN langchain_pg_collection ON langchain_pg_embedding.collection_id = langchain_pg_collection.uuid

WHERE langchain_pg_embedding.collection_id = 'a9112e1a-ec73-4742-9d88-806c09c525b4' ORDER BY distance ASC

LIMIT 1;

Limit (cost=12.18..24.26 rows=1 width=152) (actual time=0.164..0.165 rows=1 loops=1)

-> Nested Loop (cost=12.18..24.26 rows=1 width=152) (actual time=0.164..0.164 rows=1 loops=1)

-> Index Scan using langchain_pg_embedding_embedding_idx on langchain_pg_embedding (cost=12.03..16.08 rows=1 width=144) (actual time=0.148..0.149 rows=1 loops=1

)

Order By: (embedding <=> '[-0.77009904,1.1517035,-0.14216383,-0.7595568,...]'::vector)

Filter: (collection_id = 'a9112e1a-ec73-4742-9d88-806c09c525b4'::uuid)

-> Index Only Scan using langchain_pg_collection_pkey on langchain_pg_collection (cost=0.15..8.17 rows=1 width=16) (actual time=0.005..0.005 rows=1 loops=1)

Index Cond: (uuid = 'a9112e1a-ec73-4742-9d88-806c09c525b4'::uuid)

Heap Fetches: 1

Planning Time: 0.116 ms

Execution Time: 0.622 ms

(10 rows)

最后我們總結一下:

1.Langchain 是一個整合AI大模型調用和本地embedding 向量寫入整合的一個AI開發框架,可以幫我們快速實現RAG的開發
2.langchain 和 pgvector 整合的時候,需要注意初始化pgvector 對象的時候,要制定embedding 的長度,否則自動創建的表vector是沒有長度限制的,
導致不能創建索引的錯誤:ERROR: column does not have dimensions

本文章轉載微信公眾號@PostgreSQL知識庫

上一篇:

大語言模型Agent協作工具CrewAI使用指南

下一篇:

基于LangChain+GLM搭建知識本地庫
#你可能也喜歡這些API文章!

我們有何不同?

API服務商零注冊

多API并行試用

數據驅動選型,提升決策效率

查看全部API→
??

熱門場景實測,選對API

#AI文本生成大模型API

對比大模型API的內容創意新穎性、情感共鳴力、商業轉化潛力

25個渠道
一鍵對比試用API 限時免費

#AI深度推理大模型API

對比大模型API的邏輯推理準確性、分析深度、可視化建議合理性

10個渠道
一鍵對比試用API 限時免費