關于RAG的更多高級優化技術,可以參考論文《A Survey on Retrieval-Augmented Text Generation for Large Language Models》[1]

二、高級RAG之預檢索

       預檢索是定義 a) 如何進行索引以及 b) 在將用戶查詢用于檢索之前對其進行哪些操作的步驟。下面,我將討論預檢索優化的各種策略,包括數據索引和查詢增強,并提供示例 Python 代碼示例。

2.1 數據索引優

       在做其他事情之前,我們必須先存儲數據,以便以后查詢,這稱為索引。這包括設置正確的塊大小、有效使用元數據以及選擇嵌入模型。

2.1.1. 用于文本分塊的滑動窗口

        索引文本的一種簡單方法是將文本拆分為 n 個部分,將它們轉換為嵌入向量,然后將它們存儲在向量數據庫中。滑動窗口方法創建重疊的文本塊,以確保在塊的邊界處不會丟失任何上下文信息。以下代碼示例使用 nltk 庫按句子拆分文本。

import nltkfrom nltk.tokenize import sent_tokenize
nltk.download('punkt') # Ensure the punkt tokenizer is downloaded
def sliding_window(text, window_size=3): """ Generate text chunks using a sliding window approach.
Args: text (str): The input text to chunk. window_size (int): The number of sentences per chunk.
Returns: list of str: A list of text chunks. """ sentences = sent_tokenize(text) return [' '.join(sentences[i:i+window_size]) for i in range(len(sentences) - window_size + 1)]
# Example usagetext = "This is the first sentence. Here comes the second sentence. And here is the third one. Finally, the fourth sentence."chunks = sliding_window(text, window_size=3)for chunk in chunks: print(chunk) print("-----") # here, you can convert the chunk to embedding vector # and, save it to a vector database

2.1.2. 元數據利用

       元數據可以包含文檔創建日期、作者或相關標簽等信息,這些信息可用于在檢索過程中篩選或確定文檔的優先順序,從而增強搜索過程。

       以下代碼示例:使用 faiss 庫創建一個向量數據庫,并將向量插入其中并通過元數據(標簽)進行搜索。

import numpy as npimport faiss
documents = [ "Document 1 content here", "Content of the second document", "The third one has different content",]metadata = [ {"date": "20230101", "tag": "news"}, {"date": "20230102", "tag": "update"}, {"date": "20230103", "tag": "report"},]
# Dummy function to generate embeddingsdef generate_embeddings(texts): """Generate dummy embeddings for the sake of example.""" return np.random.rand(len(texts), 128).astype('float32') # 128-dimensional embeddings
# Generate embeddings for documentsdoc_embeddings = generate_embeddings(documents)
# Create a FAISS index for the embeddings (using FlatL2 for simplicity)index = faiss.IndexFlatL2(128) # 128 is the dimensionality of the vectorsindex.add(doc_embeddings) # Add embeddings to the index
# Example search function that uses metadatadef search(query_embedding, metadata_key, metadata_value): """Search the index for documents that match metadata criteria.""" k = 2 # Number of nearest neighbors to find distances, indices = index.search(np.array([query_embedding]), k) # Perform the search results = [] for idx in indices[0]: if metadata[idx][metadata_key] == metadata_value: results.append((documents[idx], metadata[idx])) return results
# Generate a query embedding (in a real scenario, this would come from a similar process)query_embedding = generate_embeddings(["Query content here"])[0]
# Search for documents tagged with 'update'matching_documents = search(query_embedding, 'tag', 'update')print(matching_documents)

2.2 查詢增強

       在某些情況下,用戶無法清楚地表達問題。在這種情況下,我們可以通過完全重寫或擴展查詢來增強查詢。

       我們可以利用 LLM 本身。我們可以將問題發送給 LLM,并要求其更好地表達。以下提示將有助于此。

Given the prompt: '{prompt}', generate 3 question that are better articulated.

       一旦我們有了新的查詢,我們就可以將新的查詢轉換為嵌入向量,并使用它們在向量數據庫中進行搜索。

三、高級RAG之檢索技術

       檢索是使用查詢來搜索先前索引的數據庫的步驟。下面,我將討論各種檢索策略。

3.1 混合搜索模型

       到目前為止,我們一直在討論在向量數據庫中搜索查詢——我們在其中存儲嵌入向量。讓我們更進一步,將其與傳統的基于關鍵字的搜索相結合。這種方法確保檢索系統可以處理各種查詢類型;從需要精確關鍵字匹配的查詢到需要理解上下文的更復雜的查詢。

       讓我們建立一個混合搜索模型。我們將使用 Elasticsearch 作為傳統搜索機制,并使用 faiss 作為向量數據庫進行語義搜索。

3.1.1. 創建Elasticsearch索引

       首先假設所有文檔都在“documents”字典中,并且我們已經獲取了嵌入向量并將它們存儲在字典中。以下代碼塊連接到 Elasticsearch 8.13.4 并為給定的示例文檔創建索引。

ES_NODES = "http://localhost:9200"
documents = [ {"id": 1, "text": "How to start with Python programming.", "vector": [0.1, 0.2, 0.3]}, {"id": 2, "text": "Advanced Python programming tips.", "vector": [0.1, 0.3, 0.4]}, # More documents...]
from elasticsearch import Elasticsearch
es = Elasticsearch( hosts=ES_NODES,)for doc in documents: es.index(index="documents", id=doc['id'], document={"text": doc['text']})

3.1.2. 創建Faiss索引

       在這一部分中,我們使用 faiss 作為向量數據庫并對向量進行索引。

import numpy as npimport faiss
dimension = 3 # Assuming 3D vectors for simplicityfaiss_index = faiss.IndexFlatL2(dimension)vectors = np.array([doc['vector'] for doc in documents])faiss_index.add(vectors)

3.1.3. 混合索引

       下面代碼將Elasticsearch關鍵詞搜索和faiss向量語義匹配進行混合搜索。

def hybrid_search(query_text, query_vector, alpha=0.5): # Perform a keyword search using Elasticsearch on the "documents" index, matching the provided query_text. response = es.search(index="documents", query={"match": {"text": query_text}}) # Extract the document IDs and their corresponding scores from the Elasticsearch response. keyword_results = {hit['_id']: hit['_score'] for hit in response['hits']['hits']}
# Prepare the query vector for vector search: reshape and cast to float32 for compatibility with Faiss. query_vector = np.array(query_vector).reshape(1, -1).astype('float32') # Perform a vector search with Faiss, retrieving indices of the top 5 closest documents. _, indices = faiss_index.search(query_vector, 5) # Create a dictionary of vector results with scores inversely proportional to their rank (higher rank, higher score). vector_results = {str(documents[idx]['id']): 1/(rank+1) for rank, idx in enumerate(indices[0])}
# Initialize a dictionary to hold combined scores from keyword and vector search results. combined_scores = {} # Iterate over the union of document IDs from both keyword and vector results. for doc_id in set(keyword_results.keys()).union(vector_results.keys()): # Calculate combined score for each document using the alpha parameter to balance the influence of both search results. combined_scores[doc_id] = alpha * keyword_results.get(doc_id, 0) + (1 - alpha) * vector_results.get(doc_id, 0)
# Return the dictionary containing combined scores for all relevant documents. return combined_scores
# Example usagequery_text = "Python programming"query_vector = [0.1, 0.25, 0.35]# Execute the hybrid search function with the specified query text and vector.results = hybrid_search(query_text, query_vector)# Print the results of the hybrid search to see the combined scores of documents.print(results)

       該hybrid_search 函數首先使用 Elasticsearch 進行關鍵字搜索。下一步,它使用 Faiss 執行向量搜索,Faiss 返回前五個最接近的文檔的索引,這些索引用于根據文檔的排名創建反向分數文檔(即,最接近的文檔得分最高)。

       一旦我們獲得了 Elasticsearch 和 Faiss 的結果,我們就可以把這兩種方法的得分結合起來。每個文檔的最終得分是使用參數 alpha加權平均值計算得到,如果alpha=0.5,意味這兩個結果賦予了相同的權重。

完整的代碼,可以參考[2]

3.2 微調嵌入模型

       微調嵌入模型是增強檢索增強生成系統性能的有效步驟。微調預訓練模型有助于模型理解特定領域或數據集的細微差別,從而可以顯著提高檢索到的文檔的相關性和準確性。

      我們可以用以下幾個要點來總結微調模型的重要性:

3.2.1 準備微調數據

       以下代碼塊是微調模型的第一步。它初始化用于微調預訓練屏蔽語言模型的管道,加載模型和標記器,并調整設備兼容性(GPU 或 CPU)。

       初始化后,它會通過標記化和動態標記掩碼處理樣本數據集。此設置可讓模型為自監督學習做好準備,在自監督學習中,它會預測掩碼標記,從而增強其對輸入數據的語義理解。

# Define the model name using a pre-trained model from the Sentence Transformers librarymodel_name = "sentence-transformers/all-MiniLM-L6-v2"
# Load the tokenizer for the specified model from Hugging Face's transformers librarytokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the model for masked language modeling based on the specified modelmodel = AutoModelForMaskedLM.from_pretrained(model_name)
# Determine if a GPU is available and set the device accordingly; use CPU if GPU is not availabledevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Move the model to the appropriate device (GPU or CPU)model.to(device)
# Define a generator function to create a dataset; this should be replaced with actual data loading logicdef dataset_generator(): # Example dataset composed of individual sentences; replace with your actual dataset sentences dataset = ["sentence1", "sentence2", "sentence3"] # Yield each sentence as a dictionary with the key 'text' for sentence in dataset: yield {"text": sentence}
# Create a dataset object using Hugging Face's Dataset class from the generator functiondataset = Dataset.from_generator(dataset_generator)
# Define a function to tokenize the text datadef tokenize_function(example): # Tokenize the input text and truncate it to the maximum length the model can handle return tokenizer(example["text"], truncation=True)
# Apply the tokenization function to all items in the dataset, batch processing them for efficiencytokenized_datasets = dataset.map(tokenize_function, batched=True)
# Initialize a data collator for masked language modeling which randomly masks tokens# This is used for training the model in a self-supervised mannerdata_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

3.2.2 開始微調模型

       一旦數據準備好了,我們就可以開始微調階段。在這個階段,我們將使用模型的現有權重并開始更新它們。

      以下代碼塊使用 Hugging Face 的 API 設置并執行語言模型的訓練Trainer。它首先定義訓練參數(時期、批量大小和學習率等)。Trainer然后,對象使用這些設置以及預加載的模型、標記化數據集和用于屏蔽語言建模的數據整理器(模型、標記化數據集和數據整理器是在上一步中創建的)。訓練完成后,將保存新的更新模型及其標記器以供下一步使用。

# Define training arguments to configure the training sessiontraining_args = TrainingArguments( output_dir="output", # Directory where the outputs (like checkpoints) will be saved num_train_epochs=3, # Total number of training epochs to perform per_device_train_batch_size=16, # Batch size per device during training learning_rate=2e-5, # Learning rate for the optimizer)
# Initialize the Trainer, which handles the training loop and evaluationtrainer = Trainer( model=model, # The model to be trained, already loaded and configured args=training_args, # The training arguments defining the training setup train_dataset=tokenized_datasets, # The dataset to train on, already tokenized and prepared data_collator=data_collator, # The data collator that handles input formatting and masking)
# Start the training processtrainer.train()
# Define the paths where the fine-tuned model and tokenizer will be savedmodel_path = "./model"tokenizer_path = "./tokenizer"
# Save the fine-tuned model to the specified pathmodel.save_pretrained(model_path)
# Save the tokenizer used in training to the specified pathtokenizer.save_pretrained(tokenizer_path)

3.2.3 使用微調后的模型

       現在是時候使用保存的模型和標記器來生成嵌入向量了。以下代碼塊用于此目的。

       以下代碼塊加載模型和標記器以生成給定句子的嵌入。首先,從保存的路徑加載模型和標記器,并將其加載到 GPU 或 CPU。句子(在本文的上下文中,它們是查詢)被標記化。模型在不更新其參數的情況下處理這些輸入,這稱為推理模式,可以使用with torch.no_grad()。我們不使用此模型來預測下一個標記;相反,我們的目標是從模型的隱藏狀態中提取嵌入向量。最后一步,這些嵌入向量被移回 CPU。

# Load the tokenizer and model from saved paths, ensuring the model is allocated to the appropriate device (GPU or CPU)tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)model = AutoModelForMaskedLM.from_pretrained(model_path).to(device)
# Define a function to tokenize input sentences, configuring padding and truncation to handle variable sentence lengthsdef tokenize_function_embedding(example): return tokenizer(example["text"], padding=True, truncation=True)
# List of example sentences to generate embeddings forsentences = ["This is the first sentence.", "This is the second sentence."]
# Create a Dataset object directly from these sentencesdataset_embedding = Dataset.from_dict({"text": sentences})
# Apply the tokenization function to the dataset, preparing it for embedding generationtokenized_dataset_embedding = dataset_embedding.map(tokenize_function_embedding, batched=True, batch_size=None)
# Extract 'input_ids' and 'attention_mask' needed for the model to understand which parts of the input are padding and which are actual contentinput_ids = tokenized_dataset_embedding["input_ids"]attention_mask = tokenized_dataset_embedding["attention_mask"]
# Convert these lists into tensors and ensure they are on the correct device (GPU or CPU) for processinginput_ids = torch.tensor(input_ids).to(device)attention_mask = torch.tensor(attention_mask).to(device)
# Generate embeddings using the model without updating gradients to save computational resourceswith torch.no_grad(): outputs = model(input_ids=input_ids, attention_mask=attention_mask, output_hidden_states=True) # Extract the last layer's hidden states as embeddings, specifically the first token (typically used in BERT-type models for representing sentence embeddings) embeddings = outputs.hidden_states[-1][:, 0, :]
# Move the embeddings from the GPU back to CPU for easy manipulation or savingembeddings = embeddings.cpu()
# Print each sentence with its corresponding embedding vectorfor sentence, embedding in zip(sentences, embeddings): print(f"Sentence: {sentence}") print(f"Embedding: {embedding}\n")

四、高級RAG之檢索后處理

       檢索到相關信息后,還需要以正確順序喂給大模型。在接下來的 2 個小節中,我們將解釋如何使用摘要和重新排序來提高 RAG 的質量。

4.1 對響應進行摘要

       如果在索引過程中在數據庫中存儲了大量的塊文本向量,則此步驟可能是必要的。如果文本已經很小,則可能不需要此步驟。

       以下代碼塊可用于摘要過程。以下代碼塊使用該transformers庫通過預先訓練的 BART 模型來提取文本摘要。該函數summarize_text接收文本并使用該模型根據定義的最大和最小長度參數生成簡潔的摘要。

from transformers import pipelinedef summarize_text(text, max_length=130): # Load a pre-trained summarization model from Hugging Face's model hub. # 'facebook/bart-large-cnn' is chosen for its proficiency in generating concise summaries. summarizer = pipeline("summarization", model="facebook/bart-large-cnn") # The summarizer uses the BART model to condense the input text into a summary. # 'max_length' specifies the maximum length of the summary output. # 'min_length' sets the minimum length to ensure the summary is not too terse. # 'do_sample' is set to False to use a deterministic approach for summary generation. summary = summarizer(text, max_length=max_length, min_length=30, do_sample=False) # The output from the summarizer is a list of dictionaries. # We extract the summary text from the first dictionary in the list. return summary[0]['summary_text']
# Example text to be summarized.# This text discusses the importance of summarization in retrieval-augmented generation systems.long_text = "Summarization are vital steps in the workflow of retrieval-augmented generation systems. They ensure the output is not only accurate but also concise and digestible. These techniques are essential, especially in domains where the accuracy and precision of information are crucial."
# Call the summarize_text function to compress the example text.summarized_text = summarize_text(long_text)
# Print the summarized text to see the output of the summarization model.print("Summarized Text:", summarized_text)

完整的代碼,可以參考[3]

4.2 重排序和過濾

       在檢索過程中,您應該已經得到每個文檔的“分數”——這實際上是向量與查詢向量的相似度分數。此信息可用于重新排序文檔并根據給定的閾值過濾結果。以下代碼塊顯示了如何重新排序和過濾的示例。

4.2.1. 基本重排序和過濾

       下面代碼塊定義了一個文檔列表,每個文檔都由一個包含 ID、文本和相關性分數的字典表示。然后它實現了兩個主要功能:re_rank_documents和filter_documents。該re_rank_documents函數按相關性分數降序對文檔進行排序,在重新排序后,該filter_documents函數用于排除相關性分數低于指定閾值 0.75 的任何文檔。

# Define a list of documents. Each document is represented as a dictionary with an ID, text, and a relevance score.documents = [ {"id": 1, "text": "Advanced RAG systems use sophisticated techniques for text summarization.", "relevance_score": 0.82}, {"id": 2, "text": "Basic RAG systems primarily focus on retrieval and basic processing.", "relevance_score": 0.55}, {"id": 3, "text": "Re-ranking improves the quality of responses by ordering documents by relevance.", "relevance_score": 0.89}]
# Define a function to re-rank documents based on their relevance scores.def re_rank_documents(docs):
# Use the sorted function to order the documents by 'relevance_score'. # The key for sorting is specified using a lambda function, which extracts the relevance score from each document. # 'reverse=True' sorts the list in descending order, placing documents with higher relevance scores first. return sorted(docs, key=lambda x: x['relevance_score'], reverse=True)
# Re-rank the documents using the defined function and print the result.ranked_documents = re_rank_documents(documents)print("Re-ranked Documents:", ranked_documents)
# Define a function to filter documents based on a relevance score threshold.def filter_documents(docs, relevance_threshold=0.75): # Use a list comprehension to create a new list that includes only those documents whose 'relevance_score' # is greater than or equal to the 'relevance_threshold'. return [doc for doc in docs if doc['relevance_score'] >= relevance_threshold]
# Filter the re-ranked documents using the defined function with a threshold of 0.75 and print the result.filtered_documents = filter_documents(ranked_documents)print("Filtered Documents:", filtered_documents)

4.2.2. 使用機器學習算法進行高級重排序

       對于更復雜的方法,可以使用機器學習模型對文檔進行重新排序。在這種方法中,挑戰在于:如何知道哪些文檔是相關的,以便我們可以訓練機器學習模型對文檔進行排序?

       在這種方法中,我們需要假設我們有一個系統,該系統存儲用戶與系統之間的交互,并存儲文檔是否與給定查詢相關。一旦我們有了這個數據集,我們就可以使用查詢嵌入向量和文檔嵌入來預測分數。

# assumung the data is stored in the following format in a database# query_text | response_text | user_clicked
query_embeddings = get_embedding_vector(database.query_text) response_embeddings = get_embedding_vector(database.response_text)
# create the datasetX = concat(query_embeddings, response_embeddings)y = database.user_clicked
model = model.train(X, y)model.predict_proba(...)

       上面提供的偽代碼概述了使用機器學習根據相關性對文檔進行重新排序的方法,具體來說,是通過預測用戶根據過去的交互找到相關文檔的可能性。下面偽代碼是對描述流程的分步驟說明:

五、結論

       實施簡單的檢索增強生成 (RAG) 系統可能會解決您的問題,但添加增強功能將改善您的結果并幫助您的系統生成更精確的答案。在本文中,我們討論了旨在實現此目標的幾項增強功能,包括數據索引優化、查詢增強、混合搜索、嵌入模型的微調、響應匯總以及重新排名和過濾。

       通過集成這些增強功能,您有機會顯著提高性能。繼續探索和應用這些方法,進行試驗,看看哪種方法最適合您的需求。

參考文獻:

[1] https://arxiv.org/pdf/2404.10981

[2] https://github.com/ndemir/machine-learning-projects/tree/main/hybrid-search

[3] https://github.com/ndemir/machine-learning-projects/tree/main/fine-tuning-embedding-model

文章轉自微信公眾號@ArronAI

上一篇:

LLM之RAG實戰(三十一)| 探索RAG重排序

下一篇:

LLM之RAG實戰(二十二)| LlamaIndex 高級搜索(一)構建完整的基本 RAG 框架(包括 RAG 評估)
#你可能也喜歡這些API文章!

我們有何不同?

API服務商零注冊

多API并行試用

數據驅動選型,提升決策效率

查看全部API→
??

熱門場景實測,選對API

#AI文本生成大模型API

對比大模型API的內容創意新穎性、情感共鳴力、商業轉化潛力

25個渠道
一鍵對比試用API 限時免費

#AI深度推理大模型API

對比大模型API的邏輯推理準確性、分析深度、可視化建議合理性

10個渠道
一鍵對比試用API 限時免費