欧美成人三级在线观看,老子影院午夜伦不卡,h在线免费观看

(a)利用多模態LLM（如GPT-4V[1]）來識別表格并從每個PDF頁面中提取信息

輸入：圖像格式的PDF頁面

輸出：JSON或其他格式的表。如果多模態LLM無法提取表格數據，則應總結圖像并返回摘要。

（b）利用專業的表格檢測模型（如Table Transformer[2]），來辨別表格結構

輸入：PDF頁面作為圖像

輸出：表格作為圖像

（c）使用開源框架（如unstructured[3]或者目標檢測模型[4]。這些框架可以對整個文檔進行全面的解析，并從解析的結果中提取與表相關的內容

輸入：PDF或圖像格式的文檔

輸出：純文本或HTML格式的表，從整個文檔的解析結果中獲得

（d）使用Nougat[5]、Donut[6]等端到端模型來解析整個文檔并提取與表相關的內容。這種方法不需要OCR模型

輸入：PDF或圖像格式的文檔

輸出：LaTeX或JSON格式的表，從整個文檔的解析結果中獲得

值得一提的是，無論使用何種方法提取表格信息，都應包括表格標題。這是因為在大多數情況下，表格標題是文檔或論文作者對表格的簡短描述，可以在很大程度上概括整個表格。

在上述四種方法中，方法（d）允許容易地檢索表標題，這將在下面的實驗中進一步解釋。

1.2 索引結構

根據索引的結構，大致可以分為以下幾類：

（e）僅使用圖像格式的索引表；

（f）僅使用純文本或JSON格式的索引表；

（g）僅使用LaTeX格式的索引表；

（h）僅為表的摘要編制索引；

（i）“從小到大”或“文檔摘要索引”結構，如圖2所示：

小塊的內容可以是來自表的每一行的信息或表的摘要。
大塊的內容可以是圖像格式、純文本格式或LaTeX格式的表。

如上所述，表格摘要通常使用LLM生成：

輸入：圖像格式、文本格式或LaTeX格式的表格

輸出：表格摘要

1.3 不需要表解析、索引或RAG的算法

下面介紹一些不需要表解析的算法：

（j）將相關圖像（PDF頁面）和用戶查詢發送到VQA模型（如DAN等）或多模態LLM，并返回答案；

要索引的內容：圖像格式的文檔

發送到VQA模型或多模態LLM的內容：查詢+圖像形式的相應頁面

（k）將相關文本格式的PDF頁面和用戶的查詢發送到LLM，然后返回答案；

要索引的內容：文本格式的文檔

發送到LLM的內容：查詢+文本格式的相應頁面

（l）將相關圖像（PDF頁面）、文本塊和用戶的查詢發送到多模態LLM（如GPT-4V等），并直接返回答案；

要索引的內容：圖像格式的文檔和文本格式的文檔塊

發送到多模態LLM的內容：查詢+文檔的相應圖像形式+相應的文本塊

以下是一些不需要索引的方法，如圖3和圖4所示：

（m）首先，應用類別（a）到（d）中的一種方法，將文檔中的所有表格解析為圖像形式，然后直接將所有表格圖像和用戶的查詢發送到多模態LLM（如GPT-4V等）并返回答案。

要索引的內容：無

發送到多模態LLM的內容：查詢+所有解析的表（圖像格式）

（n）使用（m）提取的圖像格式的表格，然后使用OCR模型識別表格中的所有文本，然后直接將表格中的全部文本和用戶的查詢發送到LLM并直接返回答案。

要索引的內容：無

發送到LLM的內容：用戶查詢+所有表內容（文本格式）

值得注意的是，有些方法不依賴于RAG過程：

? ? ? ?第一種方法不使用LLM，在特定的數據集上進行訓練，并使模型（如類似BERT的transformer）能夠更好地支持表理解任務，如TAPAS[7]。

第二種方法使用LLM，采用預訓練、微調方法或提示，使LLM能夠執行表理解任務，如GPT4Table[8]。

二、現有的開源解決方案

? ? ? ?上一節總結并對RAG中表格關鍵技術進行了分類。在提出本文實現的解決方案之前，讓我們探索一些開源解決方案。

LlamaIndex支持四種方法[9]，前三種都是使用多模態模型：

檢索相關圖像（PDF頁面）并將其發送到GPT-4V以響應查詢；
將每個PDF頁面視為一個圖像，讓GPT-4V對每個頁面進行圖像推理，為圖像推理構建文本矢量存儲索引，根據圖像推理矢量存儲查詢答案；
使用Table Transformer從檢索到的圖像中裁剪表信息，然后將這些裁剪的圖像發送到GPT-4V以進行查詢響應；
對裁剪的表圖像應用OCR，并將數據發送到GPT4/GPT-3.5以回答查詢。

根據本文提出方法進行分類總結：

第一種方法類似于本文中的類別（j），不需要表格解析。然而，結果表明，即使答案在圖像中，也無法產生正確的答案；
第二種方法涉及表格解析，對應于類別（a）。索引內容是基于GPT-4V返回的結果的表內容或摘要，其可以對應于類別（f）或（h）。這種方法的缺點是GPT-4V識別表格并從圖像中提取其內容的能力是不穩定的，特別是當圖像包括表格、文本和其他圖像的混合時，這在PDF格式中很常見；
第三種方法，類似于類別（m），不需要索引；
第四種方法類似于類別（n），也不需要索引。其結果表明，由于無法從圖像中提取表格信息，因此會產生不正確的答案。

通過試驗發現，第三種方法的綜合效果最好。然而，根據我的測試，第三種方法很難檢測表格，更不用說正確地將表標題與表合并了。

Langchain還提出了一些解決方案，半結構化RAG[10]的關鍵技術包括：

表格解析使用unstructured，即類別（c）；
索引方法是文檔摘要索引，對應類別（i），小塊內容：表摘要，大塊內容：原始表內容（文本格式）。

如圖5所示：

半結構化和多模態RAG[11]提出了三種解決方案，其架構如圖6所示。

可選方案1：類似于本文的類別（l）。它包括使用多模態嵌入（如CLIP）來嵌入圖像和文本，使用相似性搜索進行檢索，并將原始圖像和塊傳遞給多模式LLM進行答案合成。

? ? ? ?可選方案2：利用多模態LLM，如GPT-4V、LLaVA或FUYU-8b，從圖像中生成文本摘要。然后，嵌入和檢索文本，并將文本塊傳遞給LLM進行答案合成。

表解析使用非結構化，即類別（d）；
索引結構為文檔摘要索引（catogery（i）），小塊內容：表摘要，大塊內容：文本格式的表。

可選方案3：使用多模態LLM（如GPT-4V、LLaVA或FUYU-8b）從圖像中生成文本摘要，然后嵌入并檢索參考原始圖像的圖像摘要（catogery（i）），然后將原始圖像和文本塊傳遞給多模態LLM。

三、建議的解決方案

本文對關鍵技術和現有解決方案進行了總結、分類和討論。基于此，我們提出了以下解決方案，如圖7所示。為了簡單起見，省略了一些RAG模塊，如重新排序和查詢重寫。

表格解析：使用Nougat(catogery(d))。根據我的測試，它的表格檢測比unstructured的更有效(catogery(c))。此外，Nougat可以很好地提取表格標題，非常方便與表格關聯；
文檔摘要索引結構(catogery(i))：小塊的內容包括表格摘要，大塊的內容包括LaTeX格式的相應表格和文本格式的表格標題。我們使用多向量檢索器[12]來實現它；
表格摘要獲取方法：將表格和表格標題發送給LLM進行摘要。

這種方法的優點是，它可以有效地解析表，同時綜合考慮表摘要和表之間的關系，而且還不需要多模型LLM的需求，從而節省了成本。

3.1 Nougat的原理

Nougat[13]是在Donut[14]的基礎上開發的。它通過網絡隱式識別文本，不需要任何與OCR相關的輸入或模塊，如圖8所示：

Nougat不僅可以解析表格數據，而且還可以解析公式，也可以方便地關聯表標題，如圖9所示：

Nougat是一個缺乏中間結果的端到端模型，它可能嚴重依賴于其訓練數據。

根據格式化訓練數據的代碼[15]，對于表格，緊接在\end｛table｝后面的行是caption_parts，這似乎與所提供的訓練數據的格式一致：

def format_element( element: Element, keep_refs: bool = False, latex_env: bool = False) -> List[str]: """ Formats a given Element into a list of formatted strings.
 Args: element (Element): The element to be formatted. keep_refs (bool, optional): Whether to keep references in the formatting. Default is False. latex_env (bool, optional): Whether to use LaTeX environment formatting. Default is False.
 Returns: List[str]: A list of formatted strings representing the formatted element. """ ... ... if isinstance(element, Table): parts = [ "[TABLE%s]\n\\begin{table}\n" % (str(uuid4())[:5] if element.id is None else ":" + str(element.id)) ] parts.extend(format_children(element, keep_refs, latex_env)) caption_parts = format_element(element.caption, keep_refs, latex_env) remove_trailing_whitespace(caption_parts) parts.append("\\end{table}\n") if len(caption_parts) > 0: parts.extend(caption_parts + ["\n"]) parts.append("[ENDTABLE]\n\n") return parts ... ...

3.2 Nougat的優點和缺點

優勢：

Nougat可以將以前解析工具（如公式和表格）具有挑戰性的部分準確解析為LaTeX源代碼；
Nougat的解析結果是一個類似markdown的半結構化文檔；
輕松獲取表格標題，并方便地將其與表格關聯。

缺點：

Nougat的解析速度較慢，這可能對大規模部署構成挑戰；
由于Nougat是在科學論文上訓練的，它擅長于類似結構的文件。對于非拉丁文本文檔，其性能會下降；
Nougat模型一次只在一篇科學論文的一頁上訓練，缺乏對其他頁面的了解。這可能會導致解析的內容出現一些不一致。因此，如果識別效果不好，可以考慮將PDF分割成單獨的頁面，并逐一解析；
在兩列論文中分析表格不如在單列論文中有效。

3.3 代碼實現

首先，安裝相關的Python包

pip install langchainpip install chromadbpip install nougat-ocr

安裝完成后，我們可以檢查Python包的版本：

langchain 0.1.12langchain-community 0.0.28langchain-core 0.1.31langchain-openai 0.0.8langchain-text-splitters 0.0.1
chroma-hnswlib 0.7.3chromadb 0.4.24
nougat-ocr 0.1.17

設置環境并導入：

import osos.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"
import subprocessimport uuid
from langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAIfrom langchain.retrievers.multi_vector import MultiVectorRetrieverfrom langchain.storage import InMemoryStorefrom langchain_community.vectorstores import Chromafrom langchain_core.documents import Documentfrom langchain_openai import OpenAIEmbeddingsfrom langchain_core.runnables import RunnablePassthrough

下載論文《Attention Is All You Need》[16]到YOUR_PDF_PATH，使用Nougat解析PDF文件，從解析結果中獲得latex格式的表格和文本格式的表格標題。第一次執行將下載必要的模型文件。

def june_run_nougat(file_path, output_dir): # Run Nougat and store results as Mathpix Markdown cmd = ["nougat", file_path, "-o", output_dir, "-m", "0.1.0-base", "--no-skipping"] res = subprocess.run(cmd) if res.returncode != 0: print("Error when running nougat.") return res.returncode else: print("Operation Completed!") return 0
def june_get_tables_from_mmd(mmd_path): f = open(mmd_path) lines = f.readlines() res = [] tmp = [] flag = "" for line in lines: if line == "\\begin{table}\n": flag = "BEGINTABLE" elif line == "\\end{table}\n": flag = "ENDTABLE" if flag == "BEGINTABLE": tmp.append(line) elif flag == "ENDTABLE": tmp.append(line) flag = "CAPTION" elif flag == "CAPTION": tmp.append(line) flag = "MARKDOWN" print('-' * 100) print(''.join(tmp)) res.append(''.join(tmp)) tmp = []
 return res
file_path = "YOUR_PDF_PATH"output_dir = "YOUR_OUTPUT_DIR_PATH"
if june_run_nougat(file_path, output_dir) == 1: import sys sys.exit(1)
mmd_path = output_dir + '/' + os.path.splitext(file_path)[0].split('/')[-1] + ".mmd" tables = june_get_tables_from_mmd(mmd_path)

函數june_get_tables_from_mmd用于從圖10所示的mmd文件中提取從t\begin{table}到\end{table}的所有內容，包括\end{table}后面的行。

值得注意的是，沒有發現任何官方文件規定表格標題必須放在表格下方，或者表格應以\ begin｛table｝開頭，以\ end｛table}結尾。因此，june_get_tables_from_md是啟發式的。

以下是解析PDF中表格的結果：

Operation Completed!----------------------------------------------------------------------------------------------------\begin{table}\begin{tabular}{l c c c} \hline \hline Layer Type & Complexity per Layer & Sequential Operations & Maximum Path Length \\ \hline Self-Attention & \(O(n^{2}\cdot d)\) & \(O(1)\) & \(O(1)\) \\ Recurrent & \(O(n\cdot d^{2})\) & \(O(n)\) & \(O(n)\) \\ Convolutional & \(O(k\cdot n\cdot d^{2})\) & \(O(1)\) & \(O(log_{k}(n))\) \\ Self-Attention (restricted) & \(O(r\cdot n\cdot d)\) & \(O(1)\) & \(O(n/r)\) \\ \hline \hline \end{tabular}\end{table}Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations for different layer types. \(n\) is the sequence length, \(d\) is the representation dimension, \(k\) is the kernel size of convolutions and \(r\) the size of the neighborhood in restricted self-attention.
----------------------------------------------------------------------------------------------------\begin{table}\begin{tabular}{l c c c c} \hline \hline \multirow{2}{*}{Model} & \multicolumn{2}{c}{BLEU} & \multicolumn{2}{c}{Training Cost (FLOPs)} \\ \cline{2-5} & EN-DE & EN-FR & EN-DE & EN-FR \\ \hline ByteNet [18] & 23.75 & & & \\ Deep-Att + PosUnk [39] & & 39.2 & & \(1.0\cdot 10^{20}\) \\ GNMT + RL [38] & 24.6 & 39.92 & \(2.3\cdot 10^{19}\) & \(1.4\cdot 10^{20}\) \\ ConvS2S [9] & 25.16 & 40.46 & \(9.6\cdot 10^{18}\) & \(1.5\cdot 10^{20}\) \\ MoE [32] & 26.03 & 40.56 & \(2.0\cdot 10^{19}\) & \(1.2\cdot 10^{20}\) \\ \hline Deep-Att + PosUnk Ensemble [39] & & 40.4 & & \(8.0\cdot 10^{20}\) \\ GNMT + RL Ensemble [38] & 26.30 & 41.16 & \(1.8\cdot 10^{20}\) & \(1.1\cdot 10^{21}\) \\ ConvS2S Ensemble [9] & 26.36 & **41.29** & \(7.7\cdot 10^{19}\) & \(1.2\cdot 10^{21}\) \\ \hline Transformer (base model) & 27.3 & 38.1 & & \(\mathbf{3.3\cdot 10^{18}}\) \\ Transformer (big) & **28.4** & **41.8** & & \(2.3\cdot 10^{19}\) \\ \hline \hline \end{tabular}\end{table}Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost.
----------------------------------------------------------------------------------------------------\begin{table}\begin{tabular}{c|c c c c c c c c|c c c c} \hline \hline & \(N\) & \(d_{\text{model}}\) & \(d_{\text{ff}}\) & \(h\) & \(d_{k}\) & \(d_{v}\) & \(P_{drop}\) & \(\epsilon_{ls}\) & train steps & PPL & BLEU & params \\ \hline base & 6 & 512 & 2048 & 8 & 64 & 64 & 0.1 & 0.1 & 100K & 4.92 & 25.8 & 65 \\ \hline \multirow{4}{*}{(A)} & \multicolumn{1}{c}{} & & 1 & 512 & 512 & & & & 5.29 & 24.9 & \\ & & & & 4 & 128 & 128 & & & & 5.00 & 25.5 & \\ & & & & 16 & 32 & 32 & & & & 4.91 & 25.8 & \\ & & & & 32 & 16 & 16 & & & & 5.01 & 25.4 & \\ \hline (B) & \multicolumn{1}{c}{} & & \multicolumn{1}{c}{} & & 16 & & & & & 5.16 & 25.1 & 58 \\ & & & & & 32 & & & & & 5.01 & 25.4 & 60 \\ \hline \multirow{4}{*}{(C)} & 2 & \multicolumn{1}{c}{} & & & & & & & & 6.11 & 23.7 & 36 \\ & 4 & & & & & & & & 5.19 & 25.3 & 50 \\ & 8 & & & & & & & & 4.88 & 25.5 & 80 \\ & & 256 & & 32 & 32 & & & & 5.75 & 24.5 & 28 \\ & 1024 & & 128 & 128 & & & & 4.66 & 26.0 & 168 \\ & & 1024 & & & & & & 5.12 & 25.4 & 53 \\ & & 4096 & & & & & & 4.75 & 26.2 & 90 \\ \hline \multirow{4}{*}{(D)} & \multicolumn{1}{c}{} & & & & & 0.0 & & 5.77 & 24.6 & \\ & & & & & & 0.2 & & 4.95 & 25.5 & \\ & & & & & & & 0.0 & 4.67 & 25.3 & \\ & & & & & & & 0.2 & 5.47 & 25.7 & \\ \hline (E) & \multicolumn{1}{c}{} & \multicolumn{1}{c}{} & & \multicolumn{1}{c}{} & & & & & 4.92 & 25.7 & \\ \hline big & 6 & 1024 & 4096 & 16 & & 0.3 & 300K & **4.33** & **26.4** & 213 \\ \hline \hline \end{tabular}\end{table}Table 3: Variations on the Transformer architecture. Unlisted values are identical to those of the base model. All metrics are on the English-to-German translation development set, newstest2013. Listed perplexities are per-wordpiece, according to our byte-pair encoding, and should not be compared to per-word perplexities.
----------------------------------------------------------------------------------------------------\begin{table}\begin{tabular}{c|c|c} \hline**Parser** & **Training** & **WSJ 23 F1** \\ \hline Vinyals \& Kaiser et al. (2014) [37] & WSJ only, discriminative & 88.3 \\ Petrov et al. (2006) [29] & WSJ only, discriminative & 90.4 \\ Zhu et al. (2013) [40] & WSJ only, discriminative & 90.4 \\ Dyer et al. (2016) [8] & WSJ only, discriminative & 91.7 \\ \hline Transformer (4 layers) & WSJ only, discriminative & 91.3 \\ \hline Zhu et al. (2013) [40] & semi-supervised & 91.3 \\ Huang \& Harper (2009) [14] & semi-supervised & 91.3 \\ McClosky et al. (2006) [26] & semi-supervised & 92.1 \\ Vinyals \& Kaiser el al. (2014) [37] & semi-supervised & 92.1 \\ \hline Transformer (4 layers) & semi-supervised & 92.7 \\ \hline Luong et al. (2015) [23] & multi-task & 93.0 \\ Dyer et al. (2016) [8] & generative & 93.3 \\ \hline \end{tabular}\end{table}Table 4: The Transformer generalizes well to English constituency parsing (Results are on Section 23 of WSJ)* [5] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. _CoRR_, abs/1406.1078, 2014.

然后使用LLM對表格進行總結：

# Promptprompt_text = """You are an assistant tasked with summarizing tables and text. \ Give a concise summary of the table or text. The table is formatted in LaTeX, and its caption is in plain text format: {element} """prompt = ChatPromptTemplate.from_template(prompt_text)
# Summary chainmodel = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()# Get table summariestable_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})print(table_summaries)

以下是在Attention Is All You Need[16]中找到的四個表的摘要，如圖11所示：

使用Multi-Vector Retriever[12]構建文檔摘要索引結構。

# The vectorstore to use to index the child chunksvectorstore = Chroma(collection_name = "summaries", embedding_function = OpenAIEmbeddings())
# The storage layer for the parent documentsstore = InMemoryStore()id_key = "doc_id"
# The retriever (empty to start)retriever = MultiVectorRetriever( vectorstore = vectorstore, docstore = store, id_key = id_key, search_kwargs={"k": 1} # Solving Number of requested results 4 is greater than number of elements in index..., updating n_results = 1)
# Add tablestable_ids = [str(uuid.uuid4()) for _ in tables]summary_tables = [ Document(page_content = s, metadata = {id_key: table_ids[i]}) for i, s in enumerate(table_summaries)]retriever.vectorstore.add_documents(summary_tables)retriever.docstore.mset(list(zip(table_ids, tables)))

一切就緒，構建一個簡單的RAG管道，并執行查詢：

# Prompt templatetemplate = """Answer the question based only on the following context, which can include text and tables, there is a table in LaTeX format and a table caption in plain text format:{context}Question: {question}"""prompt = ChatPromptTemplate.from_template(template)
# LLMmodel = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")

# Simple RAG pipelinechain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser())

print(chain.invoke("when layer type is Self-Attention, what is the Complexity per Layer?")) # Query about table 1
print(chain.invoke("Which parser performs worst for BLEU EN-DE")) # Query about table 2
print(chain.invoke("Which parser performs best for WSJ 23 F1")) # Query about table 4

執行結果如下，表明幾個問題得到了準確的回答，如圖12所示：

整體代碼如下：

import osos.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"
import subprocessimport uuid
from langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAIfrom langchain.retrievers.multi_vector import MultiVectorRetrieverfrom langchain.storage import InMemoryStorefrom langchain_community.vectorstores import Chromafrom langchain_core.documents import Documentfrom langchain_openai import OpenAIEmbeddingsfrom langchain_core.runnables import RunnablePassthrough

def june_run_nougat(file_path, output_dir): # Run Nougat and store results as Mathpix Markdown cmd = ["nougat", file_path, "-o", output_dir, "-m", "0.1.0-base", "--no-skipping"] res = subprocess.run(cmd) if res.returncode != 0: print("Error when running nougat.") return res.returncode else: print("Operation Completed!") return 0
def june_get_tables_from_mmd(mmd_path): f = open(mmd_path) lines = f.readlines() res = [] tmp = [] flag = "" for line in lines: if line == "\\begin{table}\n": flag = "BEGINTABLE" elif line == "\\end{table}\n": flag = "ENDTABLE" if flag == "BEGINTABLE": tmp.append(line) elif flag == "ENDTABLE": tmp.append(line) flag = "CAPTION" elif flag == "CAPTION": tmp.append(line) flag = "MARKDOWN" print('-' * 100) print(''.join(tmp)) res.append(''.join(tmp)) tmp = []
 return res
file_path = "YOUR_PDF_PATH"output_dir = "YOUR_OUTPUT_DIR_PATH"
if june_run_nougat(file_path, output_dir) == 1: import sys sys.exit(1)
mmd_path = output_dir + '/' + os.path.splitext(file_path)[0].split('/')[-1] + ".mmd" tables = june_get_tables_from_mmd(mmd_path)

# Promptprompt_text = """You are an assistant tasked with summarizing tables and text. \ Give a concise summary of the table or text. The table is formatted in LaTeX, and its caption is in plain text format: {element} """prompt = ChatPromptTemplate.from_template(prompt_text)
# Summary chainmodel = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()# Get table summariestable_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})print(table_summaries)
# The vectorstore to use to index the child chunksvectorstore = Chroma(collection_name = "summaries", embedding_function = OpenAIEmbeddings())
# The storage layer for the parent documentsstore = InMemoryStore()id_key = "doc_id"
# The retriever (empty to start)retriever = MultiVectorRetriever( vectorstore = vectorstore, docstore = store, id_key = id_key, search_kwargs={"k": 1} # Solving Number of requested results 4 is greater than number of elements in index..., updating n_results = 1)
# Add tablestable_ids = [str(uuid.uuid4()) for _ in tables]summary_tables = [ Document(page_content = s, metadata = {id_key: table_ids[i]}) for i, s in enumerate(table_summaries)]retriever.vectorstore.add_documents(summary_tables)retriever.docstore.mset(list(zip(table_ids, tables)))

# Prompt templatetemplate = """Answer the question based only on the following context, which can include text and tables, there is a table in LaTeX format and a table caption in plain text format:{context}Question: {question}"""prompt = ChatPromptTemplate.from_template(template)
# LLMmodel = ChatOpenAI(temperature = 0, model = "gpt-3.5-turbo")
# Simple RAG pipelinechain = ( {"context": retriever, "question": RunnablePassthrough()} | prompt | model | StrOutputParser())
print(chain.invoke("when layer type is Self-Attention, what is the Complexity per Layer?")) # Query about table 1
print(chain.invoke("Which parser performs worst for BLEU EN-DE")) # Query about table 2
print(chain.invoke("Which parser performs best for WSJ 23 F1")) # Query about table 4