
API網關如何發展:更輕、更智能、云原生
? ? ? ?基于上述Mistral 7B Prompt模板,我們構建關鍵詞抽取Prompt,包括Example Prompt和Keyword Prompt,Example Prompt是抽取關鍵詞的一個Prompt樣例,Keyword Prompt是讓LLM輸出關鍵詞的Prompt,下面展示一個例子:
example_prompt = """<s>[INST]I have the following document:- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
Please give me the keywords that are present in this document and separate them with commas.Make sure you to only return the keywords and say nothing else. For example, don't say:"Here are the keywords present in the document"[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>"""
Keyword Prompt充分利用了KeyBERT的 [DOCUMENT] 標簽表示下面是文檔:
keyword_prompt = """[INST]I have the following document:- [DOCUMENT]
Please give me the keywords that are present in this document and separate them with commas.Make sure you to only return the keywords and say nothing else. For example, don't say:"Here are the keywords present in the document"[/INST]"""
關鍵詞抽取的完整Prompt需要合并Example Prompt和Keyword Prompt,代碼如下:
>>> prompt = example_prompt + keyword_prompt>>> print(prompt)"""<s>[INST]I have the following document:- The website mentions that it only takes a couple of days to deliver but I still have not received mine.
Please give me the keywords that are present in this document and separate them with commas.Make sure you to only return the keywords and say nothing else. For example, don't say: "Here are the keywords present in the document"[/INST] meat, beef, eat, eating, emissions, steak, food, health, processed, chicken</s>[INST]
I have the following document:- [DOCUMENT]
Please give me the keywords that are present in this document and separate them with commas.Make sure you to only return the keywords and say nothing else. For example, don't say: "Here are the keywords present in the document"[/INST]"""
from keybert.llm import TextGenerationfrom keybert import KeyLLM
# Load it in KeyLLMllm = TextGeneration(generator, prompt=prompt)kw_model = KeyLLM(llm)
documents = ["The website mentions that it only takes a couple of days to deliver but I still have not received mine.","I received my package!","Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license."]
keywords = kw_model.extract_keywords(documents)
輸出如下內容:
[['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['package', 'received'], ['LLM', 'API', 'accessibility', 'release', 'license', 'research', 'community', 'model', 'weights', 'Meta']]
可以隨意使用提示來指定要提取的關鍵字類型、關鍵字的長度,甚至如果LLM是多語言的,還可以使用哪種語言返回關鍵字。
? ? ?切換其他LLM,比如ChatGPT,可以參考:https://maartengr.github.io/KeyBERT/guides/llms.html
? ? ? ?在成千上萬個文檔上重復使用LLM并不是最有效的方法!其實,我們可以對文檔先進行聚類,然后再提取關鍵詞。其工作原理如下:首先,我們embedding所有文檔,并將它們轉換為數字表示;其次,找出哪些文檔彼此最相似,假設高度相似的文檔將具有相同的關鍵字,因此不需要為所有文檔提取關鍵字。第三,只從每個聚類中的一個文檔中提取關鍵字,并將關鍵字分配給同一聚類中的所有文檔。
from keybert import KeyLLMfrom sentence_transformers import SentenceTransformer
# Extract embeddingsmodel = SentenceTransformer('BAAI/bge-small-en-v1.5')embeddings = model.encode(documents, convert_to_tensor=True)
# Load it in KeyLLMkw_model = KeyLLM(llm)
# Extract keywordskeywords = kw_model.extract_keywords( documents, embeddings=embeddings, threshold=.5)
threshold增加到大約.95將識別幾乎相同的文檔,而將其設置為大約.5將識別關于相同主題的文檔。
輸出關鍵詞如下:
>>> keywords[['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['LLaMA', 'model', 'weights', 'release', 'noncommercial', 'license', 'research', 'community', 'powerful', 'LLMs', 'APIs']]
? ? ? ?在這個示例中,我們可以看到前兩個文檔被聚集在一起,并接收到相同的關鍵字。我們沒有將所有三個文檔都傳遞給LLM,而是只傳遞了兩個文檔。如果你有成千上萬的文檔,這可以大大加快速度。
? ? ? ?之前的例子中,我們手動將文檔embedding傳遞給KeyLLM,基本上是對關鍵字進行零樣本提取。我們可以利用KeyBERT來進一步擴展這個例子。由于KeyBERT可以生成關鍵字并對文檔,我們可以利用它不僅簡化管道,而且向LLM建議一些關鍵字。這些建議的關鍵字可以幫助LLM決定要使用的關鍵字。此外,它允許KeyBERT中的所有內容與KeyLLM一起使用!
使用KeyBERT和KeyLLM抽取關鍵詞只需要三行代碼,如下:
from keybert import KeyLLM, KeyBERT
# Load it in KeyLLMkw_model = KeyBERT(llm=llm, model='BAAI/bge-small-en-v1.5')
# Extract keywordskeywords = kw_model.extract_keywords(documents, threshold=0.5)
輸出如下:
>>> keywords[['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['deliver', 'days', 'website', 'mention', 'couple', 'still', 'receive', 'mine'], ['LLaMA', 'model', 'weights', 'release', 'license', 'research', 'community', 'powerful', 'LLMs', 'APIs', 'accessibility']]
[1] https://towardsdatascience.com/introducing-keyllm-keyword-extraction-with-llms-39924b504813
[2]?https://maartengr.github.io/KeyBERT/guides/keyllm.html
文章轉自微信公眾號@ArronAI