国产呦精品系列在线,精品国产福利在线观看一区 ,午夜福免费福利在线观看

使用Kor進(jìn)行structured generation的流程如下：

定義schema，包括結(jié)構(gòu)、注釋還有例子；
Kor用特定的prompt template，將用戶提供的schema和待處理的raw text，組裝成prompt；
將prompt發(fā)送給LLM，借助其通用的In Context Learning能力，盡量生成符合schema的內(nèi)容；
Kor對LLM的輸出進(jìn)行parse，返回符合schema的結(jié)構(gòu)化結(jié)果，但也有概率沒有返回（當(dāng)LLM的輸出不符合schema時）。

Kor的工作是其中的第2步、第4步。由此可見，Kor是對LLM的一層包裝。

Kor的優(yōu)點是：使用方便。Kor無需介入decode過程，只要有一個text to text的LLM API即可使用，既可以用閉源模型，也可以用開源模型。

但Kor的缺點也很明顯：無法保證抽取結(jié)果一定滿足schema，這是因為：

本質(zhì)上Kor只是幫你“組裝”了一下prompt而已，輸出是否符合schema還取決于模型自身的instruction-following能力。

兩則練習(xí)

介紹了Kor的原理之后，我們進(jìn)行兩則練習(xí)。

在練習(xí)中，筆者將使用硅基流動^[2]提供的免費glm4-9b-chat API。

本文涉及的代碼，已整理在以下git項目中，歡迎star：

https://github.com/duanyu/structured_generation_with_llm

練習(xí)1：翻譯

Example 1: 中文翻譯器

效果：輸入任意文本，返回{"translate_result": {"chinese": 翻譯結(jié)果}}



在結(jié)構(gòu)化輸出中，一般只需兩步即可：



設(shè)置schema（即想要llm輸出的結(jié)構(gòu)，同時包含注釋、例子）；

用結(jié)構(gòu)化輸出工具（例如本文提到的Kor）得到schema結(jié)果。

Kor支持兩種設(shè)置schema的模式，Kor schema和Pydantic Model，在這個例子中，我們使用Kor schema。



注意：此處不對Kor做過多介紹，細(xì)節(jié)請讀者參閱文檔：https://eyurtsev.github.io/kor/

# kor schema，我們想要的輸出格式

schema = Object(

    id="translate_result",

    description=(

        "任意文本的翻譯結(jié)果。"

    ),

    attributes=[

        Text(

            id="chinese",

            description="中文翻譯結(jié)果",

            examples=[], # Kor支持few-shot examples，但本例子比較簡單，故不需要

            many=False, 

        ),

    ],

    many=False,

)

# 運行結(jié)果

chain = create_extraction_chain(llm, schema, encoder_or_encoder_class='json')

text = "We've trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT's code output. We found that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. We are beginning the work to integrate CriticGPT-like models into our RLHF labeling pipeline, providing our trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools."

print(chain.run(text)['data'])

{'translate_result': {'chinese': '我們訓(xùn)練了一個基于GPT-4的模型，稱為CriticGPT，用于捕捉ChatGPT代碼輸出的錯誤。我們發(fā)現(xiàn)，當(dāng)人們從CriticGPT那里獲得幫助來審查ChatGPT代碼時，他們比沒有幫助的人高出60%的效率。我們正在開始將類似CriticGPT的模型集成到我們的RLHF標(biāo)記流程中，為我們的訓(xùn)練師提供明確的AI輔助。這是朝著能夠評估來自高級AI系統(tǒng)的輸出邁出的一步，這些輸出在沒有更好的工具的情況下很難被人類評估。'}}

示例1成功運行：）



我們打印kor的prompt來看看。



print(chain.prompt.format_prompt(text="[user input]").to_string())

Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.



```TypeScript



translate_result: { // 任意文本的翻譯結(jié)果。

 chinese: string // 中文翻譯結(jié)果

}

```



Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: [user input]

Output:

練習(xí)2：評價解析

Example 2：評價解析

預(yù)期效果：輸入一段用戶評價，得到評價屬性（口味、價格等）、評價極性（正向、負(fù)向、中立）、評價詞（好吃、貴等）、參考片段。



結(jié)構(gòu)化輸出，第一步是定義schema，我們可以設(shè)置成這樣的schema



[

    {

        'aspect': 評價屬性,

        'sentiment': 評價極性,

        'sentiment_word': 評價詞,

        'reference': 參考片段,

    }

]

在這個例子中，我們使用Pydantic Model來定義schema，Pydantic Model也能夠支持few-shot examples，其額外好處是可以Validate

# 評價解析的pydantic model

class Sentiment(enum.Enum):

    positive = "positive"

    negative = "negative"

    neural = "neural"



class Dianpin(BaseModel):

    aspect: str = Field(

        description="評價屬性"

    )

    sentiment_word: str = Field(

        description='對評價屬性的評價詞，從原文中抽取',

    )

    sentiment: Optional[Sentiment] = Field(

        description='對評價屬性的情感，positive\negative\neural中的一個',

    )

    reference: str = Field(

        description='評價的原文'

    )

# 運行kor

schema, validator = from_pydantic(

    Dianpin, 

    description='對評價的解析結(jié)果', 

    examples=[],  

    many=True #支持list of aspect

)

chain = create_extraction_chain(

    llm, schema, encoder_or_encoder_class="json", validator=validator

)



pprint(chain.run("整體來說，環(huán)境可以，味道的話也還不錯，但價格有一點小貴。"))

{'data': {},

 'errors': [ParseError('The LLM has returned structured data which does not match the expected schema. Providing additional examples may help improve the parse.')],

 'raw': '\n'

        '<json>\n'

        '[\n'

        '  {\n'

        '    "aspect": "環(huán)境",\n'

        '    "sentiment_word": "可以",\n'

        '    "sentiment": "positive"\n'

        '  },\n'

        '  {\n'

        '    "aspect": "味道",\n'

        '    "sentiment_word": "還不錯",\n'

        '    "sentiment": "positive"\n'

        '  },\n'

        '  {\n'

        '    "aspect": "價格",\n'

        '    "sentiment_word": "小貴",\n'

        '    "sentiment": "negative"\n'

        '  }\n'

        ']\n'

        '</json>',

 'validated_data': {}}

注意，此時data字段數(shù)據(jù)為空，因為LLM的返回不符合預(yù)期的schema，kor建議加入examples



于是我們加入一個簡單的example

# 運行kor

schema, validator = from_pydantic(

    Dianpin, 

    description='對評價的解析結(jié)果', 

    examples=[

        ('味道真不錯，下次還來！', [{"aspect":"味道", "sentiment_word": "真不錯", "sentiment": "positive", "reference": "味道真不錯"}])

    ],

    many=True #支持list of aspect

)

chain = create_extraction_chain(

    llm, schema, encoder_or_encoder_class="json", validator=validator

)



pprint(chain.run("整體來說，環(huán)境可以，味道的話也還不錯，但價格有一點小貴。"))

{'data': {'dianpin': [{'aspect': '環(huán)境',

                       'reference': '整體來說，環(huán)境可以',

                       'sentiment': 'positive',

                       'sentiment_word': '可以'},

                      {'aspect': '味道',

                       'reference': '味道的話也還不錯',

                       'sentiment': 'positive',

                       'sentiment_word': '還不錯'},

                      {'aspect': '價格',

                       'reference': '但價格有一點小貴',

                       'sentiment': 'negative',

                       'sentiment_word': '小貴'}]},

 'errors': [],

 'raw': '\n'

        '<json>\n'

        '{\n'

        '  "dianpin": [\n'

        '    {\n'

        '      "aspect": "環(huán)境",\n'

        '      "sentiment_word": "可以",\n'

        '      "sentiment": "positive",\n'

        '      "reference": "整體來說，環(huán)境可以"\n'

        '    },\n'

        '    {\n'

        '      "aspect": "味道",\n'

        '      "sentiment_word": "還不錯",\n'

        '      "sentiment": "positive",\n'

        '      "reference": "味道的話也還不錯"\n'

        '    },\n'

        '    {\n'

        '      "aspect": "價格",\n'

        '      "sentiment_word": "小貴",\n'

        '      "sentiment": "negative",\n'

        '      "reference": "但價格有一點小貴"\n'

        '    }\n'

        '  ]\n'

        '}\n'

        '</json>',

 'validated_data': [Dianpin(aspect='環(huán)境', sentiment_word='可以', sentiment=<Sentiment.positive: 'positive'>, reference='整體來說，環(huán)境可以'),

                    Dianpin(aspect='味道', sentiment_word='還不錯', sentiment=<Sentiment.positive: 'positive'>, reference='味道的話也還不錯'),

                    Dianpin(aspect='價格', sentiment_word='小貴', sentiment=<Sentiment.negative: 'negative'>, reference='但價格有一點小貴')]}

加入example之后，示例2成功運行。



我們也打印kor的prompt，看看長什么樣，以及few-shot examples是如何使用的。



print(chain.prompt.format_prompt(text="[user input]").to_string())



Your goal is to extract structured information from the user's input that matches the form described below. When extracting information please make sure it matches the type information exactly. Do not add any attributes that do not appear in the schema shown below.



```TypeScript



dianpin: Array<{ // 對評價的解析結(jié)果

 aspect: string // 評價屬性

 sentiment_word: string // 對評價屬性的評價詞，從原文中抽取

 sentiment: "positive" | "negative" | "neural" // 對評價屬性的情感，positive

egative

eural中的一個

 reference: string // 評價的原文

}>

```



Please output the extracted information in JSON format. Do not output anything except for the extracted information. Do not add any clarifying information. Do not add any fields that are not in the schema. If the text contains attributes that do not appear in the schema, please ignore them. All output must be in JSON format and follow the schema specified above. Wrap the JSON in <json> tags.



Input: 味道真不錯，下次還來！

Output: <json>{"dianpin": [{"aspect": "味道", "sentiment_word": "真不錯", "sentiment": "positive", "reference": "味道真不錯"}]}</json>

Input: [user input]

Output:

總結(jié)

本文作為structured generation的第一期，介紹了Kor。Kor主要基于prompt，是對LLM的一層封裝；Kor的設(shè)計理念使其便于進(jìn)行數(shù)據(jù)處理（raw data -> schema），但Kor的最大限制是，并不能保證所抽取內(nèi)容的結(jié)構(gòu)穩(wěn)定性，而這點將會被guided decoding類技術(shù)解決。

文章轉(zhuǎn)自微信公眾號@漫談NLP