基于大模型的分块

基于大模型的分块

Pasted image 20250506215709.png

在使用大语言模型(LLM)进行文本处理时,可以直接利用 LLM 的能力来创建分块(chunks)。这种方法能够生成语义孤立且有意义的分块,对语义的准确性具有很高的保障。这是因为 LLM 能够理解上下文和意义,而不是像简单启发式方法那样仅仅基于规则进行分割。

然而,这种方法也存在一定的挑战。首先,它对计算资源的要求较高,是目前计算成本最高的分块技术之一。其次,由于 LLM 的上下文窗口通常是有限的,因此在处理长文本时需要特别注意上下文窗口的限制。


Dense X Retrieval: 检索粒度的选择

在论文 Dense X Retrieval: What Retrieval Granularity Should We Use? 中,作者提出了一种新的检索单位,称为 propositionProposition 被定义为文本中的原子表达(atomic expressions),即不能进一步分解的单个语义元素。这些元素可以用于构成更大的语义单位,能够以简明扼要的方式表达文本中的独特事实或特定概念。

Proposition 的特点

Proposition 的获取方式是通过构建提示词(prompts),并与 LLM 交互生成结果。目前,工具如 LlamaIndexLangChain 都实现了相关算法。


LlamaIndex 的实现方案

LlamaIndex 的实现思路是基于论文中提供的提示词,通过与 LLM 交互生成 proposition。以下是一个具体的代码示例:

PROPOSITIONS_PROMPT = PromptTemplate(
"""
Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of context.

1. Split compound sentence into simple sentences. Maintain the original phrasing from the input whenever possible.

2. For any named entity that is accompanied by additional descriptive information, separate this information into its own distinct proposition.

3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the entities they refer to.

4. Present the results as a list of strings, formatted in JSON.

Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content:
The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in 1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were frequently seen in gardens in spring, and thus may have served as a convenient explanation for the origin of the colored eggs hidden there for children. Alternatively, there is a European tradition that hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, and both occur on grassland and are first seen in the spring. In the nineteenth century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe. German immigrants then exported the custom to Britain and America where it evolved into the Easter Bunny."
Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in 1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about the possible explanation for the connection between hares and the tradition during Easter", "