Application of LLM in IR (WIP)

25. Application of LLM in IR (WIP)#

25.1. Query Understanding & Optimization#

A Survey of Query Optimization in Large Language Models

Query Rewriting for Retrieval-Augmented Large Language Models

BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering

25.1.1. Query Categorization#

25.1.1.1. Natural Language Queries#

Question Type	Description
Fact-based	Questions that seek specific facts or information.
Multi-hop	Questions that require reasoning over multiple pieces of information or relationships.
Numerical	Questions that involve numerical values or calculations.
Tabular	Questions that relate to data presented in tables.
Temporal	Questions that involve time-related information or events.
Multi-constraint	Questions that involve multiple constraints or conditions.

Examples

Question Type	Query Examples
Fact-based	What is the capital of France? Who invented the telephone? What is the population of Tokyo?
Multi-hop	Who is the CEO of the company that manufactures the iPhone? What is the name of the river that flows through London? What is the highest mountain in the country where the Taj Mahal is located?
Numerical	What is the average temperature in July in London? What is the distance between New York and Los Angeles? What is the GDP of Japan?
Tabular	Which country has the highest population density? What is the average number of mobile phones sold by all the showrooms in the year 2007 ? How many restaurants are in the American cuisine type?
Temporal	When did World War II end? Who was the president of the United States in 1963? What was the price of gold in 2008?
Multi-constraint	Find a hotel in Paris that has a swimming pool and is within walking distance of the Eiffel Tower. What are some laptops that have a 15 -inch screen, 16 GB of RAM, and cost less than $$ 1, 000$ ? Is it possible to constrain the flat face of a say 6 countersunk screws to a single face?

25.1.2. Query Optimization Overview#

Query Expansion - aims to capture a wider range of relevant information and potentially uncover connections that may not have been apparent in the query.This process involves analyzing the initial query, identifying key concepts, and incorporating related terms, synonyms, or associated ideas to form a new query for creating a more comprehensive search.

Query Decomposition - aims to effectively break down complex, multihop queries into simpler, more manageable subqueries or tasks. This approach involves dissecting a query that requires facts from multiple sources or steps into smaller, more direct queries that can be answered individually.

Query Disambiguation - aims to identify and eliminate ambiguity in complex queries, ensuring they are unequivocal. This involves pinpointing elements of the query that could be interpreted in multiple ways and refining the query to ensure a single, precise interpretation.

Query Abstraction - aims to provide a broader perspective on the fact need, potentially leading to more diverse and comprehensive results. This involves identifying and distilling the fundamental intent and core conceptual elements of the query, then creating a higher-level representation that captures the essential meaning while removing specific details.

25.1.3. GEFEED (Retrieval Feedback)#

When using LLM to optimize query and retrievel processes, there has been efforts on using LLM to generate relevent context for the query [WYW23, YIW+22] based on the internal knowledge of LLM. These relevant context can be used to further refine the query or the retrievel process. However, there are several fundamental drawback in this approach when it comes to knowledge intensive tasks.

LLM has a tendency to hallucinate content, generating information not grounded by world knowledge, leading to untrustworthy outputs and a diminished capacity to provide accurate information.
The quality and scope of the internal knowledge of LLM may be incomplete or out-of-date due to the reliability of the sources in the pre-training corpus. Moreover, due to model capacity limitation, LLMs cannot memorize all world information, particularly the long tail of knowledge from their training corpus [KDR+23].

The key steps in GEFEED [Fig. 25.1] are:

Given a query, the language model generates initial outputs (more than one)[Fig. 25.2].
A retrieval module retrieve expanded relevant information using the original query and generated outputs as a new query.
The language model reader produce the final output based on the expanded retrieved information.

The potential benefits from GEFEED are:

By directly generating the expected answer, rather than performing query paraphrasing, the lack of lexical or semantic overlap with the question and the document can be reduced.
As more relevant documents are retrieved from the corpus using expected answers, the recall and the precision of the retrieved documents can be both improved.

../../_images/GEFEED_demo.png — Fig. 25.1 REFEED operates by initially prompting a large language model to generate an answer in response to a given query, followed by the retrieval of documents from extensive document collections. Subsequently, the pipeline refines the initial answer by incorporating the information gleaned from the retrieved documents. Image from [YZL+23].#

../../_images/GEFEED_workflow.png — Fig. 25.2 The language model is prompted to sample multiple answers, allowing for a more comprehensive retrieval feedback based on different answers. Image from [YZL+23].#

25.2. Query-Doc Ranking#

Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents

A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models

Zero-Shot Listwise Document Reranking with a Large Language Model

25.2.1. Ranking via Query Likelihood#

../../_images/ranking_by_query_likelihood.png — Fig. 25.3 Illustration of using query likelihood to approximate the query doc relevance. Image from [SLJ+22].#

Given a passage $d_{i}$ and a query $q$ , the relevance score is approximated by the likelihood of the $q$ conditioned on $d_{i}$ plus an instruction prompt $P$ .

Specifically, we can estimate $\log p (q ∣ d_{i}, P)$ using a LLM to compute the average conditional loglikelihood of the query tokens:

\log p (q ∣ d_{i}; P) = \frac{1}{| q |} \sum_{t} \log p (q_{t} ∣ q_{< t}, d_{i}; P)

where $| q | = t$ denotes the number of query tokens. The instruction prompt is given by Please write a question based on this passage to the passage tokens as shown in Fig. 25.3.

[DZD+23] further extends the above query-likelihood approach by including demonstrations. Specificially, let $z_{1}, . . ., z_{k}$ be positive query-document pair as demonstrations, then the modified query likelihood becomes

\log p (q ∣ d_{i}; z_{1}, . . ., z_{k}; P) = \frac{1}{| q |} \sum_{t} \log p (q_{t} ∣ q_{< t}, d_{i}; z_{1}, . . ., z_{k}; P) .

The authors found that

Selecting demonstrations based on semantic similarity is not necessarily providing the best value;
Instead, one can use difficulty-based selection to find challenging demonstrations to include in the prompt. We estimate difficulty using demonstration query likelihood (DQL):

DQL (z) \propto \frac{1}{| q^{(z)} |} \log P (q^{(z)} ∣ d^{(z)})

then select the demonstrations with the lowest DQL. Intuitively, this should find hard samples that potentially correspond to large gradients had we directly trained the model instead of prompting.

25.2.2. Ranking via Relevance Label Likelihood#

../../_images/relevance_label_likelihood_demo.png — Fig. 25.4 Illustration of using relevance likelihood to approximate the query doc relevance. One can use different rating class and scale to improve the results. Image from [ZQH+23].#

Authors in [ZQH+23] proposed that one can use LLM as zero-shot text ranker by prompting LLM to provide rating for a given query and doc pair. Example rating scale or class include

Binary class: {Yes, No}
Fine-grained relevance label: {Highly Relevant, Somewhat Relevant, Highly Relevant, Perfectly Relevant}
Rating scale: {0, 1, 2, 3, 4}

We can obtain the log-likelihood of the LLM generating each relevance label:

s_{i, k} = LLM (l_{k} ∣ q, d_{i})

where $l_{k}$ is the relevance class label.

Once we obtain the log-likelihood of each relevance label, we can derive the ranking scores.

Expected relevance values: First, we need to assign a series of relevance values $[y_{0}, . . ., y_{k}]$ to all the relevance labels $[l_{0}, . . ., l_{k}]$ , where $y_{k} \in R$ . Then we can calculate the expected relevance value by:

\begin{array}{r} \begin{aligned} f (q, d_{i}) & = \sum p_{i, k} \cdot y_{k} \\ where p_{i, k} & = \frac{\exp (s_{i, k})}{\sum_{k^{'}} \exp (s_{i, k^{'}})} \end{aligned} \end{array}

Peak relevance likelihood: We can further simplify ranking score by only using the loglikelihood of the peak relevance label (e.g., Perfectly Relevant in this example). More formally, let $l_{k^{*}}$ denote the relevance label with the highest relevance. We can simply rank the documents by:

f (q, d_{i}) = s_{i, k^{*}} .

The key findings are:

By using more fine grained relevance label will generally improve the zero-shot ranking performance.
It is hypothesized that the inclusion of fine-grained relevance labels in the prompt may guide LLMs to better differentiate documents, especially those ranked at the top.

25.2.3. Pairwise and Groupwise Text Ranking#

../../_images/pairwise_ranking_demo.png — Fig. 25.5 An illustration of pairwise ranking prompting. The scores in scoring mode represent the log-likelihood of the model generating the target text given the prompt. Image from [QJH+23].#

Using LLM for pointwise ranking via query or relevance class likelihood has shown good performance, but the success is mainly limited to large-scale models. The hypothesis is that pointwise ranking requires LLM to output calibrated predictions, which can be challenging for less-competent small-scale models. Authors from [QJH+23] proposed that one can leverage pairwise ranking to reduce the difficulity of the task.

As shown in Fig. 25.5, the pairwise ranking paradigm takes one query and a pair of documents as the input and output the comparison result. This potentially resolves the calibration issue.

One can further consider group-wise ranking by prompting LLM to order the relevance of $k$ candidate passages, as shown in Fig. 25.6.

../../_images/groupwise_ranking_demo.png — Fig. 25.6 Illustration of groupwise_ranking for $k$ passages. Image from [SYM+23].#

With the local order established by either pairwise ranking or groupwise ranking, one can achieve global ordered rank list using a sliding window approach [Fig. 25.7].

../../_images/slidingwindow_ranklist_generation.png — Fig. 25.7 Illustration of re-ranking 8 passages using sliding windows with a window size of 4 and a step size of 2. The blue color represents the first two windows, while the yellow color represents the last window. The sliding windows are applied in back-to-first order. Image from [SYM+23].#

25.2.4. Ranker Distillation#

Pairwise distillation: Suppose we have a query $q$ and $D$ candidate documents $(d_{1}, \dots, d_{M})$ for ranking. Let the LLM-based ranking results of the $D$ documents be $R = (r_{1}, \dots, r_{M})$ (e.g, $r_{i} = 3$ means $d_{i}$ ranks third among the candidates).

Let $s_{i} = f_{θ} (q, d_{i})$ be the student model’s relevance prediction score between $(q, d_{i})$ .

We can use pairwise Ranking loss to optimize the student model, which is given by

L_{RankNet} = \sum_{i = 1}^{M} \sum_{j = 1}^{M} 1_{r_{i} < r_{j}} \log (1 + \exp (s_{i} - s_{j}))

25.3. Application in RAG#

Small Models, Big Insights: Leveraging Slim Proxy Models to Decide When and What to Retrieve for LLMs

25.4. Generative SERP#

GenSERP: Large Language Models for Whole Page Presentation

25.5. Collections#

Awesome Information Retrieval in the Age of Large Language Model

25.6. Bibliography#

[DZD+23]

Andrew Drozdov, Honglei Zhuang, Zhuyun Dai, Zhen Qin, Razieh Rahimi, Xuanhui Wang, Dana Alon, Mohit Iyyer, Andrew McCallum, Donald Metzler, and others. Parade: passage ranking using demonstrations with llms. In Findings of the Association for Computational Linguistics: EMNLP 2023, 14242–14252. 2023.

[KDR+23]

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, 15696–15707. PMLR, 2023.

[QJH+23] (1,2)

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, and others. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563, 2023.

[SLJ+22]

Devendra Singh Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen-tau Yih, Joelle Pineau, and Luke Zettlemoyer. Improving passage retrieval with zero-shot question generation. arXiv preprint arXiv:2204.07496, 2022.

[SYM+23] (1,2)

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is chatgpt good at search? investigating large language models as re-ranking agents. arXiv preprint arXiv:2304.09542, 2023.

[WYW23]

Liang Wang, Nan Yang, and Furu Wei. Query2doc: query expansion with large language models. arXiv preprint arXiv:2303.07678, 2023.

[YIW+22]

Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. Generate rather than retrieve: large language models are strong context generators. arXiv preprint arXiv:2209.10063, 2022.

[YZL+23] (1,2)

Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, and Ashish Sabharwal. Improving language models via plug-and-play retrieval feedback. arXiv preprint arXiv:2305.14002, 2023.

[ZQH+23] (1,2)

Honglei Zhuang, Zhen Qin, Kai Hui, Junru Wu, Le Yan, Xuanhui Wang, and Michael Bendersky. Beyond yes and no: improving zero-shot llm rankers via scoring fine-grained relevance labels. arXiv preprint arXiv:2310.14122, 2023.