Information Retrieval and Text Ranking

22. Information Retrieval and Text Ranking#

22.1. Overview of Information Retrieval#

22.1.1. Ad-hoc Retrieval#

Ad-hoc search and retrieval is a classic information retrieval (IR) task consisting of two steps: first, the user specifies his or her information need through a query; second, the information retrieval system fetches documents from a large corpus that are likely to be relevant to the query. Key elements in an ad-hoc retrieval system include

Query, the textual description of information need.
Corpus, a large collection of textual documents to be retrieved.
Relevance is about whether a retrieved document can meet the user’s information need.

There has been a long research and product development history on ad-hoc retrieval. Successful products in ad-hoc retrieval include Google search engine [Fig. 22.2] and Microsoft Bing [Fig. 22.3].
One core component within Ad-hoc retrieval is text ranking. The returned documents from a retrieval system or a search engine are typically in the form of an ordered list of texts. These texts (web pages, academic papers, news, tweets, etc.) are ordered with respect to the relevance to the user’s query, or the user’s information need.

A major characteristic of ad-hoc retrieval is the heterogeneity of the query and the documents [Fig. 22.1]. A user’s query often comes with potentially unclear intent and is usually very short, ranging from a few words to a few sentences. On the other hand, documents are typically from a different set of authors with varying writing styles and have longer text length, ranging from multiple sentences to many paragraphs. Such heterogeneity poses significant challenges for vocabulary match and semantic match for ad-hoc retrieval tasks.

../../_images/query_length_MS_MARCO.png — Fig. 22.1 Query length and document length distribution in Ad-hoc retrieval example using MS MARCO dataset.#

There have been decades’ research and engineering efforts on developing ad-hoc retrieval models and system. Traditional IR systems primarily rely on techniques to identify exact term matches between a query and a document and compute final relevance score between various weighting schemes. Such exact matching approach has achieved tremendous success due to scalability and computational efficiency - fetching a handful of relevant document from billions of candidate documents. Unfortunately, exact match often suffers from vocabulary mismatch problem where sentences with similar meaning but in different terms are considered not matched. Recent development of deep neural network approach [HHG+13], particularly Transformer based pre-trained large language models, has made great progress in semantic matching, or inexact match, by incorporating recent success in natural language understanding and generation. Recently, combining exact matching with semantic matching is empowering many IR and search products.

../../_images/ad_hoc_retrieval_demo_google.png — Fig. 22.2 Google search engine.#

../../_images/ad_hoc_retrieval_demo_bing.png — Fig. 22.3 Microsoft Bing search engine.#

22.1.2. Open-domain Question Answering#

Another application closely related IR is open-domain question answering (OpenQA), which has found a widespread adoption in products like search engine, intelligent assistant, and automatic customer service. OpenQA is a task to answer factoid questions that humans might ask, using a large collection of documents (e.g., Wikipedia, Web page, or collected document) as the information source. An OpenQA example is like

Q: What is the capital of China?

A: Beijing.

Contrast to Ad-hoc retrieval, instead of simply returning a list of relevant documents, the goal of OpenQA is to identify (or extract) a span of text that directly answers the user’s question. Specifically, for factoid question answering, the OpenQA system primarily focuses on questions that can be answered with short phrases or named entities such as dates, locations, organizations, etc.

A typical modern OpenQA system adopts a two-stage pipeline [CFWB17] [Fig. 22.4]: (1) A document retriever selects a small set of relevant passages that probably contain the answer from a large-scale collection; (2) A document reader extracts the answer from relevant documents returned by the document retriever. Similar to ad-hoc search, relevant documents are required to be not only topically related to but also correctly address the question, which requires more semantics understanding beyond exact term matching features. One widely adopted strategy to improve OpenQA system with large corpus is to use an efficient document (or paragraph) retrieval technique to obtain a few relevant documents, and then use an accurate (yet expensive) reader model to read the retrieved documents and find the answer.

Nowadays many web search engines like Google and Bing have been evolving towards higher intelligence by incorporating OpenQA techniques into their search functionalities.

../../_images/open-domain_QA.png — Fig. 22.4 A typical open-domain architecture where a retriever retrieves passages from information source relevant to the question.#

Compared with ad-hoc retrieval, OpenQA shows reduced heterogeneity between the question and the answer passage/sentence yet add challenges of precisely understanding the question and locating passages that might contain answers. On one hand, the question is usually in natural language, which is longer than keyword queries and less ambiguous in intent. On the other hand, the answer passages/sentences are usually much shorter text spans than documents, leading to more concentrated topics/semantics. Retrieval and reader models need to capture the patterns expected in the answer passage/sentence based on the intent of the question, such as the matching of the context words, the existence of the expected answer type, and so on.

22.1.3. Modern IR Systems#

A traditional IR system, or concretely a search engine, operates through several key steps.

The first step is crawling. A web search engines discover and collect web pages by crawling from site to site; Another vertical search engines such as e-commerce search engines collect their product information from product description and other product meta data. The second step is indexing, which creates an inverted index that maps key words to document ids. The last step is searching. Searching is a process that accepts a text query as input, looks up relevant documents from the inverted index, ranks documents, and returns a list of results, ranked by their relevance to the query.

../../_images/traditional_IR_engine.png — Fig. 22.5 Key steps in a traditional IR system.#

The rapid progress of deep neural network learning [GBCB16] and their profound impact on natural language processing has also reshaped IR systems and brought IR into a deep learning age. Deep neural networks (e.g., Transformers [DCLT18]) have proved their unparalleled capability in semantic understanding over traditional IR margin yet they suffer from high computational cost and latency. This motivates the development of multi-stage retrieval and ranking IR system [Fig. 22.6] in order to better balance trade-offs between effectiveness (i.e., quality and accuracy of final results) and efficiency (i.e., computational cost and latency).

In this multi-stage pipeline, early stage models consists of simpler but more efficient models to reduce the candidate documents from billions to thousands; later stage models consists of complex models to perform accurate ranking for a handful of documents coming from early stages.

../../_images/retrieve_ranking_arch.png — Fig. 22.6 The multi-stage architecture of modern information retrieval systems.#

In modern search engine, traditional IR models, which is based on term matching, serve as good candidates for early stage model due to their efficiency. The core idea of the traditional approach is to count repetitions of query terms in the document. Large counts indicates higher relevance. Different transformation and weighting schemes for those counts lead to a variety of possible TF-IDF ranking features.

Later stage models are primarily deep learning model. Deep learning models in IR not only provide powerful representations of textual data that capture word and document semantics, allowing a machine to better under queries and documents, but also open doors to multi-modal (e.g., image, video) and multilingual search, ultimately paving the way for developing intelligent search engines that deliver rich contents to users.

Remark 22.1 (Why we need a semantic understanding model)

For web-scale search engines like Google or Bing, typically a very small set of popular pages that can answer a good proportion of queries.[MNCC16] The vast majority of queries contain common terms. It is possible to use term matching between key words in URL or title and query terms for text ranking; It is also possible to simply memorize the user clicks between common queries between their ideal URLs. For example, a query CNN is always matched to the CNN landing page. These simple methods clearly do not require a semantic understanding on the query and the document content.

However, for new or tail queries as well as new and tail document, a semantic understanding on queries and documents is crucial. For these cases, there is a lack of click evidence between the queries and the documents, and therefore a model that capture the semantic-level relationship between the query and the document content is necessary for text ranking.

22.1.4. Challenges And Opportunities In IR Systems#

22.1.4.1. Query Understanding And Rewriting#

A user’s query does not always have crystal clear description on the information need of the user. Rather, it often comes with potentially misspellings and unclear intent, and is usually very short, ranging from a few words to a few sentences [ch:neural-network-and-deep-learning:ApplicationNLP_IRSearch:tab:example_queries_MSMARCO]. There are several challenges to understand the user’s query.

Users might use vastly different query representations even though they have the same search intent. For example, suppose users like to get information about distance between Sun and Earth. Common used queries could be

how far earth sun
distance from sun to earth
distance of earth from sun
how far earth is from the sun

We can see that some of them are just key words rather than a full sentence and some of them might not have the completely correct grammar.

There are also more challenging scenarios where queries are often poorly worded and far from describing the searcher’s actual information needs. Typically, we employ a query rewriting component to expand the search and increase recall, i.e., to retrieve a larger set of results, with the hope that relevant results will not be missed. Such query rewriting component has multiple sub-components which are summarized below.

Spell checker Spell checking queries is an important and necessary feature of modern search. Spell checking enhance user experience by fixing basic spelling mistakes like itlian restaurat to italian restaurant.

Query expansion Query expansion improves search result retrieval by adding or substituting terms to the user’s query. Essentially, these additional terms aim to minimize the mismatch between the searcher’s query and available documents. For example, the query italian restaurant, we can expand restaurant to food or cuisine to search all potential candidates.

Query relaxation The reverse of query expansion is query relaxation, which expand the search scope when the user’s query is too restrictive. For example, a search for good Italian restaurant can be relaxed to italian restaurant.

Query intent understanding This subcomponent aims to figure out the main intent behind the query, e.g., the query coffee shop most likely has a local intent (an interest in nearby places) and the query earthquake may have a news intent. Later on, this intent will help in selecting and ranking the best documents for the query.

Given a rewritten query, It is also important to correctly weigh specific terms in a query such that we can narrow down the search scope. Consider the query NBA news, a relevant document is expected to be about NBA and news but have more focus on NBA. There are traditional rule-based approach to determine the term importance as well as recent data-driven approach that determines the term importance based on sophisticated natural language and context understanding.

To improve relevance ranking, it is often necessary to incorporate additional context information (e.g., time, location, etc.) into the user’s query. For example, when a user types in a query coffee shop, retrieve coffee shops by ascending distance to the user’s location can generally improve relevance ranking. Still, there are challenges on deciding for which type of query we need to incorporate the context information.

../../_images/query_word_cloud.png — Fig. 22.7 Word cloud visualization for common query words using MS MARCO data.#

22.1.4.2. Exact Match And Semantic Match#

Traditional IR systems retrieve documents mainly by matching keywords in documents with those in search queries. While in many cases exact term match naturally ensure semantic match, there are cases, exact term matching can be insufficient.

The first reason is due to the polysemy of words. That is, a word can mean different things depending on context. The meaning of book is different in text book and book a hotel room. Short queries particularly suffer from Polysemy because they are often devoid of context.

The second reason is due to the fact that a concept is often expressed using different vocabularies and language styles in documents and queries. As a result, such a model would have difficulty in retrieving documents that have none of the query terms but turn out to be relevant.

Modern neural-based IR model enable semantic retrieval by learning latent representations of text from data and enable document retrieval based on semantic similarity.

Table 22.1 Retrieval results based on exact matching methods and semantic matching methods.#
Query	“Weather Related Fatalities”
Information Need	A relevant document will report a type of weather event which has directly caused at least one fatality in some location.
Lexical Document	“.. Oklahoma and South Carolina each recorded three fatalities. There were two each in Arizona, Kentucky, Missouri, Utah and Virginia. Recording a single lightning death for the year were Washington, D.C.; Kansas, Montana, North Dakota, ..”
Semantic Document	.. Closed roads and icy highways took their toll as at least one motorist was killed in a 17-vehicle pileup in Idaho, a tour bus crashed on an icy stretch of Sierra Nevada interstate and 100-car string of accidents occurred near Seattle …

An IR system solely rely on semantic retrieval is vulnerable to queries that have rare words. This is because rare words are infrequent or even never appear in the training data and learned representation associated with rare words query might be poor due to the nature of data-driven learning. On the other hand, exact matching approach are robust to rare words and can precisely retrieve documents containing rare terms.

Another drawback of semantic retrieval is high false positives: retrieving documents that are only loosely related to the query.

Nowadays, much efforts have been directed to achieve a strong and intelligent modern IR system by combining exact match and semantic match approaches in different ways. Examples include joint optimization of hybrid exact match and semantic match systems, enhancing exact match via semantic based query and document expansion, etc.

22.1.4.3. Robustness To Document Variations#

In response to users’ queries and questions, IR systems needs to search a large collection of text documents, typically at the billion-level scale, to retrieve relevant ones. These documents are comprised of mostly unstructured natural language text, as compared to structured data like tables or forms.

Documents can vary in length, ranging from sentences (e.g., searching for similar questions in a community question answering application like Quora) to lengthy web pages. A long document might like a short document, covering a similar scope but with more words, which is also known as the Verbosity hypothesis. On the other hand, a long document might consist of a number of unrelated short documents concatenated together, which is known as Scope hypothesis. The wide variation of document forms lead to different strategies. For example, following the Verbosity hypothesis a long document is represented by a single feature vector. Following the Scope hypothesis, one can break a long document into several semantically distinctive parts and represent each of them as separate feature vectors. We can consider each part as the unit of retrieval or rank the long document by aggregating evidence across its constituent parts.
For full-text scientific articles, we might choose to only consider article titles and abstracts, and ignoring most of the numerical results and analysis.

There are also challenges on breaking a long document into semantically distinctive parts and encode each part into meaningful representation. Recent neural network methods extract semantic parts by clustering tokens in the hidden space and represent documents by multi-vector representations[HSLW19, LETC21, TSJ+21].

22.1.4.4. Computational Efficiency#

IR product such as search engines often serve a huge pool of user and need to handle tremendous volume of search requests during peak time (e.g., when there is breaking news events). To provide the best user experience, computational efficiency of IR models often directly affect user perceived latency. A long standing challenge is to achieve high accuracy/relevance in fetched documents yet to maintain a low latency. While traditional IR methods based on exact term match has excellent computational efficiency and scalability, it suffers from low accuracy due to the vocabulary and semantic mismatch problems. Recent progress in deep learning and natural language process are highlighted by complex transformer-based model [DCLT18] that achieved accuracy gain over traditional IR by a large margin yet experienced high latency issues. There are numerous ongoing studies (e.g., [GDC21, MC19, MDC17]) aiming to bring the benefits from the two sides via hybrid modeling methodology.

To alleviate the computational bottleneck from deep learning based dense retrieval, state-of-the-art search engines also adopts a multi-stage retrieval pipeline system: an efficient first-stage retriever uses a query to fetch a set of documents from the entire document collection, and subsequently one or more more powerful retriever to refine the results.

22.2. Text Ranking Evaluation Metrics#

Consider a large corpus containing $N$ documents $D=\{d_1,...,d_N\}$. Given a query $q$, suppose the retriever and its subsequent re-ranker (if there is) ultimately produce an ordered list of $k$ relevant document $R_q = \{d_{i_1},...,d_{i_k}\}$, where documents $d_{i_1},...,d_{i_k} $ are ranked by some relevance measure with respect to the query.

In the following, we discuss several commonly used metrics to evaluate text ranking quality and IR system.

22.2.1. Precision And Recall#

Precision and recall are metrics concerns the fraction of relevant documents retrieved for a query $q$, but they are not concerned with the ranking order. Specifically, precision computes the fraction of relevant documents with respect to the total number of documents in the retrieved set $R_{q}$, where $R_{q}$ have $K$ documents; Recall computes the fraction of relevant documents with respect to the total number of relevant documents in the corpus $D$. More formally, we have

\[\begin{split}\begin{align*} \operatorname{Precision}_{q}@K &=\frac{\sum_{d \in R_{q}} \operatorname{rel}_{q}(d)}{\left|R_{q}\right|} = \frac{1}{K} \sum_{d \in R_{q}} \operatorname{rel}_{q}(d)\\ \operatorname{Recall}_{q}@K &=\frac{\sum_{d \in R_{q}} \operatorname{rel}_{q}(d)}{\sum_{d \in D} \operatorname{rel}_{q}(d)} \end{align*}\end{split}\]

where $\operatorname{rel}_{q}(d)$ is a binary indicator function indicating if document $d$ is relevant to $q$. Note that the denominator $\sum_{d \in D} \operatorname{rel}_{q}(d)$ is the total number of relevant documents in $D$, which is a constant.

Typically, we only retrieve a fixed number $K$ of documents, i.e., $|R_q| = K$, where $K$ typically takes 100 to 1000. The precision and recall at a given $K$ can be computed over the total number of queries $Q$, that is

\[\begin{split}\begin{align*} \operatorname{Precision}@{K} &=\frac{1}{|Q|}\sum_{q\in Q} \operatorname{Precision}_{q}@K \\ \operatorname{Recall}@{K} &=\frac{1}{|Q|}\sum_{q\in Q} \operatorname{Recall}_{q}@K \end{align*}\end{split}\]

Because recall has a fixed denominator, recall will increase as $K$ increases.

22.2.2. Normalized Discounted Cumulative Gain (NDCG)#

When we are evaluating retrieval and ranking results based on their relevance to the query, we normally evaluate the ranking result in the following way

The ranking result is good if documents with high relevance appear in the top several positions in search engine result list.
We expect documents with different degree of relevance should contribute to the final ranking in proportion to their relevance.

Cumulative Gain (CG) is the sum of the graded relevance scores of all documents in the search result list. CG only considers the relevance of the documents in the search result list, and does not consider the position of these documents in the result list. Given the ranking position of a result list, CG can be defined as:

\[CG@k = \sum_{i=1}^k s_i\]

where $s_i$ is the relevance score, or custom defined gain, of document $i$ in the result list. The relevance score of a document is typically provided by human annotators.

Discounted cumulative gain (DCG) is the discounted version of CG. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.

The traditional formula of DCG accumulated at a particular rank position $k$ is defined as

\[ DCG@k=\sum_{i=1}^{k} \frac{s_{i}}{\log_{2}(i+1)}=s_{1}+\sum_{i=2}^{k} \frac{s_{i}}{\log _{2}(i+1)}. \]

An alternative formulation of $DCG$ places stronger emphasis on more relevant documents:

\[ {DCG}@k=\sum_{i=1}^{p} \frac{2^{rel_q(d_i)}-1}{\log _{2}(i+1)}. \]

The ideal DCG, IDCG, is computed the same way but by sorting all the candidate documents in the corpus by their relative relevance so that it produces the max possible DCG@k. The normalized DCG, NDCG, is then given by,

\[ NDCG@k=\frac{DCG@k}{IDCG@k}. \]

Example 22.1

Consider 5 candidate documents with respect to a query. Let their ground truth relevance scores be

\[s_1=10, s_2=0,s_3=0,s_4=1,s_5=5,\]

which corresponds to a perfect rank of $$s_1, s_5, s_4, s_2, s_3.$$ Let the predicted scores be $$y_1=0.05, y_2=1.1, y_3=1, y_4=0.5, y_5=0.0,$$ which corresponds to rank $$s_2, s_3, s_4, s_1, s_5.$$

For $k=1,2$, we have

\[DCG@k = 0, NDCG@k=0.\]

For $k=3$, we have

\[DCG@k = \frac{s_4}{\log_2 (3+1)} = 0.5, IDCG@k = 10.0 + \frac{5.0}{\log_2 3} + \frac{1.0}{\log_2 4} = 13.65, NDCG@k=0.0366.\]

For $k=4$, we have

\[DCG@k = \frac{s_4}{\log_2 4} + \frac{s_1}{\log_2 5} = 4.807, IDCG@k = IDCG@3 + 0.0 = 13.65, NDCG@k=0.352.\]

22.2.3. Online Metrics#

When a text ranking model is deployed to serve user’s request, we can also measure the model performance by tracking several online metrics.

Click-through rate and dwell time When a user types a query and starts a search session, we can measure the success of a search session on user’s reactions. On a per-query level, we can define success via click-through rate. The Click-through rate (CTR) measures the ratio of clicks to impressions.

\[\operatorname{CTR} = \frac{\text{Number of clicks}}{\text{Number of impressions}},\]

where an impression means a page displayed on the search result page a search engine result page and a click means that the user clicks the page.

One problem with the click-through rate is we cannot simply treat a click as the success of document retrieval and ranking. For example, a click might be immediately followed by a click back as the user quickly realizes the clicked doc is not what he is looking for. We can alleviate this issue by removing clicks that have a short dwell time.

Time to success: Click-through rate only considers the search session of a single query. In real application case, a user’s search experience might span multiple query sessions until he finds what he needs. For example, the users initially search action movies and they do not find that the ideal results and refine the initial query to a more specific one: action movies by Jackie Chan. Ideally, we can measure the time spent by the user in identifying the page he wants as a metrics.

22.3. Traditional Sparse IR Fundamentals#

22.3.1. Exact Match Framework#

Most traditional approaches to ad-hoc retrieval simply count repetitions of the query terms in the document text and assign proper weights to matched terms to calculate a final matching score. This framework, also known as exact term matching, despite its simplicity, serves as a foundation for many IR systems. A variety of traditional IR methods fall into this framework and they mostly differ in different weighting (e.g., tf-idf) and term normalization (e.g., dogs to dog) schemes.

In the exact term matching, we represent a query and a document by a set of their constituent terms, that is, $q = \{t^{q}_1,...,t^q_M\}$ and $d = \{t^{d}_1,...,t^d_M\}$. The matching score between $q$ and $d$ with respect to a vocabulary $V$ is given by:

\[ S(q, d)= \sum_{t\in V} f(t)\cdot\mathbb{1}(t\in q\cap d) = \sum_{t \in q \cap d} f(t) \]

where $f$ is some function of a term and its associated statistics, the three most important of which are

Term frequency (how many times a term occurs in a document);
Document frequency (the number of documents that contain at least once instance of the term);
Document length (the length of the document that the term occurs in).

Exact term match framework estimates document relevance based on the count of only the query terms in the document. The position of these occurrences and relationship with other terms in the document are ignored.

BM25 are based on exact matching of query and document words, which limits the in- formation available to the ranking model and may lead to problems such vocabulary mismatch

22.3.2. TF-IDF Vector Space Model#

In the vector space model, we represent each query or document by a vector in a high dimensional space. The vector representation has the dimensionality equal to the vocabulary size, and in which each vector component corresponds to a term in the vocabulary of the collection. This query vector representation stands in contrast to the term vector representation of the previous section, which included only the terms appearing in the query. Given a query vector and a set of document vectors, one for each document in the collection, we rank the documents by computing a similarity measure between the query vector and each document

The most commonly used similarity scoring function for a document vector $\vec{d}$ and a query vector $\vec{q}$ is the cosine similarity $\operatorname{Sim}(\vec{d}, \vec{q})$ is computed as

\[ \operatorname{Sim}(\vec{d}, \vec{q})=\frac{\vec{d}}{|\vec{d}|} \cdot \frac{\vec{q}}{|\vec{q}|}. \]

The component value associated with term $t$ is typically the product of term frequency $tf(t)$ and inverse document frequency $idf(t)$. In addition, cosine similarity has a length normalization component that implicitly handles issues related to document length.

Over the years there have been a number of popular variants for both the TF and the IDF functions been proposed and evaluated. A basic version of $tf(t)$ is given by

\[\begin{split} tf(t,d)= \begin{cases}\log \left(f_{t, d}\right)+1 & \text { if } f_{t, d}>0 \\ 0 & \text { otherwise. }\end{cases} \end{split}\]

where $f_{t, d}$ is the actual term frequency count of $t$ in document $d$. Here the basic intuition is that a term appearing many times in a document should be assigned a higher weight for that document, and the its value should not necessarily increase linearly with the actual term frequency $f_{t, d}$, hence the logarithm is used to proxy the saturation effect. Although two occurrences of a term should be given more weight than one occurrence, they shouldn’t necessarily be given twice the weight.

A common $idf(t)$ functions is given by

\[ idf(t)=\log \left(N / N_{t}\right) \]

where $N_t$ is the number of documents in the corpus that contain the term $t$ and $N$ is the total number of documents. Here the basic intuition behind the $idf$ functions is that a term appearing in many documents should be assigned a lower weight than a term appearing in few documents.

22.3.3. BM25#

One of the most widely adopted exact matching method is called BM25 (short for Okapi BM25)[CMS10, RZ09, YFL18]. BM25 combines overlapping terms, term-frequency (TF), inverse document frequency (IDF), and document length into following formula

\[ BM25(q, d)=\sum_{t_{q} \in q\cap d} i d f\left(t_{q}\right) \cdot \frac{t f\left(t_{q}, d\right) \cdot\left(k_{1}+1\right)}{t f\left(t_{q}, d\right)+k_{1} \cdot\left(1-b+b \cdot \frac{|d|}{a v g d l}\right)} \]

where $tf(t_q, d)$ is the query’s term frequency in the document $d$, $|d|$ is the length (in terms of words) of document $d$, $avgdl$ is the average length of documents in the collection $D$, and $k_{1}$ and $b$ are parameters that are usually tuned on a validation dataset. In practice, $k_{1}$ is sometimes set to some default value in the range $[1.2,2.0]$ and $b$ as $0.75$. The $i d f(t)$ is computed as,

\[ idf(t)=\log \frac{N-N_t+0.5}{N_t+0.5} \Leftarrow \log \frac{\text{Number of documents}}{\text{Number of documents with term } t}. \]

At first sight, BM25 looks quite like a traditional $tf\times idf$ weight - a product of two components, one based on $tf$ and one on $idf$. Intuitively, a document $d$ has a higher BM25 score if

Many query terms also frequently occur in the document;
These frequent co-occurring terms have larger idf values (i.e., they are not common terms).

However, there is one significant difference. The $tf$ component in the BM25 uses some saturation mechanism to discount the impact of frequent terms in a document when the document length is long.

BM25 does not concerns with word semantics, that is whether the word is a noun or a verb, or the meaning of each word. It is only sensitive to word frequency (i.e., which are common words and which are rare words), and the document length. If one query contains both common words and rare words, this method puts more weight on the rare words and returns documents with more rare words in the query. Besides, a term saturation mechanism is applied to decrease the matching signal when a matched word appears too frequently in the document. A document-length normalization mechanism is used to discount term weight when a document is longer than average documents in the collection.

More specifically, two parameters in BM25, $k_1$ and $b$, are designed to perform term frequency saturation and document-length normalization,respectively.

The constant $k_{1}$ determines how the $tf$ component of the term weight changes as the frequency increases. If $k_{1}=0$, the term frequency component would be ignored and only term presence or absence would matter. If $k_{1}$ is large, the term weight component would increase nearly linearly with the frequency.
The constant $b$ regulates the impact of the length normalization, where $b=0$ corresponds to no length normalization, and $b=1$ is full normalization.

Remark 22.2 (Weighting scheme for long queries)

If the query is long, then we might also use similar weighting for query terms. This is appropriate if the queries are paragraph-long information needs, but unnecessary for short queries.

\[ BM25(q, d)=\sum_{t_q \in q\cap d} idf(t_q) \cdot \frac{\left(k_{1}+1\right) tf(t_q,d) }{k_{1}\left((1-b)+b \times |d|/avgdl\right)+tf(t_q,d)} \cdot \frac{\left(k_{3}+1\right) tf(t_q,q)}{k_{2}+tf(t_q,q)} \]

with $tf(t_q, q)$ being the frequency of term $t$ in the query $q$, and $k_{2}$ being another positive tuning parameter that this time calibrates term frequency scaling of the query.

22.3.4. BM25 Implementation#

To efficient implementation of BM25, we can pre-compute document side term frequency and store it, which is known as indexing process. This includes:

Tokenize each document into tokens
Compute the number of documents containing each token and token frequency in each document.
Compute the idf for each token using the document frequencies
Compute the BM25 scores for each token in each document $BM25(t_i,d)$

During the query process, we tokenize the query $q$ into tokens $t_i$ and compute

\[BM25(q, d) = \sum_{t_i \in q\cap d} BM25(t_i,d).\]

22.4. Semantic Dense Models#

22.4.1. Motivation#

For ad-hoc search, traditional exact-term matching models (e.g., BM25) are playing critical roles in both traditional IR systems [Fig. 22.5] and modern multi-stage pipelines [Fig. 22.6]. Unfortunately, exact-term matching inherently suffers from the vocabulary mismatch problem due to the fact that a concept is often expressed using different vocabularies and language styles in documents and queries.

Early latent semantic models such as latent semantic analysis (LSA) illustrated the idea of identifying semantically relevant documents for a query when lexical matching is insufficient. However, their effectiveness in addressing the language discrepancy between documents and search queries are limited by their weak modeling capacity (i.e., simple, linear models). Also, these model parameters are typically learned via the unsupervised learning, i.e., by grouping different terms that occur in a similar context into the same semantic cluster.

The introduction of deep neural networks for semantic modeling and retrieval was pioneered in [HHG+13]. Recent deep learning model utilize the neural networks with large learning capacity and user-interaction data for supervised learning, which has led to significance performance gain over LSA. Similarly in the field of OpenQA [KOuguzM+20], whose first stage is to retrieve relevant passages that might contain the answer, semantic-based retrieval has also demonstrated performance gains over traditional retrieval methods.

22.4.2. Two Architecture Paradigms#

The current neural architecture paradigms for IR can be categorized into two classes: representation-based and interaction-based [Fig. 22.8].

In the representation-based architecture, a query and a document are encoded independently into two embedding vectors, then their relevance is estimated based on a single similarity score between the two embedding vectors.

Here we would like to make a critical distinction on symmetric vs. asymmetric encoding:

For symmetric encoding, the query and the entries in the corpus are typically of the similar length and have the same amount of content and they are encoded using the same network. Symmetric encoding is used for symmetric semantic search. An example would be searching for similar questions. For instance, the query could be How to learn Python online? and the entry that satisfies the search is like How to learn Python on the web?.
For asymmetric encoding, we usually have a short query (like a question or some keywords) and we would like to find a longer paragraph answering the query; they are encoded using two different networks. An example would be information retrieval. The entry is typically a paragraph or a web-page.

In the interaction-based architecture, instead of directly encoding $q$ and $d$ into individual embeddings, term-level interaction features across the query and the document are first constructed. Then a deep neural network is used to extract high-level matching features from the interactions and produce a final relevance score.

../../_images/Two_paradigms.png — Fig. 22.8 Two common architectural paradigms in semantic retrieval learning: representation-based learning (left) and interaction-based learning (right).#

These two architectures have different strengths in modeling relevance and final model serving. For example, a representation-based model architecture makes it possible to pre-compute and cache document representations offline, greatly reducing the online computational load per query. However, the pre-computation of query-independent document representations often miss term-level matching features that are critical to construct high-quality retrieval results. On the other hand, interaction-based architectures are often good at capturing the fine-grained matching feature between the query and the document.

Since interaction-based models can model interactions between word pairs in queries and document, they are effective for re-ranking, but are cost-prohibitive for first-stage retrieval as the expensive document-query interactions must be computed online for all ranked documents.

Representation-based models enable low-latency, full-collection retrieval with a dense index. By representing queries and documents with dense vectors, retrieval is reduced to nearest neighbor search, or a maximum inner product search (MIPS) [] problem if similarity is represented by an inner product.

In recent years, there has been increasing effort on accelerating maximum inner product and nearest neighbor search, which led to high-quality implementations of libraries for nearest neighbor search such as hnsw [MY18], FAISS [JDJegou19], and SCaNN [GSL+20].

22.4.3. Classic Representation-based Learning#

22.4.3.1. DSSM#

Deep structured semantic model (DSSM) [HHG+13] improves the previous latent semantic models in two aspects: 1) DSSM is supervised learning based on labeled data, while latent semantic models are unsupervised learning; 2) DSSM utilize deep neural networks to capture more semantic meanings.

The high-level architecture of DSSM is illustrated in Fig. 22.9. First, we represent a query and a document (only its title) by a sparse vector, respectively. Second, we apply a non-linear projection to map the query and the document sparse vectors to two low-dimensional embedding vectors in a common semantic space. Finally, the relevance of each document given the query is calculated as the cosine similarity between their embedding vectors in that semantic space.

../../_images/dssm.png — Fig. 22.9 The architecture of DSSM. Two MLP encoders with shared parameters are used to encode a query and a document into dense vectors. Query and document are both represented by term vectors. The final relevance score is computed via dot product between the query vector and the document vector.#

To represent word features in the query and the documents, DSSM adopt a word level sparse term vector representation with letter 3-gram vocabulary, whose size is approximately $30k \approx 30^3$. Here 30 is the approximate number of alphabet letters. This is also known as a letter trigram word hashing technique. In other words, both query and the documents will be represented by sparse vectors with dimensionality of $30k$.

The usage of letter 3-gram vocabulary has multiple benefits compared to the full vocabulary:

Avoid OOV problem with finite-size vocabulary or term vector dimensionality.
The use of letter n-gram can capture morphological meanings of words.

One problem of this method is collision, i.e., two different words could have the same letter n-gram vector representation because this is a bag-of-words representation that does not take into account orders. But the collision probability is rather low.

Table 22.2 Word hashing token size and collision numbers as a function of the vocabulary size and the type of letter ngrams.#
Word Size	Letter-Bigram		Letter-Trigram
	Token Size	Collision	Token Size	Collision
40k	1107	18	10306	2
500k	1607	1192	30621	22

Training. The neural network model is trained on the clickthrough data to map a query and its relevant document to vectors that are similar to each other and vice versa. The click-through logs consist of a list of queries and their clicked documents. It is assumed that a query is relevant, at least partially, to the documents that are clicked on for that query.

The semantic relevance score between a query $q$ and a document $d$ is given by:

\[ R(q, d)=\operatorname{Cosine}\left(y_{Q}, y_{D}\right) \]

where $E_{q}$ and $E_{q}$ are the embedding vectors of the query and the document, respectively. The conditional probability of a document being relevant to a given query is now defined through a Softmax function

\[ P(d \mid q)=\frac{\exp (\gamma R(q, d))}{\sum_{d^{\prime} \in D} \exp \left(\gamma R\left(q, d^{\prime}\right)\right)} \]

where $\gamma$ is a smoothing factor as a hyperparameter. $D$ denotes the set of candidate documents to be ranked. While $D$ should ideally contain all possible documents in the corpus, in practice, for each query $q$, $D$ is approximated by including the clicked document $d^{+}$ and four randomly selected un-clicked documents.

In training, the model parameters are estimated to maximize the likelihood of the clicked documents given the queries across the training set. Equivalently, we need to minimize the following loss function

\[ L=-\sum_{(q, d^+)}\log P\left(d^{+} \mid q\right). \]

Evaluation DSSM is compared with baselines of traditional IR models like TF-IDF, BM25, and LSA. Specifically, the best performing DNN-based semantic model, L-WH DNN, uses three hidden layers, including the layer with the Letter-trigram-based Word Hashing, and an output layer, and is discriminatively trained on query-title pairs.

Models	NDCG@1	NDCG@3	NDCG@10
TF-IDF	0.319	0.382	0.462
BM25	0.308	0.373	0.455
LSA	0.298	0.372	0.455
L-WH DNN	0.362	0.425	0.498

22.4.3.2. CNN-DSSM#

DSSM treats a query or a document as a bag of words, the fine-grained contextual structures embedding in the word order are lost. The DSSM-CNN[SHG+14] [Fig. 22.10] directly represents local contextual features at the word n-gram level; i.e., it projects each raw word n-gram to a low dimensional feature vector where semantically similar word $\mathrm{n}$ grams are projected to vectors that are close to each other in this feature space.

Moreover, instead of simply summing all local word-n-gram features evenly, the DSSM-CNN performs a max pooling operation to select the highest neuron activation value across all word n-gram features at each dimension. This amounts to extract the sentence-level salient semantic concepts.

Meanwhile, for any sequence of words, this operation forms a fixed-length sentence level feature vector, with the same dimensionality as that of the local word n-gram features.

Given the letter-trigram based word representation, we represent a word-n-gram by concatenating the letter-trigram vectors of each word, e.g., for the $t$-th word-n-gram at the word-ngram layer, we have:

\[ l_{t}=\left[f_{t-d}^{T}, \ldots, f_{t}^{T}, \ldots, f_{t+d}^{T}\right]^{T}, \quad t=1, \ldots, T \]

where $f_{t}$ is the letter-trigram representation of the $t$-th word, and $n=2 d+1$ is the size of the contextual window. In our experiment, there are about $30K$ unique letter-trigrams observed in the training set after the data are lower-cased and punctuation removed. Therefore, the letter-trigram layer has a dimensionality of $n \times 30 K$.

../../_images/CNN_DSSM.png — Fig. 22.10 The architecture of CNN-DSSM. Each term together with its left and right contextual words are encoded together into term vectors.#

22.4.4. Mono-BERT And Duo-BERT#

22.4.4.1. Why Transformers?#

BERT (Bidirectional Encoder Representations from Transformers) [DCLT18] and its transformer variants [LWLQ21] represent the state-of-the-art modeling strategies in a broad range of natural language processing tasks. The application of BERT in information retrieval and ranking was pioneered by [NC19, NYCL19]. The fundamental characteristics of BERT architecture is self-attention. By pretraining BERT on large scale text data, BERT encoder can produce contextualized embeddings can better capture semantics of different linguistic units. By adding additional prediction head to the BERT backbone, such BERT encoders can be fine-tuned to retrieval related tasks. In this section, we will go over the application of different BERT-based models in neural information retrieval and ranking tasks.

22.4.4.2. Mono-BERT (Cross-Encoder) For Point-wise Ranking#

../../_images/mono_bert_arch.png — Fig. 22.11 The architecture of Mono-BERT for document relevance ranking. The input is the concatenation of the query token sequence and the candidate document token sequence. Once the input sequence is passed through the model, we use the [CLS] embedding as input to a single layer neural network to obtain a posterior probability $p_{i}$ of the candidate $d_{i}$ being relevant to query $q$.#

The first application of BERT in document retrieval is using BERT as a cross encoder, where the query token sequence and the document token sequence are concatenated via [SEP] token and encoded together. This architecture [Fig. 22.11], called mono-BERT, was first proposed by [NC19, NYCL19].

To meet the token sequence length constraint of a BERT encoder (e.g., 512), we might need to truncate the query (e.g, not greater than 64 tokens) and the candidate document token sequence such that the total concatenated token sequence have a maximum length of 512 tokens.

Once the input sequence is passed through the model, we use the [CLS] embedding as input to a single layer neural network to obtain a posterior probability $p_{i}$ of the candidate $d_{i}$ being relevant to query $q$. The posterior probability can be used to rank documents.

The training data can be represented by a collections of triplets $(q, J_P^q, J_N^q), q\in Q$, where $Q$ is the set of queries, $J_{P}^q$ is the set of indexes of the relevant candidates associated with query $q$ and $J_{N}^q$ is the set of indexes of the nonrelevant candidates.

The encoder can be fine-tuned using cross-entropy loss:

\[ L_{\text {mono-BERT}}=-\sum_{q\in Q}( \sum_{j \in J_{P}^q} \log \left(s_{j}\right)-\sum_{j \in J_{N}^q} \log \left(1-s_{j}\right) ). \]

During training, each batch can consist of a query and its candidate documents (include both positive and negative) produced by previous retrieval layers.

22.4.4.3. Duo-BERT For Pairwise Ranking#

Mono-BERT can be characterized as a pointwise approach for ranking. Within the framework of learning to rank, [NC19, NYCL19] also proposed duo-BERT, which is a pairwise ranking approach. In this pairwise approach, the duo-BERT ranker model estimates the probability $p_{i, j}$ of the candidate $d_{i}$ being more relevant than $d_{j}$ with respect to query $q$.

The duo-BERT architecture [Fig. 22.12] takes the concatenation of the query $q$, the candidate document $d_{i}$, and the candidate document $d_{j}$ as the input. We also need to truncate the query, candidates $d_{i}$ and $d_{j}$ to proper lengths (e.g., 62 , 223 , and 223 tokens, respectively), so the entire sequence will have at most 512 tokens.

Once the input sequence is passed through the model, we use the [CLS] embedding as input to a single layer neural network to obtain a posterior probability $p_{i,j}$. This posterior probability can be used to rank documents $i$ and $j$ with respect to each other. If there are $k$ candidates for query $q$, there will be $k(k-1)$ passes to compute all the pairwise probabilities.

The model can be fine-tune using with the following loss per query.

\[\begin{split}\begin{align*} L_{\text {duo }}=&-\sum_{i \in J_{P}, j \in J_{N}} \log \left(p_{i, j}\right) \\ &-\sum_{i \in J_{N}, j \in J_{P}} \log \left(1-p_{i, j}\right) \end{align*}\end{split}\]

../../_images/duo_bert_arch.png — Fig. 22.12 The duo-BERT architecture takes the concatenation of the query and two candidate documents as the input. Once the input sequence is passed through the model, we use the [CLS] embedding as input to a single layer neural network to obtain a posterior probability that the first document is more relevant than the second document.#

At inference time, the obtained $k(k -1)$ pairwise probabilities are used to produce the final document relevance ranking given the query. Authors in [NYCL19] investigate five different aggregation methods (SUM, BINARY, MIN, MAX, and SAMPLE) to produce the final ranking score.

\[\begin{split}\begin{align*} \operatorname{SUM}: s_{i} &=\sum_{j \in J_{i}} p_{i, j} \\ \operatorname{BINARY}: s_{i} &=\sum_{j \in J_{i}} \bm{1}_{p_{i, j} > 0.5} \\ \operatorname{MIN}: s_{i} &=\min _{j \in J_{i}} p_{i, j} \\ \operatorname{MAX}: s_{i} &=\max _{j \in J_{i}} p_{i, j} \\ \operatorname{SAMPLE}: s_{i}&=\sum_{j \in J_{i}(m)} p_{i, j} \end{align*}\end{split}\]

where $J_i = \{1 <= j <= k, j\neq i\}$ and $J_i(m)$ is $m$ randomly sampled elements from $J_i$.

The SUM method measures the pairwise agreement that candidate $d_{i}$ is more relevant than the rest of the candidates $\left\{d_{j}\right\}_{j \neq i^{*}}$. The BINARY method resembles majority vote. The Min (MAX) method measures the relevance of $d_{i}$ only against its strongest (weakest) competitor. The SAMPLE method aims to decrease the high inference costs of pairwise computations via sampling. Comparison studies using MS MARCO dataset suggest that SUM and BINARY give the best results.

22.4.4.4. Multistage Retrieval And Ranking Pipeline#

../../_images/multistage_retrieval_ranking_bert.png — Fig. 22.13 Illustration of a three-stage retrieval-ranking architecture using BM25, monoBERT and duoBERT. Image from [NYCL19].#

With BERT variants of different ranking capability, we can construct a multi-stage ranking architecture to select a handful of most relevant document from a large collection of candidate documents given a query. Consider a typical architecture comprising a number of stages from $H_0$ ot $H_N$. $H_0$ is a exact-term matching stage using from an inverted index. $H_0$ stage take billion-scale document as input and output thousands of candidates $R_0$. For stages from $H_1$ to $H_N$, each stage $H_{n}$ receives a ranked list $R_{n-1}$ of candidates from the previous stage and output candidate list $R_n$. Typically $|R_n| \ll |R_{n-1}|$ to enable efficient retrieval.

An example three-stage retrieval-ranking system is shown in Fig. 22.13. In the first stage $H_{0}$, given a query $q$, the top candidate documents $R_{0}$ are retrieved using BM25. In the second stage $H_{1}$, monoBERT produces a relevance score $s_{i}$ for each pair of query $q$ and candidate $d_{i} \in R_{0}.$ The top candidates with respect to these relevance scores are passed to the last stage $H_{2}$, in which duoBERT computes a relevance score $p_{i, j}$ for each triple $\left(q, d_{i}, d_{j}\right)$. The final list of candidates $R_{2}$ is formed by re-ranking the candidates according to these scores .

Evaluation. Different multistage architecture configurations are evaluated using the MS MARCO dataset. We have following observations:

Using a single stage of BM25 yields the worst performance.
Adding an additional monoBERT significantly improve the performance over the single BM25 stage architecture.
Adding the third component duoBERT only yields a diminishing gain.

Further, the author found that employing the technique of Target Corpus Pre-training (TCP)\ gives additional performance gain. Specifically, the BERT backbone will undergo a two-phase pre-training. In the first phase, the model is pre-trained using the original setup, that is Wikipedia (2.5B words) and the Toronto Book corpus ( 0.8B words) for one million iterations. In the second phase, the model is further pre-trained on the MS MARCO corpus.

Method	Dev	Eval
Anserini (BM25)	18.7	19.0
+ monoBERT	37.2	36.5
+ monoBERT + duoBERT_MAX	32.6	-
+ monoBERT + duoBERT_MIN	37.9	-
+ monoBERT + duoBERT_SUM	38.2	37.0
+ monoBERT + duoBERT_BINARY	38.3	-
+ monoBERT + duoBERT_SUM + TCP	39.0	37.9

22.4.5. Advanced Architectures#

22.4.5.1. DC-BERT#

One way to improve the computational efficiency of cross-encoder is to employ bi-encoders for partial separate encoding and then employ an additional shallow module for cross encoding. One example is the architecture shown in Fig. 22.14, which is called DC-BERT and proposed in [NZG+20]. The overall architecture of DC-BERT consists of a dual-BERT component for decoupled encoding, a Transformer component for question-document interactions, and a binary classifier component for document relevance scoring.

The document encoder can be run offline to pre-encodes all documents and caches all term representations. During online inference, we only need to run the BERT query encodes online. Then the obtained contextual term representations are fed into high-layer Transformer interaction layer.

../../_images/DC_bert.png — Fig. 22.14 The overall architecture of DC-BERT [Fig. 22.14] consists of a dual-BERT component for decoupled encoding, a Transformer component for question-document interactions, and a classifier component for document relevance scoring.#

Dual-BERT component. DC-BERT contains two pre-trained BERT models to independently encode the question and each retrieved document. During training, the parameters of both BERT models are fine-tuned to optimize the learning objective.

Transformer component. The dual-BERT components produce contextualized embeddings for both the query token sequence and the document token sequence. Then we add global position embeddings $\mathbf{E}_{P_{i}} \in \mathbb{R}^{d}$ and segment embedding again to re-encode the position information and segment information (i.e., query vs document). Both the global position and segment embeddings are initialized from pre-trained BERT, and will be fine-tuned. The number of Transformer layers $K$ is configurable to trade-off between the model capacity and efficiency. The Transformer layers are initialized by the last $K$ layers of pre-trained BERT, and are fine-tuned during the training.

Classifier component. The two CLS token output from the Transformer layers will be fed into a linear binary classifier to predict whether the retrieved document is relevant to the query. Following previous work (Das et al., 2019; Htut et al., 2018; Lin et al., 2018), we employ paragraph-level distant supervision to gather labels for training the classifier, where a paragraph that contains the exact ground truth answer span is labeled as a positive example. We parameterize the binary classifier as a MLP layer on top of the Transformer layers:

\[ p\left(q_{i}, d_{j}\right)=\sigma\left(\operatorname{Linear}\left(\left[o_{[C L S]} ; o_{[C L S]}^{\prime}\right]\right)\right) \]

where $\left(q_{i}, d_{j}\right)$ is a pair of question and retrieved document, and $o_{[C L S]}$ and $o_{[C L S]}^{\prime}$ are the Transformer output encodings of the [CLS] token of the question and the document, respectively. The MLP parameters are updated by minimizing the cross-entropy loss.

DC-BERT uses one Transformer layer for question-document interactions. Quantized BERT is a 8bit-Integer model. DistilBERT is a compact BERT model with 2 Transformer layers.

We first compare the retriever speed. DC-BERT achieves over 10x speedup over the BERT-base retriever, which demonstrates the efficiency of this method. Quantized BERT has the same model architecture as BERT-base, leading to the minimal speedup. DistilBERT achieves about 6x speedup with only 2 Transformer layers, while BERT-base uses 12 Transformer layers.

With a 10x speedup, DC-BERT still achieves similar retrieval performance compared to BERT- base on both datasets. At the cost of little speedup, Quantized BERT also works well in ranking documents. DistilBERT performs significantly worse than BERT-base, which shows the limitation of the distilled BERT model.

Model	SQuAD		Natural Questions
	PTB@10	Speedup	P@10	Speedup
BERT-base	71.5	1.0x	65.0	1.0x
Quantized BERT	68.0	1.1x	64.3	1.1x
DistilBERT	56.4	5.7x	60.6	5.7x
DC-BERT	70.1	10.3x	63.5	10.3x

To further investigate the impact of our model architecture design, we compare the performance of DC-BERT and its variants, including 1) DC-BERT-Linear, which uses linear layers instead of Transformers for interaction; and 2) DC-BERT-LSTM, which uses LSTM and bi- linear layers for interactions following previous work (Min et al., 2018). We report the results in Table 3. Due to the simplistic architecture of the interaction layers, DC-BERT-Linear achieves the best speedup but has significant performance drop, while DC-BERT-LSTM achieves slightly worse performance and speedup than DC-BERT.

Retriever Model	Retriever P@10	Retriever Speedup
DC-BERT-Linear	57.3	43.6x
DC-BERT-LSTM	61.5	8.2x
DC-BERT	63.5	10.3x

22.4.5.2. ColBERT#

ColBERT [KZ20] is another example architecture that consists of an early separate encoding phase and a late interaction phase, as shown in Fig. 22.15. ColBERT employs a single BERT model for both query and document encoders but distinguish input sequences that correspond to queries and documents by prepending a special token [Q] to queries and another token [D] to documents.

../../_images/Col_bert.png — Fig. 22.15 The architecture of ColBERT, which consists of an early separate encoding phase and a late interaction phase.#

The query Encoder take query tokens as the input. Note that if a query is shorter than a pre-defined number $N_q$, it will be padded with BERT’s special [mask] tokens up to length $N_q$; otherwise, only the first $N_q$ tokens will be kept. It is found that the mask token padding serves as some sort of query augmentation and brings perform gain. In additional, a [Q] token is placed right after BERT’s sequence start token [CLS]. The query encoder then computes a contextualized representation for the query tokens.

The document encoder has a very similar architecture. A [D] token is placed right after BERT’s sequence start token [CLS]. Note that after passing through the encoder, embeddings correponding to punctuation symbols are filtered out.

Given BERT’s representation of each token, an additional linear layer with no activation is used to reduce the dimensionality reduction. The reduced dimensionality $m$ is set much smaller than BERT’s fixed hidden dimension.

Finally, given $q= q_{1} \ldots q_{l}$ and $d=d_{1} \ldots d_{n}$, an additional CNN layer is used to allow each embedding vector to interact with its neighbor, yielding the bags of embeddings $E_{q}$ and $E_{d}$ in the following manner.

\[\begin{split}\begin{align*} &E_{q}:=\operatorname{Normalize}\left(\operatorname{CNN}\left(\operatorname{BERT}\left([Q] q_{0} q_{1} \ldots q_{l} \# \# \ldots \#\right)\right)\right) \\ &E_{d}:=\operatorname{Filter}\left(\operatorname{Normalize}\left(\operatorname{CNN}\left(\operatorname{BERT}\left([D] d_{0} d_{1} \ldots d_{n} \right)\right)\right)\right) \end{align*}\end{split}\]

Here # refers to the [mask] tokens and $\operatorname{Normalize}$ denotes $L_2$ length normalization.

In the late interaction phase, every query embedding interacts with all document embeddings via a MaxSimilarity operator, which computes maximum similarity (e.g., cosine similarity), and the scalar outputs of these operators are summed across query terms.

Formally, the final similarity score between the $q$ and $d$ is given by

\[ S_{q, d} =\sum_{i \in I_q} \max _{j \in I_d} E_{q_{i}} \cdot E_{d_{j}}^{T}, \]

where $I_q = \{1,...,l\}$, $I_d = \{1, ..., n\}$ are the index sets for query token embeddings and document token embeddings, respectively. ColBERT is differentiable end-to-end and we can fine-tune the BERT encoders and train from scratch the additional parameters (i.e., the linear layer and the $[Q]$ and $[D]$ markers’ embeddings). Notice that the final aggregation interaction mechanism has no trainable parameters.

The retrieval performance of ColBERT is evaluated on MS MARCO dataset. Compared with traditional exact term matching retrieval, ColBERT has shortcomings in terms of latency but MRR is significantly better.

Method	MRR@10(Dev)	MRR@10 (Local Eval)	Latency (ms)	Recall@50
BM25 (official)	16.7	-	-	-
BM25 (Anserini)	18.7	19.5	62	59.2
doc2query	21.5	22.8	85	64.4
DeepCT	24.3	-	62 (est.)	69[2]
docTTTTTquery	27.7	28.4	87	75.6
ColBERT L2 (re-rank)	34.8	36.4	-	75.3
ColBERTL2 (end-to-end)	36.0	36.7	458	82.9

Similarly, we can evaluate ColBERT’s re-ranking performance against some strong baselines, such as BERT cross encoders [NC19, NYCL19]. ColBERT has demonstrated significant benefits in reducing latency with little cost of re-ranking performance.

Method	MRR@10 (Dev)	MRR@10 (Eval)	Re-ranking Latency (ms)
BM25 (official)	16.7	16.5	-
KNRM	19.8	19.8	3
Duet	24.3	24.5	22
fastText+ConvKNRM	29.0	27.7	28
BERT base	34.7	-	10,700
BERT large	36.5	35.9	32,900
ColBERT (over BERT base)	34.9	34.9	61

22.4.5.3. Multi-Attribute and Multi-task Modeling#

We can extend cross-encoder to take into multiple-attributes from query side and document side, as well as generating multiple predictive outputs for different tasks [Fig. 22.16].

For example, query side attributes could include

Query text
Query’s language (produced by a cheapter language detection model)
User’s location and region
Other query side signals (e.g., key concept groups in the query, document signals from historical queries) Document side attributes could include
Organic contents with semantic markers (e.g., [T] for Title)
Other derived signals from documents (e.g., puesedo queries, historical queries, etc.) Other high level signals suitable late stage fusion
Document refreshness attribute (for intent to search latest news)
Document spamness attributes

After feature fusion (e.g., via concatination), we can separate MLP head for different tasks

../../_images/multitask_multiattribute_arch.png — Fig. 22.16 An representative cross-encoder that is extended to take into account multiple-attributes from query side and document side. There are multiple outputs for **multi-tasking**.#

22.5. Ranker Training#

22.5.1. Overview#

Unlike in classification or regression, the main goal of a ranker[AWB+19] is not to assign a label or a value to individual items, but to produce an ordering of the items in that list in such a way that the utility of the entire list is maximized.

In other words, in ranking we are more concerned with the relative ordering items, instead of predicting the numerical value or label of an individual item.

Pointwise ranking transforms the ranking problem into a regression problem. Given a certain Query, ranking amounts to

Predict the relevance score between the document to the query
Order the document list based on its relevance score with the query.

Pairwise Ranking[Bur10], instead of predicting the absolute relevance, learns to predict the relative order of documents. This method is particularly useful in scenarios where the absolute relevance scores are less important than the relative ordering of items,

Listwise ranking [CQL+07] considers the entire list of items simultaneously when training a ranking model. Unlike pairwise or pointwise methods, listwise ranking directly optimizes ranking metrics such as NDCG or MAP, which better aligns the training objective with the evaluation criteria used in information retrieval tasks. This approach can capture more complex relationships between items in a list and often leads to better performance in real-world ranking scenarios, though it may be computationally more expensive than other ranking methods.

22.5.2. Training Data#

In a typical model learning setting, we construct training data from user search log, which contains queries issued by users and the documents they clicked after issuing the query. The basic assumption is that a query and a document are relevant if the user clicked the document.

Model learning in information retrieval typically falls into the category of contrastive learning. The query and the clicked document form a positive example; the query and irrelevant documents form negative examples. For retrieval problems, it is often the case that positive examples are available explicitly, while negative examples are unknown and need to be selected from an extremely large pool. The strategy of selecting negative examples plays an important role in determining quality of the encoders. In the most simple case, we randomly select unclicked documents as irrelevant document, or negative example. We defer the discussion of advanced negative example selecting strategy to Section 22.6.

When there is a shortage of annotation data or click behavior data, we can also leverage weakly supervised data for training [DZS+17, HG19, NSN18, RSL+21]. In the weakly supervised data, labels or signals are obtained from an unsupervised ranking model, such as BM25. For example, given a query, relevance scores for all documents can be computed efficiently using BM25. Documents with highest scores can be used as positive examples and documents with lower scores can be used as negatives or hard negatives.

22.5.3. Model Training Objective Functions#

22.5.3.1. Pointwise Regression Objective#

The idea of pointwise regression objective is to model the numerical relevance score for a given query-document. During inference time, the relevance scores between a set of candidates and a given query can be predicted and ranked.

During training, given a set of query-document pairs $\left(q_{i}, d_{i, j}\right)$ and their corresponding relevance score $y_{i, j} \in [0, 1]$ and their prediction $f(q_i,d_{i,j})$. A pointwise regression objective tries to optimize a model to predict the relevance score via minimization

\[L = -\sum_{i} \sum_{j} (f(q_i,d_{i,j})) - y_{i,j})^2. \]

Using a regression objective offer flexible for the user to model different levels of relevance between queries and documents. However, such flexibility also comes with** a requirement that the target relevance score should be accurate in absolute scale.** While human annotated data might provide absolute relevance score, human annotation data is expensive and small scale. On the other hand, absolute relevance scores that are approximated by click data can be noisy and less optimal for regression objective. To make** training robust to label noises**, one can consider Pairwise ranking objectives. This particularly important in weak supervision scenario with noisy label. Using the ranking objective alleviates this issue by forcing the model to learn a preference function rather than reproduce absolute scores.

22.5.3.2. Pointwise Ranking Objective#

The idea of pointwise ranking objective is to simplify a ranking problem to a binary classification problem. Specifically, given a set of query-document pairs $\left(q_{i}, d_{i, j}\right)$ and their corresponding relevance label $y_{i, j} \in \{0, 1\}$, where 0 denotes irrelevant and 1 denotes relevant. A pointwise learning objective tries to optimize a model to predict the relevance label.

A commonly used pointwise loss functions is the binary Cross Entropy loss:

\[ L=-\sum_{i} \sum_{j} y_{i, j} \log \left(p\left(q_{i}, d_{i, j}\right)\right)+\left(1-y_{i, j}\right) \log \left(1-p\left(q_{i}, d_{i, j}\right)\right) \]

where $p\left(q_{i}, d{i, j}\right)$ is the predicted probability of document $d_{i,j}$ being relevant to query $q_i$.

The advantages of pointwise ranking objectives are two-fold. First, pointwise ranking objectives are computed based on each query-document pair $\left(q_{i}, d_{i, j}\right)$ separately, which makes it simple and easy to scale. Second, the outputs of neural models learned with pointwise loss functions often have real meanings and value in practice. For instance, in sponsored search, the predicted the relevance probability can be used in ad bidding, which is more important than creating a good result list in some application scenarios.

In general, however, pointwise ranking objectives are considered to be less effective in ranking tasks. Because pointwise loss functions consider no document preference or order information, they do not guarantee to produce the best ranking list when the model loss reaches the global minimum. Therefore, better ranking paradigms that directly optimize document ranking based on pairwise loss functions and even listwise loss functions.

22.5.3.3. Pairwise Ranking via Triplet Loss#

Pointwise ranking loss aims to optimize the model to directly predict relevance between query and documents on absolute score. From embedding optimization perspective, it train the neural query/document encoders to produce similar embedding vectors for a query and its relevant document and dissimilar embedding vectors for a query and its irrelevant documents.

On the other hand, pairwise ranking objectives focus on optimizing the relative preferences between documents rather than predicting their relevance labels. In contrast to pointwise methods where the final ranking loss is the sum of loss on each document, pairwise loss functions are computed based on the different combination of document pairs.

One of the most common pairwise ranking loss function is the triplet loss. Let $\mathcal{D}=\left\{\left\langle q_{i}, d_{i}^{+}, d_{i}^{-}\right\rangle\right\}_{i=1}^{m}$ be the training data organized into $m$ triplets. Each triplet contains one query $q_{i}$ and one relevant document $d_{i}^{+}$, along with one irrelevant (negative) documents $d_{i}^{-}$. Negative documents are typically randomly sampled from a large corpus or are strategically constructed [Section 22.6]. Visualization of the learning process in the embedding space is shown in Fig. 22.17. Triplet loss helps guide the encoder networks to pull relevant query and document closer and push irrelevant query and document away.

The loss function is given by

\[L =- \sum_{\left\langle q_{i}, d_{i}^{+}, d_{i}^{-}\right\rangle}\max(0, (\operatorname{Sim}(q_i, d_i^+) - \operatorname{Sim}(q_i, d^-_i)) - m)\]

where $\operatorname{Sim}(q, d)$ is the similarity score produced by the network between the query and the document and $m$ is a hyper-parameter adjusting the margin. Clearly, if we would like to make $L$ small, we need to make $\operatorname{Sim}(q_i, d_i^+) - \operatorname{Sim}(q_i, d^-_i) > m$. Commonly used $\operatorname{Sim}$ functions include dot product or Cosine similarity (i.e., length-normalized dot product), which are related to distance calculation in the Euclidean space and hyperspherical surface.

../../_images/triplet.png — Fig. 22.17 The illustration of the learning process (in the embedding space) using triplet loss.#

Triplet loss can also operating in the angular space

\[ \operatorname{sim}(q, d)=1-\arccos \left(\frac{\psi_{\beta}(q) \cdot \psi_{\alpha}(d)}{\left\|\psi_{\beta}(q)\right\|\left\|\psi_{\alpha}(d)\right\|}\right) / \pi \]

As illustrated in Figure 1, the training objective is to score the positive example $d^{+}$by at least the margin $\mu$ higher than the negative one $d^{-}$. As part of our loss function, we use the triplet margin objective:

\[ l\left(q, d^{+}, d^{-}\right):=-\max \left(0, \operatorname{sim}\left(q, d^{+}\right)-\operatorname{sim}\left(q, d^{-}\right)-\mu\right) \]

22.5.3.4. N-pair Loss#

Triplet loss optimize the neural by encouraging positive pair $(q_i, d^+_i)$ to be more similar than its negative pair $(q_i, d^+_i)$. One improvement is to encourage $q_i$ to be more similar $d^+_i$ compared to $n$ negative examples $ d_{i, 1}^{-}, \cdots, d_{i, n}^{-}$, instead of just one negative example. This is known as N-pair loss [Soh16], and it is typically more robust than triplet loss.

Let $\mathcal{D}=\left\{\left\langle q_{i}, d_{i}^{+}, D_i^-\right\rangle\right\}_{i=1}^{m}$, where $D_i^- = \{d_{i, 1}^{-}, \cdots, d_{i, n}^{-}\}$ are a set of negative examples (i.e., irrelevant document) with respect to query $q_i$, be the training data that consists of $m$ examples. Each example contains one query $q_{i}$ and one relevant document $d_{i}^{+}$, along with $n$ irrelevant (negative) documents $d_{i, j}^{-}$. The $n$ negative documents are typically randomly sampled from a large corpus or are strategically constructed [Section 22.6].

Visualization of the learning process in the embedding space is shown in Fig. 22.18. Like triplet loss, N-pair loss helps guide the encoder networks to pull relevant query and document closer and push irrelevant query and document away. Besides that, when there are are negatives are involved in the N-pair loss, their repelling to each other appears to help the learning of generating more uniform embeddings[WI20].

The loss function is given by

\[L =-\sum_{\left\langle q_{i}, d_{i}^{+}, D_{i}^{-}\right\rangle}\log \frac{\exp(\operatorname{Sim}\left(e_{q_{i}}, e_{d_{i}^{+}}\right))}{\exp(\operatorname{Sim}\left(e_{q_{i}}, e_{d_{i}^{+}}\right))+\sum_{d^-_i\in D^-} \exp(\operatorname{Sim}\left(e_{q_{i}}, e_{d_{i}^{-}}\right))}\]

where $\operatorname{Sim}(e_q, e_d)$ is the similarity score function taking query embedding $e_q$ and document embedding $e_d$ as the input.

../../_images/N_pair_loss.png — Fig. 22.18 The illustration of the learning process (in the embedding space) using N-pair loss.#

22.5.3.5. N-pair Dual Loss#

../../_images/N_pair_loss_dual.png — Fig. 22.19 The illustration of the learning process (in the embedding space) using N-pair dual loss.#

The N-pair loss uses query as the anchor to adjust the distribution of document vectors in the embedding space. Authors in [LLXL21] proposed that document can also be as the anchor to adjust the distribution of query vectors in the embedding space. This leads to loss functions consisting of two parts

\[\begin{split}\begin{align*} L &= L_{prime} + L_{dual} \\ L_{prime} &=-\sum_{\left\langle q_{i}, d_{i}^{+}, D_{i}^{-}\right\rangle}\log \frac{\exp(\operatorname{Sim}\left(e_{q_{i}}, e_{d_{i}^{+}}\right))}{\exp(\operatorname{Sim}\left(e_{q_{i}}, e_{d_{i}^{+}}\right))+\sum_{d^-_i\in D^-} \exp(\operatorname{Sim}\left(e_{q_{i}}, e_{d_{i}^{-}}\right))} \\ L_{dual} &=-\sum_{\left\langle d_{i}, q_{i}^{+}, Q_{i}^{-}\right\rangle}\log \frac{\exp(\operatorname{Sim}\left(e_{d_{i}}, e_{q_{i}^{+}}\right))}{\exp(\operatorname{Sim}\left(e_{d_{i}}, e_{q_{i}^{+}}\right))+\sum_{q^-_i\in Q^-} \exp(\operatorname{Sim}\left(e_{d_{i}}, e_{q_{i}^{-}}\right))} \end{align*}\end{split}\]

where $\operatorname{Sim}(e_q, e_d)$ is a symmetric similarity score function for the query and the document embedding vectors, $L_{prime}$ is the N-pair loss, and $L_{dual}$ is the N-pair dual loss.

To compute dual loss, we need to prepare training data $\mathcal{D}_{dual}=\left\{\left\langle d_{i}, q_{i}^{+}, Q_i^-\right\rangle\right\}_{i=1}^{m}$, where $Q_i^- = \{q_{i, 1}^{-}, \cdots, q_{i, n}^{-}\}$ are a set of negative queries examples (i.e., irrelevant query) with respect to document $d_i$. Each example contains one document $d_{i}$ and one relevant query $d_{i}^{+}$, along with $n$ irrelevant (negative) queries $q_{i, j}^{-}$.

22.5.3.6. Doc-Doc N-pair Loss#

../../_images/N_pair_doc_doc.png — Fig. 22.20 The illustration of the learning process (in the embedding space) using Doc-Doc N-pair loss.#

Besiding use above prime and dual loss to capture robust query doc relationship, we can also improve robustness of document representation by considering doc-doc relations. Particularly,

When there are multiple positive documents associated with the same query, we use loss function encourage their representation embedding to stay close.
For positive and negative documents associated with the same query, we use loss function encourage their representation embedding to stay far apart.

The loss function is given by

\[L =-\sum_{\left\langle q_{i}, d_{i}^{+}, d_{i'}^{+} \in D_{i}^{+}, D_{i}^{-}\right\rangle}\log \frac{\exp(\operatorname{Sim}\left(e_{d_{i}}, e_{d_{i'}}\right))}{\exp(\operatorname{Sim}\left(e_{d_{i}^+}, e_{d_{i}^{+}}\right))+\sum_{d^-_i\in D^-} \exp(\operatorname{Sim}\left(e_{d_{i}^+}, e_{d_{i}^{-}}\right))}\]

where $\operatorname{Sim}(e_{d_1}, e_{d_2})$ is the similarity score function taking document embeddings $e_{d_1}$ and $e_{d_2}$ as the input.

22.6. Training Data Sampling Strategies#

22.6.1. Principles#

From the ranking perspective, both retrieval and re-ranking requires the generation of some order on the input samples. For example, given a query $q$ and a set of candidate documents $(d_1,...,d_N)$. We need the model to produce an order list $d_2 \succ d_3 ... \succ d_k$ according to their relevance to the query.

To train a model to produce the expected results during inference, we should ensure the training data distribution to matched with the inference time data distribution. Particularly, the inference time the candidate document distribution and ranking granularity differ vastly for retrieval tasks and re-ranking tasks [Fig. 22.21]. Specifically,

For the retrieval task, we typically need to identify top $k (k=1000-10000)$ relevant documents from the entire document corpus. This is achieved by ranking all documents in the corpus with respect to the relevance of the query.
For the re-ranking task, we need to identify the top $k (k=10)$ most relevant documents from the relevant documents generated by the retrieval task.

Clearly, features most useful in the retrieval task (i.e., distinguish relevant from irrelevant) are often not the same as the features most useful in re-ranking task (i.e., distinguish most relevant from less relevant). Therefore, the training samples for retrieval and re-ranking need to be constructed differently.

../../_images/retrieval_reranking_task.png — Fig. 22.21 Retrieval tasks and re-ranking tasks are faced with different the candidate document distribution and ranking granularity.#

Constructing the proper training data distribution is more challenging to retrieval stage than the re-ranking stage. In re-ranking stage, data in the training and inference phases are both the documents from previous retrieval stages. In the retrieval stage, we need to construct training examples in a mini-batch fashion in a way that each batch approximates the distribution in the inference phase as close as possible.

This section will mainly focus on constructing training examples for retrieval model training in an efficient and effective way. Since the number of negative examples (i.e., irrelevant documents) significantly outnumber the number of positive examples. Constructing training examples particularly boil down to constructing proper negative examples.

22.6.2. Negative Sampling Methods I: Heuristic Methods#

22.6.2.1. Random Negatives and In-batch Negatives#

Random negative sampling is the most basic negative sampling algorithm. The algorithm uniformly sample documents from the corpus and treat it as a negative. Clearly, random negatives can generate negatives that are too easy for the model. For example, a negative document that is topically different from the query. These easy negatives lower the learning efficiency, that is, each batch produces limited information gain to update the model. Still, random negatives are widely used because of its simplicity.

In practice, random negatives are implemented as in-batch negatives. In the contrastive learning framework with N-pair loss [Section 22.5.3.4], we construct a mini-batch of query-doc examples like $\{(q_1, d_1^+, d_{1,1}^-, d_{1,M}^-), ..., (q_N, d_N^+, d_{N,1}^-, d_{N,M}^M)\}$, Naively implementing N-pair loss would increase computational cost from constructing sufficient negative documents corresponding to each query. In-batch negatives[KOuguzM+20] is trick to reuse positive documents associated with other queries as extra negatives [Fig. 22.22]. The critical assumption here is that queries in a mini-batch are vastly different semantically, and positive documents from other queries would be confidently used as negatives. The assumption is largely true since each mini-batch is randomly sampled from the set of all training queries, in-batch negative document are usually true negative although they might not be hard negatives.

../../_images/in_batch_negatives.png — Fig. 22.22 The illustration of using in-batch negatives in contrastive learning.#

Specifically, assume that we have $N$ queries in a mini-batch and each one is associated with a relevant positive document. By using positive document of other queries, each query will have an additional $N - 1$ negatives.

Formally, we can define our batch-wise loss function as follows:

\[ \mathcal{L}:=\sum_{1 \leq i \leq N}\left(\sum_{1 \leq j \leq N} l\left(q_{i}, d_{i}^{+}, d_{j}^{-}\right)+\sum_{1 \leq k \leq N, k \neq i} l\left(q_{i}, d_{i}^{+}, d_{k}^{+}\right)\right) \]

where $l\left(q_{i}, d_{i}^{+}, d_{j}^{-}\right)$ is the loss function for a triplet.

In-batch negative offers an efficient implementation for random negatives. Another way to mitigate the inefficient learning issue is simply use large batch size (>4,000) [Large-Scale Negatives].

22.6.2.2. Popularity-based Negative Sampling#

Popularity-based negative sampling use document popularity as the sampling weight to sample negative documents. The popularity of a document can be defined as some combination of click, dwell time, quality, etc. Compared to random negative sampling, this algorithm replaces the uniform distribution with a popularity-based sampling distribution, which can be pre-computed offline.

The major rationale of using popularity-based negative examples is to improve representation learning. Popular negative documents represent a harder negative compared to a unpopular negative since they tend to have to a higher chance of being more relevant; that is, lying closer to query in the embedding space. If the model is trained to distinguish these harder cases, the over learned representations will be likely improved.

Popularity-based negative sampling is also used in word2vec training [MSC+13]. For example, the probability to sample a word $w_i$ is given by:

\[ P\left(w_i\right)=\frac{f\left(w_i\right)^{3 / 4}}{\sum_{j=0}^n\left(f\left(w_j\right)^{3 / 4}\right)}, \]

where $f(w)$ is the frequency of word $w$. This equation, compared to linear popularity, has the tendency to increase the probability for less frequent words and decrease the probability for more frequent words.

22.6.2.3. Topic-aware Negative Sampling#

In-batch random negatives would often consist of topically-different documents, leaving little information gain for the training. To improve the information gain from a single random batch, we can constrain the queries and their relevant document are drawn from a similar topic[HofstatterLY+21].

The procedures are

Cluster queries using query embeddings produced by basic query encoder.
Sample queries and their relevant documents from a randomly picked cluster. A relevant document of a query form the negative of the other query.

Since queries are topically similar, the formed in-batch negatives are harder examples than randomly formed in-batch negative, therefore delivering more information gain each batch.

Note that here we group queries into clusters by their embedding similarity, which allows grouping queries without lexical overlap. We can also consider lexical similarity between queries as additional signals to predict query similarity.

22.6.3. Negative Sampling Methods II: Model-based Methods#

22.6.3.1. Static Hard Negative Examples#

Deep model improves the encoded representation of queries and documents by contrastive learning, in which the model learns to distinguish positive examples and negative examples. A simple random sampling strategy tend to produce a large quantity of easy negative examples since easy negative examples make up the majority of negative examples. Here by easy negative examples, we mean a document that can be easily judged to be irrelevant to the query. For example, the document and the query are in completely different topics.

The model learning from easy negative example can quickly plateau since easy negative examples produces vanishing gradients to update the model. An improvement strategy is to supply additional hard negatives with randomly sampled negatives. In the simplest case, hard negatives can be selected based on a traditional BM25model [KOuguzM+20, NC19] or other efficient dense retriever: hard negatives are those have a high relevant score to the query but they are not relevant.

As illustrated in Fig. 22.23, a model trained only with easy negatives can fail to distinguish fairly relevant documents from irrelevant examples; on the other hand, a model trained with some hard negatives can learn better representations:

Positive document embeddings are more aligned [WI20]; that is, they are lying closer with respect to each other.
Fairly relevant and irrelevant documents are more separated in the embedding space and thus a better decision boundary for relevant and irrelevant documents.

../../_images/Impact_hard_negative_on_retrieval.png — Fig. 22.23 Illustration of importance of negative hard examples, which helps learning better representations to distinguish irrelevant and fairly relevant documents.#

In generating these negative examples, the negative-generation model (e.g., BM25) and the model under training are de-coupled; that is the negative-generation model is not updated during training and the hard examples are static. Despite this simplicity, static hard negative examples introduces two shortcomings:

Distribution mismatch, the negatives generated by the static model might quickly become less hard since the target model is constantly evolving.
The generated negatives can have a higher risk of being false negatives to the target model because negative-generation model and the target model are two different models.

22.6.3.2. Dynamic Hard Negative Examples#

Dynamic hard negative mining is a scheme proposed in ANCE[XXL+20]. The core idea is to use the target model at previous checkpoint as the negative-generation model [Fig. 22.24]. However, this negative mining approach is rather computationally demanding since corpus index need updates at every checkpoint.

../../_images/ANCE_negative_sampling_demo.png — Fig. 22.24 Dynamic hard negative sampling from ANCE asynchronous training framework. Negatives are drawn from index produced using models at the previous checkpoint. Image from [XXL+20].#

22.6.3.3. Large-Scale Negatives#

In-batch negatives offers an efficient way to construct many negatives during training. During multiple GPU training [examplified by RocketQA[QDL+20]], in-batch negatives can be generalized to cross-batch negatives.

Specifically,

We first compute the document embeddings within each single GPU, and then share these documents embeddings among all the GPUs.
Beside in-batch negatives, all documents representations from other GPUs are used as the additional negatives for each query.

../../_images/cross_batch_negatives.png — Fig. 22.25 The comparison of in-batch negative and cross-batch negative during multi-gpu training. Image from [QDL+20].#

Even in the single-GPU training setting, we can leverage queue to construct large-scale negatives [MoCo [CFGH20]]. The key idea is that instead of discarding embeddings from previous batches, we can use a rolling queue to store them and use them as additional negatives for current batch.

22.6.3.4. Hard Positives#

In the retrieval model query-doc training data, it is usually filled with easy positives, that is query and relevant documents have all query term exact matched. During hybrid retrieval system, as the goal of dense retrieval is to complement sparse retrieval (which relies on exact term matching), it is beneficial to enrich training samples with hard positives, that is query and relevant document does not have all query term exact matched, particularly important query terms. With easy and hard positives, we can design currilumn learning to help model improve its semantic retrieval ability.

22.6.4. Label Denoising#

22.6.4.1. False Negatives#

Hard negative examples produced from static or dynamic negative mining methods are effective to improve the encoder’s performance. However, when selecting hard negatives with a less powerful model (e.g., BM25), we are also running the risk of introduce more false negatives (i.e., negative examples are actually positive) than a random sampling approach. Authors in [QDL+20] proposed to utilize a well-trained, complex model (e.g., a cross-encoder) to determine if an initially retrieved hard-negative is a false negative. Such models are more powerful for capturing semantic similarity among query and documents. Although they are less ideal for deployment and inference purpose due to high computational cost, they are suitable for filtering. From the initially retrieved hard-negative documents, we can filter out documents that are actually relevant to the query. The resulting documents can be used as denoised hard negatives.

22.6.4.2. False Positives#

Because of the noise in the labeling process (e.g., based on click data), it is also possible that a positive labeled document turns out to be irrelevant. To reduce false positive examples, one can develop more robust labeling process and merge labels from multiple sources of signals.

22.7. Knowledge Distillation#

22.7.1. Introduction#

Knowledge distillation aims to transfer knowledge from a well-trained, high-performing yet cumbersome teacher model to a lightweight student model with significant performance loss. Knowledge distillation has been a widely adopted method to achieve efficient neural network architecture, thus reducing overall inference costs, including memory requirements as well as inference latency. Typically, the teacher model can be an ensemble of separately trained models or a single very large model trained with a very strong regularizer such as dropout. The student model uses the distilled knowledge from the teacher network as additional learning cues. The resulting student model is computationally inexpensive and has accuracy better than directly training it from scratch.

As such, tor retrieval and ranking systems, knowledge distillation is a desirable approach to develop efficient models to meet the high requirement on both accuracy and latency.

For example, one can distill knowledge from a more powerful cross-encoder (e.g., BERT cross-encoder in Section 22.4.4.2) to a computational efficient bi-encoders. Empirically, this two-step procedure might be more effective than directly training a bi-encoder from scratch.

In this section, we first review the principle of knowledge distillation. Then we go over a couple examples to demonstrate the application of knowledge distillation in developing retrieval and ranking models.

22.7.2. Knowledge Distillation Training Framework#

In the classic knowledge distillation framework [HVD15, TLL+19][Fig. 22.26], the fundamental principle is that the teacher model produces soft label $q$ for each input feature $x$. Soft label $q$ can be viewed as a softened probability vector distributed over class labels of interest. Soft targets contain valuable information on the rich similarity structure over the data. Use MNIST classification as an example, a reasonable soft target will tell that 2 looks more like 3 than 9. These soft targets can be viewed as a strategy to mitigate the over-confidence issue and reduce gradient variance when we train neural networks using one-hot hard labels. Similar mechanism is leveraged in smooth label to improves model generalization.

Allows the smaller Student model to be trained on much smaller data than the original cumbersome model and with a much higher learning rate

Specifically, the logits $z$ from the techer model are outputted to generate soft labels via

\[ q_{i}^T=\frac{\exp \left(z_{i} / T\right)}{\sum_{j} \exp \left(z_{j} / T\right)}, \]

where $T$ is the temperature parameter controlling softness of the probability vector, and the sum is over the entire label space. When $T=1$, it is equivalent to standard Softmax function. As $T$ grows, $q$ become softer and approaches uniform distribution $T=\infty$. On the other hand, as $T\to 0$, the $q$ approaches a one-hot hard label.

../../_images/teacher_student_distillation_scheme.png — Fig. 22.26 The classic teacher-student knowledge distillation framework.#

The loss function for the student network training is a weighted sum of the hard label based cross entropy loss and soft label based KL divergence. The rationale of KL (Kullback-Leibler) divergence is to use the softened probability vector from the teacher model to guide the learning of the student network. Minimizing the KL divergence constrains the student model’s probabilistic outputs to match soft targets of the teacher model.

The loss function is formally given by

\[L =(1-\alpha) L_{C E}\left(p, y\right)+\alpha T^{2} {L}_{K L}\left(p^T, q^T\right)\]

where $L_{CE}$ is the regular cross entropy loss between predicted probability vector $p$ and the one-hot label vector

\[{L}_{CE}\left(p, y\right) =-\sum_{j}{y}_{j} \log {p}_{j};\]

$L_{KL}$ is the KL divergence loss between the softened predictions at temperature $T$ from the student and the teacher networks, respectively:

\[{L}_{KL}\left({p}^T, q^{T}\right) = -\sum_{j} {q}_{j}^{T} \log \frac{p_{j}^T}{{q}_{j}^{T}}.\]

Note that the same high temperature is used to produce distributions from the student model.

Note that $L_{KL}(p^T, q^T) = L_{CE}(p^T, q^T) + H(q^T, q^T)$, with $H(q^T, q^T)$ being the entropy of probability vector $q^T$ and remaining as a constant during the training. As a result, we also often reduce the total loss to

\[L =(1-\alpha) L_{C E}\left(p, y\right)+\alpha T^{2} {L}_{CE}\left(p^T, q^T\right).\]

Finally, the multiplier $T^2$ is used to re-scale the gradient of KL loss and $\alpha$ is a scalar controlling the weight contribution to each loss.

Besides using softened probability vector and KL divergence loss to guide the student learning process, we can also use MSE loss between the logits from the teacher and the student networks. Specifically, $$L_{MSE} = ||z^{(T)} - z^{(S)}||^2$$

where $z^{(T)}$ and $z^{(S)}$ are logits from the teacher and the student network, respectively.

Remark 22.3 (connections between MSE loss and KL loss)

In [HVD15], given a single sample input feature $x$, the gradient of ${L}_{KL}$ with respect to $z_{k}^{(S)}$ is as follows:

\[ \frac{\partial {L}_{KL}}{\partial {z}_{k}^{s}}=T\left(p_{k}^{T}-{q}_{k}^{T}\right). \]

When $T$ goes to $\infty$, using the approximation $\exp \left({z}_{k}/ T\right) \approx 1+{z}_{k} / T$, the gradient is simplified to:

\[ \frac{\partial {L}_{KL}}{\partial {z}_{k}^{(S)}} \approx T\left(\frac{1+z_{k}^{(S)} / T}{K+\sum_{j} {z}_{j}^{(S)} / T}-\frac{1+{z}_{k}^{(T)} / T}{K+\sum_{j} {z}_{j}^{(T)} / T}\right) \]

where $K$ is the number of classes.

Here, by assuming the zero-mean teacher and student logit, i.e., $\sum_{j} {z}_{j}^{(T)}=0$ and $\sum_{j} {z}_{j}^{(S)}=0$, and hence $\frac{\partial {L}_{K L}}{\partial {z}_{k}^{(S)}} \approx \frac{1}{K}\left({z}_{k}^{(S)}-{z}_{k}^{(T)}\right)$. This indicates that minimizing ${L}_{KL}$ is equivalent to minimizing the mean squared error ${L}_{MSE}$, under a sufficiently large temperature $T$ and the zero-mean logit assumption for both the teacher and the student.

22.7.3. Example Distillation Strategies#

22.7.3.1. Bi-encoder Teacher Distillation#

Authors in [LJZ20, VTGS20] pioneered the strategy of distilling powerful BERT cross-encoder into BERT bi-encoder to retain the benefits of the two model architectures: the accuracy of cross-encoder and the efficiency of bi-encoder.

Knowledge distillation follows the classic soft label framework. Bi-encoder student model training can use pointwise ranking loss, which is equivalent to binary relevance classification problem given a query and a candidate document. More formally, given training examples $(q_i, d_i)$ and their labels $y_i\in \{0, 1\}$. The BERT cross-encoder as teacher model to produce soft targets for irrelevance label and relevance label.

Although cross-encoder teacher can offer accurate soft labels, it cannot directly extend to the in-batch negatives technique and N-pair loss [Section 22.5.3.4] when training the student model. The reason is that query and document embedding cannot be computed separately from a cross-encoder. Implementing in-batch negatives using cross-encoder requires exhaustive computation on all combinations between a query and possible documents, which amount to $|B|^2$ ($|B|$ is the batch size) query-document pairs.

Authors in [LYL21] proposed to leverage bi-encoder variant such as Col-BERT [Section 22.4.5.2] as a teacher model, which is more feasible to perform exhaustive comparisons between queries and passages since they are passed through the encoder independently [Fig. 22.27].

../../_images/in_batch_distillation.png — Fig. 22.27 Compared to cross-encoder teacher, bi-encoder teacher computes query and document embeddings independents, which enables the application of the in-batch negative trick. Image from [LYL21].#

22.7.3.2. Cross-Encoder Embedding Similarity Distillation#

While bi-encoder teacher can offer efficiency in producing on-the-fly teacher scores, it sacrifaces the interaction modeling abiity from cross-encoders. On the other hand, directly using cross-encoder to produce binary classficiation logics as the distillation target does not fully leverage other useful information in the teacher model.

To mitigate this, one can

Having a spealized cross-encoder teacher to produce query and document embeddings
Enforce closeness between student and teacher on query/doc embedding vectors (e.g., via cosine similarity distance)
Enforce closeness between student and teacher on query-doc embedding similarity scores (e.g., via MSE)

../../_images/cross_encoder_distillation.png — Fig. 22.28 Illustration of leveraging rich information from a cross-encoder teacher for knowledge distillation.#

22.7.3.3. Ensemble Teacher Distillation#

As we have seen in previous sections, large Transformer based models such as BERT cross-encoders or bi-encoders are popular choices of teacher models when we perform knowledge distillation. These fine-tuned BERT models often show high performance variances across different runs. From ensemble learning perspective, using an ensemble of models as a teacher model could potentially not only achieves better distillation results, but also reduces the performance variances.

The critical challenge arising from distilling an ensemble teacher model vs a single teacher model is how to reconcile soft target labels generated by different teacher models.

Authors in [ZQH+21] propose following method to fuse scores and labels. Formally, consider query $q_{i}$, its $j$-th candidate document $d_{i j}$, and $K$ teacher models. Let the predicted ranking score by the $k$-th teacher be represented as $\hat{s}_{i j}^{(k)}$.

The simplest aggregated teacher label is to directly use the mean score, namely

\[ s_{i j}=\frac{1}{K}\hat{s}_{i j}^{(k)}. \]

The simple average scheme would work poorly when teacher models can have outputs with very different scales. A more robust way to fuse scores is to reciprocal rank, given by

\[ s_{i j}^{(k)}=\frac{1}{K} \sum_{k=1}^{K}\frac{1}{C+\hat{r}_{i j}^{(k)}} \]

where $\hat{r}_{i j}^{(k)}$ is the predicted rank of the $j$-th candidate text by the $k$-th teacher, and $C$ is the constant as model hyperparameters.

With the fused score, softened probability vector can be obtained by taking Softmax with temperature as the scaling factor.

22.8. Multi-vector Representations#

22.8.1. Introduction#

In classic representation-based learning for semantic retrieval [Section 22.4.3], we use two encoders (i.e., bi-encoders) to separately encoder a query and a candidate document into two dense vectors in the embedding space, and then a score function, such as cosine similarity, to produce the final relevance score. In this paradigm, there is a single global, static representation for each query and each document. Specifically, the document’s embedding remain the same regardless of the document length, the content structure of document (e.g., multiple topics) and the variation of queries that are relevant to the document. It is very common that a document with hundreds of tokens might contain several distinct subtopics, some important semantic information might be easily missed or biased by each other when compressing a document into a dense vector. As such, this simple bi-encoder structure may cause serious information loss when used to encode documents. ^{[^2]}

On the other hand, cross-encoders based on BERT variants [Section 22.4.3] utilize multiple self-attention layers not only to extract contextualized features from queries and documents but also capture the interactions between them. Cross-encoders only produce intermediate representations that take a pair of query and document as the joint input. While BERT-based cross-encoders brought significant performance gain, they are computationally prohibitive and impractical for online inference.

In this section, we focus on different strategies [HSLW19, LETC21, TSJ+21] to encode documents by multi-vector representations, which enriches the single vector representation produced by a bi-encoder. With additional computational overhead, these strategies can gain much improvement of the encoding quality while retaining the fast retrieval strengths of Bi-encoder.

22.8.2. Token-level Multi-vector Representation#

To enrich the representations of the documents produced by Bi-encoder, some researchers extend the original Bi-encoder by employing more delicate structures like later- interaction

ColBERT [Section 22.4.5.2] can be viewed a token-level multi-vector representation encoder for both queries and documents. Token-level representations for documents can be pre-computed offline. During online inference, late interactions of the query’s multi-vectors representation and the document’s multi-vectors representation are used to improve the robustness of dense retrieval, as compared to inner products of single-vector representations. Specifically,

Formally, given $q= q_{1} \ldots q_{l}$ and $d=d_{1} \ldots d_{n}$ and their token level embeddings $\{E_{q_1},\ldots E_{q_l}\}$ and $\{E_{d_1},...,E_{d_n}\}$ and the final similarity score between the $q$ and $d$ is given by

\[ S_{q, d} =\sum_{i \in I_q} \max _{j \in I_d} E_{q_{i}} \cdot E_{d_{j}}^{T}, \]

where $I_q = \{1,...,l\}$, $I_d = \{1, ..., n\}$ are the index sets for query token embeddings and document token embeddings, respectively.

While this method has shown signficant improvement over bi-encoder methods, it has a main disadvantage of high storage requirements. For example, ColBERT requires storing all the WordPiece token vectors of each text in the corpus.

22.8.3. Semantic Clusters As Pseudo Query Embeddings#

The primary limitation of Bi-encoder is information loss when we condense the document into a query agnostic dense vector representation. Authors in [TSJ+21] proposed the idea of representing a document by its semantic salient fragments [Fig. 22.29]. These semantic fragments can be modeled by token embedding vector clusters in the embedding space. By performing clustering algorithms (e.g., k-means) on token embeddings, the generated centroids can be used as a document’s multi-vector presentation. Another interpretation is that these centroids can be viewed as multiple potential queries corresponding to the input document; as such, we can call them pseudo query embeddings.

../../_images/pseudo_query_embeddings.png — Fig. 22.29 Deriving semantic clusters by clustering token-level embedding vectors. Semantic cluster centroids are used as multi-vector document representation, or as pseudo query embeddings. The final relevance score between a query and a document is computed using attententive pooling of centroid cluster and doc product.#

There are a couple of steps to calculate the final relevance score between a query and a document. First, we encode the query $q$ into a dense vector $e_q$ (via the CLS token embedding) and the document $d$ into multiple vectors via token level encoding and clustering $c_1,...,c_K$. Second, the query-conditional document representation $e_d$ is obtained by attending to the centroids using $e_q$ as the key. Finally, we can compute the similarity score via dot product between $e_q$ and $e_d$.

In summary, we have

\[\begin{split}\begin{align*} a_{j}&=\operatorname{Softmax}\left(e_{q} \cdot c_{j}\right) \\ e_{d}&=\sum_{j=1}^{k} a_{j} c_{j} \\ y &=e_{q} \cdot e_{d} \end{align*}\end{split}\]

In practice, we can save the centroid embeddings in memory and retrieve them using the real queries.

22.9. Query and Document Expansion#

22.9.1. Overview#

Query expansion and document expansion techniques provide two potential solutions to the inherent vocabulary mismatch problem in traditional IR systems. The core idea is to add extra relevant terms to queries and documents respectively to aid relevance matching.

Consider an ad-hoc search example of automobile sales per year in the US.

Document expansion can be implemented by appending car in documents that contains the term automobile but not car. Then an exact-match retriever can fetch documents containing either car and automobile.
Query expansion can be accomplished by retrieving results from both the query automobile sales per year in the US and the query car sales per year in the US.

There are many different approaches to coming up suitable terms to expand queries and documents in order to make relevance matching easier. These approaches range from traditional rule-based methods such as synonym expansion to recent learning based approaches by mining user logs. For example, augmented terms for a document can come from queries that are relevant from user click-through logs.

../../_images/query_expansion_arch.png — Fig. 22.30 Query expansion module in an IR system.#

Both query and document expansion can be fit into typical IR architectures through an de-coupled module. A query expansion module takes an input query and output a (richer) expanded query [Fig. 22.30]. They are also known as query rewriters or expanders. The module might remove terms deemed unnecessary in the user’s query, for example stop words and add extra terms facilitate the engine to retrieve documents with a high recall.

../../_images/doc_expansion_arch.png — Fig. 22.31 Document expansion module in an IR system.#

Similarly, document expansion naturally fits into the retrieval and ranking pipeline [Fig. 22.31]. The index will be simply built upon the expanded corpus to provide a richer set of candidate documents for retrieval. The extra computation for document expansion can be all carried out offline. Therefore, it presents the same level of effectiveness like query expansion but at lower latency costs (for example, using less computationally intensive rerankers).

Query expansion and document expansion have different pros and cons. Main advantages of query expansions include

Compare to document expansion, query expansion techniques can be quickly implemented and experimented without modifying the entire indexing. On the other hand, experimenting document expansion techniques can be costly and time-consuming, since the entire indexing is affected.
Query expansion techniques are generally more flexible. For example, it is easy to switch on or off different features at query time (for example, selectively apply expansion only to certain intents or certain query types). The flexibility of query expansion also allow us insert an expansion module in different stages of the retrieval-ranking pipeline.

On the other hand, one unique advantage for document expansion is that: documents are typically much longer than queries, and thus offer more context for a model to choose appropriate expansion terms. Neural based natural language generation models, like Transformers [VSP+17], can benefit from richer contexts and generate cohesive natural language terms to expand original documents.

22.9.2. Document Expansion Via Query Prediction#

Authors in [NYLC19] proposed DocT5Query, a document expansion strategy based on a seq-to-seq natural language generation model to enrich each document.

For each document, the transformer generation model predicts a set of queries that are likely to be relevant to the document. Given a dataset of (query, relevant document) pairs, we use a transformer model is trained to takes the document as input and then to produce the target query^{[^3]}.

../../_images/Doc2query_arch.png — Fig. 22.32 Given a document, our Doc2query model predicts a query, which is appended to the document. Expansion is applied to all documents in the corpus, which are then indexed and searched as before. Image from [NYLC19].#

Once the model is trained, we can use the model to predict top-$k$ queries using beam search and append them to each document in the corpus.

Examples of query predictions on MS MARCO compared to real user queries.

Input Document: July is the hottest month in Washington DC with an average temperature of 80F and the coldest is January at 38F with the most daily sunshine hours at 9 in July. The wettest month is May with an average of $100 \mathrm{ mm}$ of rain.
Predicted Query: weather in washington dc
Target Query: what is the temperature in washington

Another example:

Input Document: sex chromosome - (genetics) a chromosome that determines the sex of an individual; mammals normally have two sex chromosomes chromosome - a threadlike strand of DNA in the cell nucleus that carries the genes in a linear order; humans have 22 chromosome pairs plus two sex chromosomes.
Predicted Query: what is the relationship between genes and chromosomes
Target Query: *which chromosome controls sex characteristics *

This document expansion technique has demonstrated its effectiveness on the MS MARCO dataset when it is combined with BM25. The query prediction improves the performance from two aspects:

Predicted queries tend to copy some words from the input document (e.g., Washington DC, chromosome), which is sort of equivalent to performing term re-weighting (i.e., increasing the importance of key terms).
Predicted queries might also contain words not present in the input document (e.g., weather), which can be characterized as expansion by synonyms and other related terms.

A widely used relevance feedback algorithm was developed by Rocchio [Roc71] for vector space models. Let $\left\{d_{1}, d_{2}, \cdots, d_{k}\right\}$ be the vectors top $k$ documents retrieved by SNRM in response to the query vector $q$. The updated query vector is computed as:

\[ \vec{q}^{*}=\vec{q}+\alpha \frac{1}{k} \sum_{i=1}^{k} \vec{d}_{i} \]

where $\alpha$ controls the weight of the feedback vector. In practice we only keep the top $t$ (e.g., $t=10 -20$) terms with the highest values in the updated query vector $\vec{q}^{*}$.

A continued study in this line shows that replacing the transformer with more powerful seq-to-seq transformer model T5 [RSR+19] can bring further performance gain.

22.10. Contextualized Term Importance#

22.10.1. Context-aware Term Importance: Deep-CT#

In ad-hoc search, queries are mainly short and keyword based without complex grammatical structures. To be able to fetch most relevant results, it is important to take into account term importance. For example, given the query bitcoin news, a relevant document is expected to be about bitcoin and news, where the term bitcoin is more important than news in the sense that a document describing other aspects of bitcoin would be more relevant than a document describing news of other things.

In the traditional IR framework, term importance is calculated using inverse document frequency. A term is less important if it is a common term appearing in a large number of documents. These frequency-based term weights have been a huge success in traditional IR systems due to its simplicity and scalability. The problematic aspect is that Tf-idf determines the term importance solely based on word counts rather than the semantics. High-frequency words does not necessarily indicate their central role to the meaning of the text, especially for short texts where the word frequency distribution is quite flat. Considering the following two passages returned for the query stomach [DC19], one is relevant and one is not:

Relevant: In some cases, an upset stomach is the result of an allergic reaction to a certain type of food. It also may be caused by an irritation. Sometimes this happens from consuming too much alcohol or caffeine. Eating too many fatty foods or too much food in general may also cause an upset stomach.
Less relevant: All parts ofthe body (muscles , brain, heart, and liver) need energy to work. This energy comes from the food we eat. Our bodies digest the food we eat by mixing it with fluids( acids and enzymes) in the stomach. When the stomach digests food, the carbohydrate (sugars and starches) in the food breaks down into another type of sugar, called glucose.

In both passages, the word stomach appear twice; but the second passage is actually off-topic. This example also suggests that the importance of a term depends on its context, which helps understand the role of the word playing in the text.

Authors in [DC19] proposed DeepCT, which uses the contextual word representations from BERT to estimate term importance to improve the traditional IR approach. Specifically, given a word in a specific text, its contextualized word embedding (e.g., BERT) is used a feature vector that characterizes the word’s syntactic and semantic role in the text. Then DeepCT estimates the word’s importance score via a weighted summation: $$ \hat{y}_{t, c}= {w} T_{t, c}+b $$ where $T_{t, c} \in \mathbb{R}^D$ is token $t$ 's contextualized embedding in the text $c$; and, ${w}\in \mathbb{R}^D$ and $b$ are the weights and bias.

The model parameters of DeepCT are the weight and bias, and they can be estimated from a supervised learning task, per-token regression, given by $$ L_{MSE}=\sum_{c} \sum_{t}\left(y_{t, c}-\hat{y}_{t, c}\right)^{2}. $$

The ground truth term weight $y_{t, c}$ for every word in either the query or the document are estimated in the following manner.

The importance of a term in a document $d$ is estimated by the occurrence of the term in relevant queries [Fig. 22.33]. More formally, it is given by $$ QTR(t, d)=\frac{\left|Q_{d, t}\right|}{\left|Q_{d}\right|} $Q_{d}$ is the set of queries that are relevant to document $d$. $Q_{d, t}$ is the subset of $Q_{d}$ that contains term $t$. The intuition is that words that appear in relevant queries are more important than other words in the document.
The importance of a term in a document $d$ is estimated by the occurrence of the term in relevant queries [Fig. 22.33]. More formally, it is given by $$ TR(t, q)=\frac{\left|D_{q, t}\right|}{\left|D_{q}\right|} $D_{q}$ is the set of documents that are relevant to the query $q$. $D_{q, t}$ is the subset of relevant documents that contains term $t$. The intuition is that a query term is more important if it is mentioned by more relevant documents.

docs/img/chapter_application_IR/ApplicationIRSearch/termImportance/deepCT/deepCT_term_importance_demo — Fig. 22.33 Illustration of calculating context-aware term importance for a query (left) and a document (right). Term importance in a query is estimated from relevant documents of the query; term importance in a document is estimated from relevant queries of the document.#

22.11. Approximate Nearest Neighbor Search#

22.11.1. Overview#

Applying dense retrieval in the first-stage of the ad-hoc retrieval system involves performing nearest neighbor search among web-scale documents in the high-dimensional embedding space. Exact nearest neighbor search is inherently expensive due to the curse of dimensionality and the large number of documents. Consider a $D$-dimensional Euclidean space $\mathbb{R}^{D}$, the problem is to find the nearest element $\mathrm{NN}(x)$, in a finite set $\mathcal{Y} \subset \mathbb{R}^{D}$ of $n$ vectors, minimizing the distance to the query vector $x \in \mathbb{R}^{D}$ is given by:

\[ \mathrm{NN}(x)=\arg \min _{y \in \mathcal{Y}} d(x, y). \]

A brute force exhaustive distance calculation has the complexity of $\mathcal{O}(n D)$. Several multi-dimensional indexing methods, such as the popular KD-tree [FBF77] or other branch and bound techniques, have been proposed to reduce the search time. However, nowadays the dominating approaches are approximate nearest neighbor search via vector quantization, which is the primary focus of this section.

22.11.2. Vector Quantization#

22.11.2.1. Approximate Representation And Storage#

Quantization is a technique widely used to reduce the cardinality of high dimensional representation space, in particular when the input data is real-valued.

Formally, a quantizer is a function $q$ mapping a multi-dimensional vector $x \in \mathbb{R}^{D}$ to a pre-defined centroid $q(x) = c_i$, where $c_i \in \mathcal{C} = \{c_1,...,c_{k}\}$. The values $c_i \in \mathbb{R}^D$ are called centroids, and the set $\mathcal{C}$ is the codebook of size $k$.

The set $\mathcal{V}_{i}$ of vectors mapped to a given index $i$ is referred to as a (Voronoi) cell, and defined as

\[ \mathcal{V}_{i} \triangleq\left\{x \in \mathbb{R}^{D}: q(x)=c_{i}\right\}. \]

The $k$ cells of a quantizer form a partition of $\mathbb{R}^{D}$. By definition, all the vectors lying in the same cell $\mathcal{V}_{i}$ are reconstructed by the same centroid $c_{i}$. That is, they are all approximated by centroid $c_i$.

How do we measure the approximation quality of such representation? The quality of a quantizer is usually measured by the mean squared error between the input vector $x$ and its reproduction value $q(x):$

\[ \operatorname{MSE}(q)=\mathbb{E}_{X}\left[d(q(x), x)^{2}\right]=\int p(x) d(q(x), x)^{2} d x \]

where $d(x, y)=\|x-y\|$ is the Euclidean distance between $x$ and $y$, and where $p(x)$ is the probability distribution function corresponding the randomly sampled $x$. Approximate calculation of the integral can be achieved by Monte-Carlo sampling.

In order for the quantizer to be optimal, it has to satisfy two properties known as the Lloyd optimality conditions. First, a vector $x$ must be quantized to its nearest codebook centroid, in terms of the Euclidean distance:

\[ q(x)=\arg \min _{\mathrm{s}_{i} \in \mathcal{C}} d\left(x, c_{1}\right) \]

As a result, the cells are delimited by hyperplanes. The second Lloyd condition is that the reconstruction value must be the expectation of the vectors lying in the Voronoi cell:

\[ c_{i}=\mathbb{E}_{X}[x \mid i]=\int_{V_{i}} p(x) x d x . \]

The Lloyd quantizer, which corresponds to the $k$-means clustering algorithm, finds a near-optimal codebook by iteratively assigning the vectors of a training set to centroids and re-estimating these centroids from the assigned vectors.

../../_images/codebook_construction.png — Fig. 22.34 Codecook construction can be achieved using k-means algorithm to compute $K$ centroids from database vectors.#

The storage for $N$ vectors now reduce to storage of their index values plus the centroids in the codebook. Each index value requires $\log_{2} k$ bits. On the other hand, storing the original vectors typically take more than $\log_2(k)$ bits.

Two important benefits of compressing the dataset are

Memory access times are generally the limiting factor on processing speed; With compression, the processing speed will be drastically accelerated.
Reduce sheer memory capacity for big datasets.

In Fig. 22.35, we illustrate the storage saving by representing a $D$ dimensional vector by a codebook of 256 centroids. We only need 8-bits ($2^8 = 256$) to store a centroid id. Each vector is now replace by a 8-bit integers.

../../_images/codebook_memory_saving_demo.png — Fig. 22.35 Illustration of memory saving benefits of vector quantization. A $D$-dimensional float vector is stored as its nearest centroid integer id, which only occupies $\log_2 k$ bit.#

22.11.2.2. Approximating Distances Using Quantized Codes#

Given the representation choices for the query vector $x$ and the database vector $y$, we define two options in approximating the distance $d(x, y)$ [Fig. 22.36].

Symmetric distance computation (SDC): both the vectors $x$ and $y$ are represented by their respective centroids $q(x)$ and $q(y)$. The distance $d(x, y)$ is approximated by the distance $\hat{d}(x, y) \triangleq d(q(x), q(y))$.

Asymmetric distance computation (ADC): the database vector $y$ is represented by $q(y)$, but the query $x$ is not encoded. The distance $d(x, y)$ is approximated by the distance $\tilde{d}(x, y) \triangleq d(x, q(y))$

../../_images/distance_compute.png — Fig. 22.36 Illustration of the symmetric (left) and asymmetric distance (right) computation. The distance $d(x, y)$ is estimated with either the distance $d(q(x), q(y))$ (left) or the distance $d(x, q(y))$ (right).#

Suppose now we have a query vector $x$ and we want to find its nearest neighbors among all the $y$ in the database $\mathcal{Y}$.

There are benefits in performing symmetric distance computation. To perform symmetric distance computation, we can pre-compute a $K\times K$ table to cache the Euclidean distance between all centroids. After computing the encoding $q(x)$, we can get $d(q(x), q(y))$ by table lookup. On the other hand, to perform asymmetric distance computation between $x$ and all $y\in \mathcal{Y}$, we can directly calculate the Euclidean distance between the query vector $x$ and centroid $q(y)$ in the codebook.

22.11.3. Product Quantization#

22.11.3.1. From Vector Quantization To Product Quantization#

Let us consider a quantizer producing 64 bits codes, i.e., which can contain $k=2^{64} \approx 1.8\times 10^{19} $ centroids. It is prohibitive run the k-means algorithm and practically impossible to store the $D \times k$ floating point values representing the $k$ centroids.

Product quantization serves as an efficient solution to address these computation and memory consumption issues in vector quantization. The key idea of product quantization is grouping and splitting[ Fig. 22.37]. The input vector $x$ is split into $m$ distinct subvectors $u_{j}, 1 \leq$ $j \leq m$ of dimension $D^{*}=D / m$, where $D$ is a multiple of $m$. The subvectors are then quantized separately using $m$ distinct quantizers. A given vector $x$ is therefore quantized as follows:

\[\underbrace{x_{1}, \ldots, x_{D^{*}}}_{u_{1}(x)}, \ldots, \underbrace{x_{D-D^{*}+1}, \ldots, x_{D}}_{u_{m}(x)} \rightarrow q_{1}\left(u_{1}(x)\right), \ldots, q_{m}\left(u_{m}(x)\right),\]

where $q_{j}$ is a quantizer used to quantize the $j^{\text {th }}$ subvector using the codebook $\mathcal{C}_{j} = \{c_{j,1},...,c_{j,k^*}\}$. Here we assume that all subquantizers have the same finite number $k^*$ centroids.

../../_images/codebook_memory_saving_demo_product_quantization.png — Fig. 22.37 Illustration of memory saving benefits of vector product quantization. A $D$-dimensional float vector is first split into $m$ subvectors, and each subvector is stored as its nearest centroid integer id, which only occupies $\log_2 K^*$ bit.#

In this case, a reproduction value of the product quantizer is identified by an element of the product index set $\mathcal{I}=$ $\mathcal{I}_{1} \times \ldots \times \mathcal{I}_{m}$. The codebook is therefore defined as the Cartesian product

\[ \mathcal{C}=\mathcal{C}_{1} \times \ldots \times \mathcal{C}_{m} \]

and a centroid of this set is the concatenation of centroids of the $m$ subquantizers. In that case, the total number of centroids is given by

\[ k=\left(k^{*}\right)^{m} \]

Note that in the extremal case where $m=D$, the components of a vector $x$ are all quantized separately. Then the product quantizer turns out to be a scalar quantizer, where the quantization function associated with each component may be different.

The strength of a product quantizer is to produce a large set of centroids from several small groups of centroids; each group is associated with its own subquantizer.

22.11.3.2. Approximating Distances Using Quantized Codes#

Like vector quantization, we also have different options in approximating distance calculation between query vector $x$ and database vector $y$.

Symmetric distance computation (SDC):

\[ \hat{d}(x, y)=d(q(x), q(y))=\sqrt{\sum_{j} d\left(q_{j}(x), q_{j}(y)\right)^{2}} \]

where the distance $d\left(c_{j, i}, c_{j, i^{\prime}}\right)^{2}$ is read from a look-up table associated with the $j^{\text {th }}$ subquantizer. Each look-up table contains all the squared distances between pairs of centroids $\left(i, i^{\prime}\right)$ of the subquantizer, or $\left(k^{*}\right)^{2}$ squared distances $^{1}$.

Asymmetric distance computation (ADC):

\[ \tilde{d}(x, y)=d(x, q(y))=\sqrt{\sum_{j} d\left(u_{j}(x), q_{j}\left(u_{j}(y)\right)\right)^{2}} \]

where the squared distances $d\left(u_{j}(x), c_{j, i}\right)^{2}: j=$ $1 \ldots m, i=1 \ldots k^{*}$, are computed prior to the search. For nearest neighbors search, we do not compute the square roots in practice: the square root function is monotonically increasing and the squared distances produces the same vector ranking.

22.11.4. Approximate Non-exhaustive Nearest Neighbor Search#

22.11.4.1. Hierarchical Quantization And Inverted File Indexing#

While performing approximate nearest neighbor search with vector quantization or product quantization can already achieve computation acceleration and storage saving in the distance calculation, the search is still exhaustive in which we are computing distance between the query vector and all the vectors in the database. Exhaustive search is not scalable to database containing billions of vectors and scenarios having high query through-puts.

To avoid exhaustive search we can design a hierarchical quantization strategy to reduce the number of candidates that we will run distance calculation and use product quantization to speed up the distance calculation.

The candidate reduction is achieved via a technique called inverted file index (IVF). IVF applies vector quantization via k-means clustering to produce a large number (e.g., 100) of dataset partitions. At the query time, we identify the a number (e.g., 10) of partitions that are the nearest to the query and only compare the query vector to database vector in the these partitions. The residual distance between each vector and its associated partition centroid is approximated by a residual product quantization.

In sparse retrieval, inverted indexing refers to the mapping from a term to a list of documents in the database that contain the term. It resemble the index in the back of a textbook, which maps words or concepts to page numbers

In the context of vector quantization for efficient dense retrieval, we use k-means clustering to partition all vectors in the dataset.
For each partition, an inverted file list refers to the document vectors and its corresponding document ids belonging to this partition.

Given a query, once we determine which partition the query belongs to (using a quantizer), we reduce the search space to the documents in the same partition.

So far, there are two layers of quantization, which are realized through a coarse quantizer and a residual product quantizer. More formally, we denote the centroid $q_{\mathrm{c}}(y)$ associated with a vector $y$. Then the product quantizer $q_{\mathrm{p}}$ is used to encode the residual vector

\[ r(y)=y-q_{\mathrm{c}}(y) \]

corresponding to the offset in the Voronoi cell. The energy of the residual vector is small compared to that of the vector itself. The vector is approximated by

\[ \tilde{y} = q_{\mathrm{c}}(y)+q_{\mathrm{p}}\left(y-q_{\mathrm{c}}(y)\right). \]

Commonly we represent $y$ by the tuple $\left(q_{\mathrm{c}}(y), q_{\mathrm{p}}(r(y))\right)$. Like binary representation of a number, the coarse quantizer part represents the most significant bits, while the product quantizer part represents the least significant bits.

Remark 22.4 (shared product quantizer for residuals)

The product quantizer can be learned on a set of residual vectors. Ideally, we can learn a product quantizer for each partition since the residual vectors likely to be dependent on the coarse quantizer. One can further reduce memory cost significantly by using the same product quantizer across all coarse quantizers, although this probably gives inferior results

22.12. Benchmark Datasets#

22.12.1. MS MARCO#

MS MARCO (Microsoft MAchine Reading Comprehension) [NRS+16] is a large scale dataset widely used to train and evaluate models for the document retrieval and ranking tasks as well as tasks like key phrase extraction for question answering. MS MARCO dataset is sampled from Bing search engine user logs, with Bing retrieved passages given queries and human annotated relevance labels. There are more than 530,000 questions in the “train” data partition, and the evaluation is usually performed on around 6,800 questions in the “dev” and “eval” data partition. The ground-truth labels for the “eval” partition are not published. The original data set contains more than $8.8$ million passages.

There are two tasks: Passage ranking and document ranking; and two subtasks in each case: full ranking and re-ranking.

Each task uses a large human-generated set of training labels. The two tasks have different sets of test queries. Both tasks use similar form of training data with usually one positive training document/passage per training query. In the case of passage ranking, there is a direct human label that says the passage can be used to answer the query, whereas for training the document ranking task we transfer the same passage-level labels to document-level labels. Participants can also use external corpora for large scale language model pretraining, or adapt algorithms built for one task (e.g. passage ranking) to the other task (e.g. document ranking). This allows participants to study a variety of transfer learning strategies.

22.12.1.1. Document Ranking Task#

The first task focuses on document ranking. We have two subtasks related to this: Full ranking and top-100 re-ranking.

In the full ranking (retrieval) subtask, you are expected to rank documents based on their relevance to the query, where documents can be retrieved from the full document collection provided. You can submit up to 100 documents for this task. It models a scenario where you are building an end-to-end retrieval system.

In the re-ranking subtask, we provide you with an initial ranking of 100 documents from a simple IR system, and you are expected to re-rank the documents in terms of their relevance to the question. This is a very common real-world scenario, since many end-to-end systems are implemented as retrieval followed by top-k re-ranking. The re-ranking subtask allows participants to focus on re-ranking only, without needing to implement an end-to-end system. It also makes those re-ranking runs more comparable, because they all start from the same set of 100 candidates.

22.12.1.2. Passage Ranking Task#

Similar to the document ranking task, the passage ranking task also has a full ranking and reranking subtasks.

In context of full ranking (retrieval) subtask, given a question, you are expected to rank passages from the full collection in terms of their likelihood of containing an answer to the question. You can submit up to 1,000 passages for this end-to-end retrieval task.

In context of top-1000 reranking subtask, we provide you with an initial ranking of 1000 passages and you are expected to rerank these passages based on their likelihood of containing an answer to the question. In this subtask, we can compare different reranking methods based on the same initial set of 1000 candidates, with the same rationale as described for the document reranking subtask.

One caveat of the MSMARCO collection is that only contains binary annotations for fewer than two positive examples per query, and no explicit annotations for non-relevant passages. During the reranking task, negative examples are generated from the top candidates of a traditional retrieval system. This approach works reasonably well, but accidentally picking relevant passages as negative examples is possible.

22.12.2. TERC#

22.12.2.1. TREC-deep Learning Track#

Deep Learning Track at the Text REtrieval Conferences (TRECs) deep learning track^{[^7]}[CMY+20] is another large scale dataset used to evaluate retrieval and ranking model through two tasks: Document retrieval and passage retrieval.

Both tasks use a large human-generated set of training labels, from the MS-MARCO dataset. The document retrieval task has a corpus of 3.2 million documents with 367 thousand training queries, and there are a test set of 43 queries. The passage retrieval task has a corpus of 8.8 million passages with 503 thousand training queries, and there are a test set of 43 queries.

22.12.2.2. TREC-CAR#

TREC-CAR (Complex Answer Retrieval) [DVRC17] is a dataset where the input query is the concatenation of a Wikipedia article title with the title of one of its sections. The ground-truth documents are the paragraphs within that section. The corpus consists of all English Wikipedia paragraphs except the abstracts. The released dataset has five predefined folds, and we use the first four as a training set (approx. 3M queries), and the remaining as a validation set (approx. 700k queries). The test set has approx. 2,250 queries.

22.12.3. Natural Question (NQ)#

Natural Question (NQ) [KPR+19] introduces a large dataset for open-domain QA. The original dataset contains more than 300,000 questions collected from Google search logs. In [KOuguzM+20], around 62,000 factoid questions are selected, and all the Wikipedia articles are processed as the collection of passages. There are more than 21 million passages in the corpus.

22.13. Note On Bibliography And Software#

22.13.1. Bibliography#

For excellent reviews in neural information retrieval, see [GFP+20, LNY21, MC+18]

For traditional information retrieval, see [BCC16, CMS10, RZ09, SchutzeMR08]

[AWB+19]

Qingyao Ai, Xuanhui Wang, Sebastian Bruch, Nadav Golbandi, Michael Bendersky, and Marc Najork. Learning groupwise multivariate scoring functions using deep neural networks. In Proceedings of the 2019 ACM SIGIR international conference on theory of information retrieval, 85–92. 2019.

[Bur10]

Christopher JC Burges. From ranknet to lambdarank to lambdamart: an overview. Learning, 11(23-581):81, 2010.

[BCC16]

Stefan Buttcher, Charles LA Clarke, and Gordon V Cormack. Information retrieval: Implementing and evaluating search engines. Mit Press, 2016.

[CQL+07]

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, 129–136. 2007.

[CFWB17]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017.

[CFGH20]

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. 2020. URL: https://arxiv.org/abs/2003.04297, arXiv:2003.04297.

[CMY+20]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M Voorhees. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820, 2020.

[CMS10] (1,2)

W Bruce Croft, Donald Metzler, and Trevor Strohman. Search engines: Information retrieval in practice. Volume 520. Addison-Wesley Reading, 2010.

[DC19] (1,2)

Zhuyun Dai and Jamie Callan. Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv preprint arXiv:1910.10687, 2019.

[DZS+17]

Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W Bruce Croft. Neural ranking models with weak supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 65–74. 2017.

[DCLT18] (1,2,3)

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[DVRC17]

Laura Dietz, Manisha Verma, Filip Radlinski, and Nick Craswell. Trec complex answer retrieval overview. In TREC. 2017.

[FBF77]

Jerome H Friedman, Jon Louis Bentley, and Raphael Ari Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS), 3(3):209–226, 1977.

[GDC21]

Luyu Gao, Zhuyun Dai, and Jamie Callan. Coil: revisit exact lexical match in information retrieval with contextualized inverted list. arXiv preprint arXiv:2104.07186, 2021.

[GBCB16]

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. Volume 1. MIT press Cambridge, 2016.

[GFP+20]

Jiafeng Guo, Yixing Fan, Liang Pang, Liu Yang, Qingyao Ai, Hamed Zamani, Chen Wu, W Bruce Croft, and Xueqi Cheng. A deep look into neural ranking models for information retrieval. Information Processing & Management, 57(6):102067, 2020.

[GSL+20]

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, 3887–3896. PMLR, 2020.

[HG19]

Dany Haddad and Joydeep Ghosh. Learning more from less: towards strengthening weak supervision for ad-hoc retrieval. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 857–860. 2019.

[HVD15] (1,2)

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[HofstatterLY+21]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 113–122. 2021.

[HHG+13] (1,2,3)

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, 2333–2338. 2013.

[HSLW19] (1,2)

Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. Poly-encoders: transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969, 2019.

[JDJegou19]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.

[KOuguzM+20] (1,2,3,4)

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.

[KZ20]

Omar Khattab and Matei Zaharia. Colbert: efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 39–48. 2020.

[KPR+19]

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, and others. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.

[LLXL21]

Yizhi Li, Zhenghao Liu, Chenyan Xiong, and Zhiyuan Liu. More robust dense retrieval with contrastive dual learning. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, 287–296. 2021.

[LNY21]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. Pretrained transformers for text ranking: bert and beyond. Synthesis Lectures on Human Language Technologies, 14(4):1–325, 2021.

[LYL21] (1,2)

Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), 163–173. 2021.

[LWLQ21]

Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. arXiv preprint arXiv:2106.04554, 2021.

[LJZ20]

Wenhao Lu, Jian Jiao, and Ruofei Zhang. Twinbert: distilling knowledge to twin-structured bert models for efficient retrieval. arXiv preprint arXiv:2002.06275, 2020.

[LETC21] (1,2)

Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics, 9:329–345, 2021.

[MY18]

Yu A Malkov and Dmitry A Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.

[MSC+13]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119. 2013.

[MC19]

Bhaskar Mitra and Nick Craswell. An updated duet model for passage re-ranking. arXiv preprint arXiv:1903.07666, 2019.

[MC+18]

Bhaskar Mitra, Nick Craswell, and others. An introduction to neural information retrieval. Now Foundations and Trends Boston, MA, 2018.

[MDC17]

Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In Proceedings of the 26th International Conference on World Wide Web, 1291–1299. 2017.

[MNCC16]

Bhaskar Mitra, Eric Nalisnick, Nick Craswell, and Rich Caruana. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016.

[NRS+16]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: a human generated machine reading comprehension dataset. In CoCo@ NIPS. 2016.

[NZG+20]

Ping Nie, Yuyu Zhang, Xiubo Geng, Arun Ramamurthy, Le Song, and Daxin Jiang. Dc-bert: decoupling question and document for efficient contextual encoding. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, 1829–1832. 2020.

[NSN18]

Yifan Nie, Alessandro Sordoni, and Jian-Yun Nie. Multi-level abstraction convolutional model with weak supervision for information retrieval. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 985–988. 2018.

[NC19] (1,2,3,4,5)

Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085, 2019.

[NYCL19] (1,2,3,4,5,6)

Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424, 2019.

[NYLC19] (1,2)

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. Document expansion by query prediction. arXiv preprint arXiv:1904.08375, 2019.

[QDL+20] (1,2,3)

Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. Rocketqa: an optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191, 2020.

[RSR+19]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.

[RSL+21]

Ori Ram, Gal Shachaf, Omer Levy, Jonathan Berant, and Amir Globerson. Learning to retrieve passages without supervision. arXiv preprint arXiv:2112.07708, 2021.

[RZ09] (1,2)

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009.

[Roc71]

Joseph Rocchio. Relevance feedback in information retrieval. The Smart retrieval system-experiments in automatic document processing, pages 313–323, 1971.

[SchutzeMR08]

Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. Introduction to information retrieval. Volume 39. Cambridge University Press Cambridge, 2008.

[SHG+14]

Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management, 101–110. 2014.

[Soh16]

Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 2016.

[TSJ+21] (1,2,3)

Hongyin Tang, Xingwu Sun, Beihong Jin, Jingang Wang, Fuzheng Zhang, and Wei Wu. Improving document representations by generating pseudo query embeddings for dense retrieval. arXiv preprint arXiv:2105.03599, 2021.

[TLL+19]

Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136, 2019.

[VTGS20]

Amir Vakili Tahami, Kamyar Ghajar, and Azadeh Shakery. Distilling knowledge for fast retrieval-based chat-bots. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2081–2084. 2020.

[VSP+17]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 5998–6008. 2017.

[WI20] (1,2)

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, 9929–9939. PMLR, 2020.

[XXL+20] (1,2)

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808, 2020.

[YFL18]

Peilin Yang, Hui Fang, and Jimmy Lin. Anserini: reproducible ranking baselines using lucene. Journal of Data and Information Quality (JDIQ), 10(4):1–20, 2018.

[ZQH+21]

Honglei Zhuang, Zhen Qin, Shuguang Han, Xuanhui Wang, Michael Bendersky, and Marc Najork. Ensemble distillation for bert-based ranking models. In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, 131–136. 2021.

22.13.2. Software#

Faiss is a recently developed computational library for efficient similarity search and clustering of dense vectors.

Information Retrieval and Text Ranking

Contents

22. Information Retrieval and Text Ranking#

22.1. Overview of Information Retrieval#

22.1.1. Ad-hoc Retrieval#

22.1.2. Open-domain Question Answering#

22.1.3. Modern IR Systems#

22.1.4. Challenges And Opportunities In IR Systems#

22.1.4.1. Query Understanding And Rewriting#

22.1.4.2. Exact Match And Semantic Match#

22.1.4.3. Robustness To Document Variations#

22.1.4.4. Computational Efficiency#

22.2. Text Ranking Evaluation Metrics#

22.2.1. Precision And Recall#

22.2.2. Normalized Discounted Cumulative Gain (NDCG)#

22.2.3. Online Metrics#

22.3. Traditional Sparse IR Fundamentals#

22.3.1. Exact Match Framework#

22.3.2. TF-IDF Vector Space Model#

22.3.3. BM25#

22.3.4. BM25 Implementation#

22.4. Semantic Dense Models#

22.4.1. Motivation#

22.4.2. Two Architecture Paradigms#

22.4.3. Classic Representation-based Learning#

22.4.3.1. DSSM#

22.4.3.2. CNN-DSSM#

22.4.4. Mono-BERT And Duo-BERT#

22.4.4.1. Why Transformers?#

22.4.4.2. Mono-BERT (Cross-Encoder) For Point-wise Ranking#

22.4.4.3. Duo-BERT For Pairwise Ranking#

22.4.4.4. Multistage Retrieval And Ranking Pipeline#

22.4.5. Advanced Architectures#

22.4.5.1. DC-BERT#

22.4.5.2. ColBERT#

22.4.5.3. Multi-Attribute and Multi-task Modeling#

22.5. Ranker Training#

22.5.1. Overview#

22.5.2. Training Data#

22.5.3. Model Training Objective Functions#

22.5.3.1. Pointwise Regression Objective#

22.5.3.2. Pointwise Ranking Objective#

22.5.3.3. Pairwise Ranking via Triplet Loss#

22.5.3.4. N-pair Loss#

22.5.3.5. N-pair Dual Loss#

22.5.3.6. Doc-Doc N-pair Loss#

22.6. Training Data Sampling Strategies#

22.6.1. Principles#

22.6.2. Negative Sampling Methods I: Heuristic Methods#

22.6.2.1. Random Negatives and In-batch Negatives#

22.6.2.2. Popularity-based Negative Sampling#

22.6.2.3. Topic-aware Negative Sampling#

22.6.3. Negative Sampling Methods II: Model-based Methods#

22.6.3.1. Static Hard Negative Examples#

22.6.3.2. Dynamic Hard Negative Examples#

22.6.3.3. Large-Scale Negatives#

22.6.3.4. Hard Positives#

22.6.4. Label Denoising#

22.6.4.1. False Negatives#

22.6.4.2. False Positives#

22.7. Knowledge Distillation#

22.7.1. Introduction#

22.7.2. Knowledge Distillation Training Framework#

22.7.3. Example Distillation Strategies#

22.7.3.1. Bi-encoder Teacher Distillation#

22.7.3.2. Cross-Encoder Embedding Similarity Distillation#

22.7.3.3. Ensemble Teacher Distillation#

22.8. Multi-vector Representations#

22.8.1. Introduction#

22.8.2. Token-level Multi-vector Representation#

22.8.3. Semantic Clusters As Pseudo Query Embeddings#

22.9. Query and Document Expansion#

22.9.1. Overview#

22.9.2. Document Expansion Via Query Prediction#

22.10. Contextualized Term Importance#

22.10.1. Context-aware Term Importance: Deep-CT#

22.11. Approximate Nearest Neighbor Search#

22.11.1. Overview#

22.11.2. Vector Quantization#

22.11.2.1. Approximate Representation And Storage#