Publications

Puxuan Yu, Antonio Mallia, Matthias Petri Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies. In Proceedings of the 46th European Conference on Information Retrieval (ECIR). 2024

Abstract: We explore leveraging corpus-specific vocabularies that improve both efficiency and effectiveness of learned sparse retrieval systems. We find that pre-training the underlying BERT model on the target corpus, specifically targeting different vocabulary sizes incorporated into the document expansion process, improves retrieval quality by up to 12% while in some scenarios decreasing latency by up to 50%. Our experiments show that adopting corpus-specific vocabulary and increasing vocabulary size decreases average postings list length which in turn reduces latency. Ablation studies show interesting interactions between custom vocabularies, document expansion techniques, and sparsification objectives of sparse models. Both effectiveness and efficiency improvements transfer to different retrieval approaches such as uniCOIL and SPLADE and offer a simple yet effective approach to providing new efficiency-effectiveness trade-offs for learned sparse retrieval systems.

PDF Cite
Xueguang Ma, Hengxin Fun, Xusen Yin, Antonio Mallia, Jimmy Lin. Enhancing Sparse Retrieval via Unsupervised Learning. In Proceedings of the 1st International ACM SIGIR Conference on Information Retrieval in the Asia Pacific (SIGIR-AP) 2023

Abstract: Recent work has shown that neural retrieval models excel in text ranking tasks in a supervised setting when given large amounts of manually labeled training data. However, it remains an open question how one might train effective unsupervised retrieval models that are unequivocally better than lexical-matching baselines such as BM25. While some progress has been made in unsupervised dense retrieval models within a bi-encoder architecture, unsupervised sparse retrieval models remain unexplored. In this work, we propose BM26, to our knowledge the first such model. BM26 is trained in an unsupervised manner without the need for any human relevance judgments but keeps the same retrieval paradigm as BM25. Evaluations across multiple modern test collections, including MS MARCO, NaturalQuestions, TriviaQA, and BEIR, show that BM26 alone outperforms Contriever, the current state-of-the-art unsupervised dense retriever, in general, and performs on par with BM25. We further demonstrate two promising avenues to enhance existing lexical retrieval systems using such unsupervised lexical retrieval models: 1) Enhancing the BM25 system by hybrid with BM26 based on simple vector concatenation (that we dub "BM51"), demonstrating a significant and consistent improvement over BM25 alone, while remains unsupervised system. 2) Enhancing the supervised lexical retrieval model with improved initialization using BM26, as seen with the SPLADE model, generating significant improvements in both in-domain and zero-shot retrieval effectiveness.

PDF Cite
Gabriel Tolosa, Antonio Mallia. Many are Better than One: Algorithm Selection for Faster Top-K Retrieval. In Information Processing & Management(IPM) 2023

Abstract: Large-scale search engines have become a fundamental tool to efficiently access information on the Web. Typically, users expect answers in sub-second time frames, which demands highly efficient algorithms to traverse the data structures to return the top-k results. Despite different top-k algorithms that avoid processing all postings for all query terms, finding one algorithm that performs the fastest on any query is not always possible. The fastest average algorithm does not necessarily perform the best on all queries when evaluated on a per-query basis. To overcome this challenge, we propose to combine different state-of-the-art disjunctive top-k query processing algorithms to minimize the execution time by selecting the most promising one for each query. We model the selection step as a classification problem in a machine-learning setup. We conduct extensive experimentation and compare the results against state-of-the-art baselines using standard document collections and query sets. On ClueWeb12, our proposal shows a speed-up of up to 1.20x for non-blocked index organizations and 1.19x for block-based ones. Moreover, tail latencies are reduced showing proportional improvements on average, but a resulting dramatic decrease in latency variance. Given these findings, the proposed approach can be easily applied to existing search infrastructures to speed up query processing and reduce resource consumption, positively impacting providers’ operative costs.

PDF Cite
Joel Mackenzie, Antonio Mallia, Alistair Moffat, Matthias Petri. Accelerating Learned Sparse Indexes Via Term Impact Decomposition. In Findings of the Association for Computational Linguistics: EMNLP 2022

Abstract: Novel inverted index-based learned sparse ranking models provide more effective, but less efficient, retrieval performance compared to traditional ranking models like BM25. In this paper, we introduce a technique we call postings clipping to improve the query efficiency of learned representations. Our technique amplifies the benefit of dynamic pruning query processing techniques by accounting for changes in term importance distributions of learned ranking models. The new clipping mechanism accelerates top-k retrieval by up to 9.6X without any loss in effectiveness.

PDF Cite
Antonio Mallia, Joel Mackenzie, Torsten Suel, Nicola Tonellotto. Faster Learned Sparse Retrieval with Guided Traversal. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 2022

Abstract: Neural information retrieval architectures based on transformers such as BERT are able to significantly improve system effectiveness over traditional sparse models such as BM25. Though highly effective, these neural approaches are very expensive to run, making them difficult to deploy under strict latency constraints. To address this limitation, recent studies have proposed new families of learned sparse models that try to match the effectiveness of learned dense models, while leveraging the traditional inverted index data structure for efficiency. Current learned sparse models learn the weights of terms in documents and, sometimes, queries; however, they exploit different vocabulary structures, document expansion techniques, and query expansion strategies, which can make them slower than traditional sparse models such as BM25. In this work, we propose a novel indexing and query processing technique that exploits a traditional sparse model's "guidance" to efficiently traverse the index, allowing the more effective learned model to execute fewer scoring operations. Our experiments show that our guided processing heuristic is able to boost the efficiency of the underlying learned sparse model by a factor of four without any measurable loss of effectiveness.

PDF Cite
Michal Siedlaczek, Antonio Mallia, Torsten Suel. Using Conjunctions for Faster Disjunctive Top-k Queries. In Proceedings of the 15th ACM International Conference on Web Search and Data Mining (WSDM). 2022

Abstract: While current search engines use highly complex ranking functions with hundreds of features, they often perform an initial candidate generation step that uses a very simple ranking function to identify a limited set of promising candidates. A common approach is to use a disjunctive top-k query for this step. There are many methods for disjunctive top-k computation, but they tend to be slow for the required values of k, which are in the hundreds to thousands. We propose a new approach to safe disjunctive top-k computation that, somewhat counterintuitively, uses precomputed conjunctions of inverted lists to speed up disjunctive queries. The approach is based on a generalization of the well-known MaxScore algorithm, and utilizes recent improvements in threshold estimation techniques as well as new ideas to obtain significant improvements in performance. Our algorithms are implemented as an extension of the PISA framework for search-engine query processing, and available as open-source to support replication and follow-up work.

PDF Cite
Jimmy Lin, Xueguang Ma, Joel Mackenzie, Antonio Mallia. On the Separation of Logical and Physical Ranking Models for Text Retrieval Applications. In Proceedings of the 2nd International Conference on Design of Experimental Search and Information REtrieval Systems (DESIRES). 2021

Abstract: Text retrieval using bags of words is typically formulated as inner products between vector representations of queries and documents, realized in query evaluation algorithms that traverse postings in an inverted index. Viewed in database terms, this captures a tight coupling between the ‘logical’ aspects of ranking (i.e., term weighting) and the ‘physical’ aspects of ranking (query evaluation). We argue that explicitly decoupling these two aspects offers a framework for thinking about the relationship between sparse retrieval techniques and the rapidly growing literature on dense retrieval techniques.

PDF Cite
Antonio Mallia, Omar Khattab, Torsten Suel, Nicola Tonellotto. Learning Passage Impacts for Inverted Indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 2021

Abstract: Neural information retrieval systems typically use a cascading pipeline, in which a first-stage model retrieves a candidate set of documents and one or more subsequent stages re-rank this set using contextualized language models such as BERT. In this paper, we propose DeepImpact, a new document term-weighting scheme suitable for efficient retrieval using a standard inverted index. Compared to existing methods, DeepImpact improves impact-score modeling and tackles the vocabulary-mismatch problem. In particular, DeepImpact leverages DocT5Query to enrich the document collection and, using a contextualized language model, directly estimates the semantic importance of tokens in a document, producing a single-value representation for each token in each document. Our experiments show that DeepImpact significantly outperforms prior first-stage retrieval approaches by up to 17% on effectiveness metrics w.r.t. DocT5Query, and, when deployed in a re-ranking scenario, can reach the same effectiveness of state-of-the-art approaches with up to 5.1x speedup in efficiency.

PDF Cite
Antonio Mallia, Michal Siedlaczek, Torsten Suel. Fast Disjunctive Candidate Generation Using Live Block Filtering. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM). 2021

Abstract: A lot of research has focused on the efficiency of search enginequery processing, and in particular on disjunctive top-k queries that return the highest scoring k results that contain at least one of the query terms. Disjunctive top-k queries over simple ranking functions are commonly used to retrieve an initial set of candidate results that are then reranked by more complex, often machine-learned rankers. Many optimized top-k algorithms have been proposed, including MaxScore, WAND, BMW, and JASS. While the fastest methods achieve impressive results on top-10 and top-100 queries, they tend to become much slower for the larger k commonly used for candidate generation. In this paper, we focus on disjunctive top-k queries for larger k. We propose new algorithms that achieve much faster query processing for values of k up to thousands or tens of thousands. Our algorithms build on top of the live-block filtering approach of Dimopoulos et al, and exploit the SIMD capabilities of modern CPUs. We also perform a detailed experimental comparison of our methods with the fastest known approaches, and release a full model implementation of our methods and of the underlying live-block mechanism, which will allows others to design and experiment with additional methods under the live-block approach.

PDF Cite
Luke Gallagher, Antonio Mallia, J. Shane Culpepper, Torsten Suel, B. Barla Cambazoglu. Feature Extraction for Large-Scale Text Collections. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM). 2020

Abstract: Feature engineering is a fundamental but poorly documented component in learning-to-rank (LTR) search engines. Such features are commonly used to construct learning models for web and product search engines, recommender systems and question-answering tasks. In each of these domains, there is a growing interest in the creation of open-access test collections that promote reproducible research. However, there are still few open source software packages capable of extracting high-quality machine learning features from large text collections. Instead, most feature-based LTR research relies on "canned" test collections, which often do not expose critical details about the underlying collection or implementation details of the features extracted. Both of these are crucial to collection creation and deployment of a search engine into production. So in this regard, the experiments are rarely reproducible with new features or collections, or helpful for companies wishing to deploy LTR systems. In this paper, we introduce Fxt, an open-source framework to perform efficient and scalable feature extraction. Fxt can easily be integrated into complex, high-performance software applications to help solve a wide variety of textual machine learning problems. To demonstrate the software's utility, we build and document a reproducible feature extraction pipeline and show how to recreate several common LTR experiments using the ClueWeb09B collection. Researchers and practitioners can benefit from Fxt to extend their machine learning pipelines for various text-based retrieval tasks, and learn how some static document features and query-specific features are implemented.

PDF Cite
Antonio Mallia, Michal Siedlaczek, Mengyang Sun, Torsten Suel. A Comparison of Top-k Threshold Estimation Techniques for Disjunctive Query Processing. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM). 2020

Abstract: In the top-k threshold estimation problem, given a query q, the goal is to estimate the score of the result at rank k. A good estimate of this score can result in significant performance improvements for several query processing scenarios, including selective search, index tiering, and widely used disjunctive query processing algorithms such as MaxScore, WAND, and BMW. Several approaches have been proposed, including parametric approaches, methods using random sampling, and a recent approach based on machine learning. However, previous work fails to perform any experimental comparison between these approaches. In this paper, we address this issue by reimplementing four major approaches and comparing them in terms of estimation error, running time, likelihood of an overestimate, and end-to-end performance when applied to common classes of disjunctive top-k query processing algorithms.

PDF Cite
Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michal Siedlaczek, Andrew Trotman, Arjen de Vries. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 2020

Abstract: There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems. To balance these two goals, we examine two approaches to providing interoperability between the inverted indexes of several systems. The first takes advantage of internal abstractions around index structures and building wrappers that allow one system to directly read the indexes of another. The second involves sharing indexes across systems via a data exchange specification that we have developed, called the Common Index File Format (CIFF). We demonstrate the first approach with the Java systems Anserini and Terrier, and the second approach with Anserini, JASSv2, OldDog, PISA, and Terrier. Together, these systems provide a wide range of implementations and features, with different research goals. Overall, we recommend CIFF as a low-effort approach to support independent innovation while enabling the types of fair evaluations that are critical for driving the field forward.

PDF Cite
Antonio Mallia, Michal Siedlaczek, Torsten Suel and Mohamed Zahran. GPU-Accelerated Decoding of Integer Lists. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). 2019

Abstract: An inverted index is the basic data structure used in most current large-scale information retrieval systems. It can be modeled as a collection of sorted sequences of integers. Many compression techniques for inverted indexes have been studied in the past, with some of them reaching tremendous decompression speeds through the use of SIMD instructions available on modern CPUs. While there has been some work on query processing algorithms for Graphics Processing Units (GPUs), little of it has focused on how to efficiently access compressed index structures, and we see some potential for significant improvements in decompression speed. In this paper, we describe and implement two encoding schemes for index decompression on GPU architectures. Their format and decoding algorithm is adapted from existing CPU-based compression methods to exploit the execution model and memory hierarchy offered by GPUs. We show that our solutions, GPU-BP and GPU-VByte, achieve significant speedups over their already carefully optimized CPU counterparts.

PDF Cite
Antonio Mallia. Efficient top-k document retrieval. In Proceedings of the 9th PhD Symposium on Future Directions in Information Access (FDIA). 2019

Abstract: Over the past few decades, the IR community has been making a continuous effort to improve the efficiency of search in large collections of documents. Query processing is still one of the main bottlenecks in large-scale search systems. The top-k document retrieval problem, which can be defined as reporting the k most relevant documents from a collection for a given query, can be extremely expensive, as it involves scoring large amounts of documents. In this work, we investigate the top-k document retrieval problem from several angles with the aim of improving the efficiency of this task in large-scale search systems. Finally, we briefly describe our initial findings and conclude by proposing future directions to follow.

PDF Cite
Antonio Mallia, Michal Siedlaczek, Joel Mackenzie and Torsten Suel. PISA: Performant Indexes and Search for Academia. In Proceedings of the Open-Source IR Replicability Challenge (OSIRRC) co-located with SIGIR. 2019

Abstract: Performant Indexes and Search for Academia (PISA) is an experimental search engine that focuses on efficient implementations of state-of-the-art representations and algorithms for text retrieval. In this work, we outline our effort in creating a replicable search run from PISA for the 2019 Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure. We also discuss the origins, current functionality, and future direction and challenges for the PISA system.

PDF Cite
Antonio Mallia, Michal Siedlaczek and Torsten Suel. An Experimental Study of Index Compression and DAAT Query Processing Methods. In Proceedings of the 41st European Conference on Information Retrieval (ECIR). 2019

Abstract: In the last two decades, the IR community has seen numerous advances in top-k query processing and inverted index compression techniques. While newly proposed techniques are typically proposed against a few baselines, these evaluations are often very limited, and we feel that there is no clear overall picture on the best choices of algorithms and compression methods. In this paper, we attempt to address this issue by evaluating a number of state-of-the-art index compression methods and safe disjunctive DAAT query processing algorithms. Our goal is to understand how much index compression performance impacts overall query processing speeds, how the choice of query processing algorithm depends on the compression method used, and how performance is impacted by document reordering techniques and the number of results returned, keeping in mind that current search engines typically use sets of hundreds or thousands of candidates for further reranking.

PDF Cite
Joel Mackenzie, Antonio Mallia, Matthias Petri, J. Shane Culpepper and Torsten Suel. Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study. In Proceedings of the 41st European Conference on Information Retrieval (ECIR). 2019

Abstract: Document reordering is an important but often overlooked preprocessing stage in index construction. Reordering document identifiers in graphs and inverted indexes has been shown to reduce storage costs and improve processing efficiency in the resulting indexes. However, surprisingly few document reordering algorithms are publicly available despite their importance. A new reordering algorithm derived from recursive graph bisection was recently proposed by Dhulipala et al., and shown to be highly effective and efficient when compared against other state-of-the-art reordering strategies. In this work, we present a reproducibility study of this new algorithm. We describe both the implementation challenges faced, as well as the performance characteristics of our clean-room reimplementation. We show that we are able to successfully reproduce the core results of the original paper, and show that the algorithm generalizes to other collections and indexing frameworks. Furthermore, we make our implementation publicly available to help promote further research in this space.

PDF Cite
Antonio Mallia, Elia Porciani. Faster BlockMax WAND with Longer Skipping. In Proceedings of the 41st European Conference on Information Retrieval (ECIR). 2019

Abstract: One of the major problems for modern search engines is to keep up with the tremendous growth in the size of the web and the number of queries submitted by users. The amount of data being generated today can only be processed and managed with specialized technologies. BlockMaxWAND and the more recent Variable BlockMaxWAND represent the most advanced query processing algorithms that make use of dynamic pruning techniques, which allow them to retrieve the top k most relevant documents for a given query without any effectiveness degradation of its ranking. In this paper, we describe a new technique for the BlockMaxWAND family of query processing algorithm, which improves block skipping in order to increase its efficiency. We show that our optimization is able to improve query processing speed on short queries by up to 37% with negligible additional space overhead.

PDF Cite
Melanie Tosik, Antonio Mallia, Kedar Gangopadhyay. Debunking Fake News One Feature at a Time. CoRR abs/1808.02831. 2018

Abstract: Identifying the stance of a news article body with respect to a certain headline is the first step to automated fake news detection. In this paper, we introduce a 2-stage ensemble model to solve the stance detection task. By using only hand-crafted features as input to a gradient boosting classifier, we are able to achieve a score of 9161.5 out of 11651.25 (78.63%) on the official Fake News Challenge (Stage 1) dataset. We identify the most useful features for detecting fake news and discuss how sampling techniques can be used to improve recall accuracy on a highly imbalanced dataset.

PDF Cite
Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and Rossano Venturini. Faster BlockMax WAND with Variable-sized Blocks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 2017

Abstract: Query processing is one of the main bottlenecks in large-scale search engines. Retrieving the top k most relevant documents for a given query can be extremely expensive, as it involves scoring large amounts of documents. Several dynamic pruning techniques have been introduced in the literature to tackle this problem, such as BlockMaxWAND, which splits the inverted index into constant- sized blocks and stores the maximum document-term scores per block; this information can be used during query execution to safely skip low-score documents, producing many-fold speedups over exhaustive methods. We introduce a refinement for BlockMaxWAND that uses variable- sized blocks, rather than constant-sized. We set up the problem of deciding the block partitioning as an optimization problem which maximizes how accurately the block upper bounds represent the underlying scores, and describe an efficient algorithm to find an approximate solution, with provable approximation guarantees. rough an extensive experimental analysis we show that our method significantly outperforms the state of the art roughly by a factor 2×. We also introduce a compressed data structure to represent the additional block information, providing a compression ratio of roughly 50%, while incurring only a small speed degradation, no more than 10% with respect to its uncompressed counterpart.

PDF Cite

Other activities