Djoerd Hiemstra – Research, Teaching and More

Team OpenWebSearch at CLEF 2024

LongEval

by Daria Alexander, Maik Fröbe, Gijs Hendriksen, Ferdinand Schlatt, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, and Arjen de Vries

We describe the OpenWebSearch group’s participation in the CLEF 2024 LongEval IR track. Our submitted runs explore how historical data from the past can be transferred into future retrieval systems. Therefore, we incorporate relevance information from past click logs into the query reformulation process via keyqueries and into the indexing process via a reverted index and ultimately incorporate both into learning-to-rank pipelines to ensure that retrieval is also possible for novel queries that were not seen before. Our evaluation shows that keyqueries substantially outperform other approaches for queries with historical click data available.

To be presented at CLEF 2024: Conference and Labs of the Evaluation Forum on 9-12 September in Grenoble, France.

[download pdf]

Announcing IRRJ

Today at SIGIR 2024, the Information Retrieval Research Journal (IRRJ) will be informally announced:

Open Access,
No article processing charges,
Papers in all areas Information Retrieval (IR),
First issue planned end of 2024,
Submissions open in September,
To enlarge IR with researchers from low-income countries!

Editorial board:

Djoerd Hiemstra (Radboud University, the Netherlands)
Vanessa Murdock (Amazon, USA)
Johanne Trippas (RMIT, Australia)
Makoto Kato, (University of Tsukuba, Japan)
Ismail Sengor Altingovde (Middle East Technical University, Turkiye)
Monica Paramita (University of Sheffield, UK)
Negin Rahimi (University of Massachusetts, Amherst, USA)
Ben He (University Chinese Academy of Sciences, China)
Shangsong Liang (Mohamed bin Zayed University of Artificial Intelligence, UAE)
Haiming Liu (University of Southampton, UK)
Debarshi Kumar Sanyal, (Indian Association for the Cultivation of Science, India)
Daniela Godoy (National Council for Scientific and Technological Research, Argentina)
Barbara Poblete (DCC University, Chile)

Advisory board:

Paul Kantor (Emeritus, Rutgers University, USA)
Stephen Robertson (formerly Microsoft Research, UK)

More information follows soon!

Nirmal Roy defends PhD thesis on the effects of interfaces on search

Exploring the effects of interactive interfaces on user search behaviour

by Nirmal Roy

Interactive information retrieval (IIR) is a user-centered approach to information seeking and retrieval. In this paradigm, the search process is not confined to a single query and a static set of results. Instead, it emphasises the active involvement of users in refining their information needs, iteratively modifying queries, and exploring retrieved content. IIR studies research how to facilitate a more tailored and practical search experience, adapting to the evolving requirements and preferences of users. In this thesis, we focus on four distinct yet interrelated areas in the domain of IIR to have a better understanding of the interaction between the user and the information retrieval system.

[Read more]

Tom Rust graduates on Learned Sparse Retrieval

by Tom Rust

Machine learning algorithms are achieving better results each day and are gaining popularity. The top-performing models are usually deep learning models. These models can absorb vast amounts of training data, improving prediction results. Unfortunately, these models consume a large amount of energy, which is something that not everyone is aware of. In information retrieval, large language models are used to provide extra context to queries and documents. Since information retrieval systems typically have large datasets, a suitable deep learning model must be chosen to find a balance between accuracy and energy usage. Learned sparse retrieval models are an example of these deep learning models. These models work by expanding all documents to create the optimal document representation that allows this document to be found correctly. This step is done before creating the inverted index, allowing for conventional ranking methods such as BM25. With this research, we compare different learned sparse retrieval models in terms of accuracy, speed, size and energy usage. We also compare them with a full-text index. We see that on MS Marco, the learned sparse retrievers outperform the full-text index on all popular evaluation benchmarks. However, the learned sparse retrievers can consume up to 100 times more energy whilst creating the index, which then has a higher query latency, and it uses more disk space. For WT10g we see that the full-text index gives us the highest accuracies whilst also being more energy efficient, using less disk space and having a lower query latency.
We conclude that learned sparse retrieval has the potential to improve accuracy on certain datasets, but a trade-off is necessary between the improved accuracy and the cost of increased storage, latency, and energy consumption.

Proceedings of WOWS 2024

The Proceedings of the first Workshop on Open Web Search (WOWS), which took place on 28 March 2024 in Glasgow, UK, are now published in the CEUR Workshop Series as Volume 3689.

WOWS 2024 had two calls for contributions. The first call targets scientific contributions on cooperative search engine development. This includes cooperative crawling of the web and cooperative deployment and evaluation of search engines. We specifically highlight the potential of enabling public and commercial organizations to use an indexed web crawl as a resource to create innovative search engines tailored to specific user groups, instead of relying on one search engine provider. The second call aims at gaining practical experience with joint, cooperative evaluation of search engine prototypes and their components using the Information Retrieval Experiment Platform TIREx. The workshop involved a keynote by Negar Arabzadeh from the University of Waterloo, 8 paper presentations (5 full papers and 3 short papers accepted out of 13 submissions), and a breakout session with participant discussions. WOWS received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070014. We would like to thank the Program Committee members for helpful reviews and suggestions to improve the contributions to the workshop. Special thanks go to Christine Plote, Managing Director of the Open Search Foundation for the WOWS 2024 website.

https://ceur-ws.org/Vol-3689/

[download pdf]

Search engines: going forward together and sustainably

Zoekmachines: Samen en duurzaam vooruit

The video of my inaugural lecture is out (in Dutch, English subtitles to be added)

Semere Bitew defends PhD thesis on Language Models for Education

Language Model Adaptation with Applications in AI for Education

by Semere Kiros Bitew

The overall theme of my dissertation is in adapting language models mainly for applications in AI in education to automatically create educational content. It addresses the challenges in formulating test and exercise questions in educational settings, which traditionally require significant training, experience, time, and resources. This is particularly critical in high-stakes environments like certifications and tests, where questions cannot be reused. In particular, the primary research is focused on two educational tasks: distractor generation and gap-filling exercise generation. Distractor generation task refers to generating plausible but incorrect answers in multiple-choice questions, while gap-filling exercise generation refers to inducing well-chosen gaps to generate grammar exercises from existing texts. These tasks, although extensively researched, present unexplored avenues that recent advancements in language models can address. As a secondary objective, I explore the adaptation of coreference resolution to new languages. Coreference resolution is a key NLP task that involves clustering mentions in a text that refer to the same real-world entities, a process vital for understanding and generating coherent language.

Crawling and Indexing the Web for Public Use

by Gijs Hendriksen, Michael Dinzinger, Sheikh Mastura Farzana, Noor Afshan Fathima, Maik Fröbe, Sebastian Schmidt, Saber Zerhoudi, Michael Granitzer, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, and Benno Stein

Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index. The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index – for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.

To be presented at the European Conference on Information Retrieval (ECIR 2024) in Glasgow on 24-28 March.

[download pdf]

Weighted AUReC

Handling Skew in Shard Map Quality Estimation for Selective Search

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen de Vries

In selective search, a document collection is partitioned into a collection of topical index shards. To efficiently estimate the topical coherence (or quality) of a shard map, the AUReC (Area Under Recall Curve) measure was introduced. AUReC makes the assumption that shards are of similar sizes, one that is violated in practice, even for unsupervised approaches. The problem might be amplified if supervised labelling approaches with skewed class distributions are used. To estimate the quality of such unbalanced shard maps, we introduce a weighted adaptation of the AUReC measure, and empirically evaluate its effectiveness using the ClueWeb09B and Gov2 datasets. We show that it closely matches the evaluations of the original AUReC when shards are similar in size, but captures better the differences in performance when shard sizes are skewed.

To be presented at the European Conference on Information Retrieval (ECIR) in Glasgow on 24-28 March.

[download pdf]

Inaugural lecture on 1 March

Invitation

On 1 March 2024 at 15:45h., I will give my inaugural lecture: “Zoekmachines: Samen en duurzaam vooruit” (in Dutch). Everyone is invited. Please register on: https://www.ru.nl/rede/hiemstra

In the lecture, I will share an ancient wisdom about working together; I will discuss my plan to teach students of all background their shared history; and I will reveal my dream to provide unrestricted access to all human information by working together. The lecture will contain cars, iPhone chargers, the Space Shuttle, and references to exciting recent research.

[download pdf]