Tutorials

The following tutorials will be given on February 2, 2015.

  • Dynamic Information Retrieval Modeling by Hui Yang, Marc Sloan, and Jun Wang (morning)
  • Scalability and Efficiency Challenges in Large-Scale Web Search Engines by B. Barla Cambazoglu and Ricardo Baeza-Yates (morning)
  • Offline Evaluation and Optimization for Interactive Systems: A Practical Guide by Lihong Li (morning)
  • Real-Time Bidding: A New Frontier of Computational Advertising Research by Jun Wang and Shuai Yuan (afternoon)
  • Learning about health and medicine from Internet data by Elad Yom-Tov, Ingemar Johansson Cox and Vasileios Lampos (afternoon)
  • Distributed Graph Algorithmics: Theory and Practice by Silvio Lattanzi and Vahab Mirrokni (afternoon)

Dynamic Information Retrieval Modeling

Hui Yang, Marc Sloan, and Jun Wang

In Dynamic Information Retrieval modeling we model dynamic systems which change or adapt over time or a sequence of events using a range of techniques from artificial intelligence and reinforcement learning. Many of the open problems in current IR research can be described as dynamic systems, for instance, session search or computational advertising. State of the art research provides solutions to these problems that are responsive to a changing environment, learn from past interactions and predict future utility. Advances in IR interface, personalization and ad display demand models that can react to users in real time and in an intelligent, contextual way.

The objective of this tutorial is to provide a comprehensive and up-to-date introduction to Dynamic Information Retrieval Modeling. Dynamic IR Modeling is the statistical modeling of IR systems that can adapt to change. It is a natural follow-up to previous statistical IR modeling tutorials with a fresh look on state-of-the-art dynamic retrieval models and their applications including session search and online advertising. The tutorial will cover techniques ranging from classic relevance feedback to the latest applications of partially observable Markov decision processes (POMDPs) and will present to fellow researchers and practitioners a handful of useful algorithms and tools for solving and evaluating IR problems incorporating dynamics.

Scalability and Efficiency Challenges in Large-Scale Web Search Engines

B. Barla Cambazoglu and Ricardo Baeza-Yates

This tutorial aims to provide a fairly comprehensive overview of the scalability and efficiency challenges in large-scale web search engines. In particular, the tutorial provides an in-depth architectural overview of a web search engine, mainly focusing on the web crawling, indexing, and query processing components. The scalability and efficiency issues encountered in the above-mentioned components are presented at four different granularities: at the level of a single computer, a cluster of computers, a single data center, and a multi-center search engine. The tutorial also points at the open research problems and provides recommendations to researchers who are new to the field.

Offline Evaluation and Optimization for Interactive Systems: A Practical Guide

Lihong Li

Evaluating and optimizing an interactive system (like search engines, recommender and advertising systems) from historical data against a predefined online metric is challenging, especially when that metric is computed from user feedback such as clicks and payments. The key challenge is counterfactual in nature: we only observe a user’s feedback for the action taken by the system, but do not know what that user would have reacted if the system chose a different action. The standard approach to evaluating such metrics of a user-interacting system is online A/B tests (a.k.a. randomized controlled experiments), which can be expensive for several reasons. Offline evaluation becomes critical, with the aim of evaluating the same metrics without running too many costly experiments on live users. In recent years, substantial advances have been made to address this problem, resulting in reliable solutions that have proven effective in important real-world problems and that have been used by industry leaders. This tutorial reviews the basic theory as well as representative techniques, and illustrates how to apply them in practice, using several case studies done at Microsoft and Yahoo!.

Real-Time Bidding: A New Frontier of Computational Advertising Research

Jun Wang and Shuai Yuan

In display and mobile advertising, the most significant development in recent years is the Real-Time Bidding (RTB), which allows selling and buying in real-time one ad impression at a time. Since then, RTB has fundamentally changed the landscape of the digital marketing by scaling the buying process across a large number of available inventories. The demand for automation, integration and optimisation in RTB brings new research opportunities in the DM/ML fields. However, despite its rapid growth and huge potential, many aspects of RTB remain unknown to the research community for many reasons. In this tutorial, together with invited distinguished speakers from online advertising industry, we aim to bring the insightful knowledge from the real-world systems to bridge the gaps and provide an overview of the fundamental infrastructure, algorithms, and technical and research challenges of this new frontier of computational advertising. We will also introduce to researchers the datasets, tools, and platforms which are publicly available thus they can get hands-on quickly. Click here for additional information.

Learning about health and medicine from Internet data

Elad Yom-Tov, Ingemar Johansson Cox and Vasileios Lampos

Surveys show that around 80% of US Internet users consult the Internet when they require medical information. People seek this information using both traditional search engines and via social media. The information created using the search process offers an unprecedented opportunity for applications to monitor and improve the quality of life of people with a variety of medical conditions. In recent years research in this area has addressed public-health questions such as the effect of media on development of anorexia, developed tools for measuring influenza rates and assessing drug safety, and examined the effects of health information on individual wellbeing.

This tutorial will show how Internet data can facilitate medical research, providing an overview of the state-of-the-art in this area. During the tutorial we will discuss the information which can be gleaned from a variety of Internet data sources, including social media, search engines, and specialized medical websites. We will provide an overview of analysis methods used in recent literature, and show how results can be evaluated using publicly-available health information and online experimentation. Finally, we will discuss ethical and privacy issues and possible technological solutions. This tutorial is intended for researchers of user generated content who are interested in applying their knowledge to improve health and medicine.

Distributed Graph Algorithmics: Theory and Practice

Silvio Lattanzi and Vahab Mirrokni

As a fundamental tool in modeling and analyzing social and information networks, large-scale graph mining is an important component of any tool set for big data analysis. Processing graphs with hundreds of billions of edges is only possible via developing distributed algorithms under distributed graph mining frameworks such as MapReduce, Pregel, Gigraph, and alike. For these distributed algorithms to work well in practice, we need to take into account several metrics such as the number of rounds of computation and the communication complexity of each round. For example, given the popularity and ease-of-use of MapReduce framework, developing practical algorithms with good theoretical guarantees for basic graph algorithms is a problem of great importance.

In this tutorial, we first discuss how to design and implement algorithms based on traditional MapReduce architecture. In this regard, we discuss various basic graph theoretic problems such as computing connected components, maximum matching, MST, counting triangle and overlapping or balanced clustering. We discuss a computation model for MapReduce and describe the sampling, filtering, local random walk, and core-set techniques to develop efficient algorithms in this framework. At the end, we explore the possibility of employing other distributed graph processing frameworks. In particular, we study the effect of augmenting MapReduce with a distributed hash table (DHT) service and also discuss the use of a new graph processing framework called ASYMP based on asynchronous message-passing method. In particular, we will show that using ASYMP, one can improve the CPU usage, and achieve significantly improved running time.

Comments are closed.