Full Day Tutorial
Practice of Efficient Data Collection via Crowdsourcing: Aggregation, Incremental Relabelling, and Pricing
Alexey DrutsaYandex, Russia
Valentina FedorovaYandex, Russia
Dmitry UstalovYandex, Russia
Olga MegorskayaYandex, Russia
Evfrosiniya ZerminovaYandex, Russia
Daria BaidakovaYandex, Russia
Topic: In this tutorial, we present you a portion of unique industrial practical experience on efficient data labeling via crowdsourcing shared by both leading researchers and engineers from Yandex. Majority of ML projects require training data, and often this data can only be obtained by human labelling. Moreover, the more applications of AI appear, the more nontrivial tasks for collecting human labelled data arise. Production of such data in a large-scale requires construction of a technological pipeline, what includes solving issues related to quality control and smart distribution of tasks between workers.
We will make an introduction to data labeling via public crowdsourcing marketplaces and will present key components of efficient label collection. This will be followed by a practical session, where participants will choose one of real label collection tasks, experiment with selecting settings for the labelling process, and launch their label collection project at Yandex.Toloka, one of the largest crowdsourcing marketplace. The projects will be run on real crowds within the tutorial session. Finally, participants will receive a feedback about their projects and practical advices to make them more efficient. We invite beginners, advanced specialists, and researchers to learn how to collect labelled data with good quality and do it efficiently.
Jen-Tzung ChienNational Chiao Tung University, Taiwan
Topic: This tutorial addresses the advances in deep Bayesian data mining for natural language with ubiquitous applications ranging from speech recognition to document summarization, text classification, text segmentation, information extraction, image caption generation, sentence generation, dialogue control, sentiment classification, recommendation system, question answering and machine translation, to name a few. Traditionally, "deep learning" is taken to be a learning process where the inference or optimization is based on the real-valued deterministic model. The "semantic structure" in words, sentences, entities, actions and documents drawn from a large vocabulary may not be well expressed or correctly optimized in mathematical logic or computer programs. The "distribution function" in discrete or continuous latent variable model for natural language may not be properly decomposed or estimated. This tutorial addresses the fundamentals of statistical models and neural networks, and focus on a series of advanced Bayesian models and deep models including hierarchical Dirichlet process, Chinese restaurant process, hierarchical Pitman-Yor process, Indian buffet process, recurrent neural network, sequence-to-sequence model, variational auto-encoder (VAE), generative adversarial network, attention mechanism, memory-augmented neural network, skip neural network, temporal difference VAE, stochastic neural network, stochastic temporal convolutional network, predictive state neural network, and policy neural network. Enhancing the prior/posterior representation is addressed. We present how these models are connected and why they work for a variety of applications on symbolic and complex patterns in sequence data. The variational inference and sampling method are formulated to tackle the optimization for complicated models. The word and sentence embeddings, clustering and co-clustering are merged with linguistic and semantic constraints. A series of case studies are presented to tackle different issues in deep Bayesian mining and understanding. At last, we will point out a number of directions and outlooks for future studies.
Somit GuptaMicrosoft, USA
Xiaolin ShiSnap, USA
Pavel DmitrievMicrosoft, USA
Xin FuFacebook, USA
Topic: A/B Testing is the gold standard to estimate the causal relationship between a change in a product and its impact on key outcome measures. It is widely used in the industry to test changes ranging from simple copy change or UI change to more complex changes like using machine learning models to personalize user experience. The key aspect of A/B testing is evaluation of experiment results. Designing the right set of metrics - correct outcome measures, data quality indicators, guardrails that prevent harm to business, and a comprehensive set of supporting metrics to understand the “why” behind the key movements is the #1 challenge practitioners face when trying to scale their experimentation program. On the technical side, improving sensitivity of experiment metrics is a hard problem and an active research area, with large practical implications as more and more small and medium size businesses are trying to adopt A/B testing and suffer from insufficient power. In this tutorial we will discuss challenges, best practices, and pitfalls in evaluating experiment results, focusing on both lessons learned and practical guidelines as well as open research questions.
Boris KovalerchukCentral Washington University, USA
Topic: Intelligible machine learning and knowledge discovery are important for modeling individual and social behavior, user activity, link prediction, community detection, crowd-generated data, and others. The role of the interpretable method in web search and mining activities is also very significant to enhance clustering, classification, data summarization, knowledge acquisition, opinion and sentiment mining, web traffic analysis, and web recommender systems. Deep learning success in accuracy of prediction and its failure in explanation of the produced models without special interpretation efforts motivated the surge of efforts to make Machine Learning (ML) models more intelligible and understandable. The prominence of visual methods in getting appealing explanations of ML models motivated the growth of deep visualization, and visual knowledge discovery. This tutorial covers the state-of-the-art research, development, and applications in the area of Intelligible Knowledge Discovery, and Machine Learning boosted by Visual Means. The topic is interdisciplinary, bridging efforts of research and applied communities in Data Mining, Machine Learning, Visual Analytics, Information Visualization, and HCI. This is a novel and fast-growing area with significant applications, and potential.
Xiang WangNational University of Singapore, Singapore
Xiangnan HeUniversity of Science and Technology of China, China
Tat-Seng ChuaNational University of Singapore, Singapore
Topic: Recommendation methods construct predictive models to estimate the likelihood of a user-item interaction. Previous models largely follow a general supervised learning paradigm — treating each interaction as a separate data instance and performing prediction based on the “information isolated island”. Such methods, however, overlook the relations among data instances, which may result in suboptimal performance especially for sparse scenarios. Moreover, the models built on a separate data instance only can hardly exhibit the reasons behind a recommendation, making the recommendation process opaque to understand.
In this tutorial, we revisit the recommendation problem from the perspective of graph learning. Common data sources for recommendation can be organized into graphs, such as user-item interactions (bipartite graphs), social networks, item knowledge graphs (heterogeneous graphs), among others. Such a graph-based organization connects the isolated data instances, bringing benefits for exploiting high-order connectivities that encode meaningful patterns for collaborative filtering, content-based filtering, social influence modeling and knowledge-aware reasoning. Together with the recent success of graph neural networks (GNNs), graph-based models have exhibited the potential to be the technologies for nextgeneration recommendation systems. The tutorial provides a review on graph-based learning methods for recommendation, with special focus on recent developments of GNNs and knowledge graphenhanced recommendation. By introducing this emerging and promising area in the tutorial, we expect the audience can get deep understanding and accurate insight on the spaces, stimulate more ideas and discussions, and promote developments of technologies.
Zhenhui LiPennsylvania State University, USA
Huaxiu YaoPennsylvania State University, USA
Fenglong MaPennsylvania State University, USA
Topic: In the era of big data, it is easy for us collect a huge number of image and text data. However, we frequently face the real-world problems with only small (labeled) data in some domains, such as healthcare and urban computing. The challenge is how to make machine learn algorithms still work well with small data? To solve this challenge, in this tutorial, we will cover the state-of-the-art machine learning techniques to handle small data issue. In particular, we focus on the following three aspects: (1) Providing a comprehensive review of recent advances in exploring the power of knowledge transfer, especially focusing on meta-learning; (2) introducing the cutting-edge techniques of incorporating human/expert knowledge into machine learning models; and (3) identifying the open challenges to data augmentation techniques, such as generative adversarial networks.
Ruoying WangLinkedIn, USA
Kexin NieLinkedIn, USA
Tie WangLinkedIn, USA
Yang YangLinkedIn, USA
Bo LongLinkedIn, USA
Topic: Anomaly detection is important in various applications ranging from intrusion detection, fraud detection, to medical diagnosis and large-scale sensor data from the Internet of Things. The goal of anomaly detection is to identify rare abnormal data patterns that deviate from the majority of the data. The anomaly patterns are difficult to detect, due to high dimensional data structure (e.g. image and text) and temporal pattern over time. In addition, several new applications require detecting anomalies from large scale of data. It becomes increasingly challenging to apply traditional models, which often fail to identify anomalies in these cases. As we will show in this tutorial, deep learning models have successfully improved the performance of anomaly detection in the face of these challenges.
In this tutorial, we summarize the cutting-edge deep learning techniques used in various applications to detect anomalies. We first introduce anomaly detection task, and then give an overview of the traditional techniques used to detect anomalies such as statistical models, clustering, and one-class classification. We will talk about the challenges and the opportunities for more advanced algorithms.
Then we focus on introducing the state-of-the-art deep anomaly detection algorithms. In deep model anomaly detection techniques, we cover two fundamental tasks: 1) learning normal representations from complex data, where RNN, LSTM, Auto-Encoder, GAN and their variations are widely adopted for sequential data such as text, audio and time series. CNN plays a major role for non-sequential data such as images and network and sensors; 2) detecting anomalies, where we summarize the techniques used to effectively detect anomalies based on reconstruction errors, reconstruction probabilities and one class NN. Semi-supervised learning techniques and transfer learning are presented, which are used to compensate for sparse anomaly labels. In terms of deep anomaly detection architecture, we introduce the architecture of deep learning anomaly detection model including hybrid models and spatial temporal network.
Second to last, we evaluate deep learning methodologies on several publicly available data sets. What's more, we illustrate the end-to-end anomaly detection product at LinkedIn, by sharing our experiences for multivariate time series deep anomaly detection, multi-step horizon forecasting and pattern-based deep anomaly detection. In the end, we highlight several important future trends.
Targeted Audience: This tutorial is suitable for academic and industrial researchers, graduate students, and practitioners. After the tutorial, we expect the audience to have learnt the key concepts and principles of applying the state-of-the-art deep learning models for anomaly detection and gained real-world experiences through illustrative examples.
Adversarial Machine Learning in Recommender Systems (AML-RecSys)
Felice Antonio MerraPolytechnic University of Bari, Italy
Tommaso Di NoiaPolytechnic University of Bari, Italy
Yashar DeldjooPolytechnic University of Bari, Italy
Topic: Recommender systems (RS) have emerged as a paradigm of information push to alleviate the information overload issue and enhance the user experience. Variety of recommendation models have been proposed in the last two decades for different recommendation tasks in various domains and, in fact, they all share an underlying assumption: user-item interaction can serve as proper ground truth for model training and evaluation.
Recent advances in adversarial learning, show that (even) state-of-the-art recommendation approaches such as collaborative filtering matrix factorization (CF-MF) models or the ones based on deep neural architectures (deep-CF) can be vulnerable to adversarial perturbations applied on the input data thus putting the robustness of recommendation models in jeopardy. For instance, it has been shown that by adding small fluctuations on the input data or parameters of model-based CF, the model itself can augment the fluctuation leading to a significant change in the final prediction value. The sources of such adversarial perturbation can be noisy/unrealistic user feedback or the ones introduced artificially for a malicious purpose.
Therefore, more recently the community of recommender systems and information retrieval has been moving in the direction of obtaining a richer understanding of the role adversarial training can have for recommender systems in order to improve, for instance, the robustness of RS and increase their overall performance.
In the context of this tutorial, we introduce Adversarial Machine Learning (AML) and its successful application in RS with a comprehensive overview of existing works in the literature. We will present state-of-the-art approaches for AML applied to the recommender systems field providing both academic and industrial participants with a rich understanding of existing works by looking at goals, domains and technical characteristics.
Web-Scale Knowledge Collection
Colin LockardUniversity of Washington, USA
Prashant ShiralkarAmazon, USA
Xin Luna DongAmazon, USA
Hannaneh HajishirziUniversity of Washington, USA
Topic: How do we surface the large amount of information present in HTML documents on the Web, from news articles to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, question answering, recommendation, and more. In this tutorial, we present approaches for information extraction (IE) from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision.
The World Wide Web contains vast quantities of textual information in several forms: unstructured text, template-based semi-structured webpages (which present data in key-value pairs and lists), and tables. Methods for extracting information from these sources and converting it to a structured form have been a target of research from the natural language processing (NLP), data mining, and database communities. While these researchers have largely separated extraction from web data into different problems based on the modality of the data, they have faced similar problems such as learning with limited labeled data, defining (or avoiding defining) ontologies, making use of prior knowledge, and scaling solutions to deal with the size of the Web.
This tutorial takes a holistic view toward information extraction, exploring the commonalities in the challenges and solutions developed to address these different forms of text. We will explore the approaches targeted at unstructured text that largely rely on learning syntactic or semantic textual patterns, approaches targeted at semi-structured documents that learn to identify structural patterns in the template, and approaches targeting web tables which rely heavily on entity linking and type information. Finally, we will look at recent research that takes a more inclusive approach toward textual extraction by combining the different signals from textual, layout, and visual clues into a single model made possible by deep learning methods. .
Questions about the WSDM tutorials should be directed to: firstname.lastname@example.org