WSDM Cup 2020

WSDM Cup is a competition-style event co-located with the leading WSDM conference. This year, we have three exciting competition tasks from Microsoft Research, 4Paradigm and Sichuan Airlines, each of whom makes available 1 or more industrial scale datasets, enabling research on new problems. These challenges will be conducted through publicly accessible data challenge platforms and will have clear objectives to maximize for establishing rankings. Top participants will receive cash prizes of thousands of dollars, and be invited to present their work at WSDM Cup in February 2020.

Event Schedule

9:00 - 9:10 AM Opening and Prizes
9:10 - 9:45 AM Invited Talk 1: Enhancing research evaluations with GOTO data science (Dr. Kuansan Wang, Microsoft)
9:45 - 10:00 AM An Adaptive Early Stopping Strategy for Query-based Passage Re-ranking (Team Dlutycx: Chengxuan Ying, Chen Huo)
10:00 - 10:30 AM Coffee Break
10:30 - 10:45 AM An Empirical Two-Stage Framework on Citation Intent Recognition (Team SimpleBaseline: Kuei-Chen Huang, Chi-yu Yang)
10:45 - 11:00 AM Recall and re-rank: An Empirical Ensemble Framework for Citation Intent Recognition (Team Xiong: Gaoxiong Cao, Ziming Wu, Xiaohao Xu, Yinxiang Xu, Yongqiang Liu)
11:00 - 11:15 AM An Effective Approach for Citation Intent Recognition Based on Bert and LightGBM (Team Ferryman: Weilong Chen, Shuaipeng Liu, Wei Bao, Huixing Jiang)
11:15 - 11:45 AM Invited talk 2: Towards Automated Time Series Forecasting (Dr. Zhen Xu, 4Paradigm)
11:45 - 12:15 PM An Overview of AutoSeries Results (Team DenisVorotyntsev: Denis Vorotyntsev)
12:15 - 12:25 PM Closing

[Invited Talk 1] Enhancing research evaluations with GOTO data science

View Abstract

Abstract: For more than two centuries, research output has been growing at an exponential rate. While the world first sees research articles exceeding 1 million per year in 1974, we are publishing as many papers in a month today. The volume, the velocity, and the variety of scientific reports have exceeded human capacity to process them properly, creating an unprecedented cognitive challenge in understanding their significance, recognizing their impacts and planning for future investments. The research communities have recognized the crisis and proclaimed a Declaration of Research Assessment (DORA) to groundswell supports, and Computing Research Association (CRA) has further developed the “GOTO” principle that calls for Good and Open data with Transparent and Objective methodologies for research assessments. This talk will build on the evidence and rationale described in the July 2019 CACM and illustrate how the publicly available datasets on scholarly communication and the best practices in modern data mining, such as those reported in Track 1 of WSDM Cup 2020, can potentially be utilized to attack the cognitive overload challenge and revolutionize research evaluations.

View Bio

Bio: Kuansan Wang is Managing Director and Principal Researcher from Microsoft Research Outreach in Redmond, WA. He joined Microsoft Research in 1998, first as a researcher in the Speech Technology Group working on multimodal dialog system, then as an architect that designed and shipped various speech products, including the Voice Command on mobile that eventually becomes Cortana, and Microsoft Speech Server that is still powering Microsoft and partners' call centers. In 2007, he rejoined Microsoft Research to work on large scale natural language understanding and web search technologies, and is currently responsible for running the largest machine reading efforts that use AI agents to dynamically acquire knowledge from the web and make it available to the general public. Kuansan received his BS from National Taiwan University and MS and PhD from University of Maryland, College Park, respectively, all in Electrical Engineering. In addition to 120+ scholarly papers and 40+ patents he has published, his work has also been adopted into 10 international standards from W3C, Ecma and ISO.

[Invited Talk 2] Towards Automated Time Series Forecasting

View Abstract

Abstract: For time series forecasting, ML methods show powerful predictive performances. However, in practice, many engineers do a lot of ad hoc feature engineering and it is very difficult to switch between different datasets without human efforts. To address this problem, Automated Machine Learning (AutoML) is proposed to explore automatic pipeline to train an effective ML model given a specific task requirement. Since its proposal, AutoML have been explored in various applications, and a series of AutoML competitions, e.g., Auto-ML Track at KDD Cup, Automated natural language processing (AutoNLP) and Automated computer vision (AutoCV), have been organized by 4Paradigm, Inc. and ChaLearn (sponsored by Google, Microsoft). These competitions have drawn a lot of attention from both academic researchers and industrial practitioners. In this talk, we will explain the design of the challenge, what’s special on time series and AutoSeries, and what results / lessons we have from it.

View Bio

Bio: Zhen Xu receives an engineering degree from Ecole Polytechnique, Paris. He is now a machine learning engineer at 4Paradigm, Beijing. He is in charge of challenge organizations at 4Paradigm.

See here for submitted reports from WSDM Cup 2020.


Task 1

Microsoft Research - Citation Intent Recognition

For centuries, a key to the remarkable technological progress in our society has been the unassailable integrity exhibited by scientists in conducting scholarly communications. New discoveries and theories are openly distributed and discussed in published articles, and impactful contributions are often recognized by the research community at large in the form of citations. However, with the competition for research funding or promotions getting ever fiercer, unscrupulous behaviors intended at “gaming the system” rather than advancing the frontiers of our knowledge have become regrettably prevalent. Known as “coercive citations”, journal editors are seen to force authors to cite marginally relevant articles in particular journals to boost their journal impact factors, so are paper reviewers to solely increase their citation counts or h-index. These conducts are an affront to the highest integrity demanded of any scientists and technologists and, left unchecked, can undermine the public trusts and hamper the future developments in science and technology. This contest is the first in a series that explores the extent to which the web search and data mining technologies can be employed to distinguish superfluous citations from genuine recognitions. In this contest, however, we are focusing on a necessary first step in which the citation intents of the authors are recognized: the contestant is asked to develop a system that can recognize the citation intent of a given passage in a scholarly article and retrieve relevant citation targets from a given database.

Access the competition details here! Good luck!

Task 2

4Paradigm - Automated Time Series Regression

Machine Learning has achieved remarkable success in time series-related tasks, e.g., classification, regression and clustering. For time series regression, ML methods show powerful predictive performances. However, in practice, it is very difficult to switch between different datasets without human efforts. To address this problem, Automated Machine Learning (AutoML) is proposed to explore automatic pipeline to train an effective ML model given a specific task requirement. Since its proposal, AutoML have been explored in various applications, and a series of AutoML competitions, e.g., Auto-ML Track at KDD Cup, Automated Natural Language Processing (AutoNLP) and Automated Computer Vision (AutoCV) have been organized by 4Paradigm, Inc. and ChaLearn (sponsored by Google, Microsoft). These competitions have drawn a lot of attention from both academic researchers and industrial practitioners. In this challenge, we further propose the Automated Time Series Regression (AutoSeries) competition which aims at proposing automated solutions for time series regression task. This challenge is restricted to multivariate regression problems, which come from different time series domains, including air quality, sales, work presence, city traffic, etc. Provided solutions are expected to flexibly handle multiple types of datasets and automatically extract useful features, discover temporal correlations and make solutions generic enough to be applicable for unseen datasets.

Access the competition details here! Good luck!

Task 3

Sichuan Airlines - Flight Delay Discovery and Optimization

Sichuan Airlines Co., Ltd was established on August 29, 2002 and its headquarters is located in Chengdu with nine branches in Chongqing, Beijing, Yunnan,etc and three operational bases in Shenzhen, Nanning, Mianyang. From the former seven routes to the current 200-plus routes, Sichuan Airlines has built a well-operating network, integrating main routes, secondary routes, international routes, regional routes and branch routes, which contribute to the formation of the regional comprehensive transportation hub. On the one hand, both the number of passengers and routes served by Sichuan Airlines have shown a gradual upward trend. On the other hand, Sichuan Airlines is also facing an increasing number of challenges, such as severe weather and aircraft fault in scheduling flights, and these issues have the potential to cause the large-scale delay of subsequent flights at the relevant airports. When the large-scale delay of subsequent flights may occur, the dispatcher needs to adjust the flight schedules in time to make the flight scheduling reasonable and orderly, but Sichuan Airlines currently uses manual assistance to adjust flight scheduling and update flight information in the system, which means that in the case of extreme weather, relying on manual identification and adjustment of flights is not only time-consuming or laborious but also involves many restrictions. The purpose of this project is to implement a model which automatically identifies the subsequent flights that potentially delay, and recommend an optimization scheme.

Access the competition details here! Good luck!

Contact

WSDM Cup Chairs

Questions about the WSDM Cup 2020 should be directed to: wsdm-2020-cup-chairs@googlegroups.com.