The 9th ACM International Conference on Web Search and Data Mining

San Francisco, California, USA. February 22-25, 2016.

Keynotes

Tuesday 9:00 AM

Jeff Dean

Google Senior Fellow
Google Research

Title: Large-Scale Deep Learning For Building Intelligent Computer Systems

Abstract: For the past five years, the Google Brain team has focused on conducting research in difficult problems in artificial intelligence, on building large-scale computer systems for machine learning research, and, in collaboration with many teams at Google, on applying our research and systems to dozens of Google products. Our group has recently open-sourced the TensorFlow system (tensorflow.org), a system designed to easily express machine ideas, and to quickly train, evaluate and deploy machine learning systems. In this talk, I'll highlight some of the design decisions we made in building TensorFlow, discuss research results produced within our group, and describe ways in which these ideas have been applied to a variety of problems in Google's products, usually in close collaboration with other teams.
This talk describes joint work with many people at Google.

Bio: Jeff joined Google in 1999 and is currently a Google Senior Fellow in Google's Research Group, where he leads the Google Brain team, Google's deep learning research team in Mountain View, working on systems for speech recognition, computer vision, language understanding, and various predictive tasks. He has co-designed/implemented five generations of Google's crawling, indexing, and query serving systems, and co-designed/implemented major pieces of Google's initial advertising and AdSense for Content systems. He is also a co-designer and co-implementor of Google's distributed computing infrastructure, including the MapReduce, BigTable and Spanner systems, protocol buffers, LevelDB, the recently open-sourced TensorFlow system for machine learning, systems infrastructure for statistical machine translation, and a variety of internal and external libraries and developer tools. He received a Ph.D. in Computer Science from the University of Washington in 1996. He is a Fellow of the ACM and the AAAS, a member of the U.S. National Academy of Engineering, and a recipient of the Mark Weiser Award and the ACM-Infosys Foundation Award in the Computing Sciences.

Slides
Wednesday 9:00 AM

Yiling Chen

Gordon McKay Professor of Computer Science
Harvard University

Title: Why Incentive Alignment is Relevant for Data Science

We are blessed with unprecedented abilities to connect with people all over the world: buying and selling products, sharing information and experiences, asking and answering questions, collaborating on projects, borrowing and lending money, and exchanging excess resources. These activities result in rich data that scientists can use to understand human social behavior, generate accurate predictions, and make policy recommendations. Data science and machine learning traditionally take such data as given, often treating them as independent samples drawn from some underlying true distribution. However, such data are possessed or generated by (potentially strategic) people in the context of specific interaction rules. Hence, what data become available depends on the interaction rules. For example, people with sensitive medical conditions may not reveal their medical data in a survey but could be willing to share them when compensated; crowd workers may not put in a good-faith effort in completing a task if they know that the requester cannot verify the quality of their contributions.
In this talk, I argue that a holistic view that jointly considers data acquisition and inference and learning is important for data science and machine learning. I will first discuss a set of behavioral experiments showing how incentive schemes can affect elicited contributions in crowdsourcing. Then, I will present a project on designing mechanisms for online procurement of data held by strategic agents for machine learning tasks. In particular, agents have a private cost of furnishing their data which can arbitrarily correlated with the data itself. The challenge is to use past data to actively price future date to obtain learning guarantees.

Bio: Yiling Chen is the Gordon McKay Professor of Computer Science at Harvard University. She received her Ph.D. in Information Sciences and Technology from the Pennsylvania State University. Prior to working at Harvard, she spent two years at Yahoo! Research in New York City. Her current research focuses on topics in the intersection of computer science and economics. Her awards include an ACM EC Outstanding Paper Award, an AAMAS Best Paper Award, and an NSF Career award, and she was selected by IEEE Intelligent Systems as one of "AI's 10 to Watch" in 2011.

Slides
Thursday 9:00 AM

Foster Provost

Professor of Data Science and Andre Meyer Faculty Fellow
New York University

Title: The Predictive Power of Massive Data about our Fine-Grained Behavior

Abstract: What really is it about “big data” that makes it different from traditional data? In this talk I illustrate one important aspect: massive ultra-fine-grained data on individuals' behaviors holds remarkable predictive power. I examine several applications to marketing-related tasks, showing how machine learning methods can extract the predictive power and how the value of the data “asset” seems different from the value of traditional data used for predictive modeling.
I then dig deeper into explaining the predictions made from massive numbers of fine-grained behaviors by applying a counterfactual framework for explaining model behavior based on treating the individual behaviors as evidence that is combined by the model. This analysis shows that the fine-grained behavior data incorporate various sorts of information that we traditionally have sought to capture by other means. For example, for marketing modeling the behavior data effectively incorporate demographics, psychographics, category interest, and purchase intent.
Finally, I discuss the flip side of the coin: the remarkable predictive power based on fine-grained information on individuals raises new privacy concerns. In particular, I discuss privacy concerns based on inferences drawn about us (in contrast to privacy concerns stemming from violations to data confidentiality). The evidence counterfactual approach used to explain the predictions also can be used to provide online consumers with transparency into the reasons why inferences are drawn about them. In addition, it offers the possibility to design novel solutions such as a privacy-friendly “cloaking device” to inhibit inferences from being drawn based on particular behaviors.
This talk draws on work from several papers.

Bio: Foster Provost is Professor of Data Science and Andre Meyer Faculty Fellow at New York University. He is coauthor of the best-selling data science book, Data Science for Business. His research focuses on modeling behavior data, modeling (social) network data, crowd-sourcing for data science, aligning data science with application goals, and privacy-friendly methods. His research has won many awards, including the INFORMS Design Science Award and best paper awards at KDD across three decades. He cofounded several companies based on his research, including Dstillery and Integral Ad Science. Foster previously was Editor-in-Chief of the journal Machine Learning. His latest music album, Mean Reversion, is scheduled to be released in 2015.

Invited Speakers: Practice & Experience Track

Wednesday 4:45 PM

Lars Backstrom

Director of Engineering on News Feed
Facebook

Title: Serving a Billion Personalized News Feeds

Abstract: Feed ranking’s goal is to provide perople with over a billion personalized experiences. We strive to provide the most compelling content to each person, personalized to them so that they are most likely to see the content that is most interesting to them. Similar to a newspaper, putting the right stories above the fold has always been critical to engaging customers and interesting them in the rest of the paper. In feed ranking, we face a similar challenge, but on a grander scale. Each time a person visits, we need to find the best piece of content out of all the available stories and put it at the top of feed where people are most likely to see it. To accomplish this, we do large-scale machine learning to model each person, figure out which friends, pages and topics they care about and pick the stories each particular person is interested in. In addition to the large-scale machine learning problems we work on, another primary area of research is understanding the value we are creating for people and making Feed ranking’s goal is to provide perople with over a billion personalized experiences. We strive to provide the most compelling content to each person, personalized to them so that they are most likely to see the content that is most interesting to them.
Similar to a newspaper, putting the right stories above the fold has always been critical to engaging customers and interesting them in the rest of the paper. In feed ranking, we face a similar challenge, but on a grander scale. Each time a person visits, we need to find the best piece of content out of all the available stories and put it at the top of feed where people are most likely to see it. To accomplish this, we do large-scale machine learning to model each person, figure out which friends, pages and topics they care about and pick the stories each particular person is interested in. In addition to the large-scale machine learning problems we work on, another primary area of research is understanding the value we are creating for people and making sure that our objective function is in alignment with what people want.

Bio: Lars Backstrom graduated from Cornell University as an undergrad in 2004 and with a PhD in 2009. Lars joined Facebook in the fall of 2009. At Facebook, he first worked on the people you may know system, trying to find people who users are friends with but not yet connected to on Facebook. This combined machine learning with network science, as Facebook worked to use the graph structure to find missing edges and connect them. Since late 2010, he has been working on News Feed, trying to understand the social graph to find people's closest friends and connect everyone with their friends and family by showing the most relevant content produced by those people in News Feed.

Tuesday 4:00 PM

Yoelle Maarek

Vice-President of Research
Yahoo

Title: Is Mail The Next Frontier In Search And Data Mining?

Abstract: The nature of Web mail traffic has significantly evolved in the last two decades, and consequently the behavior of Web mail users has also changed. For instance a recent study conducted by Yahoo Labs showed that today 90% of Web mail traffic is machine-generated. This partly explains why email traffic continues to grow even if a significant amount of personal communications has moved towards social media. Most users today are receiving in their inbox important invoices, receipts, and travel itineraries, together with non-malicious junk mail such as hotel newsletters or shopping promotions that could safely ignore. This is one of the reasons that a majority of messages remain unread, and many are deleted without being read. In that sense, Web mail has become quite similar to traditional snail mail. In spite of this drastic change in nature, many mail features remain unchanged. While 70% of mail users do not define even a single folder, folders are still predominant in the left trail of many Web mail clients. Mail search results are still mostly ranked by date, which makes the retrieving of older messages extremely challenging. This is even more painful to users, as unlike in Web search, they will know when a relevant previously read message has not been returned.
In this talk, I present the results of multiple large-scale studies that have been conducted at Yahoo Labs in the last few years. I highlight the inherent challenges associated with such studies, especially around privacy concerns. I will discuss the new nature of consumerWeb mail, which is dominated by machine-generated messages of highly heterogeneous forms and value. I will show how the change has not been fully recognized yet by my most email clients. As an example, why should there still be a reply option associated with a message coming from a "do-not-reply@" address?. I will introduce some approaches for large-scale mail mining specifically tailored to machine-generated email. I will conclude by discussing possible applications and research directions.

Bio: Yoelle Maarek is VP Research EMEA at Yahoo. Her teams in Israel and UK conduct research around mail, search and native ads, directly impacting Yahoo products. Prior to this, Yoelle was the Director of Google Haifa Engineering Center, which she opened in 2006 and grew to close to 40 team members. There, she led the team that launched “Suggest”, Google’s query completion feature on google.com and YouTube. From 1989 to 2006, Yoelle was with IBM Research, first in the US, and then in Israel, where she held a number of technical and management positions, eventually leading the search and collaboration department and becoming a Distinguished Engineer. She received her PhD in Computer Science from Technion, in Haifa, Israel. In parallel, she spent a year in the Computer Science Department of Columbia University in New York, as a visiting PhD student. She graduated from the “Ecole Nationale des Ponts et Chaussees” in Paris, France, and received her “DEA” (graduate degree) in Computer Science from Paris VI University, both in 1985. Yoelle’s research interests include Information Retrieval, Web search, Web mining and Web applications. She has published more than 70 articles in these fields. She has been involved in various senior roles in most of the recent SIGIR,WWW and WSDM conferences. Yoelle currently serves as Vice Chair of the SIGIR Executive committee and was recently elected to IW3C2, the organization that manages the WWW conference series. Yoelle is a member of the Board of Governors of the Technion, chairing its Student Affairs Committee, as well as a member of the Technion Management Council. She was inducted as an ACM Fellow in 2013.

Thursday 3:30 PM

Mor Naaman

Associate Professor, Jacobs Institute at Cornell Tech
Co-Founder and Chief Scientist at Seen.co

Title: The Past and Future of Systems for Current Events

Abstract: An overwhelming amount of content from real-world events is shared by individuals through social media services. This shared media represents an important part of our society, culture and history. At the same time, this social media event content is still difficult to consume and understand, fragmented across services, and hard to find. We have worked since 2008, in both research and startup settings, to tackle these (and other) challenges in making social media information about events accessible and usable. I will discuss our early research, show how it led to the startup company I co-founded, comment on what the startup (which recently pivoted away from events) did well and where it failed, and highlight open challenges and directions for the future work and research in this area.

Bio: Mor Naaman is an associate professor of Information Science at the Jacobs Technion-Cornell Institute at Cornell Tech, where he is the founder of the Connective Media hub, and leads a research group focused on social technologies. His research applies multidisciplinary methods to 1) gain a better understanding of people and their use of social tech; 2) extract insights about people, technology and society from social media and other sources of social data, and 3) develop new social technologies as well as novel tools to make social data more accessible and usable in various settings. Previously, Mor was on the faculty at Rutgers SC&I, led a research team at Yahoo! Research Berkeley, received a Ph.D. in Computer Science from Stanford University, and played professional basketball for Hapoel Tel Aviv. He is a recipient of a NSF Early Faculty CAREER Award, research awards and grants from numerous corporations including AOL and Google, and multiple best paper awards. Find out more about Mor at http://mornaaman.com.

Wednesday 4:00 PM

Jie Tang

Associate Professor, Department of Computer Science and Technology
Tsinghua University

Title: AMiner: Toward Understanding Big Scholar Data

Abstract: In this talk, I present a novel academic search and mining system, AMiner1, the second generation of the ArnetMiner system. Different from traditional academic search systems that focus on document (paper) search, AMiner aims to provide a systematic modeling approach to gain a deep understanding of the large and heterogeneous networks formed by authors, papers they have published, and venues in which they were published. The system extracts researchers’ profiles automatically from the Web and integrates them with published papers after name disambiguation. It has collected a large scholar dataset, with more than 130,000,000 researcher profiles and 100,000,000 papers from multiple publication databases. We have also developed an approach named COSNET to connect AMiner with several professional social networks, such as LinkedIn and VideoLectures, which significantly enriches the scholar metadata. Based on our integrated big scholar data, we devised a unified topic modeling approach to modeling the different entities (authors, papers, venues) simultaneously and providing a topic-level expertise search by leveraging the modeling results. In addition, AMiner offers a set of researcher-centered functions, including social influence analysis, influence visualization, collaboration recommendation, relationship mining, similarity analysis, and community evolution. The system has been in operation since 2006 and has attracted more than 7,000,000 independent IP accesses from over 200 countries/regions.

Bio: Jie Tang is an associate professor with the Department of Computer Science and Technology, Tsinghua University. His interests include social network analysis, data mining, and machine learning. He has published more than 100 journal/conference papers and holds 10 patents. He has served as PC Co-Chair of WSDM’15, ASONAM’15, ADMA’11, SocInfo’12, KDD-CUP Co-Chair of KDD’15, Poster Co-Chair of KDD’14, Workshop Co-Chair of KDD’13, Local Chair of KDD’12, Publication Co-Chair of KDD’11, and as the PC member of more than 50 international conferences. He is the principal investigator of National High-tech R&D Program (863), NSFC project, Chinese Young Faculty Research Funding, National 985 funding, and international collaborative projects with Minnesota University, IBM, Google, Nokia, Sogou, etc. He leads the project Arnetminer.org for academic social network analysis and mining, which has attracted millions of independent IP accesses from 220 countries/regions in the world. He was honored with the Newton Advanced Scholarship Award, CCF Young Scientist Award, NSFC Excellent Young Scholar, and IBM Innovation Faculty Award.

Slides