Deep Learning Technology Center
14820 NE 36th St,
Redmond, WA 98052
Curriculum Vitae (Last updated: November 15, 2016)
I am a researcher in the Deep Learning Technology Center at Microsoft Research. I am easily distracted by (cover-letter-speak: my interests are) new or elegant ideas in Machine Learning, more broadly, Artificial Intelligence, with applications that grapple with ambiguity (e.g. Natural Language Processing) and/or adapt to an ever-changing worldview.
Following the parable of the Rabbit's Ph.D.,
I was advised by Prof. Thorsten Joachims during my PhD at Cornell University in the Department of Computer Science.
He was flanked by Prof. Johannes Gehrke,
Prof. Éva Tardos and
Prof. Robert Kleinberg in my committee.
My graduate minor was in Applied Mathematics.
My thesis studies machine learning for interactive systems.
I developed principles and algorithms that can re-use the interaction logs of these systems to inform
the design of (train and evaluate) future systems --
technically speaking, Counterfactual Evaluation and Learning from Logged User Feedback.
Slides from my thesis defense talk are available here.
I spent the 2015-16 academic year visiting the Information and Language Processing Systems group at the University of Amsterdam, interned with the Machine Learning group at Microsoft Research NYC during the summer of 2015, Computer Human Interactive Learning group (now called Machine Teaching Group) at Microsoft Research Redmond during the summer of 2013, Search Labs at Microsoft Research during the summer of 2012, and worked as a strategist with Tower Research Capital for 14 months from June 2010 - July 2011.
I received a Bachelor of Technology degree in Computer Science and Engineering from Indian Institute of Technology, Bombay in 2010 and a Master of Science degree in Computer Science from Cornell University in 2014. I interned at Microsoft India Development Centre as a Software Development Engineer-Tester during the summer of 2009.
Before that, I played games and sports all day every day.
POEM: Policy Optimization for Exponential Models.
POEM performs counterfactual risk minimization on logged bandit feedback data.
BLBF-DisplayAds: Test-Bed for Large-scale Validation of Counterfactual Learning Methods.
In collaboration with Criteo (a leader in the display advertising space), we provide a large-scale dataset and a standardized test-bed to systematically investigate off-policy learning algorithms.
SIGIR-Demo: Demonstration of Counterfactual Evaluation and Learning (SIGIR 2016 Tutorial).
This demo implements a news recommendation scenario and provides simple implementations of several off-policy estimators (for evaluation) and learning algorithms (for optimization) for benchmarking.
MNAR: Collaborative Filtering on User Ratings.
This implements a collaborative filtering algorithm for rating prediction, using principles from causal inference.
KESS: Keyphrase Extraction through Submodular Summarization.
[Coming soon:] KESS extracts phrases that describe novel and influential ideas from a time-stamped corpus of texts.
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. [abstract, WWW Companion 2015] [paper, ICML2015] [extended, JMLR2015].
Adith Swaminathan and Thorsten Joachims.
The logs of an interactive learning system contain valuable information about the system's performance and user experience. However they are biased (actions preferred by the system are over-represented) and incomplete (user feedback for actions not taken by the system is unavailable). So they cannot be used to train and evaluate better systems through supervised learning. We developed a new learning principle and an efficient algorithm that re-uses such logs to solve structured prediction problems. The key insight was to reason about the variance introduced when fixing the biased logs, and to use an efficient conservative approximation of this variance so that we can solve the problem at scale (software available online).
The Self-Normalized Estimator for Counterfactual Learning. NIPS 2015.
Adith Swaminathan and Thorsten Joachims.
We identified an interesting, new kind of overfitting in the batch learning from logged bandit feedback scenario which we called propensity overfitting. The conventional counterfactual risk estimator ("Inverse Propensity Scoring" --- IPS) is very vulnerable to this overfitting when optimized over rich hypothesis spaces. We identified one property, equivariance, that guards against this problem and motivated the use of the self-normalized risk estimator for this problem.
Off-Policy Evaluation for Slate Recommendation. ICML 2016 Personalization Workshop.
Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, Imed Zitouni.
Counterfactual techniques based on the standard "Inverse Propensity Scoring" (IPS) trick fail catastrophically when the number of actions that a system can take is large --- e.g., when recommending slates in an information retrieval application with combinatorially many rankings of items. We developed an estimator for evaluating slate recommenders that exploits combinatorial structure when constructing a slate and dominates the performance of IPS-based methods.
Recommendations as Treatments: Debiasing Learning and Evaluation. ICML 2016.
Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, Thorsten Joachims
The Netflix Prize was a challenge to predict how a population of users would rate an inventory of items, using user-volunteered item-ratings. However, there are severe biases --- “Missing Not At Random” (MNAR) --- in this volunteered data. We developed the first principled discriminative approach to this MNAR problem, showed that it works reliably on two user-volunteered ratings datasets, and provided an implementation online.
Unbiased Ranking Evaluation on a Budget. WWW Companion 2015.
Tobias Schnabel, Adith Swaminathan and Thorsten Joachims.
Evaluating structured prediction systems is hard and costly. We re-wrote several popular evaluation metrics for these systems as Monte Carlo estimates and used ideas from importance sampling to derive a simple, efficient, and unbiased strategy that also generates re-usable judgements.
Unbiased Comparative Evaluation of Ranking Functions. ICTIR 2016.
(Best Presentation Award)
Tobias Schnabel, Adith Swaminathan, Peter Frazier and Thorsten Joachims.
The Monte Carlo estimation view of ranking evaluation we developed above unified several prior approaches to building re-usable test collections. We developed new estimators and principled document sampling strategies that substantially improve statistical efficiency in several comparative evaluation scenarios (e.g. when compiling a leaderboard of competing rankers).
Unbiased Learning-to-Rank with Biased Feedback. WSDM 2017.
(Best Paper Award)
Thorsten Joachims, Adith Swaminathan, Tobias Schnabel.
Can we avoid relying on randomization when collecting user feedback for training learning to rank (L2R) algorithms? We can, by piggybacking on the randomness in user actions! User feedback is typically biased, so we re-purposed user click models as propensity estimators to obtain "de-biased" user feedback. These propensity-weighted click-logs were then used to train standard L2R algorithms, and showed very strong results on the arXiv search engine.
Temporal corpus summarization using submodular word coverage. CIKM 2012.
Ruben Sipos, Adith Swaminathan, Pannaga Shivaswamy and Thorsten Joachims.
Evolving corpora can be summarized through timelines indicating the influential elements. We created interesting timelines of articles in the ACL anthology and NIPS conference proceedings over a span of 15 years using landmark documents, influential authors and topic key-phrases. We achieved this by formulating a sub-modular objective extending the coverage problem over words (traditional summarization) to incorporate coverage across time.
Beyond myopic inference in Big Data pipelines. KDD 2013.
Karthik Raman, Adith Swaminathan, Johannes Gehrke and Thorsten Joachims.
Pipelines constructed using modular components have very brittle performance: errors in earlier components cascade through the pipeline causing catastrophic failures in the eventual output. We explored simple ideas from Probabilistic Graphical Models to make the inference procedure more robust, while still using off-the-shelf components to compose pipelines.
Mining videos from the web for electronic textbooks. ICFCA 2014.
Rakesh Agrawal, Maria Christoforaki, Sreenivas Gollapudi, Anitha Kannan, Krishnaram Kenthapadi and Adith Swaminathan.
There is a wealth of educational content available online, but the primary teaching aid for the vast majority of students (e.g. in India and China) continues to be the textbook. Recognizing how textbook sections are organized and mining them for concepts using Formal Concept Analysis, we automatically retrieved appropriate videos from Youtube Edu. Using transcripts made available by educators to remain accessible to deaf students, we side-stepped the challenge of automatic audio transcription or content-based video retrieval.
This timeline collects some of the fun research projects I worked on, arranged chronologically. The icons represents Talks, references Unpublished research projects, and references Publications. The colors indicate:
Hex is a two-player board game
invented by Piet Hein and John Nash. The first player to move has an advantage (a winning strategy exists, but not known for 9x9 boards and larger).
Hex has a large branching factor in the size of the board, which make search-based AI for Hex challenging.
Advised by Prof. Amitabha Sanyal, Nikhil Hooda and I coded a functional AI for Hex in Scheme using micro- and macro-reductions in an abstract connection game, Y. Using differential calculus to infer approximate Probability of Ownership for unoccupied hexes, NASH consistently beat Hexy (2000 Olympiad Gold, AAAI 2000 Outstanding Paper) and Queenbee (2000 Olympiad Silver) [both use alpha-beta search based on virtual connections] on 7x7 boards playing as Player 1, and was competitive with both programs on 9x9 boards. Note, Olympiads are typically played on 11x11 boards and use the pie rule to mitigate Player 1 advantage, which wasn't simulated. The current state-of-the-art is MoHex, which uses Monte Carlo Tree Search. Read more about AI research for Hex here.
Advised by Prof. G. Sivakumar, Karthik Raman, Nikhil Hooda and I tackled the classic online K-server problem. We derived the competitive ratio for several online algorithms, and constructed request sequences that showed these ratio bounds were tight. We also created a variant of the Harmonic algorithm that was O(2k log k)-competitive and efficient (the original Harmonic algorithm has O(2k k)-competitive ratio). The state-of-the-art is the Work Function Algorithm which is (2k-1) competitive (but very inefficient). We couldn't construct a request sequence to show this bound is tight. The K-server conjecture (there exists a deterministic k-competitive algorithm) is open.
Samhita Kasula, Omkar Wagh, Karthik Raman and I were intrigued that we readily anthropomorphize pets but "intelligent entities" (say, Google) are still firmly considered tools. We studied affective computing, and presented work on simulating human-like emotions (e.g. incorporating prosody in Natural Language Processing -- a separate project that Nikhil Hooda, Prachur Goel and I worked on) to have more emotionally intelligent human-computer interactions. Read more about affective computing here.
Surreal numbers are a simple, constructive number system that generalize the reals, infinitesimals and infinite numbers, invented by John Conway. Following Conway's insight connecting surreal numbers to two player games, I demonstrated how surreals can be used to reason about Hex -- however, this representation was much more inefficient than NASH's approximations. This talk won Gold at the Exposition General Championship at IIT Bombay. Read more about surreals and games here.
Advised by Prof. Soumen Chakrabarti, I built a reviewer-paper matching tool that integrated seamlessly with Microsoft CMT, ConfMaster, etc. The tool worked by scraping DBLP for recent publications of reviewers to build a text-profile, collecting citations from submissions, reviewer bids and topic/track/category matches from CMT etc. Optimal weighting of these different affinity cues was learned using transductive regression and using Max-Margin Markov Networks. The assignment was achieved using a min-cost max-flow problem. Modeling asymmetric costs (dissatisfied reviewers who don't get the papers they want vs. irritated reviewers who get papers they don't want) was crucial. Check out the Toronto paper matching system for an alternative approach.
Remember Age of Empires 2? The end-game summary screen showed how your empire fared across the ages -- influential novel events (e.g. wars) were marked out as milestones on the timeline. Evolving text corpora can be summarized in a similar way. At any point in time, novelty can be measured relative to the word content in the past and influence can be measured by looking at the future development of the corpus. By formulating a sub-modular objective incorporating novely and influence, we found key-phrases and milestone papers. These turn out to complement citations (e.g. survey papers are highly cited but not novel). Try our prototype here. Also check out Dafna Shahaf's work on a similar idea -- “information maps”.
Can we make interactive textbooks programmatically, by using online educational content (e.g. Khan Academy, YoutubeEDU)? The Vidya project at MSR SVC attempted this, by retrieving appropriate videos from the web as one reads through a textbook section. Classic retrieval techniques (e.g. Bag-of-Words and topic models) don't work well, so we developed a pipeline to associate sections with distinctive phrases. Using transcripts of videos, we side-stepped the challenge of automatic audio transcription. Although this worked better than conventional approaches, the bigger problem of determining what is appropriate -- over and above what is relevant -- remains open.
We connected the workflow of Big Data pipelines to inference in probabilistic graphical models. This allows us to detect and recover from prediction errors in several workflows, using simple message passing algorithms (beam search) for inference in graphical models. Moreover, it allowed us to trade-off computation cost of inference with accuracy using heuristically adapted beam sizes.
There is ample evidence that unlabeled inputs in addition to labeled data can improve accuracy in univariate classification (refer semi-supervised learning). For structured output prediction, perhaps “unpaired” outputs will also help (since they tell us which structures are a priori more likely). We derived a simple objective that approximated the marginal distribution of these unpaired outputs to learn a discriminative prediction rule, but the quality of this approximation was terrible for high-dimensional problems with limited data. This hurt the eventual performance of this approach. If you have a problem that has unpaired outputs, perhaps you are better off deriving empirical constraints over expected outputs and using expectation regularization or using a generative model.
SPICMACAY Cornell (website I made)