What I learned at SIGIR 2019
A few weeks ago, I attended SIGIR 2019, a conference on Research and Development in Information Retrieval, which took place in Paris at La Villette for 5 full days! It was a great opportunity for our R&D team to get more familiar with state-of-the-art methods in Information Retrieval.
Let’s start with the why and the what! Information Retrieval (IR) is the task of providing a user with the information he or she was looking for. The most popular applications are certainly search engines (your Google queries) and recommender systems (your next compulsive buy on Amazon).
IR is a widely used framework at Sidetrade and involves many of our products. For instance, Smart Explorer provides our customers with a ranking of EU companies for facilitating prospection and the Growth product optimizes up&cross-selling by leveraging a recommender system. In addition to predictive analysis, providing business insights is highly valuable for our customers. For instance, if ranking user’s clients with respect to their churn risk is interesting, then identifying actionable insights for reducing that risk is gold! That’s the reason why papers connected with the Fairness, Explainability, Unbiased Ranking, and Collaborative Filtering research domains caught my attention at La Villette.
My feelings on the conference are a bit mixed. Accepted papers were definitely high-quality research with often strong empirical results which is really valuable in the IR community. But some oral presentations (if not too many…) followed exactly the paper script, exposing method / results / related work etc, and were not designed for generating discussion (this is really unpleasant when knowing that thousands of people took transcontinental flights to be in the room). Some did not even get a question from the audience… Let’s forget about it and focus on five papers that really attracted my attention during this conference!
“Noise Contrastive Estimation for One-Class Collaborative Filtering.” 
Wu et al. investigate the way we select items that probably don’t interest a user in order to get negative examples for training a recommender system. They argue that a pair user-item can be wrongly flagged as negative due to a lack of interactions and not because the user is not interested. The model is then badly biased and becomes more cautious for recommending unpopular items. This problem occurs when interactions between users and items are very sparse which is typically the case for the cross-sell problem: a highly popular product line of a company can smash the insight of selling a super well-suited but somehow less popular product.
Rather than modelling the problem as a positive VS negative classification, they suggest to learn by comparing observed user-item interactions VS a random model which is known as Noise Contrastive Estimation (NCE). It is worth noting that NCE has been the key of the success story for learning high quality word embeddings . From a practical perspective, NCE recommendation offers significant improvement with an impressive gain in time computation making it a serious competitor for popular models for practitioners.
“Relational Collaborative Filtering: Modeling Multiple Item Relations for Recommendation” 
Xin et al. leverage a simple but very efficient idea for making recommendations more efficient. Rather than using only the collaborative similarity (I’m more likely to be interested in items which have interested users with the same interest profile than me), they suggest to leverage item similarity (for instance movies with shared director have some similarity). By embedding the relation with an attention-based neural network, they significantly outperform state-of-the-art methods based on collaborative similarity only.
This approach may have a broad impact in industrial applications. Following the up&cross-sell example at Sidetrade, embedding metadata of a product (product line, product hierarchy, …) becomes a prime suspect for improvements!
“Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval.” 
This paper really attracted my attention since it shares some connections with Dirty Data, a research project we are involved in at Sidetrade. Table2Vec casts the popular word2vec framework  for learning high quality embedding of entities in a table. The main idea is to leverage the fact that two entities of a given column which share similar names should be embedded nearby in the feature space. For instance, ‘Electricité De France’ and ‘EDF’ share some patterns and should not be embedded in remote regions of the space.
“Domain Adaptation for Enterprise Email Search.” 
Tran et al from Google show how their Gmail search engine for companies can be improved leveraging specific statistical patterns of each company. The so-called Domain Invariant Representation model consists in learning a representation of emails which is company-invariant. The promise of this approach is to learn generic patterns: knowledge learned from a company A can then be used to improve the model of a company B and so on. Domain Adaptation is a research topic extensively studied at Sidetrade [6,7] since it has huge applications for making models more robust in a production environment.
I was very interested to see how Google has chosen to implement this framework for their Gmail search engine! In their context, using Domain Adaptation has improved performances slightly but results are a bit underwhelming… It would be interesting to also use the testing protocol of  which consists in training the model on a set of companies and testing on a set of unseen companies during training. Maybe results would be more impressive?
“Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors” 
The kind of paper l love! A huge experimental setup which brings back to life the trend of using benchmarks and some custom performance measures for comparing models . The question the paper attempts to answer is: “how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise” and then what is the better-suited statistical test to quantify it? They run computations of 500 millions of p-values for a range of widely adopted tests, IR systems and varying dataset size… in order to give sound recommendations to practitioners! In a few words, be careful with the bootstrap-shift test if the sample size is not large while the t-test and the permutation test behave well even in low data regime.
In a nutshell, attending SIGIR 2019 was definitely a great experience in order to have a better view of what is done in the IR community. Some papers were really outstanding and highly relevant to our business cases. Furthermore, I had the chance to attend the insightful tutorial on Fairness in IR from Michael D. Ekstrand and Fernando Diaz. As I had been kindly warned, I got answers, but left with even more new questions 😊 …
 Wu et al. “Noise Contrastive Estimation for One-Class Collaborative Filtering.” (2019).
 Mikolov, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.
 Xin et al. “Relational Collaborative Filtering: Modeling Multiple Item Relations for Recommendation.” arXiv preprint arXiv:1904.12796 (2019).
 Li Deng, Shuo Zhang, and Krisztian Balog. “Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval.” Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2019.
 Tran et al. “Domain Adaptation for Enterprise Email Search.” arXiv preprint arXiv:1906.07897 (2019).
 Bouvier et al. “Hidden Covariate Shift: A Minimal Assumption For Domain Adaptation.” arXiv preprint arXiv:1907.12299 (2019).
 Bouvier et al. “Learning Invariant Representations for Sentiment Analysis: The Missing Material is Datasets.” arXiv preprint arXiv:1907.12305 (2019).
 Urbano et al, http://arxiv.org/abs/1905.11096