Theodore VasiloudisThis is the personal website for Theodore Vasiloudis. Here I will post thoughts and experiences on large-scale machine learning, including algorithms, systems and related stuff I find interesting.
http://tvas.me/
Fri, 16 Jun 2017 02:19:34 +0000Fri, 16 Jun 2017 02:19:34 +0000Jekyll v3.4.3Highlights from KDD 2016<p><img src="/assets/san_fransisco.jpg" alt="San Fransisco" />
<em>Photo credit: <a href="https://www.flickr.com/photos/davidyuweb/15370470163/">David Yu</a></em></p>
<p>This August while interning at <a href="https://twitter.com/LifeAtPandora">Pandora</a> I had the opportunity to attend the <a href="http://www.kdd.org/kdd2016/">22nd ACM SIGKDD Conference on Knowledge
Discovery and Data Mining (KDD)</a>, held in San Fransisco. My manager
<a href="https://twitter.com/ocelma">Oscar Celma</a> was cool enough to let me attend during my internship, and my research institute,
<a href="https://www.sics.se/">SICS</a> was cool enough to cover my conference fee, even though I was not presenting.</p>
<p>Like I did
with my post on <a href="/conferences/2015/11/23/ICDM-2015-Highlights.html">last year’s ICDM</a>, I’ll be providing summaries for
papers I found interesting from the sessions I attended, and provide links to the full text articles, organized by day
and session. One great thing that KDD did this year was to ask the authors to provide 2 minute Youtube videos describing
their papers, so for most of the linked papers you will find the video as well, providing a brief, accessible explanation.</p>
<p>This will be long post so feel free to skip ahead to the sections that are most interesting to you.</p>
<ul id="markdown-toc">
<li><a href="#pre-conference-and-tutorial-days" id="markdown-toc-pre-conference-and-tutorial-days">Pre-conference and tutorial days</a></li>
<li><a href="#sunday-workshops-and-opening" id="markdown-toc-sunday-workshops-and-opening">Sunday: Workshops and Opening</a> <ul>
<li><a href="#mining-and-learning-with-graphs" id="markdown-toc-mining-and-learning-with-graphs">Mining and learning with graphs</a> <ul>
<li><a href="#keynotes" id="markdown-toc-keynotes">Keynotes</a></li>
<li><a href="#papers" id="markdown-toc-papers">Papers</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#monday-day-1-of-the-main-conference" id="markdown-toc-monday-day-1-of-the-main-conference">Monday: Day 1 of the main conference</a> <ul>
<li><a href="#graphs-and-rich-data-best-paper-award" id="markdown-toc-graphs-and-rich-data-best-paper-award">Graphs and Rich Data (Best paper award)</a></li>
<li><a href="#large-scale-data-mining" id="markdown-toc-large-scale-data-mining">Large-scale Data Mining</a></li>
<li><a href="#streams-and-temporal-evolution-i-best-student-paper-award" id="markdown-toc-streams-and-temporal-evolution-i-best-student-paper-award">Streams and temporal evolution I (Best student paper award)</a></li>
<li><a href="#streams-and-temporal-evolution-ii-theos-coolest-idea-of-kdd-award" id="markdown-toc-streams-and-temporal-evolution-ii-theos-coolest-idea-of-kdd-award">Streams and temporal evolution II (Theo’s coolest idea of KDD award)</a></li>
</ul>
</li>
<li><a href="#tuesday-day-2-of-the-main-conference" id="markdown-toc-tuesday-day-2-of-the-main-conference">Tuesday: Day 2 of the main conference</a> <ul>
<li><a href="#deep-learning-and-embedding" id="markdown-toc-deep-learning-and-embedding">Deep learning and embedding</a></li>
<li><a href="#recommender-systems" id="markdown-toc-recommender-systems">Recommender Systems</a></li>
<li><a href="#turing-lecture-whitfield-diffie" id="markdown-toc-turing-lecture-whitfield-diffie">Turing Lecture: Whitfield Diffie</a></li>
</ul>
</li>
<li><a href="#wednesday-last-day-of-the-conference" id="markdown-toc-wednesday-last-day-of-the-conference">Wednesday: Last day of the conference</a> <ul>
<li><a href="#supervised-learning" id="markdown-toc-supervised-learning">Supervised learning</a></li>
<li><a href="#optimization" id="markdown-toc-optimization">Optimization</a></li>
</ul>
</li>
<li><a href="#closing-thoughts" id="markdown-toc-closing-thoughts">Closing thoughts</a></li>
</ul>
<h2 id="pre-conference-and-tutorial-days">Pre-conference and tutorial days</h2>
<p>The Broadening Participation in Data Mining Workshop <a href="http://www.dataminingshop.com/web/">(BPDM)</a> was held on Friday and
Saturday. This
workshop aims to broaden the participation of minority and underrepresented groups in Data Mining, by providing guidance,
networking and other opportunities. I think it’s a great initiative and I hope to see other venues take up something
similar in ML in the vein of <a href="http://wimlworkshop.org/">WiML</a>.</p>
<p>Saturday was tutorials day at KDD, and there were a lot to choose from. I spent most of my time in the <a href="http://www.francois-petitjean.com/Research/KDD-2016-Tutorial/">Scalable Learning of
Graphical Models</a> and the <a href="https://sites.google.com/site/iotminingtutorial/">IoT Big Data Stream
Mining</a> tutorials. The deep learning tutorial from Ruslan Salakhutdinov
had to be cancelled (if anyone knows why let me know). The slides and video from his <a href="http://www.cs.toronto.edu/~rsalakhu/kdd.html">2014 tutorial at KDD</a> are available
however, and I can definitely recommend it as a good introduction to the field.</p>
<h2 id="sunday-workshops-and-opening">Sunday: Workshops and Opening</h2>
<p>Sunday was the workshop day and again it made me wish I could split myself to cover many in parallel, with topics like
large-scale sports analytics, learning from time series, deep learning from data mining, and stream mining. In the end,
I chose to spend most of my day in the workshop on <a href="http://www.mlgworkshop.org/2016/">“Mining and Learning from Graphs”</a>, which was closest
to my interests, and probably the best of the day.</p>
<h3 id="mining-and-learning-with-graphs">Mining and learning with graphs</h3>
<h4 id="keynotes">Keynotes</h4>
<p>The reason I think this was the best workshop of the day has a lot to do with the great keynote lineup as well the
quality and variety of the work presented. Lars Backstrom, the director of engineering at Facebook, had the first keynote
where he talked about the challenges in creating a personalized newsfeed for over a billion users. He talked about how
both business decisions and probabilities calculated by many models (trees, deep learning, logistic regression) affect
the scoring of items in the News Feed that end up determining the ranking of the items users see.</p>
<p>He also mentioned
some work I was not previously familiar with, co-authored with Jon Kleinberg, on <a href="https://dl.acm.org/citation.cfm?id=2531642">discovering strong ties
in a social network</a>, like romantic relationships. For this they developed a new measure of tie strength, <em>dispersion</em>,
which measures “the extent to which two people’s mutual friends are not themselves well-connected”. Using this method
they were able to identify the spouse of male users correctly with .667 precision, which is impressive considering they
are only using the graph structure as information. The dispersion metric itself is an interesting concept and can also be used
for News Feed ranking.</p>
<p>The rest of the keynotes were full of great ideas as well. <a href="https://www.cs.cmu.edu/~lakoglu/index.html">Leman Akoglu</a>, who
recently moved back to CMU after Stony Brook, gave a talk on <a href="https://arxiv.org/abs/1601.06711">detecting anomalous neighborhoods on attributed networks</a>.
<a href="http://www.sandia.gov/~tgkolda/">Tamara Kolda</a> talked about how to correctly model networks, and presented their <a href="https://arxiv.org/abs/1302.6636">BTER generative graph model</a> which is able to generate graphs that closely follow properties of
real-world graphs such as degree distribution and the triangle distribution, and a more recent extension of the model to
<a href="https://arxiv.org/abs/1607.08673">bi-partite graphs with community structure</a>.</p>
<p><a href="http://web.cs.ucla.edu/~yzsun/">Yishou Sun</a> (now at UCLA) presented a <a href="http://web.cs.ucla.edu/~yzsun/papers/ijcai16_anomaly.pdf">probabilistic model for event likelihood</a>.
The key idea here is
that one can model an event, say a user purchasing an item, as an <a href="http://www.analytictech.com/networks/egonet.htm">ego-network</a>
(networks where we focus on one node, the “ego” node). The ego node would be the event, linked to heterogeneous entities,
like the item, date, and user. The entities are then embedded into a latent space by using their co-occurrence with other
events, and the embeddings can then be used for tasks like anomaly detection and content-based recommendation.</p>
<p><a href="https://www.cs.purdue.edu/homes/neville/">Jennifer Neville</a> presented methods for modelling distributions of networks,
which essentially allows one to <a href="http://www.kdd.org/kdd2016/subtopic/view/sampling-of-attributed-networks-from-hierarchical-generative-models">generate network samples</a>
of attributed hierarchical networks, which can then be used for inference and evaluation.
Finally, <a href="https://users.soe.ucsc.edu/~vishy/">S.V.N. Vishwanathan</a> had a disclaimer that his talk was not exactly on the
topic of graphs, but rather how to exploit the computational graph to achieve better parallelism in distributed machine
learning. He presented some recent work on <a href="https://arxiv.org/abs/1605.09499">distributed stochastic variational inference</a>
that only updates a small part of the model for each data point (compared to classic stochastic VI), to achieve both data and model
parallelism while maintaining high accuracy.</p>
<h4 id="papers">Papers</h4>
<p>A number of cool ideas were presented through the papers at the workshop:</p>
<ul>
<li>Cohen et al.
presented a new algorithm on <a href="http://www.mlgworkshop.org/2016/paper/MLG2016_paper_35.pdf">distance-based influence in networks</a>, where a scalable
influence maximization algorithm was presented which can be used with any decay function.</li>
<li>Qian et al. presented
a fun idea: <a href="http://www.mlgworkshop.org/2016/paper/MLG2016_paper_18.pdf">blinking graphs</a>. A graph that blinks is a one
where each edge and node exists with a probability equal to its weight. This is then used to provide a proximity measure
between nodes, that turns out to provide outputs that are more intuitive and are shown to be useful in tasks like link
prediction.</li>
<li>Giselle Zeno used the work presented earlier by J. Neville that allows for the generation of attributed
networks, to create different samples from a network distribution and <a href="http://www.mlgworkshop.org/2016/paper/MLG2016_paper_27.pdf">systematically study</a> how graph characteristics
affect the performance of collective classification algorithms.</li>
<li>Rossi et al. presented <a href="http://www.mlgworkshop.org/2016/paper/MLG2016_paper_33.pdf">Relational Similarity Machines</a>,
a model for relational learning that can handle large graphs and is flexible in terms of learning tasks, constraints and
domains.</li>
</ul>
<p>I would definitely encourage you to take a look at the <a href="http://www.mlgworkshop.org/2016/">workshop website</a> and check
out some more of the papers. Overall, great work from all the organisers, with a great intro from <a href="https://twitter.com/seanjtaylor">Sean Taylor</a>
from Facebook, and a diverse and engaging set of keynote speakers. I’ll be looking to submit here next year!</p>
<h2 id="monday-day-1-of-the-main-conference">Monday: Day 1 of the main conference</h2>
<h4 id="graphs-and-rich-data-best-paper-award">Graphs and Rich Data (Best paper award)</h4>
<p>I started Day 1 of the conference by attending the Graphs and Rich Data session. The first paper presented was the best
paper award winner, <a href="http://www.kdd.org/kdd2016/subtopic/view/fraudar-bounding-graph-fraud-in-the-face-of-camouflage">FRAUDAR: Bounding Graph Fraud in the Face of Camouflage</a>
from Christos Faloutsos’ lab at CMU. In the paper Hooi et al. describe a method for detecting fraud, in the form of
reviews on Amazon or followers on Twitter, in the presence of camouflage: when fraudulent users have taken over legitimate
user accounts. In the paper they propose a number of metrics to measure the suspiciousness of subsets of nodes in a bipartite
graph (e.g. users and products) and show how to compute them in linear time. They illustrate the effectiveness of the approach
by using a Twitter graph with ~42M users and ~1.5B edges and showing that their algorithm is able to detect a group of
fraudulent users (manually evaluated). I would have loved to see some comparison in terms of accuracy on the real-world
data with other algorithms and a more quantitative evaluation using real-world data, but obtaining that would be hard
without a good ground-truth dataset, and I don’t know if any exist for graph-based fraud detection.</p>
<h4 id="large-scale-data-mining">Large-scale Data Mining</h4>
<p>I then moved on to the Large Scale Data Mining session, just in time to catch Daniel Ting deliver a smooth presentation
of his work on <a href="http://www.kdd.org/kdd2016/subtopic/view/towards-optimal-cardinality-estimation-of-unions-and-intersections-with-ske">cardinality estimation of unions and intersection with sketches</a>.
The cardinality of unions and intersections can be used for a number of applications, from calculating the Jaccard
similarity between two sets, to estimating the number of users accessing a particular website grouped by location or time,
and can be used for fundamental problems like estimating the size of a join. Daniel here proposed two new estimators
based on pseudo-likelihood and re-weighted estimators. The re-weighted estimators are perhaps the most interesting as they
can be generalized more easily (the work focuses on the MinCount sketch) and are easier to implement. I particularly like
the main idea behind them: Taking the weighted average of the several estimators after finding the most uncorrelated ones.
It is a rare thing to see a single author paper nowadays and Daniel hit it out of the park in terms
of quality and rigour with this one.</p>
<!---
Two other great papers from the session were [efficient anomaly detection in streaming graphs](http://www.kdd.org/kdd2016/subtopic/view/fast-memory-efficient-anomaly-detection-in-streaming-heterogeneous-graphs)
from Emaad Manzoor, and the
[XGBoost paper](http://www.kdd.org/kdd2016/subtopic/view/xgboost-a-scalable-tree-boosting-system) from Tianqi Chen. Emaad presented StreamSpot, an anomaly detection approach for streaming heterogeneous
graphs. He uses a string representation (shingles) for local substructure of graphs, and then uses a variation of SimHash named
StreamHash to compute similarities between the shingles. The algorithm is then initialized with benign clusters and the
anomalies then are detected for each cluster based on their deviation from the cluster's graph and medoid. My impression
is that the initialization process requiring a benign dataset limits the applicability of the algorithm somewhat, since
one can never be sure a dataset does not contain any anomalies, unless it is completely hand-labeled. Still the idea is
novel and I liked the translation of graphs to shingles along with the StreamHash algorithm.
--->
<p>In the same session Tianqi Chen presented <a href="(http://www.kdd.org/kdd2016/subtopic/view/xgboost-a-scalable-tree-boosting-system)">XGBoost</a>.
I assume <a href="https://xgboost.readthedocs.io">XGBoost</a> needs no introduction to most, it’s a gradient boosted tree algorithm
that has become wildly popular and has been used in the winning solution for 17 out of 29 Kaggle challenges during 2015.
Part of the appeal of XGBoost lies in its scalable nature and Tianqi has gone to great lengths
to ensure the algorithm is fast, easy to use and will run from anywhere
(C++, Python, R) and on anything (local and distributed). JVM-based solutions were also added recently, so it is possible
now to XGBoost on top of <a href="http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html">Apache Flink</a>
or Spark for example. I hope to find the time this year to work on the Flink integration so that it becomes a great platform
to run on and boosts our efforts (pun intended) for <a href="https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/ml/index.html">ML on Flink</a>.</p>
<h4 id="streams-and-temporal-evolution-i-best-student-paper-award">Streams and temporal evolution I (Best student paper award)</h4>
<p>Lorenzo De Stefani from Brown presented <a href="http://www.kdd.org/kdd2016/subtopic/view/triest-counting-local-and-global-triangles-in-fully-dynamic-streams-with-fi">TRIÈST</a>,
a new algorithm for counting local and global triangles in streaming graphs,
that supports additions and deletions of edges, with a fixed memory budget.
Counting triangles is a classic problem in network theory, as it can help with many tasks like spam detection, link
prediction etc. In many real world graphs, like a social network, edges are constantly being added and removed, so
maintaining an accurate count of the triangles in real-time is a challenging problem, especially in graphs with millions
of nodes and billions of edges.</p>
<p>What De Stefani et al. have done
here is present a one-pass algorithm based on reservoir sampling that provides unbiased estimates of the local and global
triangle counts with very little variance, that only requires the user to specify the amount of memory they want to use (an easy parameter to set).
Compared to previous approaches, TRIÈST does not require the user to set an edge sampling probability (a parameter that is
very hard to set without prior knowledge about the stream), and provides full
utilization of the available memory early on (vs. the end of the stream).
I find the use of reservoir sampling a great “oh why didn’t I think of that” idea here, and the value of the paper comes
from the rigorous analysis of the algorithm, and the extensive experimentation the authors have performed.
A very worthy recipient of the best student paper award.</p>
<h4 id="streams-and-temporal-evolution-ii-theos-coolest-idea-of-kdd-award">Streams and temporal evolution II (Theo’s coolest idea of KDD award)</h4>
<p>Perhaps the most novel idea I saw at KDD came from the paper on <a href="http://www.kdd.org/kdd2016/subtopic/view/continuous-experience-aware-language-model">Continuous Experience-aware Language modelling</a>
by Mukherjee et al. from <a href="https://www.mpi-inf.mpg.de/home/">MPI</a>. The idea here is to try to model the experience of the user in reviewing items from a particular domain,
based on the evolution of their language model. Think of a beer reviewing site. Your first few reviews might contain sentences
like <em>“I like this beer”</em> or <em>“Great taste!”</em>. But as you gain more experience in tasting beer, the way you describe it
becomes more nuanced; you might write something like <em>“Fascinating malt and hoppiness, the aftertaste left something to be desired
however”</em>. So as you evolve as a beer drinker, so does the language you use to describe it.</p>
<p>Previous work in the field has
tried to model this evolution of experience on a discrete scale; the user’s experience remains either static or suddenly
jumps a level. In this work the authors have used a model used in financial analysis called <a href="https://en.wikipedia.org/wiki/Geometric_Brownian_motion">Geometric Brownian Motion</a>
to instead model the evolution of the user’s experience as a continuous-time stochastic process. The user’s language
model is also continuous, using a dynamic variant of LDA that employs variational methods like Kalman filtering for inference.
Using this model they are able to more accurately recommend items to users, (albeit using RMSE as a metric which
<a href="https://www.researchgate.net/profile/Paolo_Cremonesi/publication/221141030_Performance_of_recommender_algorithms_on_top-N_recommendation_tasks/links/55ef4ac808ae0af8ee1b1bd0.pdf">was shown to be problematic</a>)
and do some explorative analysis, like show the evolution of term usage with experience, or the top words used for experienced
and inexperienced users. Overall I really liked this idea of tracking the language model of users over time, and I
believe that continuous-time models can have beneficial effects in many other domains.</p>
<h2 id="tuesday-day-2-of-the-main-conference">Tuesday: Day 2 of the main conference</h2>
<h4 id="deep-learning-and-embedding">Deep learning and embedding</h4>
<p>The paper describing the system that <a href="https://www.google.com/inbox/">Inbox by Gmail</a> uses for its <a href="http://www.kdd.org/kdd2016/subtopic/view/smart-reply-automated-response-suggestion-for-email">Smart Reply automated
response system</a>
was presented on Tuesday and it drew a lot of attention as expected. In case you are not familiar with the system, it
provides users of the Inbox app with short, automated replies. So the app writes the
emails for you instead of you having to type them out (yay machine learning!). This is obviously a highly challenging
tasks for many reasons. How can you tell if an email is a good candidate for a concise response? How does one generate
a response that is relevant to the incoming email? How does one provide enough variance in the responses generated?
And as is always the case at Google, how does one do this at scale?</p>
<p>Kannan et al. describe a system that uses a feed-forward
neural net to determine if an email is a good candidate to show automated responses for, an LSTM network for the actual
response text generation (sequence-to-sequence learning), a semi-supervised graph learning system to generate the set of
responses, and a simple strategy to ensure that the responses shown to the user are diverse in terms of intent. Although
the paper does not delve very deeply into each topic as they have to cover a complicated end-to-end learning system, it’s
still a great read as it provides insights into the scalability issues with deploying such models to millions of users,
as well as the challenge of optimizing for multiple objectives (accuracy, diversity, scalability) in a complex system.</p>
<h4 id="recommender-systems">Recommender Systems</h4>
<p>In this session chaired by <a href="https://twitter.com/xamat/">Xavier Amatriain</a>, <a href="http://www-users.cs.umn.edu/~christa/">Konstantina Christakopoulou</a> presented her paper
on <a href="http://www.kdd.org/kdd2016/subtopic/view/towards-conversational-recommender-systems">conversational recommender systems</a>.
The scenario here is common: You are at a new city, and would like to go out for dinner. If you had a local friend,
you’d have a small conversation: “Do you like Indian? What about Chinese? What’s your price-range?” and based on your
responses you knowledgeable friend would recommend a restaurant that they think you’d like. The challenges in creating
an automated system that does this are many: How does one find which dimensions are important (cuisine, price)?
Which questions should the system pose in order to arrive to a good recommendation as soon as possible?</p>
<p>Konstantina
addresses this problem as an online learning problem, where the system learns the preferences of the user online,
as well as the questions that allow it to provide good recommendations quickly. This is done by utilizing a bandit-based
approach that adapts the latent recommendation space to the user according to their interactions, and a number of
question selection strategies are tested, where is it shown that using a bandit-like approach to balance exploration
and exploitation in the latent question space is highly beneficial. I’m a fan of this work because I think it directly addresses cold-start
problems in recommender systems with an intuitive and human-centered approach, which includes knowledge we already have
about users and items through classic CF systems, with online learning and incorporating context.</p>
<h4 id="turing-lecture-whitfield-diffie">Turing Lecture: Whitfield Diffie</h4>
<p>Since this post is already too long I will not be covering the keynotes, however I could not skip mentioning
Whitfield Diffie’s Turing lecture, which was one of the highlights of the conference. Whitfield took us on journey through the
history of cryptography, starting with the <a href="https://en.wikipedia.org/wiki/Caesar_cipher">Ceasar Cipher</a> all the way to
<a href="https://en.wikipedia.org/wiki/Homomorphic_encryption">Homomorphic encryption</a>, with many interesting tidbits and
historical anecdotes along the way.</p>
<p>I particularly liked his story on one of the things that motivated him to find a solution for the public key cryptography
problem that he is most famous for. Diffie explained that one of his friends had told him that at the NSA the phone lines
are secure, so Diffi thought that without having an encryption key negotiated before-hand you could pick up a phone and dial and your
communication would be safe from eavesdropping. Diffie assumed that hey had somehow solved the problem of key distribution,
which motivated him to work even harder on the problem. The reality was that NSA was simply using shielded private lines
for their communication, but in Diffie’s own words, <strong><em>“Misunderstanding is the seed of invention”</em></strong>.</p>
<p>Another problem was presented by Diffie’s mentor <a href="https://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)">John McCarthy</a>
at a conference in Bordeaux where he talked about “buying and selling through home terminals”, what we call e-commerce today.
This problem led Diffie to think about digital signatures and proof of correctness, and the key idea of having a problem which you cannot
solve, but you can tell whether a solution provided is correct, which eventually led to public key cryptography.
I cannot provide a good enough summary of the talk here, but I would wholeheartedly recommend watching the whole thing, as
<a href="https://www.youtube.com/watch?v=CIZh0CHXGC4">it’s up on Youtube</a> and most definitely worth your time.</p>
<h2 id="wednesday-last-day-of-the-conference">Wednesday: Last day of the conference</h2>
<h4 id="supervised-learning">Supervised learning</h4>
<p>The highlight of the supervised learning session was the work from Marco Tulio Ribeiro et al. on <a href="http://www.kdd.org/kdd2016/subtopic/view/why-should-i-trust-you-explaining-the-predictions-of-any-classifier">explaining the predictions of
any classifier</a>.
The problem they are trying to solve is interpretability: Despite the wide adoption of machine learning, many of the more
complicated models, such as deep learning or random forests, are used as black boxes, explaining why they gave us a
particular answer is very hard. This makes it difficult to trust the system, and deploy it in a setting where it would aid
critical decision-making, like whether or not to administer a specific treatment to a patient.</p>
<p>The proposed system, <a href="https://github.com/marcotcr/lime">LIME</a>, which stands for Local Interpretable Model-agnostic Explanations, can explain the outputs of
any classifier. The way they achieve that is by fitting a simple, interpretable model like a linear regression, on generated
samples, weighted by their distance to the prediction point. What this essentially does is to approximate the complex
decision boundary locally using an interpretable model, from which we can then explain why a decision was made. For text
this could be the words that were present in a document that lead us to classify it as spam or not, and in images it could
be superpixels that caused the classification of the image as containing as dog or cat. The system introduces a lot of
overhead of course, the authors report 10 minutes runtime to explain one output from InceptionNet on a laptop, but there is a lot of room
for improvement there. Interpretability is one of the main challenges for ML in the coming years and it’s always welcome
to see new exciting work on the subject.</p>
<h4 id="optimization">Optimization</h4>
<p>One of the best papers of the conference was presented in one of the final sessions by Steffen Rendle, of factorization
machines fame, who is now at Google. He and his colleagues provide a solution for a problem of scale: How to train a
generalized linear model in a few hours for a trillion examples. For this they proposed <a href="http://www.kdd.org/kdd2016/subtopic/view/robust-large-scale-machine-learning-in-the-cloud">Scalable Coordinate Descent (SCD)</a>,
whose convergence behavior does not change regardless of the how much it is scaled out or the computing environment.
They also described a distributed learning system designed for the cloud which takes into consideration the challenges
present in a cloud environment, like shared machines (VMs) that are pre-emptible (i.e. you could be kicked out after a
grace period), machine failures etc.</p>
<p>The problem with <a href="https://en.wikipedia.org/wiki/Coordinate_descent">coordinate descent</a> is that it’s a highly sequential
algorithm, and not a lot of work has to be done at each step, which make parallelizing or distributing it challenging.
The key idea for the SCD algorithm is to make use of the structure and sparsity
present in the data. The data are partitioned in “pure” blocks where each block has at most one non-zero entry, and
updates are performed per block. The enforced independence in features is what enables the parallelism for the algorithm.
On the systems side they use a number of tricks to deal with having short barriers. Syncing the workers is challenging
in the presence of stragglers (machines that are slower) which they overcome by using dynamic load-balancing, caching,
and pre-fetching. Using this system and algorithm they are able to achieve near-linear scale out and speed up, going
from 20 billion examples to 1 trillion.</p>
<h2 id="closing-thoughts">Closing thoughts</h2>
<p>Overall the conference was well organized and a pleasure to attend. The venue was great, even though many of the sessions
had to be done in a different hotel across the street. The choice of having some of the keynotes over lunch however was criticised
by most attendees, as it was almost impossible to hear the speakers, and I’m sure it was not a good experience for them
either. The conference had a very heavy company presence as well, which I actually found welcome, as I had the opportunity
to talk to people from many interesting companies who are doing great research work like Microsoft, Facebook, Amazon etc.</p>
<p>If I have one gripe with conference is the insistence <em>“per KDD tradition”</em> on not performing double-blinded or open reviews, even though
<a href="https://hub.wiley.com/community/exchanges/discover/blog/2016/06/27/what-are-the-current-attitudes-toward-peer-review-publishing-research-consortium-survey-results">the research community</a>
is moving towards that (<a href="(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4191873/)">original paper</a>),
so kudos to ICLR for open and ICDM for triple-blind reviews.</p>
<p>In closing, KDD was a great conference and I’m glad I was given the opportunity to attend. I met a bunch of great new people and
reconnected with old friends, had interesting discussions with many companies, and the research presented filled me with
new ideas to take home and expand.</p>
<p>Looking forward to next year!</p>
Thu, 01 Sep 2016 00:00:00 +0000
http://tvas.me/conferences/2016/09/01/KDD-2016-Highlights.html
http://tvas.me/conferences/2016/09/01/KDD-2016-Highlights.htmldata-miningresearchconferenceconferencesHighlights from ICDM 2015<p><img src="/assets/ac.jpg" alt="Atlantic City" /></p>
<p>This past week I had the opportunity to attend the <a href="http://icdm2015.stonybrook.edu/">15th IEEE International Conference on
Data Mining</a>, held in Atlantic City, NJ, November 14-17, 2015.
This was the first scientific conference I attended and we had a chance to present our
work on <a href="/assets/concepts-icdm.pdf">scalable graph similarity calculation</a>. In this post I will try to point out some
of the more interesting work from the conference (based on some of the sessions I attended)
and summarize the keynotes. I’ve included links to the full-text papers whenever I could
find them.</p>
<ul id="markdown-toc">
<li><a href="#highlights-from-the-sessions-i-attended" id="markdown-toc-highlights-from-the-sessions-i-attended">Highlights from the sessions I attended:</a> <ul>
<li><a href="#day-1" id="markdown-toc-day-1">Day 1</a> <ul>
<li><a href="#applications-1" id="markdown-toc-applications-1">Applications 1</a></li>
<li><a href="#mining-social-networks-1" id="markdown-toc-mining-social-networks-1">Mining Social Networks 1</a></li>
<li><a href="#big-data-2" id="markdown-toc-big-data-2">Big Data 2</a></li>
</ul>
</li>
<li><a href="#day-2" id="markdown-toc-day-2">Day 2</a> <ul>
<li><a href="#network-mining-1" id="markdown-toc-network-mining-1">Network Mining 1</a></li>
</ul>
</li>
<li><a href="#day-3" id="markdown-toc-day-3">Day 3</a> <ul>
<li><a href="#graph-mining" id="markdown-toc-graph-mining">Graph Mining</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#keynotes" id="markdown-toc-keynotes">Keynotes</a> <ul>
<li><a href="#robert-f-engle" id="markdown-toc-robert-f-engle">Robert F. Engle</a></li>
<li><a href="#michael-i-jordan" id="markdown-toc-michael-i-jordan">Michael I. Jordan</a></li>
<li><a href="#lada-adamic" id="markdown-toc-lada-adamic">Lada Adamic</a></li>
</ul>
</li>
<li><a href="#venueorganisation" id="markdown-toc-venueorganisation">Venue/Organisation</a></li>
</ul>
<h2 id="highlights-from-the-sessions-i-attended">Highlights from the sessions I attended:</h2>
<h3 id="day-1">Day 1</h3>
<h4 id="applications-1">Applications 1</h4>
<p>The first session I attended was named “Applications 1” and it included
a number of applications (surprise!) on a diverse set of domains. The session started
with some very solid work on <a href="http://arxiv.org/abs/1406.0516">“Modeling Adoption and Usage of Competing Products”</a>,
where the authors create a model that can provide insight into the factors that drive
product adoption and frequency of use, which they evaluate at a large scale by looking
into the use of URL shorteners on Twitter.
In <em>“Mining Indecisiveness in Customer Behaviors”</em> the authors investigated how they could
reduce indecisiveness in users interacting with an online retail platform, by making use
of information about competing products. The end goal is to increase conversion of course,
but it would be interesting to see how such a system could be implemented in a way that
is fair to all retailers/brands.</p>
<p>Two short papers I should point out were <a href="http://medianetlab.ee.ucla.edu/papers/Yannick_ICDM.pdf">“Personalized Grade Prediction: A Data Mining
Approach”</a> and
<a href="http://www.cc.gatech.edu/~iperros3/publications/icdm15.pdf">“Sparse Hierarchical Tucker and its Application to Healthcare”</a>.
The first
paper deals with personalized early grade prediction for students using only assignment/homework data,
that could allow course instructors to identify students who might have
trouble in a course early on, most importantly using only their data from the specific
course, thereby avoiding any potential privacy pitfalls. In the second work, a new tensor
factorization method is proposed, that is 18x more accurate and 7.5x faster than the current
state-of-the-art. While the application presented here is limited to healthcare, I hope
that it can prove a starting point for a more generalized approach, as tensor factorization
problems can surface in wide variety of domains so solving their scalability problems
could have an effect on a wide range of fields.</p>
<h4 id="mining-social-networks-1">Mining Social Networks 1</h4>
<p>The next session I attended was “Mining Social Networks 1”, where the best student paper,
<a href="http://arxiv.org/abs/1505.07193">“From Micro to Macro: Uncovering and Predicting Information Cascading Process with
Behavioral Dynamics”</a> was presented among others.
Cascade prediction has applications
in areas like viral marketing and epidemic prevention, so it’s a problem of great interest
in the industry as well as society. The work presented here utilized a data-driven approach
to create a “Networked Weibull Regression” model, and use it for predicting cascades
as they occur, going from micro behavioral dynamics modelling which are aggregated to predict
the macro cascading processes.</p>
<p>They evaluate their method on a dataset from Weibo, one of the largest Twitter-style
services in China, and show that their method handily beats the current state of the art.
It’s a well written work that deserves the praise it got, however I would definitely be interested
in seeing it applied and evaluated on a different publicly available dataset, (although they are
hard to come by in this domain) and an extension of the method that predicts the cascades as they
happen in real-time (shameless plug: Use <a href="https://flink.apache.org">Apache Flink</a> for your real-time processing needs!).</p>
<h4 id="big-data-2">Big Data 2</h4>
<p>The last session I attended on Sunday was “Big Data 2”. The two regular papers from that
session were perhaps application specific but nonetheless provided some valuable insights.
The first, “Accelerating Exact Similarity Search on CPU-GPU Systems” dealt with the exact
kNN problem, and how it can be efficiently accelerated on GPU-equipped systems. Although
approximate kNN methods like LSH seem to be the standard at the industry currently, the
authors mentioned that the techniques presented could be used in that context as well,
so this is something to look forward to definitely. The second regular paper <a href="http://arxiv.org/abs/1508.07678">“Online Model
Evaluation in a Large-Scale Computational Advertising Platform”</a>
provided a rare look into how a large advertising platform like Turn evaluates its bid prediction models online,
something that a previous related paper from Google,
<a href="https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf">“Ad Click Prediction: a View from the Trenches”</a>,
was missing.</p>
<h3 id="day-2">Day 2</h3>
<h4 id="network-mining-1">Network Mining 1</h4>
<p>An interesting idea presented in “Network Mining 1” was <a href="http://arxiv.org/abs/1509.02533">“Absorbing random-walk centrality”</a>,
where the authors presented a way to identify <em>teams</em> of central nodes in a graph. An application
for this measure could be for example: given a subgraph of Twitter that we know contains
a number of accounts about politics, find the important nodes that represent a diverse set
of political views. The authors show that this is an NP-hard problem, and the greedy algorithm
presented has a complexity of O(n^3), where n is the number of nodes, which makes it
inapplicable for large graphs. Personalized PageRank could be used as heuristic however which
is more computationally efficient.</p>
<h3 id="day-3">Day 3</h3>
<h4 id="graph-mining">Graph Mining</h4>
<p>We presented out work, <a href="/assets/concepts-icdm.pdf">“Knowing an Object by the Company It Keeps: A Domain-Agnostic Scheme for Similarity Discovery”</a>,
in the “Graph Mining” session. Our main contribution is a method that allows us to
transform a <em>correlation graph</em> to a <em>similarity graph</em>, where connected items should be <em>exchangable</em> in some sense.</p>
<p>As an example, think of a correlation graph where we have words as nodes and edges between words are created by taking the conditional probability of a word appearing
within <em>n</em> words of another one. This can be easily extracted from a text corpus and pairs like (<em>Rooney, goal</em>)
could have a high correlation score. What we want to do with our algorithm is to discover <em>similarities</em>
between items that go beyond simple correlation, and show characteristics such as exchangability.
For example a pair (<em>Rooney</em>, <em>Ronaldo</em>) could be a good pair in this sense, as you could replace
Rooney with Ronaldo in a sentence and it should still make sense. The approach we presented is domain
agnostic, and as such is not limited to just text; we applied our algorithm on graphs of music artists and <a href="https://en.wikipedia.org/wiki/Genetic_code">codons</a>
as well. I will soon write up a more extensive summary of our work, including code and examples.
For now enjoy this <a href="/assets/concepts-visualization.pdf">nice visualization</a>
of word relations and clusters that can be created using our method.
<em>Note:</em> better to download and view in a PDF viewer which has <em>lots</em> of zoom.</p>
<p>Some impressive work for me from that session was <a href="http://arxiv.org/abs/1506.04322">“Efficient Graphlet Counting for Large Networks”</a>.
<a href="https://en.wikipedia.org/wiki/Graphlets">Graphlets</a> are small, connected, induced (i.e. the edges
in the graphlet correspond to those in the large graph) subgraphs of a large network, and can be used
for things like graph comparison and classification. The method presented here uses already proven
combinatorial arguments to reduce the number of graphlets one has to count for every edge, and
obtains the remaining counts in constant time. In a large study of over 300 networks the algorithm
is shown to be on average 460 times faster that the current state-of-the-art, allowing the largest
graphlet computations to date. I am always happy when I see established results used in a clever
way to solve new problems, especially when the results are so impressive.</p>
<h2 id="keynotes">Keynotes</h2>
<h4 id="robert-f-engle">Robert F. Engle</h4>
<p>ICDM featured 3 keynotes this year. The first one was given by Robert F. Engle, winner of the
Nobel Memorial Prize in Economic Sciences in 2003. He presented a summary of some of his seminal
work on <a href="https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity">ARCH</a>,
and presented some more recent work on financial volatility measurement through the <a href="http://vlab.stern.nyu.edu/">V-lab</a>
project. This keynote was quite math-heavy as a result and I think many people in the audience did
not find it that interesting/relevant to their work, estimated from the proportion of people looking at their
laptops around me.</p>
<h4 id="michael-i-jordan">Michael I. Jordan</h4>
<p>The second keynote, and the most interesting for me, was given by M.I. Jordan, with the title “On
Computational Thinking, Inferential Thinking and ‘Big Data’”, a talk he has delivered in a couple
of other venues before, so (some of) the <a href="http://www.stat.harvard.edu/NRC2014/MichaelJordan.pdf">slides are available</a>.
His keynote revolved around some of what he identified as central demands for learning and inference
and the tradeoffs between them; namely error bounds (“inferential quality”),
scaling/runtime/communication, and privacy. He identified the problem of the lack of an interface
between statistical theory and computational theory which currently have an “oil/water”
relationship. In statistics, more data points are great as they reduce uncertainty, but can be a cause
of problems in terms of computation as we usually measure complexity in the order of data points. The approach
suggested is to “treat computation, communication, and privacy as constraints on statistical
risk”.</p>
<p>In terms of privacy he mentioned how our inference problem basically has 3 components: the
population P, which we try to approximate with our sample S, which we then modify
according to our privacy concerns to get our final dataset Q, which we can query.
In dealing with privacy issues he mentioned <a href="https://en.wikipedia.org/wiki/Differential_privacy">differential privacy</a>
as a good way to quantify the privacy loss for a query. This should allow us, given some privacy
concerns, to estimate the amount of data we need to achieve the same level of risk in our queries.</p>
<p>For the tradeoff between inferential quality and communication, common in distributed
learning settings, he proposed the use of a channel with certain communication constraints, as a
way to impose bitrate constraints. The proposed solution involves minimax risk with B-bounded
communication, which allows for optimal estimation under a communication constraint (see
<a href="http://www.cs.berkeley.edu/~yuczhang/files/nips13_communication.pdf">here</a>
for the NIPS paper on the subject).</p>
<p>The last part of his talk was new (i.e. is not the slides linked) and concerned the tradeoff between
inference quality and computation resources. This part focused on efficient distributed bootstrap
processes, with the thesis being that such processes can be used to generate multiple realizations
of samples from the population that allow for the efficient estimation of parameters. The problem
with a frequentist approach in this case is that the communication cost of each resampling can be
prohibitively high for large datasets, e.g. ~623GB for a 1TB dataset
(see <a href="http://www.stat.washington.edu/courses/stat527/s13/readings/EfronTibshirani_JASA_1997.pdf">here</a> why).
The proposed solution here is the <a href="http://web.cs.ucla.edu/~ameet/blb_icml2012_final.pdf">“Bag of Little Bootstraps”</a>,
in which one bootstraps many small subsets of the data and performs multiple computations on these
small samples. The results from these computations are then averaged to obtain an estimate of the
parameters of the population.
This means that in a distributed setting we would use only small subsets of the data to perform
our computation; in the 1TB example above, the resample size could for example be 4GB instead of
the 632GB required by the bootstrap.
Another interesting point made was that obtaining a confidence interval on a parameter instead of
a point estimate like is usually done now, can not only be more useful, but could be done more
efficiently as well.</p>
<p>In closing Jordan identified there are many remaining conceptual and mathematical challenges in the
problem of ‘Big Data’ and facing these will require a “rapprochement between computer science and
statistics” which would reshape both disciplines and might take decades to complete.</p>
<h4 id="lada-adamic">Lada Adamic</h4>
<p>Unfortunately I had to skip Lada Adamic’s keynote, so I would really appreciate if someone has a
summary that I can add here.</p>
<h2 id="venueorganisation">Venue/Organisation</h2>
<p>The conference organization was mostly smooth and organizers and volunteers deserve a lot of credit for
the way that everything worked out. Sessions generally began and ended on time, the workshops and
tutorials were well organized and useful, and I particularly enjoyed the PhD forum.
One thing that I found unusual was the fact that even though the proceedings were handed out in
digital form (kudos for that) attendees had to choose between the conference or the workshop
proceedings. My guess is this happened for licensing cost issues, but it would have been nice to have
access to both.</p>
<p>The conference this year took place at Bally’s casino/hotel in Atlantic City.
It was hard to avoid the grumbling from many of the participants for the choice of venue, especially
when one puts it next to last year’s venue in <a href="/assets/shenzen.jpg">Shenzen</a> or next year in
<a href="/assets/barcelona.jpg">Barcelona</a>.</p>
<p>Truth be told, the venue was underwhelming, but I guess it was mostly the choice of Atlantic City
that had people irked; there was very little to do and see in the city unless you wanted to gamble.
Still, I was fortunate to meet a lot of cool people at the conference, so I’m looking forward to
attending next year’s edition in Barcelona!</p>
<p>There was a lot of other great work at the conference as well, but these were the presentations
I found most memorable.
So that’s all for now, if I’ve made a terrible mistake when describing your work, shoot me an <a href="mailto:tvas@sics.se">email</a>
and I’ll fix it ASAP.</p>
Mon, 23 Nov 2015 00:00:00 +0000
http://tvas.me/conferences/2015/11/23/ICDM-2015-Highlights.html
http://tvas.me/conferences/2015/11/23/ICDM-2015-Highlights.htmldata-miningresearchconferences