Theodore VasiloudisThis is the personal website for Theodore Vasiloudis. Here I will post thoughts and experiences on large-scale machine learning, including algorithms, systems and related stuff I find interesting.
http://tvas.me/
Wed, 11 Mar 2020 05:13:08 +0000Wed, 11 Mar 2020 05:13:08 +0000Jekyll v3.8.5Block-distributed Gradient Boosted Trees<p>This is the second post in my plan to write an accessible explanation of
every paper I’ve included in my thesis. We are jumping forward in time, from
<a href="/research/2019/07/02/Finding-graph-similarities.html">our first paper in 2015</a> where we
talked about uncovering concepts and similarities in graphs, to the latest one
on scalable gradient boosted tree learning. This detour is to
celebrate the fact that we won the <a href="https://twitter.com/thvasilo/status/1153762417521889282" target="_blank">best short paper award at SIGIR
2019</a>
for this work! The paper is open-access so I encourage <a href="https://doi.org/10.1145/3331184.3331331" target="_blank">reading it
for more details</a>.</p>
<p>Here we focus on gradient boosted trees (GBT) and try to overcome
the issues
that come up when training high-dimensional data in a distributed setting.
We’ll take this opportunity to illustrate how distributed training of
GBTs works, what are the specific issues with the current state-of-the-art,
and demonstrate the benefits and limitations of our proposed solutions.
Some of the illustrations and explanations here were taken from <a href="http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-250038" target="_blank">my
dissertation</a>
and <a href="https://docs.google.com/presentation/d/1YSMNwze4lcS94Vpd9cNJ7O27K3PfeIskWRRlkHSW0nU/edit?usp=sharing" target="_blank">defense</a>.</p>
<h2 id="gradient-boosted-trees">Gradient boosted trees</h2>
<p>Gradient boosted trees are one of the most successful algorithms in machine
learning, used across the industry and academia. They largely owe their success to
their excellent accuracy, <a href="https://arxiv.org/abs/1707.05023" target="_blank">solid theoretical foundations</a>, and highly scalable implementations like <a href="https://xgboost.ai/" target="_blank">XGBoost</a>
and <a href="https://github.com/microsoft/LightGBM" target="_blank">LightGBM</a>.</p>
<p>GBTs are an ensemble of decision trees. In the most common version of the
algorithm, a new tree is added at each iteration of the algorithm.
A common illustration of the process is the following:</p>
<center><img src="/assets/gbt-example.png" alt="Gradient boosting" width="600" /></center>
<blockquote>
</blockquote>
<p>At each iteration, a new tree is added, which
is trained on labels that result from the errors made by the existing
ensemble of trees. In “base” boosting that would be the errors made
from the ensemble, i.e. the errors of the previous iteration become
the targets for the next one. In gradient boosting we train on the <em>gradients
of the errors</em>, as determined by the loss function we are using.
It’s this flexibility of choosing our loss function that makes
GBTs attractive: they can be used for classification, regression,
ranking, or survival analysis by simply changing the loss function
we use (and its gradients).</p>
<p>The algorithm for growing <em>a single tree</em> at a high level is the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Until a stopping criterion is met, do the following:
1. Make predictions using the current ensemble.
2. Calculate the gradient histograms for each data point.
3. Use the gradient histograms to find the optimal split for each leaf.
4. Grow the tree by splitting the leaves.
</code></pre></div></div>
<h2 id="toy-dataset">Toy Dataset</h2>
<p>Throughout our explanation we’ll use the following toy dataset
to illustrate the process:</p>
<p><img src="/assets/gbt-toy-data.png" alt="Toy data" /></p>
<p>Here we have a dataset with 4 data points and 3 features, where
we have also listed a gradient value for each data point.</p>
<h3 id="row-and-block-distributed-data">Row and block-distributed data</h3>
<p>If you think of our dataset as a table
where rows are data points and columns are features, we can distribute
our data by row, column, or both. The last distribution strategy is commonly referred to as <em>block</em> distribution.
The row and block data distribution strategies are illustrated below:</p>
<center><img src="/assets/gbt-row-distributed.png" alt="Row-distributed data" width="600" /></center>
<p><em>Row distributed data.</em></p>
<p><img src="/assets/gbt-block-distributed.png" alt="Block-distributed data" />
<em>Block-distributed data.</em></p>
<p>Current state of the art implementations
provide distributed GBT training using only <em>row distribution</em>, i.e.
data are only partitioned along the data point dimension and a subset of the data end up on each
machine in the cluster. Using block-distribution would allow us
to parallelize learning along both dimensions, increasing the
scalability of the algorithm and reducing the training cost. However, this creates a couple of additional challenges in the training process that we will expand upon now.</p>
<h2 id="prediction-in-the-block-distributed-setting">Prediction in the block-distributed setting</h2>
<p>Prediction in the single-machine and row-distributed scenarios is
straight-forward: For every data point we have its full features,
so we can drop each data point down every
tree in the ensemble and determine the exit leaf, i.e. the leaf
that every data point will fall into.</p>
<p>The same is not the case for block-distributed data. Take for example
the following tree and let’s look at data points 1 & 2:</p>
<p><img src="/assets/gbt-block-dist-pred.png" alt="Block-distributed prediction" /></p>
<p>In order to determine which leafs they will fall into, we need to check the
values of both Feature 1 and Feature 3. However, these lie on different
workers, which means that some sort of communication is needed to determine
the exit leaf. What we want to avoid is the need to communicate the data
points themselves, as shuffling the entire dataset over the network
would incur prohibitive network costs.</p>
<p>In our work we solve this by utilizing <a href="http://pages.di.unipi.it/rossano/wp-content/uploads/sites/7/2015/11/sigir15.pdf" target="_blank">QuickScorer</a>, an algorithm
originally devised to speed up inference (prediction) in GBTs
which won the best paper award in SIGIR in 2015.</p>
<h3 id="quickscorer">Quickscorer</h3>
<p>QuickScorer was devised to take advantage of modern computer architectures,
creating a cache-friendly algorithm that uses fast, bitwise operations
to determine the exit nodes in a decision tree.</p>
<p>The algorithm starts by assigning a bitvector to every internal node
in the tree.
Every bit in the bitvector corresponds to a leaf in the tree, so
for a tree with 8 leaves, every internal node is assigned a bitvector
with 8 bits, the leftmost bit corresponding to the leftmost leaf.
Every zero in the bitvector indicates that the corresponding leaf
would become inaccessible if the node evaluated to false.
To determine the exit leaf for a particular data point, we take the bitvectors
of all nodes that evaluate to false for that data point, and perform a bit-wise <code class="language-plaintext highlighter-rouge">AND</code> operation between
them. The leftmost bit set to 1 indicates the exit node for the data point.</p>
<p>For a concrete example let’s take the data from above, focusing on data point 2, and determine which internal nodes evaluate to false:</p>
<center><img src="/assets/gbt-quickscorer-example.png" alt="QuickScorer Example" width="600" /></center>
<p>The root is given the bitvector <code class="language-plaintext highlighter-rouge">0011</code>, because if it evaluates to false
–assuming that we move to the right when a condition
evaluates to false– the two leftmost leafs become inaccessible.
For data point 2, both the root condition and the condition
on the right child evaluate to false. So we would take the bitwise
<code class="language-plaintext highlighter-rouge">AND</code> between their bitvectors, which for data point 2 would be
<code class="language-plaintext highlighter-rouge">0011 AND 1101</code> which results in <code class="language-plaintext highlighter-rouge">0001</code>. This means that the exit
node for data point 2 is the rightmost leaf in the tree.</p>
<h3 id="block-distributed-quickscorer">Block-distributed QuickScorer.</h3>
<p>The main idea for our paper is that this evaluation of false
nodes can be done independently and in parallel at each worker
in a block distributed setting. Once each worker has performed their
local aggregation of bitvectors, they can send them over to one server
to perform the overall aggregation and determine the exit leaf for
every data point. The terms <em>server</em> and <em>worker</em> are taken
from the parameter server architecture we are using, explained
in brief later. Briefly, worker machines store data and perform computations,
while server aggregate updates and update the model.</p>
<center><img src="/assets/gbt-block-dist-quickscorer.png" alt="Block-distributed QuickScorer" /></center>
<p><em>Example of block-distributed prediction for data point 2.</em></p>
<p>Workers that share the same rows in the dataset will send their
local bitvectors to the same server:</p>
<center><img src="/assets/gbt-bd-pred-pattern.png" alt="Block-distributed prediction communication pattern." /></center>
<p>Because the <code class="language-plaintext highlighter-rouge">AND</code> operation is commutative and associative, the
order of the aggregations and where each partial aggregation happens
does not matter. The results will be provably correct as if we had
done the bitwise <code class="language-plaintext highlighter-rouge">AND</code> locally on one machine.</p>
<p>Importantly, because we are only communicating lightweight bitvectors
instead of the data themselves, the communication cost of prediction will
be low. This solves the first problem of block-distributed learning.
The second issue is how to calculate the gradient histograms and communicate
them efficiently.</p>
<h3 id="calculating-gradient-histograms">Calculating Gradient histograms</h3>
<p>The most computationally intensive part of GBT learning is step 2 from the
algorithm above: calculating the gradient histograms for each leaf.
Gradient histograms are histograms of the gradient value of each data
point, that we use
to find the best <em>split candidate</em> for each leaf in the tree.
A split candidate is a candidate for an internal node
of the tree, which takes the form <em>feature_value < threshold</em>. In
“traditional” decision trees we try to find the feature and threshold combination
that allows us to best separate the data in some information theoretical sense,
such as the <a href="https://en.wikipedia.org/wiki/Decision_tree_learning#Metrics" target="_blank">Gini impurity</a>.
In gradient boosted trees we try to find the feature and threshold combination that provides us with the <em>best reduction in the overall loss of the tree</em>,
which we refer to as the <em>gain</em> of a particular split candidate.</p>
<h4 id="gradient-histogram-creation">Gradient histogram creation</h4>
<p>It’s best to look at this process through a concrete example.</p>
<p>The exhaustive way to find the optimal split point would
be to take every unique feature
value from the subset of data in a particular leaf and calculate the potential gain
if we split on that value, for every feature.</p>
<p><img src="/assets/gbt-feature-naive.png" alt="Feature values unique" /></p>
<p>This will quickly
become computationally infeasible if we have continuous
features with many (potentially billions of) unique values. What implementations like XGBoost do
instead is to aggregate the gradients of specific ranges of values
into buckets, to create what is called a <em>gradient histogram</em> for
each feature.</p>
<p>For the data shown above, for Feature 1 we might aggregate the gradient values
for the ranges of Feature 1 in [0, 5), [5, 12), [12, 20). This process is called
feature quantization, and usually efficient quantile sketches are employed,
such as the <a href="http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf" target="_blank">Greenwald-Khanna
sketch</a>.
In our work we use the state of the art <a href="https://arxiv.org/abs/1603.05346" target="_blank">KLL
sketch</a>.
The number of bins we choose creates a tradeoff: More bins usually mean more accuracy,
but also increase the computational cost as we need to evaluate more split points, and
as we will see later, increase the communication cost for distributed implementations.
If we quantize all
the features above, we can end up with the following bins:</p>
<p><img src="/assets/gbt-feature-sketches.png" alt="Feature sketches" /></p>
<p>The next step in the process is to take the feature bins we have created using
the sketches, and aggregate the gradient values that correspond to each bin.
In other words for each data point we get its gradient value according to its
feature values:</p>
<p><img src="/assets/gbt-feature-grads.png" alt="Feature sketches with gradients" /></p>
<p>Finally we aggregate the gradient values that belong in the same
bucket to get the <em>gradient histogram</em> for each feature:</p>
<p><img src="/assets/gbt-gradient-histograms.png" alt="Gradient histograms" /></p>
<p>The colors help us discern the source of each data point’s
contribution to a particular bucket. For example, for
Feature 1, data points 3 and 4 both have values in the
[0, 5) range, so their gradient values end up in the
first bucket.</p>
<h3 id="split-evaluation">Split evaluation</h3>
<p>Once we have determined the gradient histograms, we can proceed to evaluate
the split candidates for each leaf. For that, we take the candidate
split points for each feature, partition the data on the proposed split into
the two new candidate leafs, and
evaluate the potential gain in accuracy (i.e. loss reduction).
The gain can be calculated by the following equation:</p>
<script type="math/tex; mode=display">\mathcal{G}_{\text{split}} = \frac{1}{2} \left[\frac{\left(\sum_{i \in I_{L}} g_{i} \right)^{2}} {\sum_{i \in I_{L}} h_{i}+ \lambda} + \frac{\left(\sum_{i \in I_{R}} g_{i} \right)^{2}}{\sum_{i \in I_{R}} h_{i} + \lambda} - \frac{\left(\sum_{i \in I}g_{i}\right)^{2}}{\sum_{i \in I}h_{i}+\lambda}\right]-\gamma</script>
<p>In the above, $g_i$ are the first order gradient values, $h_i$ the second
order (hessian) gradients, and $\lambda, \gamma$ are regularization terms.
$I_{L}$ and $I_{R}$ are the subsets created by splitting the subset of
data points $I$ according to the split candidate to the two resulting children (Left and Right).
In our simplified example we’re only using the first order gradients,
and after applying a simplified version of the above equation we
get the following gain values for each potential split point
using the gradient histograms of the figure above:</p>
<center><img src="/assets/gbt-gain.png" alt="Gain calculation" width="500" /></center>
<p>This means that if we split the data points at the <em>Feature 1 < 5</em> split
point, we would get the best gain in accuracy.</p>
<h3 id="row-distributed-gradient-histogram-aggregation">Row-distributed gradient histogram aggregation</h3>
<p>The process we just described assumes that we have access to the complete
dataset to create the histograms, which means that all our data should be stored in one machine<sup id="fnref:outofcore"><a href="#fn:outofcore" class="footnote">1</a></sup>. What
happens however when our data is so massive that we need multiple machines
to store it? In addition, even if we had infinite storage space on one machine, we want
to be able to train our models as fast as possible, which these days
means that we want to take advantage of the parallel computation. Scaling up on a single machine can be expensive
and lacks fault tolerance, so often we employ clusters of commodity computers
to quickly and reliably train models on massive data.</p>
<p>When data are row-distributed we have two challenges to solve:
The first one is creating the feature quantiles that give us an estimate
of the empirical cumulative distribution function for each
feature. These are then used to create the so called gradient
histograms that allows us to determine the best way to grow our
tree. The problem arises from the fact that each worker
only has access to a subset of the complete dataset, which
makes communication for both of these steps necessary.</p>
<h4 id="mergeable-quantile-sketches">Mergeable quantile sketches</h4>
<p>Determining the quantile sketches is relatively straightforward:
some quantile sketches have the property that they are <em>mergeable</em>:
given a stream of data, applying the sketch to parts of the stream
and then merging those partial sketches will have the same result
as if we had applied the sketch to complete stream. This means that
we can create partial sketches at each worker, and then merge
those sketches to get the complete feature quantiles. This requires
communicating the partial sketches over the network. More
on the potential issues with that will come later.</p>
<h4 id="communicating-partial-gradient-histograms">Communicating partial gradient histograms</h4>
<p>Similarly to how we can create partial quantile sketches, we can
also create partial gradient histograms that then need to be merged
to create the overall histogram for each leaf. Once we have the buckets
from the previous step we can create the local histograms at each
worker:</p>
<center><img src="/assets/gbt-grad-hist-w1.png" alt="Worker 1 Gradient histograms" width="700" /></center>
<center><img src="/assets/gbt-grad-hist-w2.png" alt="Worker 2 Gradient histograms" width="700" /></center>
<p>These then need to be communicated between the workers so that
all workers end up with the same copy of the merged gradient histograms.
See all those zeros that have now appeared in our histograms? This is
the cause of the issues in the current state of the art.</p>
<h4 id="issue-with-the-state-of-the-art-dense-communication">Issue with the state of the art: Dense Communication</h4>
<p>The problem with the above approach is that all current implementations utilize
some sort of an
<a href="https://en.wikipedia.org/wiki/Reduce_(parallel_pattern)" target="_blank">all-reduce</a>
operation to sync the sketches and histograms between the worker. In short, an all-reduce
operation will apply an aggregation function to each element in a vector, and then make
the aggregated result available to each worker in the cluster. In the following example
we have 4 vectors which we want to aggregate element-wise, using addition as our
aggregation function.</p>
<p><img src="/assets/gbt-all-reduce.png" alt="All-reduce example" />
<em>All-reduce aggregation of vectors. The final result is propagated back from the root to every node in the tree.</em></p>
<p>All<sup id="fnref:kylix"><a href="#fn:kylix" class="footnote">2</a></sup> (like
<a href="https://en.wikipedia.org/wiki/Message_Passing_Interface" target="_blank">MPI</a>) current all-reduce implementations use <em>dense communication</em> to
perform this operation. This means that <strong>the number of elements to be aggregated needs to be
known in advance, and each element must have the same byte size</strong>.</p>
<p>Why is this a problem? In the simple vector aggregation example given above,
almost half the elements are zeros. Because of this system limitation, every vector
being communicated will need to have the same byte size, regardless of the
number of zeros in it. This causes a lot of redundant communication. Similarly to the
vectors in this example, the gradient histograms are also sparse, as we can see with the example
for Workers 1 and 2 further up. As we increase the number of buckets and the number of
features, that can be in the millions or billions, the amount of zeros being
communicated will increase, creating massive unnecessary overhead. It this problem in
particular that we attack with our approach.</p>
<h3 id="block-distributed-gradient-histogram-aggregation">Block-distributed gradient histogram aggregation</h3>
<p>Our idea to deal with the problem is to communicate sparse representations
of the histograms. <a href="https://en.wikipedia.org/wiki/Sparse_matrix#Storing_a_sparse_matrix" target="_blank">Sparse representations</a> can be thought of as maps
that compress the size of vectors and matrices with a high ratio of zeros
in their values. That requires a communication system that can deal with
different workers communicating objects with different byte size. As we
mentioned above, systems like MPI do not allow us to communicate objects
of arbitrary size.</p>
<p>Another communication system with a more flexible
programming paradigm is the <a href="https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf" target="_blank">Parameter Server</a> (PS). In this system,
every machine in the cluster assumes a particular role: <em>Servers</em> are
responsible for storing and updating the parameters of the model.
<em>Workers</em> store the training data and are responsible for the computation
of the updates to the model, which they then forward to the servers.
Importantly, the parameter server allows point-to-point communication
between servers and workers, allowing each worker to send over objects
of arbitrary size, making it a perfect candidate for sparse communication.
<a href="https://doi.org/10.1145/3183713.3196892" target="_blank">DimBoost</a> (paywalled) was the first paper to use the PS for GBT training and
served as an inspiration for our work. However, DimBoost still
uses dense communication.</p>
<!-- TODO: Include Server/Worker in the pic or create your own -->
<center><img src="/assets/parameter-server.png" alt="Parameter Server" width="500" /></center>
<p>Source: <a href="https://storage.googleapis.com/pub-tools-public-publication-data/pdf/40565.pdf" target="_blank">Google DistBelief</a></p>
<p>In our work we represent the gradient histograms as sparse <a href="https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format)" target="_blank">CSR matrices</a>, each
row corresponding to a feature, and each column to a bucket in the histogram.
Each worker populates its own sparse matrix, and then sends them over to one
server. Each server is responsible for creating the gradient
histograms of a specific range of features, so the
workers that are responsible for the same sets of features will send their
histograms to the same server.</p>
<p><img src="/assets/gbt-block-hist-aggregation.png" alt="Block-distributed histogram aggregation" /></p>
<p>In the example above, Server 1 is responsible for features 1 and 2,
and Server 2 for Feature 3. The workers that share the same columns
of data will send their partial histograms to the same server.
Each server can then aggregate
the partial histograms, and calculate the best local split
candidate from its local view of the data. It takes one
final communication step to determine the best overall
split candidate.</p>
<h2 id="results">Results</h2>
<p>So how much difference in terms of communication can this strategy make? To
test our hypothesis, we implemented both versions of the algorithm in C++,
basing our code the XGBoost codebase that makes use of the
<a href="https://github.com/dmlc/rabit" target="_blank">rabbit</a> collective
communication framework. For our parameter server implementation we use
<a href="https://github.com/dmlc/ps-lite" target="_blank">ps-lite</a>.</p>
<p>To test the performance of each method under various levels of sparsity we used
4 datasets for binary classification, taken from the <a href="https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html" target="_blank">libsvm</a> and <a href="https://www.openml.org/" target="_blank">OpenML</a>
repositories. URL and avazu are two highly sparse datasets with 3.2M and 1M
features respectively. RCV1 is less sparse with 47K features, and Bosch
is a dense dataset with 968 features.</p>
<p>We use a local cluster of 12 workers, using 9 workers and 3 servers for the
block-distributed experiments, and all 12 machines as workers for the row-distributed
ones. We measure the communication cost of each approach by the size of the
histograms created by each method, measured in MiB. We also measure their
real-world performance as the time spent in communication and computation
during histogram aggregation.</p>
<h3 id="byte-size-of-the-gradient-histograms">Byte size of the gradient histograms</h3>
<p><img src="/assets/gbt-hist-size.png" alt="Gradient histogram size" /></p>
<p>In the figure above we can see that our block-distributed approach creates
histograms that are several orders of magnitude smaller than the dense
row-distributed method. This confirms our hypothesis that for sparse datasets
there are massive amounts of redundant communication in current implementations.</p>
<h3 id="communication-computation-time">Communication-Computation time</h3>
<p><img src="/assets/gbt-comp-comm-cost.png" alt="Communication-Computation cost" /></p>
<p>The figure above shows us the time spent communicating data and
performing computation for each approach. For the sparse datasets
URL and avazu,
we can see a large decrease in the time required to communicate the gradients,
and a modest increase in computation time that leads to an overall
improved runtime.
However, for the more dense datasets the difference in communication
time is small, while the computation time increases significantly.
This is caused by the overhead introduced from building the sparse
data structures, which, unlike the dense arrays used by the row-distributed
algorithm, are not contiguous in memory and require constant indirect
memory accesses. This, in addition to the overhead introduced by the
parameter server, lead to an increased overall time to compute the gradient
histograms for the dense datasets.</p>
<h3 id="quantile-sketch-size">Quantile sketch size</h3>
<p>Another aspect where dense communication can significantly increase
the cost are quantile sketches. As we mentioned above, in order
to determine the ranges for the buckets in the gradient histograms,
we need an estimate of the cumulative density function for each
feature. This is done in the distributed setting by creating
a quantile sketch at each worker, and then merging those to get
an overall quantile sketch.</p>
<p>The issue with dense communication of quantile sketches is that these are
probabilistic sketches, and as such their actual size cannot be known in
advance. What systems like XGBoost have to instead is to allocate for each
sketch the <em>maximum possible size</em> it can occupy in memory, and communicate
that. For such efficient quantile sketches the maximum theoretical size can be multiple orders of
magnitude larger than the sketch’s actual size. Using our approach we
are able to just communicate the necessary bytes, and not the theoretical
maximum for each sketch, leading to massive savings:</p>
<p><img src="/assets/gbt-sketch-size.png" alt="Quantile sketch size" /></p>
<p>As shown in the original <a href="https://www.kdd.org/kdd2016/subtopic/view/xgboost-a-scalable-tree-boosting-system" target="_blank">XGBoost paper</a> (Figure 3), being able to communicate the sketches
at every iteration instead of only at the start of learning (local vs. global sketches)
leads to similar accuracy with fewer bins per histogram, enabling
even more savings in communication.
<!-- The XGBoost paper demonstrates
equivalent accuracy using six times fewer buckets, which directly
translates to a six-fold decrease in the communication cost for histogram
aggregation (although communication becomes more frequent).
--></p>
<h2 id="conclusions">Conclusions</h2>
<p>In this work we demonstrated the value of sparse communication, and provided
solutions for the problems that arise with block-distributed learning.
Using a more flexible communication paradigm we are able to get massive
savings in the amount data sent over the network, leading to improved
training times for sparse data.</p>
<p>This works opens up avenues for plenty of improvements. First, while
we have created a proof-of-concept system to evaluate the row vs. block
distribution in isolation, the real test will come by integrating these ideas in an existing GBT distribution like XGBoost and evaluating its performance
in a wide range of datasets against other state-of-the-art systems like
LightGBM and CatBoost.</p>
<p>In term of the algorithm itself, one easy improvement is the use of
the RapidScorer algorithm in place of QuickScorer that uses
run length encoding to compress the bitvectors for large trees.
Such a method can bring further communication savings for prediction.</p>
<p>If there’s one takeaway for users and especially developers of GBT
learning systems is that current communication patterns are highly
inefficient, and massive savings can be had by taking advantage of
the inherent sparsity in the data and intermediate parts of the model
like the gradient histograms. This, in addition to the new scale-out
dimension that block-distribution enables, can make distributed GBT
training even cheaper and efficient.</p>
<div class="footnotes">
<ol>
<li id="fn:outofcore">
<p>Generally the assumption is that the data should also be able
to fit in the main memory of the machine, however techniques like
<a href="https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html" target="_blank">out-of-core learning</a> allow us to overcome that requirement. <a href="#fnref:outofcore" class="reversefootnote">↩</a></p>
</li>
<li id="fn:kylix">
<p>There’s been some research towards sparse all-reduce systems,
like <a href="https://people.eecs.berkeley.edu/~jfc/papers/14/Kylix.pdf" target="_blank">Kylix</a>. <a href="#fnref:kylix" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Mon, 26 Aug 2019 00:00:00 +0000
http://tvas.me/articles/2019/08/26/Block-Distributed-Gradient-Boosted-Trees.html
http://tvas.me/articles/2019/08/26/Block-Distributed-Gradient-Boosted-Trees.htmldecision treesgradient boostingresearchdistributed systemsarticlesUncovering similarities and concepts at scale<script type="math/tex; mode=display">\let\und_\newcommand{\Rho}{\mathrm{P}} \newcommand{\rn}[1]{\rho\und{#1}} \newcommand{\rns}[1]{|\rn{#1}|\und1} \newcommand{\mrn}[1]{\tau\und{#1}} \newcommand{\drns}[1]{|\check{\rho}\und{#1}|\und1} \newcommand{\krns}[1]{|\hat{\rho}\und{#1}|\und1} \newcommand{\rv}{\Rho} \newcommand{\sy}[1]{\sigma\und{#1}} \newcommand{\asy}[1]{\tilde{\sigma}\und{#1}} \newcommand{\nm}[1]{L\und1(#1)} \newcommand{\dnm}[2]{|\rn{#1}-\rn{#2}|\und1} \newcommand{\anm}[1]{\tilde{L}\und1(#1)}</script>
<p><img src="/assets/billion-word-graph.png" alt="Word graph" /></p>
<p>I defended <a href="http://urn.kb.se/resolve?urn=urn%3Anbn%3Ase%3Akth%3Adiva-250038" target="_blank">my thesis</a> recently and finally have some time to look back to look over the work I’ve done the past
few years from a distance. Over the next few weeks
I’ll be going over each of the papers included in my dissertation,
to present them in a more accessible format.</p>
<p>This first post is about a scalable way to determine similarities between
objects and grouping them in coherent groups. We’ll give examples of
how we’re able to combine deep learning with graph processing to uncover
“visual concepts” along with an high-level explanation of the algorithm.
The code for this work is <a href="https://github.com/sics-dna/concepts">available on Github</a>.</p>
<h2 id="introduction">Introduction</h2>
<p>Finding similarities is one of the fundamental problems in machine learning. We use similarities between users and items
to make recommendations, we use similarities between websites to do web searches, we use similarities between proteins to study disease etc.</p>
<p>So a natural question that comes up is: how can we efficiently calculate similarities between objects? There have been
many approaches for this purpose proposed in different domains, like <a href="https://en.wikipedia.org/wiki/Word_embedding" target="_blank">word embeddings</a>, <a href="https://en.wikipedia.org/wiki/Collaborative_filtering" target="_blank">collaborative filtering</a>,
and <a href="https://en.wikipedia.org/wiki/Similarity_learning" target="_blank">similarity learning</a>.
Once we have calculated similarities between objects, how can we then discover groups of objects that belong together?</p>
<p>In this post I will provide an overview of our work on <a href="/assets/concepts-icdm.pdf" target="_blank">scalable similarity calculation and concept discovery</a>
presented at <a href="/conferences/2015/11/23/ICDM-2015-Highlights.html">ICDM 2015</a>
and extended in our <a href="/assets/concepts-kais.pdf" target="_blank">KAIS paper</a>, in which we model objects as nodes in a graph
where the edges represent similarities between objects. I will show how one can create correlation graphs from data, and how one
can transform that graph to extract similarities between objects, and finally how we can use the similarity graph to create
interesting groups of similar items which we call <em>concepts</em>, with illustrative examples.</p>
<p>This post will be a bit long so you can skip ahead to the section that interests you most:</p>
<ul id="markdown-toc">
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a> <ul>
<li><a href="#problem-description" id="markdown-toc-problem-description">Problem description</a></li>
<li><a href="#our-approach" id="markdown-toc-our-approach">Our Approach</a></li>
<li><a href="#graph-based-vs-vector-space-similarity" id="markdown-toc-graph-based-vs-vector-space-similarity">Graph-based vs. vector-space similarity</a></li>
</ul>
</li>
<li><a href="#creating-the-correlation-graph" id="markdown-toc-creating-the-correlation-graph">Creating the correlation graph</a></li>
<li><a href="#transforming-the-correlation-graph-into-a-similarity-graph" id="markdown-toc-transforming-the-correlation-graph-into-a-similarity-graph">Transforming the correlation graph into a similarity graph</a> <ul>
<li><a href="#math" id="markdown-toc-math">Math</a></li>
</ul>
</li>
<li><a href="#uncovering-concepts" id="markdown-toc-uncovering-concepts">Uncovering concepts</a></li>
<li><a href="#examples" id="markdown-toc-examples">Examples</a> <ul>
<li><a href="#text" id="markdown-toc-text">Text</a> <ul>
<li><a href="#people" id="markdown-toc-people">People</a></li>
<li><a href="#nationalities--groups" id="markdown-toc-nationalities--groups">Nationalities & Groups</a></li>
</ul>
</li>
<li><a href="#visual-concepts" id="markdown-toc-visual-concepts">Visual Concepts</a> <ul>
<li><a href="#forming-the-correlation-graph" id="markdown-toc-forming-the-correlation-graph">Forming the correlation graph.</a></li>
<li><a href="#example-concepts" id="markdown-toc-example-concepts">Example concepts</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a> <ul>
<li><a href="#discussion---links-to-posts-on-social-media" id="markdown-toc-discussion---links-to-posts-on-social-media">Discussion - Links to posts on social media</a></li>
</ul>
</li>
</ul>
<h3 id="problem-description">Problem description</h3>
<p>Our problem description is this: Given a dataset that we know encodes some relations between our objects, how can we
calculate similarities between objects that interest us? And how can we do that in scalable manner, if we assume we have potentially billions of
records to learn from?</p>
<p>Our algorithm should have a few characteristics that make it attractive:</p>
<ul>
<li><strong>Accurate:</strong> Clearly we want our resulting similarity scores to make some kind of sense in our domain. If we are calculating
similarities between music artists, we would like our score to assign high similarity between the Wu-tang Clan and Nas,
but low similarities between Wu-tang and Tchaikovsky.</li>
<li><strong>Domain Agnostic:</strong> We would like our algorithm to be applicable in various domains, rather than being specialized
in one. For example while collaborative filtering works well in determining similarities between users and items,
given a user-item interaction matrix, it’s not intuitive to apply it in order to obtain similarities between proteins.</li>
<li><strong>Unsupervised:</strong> We want our algorithm to be able to discover similarities based on relations that are
present in the source data in an unsupervised manner. Relying on human-curated databases like <a href="https://wordnet.princeton.edu/" target="_blank">WordNet</a>
brings with it many problems like slow adaptation of new terms, limited coverage and most importantly, high cost.</li>
<li><strong>Scalable:</strong> Relying on data to uncover similarities means that we should be able to use very large datasets
that could contain latent information about the relationships between our objects. If our algorithm scales
unfavorably with the amount of input data, we would have to rely on subsampling, potentially losing useful information
about the interactions between the objects.</li>
</ul>
<h3 id="our-approach">Our Approach</h3>
<p>In order to achieve these desiderata we decided to take a two step approach: We first process our dataset to create a
compact representation of it we call the <em>correlation graph</em>. This graph
can include some useful relationships, but can also include spurious correlations. Take for example the words “Messi”, who is a famous footballer,
and “ball”. These will often appear within a short distance in text, meaning
they are correlated. <strong>This however does not mean that Messi is similar to a
ball!</strong>
We want to have a method that allows us to discover deeper semantic relationships
between objects and not just correlations.</p>
<p>For that purpose, we further transform the correlation graph into a <em>similarity graph</em>.
The similarity graph should capture semantic relationships, and we focus on
exchangeability of an object. If we take the word “Messi” in a sentence and replace
it with Ronaldo, the sentence should still make sense. Ideally we would like
our algorithm to be able to group Messi and Ronaldo in the same semantic
group.</p>
<p>We call these semantic groups “concepts” and our approach to uncovering
them is to apply a community detection technique on top of the similarity
graph.</p>
<p>The focus of our work and of this post will be on the transformation between correlation and similarity graph;
of course choosing how to create the correlation graph and perform community detection are of great importance,
so we will show a couple of examples we used that should be applicable in various settings.</p>
<h3 id="graph-based-vs-vector-space-similarity">Graph-based vs. vector-space similarity</h3>
<p>You might have noticed that we have mentioned two approaches to calculating similarities that are quite different:
Graph-based similarity and vector-space similarity which is used in, for
example, word embeddings.</p>
<p>The overall idea in vector-based similarity is to embed objects in some vector space and then use distance measures like
Euclidean or cosine distance to measure their similarities. This approach has proven to work well in a number of fields;
apart from the ones we have mentioned above <a href="http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/" target="_blank">RNN word embeddings</a>
have been very successful in the NLP field. Concept discovery is done in this case using traditional clustering
techniques like kMeans clustering.</p>
<p>There are however some problems with this kind of approach. Since these are usually iterative methods they tend to be
computationally costly, as they require multiple passes over the complete dataset to converge. They also tend to have
parameters that significantly affect their performance that can be hard to tune without the use of expensive cross-validation,
such as the number of factors to use in a matrix factorization technique,
or the dimensionality of the embeddings.</p>
<p>By using a graph representation instead, the connections between objects arise from the data themselves, and this allows
us to discover higher-order structures and relationships between objects in an efficient manner, all while having compact representation of the data.
Importantly, we only require a single pass over the data to create the
correlation graph, and provide a scalable algorithm for the similarity
transformation.</p>
<h2 id="creating-the-correlation-graph">Creating the correlation graph</h2>
<p>The correlation graph would usually be created from data, unless you already have access to a model of
correlations between objects, like the <a href="http://www.biomedcentral.com/1471-2105/6/134" target="_blank">codon substitution matrix</a>
we used as an example in our paper.</p>
<p>The creation of the correlation graph is a very important step in the whole process, as the <a href="https://en.wikipedia.org/wiki/Garbage_in,_garbage_out" target="_blank">“garbage in, garbage out”</a>
principle holds very true in machine learning. If we want our similarity graph to be meaningful, the correlation graph
should encode some information about the relationships between objects. Fortunately it’s relatively straightforward to create a correlation
graph for two of the most important types of data on the web: text and user interaction data.</p>
<p>For these types of data one can simply model objects as nodes and some measure of co-occurrence between objects as the
edges between them. For example one could create one node in the graph for each word in the dataset, and create
edges between words weighted by their conditional probability, e.g. how probable is it that we will see the word “dog”
given that we have observed the word “cat” within a sentence? Or in the case of user data, what is the conditional probability of a user
interacting with item A given that he has interacted with item B?</p>
<p>If we extract a (compactly represented) co-occurrence matrix, we are then able to create many different correlation
graphs, by choosing a different correlation measure.
For text we obtained the best results using <a href="https://en.wikipedia.org/wiki/Pointwise_mutual_information" target="_blank">pointwise mutual information</a>
but one could also use a multitude of other measures like, the <a href="https://en.wikipedia.org/wiki/Jaccard_index" target="_blank">Jaccard Index</a>
or the <a href="https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient" target="_blank">Sorensen-Dice coefficient</a>
among others.</p>
<p>One can imagine using similar approaches for other types of data, so the above is not limited to text and user data. In the examples below we will show
how one can use a combination of supervised deep learning and our method
to uncover visual concepts.</p>
<h2 id="transforming-the-correlation-graph-into-a-similarity-graph">Transforming the correlation graph into a similarity graph</h2>
<p>Creating the correlation graph should generally be a straightforward process. But, as mentioned above, the correlation graph will not
necessarily model the kinds of relations we want to capture.
We are aiming for semantic similarity, not simple correlation.</p>
<p>In order to discover similarities then, we follow the popular addage by J.R. Firth’s:</p>
<blockquote>
<p>“You shall know a word by the company it keeps.”</p>
</blockquote>
<p>Our goal is to extend this definition and talk about any object, not just
words. The way we do this is by looking at the neighborhood (or <em>context</em>) of the items and calculating a similarity score
based on the similarity of their <em>contexts</em>.</p>
<p>The central element of out approach is the correlation-to-similarity graph
transformation. <strong>The main idea is to assign higher similarity to nodes whose
contexts are themselves similar</strong>. Nodes that have strong correlations
to the same nodes, and little other correlations will be assigned higher
similarity themselves. In the example correlation graph below, we are
interested in calculating the similarity between nodes 4 and 7. The weight
of the correlation is indicated by the width of the edges. Nodes 4 and 7 have
strong correlations to the same set of nodes (2,3,5,6) and few weak correlations
to other nodes. Since their “contexts” are similar, we posit that the nodes
themselves should be similar. The exact calculation is given below, as well
as in the paper, but hopefully this provides enough intuition.</p>
<p><img src="/assets/correlation-example.png" alt="Correlation graph example" />
<em>An example correlation graph. Nodes 4 and 7 have similar contexts,
so should be similar themselves.</em></p>
<p>In order to achieve scalability we take into consideration
only two-hop neighbors in our calculation. While this may seem
limiting at first, it allows us to uncover deep relationships
as we will see in our examples. Based on this approach we can take
any <em>correlation</em> graph and transform it into a <em>similarity</em> graph
as shown here:</p>
<p><img src="/assets/similarity-transformation.png" alt="Similarity Transformation" />
<em>An example of a graph transformation, from a correlation graph into
a similarity graph.</em></p>
<h3 id="math">Math</h3>
<p>Time for some math and definitions taken from our paper then. If you’re not
interested in these feel free to <a href="#examples">skip to the examples</a>. <strong>The TLDR is that we
sum the correlations of every two-hop neighbors, calculate how much they have
in common vs. uncommon and determine their similarity as a result of that.</strong></p>
<p>Let $C = {i}\und{i=1}^n$ be a set of objects, where each object has a correlation, $\rn{i,j}$, to
every other object.
The context of an object $i$ are its relations to all
other objects, $\rn{i} = (\rn{i,j})\und{j=1}^n$.</p>
<p>The way we define we define the similarity
$\sy{i,j}$, is by subtracting the relative $L\und1$-norm of the difference between $\rn{i}$ and $\rn{j}$ from
1; that way we transform a difference measure to a similarity measure:</p>
<script type="math/tex; mode=display">\begin{equation}\label{eq:sim}
\sy{i,j} = 1 - \frac{\dnm{i}{j}}{\rns{i} + \rns{j}},
\end{equation}</script>
<p>where</p>
<script type="math/tex; mode=display">\begin{equation}\label{eq:totrel}
\rns{i} = \sum\und{k \in C} | \rn{i,k}|
\end{equation}</script>
<p>and</p>
<script type="math/tex; mode=display">\begin{equation}\label{}
\dnm{i}{j} = \sum\und{k \in C} | \rn{i,k} - \rn{j,k} |,
\end{equation}</script>
<p>denoted $\nm{i,j}$ for short.</p>
<p>In order to make the computation scalable, we only calculate similarities between items that are one-hop
neighbors in the graph, that is, in order for two items to have a similarity score in our approach they must
share at least one common neighbor. We can then define $\nm{i,j}$ as:</p>
<script type="math/tex; mode=display">\nm{i,j} = \rns{i} + \rns{j} + \Lambda\und{i,j},</script>
<p>where</p>
<script type="math/tex; mode=display">\begin{equation}
\Lambda\und{i,j} = \sum\und{k \in n\und{i} \cap n\und{j}} (|\rn{i, k} - \rn{j, k}| - |\rn{i, k}| - |\rn{j, k}|)
\label{eq:l1common}
\end{equation}</script>
<p>In the paper we provide more details about how to make the approach scalable,
namely applying a max in degree for nodes and a weight threshold for the edges,
motivated by the observation that most items in the graph are unrelated,
so we should avoid including in the computation small, potentially spurious
correlations.</p>
<h2 id="uncovering-concepts">Uncovering concepts</h2>
<p>We mentioned earlier that the notion of a concept in a similarity graph lies in the graph structure, i.e.
communities or clusters are encoded in the way that nodes are connected with each other. The equivalent
of clustering in a vector-space is
commonly referred to as <em>community detection</em> in the graph literature.</p>
<p>Community detection is a heavily researched topic, and I encourage you to take a look at this
<a href="http://arxiv.org/abs/0906.0612">excellent survey</a> by Santo Fortunato for an overview of the field.
In the context of this work we wanted a community detection that was scalable and
ideally allowed for <em>overlapping</em> communities. Overlapping community detection aims at assigning
nodes to one or more communities, as most objects in the real world belong to more than one community.
A person for example might belong to a group of friends, a (potentially overlapping) group of colleagues,
a gym club etc. Uncovering such communities is computationally challenging, but some very interesting algorithms
have recently been proposed, like the one from <a href="http://i.stanford.edu/~crucis/pubs/paper-nmfagm.pdf">Yang and Leskovec</a>,
and you can take a look <a href="http://arxiv.org/abs/1110.5813" target="_blank">here</a> for an overview of the area.</p>
<p>In our work we used a variant of the <a href="http://arxiv.org/abs/1109.5720" target="_blank">SLPA</a> algorithm. SLPA is a community detection
algorithm based on <a href="https://en.wikipedia.org/wiki/Label_Propagation_Algorithm" target="_blank">label propagation</a> that can scale to graphs with millions
of nodes and billions of edges. It is an iterative algorithm where each
node maintains a <em>memory</em> of community labels that are exchanged over
the edges of the network. Nodes will sample from their incoming labels
and maintain the most frequent ones in their memory. As we move from iteration
to iteration, the memory of each node will gradually converge to a small
subset of labels, which are used to label each node with overlapping
communities.</p>
<p><img src="/assets/community-detection-example.png" alt="Community Detection Example" />
<em>An example of community detection.</em></p>
<h2 id="examples">Examples</h2>
<p>Now that we’ve seen how the method works let’s take a look at the
kinds of output that is possible using this approach. We’ll look
at examples from the text and image domains, where in the latter
we combine the power of supervised deep networks with our algorithm
to uncover visual concepts.</p>
<h3 id="text">Text</h3>
<p>The first example comes from the <a href="https://arxiv.org/abs/1312.3005" target="_blank">Billion word corpus</a> which is a standardized
dataset originating from crawling news sources. As such, the concepts
uncovered relate to the words that commonly appear in news sources.</p>
<p>Here we have used the probability of co-occurrence within a window of 2
(bigrams) between two words to create the correlation graph, then applied
our transformation to get the similarity graph from which we are then able
to uncover concepts.</p>
<p>The full graph is shown at the top of this page and the zoomable PDF file
is <a href="/assets/concepts-visualization.pdf" target="_blank">also available</a>. Here, we’ll zoom into a couple of interesting regions
of the graph to demonstrate the kinds of concepts we are able to discover.</p>
<h4 id="people">People</h4>
<p>In the first example we can see concepts of names being grouped together.
On the left we have names of political figures like Blair, Clinton, and
Obama. On the right we have names of athletes being grouped together, like
Favre, Williams, and Armstrong.</p>
<p><img src="/assets/names.png" alt="Names Concept" />
<em>Two uncovered concepts of names: politicians on the left, and athletes on the right.</em></p>
<h4 id="nationalities--groups">Nationalities & Groups</h4>
<p>In this second example we can see one group of nationalities uncovered,
which in turn connects (as we move to nationalities that commonly appear in the news like Palestinian, Kurdish, and Tibetan) to groupings of people, including religions
and organizations that are likely to appear in the news.</p>
<p><img src="/assets/nationalities.png" alt="Nationalities Concept" />
<em>These concepts group together nationalities on the left and other groups on the right.</em></p>
<h3 id="visual-concepts">Visual Concepts</h3>
<p>For the next example we combine the power of supervised neural networks with
our unsupervised learning algorithm to uncover concepts from raw images.
Deep neural nets can be trained on a large labeled dataset like ImageNet
to recognize thousands of objects in an image. We can then use the trained
network on millions of unlabeled images to generate approximate labels.
Using those labels, we can create a correlation graph and apply our algorithm
to uncover what we call “visual concepts”.</p>
<p>In this example we use the <a href="https://storage.googleapis.com/openimages/web/index.html" target="_blank">OpenImages dataset</a> released by Google that has annotations
for approximately 9 million unlabeled images from the Flickr image
hosting service. These images were annotated using a collection of
neural network models with 19,995 classes in total, with image
being annotated with 8.4 labels on average.</p>
<h4 id="forming-the-correlation-graph">Forming the correlation graph.</h4>
<p>To create the correlation graph we create a clique (fully connected graph)
for all the labels that appear together in a single image. As objects
appear together in the real world these cliques are then connected to
other cliques, forming the full correlation graph.</p>
<p>We give an example in the following figure: Here we have two annotated
images of people in cowboy hats. In the top image, the neural network
has missed the <em>person</em> label, and the <em>human face</em> label, because it is
obstructed by the hat. When we create the correlation graph however,
the <em>guitar</em> label is connected to the <em>person</em> as a second degree
connection, so given enough data, we can deduce that <em>guitar</em> appears
often in the context of <em>person</em>.</p>
<p><img src="/assets/openimages-correlation.png" alt="OpenImages Correlation Graph" />
<em>Creating the correlation graph from the OpenImages annotations.</em></p>
<h4 id="example-concepts">Example concepts</h4>
<p>After creating the correlation graph as described above, we
can apply our transformation and create the visual concepts,
a couple of examples of which we give here. <strong>You can view the
full visual concept graph <a href="/assets/openimages_communities.pdf" target="_blank">here</a></strong>.</p>
<p>As expected from images taken from the Internet, we have concepts of cats and
dogs, with various breeds being grouped together, and another
concept with species of birds (in orange):</p>
<p><img src="/assets/cats-dogs.png" alt="Cats and dogs visual concept" />
<em>Animal concepts uncovered from real-world images.</em></p>
<p>In another example concept we have various sports being grouped
together, with a contact sports concept forming on the right.</p>
<p><img src="/assets/openimages-sports.png" alt="Cats and dogs visual concept" />
<em>Sports concepts uncovered from real-world images.</em></p>
<p>We can see then the potential of combining the two methods: It
provides a new lens into the world, extracted from real world images.
This can provide us with insights that are not present in text
corpora.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we described how we can calculate the similarities
between objects in any domain, provided that we have access to
a set of approximate correlations between them. Using the generated
similarity graph we can then group objects in coherent clusters
which we call “concepts”, allowing us to discover structure and
knowledge from large-scale unlabeled data.</p>
<p>In the paper we provide more details showing a quantitative evaluation
vs. word embedding methods and demonstrate the scalability of the
approach by training on the Google N-grams dataset (24.5 billion
records) in a matter of minutes.</p>
<h3 id="discussion---links-to-posts-on-social-media">Discussion - Links to posts on social media</h3>
<p>As mentioned before, <a href="https://github.com/sics-dna/concepts">the code is available on Github</a>.</p>
<p><a href="https://news.ycombinator.com/item?id=20339284">HackerNews</a></p>
Tue, 02 Jul 2019 00:00:00 +0000
http://tvas.me/research/2019/07/02/Finding-graph-similarities.html
http://tvas.me/research/2019/07/02/Finding-graph-similarities.htmldata-miningresearchgraphsunsupervisedlarge-scaleresearchA Brief History of Information Theory<blockquote>
<p>The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning.</p>
<p>-Claude Shannon (1948)</p>
</blockquote>
<p>As part of a course I’m taking we were tasked with writing the history
of a scientific discovery we find important. I knew immediately what
I wanted to write about: Claude Shannon and the development of information
theory. It’s a story that involves some of the greatest minds in
science, the World War, and the genius of one person.</p>
<p>So with the opportunity of Shannon’s birthday (April 30th) I decided to post
the essay here. Most of the material was pulled from James Gleick’s
excellent: “The Information: A History, A Theory, A Flood” which I recommend
to anyone even remotely connected to computer science.</p>
<h2 id="introduction">Introduction</h2>
<p>In the same year that Bell Labs announced the invention of an electronic semiconductor that could do everything a vacuum tube could more efficiently, Claude Shannon published in The Bell System Technical Journal his seminal work, “A Mathematical Theory of Communication”. The information age had just begun.</p>
<p>It’s hard to summarize the importance of Shannon’s work, as it laid the groundwork for what we now call information theory, with implications for practically every field in science today, from engineering, to biology, medicine, and the social sciences. It underlies everything about “the quantification, storage, and communication of information”. In it, Shannon proposed a unit for measuring information, which he termed “binary digits” or <em>bits</em> for short. He derived a formula for channel capacity that determines the absolute speed limit of any communication channel. From that limit he showed that it is possible to devise error correcting codes in a noisy channel that will overcome any level of noise. The contributions now underpin the work in compression and coding for error correction used in everything from mobile communication and media compression, to quantum computation.</p>
<p>Apart from its scientific importance, the story of the development of information theory is fascinating in itself, as it involves some of the greatest thinkers in computer science, and provides a look into the birth of “great ideas”.</p>
<h2 id="earlier-attempts">Earlier Attempts</h2>
<p>To understand how Shannon arrived at his theory, one has to look at the scientific context and earlier attempts to quantify information. In the 1830s, the advent of the telegraph led to the creation of coding schemes, like Morse code, that were used to communicate messages over large distances using electrical signals. In those early attempts, before any formal notion of coding theory was developed, Morse code demonstrated the principles of lossless compression, by using shorter messages for more common letters (one dot for “e”) than for less common ones (“j” is one dot followed by three dashes).</p>
<p>The earliest attempts to quantify information content was developed in the 1920s by two Swedes working at Bell Labs: Job B. Johnson and Henry Nyquist, work that was later extended by Ralph Hartley. Job B. Johnson observed that the thermal noise in circuits followed the principles of Brownian motion, i.e. a stochastic process, which Nyquist later expanded upon. In Nyquist’s 1924 paper, “Certain Factors Affecting Telegraph Speed” he calculates a formula for the “speed of transmission of intelligence”, that connects the speed of “intelligence” transmission to the bandwidth of the channel. It was Hartley who in his 1928 paper “Transmission of information” first used the word “information” to describe the “stuff” of communication and to provide a stricter definition for what had until that point a vague term. Information could be words, sound or anything else. A transmission was defined as a sequence of n symbols from an alphabet with S symbols. In this sense information was a quantity, that determined the ability of the receiver of a transmission to determine if a particular sequence was intended by the sender, regardless of the content of the message.</p>
<p>The amount of information is proportional to the the length of the sequence, but depends on the number of possible symbols as well: a symbol from a two-symbol alphabet (0-1) caries less information than a letter of the English alphabet, which caries less information from a Chinese ideogram. The amount of information was defined by Hartley as:</p>
<script type="math/tex; mode=display">H = n\log{S}</script>
<p>where n is the length of the message and S the size of the alphabet. In a binary system S is two. The relationship between information and alphabet size if logarithmic: a doubling of the information requires quadrupling the alphabet size.</p>
<p>The assumption made by Hartley however was that each symbol had equal probability, and contained no notion about the communication of messages of unequal probability.</p>
<h2 id="wartime">Wartime</h2>
<p>The development of information theory by Shannon was a result of many years of previous research and interactions with some of the brightest minds of his era. The foundations were set in the period of the Great War, were Shannon met Alan Turing as part of their work on cryptanalysis.
Turing had been working on deciphering the messages produced by the German “Enigma” machines. These were machines that the Germans used to communicate encrypted messages. The basis of (symmetric) cryptography is the substitution of symbols in the alphabet with different ones, according to a <em>key</em> that sender and receiver share, and must remain secret from adversaries.</p>
<p>At its essence as Shannon noted “a secrecy system is almost identical with a noisy communication system”. The job of the cryptanalyst is to take the seemingly random stream of symbols that is the result of encryption, and try to detect <em>patterns</em> that correspond to the original language.</p>
<p><img src="/assets/shannon-crypto.jpg" alt="A general secrecy system [Shannon49b]" /></p>
<p><em>Schematic of a general secrecy system [Shannon49b]</em></p>
<p>In Shannon’s view, patterns equaled to redundancy: If we are fairly certain in the appearance of a symbol after another, the second symbol can be considered redundant. For example, after the letter <em>t</em> the most likely letter to appear is <em>h</em>, making the information content of <em>h</em> after a <em>t</em> lower than all the other letters in the alphabet.
The process of deciphering a message then became a practice in pattern matching and probability: As long as the cipher maintained some notion of patterns that exemplified statistical regularity, they could be cracked. Shannon completed his report on “A Mathematical Theory of Cryptography” in 1945, but it would not be declassified until 1949. This work established the scientific principles of cryptography. He showed that a perfect cipher must produce keys that are truly random, each key must be used only once, be as large as the original message, and never re-used. The term “information theory” appears for the first time in this text.</p>
<p>Turing had been working in a similar vein at Bletchley Park, and had defined his own measure of information, the ban (now also called the hartley), which measured information on base 10 logarithms (instead of base two, as does the bit). During 1943 the two men met daily at Bell Labs where Turing was visiting to work on the X system, used to encrypt the communications between Franklin D. Roosevelt and Winston Churchill.
While they were not able to talk about their cryptanalysis work, they exchanged ideas on “thinking machines” and the “halting problem” that Turing had resolved before the war.
The basis of both men’s work in cryptanalysis had been the statistical nature of communication, in its patterns, the resulting redundancy, and how those could be exploited to decipher messages. As Shannon himself noted, communication theory and cryptography were “so close together that you couldn’t separate them” and promised to develop these results “in a forthcoming memorandum on the transmission of information”.</p>
<h2 id="a-mathematical-theory-of-communication">A Mathematical Theory of Communication</h2>
<p>The memorandum was published in 1948, but did not see widespread adoption until the publishing of the book with Warren Weaver who provided an introduction for a more general scientific audience, and included a small but poignant change in the title to “The Mathematical Theory of Communication”.</p>
<p>Until Shannon, information was not a strictly defined technical term, but was associated with its everyday, overloaded meaning. He rather wanted to remove “meaning” from the definition and remove any semantic connotations, e.g. the language content of a message. He wrote that “the semantic aspects of communication are irrelevant to the engineering aspects”.</p>
<p>In his explanation of information Weaver notes that: “information is a measure of one’s freedom of choice when one selects a message”. Meaning does not enter here as the message could very well be “pure nonsense”. Information is closely associated with uncertainty: it can be thought of as a measure of surprise. Following up the word “White” with “House” is not surprising, and as such, “House” carries little information and is to some degree redundant. The same can be true for the many letters of the English alphabet, given their context. For example “<em>U cn prbly rd ths</em>”. Shannon determined that English has a redundancy of about 50 percent.</p>
<p>The redundancy posits that certain sequences of symbols will be more likely than others, and that communication can be modeled as a <em>stochastic process</em>. The generation of messages is governed by a set of probabilities that depend on the state of the system and its history. In a simplified model, communication of language could be modeled as a Markov process, where the next symbol to appear depends solely on a number of symbols that preceded it.</p>
<p>Perhaps the most important contribution of Shannon’s work is the introduction of entropy. According to Weaver, in the context of communication “information is exactly that which is known in thermodynamics as <em>entropy</em>”. Entropy was introduced originally by Clausius in 1865, and later used by Boltzmann and Gibbs in his work in statistical mechanics. The entropy of a thermodynamical system is the measure of the number of states with significant probability of being occupied, multiplied by Boltzmann’s constant.</p>
<p>Entropy then measures the uncertainty in a system, and that is in essence its information content. Shannon had originally planned to call this “<em>uncertainty</em>” but was dissuaded by Von Neumann<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>:</p>
<blockquote>
<p>I thought of calling it “information”, but the word was overly used, so I decided to call it “uncertainty”. […] Von Neumann told me, “You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”</p>
</blockquote>
<p>The diagram used by Shannon for his communication model bears many similarities with the one used for his cryptography paper, which allows us to see a clear path in the development of the ideas:</p>
<p><img src="/assets/shannon_comm_channel.jpg" alt="A general communication system in [Shannon49a]" /></p>
<p><em>A general communication system in [Shannon49a]</em></p>
<p>The entropy connects information to the amount of choice available when constructing messages, it measures the uncertainty involved in the “<em>selection of an event or how uncertain we are of the outcome</em>”. When the probability of each symbol appearing is equal, one can use Nyquist’s formula:</p>
<script type="math/tex; mode=display">H = n \log{S}.</script>
<p>For the case where the probabilities of each symbol are determined by
$p_1…p_S$ the expression that Shannon defined was:</p>
<script type="math/tex; mode=display">H = -\sum_i{p_i log_2 p_i}</script>
<p>Shannon defined a unit of measure for this information as “binary digits, or <em>bits</em>”, which he credited to John Tukey. A bit represents the unit of information that is present in the flipping of a coin, i.e. an event with two possible outcomes of equal probability.</p>
<p>From the concept of redundancy Shannon developed ways to communicate natural language more efficiently, by making use of the probabilities of different symbols. He further defined the channel capacity of a noisy channel (Shannon limit) and the possibility of perfect communication in a noisy channel through the noisy-channel coding theorem. Removing redundancy could increase the rate of transfer, which underpins the field of compression, while adding redundancy can enable correct communication in the presence of errors, the basis of coding theory.</p>
<p>The fundamental connection made was that information and probabilities were intrinsically connected: an event carries information related to the probability of observing it, as defined by Shannon’s entropy.</p>
<p>As a more practical hardware example than the coin flip, Shannon noted that:</p>
<blockquote>
<p>A device with two stable positions, such as a flip-flop circuit, can store one bit of information. N such devices can store N bits, since the total number of possible states is $2^N$, and $log_2 2^N = N$</p>
</blockquote>
<p>At around the same time, in the same building that Shannon had developed his theory of information, the transistor had just been created.</p>
<h2 id="the-discovery-process">The discovery process</h2>
<p>In our description of the development of information theory by Shannon, we have neglected some important previous work that better puts in context Shannon as one of the pillars of computer science. After graduating in 1936 with two bachelor’s degrees in electrical engineering and mathematics from the University of Michigan, Shannon joined MIT for his graduate studies. There, here worked on Vannevan Bush’s mechanical computer, the differential analyzer [Bush36]. He spent his time analyzing the machine’s circuits, and designing switching circuits based on Boole’s algebra. In his 1937 master’s degree thesis “A Symbolic Analysis of Relay And Switching Circuits” he proved that such circuits could be used to solve all problems that Boolean algebra could solve, which would later become the foundation of digital circuit design. The thesis has been called “possibly the most important, and the most noted, master’s thesis of the century [Gardner87]”.</p>
<p>At the advice of Bush, Shannon switched from electrical engineering to mathematics for his PhD studies, and suggested applying symbolic algebra to the problem of Mendelian genetics. Mendelian genetics was a branch of genetics that was met with resistance initially in the 1900s. Even after R.A. Fisher’s seminal book “The Genetical Theory of Natural Selection” [Fisher30], Mendelian genetics were not clearly understood, particularly its basic components, the genes. Shannon’s PhD thesis, “An Algebra For Theoretical Genetics”, includes in the introduction the following passage:</p>
<blockquote>
<p>Although all parts of the Mendelian genetics theory have not been incontestably established, still it is possible for our purposes, to act as though they were, since the results obtained are known to be the same <em>as if</em> the simple representation which we give were true. Hereafter we shall speak therefore as though the genes actually exist and as though our simple representation of hereditary phenomena were really true, since so far as we are concerned, this might just as well be so.</p>
</blockquote>
<p>It is important to take a moment and appreciate the workings of Shannon’s approach here, and the liberty afforded to him to dig deep into a single idea, be it from the field of mathematics or a different area. Before his PhD dissertation Shannon published no articles other than his unpublished (though seminal) master’s thesis. Although his PhD thesis was never published, (it sits at 45 citations according to Google scholar, compared to 105,460 for his information theory papers) he was given the chance to continue his research at the Institute of Advanced Study in Princeton, at that time occupied by giants such as Einstein and Kurt Gödel. There, he had the chance to discuss his ideas with mathematicians such as John Von Neumann, inventor of, among many other things, the computer architecture that underpins all modern computers.</p>
<p>It is safe to say that Shannon’s interaction with many of the greatest thinkers of his era, Turing, Von Neumann, Einstein, Gödel, helped shape him as a scientist and enabled the development of one of the “great ideas” of the previous century.</p>
<p>Another important consideration is the fact that Shannon made his greatest work while employed at the labs of a private company, Bell Labs, either while being assigned work from the government on wartime efforts, or later on in his own work. It is remarkable to think that private companies would fund mathematicians to perform basic research, but that is exactly what Bell Labs was doing during that era. The lab, now owned by Nokia, counts 8 Nobel Prizes among its accomplishments [Bell18]. One may wonder if such a lab exists today. Private initiatives in machine learning like OpenAI may come close, but other efforts like Facebook’s FAIR labs and Google’s Deepmind are doing cutting edge research, but always with a product focus, the results of the research are expected to be, in some way, useful for business purposes.</p>
<p>The drive of Shannon is also an interesting topic. As an outside observer, it’s perhaps easy to look back at Shannon’s cryptanalysis work and draw a line between that and information theory. However, Shannon’s personal process does not seem to reflect this. He is quoted in [Gleick11]:</p>
<blockquote>
<p>My mind wanders around, and I conceive of different things day and night. Like a science-fiction writer, I’m thinking, “What if it were like this?”</p>
</blockquote>
<p>As is often the case then, curiosity, mixed with a brilliant mind and unrelenting rigor lead to the establishment of information theory. Indeed the needs were there: The inclusion of a noise source in Shannon’s theory reflects his engineer self, rooted in the practicalities of communicating over imperfect channels, which is what his company, Bell (AT&T) required. But the theoretical foundation and mathematical rigor is what elevates this theory beyond simple applied science. It’s the culmination of practical needs combined with the curiosity of a brilliant mind.</p>
<h2 id="ongoing-impact">Ongoing Impact</h2>
<p>Ever since the publication of the book by Shannon and Weaver, information theory has been applied to pretty much every area in science to the point of Shannon calling out other researchers in his article “The Bandwagon” [Shannon56]. The Bandwagon is an interesting article on its own, and a rare one in the sense that it is the founder of a discipline chastising his pupils as it were for taking his theory, doing away with rigor and running away with it. It was a piece of writing, and scientific leadership that is rare in the current climate especially in areas full of hype, such as the current state of machine learning and “AI”.</p>
<p>Nevertheless, information theory would prove critical in a variety of fields. That included neuroscience [Dimitrov11], biology [Adami04], economics [Maasoumi93], machine learning [MacKay03], cognitive science [Dretske81], linguistics and natural language processing [Harris91], and of course communication [Gallager68] and compression [Johnson03].</p>
<p>New applications are constantly being developed as well. We note one example that is relevant to our research where coding theory is being used to speed up distributed machine learning [Lee18].</p>
<h2 id="conclusions">Conclusions</h2>
<p>The assignment description asks “How the invention changed our view of the world?”. To this we would answer that before Shannon, information was a term with no clear meaning. Gleick uses the phrase “the ‘stuff’ of communication” and in the past “intelligence” was used to convey the content of a message. One of the most important contributions of Shannon might actually be doing away with “meaning”, and focusing on what remains, ones and zeros, the presence of structure or not, a quantification of uncertainty that has lead to all the scientific advancement we see around us today.</p>
<p>With the ongoing development of machine learning, which in many ways has its roots set in information theory [MacKay03], and the promise of quantum computing [Nielsen10], the role of information theory will remain central in computer science and science in general. It is remarkable that one person can contribute so much to the development of science, but that is exactly what Shannon did.</p>
<h2 id="references">References</h2>
<p>[Adami04] Adami, C. (2004). Information theory in molecular biology. Physics of Life Reviews, 1(1), 3-22.</p>
<p>[Bell18] https://www.bell-labs.com/about/recognition/, Retrieved 2018-04-26</p>
<p>[Bush36] Bush, Vannevar (1936). Instrumental Analysis. Bulletin of the American Mathematical Society. 42 (10): 649–69. doi:10.1090/S0002-9904-1936-06390-1</p>
<p>[Dimitrov11] Dimitrov, A. G., Lazar, A. A., & Victor, J. D. (2011). Information theory in neuroscience. Journal of Computational Neuroscience, 30(1), 1-5.</p>
<p>[Dretske81] Dretske, F. (1981). Knowledge and the Flow of Information.</p>
<p>[Fisher30] Fisher, R. A. (1930). The Genetical Theory of Natural Selection. The Clarendon Press.</p>
<p>[Gallager68] Gallager, R. G. (1968). Information theory and reliable communication (Vol. 2). New York: Wiley.</p>
<p>[Gardner87] Gardner, Howard (1987). The Mind’s New Science: A History of the Cognitive Revolution. Basic Books. p. 144. ISBN 0-465-04635-5.</p>
<p>[Gleick11] Gleick, James (2011). The Information: A History, A Theory, A Flood. Pantheon Books.</p>
<p>[Harris91] Harris, Z. (1991). Theory of language and information: a mathematical approach.</p>
<p>[Johnson03] Johnson Jr, P. D., Harris, G. A., & Hankerson, D. C. (2003). Introduction to information theory and data compression. CRC press.</p>
<p>[Lee18] Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., & Ramchandran, K. (2018). Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 64(3), 1514-1529.</p>
<p>[MacKay03] MacKay, D. J. (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press.</p>
<p>[MacKay52] MacKay, D. M., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link. Bulletin of Mathematical Biophysics, 14, 127–135.</p>
<p>[Nielsen10] Nielsen, M. A., & Chuang, I. L. (2010). Quantum computation and quantum information. Cambridge university press.</p>
<p>[Maasoumi93] Maasoumi , E. (1993). A compendium to information theory in economics and econometrics. Econometric reviews, 12(2), 137-181.</p>
<p>[Sloane98] Sloane, Neil (1998). Bibliography of Claude Elwood Shannon, http://neilsloane.com/doc/shannonbib.html, Retrieved 2018-04-26</p>
<p>[Shannon38] Shannon, Claude E. (1938). A Symbolic Analysis of Relay and Switching Circuits.” Unpublished MS Thesis, Massachusetts Institute of Technology.</p>
<p>[Shannon40] Shannon, Claude E. (1938). An Algebra for Theoretical Genetics Ph.D. Thesis, Massachusetts Institute of Technology.</p>
<p>[Shannon48] Shannon, Claude E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, Vol. 27 (July and October 1948)</p>
<p>[Shannon49a] Shannon, Claude E. (with Warren Weaver) The Mathematical Theory of Communication, University of Illinois Press, Urbana, IL, 1949. The section by Shannon is essentially identical to the previous item.</p>
<p>[Shannon49b] Communication Theory of Secrecy Systems, Bell System Technical Journal, Vol. 28 (1949), pp. 656-715. ``The material in this paper appeared originally in a confidential report `A Mathematical Theory of Cryptography’, dated Sept. 1, 1945, which has now been declassified.’’ Included in Part A.</p>
<p>[Shannon56] Shannon, Claude E. (1956). The Bandwagon. IRE Transactions on Information Theory, 2(1), 3.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The validity of this story is challenged by [Gleick11] <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Mon, 30 Apr 2018 00:00:00 +0000
http://tvas.me/articles/2018/04/30/Information-Theory-History.html
http://tvas.me/articles/2018/04/30/Information-Theory-History.htmlinformation theoryresearchhistoryarticlesHighlights from KDD 2016<p><img src="/assets/san_fransisco.jpg" alt="San Fransisco" />
<em>Photo credit: <a href="https://www.flickr.com/photos/davidyuweb/15370470163/">David Yu</a></em></p>
<p>This August while interning at <a href="https://twitter.com/LifeAtPandora">Pandora</a> I had the opportunity to attend the <a href="http://www.kdd.org/kdd2016/">22nd ACM SIGKDD Conference on Knowledge
Discovery and Data Mining (KDD)</a>, held in San Fransisco. My manager
<a href="https://twitter.com/ocelma">Oscar Celma</a> was cool enough to let me attend during my internship, and my research institute,
<a href="https://www.sics.se/">SICS</a> was cool enough to cover my conference fee, even though I was not presenting.</p>
<p>Like I did
with my post on <a href="/conferences/2015/11/23/ICDM-2015-Highlights.html">last year’s ICDM</a>, I’ll be providing summaries for
papers I found interesting from the sessions I attended, and provide links to the full text articles, organized by day
and session. One great thing that KDD did this year was to ask the authors to provide 2 minute Youtube videos describing
their papers, so for most of the linked papers you will find the video as well, providing a brief, accessible explanation.</p>
<p>This will be long post so feel free to skip ahead to the sections that are most interesting to you.</p>
<ul id="markdown-toc">
<li><a href="#pre-conference-and-tutorial-days" id="markdown-toc-pre-conference-and-tutorial-days">Pre-conference and tutorial days</a></li>
<li><a href="#sunday-workshops-and-opening" id="markdown-toc-sunday-workshops-and-opening">Sunday: Workshops and Opening</a> <ul>
<li><a href="#mining-and-learning-with-graphs" id="markdown-toc-mining-and-learning-with-graphs">Mining and learning with graphs</a> <ul>
<li><a href="#keynotes" id="markdown-toc-keynotes">Keynotes</a></li>
<li><a href="#papers" id="markdown-toc-papers">Papers</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#monday-day-1-of-the-main-conference" id="markdown-toc-monday-day-1-of-the-main-conference">Monday: Day 1 of the main conference</a> <ul>
<li><a href="#graphs-and-rich-data-best-paper-award" id="markdown-toc-graphs-and-rich-data-best-paper-award">Graphs and Rich Data (Best paper award)</a></li>
<li><a href="#large-scale-data-mining" id="markdown-toc-large-scale-data-mining">Large-scale Data Mining</a></li>
<li><a href="#streams-and-temporal-evolution-i-best-student-paper-award" id="markdown-toc-streams-and-temporal-evolution-i-best-student-paper-award">Streams and temporal evolution I (Best student paper award)</a></li>
<li><a href="#streams-and-temporal-evolution-ii-theos-coolest-idea-of-kdd-award" id="markdown-toc-streams-and-temporal-evolution-ii-theos-coolest-idea-of-kdd-award">Streams and temporal evolution II (Theo’s coolest idea of KDD award)</a></li>
</ul>
</li>
<li><a href="#tuesday-day-2-of-the-main-conference" id="markdown-toc-tuesday-day-2-of-the-main-conference">Tuesday: Day 2 of the main conference</a> <ul>
<li><a href="#deep-learning-and-embedding" id="markdown-toc-deep-learning-and-embedding">Deep learning and embedding</a></li>
<li><a href="#recommender-systems" id="markdown-toc-recommender-systems">Recommender Systems</a></li>
<li><a href="#turing-lecture-whitfield-diffie" id="markdown-toc-turing-lecture-whitfield-diffie">Turing Lecture: Whitfield Diffie</a></li>
</ul>
</li>
<li><a href="#wednesday-last-day-of-the-conference" id="markdown-toc-wednesday-last-day-of-the-conference">Wednesday: Last day of the conference</a> <ul>
<li><a href="#supervised-learning" id="markdown-toc-supervised-learning">Supervised learning</a></li>
<li><a href="#optimization" id="markdown-toc-optimization">Optimization</a></li>
</ul>
</li>
<li><a href="#closing-thoughts" id="markdown-toc-closing-thoughts">Closing thoughts</a></li>
</ul>
<h2 id="pre-conference-and-tutorial-days">Pre-conference and tutorial days</h2>
<p>The Broadening Participation in Data Mining Workshop <a href="http://www.dataminingshop.com/web/">(BPDM)</a> was held on Friday and
Saturday. This
workshop aims to broaden the participation of minority and underrepresented groups in Data Mining, by providing guidance,
networking and other opportunities. I think it’s a great initiative and I hope to see other venues take up something
similar in ML in the vein of <a href="http://wimlworkshop.org/">WiML</a>.</p>
<p>Saturday was tutorials day at KDD, and there were a lot to choose from. I spent most of my time in the <a href="http://www.francois-petitjean.com/Research/KDD-2016-Tutorial/">Scalable Learning of
Graphical Models</a> and the <a href="https://sites.google.com/site/iotminingtutorial/">IoT Big Data Stream
Mining</a> tutorials. The deep learning tutorial from Ruslan Salakhutdinov
had to be cancelled (if anyone knows why let me know). The slides and video from his <a href="http://www.cs.toronto.edu/~rsalakhu/kdd.html">2014 tutorial at KDD</a> are available
however, and I can definitely recommend it as a good introduction to the field.</p>
<h2 id="sunday-workshops-and-opening">Sunday: Workshops and Opening</h2>
<p>Sunday was the workshop day and again it made me wish I could split myself to cover many in parallel, with topics like
large-scale sports analytics, learning from time series, deep learning from data mining, and stream mining. In the end,
I chose to spend most of my day in the workshop on <a href="http://www.mlgworkshop.org/2016/">“Mining and Learning from Graphs”</a>, which was closest
to my interests, and probably the best of the day.</p>
<h3 id="mining-and-learning-with-graphs">Mining and learning with graphs</h3>
<h4 id="keynotes">Keynotes</h4>
<p>The reason I think this was the best workshop of the day has a lot to do with the great keynote lineup as well the
quality and variety of the work presented. Lars Backstrom, the director of engineering at Facebook, had the first keynote
where he talked about the challenges in creating a personalized newsfeed for over a billion users. He talked about how
both business decisions and probabilities calculated by many models (trees, deep learning, logistic regression) affect
the scoring of items in the News Feed that end up determining the ranking of the items users see.</p>
<p>He also mentioned
some work I was not previously familiar with, co-authored with Jon Kleinberg, on <a href="https://dl.acm.org/citation.cfm?id=2531642">discovering strong ties
in a social network</a>, like romantic relationships. For this they developed a new measure of tie strength, <em>dispersion</em>,
which measures “the extent to which two people’s mutual friends are not themselves well-connected”. Using this method
they were able to identify the spouse of male users correctly with .667 precision, which is impressive considering they
are only using the graph structure as information. The dispersion metric itself is an interesting concept and can also be used
for News Feed ranking.</p>
<p>The rest of the keynotes were full of great ideas as well. <a href="https://www.cs.cmu.edu/~lakoglu/index.html">Leman Akoglu</a>, who
recently moved back to CMU after Stony Brook, gave a talk on <a href="https://arxiv.org/abs/1601.06711">detecting anomalous neighborhoods on attributed networks</a>.
<a href="http://www.sandia.gov/~tgkolda/">Tamara Kolda</a> talked about how to correctly model networks, and presented their <a href="https://arxiv.org/abs/1302.6636">BTER generative graph model</a> which is able to generate graphs that closely follow properties of
real-world graphs such as degree distribution and the triangle distribution, and a more recent extension of the model to
<a href="https://arxiv.org/abs/1607.08673">bi-partite graphs with community structure</a>.</p>
<p><a href="http://web.cs.ucla.edu/~yzsun/">Yishou Sun</a> (now at UCLA) presented a <a href="http://web.cs.ucla.edu/~yzsun/papers/ijcai16_anomaly.pdf">probabilistic model for event likelihood</a>.
The key idea here is
that one can model an event, say a user purchasing an item, as an <a href="http://www.analytictech.com/networks/egonet.htm">ego-network</a>
(networks where we focus on one node, the “ego” node). The ego node would be the event, linked to heterogeneous entities,
like the item, date, and user. The entities are then embedded into a latent space by using their co-occurrence with other
events, and the embeddings can then be used for tasks like anomaly detection and content-based recommendation.</p>
<p><a href="https://www.cs.purdue.edu/homes/neville/">Jennifer Neville</a> presented methods for modelling distributions of networks,
which essentially allows one to <a href="http://www.kdd.org/kdd2016/subtopic/view/sampling-of-attributed-networks-from-hierarchical-generative-models">generate network samples</a>
of attributed hierarchical networks, which can then be used for inference and evaluation.
Finally, <a href="https://users.soe.ucsc.edu/~vishy/">S.V.N. Vishwanathan</a> had a disclaimer that his talk was not exactly on the
topic of graphs, but rather how to exploit the computational graph to achieve better parallelism in distributed machine
learning. He presented some recent work on <a href="https://arxiv.org/abs/1605.09499">distributed stochastic variational inference</a>
that only updates a small part of the model for each data point (compared to classic stochastic VI), to achieve both data and model
parallelism while maintaining high accuracy.</p>
<h4 id="papers">Papers</h4>
<p>A number of cool ideas were presented through the papers at the workshop:</p>
<ul>
<li>Cohen et al.
presented a new algorithm on <a href="http://www.mlgworkshop.org/2016/paper/MLG2016_paper_35.pdf">distance-based influence in networks</a>, where a scalable
influence maximization algorithm was presented which can be used with any decay function.</li>
<li>Qian et al. presented
a fun idea: <a href="http://www.mlgworkshop.org/2016/paper/MLG2016_paper_18.pdf">blinking graphs</a>. A graph that blinks is a one
where each edge and node exists with a probability equal to its weight. This is then used to provide a proximity measure
between nodes, that turns out to provide outputs that are more intuitive and are shown to be useful in tasks like link
prediction.</li>
<li>Giselle Zeno used the work presented earlier by J. Neville that allows for the generation of attributed
networks, to create different samples from a network distribution and <a href="http://www.mlgworkshop.org/2016/paper/MLG2016_paper_27.pdf">systematically study</a> how graph characteristics
affect the performance of collective classification algorithms.</li>
<li>Rossi et al. presented <a href="http://www.mlgworkshop.org/2016/paper/MLG2016_paper_33.pdf">Relational Similarity Machines</a>,
a model for relational learning that can handle large graphs and is flexible in terms of learning tasks, constraints and
domains.</li>
</ul>
<p>I would definitely encourage you to take a look at the <a href="http://www.mlgworkshop.org/2016/">workshop website</a> and check
out some more of the papers. Overall, great work from all the organisers, with a great intro from <a href="https://twitter.com/seanjtaylor">Sean Taylor</a>
from Facebook, and a diverse and engaging set of keynote speakers. I’ll be looking to submit here next year!</p>
<h2 id="monday-day-1-of-the-main-conference">Monday: Day 1 of the main conference</h2>
<h4 id="graphs-and-rich-data-best-paper-award">Graphs and Rich Data (Best paper award)</h4>
<p>I started Day 1 of the conference by attending the Graphs and Rich Data session. The first paper presented was the best
paper award winner, <a href="http://www.kdd.org/kdd2016/subtopic/view/fraudar-bounding-graph-fraud-in-the-face-of-camouflage">FRAUDAR: Bounding Graph Fraud in the Face of Camouflage</a>
from Christos Faloutsos’ lab at CMU. In the paper Hooi et al. describe a method for detecting fraud, in the form of
reviews on Amazon or followers on Twitter, in the presence of camouflage: when fraudulent users have taken over legitimate
user accounts. In the paper they propose a number of metrics to measure the suspiciousness of subsets of nodes in a bipartite
graph (e.g. users and products) and show how to compute them in linear time. They illustrate the effectiveness of the approach
by using a Twitter graph with ~42M users and ~1.5B edges and showing that their algorithm is able to detect a group of
fraudulent users (manually evaluated). I would have loved to see some comparison in terms of accuracy on the real-world
data with other algorithms and a more quantitative evaluation using real-world data, but obtaining that would be hard
without a good ground-truth dataset, and I don’t know if any exist for graph-based fraud detection.</p>
<h4 id="large-scale-data-mining">Large-scale Data Mining</h4>
<p>I then moved on to the Large Scale Data Mining session, just in time to catch Daniel Ting deliver a smooth presentation
of his work on <a href="http://www.kdd.org/kdd2016/subtopic/view/towards-optimal-cardinality-estimation-of-unions-and-intersections-with-ske">cardinality estimation of unions and intersection with sketches</a>.
The cardinality of unions and intersections can be used for a number of applications, from calculating the Jaccard
similarity between two sets, to estimating the number of users accessing a particular website grouped by location or time,
and can be used for fundamental problems like estimating the size of a join. Daniel here proposed two new estimators
based on pseudo-likelihood and re-weighted estimators. The re-weighted estimators are perhaps the most interesting as they
can be generalized more easily (the work focuses on the MinCount sketch) and are easier to implement. I particularly like
the main idea behind them: Taking the weighted average of the several estimators after finding the most uncorrelated ones.
It is a rare thing to see a single author paper nowadays and Daniel hit it out of the park in terms
of quality and rigour with this one.</p>
<!---
Two other great papers from the session were [efficient anomaly detection in streaming graphs](http://www.kdd.org/kdd2016/subtopic/view/fast-memory-efficient-anomaly-detection-in-streaming-heterogeneous-graphs)
from Emaad Manzoor, and the
[XGBoost paper](http://www.kdd.org/kdd2016/subtopic/view/xgboost-a-scalable-tree-boosting-system) from Tianqi Chen. Emaad presented StreamSpot, an anomaly detection approach for streaming heterogeneous
graphs. He uses a string representation (shingles) for local substructure of graphs, and then uses a variation of SimHash named
StreamHash to compute similarities between the shingles. The algorithm is then initialized with benign clusters and the
anomalies then are detected for each cluster based on their deviation from the cluster's graph and medoid. My impression
is that the initialization process requiring a benign dataset limits the applicability of the algorithm somewhat, since
one can never be sure a dataset does not contain any anomalies, unless it is completely hand-labeled. Still the idea is
novel and I liked the translation of graphs to shingles along with the StreamHash algorithm.
--->
<p>In the same session Tianqi Chen presented <a href="(http://www.kdd.org/kdd2016/subtopic/view/xgboost-a-scalable-tree-boosting-system)">XGBoost</a>.
I assume <a href="https://xgboost.readthedocs.io">XGBoost</a> needs no introduction to most, it’s a gradient boosted tree algorithm
that has become wildly popular and has been used in the winning solution for 17 out of 29 Kaggle challenges during 2015.
Part of the appeal of XGBoost lies in its scalable nature and Tianqi has gone to great lengths
to ensure the algorithm is fast, easy to use and will run from anywhere
(C++, Python, R) and on anything (local and distributed). JVM-based solutions were also added recently, so it is possible
now to XGBoost on top of <a href="http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html">Apache Flink</a>
or Spark for example. I hope to find the time this year to work on the Flink integration so that it becomes a great platform
to run on and boosts our efforts (pun intended) for <a href="https://ci.apache.org/projects/flink/flink-docs-master/dev/libs/ml/index.html">ML on Flink</a>.</p>
<h4 id="streams-and-temporal-evolution-i-best-student-paper-award">Streams and temporal evolution I (Best student paper award)</h4>
<p>Lorenzo De Stefani from Brown presented <a href="http://www.kdd.org/kdd2016/subtopic/view/triest-counting-local-and-global-triangles-in-fully-dynamic-streams-with-fi">TRIÈST</a>,
a new algorithm for counting local and global triangles in streaming graphs,
that supports additions and deletions of edges, with a fixed memory budget.
Counting triangles is a classic problem in network theory, as it can help with many tasks like spam detection, link
prediction etc. In many real world graphs, like a social network, edges are constantly being added and removed, so
maintaining an accurate count of the triangles in real-time is a challenging problem, especially in graphs with millions
of nodes and billions of edges.</p>
<p>What De Stefani et al. have done
here is present a one-pass algorithm based on reservoir sampling that provides unbiased estimates of the local and global
triangle counts with very little variance, that only requires the user to specify the amount of memory they want to use (an easy parameter to set).
Compared to previous approaches, TRIÈST does not require the user to set an edge sampling probability (a parameter that is
very hard to set without prior knowledge about the stream), and provides full
utilization of the available memory early on (vs. the end of the stream).
I find the use of reservoir sampling a great “oh why didn’t I think of that” idea here, and the value of the paper comes
from the rigorous analysis of the algorithm, and the extensive experimentation the authors have performed.
A very worthy recipient of the best student paper award.</p>
<h4 id="streams-and-temporal-evolution-ii-theos-coolest-idea-of-kdd-award">Streams and temporal evolution II (Theo’s coolest idea of KDD award)</h4>
<p>Perhaps the most novel idea I saw at KDD came from the paper on <a href="http://www.kdd.org/kdd2016/subtopic/view/continuous-experience-aware-language-model">Continuous Experience-aware Language modelling</a>
by Mukherjee et al. from <a href="https://www.mpi-inf.mpg.de/home/">MPI</a>. The idea here is to try to model the experience of the user in reviewing items from a particular domain,
based on the evolution of their language model. Think of a beer reviewing site. Your first few reviews might contain sentences
like <em>“I like this beer”</em> or <em>“Great taste!”</em>. But as you gain more experience in tasting beer, the way you describe it
becomes more nuanced; you might write something like <em>“Fascinating malt and hoppiness, the aftertaste left something to be desired
however”</em>. So as you evolve as a beer drinker, so does the language you use to describe it.</p>
<p>Previous work in the field has
tried to model this evolution of experience on a discrete scale; the user’s experience remains either static or suddenly
jumps a level. In this work the authors have used a model used in financial analysis called <a href="https://en.wikipedia.org/wiki/Geometric_Brownian_motion">Geometric Brownian Motion</a>
to instead model the evolution of the user’s experience as a continuous-time stochastic process. The user’s language
model is also continuous, using a dynamic variant of LDA that employs variational methods like Kalman filtering for inference.
Using this model they are able to more accurately recommend items to users, (albeit using RMSE as a metric which
<a href="https://www.researchgate.net/profile/Paolo_Cremonesi/publication/221141030_Performance_of_recommender_algorithms_on_top-N_recommendation_tasks/links/55ef4ac808ae0af8ee1b1bd0.pdf">was shown to be problematic</a>)
and do some explorative analysis, like show the evolution of term usage with experience, or the top words used for experienced
and inexperienced users. Overall I really liked this idea of tracking the language model of users over time, and I
believe that continuous-time models can have beneficial effects in many other domains.</p>
<h2 id="tuesday-day-2-of-the-main-conference">Tuesday: Day 2 of the main conference</h2>
<h4 id="deep-learning-and-embedding">Deep learning and embedding</h4>
<p>The paper describing the system that <a href="https://www.google.com/inbox/">Inbox by Gmail</a> uses for its <a href="http://www.kdd.org/kdd2016/subtopic/view/smart-reply-automated-response-suggestion-for-email">Smart Reply automated
response system</a>
was presented on Tuesday and it drew a lot of attention as expected. In case you are not familiar with the system, it
provides users of the Inbox app with short, automated replies. So the app writes the
emails for you instead of you having to type them out (yay machine learning!). This is obviously a highly challenging
tasks for many reasons. How can you tell if an email is a good candidate for a concise response? How does one generate
a response that is relevant to the incoming email? How does one provide enough variance in the responses generated?
And as is always the case at Google, how does one do this at scale?</p>
<p>Kannan et al. describe a system that uses a feed-forward
neural net to determine if an email is a good candidate to show automated responses for, an LSTM network for the actual
response text generation (sequence-to-sequence learning), a semi-supervised graph learning system to generate the set of
responses, and a simple strategy to ensure that the responses shown to the user are diverse in terms of intent. Although
the paper does not delve very deeply into each topic as they have to cover a complicated end-to-end learning system, it’s
still a great read as it provides insights into the scalability issues with deploying such models to millions of users,
as well as the challenge of optimizing for multiple objectives (accuracy, diversity, scalability) in a complex system.</p>
<h4 id="recommender-systems">Recommender Systems</h4>
<p>In this session chaired by <a href="https://twitter.com/xamat/">Xavier Amatriain</a>, <a href="http://www-users.cs.umn.edu/~christa/">Konstantina Christakopoulou</a> presented her paper
on <a href="http://www.kdd.org/kdd2016/subtopic/view/towards-conversational-recommender-systems">conversational recommender systems</a>.
The scenario here is common: You are at a new city, and would like to go out for dinner. If you had a local friend,
you’d have a small conversation: “Do you like Indian? What about Chinese? What’s your price-range?” and based on your
responses you knowledgeable friend would recommend a restaurant that they think you’d like. The challenges in creating
an automated system that does this are many: How does one find which dimensions are important (cuisine, price)?
Which questions should the system pose in order to arrive to a good recommendation as soon as possible?</p>
<p>Konstantina
addresses this problem as an online learning problem, where the system learns the preferences of the user online,
as well as the questions that allow it to provide good recommendations quickly. This is done by utilizing a bandit-based
approach that adapts the latent recommendation space to the user according to their interactions, and a number of
question selection strategies are tested, where is it shown that using a bandit-like approach to balance exploration
and exploitation in the latent question space is highly beneficial. I’m a fan of this work because I think it directly addresses cold-start
problems in recommender systems with an intuitive and human-centered approach, which includes knowledge we already have
about users and items through classic CF systems, with online learning and incorporating context.</p>
<h4 id="turing-lecture-whitfield-diffie">Turing Lecture: Whitfield Diffie</h4>
<p>Since this post is already too long I will not be covering the keynotes, however I could not skip mentioning
Whitfield Diffie’s Turing lecture, which was one of the highlights of the conference. Whitfield took us on journey through the
history of cryptography, starting with the <a href="https://en.wikipedia.org/wiki/Caesar_cipher">Ceasar Cipher</a> all the way to
<a href="https://en.wikipedia.org/wiki/Homomorphic_encryption">Homomorphic encryption</a>, with many interesting tidbits and
historical anecdotes along the way.</p>
<p>I particularly liked his story on one of the things that motivated him to find a solution for the public key cryptography
problem that he is most famous for. Diffie explained that one of his friends had told him that at the NSA the phone lines
are secure, so Diffi thought that without having an encryption key negotiated before-hand you could pick up a phone and dial and your
communication would be safe from eavesdropping. Diffie assumed that hey had somehow solved the problem of key distribution,
which motivated him to work even harder on the problem. The reality was that NSA was simply using shielded private lines
for their communication, but in Diffie’s own words, <strong><em>“Misunderstanding is the seed of invention”</em></strong>.</p>
<p>Another problem was presented by Diffie’s mentor <a href="https://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)">John McCarthy</a>
at a conference in Bordeaux where he talked about “buying and selling through home terminals”, what we call e-commerce today.
This problem led Diffie to think about digital signatures and proof of correctness, and the key idea of having a problem which you cannot
solve, but you can tell whether a solution provided is correct, which eventually led to public key cryptography.
I cannot provide a good enough summary of the talk here, but I would wholeheartedly recommend watching the whole thing, as
<a href="https://www.youtube.com/watch?v=CIZh0CHXGC4">it’s up on Youtube</a> and most definitely worth your time.</p>
<h2 id="wednesday-last-day-of-the-conference">Wednesday: Last day of the conference</h2>
<h4 id="supervised-learning">Supervised learning</h4>
<p>The highlight of the supervised learning session was the work from Marco Tulio Ribeiro et al. on <a href="http://www.kdd.org/kdd2016/subtopic/view/why-should-i-trust-you-explaining-the-predictions-of-any-classifier">explaining the predictions of
any classifier</a>.
The problem they are trying to solve is interpretability: Despite the wide adoption of machine learning, many of the more
complicated models, such as deep learning or random forests, are used as black boxes, explaining why they gave us a
particular answer is very hard. This makes it difficult to trust the system, and deploy it in a setting where it would aid
critical decision-making, like whether or not to administer a specific treatment to a patient.</p>
<p>The proposed system, <a href="https://github.com/marcotcr/lime">LIME</a>, which stands for Local Interpretable Model-agnostic Explanations, can explain the outputs of
any classifier. The way they achieve that is by fitting a simple, interpretable model like a linear regression, on generated
samples, weighted by their distance to the prediction point. What this essentially does is to approximate the complex
decision boundary locally using an interpretable model, from which we can then explain why a decision was made. For text
this could be the words that were present in a document that lead us to classify it as spam or not, and in images it could
be superpixels that caused the classification of the image as containing as dog or cat. The system introduces a lot of
overhead of course, the authors report 10 minutes runtime to explain one output from InceptionNet on a laptop, but there is a lot of room
for improvement there. Interpretability is one of the main challenges for ML in the coming years and it’s always welcome
to see new exciting work on the subject.</p>
<h4 id="optimization">Optimization</h4>
<p>One of the best papers of the conference was presented in one of the final sessions by Steffen Rendle, of factorization
machines fame, who is now at Google. He and his colleagues provide a solution for a problem of scale: How to train a
generalized linear model in a few hours for a trillion examples. For this they proposed <a href="http://www.kdd.org/kdd2016/subtopic/view/robust-large-scale-machine-learning-in-the-cloud">Scalable Coordinate Descent (SCD)</a>,
whose convergence behavior does not change regardless of the how much it is scaled out or the computing environment.
They also described a distributed learning system designed for the cloud which takes into consideration the challenges
present in a cloud environment, like shared machines (VMs) that are pre-emptible (i.e. you could be kicked out after a
grace period), machine failures etc.</p>
<p>The problem with <a href="https://en.wikipedia.org/wiki/Coordinate_descent">coordinate descent</a> is that it’s a highly sequential
algorithm, and not a lot of work has to be done at each step, which make parallelizing or distributing it challenging.
The key idea for the SCD algorithm is to make use of the structure and sparsity
present in the data. The data are partitioned in “pure” blocks where each block has at most one non-zero entry, and
updates are performed per block. The enforced independence in features is what enables the parallelism for the algorithm.
On the systems side they use a number of tricks to deal with having short barriers. Syncing the workers is challenging
in the presence of stragglers (machines that are slower) which they overcome by using dynamic load-balancing, caching,
and pre-fetching. Using this system and algorithm they are able to achieve near-linear scale out and speed up, going
from 20 billion examples to 1 trillion.</p>
<h2 id="closing-thoughts">Closing thoughts</h2>
<p>Overall the conference was well organized and a pleasure to attend. The venue was great, even though many of the sessions
had to be done in a different hotel across the street. The choice of having some of the keynotes over lunch however was criticised
by most attendees, as it was almost impossible to hear the speakers, and I’m sure it was not a good experience for them
either. The conference had a very heavy company presence as well, which I actually found welcome, as I had the opportunity
to talk to people from many interesting companies who are doing great research work like Microsoft, Facebook, Amazon etc.</p>
<p>If I have one gripe with conference is the insistence <em>“per KDD tradition”</em> on not performing double-blinded or open reviews, even though
<a href="https://hub.wiley.com/community/exchanges/discover/blog/2016/06/27/what-are-the-current-attitudes-toward-peer-review-publishing-research-consortium-survey-results">the research community</a>
is moving towards that (<a href="(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4191873/)">original paper</a>),
so kudos to ICLR for open and ICDM for triple-blind reviews.</p>
<p>In closing, KDD was a great conference and I’m glad I was given the opportunity to attend. I met a bunch of great new people and
reconnected with old friends, had interesting discussions with many companies, and the research presented filled me with
new ideas to take home and expand.</p>
<p>Looking forward to next year!</p>
Thu, 01 Sep 2016 00:00:00 +0000
http://tvas.me/conferences/2016/09/01/KDD-2016-Highlights.html
http://tvas.me/conferences/2016/09/01/KDD-2016-Highlights.htmldata-miningresearchconferenceconferencesHighlights from ICDM 2015<p><img src="/assets/ac.jpg" alt="Atlantic City" /></p>
<p>This past week I had the opportunity to attend the <a href="http://icdm2015.stonybrook.edu/">15th IEEE International Conference on
Data Mining</a>, held in Atlantic City, NJ, November 14-17, 2015.
This was the first scientific conference I attended and we had a chance to present our
work on <a href="/assets/concepts-icdm.pdf">scalable graph similarity calculation</a>. In this post I will try to point out some
of the more interesting work from the conference (based on some of the sessions I attended)
and summarize the keynotes. I’ve included links to the full-text papers whenever I could
find them.</p>
<ul id="markdown-toc">
<li><a href="#highlights-from-the-sessions-i-attended" id="markdown-toc-highlights-from-the-sessions-i-attended">Highlights from the sessions I attended:</a> <ul>
<li><a href="#day-1" id="markdown-toc-day-1">Day 1</a> <ul>
<li><a href="#applications-1" id="markdown-toc-applications-1">Applications 1</a></li>
<li><a href="#mining-social-networks-1" id="markdown-toc-mining-social-networks-1">Mining Social Networks 1</a></li>
<li><a href="#big-data-2" id="markdown-toc-big-data-2">Big Data 2</a></li>
</ul>
</li>
<li><a href="#day-2" id="markdown-toc-day-2">Day 2</a> <ul>
<li><a href="#network-mining-1" id="markdown-toc-network-mining-1">Network Mining 1</a></li>
</ul>
</li>
<li><a href="#day-3" id="markdown-toc-day-3">Day 3</a> <ul>
<li><a href="#graph-mining" id="markdown-toc-graph-mining">Graph Mining</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#keynotes" id="markdown-toc-keynotes">Keynotes</a> <ul>
<li><a href="#robert-f-engle" id="markdown-toc-robert-f-engle">Robert F. Engle</a></li>
<li><a href="#michael-i-jordan" id="markdown-toc-michael-i-jordan">Michael I. Jordan</a></li>
<li><a href="#lada-adamic" id="markdown-toc-lada-adamic">Lada Adamic</a></li>
</ul>
</li>
<li><a href="#venueorganisation" id="markdown-toc-venueorganisation">Venue/Organisation</a></li>
</ul>
<h2 id="highlights-from-the-sessions-i-attended">Highlights from the sessions I attended:</h2>
<h3 id="day-1">Day 1</h3>
<h4 id="applications-1">Applications 1</h4>
<p>The first session I attended was named “Applications 1” and it included
a number of applications (surprise!) on a diverse set of domains. The session started
with some very solid work on <a href="http://arxiv.org/abs/1406.0516">“Modeling Adoption and Usage of Competing Products”</a>,
where the authors create a model that can provide insight into the factors that drive
product adoption and frequency of use, which they evaluate at a large scale by looking
into the use of URL shorteners on Twitter.
In <em>“Mining Indecisiveness in Customer Behaviors”</em> the authors investigated how they could
reduce indecisiveness in users interacting with an online retail platform, by making use
of information about competing products. The end goal is to increase conversion of course,
but it would be interesting to see how such a system could be implemented in a way that
is fair to all retailers/brands.</p>
<p>Two short papers I should point out were <a href="http://medianetlab.ee.ucla.edu/papers/Yannick_ICDM.pdf">“Personalized Grade Prediction: A Data Mining
Approach”</a> and
<a href="http://www.cc.gatech.edu/~iperros3/publications/icdm15.pdf">“Sparse Hierarchical Tucker and its Application to Healthcare”</a>.
The first
paper deals with personalized early grade prediction for students using only assignment/homework data,
that could allow course instructors to identify students who might have
trouble in a course early on, most importantly using only their data from the specific
course, thereby avoiding any potential privacy pitfalls. In the second work, a new tensor
factorization method is proposed, that is 18x more accurate and 7.5x faster than the current
state-of-the-art. While the application presented here is limited to healthcare, I hope
that it can prove a starting point for a more generalized approach, as tensor factorization
problems can surface in wide variety of domains so solving their scalability problems
could have an effect on a wide range of fields.</p>
<h4 id="mining-social-networks-1">Mining Social Networks 1</h4>
<p>The next session I attended was “Mining Social Networks 1”, where the best student paper,
<a href="http://arxiv.org/abs/1505.07193">“From Micro to Macro: Uncovering and Predicting Information Cascading Process with
Behavioral Dynamics”</a> was presented among others.
Cascade prediction has applications
in areas like viral marketing and epidemic prevention, so it’s a problem of great interest
in the industry as well as society. The work presented here utilized a data-driven approach
to create a “Networked Weibull Regression” model, and use it for predicting cascades
as they occur, going from micro behavioral dynamics modelling which are aggregated to predict
the macro cascading processes.</p>
<p>They evaluate their method on a dataset from Weibo, one of the largest Twitter-style
services in China, and show that their method handily beats the current state of the art.
It’s a well written work that deserves the praise it got, however I would definitely be interested
in seeing it applied and evaluated on a different publicly available dataset, (although they are
hard to come by in this domain) and an extension of the method that predicts the cascades as they
happen in real-time (shameless plug: Use <a href="https://flink.apache.org">Apache Flink</a> for your real-time processing needs!).</p>
<h4 id="big-data-2">Big Data 2</h4>
<p>The last session I attended on Sunday was “Big Data 2”. The two regular papers from that
session were perhaps application specific but nonetheless provided some valuable insights.
The first, “Accelerating Exact Similarity Search on CPU-GPU Systems” dealt with the exact
kNN problem, and how it can be efficiently accelerated on GPU-equipped systems. Although
approximate kNN methods like LSH seem to be the standard at the industry currently, the
authors mentioned that the techniques presented could be used in that context as well,
so this is something to look forward to definitely. The second regular paper <a href="http://arxiv.org/abs/1508.07678">“Online Model
Evaluation in a Large-Scale Computational Advertising Platform”</a>
provided a rare look into how a large advertising platform like Turn evaluates its bid prediction models online,
something that a previous related paper from Google,
<a href="https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf">“Ad Click Prediction: a View from the Trenches”</a>,
was missing.</p>
<h3 id="day-2">Day 2</h3>
<h4 id="network-mining-1">Network Mining 1</h4>
<p>An interesting idea presented in “Network Mining 1” was <a href="http://arxiv.org/abs/1509.02533">“Absorbing random-walk centrality”</a>,
where the authors presented a way to identify <em>teams</em> of central nodes in a graph. An application
for this measure could be for example: given a subgraph of Twitter that we know contains
a number of accounts about politics, find the important nodes that represent a diverse set
of political views. The authors show that this is an NP-hard problem, and the greedy algorithm
presented has a complexity of O(n^3), where n is the number of nodes, which makes it
inapplicable for large graphs. Personalized PageRank could be used as heuristic however which
is more computationally efficient.</p>
<h3 id="day-3">Day 3</h3>
<h4 id="graph-mining">Graph Mining</h4>
<p>We presented out work, <a href="/assets/concepts-icdm.pdf">“Knowing an Object by the Company It Keeps: A Domain-Agnostic Scheme for Similarity Discovery”</a>,
in the “Graph Mining” session. Our main contribution is a method that allows us to
transform a <em>correlation graph</em> to a <em>similarity graph</em>, where connected items should be <em>exchangable</em> in some sense.</p>
<p>As an example, think of a correlation graph where we have words as nodes and edges between words are created by taking the conditional probability of a word appearing
within <em>n</em> words of another one. This can be easily extracted from a text corpus and pairs like (<em>Rooney, goal</em>)
could have a high correlation score. What we want to do with our algorithm is to discover <em>similarities</em>
between items that go beyond simple correlation, and show characteristics such as exchangability.
For example a pair (<em>Rooney</em>, <em>Ronaldo</em>) could be a good pair in this sense, as you could replace
Rooney with Ronaldo in a sentence and it should still make sense. The approach we presented is domain
agnostic, and as such is not limited to just text; we applied our algorithm on graphs of music artists and <a href="https://en.wikipedia.org/wiki/Genetic_code">codons</a>
as well. I will soon write up a more extensive summary of our work, including code and examples.
For now enjoy this <a href="/assets/concepts-visualization.pdf">nice visualization</a>
of word relations and clusters that can be created using our method.
<em>Note:</em> better to download and view in a PDF viewer which has <em>lots</em> of zoom.</p>
<p>Some impressive work for me from that session was <a href="http://arxiv.org/abs/1506.04322">“Efficient Graphlet Counting for Large Networks”</a>.
<a href="https://en.wikipedia.org/wiki/Graphlets">Graphlets</a> are small, connected, induced (i.e. the edges
in the graphlet correspond to those in the large graph) subgraphs of a large network, and can be used
for things like graph comparison and classification. The method presented here uses already proven
combinatorial arguments to reduce the number of graphlets one has to count for every edge, and
obtains the remaining counts in constant time. In a large study of over 300 networks the algorithm
is shown to be on average 460 times faster that the current state-of-the-art, allowing the largest
graphlet computations to date. I am always happy when I see established results used in a clever
way to solve new problems, especially when the results are so impressive.</p>
<h2 id="keynotes">Keynotes</h2>
<h4 id="robert-f-engle">Robert F. Engle</h4>
<p>ICDM featured 3 keynotes this year. The first one was given by Robert F. Engle, winner of the
Nobel Memorial Prize in Economic Sciences in 2003. He presented a summary of some of his seminal
work on <a href="https://en.wikipedia.org/wiki/Autoregressive_conditional_heteroskedasticity">ARCH</a>,
and presented some more recent work on financial volatility measurement through the <a href="http://vlab.stern.nyu.edu/">V-lab</a>
project. This keynote was quite math-heavy as a result and I think many people in the audience did
not find it that interesting/relevant to their work, estimated from the proportion of people looking at their
laptops around me.</p>
<h4 id="michael-i-jordan">Michael I. Jordan</h4>
<p>The second keynote, and the most interesting for me, was given by M.I. Jordan, with the title “On
Computational Thinking, Inferential Thinking and ‘Big Data’”, a talk he has delivered in a couple
of other venues before, so (some of) the <a href="http://www.stat.harvard.edu/NRC2014/MichaelJordan.pdf">slides are available</a>.
His keynote revolved around some of what he identified as central demands for learning and inference
and the tradeoffs between them; namely error bounds (“inferential quality”),
scaling/runtime/communication, and privacy. He identified the problem of the lack of an interface
between statistical theory and computational theory which currently have an “oil/water”
relationship. In statistics, more data points are great as they reduce uncertainty, but can be a cause
of problems in terms of computation as we usually measure complexity in the order of data points. The approach
suggested is to “treat computation, communication, and privacy as constraints on statistical
risk”.</p>
<p>In terms of privacy he mentioned how our inference problem basically has 3 components: the
population P, which we try to approximate with our sample S, which we then modify
according to our privacy concerns to get our final dataset Q, which we can query.
In dealing with privacy issues he mentioned <a href="https://en.wikipedia.org/wiki/Differential_privacy">differential privacy</a>
as a good way to quantify the privacy loss for a query. This should allow us, given some privacy
concerns, to estimate the amount of data we need to achieve the same level of risk in our queries.</p>
<p>For the tradeoff between inferential quality and communication, common in distributed
learning settings, he proposed the use of a channel with certain communication constraints, as a
way to impose bitrate constraints. The proposed solution involves minimax risk with B-bounded
communication, which allows for optimal estimation under a communication constraint (see
<a href="http://www.cs.berkeley.edu/~yuczhang/files/nips13_communication.pdf">here</a>
for the NIPS paper on the subject).</p>
<p>The last part of his talk was new (i.e. is not the slides linked) and concerned the tradeoff between
inference quality and computation resources. This part focused on efficient distributed bootstrap
processes, with the thesis being that such processes can be used to generate multiple realizations
of samples from the population that allow for the efficient estimation of parameters. The problem
with a frequentist approach in this case is that the communication cost of each resampling can be
prohibitively high for large datasets, e.g. ~623GB for a 1TB dataset
(see <a href="http://www.stat.washington.edu/courses/stat527/s13/readings/EfronTibshirani_JASA_1997.pdf">here</a> why).
The proposed solution here is the <a href="http://web.cs.ucla.edu/~ameet/blb_icml2012_final.pdf">“Bag of Little Bootstraps”</a>,
in which one bootstraps many small subsets of the data and performs multiple computations on these
small samples. The results from these computations are then averaged to obtain an estimate of the
parameters of the population.
This means that in a distributed setting we would use only small subsets of the data to perform
our computation; in the 1TB example above, the resample size could for example be 4GB instead of
the 632GB required by the bootstrap.
Another interesting point made was that obtaining a confidence interval on a parameter instead of
a point estimate like is usually done now, can not only be more useful, but could be done more
efficiently as well.</p>
<p>In closing Jordan identified there are many remaining conceptual and mathematical challenges in the
problem of ‘Big Data’ and facing these will require a “rapprochement between computer science and
statistics” which would reshape both disciplines and might take decades to complete.</p>
<h4 id="lada-adamic">Lada Adamic</h4>
<p>Unfortunately I had to skip Lada Adamic’s keynote, so I would really appreciate if someone has a
summary that I can add here.</p>
<h2 id="venueorganisation">Venue/Organisation</h2>
<p>The conference organization was mostly smooth and organizers and volunteers deserve a lot of credit for
the way that everything worked out. Sessions generally began and ended on time, the workshops and
tutorials were well organized and useful, and I particularly enjoyed the PhD forum.
One thing that I found unusual was the fact that even though the proceedings were handed out in
digital form (kudos for that) attendees had to choose between the conference or the workshop
proceedings. My guess is this happened for licensing cost issues, but it would have been nice to have
access to both.</p>
<p>The conference this year took place at Bally’s casino/hotel in Atlantic City.
It was hard to avoid the grumbling from many of the participants for the choice of venue, especially
when one puts it next to last year’s venue in <a href="/assets/shenzen.jpg">Shenzen</a> or next year in
<a href="/assets/barcelona.jpg">Barcelona</a>.</p>
<p>Truth be told, the venue was underwhelming, but I guess it was mostly the choice of Atlantic City
that had people irked; there was very little to do and see in the city unless you wanted to gamble.
Still, I was fortunate to meet a lot of cool people at the conference, so I’m looking forward to
attending next year’s edition in Barcelona!</p>
<p>There was a lot of other great work at the conference as well, but these were the presentations
I found most memorable.
So that’s all for now, if I’ve made a terrible mistake when describing your work, shoot me an <a href="mailto:tvas@sics.se">email</a>
and I’ll fix it ASAP.</p>
Mon, 23 Nov 2015 00:00:00 +0000
http://tvas.me/conferences/2015/11/23/ICDM-2015-Highlights.html
http://tvas.me/conferences/2015/11/23/ICDM-2015-Highlights.htmldata-miningresearchconferences