{"id":487934,"date":"2018-05-30T10:53:33","date_gmt":"2018-05-30T17:53:33","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=487934"},"modified":"2018-06-07T13:51:15","modified_gmt":"2018-06-07T20:51:15","slug":"optimizing-barnes-hut-t-sne","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/optimizing-barnes-hut-t-sne\/","title":{"rendered":"Optimizing Barnes-Hut t-SNE"},"content":{"rendered":"<div id=\"attachment_488000\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-488000\" class=\"wp-image-488000 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/bhtsne.gif\" alt=\"\" width=\"640\" height=\"360\" \/><p id=\"caption-attachment-488000\" class=\"wp-caption-text\">Barnes-Hut optimization highlighted as points converge to their t-SNE positions<\/p><\/div>\n<p>Ten years ago, while writing a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/tavianator\/carom\">physics engine<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, I learned about the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Barnes%E2%80%93Hut_simulation\">Barnes-Hut algorithm<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for the gravitational <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/N-body_problem\">n-body problem<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Normally, computing the Newtonian gravitational forces between n bodies requires \\(O(n^2)\\) evaluations of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Newton%27s_law_of_universal_gravitation\">Newton&#8217;s law of universal gravitation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, as every body exerts a force on every other body in the system. Barnes-Hut is an approximation that brings that complexity down to \\(O(n \\log n)\\), by treating clusters of faraway objects as single particles. An <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Octree\">octree<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is used to subdivide the space, and when a certain threshold is met, an internal node may be treated as a point particle at its children&#8217;s center of mass, rather than recursing into it.<\/p>\n<p>I didn&#8217;t use the Barnes-Hut algorithm at the time. But years later I was surprised to see it being used in the seemingly unrelated context of deep learning. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/T-distributed_stochastic_neighbor_embedding\">t-SNE<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a dimensionality reduction technique often used for visualizing high-dimensional data like word embeddings. It attempts to map points that are close in the high-dimensional space to points that are still close in a low-dimensional space, preserving clusters in the high-dimensional representation. It turns out that t-SNE can be accelerated by applying the same Barnes-Hut technique as gravitational simulation, since they both involve an inverse square law computed between pairs of points. In this post, I&#8217;m going to take the Barnes-Hut t-SNE <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/T-distributed_stochastic_neighbor_embedding\">implementation<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> by the original inventor of t-SNE, Laurens van der Maaten (with Geoffrey Hinton), and show how its performance can further be increased while keeping the algorithm the same.<\/p>\n<p>The first step is to check out and build the code:<\/p>\n<blockquote>\n<pre>$ git clone git@github.com:lvdmaaten\/bhtsne.git\r\nCloning into 'bhtsne'...\r\nremote: Counting objects: 205, done.\r\nremote: Total 205 (delta 0), reused 0 (delta 0), pack-reused 205\r\nReceiving objects: 100% (205\/205), 83.59 KiB | 2.70 MiB\/s, done.\r\nResolving deltas: 100% (110\/110), done.\r\n$ cd bhtsne\r\n$ g++ -O2 -g -Wall -Wno-sign-compare sptree.cpp tsne.cpp tsne_main.cpp -o bh_tsne\r\n<\/pre>\n<\/blockquote>\n<p>To get an idea of its current performance, let\u2019s test it on some real-world data, the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/nlp.stanford.edu\/projects\/glove\/\">GloVe<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> word vectors:<\/p>\n<blockquote>\n<pre>$ curl -LO http:\/\/nlp.stanford.edu\/data\/glove.6B.zip\r\n\u00a0 % Total\u00a0\u00a0\u00a0 % Received % Xferd\u00a0 Average Speed\u00a0\u00a0 Time\u00a0\u00a0\u00a0 Time\u00a0\u00a0\u00a0\u00a0 Time\u00a0 Current\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Dload\u00a0 Upload\u00a0\u00a0 Total\u00a0\u00a0 Spent\u00a0\u00a0\u00a0 Left\u00a0 Speed\r\n\u00a0 0\u00a0\u00a0 308\u00a0\u00a0\u00a0 0\u00a0\u00a0\u00a0\u00a0 0\u00a0\u00a0\u00a0 0\u00a0\u00a0\u00a0\u00a0 0\u00a0\u00a0\u00a0\u00a0\u00a0 0\u00a0\u00a0\u00a0\u00a0\u00a0 0 --:--:-- --:--:-- --:--:--\u00a0\u00a0\u00a0\u00a0 0\r\n100\u00a0 822M\u00a0 100\u00a0 822M\u00a0\u00a0\u00a0 0\u00a0\u00a0\u00a0\u00a0 0\u00a0 5661k\u00a0\u00a0\u00a0\u00a0\u00a0 0\u00a0 0:02:28\u00a0 0:02:28 --:--:-- 3072k\r\n$ unzip glove.6B.zip\r\nArchive:\u00a0 glove.6B.zip\r\n\u00a0 inflating: glove.6B.50d.txt\r\n\u00a0 inflating: glove.6B.100d.txt\r\n\u00a0 inflating: glove.6B.200d.txt\r\n\u00a0 inflating: glove.6B.300d.txt\r\n<\/pre>\n<p>We have to strip out the first column, which is the words themselves:<\/p>\n<p><blockquote<\/p>\n<pre>$ cut -d&#39; &#39; -f2- glove.6B.50d.txt &gt;glove.6B.50d.cut.txt\r\n<\/pre>\n<\/blockquote>\n<p>And now we can use the Python wrapper to run the tool:<\/p>\n<blockquote>\n<pre>$ .\/bhtsne.py -v -r1 -i glove.6B.50d.cut.txt -o glove.6B.50d.tsne.txt\r\nRead the 400000 x 50 data matrix successfully!\r\nUsing random seed: 1\r\nUsing no_dims = 2, perplexity = 50.000000, and theta = 0.500000\r\nComputing input similarities...\r\nBuilding tree...\r\n - point 0 of 400000\r\n...\r\n - point 390000 of 400000\r\nInput similarities computed in 21409.84 seconds (sparsity = 0.000682)!\r\nLearning embedding...\r\nIteration 50: error is 131.843420 (50 iterations in 729.03 seconds)\r\n...\r\nIteration 999: error is 6.373127 (50 iterations in 654.72 seconds)\r\nFitting performed in 15039.99 seconds.\r\nWrote the 400000 x 2 data matrix successfully!\r\n.\/bhtsne.py -v -r1 -i glove.6B.50d.cut.txt -o glove.6B.50d.tsne.txt\u00a0 36697.76s user 228.13s system 99% cpu 10:17:09.79 total\r\n<\/pre>\n<\/blockquote>\n<p>Wow, that took a long time! Let&#8217;s see if we can speed it up a bit.<\/p>\n<h3>Profiling<\/h3>\n<p>First, because premature optimization is the root of all evil, we should profile to see where the bottlenecks are. We don&#8217;t have all day, so let&#8217;s make a smaller input file for benchmarking, with just the first 1,000 points:<\/p>\n<blockquote>\n<pre>$ head -n1000 glove.6B.50d.cut.txt >glove.6B.50d.small.txt<\/pre>\n<\/blockquote>\n<p>And use the Linux <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/perf.wiki.kernel.org\/index.php\/Main_Page\">perf<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> tool to sample the behavior of the program:<\/p>\n<blockquote>\n<pre>$ perf record --call-graph=dwarf .\/bh_tsne<\/pre>\n<\/blockquote>\n<p>&#8216;perf report&#8217; shows where it&#8217;s spending most of its time:<\/p>\n<blockquote>\n<pre>-\u00a0\u00a0 94.58%\u00a0\u00a0\u00a0\u00a0 0.00%\u00a0 bh_tsne\u00a0 bh_tsne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 [.] main\r\n\u00a0\u00a0 - main\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 - 94.57% TSNE::run\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 - 87.76% TSNE::computeGradient\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 - 49.31% SPTree::computeNonEdgeForces\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 - 48.92% SPTree::computeNonEdgeForces\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + 47.42% SPTree::computeNonEdgeForces\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 21.25% SPTree::computeEdgeForces\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + 10.90% SPTree::SPTree\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + 6.11% SPTree::~SPTree\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + 4.02% TSNE::evaluateError\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + 2.10% TSNE::computeGaussianPerplexity\r\n<\/pre>\n<\/blockquote>\n<p>So it looks like SPTree::computeNonEdgeForces(), which happens to be the octree traversal, is the main candidate for optimization. Diving into the annotated assembly view of perf report by pressing A, a couple hotspots are evident.<\/p>\n<h3>Avoiding square roots<\/h3>\n<p>Let&#8217;s start with this one:<\/p>\n<blockquote>\n<pre>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 if(is_leaf || max_width \/ sqrt(D) < theta) {\r\n\u00a0 5.24 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 sqrtsd %xmm2,%xmm4\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0 \u2193 ja\u00a0\u00a0\u00a0\u00a0 2115 &lt;SPTree::computeNonEdgeForces\r\n 10.37 \u250218e:\u00a0\u00a0 divsd\u00a0 %xmm4,%xmm1\r\n\u00a0 0.45 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 comisd %xmm1,%xmm3\r\n\u00a0 1.48 \u2502\u00a0\u00a0\u00a0\u00a0 \u2193 ja\u00a0\u00a0\u00a0\u00a0 20b0 &lt;SPTree::computeNonEdgeForces(unsigned int, double, 1e0\r\n<\/pre>\n<\/blockquote>\n<p>The criterion \\(\\frac{w_{\\max}}{\\sqrt{D}} &lt; \\theta \\) determines whether the current node is a good enough approximation of its children, or if we have to recurse into it and handle its children separately.<\/p>\n<p>Square roots and division are both very expensive operations, but because \\(x^2\\) is an increasing function on positive numbers, we can rewrite the inequality in the more efficient form \\(w_{\\max}^2 &lt; \\theta^2 D\\). With <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/Maluuba\/bhtsne\/commit\/568dc3623fa4125297f166eda5d5ee329f68901b\">this patch<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, bhtsne is about 5% faster on this dataset.<\/p>\n<h3>Hoisting invariants<\/h3>\n<p>Now let&#8217;s move on to this hotspot:<\/p>\n<blockquote>\n<pre>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 double cur_width;\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 for(unsigned int d = 0; d &lt; dimension; d++) {\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0cur_width = boundary-&gt;getWidth(d);\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 max_width = (max_width &gt; cur_width) ? max_width : cur_width;\r\n 10.98 \u2502 cc:\u00a0\u00a0 maxsd\u00a0 (%r9),%xmm1\r\n\r\n<\/pre>\n<\/blockquote>\n<p>This loop is computing the maximum width of the node along any axis. But doesn&#8217;t that remain constant for the life of the node? We could compute that once for every node and store it in a field to save some work. <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/Maluuba\/bhtsne\/commit\/60b41c26d84d3c2d3543d60e6ca6dac02c33b941\">Doing that<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> makes it about another 5% faster.<\/p>\n<h3>Shrinking the footprint<\/h3>\n<p>Why exactly did the last patch help? &#8216;maxsd&#8217; is not a slow operation (computing the maximum of two doubles) and it wasn&#8217;t a long loop (dimension == 2 in this case). In fact, it was slow for the same reason almost all code is slow these days: memory locality. The getWidth(d) call ultimately looks up the width in an array somewhere else in memory. As CPUs have gotten faster over the years, RAM has failed to keep pace. Levels of smaller and faster on-CPU caches have mitigated the issue, but the smaller your memory footprint is, and the fewer pointers you have to chase, the more of your working set will fit in lower levels of cache and the faster your program will be.<\/p>\n<p>So, if we can reduce the memory footprint of the tree, we&#8217;ll probably be able to keep more of it in the cache, resulting in faster code. After <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/Maluuba\/bhtsne\/commit\/b1aa46a252a1315b8241edf74ec101f03ac020f7\">reworking<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> the implementation to use standard C++ containers instead of manual memory management,<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/Maluuba\/bhtsne\/commit\/d59e6d472bf3c207cf7dd055f8f72551e0f7085e\"> this patch<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> introduces a smaller node structure. These changes make it almost 40% faster yet on our benchmark.<\/p>\n<p>We can shrink the footprint a bit more by noticing that all the nodes at the same level as each other have the same sizes, just different positions, so they can share their width member.<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/Maluuba\/bhtsne\/commit\/6b18cc61502a6439d5edd840041a01a4078626d7\"> This patch<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> gains us about another 5%.<\/p>\n<p>Finally, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/Maluuba\/bhtsne\/commit\/31110a9028e1491a67cebca0fb1d715a57cbac09\">by eliminating<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> the scratch space array buff and just recomputing the values as necessary, we get almost another 10% faster. Altogether, we&#8217;re now over 70% faster on this benchmark.<\/p>\n<h3>Scaling up<\/h3>\n<p>Now let&#8217;s check how this scales to larger datasets.<\/p>\n<div id=\"attachment_487973\" style=\"width: 650px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-487973\" class=\"size-full wp-image-487973\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/tavian_scalingup_figure1.png\" alt=\"Tavian Barnes, barns-hut\" width=\"640\" height=\"682\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/tavian_scalingup_figure1.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/tavian_scalingup_figure1-282x300.png 282w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><p id=\"caption-attachment-487973\" class=\"wp-caption-text\">Figure 1 \u2013 <span style=\"margin: 0px; color: black; font-family: 'Calibri',sans-serif; font-size: 12pt;\">Run time before and after optimizations<\/span>.<\/p><\/div>\n<p>Figure 1 shows that with 10,000 points, we&#8217;re over 100% faster, going from 3m09s to just 1m32s. But at 100,000 points, the benefit falls back down to 75%, going from 1h14m to 42m. The output hints that the bottleneck has started shifting to a different part of the program:<\/p>\n<blockquote>\n<pre>$ .\/bhtsne.py -v -r1 -i glove.6B.50d.large.txt -o glove.6B.50d.tsne.txt\r\n...\r\nInput similarities computed in 1388.20 seconds (sparsity = 0.002531)!\r\n...\r\nFitting performed in 1081.47 seconds.\r\n...\r\n.\/bhtsne.py -v -r1 -i glove.6B.50d.large.txt -o glove.6B.50d.tsne.txt 2529.04s user 1.70s system 99% cpu 42:19.12 total\r\n<\/pre>\n<\/blockquote>\n<p>&#8216;perf&#8217; confirms that with more points, we&#8217;re spending more time in a different part of the program:<\/p>\n<blockquote>\n<pre>-\u00a0\u00a0 95.04%\u00a0\u00a0\u00a0\u00a0 0.26%\u00a0 bh_tsne\u00a0 bh_tsne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 [.] TSNE::run\r\n\u00a0\u00a0 - 94.78% TSNE::run\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 - 77.80% TSNE::computeGradient\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + 45.46% SPTree::computeNonEdgeForces\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 25.83% SPTree::computeEdgeForces\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + 4.71% SPTree::SPTree\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + 1.61% SPTree::~SPTree\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 - 11.67% TSNE::computeGaussianPerplexity\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 - 10.62% VpTree<DataPoint, &(euclidean_distance(DataPoint const&, DataPoint const&))>::search (inlined)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + 10.23% VpTree<DataPoint, &(euclidean_distance(DataPoint const&, DataPoint const&))>::search (inlined)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 0.79% expf32x\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 + 4.56% TSNE::evaluateError\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 0.69% TSNE::symmetrizeMatrix\r\n<\/pre>\n<\/blockquote>\n<p>VpTree::search() is starting to show up in the profile, and its share of execution time only goes up as the number of points grows. This corresponds to the &#8220;Computing input similarities&#8230;&#8221; part of the program, which involves finding the nearest neighbors of every point in the dataset. A careful look at the profile shows that almost all that time is spent computing high-dimensional Euclidean distances:<\/p>\n<blockquote>\n<pre>+\u00a0\u00a0\u00a0 9.59%\u00a0\u00a0\u00a0\u00a0 0.00%\u00a0 bh_tsne\u00a0 bh_tsne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 [.] VpTree&lt;DataPoint, &amp;(euclidean_distance(DataPoint const&amp;, DataPoint const&amp;))&gt;::search (inlined)\r\n+\u00a0\u00a0\u00a0 9.02%\u00a0\u00a0\u00a0\u00a0 9.00%\u00a0 bh_tsne\u00a0 bh_tsne\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 [.] euclidean_distance\r\n<\/pre>\n<\/blockquote>\n<p>Unfortunately, we can&#8217;t do the same thing we did the last time we saw a Euclidean distance computation. The code is using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Vantage-point_tree\">vantage-point trees<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to find nearest neighbors, which fundamentally rely on the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Metric_(mathematics)\">metric<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> properties of the distance function. Eliminating the square root makes the new &#8220;metric&#8221; violate the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Triangle_inequality\">triangle inequality<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and would cause the VP tree to give incorrect results. Furthermore, the square root isn&#8217;t even the slowest part &#8212; most of the time is just spent iterating over the coordinates:<\/p>\n<blockquote>\n<pre>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 diff = (x1[d] - x2[d]);\r\n\u00a0 6.08 \u250223:\u00a0\u00a0 movsd \u00a0(%rcx,%rax,8),%xmm1\r\n 21.08 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0 subsd\u00a0 (%rsi,%rax,8),%xmm1\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 dd += diff * diff;\r\n\u00a0 0.15 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0 lea\u00a0\u00a0\u00a0 0x1(%rax),%rdx\r\n 10.72 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0 mulsd\u00a0 %xmm1,%xmm1\r\n 27.05 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0 addsd\u00a0 %xmm1,%xmm0\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 for(int d = 0; d < t1._D; d++) {\r\n 18.73 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0 cmp\u00a0\u00a0\u00a0 %rax,%rdi\r\n\u00a0 0.00 \u2502\u00a0\u00a0\u00a0 \u2191 jne\u00a0\u00a0\u00a0 2e90 &lt;euclidean_distance(DataPoint const&, 20\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 }\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 return sqrt(dd);\r\n\u00a0 0.46 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0 ucomisd %xmm0,%xmm2\r\n 11.74 \u2502\u00a0\u00a0\u00a0\u00a0\u00a0 sqrtsd %xmm0,%xmm1\r\n<\/pre>\n<\/blockquote>\n<p>High dimensional nearest neighbors is a hard problem at the best of times, because of the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Curse_of_dimensionality\">curse of dimensionality<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. The best choice here might be to use an approximate nearest neighbors algorithm, or at least an all-nearest-neighbors algorithm that avoids wasted work. But I&#8217;ll save that for a future post.<\/p>\n<h3>Conclusion<\/h3>\n<p>So how did we do on the original, 400,000-point dataset?<\/p>\n<blockquote>\n<pre>$ .\/bhtsne.py -v -r1 -i glove.6B.50d.cut.txt -o glove.6B.50d.tsne.txt\r\nRead the 400000 x 50 data matrix successfully!\r\nUsing random seed: 1\r\nUsing no_dims = 2, perplexity = 50.000000, and theta = 0.500000\r\nComputing input similarities...\r\nBuilding tree...\r\n- point 0 of 400000 \r\n...\r\n- point 390000 of 400000\r\nInput similarities computed in 21370.59 seconds (sparsity = 0.000682)!\r\nLearning embedding...\r\nIteration 50: error is 131.843420 (50 iterations in 273.23 seconds) \r\n...\r\nIteration 999: error is 6.373127 (50 iterations in 239.38 seconds)\r\nFitting performed in 5394.53 seconds.\r\nWrote the 400000 x 2 data matrix successfully!\r\n.\/bhtsne.py -v -r1 -i glove.6B.50d.cut.txt -o glove.6B.50d.tsne.txt\u00a0 26989.04s user 59.58s system 99% cpu 7:32:10.40 total\r\n<\/pre>\n<\/blockquote>\n<p>With all the above optimizations, we&#8217;ve saved 2 hours and 45 minutes on this computation, making it 36% faster. The second stage of the computation, where the optimizations apply, is over 178% percent faster, starting at over four hours and ending up under one and a half. These optimizations have been submitted upstream as a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/lvdmaaten\/bhtsne\/pull\/72\">pull request<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and I hope to make more improvements in the future!<\/p>\n<h3>References<\/h3>\n<p>L.J.P. van der Maaten. <strong>Accelerating t-SNE using Tree-Based Algorithms<\/strong>. Journal of Machine Learning Research 15(Oct):3221-3245, 2014.<br \/>\nL.J.P. van der Maaten and G.E. Hinton. <strong>Visualizing High-Dimensional Data Using t-SNE<\/strong>. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.<br \/>\nJ. Barnes and P. Hut. <strong>A hierarchical O(N log N) force-calculation algorithm<\/strong>. Nature 324 (4): 446\u2013449, 1986.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ten years ago, while writing a physics engine, I learned about the Barnes-Hut algorithm for the gravitational n-body problem. Normally, computing the Newtonian gravitational forces between n bodies requires evaluations of Newton&#8217;s law of universal gravitation, as every body exerts a force on every other body in the system. Barnes-Hut is an approximation that brings [&hellip;]<\/p>\n","protected":false},"author":37074,"featured_media":488936,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Tavian Barnes","user_id":"37059"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[194488],"tags":[],"research-area":[13560],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-487934","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-program-languages-and-software-engineering","msr-research-area-programming-languages-software-engineering","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"480\" height=\"280\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/BHTSNE_Prog_MCR_Carousel_05_2018_480x280_Static.jpg\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/BHTSNE_Prog_MCR_Carousel_05_2018_480x280_Static.jpg 480w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2018\/05\/BHTSNE_Prog_MCR_Carousel_05_2018_480x280_Static-300x175.jpg 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/>","byline":"Tavian Barnes","formattedDate":"May 30, 2018","formattedExcerpt":"Ten years ago, while writing a physics engine, I learned about the Barnes-Hut algorithm for the gravitational n-body problem. Normally, computing the Newtonian gravitational forces between n bodies requires evaluations of Newton&#039;s law of universal gravitation, as every body exerts a force on every other&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/487934","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37074"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=487934"}],"version-history":[{"count":37,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/487934\/revisions"}],"predecessor-version":[{"id":488981,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/487934\/revisions\/488981"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/488936"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=487934"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=487934"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=487934"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=487934"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=487934"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=487934"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=487934"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=487934"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=487934"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=487934"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=487934"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}