<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Ragha's Blog]]></title><description><![CDATA[Musings on tech, machine learning]]></description><link>https://raghakot.github.io</link><image><url>images/cover.jpg</url><title>Ragha&apos;s Blog</title><link>https://raghakot.github.io</link></image><generator>RSS for Node</generator><lastBuildDate>Sat, 20 Jan 2018 18:26:21 GMT</lastBuildDate><atom:link href="https://raghakot.github.io/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Towards Differentiable Neural Architecture]]></title><description><![CDATA[<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p>Ever since AlexNet was introduced in 2012, neural net research landscape fundamentally changed. With deep computational graphs, the possibilities are endless. We quickly iterated through a dizzying number of architectures such as AlexNet, VGGNet, NIN, Inception, ResNets, FractalNet, Xception, DenseNets and so on. With deep learning, architecture engineering is the new feature engineering and it clearly plays an important role in advancing the state of the art. Instead of making incremental improvements, is there a way to <em>learn</em> the connectivity pattern itself?</p>
</div>
<div class="paragraph">
<p>First step is to make this more concrete. Ideally, we want to learn connectivity amongst individual neurons; instead, lets simplify the problem by constraining ourselves to known layer Lego blocks (by layer Lego blocks, I mean general purpose computational layers such as <em>Convolutional Layer</em>, <em>LSTM</em>, <em>Pooling</em> etc.). Given a bunch of Layer Legos \(L = \{L_{1}, L_{2}, \cdots, L_{n}\}\), our task is to learn how these should be assembled. First thing that comes to mind is evolutionary search, but that is computationally prohibitive. A quicker and more elegant alternative would be to learn this by computing the gradient of <em>architecture</em> with respect to loss. The challenge is to make <em>architecture</em> itself differentiable.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/diff_neural/legos.jpg" alt="legos" width="400">
</div>
<div class="title">Figure 1. Deep learning practitioners</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_differentiable_architecture">Differentiable Architecture</h2>
<div class="sectionbody">
<div class="paragraph">
<p>Architecture can mean a lot of things. For now, let us constrain ourselves to connectivity patterns. Given some \(n\) layers, what is the optimal connectivity amongst them? The naive way of approaching this problem would be to try out all \(n(n-1)\) possible connection configurations<sup class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnote_1" title="View footnote.">1</a>]</sup>. There is a resnet, densenet or some other future net hidden in there.</p>
</div>
<div class="paragraph">
<p>Bruteforce search is not such a bad option. Cifar10/100 networks are quite short (relatively speaking). If you have access to huge compute, it might take a month to try out all possible connectivities. Hopefully we can extrapolate the recurring blocks and apply it to imagenet and get a new SOTA results (wishful thinking). While practical, this is a rather boring way to solve the problem. What if we can learn the connections via gradient descent? For this to work, we need a way to compute \(\frac{\partial (loss)}{\partial (connection)}\).</p>
</div>
<div class="paragraph">
<p>Connectivity is inherently binary. This is problematic for applying gradient descent because of the discontinuity. Notice the similarities with attention? We can employ reinforcement learning like hard attention or try soft attention strategy. In order to make connectivity <em>continuous</em>, we can make it weighted, similar to <em>forget gate</em> in LSTMs<sup class="footnote">[<a id="_footnoteref_2" class="footnote" href="#_footnote_2" title="View footnote.">2</a>]</sup>. To bias the connectivity mostly towards 0 or 1, we can apply use sigmoid weighted connectivity. To recap, the strategy is to start out with a fully connected net where evry layer is connected to every other layer via sigmoid weighted param. The weights can be learned via backprop.</p>
</div>
<div class="paragraph">
<p>Cool. There is one slight technicality. Backward connections introduce cycles in the graph<sup class="footnote">[<a id="_footnoteref_3" class="footnote" href="#_footnote_3" title="View footnote.">3</a>]</sup>. Excluding backward connections, every layer can have incoming inputs from all previous layers, which leaves us with \(\frac{n(n-1)}{2}\) possible connections. This is exactly what <a href="https://arxiv.org/pdf/1608.06993v3.pdf">DenseNets</a> do!, except that we arrived at a similar architecture with a different motivation.</p>
</div>
<div class="paragraph">
<p>Unlike DenseNets, we want the connections to be gated so that they can be pruned if needed. There are a couple of possibilities for how \(L_{i}\) is connected to \(L_{j}\).</p>
</div>
<div class="olist arabic">
<ol class="arabic">
<li>
<p><strong>Shared/Unique feature weights</strong>: We can either associate weights \(W = [w_{1}, w_{2}, \cdots, w_{f}]\) for \(L_{i}\) with \(f\) feature maps or use a single weight for all feature maps of an incoming input. Lets call this <em>shared</em>/<em>unique</em> feature weight scheme.</p>
</li>
<li>
<p><strong>Input aggregation</strong>: For layer \(L_{j}\), we need a way to aggregate weight feature maps from previous layers. Some possibilities include:</p>
<div class="ulist">
<ul>
<li>
<p><strong>DenseNet style concat</strong>: i.e, simply concatenate previous layer weighted feature maps. This has the limitation that concatenation is not viable when the size of feature maps differ. For this we can either try to pad smaller feature maps
with 0&#8217;s to make them all the same size as the largest feature map or resize all feature maps to the smallest feature map by pooling/interpolation/some other form of down sampling. Both approaches have their own pros and cons. The first is computationally expensive (since convnets have to slide over larger feature maps), while the second one is lossy. I have a feeling that lossy would end up being the winner.</p>
</li>
<li>
<p><strong>ResNet style concat</strong>: i.e., squish weighted input feature maps via <em>max</em> or <em>sum</em>. <em>avg</em>, <em>prod</em>, <em>min</em> dont seem like good ideas for obvious reasons. I like <em>max</em> better than resnet style <em>sum</em>, since it is equivalent to focusing on input patterns that matched with a filter (high output value) across various feature maps. A high value at some \((row, col)\) of the filter might indicate the presence of some abstract shape detected by previous convolutional layers.</p>
</li>
<li>
<p><strong>Squeeze net style concat</strong>: Lastly, we can also try squeeze net strategy of squishing \(n\) concatenated input feature maps (after reshaping to same size via padding or down sampling) down to \(m\) feature maps where \(m \le n\).</p>
</li>
</ul>
</div>
</li>
</ol>
</div>
<div class="paragraph">
<p>Phew! That completes our setup. The idea is super simple. Start out with an fully connected network where every layer gets inputs from all the previous layers in a weighted fashion. Before weighing, we use \(sigmoid(W)\) to bias the values mostly towards 0 or 1. When trained end-end via backprop, the network should, in-principle, learn to prune unwanted connections.</p>
</div>
<div class="paragraph">
<p>We can also try doing couple of other things.</p>
</div>
<div class="olist arabic">
<ol class="arabic">
<li>
<p>Take the converged connectivity pattern and try it exclusively by removing connections with small weights. If \(w = 0.002\) or something like that, we know that the network is trying really hard to get rid of that connection. So might as well see if the performance is better when we start out by removing it.</p>
</li>
<li>
<p>Examine the evolution of weights during training. Do they always converge in the same manner? If they don&#8217;t, that would be concerning.</p>
</li>
<li>
<p>Try different connection weight initialization schemes. \(w = 1\) would mean everything starts out fully connected. Alternatively we can set \(w\) from a Gaussian distribution centered around 0.5. We can try other funky things like setting initial feed forward stack \(L_{i-1}\) &#8594; \(L_{i}\) weight to 1 and rest to 0.5.</p>
</li>
</ol>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_experimental_setup">Experimental Setup</h2>
<div class="sectionbody">
<div class="ulist">
<ul>
<li>
<p>Layers = [Conv, Conv, MaxPool, Dropout] \(\times\) 2-3 blocks of variable feature sizes.</p>
</li>
<li>
<p>cifar10 dataset augmented with 10% random shifts along image rows/cols along with a 50% chance of horizontal flip.</p>
</li>
<li>
<p><code>random_seed = 1337</code> for reproducibility.</p>
</li>
<li>
<p>Training for 200 epochs, batch size of 32 using categorical cross-entropy loss with Adam optimizer.</p>
</li>
</ul>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_implementation">Implementation</h2>
<div class="sectionbody">
<div class="paragraph">
<p>A quick summary of these ideas are translated into concrete implementation. A complete implementation can be found on my <a href="https://github.com/raghakot/deep-learning-experiments/tree/master/exp3">github</a>.</p>
</div>
<div class="ulist">
<ul>
<li>
<p><strong>Creating fully connected net</strong>. i.e, a layer is connected to all prev layer outputs. This is not as hard as it appears.</p>
</li>
</ul>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-python" data-lang="python">def make_fully_connected(x, layers):
    inputs = [x]
    for layer in layers:
        x = layer(x)
        inputs.append(x)
        # This is the part where we resize inputs to be of same shape and merge them in resnet or densenet style
    return x</code></pre>
</div>
</div>
<div class="ulist">
<ul>
<li>
<p><strong>Merging</strong>. i.e., resizing prev layer outputs to be of the same shape and concatenating them in densenet or resnet style. We also want to weigh merged outputs so that those weights can be learned during backprop. The easiest way to do this in keras is to create a custom layer<sup class="footnote">[<a id="_footnoteref_4" class="footnote" href="#_footnote_4" title="View footnote.">4</a>]</sup>.</p>
</li>
</ul>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-python" data-lang="python">import numpy as np
import tensorflow as tf

from keras import backend as K
from keras.layers import Lambda, Layer

class Connection(Layer):
    """Takes a list of inputs, resizes them to the same shape, and outputs a weighted merge.
    """
    def __init__(self, init_value=0.5, merge_mode='concat',
                 resize_interpolation=tf.image.ResizeMethod.BILINEAR,
                 shared_weights=True):
        """Creates a connection that encapsulates weighted connection of input feature maps.

        :param init_value: The init value for connection weight
        :param merge_mode: Defines how feature maps from inputs are aggregated.
        :param resize_interpolation: The downscaling interpolation to use. Instance of `tf.image.ResizeMethod`.
            Note that ResizeMethod.AREA will fail as its gradient is not yet implemented.
        :param shared_weights: True to share the same weight for all feature maps from inputs[i].
        False creates a separate weight per feature map.
        """
        self.init_value = init_value
        self.merge_mode = merge_mode
        self.resize_interpolation = resize_interpolation
        self.shared_weights = shared_weights
        super(Connection, self).__init__()

    def _ensure_same_size(self, inputs):
        """Ensures that all inputs match last input size.
        """
        # Find min (row, col) value and resize all inputs to that value.
        rows = min([K.int_shape(x)[1] for x in inputs])
        cols = min([K.int_shape(x)[2] for x in inputs])
        return [tf.image.resize_images(x, [rows, cols], self.resize_interpolation) for x in inputs]

    def _merge(self, inputs):
        """Define other merge ops like [+, X, Avg] here.
        """
        if self.merge_mode == 'concat':
            # inputs are already stacked.
            return inputs
        else:
            raise RuntimeError('mode {} is invalid'.format(self.merge_mode))

    def build(self, input_shape):
        """ Create trainable weights for this connection
        """
        # Number of trainable weights = sum of all filters in `input_shape`
        nb_filters = sum([s[3] for s in input_shape])

        # Create shared weights for all filters within an input layer.
        if self.shared_weights:
            weights = []
            for shape in input_shape:
                # Create shared weight, make nb_filter number of clones.
                shared_w = K.variable(self.init_value)
                for _ in range(shape[3]):
                    weights.append(shared_w)
            self.W = K.concatenate(weights, axis=-1)
        else:
            self.W = K.variable(np.ones(shape=nb_filters) * self.init_value)

        self._trainable_weights.append(self.W)
        super(Connection, self).build(input_shape)

    def call(self, layer_inputs, mask=None):
        # Resize all inputs to same size.
        resized_inputs = self._ensure_same_size(layer_inputs)

        # Compute sigmoid weighted inputs
        stacked = K.concatenate(resized_inputs, axis=-1)
        weighted = stacked * K.sigmoid(self.W)

        # Merge according to provided merge strategy.
        merged = self._merge(weighted)

        # Cache this for use in `get_output_shape_for`
        self._out_shape = K.int_shape(merged)
        return merged

    def get_output_shape_for(self, input_shape):
        return self._out_shape</code></pre>
</div>
</div>
<div class="paragraph">
<p>Lets look at this step by step.</p>
</div>
<div class="olist arabic">
<ol class="arabic">
<li>
<p><code>_ensure_same_size</code> computes smallest \((rows, cols)\) amongst all inputs and uses it to resize all inputs to be the same shape using the provided resize interpolation scheme.</p>
</li>
<li>
<p>We have to define trainable weights in <code>build</code> per keras custom layer <a href="https://keras.io/layers/writing-your-own-keras-layers/">docs</a>. We need as many weights as sum of filters across all inputs to the <code>Connection</code> layer. Weight sharing across all filters of an incoming layer can be achieved by concatenating same weight variable for all filters.</p>
</li>
<li>
<p><code>call</code> computes sigmoid weighted inputs (I tested without sigmoid, and as expected, sigmoid weighing which mostly "allows or disallows inputs" worked a lot better), merged with defined merge strategy. We can tweak <code>init_value</code> and <code>merge_mode</code> to try various init strategies for weights and different merge strategies.</p>
</li>
</ol>
</div>
<div class="paragraph">
<p>The fully connected net using layers defined below, followed by sequential <code>Dense</code> layers using the above code is shown in fig.</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-python" data-lang="python">layers = [
	Convolution2D(32, 3, 3, border_mode='same', activation='relu', bias=False),
	Convolution2D(32, 3, 3, bias=False, activation='relu'),
	MaxPooling2D(pool_size=(2, 2)),
	Dropout(0.25),

	Convolution2D(64, 3, 3, bias=False, activation='relu', border_mode='same'),
	Convolution2D(64, 3, 3, bias=False, activation='relu'),
	MaxPooling2D(pool_size=(2, 2)),
	Dropout(0.25)
]</code></pre>
</div>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/diff_neural/model.png" alt="model">
</div>
<div class="title">Figure 2. Fully connected network from <code>layers</code> followed by sequential <code>Dense</code> layers (open in new tab or download to zoom in).</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_discussion">Discussion</h2>
<div class="sectionbody">
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
Experimentation is still a work in progress. Check back for updates.
</td>
</tr>
</table>
</div>
<div class="paragraph">
<p>Without \(sigmoid\) weighing which mostly "allows or disallows inputs", the convergence was horribly slow. All subsequent results described here used \(sigmoid\) connection weights.</p>
</div>
<div class="sect2">
<h3 id="_experiment1_densenet_style_merge">Experiment1: DenseNet Style merge</h3>
<div class="paragraph">
<p>In these set of experiments, activation maps from prev layers are <em>concatenated</em>.</p>
</div>
<div class="sect3">
<h4 id="_insights_from_initial_exploration">Insights from initial exploration</h4>
<div class="ulist">
<ul>
<li>
<p>Connection weight initialization scheme (init to 0, 1, 0.5) has no effect on convergence.</p>
</li>
<li>
<p>Down sampling interpolation scheme (inter_area, inter_nn, inter_bilinear, inter_bicubic) doesn&#8217;t affect the convergence significantly<sup class="footnote">[<a id="_footnoteref_5" class="footnote" href="#_footnote_5" title="View footnote.">5</a>]</sup>.</p>
</li>
</ul>
</div>
</div>
<div class="sect3">
<h4 id="_evolution_of_connection_weights_shared_feature_map_weights">Evolution of connection weights (shared feature map weights)</h4>
<div class="paragraph">
<p>It is definitely interesting to track how connection weights between layers evolved with training epochs. Fig 3 shows the connection weight evolution for connection_o through connection_7 (connection 0 weight 0 corresponds to input&#8594;conv2, connection 0 weight 1 corresponds to conv1&#8594;input2, and so on. Refer to fig 2 to get a complete picture).</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/diff_neural/connection_evolution.png" alt="connection_evolution">
</div>
<div class="title">Figure 3. Evolution of various connection weights during training</div>
</div>
<div class="paragraph">
<p>TODO: Discussion.</p>
</div>
</div>
<div class="sect3">
<h4 id="_evolution_of_connection_weights_unique_feature_map_weights">Evolution of connection weights (unique feature map weights)</h4>
<div class="paragraph">
<p>WIP..</p>
</div>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_reproducibility">Reproducibility</h2>
<div class="sectionbody">
<div class="paragraph">
<p>The code to reproduce all the experiments is available on <a href="https://github.com/raghakot/deep-learning-experiments/tree/master/exp3">Github</a>. Feel free to reuse or improve.</p>
</div>
<link rel="stylesheet" type="text/css" href="../../../extras/inlineDisqussions.css" />

<script type="text/javascript">
  (function defer() {
    if (window.jQuery) {
      jQuery(document).ready(function() {
          disqus_shortname = 'raghakot-github-io';
          jQuery("p, img").inlineDisqussions();
      });
    } else {
      setTimeout(function() { defer() }, 50);
    }
  })();
</script>
</div>
</div>
<div id="footnotes">
<hr>
<div class="footnote" id="_footnote_1">
<a href="#_footnoteref_1">1</a>. Don&#8217;t give such answers in interviews, lol
</div>
<div class="footnote" id="_footnote_2">
<a href="#_footnoteref_2">2</a>. An excellent overview of LSTMs can be found on <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/" class="bare">http://colah.github.io/posts/2015-08-Understanding-LSTMs/</a>
</div>
<div class="footnote" id="_footnote_3">
<a href="#_footnoteref_3">3</a>. There are ways to avoid the issue by unrolling the recurrent loops to a fixed number of time steps but lets put that off for now in the interest of simplicity
</div>
<div class="footnote" id="_footnote_4">
<a href="#_footnoteref_4">4</a>. <a href="https://keras.io/layers/core/#lambda">Lambda layer</a> can be used, but that doesn&#8217;t allow for trainable weights. This is not an issue if tensorflow optimizer was directly used.
</div>
<div class="footnote" id="_footnote_5">
<a href="#_footnoteref_5">5</a>. inter_bilinear, inter_bicubic work slightly better initially but they all converge to the same final value
</div>
</div>]]></description><link>https://raghakot.github.io/2017/01/14/Towards-Differentiable-Neural-Architecture.html</link><guid isPermaLink="true">https://raghakot.github.io/2017/01/14/Towards-Differentiable-Neural-Architecture.html</guid><category><![CDATA[deep learning]]></category><dc:creator><![CDATA[Raghavendra Kotikalapudi]]></dc:creator><pubDate>Sat, 14 Jan 2017 00:00:00 GMT</pubDate></item><item><title><![CDATA[Baking rotational invariance into a neuron]]></title><description><![CDATA[<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p>This is a continuation from the first <a href="https://raghakot.github.io/2017/01/03/Exploring-alternative-neural-computational-models.html">post</a> on exploring alternative neural computational models. To recap quickly, a neuron is fundamentally acting as a pattern matching unit by computing the dot product \(W \cdot X\), which is equivalent to unnormalized cosine similarity between vectors \(W\) and \(X\).</p>
</div>
<div class="paragraph">
<p>This model of pattern matching is very naive, in a sense that it is not invariant to translations or rotations. Consider the following weight vector template (conv filter) corresponding to vertical edge pattern <code>|</code></p>
</div>
<div class="paragraph">
<p>\(W = \begin{bmatrix}
0.2 &amp; 1.0 &amp; 0.2\\
1.0 &amp; 1.0 &amp; 1.0\\
0.2 &amp; 1.0 &amp; 0.2
\end{bmatrix} \)</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/vertical_line_filter.png" alt="vertical_line_filter" width="300">
</div>
<div class="title">Figure 1. Visual representation of \(W\) template</div>
</div>
<div class="paragraph">
<p>When a neuron tries to match this pattern in a \(3 \times 3\) input image patch, it will only output a high value if a vertical edge is found in the center of the image patch. i.e., dot product is not invariant to shifts. My first thought was to use <a href="https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient">Pearson coefficient</a> as it provides shift invariance. However, this not a problem in practice as the filter is slid across all possible \(3 \times 3\) patches within the input image (see fig2).</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/conv_illustration.png" alt="conv_illustration" width="400">
</div>
<div class="title">Figure 2. Sliding filter over input image to generate the output activation area (image borrowed from <a href="http://intellabs.github.io/RiverTrail/tutorial/">river trail documentation</a>)</div>
</div>
<div class="paragraph">
<p>In a way, the sliding window operation is quite clever as it communicates information about <em>parts of the image</em> matching filter template without increasing the number of <em>trainable</em> parameters.  As a concrete example, consider a conv-net with filters as shown in fig3<sup class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnote_1" title="View footnote.">1</a>]</sup>. The first layer filters (green) detect eyes, nose etc., second layer filters (yellow) detect face, leg etc., by aggregating features from first layer and so on<sup class="footnote">[<a id="_footnoteref_2" class="footnote" href="#_footnote_2" title="View footnote.">2</a>]</sup>.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/conv_net1.png" alt="conv_illustration" width="600">
</div>
<div class="title">Figure 3. Hypothetical conv-net filters to illustrate how a <em>human</em> might be detected. (image borrowed from quora <a href="https://www.quora.com/How-is-a-convolutional-neural-network-able-to-learn-invariant-features">post</a>)</div>
</div>
<div class="paragraph">
<p>Now suppose the face (represented by two red and one magenta point) is moved to another corner as shown in fig4. The same activations occur, leading to the same end result.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/conv_net2.png" alt="conv_illustration" width="600">
</div>
<div class="title">Figure 4. Illustration of activations when an input object is shifted. (image borrowed from quora <a href="https://www.quora.com/How-is-a-convolutional-neural-network-able-to-learn-invariant-features">post</a>)</div>
</div>
<div class="paragraph">
<p>So far so good. How about rotational invariance? We know that a lot of learned conv filters are identical, but rotated by some non-random factor. Instead of learning rotational invariance in this manner, we will try to build it directly into the computational model. The motivation is simple - the neuron should output high value even if some rotated version of pattern is present in the input. Hopefully this eliminates redundant filters and improves test time accuracy.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/conv_filters.jpg" alt="conv_filters" width="600">
</div>
<div class="title">Figure 5. A few conv filters of a vggnet (image borrowed from keras <a href="https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html">blog post</a>). Notice how some filters are identical but rotated.</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_endowing_rotational_invariance">Endowing rotational invariance</h2>
<div class="sectionbody">
<div class="paragraph">
<p>How can we build rotational invariance into dot product similarity computation? After searching for a bit on the Internet, I couldn&#8217;t find any obvious similarity metrics that do this. Sure there is <a href="https://en.wikipedia.org/wiki/Scale-invariant_feature_transform">SIFT</a> and <a href="https://en.wikipedia.org/wiki/Speeded_up_robust_features">SURF</a>, but they are complex computations and cannot be expressed in terms of dot products alone.</p>
</div>
<div class="paragraph">
<p>Instead of trying to come up with a metric, I decided to try the brute force approach of matching the input patch with all possible rotations of the filter. If we take <code>max</code> of all those outputs, then, in principle, we are choosing the output resulting from the rotated filter that best <code>aligns</code> with the pattern in input patch. This strategy is partly inspired by how conv nets gets around translational invariance problem by simply sliding the filter over all possible locations. Unlike sliding operation, which is an <em>architectural</em> property rather than the computational property of the neuron, the brute force rotations of filters can be confined within the abstraction of neuron, making it a part of the computational model.</p>
</div>
<div class="paragraph">
<p>Lets look at this idea in more concrete terms. A neuron receives input \(X\) and has associated weights \(W\). Let \(r\) denote some arbitrary discrete rotation \(\theta\). Instead of directly computing \(W \cdot X\), we will compute \(\max \{ rW \cdot X, r^{2}W \cdot X, \cdots, r^{k}W \cdot X \}\), where \(k = \left \lceil \frac{360}{\theta} \right \rceil\) is the number of brute force rotations of \(\theta\) step.</p>
</div>
<div class="paragraph">
<p>This idea has a number of practical challenges.</p>
</div>
<div class="ulist">
<ul>
<li>
<p>How do we decide \(\theta\)? If we set \(\theta = 90^{\circ}\), we can produce 4 rotated filters.</p>
</li>
<li>
<p>For \(\theta \neq 90^{\circ}\) we need interpolation, which might change filter shape.</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>To avoid interpolation, we can try to shift the image by rotating elements clockwise from outermost layer to innermost layer. For a \(3 \times 3\) image, there are 8 possible rotations. Unfortunately, this only works for filters that are symmetric along the center of the matrix. As an example, consider all 8 rolled rotations of the <code>L</code> shaped filter (fig 6). As the image is not symmetric along the center, we get skewed representations for \(45\circ\) rotations. The second image in fig 6 still looks like an <code>L</code>, except that the tail end of <code>L</code> is squished upwards to maintain \(3 \times 3\) filter shape.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/rotations_L.png" alt="rotations_of_filter" width="600">
</div>
<div class="title">Figure 6. All 8 shifts of the <code>L</code> shaped filter. Each row contains a clockwise rolled elements from the previous.</div>
</div>
<div class="paragraph">
<p>Despite the setback, all \(90^{\circ}\) rotations are accurate and \(45^{\circ}\) rotations are only slightly skewed. Skew might be a blessing in disguise as it might endow the neuron to be <em>deform</em> invariant as well (wishful thinking). In either case, the deformations are not too bad for \(3 \times 3\) filters, and it is at-least worth experimenting with vggnet which uses all \(3 \times 3\) conv filters.</p>
</div>
<div class="paragraph">
<p>Alternatively, instead of trying all possible alignments, we could try to find \(\theta\) that <em>maximizes</em> the dot product.</p>
</div>
<div class="paragraph">
<p>Let R denote \(2 \times 2\) rotation matrix.</p>
</div>
<div class="paragraph">
<p>\(R = \begin{bmatrix}
cos(\theta) &amp; sin(\theta)\\
-sin(\theta) &amp; cos(\theta)
\end{bmatrix}\)</p>
</div>
<div class="paragraph">
<p>To accomplish optimal alignment, we need \(\theta\) such that:</p>
</div>
<div class="paragraph">
<p>\(\arg\max_\theta R_{\theta}W \cdot X\)</p>
</div>
<div class="paragraph">
<p>This can be solved by differentiating with respect to \(\theta\) and equating the derivative to 0.</p>
</div>
<div class="paragraph">
<p>I finally decided to go with the brute force approach to quickly test out the idea and maybe further develop the math for optimal alignment if it showed promise. The brute-force computations discussed so far apply to an individual neuron. For Convolutional layer, the output is generated as follows:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Generate all 8 rotated filters from \(W\), \(W_{1}, \cdots, W_{8}\).</p>
</li>
<li>
<p>Compute output activation volumes<sup class="footnote">[<a id="_footnoteref_3" class="footnote" href="#_footnote_3" title="View footnote.">3</a>]</sup> \(O_{1}, \cdots, O_{8}\) by convolving input with \(W_{1}, \cdots, W_{8}\).</p>
</li>
<li>
<p>Across volumes, select \(x, y\) value corresponding to filter \((row, col)\) that has the max value. This corresponding to selecting the best <em>aligned</em> filter with input patch across the 8 rotations.</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>These steps can be implemented as follows. Note that <code>W.shape = (rows, cols, nb_input_filters, nb_output_filters)</code>. You also need bleeding edge tensorflow from <em>master</em>; as of Nov 2016, <code>gather_nd</code> does not have a gradient implementation.</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-python" data-lang="python">import tensorflow as tf


# The clockwise shift-1 rotation permutation.
permutation = [[1, 0], [0, 0], [0, 1], [2, 0], [1, 1], [0, 2], [2, 1], [2, 2], [1, 2]]


def shift_rotate(w, shift=1):
    shape = w.get_shape()
    for i in range(shift):
        w = tf.reshape(tf.gather_nd(w, permutation), shape)
    return w


def conv2d(x, W, **kwargs):
    # Determine all 7 rotations of w.
    w = W
    w_rot = [w]
    for i in range(7):
        w = shift_rotate(w)
        w_rot.append(w)

    # Convolve with all 8 rotations and stack.
    outputs = tf.stack([tf.nn.conv2d(x, w_i, **kwargs) for w_i in w_rot])

    # Max filter activation across rotations.
    output = tf.reduce_max(outputs, 0)
    return output</code></pre>
</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_experimental_setup">Experimental Setup</h2>
<div class="sectionbody">
<div class="paragraph">
<p>The setup is as follows.</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Mini vggnet (1,250,858 parameters) comprising of \(3 \times 3\) convolutions with ReLU activation.</p>
</li>
<li>
<p>cifar10 dataset augmented with 10% random shifts along image rows/cols along with a 50% chance of horizontal flip.</p>
</li>
<li>
<p>batch size = 32, epochs = 200 with early stopping.</p>
</li>
<li>
<p>Adam optimizer with initial learning rate of \(10^{-3}\), and reduced by a factor of \(10^{-1}\) when convergence stalls.</p>
</li>
<li>
<p><code>random_seed = 1337</code> for reproducibility.</p>
</li>
</ul>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural1/model.png" alt="test_model" width="300">
</div>
<div class="title">Figure 7. Test model</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_discussion">Discussion</h2>
<div class="sectionbody">
<div class="paragraph">
<p>Disappointingly, rotational invariant model resulted in lower val accuracy than the baseline. I had three running hypothesis as to why this might happen.</p>
</div>
<div class="olist arabic">
<ol class="arabic">
<li>
<p>Implementation issue. I ruled this out by testing the <code>conv2d(&#8230;&#8203;)</code> implementation by feeding it numpy arrays and verifying the output.</p>
</li>
<li>
<p>Perhaps, rotational invariance is not a good idea for the earlier filters. As an example, consider a latter layer that uses <em>vertical</em> and <em>horizontal</em> edge features for the preceeding layer to detect a <em>square</em> like shape. Having rotational invariance means that it is not possible to distinguish horizontal from vertical edges, making square detection a lot more challenging.</p>
</li>
<li>
<p>Maybe the skew from \(\theta = \{45^{\circ}, 135^{\circ}, 225^{\circ}, 315^{\circ}\}\) was an issue after all.</p>
</li>
</ol>
</div>
<div class="paragraph">
<p>To test 2, 3, I decided to run 6 more experiments by using:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Use new conv logic only on last 4, 3, 2, 1 layer(s). If hypothesis 2 is true, the convergence should get better.</p>
</li>
<li>
<p>Use \(90^{\circ}\) rotations (4 filters) and \(45^{\circ}\) rotations (8 filters) to test if <em>skew</em> introduced by the 8 filter version was really an issue.</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>The results of these experiments are summarized in fig 8.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/convergence.png" alt="convergence" width="800">
</div>
<div class="title">Figure 8. Convergence graphs for #{8/4 rotations}_rot_{last 1, 2, 3, 4 conv with new conv layer} and the baseline model. Models in legend are ordered in descending order of accuracy.</div>
</div>
<div class="paragraph">
<p>It seems evident that using rotationally invariant convolutional layers everywhere is not a good idea (<em>4_rot_4</em> and <em>8_rot_4</em> have the lowest values). In line with hypothesis 2, the accuracy improved as we get rid of earlier rotationally invariant conv layers, with 4_rot_1 and 8_rot_1 on par/slightly better than the baseline model. The skew (according to hypothesis 3) does not seem to the issue as <em>8_rot_[4, 3, 2, 1]</em> closely followed <em>4_rot_[4, 3, 2, 1]</em> accuracies.</p>
</div>
<div class="paragraph">
<p>Since <em>4_rot_1</em> and <em>8_rot_4</em> converged to similar <code>val_acc</code> as the baseline, it is worth comparing how prediction probability of the correct class changes as we rotate the input image from \(\theta = \{0, 360\}\) by a step of \(1^{\circ}\). Fig 9. shows this comparison across all models for the sake of completeness. It is awesome to note that all models outperform <em>baseline</em> model in terms of rotational invariance.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/rotations_all.png" alt="rotations_all" width="800">
</div>
<div class="title">Figure 9. Prediction probability of the correct output class as a test input image is rotated from \(\theta = \{0, 360\}\) by step of \(1^{\circ}\) across models. Legend shows the percentage correct classifications across rotations (ascending order).</div>
</div>
<div class="paragraph">
<p>Instead of checking random images, lets compute these values for 1000 random test images and use a box plot to see <em>mean, std, min, max</em> of percentage correct classifications across all rotations. Interestingly, <em>8_rot_4</em>, <em>4_rot_4</em> show the best rotational invariance (higher mean, large <em>std</em> in positive direction). It is also somewhat mysterious why <em>8_rot_2</em>, <em>4_rot_2</em> is slightly worse than <code>baseline</code> compared to 1, 3, 4.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/rotations_box_all.png" alt="rotations_box_all" width="800">
</div>
<div class="title">Figure 10. Box plot of percentage correct classifications across all rotations for 1000 random test images (in ascending order of <em>mean</em>).</div>
</div>
<div class="paragraph">
<p>Can we get the benefits of 8/4 rot models without the slower convergence? What if we applied new conv model (8/4 rotations) to the <em>baseline</em> model weights? Fig 11. shows the results for a single image. It is very curious to note that <em>8_rot_4</em> and <em>4_rot_4</em> models yield zero accuracy. This seems to indicate that weights from the usual conv layer won&#8217;t translate well into rotated conv model (the errors across various layers add up). The error rates seem to recover as we remove rotational conv layers from the early layers, but there is a lot of noise since this is for a single test image.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/rot_baseline_all.png" alt="rot_baseline_all" width="800">
</div>
<div class="title">Figure 11. Prediction probability of the correct output class as a test input image is rotated from \(\theta = \{0, 360\}\) by step of \(1^{\circ}\) across models with <em>baseline</em> trained weights. Legend shows the percentage correct classifications across rotations (ascending order).</div>
</div>
<div class="paragraph">
<p>To get a clearer picture, lets compute the box plot by considering 1000 random test images as described earlier. Fig. 12 makes it pretty clear that we cannot just add rotation operations to a <em>regular</em> model after training. On the flip side, it provides evidence that rotational conv affects the weights.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural2/rot_baseline_box_all.png" alt="rot_baseline_box_all" width="800">
</div>
<div class="title">Figure 12. Box plot of percentage correct classifications across all rotations for 1000 random test images (in ascending order of <em>mean</em>). Note that all models use trained <em>baseline</em> weights.</div>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_conclusions">Conclusions</h2>
<div class="sectionbody">
<div class="paragraph">
<p>The code to reproduce all the experiments is available on <a href="https://github.com/raghakot/deep-learning-experiments/tree/master/exp2">Github</a>. Feel free to reuse or improve.</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Even though convergence is slower, 8 rotations seem to offer better invariance compared to 4 rotations. It would be interesting to see whether all models converge to the same accuracy eventually.</p>
</li>
<li>
<p>Filter skew does not seem to be an issue.</p>
</li>
<li>
<p>All models seem to exhibit the same seasonal pattern of accuracy with respect to rotation (see fig. 9). What&#8217;s with the weird dip at \(100^{\circ}\)?</p>
</li>
</ul>
</div>
<div class="paragraph">
<p><strong>EDITs</strong></p>
</div>
<div class="ulist">
<ul>
<li>
<p>It might be worth experimenting by concatenating activations from rotated filters instead of <em>max</em> operation in light of recent success with DenseNets.</p>
</li>
<li>
<p>Try out squeeze net style feature compression following concatenation of rotated filters.</p>
</li>
</ul>
</div>
<link rel="stylesheet" type="text/css" href="../../../extras/inlineDisqussions.css" />

<script type="text/javascript">
  (function defer() {
    if (window.jQuery) {
      jQuery(document).ready(function() {
          disqus_shortname = 'raghakot-github-io';
          jQuery("p, img").inlineDisqussions();
      });
    } else {
      setTimeout(function() { defer() }, 50);
    }
  })();
</script>
</div>
</div>
<div id="footnotes">
<hr>
<div class="footnote" id="_footnote_1">
<a href="#_footnoteref_1">1</a>. First layer filters generally detect basic lines and textures, this is just a simplified view to illustrate the point
</div>
<div class="footnote" id="_footnote_2">
<a href="#_footnoteref_2">2</a>. In reality, convolution filters may detect objects that have no meaningful interpretation
</div>
<div class="footnote" id="_footnote_3">
<a href="#_footnoteref_3">3</a>. If you are not familiar with the idea of activation volumes, I would recommend referring to <a href="http://cs231n.github.io/convolutional-networks/">CS231n Convolutional Neural Networks for Visual Recognition</a> for an outstanding explanation.
</div>
</div>]]></description><link>https://raghakot.github.io/2017/01/09/Baking-rotational-invariance-into-a-neuron.html</link><guid isPermaLink="true">https://raghakot.github.io/2017/01/09/Baking-rotational-invariance-into-a-neuron.html</guid><category><![CDATA[deep learning]]></category><dc:creator><![CDATA[Raghavendra Kotikalapudi]]></dc:creator><pubDate>Mon, 09 Jan 2017 00:00:00 GMT</pubDate></item><item><title><![CDATA[Exploring alternative neural computational models]]></title><description><![CDATA[<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p>Despite different variations of neural networks architectures, the computational model of a neuron hasn&#8217;t changed since inception. Every neuron has \(n\) incoming inputs \(X = (x_{1}, x_{2}, &#8230;&#8203;, x_{n})\) with weights \(W = (w_{1}, w_{2}, &#8230;&#8203;, w_{n})\). A neuron then computes \(W.X = \sum_{i=0}^{n}w_{i}.x_{i}\), forwards it through some nonlinearity like sigmoid or ReLU to generate the output<sup class="footnote">[<a id="_footnoteref_1" class="footnote" href="#_footnote_1" title="View footnote.">1</a>]</sup>.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural1/neuron_model.jpeg" alt="neuron_model" width="400">
</div>
<div class="title">Figure 1. Computation model of a neuron</div>
</div>
<div class="paragraph">
<p>As far as i am aware, all architectures build on this fundamental computational model. Let us examine what the computation is conceptually doing in a step by step fashion.</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Each input component \(x_{i}\) is being magnified or diminished when multiplied with a scaling factor \(w_{i}\). This can be interpreted as enhancing or diminishing the contribution of some pattern component \(x_{i}\) to the output decision. Think of these weights as <em>knobs</em> that range from \([-\infty, +\infty]\). The task of learning is to <em>tune</em> these knobs to the correct value.</p>
</li>
<li>
<p>The computation \(W.X\) is then enhancing/diminishing various input components of vector \(X\). In terms of linear algebra, the dot product of \(W\) and \(X\) can be interpreted as computing similarity. Consider \(X = (x_{1}, x_{2})\) and \(W = (w_{1}, w_{2})\). If you imagine plotting these vectors on a 2D graph paper, \(\frac{W.X}{|W||X|}\) is the cosine of angle between these two vectors. When these two vectors align (point in the same direction), the value of \(W.X\) is maximized indicating high degree of similarity. From this point of view, neuron is fundamentally a pattern matching unit which outputs a large value when it matches a specific pattern (encoded by weights) in the input<sup class="footnote">[<a id="_footnoteref_2" class="footnote" href="#_footnote_2" title="View footnote.">2</a>]</sup>.</p>
</li>
<li>
<p>In a complex neural network, every neuron is matching the input against some template encoded via weights and forwarding that result to the subsequent layers. In case of image recognition, the first layer neurons might match whether the input contains vertical edge, horizontal edge and so on. The circle detection neuron might use various edge matching computation results from previous layers to detect the shape patterns.</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>So far so good. <em>What is the purpose of activation function?</em> To understand the motivation, consider the geometric perspective of the computation through the lens of linear algebra.</p>
</div>
<div class="paragraph">
<p>\( W.X = \begin{bmatrix} w_{1} \\ w_{2} \\ \vdots \\ w_{n} \end{bmatrix} \begin{bmatrix} x_{1} &amp; x_{2} &amp; \cdots &amp; x_{n} \end{bmatrix} \)</p>
</div>
<div class="paragraph">
<p>Geometrically, this can be interpreted as a linear transformation of \(X\). Additionally, there is a bias term to shift the origin. So, the computation \(W.X + b\) can be interpreted as an affine transformation. Without an non-linear activation function, multiple linear transformations within each layer \(W_{L}\) can be condensed into a single layer with \(W = \prod_{L}W_{L}\). A single layered linear neural network is pretty limited in its representational power.</p>
</div>
<div class="paragraph">
<p>Lets put all this together to interpret what a multi-layered neural network might be doing<sup class="footnote">[<a id="_footnoteref_3" class="footnote" href="#_footnote_3" title="View footnote.">3</a>]</sup>:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Suppose we input a \(100 \times 100\) image to a vggnet. The input tensor \(X\) can be thought of as a vector with 10000 basis. Each basis is representing some positional component in an image.</p>
</li>
<li>
<p>We know that the very first conv layer learns gabor filter like edge detection weight templates. Lets assume that the first convolutional layer filter generates \(100 \times 100\) output (assuming appropriate padding). This can be seen as <em>transforming</em> input from positional basis vector space to edge-like basis vector space. Instead of measuring the pixel intensity, the new vector space measures the similarity of the positional input patch with respect to a edge-detection templates (\(3 \times 3\) conv filter weights). Extrapolating on this notion, every layer <em>transforms</em> the data into a more meaningful basis vector space, eventually leading to basis vectors comprising of output categories.</p>
</li>
</ul>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_motivation">Motivation</h2>
<div class="sectionbody">
<div class="paragraph">
<p>In the current computational model, the output of a neuron is different even though \(X\) and \(W\) perfectly align with each other but differ in magnitudes. As each \(w_{i}\) has range \([-\infty, \infty]\), backprop is essentially searching for this weight vector spanning all of the weight vector space.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural1/vector_similarity.png" alt="vector_similarity" width="300">
</div>
<div class="title">Figure 2. Similarity between vectors X and W</div>
</div>
<div class="paragraph">
<p>If we view a neuron as fundamental template matching unit, then various magnitude (length) of \(X\) and \(W\) should yield the same output. If we constrain \(W\) to be a unit vector (\(|W| = 1\)), then backprop would only need to search along points on a hypersphere of unit radius. Although the hypersphere contains infinitely many points, with sufficient discretization, the search space intuitively <em>feels</em> a lot smaller than the entire weight vector space.</p>
</div>
<div class="paragraph">
<p>Considering that weight sharing (a form on constraint) in conv-nets led to huge improvements, constraining seems like a reasonable approach. Weight regularization - a form of constraint - penalizes large \(|W|\) and vaguely fits the idea of encouraging weight updates to explore points along the hypersphere.</p>
</div>
<div class="paragraph">
<p>I think that in general, constraining as a research direction is largely unexplored. For example, it might be interesting to try and constrain conv filters to be as <em>distinct</em> and <em>diverse</em> and possible. I will follow up on these investigations in a future blog.</p>
</div>
<div class="paragraph">
<p>Getting back to the topic at hand, there are several options for constraining \(|W| = 1\).</p>
</div>
<div class="olist arabic">
<ol class="arabic">
<li>
<p>Penalize when \(|W|\) deviates from 1 as a regularization loss.</p>
</li>
<li>
<p>Represent \(W\) in parametric form. Each \(w_{i}\) can be calculated using n-dim <a href="https://en.wikipedia.org/wiki/N-sphere#Spherical_coordinates">spherical coordinates</a>.</p>
</li>
<li>
<p>Instead of computing \(W.X\), compute \(\frac{W}{|W|} \times X\). Normalizing \(W\) ensures that the output does not change due to scaling. Consequently, the gradient of output with respect to \(W\) can only exist if \(W\) changes along the unit hypersphere.</p>
</li>
</ol>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_experimental_setup">Experimental Setup</h2>
<div class="sectionbody">
<div class="paragraph">
<p>I explored option 3 as it was the simplest to implement. The setup is as follows.</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Mini vggnet (1,250,858 parameters) comprising of \(3 \times 3\) convolutions with ReLU activation.</p>
</li>
<li>
<p>cifar10 dataset augmented with 10% random shifts along image rows/cols along with a 50% chance of horizontal flip.</p>
</li>
<li>
<p>batch size = 32, epochs = 200 with early stopping.</p>
</li>
<li>
<p>Adam optimizer with initial learning rate of \(10^{-3}\), and reduced by a factor of \(10^{-1}\) when convergence stalls.</p>
</li>
<li>
<p><code>random_seed = 1337</code> for reproducibility.</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>\(W_{norm}\) is calculated as:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-python" data-lang="python"># 1e-8 is used to prevent division by 0
W_norm = W / (tf.sqrt(tf.reduce_sum(tf.square(W), axis=[0, 1, 2], keep_dims=True)) + 1e-8)</code></pre>
</div>
</div>
<div class="paragraph">
<p><strong>Experiments</strong></p>
</div>
<div class="paragraph">
<p>Weight normalization can be applied to both <code>Convolutional</code> and <code>Dense</code> layers. I tried all four combinations.</p>
</div>
<div class="olist arabic">
<ol class="arabic">
<li>
<p><strong>baseline</strong>: conv, dense</p>
</li>
<li>
<p><strong>norm_conv</strong>: conv_norm, dense</p>
</li>
<li>
<p><strong>norm_dense</strong>: conv dense_norm</p>
</li>
<li>
<p><strong>norm_conv_dense</strong>: conv_norm, dense_norm</p>
</li>
</ol>
</div>
<div class="paragraph text-center">
<p>Test model
image::alt_neural1/model.png[test_model, 300]</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_discussion">Discussion</h2>
<div class="sectionbody">
<div class="paragraph">
<p>Convergence graphs for all four variations are shown in fig. As hypothesized, convergence improves with internal weight normalization. More experimentation is needed to see if this holds across various models and datasets. Intriguingly, the benefits are lost when both 'Conv` and <code>Dense</code> layers are normalized. It is as if they 'cancel out'. I am not sure why this happens.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/alt_neural1/convergence.png" alt="convergence_graphs" width="800">
</div>
<div class="title">Figure 3. Convergence graphs on validation set for all four variants.</div>
</div>
<div class="paragraph">
<p>Additionally, the use of <code>ReLU</code> which effectively attenuates negative values. This limits the neuron to only communicate information when the angle between \(X\) and \(W\) lies between \([-\frac{\pi}{2}, \frac{\pi}{2}]\). Perhaps it is useful if a neuron could also communicate the <em>lack of</em> similarity or the <em>inverse</em> of weight template information. For example, the lack of a specific stripe pattern might increase the networks confidence that the output is more likely to be one cat species over another.</p>
</div>
<div class="paragraph">
<p>One way to remedy this problem might be to use an activation function that allows negative values. A quick experiment with <code>ELU</code> activation, however, did not result in significant improvement.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_reproducibility">Reproducibility</h2>
<div class="sectionbody">
<div class="paragraph">
<p>The code to reproduce all the experiments in this blog is available on <a href="https://github.com/raghakot/deep-learning-experiments/tree/master/exp1">Github</a>. Feel free to reuse or improve.</p>
</div>
<link rel="stylesheet" type="text/css" href="../../../extras/inlineDisqussions.css" />

<script type="text/javascript">
  (function defer() {
    if (window.jQuery) {
      jQuery(document).ready(function() {
          disqus_shortname = 'raghakot-github-io';
          jQuery("p, img").inlineDisqussions();
      });
    } else {
      setTimeout(function() { defer() }, 50);
    }
  })();
</script>
</div>
</div>
<div id="footnotes">
<hr>
<div class="footnote" id="_footnote_1">
<a href="#_footnoteref_1">1</a>. Technically, bias is involved, but i am excluding it to keep the discussion focused.
</div>
<div class="footnote" id="_footnote_2">
<a href="#_footnoteref_2">2</a>. The correct weight vectors are learned using backpropogation.
</div>
<div class="footnote" id="_footnote_3">
<a href="#_footnoteref_3">3</a>. This is my own interpretation and might as well be incorrect.
</div>
</div>]]></description><link>https://raghakot.github.io/2017/01/03/Exploring-alternative-neural-computational-models.html</link><guid isPermaLink="true">https://raghakot.github.io/2017/01/03/Exploring-alternative-neural-computational-models.html</guid><category><![CDATA[deep learning]]></category><dc:creator><![CDATA[Raghavendra Kotikalapudi]]></dc:creator><pubDate>Tue, 03 Jan 2017 00:00:00 GMT</pubDate></item><item><title><![CDATA[Ultrasound nerve segmentation challenge on Kaggle]]></title><description><![CDATA[<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p>A while ago, kaggle hosted the <a href="https://www.kaggle.com/c/ultrasound-nerve-segmentation">ultrasound nerve segmentation challenge</a>, which requires partipants to predict the nerve area (brachial plexus) in a given Ultrasound image. The task is to predict the segmentation mask for the the brachial plexus. Even my own neural network (brain) finds it difficult to spot patterns in these images. Can you spot the pattern?</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/ultrasound/example.jpg" alt="sample">
</div>
<div class="title">Figure 1. Sample ultrasound and segmentation mask</div>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/ultrasound/contour.gif" alt="contour">
</div>
<div class="title">Figure 2. Various ultrasound images overlayed with segmentation contours.</div>
</div>
<div class="paragraph">
<p>What makes this challenge particularly interesting is that ~60% of images do not contain the brachial plexus, i.e., segmentations mask is empty. Since the competition uses the mean of <a href="https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient">dice coefficient</a> across all samples as the evaluation metric, even a single pixel in segmentation mask for non-nerve images mask kills the score. Ironically, you can achieve a score of ~0.6 just by predicting zero masks; it drops to ~[0.35, 0.45] when you try to learn it using a standard <a href="https://arxiv.org/pdf/1505.04597.pdf">U-Net</a>. This, combined with nosiy ultrasound images and a lack of obvious nerve patterns, intrigued me enough to participate in the challenge.</p>
</div>
<div class="quoteblock">
<blockquote>
<div class="paragraph">
<p>Unlike most kaggle posts, this article is more about the thought-process involved in coming up with a solution. I find it distasteful when folks just hand out a giant blob of code on github or just post their final complex solution that somehow gives good results.</p>
</div>
</blockquote>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_step1_quick_and_dirty_baseline">Step1: Quick and dirty baseline</h2>
<div class="sectionbody">
<div class="paragraph">
<p>My methodology is always to first start with a quick and simple baseline model and setup and end-end working code. In case of this challenge, this means:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Data related I/O - Loading, train/test splits, optional preprocessing blocks</p>
</li>
<li>
<p>Baseline model: I stated with a U-Net as it seemed like the state of the art model for these kinds of data. I also retrofitted it with batch normalization, ELU activation, and dropout.</p>
</li>
<li>
<p>Clean code: Try to organize code into classes. Having function blocks everywhere makes the code messy and error prone.</p>
</li>
<li>
<p>Submission code: Code to generate submission file.</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>As mentioned in the overview, this tends to give mean dice score of ~[0.35, 0.45], which is pretty abysmal.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_step2_data_exploration">Step2: Data exploration</h2>
<div class="sectionbody">
<div class="paragraph">
<p>Now that an end-end working code is setup. We can work on optimizing individual blocks. The most obvious strategy is to get a better understanding of dataset at hand. The idea is to leverage insights from the dataset to improve/build the conv-net architecture and perhaps use dataset specific idiosynchronicities to provide better training signal.</p>
</div>
<div class="paragraph">
<p>Initial manual exploration revealed contradictory examples. i.e., two very similar images have different labels, one with a valid segmentation mask while the other doesn&#8217;t.</p>
</div>
<div class="imageblock text-center">
<div class="content">
<img src="https://raghakot.github.io/images/ultrasound/contradictory_samples.png" alt="contour" width="600">
</div>
<div class="title">Figure 3. Contradictory samples</div>
</div>
<div class="paragraph">
<p>Lets try to find all such near duplicates. High level idea is as follows:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>Divide image into blocks, say (20, 20).</p>
</li>
<li>
<p>Compute histogram over various blocks. The concatenated set of histograms represents the image <em>hash</em>.</p>
</li>
<li>
<p>Compute pairwise cosine similarity between image hashes and find ones within some reasonable threshold.</p>
</li>
</ul>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-python" data-lang="python">import skimage.util
import scipy.spatial.distance as spdist

def compute_img_hist(img):
    # Divide the image in blocks and compute per-block histogram
    blocks = skimage.util.view_as_blocks(img, block_shape=(20, 20))
    img_hists = [np.histogram(block, bins=np.linspace(0, 1, 10))[0] for block in blocks]
    return np.concatenate(img_hists)

# Compute hashes over all images
hashes = np.array(map(compute_img_hist, imgs))

# pairwise distance matrix
dist = spdist.squareform(spdist.pdist(hashes, metric='cosine'))

# find duplicates. +np.eye because we want to exclude image matching with itself.
# 0.008 is determined after some trial and errors.
close_pairs = dists + np.eye(dists.shape[0]) &lt; 0.008</code></pre>
</div>
</div>
<div class="paragraph">
<p>The number of samples with inconsistent labeling is a whooping ~40% of the training set!!! Such a large number of inconsistent samples is bound to put an upper limit on the accuracy of the model. I think part of the problem lies with how the dataset was setup. Typically, doctors use ultrasound video frames to localize the nerve. Finding the nerve with a single frame is extremely challenging, even for human experts. I think it would be more interesting if they included videos instead.</p>
</div>
<div class="paragraph">
<p>The training set is small as is, just 5635 samples. After discards, we are only left with 3398 samples. This certainly doesn&#8217;t look good for training a deep model as they require lots of training data. On top of that, we cannot use transfer learning either as there are no pretrained-nets on this sort of data.</p>
</div>
<div class="paragraph">
<p>Despite all this, the cleaned raining data manages to improve baseline model score to ~0.55.</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_step3_real_work_begins">Step3: Real work begins&#8230;&#8203;</h2>
<div class="sectionbody">
<div class="admonitionblock note">
<table>
<tr>
<td class="icon">
<i class="fa icon-note" title="Note"></i>
</td>
<td class="content">
This article is still a work in progress. Check back for updates.
</td>
</tr>
</table>
</div>
<div class="paragraph">
<p>Since the beginning, it was clear that i had to do something about the weird score evaluation. Even a single pixel predicted on the segmentation output for non-nerve images was going to hurt the score. A pretty straight forward strategy was to add a <code>Dense</code> layer towards the end of encoder an predict whether the image contains the mask or not. Lets call this value <code>has_mask</code></p>
</div>
<div class="paragraph">
<p>During test/submission time, I could simply zero out all the segmentation mask values if <code>has_mask = False</code>.</p>
</div>
<link rel="stylesheet" type="text/css" href="../../../extras/inlineDisqussions.css" />

<script type="text/javascript">
  (function defer() {
    if (window.jQuery) {
      jQuery(document).ready(function() {
          disqus_shortname = 'raghakot-github-io';
          jQuery("p, img").inlineDisqussions();
      });
    } else {
      setTimeout(function() { defer() }, 50);
    }
  })();
</script>
</div>
</div>]]></description><link>https://raghakot.github.io/2016/12/26/Ultrasound-nerve-segmentation-challenge-on-Kaggle.html</link><guid isPermaLink="true">https://raghakot.github.io/2016/12/26/Ultrasound-nerve-segmentation-challenge-on-Kaggle.html</guid><dc:creator><![CDATA[Raghavendra Kotikalapudi]]></dc:creator><pubDate>Mon, 26 Dec 2016 00:00:00 GMT</pubDate></item><item><title><![CDATA[Facebook hacker cup: Power Overwhelming]]></title><description><![CDATA[<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p>You are to inflict maximum damage to the zerg army. There are two types of units - Warrior and Shield. Warriors do damage every second, while a shield protects your entire army for one second. Your army is instantly overrun after the shield generators expire. Given \(G\) cost to build a shield, \(W\) cost to build a warrior and total money \(M\), how many shields would you build?</p>
</div>
</div>
</div>
<div class="sect2">
<h3 id="_solution">Solution</h3>
<div class="paragraph">
<p>Let \(X\) and \(Y\) be the optimal number of generators and number of warriors to be built, respectively. Let&#8217;s start with a simple concrete example. Suppose shields and warriors both cost 1 unit and you have total money of 5 units. Also assume 1 unit of damage per warrior. What is the optimum value of \(X\) and \(Y\)? It would be optimum if you can inflict maximum damage. With 5 units of money, you can buy shields/warriors in the following combinations.</p>
</div>
<table class="tableblock frame-all grid-all spread">
<caption class="title">Table 1. Combinations of \(X, Y\)</caption>
<colgroup>
<col style="width: 33.3333%;">
<col style="width: 33.3333%;">
<col style="width: 33.3334%;">
</colgroup>
<thead>
<tr>
<th class="tableblock halign-left valign-top">X</th>
<th class="tableblock halign-left valign-top">Y</th>
<th class="tableblock halign-left valign-top">Damage</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">1</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">4</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">4</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">2</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">3</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">6</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">3</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">2</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">6</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">4</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">1</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">4</p></td>
</tr>
</tbody>
</table>
<div class="paragraph">
<p>In this case, the optimal choice of \(X, Y\) seem to be \((2, 3)\) and \((3, 4)\) respectively, both of which maximize the product \(XY\). In the general case, the cost to buy \(X\) generators is \(XG\), cost to buy \(Y\) warriors is \(YW\). Since we are limited by \(M\) amount of money, \(X, Y\) must satisfy \(XG + YW \le M\). This can be represented as a line and the inequality encapsulates a region for candidate \((X, Y)\) values.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="https://raghakot.github.io/images/post2/fig1.png" alt="fig1.png">
</div>
<div class="title">Figure 1. Candidate solution region</div>
</div>
<div class="paragraph">
<p>We want \(X, Y \mid \arg\max{XY} \). Geometrically, \(XY\) represents the area of the rectangle shaded in blue.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="https://raghakot.github.io/images/post2/fig2.png" alt="fig2.png">
</div>
<div class="title">Figure 2. The area to be maximized</div>
</div>
<div class="paragraph">
<p>We need to find integer values of \(X, Y\) that maximize this area.</p>
</div>
<div class="paragraph">
<p>How do we go about doing that?
We know that the line will intersect X and Y axis at \(\frac{M}{G}, \frac{M}{W}\).
We also know that the optimal rectangle touches the line. If it doesn&#8217;t, the area can be trivially increased by increasing either/both \(X, Y\) values until it touches the line.</p>
</div>
<div class="paragraph">
<p>The boring calculus way for figuring this out is as follows:</p>
</div>
<div class="paragraph">
<p>\(Area(A) = XY \\
A = \frac{M - YW}{G}Y \\
\\
\frac{\partial A}{\partial Y} = \frac{Y(M - WY)}{G} \\
\arg\max{A} \scriptstyle \implies \frac{Y(M - WY)}{G} = 0 \\
Y = \frac{M}{2W} \)</p>
</div>
<div class="paragraph">
<p>This gives corresponding \(X = \frac{M}{2G}\)</p>
</div>
<div class="paragraph">
<p>A more interesting way to arrive at the same conclusion would be to think as follows:</p>
</div>
<div class="ulist">
<ul>
<li>
<p>The shaded triangle increases its area as long as X and Y increase.</p>
</li>
<li>
<p>Start with \((0, 0)\) and increase both \(X, Y\) by 1. This gives us a small rectangle with bottom left \((0, 0)\) and top right \((1, 1)\).</p>
</li>
<li>
<p>If the straight line was X + Y = 1, it would be a right angled isosceles triangle. The \(X, Y\) value would always increase in the direction of line \(Y = X\), which would cut the line in middle at \((\frac{X}{2}, \frac{Y}{2})\)</p>
</li>
<li>
<p>Generalizing this, the value of \(X. Y\) will increase as long as we go on the direction of line that joins \((0, 0)\) to midpoint of the line \(XG + YW = M\). The mid point of this line is, unsurprisingly, \((\frac{M}{2G}, \frac{M}{2W})\)</p>
</li>
</ul>
</div>
<div class="paragraph">
<p>There are other ways to arrive at this. Perhaps I should write a post that exclusively deals with this problem.</p>
</div>
<div class="paragraph">
<p>Coming back to our original question, the optimal number of shields must be \(\frac{M}{2G}\). If, \(\frac{M}{2G}\) is not an integer, we go towards the origin on the line that joins \((0, 0)\) and \((\frac{M}{2G}, \frac{M}{2W})\). However, I was able to get away by taking \(\lfloor\frac{M}{2G} \rfloor\).</p>
</div>
<link rel="stylesheet" type="text/css" href="../../../extras/inlineDisqussions.css" />

<script type="text/javascript">
  (function defer() {
    if (window.jQuery) {
      jQuery(document).ready(function() {
          disqus_shortname = 'raghakot-github-io';
          jQuery("p, img").inlineDisqussions();
      });
    } else {
      setTimeout(function() { defer() }, 50);
    }
  })();
</script>
</div>]]></description><link>https://raghakot.github.io/2011/01/15/Facebook-hacker-cup-Power-Overwhelming.html</link><guid isPermaLink="true">https://raghakot.github.io/2011/01/15/Facebook-hacker-cup-Power-Overwhelming.html</guid><category><![CDATA[competitions]]></category><category><![CDATA[ migrated]]></category><dc:creator><![CDATA[Raghavendra Kotikalapudi]]></dc:creator><pubDate>Sat, 15 Jan 2011 00:00:00 GMT</pubDate></item><item><title><![CDATA[Gotcha with fitness function design]]></title><description><![CDATA[<div class="paragraph">
<p>Typically, a genetic algorithm works by the notion of maximizing the fitness. Consider a function \(y = x\), which is to be minimized in the interval \([-5, 5]\). One approach is use \(\frac{1}{x}\) as the fitness function. Intuitively, by maximizing \(\frac{1}{x}\), we are minimizing \(y = x\). However, a plot of \(\frac{1}{x}\) reveals some serious flaws.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="https://raghakot.github.io/images/post3/fig1.png" alt="fig1.png">
</div>
<div class="title">Figure 1. Plot of \(y = \frac{1}{x}\)</div>
</div>
<div class="paragraph">
<p>If we move from the right, the maximum occurs at \(x = 0\) instead of \(x = -5\). Why? because \(\frac{1}{x}\) is not differentiable at \(x = 0\). Always make sure that the fitness/loss function is differentiable!
In this case, it is better to use \(y = -x\) as the fitness function.</p>
</div>
<div class="paragraph">
<p>In general, if we are seeking to minimize \(y = f(x)\), where \(f(x)\) is differentiable, then it is safer to use \(y = -f(x)\) as the fitness function. Probably very obvious, but I got burned by this.</p>
</div>
<link rel="stylesheet" type="text/css" href="../../../extras/inlineDisqussions.css" />

<script type="text/javascript">
  (function defer() {
    if (window.jQuery) {
      jQuery(document).ready(function() {
          disqus_shortname = 'raghakot-github-io';
          jQuery("p, img").inlineDisqussions();
      });
    } else {
      setTimeout(function() { defer() }, 50);
    }
  })();
</script>]]></description><link>https://raghakot.github.io/2011/01/11/Gotcha-with-fitness-function-design.html</link><guid isPermaLink="true">https://raghakot.github.io/2011/01/11/Gotcha-with-fitness-function-design.html</guid><category><![CDATA[machine learning]]></category><category><![CDATA[ migrated]]></category><dc:creator><![CDATA[Raghavendra Kotikalapudi]]></dc:creator><pubDate>Tue, 11 Jan 2011 00:00:00 GMT</pubDate></item><item><title><![CDATA[Facebook hacker cup: Geometric approach to double squares problem]]></title><description><![CDATA[<div id="preamble">
<div class="sectionbody">
<div class="paragraph">
<p>Given an integer \(N\), we need to find the number of integral pairs \((x, y) \mid x^2 + y^2 = N\), \(N \in [0, 2147483647]\), with a maximum of 100 such numbers in a file.</p>
</div>
</div>
</div>
<div class="sect2">
<h3 id="_solution">Solution</h3>
<div class="paragraph">
<p>Geometrically, we know that \(x^2 + y^2 = r\) represents a circle centered at \((0,0)\) with a radius \(r\).  So, the problem can be interpreted visually as finding all possible solutions where \(x, y\) are positive, i.e., lie in the first quadrant.</p>
</div>
<div class="paragraph">
<p>Testing all possible integral pairs of \((x, y)\) upto \(N\) is equivalent to searching within the grayed geometric square as illustrated.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="https://raghakot.github.io/images/post1/fig1.png" alt="fig1.png">
</div>
<div class="title">Figure 1. Geometric view of double squares problem</div>
</div>
<div class="paragraph">
<p>Algorithmically, this approach takes \(O(n^2)\) time. Consider the sample input as shown below (this was the file I received)</p>
</div>
<div class="listingblock">
<div class="content">
<pre>20
1105
65
2147483646
1257431873
25
1000582589
1148284322
5525
0
1475149141
858320077
1022907856
1041493518
3
1215306625
372654318
160225
5928325
2147483643
1538292481</pre>
</div>
</div>
<div class="paragraph">
<p>On my machine, brute force approach takes approx 14 minutes (!!!) to solve this input. Clearly, we are exploring a lot of unwanted points. One obvious way to improve this is to explore points bounded by the circle as as indicated by the shaded blue square.</p>
</div>
<div class="imageblock">
<div class="content">
<img src="https://raghakot.github.io/images/post1/fig2.png" alt="fig2.png">
</div>
<div class="title">Figure 2. Improving on naive search</div>
</div>
<div class="paragraph">
<p>This makes intuitive sense since \(x, y\) cannot exceed \(\sqrt{n}\). This reduces the complexity from \(O(n^2)\) to \(O(n)\) reducing execution time from an earlier 14 minutes to 80.5 secs.</p>
</div>
<div class="paragraph">
<p>Can we do any better? Why explore the entire blue square? It is sufficient if we examine all points on the arc of the circle in the first quadrant. i.e, increment \(x\) from \([0, \sqrt{n}]\), compute corresponding \(y = \sqrt{n - x^2}\) and check if it is an integer. \(y\) is an integer if \(\lfloor y \rfloor == y\).</p>
</div>
<div class="paragraph">
<p>This looks pretty good. Is there anything else that can be improved upon? Well, if you notice the figure carefully, we don&#8217;t have to iterate \(x\) from \([0, \sqrt{n}]\). All points on one side of line \(y = x\) are mirror images of points on the other side. i.e., if we find a point, day \((3, 4)\) then we will find a mirror \((4, 3)\) on the other side. Since the problem doesn&#8217;t differentiate between these two solutions, it is sufficient if we iterate \(x\) from \([0, x']\), where \(x'\) is given by \(\sqrt{n} \cos{45} = \sqrt{n/2}\).</p>
</div>
<div class="imageblock">
<div class="content">
<img src="https://raghakot.github.io/images/post1/fig3.png" alt="fig3.png">
</div>
<div class="title">Figure 3. Line \(y = x\) acts as a mirror</div>
</div>
<div class="paragraph">
<p>With this, we can complete the algorithm in java as follows:</p>
</div>
<div class="listingblock">
<div class="content">
<pre class="highlight"><code class="language-java" data-lang="java">int getNumSumSquares(final int n)
{
	if(n == 0) {
		return 1;
    }

	final int iterations = (int) Math.ceil(Math.sqrt((n * 1.0)/2)) + 1;
	double y;
	int count = 0;
	for(int x = 0; x &lt;= iterations; x++)
	{
		y = Math.sqrt(n - (x*x));
		// check if y is an integer
		if (Math.floor(y) == y) {
			count++;
        }
	}
	return count;
}</code></pre>
</div>
</div>
<div class="paragraph">
<p>This algorithm works in \(O(\sqrt{n/2}) \sim O(\sqrt{n})\). This only takes 0.047 secs to execute!</p>
</div>
<link rel="stylesheet" type="text/css" href="../../../extras/inlineDisqussions.css" />

<script type="text/javascript">
  (function defer() {
    if (window.jQuery) {
      jQuery(document).ready(function() {
          disqus_shortname = 'raghakot-github-io';
          jQuery("p, img").inlineDisqussions();
      });
    } else {
      setTimeout(function() { defer() }, 50);
    }
  })();
</script>
</div>]]></description><link>https://raghakot.github.io/2011/01/11/Facebook-hacker-cup-Geometric-approach-to-double-squares-problem.html</link><guid isPermaLink="true">https://raghakot.github.io/2011/01/11/Facebook-hacker-cup-Geometric-approach-to-double-squares-problem.html</guid><category><![CDATA[competitions]]></category><category><![CDATA[ migrated]]></category><dc:creator><![CDATA[Raghavendra Kotikalapudi]]></dc:creator><pubDate>Tue, 11 Jan 2011 00:00:00 GMT</pubDate></item><item><title><![CDATA[Kleiber's Law]]></title><description><![CDATA[<div class="paragraph">
<p>Last week, I happened to read about Kleiber&#8217;s law while browsing through literature on natural evolution. The implications are really fascinating. It establishes a relationship between mass and metabolism as:</p>
</div>
<div class="paragraph">
<p>\(metabolism = mass^{\frac{3}{4}}\)</p>
</div>
<div class="paragraph">
<p>Metabolism is ultimately linked to the number of heartbeats since the heart regulates the supply of oxygen. Therefore, \(\mid heartbeat \mid \propto mass\). Curiously, the number of heartbeats per lifetime tends to be constant. This means that bigger/heavier animals tend to have slower metabolism with lower heart rate compared to, say, flies, with faster metabolism.</p>
</div>
<div class="paragraph">
<p>Ironically, if we have fixed number of heartbeats, wouldn&#8217;t running/exercising make us die faster? I suppose, the long term benefits outweigh shot term loss.</p>
</div>
<link rel="stylesheet" type="text/css" href="../../../extras/inlineDisqussions.css" />

<script type="text/javascript">
  (function defer() {
    if (window.jQuery) {
      jQuery(document).ready(function() {
          disqus_shortname = 'raghakot-github-io';
          jQuery("p, img").inlineDisqussions();
      });
    } else {
      setTimeout(function() { defer() }, 50);
    }
  })();
</script>]]></description><link>https://raghakot.github.io/2010/11/17/Kleibers-Law.html</link><guid isPermaLink="true">https://raghakot.github.io/2010/11/17/Kleibers-Law.html</guid><category><![CDATA[light bulb]]></category><category><![CDATA[ migrated]]></category><dc:creator><![CDATA[Raghavendra Kotikalapudi]]></dc:creator><pubDate>Wed, 17 Nov 2010 00:00:00 GMT</pubDate></item><item><title><![CDATA[Predicting hand position on the keyboard by observing random text]]></title><description><![CDATA[<div class="paragraph">
<p>Today, I happened to type some random text and noticed interesting. This is a random sample of what i typed:</p>
</div>
<div class="listingblock">
<div class="content">
<pre>gh
jgh
jg
hjg
hjg
hjg
hjg
hjg
hkg
hjg
hg
jg
hjg
hjg
hg
hjg
hg
hjg
h
gh
gh
ghj
ghj
ghj
g
hjg
khg
hg
hg
hg
hjg
hjg
hjg
hjg
h
ghj</pre>
</div>
</div>
<div class="paragraph">
<p>Notice how 'h' repeats a lot. It so happens that my middle finger was on 'h'. Perhaps, frequency is somehow linked to the length of finger? See table below. I used the keys G, H, J, K. My index finger was on G, middle on H, ring finger on J and little finger on K.</p>
</div>
<table class="tableblock frame-all grid-all spread">
<caption class="title">Table 1. Frequency statictics</caption>
<colgroup>
<col style="width: 33.3333%;">
<col style="width: 33.3333%;">
<col style="width: 33.3334%;">
</colgroup>
<thead>
<tr>
<th class="tableblock halign-left valign-top">Character ordered by frequency</th>
<th class="tableblock halign-left valign-top">Actual finger on character</th>
<th class="tableblock halign-left valign-top">Finger ordered by length</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">H</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">Middle</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">Middle</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">J</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">Ring</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">Ring</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">G</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">Index</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">Index</p></td>
</tr>
<tr>
<td class="tableblock halign-left valign-top"><p class="tableblock">K</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">Little</p></td>
<td class="tableblock halign-left valign-top"><p class="tableblock">Little</p></td>
</tr>
</tbody>
</table>
<div class="paragraph">
<p>You can try this on your own. Place your fingers on the keyboard (horizontally, any other orientation complicates the situation as relative length changes). Probability theory suggests that the chance of occurrence of G, H, J or K is 1/4. But I think that in this case, probability is somehow weighted, proportional to the length of the finger.</p>
</div>
<div class="paragraph">
<p>This should allow us to predict hand position on a keyboard based on a bunch of randomly pressed keystrokes.</p>
</div>
<link rel="stylesheet" type="text/css" href="../../../extras/inlineDisqussions.css" />

<script type="text/javascript">
  (function defer() {
    if (window.jQuery) {
      jQuery(document).ready(function() {
          disqus_shortname = 'raghakot-github-io';
          jQuery("p, img").inlineDisqussions();
      });
    } else {
      setTimeout(function() { defer() }, 50);
    }
  })();
</script>]]></description><link>https://raghakot.github.io/2010/09/24/Predicting-hand-position-on-the-keyboard-by-observing-random-text.html</link><guid isPermaLink="true">https://raghakot.github.io/2010/09/24/Predicting-hand-position-on-the-keyboard-by-observing-random-text.html</guid><category><![CDATA[light bulb]]></category><category><![CDATA[ migrated]]></category><dc:creator><![CDATA[Raghavendra Kotikalapudi]]></dc:creator><pubDate>Fri, 24 Sep 2010 00:00:00 GMT</pubDate></item></channel></rss>