Ever since AlexNet was introduced in 2012, neural net research landscape fundamentally changed. With deep computational graphs, the possibilities are endless. We quickly iterated through a dizzying number of architectures such as AlexNet, VGGNet, NIN, Inception, ResNets, FractalNet, Xception, DenseNets and so on. With deep learning, architecture engineering is the new feature engineering and it clearly plays an important role in advancing the state of the art. Instead of making incremental improvements, is there a way to learn the connectivity pattern itself?
First step is to make this more concrete. Ideally, we want to learn connectivity amongst individual neurons; instead, lets simplify the problem by constraining ourselves to known layer Lego blocks (by layer Lego blocks, I mean general purpose computational layers such as Convolutional Layer, LSTM, Pooling etc.). Given a bunch of Layer Legos \(L = \{L_{1}, L_{2}, \cdots, L_{n}\}\), our task is to learn how these should be assembled. First thing that comes to mind is evolutionary search, but that is computationally prohibitive. A quicker and more elegant alternative would be to learn this by computing the gradient of architecture with respect to loss. The challenge is to make architecture itself differentiable.
Architecture can mean a lot of things. For now, let us constrain ourselves to connectivity patterns. Given some \(n\) layers, what is the optimal connectivity amongst them? The naive way of approaching this problem would be to try out all \(n(n-1)\) possible connection configurations^{[1]}. There is a resnet, densenet or some other future net hidden in there.
Bruteforce search is not such a bad option. Cifar10/100 networks are quite short (relatively speaking). If you have access to huge compute, it might take a month to try out all possible connectivities. Hopefully we can extrapolate the recurring blocks and apply it to imagenet and get a new SOTA results (wishful thinking). While practical, this is a rather boring way to solve the problem. What if we can learn the connections via gradient descent? For this to work, we need a way to compute \(\frac{\partial (loss)}{\partial (connection)}\).
Connectivity is inherently binary. This is problematic for applying gradient descent because of the discontinuity. Notice the similarities with attention? We can employ reinforcement learning like hard attention or try soft attention strategy. In order to make connectivity continuous, we can make it weighted, similar to forget gate in LSTMs^{[2]}. To bias the connectivity mostly towards 0 or 1, we can apply use sigmoid weighted connectivity. To recap, the strategy is to start out with a fully connected net where evry layer is connected to every other layer via sigmoid weighted param. The weights can be learned via backprop.
Cool. There is one slight technicality. Backward connections introduce cycles in the graph^{[3]}. Excluding backward connections, every layer can have incoming inputs from all previous layers, which leaves us with \(\frac{n(n-1)}{2}\) possible connections. This is exactly what DenseNets do!, except that we arrived at a similar architecture with a different motivation.
Unlike DenseNets, we want the connections to be gated so that they can be pruned if needed. There are a couple of possibilities for how \(L_{i}\) is connected to \(L_{j}\).
Shared/Unique feature weights: We can either associate weights \(W = [w_{1}, w_{2}, \cdots, w_{f}]\) for \(L_{i}\) with \(f\) feature maps or use a single weight for all feature maps of an incoming input. Lets call this shared/unique feature weight scheme.
Input aggregation: For layer \(L_{j}\), we need a way to aggregate weight feature maps from previous layers. Some possibilities include:
DenseNet style concat: i.e, simply concatenate previous layer weighted feature maps. This has the limitation that concatenation is not viable when the size of feature maps differ. For this we can either try to pad smaller feature maps with 0’s to make them all the same size as the largest feature map or resize all feature maps to the smallest feature map by pooling/interpolation/some other form of down sampling. Both approaches have their own pros and cons. The first is computationally expensive (since convnets have to slide over larger feature maps), while the second one is lossy. I have a feeling that lossy would end up being the winner.
ResNet style concat: i.e., squish weighted input feature maps via max or sum. avg, prod, min dont seem like good ideas for obvious reasons. I like max better than resnet style sum, since it is equivalent to focusing on input patterns that matched with a filter (high output value) across various feature maps. A high value at some \((row, col)\) of the filter might indicate the presence of some abstract shape detected by previous convolutional layers.
Squeeze net style concat: Lastly, we can also try squeeze net strategy of squishing \(n\) concatenated input feature maps (after reshaping to same size via padding or down sampling) down to \(m\) feature maps where \(m \le n\).
Phew! That completes our setup. The idea is super simple. Start out with an fully connected network where every layer gets inputs from all the previous layers in a weighted fashion. Before weighing, we use \(sigmoid(W)\) to bias the values mostly towards 0 or 1. When trained end-end via backprop, the network should, in-principle, learn to prune unwanted connections.
We can also try doing couple of other things.
Take the converged connectivity pattern and try it exclusively by removing connections with small weights. If \(w = 0.002\) or something like that, we know that the network is trying really hard to get rid of that connection. So might as well see if the performance is better when we start out by removing it.
Examine the evolution of weights during training. Do they always converge in the same manner? If they don’t, that would be concerning.
Try different connection weight initialization schemes. \(w = 1\) would mean everything starts out fully connected. Alternatively we can set \(w\) from a Gaussian distribution centered around 0.5. We can try other funky things like setting initial feed forward stack \(L_{i-1}\) → \(L_{i}\) weight to 1 and rest to 0.5.
Layers = [Conv, Conv, MaxPool, Dropout] \(\times\) 2-3 blocks of variable feature sizes.
cifar10 dataset augmented with 10% random shifts along image rows/cols along with a 50% chance of horizontal flip.
random_seed = 1337
for reproducibility.
Training for 200 epochs, batch size of 32 using categorical cross-entropy loss with Adam optimizer.
A quick summary of these ideas are translated into concrete implementation. A complete implementation can be found on my github.
Creating fully connected net. i.e, a layer is connected to all prev layer outputs. This is not as hard as it appears.
def make_fully_connected(x, layers):
inputs = [x]
for layer in layers:
x = layer(x)
inputs.append(x)
# This is the part where we resize inputs to be of same shape and merge them in resnet or densenet style
return x
Merging. i.e., resizing prev layer outputs to be of the same shape and concatenating them in densenet or resnet style. We also want to weigh merged outputs so that those weights can be learned during backprop. The easiest way to do this in keras is to create a custom layer^{[4]}.
import numpy as np
import tensorflow as tf
from keras import backend as K
from keras.layers import Lambda, Layer
class Connection(Layer):
"""Takes a list of inputs, resizes them to the same shape, and outputs a weighted merge.
"""
def __init__(self, init_value=0.5, merge_mode='concat',
resize_interpolation=tf.image.ResizeMethod.BILINEAR,
shared_weights=True):
"""Creates a connection that encapsulates weighted connection of input feature maps.
:param init_value: The init value for connection weight
:param merge_mode: Defines how feature maps from inputs are aggregated.
:param resize_interpolation: The downscaling interpolation to use. Instance of `tf.image.ResizeMethod`.
Note that ResizeMethod.AREA will fail as its gradient is not yet implemented.
:param shared_weights: True to share the same weight for all feature maps from inputs[i].
False creates a separate weight per feature map.
"""
self.init_value = init_value
self.merge_mode = merge_mode
self.resize_interpolation = resize_interpolation
self.shared_weights = shared_weights
super(Connection, self).__init__()
def _ensure_same_size(self, inputs):
"""Ensures that all inputs match last input size.
"""
# Find min (row, col) value and resize all inputs to that value.
rows = min([K.int_shape(x)[1] for x in inputs])
cols = min([K.int_shape(x)[2] for x in inputs])
return [tf.image.resize_images(x, [rows, cols], self.resize_interpolation) for x in inputs]
def _merge(self, inputs):
"""Define other merge ops like [+, X, Avg] here.
"""
if self.merge_mode == 'concat':
# inputs are already stacked.
return inputs
else:
raise RuntimeError('mode {} is invalid'.format(self.merge_mode))
def build(self, input_shape):
""" Create trainable weights for this connection
"""
# Number of trainable weights = sum of all filters in `input_shape`
nb_filters = sum([s[3] for s in input_shape])
# Create shared weights for all filters within an input layer.
if self.shared_weights:
weights = []
for shape in input_shape:
# Create shared weight, make nb_filter number of clones.
shared_w = K.variable(self.init_value)
for _ in range(shape[3]):
weights.append(shared_w)
self.W = K.concatenate(weights, axis=-1)
else:
self.W = K.variable(np.ones(shape=nb_filters) * self.init_value)
self._trainable_weights.append(self.W)
super(Connection, self).build(input_shape)
def call(self, layer_inputs, mask=None):
# Resize all inputs to same size.
resized_inputs = self._ensure_same_size(layer_inputs)
# Compute sigmoid weighted inputs
stacked = K.concatenate(resized_inputs, axis=-1)
weighted = stacked * K.sigmoid(self.W)
# Merge according to provided merge strategy.
merged = self._merge(weighted)
# Cache this for use in `get_output_shape_for`
self._out_shape = K.int_shape(merged)
return merged
def get_output_shape_for(self, input_shape):
return self._out_shape
Lets look at this step by step.
_ensure_same_size
computes smallest \((rows, cols)\) amongst all inputs and uses it to resize all inputs to be the same shape using the provided resize interpolation scheme.
We have to define trainable weights in build
per keras custom layer docs. We need as many weights as sum of filters across all inputs to the Connection
layer. Weight sharing across all filters of an incoming layer can be achieved by concatenating same weight variable for all filters.
call
computes sigmoid weighted inputs (I tested without sigmoid, and as expected, sigmoid weighing which mostly "allows or disallows inputs" worked a lot better), merged with defined merge strategy. We can tweak init_value
and merge_mode
to try various init strategies for weights and different merge strategies.
The fully connected net using layers defined below, followed by sequential Dense
layers using the above code is shown in fig.
layers = [
Convolution2D(32, 3, 3, border_mode='same', activation='relu', bias=False),
Convolution2D(32, 3, 3, bias=False, activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
Dropout(0.25),
Convolution2D(64, 3, 3, bias=False, activation='relu', border_mode='same'),
Convolution2D(64, 3, 3, bias=False, activation='relu'),
MaxPooling2D(pool_size=(2, 2)),
Dropout(0.25)
]
layers
followed by sequential Dense
layers (open in new tab or download to zoom in).Experimentation is still a work in progress. Check back for updates. |
Without \(sigmoid\) weighing which mostly "allows or disallows inputs", the convergence was horribly slow. All subsequent results described here used \(sigmoid\) connection weights.
In these set of experiments, activation maps from prev layers are concatenated.
Connection weight initialization scheme (init to 0, 1, 0.5) has no effect on convergence.
Down sampling interpolation scheme (inter_area, inter_nn, inter_bilinear, inter_bicubic) doesn’t affect the convergence significantly^{[5]}.
It is definitely interesting to track how connection weights between layers evolved with training epochs. Fig 3 shows the connection weight evolution for connection_o through connection_7 (connection 0 weight 0 corresponds to input→conv2, connection 0 weight 1 corresponds to conv1→input2, and so on. Refer to fig 2 to get a complete picture).
TODO: Discussion.
WIP..
The code to reproduce all the experiments is available on Github. Feel free to reuse or improve.
This is a continuation from the first post on exploring alternative neural computational models. To recap quickly, a neuron is fundamentally acting as a pattern matching unit by computing the dot product \(W \cdot X\), which is equivalent to unnormalized cosine similarity between vectors \(W\) and \(X\).
This model of pattern matching is very naive, in a sense that it is not invariant to translations or rotations. Consider the following weight vector template (conv filter) corresponding to vertical edge pattern |
\(W = \begin{bmatrix} 0.2 & 1.0 & 0.2\\ 1.0 & 1.0 & 1.0\\ 0.2 & 1.0 & 0.2 \end{bmatrix} \)
When a neuron tries to match this pattern in a \(3 \times 3\) input image patch, it will only output a high value if a vertical edge is found in the center of the image patch. i.e., dot product is not invariant to shifts. My first thought was to use Pearson coefficient as it provides shift invariance. However, this not a problem in practice as the filter is slid across all possible \(3 \times 3\) patches within the input image (see fig2).
In a way, the sliding window operation is quite clever as it communicates information about parts of the image matching filter template without increasing the number of trainable parameters. As a concrete example, consider a conv-net with filters as shown in fig3^{[1]}. The first layer filters (green) detect eyes, nose etc., second layer filters (yellow) detect face, leg etc., by aggregating features from first layer and so on^{[2]}.
Now suppose the face (represented by two red and one magenta point) is moved to another corner as shown in fig4. The same activations occur, leading to the same end result.
So far so good. How about rotational invariance? We know that a lot of learned conv filters are identical, but rotated by some non-random factor. Instead of learning rotational invariance in this manner, we will try to build it directly into the computational model. The motivation is simple - the neuron should output high value even if some rotated version of pattern is present in the input. Hopefully this eliminates redundant filters and improves test time accuracy.
How can we build rotational invariance into dot product similarity computation? After searching for a bit on the Internet, I couldn’t find any obvious similarity metrics that do this. Sure there is SIFT and SURF, but they are complex computations and cannot be expressed in terms of dot products alone.
Instead of trying to come up with a metric, I decided to try the brute force approach of matching the input patch with all possible rotations of the filter. If we take max
of all those outputs, then, in principle, we are choosing the output resulting from the rotated filter that best aligns
with the pattern in input patch. This strategy is partly inspired by how conv nets gets around translational invariance problem by simply sliding the filter over all possible locations. Unlike sliding operation, which is an architectural property rather than the computational property of the neuron, the brute force rotations of filters can be confined within the abstraction of neuron, making it a part of the computational model.
Lets look at this idea in more concrete terms. A neuron receives input \(X\) and has associated weights \(W\). Let \(r\) denote some arbitrary discrete rotation \(\theta\). Instead of directly computing \(W \cdot X\), we will compute \(\max \{ rW \cdot X, r^{2}W \cdot X, \cdots, r^{k}W \cdot X \}\), where \(k = \left \lceil \frac{360}{\theta} \right \rceil\) is the number of brute force rotations of \(\theta\) step.
This idea has a number of practical challenges.
How do we decide \(\theta\)? If we set \(\theta = 90^{\circ}\), we can produce 4 rotated filters.
For \(\theta \neq 90^{\circ}\) we need interpolation, which might change filter shape.
To avoid interpolation, we can try to shift the image by rotating elements clockwise from outermost layer to innermost layer. For a \(3 \times 3\) image, there are 8 possible rotations. Unfortunately, this only works for filters that are symmetric along the center of the matrix. As an example, consider all 8 rolled rotations of the L
shaped filter (fig 6). As the image is not symmetric along the center, we get skewed representations for \(45\circ\) rotations. The second image in fig 6 still looks like an L
, except that the tail end of L
is squished upwards to maintain \(3 \times 3\) filter shape.
L
shaped filter. Each row contains a clockwise rolled elements from the previous.Despite the setback, all \(90^{\circ}\) rotations are accurate and \(45^{\circ}\) rotations are only slightly skewed. Skew might be a blessing in disguise as it might endow the neuron to be deform invariant as well (wishful thinking). In either case, the deformations are not too bad for \(3 \times 3\) filters, and it is at-least worth experimenting with vggnet which uses all \(3 \times 3\) conv filters.
Alternatively, instead of trying all possible alignments, we could try to find \(\theta\) that maximizes the dot product.
Let R denote \(2 \times 2\) rotation matrix.
\(R = \begin{bmatrix} cos(\theta) & sin(\theta)\\ -sin(\theta) & cos(\theta) \end{bmatrix}\)
To accomplish optimal alignment, we need \(\theta\) such that:
\(\arg\max_\theta R_{\theta}W \cdot X\)
This can be solved by differentiating with respect to \(\theta\) and equating the derivative to 0.
I finally decided to go with the brute force approach to quickly test out the idea and maybe further develop the math for optimal alignment if it showed promise. The brute-force computations discussed so far apply to an individual neuron. For Convolutional layer, the output is generated as follows:
Generate all 8 rotated filters from \(W\), \(W_{1}, \cdots, W_{8}\).
Compute output activation volumes^{[3]} \(O_{1}, \cdots, O_{8}\) by convolving input with \(W_{1}, \cdots, W_{8}\).
Across volumes, select \(x, y\) value corresponding to filter \((row, col)\) that has the max value. This corresponding to selecting the best aligned filter with input patch across the 8 rotations.
These steps can be implemented as follows. Note that W.shape = (rows, cols, nb_input_filters, nb_output_filters)
. You also need bleeding edge tensorflow from master; as of Nov 2016, gather_nd
does not have a gradient implementation.
import tensorflow as tf
# The clockwise shift-1 rotation permutation.
permutation = [[1, 0], [0, 0], [0, 1], [2, 0], [1, 1], [0, 2], [2, 1], [2, 2], [1, 2]]
def shift_rotate(w, shift=1):
shape = w.get_shape()
for i in range(shift):
w = tf.reshape(tf.gather_nd(w, permutation), shape)
return w
def conv2d(x, W, **kwargs):
# Determine all 7 rotations of w.
w = W
w_rot = [w]
for i in range(7):
w = shift_rotate(w)
w_rot.append(w)
# Convolve with all 8 rotations and stack.
outputs = tf.stack([tf.nn.conv2d(x, w_i, **kwargs) for w_i in w_rot])
# Max filter activation across rotations.
output = tf.reduce_max(outputs, 0)
return output
The setup is as follows.
Mini vggnet (1,250,858 parameters) comprising of \(3 \times 3\) convolutions with ReLU activation.
cifar10 dataset augmented with 10% random shifts along image rows/cols along with a 50% chance of horizontal flip.
batch size = 32, epochs = 200 with early stopping.
Adam optimizer with initial learning rate of \(10^{-3}\), and reduced by a factor of \(10^{-1}\) when convergence stalls.
random_seed = 1337
for reproducibility.
Disappointingly, rotational invariant model resulted in lower val accuracy than the baseline. I had three running hypothesis as to why this might happen.
Implementation issue. I ruled this out by testing the conv2d(…)
implementation by feeding it numpy arrays and verifying the output.
Perhaps, rotational invariance is not a good idea for the earlier filters. As an example, consider a latter layer that uses vertical and horizontal edge features for the preceeding layer to detect a square like shape. Having rotational invariance means that it is not possible to distinguish horizontal from vertical edges, making square detection a lot more challenging.
Maybe the skew from \(\theta = \{45^{\circ}, 135^{\circ}, 225^{\circ}, 315^{\circ}\}\) was an issue after all.
To test 2, 3, I decided to run 6 more experiments by using:
Use new conv logic only on last 4, 3, 2, 1 layer(s). If hypothesis 2 is true, the convergence should get better.
Use \(90^{\circ}\) rotations (4 filters) and \(45^{\circ}\) rotations (8 filters) to test if skew introduced by the 8 filter version was really an issue.
The results of these experiments are summarized in fig 8.
It seems evident that using rotationally invariant convolutional layers everywhere is not a good idea (4_rot_4 and 8_rot_4 have the lowest values). In line with hypothesis 2, the accuracy improved as we get rid of earlier rotationally invariant conv layers, with 4_rot_1 and 8_rot_1 on par/slightly better than the baseline model. The skew (according to hypothesis 3) does not seem to the issue as 8_rot_[4, 3, 2, 1] closely followed 4_rot_[4, 3, 2, 1] accuracies.
Since 4_rot_1 and 8_rot_4 converged to similar val_acc
as the baseline, it is worth comparing how prediction probability of the correct class changes as we rotate the input image from \(\theta = \{0, 360\}\) by a step of \(1^{\circ}\). Fig 9. shows this comparison across all models for the sake of completeness. It is awesome to note that all models outperform baseline model in terms of rotational invariance.
Instead of checking random images, lets compute these values for 1000 random test images and use a box plot to see mean, std, min, max of percentage correct classifications across all rotations. Interestingly, 8_rot_4, 4_rot_4 show the best rotational invariance (higher mean, large std in positive direction). It is also somewhat mysterious why 8_rot_2, 4_rot_2 is slightly worse than baseline
compared to 1, 3, 4.
Can we get the benefits of 8/4 rot models without the slower convergence? What if we applied new conv model (8/4 rotations) to the baseline model weights? Fig 11. shows the results for a single image. It is very curious to note that 8_rot_4 and 4_rot_4 models yield zero accuracy. This seems to indicate that weights from the usual conv layer won’t translate well into rotated conv model (the errors across various layers add up). The error rates seem to recover as we remove rotational conv layers from the early layers, but there is a lot of noise since this is for a single test image.
To get a clearer picture, lets compute the box plot by considering 1000 random test images as described earlier. Fig. 12 makes it pretty clear that we cannot just add rotation operations to a regular model after training. On the flip side, it provides evidence that rotational conv affects the weights.
The code to reproduce all the experiments is available on Github. Feel free to reuse or improve.
Even though convergence is slower, 8 rotations seem to offer better invariance compared to 4 rotations. It would be interesting to see whether all models converge to the same accuracy eventually.
Filter skew does not seem to be an issue.
All models seem to exhibit the same seasonal pattern of accuracy with respect to rotation (see fig. 9). What’s with the weird dip at \(100^{\circ}\)?
EDITs
It might be worth experimenting by concatenating activations from rotated filters instead of max operation in light of recent success with DenseNets.
Try out squeeze net style feature compression following concatenation of rotated filters.
Despite different variations of neural networks architectures, the computational model of a neuron hasn’t changed since inception. Every neuron has \(n\) incoming inputs \(X = (x_{1}, x_{2}, …, x_{n})\) with weights \(W = (w_{1}, w_{2}, …, w_{n})\). A neuron then computes \(W.X = \sum_{i=0}^{n}w_{i}.x_{i}\), forwards it through some nonlinearity like sigmoid or ReLU to generate the output^{[1]}.
As far as i am aware, all architectures build on this fundamental computational model. Let us examine what the computation is conceptually doing in a step by step fashion.
Each input component \(x_{i}\) is being magnified or diminished when multiplied with a scaling factor \(w_{i}\). This can be interpreted as enhancing or diminishing the contribution of some pattern component \(x_{i}\) to the output decision. Think of these weights as knobs that range from \([-\infty, +\infty]\). The task of learning is to tune these knobs to the correct value.
The computation \(W.X\) is then enhancing/diminishing various input components of vector \(X\). In terms of linear algebra, the dot product of \(W\) and \(X\) can be interpreted as computing similarity. Consider \(X = (x_{1}, x_{2})\) and \(W = (w_{1}, w_{2})\). If you imagine plotting these vectors on a 2D graph paper, \(\frac{W.X}{|W||X|}\) is the cosine of angle between these two vectors. When these two vectors align (point in the same direction), the value of \(W.X\) is maximized indicating high degree of similarity. From this point of view, neuron is fundamentally a pattern matching unit which outputs a large value when it matches a specific pattern (encoded by weights) in the input^{[2]}.
In a complex neural network, every neuron is matching the input against some template encoded via weights and forwarding that result to the subsequent layers. In case of image recognition, the first layer neurons might match whether the input contains vertical edge, horizontal edge and so on. The circle detection neuron might use various edge matching computation results from previous layers to detect the shape patterns.
So far so good. What is the purpose of activation function? To understand the motivation, consider the geometric perspective of the computation through the lens of linear algebra.
\( W.X = \begin{bmatrix} w_{1} \\ w_{2} \\ \vdots \\ w_{n} \end{bmatrix} \begin{bmatrix} x_{1} & x_{2} & \cdots & x_{n} \end{bmatrix} \)
Geometrically, this can be interpreted as a linear transformation of \(X\). Additionally, there is a bias term to shift the origin. So, the computation \(W.X + b\) can be interpreted as an affine transformation. Without an non-linear activation function, multiple linear transformations within each layer \(W_{L}\) can be condensed into a single layer with \(W = \prod_{L}W_{L}\). A single layered linear neural network is pretty limited in its representational power.
Lets put all this together to interpret what a multi-layered neural network might be doing^{[3]}:
Suppose we input a \(100 \times 100\) image to a vggnet. The input tensor \(X\) can be thought of as a vector with 10000 basis. Each basis is representing some positional component in an image.
We know that the very first conv layer learns gabor filter like edge detection weight templates. Lets assume that the first convolutional layer filter generates \(100 \times 100\) output (assuming appropriate padding). This can be seen as transforming input from positional basis vector space to edge-like basis vector space. Instead of measuring the pixel intensity, the new vector space measures the similarity of the positional input patch with respect to a edge-detection templates (\(3 \times 3\) conv filter weights). Extrapolating on this notion, every layer transforms the data into a more meaningful basis vector space, eventually leading to basis vectors comprising of output categories.
In the current computational model, the output of a neuron is different even though \(X\) and \(W\) perfectly align with each other but differ in magnitudes. As each \(w_{i}\) has range \([-\infty, \infty]\), backprop is essentially searching for this weight vector spanning all of the weight vector space.
If we view a neuron as fundamental template matching unit, then various magnitude (length) of \(X\) and \(W\) should yield the same output. If we constrain \(W\) to be a unit vector (\(|W| = 1\)), then backprop would only need to search along points on a hypersphere of unit radius. Although the hypersphere contains infinitely many points, with sufficient discretization, the search space intuitively feels a lot smaller than the entire weight vector space.
Considering that weight sharing (a form on constraint) in conv-nets led to huge improvements, constraining seems like a reasonable approach. Weight regularization - a form of constraint - penalizes large \(|W|\) and vaguely fits the idea of encouraging weight updates to explore points along the hypersphere.
I think that in general, constraining as a research direction is largely unexplored. For example, it might be interesting to try and constrain conv filters to be as distinct and diverse and possible. I will follow up on these investigations in a future blog.
Getting back to the topic at hand, there are several options for constraining \(|W| = 1\).
Penalize when \(|W|\) deviates from 1 as a regularization loss.
Represent \(W\) in parametric form. Each \(w_{i}\) can be calculated using n-dim spherical coordinates.
Instead of computing \(W.X\), compute \(\frac{W}{|W|} \times X\). Normalizing \(W\) ensures that the output does not change due to scaling. Consequently, the gradient of output with respect to \(W\) can only exist if \(W\) changes along the unit hypersphere.
I explored option 3 as it was the simplest to implement. The setup is as follows.
Mini vggnet (1,250,858 parameters) comprising of \(3 \times 3\) convolutions with ReLU activation.
cifar10 dataset augmented with 10% random shifts along image rows/cols along with a 50% chance of horizontal flip.
batch size = 32, epochs = 200 with early stopping.
Adam optimizer with initial learning rate of \(10^{-3}\), and reduced by a factor of \(10^{-1}\) when convergence stalls.
random_seed = 1337
for reproducibility.
\(W_{norm}\) is calculated as:
# 1e-8 is used to prevent division by 0
W_norm = W / (tf.sqrt(tf.reduce_sum(tf.square(W), axis=[0, 1, 2], keep_dims=True)) + 1e-8)
Experiments
Weight normalization can be applied to both Convolutional
and Dense
layers. I tried all four combinations.
baseline: conv, dense
norm_conv: conv_norm, dense
norm_dense: conv dense_norm
norm_conv_dense: conv_norm, dense_norm
Test model image::alt_neural1/model.png[test_model, 300]
Convergence graphs for all four variations are shown in fig. As hypothesized, convergence improves with internal weight normalization. More experimentation is needed to see if this holds across various models and datasets. Intriguingly, the benefits are lost when both 'Conv` and Dense
layers are normalized. It is as if they 'cancel out'. I am not sure why this happens.
Additionally, the use of ReLU
which effectively attenuates negative values. This limits the neuron to only communicate information when the angle between \(X\) and \(W\) lies between \([-\frac{\pi}{2}, \frac{\pi}{2}]\). Perhaps it is useful if a neuron could also communicate the lack of similarity or the inverse of weight template information. For example, the lack of a specific stripe pattern might increase the networks confidence that the output is more likely to be one cat species over another.
One way to remedy this problem might be to use an activation function that allows negative values. A quick experiment with ELU
activation, however, did not result in significant improvement.
The code to reproduce all the experiments in this blog is available on Github. Feel free to reuse or improve.
A while ago, kaggle hosted the ultrasound nerve segmentation challenge, which requires partipants to predict the nerve area (brachial plexus) in a given Ultrasound image. The task is to predict the segmentation mask for the the brachial plexus. Even my own neural network (brain) finds it difficult to spot patterns in these images. Can you spot the pattern?
What makes this challenge particularly interesting is that ~60% of images do not contain the brachial plexus, i.e., segmentations mask is empty. Since the competition uses the mean of dice coefficient across all samples as the evaluation metric, even a single pixel in segmentation mask for non-nerve images mask kills the score. Ironically, you can achieve a score of ~0.6 just by predicting zero masks; it drops to ~[0.35, 0.45] when you try to learn it using a standard U-Net. This, combined with nosiy ultrasound images and a lack of obvious nerve patterns, intrigued me enough to participate in the challenge.
Unlike most kaggle posts, this article is more about the thought-process involved in coming up with a solution. I find it distasteful when folks just hand out a giant blob of code on github or just post their final complex solution that somehow gives good results.
My methodology is always to first start with a quick and simple baseline model and setup and end-end working code. In case of this challenge, this means:
Data related I/O - Loading, train/test splits, optional preprocessing blocks
Baseline model: I stated with a U-Net as it seemed like the state of the art model for these kinds of data. I also retrofitted it with batch normalization, ELU activation, and dropout.
Clean code: Try to organize code into classes. Having function blocks everywhere makes the code messy and error prone.
Submission code: Code to generate submission file.
As mentioned in the overview, this tends to give mean dice score of ~[0.35, 0.45], which is pretty abysmal.
Now that an end-end working code is setup. We can work on optimizing individual blocks. The most obvious strategy is to get a better understanding of dataset at hand. The idea is to leverage insights from the dataset to improve/build the conv-net architecture and perhaps use dataset specific idiosynchronicities to provide better training signal.
Initial manual exploration revealed contradictory examples. i.e., two very similar images have different labels, one with a valid segmentation mask while the other doesn’t.
Lets try to find all such near duplicates. High level idea is as follows:
Divide image into blocks, say (20, 20).
Compute histogram over various blocks. The concatenated set of histograms represents the image hash.
Compute pairwise cosine similarity between image hashes and find ones within some reasonable threshold.
import skimage.util
import scipy.spatial.distance as spdist
def compute_img_hist(img):
# Divide the image in blocks and compute per-block histogram
blocks = skimage.util.view_as_blocks(img, block_shape=(20, 20))
img_hists = [np.histogram(block, bins=np.linspace(0, 1, 10))[0] for block in blocks]
return np.concatenate(img_hists)
# Compute hashes over all images
hashes = np.array(map(compute_img_hist, imgs))
# pairwise distance matrix
dist = spdist.squareform(spdist.pdist(hashes, metric='cosine'))
# find duplicates. +np.eye because we want to exclude image matching with itself.
# 0.008 is determined after some trial and errors.
close_pairs = dists + np.eye(dists.shape[0]) < 0.008
The number of samples with inconsistent labeling is a whooping ~40% of the training set!!! Such a large number of inconsistent samples is bound to put an upper limit on the accuracy of the model. I think part of the problem lies with how the dataset was setup. Typically, doctors use ultrasound video frames to localize the nerve. Finding the nerve with a single frame is extremely challenging, even for human experts. I think it would be more interesting if they included videos instead.
The training set is small as is, just 5635 samples. After discards, we are only left with 3398 samples. This certainly doesn’t look good for training a deep model as they require lots of training data. On top of that, we cannot use transfer learning either as there are no pretrained-nets on this sort of data.
Despite all this, the cleaned raining data manages to improve baseline model score to ~0.55.
This article is still a work in progress. Check back for updates. |
Since the beginning, it was clear that i had to do something about the weird score evaluation. Even a single pixel predicted on the segmentation output for non-nerve images was going to hurt the score. A pretty straight forward strategy was to add a Dense
layer towards the end of encoder an predict whether the image contains the mask or not. Lets call this value has_mask
During test/submission time, I could simply zero out all the segmentation mask values if has_mask = False
.
You are to inflict maximum damage to the zerg army. There are two types of units - Warrior and Shield. Warriors do damage every second, while a shield protects your entire army for one second. Your army is instantly overrun after the shield generators expire. Given \(G\) cost to build a shield, \(W\) cost to build a warrior and total money \(M\), how many shields would you build?
Let \(X\) and \(Y\) be the optimal number of generators and number of warriors to be built, respectively. Let’s start with a simple concrete example. Suppose shields and warriors both cost 1 unit and you have total money of 5 units. Also assume 1 unit of damage per warrior. What is the optimum value of \(X\) and \(Y\)? It would be optimum if you can inflict maximum damage. With 5 units of money, you can buy shields/warriors in the following combinations.
X | Y | Damage |
---|---|---|
1 |
4 |
4 |
2 |
3 |
6 |
3 |
2 |
6 |
4 |
1 |
4 |
In this case, the optimal choice of \(X, Y\) seem to be \((2, 3)\) and \((3, 4)\) respectively, both of which maximize the product \(XY\). In the general case, the cost to buy \(X\) generators is \(XG\), cost to buy \(Y\) warriors is \(YW\). Since we are limited by \(M\) amount of money, \(X, Y\) must satisfy \(XG + YW \le M\). This can be represented as a line and the inequality encapsulates a region for candidate \((X, Y)\) values.
We want \(X, Y \mid \arg\max{XY} \). Geometrically, \(XY\) represents the area of the rectangle shaded in blue.
We need to find integer values of \(X, Y\) that maximize this area.
How do we go about doing that? We know that the line will intersect X and Y axis at \(\frac{M}{G}, \frac{M}{W}\). We also know that the optimal rectangle touches the line. If it doesn’t, the area can be trivially increased by increasing either/both \(X, Y\) values until it touches the line.
The boring calculus way for figuring this out is as follows:
\(Area(A) = XY \\ A = \frac{M - YW}{G}Y \\ \\ \frac{\partial A}{\partial Y} = \frac{Y(M - WY)}{G} \\ \arg\max{A} \scriptstyle \implies \frac{Y(M - WY)}{G} = 0 \\ Y = \frac{M}{2W} \)
This gives corresponding \(X = \frac{M}{2G}\)
A more interesting way to arrive at the same conclusion would be to think as follows:
The shaded triangle increases its area as long as X and Y increase.
Start with \((0, 0)\) and increase both \(X, Y\) by 1. This gives us a small rectangle with bottom left \((0, 0)\) and top right \((1, 1)\).
If the straight line was X + Y = 1, it would be a right angled isosceles triangle. The \(X, Y\) value would always increase in the direction of line \(Y = X\), which would cut the line in middle at \((\frac{X}{2}, \frac{Y}{2})\)
Generalizing this, the value of \(X. Y\) will increase as long as we go on the direction of line that joins \((0, 0)\) to midpoint of the line \(XG + YW = M\). The mid point of this line is, unsurprisingly, \((\frac{M}{2G}, \frac{M}{2W})\)
There are other ways to arrive at this. Perhaps I should write a post that exclusively deals with this problem.
Coming back to our original question, the optimal number of shields must be \(\frac{M}{2G}\). If, \(\frac{M}{2G}\) is not an integer, we go towards the origin on the line that joins \((0, 0)\) and \((\frac{M}{2G}, \frac{M}{2W})\). However, I was able to get away by taking \(\lfloor\frac{M}{2G} \rfloor\).
Typically, a genetic algorithm works by the notion of maximizing the fitness. Consider a function \(y = x\), which is to be minimized in the interval \([-5, 5]\). One approach is use \(\frac{1}{x}\) as the fitness function. Intuitively, by maximizing \(\frac{1}{x}\), we are minimizing \(y = x\). However, a plot of \(\frac{1}{x}\) reveals some serious flaws.
If we move from the right, the maximum occurs at \(x = 0\) instead of \(x = -5\). Why? because \(\frac{1}{x}\) is not differentiable at \(x = 0\). Always make sure that the fitness/loss function is differentiable! In this case, it is better to use \(y = -x\) as the fitness function.
In general, if we are seeking to minimize \(y = f(x)\), where \(f(x)\) is differentiable, then it is safer to use \(y = -f(x)\) as the fitness function. Probably very obvious, but I got burned by this.
Given an integer \(N\), we need to find the number of integral pairs \((x, y) \mid x^2 + y^2 = N\), \(N \in [0, 2147483647]\), with a maximum of 100 such numbers in a file.
Geometrically, we know that \(x^2 + y^2 = r\) represents a circle centered at \((0,0)\) with a radius \(r\). So, the problem can be interpreted visually as finding all possible solutions where \(x, y\) are positive, i.e., lie in the first quadrant.
Testing all possible integral pairs of \((x, y)\) upto \(N\) is equivalent to searching within the grayed geometric square as illustrated.
Algorithmically, this approach takes \(O(n^2)\) time. Consider the sample input as shown below (this was the file I received)
20 1105 65 2147483646 1257431873 25 1000582589 1148284322 5525 0 1475149141 858320077 1022907856 1041493518 3 1215306625 372654318 160225 5928325 2147483643 1538292481
On my machine, brute force approach takes approx 14 minutes (!!!) to solve this input. Clearly, we are exploring a lot of unwanted points. One obvious way to improve this is to explore points bounded by the circle as as indicated by the shaded blue square.
This makes intuitive sense since \(x, y\) cannot exceed \(\sqrt{n}\). This reduces the complexity from \(O(n^2)\) to \(O(n)\) reducing execution time from an earlier 14 minutes to 80.5 secs.
Can we do any better? Why explore the entire blue square? It is sufficient if we examine all points on the arc of the circle in the first quadrant. i.e, increment \(x\) from \([0, \sqrt{n}]\), compute corresponding \(y = \sqrt{n - x^2}\) and check if it is an integer. \(y\) is an integer if \(\lfloor y \rfloor == y\).
This looks pretty good. Is there anything else that can be improved upon? Well, if you notice the figure carefully, we don’t have to iterate \(x\) from \([0, \sqrt{n}]\). All points on one side of line \(y = x\) are mirror images of points on the other side. i.e., if we find a point, day \((3, 4)\) then we will find a mirror \((4, 3)\) on the other side. Since the problem doesn’t differentiate between these two solutions, it is sufficient if we iterate \(x\) from \([0, x']\), where \(x'\) is given by \(\sqrt{n} \cos{45} = \sqrt{n/2}\).
With this, we can complete the algorithm in java as follows:
int getNumSumSquares(final int n)
{
if(n == 0) {
return 1;
}
final int iterations = (int) Math.ceil(Math.sqrt((n * 1.0)/2)) + 1;
double y;
int count = 0;
for(int x = 0; x <= iterations; x++)
{
y = Math.sqrt(n - (x*x));
// check if y is an integer
if (Math.floor(y) == y) {
count++;
}
}
return count;
}
This algorithm works in \(O(\sqrt{n/2}) \sim O(\sqrt{n})\). This only takes 0.047 secs to execute!
Last week, I happened to read about Kleiber’s law while browsing through literature on natural evolution. The implications are really fascinating. It establishes a relationship between mass and metabolism as:
\(metabolism = mass^{\frac{3}{4}}\)
Metabolism is ultimately linked to the number of heartbeats since the heart regulates the supply of oxygen. Therefore, \(\mid heartbeat \mid \propto mass\). Curiously, the number of heartbeats per lifetime tends to be constant. This means that bigger/heavier animals tend to have slower metabolism with lower heart rate compared to, say, flies, with faster metabolism.
Ironically, if we have fixed number of heartbeats, wouldn’t running/exercising make us die faster? I suppose, the long term benefits outweigh shot term loss.
Today, I happened to type some random text and noticed interesting. This is a random sample of what i typed:
gh jgh jg hjg hjg hjg hjg hjg hkg hjg hg jg hjg hjg hg hjg hg hjg h gh gh ghj ghj ghj g hjg khg hg hg hg hjg hjg hjg hjg h ghj
Notice how 'h' repeats a lot. It so happens that my middle finger was on 'h'. Perhaps, frequency is somehow linked to the length of finger? See table below. I used the keys G, H, J, K. My index finger was on G, middle on H, ring finger on J and little finger on K.
Character ordered by frequency | Actual finger on character | Finger ordered by length |
---|---|---|
H |
Middle |
Middle |
J |
Ring |
Ring |
G |
Index |
Index |
K |
Little |
Little |
You can try this on your own. Place your fingers on the keyboard (horizontally, any other orientation complicates the situation as relative length changes). Probability theory suggests that the chance of occurrence of G, H, J or K is 1/4. But I think that in this case, probability is somehow weighted, proportional to the length of the finger.
This should allow us to predict hand position on a keyboard based on a bunch of randomly pressed keystrokes.