[R] (Replicating) Modern Neural Networks Generalize on Small Data Sets

I have stumbled across the NIPS paper from Olson, Wyner and Berk on decomposing deep neural networks into decorrelated sub-networks to explain how neural networks generalise well (as de facto ensembles) also on small data sets.

Since I am trying to get more acquainted with tensorflow and pytorch, I tried to implement the decomposition strategy presented, however, am running into some issues obtaining convergence results similar to the ones presented in the paper. I am new to this subreddit and hope that maybe someone could point me to issues in my replication efforts.

The authors use the following network architectures they subsequently decompose into subnetworks via linear programming:

– 10 hidden layers (with elu activation and he initialization) with 100 each

– binary classification task

– Adam optimizer with learning rated of 0.001 and 200 training epochs

I am using the following code to simulate data similar to the synthetic data set used in the paper:

np.random.seed(0)

def classify(x):

if np.linalg.norm(x) < 0.6: # the paper mentions a radius of 0.3, the images look like 0.6 though

return 1

else:

return 1 if np.random.random() < 0.15 else 0

x_ = np.linspace(-1, 1, 20)

X_train = np.asarray([(x_[i], x_[j]) for i in range(20) for j in range(20)], dtype=np.float32)

y_train = np.apply_along_axis(classify, 1, X_train).reshape(400, 1)

Then I have set up the network architecture as follows:

X_input = tf.placeholder(tf.float32, [None, 2], name='input_data')

y = tf.placeholder(tf.float32, [None, 1])

X = tf.layers.dense(X_input, units=M, kernel_initializer=tf.contrib.layers.variance_scaling_initializer(), activation=tf.nn.elu, name='dense_0')

for i in range(1, L):

X = tf.layers.dense(X, units=M, kernel_initializer=tf.contrib.layers.variance_scaling_initializer(), activation=tf.nn.elu, name='dense_' + str(int(i)))

logits = tf.layers.dense(X, units=1, activation=None, name='logits')

predicted = tf.nn.sigmoid(logits, name='predicted')

is_correct = tf.equal(tf.round(predicted), y, name='is_prediction_correct')

accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))

cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)

l = tf.reduce_mean(cross_entropy)

To my understanding this should match the implementation details of the paper, however, while the paper limits the training epochs to 200 and even mentions, that

In practice, we found that networks without dropout achieved 100% training accuracy after a couple dozen epochs of training.”

my implementation only converges after 500+ epochs at best.

Does anyone here have ideas about what I could be doing wrong?

submitted by /u/biopsi

Days
:
Hours
:
Minutes
:
Seconds

Plug yourself into AI and don't miss a beat

Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.