[R] (Replicating) Modern Neural Networks Generalize on Small Data Sets
I have stumbled across the NIPS paper from Olson, Wyner and Berk on decomposing deep neural networks into decorrelated sub-networks to explain how neural networks generalise well (as de facto ensembles) also on small data sets.
Since I am trying to get more acquainted with tensorflow and pytorch, I tried to implement the decomposition strategy presented, however, am running into some issues obtaining convergence results similar to the ones presented in the paper. I am new to this subreddit and hope that maybe someone could point me to issues in my replication efforts.
The authors use the following network architectures they subsequently decompose into subnetworks via linear programming:
– 10 hidden layers (with elu activation and he initialization) with 100 each
– binary classification task
– Adam optimizer with learning rated of 0.001 and 200 training epochs
I am using the following code to simulate data similar to the synthetic data set used in the paper:
if np.linalg.norm(x) < 0.6: # the paper mentions a radius of 0.3, the images look like 0.6 though
return 1 if np.random.random() < 0.15 else 0
x_ = np.linspace(-1, 1, 20)
X_train = np.asarray([(x_[i], x_[j]) for i in range(20) for j in range(20)], dtype=np.float32)
y_train = np.apply_along_axis(classify, 1, X_train).reshape(400, 1)
Then I have set up the network architecture as follows:
X_input = tf.placeholder(tf.float32, [None, 2], name='input_data')
y = tf.placeholder(tf.float32, [None, 1])
X = tf.layers.dense(X_input, units=M, kernel_initializer=tf.contrib.layers.variance_scaling_initializer(), activation=tf.nn.elu, name='dense_0')
for i in range(1, L):
X = tf.layers.dense(X, units=M, kernel_initializer=tf.contrib.layers.variance_scaling_initializer(), activation=tf.nn.elu, name='dense_' + str(int(i)))
logits = tf.layers.dense(X, units=1, activation=None, name='logits')
predicted = tf.nn.sigmoid(logits, name='predicted')
is_correct = tf.equal(tf.round(predicted), y, name='is_prediction_correct')
accuracy = tf.reduce_mean(tf.cast(is_correct, tf.float32))
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
l = tf.reduce_mean(cross_entropy)
optimizer = tf.train.AdamOptimizer(learning_rate=0.001).minimize(l)
To my understanding this should match the implementation details of the paper, however, while the paper limits the training epochs to 200 and even mentions, that
“In practice, we found that networks without dropout achieved 100% training accuracy after a couple dozen epochs of training.”
my implementation only converges after 500+ epochs at best.
Does anyone here have ideas about what I could be doing wrong?