[D] Training on realizations, testing on posteriors?
I have a supervised, discriminative problem. The model is a feedforward NN. I’m trying to learn to predict p(Y = y|Z) where Z ~ P(Z | X). At training time I have get a sample Z ~ p(Z | X) but at test time I have the full posterior p(Z | X). Concretely, Z is represented by a 100 dimensional latent vector; at training I only have a 1-hot encoding but at test time I have a distribution. I could, of course, just take the argmax at test time but that throws away the full posterior. Or perhaps I should inject noise at train-time to the one-hot encoding. Is there any literature on this problem?