[P] Stylegan Music Video

We made a music video using NVIDIA’s styleGAN. You can check it out here: https://youtu.be/bCJXnRFGoSE .

Methodology

We first produced a mel-scaled spectrogram for the piece of music. We tweaked the arguments such that each time-step of the spectrogram corresponds to 16.7ms (duration of a frame @60fps). The frequency dimension of the spectrogram is scaled to match styleGAN’s input dimension.

Then we explored a pre-trained (on faces) styleGAN’s input space for interesting output images. The way we performed the exploration was to compute the gradient of the mean squared error between styleGAN’s output image and a real image (which we had chosen), with respect to a random input. Then with steps of gradient descent we search for inputs which produce outputs similar to our real image. We wanted “non-realistic, creepy faces”, which we got by using extreme hyper-parameters in this exploration phase, by swapping the colors of the output and by carefully choosing the custom target image. For each generated image we also saved the input vector (512 dimensional) which lead to it.

Finally, we made a large spreadsheet in which each row is a beat of the song (175 bpm for most parts). We assigned various generated images we liked at different parts of the song (usually at intervals 4 beats). We turned this spreadsheet into a large input array of dimensions equal to the mel-scaled spectrogram, by linearly interpolating between the pre-chosen generated images at the intervals dictated by the spreadsheet. We add this input matrix to the spectrogram with some weights and feed it to the pre-trained styleGAN. The outputs are the frames of the video.

(For the first few seconds of the song we also used some real footage which we morphed with generated faces)

Discussion

Throughout the project we felt that there must be a better way to do targeted searches of the input space. For styleGAN there is some interpretability to each dimension of the input, however we found it hard to make use of this, especially when the target image was not strictly a face (a skull for example). What are other ways in which we can answer the question “what inputs of this (differentiable) black box lead to a desired output?”

submitted by /u/kinezodin
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[P] Stylegan Music Video

Methodology

Discussion