TiVA

Demos

It is recommended to use earphones to hear the demos videos, raise the volume and zoom in the videos.

Section 1: Comparison with baseline models.

1. Sample 1: Eagle screaming.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)
2. Sample 2: Rowboat, canoe, kayak rowing.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)
3. Sample 3: Playing tabla.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)
4. Sample 4: Cat growling.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)
5. Sample 5: Cat caterwauling.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)
6. Sample 6: Barn swallow calling.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)
7. Sample 7: Playing djembe.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)
8. Sample 8: Alligators, crocodiles hissing.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)
9. Sample 9: Baby crying.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)
10. Sample 10: Dog baying.
Ground Truth SpecVQGAN IM2WAV
DiffSound-V Diff-Foley TiVA (Ours)

Section 2: Generated audios for videos from Sora.

Sora is a text-conditional diffusion models which can generate high fidelity videos. We use Sora's generated videos as input, generating corresponding audios for these videos.

Video 1 Video 2 Video 3