It is recommended to use earphones to hear the demos videos, raise the volume and zoom in the videos.
Section 1: Comparison with baseline models.
1. Sample 1: Eagle screaming.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
2. Sample 2: Rowboat, canoe, kayak rowing.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
3. Sample 3: Playing tabla.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
4. Sample 4: Cat growling.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
5. Sample 5: Cat caterwauling.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
6. Sample 6: Barn swallow calling.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
7. Sample 7: Playing djembe.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
8. Sample 8: Alligators, crocodiles hissing.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
9. Sample 9: Baby crying.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
10. Sample 10: Dog baying.
Ground Truth
SpecVQGAN
IM2WAV
DiffSound-V
Diff-Foley
TiVA (Ours)
Section 2: Generated audios for videos from Sora.
Sora is a text-conditional diffusion models which can generate high fidelity videos. We use Sora's generated videos as input, generating corresponding audios for these videos.