TiVA

Demos

It is recommended to use earphones to hear the demos videos, raise the volume and zoom in the videos.

Section 1: Comparison with baseline models.

1. Sample 1: Eagle screaming.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

2. Sample 2: Rowboat, canoe, kayak rowing.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

3. Sample 3: Playing tabla.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

4. Sample 4: Cat growling.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

5. Sample 5: Cat caterwauling.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

6. Sample 6: Barn swallow calling.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

7. Sample 7: Playing djembe.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

8. Sample 8: Alligators, crocodiles hissing.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

9. Sample 9: Baby crying.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

10. Sample 10: Dog baying.

Ground Truth	SpecVQGAN	IM2WAV

DiffSound-V	Diff-Foley	TiVA (Ours)

Section 2: Generated audios for videos from Sora.

Sora is a text-conditional diffusion models which can generate high fidelity videos. We use Sora's generated videos as input, generating corresponding audios for these videos.

Video 1	Video 2	Video 3