pe-av-small-16-frame
164
3
license:apache-2.0
by
facebook
Audio Model
OTHER
New
164 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
Unknown
Mobile
Laptop
Server
Quick Summary
AI model with specialized capabilities.
Code Examples
`perception_models` Usagepythonpytorch
import torch
from core.audio_visual_encoder import PEAudioVisual, PEAudioVisualTransform
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load model and transform
model = PEAudioVisual.from_config("pe-av-large", pretrained=True).to(device)
transform = PEAudioVisualTransform.from_config("pe-av-large")
video_files = ["video1.mp4", "video2.mp4"]
descriptions = ["description1", "description2"]
audio_files = ["audio1.wav", "audio2.wav"]
# Process inputs and get embeddings
inputs = transform(videos=video_files, text=descriptions, audio=audio_files).to(device)
with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
outputs = model(**inputs)
# Access different embeddings
audio_embeds = outputs.audio_embeds # Audio-only embeddings
visual_embeds = outputs.visual_embeds # Video-only embeddings
audio_visual_embeds = outputs.audio_visual_embeds # Joint audio-visual embeddings
audio_text_embeds = outputs.audio_text_embeds # Text embeddings aligned to audio
visual_text_embeds = outputs.visual_text_embeds # Text embeddings aligned to video
audio_visual_text_embeds = outputs.audio_visual_text_embeds # Text embeddings aligned to audio-visual
audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding
visual_plus_text_embeds = outputs.visual_plus_text_embeds # Joint video and text embedding
# Compute the dot product to get their similarities
audio_visual_similarity = audio_embeds @ visual_embeds.T
# When computing similarity against text embeddings, use the
# appropriate text embedding based on the other modality
audio_text_similarity = audio_embeds @ audio_text_embeds.T
video_text_similarity = visual_embeds @ visual_text_embeds.Tpython
inputs = transform(videos=video_files, text=descriptions).to(device)
with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
outputs = model(**inputs)
audio_embeds = outputs.audio_embeds # None
visual_embeds = outputs.visual_embeds # available
audio_visual_embeds = outputs.audio_visual_embeds # None
audio_visual_text_embeds = outputs.audio_visual_text_embeds # None
audio_text_embeds = outputs.audio_text_embeds # None
visual_text_embeds = outputs.visual_text_embeds # available
audio_plus_text_embeds = outputs.audio_plus_text_embeds # None
visual_plus_text_embeds = outputs.visual_plus_text_embeds # Availableoptionally re-use pre-computed PE featurespythontransformers
from transformers import PeAudioVideoModel, PeAudioVideoProcessor
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = PeAudioVideoModel.from_pretrained("facebook/pe-av-large")
processor = PeAudioVideoProcessor.from_pretrained("facebook/pe-av-large")
model = model.to(device)
video_files = ["video1.mp4", "video2.mp4"]
descriptions = ["description1", "description2"]
audio_files = ["audio1.wav", "audio2.wav"]
# Process inputs and get embeddings
inputs = processor(
videos=video_files, text=descriptions, audio=audio_files, return_tensors="pt", padding=True
)
with torch.inference_mode(), torch.autocast(device.type, dtype=torch.bfloat16):
outputs = model(**inputs.to(device), return_loss=True)
audio_embeds = outputs.audio_embeds # Audio-only embeddings
video_embeds = outputs.video_embeds # Video-only embeddings
audio_video_embeds = outputs.audio_video_embeds # Joint audio-video embeddings
text_audio_video_embeds = outputs.audio_video_text_embeds # Text embeddings aligned to audio-video
text_audio_embeds = outputs.text_audio_embeds # Text embeddings aligned to audio
text_video_embeds = outputs.text_video_embeds # Text embeddings aligned to video
audio_plus_text_embeds = outputs.audio_plus_text_embeds # Joint audio and text embedding
video_plus_text_embeds = outputs.video_plus_text_embeds # Joint video and text embedding
# For classification, you can use the logits_* fields of the output
audio_text_preds = outputs.logits_audio_text.sigmoid()
# The overall loss is also available in the output (requires passing return_loss=True)
loss = outputs.lossDeploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.