AI Machine Learning & Data Science Research

UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance to Text-Based Approaches but 28x Faster

In the new paper TVLT: Textless Vision-Language Transformer, researchers from UNC Chapel Hill present the Textless Vision-Language Transformer (TVLT) for vision-and-language representation learning. TVLT uses only raw visual and audio inputs and performs comparably to its text-based counterparts but requires only 1/3 the parameters and achieves 28x faster inference speeds.

Transformer architectures have achieved impressive performance in vision-language (VL) representation learning when trained on text-annotated images or videos. It remains challenging, however, for transformers to learn VL representations without relying on text, i.e. using only low-level visual and acoustic inputs.

In the new paper TVLT: Textless Vision-Language Transformer , researchers from UNC Chapel Hill present the Textless Vision-Language Transformer (TVLT) for vision-and-language representation learning. TVLT uses only raw visual and audio inputs and performs comparably to its text-based counterparts but requires only 1/3 the parameters and achieves 28x faster inference speeds.

The TVLT’s main architecture is a transformer comprising a 12-layer encoder and an 8-layer decoder. It takes its inputs as a list of embeddings obtained directly from perception-level video and audio and does not include any text-specific modules for automatic speech recognition (ASR) or tokenization.

The input embeddings are a combination of 1) modality embedding, 2) temporal/spatial embeddings for video, 3) temporal/frequency embeddings for audio, and 4) vision/audio patch embeddings.

The TVLT is pretrained with two objectives: vision-audio matching (VAM) and masked autoencoding (MAE). VAM is employed to learn the global cross-modal representations, and a linear layer with sigmoid activation is then applied to the encoder to obtain a matching probability. Finally, the binary cross-entropy loss is computed.

MAE is used to improve unimodal representations by masking random patches of visual frames and the audio spectrogram and reconstructing missing inputs. The novel approach slices the audio and video parts of the encoder output and feeds them to the decoder independently instead of jointly, which saves compute costs and boosts finetuning performance.

In their empirical study, the team compared TVLT with text-based counterparts on audio-to-video retrieval, video-based multimodal sentiment analysis, and visual question-answering benchmarks.

In the experiments, TVLT achieved performance competitive with state-of-the-art audio-based vision-and-language models on visual question answering, image retrieval, video retrieval and multimodal sentiment analysis. Moreover, it required only 1/3 of the parameters, and its inference speed was 28x faster than the text-based methods.

Overall, this paper showcases the powerful performance of TVLT and advances the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals without the need for traditional but computationally expensive text modelling.

The code and checkpoints are available on the project’s GitHub . The paper TVLT: Textless Vision-Language Transformer is on arXiv .


Author : Hecate He | Editor : Michael Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “ UNC Chapel Hill’s Textless Vision-Language Transformer: Comparable Performance to Text-Based Approaches but 28x Faster

  1. T5 language model by 0.6 percent, and was the first model to surpass the human baseline.

Leave a Reply

Your email address will not be published.

%d bloggers like this: