All Work

Understanding the visual knowledge of language models
Understanding the visual knowledge of language models
MIT News
Looking for a specific action in a video? This AI-based method can find it for you
Looking for a specific action in a video? This AI-based method can find it for you
MIT News
Computer vision system marries image recognition and generation
Computer vision system marries image recognition and generation
MIT News
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Language-Guided Audio-Visual Source Separation via Trimodal Consistency
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation
Bias Mimicking: A Simple Sampling Approach for Bias Mitigation
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval
MaskSketch: Unpaired Structure-guided Masked Image Generation
MaskSketch: Unpaired Structure-guided Masked Image Generation
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
ConStruct-VL: Data-Free Continual Structured VL Concepts Learning
Understanding and Improving Visual Prompting: A Label-Mapping Perspective
Understanding and Improving Visual Prompting: A Label-Mapping Perspective
Video Test-Time Adaptation for Action Recognition
Video Test-Time Adaptation for Action Recognition
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer
Masked Motion Encoding for Self-Supervised Video Representation Learning
Masked Motion Encoding for Self-Supervised Video Representation Learning
EC^2 : Emergent Communication for Embodied Control
EC^2 : Emergent Communication for Embodied Control
Learning Situation Hyper-Graphs for Video Question Answering
Learning Situation Hyper-Graphs for Video Question Answering
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners
Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners
3D Concept Learning and Reasoning from Multi-View Images
3D Concept Learning and Reasoning from Multi-View Images
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
Teaching Structured Vision & Language Concepts to Vision & Language Models
Teaching Structured Vision & Language Concepts to Vision & Language Models
CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning
CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning
More Language, Less Labeling with Kate Saenko
More Language, Less Labeling with Kate Saenko
This Week in Machine Learning & AI (TWIML) podcast
A safer, lower-cost alternative to real data for pretraining computer vision models
A safer, lower-cost alternative to real data for pretraining computer vision models
IBM Research blog
Hallucinating to better text translation
Hallucinating to better text translation
MIT News
Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions
Non-Adversarial Video Synthesis with Learned Priors
Non-Adversarial Video Synthesis with Learned Priors
Camera On-boarding for Person Re-identification using Hypothesis Transfer Learning
Camera On-boarding for Person Re-identification using Hypothesis Transfer Learning
Semi-Supervised Action Recognition with Temporal Contrastive Learning
Semi-Supervised Action Recognition with Temporal Contrastive Learning
Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback
Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback
GAN Compression: Efficient Architectures for Interactive Conditional GANs
GAN Compression: Efficient Architectures for Interactive Conditional GANs
Separating Skills and Concepts for Novel Visual Question Answering
Separating Skills and Concepts for Novel Visual Question Answering
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules
Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules
The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models
Fine-grained Angular Contrastive Learning with Coarse Labels
Fine-grained Angular Contrastive Learning with Coarse Labels
Black-box Explanation of Object Detectors via Saliency Maps
Black-box Explanation of Object Detectors via Saliency Maps
Anycost GANs for Interactive Image Synthesis and Editing
Anycost GANs for Interactive Image Synthesis and Editing
Relationship Matters: Relation Guided Knowledge Transfer for Incremental Learning of Object Detectors
Relationship Matters: Relation Guided Knowledge Transfer for Incremental Learning of Object Detectors
Identifying Interpretable Action Concepts in Deep Networks
Identifying Interpretable Action Concepts in Deep Networks
Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation
Jointly Optimize Data Augmentation and Network Training: Adversarial Data Augmentation in Human Pose Estimation