2024

Visual AutoRegressive Modeling (VAR)
- title and link: Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- information: NeurIPS 2024 best paper ByteDance (Liwei Wang)
- problem and position: autoregressive style image generation beats diffusion
- method overview: redefine order as coarse-to-fine next-scale prediction
- teaser:
- results: beat diffusion, scaling law
- method details:
  - VQGAN with modified multi-scale quantization layer to tokenize image into multi-scale token maps
  - decoder-only Transformer like GPT-2 for next-scale prediction
Open-source VLA (OpenVLA)
- title and link: OpenVLA: An Open-Source Vision-Language-Action Model
- information: CoRL 2024 outstanding paper finalist Stanford (Chelsea Finn, Percy Liang, Sergey Levine, Dorsa Sadigh, Russ Tedrake)
- problem and position: general VLA
- method overview: finetune VLM on robotic dataset
- results: beat RT-X when zero-shot or finetuning
- method details:
  - finetune Prismatic-7B VLM, with 600M pretrained SigLIP and DINOv2 vision encoder, small MLP projector, 7B Llama 2 LLM
  - train on Open X-Embodiment dataset
  - 7B parameters, trained with 64 A100s for 14 days, inference 6Hz
  - open source
From Speaker to Dubber (speaker2dubber)
- title and link: From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning
- information: ACM MM 2024 best paper
- problem and position: two-stage movie dubbing
- method overview: first multi-task speaker pre-training, second prosody consistency learning and duration consistency reasoning
- teaser:
- results:
Harmonic Mobile Manipulation (HarmonicMM)
- title and link: Harmonic Mobile Manipulation
- information: IROS 2024 best mobile manipulation paper AI2
- problem and position: end-to-end learning navigation and manipulation
- method overview: RGB images input and base+arm actions output, train by DD-PPO in ProcTHOR
- results:
Minimalist Vision (MinimalistVision)
- title and link: Minimalist Vision with Freeform Pixels
- information: ECCV 2024 best paper Columbia
- problem and position: use smallest number of pixels to solve a lightweight vision task
- method overview: freeform pixels as the first layer of network for training, deployed by learned optical masks and photodetectors
- teaser:
- results: 8 pixels for monitoring indoor spaces, measuring room lighting and estimating traffic flow
GENerative Interactive Environments (Genie)
- title and link: Genie: Generative Interactive Environments
- information: ICML 2024 best paper DeepMind
- problem and position: the first world model learning from action-unlabeled videos
- method overview: spatiotemporal video tokenizer, latent action model, dynamics model
- teaser:
- results: experiments on 2D platformer games and robotics RT1
- method details:
  - video tokenizer is VQ-VAE-based to encode and decode video
  - latent action model is VQ-VAE-based, treat latent variable as action, limit $|A| = 8$, directly from pixels for better performance
  - dynamics model is a decoder-only MaskGIT on tokens with action additive instead of concat for better performance
  - first train video tokenizer, then co-train latent action model and dynamics model, inference discard latent action model with only action codebook
  - all components use spatiotemporal Transformer for memory efficiency
  - 11B parameters, most on 10.1B dynamics model, trained with 256 TPUv5p
Video Poet (VideoPoet)
- title and link: VideoPoet: A Large Language Model for Zero-Shot Video Generation
- information: ICML 2024 best paper Google
- problem and position: LLM-based video generation
- method overview: modality-specific tokenizers, decoder-only autoregressive Transformer, super-resolution Transformer
- teaser:
- results:
- method details:
  - decoder-only Transformer backbone for autoregressive generation
  - modality-specific tokenizers to map into unified space
    - text: pretrained frozen T5 XL encoder
    - image and video: MAGVIT-v2 tokenizer
    - audio: pretrained SoundStream tokenizer
  - super-resolution Transformer with window attention
  - pretraining task mixture, finetune on text2video
Stable Diffusion 3 (SD3)
- title and link: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- information: ICML 2024 best paper Stability
- problem and position: improve rectified flow Transformers for text2image
- method overview: middle-frequent sampling for rectified flow, multi-modal DiT
- results:
- Rectified Flow forward process but with more frequent sampling for middle timesteps training
- MM-DiT builds upon DiT
- 8B parameters
- scaling law
Universal Manipulation Interface (UMI)
- title and link: Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
- information: RSS 2024 outstanding systems paper Stanford (Shuran Song)
- problem and position: portable and low-cost and in-the-wild robotics data collection by human
- method overview: soft-gripper with camera and IMU
- results:
- method details:
  - teleoperation high cost, human video large gap
  - hand-held soft gripper, but with fisheye camera and side mirrors, GoPro IMU pose tracking, marker for width detection
  - filtered unreachable data by forward kinematics
  - raw fisheye images without undistortion as input
  - reflecting the crops in the mirrors works better for policy learning
  - synchronized RGB images, relative gripper pose and width as inputs, delayed relative gripper pose and width as outputs
  - Diffusion Policy imitation learning
Fast-Slow LLM Anomaly Detection (AESOP)
- title and link: Real-Time Anomaly Detection and Reactive Planning with Large Language Models
- information: RSS 2024 outstanding paper Stanford
- problem and position: robotics anomaly detection
- method overview: fast-stage LLM embedding similarity with dataset to threshold anomaly, slow-stage LLM reasoning to select continue or any predefined recovery strategy, MPC with both nominal and recovery objective
- results: not standard benchmark, self comparison
Generalized Winding Numbers (GWN)
- title and link: Robust Containment Queries over Collections of Rational Parametric Curves via Generalized Winding Numbers
- information: SIGGRAPH 2024 best paper
- problem and position: extend point containment query to non-watertight 2D rational parametric curves by generalized winding numbers
- method overview: generalized winding numbers by winding angles, practical calculation by recursive closure with lines, coincidence by tangent
- teaser:
Generative Image Dynamics (GID)
- title and link: Generative Image Dynamics
- information: CVPR 2024 best paper Google
- problem and position: model image-space motion for turning single image to looping video or interactive demo
- method overview: diffusion model predicts Fourier domain spectral volumes, Fourier transform to warp motioned future frames
- teaser:
- results:
- method details:
  - motion texture as 2D displacement maps ${F_t \mid I_t(\mathbf{p} + F_t(\mathbf{p})) = I_0(\mathbf{p}), t = 1, \ldots, T}$
  - directly predicting motion texture scales with $T$
  - Fast Fourier Transform to frequency domain $S(\mathbf{p}) = FFT(F(\mathbf{p}))$ and only low-freq $K = 16$ enough
  - latent diffusion model predicts $4K$-channel 2D motion spectrum map, with $4$ Fourier coefficients each frequency
  - normalization for stable training concentrates to low-freq, so normalize per frequency
  - directly outputting $4K$ channels yields over-smoothed results, so first train conditioning on frequency embedding to predict single $4$-channel coefficients, then freeze and insert attention layers for coordinating different frequencies and fine-tune
  - rendering with additional multi-scale ResNet-34 features soft-warped by $F_t = FFT^{-1}(\hat{S})$ and $W = \frac{1}{T} \sum_t |F_t|_2$ to decode
  - train from scratch for 6 days with 16 A100s
  - collect 3015 videos as >150k image-motion pairs
Rich Automatic Human Feedback (RAHF)
- title and link: Rich Human Feedback for Text-to-Image Generation
- information: CVPR 2024 best paper Google
- problem and position: fine-grained human feedback on text-image alignment for text2image generation
- method overview: fine-grained annotated text-image alignment dataset and transformer model to predict the human feedbacks
- results: can help text2image generation models
- method details:
  - RichHF-18K dataset from Pick-a-Pic
  - marking image regions, marking text words, annotating rate
  - multimodal transformer with multiple heads
  - finetune Muse on the self-generated images with high RAHF-predicted score
  - Muse inpainting with RAHF-predicted implausibility heatmap
NavigatiOn with goal MAsked Diffusion (NoMaD)
- title and link: NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration
- information: ICRA 2024 best paper UCBerkeley (Sergey Levine)
- problem and position: single network for goal-directed navigation and goal-agnostic exploration
- method overview: Transformer encoder with optional goal condition masking for observed images and diffusion policy for future actions
- results:
- method details:
  - ViNT as the encoder backbone for goal-conditioned navigation
  - ViKiNG’s topological graph for goal-free exploration
  - 50% probability goal masking during training
  - 1D conditional UNet as the diffusion policy
  - train on combination of GNM and SACSoN datasets
Robotics Transformer X (RT-X)
- title and link: Open X-Embodiment: Robotic Learning Datasets and RT-X Models
- information: ICRA 2024 best paper 44 institutions
- problem and position: union of open-source robotics datasets and attempt to general training
- method overview: RT-1 and RT-2 on Open X-Embodiment dataset
- teaser:
- results:
Universal Simulator (UniSim)
- title and link: Learning Interactive Real-World Simulators
- information: ICLR 2024 outstanding paper UCBerkeley (Pieter Abbeel)
- problem and position: action-conditioned video prediction enables robot learning
- method overview: accept language, motor action, camera motions as actions, then action-conditioned video diffusion model
- teaser:
- results: used for high-level VLM policy and low-level RL policy training
- method details:
  - different video datasets cover different information
  - texts by T5 language embedding, motor actions, camera motions
  - video 3D UNet diffusion model predicts next frames conditioned on observed frames and actions autoregressively
  - action-condition by classifier-free guidance
  - 5.6B parameters
  - experiment PaLM-E image-goal conditioned VLM policy and PaLI VLA policy with learned reward function for block rearrangement on 10k generated videos