2024

  • Visual AutoRegressive Modeling (VAR)
    • title and link: Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
    • information: NeurIPS 2024 best paper ByteDance (Liwei Wang)
    • problem and position: autoregressive style image generation beats diffusion
    • method overview: redefine order as coarse-to-fine next-scale prediction
    • teaser: VAR_teaser
    • results: beat diffusion, scaling law VAR_result
    • method details:
      • VQGAN with modified multi-scale quantization layer to tokenize image into multi-scale token maps
      • decoder-only Transformer like GPT-2 for next-scale prediction VAR_method
  • Open-source VLA (OpenVLA)
    • title and link: OpenVLA: An Open-Source Vision-Language-Action Model
    • information: CoRL 2024 outstanding paper finalist Stanford (Chelsea Finn, Percy Liang, Sergey Levine, Dorsa Sadigh, Russ Tedrake)
    • problem and position: general VLA
    • method overview: finetune VLM on robotic dataset
    • results: beat RT-X when zero-shot or finetuning OpenVLA_result
    • method details:
      • finetune Prismatic-7B VLM, with 600M pretrained SigLIP and DINOv2 vision encoder, small MLP projector, 7B Llama 2 LLM
      • train on Open X-Embodiment dataset
      • 7B parameters, trained with 64 A100s for 14 days, inference 6Hz
      • open source OpenVLA_method
  • From Speaker to Dubber (speaker2dubber)
  • Harmonic Mobile Manipulation (HarmonicMM)
    • title and link: Harmonic Mobile Manipulation
    • information: IROS 2024 best mobile manipulation paper AI2
    • problem and position: end-to-end learning navigation and manipulation
    • method overview: RGB images input and base+arm actions output, train by DD-PPO in ProcTHOR HarmonicMM_method
    • results: HarmonicMM_result
  • Minimalist Vision (MinimalistVision)
    • title and link: Minimalist Vision with Freeform Pixels
    • information: ECCV 2024 best paper Columbia
    • problem and position: use smallest number of pixels to solve a lightweight vision task
    • method overview: freeform pixels as the first layer of network for training, deployed by learned optical masks and photodetectors MinimalistVision_method
    • teaser: MinimalistVision_teaser
    • results: 8 pixels for monitoring indoor spaces, measuring room lighting and estimating traffic flow
  • GENerative Interactive Environments (Genie)
    • title and link: Genie: Generative Interactive Environments
    • information: ICML 2024 best paper DeepMind
    • problem and position: the first world model learning from action-unlabeled videos
    • method overview: spatiotemporal video tokenizer, latent action model, dynamics model
    • teaser: Genie_teaser1Genie_teaser2
    • results: experiments on 2D platformer games and robotics RT1 Genie_result1Genie_result2
    • method details: Genie_method1 Genie_method2
      • video tokenizer is VQ-VAE-based to encode and decode video Genie_method3
      • latent action model is VQ-VAE-based, treat latent variable as action, limit $|A| = 8$, directly from pixels for better performance Genie_method4
      • dynamics model is a decoder-only MaskGIT on tokens with action additive instead of concat for better performance Genie_method5
      • first train video tokenizer, then co-train latent action model and dynamics model, inference discard latent action model with only action codebook
      • all components use spatiotemporal Transformer for memory efficiency
      • 11B parameters, most on 10.1B dynamics model, trained with 256 TPUv5p
  • Video Poet (VideoPoet)
    • title and link: VideoPoet: A Large Language Model for Zero-Shot Video Generation
    • information: ICML 2024 best paper Google
    • problem and position: LLM-based video generation
    • method overview: modality-specific tokenizers, decoder-only autoregressive Transformer, super-resolution Transformer
    • teaser: VideoPoet_teaser
    • results: VideoPoet_result1VideoPoet_result2
    • method details:
      • decoder-only Transformer backbone for autoregressive generation
      • modality-specific tokenizers to map into unified space
        • text: pretrained frozen T5 XL encoder
        • image and video: MAGVIT-v2 tokenizer
        • audio: pretrained SoundStream tokenizer VideoPoet_method1
      • super-resolution Transformer with window attention VideoPoet_method2
      • pretraining task mixture, finetune on text2video
  • Stable Diffusion 3 (SD3)
    • title and link: Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
    • information: ICML 2024 best paper Stability
    • problem and position: improve rectified flow Transformers for text2image
    • method overview: middle-frequent sampling for rectified flow, multi-modal DiT
    • results: SD3_result1SD3_result2 SD3_result3
    • Rectified Flow forward process but with more frequent sampling for middle timesteps training
    • MM-DiT builds upon DiT SD3_method1SD3_method2
    • 8B parameters
    • scaling law SD3_method3
  • Universal Manipulation Interface (UMI)
    • title and link: Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots
    • information: RSS 2024 outstanding systems paper Stanford (Shuran Song)
    • problem and position: portable and low-cost and in-the-wild robotics data collection by human
    • method overview: soft-gripper with camera and IMU
    • results: UMI_result
    • method details:
      • teleoperation high cost, human video large gap
      • hand-held soft gripper, but with fisheye camera and side mirrors, GoPro IMU pose tracking, marker for width detection
      • filtered unreachable data by forward kinematics
      • raw fisheye images without undistortion as input
      • reflecting the crops in the mirrors works better for policy learning UMI_method1
      • synchronized RGB images, relative gripper pose and width as inputs, delayed relative gripper pose and width as outputs
      • Diffusion Policy imitation learning UMI_method2
  • Fast-Slow LLM Anomaly Detection (AESOP)
    • title and link: Real-Time Anomaly Detection and Reactive Planning with Large Language Models
    • information: RSS 2024 outstanding paper Stanford
    • problem and position: robotics anomaly detection
    • method overview: fast-stage LLM embedding similarity with dataset to threshold anomaly, slow-stage LLM reasoning to select continue or any predefined recovery strategy, MPC with both nominal and recovery objective AESOP_method
    • results: not standard benchmark, self comparison
  • Generalized Winding Numbers (GWN)
  • Generative Image Dynamics (GID)
    • title and link: Generative Image Dynamics
    • information: CVPR 2024 best paper Google
    • problem and position: model image-space motion for turning single image to looping video or interactive demo
    • method overview: diffusion model predicts Fourier domain spectral volumes, Fourier transform to warp motioned future frames
    • teaser: GID_teaser
    • results: GID_result1 GID_result2
    • method details:
      • motion texture as 2D displacement maps ${F_t \mid I_t(\mathbf{p} + F_t(\mathbf{p})) = I_0(\mathbf{p}), t = 1, \ldots, T}$
      • directly predicting motion texture scales with $T$
      • Fast Fourier Transform to frequency domain $S(\mathbf{p}) = FFT(F(\mathbf{p}))$ and only low-freq $K = 16$ enough
      • latent diffusion model predicts $4K$-channel 2D motion spectrum map, with $4$ Fourier coefficients each frequency GID_method1
      • normalization for stable training concentrates to low-freq, so normalize per frequency GID_method2 GID_method3
      • directly outputting $4K$ channels yields over-smoothed results, so first train conditioning on frequency embedding to predict single $4$-channel coefficients, then freeze and insert attention layers for coordinating different frequencies and fine-tune
      • rendering with additional multi-scale ResNet-34 features soft-warped by $F_t = FFT^{-1}(\hat{S})$ and $W = \frac{1}{T} \sum_t |F_t|_2$ to decode GID_method4
      • train from scratch for 6 days with 16 A100s
      • collect 3015 videos as >150k image-motion pairs
  • Rich Automatic Human Feedback (RAHF)
    • title and link: Rich Human Feedback for Text-to-Image Generation
    • information: CVPR 2024 best paper Google
    • problem and position: fine-grained human feedback on text-image alignment for text2image generation
    • method overview: fine-grained annotated text-image alignment dataset and transformer model to predict the human feedbacks
    • results: can help text2image generation models
    • method details:
      • RichHF-18K dataset from Pick-a-Pic
      • marking image regions, marking text words, annotating rate RAHF_method1
      • multimodal transformer with multiple heads RAHF_method2
      • finetune Muse on the self-generated images with high RAHF-predicted score
      • Muse inpainting with RAHF-predicted implausibility heatmap
  • NavigatiOn with goal MAsked Diffusion (NoMaD)
    • title and link: NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration
    • information: ICRA 2024 best paper UCBerkeley (Sergey Levine)
    • problem and position: single network for goal-directed navigation and goal-agnostic exploration
    • method overview: Transformer encoder with optional goal condition masking for observed images and diffusion policy for future actions
    • results: NoMaD_result1 NoMaD_result2
    • method details:
      • ViNT as the encoder backbone for goal-conditioned navigation
      • ViKiNG’s topological graph for goal-free exploration
      • 50% probability goal masking during training
      • 1D conditional UNet as the diffusion policy
      • train on combination of GNM and SACSoN datasets NoMaD_method
  • Robotics Transformer X (RT-X)
  • Universal Simulator (UniSim)
    • title and link: Learning Interactive Real-World Simulators
    • information: ICLR 2024 outstanding paper UCBerkeley (Pieter Abbeel)
    • problem and position: action-conditioned video prediction enables robot learning
    • method overview: accept language, motor action, camera motions as actions, then action-conditioned video diffusion model
    • teaser: UniSim_teaser
    • results: used for high-level VLM policy and low-level RL policy training
    • method details:
      • different video datasets cover different information UniSim_method
      • texts by T5 language embedding, motor actions, camera motions
      • video 3D UNet diffusion model predicts next frames conditioned on observed frames and actions autoregressively
      • action-condition by classifier-free guidance
      • 5.6B parameters
      • experiment PaLM-E image-goal conditioned VLM policy and PaLI VLA policy with learned reward function for block rearrangement on 10k generated videos