2025

  • Peer Review (PeerReview)
  • Frequency-space Action Sequence Tokenization (FAST)
    • title and link: FAST: Efficient Action Tokenization for Vision-Language-Action Models
    • information: RSS 2025 outstanding paper finalist Physical Intelligence (Chelsea Finn, Sergey Levine)
    • problem and position: continuous-to-discrete action tokenization for LLM-like VLA
    • method overview: transfer action chunks to frequency space and compress into tokens
    • results: match diffusion VLA but more efficient to train FAST_result1 FAST_result2
    • method details:
      • naive tokenization is per-dimension per-timestep binning scheme, but can lead to trivial solution for action chunks as simply copying the most recent action token, also lead to many tokens
      • use DCT to transform action chunk into frequency space, then use BPE to compress into tokens
      • since BPE needs training for each dataset, train a universal action tokenizer across many datasets for 1-second action chunks FAST_method
  • Reactive Diffusion Policy (RDP)
    • title and link: Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation
    • information: RSS 2025 outstanding student paper finalist SJTU (Huazhe Xu, Cewu Lu)
    • problem and position: hierarchical visual-tactile imitation learning
    • method overview: teleoperation with tactile/force feedback shown in AR, slow vision latent diffusion policy and fast tactile/force reactive decoder
    • teaser: RDP_teaser
    • results: RDP_result1RDP_result2RDP_result3
    • method details:
      • use Quest3 to teleoperate, show tactile/force by AR
      • encode tactile marker deformation field by PCA into a compact vector, encode force by directly using the 6D vector
      • fast policy as VAE to encode real action chunk into latents, which can be decoded with high-freq tactile/force, < 1ms
      • slow policy as Diffusion Policy on latent actions, ~100ms
      • insight of separating slow-fast policy: fast policy’s asymmetric design of decoding with tactile/force feedback leaves reactive behaviors to fast policy decoding, while latent actions just care about general tendency instead of reactive behaviors RDP_method
  • HumanOid STanding-up (HoST)
    • title and link: Learning Humanoid Standing-up Control across Diverse Postures
    • information: RSS 2025 outstanding systems paper finalist Shanghai AI Lab (Jiangmiao Pang)
    • problem and position: humanoid standing-up control from diverse postures
    • method overview: train PPO in simulation with several techniques
    • teaser: HoST_teaser
    • results: HoST_result
    • method details:
      • use PPO to train Unitree G1 in Isaac Gym
        • observation as robot’s proprioceptive states
        • action as delta joint angles
      • use PD controller to track action as torque-based actuation
      • divide into 3 stages as righting, rising and standing according to the height, activate different reward functions at each stage, and design 4 groups of reward functions, use multiple critics to balance HoST_method
      • add pulling force in initial training and progressively decrease the magnitude, to help RL exploration challenge
      • action scale to constrain violent motion and L2C2 loss for motion smoothness regularization
      • design 4 terrains to diversify the starting postures
      • use domain randomization for sim2real transfer
  • MuJoCo Playground (MuJoCo-Playground)
    • title and link: MuJoCo Playground
    • information: RSS 2025 outstanding demo paper UCBerkeley (Pieter Abbeel)
    • problem and position: open-source infrastructure for robotics GPU RL
    • method overview: build upon MuJoCo XLA for GPU simulation, integrate batch rendering with Madrona for GPU rendering, support DeepMind control suite and locomotion and manipulation environments with diverse robots
    • teaser: MuJoCo-Playground_teaser
    • results: whole pipeline from setup to training takes few minutes on a single GPU
  • Visual Geometry Grounded Transformer (VGGT)
    • title and link: VGGT: Visual Geometry Grounded Transformer
    • information: CVPR 2025 best paper Oxford
    • problem and position: feed-forward 3D reconstruction
    • method overview: Transformer input a set of images, output camera parameters, depth maps, point maps and point tracks
    • results: under 1 seconds, from 1 to hundreds of images, still out-perform those requiring post-processing VGGT_result1VGGT_result2VGGT_result3VGGT_result4 VGGT_result5
    • method details:
      • input a set of images, encoded by DINOv2, fused by self-attention Transformer
      • predict each camera translation, quaternion and field of view, by MLP on camera tokens
      • predict each depth map, by DPT
      • predict each point map in the frame of first image, by DPT
      • predict each dense tracking features, by DPT and CoTracker2
      • these predictions are redundant, but bring performance gain VGGT_method
  • Component-Aligned Scene reconsTruction (CAST)
    • title and link: CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image
    • information: SIGGRAPH 2025 best paper ShanghaiTech (Jiayuan Gu, Jingyi Yu)
    • problem and position: 3D scene reconstruction from a single image where objects are posed precisely and interact naturally
    • method overview: use foundation models to first parse image, then diffusion-based generative model produces object meshes and aligns poses, and use inter-object relation as cost functions to optimize object poses
    • results: CAST_result
    • method details: CAST_method1
      • first use foundation models to parse image
        • Florence-2 to detect objects with bounding boxes and descriptions
        • GPT-4v to filter out spurious detections and isolate meaningful objects
        • GroundedSAMv2 to get segmentation masks
        • MoGe to predict point maps
      • object mesh generative model builds upon 3DShape2VecSet, pre-train on Objaverse
        • VAE architecture, encode image and point cloud, diffusion on latent code, then decode as SDF CAST_method2
        • address occlusion by training with randomly masks
      • pose alignment diffusion-based generative model inputs scene point cloud and object latent code, outputs corresponding aligned-with-mesh point cloud CAST_method3
        • use Umeyama algorithm to recover the transformation matrix
      • iteratively alternate these two models, since initially no aligned-with-mesh point cloud available for object mesh model
      • further optimize object poses since not perfect estimated and not considered for inter-object relations
        • minimize custom costs depending on inter-object relations
          • contact-relation cost: encourage no penetration and at least one contact point
          • support-relation cost: encourage supported object stay close to the supporting
        • use GPT-4v to infer the inter-object relations
  • TokenVerse (TokenVerse)
    • title and link: TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
    • information: SIGGRAPH 2025 best paper DeepMind
    • problem and position: generate combinations of multi-concepts from multiple images
    • method overview: learn modulation direction for each concept token by training an additional MLP to predict it then add it to modulate
    • teaser: TokenVerse_teaser
    • results: TokenVerse_result1 TokenVerse_result2
    • method details:
      • intuitively add an attribute-aware direction to the modulation of all tokens, but not localized TokenVerse_method1 TokenVerse_method2
      • so just add the direction to the modulation of the text token we want to affect
      • train Concept-Mod which is a small MLP to predict the direction for each token in concept image-text TokenVerse_method3
      • training loss is the same as original diffusion loss, but add two tricks
        • 50% training adds an additional concept isolation loss to encourage the learned directions not to influence other concepts not in the concept image, as L2 loss on a generated auxiliary image by added directions and original generation TokenVerse_method4
        • additional MLP in each diffusion block to predict per-token per-block direction, added with per-token Concept-Mod TokenVerse_method5
  • Robot Data Management (Robo-DM)
    • title and link: Robo-DM: Data Management For Large Robot Datasets
    • information: ICRA 2025 best robot learning paper UCBerkeley
    • problem and position: robot dataset management
    • method overview: use EBML with byte packets to be self-contained, use mmap for efficient random access decoding caches Robo-DM_method
    • results: compression saves space by 70x (lossy) and 3.5x (lossless) compared to RLDS, decoding 50x faster than LeRobot
  • Human-level Robot Table Tennis (HRTT)
    • title and link: Achieving Human Level Competitive Robot Table Tennis
    • information: ICRA 2025 best robot learning paper finalist DeepMind
    • problem and position: first amateur human-level table tennis robot
    • method overview: high-level skill selector and low-level skill executors, train in simulation and iteratively transfer to real-world
    • results: robot won 13/29 matches with humans of different levels
    • method details:
      • use high-speed camera and perception to estimate ball position, use motion capture to track opponent’s paddle, in real-world
      • train in MuJoCo simulation with high-fidelity system identification and domain randomization, and iteratively train on evaluated real-world collected data with 7 cycles
      • state-based policy instead of vision-based, one episode as one hit
      • high-level controller select multiple low-level controller to execute HRTT_method
      • low-level controllers act at 50Hz for producing joint velocity
        • train in simulation by BGS
        • 1D CNN, input 8 timesteps of ball 3D position and velocity and robot joint positions, output 8 timesteps of robot joint velocities
        • first train 2 generalist base policies for forehand and backhand styles, then finetune from them to specialist policies, leading to 17 skills
        • maintain skill descriptors like the performance metadata
      • high-level controller acts when opponent hits the ball within 20ms
        • train to predict forehand or backhand style, same input as low-level
        • train to classify topspin or underspin, input ball and paddle state
        • 5 heuristic strategies to shortlist some low-level controllers
        • estimate and update online preference of each low-level skill
  • PolyTouch (PolyTouch)
    • title and link: PolyTouch: A Robust Multi-Modal Tactile Sensor for Contact-rich Manipulation Using Tactile-Diffusion Policies
    • information: ICRA 2025 best field and service robotics paper MIT (Edward Adelson)
    • problem and position: multi-modal robot finger with tactile, acoustic and peripheral vision
    • method overview: design a new tactile robot finger and tactile diffusion policy
    • teaser: PolyTouch_teaser
    • results: PolyTouch_result
    • method details: PolyTouch_method1
      • camera-based tactile sensing with a curved mirror
      • contact microphone-based acoustic sensing
      • peripheral vision sensing with the same camera PolyTouch_method2
      • diffusion policy with each modality encoder and concatenated projection
      • interesting finding: training robotic policies with more modalities require more data
  • Human-Agent Joint Learning (HAJL)
    • title and link: Human-Agent Joint Learning for Efficient Robot Manipulation Skill Acquisition
    • information: ICRA 2025 best human-robot interaction paper SJTU (Cewu Lu)
    • problem and position: human-robot shared control during teleoperation to reduce effort
    • method overview: first human operators teleoperate to gather initial dataset for training diffusion policy, then human-policy shared control by diffusing and denoising human action, and gradually increase policy control ratio, collect data while finetune policy HAJL_method
    • results: HAJL_result
  • Unsupervised Affordance Distillation (UAD)
    • title and link: UAD: Unsupervised Affordance Distillation for Generalization in Robotic Manipulation
    • information: ICRA 2025 best robot perception paper finalist Stanford (Jiajun Wu, Li Fei-Fei)
    • problem and position: use VLM to annotate affordance dataset without manual annotation
    • method overview: VLM annotates <instruction, affordance>, then train affordance model, then use affordance as condition for imitation learning
    • results: UAD_result1 UAD_result2 UAD_result3
    • method details: UAD_method
      • render object images and point clouds
      • extract 2D DINOv2 features and fuse into 3D point clouds, then PCA to reduce dimension and then cluster into regions
      • query GPT-4o to propose a set of task instructions given the image and regions, then associate the instruction and region
      • affordance model as DINOv2 backbone conditioned by FiLM with language embedding
      • imitation learning policy uses affordance map as input
  • $\mathcal{D(R,O)}$ Grasp (DRO-Grasp)
    • title and link: $\mathcal{D(R,O)}$ Grasp: A Unified Representation of Robot and Object Interaction for Cross-Embodiment Dexterous Grasping
    • information: ICRA 2025 best robot manipulation and locomotion paper NUS (Lin Shao)
    • problem and position: cross-embodiment and efficient dexterous grasp
    • method overview: model predicts $\mathcal{D(R,O)}$, then compute joint values from it
    • teaser: DRO-Grasp_teaser
    • results: DRO-Grasp_result
    • method details:
      • $\mathcal{D(R,O)}$ represents the distance between robot hand and object point clouds, which implicitly presents the grasp pose
      • point cloud feature extraction by DGCNN and then cross-attention Transformer
      • CVAE to concatenate latent configuration variable to the features
      • a kernel function to compute $\mathcal{D(R,O)}$ DRO-Grasp_method1
      • configuration-invariant pretraining robot point cloud encoder by open-hand and close-hand point-level contrastive learning
      • generate grasp pose point cloud from $\mathcal{D(R,O)}$ by least-squares optimizing distances
      • generate 6D link poses by registration of grasp and canonical point clouds
      • generate joint values by optimization like IK DRO-Grasp_method2
  • Segment Anything Model 2 (SAM2)
    • title and link: SAM 2: Segment Anything in Images and Videos
    • information: ICLR 2025 outstanding paper honorable mention FAIR (Ross Girshick)
    • problem and position: foundation model on promptable segmentation in images and videos
    • method overview: model behaves like SAM but adds memory attention to support video, public SA-V dataset
    • teaser: SAM2_teaser
    • results: SAM2_result1 SAM2_result2
    • method details:
      • image encoder uses an MAE pretrained Hiera to extract features
      • memory attention uses Transformer as self-attention and cross-attention to memory bank
      • prompt encoder as SAM
      • mask decoder similar to SAM, with additional head predicting whether object of interest is present on current frame
      • memory encoder fuses output mask and image features with CNN
      • memory bank as two FIFO queues of $N$ recent frames and $M$ prompted frames
      • train by simulating the interactive prompting
      • build a data engine and collect SA-V dataset