2023

  • Emergent Abilities as a Mirage (EAM)
    • title and link: Are Emergent Abilities of Large Language Models a Mirage?
    • information: NeurIPS 2023 outstanding paper Stanford
    • problem and position: analysis on the emergent abilities of LLMs
    • method overview: simple theoretical derivation using neural scaling laws and empirical experiments using different metrics
    • results: researcher choosing a metric that nonlinearly or discontinuously deforms per-token rates causes seemingly the emergent abilities, rather than fundamental changes in model behavior with scale EAM_result
  • Combinatorial-dependence Audio-queried TransformeR (CATR)
    • title and link: CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation
    • information: ACM MM 2023 best paper ZJU
    • problem and position: deal with Audio-Visual Video Segmentation (AVVS)’s 2 limitations
    • method overview: combinatorial-dependence fusion with decoupled audio-visual transformer encoding and block-wise encoded gate, and object-aware audio-queried decoding
    • teaser: CATR_teaser
    • results: CATR_result1CATR_result2 CATR_result3
    • method details:
      • Decoupled Audio-Visual Transformer (DAVT) encoding
        • visual and audio features separately: $F_v^l \in R^{T\times H\times W\times C}$ and $F_a^l \in R^{T\times d}$
        • first visual extraction by ResNet-50 or PVT-v2 and atrous spatial pyramid pooling, first audio extraction by VGGish pretrained on AudioSet
        • then spatial fusion by concatenating audio and visual features and self-attention on the mixed feature as $(H \times W + 1) \times D$
        • then temporal A-V and V-A fusion by multi-head attention alternating KQV as audio and visual features, merge by addition and only visual features are used for decoding CATR_method1CATR_method2
        • but spatial audio-visual fusion already mixes audio and visual features, why additional temporal fusions?
        • and inserted blockwise-encoded gate to propagate weighted features across different blocks CATR_method3
      • Audio-Queried Decoding
        • $N$ learnable queries $Q \in R^{N\times T\times C}$, embedding with audio features
        • visual features by FPN as $F_{seg} \in R^{T \times H/8 \times W/8 \times C}$
        • dynamic kernel generates segmentation $M \in R^{T \times N \times H/8 \times W/8}$ CATR_method4
  • Feature Fields for Robotic Manipulation (F3RM)
    • title and link: Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation
    • information: CoRL 2023 best paper MIT
    • problem and position: few-shot and language-guided and open-ended grasping
    • method overview: NeRF learns CLIP features from 2D to 3D, use these features to find similar grasp poses from demonstrations or languages
    • teaser: F3RM_teaser
    • results: F3RM_result1F3RM_result2
    • method details:
      • Distilled Feature Fields modify NeRF by additional feature output $f_{vis}$, supervised by 2D feature from patch-level MaskCLIP, each task instance train individually
      • demonstrations/dataset include the target grasp pose $T$, represent the target grasp pose as the average feature of sampled points around the grasp pose $Z_M$, each task category owns one
      • when inference for a new task instance, grasp pose is selected by choosing the discrete voxel having the feature $f_{\alpha}(v)$ most cosine similarity with $Z_M$ of the task category as translation, rotation by optimizing sampled rotations to minimize $z_T$ cos similarity with $Z_M$ F3RM_method1
      • for language-guided, the new task instance can have novel but similar object beyond the task category
      • when inference for language-guided, first retrieve demonstrations $z_i$ similar to the query text embedding $q$, then grasp pose is selected by choosing the discrete voxel having the feature $f_{\alpha}(v)$ most cosine similarity with $q$ as translation, rotation by optimizing sampled rotations to minimize $z_T$ cos similarity with $Z_M$ and also $z_i$ F3RM_method2
  • RoboCook (RoboCook)
    • title and link: RoboCook: Long-Horizon Elasto-Plastic Object Manipulation with Diverse Tools
    • information: CoRL 2023 best systems paper Stanford (Huazhe Xu, Jiajun Wu)
    • problem and position: robot uses tools to manipulate deformable object like making dumplings and alphabet letter cookies
    • method overview: a little bit trivial, GNN models point cloud dynamics model, PointNet++ predicts tool selection and policy action and closed-loop control
    • teaser: RoboCook_teaser
    • results: not standard benchmark, self comparison
    • method details:
      • perception
        • merge RGBD point clouds from 4 cameras and crop the region of interest
        • color segmentation to extract dough points
        • surface reconstruction to a watertight mesh
        • use reconstructed mesh’s SDF to sample dough inner points
        • use tool’s gt mesh to sample tool surface points
      • dynamics model
        • GNN predicts future states of points based on the current states and action
        • vertex as point with its position and attribute RoboCook_method1
      • closed-loop control
        • PointNet++ based tool classification
        • PointNet++ based policy network, dataset synthesized by GNN dynamics model
        • closed-loop control with visual feedback RoboCook_method2
  • Power Line Inspection with Drones (PLIDrones)
    • title and link: Autonomous Power Line Inspection with Drones via Perception-Aware MPC
    • information: IROS 2023 best paper
    • problem and position: autonomous power line inspection with drones
    • method overview: MPC with two objectives, one as power line tracking using YOLO and the other as obstacle detection using U-V Disparity Maps
    • teaser: PLIDrones_teaser
    • results: PLIDrones_result
  • ControlNet (ControlNet)
    • title and link: Adding Conditional Control to Text-to-Image Diffusion Models
    • information: ICCV 2023 best paper Stanford
    • problem and position: add spatial conditioning controls to pretrained text2image diffusion models
    • method overview: additional branch initialized with zeros finetuning
    • teaser: ControlNet_teaser
    • results: ControlNet_result1 ControlNet_result2 ControlNet_result3
    • method details: ControlNet_method1 ControlNet_method2
      • experiment on Stable Diffusion
      • since Stable Diffusion works on $64 \times 64$ latent domain, so also convert condition $c$ into $64 \times 64$ size by a tiny network ControlNet_method3
      • during finetuning, randomly 50% remove text prompt to improve semantics ability on conditions
      • finding: sudden convergence phenomenon ControlNet_method4
  • Watermark for Large Language Models (LLMWatermark)
    • title and link: A Watermark for Large Language Models
    • information: ICML 2023 outstanding paper
    • problem and position: watermark for LLM generated texts
    • method overview: partition tokens into green and red lists when generating texts and detect by computing violation statistics
    • teaser: LLMWatermark_teaser
    • method details:
      • watermark properties
        • without knowledge of LLM
        • do not need re-training
        • detection only needs a contiguous portion of the generated text
        • cannot remove unless modifying a significant portion
      • Algorithm 1: natural half violation but watermarked no, bad for low-entropy LLMWatermark_method1
      • Algorithm 2: soft, detect by computing z-statistic LLMWatermark_method2
  • D-Adaptation (D-Adaptation)
    • title and link: Learning-Rate-Free Learning by D-Adaptation
    • information: ICML 2023 outstanding paper FAIR
    • problem and position: the first automatically setting the learning rate during learning with still optimal convergence rate
    • method overview: estimate D lower bound and apply AdaGrad-like optimization
    • results: D-Adaptation_result1D-Adaptation_result2
    • method details:
      • optimal learning rate needs $D$ and $G$
        • $G$ can be achieved by AdaGrad D-Adaptation_method1D-Adaptation_method2
      • construct lower bound $d_k$ on $D=|x_0 - x_*|$ D-Adaptation_method3D-Adaptation_method4 D-Adaptation_method5D-Adaptation_method6
  • Time Optimal Ergodic Search (TOES)
    • title and link: Time Optimal Ergodic Search
    • information: RSS 2023 best paper
    • problem and position: ergodic search with minimum time needed
    • method overview: optimization on time with ergodic inequality constraint, I cannot understand
  • FurnitureBench (FurnitureBench)
    • title and link: FurnitureBench: Reproducible Real-World Benchmark for Long-Horizon Complex Manipulation
    • information: RSS 2023 best system paper KAIST
    • problem and position: real-world furniture assembly benchmark
    • method overview: common setup, reproducible furniture models, easy-to-use robot control software, large teleoperation demonstration dataset, simulated environment
    • teaser: FurnitureBench_teaser
    • results: BC and RL perform bad FurnitureBench_result
    • method details:
      • furniture assembly is long-horizon and complex, unlike pick-and-place FurnitureBench_result
      • common system setup and 8 reproducible 3D printed furniture models FurnitureBench_method2FurnitureBench_method3
      • observation includes RGB images by front camera and wrist camera and proprioceptive robot states
      • easy-to-use robot control software by python and docker
      • 5100 overall 219.6 hours VR teleoperation demonstration dataset FurnitureBench_method4
      • FurnitureSim: built upon IssacGym and Factory FurnitureBench_method5
      • evaluate BC with ResNet-18 encoder and IQL with R3M features
  • 3D Gaussian Splatting (3DGaussianSplatting)
    • title and link: 3D Gaussian Splatting for Real-Time Radiance Field Rendering
    • information: SIGGRAPH 2023 best paper
    • problem and position: the first real-time rendering of NeRF with still high visual quality
    • method overview: 3D Gaussian as scene representation and optimization interleaved with adaptive density control and tile-based rasterizer
    • results: 3DGaussianSplatting_result13DGaussianSplatting_result2
    • method details:
      • 3D Gaussians with mean point $\mu$ and covariance matrix $\Sigma$, initialized with the sparse point cloud produced by SfM, and can be projected to 2D
      • optimize position $\mu$, covariance $\Sigma$, opacity $\alpha$, spherical harmonic (SH) coefficients
      • Gaussians can be bad for under-reconstruction and over-reconstruction cases, detected by large positional gradients, addressed by adaptive control either clone or split
      • rendering like rasterization, Gaussian as primitive 3DGaussianSplatting_method13DGaussianSplatting_method2
  • Regularize Winding-Number (RWN)
    • title and link: Globally Consistent Normal Orientation for Point Clouds by Regularizing the Winding-Number Field
    • information: SIGGRAPH 2023 best paper
    • problem and position: estimate normals of point cloud with globally consistent orientation
    • method overview: use winding number requirements for surface to optimize normals
    • teaser: RWN_teaser
    • results: RWN_result1RWN_result2
    • method details:
      • winding number: 1 as interior and 0 as exterior
        • random normals then tend to be 0 everywhere RWN_method1
      • 3 requirements lead to optimization objective function
        • winding number is either 0 or 1 at any query points: $f_{01}(n)$ RWN_method2RWN_method3RWN_method4
        • 0 1 occurrences are balanced among Voronoi vertices: $f_B(n)$ RWN_method5
        • normals align with the outside Voronoi poles: $f_A(n)$ RWN_method6 RWN_method7
  • Unified Autonomous Driving (UniAD)
    • title and link: Planning-oriented Autonomous Driving
    • information: CVPR 2023 best paper Shanghai AI Lab
    • problem and position: the first unify full-stack driving tasks into one end-to-end network for ultimately planning
    • method overview: end-to-end BEVFormer-TrackFormer-MapFormer-MotionFormer-OccFormer-Planner framework connected by queries
    • teaser: UniAD_teaser1UniAD_teaser2
    • results: UniAD_result
    • method details:
      • BEVFormer: feature extractor backbone by BEVFormer
      • TrackFormer: detect and track agents by DETR and MOTR
      • MapFormer: panoptic segmentation by Panoptic SegFormer
      • MotionFormer: forecast per-agent future top-k trajectories
      • OccFormer: predict multi-step future occupancy
      • Planner: plan action
      • perception train and then end-to-end train, only care planning performance UniAD_method
  • Visual Programming (VisProg)
    • title and link: Visual Programming: Compositional visual reasoning without training
    • information: CVPR 2023 best paper
    • problem and position: neuro-symbolic approach for compositional visual tasks given natural language instructions
    • method overview: LLM generates high-level programs according to instructions and invokes subroutines to execute on the images
    • method details:
      • in-context learning of LLM: prompt GPT-3 with pairs of natural language instructions and desired high-level programs, avoid task-specific training, generate high-level python-like modular program
      • subroutines are predefined and invoked step-by-step VisProg_method
  • Distributed Data-Driven Predictive Control (DistributedDDPC)
  • Code as Policies (CaP)
  • Visual Token Matching (VTM)
    • title and link: Universal Few-shot Learning of Dense Prediction Tasks with Visual Token Matching
    • information: ICLR 2023 outstanding paper
    • problem and position: the first unified few-shot learner for arbitrary dense prediction tasks
    • method overview: image patch similarity as weight to sum label embeddings
    • results: VTM_result1VTM_result2
    • method details:
      • episodic training on several tasks and test on unseen tasks
      • unseen tasks have different output structures: decompose $R^{H\times W\times 3} \rightarrow R^{H\times W\times C_T}$ into $C_T$ independent $R^{H\times W\times 3} \rightarrow R^{H\times W\times 1}$ and learn by shared model
      • query label as weighted combination of support labels in the context of embedding space VTM_method1
      • image encoder as ViT, initialized by pretrained BEiT
      • label encoder as ViT but from scratch
      • similarity function is implemented as scaled dot-product attention
      • label decoder as Dense Prediction Transformer but from scratch VTM_method2
  • Dream Fusion (DreamFusion)
    • title and link: DreamFusion: Text-to-3D using 2D Diffusion
    • information: ICLR 2023 outstanding paper Google (Ajay Jain, Jonathan T. Barron, Ben Mildenhall)
    • problem and position: transfer pretrained 2D text2image diffusion models to 3D object synthesis, without any 3D training data
    • method overview: propose a loss for updating NeRF weights to approximate text2image diffusion model generated image distribution
    • teaser: DreamFusion_teaser
    • method details:
      • Score Distillation Sampling (SDS): use KL divergence loss to update NeRF parameters $\theta$ so that generating $x = g(\theta)$ approximates sampling from the distribution modeled by diffusion model DreamFusion_method1
      • diffusion model as Imagen with pretrained weights and frozen
      • NeRF as mip-NeRF 360 modified for additional shading, initialized with random weights and rendered under random camera and light
      • diffusion model can be thought as a teacher model to NeRF DreamFusion_method2