Data-Driven Stochastic Motion Evaluation and Optimization with Image by
Spatially-Aligned Temporal Encoding
(ICRA 2023)

1Toyota Technological Institute,

Sound will be played, so pay attention to volume.

Abstract

This paper proposes a probabilistic motion prediction method for long motions. The motion is predicted so that it accomplishes a task from the initial state observed in the given image. While our method evaluates the task achievability by the Energy-Based Model (EBM), previous EBMs are not designed for evaluating the consistency between different domains (i.e., image and motion in our method). Our method seamlessly integrates the image and motion data into the image feature domain by spatially-aligned temporal encoding so that features are extracted along the motion trajectory projected onto the image. Furthermore, this paper also proposes a data-driven motion optimization method, Deep Motion Optimizer (DMO), that works with EBM for motion prediction. Different from previous gradient-based optimizers, our self-supervised DMO alleviates the difficulty of hyper-parameter tuning to avoid local minima. The effectiveness of the proposed method is demonstrated with a variety of experiments with similar SOTA methods.

Model Architecture

cars peace

Novelty of our model: To evaluate the consistency (task achievability) between the image and the motion, we propose Spatially-Aligned Temporal Encoding to seamlessly integrate the image and motion data into the image feature domain. Typically, all of the image and motion feature vectors are fed into the Transformer to fuse these features. However, it contains many redundant features, such as those of the background, which are less useful for evaluating consistency properly. Our method solves this problem by extracting only meaningful features from the image feature map. Based on the idea that the most meaningful features are located along the motion trajectory, we spatially align the motion trajectory to the image plane and extract the features along it. We then concatenate the extracted image and motion features to associate them. Since our motion features contain the aligned position and time step information, this encoding can be regarded as spatially-aligned temporal encoding.

Results

BibTeX

@article{oba2023data,
      author={Oba, Takeru and Ukita, Norimichi},
      title={Data-Driven Stochastic Motion Evaluation and Optimization with Image by Spatially-Aligned Temporal Encoding},
      journal={arXiv preprint arXiv:2302.05041},
      year={2023}
    }