Robotic Manipulation

OASIS: Observation-Action Space Alignment via \(SE(3)\) Trajectory Prediction for Robotic Manipulation

Xinzhe Chen*, Sihua Ren*, Liqi Huang, Haowen Sun, Mingyang Li, Xingyu Chen, Zeyang Liu, Xuguang Lan

National Key Laboratory of Human-Machine Hybrid Augmented Intelligence
Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University

*Equal contribution. Corresponding author.

Comparison of existing visuomotor policies and OASIS

Abstract

Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry.

OASIS aligns the intermediate representation with the action space through \(SE(3)\) end-effector trajectory prediction. It couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with a camera-frame \(SE(3)\) trajectory predictor. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion.

Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization.

Method

OASIS method overview
OASIS uses a 3D-aware feature encoder, predicts a camera-frame \(SE(3)\) end-effector trajectory, and decodes executable 6-DoF action chunks from the pose-supervised hidden states.
1

3D-Aware Feature Encoder

Fuses Qwen2.5-based vision-language features with frozen Depth Anything 3 metric-depth features.

2

\(SE(3)\) Trajectory Predictor

Predicts a camera-frame \(SE(3)\) end-effector trajectory that provides pose-supervised hidden states for action decoding.

3

Action Decoder

Cross-attends to the pose-supervised trajectory states and current robot state to produce action chunks.

Results

Simulation Benchmarks

OASIS is trained without large-scale robotic pretraining and evaluated on LIBERO and CALVIN ABC\(\to\)D.

LIBERO
Method Intermediate Pretrain Spatial Object Goal Long Avg.
SpatialVLASpatial featuresYes88.289.978.655.578.1
WorldVLAFuture visual statesNo85.689.082.659.079.1
ThinkAct2D-supervised featuresYes88.391.487.170.984.4
\(\pi_0\)Multimodal featuresYes96.898.895.885.294.1
QDepth-VLASpatial featuresYes97.696.695.290.094.9
UniVLASpatial featuresYes96.596.895.692.095.2
Unified-VLAFuture visual statesYes95.498.893.694.095.5
OASIS\(SE(3)\)-supervised featuresNo99.098.897.495.297.6
CALVIN ABC\(\to\)D
Method Intermediate Pretrain 1 2 3 4 5 Avg.
SuSIEFuture visual statesYes87.069.049.038.026.02.69
3D Diffuser Actor\(^{\dagger}\)3D featureNo93.880.366.253.341.23.35
ReconVLASpatial featuresYes95.687.676.969.364.13.95
Seer-LargeFuture visual statesYes96.391.686.180.374.04.28
VPPFuture visual statesYes96.590.986.682.076.94.33
Unified-VLAFuture visual statesYes98.994.889.082.875.14.41
DreamVLAFuture visual statesYes98.294.689.583.478.14.44
OASIS\(SE(3)\)-supervised featuresNo98.194.991.788.983.34.57
SE(3) trajectory prediction and robot execution visualization
Predicted translation waypoints and rotation axes closely track the executed end-effector path.

Real-World Performance

OASIS is evaluated on Franka Research 3 and Kinova Gen3 platforms across Goal, Spatial, Long, and OOD settings.

Real-world robot platforms and tasks

Demos

Simulation

LIBERO Long

CALVIN ABC\(\to\)D

Real-World Tasks

Goal

Place the banana into the red bowl
Place the banana into the yellow bowl
Place the carrot into the red bowl
Place the carrot into the yellow bowl
Place the orange into the red bowl
Place the orange into the yellow bowl

Spatial

Stack blocks
Build towers
Place pots on the wooden bracket
Put the orange can on shelves
Hang the cup
Place the cup

Long Horizon

Open the drawer and place the banana into the red bowl

Out-of-Distribution Real-World Settings

Unseen Backgrounds

Banana
Carrot
Orange

Human Interference and Camera Shift

Human interference: banana
Human interference: orange
New camera perspective: carrot

BibTeX

@misc{chen2026oasisobservationactionspacealignment,
      title={OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation}, 
      author={Xinzhe Chen and Sihua Ren and Liqi Huang and Haowen Sun and Mingyang Li and Xingyu Chen and Zeyang Liu and Xuguang Lan},
      year={2026},
      eprint={2605.25829},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.25829}, 
}