Loading...
Loading...
A hierarchical vision-language navigation framework that separates high-level planning from low-level execution for natural language-guided robot navigation in complex indoor environments.
Conventional robotic navigation systems often fail because they rely on rigid goal representations, closed-world assumptions, brittle failure modes, and weak adaptation to dynamic changes during navigation. This project addresses those limitations with a hierarchical Vision-Language-Action (VLA) framework that decouples high-level planning from low-level execution. The high-level planner uses a Vision-Language Model to generate discrete navigation commands from three RGB camera feeds, while the low-level executor uses a reinforcement learning controller with depth-based safety execution on the robot platform. We recorded 200 episodes using live FantasyVLN (Qwen2.5-VL) across HM3D minival scenes. Each episode begins from a random spawn with a goal instruction, and captures 64x64 depth observations, the current VLM command, and the agent position and heading at every step. The low-level controller is a CNN-based PPO policy that conditions on depth input, high-level command context, and learned collision-avoidance patterns. Training was run for 1400 PPO updates, totaling about 2.9 million environment steps, on a single NVIDIA RTX 3090 Ti (24GB VRAM) via RunPod, with each run taking about 11.5 hours of GPU compute time. Ongoing work focuses on fine-tuning Qwen2.5-VL 7B on an object-goal navigation dataset generated on Matterport3D and HM3D using Habitat Simulator, and validating deployment on RunPod RTX 4090 32GB instances.

Phase 1
Phase 2

