Juze Zhang

I am currently a postdoc at Stanford University working with Prof. Ehsan Adeli, affiliated with the Stanford Vision and Learning Lab (SVL) and the Stanford Translational AI Lab (STAI). Before joining Stanford, I worked on digital humans at ShanghaiTech University, where I was fortunate to work closely with Jingyi Yu, Jingya Wang, and Lan Xu. I received my Ph.D. from the University of Chinese Academy of Sciences.

I consider myself a multidisciplinary researcher who enjoys building across the full stack — hardware, systems, algorithms, and frontier multimodal frameworks. On the systems side, I maintained a multi-view dome with over 100 cameras at ShanghaiTech, built the DailyCap Studio from scratch as the first student on the project, and assembled the HOI-X Studio (12 Z-CAM + 8 OptiTrack + RealSense). I also led a small team to design wireless IMUs from scratch, embedding them into objects, clothing, and even flexible PCBs worn on the face (see Projects).

On the algorithm side, my current research centers on multimodal alignment — in particular world/video models, speech LLMs, and the hardware stack for robotics. At Stanford, I built ViBES, an early exploration of the mixture-of-modality-experts (MoME) design — self-attention with modality-specific parameters and fractional embeddings on top of a speech LLM — which became a 3D conversational agent in early 2025. Encouraged by the success of this mixture-of-transformers (MoT) recipe, I then initiated OpenWAM, applying the same design to video generation in place of the speech LLM. I view this line of work as essentially sequential-transformer-to-sequential-transformer modeling.

Email / Scholar / Twitter / Github

Research

Selected papers are listed below (* denotes equal contribution). For the full list, please see my Google Scholar.

	OpenWAM: An Open Framework for Generalist World-Action Modeling OpenWAM Team, Stanford University In submission project page / arXiv / blog / code
	Listen, Speak, and Move: A Full-Body Embodied Agent for Interactive Humanoids Juze Zhang, Heng Yu, David D. Yuan, Ali Sartaz Khan, Nihar Mudigonda, Yue Zhao, Ehsan Adeli, In submission project page / arXiv / code
	ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi Kowshika Lakshmikanth, Ehsan Adeli, CVPR, 2026 project page / arXiv / code / dataset
	Recreating Video Arenas via Automated Preference Scoring Yue Zhao, Aniket Gupta, Juze Zhang, Tiange Xiang, Ryan Rong, Xuan Su, Li Fei-Fei, Ehsan Adeli, In submission project page / arXiv / code
	Think Before Watching: Planning to Select Frames for Video Understanding Heng Yu, Yue Zhao, Mohammad H. Abbasi, Juze Zhang, Ehsan Adeli, In submission project page / arXiv / code
	SocialGen: Modeling Multi-Human Social Interaction with Language Models Heng Yu, Juze Zhang, Changan Chen, Tiange Xiang, Yusu Fang, Juan Carlos Niebles, Ehsan Adeli, 3DV, 2026 project page / arXiv
	Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE) Gustavo Chau Loo Kung, Mohammad Abbasi, Camila Blank, Juze Zhang, Alan Q. Wang, Sophie Ostmeier, Akshay Chaudhari, Kilian Pohl, Ehsan Adeli, CVPR, 2026 arXiv / code
	The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion Changan Chen, Juze Zhang, Shrinidhi Kowshika Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, Ehsan Adeli, CVPR*, 2025 project page / arXiv
	SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control Jingyan Zhang, Han Liang, Ruichi Zhang, Bin Li, Juze Zhang, Xin Chen, Jingya Wang, Lan Xu, Jingyi Yu, arXiv, 2026 project page / arXiv
	MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control Bin Li, Ruichi Zhang, Han Liang, Jingyan Zhang, Juze Zhang, Xin Chen, Jingya Wang, arXiv, 2026 project page / arXiv
	InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs Bin Li, Ruichi Zhang, Han Liang, Jingyan Zhang, Juze Zhang, Xin Chen, Lan Xu, Jingyi Yu, Jingya Wang, arXiv, 2025 project page / arXiv / code
	HOI-M3: Capture Multiple Humans and Objects Interaction within Contextual Environment Juze Zhang, Jingyan Zhang, Zining Song, Zhanhe Shi, Chengfeng Zhao, Ye Shi, Jingyi Yu, Lan Xu, Jingya Wang, CVPR, 2024, Highlight project page / arXiv / code / dataset
	I’M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions Chengfeng Zhao, Juze Zhang, Jiashen Du, Ziwei Shan, Junye Wang, Jingyi Yu, Jingya Wang, Lan Xu, CVPR, 2024 project page / arXiv / code / dataset
	BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics Wenqian Zhang, Molin Huang, Yuxuan Zhou, Juze Zhang, Jingyi Yu, Jingya Wang, Lan Xu, CVPR, 2024 project page / arXiv / code / dataset
	NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions Juze Zhang, Haimin Luo, Hongdi Yang, Xinru Xu, Qianyang Wu, Ye Shi, Jingyi Yu, Lan Xu, Jingya Wang, CVPR, 2023 project page / arXiv / code / dataset
	IKOL: Inverse kinematics optimization layer for 3D human pose and shape estimation via Gauss-Newton differentiation Juze Zhang, Ye Shi, Yuexin Ma, Lan Xu, Jingyi Yu, Jingya Wang, AAAI, 2023, oral project page / arXiv / video / code /

Projects

Beyond papers, I love building the full stack with my hands — capture studios, sensors, and the systems that power the datasets above.

	DailyCap Studio: Built from Scratch A multi-view capture studio (42 Z-CAM + 16 OptiTrack) at ShanghaiTech that I built from scratch as the first student on the project — from frame design and camera layout to synchronization, calibration, and the full capture pipeline.
	HOI-X Studio A hybrid interaction-capture rig combining side-view sensing (12 Z-CAM E2 + 8 OptiTrack) with egocentric sensing (Pupil Core eye tracker + RealSense D455) for fine-grained human-object interaction capture.
	NeuralDome Studio: 76 Z-CAM + 16 Vicon I maintained and operated the ShanghaiTech multi-view dome with over 100 cameras, powering large-scale human-object interaction datasets including NeuralDome (CVPR 2023) and HOI-M3 (CVPR 2024).
	Franka FR3 Robot Arm Platform I built a Franka FR3 manipulation platform from scratch in the lab at Stanford — fabricating the mobile stand from raw aluminum extrusions, assembling and calibrating the arm, and setting up the gripper, wrist-mounted camera, and control stack.
	Wireless IMU Sensors Designed from Scratch I led a small team to design custom wireless IMUs (ESP32/Nordic + 9-axis sensing over Bluetooth/Wi-Fi), embedded into objects and clothing — including a flexible-PCB version thin enough to be worn directly on the face for facial capture.