|
Juze Zhang
I am currently a postdoc at Stanford University working with Prof. Ehsan Adeli, affiliated with the Stanford Vision and Learning Lab (SVL) and the Stanford Translational AI Lab (STAI). Before joining Stanford, I worked on digital humans at ShanghaiTech University, where I was fortunate to work closely with Jingyi Yu, Jingya Wang, and Lan Xu. I received my Ph.D. from the University of Chinese Academy of Sciences.
I consider myself a multidisciplinary researcher who enjoys building across the full stack — hardware, systems, algorithms, and frontier multimodal frameworks. On the systems side, I maintained a multi-view dome with over 100 cameras at ShanghaiTech, built the DailyCap Studio from scratch as the first student on the project, and assembled the HOI-X Studio (12 Z-CAM + 8 OptiTrack + RealSense). I also led a small team to design wireless IMUs from scratch, embedding them into objects, clothing, and even flexible PCBs worn on the face (see Projects).
On the algorithm side, my current research centers on multimodal alignment — in particular world/video models, speech LLMs, and the hardware stack for robotics. At Stanford, I built ViBES, an early exploration of the mixture-of-modality-experts (MoME) design — self-attention with modality-specific parameters and fractional embeddings on top of a speech LLM — which became a 3D conversational agent in early 2025. Encouraged by the success of this mixture-of-transformers (MoT) recipe, I then initiated OpenWAM, applying the same design to video generation in place of the speech LLM. I view this line of work as essentially sequential-transformer-to-sequential-transformer modeling.
Email /
Scholar /
Twitter /
Github
|
|
Research
Selected papers are listed below (* denotes equal contribution). For the full list, please see my Google Scholar.
|
|
OpenWAM: An Open Framework for Generalist World-Action Modeling
OpenWAM Team, Stanford University
In submission
project page /
arXiv /
blog /
code
|
|
Listen, Speak, and Move: A Full-Body Embodied Agent for Interactive Humanoids
Juze Zhang*,
Heng Yu*,
David D. Yuan,
Ali Sartaz Khan,
Nihar Mudigonda,
Yue Zhao,
Ehsan Adeli,
In submission
project page /
arXiv /
code
|
|
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body
Juze Zhang,
Changan Chen,
Xin Chen,
Heng Yu,
Tiange Xiang,
Ali Sartaz Khan,
Shrinidhi Kowshika Lakshmikanth,
Ehsan Adeli,
CVPR, 2026
project page /
arXiv /
code /
dataset
|
|
Recreating Video Arenas via Automated Preference Scoring
Yue Zhao,
Aniket Gupta,
Juze Zhang,
Tiange Xiang,
Ryan Rong,
Xuan Su,
Li Fei-Fei,
Ehsan Adeli,
In submission
project page /
arXiv /
code
|
|
Think Before Watching: Planning to Select Frames for Video Understanding
Heng Yu,
Yue Zhao,
Mohammad H. Abbasi,
Juze Zhang,
Ehsan Adeli,
In submission
project page /
arXiv /
code
|
|
SocialGen: Modeling Multi-Human Social Interaction with Language Models
Heng Yu*,
Juze Zhang*,
Changan Chen,
Tiange Xiang,
Yusu Fang,
Juan Carlos Niebles,
Ehsan Adeli,
3DV, 2026
project page /
arXiv
|
|
Diffusion MRI Transformer with a Diffusion Space Rotary Positional Embedding (D-RoPE)
Gustavo Chau Loo Kung,
Mohammad Abbasi,
Camila Blank,
Juze Zhang,
Alan Q. Wang,
Sophie Ostmeier,
Akshay Chaudhari,
Kilian Pohl,
Ehsan Adeli,
CVPR, 2026
arXiv /
code
|
|
The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
Changan Chen*, Juze Zhang*,
Shrinidhi Kowshika Lakshmikanth*,
Yusu Fang, Ruizhi Shao,
Gordon Wetzstein,
Li Fei-Fei,
Ehsan Adeli,
CVPR, 2025
project page /
arXiv
|
|
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-Based Humanoid Control
Jingyan Zhang,
Han Liang,
Ruichi Zhang,
Bin Li,
Juze Zhang,
Xin Chen,
Jingya Wang,
Lan Xu,
Jingyi Yu,
arXiv, 2026
project page /
arXiv
|
|
MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control
Bin Li*,
Ruichi Zhang*,
Han Liang,
Jingyan Zhang,
Juze Zhang,
Xin Chen,
Jingya Wang,
arXiv, 2026
project page /
arXiv
|
|
InterAgent: Physics-based Multi-agent Command Execution via Diffusion on Interaction Graphs
Bin Li*,
Ruichi Zhang*,
Han Liang,
Jingyan Zhang,
Juze Zhang,
Xin Chen,
Lan Xu,
Jingyi Yu,
Jingya Wang,
arXiv, 2025
project page /
arXiv /
code
|
|
HOI-M3: Capture Multiple Humans and Objects Interaction within Contextual Environment
Juze Zhang*,
Jingyan Zhang*,
Zining Song,
Zhanhe Shi,
Chengfeng Zhao,
Ye Shi,
Jingyi Yu,
Lan Xu,
Jingya Wang,
CVPR, 2024, Highlight
project page /
arXiv /
code /
dataset
|
|
I’M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions
Chengfeng Zhao,
Juze Zhang,
Jiashen Du,
Ziwei Shan,
Junye Wang,
Jingyi Yu,
Jingya Wang,
Lan Xu,
CVPR, 2024
project page /
arXiv /
code /
dataset
|
|
BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics
Wenqian Zhang,
Molin Huang,
Yuxuan Zhou,
Juze Zhang,
Jingyi Yu,
Jingya Wang,
Lan Xu,
CVPR, 2024
project page /
arXiv /
code /
dataset
|
|
NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions
Juze Zhang,
Haimin Luo,
Hongdi Yang,
Xinru Xu,
Qianyang Wu,
Ye Shi,
Jingyi Yu,
Lan Xu,
Jingya Wang,
CVPR, 2023
project page /
arXiv /
code /
dataset
|
|
IKOL: Inverse kinematics optimization layer for 3D human pose and shape estimation via Gauss-Newton differentiation
Juze Zhang,
Ye Shi,
Yuexin Ma,
Lan Xu,
Jingyi Yu,
Jingya Wang,
AAAI, 2023, oral
project page /
arXiv /
video /
code /
|
Projects
Beyond papers, I love building the full stack with my hands — capture studios, sensors, and the systems that power the datasets above.
|
|
DailyCap Studio: Built from Scratch
A multi-view capture studio (42 Z-CAM + 16 OptiTrack) at ShanghaiTech that I built from scratch as the first student on the project — from frame design and camera layout to synchronization, calibration, and the full capture pipeline.
|
|
HOI-X Studio
A hybrid interaction-capture rig combining side-view sensing (12 Z-CAM E2 + 8 OptiTrack) with egocentric sensing (Pupil Core eye tracker + RealSense D455) for fine-grained human-object interaction capture.
|
|
NeuralDome Studio: 76 Z-CAM + 16 Vicon
I maintained and operated the ShanghaiTech multi-view dome with over 100 cameras, powering large-scale human-object interaction datasets including NeuralDome (CVPR 2023) and HOI-M3 (CVPR 2024).
|
|
Franka FR3 Robot Arm Platform
I built a Franka FR3 manipulation platform from scratch in the lab at Stanford — fabricating the mobile stand from raw aluminum extrusions, assembling and calibrating the arm, and setting up the gripper, wrist-mounted camera, and control stack.
|
|
Wireless IMU Sensors Designed from Scratch
I led a small team to design custom wireless IMUs (ESP32/Nordic + 9-axis sensing over Bluetooth/Wi-Fi), embedded into objects and clothing — including a flexible-PCB version thin enough to be worn directly on the face for facial capture.
|
|