Inside Rho-Alpha, Microsoft’s new robotics model
The new architecture upgrades Vision-Language-Action models with tactile data to bridge the gap between semantic reasoning and low-level motor control.
While large language models (LLMs) have mastered the art of processing text and images, they remain largely confined to the digital realm. Moving from generating code to folding laundry requires a fundamental shift in how AI perceives the world. Microsoft is attempting to bridge this gap with Rho-alpha (⍴ɑ), a new robotics foundation model designed to bring adaptivity to physical tasks.
Rho-alpha falls under the category of Vision-Language-Action (VLA) models. These systems ingest visual data and natural language commands to output robot arm actions. However, standard VLAs often struggle with precision tasks where vision is obstructed or insufficient, such as manipulating a slippery object or inserting a plug behind a desk. Rho-alpha addresses this by integrating tactile sensing directly into its decision-making process, a capability Microsoft refers to as “VLA+.”
The architecture of VLA+
The core innovation of Rho-alpha lies in how it processes sensory data. Most multimodal models attempt to tokenize every input, converting images and text into discrete units that a transformer can process. However, tactile feedback is a high-frequency, continuous signal that represents force and resistance and can’t be represented as discrete tokens.
To handle this, Microsoft engineered a split architecture. The model uses a standard vision-language model (VLM) backbone, derived from Microsoft’s Phi family, to handle high-level reasoning and semantic understanding. However, the actual motor control is managed by a specialized module called the “action expert,” which is attached to the VLM. The tactile data is fused with image, text, and proprioception embeddings in the action expert. However, the tactile data bypasses the VLM component and is not tokenized.
In comments to TechTalks, Andrey Kolobov, Principal Research Manager at Microsoft Research, explained that this architecture allows the system to bypass the slower reasoning components when immediate physical reaction is needed.
“The model treats tactile as a continuous data source, providing information on the currently applied forces at the gripper fingertips,” Kolobov said.
This bypass mechanism is critical for latency. Feeding high-frequency force data through a massive transformer would introduce delays that make real-time control impossible. By fusing tactile data in the smaller, faster action expert, the robot can react to physical resistance instantly while still leveraging the VLM for broader context.
“We view the purpose of physical sensing modalities as helping our model be more reactive and adaptive,” Kolobov added. “Accordingly, we feed these modalities into the action expert, which is a small fraction of the overall architecture, bypassing the VLM.”
The long-term goal, Kolobov said, is to have the action expert or a part of it operate on proprioception and physical sensing modalities at a significantly higher frequency than on visual and language data.
Establishing priors in simulation
Training a model to interact with the physical world presents a data scarcity challenge. Unlike text, which can be scraped from the web in petabytes, robot interaction data is expensive and slow to collect. Microsoft addresses this by training Rho-alpha in a simulated environment using Nvidia Isaac Sim.
A problem in robotics is the difference between the simulated environment and the real world, a hurdle known as the “sim-to-real gap.” However, Microsoft’s approach sidesteps the need for perfectly bridging the sim-to-real gap. The goal of the simulation is not to create a 1:1 replica of the physical world, but to teach the model general concepts of physics and force.
“We actually don’t rely on the sim-to-real gap being small and do only conventional data augmentation,” Kolobov said. “The purpose of using simulated data during training is to give a rough prior idea of what tactile and force feedback looks like and how it can be useful.”
By learning these “priors” in simulation, the model enters the real world already understanding that a spike in force readings usually means it has hit an obstacle. This allows it to fine-tune its policy with significantly less real-world data.
Online learning and forgetting
Once deployed, Rho-alpha continues to learn through human interaction. If the robot fails a task, a human operator can intervene via teleoperation (using devices like a 3D mouse) to correct the movement. The model ingests this feedback to update its policy.
However, this online learning capability introduces the risk of “catastrophic forgetting,” where learning a new task causes the model to lose proficiency in previous ones.
“As the model learns from feedback on a given task, its performance on tasks not being exercised at the moment may degrade, unless care is taken to combat this,” Kolobov noted.
To mitigate this, the system can aggregate data and perform updates at regular intervals, effectively “reminding” the model of past experiences to maintain a balanced skill set.
Bimanual manipulation and future applications
Currently, Rho-alpha is optimized for bimanual (two-armed) manipulation. While many tasks can theoretically be performed with a single arm, the coordination of two end-effectors significantly improves efficiency in industrial settings.
“In many scenarios beyond pick-and-place, from folding laundry to packaging food to assembly, performing tasks with two end-effectors rather than one increases execution speed and robustness – and hence throughput,” Kolobov explained.
The model does have hardware limitations in its current state. It supports manipulation only, meaning it cannot control the mobile base of a robot or the body of a humanoid. Furthermore, the training data is heavily biased toward two-finger grippers, so using complex multi-fingered hands or suction cups would require additional post-training data.
Despite these constraints, the architecture offers a glimpse into the future of physical AI. By separating high-level semantic reasoning from low-level, high-frequency motor control, Microsoft is building a system that can think like an LLM but act with the reflexes required for the real world.



Impressive engineering around the latency issue. Bypassing the VLM for tactile feedback makes total sense when dealing with continuous force signals, dunno why more teams haven't tried this spllit approach earlier. The simulation priors idea is elegant too becuase it sidesteps the usual sim-to-real headache by just teaching rough physics concepts instead of exact matching.