I taught a two-armed robot to fold laundry — winning the simulation round of the LeHome Challenge @ ICRA 2026, and taking 2nd place on a real robot.
The task of the LeHome Challenge (ICRA 2026) was to train a robot to fold four types of clothes — long-sleeve top, short-sleeve top, long pants, shorts — using cheap SO-ARM101 hardware, first in simulation (round 1), then on a real robot (round 2). The setup is bimanual: two 6-DOF arms, three RGB cameras, a 12-dimensional joint action space. Success is scored differently in the two rounds: in simulation it is binary and automatic — specific garment keypoints must end up far from or close to each other — while in the real round a jury decides, and partial success counts.
Garment folding has been solved before, by several groups. What made this particular competition hard was the combination of constraints: low-precision, cheap hardware; the specifics of simulating highly deformable objects; four garment types with the category hidden at evaluation time; and, in the sim-to-real round, no access at all to the actual evaluation robot. The point was not a one-off demo — it was a policy that folds consistently and maximizes success rate over many trials.

The policy is a π0.5-based vision-language-action model, but instead of using the raw original policy I started from the version my team developed for our winning solution in BEHAVIOR-1K, then added extra tweaks. It is a single policy for all four garment types, trained jointly.
On top of the backbone I added, among other things:
Most of these were never properly ablated. This was a competition, not a controlled study, so I share the details for reference without claiming which were actually critical.
The organizers provided a very clean, scripted original dataset. Behavior cloning on top of it works, but the resulting policy isn't robust enough. To improve it, I use a combination of two related RL methods — AWR and RECAP. The resulting policy stays on the same action manifold, but completes the tasks with much higher robustness.
AWR weights the training data toward high-advantage frames — applied through sampling rather than the loss, so good frames are simply loaded more often.
RECAP feeds the advantage in as a conditioning input, in effect telling the model to “predict good actions only” — which also unlocks classifier-free guidance (CFG) at inference.

To run this I built an asynchronous reinforcement-learning loop, with every machine coordinating only through the Hugging Face Hub:
There are no synchronization barriers: the trainer trains on whatever data has arrived, the workers collect with whatever checkpoint is newest. Scaling up data collection is just turning on another machine.
Binary success is far too sparse for efficient RL: early actions get almost no signal. So I densify it with a multi-layer reward and advantage computation that combines several signals:
Everything is aggregated with GAE into a per-frame advantage. The result is over-engineered and could probably be simplified — but I think these are the right building blocks.
Every rollout is recorded with a debug overlay printing the model's own live predictions on top of the three camera feeds: success probability (S), advantage (A), reward (R), completion (C), and time-to-completion (T). Those are exactly the signals the training loop uses to decide which frames to learn from — and they are predicted and saved at collection time, on-policy, so they are never re-estimated by a later, drifted model.
After the policy is trained, there is a lot of room for inference-time optimization. My policy supports the following inference-time hyperparameters:
A full grid search over all of these would be far too slow, so I tune them online, during rollout collection, with a per-parameter Thompson-sampling bandit. Simply put: for each rollout I pick a random combination of hyperparameters, and the higher the success of any parameter value, the higher its probability of being picked again in future rollouts. The posteriors decay every iteration so the bandit tracks the moving policy rather than its whole history, and once they settle the best configuration per garment type is frozen for the final run. Beyond free hyperparameter tuning, this also adds free exploration — every parameter slightly changes the dynamics of the rollouts.
The results of this convergence work as a cheap replacement for an ablation study, and let me draw a few conclusions: CFG consistently converged to high values — far higher than my original expectation of ~2; best-of-N chunk selection matters, but N = 2–3 is enough; and constant replanning every ~5 steps works better than executing the full 30-step chunk; and a noise temperature slightly below 1 for the flow-matching noise is beneficial (presumably it removes outliers while keeping enough action diversity).

I solo-won the first simulation round, finishing 1st out of 62 teams. My final result is a 79.63% success rate — 6.1 points higher than second place — with the top score on three of the four garment types. The final score is computed over 80 rollouts (including both seen and unseen garments), 10 episodes each.
| # | Team | Long top | Short top | Long pant | Short pant | Overall |
|---|---|---|---|---|---|---|
| 1 | ilya (this work) | 74.5% | 70.0% | 80.5% | 93.5% | 79.63% |
| 2 | Shubham @ Vorwerk | 73.0% | 62.5% | 71.5% | 87.0% | 73.50% |
| 3 | Dum-E | 76.5% | 62.0% | 75.5% | 79.5% | 73.38% |
| 4 | SCUT-Unlimited | 65.5% | 66.0% | 70.0% | 91.0% | 73.13% |
| 5 | GraspYesAI | 73.5% | 61.0% | 69.0% | 79.0% | 70.63% |
Top 5 of 62. Full ranking on the competition website.
In most cases the policy is doing the right things. But sometimes it isn't dexterous enough, sometimes it overfits to simulation artifacts and makes mistakes, and sometimes it almost completes the task but gets stuck in a near-success state.
The final round was about solving the same problem, but on real robots. It was held offline as part of the ICRA 2026 conference in Vienna.


In practice we had very little time to prepare — less than two weeks — and between flying to Austria and everything else, I had only about one week to actually work on the solution.
Another complication: participants didn't really have access to the actual evaluation robot before the competition, so in practice it was sim → my robot → their robot, with an extra generalization step baked in.
There are two ways to close a domain gap: make the environments more similar, or make the training data diverse enough that the gap falls inside it. I leaned on both:
My script takes a frame from a full video in the organizers' provided dataset and moves my (or the simulated) robot along the same trajectory. Overlaying the video feeds lets you calibrate the cameras and make sure the robot positions are aligned.

My main strategy was to collect extra DAgger data using my own robot — teleoperating it to correct the policy wherever it went wrong, then folding those corrections back into the fine-tune.
Because the gap from the actual evaluation setup was always large, I kept the weight of this self-collected data at only 30%, while the organizer-provided dataset stayed at 60%.
Human-in-the-loop interventions turned out to be a sample-efficient way to make the policy more robust to its own mistakes. I only had time for 2–3 DAgger loop iterations, but I could clearly see the progress the policy made with every step — I believe this approach could lead to much better results with just a bit more iteration.
Note: the clips can look a bit shaky and jittery because the “idle” frames — where the model is computing its next prediction — are skipped here.
I placed 2nd in the real-robot final, with 865 of 1080 points, judged on-site at ICRA in Vienna. The score is the organizers' combined number: it rewards both how often the robot succeeds and how cleanly it folds, and folding a garment it had never seen before was worth 50% extra.
| # | Team | Score |
|---|---|---|
| 1 | sZs | 895 |
| 2 | ilya (this work) | 865 |
| 3 | Dum-E | 762.5 |
| 4 | SCUT-Unlimited | 635 |
| 5 | sisigakgak | 570 |
| 6 | Shubham @ Vorwerk | 470 |
Teams that completed scored real-robot runs. Maximum possible total: 1080 points.
Some official rollouts captured during the actual competition final. Apologies for the rough footage — I was busy with the competition itself and didn't make nicer videos, and the camera feeds weren't recorded.
I joined the LeHome Challenge to test and experiment with RL approaches for fine-tuning VLA policies; as a bonus exercise, I took on the sim-to-real round as well. I did it all under competition pressure, so many of my decisions may be sub-optimal and not tested enough — there is definitely a lot of room for improvement.
The biggest low-hanging fruit would be to combine the approaches I used independently for the two rounds: full RL in sim, and plain behavior cloning with human corrections on the real robot. These halves are complementary, and a single pipeline — a real-side value function driving advantages and best-of-N, with DAgger interventions weighted by that same signal — would combine round one's cleanliness with round two's recovery. I'm fairly sure that with slight refinement and less time pressure my recipe could reach well above 90% on this task.
Even so, I believe the overall recipe I used is the right way to improve VLA-like policies with RL and human-in-the-loop methods, and many of the ideas here can be reused independently. I share my code, a detailed tech report, and the final policy weights below.
I'm also working on longer videos where I explain everything in much more detail — stay tuned.