Learning to Fold — LeHome Challenge 2026

01 /

The challenge

The task of the LeHome Challenge (ICRA 2026) was to train a robot to fold four types of clothes — long-sleeve top, short-sleeve top, long pants, shorts — using cheap SO-ARM101 hardware, first in simulation (round 1), then on a real robot (round 2). The setup is bimanual: two 6-DOF arms, three RGB cameras, a 12-dimensional joint action space. Success is scored differently in the two rounds: in simulation it is binary and automatic — specific garment keypoints must end up far from or close to each other — while in the real round a jury decides, and partial success counts.

Garment folding has been solved before, by several groups. What made this particular competition hard was the combination of constraints: low-precision, cheap hardware; the specifics of simulating highly deformable objects; four garment types with the category hidden at evaluation time; and, in the sim-to-real round, no access at all to the actual evaluation robot. The point was not a one-off demo — it was a policy that folds consistently and maximizes success rate over many trials.

The four garment types in simulation — The four garment types from the overhead camera, in the organizers' simulation dataset: long-sleeve top, short-sleeve top, long pants, shorts.

02 /

Model architecture

The policy is a π_0.5-based vision-language-action model, but instead of using the raw original policy I started from the version my team developed for our winning solution in BEHAVIOR-1K, then added extra tweaks. It is a single policy for all four garment types, trained jointly.

On top of the backbone I added, among other things:

–auxiliary prediction heads that make the policy partly its own value function: a single query token feeds cheap linear heads that, from the same forward pass, predict success probability, task completion, garment type, and — 30 frames ahead — the keypoint distances that define success plus a Q-function residual;
–a garment-type input token, with the type inferred at the start of each episode since it is hidden at evaluation;
–advantage conditioning (the RECAP style) plus multi-signal AdaRMS conditioning, so the advantage and garment signals reach every layer of the action expert;
–exclusive self-attention across the VLM and the action expert;
–keypoint-distance and future-prediction heads as a very cheap world-model substitute — predicting the few numbers that matter instead of whole future frames.

Most of these were never properly ablated. This was a competition, not a controlled study, so I share the details for reference without claiming which were actually critical.

The policy architecture and its prediction heads — One model picks the next action chunk and, from the same forward pass, predicts its own success, progress, garment type, and a few task-relevant future quantities.

03 /

A reinforcement learning loop

The organizers provided a very clean, scripted original dataset. Behavior cloning on top of it works, but the resulting policy isn't robust enough. To improve it, I use a combination of two related RL methods — AWR and RECAP. The resulting policy stays on the same action manifold, but completes the tasks with much higher robustness.

AWR weights the training data toward high-advantage frames — applied through sampling rather than the loss, so good frames are simply loaded more often.

RECAP feeds the advantage in as a conditioning input, in effect telling the model to “predict good actions only” — which also unlocks classifier-free guidance (CFG) at inference.

Toy illustration of AWR and RECAP shifting the action distribution — Toy picture: AWR shifts mass toward the good mode, RECAP conditioning selects the positive-advantage slice. Doing both moves the policy even further toward the high-advantage actions.

The asynchronous RL loop

To run this I built an asynchronous reinforcement-learning loop, with every machine coordinating only through the Hugging Face Hub:

–One training machine (1× H200) trains continuously on all available data, recomputing advantages across every rollout dataset before each iteration and shipping a fresh checkpoint roughly every 40 minutes.
–One or more rollout machines pull the latest checkpoint and collect as many new rollouts as possible, using several strategies — random, curriculum, success-replay, hard-mining.
–When needed, a human operator collects more data by hand — either fresh teleop rollouts or automatically-detected hard cases, for efficient DAgger-style correction.

There are no synchronization barriers: the trainer trains on whatever data has arrived, the workers collect with whatever checkpoint is newest. Scaling up data collection is just turning on another machine.

The asynchronous training and rollout loop through the Hugging Face Hub — A trainer, any number of rollout machines, and a human correction station — all coordinating through the Hub, with nothing blocking on anything else.

04 /

Reward & advantage

Binary success is far too sparse for efficient RL: early actions get almost no signal. So I densify it with a multi-layer reward and advantage computation that combines several signals:

–objective task progress from the challenge's own keypoint checkpoints (with all reward withdrawn on failure, so the episode return stays binary);
–success-probability predictions, used as a value baseline;
–completion-percentage predictions, a progress signal that stays stable as the policy evolves;
–relative success across garments, which stale rollouts fall back on as their own predictions go out of date.

Everything is aggregated with GAE into a per-frame advantage. The result is over-engineered and could probably be simplified — but I think these are the right building blocks.

Rollout debug overlay with live predictions

DEMO VIDEO

Example of a rollout with reward overlay

Every rollout is recorded with a debug overlay printing the model's own live predictions on top of the three camera feeds: success probability (S), advantage (A), reward (R), completion (C), and time-to-completion (T). Those are exactly the signals the training loop uses to decide which frames to learn from — and they are predicted and saved at collection time, on-policy, so they are never re-estimated by a later, drifted model.

05 /

Inference-time optimization

After the policy is trained, there is a lot of room for inference-time optimization. My policy supports the following inference-time hyperparameters:

–Execution length — how many of the 30 predicted actions to run before re-planning against a fresh observation.
–Playback speed — a time-stretch of the executed actions, so the arms move faster or slower.
–Inpainting overlap — how long the tail of the previous chunk softly anchors the next one before it is freed to self-correct.
–Guidance scale — how hard classifier-free guidance amplifies the advantage conditioning. It converged surprisingly high (7–9).
–Noise temperature — scales the initial flow-matching noise to concentrate candidates nearer the distribution mode.
–Best-of-N — draw N candidate chunks from the same prefix and execute the one the Q-head scores highest.

Tuning with Thompson sampling

A full grid search over all of these would be far too slow, so I tune them online, during rollout collection, with a per-parameter Thompson-sampling bandit. Simply put: for each rollout I pick a random combination of hyperparameters, and the higher the success of any parameter value, the higher its probability of being picked again in future rollouts. The posteriors decay every iteration so the bandit tracks the moving policy rather than its whole history, and once they settle the best configuration per garment type is frozen for the final run. Beyond free hyperparameter tuning, this also adds free exploration — every parameter slightly changes the dynamics of the rollouts.

The results of this convergence work as a cheap replacement for an ablation study, and let me draw a few conclusions: CFG consistently converged to high values — far higher than my original expectation of ~2; best-of-N chunk selection matters, but N = 2–3 is enough; and constant replanning every ~5 steps works better than executing the full 30-step chunk; and a noise temperature slightly below 1 for the flow-matching noise is beneficial (presumably it removes outliers while keeping enough action diversity).

Thompson-sampling posteriors for two inference parameters — Converged arm posteriors for short pants. Each curve is one arm's probability of beating the per-type baseline — a curve shifted right is a value the bandit currently prefers.

06 /

Online-round results

I solo-won the first simulation round, finishing 1st out of 62 teams. My final result is a 79.63% success rate — 6.1 points higher than second place — with the top score on three of the four garment types. The final score is computed over 80 rollouts (including both seen and unseen garments), 10 episodes each.

#	Team	Long top	Short top	Long pant	Short pant	Overall
1	ilya (this work)	74.5%	70.0%	80.5%	93.5%	79.63%
2	Shubham @ Vorwerk	73.0%	62.5%	71.5%	87.0%	73.50%
3	Dum-E	76.5%	62.0%	75.5%	79.5%	73.38%
4	SCUT-Unlimited	65.5%	66.0%	70.0%	91.0%	73.13%
5	GraspYesAI	73.5%	61.0%	69.0%	79.0%	70.63%

Top 5 of 62. Full ranking on the competition website.

Success examples

Failures

In most cases the policy is doing the right things. But sometimes it isn't dexterous enough, sometimes it overfits to simulation artifacts and makes mistakes, and sometimes it almost completes the task but gets stuck in a near-success state.

07 /

Sim to real

The final round was about solving the same problem, but on real robots. It was held offline as part of the ICRA 2026 conference in Vienna.

The four real garment types — The physical garments the real policy folds — kids' clothes, to fit the reach of the small arms.

The three training data sources across three cameras — The three training sources differ markedly — organizer BC, my teleop/DAgger, and sim replays — which is exactly the diversity the fine-tune needs to span.

In practice we had very little time to prepare — less than two weeks — and between flying to Austria and everything else, I had only about one week to actually work on the solution.

Another complication: participants didn't really have access to the actual evaluation robot before the competition, so in practice it was sim → my robot → their robot, with an extra generalization step baked in.

There are two ways to close a domain gap: make the environments more similar, or make the training data diverse enough that the gap falls inside it. I leaned on both:

–start the fine-tune from a late-but-not-latest sim checkpoint — the very latest ones were the most overfit to the simulator;
–strip the model down to what transfers — the action head plus the garment-type and completion heads — and drop all the sim-only privileged machinery and RL-related logic (advantage conditioning, guidance, best-of-N, keypoint heads);
–fine-tune on a three-bucket mix: organizer real data (60%), my own teleop + DAgger (30%), and heavily-augmented sim replays (10%);
–a camera-overlay alignment tool that allows me to align the camera setup with the actual one used by the organizers;
–very heavy, per-camera augmentation — color, gain, gamma, blur, sensor noise, independent crop/rotate/zoom, cutout, camera dropout, and state noise to force the policy to trust pixels over miscalibrated proprioception;
–motion-intensity alignment — resample each source's speed so “how far to move in one step” is consistent (the sim moved fast, my early DAgger was slow);
–deliberate rig randomization — I moved cameras, re-calibrated the arms, and changed lighting over the week so no exact geometry was load-bearing;

Camera-overlay alignment

My script takes a frame from a full video in the organizers' provided dataset and moves my (or the simulated) robot along the same trajectory. Overlaying the video feeds lets you calibrate the cameras and make sure the robot positions are aligned.

Live robot cameras overlaid on the dataset frame — The three live cameras overlaid on the matching organizer-BC frame after driving the arms to the recorded joint state.

OVERLAY VIDEO

DAgger on my own robot

My main strategy was to collect extra DAgger data using my own robot — teleoperating it to correct the policy wherever it went wrong, then folding those corrections back into the fine-tune.

Because the gap from the actual evaluation setup was always large, I kept the weight of this self-collected data at only 30%, while the organizer-provided dataset stayed at 60%.

Human-in-the-loop interventions turned out to be a sample-efficient way to make the policy more robust to its own mistakes. I only had time for 2–3 DAgger loop iterations, but I could clearly see the progress the policy made with every step — I believe this approach could lead to much better results with just a bit more iteration.

Note: the clips can look a bit shaky and jittery because the “idle” frames — where the model is computing its next prediction — are skipped here.

Autonomous successful rollouts (using my robot)

Final standing

I placed 2nd in the real-robot final, with 865 of 1080 points, judged on-site at ICRA in Vienna. The score is the organizers' combined number: it rewards both how often the robot succeeds and how cleanly it folds, and folding a garment it had never seen before was worth 50% extra.

#	Team	Score
1	sZs	895
2	ilya (this work)	865
3	Dum-E	762.5
4	SCUT-Unlimited	635
5	sisigakgak	570
6	Shubham @ Vorwerk	470

Teams that completed scored real-robot runs. Maximum possible total: 1080 points.

Official rollouts from the final

Some official rollouts captured during the actual competition final. Apologies for the rough footage — I was busy with the competition itself and didn't make nicer videos, and the camera feeds weren't recorded.

08 /

Takeaways

I joined the LeHome Challenge to test and experiment with RL approaches for fine-tuning VLA policies; as a bonus exercise, I took on the sim-to-real round as well. I did it all under competition pressure, so many of my decisions may be sub-optimal and not tested enough — there is definitely a lot of room for improvement.

The biggest low-hanging fruit would be to combine the approaches I used independently for the two rounds: full RL in sim, and plain behavior cloning with human corrections on the real robot. These halves are complementary, and a single pipeline — a real-side value function driving advantages and best-of-N, with DAgger interventions weighted by that same signal — would combine round one's cleanliness with round two's recovery. I'm fairly sure that with slight refinement and less time pressure my recipe could reach well above 90% on this task.

Even so, I believe the overall recipe I used is the right way to improve VLA-like policies with RL and human-in-the-loop methods, and many of the ideas here can be reused independently. I share my code, a detailed tech report, and the final policy weights below.

I'm also working on longer videos where I explain everything in much more detail — stay tuned.

Prizewinning solution of the LeHome Challenge.