Robotics has a data problem. Macrodata Labs wants to solve it

The AI industry has spent the past several years learning a critical lesson: better data often matters as much as better models. While advances in large language models have been powered by increasingly sophisticated datasets and data pipelines, robotics has yet to undergo the same transformation. Robotics teams are working with vast quantities of video, sensor data, and demonstrations, but much of the infrastructure needed to process, annotate, and improve that data remains immature.

Macrodata Labs believes that closing that gap could become one of the most important challenges in robotics AI. Macrodata Labs recently emerged from stealth, launching Refiner, an open-source framework and cloud platform for processing robotics datasets.

The company raised $4 million in pre-seed funding in June this year to build infrastructure for the robotics data loop. The round was led by Air Street Capital, with participation from Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, Thomas Wolf, and business angels from some of the world's leading AI labs and technology companies.

Macrodata Labs builds infrastructure for the robotics data loop. Its first product, Refiner, is an open-source framework and cloud platform for robotics data processing, helping teams turn raw physical-world data into better training datasets.

I spoke to the CEO and co-founder, Guilherme Penedo, to find out more.

From building LLM datasets to building robotics infrastructure

Macrodata Labs was founded by Guilherme Penedo and Hynek Kydlíček, who formed the core team behind several of Hugging Face's largest open LLM dataset efforts. They created widely used datasets such as FineWeb, FineWeb2, FinePDFs, and FineTranslations, which have been used by teams at NVIDIA, Google, AI2, and Z.ai, and contributed to large-scale training projects such as Open-R1 and SmolLM.

Penedo was part of the team behind Falcon, one of the strongest open-source models at the time of its release. After that, he joined Hugging Face, where he focused on building large-scale datasets for training AI models.

"That's where I met my co-founder, Hynek Kydlíček. We worked together on projects such as FineWeb, which processes large portions of the internet and turns the data into high-quality training datasets. FineWeb became one of the most widely used open datasets for language model training, and we later expanded that work into other areas, including PDFs and multilingual datasets."

The common theme throughout their work was figuring out how to take massive amounts of raw data and transform it into something that can produce significantly better AI models. While building large-scale datasets at Hugging Face, the founders saw that progress was not only about model architectures or compute, but also about the infrastructure needed to collect, transform, inspect, and iterate on training data at scale.

After seeing how better data infrastructure helped unlock progress in LLMs, the founders believe robotics is approaching a similar inflection point.

Why better data, not better models, could unlock robotics

While advances in LLMs and vision-language models (VLMs) are making robots increasingly capable, the data layer underpinning robotics remains underdeveloped. Physical-world data is larger, messier, more fragmented, and far more difficult to transform into useful training datasets than text. Penedo explained:

"In language models, we learned how difficult it is to transform raw data into datasets that consistently produce high-quality results. We realised that robotics is facing many of the same challenges, but on an even larger scale."

According to Penedo, the key difference is that many data-processing tasks in language models can be handled with relatively simple rules, whereas robotics requires far more interpretation.

"You might have hundreds of hours of video showing humans performing tasks, but before that data becomes useful for training robots, you need to understand what is happening in the scene," he said.

"For example, if someone is washing dishes, you need to identify individual subtasks: picking up a plate, applying soap, rinsing, and so on. You may also need to estimate hand positions, infer actions, and map human movements to robotic equivalents."

The challenge extends beyond understanding actions. Robotics datasets combine video, sensor streams, trajectories, and other multimodal inputs, creating large, complex datasets that are difficult to store, process, and standardise. Different robotics companies often use their own data formats and workflows, while many questions about what data should be collected and how it should be annotated remain unresolved.

"We believe robotics is the next major frontier for AI," said Penedo.

"The progress we've seen in large language models and vision-language models is now enabling a new generation of robotic systems. At the same time, robotics is increasingly benefiting from the same scaling principles that transformed language models: better data leads to better models."

As a result, a significant amount of work is required to label, annotate, filter, and enrich data before it becomes useful for training.

"These constraints make data work in robotics especially important," Penedo said.

"Teams need scalable, reliable tooling so they can process demonstrations, test new annotations, and iterate on datasets without rebuilding their data stack every time they change embodiment, sensors, data format, or labeling method."

Penedo cautions that the industry is still very early, with many companies investing heavily in collecting more data, improving model architectures, or building better hardware.

"Those things are important, but comparatively little attention has been paid to improving the quality of existing data. Many teams still rely on manual processes for annotation and data preparation, even though modern AI systems can automate much of that work. The data you collect today will likely remain valuable across multiple generations of models and architectures. That's why we think infrastructure for data processing is one of the most important pieces of the stack."

Refiner: infrastructure for the Robotics data loop

Robotics companies are often hardware-first organisations, but Macrodata Labs believes that the software laye r- and specifically the data layer - is what will ultimately determine how capable these systems become. Refiner offers an open-source framework for processing robotics datasets. It enables robotics teams to ingest data, process demonstrations, and run workflows such as hand-tracking, subtask annotation, and reward model scoring. The framework supports a wide range of robotics data formats and can process multimodal robot episodes - including trajectories, camera streams, sensor data, and annotations - within a single pipeline.

Designed to work directly with cloud storage, it allows teams to work with large datasets without first downloading them locally. Penedo explained:

"Users don't need to download terabytes of data locally before they can start working. Refiner can stream data directly from cloud storage, process it efficiently, and run workflows across distributed infrastructure."

Refiner also supports GPU-based processing, which is increasingly important as robotics data pipelines rely on AI models for tasks such as annotation, understanding, and evaluation. The broader goal is to make robotics data infrastructure more accessible and scalable while giving teams the flexibility to work across different robots, sensors, and workflows.

Through the hosted Macrodata Labs platform, users can scale the same pipeline from local Python execution to managed cloud compute without rewriting their workflows. The platform handles orchestration, scheduling, CPU and GPU workers, data traceability, failure recovery, and observability, while customers pay only for the compute resources they use. Right now, the company is focused on robotics companies that train models and build robotic systems.

Over time, Penedo predicts the market will expand:

"As robotics models become more capable and accessible, we expect more organisations to buy robots off the shelf and fine-tune them for specific tasks. At that point, we can help those customers understand what data they need to collect and how to adapt models to their environments. But today our primary customers are the teams building the underlying robotics systems."

Building a robotics startup in stealth

I was curious what it was like for the team building a company in stealth. Penedo admits that there were definitely challenges.

"When you're operating in stealth, people can't easily look you up online or validate what you're doing. That means introductions and personal networks become much more important because potential customers and partners don't have much public information to work with. That said, we never intended to remain in stealth for long.

The goal was simply to give ourselves a few months to build the first version of the product, validate the core ideas, and begin working with early users before going public."

Why Europe can lead the next wave of robotics

Macrodata is technically structured as a US company, largely for fundraising reasons, but based in France and would love to see Europe become a major force in robotics. Europe is frequently cast as trailing the US in AI, but Penedo believes robotics is one area where Europe remains highly competitive.

"You see strong clusters around Zurich, driven by ETH Zürich and the companies emerging from that ecosystem. Munich is another major centre. More broadly, Europe remains highly industrialised and has a large manufacturing base, which creates real demand for innovation in robotics. That gives Europe an opportunity to play a significant role in this next wave of AI."

Macrodata Labs' immediate focus is helping users adopt Refiner and gathering feedback from the robotics community, while investing heavily in research into how better data pipelines can improve model performance. "We want to go beyond making robotics data processing more efficient and explore how better data pipelines can actually improve model performance. That means testing new approaches, training models, running experiments on real robotic systems, and continually measuring whether our methods produce better outcomes," shared Penedo.