Hugging Face has announced the launch of Storage Buckets, a new feature aimed at addressing the challenges of managing intermediate files generated during machine learning (ML) workflows. These buckets provide a flexible, S3-like object storage solution that allows users to browse, script, and manage their data efficiently.
What are Storage Buckets?
Storage Buckets serve as non-versioned containers for various ML artifacts, including checkpoints, optimizer states, and logs. They are designed to handle the mutable nature of these files, which often change and do not require version control. Each bucket resides under a user or organization namespace, can be public or private, and is accessible through a unique handle, such as hf://buckets/username/my-training-bucket.
The Role of Xet in Buckets
Built on Hugging Face’s Xet backend, Storage Buckets utilize a chunk-based storage system that enhances efficiency. Rather than treating files as single entities, Xet breaks content into smaller chunks, allowing for deduplication. This means that when similar datasets or checkpoints are uploaded, only the new or changed chunks are stored, resulting in reduced bandwidth usage and faster transfer speeds. This approach is particularly beneficial for ML workloads that generate numerous related artifacts.
Pre-Warming and Global Storage
Storage Buckets are designed to operate globally, but they also offer a pre-warming feature that allows users to bring data closer to their compute resources. This is crucial for distributed training and large-scale pipelines, as it minimizes latency by ensuring that frequently accessed data is readily available where it is needed. Hugging Face is collaborating with AWS and GCP to optimize this feature for various cloud environments.
Getting Started with Storage Buckets
Setting up a Storage Bucket is straightforward and can be accomplished in under two minutes using the Hugging Face CLI. Users can create a bucket, sync their data, and manage their files with simple commands. For those who prefer programming, the functionality is also accessible via the huggingface_hub Python library, allowing for seamless integration into existing workflows.
Storage Buckets represent a significant advancement in the management of ML artifacts, providing a dedicated space for mutable data while facilitating a smooth transition to versioned repositories when artifacts are finalized. This innovation not only streamlines workflows but also enhances the overall efficiency of machine learning projects.
This article was produced by NeonPulse.today using human and AI-assisted editorial processes, based on publicly available information. Content may be edited for clarity and style.








