# Hub ## Docs - [Using Sentence Transformers at Hugging Face](https://huggingface.co/docs/hub/sentence-transformers.md) - [Giskard on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-giskard.md) - [Embed the Dataset Viewer in a webpage](https://huggingface.co/docs/hub/datasets-viewer-embed.md) - [Using OpenCLIP at Hugging Face](https://huggingface.co/docs/hub/open_clip.md) - [Tabby on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-tabby.md) - [ChatUI on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-chatui.md) - [Pricing and Billing](https://huggingface.co/docs/hub/jobs-pricing.md) - [Displaying carbon emissions for your model](https://huggingface.co/docs/hub/model-cards-co2.md) - [Spaces ZeroGPU: Dynamic GPU Allocation for Spaces](https://huggingface.co/docs/hub/spaces-zerogpu.md) - [Webhooks Automation](https://huggingface.co/docs/hub/jobs-webhooks.md) - [Using mlx-image at Hugging Face](https://huggingface.co/docs/hub/mlx-image.md) - [Panel on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-panel.md) - [Access Patterns](https://huggingface.co/docs/hub/storage-buckets-access.md) - [Dash on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-dash.md) - [Audit Logs](https://huggingface.co/docs/hub/audit-logs.md) - [Downloading models](https://huggingface.co/docs/hub/models-downloading.md) - [The Model Hub](https://huggingface.co/docs/hub/models-the-hub.md) - [Spaces Dev Mode: Seamless development in Spaces](https://huggingface.co/docs/hub/spaces-dev-mode.md) - [SQL Console: Query Hugging Face datasets in your browser](https://huggingface.co/docs/hub/datasets-viewer-sql-console.md) - [Single Sign-On (SSO)](https://huggingface.co/docs/hub/enterprise-sso.md) - [Managing Spaces with Github Actions](https://huggingface.co/docs/hub/spaces-github-actions.md) - [Billing](https://huggingface.co/docs/hub/billing.md) - [More ways to create Spaces](https://huggingface.co/docs/hub/spaces-more-ways-to-create.md) - [Hugging Face Hub documentation](https://huggingface.co/docs/hub/index.md) - [Advanced Security](https://huggingface.co/docs/hub/enterprise-advanced-security.md) - [Using Unity Sentis Models from Hugging Face](https://huggingface.co/docs/hub/unity-sentis.md) - [FiftyOne](https://huggingface.co/docs/hub/datasets-fiftyone.md) - [Using timm at Hugging Face](https://huggingface.co/docs/hub/timm.md) - [Git over SSH](https://huggingface.co/docs/hub/security-git-ssh.md) - [Using `Transformers.js` at Hugging Face](https://huggingface.co/docs/hub/transformers-js.md) - [Spaces](https://huggingface.co/docs/hub/spaces.md) - [Using OpenCV in Spaces](https://huggingface.co/docs/hub/spaces-using-opencv.md) - [Storage Buckets: Security & Compliance](https://huggingface.co/docs/hub/storage-buckets-security.md) - [Ingesting Datasets](https://huggingface.co/docs/hub/datasets-ingesting.md) - [How to Add a Space to ArXiv](https://huggingface.co/docs/hub/spaces-add-to-arxiv.md) - [Dask](https://huggingface.co/docs/hub/datasets-dask.md) - [Hub API Endpoints](https://huggingface.co/docs/hub/api.md) - [Argilla on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-argilla.md) - [Using ๐Ÿค— Datasets](https://huggingface.co/docs/hub/datasets-usage.md) - [Model Cards](https://huggingface.co/docs/hub/model-cards.md) - [Evidence on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-evidence.md) - [Datasets Download Stats](https://huggingface.co/docs/hub/datasets-download-stats.md) - [User Provisioning (SCIM)](https://huggingface.co/docs/hub/enterprise-scim.md) - [Using PaddleNLP at Hugging Face](https://huggingface.co/docs/hub/paddlenlp.md) - [How to configure SCIM with Okta](https://huggingface.co/docs/hub/security-sso-okta-scim.md) - [Distilabel](https://huggingface.co/docs/hub/datasets-distilabel.md) - [Gating Group Collections](https://huggingface.co/docs/hub/enterprise-gating-group-collections.md) - [THE LANDSCAPE OF ML DOCUMENTATION TOOLS](https://huggingface.co/docs/hub/model-card-landscape-analysis.md) - [Livebook on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-livebook.md) - [Spaces Custom Domain](https://huggingface.co/docs/hub/spaces-custom-domain.md) - [Storage Regions on the Hub](https://huggingface.co/docs/hub/storage-regions.md) - [Basic SSO](https://huggingface.co/docs/hub/security-sso-basic.md) - [SSO Configuration Guides](https://huggingface.co/docs/hub/security-sso-configuration-guides.md) - [Storage limits](https://huggingface.co/docs/hub/storage-limits.md) - [Model Card Guidebook](https://huggingface.co/docs/hub/model-card-guidebook.md) - [Pandas](https://huggingface.co/docs/hub/datasets-pandas.md) - [GGUF usage with LM Studio](https://huggingface.co/docs/hub/lmstudio.md) - [Hugging Face Dataset Upload Decision Guide](https://huggingface.co/docs/hub/datasets-upload-guide-llm.md) - [How to configure SAML SSO with Google Workspace](https://huggingface.co/docs/hub/security-sso-google-saml.md) - [Third-party scanner: JFrog](https://huggingface.co/docs/hub/security-jfrog.md) - [Managing Spaces with CircleCI Workflows](https://huggingface.co/docs/hub/spaces-circleci.md) - [Perform SQL operations](https://huggingface.co/docs/hub/datasets-duckdb-sql.md) - [Static HTML Spaces](https://huggingface.co/docs/hub/spaces-sdks-static.md) - [Using spaCy at Hugging Face](https://huggingface.co/docs/hub/spacy.md) - [Webhook guide: build a Discussion bot based on BLOOM](https://huggingface.co/docs/hub/webhooks-guide-discussion-bot.md) - [Academia Hub](https://huggingface.co/docs/hub/academia-hub.md) - [GGUF usage with llama.cpp](https://huggingface.co/docs/hub/gguf-llamacpp.md) - [Evaluation Results](https://huggingface.co/docs/hub/eval-results.md) - [Network Security](https://huggingface.co/docs/hub/enterprise-network-security.md) - [Advanced Topics](https://huggingface.co/docs/hub/spaces-advanced.md) - [GitHub Actions](https://huggingface.co/docs/hub/repositories-github-actions.md) - [How to configure OIDC SSO with Okta](https://huggingface.co/docs/hub/security-sso-okta-oidc.md) - [Digital Object Identifier (DOI)](https://huggingface.co/docs/hub/doi.md) - [How to get a user's plan and status in Spaces](https://huggingface.co/docs/hub/spaces-get-user-plan.md) - [Programmatic User Access Control Management](https://huggingface.co/docs/hub/programmatic-user-access-control.md) - [Datasets Overview](https://huggingface.co/docs/hub/datasets-overview.md) - [Widgets](https://huggingface.co/docs/hub/models-widgets.md) - [Search](https://huggingface.co/docs/hub/search.md) - [Data Studio](https://huggingface.co/docs/hub/data-studio.md) - [Data files Configuration](https://huggingface.co/docs/hub/datasets-data-files-configuration.md) - [Transforming your dataset](https://huggingface.co/docs/hub/datasets-polars-operations.md) - [Shiny on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-shiny.md) - [Using SetFit with Hugging Face](https://huggingface.co/docs/hub/setfit.md) - [Using ๐Ÿค— `transformers` at Hugging Face](https://huggingface.co/docs/hub/transformers.md) - [Team & Enterprise plans](https://huggingface.co/docs/hub/enterprise.md) - [Langfuse on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-langfuse.md) - [Using SpeechBrain at Hugging Face](https://huggingface.co/docs/hub/speechbrain.md) - [ZenML on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-zenml.md) - [Using SpanMarker at Hugging Face](https://huggingface.co/docs/hub/span_marker.md) - [Downloading datasets](https://huggingface.co/docs/hub/datasets-downloading.md) - [Using MLX at Hugging Face](https://huggingface.co/docs/hub/mlx.md) - [Third-party scanner: Protect AI](https://huggingface.co/docs/hub/security-protectai.md) - [Getting Started with Repositories](https://huggingface.co/docs/hub/repositories-getting-started.md) - [Embedding Atlas](https://huggingface.co/docs/hub/datasets-embedding-atlas.md) - [Using PEFT at Hugging Face](https://huggingface.co/docs/hub/peft.md) - [Using Keras at Hugging Face](https://huggingface.co/docs/hub/keras.md) - [Models](https://huggingface.co/docs/hub/models.md) - [Gated models](https://huggingface.co/docs/hub/models-gated.md) - [Run with Docker](https://huggingface.co/docs/hub/spaces-run-with-docker.md) - [Integrate your library with the Hub](https://huggingface.co/docs/hub/models-adding-libraries.md) - [The HF PRO subscription ๐Ÿ”ฅ](https://huggingface.co/docs/hub/pro.md) - [Using BERTopic at Hugging Face](https://huggingface.co/docs/hub/bertopic.md) - [Models Download Stats](https://huggingface.co/docs/hub/models-download-stats.md) - [Docker Spaces Examples](https://huggingface.co/docs/hub/spaces-sdks-docker-examples.md) - [Advanced Compute Options](https://huggingface.co/docs/hub/advanced-compute-options.md) - [Managed SSO](https://huggingface.co/docs/hub/enterprise-advanced-sso.md) - [Webhooks](https://huggingface.co/docs/hub/webhooks.md) - [marimo on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-marimo.md) - [How to configure SAML SSO with Okta](https://huggingface.co/docs/hub/security-sso-okta-saml.md) - [DuckDB](https://huggingface.co/docs/hub/datasets-duckdb.md) - [Using Stanza at Hugging Face](https://huggingface.co/docs/hub/stanza.md) - [Notifications](https://huggingface.co/docs/hub/notifications.md) - [Gradio Spaces](https://huggingface.co/docs/hub/spaces-sdks-gradio.md) - [Spaces as API endpoints](https://huggingface.co/docs/hub/spaces-api-endpoints.md) - [Licenses](https://huggingface.co/docs/hub/repositories-licenses.md) - [Hugging Face CLI for AI Agents](https://huggingface.co/docs/hub/agents-cli.md) - [Webhook guide: Setup an automatic metadata quality review for models and datasets](https://huggingface.co/docs/hub/webhooks-guide-metadata-review.md) - [Use AI Models Locally](https://huggingface.co/docs/hub/local-apps.md) - [Adding a Sign-In with HF button to your Space](https://huggingface.co/docs/hub/spaces-oauth.md) - [Local Agents with llama.cpp](https://huggingface.co/docs/hub/agents-local.md) - [Using ๐Ÿงจ `diffusers` at Hugging Face](https://huggingface.co/docs/hub/diffusers.md) - [Resource groups](https://huggingface.co/docs/hub/enterprise-resource-groups.md) - [User access tokens](https://huggingface.co/docs/hub/security-tokens.md) - [Bucket Integrations](https://huggingface.co/docs/hub/storage-buckets-integrations.md) - [Tasks](https://huggingface.co/docs/hub/models-tasks.md) - [Access control in organizations](https://huggingface.co/docs/hub/organizations-security.md) - [Optimizations](https://huggingface.co/docs/hub/datasets-polars-optimizations.md) - [Featured Spaces](https://huggingface.co/docs/hub/spaces-featured.md) - [PyArrow](https://huggingface.co/docs/hub/datasets-pyarrow.md) - [Building with the SDK](https://huggingface.co/docs/hub/agents-sdk.md) - [Video Dataset](https://huggingface.co/docs/hub/datasets-video.md) - [Libraries](https://huggingface.co/docs/hub/models-libraries.md) - [๐ŸŸง Label Studio on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-label-studio.md) - [Repository Settings](https://huggingface.co/docs/hub/repositories-settings.md) - [Agents](https://huggingface.co/docs/hub/agents.md) - [Using GPU Spaces](https://huggingface.co/docs/hub/spaces-gpus.md) - [Managing organizations](https://huggingface.co/docs/hub/organizations-managing.md) - [Using Stable-Baselines3 at Hugging Face](https://huggingface.co/docs/hub/stable-baselines3.md) - [Audio Dataset](https://huggingface.co/docs/hub/datasets-audio.md) - [How to configure OIDC SSO with Microsoft Entra ID (Azure AD)](https://huggingface.co/docs/hub/security-sso-azure-oidc.md) - [File formats](https://huggingface.co/docs/hub/datasets-polars-file-formats.md) - [Gated datasets](https://huggingface.co/docs/hub/datasets-gated.md) - [Accessing Benchmark Leaderboard Data](https://huggingface.co/docs/hub/leaderboard-data-guide.md) - [Data Designer](https://huggingface.co/docs/hub/datasets-data-designer.md) - [TF-Keras (legacy)](https://huggingface.co/docs/hub/tf-keras.md) - [Pickle Scanning](https://huggingface.co/docs/hub/security-pickle.md) - [Hugging Face MCP Server](https://huggingface.co/docs/hub/agents-mcp.md) - [Configure the Dataset Viewer](https://huggingface.co/docs/hub/datasets-viewer-configure.md) - [Hub Local Cache](https://huggingface.co/docs/hub/local-cache.md) - [Custom Python Spaces](https://huggingface.co/docs/hub/spaces-sdks-python.md) - [Model(s) Release Checklist](https://huggingface.co/docs/hub/model-release-checklist.md) - [Using _Adapters_ at Hugging Face](https://huggingface.co/docs/hub/adapters.md) - [Examples & Tutorials](https://huggingface.co/docs/hub/jobs-examples.md) - [File names and splits](https://huggingface.co/docs/hub/datasets-file-names-and-splits.md) - [Your First Docker Space: Text Generation with T5](https://huggingface.co/docs/hub/spaces-sdks-docker-first-demo.md) - [Malware Scanning](https://huggingface.co/docs/hub/security-malware.md) - [Hub Rate limits](https://huggingface.co/docs/hub/rate-limits.md) - [Using RL-Baselines3-Zoo at Hugging Face](https://huggingface.co/docs/hub/rl-baselines3-zoo.md) - [Handling Spaces Dependencies in Gradio Spaces](https://huggingface.co/docs/hub/spaces-dependencies.md) - [Using ML-Agents at Hugging Face](https://huggingface.co/docs/hub/ml-agents.md) - [Streaming datasets](https://huggingface.co/docs/hub/datasets-streaming.md) - [Organization cards](https://huggingface.co/docs/hub/organizations-cards.md) - [fenic](https://huggingface.co/docs/hub/datasets-fenic.md) - [Argilla](https://huggingface.co/docs/hub/datasets-argilla.md) - [Widget Examples](https://huggingface.co/docs/hub/models-widgets-examples.md) - [Spaces Settings](https://huggingface.co/docs/hub/spaces-settings.md) - [Spaces Configuration Reference](https://huggingface.co/docs/hub/spaces-config-reference.md) - [WebDataset](https://huggingface.co/docs/hub/datasets-webdataset.md) - [Authentication for private and gated datasets](https://huggingface.co/docs/hub/datasets-duckdb-auth.md) - [Datasets](https://huggingface.co/docs/hub/datasets.md) - [Storage Buckets](https://huggingface.co/docs/hub/storage-buckets.md) - [Polars](https://huggingface.co/docs/hub/datasets-polars.md) - [Schedule Jobs](https://huggingface.co/docs/hub/jobs-schedule.md) - [Authentication](https://huggingface.co/docs/hub/datasets-polars-auth.md) - [Datasets](https://huggingface.co/docs/hub/enterprise-datasets.md) - [Combine datasets and export](https://huggingface.co/docs/hub/datasets-duckdb-combine-and-export.md) - [Publisher Analytics](https://huggingface.co/docs/hub/publisher-analytics.md) - [Using fastai at Hugging Face](https://huggingface.co/docs/hub/fastai.md) - [Models Frequently Asked Questions](https://huggingface.co/docs/hub/models-faq.md) - [Organizations, Security, and the Hub API](https://huggingface.co/docs/hub/other.md) - [Using Flair at Hugging Face](https://huggingface.co/docs/hub/flair.md) - [How to configure OIDC SSO with Google Workspace](https://huggingface.co/docs/hub/security-sso-google-oidc.md) - [Editing Datasets in Data Studio](https://huggingface.co/docs/hub/datasets-cell-editing.md) - [Disk usage on Spaces](https://huggingface.co/docs/hub/spaces-storage.md) - [Reference](https://huggingface.co/docs/hub/jobs-reference.md) - [Jobs](https://huggingface.co/docs/hub/jobs.md) - [Using AllenNLP at Hugging Face](https://huggingface.co/docs/hub/allennlp.md) - [Libraries](https://huggingface.co/docs/hub/datasets-libraries.md) - [Model Card components](https://huggingface.co/docs/hub/model-cards-components.md) - [Next Steps](https://huggingface.co/docs/hub/repositories-next-steps.md) - [Agents](https://huggingface.co/docs/hub/agents-overview.md) - [Moderation](https://huggingface.co/docs/hub/moderation.md) - [Inference Providers](https://huggingface.co/docs/hub/models-inference.md) - [Spaces as MCP servers](https://huggingface.co/docs/hub/spaces-mcp-servers.md) - [Tokens Management](https://huggingface.co/docs/hub/enterprise-tokens-management.md) - [Agent Libraries](https://huggingface.co/docs/hub/agents-libraries.md) - [Paper Pages](https://huggingface.co/docs/hub/paper-pages.md) - [Use Ollama with any GGUF Model on Hugging Face Hub](https://huggingface.co/docs/hub/ollama.md) - [Two-Factor Authentication (2FA)](https://huggingface.co/docs/hub/security-2fa.md) - [Spaces as Agent Tools](https://huggingface.co/docs/hub/spaces-agents.md) - [How to configure SCIM with Microsoft Entra ID (Azure AD)](https://huggingface.co/docs/hub/security-sso-entra-id-scim.md) - [Single Sign-On (SSO)](https://huggingface.co/docs/hub/security-sso.md) - [How to configure SAML SSO with Microsoft Entra ID (Azure AD)](https://huggingface.co/docs/hub/security-sso-azure-saml.md) - [Advanced Access Control in Organizations with Resource Groups](https://huggingface.co/docs/hub/security-resource-groups.md) - [Organizations](https://huggingface.co/docs/hub/organizations.md) - [Aim on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-aim.md) - [Lance](https://huggingface.co/docs/hub/datasets-lance.md) - [Using TensorBoard](https://huggingface.co/docs/hub/tensorboard.md) - [User Management](https://huggingface.co/docs/hub/security-sso-user-management.md) - [Collections](https://huggingface.co/docs/hub/collections.md) - [Spaces Overview](https://huggingface.co/docs/hub/spaces-overview.md) - [Configuration](https://huggingface.co/docs/hub/jobs-configuration.md) - [Repositories](https://huggingface.co/docs/hub/repositories.md) - [Secrets Scanning](https://huggingface.co/docs/hub/security-secrets.md) - [Advanced Topics](https://huggingface.co/docs/hub/models-advanced.md) - [Embed your Space in another website](https://huggingface.co/docs/hub/spaces-embed.md) - [Spark](https://huggingface.co/docs/hub/datasets-spark.md) - [Using ESPnet at Hugging Face](https://huggingface.co/docs/hub/espnet.md) - [Daft](https://huggingface.co/docs/hub/datasets-daft.md) - [Dataset Cards](https://huggingface.co/docs/hub/datasets-cards.md) - [Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker.md) - [Streamlit Spaces](https://huggingface.co/docs/hub/spaces-sdks-streamlit.md) - [Annotated Model Card Template](https://huggingface.co/docs/hub/model-card-annotated.md) - [How to handle URL parameters in Spaces](https://huggingface.co/docs/hub/spaces-handle-url-parameters.md) - [Signing commits with GPG](https://huggingface.co/docs/hub/security-gpg.md) - [User Studies](https://huggingface.co/docs/hub/model-cards-user-studies.md) - [Skills](https://huggingface.co/docs/hub/agents-skills.md) - [GGUF usage with GPT4All](https://huggingface.co/docs/hub/gguf-gpt4all.md) - [GGUF](https://huggingface.co/docs/hub/gguf.md) - [Jupyter Notebooks on the Hugging Face Hub](https://huggingface.co/docs/hub/notebooks.md) - [Pull requests and Discussions](https://huggingface.co/docs/hub/repositories-pull-requests-discussions.md) - [JupyterLab on Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker-jupyter.md) - [Perform vector similarity search](https://huggingface.co/docs/hub/datasets-duckdb-vector-similarity-search.md) - [Blog Articles for Organizations](https://huggingface.co/docs/hub/enterprise-blog-articles.md) - [Uploading models](https://huggingface.co/docs/hub/models-uploading.md) - [Appendix](https://huggingface.co/docs/hub/model-card-appendix.md) - [Quickstart](https://huggingface.co/docs/hub/jobs-quickstart.md) - [Image Dataset](https://huggingface.co/docs/hub/datasets-image.md) - [Uploading datasets](https://huggingface.co/docs/hub/datasets-adding.md) - [DDUF](https://huggingface.co/docs/hub/dduf.md) - [Using sample-factory at Hugging Face](https://huggingface.co/docs/hub/sample-factory.md) - [Using Asteroid at Hugging Face](https://huggingface.co/docs/hub/asteroid.md) - [Security](https://huggingface.co/docs/hub/security.md) - [Popular Images](https://huggingface.co/docs/hub/jobs-popular-images.md) - [Webhook guide: Setup an automatic system to re-train a model when a dataset changes](https://huggingface.co/docs/hub/webhooks-guide-auto-retrain.md) - [Sign in with Hugging Face](https://huggingface.co/docs/hub/oauth.md) - [Using Spaces for Organization Cards](https://huggingface.co/docs/hub/spaces-organization-cards.md) - [Query datasets](https://huggingface.co/docs/hub/datasets-duckdb-select.md) - [Manual Configuration](https://huggingface.co/docs/hub/datasets-manual-configuration.md) - [Spaces Changelog](https://huggingface.co/docs/hub/spaces-changelog.md) - [Jobs Overview](https://huggingface.co/docs/hub/jobs-overview.md) - [Cookie limitations in Spaces](https://huggingface.co/docs/hub/spaces-cookie-limitations.md) - [Editing datasets](https://huggingface.co/docs/hub/datasets-editing.md) - [Manage Jobs](https://huggingface.co/docs/hub/jobs-manage.md) - [Using Xet Storage](https://huggingface.co/docs/hub/xet/using-xet-storage.md) - [Xet: our Storage Backend](https://huggingface.co/docs/hub/xet/index.md) - [Xet History & Overview](https://huggingface.co/docs/hub/xet/overview.md) - [Deduplication](https://huggingface.co/docs/hub/xet/deduplication.md) - [Backward Compatibility with LFS](https://huggingface.co/docs/hub/xet/legacy-git-lfs.md) - [Security Model](https://huggingface.co/docs/hub/xet/security.md) ### Using Sentence Transformers at Hugging Face https://huggingface.co/docs/hub/sentence-transformers.md # Using Sentence Transformers at Hugging Face `sentence-transformers` is a library that provides easy methods to compute embeddings (dense vector representations) for sentences, paragraphs and images. Texts are embedded in a vector space such that similar text is close, which enables applications such as semantic search, clustering, and retrieval. ## Exploring sentence-transformers in the Hub You can find over 500 hundred `sentence-transformer` models by filtering at the left of the [models page](https://huggingface.co/models?library=sentence-transformers&sort=downloads). Most of these models support different tasks, such as doing [`feature-extraction`](https://huggingface.co/models?library=sentence-transformers&pipeline_tag=feature-extraction&sort=downloads) to generate the embedding, and [`sentence-similarity`](https://huggingface.co/models?library=sentence-transformers&pipeline_tag=sentence-similarity&sort=downloads) as a way to determine how similar is a given sentence to other. You can also find an overview of the official pre-trained models in [the official docs](https://www.sbert.net/docs/pretrained_models.html). All models on the Hub come up with features: 1. An automatically generated model card with a description, example code snippets, architecture overview, and more. 2. Metadata tags that help for discoverability and contain information such as license. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference Providers widget that allows to make inference requests. ## Using existing models The pre-trained models on the Hub can be loaded with a single line of code ```py from sentence_transformers import SentenceTransformer model = SentenceTransformer('model_name') ``` Here is an example that encodes sentences and then computes the distance between them for doing semantic search. ```py from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') query_embedding = model.encode('How big is London') passage_embedding = model.encode(['London has 9,787,426 inhabitants at the 2011 census', 'London is known for its financial district']) print("Similarity:", util.dot_score(query_embedding, passage_embedding)) ``` If you want to see how to load a specific model, you can click `Use in sentence-transformers` and you will be given a working snippet that you can load it! ## Sharing your models You can share your Sentence Transformers by using the `save_to_hub` method from a trained model. ```py from sentence_transformers import SentenceTransformer # Load or train a model model.save_to_hub("my_new_model") ``` This command creates a repository with an automatically generated model card, an inference widget, example code snippets, and more! [Here](https://huggingface.co/osanseviero/my_new_model) is an example. ## Additional resources * Sentence Transformers [library](https://github.com/UKPLab/sentence-transformers). * Sentence Transformers [docs](https://www.sbert.net/). * Integration with Hub [announcement](https://huggingface.co/blog/sentence-transformers-in-the-hub). ### Giskard on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-giskard.md # Giskard on Spaces **Giskard** is an AI model quality testing toolkit for LLMs, tabular, and NLP models. It consists of an open-source Python library for scanning and testing AI models and an AI Model Quality Testing app, which can now be deployed using Hugging Face's Docker Spaces. Extending the features of the open-source library, the AI Model Quality Testing app enables you to: - Debug tests to diagnose your issues - Create domain-specific tests thanks to automatic model insights - Compare models to decide which model to promote - Collect business feedback of your model results - Share your results with your colleagues for alignment - Store all your QA objects (tests, data slices, evaluation criteria, etc.) in one place to work more efficiently Visit [Giskard's documentation](https://docs.giskard.ai/) and [Quickstart Guides](https://docs.giskard.ai/en/latest/getting_started/quickstart/index.html) to learn how to use the full range of tools provided by Giskard. In the next sections, you'll learn to deploy your own Giskard AI Model Quality Testing app and use it right from Hugging Face Spaces. This Giskard app is a **self-contained application completely hosted on Spaces using Docker**. ## Deploy Giskard on Spaces You can deploy Giskard on Spaces with just a few clicks: > [!WARNING] > IMPORTANT NOTE ABOUT DATA PERSISTENCE: > You can use the Giskard Space as is for initial exploration and experimentation. For **longer use in > small-scale projects, attach a [Storage Bucket](https://huggingface.co/docs/hub/storage-buckets)**. This prevents data loss during Space restarts which > occur every 24 hours. You need to define the **Owner** (your personal account or an organization), a **Space name**, and the **Visibility**. If you donโ€™t want to publicly share your models and quality tests, set your Space to **Private**. Once you have created the Space, you'll see the `Building` status. Once it becomes `Running`, your Space is ready to go. If you don't see a change in the screen, refresh the page. ## Request a free license Once your Giskard Space is up and running, you'll need to request a free license to start using the app. You will then automatically receive an email with the license file. ## Create a new Giskard project Once inside the app, start by creating a new project from the welcome screen. ## Generate a Hugging Face Giskard Space Token and Giskard API key The Giskard API key is used to establish communication between the environment where your AI models are running and the Giskard app on Hugging Face Spaces. If you've set the **Visibility** of your Space to **Private**, you will need to provide a Hugging Face user access token to generate the Hugging Face Giskard Space Token and establish a communication for access to your private Space. To do so, follow the instructions displayed in the settings page of the Giskard app. ## Start the ML worker Giskard executes your model using a worker that runs the model directly in your Python environment, with all the dependencies required by your model. You can either execute the ML worker: - From your local notebook within the kernel that contains all the dependencies of your model - From Google Colab within the kernel that contains all the dependencies of your model - Or from your terminal within the Python environment that contains all the dependencies of your model Simply run the following command within the Python environment that contains all the dependencies of your model: ```bash giskard worker start -d -k GISKARD-API-KEY -u https://XXX.hf.space --hf-token GISKARD-SPACE-TOKEN ``` ## Upload your test suite, models and datasets In order to start building quality tests for a project, you will need to upload model and dataset objects, and either create or upload a test suite from the Giskard Python library. > [!TIP] > For more information on how to create test suites from Giskard's Python library's automated model scanning tool, head > over to Giskard's [Quickstart Guides](https://docs.giskard.ai/en/latest/getting_started/quickstart/index.html). These actions will all require a connection between your Python environment and the Giskard Space. Achieve this by initializing a Giskard Client: simply copy the โ€œCreate a Giskard Clientโ€ snippet from the settings page of the Giskard app and run it within your Python environment. This will look something like this: ```python from giskard import GiskardClient url = "https://user_name-space_name.hf.space" api_key = "gsk-xxx" hf_token = "xxx" # Create a giskard client to communicate with Giskard client = GiskardClient(url, api_key, hf_token) ``` If you run into issues, head over to Giskard's [upload object documentation page](https://docs.giskard.ai/en/latest/giskard_hub/upload/index.html). ## Feedback and support If you have suggestions or need specific support, please join [Giskard's Discord community](https://discord.com/invite/ABvfpbu69R) or reach out on [Giskard's GitHub repository](https://github.com/Giskard-AI/giskard). ### Embed the Dataset Viewer in a webpage https://huggingface.co/docs/hub/datasets-viewer-embed.md # Embed the Dataset Viewer in a webpage You can embed the Dataset Viewer in your own webpage using an iframe. The URL to use is `https://huggingface.co/datasets///embed/viewer`, where `` is the owner of the dataset (user or organization) and `` is the name of the dataset. You can also pass other parameters like the subset, split, filter, search or selected row. For example, the following iframe embeds the Dataset Viewer for the `glue` dataset from the `nyu-mll` organization: ```html ``` You can also get the embed code directly from the Dataset Viewer interface. Click on the `Embed` button in the top right corner of the Dataset Viewer: It will open a modal with the iframe code that you can copy and paste into your webpage: ## Parameters All the parameters of the dataset viewer page can also be passed to the embedded viewer (filter, search, specific split, etc.) by adding them to the iframe URL. For example, to show the results of the search on `mangrove` in the `test` split of the `rte` subset of the `nyu-mll/glue` dataset, you can use the following URL: ```html ``` You can get this code directly from the Dataset Viewer interface by performing the search, clicking on the `โ‹ฎ` button then `Embed`: It will open a modal with the iframe code that you can copy and paste into your webpage: ## Examples The embedded dataset viewer is used in multiple Machine Learning tools and platforms to display datasets. Here are a few examples. Open a [pull request](https://github.com/huggingface/hub-docs/blob/main/docs/hub/datasets-viewer-embed.md) if you want to appear in this section! ### Tool: ZenML [`htahir1`](https://huggingface.co/htahir1) shares a [blog post](https://www.zenml.io/blog/embedding-huggingface-datasets-visualizations-with-zenml) showing how you can use the [ZenML](https://huggingface.co/zenml) integration with the Datasets Viewer to visualize a Hugging Face dataset within a ZenML pipeline. ### Tool: Metaflow + Outerbounds [`eddie-OB`](https://huggingface.co/eddie-OB) shows in a [demo video](https://www.linkedin.com/posts/eddie-mattia_the-team-at-hugging-facerecently-released-activity-7219416449084272641-swIu) how to include the dataset viewer in Metaflow cards on [Outerbounds](https://huggingface.co/outerbounds). ### Tool: AutoTrain [`abhishek`](https://huggingface.co/abhishek) showcases how the dataset viewer is integrated into [AutoTrain](https://huggingface.co/autotrain) in a [demo video](https://x.com/abhi1thakur/status/1813892464144798171). ### Datasets: Alpaca-style datasets gallery [`davanstrien`](https://huggingface.co/davanstrien) showcases the [collection of Alpaca-style datasets](https://huggingface.co/collections/librarian-bots/alpaca-style-datasets-66964d3e490f463859002588) in a [space](https://huggingface.co/spaces/davanstrien/collection_dataset_viewer). ### Datasets: Docmatix [`andito`](https://huggingface.co/andito) uses the embedded viewer in the [blog post](https://huggingface.co/blog/docmatix) announcing the release of [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix), a huge dataset for Document Visual Question Answering (DocVQA). ### App: Masader - Arabic NLP data catalogue [`Zaid`](https://huggingface.co/Zaid) [showcases](https://x.com/zaidalyafeai/status/1815365207775932576) the dataset viewer in [Masader - the Arabic NLP data catalogue0](https://arbml.github.io/masader//). ### Using OpenCLIP at Hugging Face https://huggingface.co/docs/hub/open_clip.md # Using OpenCLIP at Hugging Face [OpenCLIP](https://github.com/mlfoundations/open_clip) is an open-source implementation of OpenAI's CLIP. ## Exploring OpenCLIP on the Hub You can find OpenCLIP models by filtering at the left of the [models page](https://huggingface.co/models?library=open_clip&sort=trending). OpenCLIP models hosted on the Hub have a model card with useful information about the models. Thanks to OpenCLIP Hugging Face Hub integration, you can load OpenCLIP models with a few lines of code. You can also deploy these models using [Inference Endpoints](https://huggingface.co/inference-endpoints). ## Installation To get started, you can follow the [OpenCLIP installation guide](https://github.com/mlfoundations/open_clip#usage). You can also use the following one-line install through pip: ``` $ pip install open_clip_torch ``` ## Using existing models All OpenCLIP models can easily be loaded from the Hub: ```py import open_clip model, preprocess = open_clip.create_model_from_pretrained('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K') tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-g-14-laion2B-s12B-b42K') ``` Once loaded, you can encode the image and text to do [zero-shot image classification](https://huggingface.co/tasks/zero-shot-image-classification): ```py import torch from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) image = preprocess(image).unsqueeze(0) text = tokenizer(["a diagram", "a dog", "a cat"]) with torch.no_grad(), torch.cuda.amp.autocast(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) ``` It outputs the probability of each possible class: ```text Label probs: tensor([[0.0020, 0.0034, 0.9946]]) ``` If you want to load a specific OpenCLIP model, you can click `Use in OpenCLIP` in the model card and you will be given a working snippet! ## Additional resources * OpenCLIP [repository](https://github.com/mlfoundations/open_clip) * OpenCLIP [docs](https://github.com/mlfoundations/open_clip/tree/main/docs) * OpenCLIP [models in the Hub](https://huggingface.co/models?library=open_clip&sort=trending) ### Tabby on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-tabby.md # Tabby on Spaces [Tabby](https://tabby.tabbyml.com) is an open-source, self-hosted AI coding assistant. With Tabby, every team can set up its own LLM-powered code completion server with ease. In this guide, you will learn how to deploy your own Tabby instance and use it for development directly from the Hugging Face website. ## Your first Tabby Space In this section, you will learn how to deploy a Tabby Space and use it for yourself or your organization. ### Deploy Tabby on Spaces You can deploy Tabby on Spaces with just a few clicks: [![Deploy on HF Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/deploy-to-spaces-lg.svg)](https://huggingface.co/spaces/TabbyML/tabby-template-space?duplicate=true) You need to define the Owner (your personal account or an organization), a Space name, and the Visibility. To secure the api endpoint, we're configuring the visibility as Private. ![Duplicate Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/duplicate-space.png) Youโ€™ll see the *Building status*. Once it becomes *Running*, your Space is ready to go. If you donโ€™t see the Tabby Swagger UI, try refreshing the page. ![Swagger UI](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/swagger-ui.png) > [!TIP] > If you want to customize the title, emojis, and colors of your space, go to "Files and Versions" and edit the metadata of your README.md file. ### Your Tabby Space URL Once Tabby is up and running, for a space link such as https://huggingface.com/spaces/TabbyML/tabby, the direct URL will be https://tabbyml-tabby.hf.space. This URL provides access to a stable Tabby instance in full-screen mode and serves as the API endpoint for IDE/Editor Extensions to talk with. ### Connect VSCode Extension to Space backend 1. Install the [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=TabbyML.vscode-tabby). 2. Open the file located at `~/.tabby-client/agent/config.toml`. Uncomment both the `[server]` section and the `[server.requestHeaders]` section. * Set the endpoint to the Direct URL you found in the previous step, which should look something like `https://UserName-SpaceName.hf.space`. * As the Space is set to **Private**, it is essential to configure the authorization header for accessing the endpoint. You can obtain a token from the [Access Tokens](https://huggingface.co/settings/tokens) page. ![Agent Config](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/agent-config.png) 3. You'll notice a โœ“ icon indicating a successful connection. ![Tabby Connected](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/tabby-connected.png) 4. You've complete the setup, now enjoy tabing! ![Code Completion](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/tabby/code-completion.png) You can also utilize Tabby extensions in other IDEs, such as [JetBrains](https://plugins.jetbrains.com/plugin/22379-tabby). ## Feedback and support If you have improvement suggestions or need specific support, please join [Tabby Slack community](https://join.slack.com/t/tabbycommunity/shared_invite/zt-1xeiddizp-bciR2RtFTaJ37RBxr8VxpA) or reach out on [Tabbyโ€™s GitHub repository](https://github.com/TabbyML/tabby). ### ChatUI on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-chatui.md # ChatUI on Spaces **HuggingChat** is an open-source interface enabling everyone to try open-source large language models such as Falcon, StarCoder, and BLOOM. Thanks to an official Docker template called ChatUI, you can deploy your own HuggingChat based on a model of your choice with a few clicks using Hugging Face's infrastructure. ## Deploy your own Chat UI To get started, simply head [here](https://huggingface.co/new-space?template=huggingchat/chat-ui-template). In the backend of this application, [text-generation-inference](https://github.com/huggingface/text-generation-inference) is used for better optimized model inference. Since these models can't run on CPUs, you can select the GPU depending on your choice of model. You should provide a MongoDB endpoint where your chats will be written. If you leave this section blank, your logs will be persisted to a database inside the Space. Note that Hugging Face does not have access to your chats. You can configure the name and the theme of the Space by providing the application name and application color parameters. Below this, you can select the Hugging Face Hub ID of the model you wish to serve. You can also change the generation hyperparameters in the dictionary below in JSON format. _Note_: If you'd like to deploy a model with gated access or a model in a private repository, you can simply provide `HF_TOKEN` in repository secrets. You need to set its value to an access token you can get from [here](https://huggingface.co/settings/tokens). Once the creation is complete, you will see `Building` on your Space. Once built, you can try your own HuggingChat! Start chatting! ## Read more - [HF Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker) - [chat-ui GitHub Repository](https://github.com/huggingface/chat-ui) - [text-generation-inference GitHub repository](https://github.com/huggingface/text-generation-inference) ### Pricing and Billing https://huggingface.co/docs/hub/jobs-pricing.md # Pricing and Billing Hugging Face Jobs let you run compute tasks on Hugging Face infrastructure without managing it yourself. Simply define a command, a Docker image, and a hardware flavor among various CPU and GPU options. > [!TIP] > Jobs are available to any user or organization with [pre-paid credits](https://huggingface.co/pricing). Billing on Jobs is based on hardware usage and is computed by the minute: you get charged for every minute the Jobs runs on the requested hardware. During a Jobโ€™s lifecycle, it is only billed when the Job is Starting or Running. This means that there is no cost during build. If a running Job starts to fail, it will be automatically suspended and the billing will stop. ## Pricing Jobs are billed per minute based on the hardware used. Below are the available hardware options and their pricing. ### CPU | **Hardware** | **CPU** | **Memory** | **Hourly Price** | |----------------------- |-------------- |------------- | ----------------- | | CPU Basic | 2 vCPU | 16 GB | $0.01 | | CPU Upgrade | 8 vCPU | 32 GB | $0.03 | | CPU XL | 16 vCPU | 124 GB | $1.00 | | CPU Performance | 32 vCPU | 256 GB | $1.90 | ### GPU | **Hardware** | **CPU** | **Memory** | **GPU Memory** | **Hourly Price** | |----------------------- |-------------- |------------- |---------------- | ----------------- | | Nvidia T4 - small | 4 vCPU | 15 GB | 16 GB | $0.40 | | Nvidia T4 - medium | 8 vCPU | 30 GB | 16 GB | $0.60 | | 1x Nvidia L4 | 8 vCPU | 30 GB | 24 GB | $0.80 | | 4x Nvidia L4 | 48 vCPU | 186 GB | 96 GB | $3.80 | | 1x Nvidia L40S | 8 vCPU | 62 GB | 48 GB | $1.80 | | 4x Nvidia L40S | 48 vCPU | 382 GB | 192 GB | $8.30 | | 8x Nvidia L40S | 192 vCPU | 1534 GB | 384 GB | $23.50 | | Nvidia A10G - small | 4 vCPU | 15 GB | 24 GB | $1.00 | | Nvidia A10G - large | 12 vCPU | 46 GB | 24 GB | $1.50 | | 2x Nvidia A10G - large | 24 vCPU | 92 GB | 48 GB | $3.00 | | 4x Nvidia A10G - large | 48 vCPU | 184 GB | 96 GB | $5.00 | | Nvidia A100 - large | 12 vCPU | 142 GB | 80 GB | $2.50 | | 4x Nvidia A100 - large | 48 vCPU | 568 GB | 320 GB | $10.00 | | 8x Nvidia A100 - large | 96 vCPU | 1136 GB | 640 GB | $20.00 | | Nvidia H200 | 23 vCPU | 256 GB | 141 GB | $5.00 | | 2x Nvidia H200 | 46 vCPU | 512 GB | 282 GB | $10.00 | | 4x Nvidia H200 | 92 vCPU | 1024 GB | 564 GB | $20.00 | | 8x Nvidia H200 | 184 vCPU | 2048 GB | 1128 GB | $40.00 | You can also retrieve available hardware and pricing programmatically via the API at `GET /api/jobs/hardware` or via the CLI: ```bash >>> hf jobs hardware ``` ## Manage billing ### Bill to your organization Billing is done to the user's namespace by default, but you can bill to your organization instead by specifying the right `namespace`: ```bash hf jobs run --namespace my-org-name ... ``` In this case the Job runs under the organization account, and you can see it in your organization Jobs page (organization page > settings > Jobs). ### View current compute usage You can look at your current billing information for Jobs in in your [Billing](https://huggingface.co/settings/billing) page, under the "Compute Usage" section: Additional information about billing can be found in the [dedicated Hub documentation](https://huggingface.co/docs/hub/en/billing). ### Recommendations #### Set timeout limits Set a `timeout` when creating the Job to ensure it can't run beyond a certain duration. A Job run that reaches the `timeout` duration is automatically stopped, and so is its billing. Here is how to set a timeout with the CLI: ```bash hf jobs run --timeout 3h ... ``` Note that the default timeout is set to **30 minutes**. You must therefore specify a longer timeout if your Job requires more time to run. #### Cancel irrelevant Jobs If a running Job is no longer relevant, you can cancel it prematurely to stop its billing, either via the Job page or the CLI: ```bash hf jobs cancel ``` ### Displaying carbon emissions for your model https://huggingface.co/docs/hub/model-cards-co2.md # Displaying carbon emissions for your model ## Why is it beneficial to calculate the carbon emissions of my model? Training ML models is often energy-intensive and can produce a substantial carbon footprint, as described by [Strubell et al.](https://arxiv.org/abs/1906.02243). It's therefore important to *track* and *report* the emissions of models to get a better idea of the environmental impacts of our field. ## What information should I include about the carbon footprint of my model? If you can, you should include information about: - where the model was trained (in terms of location) - the hardware used -- e.g. GPU, TPU, or CPU, and how many - training type: pre-training or fine-tuning - the estimated carbon footprint of the model, calculated in real-time with the [Code Carbon](https://github.com/mlco2/codecarbon) package or after training using the [ML CO2 Calculator](https://mlco2.github.io/impact/). ## Carbon footprint metadata You can add the carbon footprint data to the model card metadata (in the README.md file). The structure of the metadata should be: ```yaml --- co2_eq_emissions: emissions: number (in grams of CO2) source: "source of the information, either directly from AutoTrain, code carbon or from a scientific article documenting the model" training_type: "pre-training or fine-tuning" geographical_location: "as granular as possible, for instance Quebec, Canada or Brooklyn, NY, USA. To check your compute's electricity grid, you can check out https://app.electricitymap.org." hardware_used: "how much compute and what kind, e.g. 8 v100 GPUs" --- ``` ## How is the carbon footprint of my model calculated? ๐ŸŒŽ Considering the computing hardware, location, usage, and training time, you can estimate how much CO2 the model produced. The math is pretty simple! โž• First, you take the *carbon intensity* of the electric grid used for the training -- this is how much CO2 is produced by KwH of electricity used. The carbon intensity depends on the location of the hardware and the [energy mix](https://electricitymap.org/) used at that location -- whether it's renewable energy like solar ๐ŸŒž, wind ๐ŸŒฌ๏ธ and hydro ๐Ÿ’ง, or non-renewable energy like coal โšซ and natural gas ๐Ÿ’จ. The more renewable energy gets used for training, the less carbon-intensive it is! Then, you take the power consumption of the GPUs during training using the `pynvml` library. Finally, you multiply the power consumption and carbon intensity by the training time of the model, and you have an estimate of the CO2 emission. Keep in mind that this isn't an exact number because other factors come into play -- like the energy used for data center heating and cooling -- which will increase carbon emissions. But this will give you a good idea of the scale of CO2 emissions that your model is producing! To add **Carbon Emissions** metadata to your models: 1. If you are using **AutoTrain**, this is tracked for you ๐Ÿ”ฅ 2. Otherwise, use a tracker like Code Carbon in your training code, then specify ```yaml co2_eq_emissions: emissions: 1.2345 ``` in your model card metadata, where `1.2345` is the emissions value in **grams**. To learn more about the carbon footprint of Transformers, check out the [video](https://www.youtube.com/watch?v=ftWlj4FBHTg), part of the Hugging Face Course! ### Spaces ZeroGPU: Dynamic GPU Allocation for Spaces https://huggingface.co/docs/hub/spaces-zerogpu.md # Spaces ZeroGPU: Dynamic GPU Allocation for Spaces ZeroGPU is a shared infrastructure that optimizes GPU usage for AI models and demos on Hugging Face Spaces. It dynamically allocates and releases NVIDIA H200 GPUs as needed, offering: 1. **Free GPU Access**: Enables cost-effective GPU usage for Spaces. 2. **Multi-GPU Support**: Allows Spaces to leverage multiple GPUs concurrently on a single application. Unlike traditional single-GPU allocations, ZeroGPU's efficient system lowers barriers for developers, researchers, and organizations to deploy AI models by maximizing resource utilization and power efficiency. ## Using and hosting ZeroGPU Spaces - **Using existing ZeroGPU Spaces** - ZeroGPU Spaces are available to use for free to all users. (Visit [the curated list](https://huggingface.co/spaces/enzostvs/zero-gpu-spaces)). - [PRO users](https://huggingface.co/subscribe/pro) get x8 more daily usage quota, highest priority in GPU queues, and can go beyond their daily quota using pre-paid credits when using any ZeroGPU Spaces. - **Hosting your own ZeroGPU Spaces** - Personal accounts: [Subscribe to PRO](https://huggingface.co/settings/billing/subscription) to access ZeroGPU in the hardware options when creating a new Gradio SDK Space. - Organizations: [Subscribe to a Team or Enterprise plan](https://huggingface.co/enterprise) to enable ZeroGPU Spaces for all organization members. ## Technical Specifications ZeroGPU supports two GPU sizes | GPU size | Backing hardware | Vram | Quota cost | |---------------------|------------------|--------------------------|------------| | `large` *(default)* | Half NVIDIA H200 | 70GB | 1ร— | | `xlarge` | Full NVIDIA H200 | 141GB | 2ร— | > [!NOTE] > See [GPU size selection](#gpu-size-selection) to learn how to use sizes ## Compatibility ZeroGPU Spaces are designed to be compatible with most PyTorch-based GPU Spaces. While compatibility is enhanced for high-level Hugging Face libraries like `transformers` and `diffusers`, users should be aware that: - Currently, ZeroGPU Spaces are exclusively compatible with the **Gradio SDK**. - ZeroGPU Spaces may have limited compatibility compared to standard GPU Spaces. - Unexpected issues may arise in some scenarios. ### Supported Versions - **Gradio**: 4+ - **PyTorch**: Almost all versions from **2.1.0** to **latest** are supported See full list - 2.1.0 - 2.1.1 - 2.1.2 - 2.2.0 - 2.2.2 - 2.4.0 - 2.5.1 - 2.6.0 - 2.7.1 - 2.8.0 - 2.9.1 - **Python**: - 3.12.12 - 3.10.13 ## Getting started with ZeroGPU To utilize ZeroGPU in your Space, follow these steps: 1. Make sure the ZeroGPU hardware is selected in your Space settings. 2. Import the `spaces` module. 3. Decorate GPU-dependent functions with `@spaces.GPU`. This decoration process allows the Space to request a GPU when the function is called and release it upon completion. > [!NOTE] > The `@spaces.GPU` decorator is designed to be effect-free in non-ZeroGPU environments, ensuring compatibility across different setups. ### Example Usage ```python import spaces from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained(...) pipe.to('cuda') @spaces.GPU def generate(prompt): return pipe(prompt).images gr.Interface( fn=generate, inputs=gr.Text(), outputs=gr.Gallery(), ).launch() ``` ### Model loading Even though a real GPU is only available inside `@spaces.GPU` functions, models must be placed on `cuda` at the root module level (as shown in the example above). Lazy-loading or moving models to CUDA inside `@spaces.GPU` is discouraged, as it is significantly less efficient (CUDA transfers are optimized for placements done during startup). > [!NOTE] > Loading models on `cuda` at module level works because a PyTorch CUDA emulation mode is enabled outside `@spaces.GPU` functions, allowing CUDA operations without a real GPU. Inside `@spaces.GPU`, real CUDA is used. ## GPU size selection The default size used by `@spaces.GPU` is `large` (half H200). You can explicitly request a full H200 by specifying `size="xlarge"`: ``` python @spaces.GPU(size="xlarge") def generate(prompt): return pipe(prompt).images ``` > [!NOTE] > - `xlarge` consumes **2ร—** more daily quota than `large` (e.g. a 45s **effective** task duration consumes 90s of quota) > - `xlarge` usually means higher queuing probability and longer wait times > - Only use `xlarge` when your workload truly benefits from the additional compute or memory ## Duration Management For functions expected to exceed the default 60-second of GPU runtime, you can specify a custom duration: ```python @spaces.GPU(duration=120) def generate(prompt): return pipe(prompt).images ``` This sets the maximum function runtime to 120 seconds. Specifying shorter durations for quicker functions will improve queue priority for Space visitors. ### Dynamic duration `@spaces.GPU` also supports dynamic durations. Instead of directly passing a duration, simply pass a callable that takes the same inputs as your decorated function and returns a duration value: ```python def get_duration(prompt, steps): step_duration = 3.75 return steps * step_duration @spaces.GPU(duration=get_duration) def generate(prompt, steps): return pipe(prompt, num_inference_steps=steps).images ``` ## Compilation ZeroGPU does not support `torch.compile`, but you can use PyTorch **ahead-of-time** compilation (requires torch `2.8+`) Check out this [blogpost](https://huggingface.co/blog/zerogpu-aoti) for a complete guide on ahead-of-time compilation on ZeroGPU. ## Usage Tiers GPU usage is subject to **daily** quotas, per account tier: | Account type | Included daily GPU quota | Queue priority | | ------------------------------ | ------------------------ | --------------- | | Unauthenticated | 2 minutes | Low | | Free account | 3.5 minutes | Medium | | PRO account | 25 minutes (extensible) | Highest | | Team organization member | 25 minutes (extensible) | Highest | | Enterprise organization member | 45 minutes (extensible) | Highest | Included daily quota resets exactly 24 hours after your first GPU usage. > [!NOTE] > Remaining quota directly impacts priority in ZeroGPU queues. ### Extending quota with credits PRO, Team, and Enterprise users can continue using ZeroGPU Spaces beyond their included daily quota by consuming pre-paid credits at the rate of **$1 per 10 minutes** of GPU time. Once your daily quota is exhausted, any additional GPU usage is automatically billed against your credit balance. You can add credits from your [billing settings](https://huggingface.co/settings/billing). ## Hosting Limitations - **Personal accounts ([PRO subscribers](https://huggingface.co/subscribe/pro))**: Maximum of 10 ZeroGPU Spaces. - **Organization accounts ([Team & Enterprise](https://huggingface.co/enterprise))**: Maximum of 50 ZeroGPU Spaces. By leveraging ZeroGPU, developers can create more efficient and scalable Spaces, maximizing GPU utilization while minimizing costs. ## Recommendations If your demo uses a large model, we recommend using optimizations like ahead-of-time compilation and flash-attention 3. You can learn how to leverage these with ZeroGPU in [this post](https://huggingface.co/blog/zerogpu-aoti). These optimizations will help you to maximize the advantages of ZeroGPU hours and provide a better user experience. ## Feedback You can share your feedback on Spaces ZeroGPU directly on the HF Hub: https://huggingface.co/spaces/zero-gpu-explorers/README/discussions ### Webhooks Automation https://huggingface.co/docs/hub/jobs-webhooks.md # Webhooks Automation Webhooks allow you to listen for new changes on specific repositories or to all repositories belonging to particular set of users/organizations (not just your repos, but any repo) on Hugging Face. Use `create_webhook` in the `huggingface_hub` Python client to create a webhook that triggers a Job when a change happens in a Hugging Face repository: ```python from huggingface_hub import create_webhook # Example: Creating a webhook that triggers a Job webhook = create_webhook( job_id=job_id, watched=[{"type": "user", "name": "your-username"}, {"type": "org", "name": "your-org-name"}], domains=["repo", "discussion"], secret="your-secret" ) ``` The webhook triggers the Job with the following environment variables: - `WEBHOOK_PAYLOAD`: the full webhook payload as a JSON string - `WEBHOOK_REPO_ID`: the repository name (e.g., `user/repo-name`) - `WEBHOOK_REPO_TYPE`: the repository type (`model`, `dataset`, or `space`) - `WEBHOOK_SECRET`: the webhook secret, if one was configured The webhook payload contains multiple fields, here are a few useful ones: ``` - event: - action: one of "create", "delete", "move", "update" - scope: string - repo: - owner: string - headSha: string - name: string - type: one of "dataset", "model", "space" ``` You can find more information on webhooks in the [`huggingface_hub` Webhooks documentation](https://huggingface.co/docs/huggingface_hub/en/guides/webhooks). ### Using mlx-image at Hugging Face https://huggingface.co/docs/hub/mlx-image.md # Using mlx-image at Hugging Face [`mlx-image`](https://github.com/riccardomusmeci/mlx-image) is an image models library developed by [Riccardo Musmeci](https://github.com/riccardomusmeci) built on Apple [MLX](https://github.com/ml-explore/mlx). It tries to replicate the great [timm](https://github.com/huggingface/pytorch-image-models), but for MLX models. ## Exploring mlx-image on the Hub You can find `mlx-image` models by filtering using the `mlx-image` library name, like in [this query](https://huggingface.co/models?library=mlx-image&sort=trending). There's also an open [mlx-vision](https://huggingface.co/mlx-vision) community for contributors converting and publishing weights for MLX format. ## Installation ```bash pip install mlx-image ``` ## Models Model weights are available on the [`mlx-vision`](https://huggingface.co/mlx-vision) community on HuggingFace. To load a model with pre-trained weights: ```python from mlxim.model import create_model # loading weights from HuggingFace (https://huggingface.co/mlx-vision/resnet18-mlxim) model = create_model("resnet18") # pretrained weights loaded from HF # loading weights from local file model = create_model("resnet18", weights="path/to/resnet18/model.safetensors") ``` To list all available models: ```python from mlxim.model import list_models list_models() ``` ## ImageNet-1K Results Go to [results-imagenet-1k.csv](https://github.com/riccardomusmeci/mlx-image/blob/main/results/results-imagenet-1k.csv) to check every model converted to `mlx-image` and its performance on ImageNet-1K with different settings. > **TL;DR** performance is comparable to the original models from PyTorch implementations. ## Similarity to PyTorch and other familiar tools `mlx-image` tries to be as close as possible to PyTorch: - `DataLoader` -> you can define your own `collate_fn` and also use `num_workers` to speed up data loading - `Dataset` -> `mlx-image` already supports `LabelFolderDataset` (the good and old PyTorch `ImageFolder`) and `FolderDataset` (a generic folder with images in it) - `ModelCheckpoint` -> keeps track of the best model and saves it to disk (similar to PyTorchLightning). It also suggests early stopping ## Training Training is similar to PyTorch. Here's an example of how to train a model: ```python import mlx.nn as nn import mlx.optimizers as optim from mlxim.model import create_model from mlxim.data import LabelFolderDataset, DataLoader train_dataset = LabelFolderDataset( root_dir="path/to/train", class_map={0: "class_0", 1: "class_1", 2: ["class_2", "class_3"]} ) train_loader = DataLoader( dataset=train_dataset, batch_size=32, shuffle=True, num_workers=4 ) model = create_model("resnet18") # pretrained weights loaded from HF optimizer = optim.Adam(learning_rate=1e-3) def train_step(model, inputs, targets): logits = model(inputs) loss = mx.mean(nn.losses.cross_entropy(logits, target)) return loss model.train() for epoch in range(10): for batch in train_loader: x, target = batch train_step_fn = nn.value_and_grad(model, train_step) loss, grads = train_step_fn(x, target) optimizer.update(model, grads) mx.eval(model.state, optimizer.state) ``` ## Additional Resources * [mlx-image repository](https://github.com/riccardomusmeci/mlx-image) * [mlx-vision community](https://huggingface.co/mlx-vision) ## Contact If you have any questions, please email `riccardomusmeci92@gmail.com`. ### Panel on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-panel.md # Panel on Spaces [Panel](https://panel.holoviz.org/) is an open-source Python library that lets you easily build powerful tools, dashboards and complex applications entirely in Python. It has a batteries-included philosophy, putting the PyData ecosystem, powerful data tables and much more at your fingertips. High-level reactive APIs and lower-level callback based APIs ensure you can quickly build exploratory applications, but you arenโ€™t limited if you build complex, multi-page apps with rich interactivity. Panel is a member of the [HoloViz](https://holoviz.org/) ecosystem, your gateway into a connected ecosystem of data exploration tools. Visit [Panel documentation](https://panel.holoviz.org/) to learn more about making powerful applications. ## ๐Ÿš€ Deploy Panel on Spaces You can deploy Panel on Spaces with just a few clicks: There are a few key parameters you need to define: the Owner (either your personal account or an organization), a Space name, and Visibility. In case you intend to execute computationally intensive deep learning models, consider upgrading to a GPU to boost performance. Once you have created the Space, it will start out in โ€œBuildingโ€ status, which will change to โ€œRunningโ€ once your Space is ready to go. ## โšก๏ธ What will you see? When your Space is built and ready, you will see this image classification Panel app which will let you fetch a random image and run the OpenAI CLIP classifier model on it. Check out our [blog post](https://blog.holoviz.org/building_an_interactive_ml_dashboard_in_panel.html) for a walkthrough of this app. ## ๐Ÿ› ๏ธ How to customize and make your own app? The Space template will populate a few files to get your app started: Three files are important: ### 1. app.py This file defines your Panel application code. You can start by modifying the existing application or replace it entirely to build your own application. To learn more about writing your own Panel app, refer to the [Panel documentation](https://panel.holoviz.org/). ### 2. Dockerfile The Dockerfile contains a sequence of commands that Docker will execute to construct and launch an image as a container that your Panel app will run in. Typically, to serve a Panel app, we use the command `panel serve app.py`. In this specific file, we divide the command into a list of strings. Furthermore, we must define the address and port because Hugging Face will expect to serve your application on port 7860. Additionally, we need to specify the `allow-websocket-origin` flag to enable the connection to the server's websocket. ### 3. requirements.txt This file defines the required packages for our Panel app. When using Space, dependencies listed in the requirements.txt file will be automatically installed. You have the freedom to modify this file by removing unnecessary packages or adding additional ones that are required for your application. Feel free to make the necessary changes to ensure your app has the appropriate packages installed. ## ๐ŸŒ Join Our Community The Panel community is vibrant and supportive, with experienced developers and data scientists eager to help and share their knowledge. Join us and connect with us: - [Discord](https://discord.gg/aRFhC3Dz9w) - [Discourse](https://discourse.holoviz.org/) - [Twitter](https://twitter.com/Panel_Org) - [LinkedIn](https://www.linkedin.com/company/panel-org) - [Github](https://github.com/holoviz/panel) ### Access Patterns https://huggingface.co/docs/hub/storage-buckets-access.md # Access Patterns Beyond the [CLI and Python SDK](./storage-buckets#managing-files), there are several ways to access bucket data from your existing tools and workflows. ## Choosing an Access Method | Method | Best for | Details | |--------|----------|---------| | **hf-mount** | Mount as local filesystem โ€” any tool works | [See below](#mount-as-a-local-filesystem) | | **Volume mounts** | HF Jobs & Spaces (same idea, managed for you) | [See below](#volume-mounts-in-jobs-and-spaces) | | **hf:// paths** (fsspec) | Python data tools (pandas, DuckDB) | [See below](#python-data-tools) | | **CLI sync** | Batch transfers, backups | [Sync docs](./storage-buckets#syncing-directories) | Access through the S3 API is not currently supported, but is on the roadmap. ## Mount as a Local Filesystem [hf-mount](https://github.com/huggingface/hf-mount) lets you mount buckets (and repos) as local filesystems via NFS (recommended) or FUSE. Files are fetched lazily โ€” only the bytes your code reads hit the network. Install: ```bash curl -fsSL https://raw.githubusercontent.com/huggingface/hf-mount/main/install.sh | sh ``` Mount a bucket: ```bash hf-mount start bucket username/my-bucket /mnt/data ``` Once mounted, any tool that reads or writes files works with your bucket โ€” pandas, DuckDB, vLLM, training scripts, shell commands, etc. > [!TIP] > Buckets are mounted read-write; repos are read-only. See the [hf-mount repository](https://github.com/huggingface/hf-mount) for full documentation including backend options, caching, and write modes. ## Volume Mounts in Jobs and Spaces Volume mounts in [Jobs](./jobs) and [Spaces](./spaces) are the same idea as `hf-mount`, managed for you by the platform โ€” no extra setup needed. Buckets are mounted read-write by default. ```bash hf jobs run -v hf://buckets/username/my-bucket:/data python:3.12 python script.py ``` For the full volume mount syntax and Python API, see the [Jobs configuration docs](./jobs-configuration#volumes) and the [Spaces volume mount guide](/docs/huggingface_hub/guides/manage-spaces#mount-volumes-in-your-space). ## Python Data Tools The [`HfFileSystem`](/docs/huggingface_hub/guides/hf_file_system) provides [fsspec](https://filesystem-spec.readthedocs.io)-compatible access to buckets using `hf://buckets/` paths. Any Python library that supports fsspec can read and write bucket data directly. **pandas:** ```python import pandas as pd df = pd.read_parquet("hf://buckets/username/my-bucket/data.parquet") df.to_parquet("hf://buckets/username/my-bucket/output.parquet") ``` **DuckDB** (Python client): ```python import duckdb from huggingface_hub import HfFileSystem duckdb.register_filesystem(HfFileSystem()) duckdb.sql("SELECT * FROM 'hf://buckets/username/my-bucket/data.parquet' LIMIT 10") ``` For more on `hf://` paths and supported operations, see the [`HfFileSystem` guide](/docs/huggingface_hub/guides/hf_file_system) and the [Buckets Python guide](/docs/huggingface_hub/guides/buckets). ### Dash on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-dash.md # Dash on Spaces With Dash Open Source, you can create data apps on your laptop in pure Python, no JavaScript required. Get familiar with Dash by building a [sample app](https://dash.plotly.com/tutorial) with open source. Scale up with [Dash Enterprise](https://plotly.com/dash/) when your Dash app is ready for department or company-wide consumption. Or, launch your initiative with Dash Enterprise from the start to unlock developer productivity gains and hands-on acceleration from Plotly's team. ## Deploy Dash on Spaces To get started with Dash on Spaces, click the button below: This will start building your Space using Plotly's Dash Docker template. If successful, you should see a similar application to the [Dash template app](https://huggingface.co/spaces/dash/dash-app-template). ## Customizing your Dash app If you have never built with Dash before, we recommend getting started with our [Dash in 20 minutes tutorial](https://dash.plotly.com/tutorial). When you create a Dash Space, you'll get a few key files to help you get started: ### 1. app.py This is the main app file that defines the core logic of your project. Dash apps are often structured as modules, and you can optionally separate your layout, callbacks, and data into other files, like `layout.py`, etc. Inside of `app.py` you will see: 1. `from dash import Dash, html` We import the `Dash` object to define our app, and the `html` library, which gives us building blocks to assemble our project. 2. `app = Dash()` Here, we define our app. Layout, server, and callbacks are _bound_ to the `app` object. 3. `server = app.server` Here, we define our server variable, which is used to run the app in production. 4. `app.layout = ` The starter app layout is defined as a list of Dash components, an individual Dash component, or a function that returns either. The `app.layout` is your initial layout that will be updated as a single-page application by callbacks and other logic in your project. 5. `if __name__ == '__main__': app.run(debug=True)` If you are running your project locally with `python app.py`, `app.run(...)` will execute and start up a development server to work on your project, with features including hot reloading, the callback graph, and more. In production, we recommend `gunicorn`, which is a production-grade server. Debug features will not be enabled when running your project with `gunicorn`, so this line will never be reached. ### 2. Dockerfile The Dockerfile for a Dash app is minimal since Dash has few system dependencies. The key requirements are: - It installs the dependencies listed in `requirements.txt` (using `uv`) - It creates a non-root user for security - It runs the app with `gunicorn` using `gunicorn app:server --workers 4` You may need to modify this file if your application requires additional system dependencies, permissions, or other CLI flags. ### 3. requirements.txt The Space will automatically install dependencies listed in the `requirements.txt` file. At minimum, you must include `dash` and `gunicorn` in this file. You will want to add any other required packages your app needs. The Dash Space template provides a basic setup that you can extend based on your needs. ## Additional Resources and Support - [Dash documentation](https://dash.plotly.com) - [Dash GitHub repository](https://github.com/plotly/dash) - [Dash Community Forums](https://community.plotly.com) - [Dash Enterprise](https://plotly.com/dash) - [Dash template Space](https://huggingface.co/spaces/plotly/dash-app-template) ## Troubleshooting If you encounter issues: 1. Make sure your notebook runs locally in app mode using `python app.py` 2. Check that all required packages are listed in `requirements.txt` 3. Verify the port configuration matches (7860 is the default for Spaces) 4. Check Space logs for any Python errors For more help, visit the [Plotly Community Forums](https://community.plotly.com) or [open an issue](https://github.com/plotly/dash/issues). ### Audit Logs https://huggingface.co/docs/hub/audit-logs.md # Audit Logs > [!WARNING] > This feature is part of the Team & Enterprise plans. Audit Logs enable organization admins to easily review actions taken by members, including organization membership, repository settings and billing changes. ## Accessing Audit Logs Audit Logs are accessible through your organization settings. Each log entry includes: - Who performed the action - What type of action was taken - A description of the change - Location and anonymized IP address - Date and time of the action You can also download the complete audit log as a JSON file for further analysis. ## What Events Are Tracked? Each action has an **event name** in `scope.action` format (e.g. `repo.create`, `collection.delete`). This is the `type` field in each log entry and in the exported JSONโ€”use it when searching or filtering logs. ### Organization Management & Security - **Core organization changes** โ€” Creation, deletion, restoration, renaming, and profile/settings updates. - **Events:** `org.create`, `org.delete`, `org.restore`, `org.rename`, `org.update_settings` - **Security management** - Organization API token rotation. - **Event:** `org.rotate_token` - Token approval system โ€” Enabling or disabling the policy, authorization requests, approvals, denials, and revocations. - **Events:** `org.token_approval.enabled`, `org.token_approval.disabled`, `org.token_approval.authorization_request`, `org.token_approval.authorization_request.authorized`, `org.token_approval.authorization_request.revoked`, `org.token_approval.authorization_request.denied` - SSO โ€” Logins and joins via SSO. - **Events:** `org.sso_login`, `org.sso_join` - **Join settings** โ€” Domain-based access and automatic join configuration. - **Event:** `org.update_join_settings` ### Membership and Access Control - **Member lifecycle** โ€” Adding and removing members, role changes, and members leaving the organization. - **Events:** `org.add_user`, `org.remove_user`, `org.change_role`, `org.leave` - **Invitations** โ€” Sending invites, invitation links by email, and users accepting invites. - **Events:** `org.invite_user`, `org.invite.accept`, `org.invite.email` - **Automatic joins** โ€” Joins via verified email domain or โ€œrequest accessโ€. - **Events:** `org.join.from_domain`, `org.join.automatic` ### Content and Resource Management - **Repository administration** โ€” Creation, deletion, moving, disabling/re-enabling, duplication settings, DOI removal, resource group assignment, and general repo settings (visibility, gating, discussions, etc.). Also LFS file deletion. - **Events:** `repo.create`, `repo.delete`, `repo.move`, `repo.disable`, `repo.removeDisable`, `repo.duplication`, `repo.delete_doi`, `repo.update_resource_group`, `repo.update_settings`, `repo.delete_lfs_file` - **Collections** โ€” Creation and deletion of collections. - **Events:** `collection.create`, `collection.delete` - **Repository security** โ€” Secrets and variables (individual and bulk add/update/remove). - **Events (secrets):** `repo.add_secret`, `repo.update_secret`, `repo.remove_secret`, `repo.add_secrets`, `repo.remove_secrets` - **Events (variables):** `repo.add_variable`, `repo.update_variable`, `repo.remove_variable`, `repo.add_variables`, `repo.remove_variables` - **Spaces configuration** โ€” Storage tier changes, hardware (flavor) updates, and sleep time adjustments. - **Events:** `spaces.add_storage`, `spaces.remove_storage`, `spaces.update_hardware`, `spaces.update_sleep_time` ### Resource Groups - **Resource group administration** โ€” Creation, deletion, and settings changes. - **Events:** `resource_group.create`, `resource_group.delete`, `resource_group.settings` - **Resource group members** โ€” Adding and removing users, and role changes. - **Events:** `resource_group.add_users`, `resource_group.remove_users`, `resource_group.change_role` ### Jobs and Scheduled Jobs - **Jobs** โ€” Job creation (e.g. on a Space) and cancellation. - **Events:** `jobs.create`, `jobs.cancel` - **Scheduled jobs** โ€” Creating, deleting, resuming, suspending, and triggering runs. - **Events:** `scheduled_job.create`, `scheduled_job.delete`, `scheduled_job.resume`, `scheduled_job.suspend`, `scheduled_job.run` ### Billing and Cloud Integration - **Payment and customers** โ€” Payment method updates, attachment, and removal; customer account creation. - **Events:** `billing.update_payment_method`, `billing.create_customer`, `billing.remove_payment_method` - **Cloud marketplaces** โ€” AWS and GCP marketplace linking/unlinking and marketplace approval. - **Events:** `billing.aws_add`, `billing.aws_remove`, `billing.gcp_add`, `billing.gcp_remove`, `billing.marketplace_approve` - **Subscriptions** โ€” Starting, renewing, cancelling, reactivating, and updating subscriptions (including plan and contract details). - **Events:** `billing.start_subscription`, `billing.renew_subscription`, `billing.cancel_subscription`, `billing.un_cancel_subscription`, `billing.update_subscription`, `billing.update_subscription_plan`, `billing.update_subscription_contract_details` ## Event reference The list above covers every event type shown in the audit log UI and export. Event names follow the `scope.action` pattern; scopes include `org`, `repo`, `collection`, `spaces`, `resource_group`, `jobs`, `scheduled_job`, and `billing`. The export action itself is recorded as `org.audit_log.export` but that event is not included in the default audit log view. ### Downloading models https://huggingface.co/docs/hub/models-downloading.md # Downloading models ## Integrated libraries If a model on the Hub is tied to a [supported library](./models-libraries), loading the model can be done in just a few lines. For information on accessing the model, you can click on the "Use in _Library_" button on the model page to see how to do so. For example, `distilbert/distilgpt2` shows how to do so with ๐Ÿค— Transformers below. ## Using the Hugging Face Client Library You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. For example, to download the `HuggingFaceH4/zephyr-7b-beta` model from the command line, run ```bash hf download HuggingFaceH4/zephyr-7b-beta ``` See the [CLI download documentation](https://huggingface.co/docs/huggingface_hub/en/guides/cli#download-an-entire-repository) for more information. You can also integrate this into your own library. For example, you can quickly load a Scikit-learn model with a few lines. ```py from huggingface_hub import hf_hub_download import joblib REPO_ID = "YOUR_REPO_ID" FILENAME = "sklearn_model.joblib" model = joblib.load( hf_hub_download(repo_id=REPO_ID, filename=FILENAME) ) ``` ## Using Git Since all models on the Model Hub are Xet-backed Git repositories, you can clone the models locally by [installing git-xet](./xet/using-xet-storage#git-xet) and running: ```bash git xet install git lfs install git clone git@hf.co: # example: git clone git@hf.co:bigscience/bloom ``` If you have write-access to the particular model repo, you'll also have the ability to commit and push revisions to the model. Add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes and/or access private repos. ## Faster downloads `hf_xet` is a Rust-based package leveraging the [Xet storage backend](https://huggingface.co/docs/hub/en/xet/index) to optimize file transfers with chunk-based deduplication. By default, `hf_xet` uses **adaptive concurrency** โ€” it automatically tunes the number of parallel transfer streams based on real-time network conditions, starting conservatively (1 stream) and scaling up to 64 concurrent streams as bandwidth permits. For most machines โ€” including data center environments โ€” the default settings will already saturate the available network bandwidth. For advanced users on machines with high bandwidth **and at least 64 GB of RAM**, `HF_XET_HIGH_PERFORMANCE=1` raises concurrency bounds and significantly increases memory buffer sizes, which can help when downloading many large files in parallel. ```bash HF_XET_HIGH_PERFORMANCE=1 hf download ... ``` ## Using hf-mount For large models, you can mount a repo as a local filesystem with [hf-mount](https://github.com/huggingface/hf-mount) instead of downloading the full repo. Files are fetched lazily โ€” only the bytes your code reads hit the network. ```bash curl -fsSL https://raw.githubusercontent.com/huggingface/hf-mount/main/install.sh | sh hf-mount start repo openai-community/gpt2 /tmp/gpt2 ``` Repos are mounted read-only. See [Mount as a Local Filesystem](./storage-buckets-access#mount-as-a-local-filesystem) for full setup details, backend options, and caching. ### The Model Hub https://huggingface.co/docs/hub/models-the-hub.md # The Model Hub ## What is the Model Hub? The Model Hub is where the members of the Hugging Face community can host all of their model checkpoints for simple storage, discovery, and sharing. Download pre-trained models with the [`huggingface_hub` client library](https://huggingface.co/docs/huggingface_hub/index), with ๐Ÿค— [`Transformers`](https://huggingface.co/docs/transformers/index) for fine-tuning and other usages or with any of the over [15 integrated libraries](./models-libraries). You can even leverage [Inference Providers](/docs/inference-providers/) or [Inference Endpoints](https://huggingface.co/docs/inference-endpoints) to use models in production settings. You can refer to the following video for a guide on navigating the Model Hub: To learn how to upload models to the Hub, you can refer to the [Repositories Getting Started Guide](./repositories-getting-started). ### Spaces Dev Mode: Seamless development in Spaces https://huggingface.co/docs/hub/spaces-dev-mode.md # Spaces Dev Mode: Seamless development in Spaces > [!WARNING] > This feature is still in Beta stage. > [!WARNING] > The Spaces Dev Mode is part of PRO or Team & Enterprise plans. ## Spaces Dev Mode Spaces Dev Mode is a feature that eases the debugging of your application and makes iterating on Spaces faster. Whenever your commit some changes to your Space repo, the underlying Docker image gets rebuilt, and then a new virtual machine is provisioned to host the new container. The Dev Mode allows you to update your Space much quicker by overriding the Docker image. The Dev Mode Docker image starts your application as a sub-process, allowing you to restart it without stopping the Space container itself. It also starts a VS Code server and a SSH server in the background for you to connect to the Space. The ability to connect to the running Space unlocks several use cases: - You can make changes to the app code without the Space rebuilding everytime - You can debug a running application and monitor resources live Overall it makes developing and experimenting with Spaces much faster by skipping the Docker image rebuild phase. ## Interface Once the Dev Mode is enabled on your Space, you should see a modal like the following. The application does not restart automatically when you change the code. For your changes to appear in the Space, you need to use the `Refresh` button that will restart the app. If you're using the Gradio SDK, or if your application is Python-based, note that requirements are not installed automatically. You will need to manually run `pip install` from VS Code or SSH. ### SSH connection and VS Code The Dev Mode allows you to connect to your Space's docker container using SSH. Instructions to connect are listed in the Dev Mode controls modal. You will need to add your machine's SSH public key to [your user account](https://huggingface.co/settings/keys) to be able to connect to the Space using SSH. Check out the [Git over SSH](./security-git-ssh#add-a-ssh-key-to-your-account) documentation for more detailed instructions. You can also use a local install of VS Code to connect to the Space container. To do so, you will need to install the [SSH Remote](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh) extension. ### Persisting changes The changes you make when Dev Mode is enabled are not persisted to the Space repo automatically. By default, they will be discarded when Dev Mode is disabled or when the Space goes to sleep. If you wish to persist changes made while Dev Mode is enabled, you need to use `git` from inside the Space container (using VS Code or SSH). For example: ```shell # Add changes and commit them git add . git commit -m "Persist changes from Dev Mode" # Push the commit to persist them in the repo git push ``` The modal will display a warning if you have uncommitted or unpushed changes in the Space: ## Enabling Dev Mode You can enable the Dev Mode on your Space from the web interface or via the API. ### Via the API You can toggle Dev Mode programmatically: ``` POST https://huggingface.co/api/spaces/{namespace}/{repo}/dev-mode Content-Type: application/json Authorization: Bearer {token} { "enabled": true } ``` ### Via the web interface You can also create a Space with the dev mode enabled: ## Limitations Dev Mode is currently not available for static Spaces. Docker Spaces also have some additional requirements. ### Docker Spaces Dev Mode is supported for Docker Spaces. However, your Space needs to comply with the following rules for Dev Mode to work properly. 1. The following packages must be installed: - `bash` (required to establish SSH connections) - `curl`, `wget` and `procps` (required by the VS Code server process) - `git` and `git-lfs` to be able to commit and push changes from your Dev Mode environment 2. Your application code must be located in the `/app` folder for the Dev Mode daemon to be able to detect changes. 3. The `/app` folder must be owned by the user with uid `1000` to allow you to make changes to the code. 4. The Dockerfile must contain a `CMD` instruction for startup. Checkout [Docker's documentation](https://docs.docker.com/reference/dockerfile/#cmd) about the `CMD` instruction for more details. Dev Mode works well when the base image is debian-based (eg, ubuntu). More exotic linux distros (eg, alpine) are not tested and Dev Mode is not guaranteed to work on them. ### Example of compatible Dockerfiles This is an example of a Dockerfile compatible with Spaces Dev Mode. It installs the required packages with `apt-get`, along with a couple more for developer convenience (namely: `top`, `vim` and `nano`). It then starts a NodeJS application from `/app`. ```Dockerfile FROM node:19-slim RUN apt-get update && \ apt-get install -y \ bash \ git git-lfs \ wget curl procps \ htop vim nano && \ rm -rf /var/lib/apt/lists/* WORKDIR /app COPY --link ./ /app RUN npm i RUN chown 1000 /app USER 1000 CMD ["node", "index.js"] ``` There are several examples of Dev Mode compatible Docker Spaces in this organization. Feel free to duplicate them in your namespace! Example Python app (FastAPI HTTP server): https://huggingface.co/spaces/dev-mode-explorers/dev-mode-python Example Javascript app (Express.js HTTP server): https://huggingface.co/spaces/dev-mode-explorers/dev-mode-javascript ## Feedback You can share your feedback on Spaces Dev Mode directly on the HF Hub: https://huggingface.co/spaces/dev-mode-explorers/README/discussions ### SQL Console: Query Hugging Face datasets in your browser https://huggingface.co/docs/hub/datasets-viewer-sql-console.md # SQL Console: Query Hugging Face datasets in your browser You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the Data Studio. To learn more about the SQL Console, see the SQL Console blog post. Through the SQL Console, you can: - Run [DuckDB SQL queries](https://duckdb.org/docs/sql/query_syntax/select) on the dataset (_checkout [SQL Snippets](https://huggingface.co/spaces/cfahlgren1/sql-snippets) for useful queries_) - Share results of the query with others via a link (_check out [this example](https://huggingface.co/datasets/gretelai/synthetic-gsm8k-reflection-405b?sql_console=true&sql=FROM+histogram%28%0A++train%2C%0A++topic%2C%0A++bin_count+%3A%3D+10%0A%29)_) - Download the results of the query to a Parquet or CSV file - Embed the results of the query in your own webpage using an iframe - Query datasets with natural language > [!TIP] > You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the DuckDB Datasets documentation for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI. ## Examples ### Filtering The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query: Here's the SQL to sort by length of the reasoning ```sql SELECT * FROM train WHERE LENGTH(reasoning_chains) > 10; ``` ### Histogram Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values. For example, to plot a histogram of the `Rating` column in the [Lichess/chess-puzzles](https://huggingface.co/datasets/Lichess/chess-puzzles) dataset, you can use the following query: Learn more about the `histogram` function and parameters here. ```sql from histogram(train, Rating) ``` ### Regex Matching One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data. Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the [GeneralReasoning/GeneralThought-195k](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-195K) dataset for instructions that contain markdown code blocks. Learn more about the DuckDB regex functions here. ```sql SELECT * FROM train WHERE regexp_matches(model_answer, '```') LIMIT 10; ``` ### Saved Queries and Embeds API You can create, update, and delete SQL Console embeds programmatically. Embeds are saved queries that can be shared via link or embedded in other pages. **Create an embed:** ``` POST /api/datasets/{namespace}/{repo}/sql-console/embed Content-Type: application/json Authorization: Bearer {token} { "sql": "SELECT * FROM train LIMIT 10", "title": "Sample rows", "private": false, "views": [{"key": "default/train", "displayName": "Train", "viewName": "train"}] } ``` **Update an embed:** ``` PATCH /api/datasets/{namespace}/{repo}/sql-console/embed/{embed_id} Content-Type: application/json Authorization: Bearer {token} { "sql": "SELECT * FROM train LIMIT 20", "title": "Updated title", "private": true } ``` **Delete an embed:** ``` DELETE /api/datasets/{namespace}/{repo}/sql-console/embed/{embed_id} Authorization: Bearer {token} ``` ### Leakage Detection Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set. Learn more about leakage detection here. ```sql WITH overlapping_rows AS ( SELECT COALESCE( (SELECT COUNT(*) AS overlap_count FROM train INTERSECT SELECT COUNT(*) AS overlap_count FROM test), 0 ) AS overlap_count ), total_unique_rows AS ( SELECT COUNT(*) AS total_count FROM ( SELECT * FROM train UNION SELECT * FROM test ) combined ) SELECT overlap_count, total_count, CASE WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count) ELSE 0 END AS overlap_percentage FROM overlapping_rows, total_unique_rows; ``` ### Single Sign-On (SSO) https://huggingface.co/docs/hub/enterprise-sso.md # Single Sign-On (SSO) > [!WARNING] > This feature is part of the Team & Enterprise plans. Hugging Face offers two distinct SSO models, each designed for different organizational needs. Understanding the differences between these two approaches is key to choosing the right setup for your team. ## At a glance | | **Basic SSO** | **Managed SSO** | | --- | --- | --- | | **Plan** | Team & Enterprise | Enterprise Plus | | **Scope** | Organization resources only | Entire Hugging Face platform | | **Replaces the Hugging Face login** | No โ€” users keep their existing Hugging Face credentials | Yes โ€” your IdP becomes the only login method | | **User accounts** | Users keep their personal Hugging Face account | Accounts are owned and managed by the organization | | **Personal content** | Users can create content in their personal namespace | Users can only create content within the organization | | **Multi-org membership** | Users can belong to multiple organizations | Users are restricted to their managing organization | | **User provisioning** | Manual (SSO join link) โ€” or invitation-based [SCIM](./enterprise-scim) on Enterprise | Full lifecycle ([SCIM](./enterprise-scim)) | | **Setup** | Self-service from organization settings | Requires setup with the Hugging Face team | | **External collaborators** | Yes | Yes | | **Protocols** | SAML 2.0 and OIDC | SAML 2.0 and OIDC | | **Role mapping** | Yes | Yes | | **Resource group mapping** | Yes | Yes | ## Basic SSO Basic SSO adds an access-control layer on top of the standard Hugging Face login. It does **not** replace the Hugging Face login โ€” members keep their existing credentials and are prompted to complete SSO only when accessing your organization's resources. This is well suited for teams that want to **secure access to their organizational resources while preserving the flexibility of individual Hugging Face accounts**. Setup is self-service from your organization's settings. [Getting started with Basic SSO โ†’](./security-sso-basic) ## Managed SSO Managed SSO **replaces the Hugging Face login entirely**. Your Identity Provider becomes the sole authentication method across the entire Hugging Face platform. The organization controls the full user lifecycle, from account creation to deactivation. This is designed for companies that require **complete control over identity, access, and data governance**. Managed accounts have [specific restrictions](./enterprise-advanced-sso#restrictions-on-managed-accounts) (no personal content, organization-bound collaboration). Setup requires coordination with the Hugging Face team. [Getting started with Managed SSO โ†’](./enterprise-advanced-sso) ## User Provisioning (SCIM) Both SSO models support [SCIM](./enterprise-scim) (System for Cross-domain Identity Management) to automate user provisioning from your Identity Provider. The two models use SCIM differently, consistent with their respective philosophies: - **Basic SSO** (Enterprise plan): SCIM automates the **invitation** of existing Hugging Face users to your organization. Users must accept the invitation to join. - **Managed SSO** (Enterprise Plus plan): SCIM manages the **entire user lifecycle** โ€” account creation, profile updates, and deactivation. Learn more in the [User Provisioning (SCIM) guide](./enterprise-scim). ## Which model should you choose? **Choose Basic SSO** if your team needs to secure access to organizational resources while allowing members to maintain their own Hugging Face accounts and participate in the broader community. **Choose Managed SSO** if your enterprise requires centralized control over all user accounts, automated provisioning and deprovisioning, and strict data governance policies that prevent any content from being created outside the organization. Both models support SAML 2.0 and OIDC protocols and can be integrated with popular identity providers such as Okta, Microsoft Entra ID (Azure AD), and Google Workspace. ## Further reading - [User Management](./security-sso-user-management) โ€” Role mapping, resource group mapping, session timeout, and more - [Configuration Guides](./security-sso-configuration-guides) โ€” Step-by-step setup instructions for Okta, Microsoft Entra ID, and Google Workspace ### Managing Spaces with Github Actions https://huggingface.co/docs/hub/spaces-github-actions.md # Managing Spaces with Github Actions You can keep your Space in sync with your GitHub repository using the official [`huggingface/hub-sync`](https://github.com/marketplace/actions/sync-github-to-hugging-face-hub) GitHub Action. `hub-sync` also works for Models and Datasets. See [GitHub Actions](./repositories-github-actions) for general usage. ## Setup 1. Create a [GitHub secret](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-an-environment) called `HF_TOKEN` with a Hugging Face [access token](https://huggingface.co/settings/tokens). 2. Add a workflow file (e.g. `.github/workflows/sync-to-hub.yml`) to your repository: ```yaml name: Sync to Hugging Face Hub on: push: branches: [main] jobs: sync: runs-on: ubuntu-latest steps: - uses: actions/checkout@v6 - uses: huggingface/hub-sync@v0.1.0 with: github_repo_id: ${{ github.repository }} huggingface_repo_id: username/my-space hf_token: ${{ secrets.HF_TOKEN }} ``` You can configure the Space SDK with `space_sdk` (defaults to `gradio`). See [all parameters](./repositories-github-actions#parameters). ## How it works The action mirrors your files to the Hub using the `hf` CLI (`hf repo create` + `hf upload`). It is not a git-to-git sync โ€” it uploads the file contents and automatically excludes `.github/` and `.git/` directories. Files removed from your GitHub repository will also be removed from the Hub. For more complex workflows (e.g. build steps, custom logic), you can install and use the [`hf` CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) directly in your workflow instead. ## File size considerations For files larger than 10MB, Spaces requires [Git-LFS](./repositories-getting-started#terminal). Make sure large files in your GitHub repository are tracked with LFS before syncing. ## Alternative: manual git push If you prefer a direct git-to-git sync instead of file mirroring, you can push to your Space's git remote directly: ```yaml name: Sync to Hugging Face hub on: push: branches: [main] workflow_dispatch: jobs: sync-to-hub: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 with: fetch-depth: 0 lfs: true - name: Push to hub env: HF_TOKEN: ${{ secrets.HF_TOKEN }} run: git push https://HF_USERNAME:$HF_TOKEN@huggingface.co/spaces/HF_USERNAME/SPACE_NAME main ``` Replace `HF_USERNAME` with your username and `SPACE_NAME` with your Space name. ### Billing https://huggingface.co/docs/hub/billing.md # Billing At Hugging Face, we build a collaboration platform for the ML community (i.e., the Hub) and monetize by providing advanced features and simple access to compute for AI. Any feedback or support request related to billing is welcome at billing@huggingface.co ## Team and Enterprise subscriptions We offer advanced security and compliance features for organizations through our Team or Enterprise plans, which include [Single Sign-On](./enterprise-sso), [Advanced Access Control](./enterprise-resource-groups) for repositories, control over your data location, higher [storage capacity](./storage-limits) for public and private repositories, and more. Team and Enterprise plans are billed like a typical subscription. They renew automatically, but you can choose to cancel at any time in the organization's billing settings. You can pay for a Team subscription with a credit card or your AWS account, or upgrade to Enterprise via an annual contract. Upon renewal, the number of seats in your subscription will be updated to match the number of members of your organization. Private repository storage above the [included storage](./storage-limits) will be billed along with your subscription renewal. ## PRO subscription The PRO subscription unlocks essential features for serious users, including: - Higher [storage capacity](./storage-limits) for public and private repositories - Higher bandwidth and API [rate limits](./rate-limits) - Included credits for [Inference Providers](/docs/inference-providers/) - Higher tier for ZeroGPU Spaces usage, and pay-as-you-go quota extension - Ability to create ZeroGPU Spaces and use Dev Mode - Ability to publish Social Posts and Community Blogs - Leverage the [Data Studio](./data-studio) on private datasets - Run and schedule serverless [CPU/ GPU Jobs](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) View the full list of benefits at https://huggingface.co/pro then subscribe over at https://huggingface.co/subscribe/pro Similarly to the Team & Enterprise subscriptions, PRO subscriptions are billed like a typical subscription. The subscription renews automatically for you. You can choose to cancel the subscription at anytime in your billing settings: https://huggingface.co/settings/billing You can only pay for the PRO subscription with a credit card. The subscription is billed separately from any pay-as-you-go compute usage. Private repository storage above the [included storage](./storage-limits) will be billed along with your subscription renewal. Note: PRO benefits are also included in the [Enterprise subscription](https://huggingface.co/enterprise). ## Pay-as-you-go private storage Above the included 1TB (or 1TB per seat) of private storage in PRO, Team, and Enterprise, additional private storage is billed in 1TB increments, at a base price of **$18/TB/month**. Overage is charged to your payment method in Pay-as-you-go mode. Additional discounts are available for large-scale volumes through our account executives. See the full pricing tiers at [huggingface.co/pricing](https://huggingface.co/pricing#storage). ## Compute Services on the Hub We also directly provide compute services with [Spaces](./spaces), [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) and [Inference Providers](https://huggingface.co/docs/inference-providers/index). While most of our compute services have a comprehensive free tier, users and organizations can pay to access more powerful hardware accelerators. The billing for our compute services is usage-based, meaning you only pay for what you use. You can monitor your usage at any time from your billing dashboard, located in your user's or organization's settings menu. Compute services usage is billed separately from PRO and Team / Enterprise subscriptions (and potential private storage). Invoices for compute services are edited at the beginning of each month. ## Available payment methods Hugging Face uses [Stripe](https://stripe.com) to securely process your payment information. The only payment method supported for Hugging Face compute services is credit cards. You can add a credit card to your account from your billing settings. ### Billing thresholds & Invoicing When using credit cards as a payment method, you'll be billed for the Hugging Face compute usage each time the accrued usage goes above a billing threshold for your user or organization. On the 1st of every month, Hugging Face edits an invoice for usage accrued during the prior month. Any usage that has yet to be charged will be charged at that time. For example, if your billing threshold is set at $100.00, and you incur $254.00 of usage during a given month, your credit card will be charged a total of three times during the month: - Once for usage between $0 and $100: $100 - Once for usage between $100 and $200: $100 - Once at the end of the month for the remaining $54: $54 Note: this will be detailed in your monthly invoice. You can view invoices and receipts for the last 3 months in your billing dashboard. ## Cloud providers partnerships We partner with cloud providers like [AWS](https://huggingface.co/blog/aws-partnership), [Azure](https://huggingface.co/blog/hugging-face-endpoints-on-azure), and [Google Cloud](https://huggingface.co/blog/llama31-on-vertex-ai) to make it easy for customers to use Hugging Face directly in their cloud of choice. These solutions and usage are billed directly by the cloud provider. Ultimately, we want people to have great options for using Hugging Face wherever they build ML-powered products. You also have the option to link your Hugging Face organization to your AWS account via [AWS Marketplace](https://aws.amazon.com/marketplace/pp/prodview-n6vsyhdjkfng2). Hugging Face compute service usage will then be included in your AWS bill. Read more in our [blog post](https://huggingface.co/blog/aws-marketplace). ## Support FAQ **Q. Why do I need to add credits? What can I use them for?** A. Credits let you use HF pay-as-you-go services: - Jobs: run any workload on GPUs - Inference Providers: call 250k+ models via API - Inference Endpoints: dedicated deployments - GPU Spaces: host on custom hardware - ZeroGPU: extra quota beyond daily allowance - Private Storage: extra storage for private repos **Q. What happens if I run out of credits?** A. We recommend enabling automatic recharge to avoid service disruptions after credits are exhausted. **Q. I'm having issues adding my card. Whatโ€™s up?** A. Please ensure the card supportsย 3D-secureย authentication and is properly configured for recurring online payments. We do not yet support credit cards issued in India as weโ€™re working on adding system compliance with the latest RBI directives. Until we add support for Indian credit cards, you can: * Link an organization account to an AWS account in order to access pay-as-you-go features (Endpoints, Spaces, AutoTrain): [Hugging Face Platform on the AWS Marketplace: Pay with your AWS Account](https://huggingface.co/blog/aws-marketplace) * Use a credit card issued in another country **Q. How can I add my tax ID or update the billing details?** A. Email billing@huggingface.co and we can help! **Q. I was just billed for the PRO/Team subscription a few days ago. Why did you charge me again?** A. All subscriptions renew on the 1st of each month. We prorate the subscription charge if you sign up mid-month for your first month of Team or PRO. **Q. I need copies of my past invoices, where can I find these?** A. View and download all invoices here: https://huggingface.co/settings/billing/invoices. Invoices are also emailed. **Q. I need to update my credit card in my account. What to do?** A. Head to https://huggingface.co/settings/billing/payment and update your payment method at anytime. **Subscriptions** **Q. I need to pause my PRO subscription for a bit, where can I do this?** A. You can cancel your subscription at anytime here:ย https://huggingface.co/settings/billing/subscription. Drop us a line at billing@huggingface.co with your feedback. **Q. My org has a Team or Enterprise subscription and I need to update the number of seats. How can I do this?** A. The number of seats will automatically be adjusted at the time of the subscription renewal to reflect any increases in the number of members in the organization during the previous period. Thereโ€™s no need to update the subscribed number of seats during the month or year as itโ€™s a flat fee subscription. ### More ways to create Spaces https://huggingface.co/docs/hub/spaces-more-ways-to-create.md # More ways to create Spaces ## Duplicating a Space You can duplicate a Space by clicking the three dots at the top right and selecting **Duplicate this Space**. Learn more about it [here](./spaces-overview#duplicating-a-space). ## Creating a Space from a model You can create a Gradio demo directly from most model pages, using the "Deploy -> Spaces" button. As another example of how to create a Space from a set of models, the [Model Comparator Space Builder](https://huggingface.co/spaces/farukozderim/Model-Comparator-Space-Builder) from [@farukozderim](https://huggingface.co/farukozderim) can be used to create a Space directly from any model hosted on the Hub. ### Hugging Face Hub documentation https://huggingface.co/docs/hub/index.md # Hugging Face Hub documentation The Hugging Face Hub is a platform with over 2M models, 500k datasets, and 1M demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together. The Hub works as a central place where anyone can explore, experiment, collaborate, and build technology with Machine Learning. Are you ready to join the path towards open source Machine Learning? ๐Ÿค— Subscriptions & Plans PRO subscription Team & Enterprise Plans Single Sign-On (SSO) Audit Logs Storage Regions Data Studio for Private datasets Resource Groups Advanced Security Tokens Management Network Security Rate Limits Repositories Getting Started Repository Settings Storage Limits Storage Backend (Xet) Local Cache Pull requests and Discussions Notifications Collections Webhooks Next Steps Licenses Models The Model Hub Model Cards Eval Results Gated Models Uploading Models Downloading Models Libraries Tasks Widgets Inference Providers Download Stats Datasets Introduction Datasets Overview Dataset Cards Gated Datasets Uploading Datasets Ingesting Datasets Downloading Datasets Streaming Datasets Editing Datasets Libraries Data Studio Download Stats Data files Configuration Spaces Introduction Spaces Overview Gradio Spaces Static HTML Spaces Docker Spaces ZeroGPU Spaces Embed your Space Run with Docker Reference Advanced Topics Sign in with HF Storage Buckets new Introduction Buckets vs Git Repositories Creating a Bucket Managing Files Use Cases Security & Compliance Jobs Introduction Jobs Overview Quickstart Pricing Manage Jobs Jobs Configuration Popular images Schedule Jobs Webhooks Automation Reference Agents Introduction Agents Overview Hugging Face CLI for AI Agents Hugging Face MCP Server Hugging Face Agent Skills Building agents with the HF SDK Local Agents Agent Libraries Other Organizations Billing Security Moderation Paper Pages Search Digital Object Identifier (DOI) Hub API Endpoints Sign in with HF Contributor Code of Conduct Content Guidelines ## What's the Hugging Face Hub? We are helping the community work together towards the goal of advancing Machine Learning ๐Ÿ”ฅ. The Hugging Face Hub is a platform with over 2M models, 500k datasets, and 1M demos in which people can easily collaborate in their ML workflows. The Hub works as a central place where anyone can share, explore, discover, and experiment with open-source Machine Learning. No single company, including the Tech Titans, will be able to โ€œsolve AIโ€ by themselves โ€“ the only way we'll achieve this is by sharing knowledge and resources in a community-centric approach. We are building the largest open-source collection of models, datasets, and demos on the Hugging Face Hub to democratize and advance ML for everyone ๐Ÿš€. We encourage you to read the [Code of Conduct](https://huggingface.co/code-of-conduct) and the [Content Guidelines](https://huggingface.co/content-guidelines) to familiarize yourself with the values that we expect our community members to uphold ๐Ÿค—. ## What can you find on the Hub? The Hugging Face Hub hosts Git-based repositories, which are version-controlled folders that can contain all your files. For non-versioned, mutable object storage, the Hub also offers [Storage Buckets](./storage-buckets). On it, you'll be able to upload and discover... - Models: _hosting the latest state-of-the-art models for LLM, text, vision, and audio tasks_ - Datasets: _featuring a wide variety of data for different domains and modalities_ - Spaces: _interactive apps for demonstrating ML models directly in your browser_ The Hub offers **versioning, commit history, diffs, branches, and over a dozen library integrations**! All repositories build on [Xet](./xet/index), a new technology to efficiently store Large Files inside Git, intelligently splitting files into unique chunks and accelerating uploads and downloads. You can learn more about the features that all repositories share in the [**Repositories documentation**](./repositories). ## Models You can discover and use dozens of thousands of open-source ML models shared by the community. To promote responsible model usage and development, model repos are equipped with [Model Cards](./model-cards) to inform users of each model's limitations and biases. Additional [metadata](./model-cards#model-card-metadata) about info such as their tasks, languages, and evaluation results can be included, with training metrics charts even added if the repository contains [TensorBoard traces](./tensorboard). It's also easy to add an [**inference widget**](./models-widgets) to your model, allowing anyone to play with the model directly in the browser! For programmatic access, a serverless API is provided by [**Inference Providers**](./models-inference). To upload models to the Hub, or download models and integrate them into your work, explore the [**Models documentation**](./models). You can also choose from [**over a dozen libraries**](./models-libraries) such as ๐Ÿค— Transformers, Asteroid, and ESPnet that support the Hub. ## Datasets The Hub is home to over 500k public datasets in more than 8k languages that can be used for a broad range of tasks across NLP, Computer Vision, and Audio. The Hub makes it simple to find, download, and upload datasets. Datasets are accompanied by extensive documentation in the form of [**Dataset Cards**](./datasets-cards) and [**Data Studio**](./datasets-viewer) to let you explore the data directly in your browser. While many datasets are public, [**organizations**](./organizations) and individuals can create private datasets to comply with licensing or privacy issues. You can learn more about [**Datasets here on the Hugging Face Hub documentation**](./datasets-overview). The [๐Ÿค— `datasets`](https://huggingface.co/docs/datasets/index) library allows you to programmatically interact with the datasets, so you can easily use datasets from the Hub in your projects. With a single line of code, you can access the datasets; even if they are so large they don't fit in your computer, you can use streaming to efficiently access the data. ## Spaces [Spaces](https://huggingface.co/spaces) is a simple way to host ML demo apps on the Hub. They allow you to build your ML portfolio, showcase your projects at conferences or to stakeholders, and work collaboratively with other people in the ML ecosystem. We currently support two awesome Python SDKs (**[Gradio](https://gradio.app/)** and **[Streamlit](./spaces-sdks-streamlit)**) that let you build cool apps in a matter of minutes. Users can also create static Spaces, which are simple HTML/CSS/JavaScript pages, or deploy any Docker-based application. If you need GPU power for your demos, try [**ZeroGPU**](./spaces-zerogpu): it dynamically provides NVIDIA H200 GPUs, in real-time, only when needed. After you've explored a few Spaces (take a look at our [Space of the Week!](https://huggingface.co/spaces)), dive into the [**Spaces documentation**](./spaces-overview) to learn all about how you can create your own Space. You'll also be able to upgrade your Space to run on a GPU or other accelerated hardware. โšก๏ธ ## Storage Buckets [Storage Buckets](./storage-buckets) provide S3-like object storage on Hugging Face, powered by the Xet storage backend. Unlike repositories (which are git-based and track file history), buckets are remote object storage containers designed for large-scale files with content-addressable deduplication. They are designed for use cases where you need simple, fast, mutable storage such as storing training checkpoints, logs, intermediate artifacts, or any large collection of files that doesnโ€™t need version control. ## Organizations Companies, universities and non-profits are an essential part of the Hugging Face community! The Hub offers [**Organizations**](./organizations), which can be used to group accounts and manage datasets, models, and Spaces. Educators can also create collaborative organizations for students using [Hugging Face for Classrooms](https://huggingface.co/classrooms). An organization's repositories will be featured on the organizationโ€™s page and every member of the organization will have the ability to contribute to the repository. In addition to conveniently grouping all of an organization's work, the Hub allows admins to set roles to [**control access to repositories**](./organizations-security), and manage their organization's [payment method and billing info](https://huggingface.co/pricing). Machine Learning is more fun when collaborating! ๐Ÿ”ฅ [Explore existing organizations](https://huggingface.co/organizations), create a new organization [here](https://huggingface.co/organizations/new), and then visit the [**Organizations documentation**](./organizations) to learn more. ## Security The Hugging Face Hub supports security and access control features to give you the peace of mind that your code, models, and data are safe. Visit the [**Security**](./security) section in these docs to learn about: - User Access Tokens - Access Control for Organizations - Signing commits with GPG - Malware scanning ### Advanced Security https://huggingface.co/docs/hub/enterprise-advanced-security.md # Advanced Security > [!WARNING] > This feature is part of the Team & Enterprise plans. Team & Enterprise organizations can improve their security with advanced security controls for both members and repositories. ## Members Security Configure additional security settings to protect your organization: - **Two-Factor Authentication (2FA)**: Require all organization members to enable 2FA for enhanced account security. - **User Approval**: For organizations with a verified domain name, require admin approval for new users with matching email addresses. This adds a verified badge to your organization page. - **Hide members list**: When enabled, the list of members will not be visible on the organization page. Note that users can potentially find organization membership information through other means, so do not use for critical use cases. ## Repository Visibility Controls Manage the default visibility of repositories in your organization: - **Public by default**: New repositories are created with public visibility - **Private by default**: New repositories are created with private visibility. Note that changing this setting will not affect existing repositories. - **Private only**: Enforce private visibility for all new repositories, with only organization admins able to change visibility settings These settings help organizations maintain control of their ownership while enabling collaboration when needed. ### Using Unity Sentis Models from Hugging Face https://huggingface.co/docs/hub/unity-sentis.md # Using Unity Sentis Models from Hugging Face [Unity 3D](https://unity.com/) is one of the most popular game engines in the world. [Unity Sentis](https://unity.com/products/sentis) is the inference engine that runs on Unity 2023 or above. It is an API that allows you to easily integrate and run neural network models in your game or application making use of hardware acceleration. Because Unity can export to many different form factors including PC, mobile and consoles, it means that this is an easy way to run neural network models on many different types of hardware. ## Exploring Sentis Models in the Hub You will find `unity-sentis` models by filtering at the left of the [models page](https://huggingface.co/models?library=unity-sentis). All the Sentis models in the Hub come with code and instructions to easily get you started using the model in Unity. All Sentis models under the `unity` namespace (for example, [unity/sentis-yolotinyv7](https://huggingface.co/unity/sentis-yolotinyv7) have been validated to work, so you can be sure they will run in Unity. To get more details about using Sentis, you can read its [documentation](https://docs.unity3d.com/Packages/com.unity.sentis@latest). To get help from others using Sentis, you can ask in its [discussion forum](https://discussions.unity.com/c/ai-beta/sentis) ## Types of files Each repository will contain several types of files: * ``sentis`` files: These are the main model files that contain the neural networks that run on Unity. * ``ONNX`` files: This is an alternative format you can include in addition to, or instead of, the Sentis files. It can be useful for visualization with third party tools such as [Netron](https://github.com/lutzroeder/netron). * ``cs`` file: These are C# files that contain the code to run the model on Unity. * ``info.json``: This file contains information about the files in the repository. * Data files. These are other files that are needed to run the model. They could include vocabulary files, lists of class names etc. Some typical files will have extensions ``json`` or ``txt``. * ``README.md``. This is the model card. It contains instructions on how to use the model and other relevant information. ## Running the model Always refer to the instructions on the model card. It is expected that you have some knowledge of Unity and some basic knowledge of C#. 1. Open Unity 2023 or above and create a new scene. 2. Install the ``com.unity.sentis`` package from the [package manager](https://docs.unity3d.com/Manual/upm-ui-quick.html). 3. Download your model files (``*.sentis``) and data files and put them in the StreamingAssets folder which is a subfolder inside the Assets folder. (If this folder does not exist you can create it). 4. Place your C# file on an object in the scene such as the Main Camera. 5. Refer to the model card to see if there are any other objects you need to create in the scene. In most cases, we only provide the basic implementation to get you up and running. It is up to you to find creative uses. For example, you may want to combine two or more models to do interesting things. ## Sharing your own Sentis models We encourage you to share your own Sentis models on Hugging Face. These may be models you trained yourself or models you have converted to the [Sentis format](https://docs.unity3d.com/Packages/com.unity.sentis@1.3/manual/serialize-a-model.html) and have tested to run in Unity. Please provide the models in the Sentis format for each repository you upload. This provides an extra check that they will run in Unity and is also the preferred format for large models. You can also include the original ONNX versions of the model files. Provide a C# file with a minimal implementation. For example, an image processing model should have code that shows how to prepare the image for the input and construct the image from the output. Alternatively, you can link to some external sample code. This will make it easy for others to download and use the model in Unity. Provide any data files needed to run the model. For example, vocabulary files. Finally, please provide an ``info.json`` file, which lists your project's files. This helps in counting the downloads. Some examples of the contents of ``info.json`` are: ``` { "code": [ "mycode.cs"], "models": [ "model1.sentis", "model2.sentis"], "data": [ "vocab.txt" ] } ``` Or if your code sample is external: ``` { "sampleURL": [ "http://sampleunityproject"], "models": [ "model1.sentis", "model2.sentis"] } ``` ## Additional Information We also have some full [sample projects](https://github.com/Unity-Technologies/sentis-samples) to help you get started using Sentis. ### FiftyOne https://huggingface.co/docs/hub/datasets-fiftyone.md # FiftyOne FiftyOne is an open-source toolkit for curating, visualizing, and managing unstructured visual data. The library streamlines data-centric workflows, from finding low-confidence predictions to identifying poor-quality samples and uncovering hidden patterns in your data. The library supports all sorts of visual data, from images and videos to PDFs, point clouds, and meshes. FiftyOne accommodates object detections, keypoints, polylines, and custom schemas. FiftyOne is integrated with the Hugging Face Hub so that you can load and share FiftyOne datasets directly from the Hub. ๐Ÿš€ Try the FiftyOne ๐Ÿค Hugging Face Integration in [Colab](https://colab.research.google.com/drive/1l0kzfbJ2wtUw1EGS1tq1PJYoWenMlihp?usp=sharing)! ## Prerequisites First [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` Make sure you have `fiftyone>=0.24.0` installed: ```bash pip install -U fiftyone ``` ## Loading Visual Datasets from the Hub With `load_from_hub()` from FiftyOne's Hugging Face utils, you can load: - Any FiftyOne dataset uploaded to the hub - Most image-based datasets stored in Parquet files (which is the standard for datasets uploaded to the hub via the `datasets` library) ### Loading FiftyOne datasets from the Hub Any dataset pushed to the hub in one of FiftyOneโ€™s [supported common formats](https://docs.voxel51.com/user_guide/dataset_creation/datasets.html#supported-import-formats) should have all of the necessary configuration info in its dataset repo on the hub, so you can load the dataset by specifying its `repo_id`. As an example, to load the [VisDrone detection dataset](https://huggingface.co/datasets/Voxel51/VisDrone2019-DET): ```python import fiftyone as fo from fiftyone.utils import load_from_hub ## load from the hub dataset = load_from_hub("Voxel51/VisDrone2019-DET") ## visualize in app session = fo.launch_app(dataset) ``` ![FiftyOne VisDrone dataset](https://cdn-uploads.huggingface.co/production/uploads/63127e2495407887cb79c5ea/0eKxe_GSsBjt8wMjT9qaI.jpeg) You can [customize the download process](https://docs.voxel51.com/integrations/huggingface.html#configuring-the-download-process), including the number of samples to download, the name of the created dataset object, or whether or not it is persisted to disk. You can list all the available FiftyOne datasets on the Hub using: ```python from huggingface_hub import HfApi api = HfApi() api.list_datasets(tags="fiftyone") ``` ### Loading Parquet Datasets from the Hub with FiftyOne You can also use the `load_from_hub()` function to load datasets from Parquet files. Type conversions are handled for you, and images are downloaded from URLs if necessary. With this functionality, [you can load](https://docs.voxel51.com/integrations/huggingface.html#basic-examples) any of the following: - [FiftyOne-Compatible Image Classification Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-classification-datasets-665dfd51020d8b66a56c9b6f), like [Food101](https://huggingface.co/datasets/food101) and [ImageNet-Sketch](https://huggingface.co/datasets/imagenet_sketch) - [FiftyOne-Compatible Object Detection Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-object-detection-datasets-665e0279c94ae552c7159a2b) like [CPPE-5](https://huggingface.co/datasets/cppe-5) and [WIDER FACE](https://huggingface.co/datasets/wider_face) - [FiftyOne-Compatible Segmentation Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-segmentation-datasets-665e15b6ddb96a4d7226a380) like [SceneParse150](https://huggingface.co/datasets/scene_parse_150) and [Sidewalk Semantic](https://huggingface.co/datasets/segments/sidewalk-semantic) - [FiftyOne-Compatible Image Captioning Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-image-captioning-datasets-665e16e29350244c06084505) like [COYO-700M](https://huggingface.co/datasets/kakaobrain/coyo-700m) and [New Yorker Caption Contest](https://huggingface.co/datasets/jmhessel/newyorker_caption_contest) - [FiftyOne-Compatible Visual Question-Answering Datasets](https://huggingface.co/collections/Voxel51/fiftyone-compatible-vqa-datasets-665e16424ecc8a718156248a) like [TextVQA](https://huggingface.co/datasets/textvqa) and [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA) As an example, we can load the first 1,000 samples from the [WikiArt dataset](https://huggingface.co/datasets/huggan/wikiart) into FiftyOne with: ```python import fiftyone as fo from fiftyone.utils.huggingface import load_from_hub dataset = load_from_hub( "huggan/wikiart", ## repo_id format="parquet", ## for Parquet format classification_fields=["artist", "style", "genre"], ## columns to treat as classification labels max_samples=1000, # number of samples to load name="wikiart", # name of the dataset in FiftyOne ) ``` ![WikiArt Dataset](https://cdn-uploads.huggingface.co/production/uploads/63127e2495407887cb79c5ea/PCqCvTlNTG5SLtcK5fwuQ.jpeg) ## Pushing FiftyOne Datasets to the Hub You can push a dataset to the hub with: ```python import fiftyone as fo import fiftyone.zoo as foz from fiftyone.utils.huggingface import push_to_hub ## load example dataset dataset = foz.load_zoo_dataset("quickstart") ## push to hub push_to_hub(dataset, "my-hf-dataset") ``` When you call `push_to_hub()`, the dataset will be uploaded to the repo with the specified repo name under your username, and the repo will be created if necessary. A [Dataset Card](./datasets-cards) will automatically be generated and populated with instructions for loading the dataset from the hub. You can upload a thumbnail image/gif to appear on the Dataset Card with the `preview_path` argument. Hereโ€™s an example using many of these arguments, which would upload the first three samples of FiftyOne's [Quickstart Video](https://docs.voxel51.com/user_guide/dataset_zoo/datasets.html#quickstart-video) dataset to the private repo `username/my-quickstart-video-dataset` with tags, an MIT license, a description, and a preview image: ```python dataset = foz.load_from_zoo("quickstart-video", max_samples=3) push_to_hub( dataset, "my-quickstart-video-dataset", tags=["video", "tracking"], license="mit", description="A dataset of video samples for tracking tasks", private=True, preview_path="" ) ``` ## ๐Ÿ“š Resources - [๐Ÿš€ Code-Along Colab Notebook](https://colab.research.google.com/drive/1l0kzfbJ2wtUw1EGS1tq1PJYoWenMlihp?usp=sharing) - [๐Ÿ—บ๏ธ User Guide for FiftyOne Datasets](https://docs.voxel51.com/user_guide/using_datasets.html#) - [๐Ÿค— FiftyOne ๐Ÿค Hub Integration Docs](https://docs.voxel51.com/integrations/huggingface.html#huggingface-hub) - [๐Ÿค— FiftyOne ๐Ÿค Transformers Integration Docs](https://docs.voxel51.com/integrations/huggingface.html#transformers-library) - [๐Ÿงฉ FiftyOne Hugging Face Hub Plugin](https://github.com/voxel51/fiftyone-huggingface-plugins) ### Using timm at Hugging Face https://huggingface.co/docs/hub/timm.md # Using timm at Hugging Face `timm`, also known as [pytorch-image-models](https://github.com/rwightman/pytorch-image-models), is an open-source collection of state-of-the-art PyTorch image models, pretrained weights, and utility scripts for training, inference, and validation. This documentation focuses on `timm` functionality in the Hugging Face Hub instead of the `timm` library itself. For detailed information about the `timm` library, visit [its documentation](https://huggingface.co/docs/timm). You can find a number of `timm` models on the Hub using the filters on the left of the [models page](https://huggingface.co/models?library=timm&sort=downloads). All models on the Hub come with several useful features: 1. An automatically generated model card, which model authors can complete with [information about their model](./model-cards). 2. Metadata tags help users discover the relevant `timm` models. 3. An [interactive widget](./models-widgets) you can use to play with the model directly in the browser. 4. An [Inference Providers](./models-inference) that allows users to make inference requests. ## Using existing models from the Hub Any `timm` model from the Hugging Face Hub can be loaded with a single line of code as long as you have `timm` installed! Once you've selected a model from the Hub, pass the model's ID prefixed with `hf-hub:` to `timm`'s `create_model` method to download and instantiate the model. ```py import timm # Loading https://huggingface.co/timm/eca_nfnet_l0 model = timm.create_model("hf-hub:timm/eca_nfnet_l0", pretrained=True) ``` If you want to see how to load a specific model, you can click **Use in timm** and you will be given a working snippet to load it! ### Inference The snippet below shows how you can perform inference on a `timm` model loaded from the Hub: ```py import timm import torch from PIL import Image from timm.data import resolve_data_config from timm.data.transforms_factory import create_transform # Load from Hub ๐Ÿ”ฅ model = timm.create_model( 'hf-hub:nateraw/resnet50-oxford-iiit-pet', pretrained=True ) # Set model to eval mode for inference model.eval() # Create Transform transform = create_transform(**resolve_data_config(model.pretrained_cfg, model=model)) # Get the labels from the model config labels = model.pretrained_cfg['label_names'] top_k = min(len(labels), 5) # Use your own image file here... image = Image.open('boxer.jpg').convert('RGB') # Process PIL image with transforms and add a batch dimension x = transform(image).unsqueeze(0) # Pass inputs to model forward function to get outputs out = model(x) # Apply softmax to get predicted probabilities for each class probabilities = torch.nn.functional.softmax(out[0], dim=0) # Grab the values and indices of top 5 predicted classes values, indices = torch.topk(probabilities, top_k) # Prepare a nice dict of top k predictions predictions = [ {"label": labels[i], "score": v.item()} for i, v in zip(indices, values) ] print(predictions) ``` This should leave you with a list of predictions, like this: ```py [ {'label': 'american_pit_bull_terrier', 'score': 0.9999998807907104}, {'label': 'staffordshire_bull_terrier', 'score': 1.0000000149011612e-07}, {'label': 'miniature_pinscher', 'score': 1.0000000149011612e-07}, {'label': 'chihuahua', 'score': 1.0000000149011612e-07}, {'label': 'beagle', 'score': 1.0000000149011612e-07} ] ``` ## Sharing your models You can share your `timm` models directly to the Hugging Face Hub. This will publish a new version of your model to the Hugging Face Hub, creating a model repo for you if it doesn't already exist. Before pushing a model, make sure that you've logged in to Hugging Face: ```sh python -m pip install huggingface_hub hf auth login ``` Alternatively, if you prefer working from a Jupyter or Colaboratory notebook, once you've installed `huggingface_hub` you can log in with: ```py from huggingface_hub import notebook_login notebook_login() ``` Then, push your model using the `push_to_hf_hub` method: ```py import timm # Build or load a model, e.g. timm's pretrained resnet18 model = timm.create_model('resnet18', pretrained=True, num_classes=4) ########################### # [Fine tune your model...] ########################### # Push it to the ๐Ÿค— Hub timm.models.hub.push_to_hf_hub( model, 'resnet18-random-classifier', model_config={'labels': ['a', 'b', 'c', 'd']} ) # Load your model from the Hub model_reloaded = timm.create_model( 'hf-hub:/resnet18-random-classifier', pretrained=True ) ``` ## Inference Widget and API All `timm` models on the Hub are automatically equipped with an [inference widget](./models-widgets), pictured below for [nateraw/timm-resnet50-beans](https://huggingface.co/nateraw/timm-resnet50-beans). Additionally, `timm` models are available through the [Inference Providers](./models-inference), which you can access through HTTP with cURL, Python's `requests` library, or your preferred method for making network requests. ```sh curl https://api-inference.huggingface.co/models/nateraw/timm-resnet50-beans \ -X POST \ --data-binary '@beans.jpeg' \ -H "Authorization: Bearer {$HF_API_TOKEN}" # [{"label":"angular_leaf_spot","score":0.9845947027206421},{"label":"bean_rust","score":0.01368315052241087},{"label":"healthy","score":0.001722085871733725}] ``` ## Additional resources * timm (pytorch-image-models) [GitHub Repo](https://github.com/rwightman/pytorch-image-models). * timm [documentation](https://huggingface.co/docs/timm). * Additional documentation at [timmdocs](https://timm.fast.ai) by [Aman Arora](https://github.com/amaarora). * [Getting Started with PyTorch Image Models (timm): A Practitionerโ€™s Guide](https://towardsdatascience.com/getting-started-with-pytorch-image-models-timm-a-practitioners-guide-4e77b4bf9055) by [Chris Hughes](https://github.com/Chris-hughes10). ### Git over SSH https://huggingface.co/docs/hub/security-git-ssh.md # Git over SSH You can access and write data in repositories on huggingface.co using SSH (Secure Shell Protocol). When you connect via SSH, you authenticate using a private key file on your local machine. Some actions, such as pushing changes, or cloning private repositories, will require you to upload your SSH public key to your account on huggingface.co. You can use a pre-existing SSH key, or generate a new one specifically for huggingface.co. ## Checking for existing SSH keys If you have an existing SSH key, you can use that key to authenticate Git operations over SSH. SSH keys are usually located under `~/.ssh` on Mac & Linux, and under `C:\\Users\\\\.ssh` on Windows. List files under that directory and look for files of the form: - id_rsa.pub - id_ecdsa.pub - id_ed25519.pub Those files contain your SSH public key. If you don't have such file under `~/.ssh`, you will have to [generate a new key](#generating-a-new-ssh-keypair). Otherwise, you can [add your existing SSH public key(s) to your huggingface.co account](#add-a-ssh-key-to-your-account). ## Generating a new SSH keypair If you don't have any SSH keys on your machine, you can use `ssh-keygen` to generate a new SSH key pair (public + private keys): ``` $ ssh-keygen -t ed25519 -C "your.email@example.co" ``` We recommend entering a passphrase when you are prompted to. A passphrase is an extra layer of security: it is a password that will be prompted whenever you use your SSH key. Once your new key is generated, add it to your SSH agent with `ssh-add`: ``` $ ssh-add ~/.ssh/id_ed25519 ``` If you chose a different location than the default to store your SSH key, you would have to replace `~/.ssh/id_ed25519` with the file location you used. ## Add a SSH key to your account To access private repositories with SSH, or to push changes via SSH, you will need to add your SSH public key to your huggingface.co account. You can manage your SSH keys [in your user settings](https://huggingface.co/settings/keys). To add a SSH key to your account, click on the "Add SSH key" button. Then, enter a name for this key (for example, "Personal computer"), and copy and paste the content of your **public** SSH key in the area below. The public key is located in the `~/.ssh/id_XXXX.pub` file you found or generated in the previous steps. Click on "Add key", and voilร ! You have added a SSH key to your huggingface.co account. ## Testing your SSH authentication Once you have added your SSH key to your huggingface.co account, you can test that the connection works as expected. In a terminal, run: ``` $ ssh -T git@hf.co ``` If you see a message with your username, congrats! Everything went well, you are ready to use git over SSH. Otherwise, if the message states something like the following, make sure your SSH key is actually used by your SSH agent. ``` Hi anonymous, welcome to Hugging Face. ``` ## HuggingFace's SSH key fingerprints Public key fingerprints can be used to validate a connection to a remote server. These are HuggingFace's public key fingerprints: > SHA256:aBG5R7IomF4BSsx/h6tNAUVLhEkkaNGB8Sluyh/Q/qY (ECDSA) > SHA256:skgQjK2+RuzvdmHr24IIAJ6uLWQs0TGtEUt3FtzqirQ (DSA - deprecated) > SHA256:dVjzGIdV7d6cwKIeZiCoRMa2gMvSKfGZAvHf4gMiMao (ED25519) > SHA256:uqjYymysBGCXXiMVebB8L8RIuWbPSKGBxQQNhcT5a3Q (RSA) You can add the following ssh key entries to your ~/.ssh/known_hosts file to avoid manually verifying HuggingFace hosts: ``` hf.co ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDtPB+snz63eZvTrbMY2Qt39a6HYile89JOum55z3lhIqAqUHxLtXFd+q+ED8izQvyORFPSmFIaPw05rtXo37bm+ixL6wDmvWrHN74oUUWmtrv2MNCLHE5VDb3+Q6MJjjDVIoK5QZIuTStlq0cUbGGxQk7vFZZ2VXdTPqgPjw4hMV7MGp3RFY/+Wy8rIMRv+kRCIwSAOeuaLPT7FzL0zUMDwj/VRjlzC08+srTQHqfoh0RguZiXZQneZKmM75AFhoMbP5x4AW2bVoZam864DSGiEwL8R2jMiyXxL3OuicZteZqll0qfRlNopKnzoxS29eBbXTr++ILqYz1QFqaruUgqSi3MIC9sDYEqh2Q8UxP5+Hh97AnlgWDZC0IhojVmEPNAc7Y2d+ctQl4Bt91Ik4hVf9bU+tqMXgaTrTMXeTURSXRxJEm2zfKQVkqn3vS/zGVnkDS+2b2qlVtrgbGdU/we8Fux5uOAn/dq5GygW/DUlHFw412GtKYDFdWjt3nJCY8= hf.co ssh-dss AAAAB3NzaC1kc3MAAACBAORXmoE8fn/UTweWy7tCYXZxigmODg71CIvs/haZQN6GYqg0scv8OFgeIQvBmIYMnKNJ7eoo5ZK+fk1yPv8aa9+8jfKXNJmMnObQVyObxFVzB51x8yvtHSSrL4J3z9EAGX9l9b+Fr2+VmVFZ7a90j2kYC+8WzQ9HaCYOlrALzz2VAAAAFQC0RGD5dE5Du2vKoyGsTaG/mO2E5QAAAIAHXRCMYdZij+BYGC9cYn5Oa6ZGW9rmGk98p1Xc4oW+O9E/kvu4pCimS9zZordLAwHHWwOUH6BBtPfdxZamYsBgO8KsXOWugqyXeFcFkEm3c1HK/ysllZ5kM36wI9CUWLedc2vj5JC+xb5CUzhVlGp+Xjn59rGSFiYzIGQC6pVkHgAAAIBve2DugKh3x8qq56sdOH4pVlEDe997ovEg3TUxPPIDMSCROSxSR85fa0aMpxqTndFMNPM81U/+ye4qQC/mr0dpFLBzGuum4u2dEpjQ7B2UyJL9qhs1Ubby5hJ8Z3bmHfOK9/hV8nhyN8gf5uGdrJw6yL0IXCOPr/VDWSUbFrsdeQ== hf.co ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBL0wtM52yIjm8gRecBy2wRyEMqr8ulG0uewT/IQOGz5K0ZPTIy6GIGHsTi8UXBiEzEIznV3asIz2sS7SiQ311tU= hf.co ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINJjhgtT9FOQrsVSarIoPVI1jFMh3VSHdKfdqp/O776s ``` ### Using `Transformers.js` at Hugging Face https://huggingface.co/docs/hub/transformers-js.md # Using `Transformers.js` at Hugging Face Transformers.js is a JavaScript library for running ๐Ÿค— Transformers directly in your browser, with no need for a server! It is designed to be functionally equivalent to the original [Python library](https://github.com/huggingface/transformers), meaning you can run the same pretrained models using a very similar API. ## Exploring `transformers.js` in the Hub You can find `transformers.js` models by filtering by library in the [models page](https://huggingface.co/models?library=transformers.js). ## Quick tour It's super simple to translate from existing code! Just like the Python library, we support the `pipeline` API. Pipelines group together a pretrained model with preprocessing of inputs and postprocessing of outputs, making it the easiest way to run models with the library. Python (original) Javascript (ours) ```python from transformers import pipeline # Allocate a pipeline for sentiment-analysis pipe = pipeline('sentiment-analysis') out = pipe('I love transformers!') # [{'label': 'POSITIVE', 'score': 0.999806941}] ``` ```javascript import { pipeline } from '@huggingface/transformers'; // Allocate a pipeline for sentiment-analysis let pipe = await pipeline('sentiment-analysis'); let out = await pipe('I love transformers!'); // [{'label': 'POSITIVE', 'score': 0.999817686}] ``` You can also use a different model by specifying the model id or path as the second argument to the `pipeline` function. For example: ```javascript // Use a different model for sentiment-analysis let pipe = await pipeline('sentiment-analysis', 'nlptown/bert-base-multilingual-uncased-sentiment'); ``` Refer to the [documentation](https://huggingface.co/docs/transformers.js) for the full list of supported tasks and models. ## Installation To install via [NPM](https://www.npmjs.com/package/@huggingface/transformers), run: ```bash npm i @huggingface/transformers ``` For more information, including how to use it in vanilla JS (without any bundler) via a CDN or static hosting, refer to the [README](https://github.com/huggingface/transformers.js/blob/main/README.md#installation). ## Additional resources * Transformers.js [repository](https://github.com/huggingface/transformers.js) * Transformers.js [docs](https://huggingface.co/docs/transformers.js) * Transformers.js [demo](https://huggingface.github.io/transformers.js/) ### Spaces https://huggingface.co/docs/hub/spaces.md # Spaces [Hugging Face Spaces](https://huggingface.co/spaces) offer a simple way to host ML demo apps directly on your profile or your organization's profile. This allows you to create your ML portfolio, showcase your projects at conferences or to stakeholders, and work collaboratively with other people in the ML ecosystem. We have built-in support for an awesome SDK that let you build cool apps in Python in a matter of minutes: **[Gradio](https://gradio.app/)**, but you can also unlock the whole power of Docker and host an arbitrary Dockerfile. Finally, you can create static Spaces using JavaScript and HTML. You'll also be able to upgrade your Space to run [on a GPU or other accelerated hardware](./spaces-gpus). โšก๏ธ ## Contents - [Spaces Overview](./spaces-overview) - [Handling Spaces Dependencies](./spaces-dependencies) - [Spaces Settings](./spaces-settings) - [Using OpenCV in Spaces](./spaces-using-opencv) - [Using Spaces for Organization Cards](./spaces-organization-cards) - [More ways to create Spaces](./spaces-more-ways-to-create) - [Managing Spaces with Github Actions](./spaces-github-actions) - [How to Add a Space to ArXiv](./spaces-add-to-arxiv) - [Spaces Dev Mode](./spaces-dev-mode) - [Spaces GPU Upgrades](./spaces-gpus) - [Spaces Disk Usage & Storage](./spaces-storage) - [Gradio Spaces](./spaces-sdks-gradio) - [Docker Spaces](./spaces-sdks-docker) - [Static HTML Spaces](./spaces-sdks-static) - [Custom Python Spaces](./spaces-sdks-python) - [Embed your Space](./spaces-embed) - [Run your Space with Docker](./spaces-run-with-docker) - [Reference](./spaces-config-reference) - [Changelog](./spaces-changelog) ## Contact Feel free to ask questions on the [forum](https://discuss.huggingface.co/c/spaces/24) if you need help with making a Space, or if you run into any other issues on the Hub. If you're interested in infra challenges, custom demos, advanced GPUs, or something else, please reach out to us by sending an email to **website at huggingface.co**. You can also tag us [on Twitter](https://twitter.com/huggingface)! ๐Ÿค— ### Using OpenCV in Spaces https://huggingface.co/docs/hub/spaces-using-opencv.md # Using OpenCV in Spaces In order to use OpenCV in your Gradio or Python Spaces, you'll need to make the Space install both the Python and Debian dependencies This means adding `python3-opencv` to the `packages.txt` file, and adding `opencv-python` to the `requirements.txt` file. If those files don't exist, you'll need to create them. To see an example, [see this Gradio project](https://huggingface.co/spaces/templates/gradio_opencv/tree/main). ### Storage Buckets: Security & Compliance https://huggingface.co/docs/hub/storage-buckets-security.md # Storage Buckets: Security & Compliance Storage Buckets are built on the same infrastructure that powers the Hugging Face Hub, with enterprise-grade security and compliance built in. ## Encryption All data stored in buckets is encrypted at rest using **AES-256** encryption. Data in transit is protected via **TLS**. ## Access Control Buckets use the Hub's standard access control mechanisms: - **SSO**: Authenticate through your organization's identity provider via [Single Sign-On](./security-sso) - **RBAC**: Fine-grained permissions through [Resource Groups](./security-resource-groups) let you control who can read, write, or admin each bucket - **Tokens**: Programmatic access is managed through [User Access Tokens](./security-tokens) with scoped permissions ## Audit Logs All bucket operations โ€” uploads, downloads, deletions, and permission changes โ€” are recorded in your organization's [Audit Logs](./audit-logs), giving you a full trail of who accessed what and when. ## Data Residency Bucket data is stored in **US and EU regions**. You can choose where your data lives when creating a bucket, and [pre-warming](./storage-buckets#pre-warming-and-cdn) lets you cache data closer to your compute in specific cloud regions. ## Compliance Hugging Face maintains the following certifications and compliance standards: - **SOC 2 Type 2** certified โ€” active monitoring and patching of security vulnerabilities - **GDPR** compliant โ€” data processing agreements available through [Enterprise Plans](https://huggingface.co/pricing) For more details on Hugging Face's overall security posture, see the [Security](./security) page. For questions, contact [security@huggingface.co](mailto:security@huggingface.co). ### Ingesting Datasets https://huggingface.co/docs/hub/datasets-ingesting.md # Ingesting Datasets Data generally lives in databases or cloud storage in forms that are not suited for AI workflows. Ingesting data to the [Hub](https://huggingface.co/datasets) is a good way to publish them as AI-ready datasets, enabling easy and efficient data loading, processing and model training and evaluation. ## Using `huggingface_hub` The simplest way to ingest data is to simply upload the data files with `huggingface_hub`. The `huggingface_hub` Python library provides a rich feature set that allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](/docs/huggingface_hub/index) to learn more. This is relevant if your data is static/frozen and if you can easily obtain a local dump of the data in a format supported by the Hub (e.g., Parquet or JSON Lines) with a usable structure (e.g., well-defined fields for training and evaluation). ## Using `dlt` [dlt](http://github.com/dlt-hub/dlt) is an open-source Python library for data movement (ETL), and is useful for developers (and their agents) building data pipelines. It can ingest data from diverse source types: * Cloud storage or files * REST APIs * SQL databases * Python generators Examples of source types: * `filesystem` (includes s3, gs, az, abff, etc.) * `sql_database`, `mongodb`, `google_sheets` * `notion`, `hubspot`, `rest_api` Find your source type from the [list of sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources) and create your `dlt` project: ``` dlt init filesystem ``` You can then create a configuration file `.dlt/secrets.toml` in the root of your dlt project to define the Hub as a filesystem destination for your datasets, based on the `hf://` protocol: ```toml [destination.filesystem] bucket_url = "hf://datasets/" [destination.filesystem.credentials] hf_token = "hf_..." # Your Hugging Face Access Token ``` The namespace should be your user name or the name of your organization/team where you want to ingest your dataset. Then each dlt dataset creates or updates a Hugging Face dataset repository. The repository name is /, where is the same one you used in the bucket_url (your organization or team), and is the pipeline's dataset_name. Here is an example pipeline: ```python import dlt @dlt.resource def my_data(): # One of the functions auto-generated by `dlt init` that you can customize, # or you can define your own python generator function. # Here is an example from the `chess` source type: for player in ['magnuscarlsen', 'rpragchess']: response = requests.get(f'https://api.chess.com/pub/player/{player}') response.raise_for_status() yield response.json() # Requires bucket_url = "hf://datasets/" in .dlt/secrets.toml pipeline = dlt.pipeline( pipeline_name="my_pipeline", destination="filesystem", dataset_name="dataset_name", ) pipeline.run(my_data()) ``` Customize the `dlt` resource to load the data you want and parse the fields you want to publish in your dataset, e.g. the text you need for training and evaluation. ## Using other libraries Some libraries like [๐Ÿค— Datasets](/docs/datasets/index), [Pandas](./datasets-pandas), [Polars](./datasets-polars), [Dask](./datasets-dask), [DuckDB](./datasets-duckdb), [Spark](./datasets-spark), or [Daft](./datasets-daft) can ingest data from various places to the Hub. See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. ## Ingest raw data If you are ingesting raw data that need further curation before being published as AI-ready datasets or if you need an S3-like experience, consider ingesting them to [Hugging Face Storage Buckets](./storage-buckets). ## Scheduled ingestion There are some limitations when updating the same file on the Hub thousands of times. For instance, you might want to ingest generations of a running LLM inference server, live agents traces, or logs of a running model training. In such cases, uploading the data as a dataset on the Hub makes sense, but it can be hard to do properly. The main reason is that you donโ€™t want to version every update of your data because itโ€™ll make the git repository unusable. Three options are available: * **Use a Storage Bucket instead of a Dataset repository:** [Storage Buckets](/docs/hub/storage-buckets) offer an S3-like experience that allows updating files very frequently, since they are not based on git. Storage Buckets are especially useful for data that are not ready to be published as a dataset, e.g. data that are still evolving or that need more curation. * **Use a CommitScheduler**: The `CommitScheduler` in `huggingface_hub` offers near real-time ingestion to keep the git history of a Dataset repository manageable. It can be configured to do git commits at intervals defined in minutes. * **Use Hugging Face Jobs to schedule ingestion scripts**: Hugging Face Jobs provides a way to run and schedule python scripts on Hugging Face infrastructure. Schedule ingestion scripts to run at intervals defined using the Cron syntax. ### High frequency using Storage Buckets Contrary to Dataset repositories that are based on git, you can update files on Storage Buckets at very high rate, offering quasi real-time ingestion. Use `batch_bucket_files()` in `huggingface_hub` to update files in a bucket: ```python from huggingface_hub import batch_bucket_files def update_bucket(local_files): destinations = [os.path.basename(local_file) for local_file in local_file] batch_bucket_files(bucket_id="username/bucket_name", add=[(local_file, dst) for local_file, dst in zip(local_files, destinations)]) ``` Alternatively, you can append to files in a Bucket and `flush()` on every new item: ```python from huggingface_hub import hffs with hffs.open("buckets/username/bucket_name/texts.jsonl", "a") as f: for text in live_texts_stream: f.write(json.dumps({"text": text}) + "\n") f.flush() ``` The `HfFileSystem` is based on `fsspec` which has a default blocksize of 5MiB, which means flushing actually uploads the data once a full chunk of 5MiB of new data was appended. If you want to upload more often, lower `blocksize` in `hffs.open()` (e.g. `hffs.open(..., blocksize=100 * 2 ** 10)` for 100 kiB) or use `f.flush(force=True)`. Hugging Face storage is based on Xet which enables efficient I/O when appending to files: uploads are deduplicated and only new data are uploaded. Find more information on doing dynamic data ingestion in buckets in the [buckets documentation on uploads](/docs/hub/storage-buckets#uploading-files) and in the [dataset editing documentation](./datasets-editing#only-upload-the-new-data). ### Near real-time using a `CommitScheduler` The idea is to run a background job that regularly pushes a local folder to the Hub. You want to save data to the Hub (potentially millions of entries), but you donโ€™t need to save in real-time each userโ€™s input. Instead, you can save the data locally in a JSON file and upload it every 10 minutes. For example: ```python import json from huggingface_hub import CommitScheduler folder_path = "path/to/files/to/ingest" every = 10 # ingest every 10min with CommitScheduler(repo_id="username/dataset_name", repo_type="dataset", folder_path=folder_path, every=every) as scheduler: # Write to the folder to ingest every 10min # For example: with open(folder_path + "/texts.jsonl", "a") as f: f.write(json.dumps({"text": text}) + "\n") ... ``` Check out how to ingest dynamic data without having to reupload everything every time in the documentation on [dataset editing](./datasets-editing#only-upload-the-new-data). Find more information on scheduled uploads in the [huggingface_hub documentation](/docs/huggingface_hub/guides/upload#scheduled-uploads). ### Cron-based using Hugging Face Jobs Schedule python scripts to ingest data according to a schedule For example to run a script `ingest.py` every 5 minutes: ```bash hf jobs scheduled uv run "*/5 * * * *" ingest.py ``` Declare the script dependencies [in the header of the script](https://docs.astral.sh/uv/guides/scripts/#declaring-script-dependencies) or use `--with`. For example to run a `dlt` pipeline every day at midnight: ```bash hf jobs scheduled uv run --with "dlt[hf]" "0 0 * * *" pipeline.py ``` You can check the logs of every run using `hf jobs logs` or directly in the Jobs page on your account on Hugging Face. Find more information about Hugging Face Jobs in the [Jobs documentation](/docs/hub/jobs-overview). ### How to Add a Space to ArXiv https://huggingface.co/docs/hub/spaces-add-to-arxiv.md # How to Add a Space to ArXiv Demos on Hugging Face Spaces allow a wide audience to try out state-of-the-art machine learning research without writing any code. [Hugging Face and ArXiv have collaborated](https://huggingface.co/blog/arxiv) to embed these demos directly along side papers on ArXiv! Thanks to this integration, users can now find the most popular demos for a paper on its arXiv abstract page. For example, if you want to try out demos of the LayoutLM document classification model, you can go to [the LayoutLM paper's arXiv page](https://arxiv.org/abs/1912.13318), and navigate to the demo tab. You will see open-source demos built by the machine learning community for this model, which you can try out immediately in your browser: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/layout-lm-space-arxiv.gif) We'll cover two different ways to add your Space to ArXiv and have it show up in the Demos tab. **Prerequisites** * There's an existing paper on ArXiv that you'd like to create a demo for * You have built or (can build) a demo for the model on Spaces **Method 1 (Recommended): Linking from the Space README** The simplest way to add a Space to an ArXiv paper is to include the link to the paper in the Space README file (`README.md`). It's good practice to include a full citation as well. You can see an example of a link and a citation on this [Echocardiogram Segmentation Space README](https://huggingface.co/spaces/abidlabs/echocardiogram-arxiv/blob/main/README.md). And that's it! Your Space should appear in the Demo tab next to the paper on ArXiv in a few minutes ๐Ÿค— **Method 2: Linking a Related Model** An alternative approach can be used to link Spaces to papers by linking an intermediate model to the Space. This requires that the paper is **associated with a model** that is on the Hugging Face Hub (or can be uploaded there) 1. First, upload the model associated with the ArXiv paper onto the Hugging Face Hub if it is not already there. ([Detailed instructions are here](./models-uploading)) 2. When writing the model card (README.md) for the model, include a link to the ArXiv paper. It's good practice to include a full citation as well. You can see an example of a link and a citation on the [LayoutLM model card](https://huggingface.co/microsoft/layoutlm-base-uncased) *Note*: you can verify this step has been carried out successfully by seeing if an ArXiv button appears above the model card. In the case of LayoutLM, the button says: "arxiv:1912.13318" and links to the LayoutLM paper on ArXiv. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/arxiv-button.png) 3. Then, create a demo on Spaces that loads this model. Somewhere within the code, the model name must be included in order for Hugging Face to detect that a Space is associated with it. For example, the [docformer_for_document_classification](https://huggingface.co/spaces/iakarshu/docformer_for_document_classification) Space loads the LayoutLM [like this](https://huggingface.co/spaces/iakarshu/docformer_for_document_classification/blob/main/modeling.py#L484) and include the string `"microsoft/layoutlm-base-uncased"`: ```py from transformers import LayoutLMForTokenClassification layoutlm_dummy = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=1) ``` *Note*: Here's an [overview on building demos on Hugging Face Spaces](./spaces-overview) and here are more specific instructions for [Gradio](./spaces-sdks-gradio) and [Streamlit](./spaces-sdks-streamlit). 4. As soon as your Space is built, Hugging Face will detect that it is associated with the model. A "Linked Models" button should appear in the top right corner of the Space, as shown here: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/linked-models.png) *Note*: You can also add linked models manually by explicitly updating them in the [README metadata for the Space, as described here](https://huggingface.co/docs/hub/spaces-config-reference). Your Space should appear in the Demo tab next to the paper on ArXiv in a few minutes ๐Ÿค— ### Dask https://huggingface.co/docs/hub/datasets-dask.md # Dask [Dask](https://www.dask.org/?utm_source=hf-docs) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. In particular, we can use [Dask DataFrame](https://docs.dask.org/en/stable/dataframe.html?utm_source=hf-docs) to scale up pandas workflows. Dask DataFrame parallelizes pandas to handle large tabular data. It closely mirrors the pandas API, making it simple to transition from testing on a single dataset to processing the full dataset. Dask is particularly effective with Parquet, the default format on Hugging Face Datasets, as it supports rich data types, efficient columnar filtering, and compression. A good practical use case for Dask is running data processing or model inference on a dataset in a distributed manner. See, for example, [Coiled's](https://www.coiled.io/?utm_source=hf-docs) excellent blog post on [Scaling AI-Based Data Processing with Hugging Face + Dask](https://huggingface.co/blog/dask-scaling). ## Read and Write Since Dask uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub. First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: ```python from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in Dask. Dask DataFrame supports distributed writing to Parquet on Hugging Face, which uses commits to track dataset changes: ```python import dask.dataframe as dd df.to_parquet("hf://datasets/username/my_dataset") # or write in separate directories if the dataset has train/validation/test splits df_train.to_parquet("hf://datasets/username/my_dataset/train") df_valid.to_parquet("hf://datasets/username/my_dataset/validation") df_test .to_parquet("hf://datasets/username/my_dataset/test") ``` Since this creates one commit per file, it is recommended to squash the history after the upload: ```python from huggingface_hub import HfApi HfApi().super_squash_history(repo_id=repo_id, repo_type="dataset") ``` This creates a dataset repository `username/my_dataset` containing your Dask dataset in Parquet format. You can reload it later: ```python import dask.dataframe as dd df = dd.read_parquet("hf://datasets/username/my_dataset") # or read from separate directories if the dataset has train/validation/test splits df_train = dd.read_parquet("hf://datasets/username/my_dataset/train") df_valid = dd.read_parquet("hf://datasets/username/my_dataset/validation") df_test = dd.read_parquet("hf://datasets/username/my_dataset/test") ``` For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). ## Process data To process a dataset in parallel using Dask, you can first define your data processing function for a pandas DataFrame or Series, and then use the Dask `map_partitions` function to apply this function to all the partitions of a dataset in parallel: ```python def dummy_count_words(texts): return pd.Series([len(text.split(" ")) for text in texts]) ``` or a similar function using pandas string methods (faster): ```python def dummy_count_words(texts): return texts.str.count(" ") ``` In pandas you can use this function on a text column: ```python # pandas API df["num_words"] = dummy_count_words(df.text) ``` And in Dask you can run this function on every partition: ```python # Dask API: run the function on every partition df["num_words"] = df.text.map_partitions(dummy_count_words, meta=int) ``` Note that you also need to provide `meta` which is the type of the pandas Series or DataFrame in the output of your function. This is needed because Dask DataFrame uses a lazy API. Since Dask will only run the data processing once `.compute()` is called, it needs the `meta` argument to know the type of the new column in the meantime. ## Predicate and Projection Pushdown When reading Parquet data from Hugging Face, Dask automatically leverages the metadata in Parquet files to skip entire files or row groups if they are not needed. For example if you apply a filter (predicate) on a Hugging Face Dataset in Parquet format or if you select a subset of the columns (projection), Dask will read the metadata of the Parquet files to discard the parts that are not needed without downloading them. This is possible thanks to a [reimplementation of the Dask DataFrame API](https://docs.coiled.io/blog/dask-dataframe-is-fast.html?utm_source=hf-docs) to support query optimization, which makes Dask faster and more robust. For example this subset of FineWeb-Edu contains many Parquet files. If you can filter the dataset to keep the text from recent CC dumps, Dask will skip most of the files and only download the data that match the filter: ```python import dask.dataframe as dd df = dd.read_parquet("hf://datasets/HuggingFaceFW/fineweb-edu/sample/10BT/*.parquet") # Dask will skip the files or row groups that don't # match the query without downloading them. df = df[df.dump >= "CC-MAIN-2023"] ``` Dask will also read only the required columns for your computation and skip the rest. For example if you drop a column late in your code, it will not bother to load it early on in the pipeline if it's not needed. This is useful when you want to manipulate a subset of the columns or for analytics: ```python # Dask will download the 'dump' and 'token_count' needed # for the filtering and computation and skip the other columns. df.token_count.mean().compute() ``` ## Client Most features in `dask` are optimized for a cluster or a local `Client` to launch the parallel computations: ```python import dask.dataframe as dd from distributed import Client if __name__ == "__main__": # needed for creating new processes client = Client() df = dd.read_parquet(...) ... ``` For local usage, the `Client` uses a Dask `LocalCluster` with multiprocessing by default. You can manually configure the multiprocessing of `LocalCluster` with ```python from dask.distributed import Client, LocalCluster cluster = LocalCluster(n_workers=8, threads_per_worker=8) client = Client(cluster) ``` Note that if you use the default threaded scheduler locally without `Client`, a DataFrame can become slower after certain operations (more details [here](https://github.com/dask/dask-expr/issues/1181)). Find more information on setting up a local or cloud cluster in the [Deploying Dask documentation](https://docs.dask.org/en/latest/deploying.html). ### Hub API Endpoints https://huggingface.co/docs/hub/api.md # Hub API Endpoints We have open endpoints that you can use to retrieve information from the Hub as well as perform certain actions such as creating model, dataset or Space repos. We offer a wrapper Python client, [`huggingface_hub`](https://github.com/huggingface/huggingface_hub), and a JS client, [`huggingface.js`](https://github.com/huggingface/huggingface.js), that allow easy access to these endpoints. We also provide [webhooks](./webhooks) to receive real-time incremental info about repos. Enjoy! > [!NOTE] > We've moved the Hub API Endpoints documentation to our [OpenAPI Playground](https://huggingface.co/spaces/huggingface/openapi), which provides a comprehensive reference that's always up-to-date. You can also access the OpenAPI specification directly at [https://huggingface.co/.well-known/openapi.json](https://huggingface.co/.well-known/openapi.json), or in Markdown version if you want to send it to your Agent: [https://huggingface.co/.well-known/openapi.md](https://huggingface.co/.well-known/openapi.md). > [!NOTE] > All API calls are subject to the HF-wide [Rate limits](./rate-limits). Upgrade your account if you need elevated, large-scale access. ### Argilla on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-argilla.md # Argilla on Spaces Argilla is a free and open source tool to build and iterate on data for AI. It can be deployed on the Hub with a few clicks and Hugging Face OAuth enabled. This enables other HF users to join your Argilla server to annotate datasets, perfect for running community annotation initiatives! With Argilla you can: - Configure datasets for collecting human feedback with a growing number questions (Label, NER, Ranking, Rating, free text, etc.) - Use model outputs/predictions to evaluate them or speeding up the annotation process. - UI users can explore, find, and label the most interesting/critical subsets using Argilla's search and semantic similarity features. - Pull and push datasets from the Hugging Face Hub for versioning and model training. The best place to get started with Argilla on Spaces is [this guide](http://docs.argilla.io/latest/getting_started/quickstart/). ### Using ๐Ÿค— Datasets https://huggingface.co/docs/hub/datasets-usage.md # Using ๐Ÿค— Datasets Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using ๐Ÿค— Datasets. You can click on the [**Use this dataset** button](https://huggingface.co/datasets/nyu-mll/glue?library=datasets) to copy the code to load a dataset. First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` And then you can load a dataset from the Hugging Face Hub using ```python from datasets import load_dataset dataset = load_dataset("username/my_dataset") # or load the separate splits if the dataset has train/validation/test splits train_dataset = load_dataset("username/my_dataset", split="train") valid_dataset = load_dataset("username/my_dataset", split="validation") test_dataset = load_dataset("username/my_dataset", split="test") ``` You can also upload datasets to the Hugging Face Hub: ```python my_new_dataset.push_to_hub("username/my_new_dataset") ``` This creates a dataset repository `username/my_new_dataset` containing your Dataset in Parquet format, that you can reload later. For more information about using ๐Ÿค— Datasets, check out the [tutorials](/docs/datasets/tutorial) and [how-to guides](/docs/datasets/how_to) available in the ๐Ÿค— Datasets documentation. ### Model Cards https://huggingface.co/docs/hub/model-cards.md # Model Cards ## What are Model Cards? Model cards are files that accompany the models and provide handy information. Under the hood, model cards are simple Markdown files with additional metadata. Model cards are essential for discoverability, reproducibility, and sharing! You can find a model card as the `README.md` file in any model repo. The model card should describe: - the model - its intended uses & potential limitations, including biases and ethical considerations as detailed in [Mitchell, 2018](https://arxiv.org/abs/1810.03993) - the training params and experimental info (you can embed or link to an experiment tracking platform for reference) - which datasets were used to train your model - the model's evaluation results The model card template is available [here](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md). How to fill out each section of the model card is described in [the Annotated Model Card](https://huggingface.co/docs/hub/model-card-annotated). Model Cards on the Hub have two key parts, with overlapping information: - [Metadata](#model-card-metadata) - [Text descriptions](#model-card-text) ## Model card metadata A model repo will render its `README.md` as a model card. The model card is a [Markdown](https://en.wikipedia.org/wiki/Markdown) file, with a [YAML](https://en.wikipedia.org/wiki/YAML) section at the top that contains metadata about the model. The metadata you add to the model card supports discovery and easier use of your model. For example: * Allowing users to filter models at https://huggingface.co/models. * Displaying the model's license. * Adding datasets to the metadata will add a message reading `Datasets used to train:` to your model page and link the relevant datasets, if they're available on the Hub. Dataset and language identifiers are those listed on the [Datasets](https://huggingface.co/datasets) and [Languages](https://huggingface.co/languages) pages. ### Adding metadata to your model card There are a few different ways to add metadata to your model card including: - Using the metadata UI - Directly editing the YAML section of the `README.md` file - Via the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub) Python library, see the [docs](https://huggingface.co/docs/huggingface_hub/guides/model-cards#update-metadata) for more details. Many libraries with [Hub integration](./models-libraries) will automatically add metadata to the model card when you upload a model. #### Using the metadata UI You can add metadata to your model card using the metadata UI. To access the metadata UI, go to the model page and click on the `Edit model card` button in the top right corner of the model card. This will open an editor showing the model card `README.md` file, as well as a UI for editing the metadata. This UI will allow you to add key metadata to your model card and many of the fields will autocomplete based on the information you provide. Using the UI is the easiest way to add metadata to your model card, but it doesn't support all of the metadata fields. If you want to add metadata that isn't supported by the UI, you can edit the YAML section of the `README.md` file directly. #### Editing the YAML section of the `README.md` file You can also directly edit the YAML section of the `README.md` file. If the model card doesn't already have a YAML section, you can add one by adding three `---` at the top of the file, then include all of the relevant metadata, and close the section with another group of `---` like the example below: ```yaml --- language: - "List of ISO 639-1 code for your language" - lang1 - lang2 thumbnail: "url to a thumbnail used in social sharing" tags: - tag1 - tag2 license: "any valid license identifier" datasets: - dataset1 - dataset2 base_model: "base model Hub identifier" --- ``` You can find the detailed model card metadata specification here. ### Specifying a library You can specify the supported libraries in the model card metadata section. Find more about our supported libraries [here](./models-libraries). The library will be specified in the following order of priority: 1. Specifying `library_name` in the model card (recommended if your model is not a `transformers` model). This information can be added via the metadata UI or directly in the model card YAML section: ```yaml library_name: flair ``` 2. Having a tag with the name of a library that is supported ```yaml tags: - flair ``` If it's not specified, the Hub will try to automatically detect the library type. However, this approach is discouraged, and repo creators should use the explicit `library_name` as much as possible. 1. By looking into the presence of files such as `*.nemo` or `*.mlmodel`, the Hub can determine if a model is from NeMo or CoreML. 2. In the past, if nothing was detected and there was a `config.json` file, it was assumed the library was `transformers`. For model repos created after August 2024, this is not the case anymore, so you need to set `library_name: transformers` explicitly. ### Specifying a base model If your model is a fine-tune, an adapter, or a quantized version of a base model, you can specify the base model in the model card metadata section. This information can also be used to indicate if your model is a merge of multiple existing models. Hence, the `base_model` field can either be a single model ID, or a list of one or more base_models (specified by their Hub identifiers). ```yaml base_model: HuggingFaceH4/zephyr-7b-beta ``` This metadata will be used to display the base model on the model page. Users can also use this information to filter models by base model or find models that are derived from a specific base model: For a fine-tuned model: For an adapter (LoRA, PEFT, etc): For a quantized version of another model: For a merge of two or more models: In the merge case, you specify a list of two or more base_models: ```yaml base_model: - Endevor/InfinityRP-v1-7B - l3utterfly/mistral-7b-v0.1-layla-v4 ``` The Hub will infer the type of relationship from the current model to the base model (`"adapter", "merge", "quantized", "finetune"`) but you can also set it explicitly if needed: `base_model_relation: quantized` for instance. ### Specifying a new version If a new version of your model is available in the Hub, you can specify it in a `new_version` field. For example, on `l3utterfly/mistral-7b-v0.1-layla-v3`: ```yaml new_version: l3utterfly/mistral-7b-v0.1-layla-v4 ``` This metadata will be used to display a link to the latest version of a model on the model page. If the model linked in `new_version` also has a `new_version` field, the very latest version will always be displayed. ### Specifying a dataset You can specify the datasets used to train your model in the model card metadata section. The datasets will be displayed on the model page and users will be able to filter models by dataset. You should use the Hub dataset identifier, which is the same as the dataset's repo name as the identifier: ```yaml datasets: - stanfordnlp/imdb - HuggingFaceFW/fineweb ``` ### Specifying a bucket You can specify the [storage buckets](./storage-buckets) linked to your model in the model card metadata section. The buckets will be shown as tags on the model page and the linked bucket pages will show the model in return. You should use the Hub bucket identifier, which is the same as the bucket's repo name: ```yaml buckets: - my-org/my-bucket - my-org/another-bucket ``` ### Specifying a task (`pipeline_tag`) You can specify the `pipeline_tag` in the model card metadata. The `pipeline_tag` indicates the type of task the model is intended for. This tag will be displayed on the model page and users can filter models on the Hub by task. This tag is also used to determine which [widget](./models-widgets#enabling-a-widget) to use for the model and which APIs to use under the hood. For `transformers` models, the pipeline tag is automatically inferred from the model's `config.json` file but you can override it in the model card metadata if required. Editing this field in the metadata UI will ensure that the pipeline tag is valid. Some other libraries with Hub integration will also automatically add the pipeline tag to the model card metadata. ### Specifying a license You can specify the license in the model card metadata section. The license will be displayed on the model page and users will be able to filter models by license. Using the metadata UI, you will see a dropdown of the most common licenses. If required, you can also specify a custom license by adding `other` as the license value and specifying the name and a link to the license in the metadata. ```yaml # Example from https://huggingface.co/coqui/XTTS-v1 --- license: other license_name: coqui-public-model-license license_link: https://coqui.ai/cpml --- ``` If the license is not available via a URL you can link to a LICENSE stored in the model repo. ### Evaluation Results You can specify your **model's evaluation results** in a structured way in the model card metadata. Results are parsed by the Hub and displayed in a widget on the model page. Here is an example on how it looks like for the [bigcode/starcoder](https://huggingface.co/bigcode/starcoder) model: The initial metadata spec was based on Papers with code's [model-index specification](https://github.com/paperswithcode/model-index). This allowed us to directly index the results into Papers with code's leaderboards when appropriate. You could also link the source from where the eval results has been computed. > [!TIP] > NEW: We have a new, simpler metadata format for eval results. Check it out in [the dedicated doc page](./eval-results). Here is a partial example of a model-index that was describing [01-ai/Yi-34B](https://huggingface.co/01-ai/Yi-34B)'s score on the ARC benchmark. The result came from the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) which is defined as the `source`: ```yaml --- model-index: - name: Yi-34B results: - task: type: text-generation dataset: name: ai2_arc type: ai2_arc metrics: - name: AI2 Reasoning Challenge (25-Shot) type: AI2 Reasoning Challenge (25-Shot) value: 64.59 source: name: Open LLM Leaderboard url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard --- ``` For more details on how to format this data, check out the [Model Card specifications](https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1). ### CO2 Emissions The model card is also a great place to show information about the CO2 impact of your model. Visit our [guide on tracking and reporting CO2 emissions](./model-cards-co2) to learn more. ### Linking a Paper If the model card includes a link to a Paper page (either on HF or an Arxiv abstract/PDF), the Hugging Face Hub will extract the arXiv ID and include it in the model tags with the format `arxiv:`. Clicking on the tag will let you: * Visit the Paper page * Filter for other models on the Hub that cite the same paper. Read more about Paper pages [here](./paper-pages). ## Model Card text Details on how to fill out the human-readable portion of the model card (so that it may be printed out, cut+pasted, etc.) is available in the [Annotated Model Card](./model-card-annotated). ## FAQ ### How are model tags determined? Each model page lists all the model's tags in the page header, below the model name. These are primarily computed from the model card metadata, although some are added automatically, as described in [Enabling a Widget](./models-widgets#enabling-a-widget). ### Can I add custom tags to my model? Yes, you can add custom tags to your model by adding them to the `tags` field in the model card metadata. The metadata UI will suggest some popular tags, but you can add any tag you want. For example, you could indicate that your model is focused on finance by adding a `finance` tag. ### How can I indicate that my model is not suitable for all audiences You can add a `not-for-all-audiences` tag to your model card metadata. When this tag is present, a message will be displayed on the model page indicating that the model is not for all audiences. Users can click through this message to view the model card. ### How can I display different images for dark and light mode? You can display different versions of an image optimized for each theme. This is particularly useful for logos, diagrams, or screenshots that need different color schemes to maintain visibility and aesthetics across light and dark modes. To use this feature, you'll need to provide both versions of your image. **For images uploaded via the markdown editor** When you upload an image directly from the markdown editor (using drag-and-drop), append the URI fragment `#hf-light-mode-only` or `#hf-dark-mode-only` to the end of the image URL to specify which theme it should display in: ```markdown Image only displays when viewing in light mode ![Logo](https://cdn-uploads.huggingface.co/production/uploads/logo-light.png#hf-light-mode-only) Image only displays when viewing in dark mode ![Logo](https://cdn-uploads.huggingface.co/production/uploads/logo-dark.png#hf-dark-mode-only) ``` **For already hosted images** If you want to reference images that are already hosted without re-uploading them, use HTML `` tags with the following Tailwind CSS classes to specify which theme it should display in: ```html // Image only displays when viewing in dark mode // Image only displays when viewing in light mode ``` ### Can I write LaTeX in my model card? Yes! The Hub uses the [KaTeX](https://katex.org/) math typesetting library to render math formulas server-side before parsing the Markdown. You have to use the following delimiters: - `$$ ... $$` for display mode - `\\(...\\)` for inline mode (no space between the slashes and the parenthesis). Then you'll be able to write: $$ \LaTeX $$ $$ \mathrm{MSE} = \left(\frac{1}{n}\right)\sum_{i=1}^{n}(y_{i} - x_{i})^{2} $$ $$ E=mc^2 $$ ### Evidence on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-evidence.md # Evidence on Spaces **Evidence** is an open-source framework designed for building data-driven applications, reports, and dashboards using SQL and Markdown. With Evidence, you can quickly create decision-support tools, reports, and interactive dashboards without relying on traditional drag-and-drop business intelligence (BI) platforms. Evidence enables you to: - Write reports and dashboards directly in Markdown with SQL-backed components. - Integrate data from multiple sources, including SQL databases and APIs. - Use templated pages to automatically generate multiple pages based on a single template. - Deploy reports seamlessly to various hosting solutions. Visit [Evidenceโ€™s documentation](https://docs.evidence.dev/) for guides, examples, and best practices for using Evidence to create data products. ## Deploy Evidence on Spaces You can deploy Evidence on Hugging Face Spaces with just a few clicks: Once created, the Space will display `Building` status. Refresh the page if the status doesn't automatically update to `Running`. Your Evidence app will automatically be deployed on Hugging Face Spaces. ## Editing your Evidence app from the CLI To edit your app, clone the Space and edit the files locally. ```bash git clone https://huggingface.co/spaces/your-username/your-space-name cd your-space-name npm install npm run sources npm run dev ``` You can then modify pages/index.md to change the content of your app. ## Editing your Evidence app from VS Code The easiest way to develop with Evidence is using the [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Evidence.evidence-vscode): 1. Install the extension from the VS Code Marketplace 2. Open the Command Palette (Ctrl/Cmd + Shift + P) and enter `Evidence: Copy Existing Project` 3. Paste the URL of the Hugging Face Spaces Evidence app you'd like to copy (e.g. `https://huggingface.co/spaces/your-username/your-space-name`) and press Enter 4. Select the folder you'd like to clone the project to and press Enter 5. Press `Start Evidence` in the bottom status bar Check out the docs for [alternative install methods](https://docs.evidence.dev/getting-started/install-evidence), Github Codespaces, and alongside dbt. ## Learning More - [Docs](https://docs.evidence.dev/) - [Github](https://github.com/evidence-dev/evidence) - [Slack Community](https://slack.evidence.dev/) - [Evidence Home Page](https://www.evidence.dev) ### Datasets Download Stats https://huggingface.co/docs/hub/datasets-download-stats.md # Datasets Download Stats ## How are downloads counted for datasets? Counting the number of downloads for datasets is not a trivial task, as a single dataset repository might contain multiple files, from multiple subsets and splits (e.g. train/validation/test) and sometimes with many files in a single split. To solve this issue and avoid counting one person's download multiple times, we treat all files downloaded by a user (based on their IP address) within a 5-minute window in a given repository as a single dataset download. This counting happens automatically on our servers when files are downloaded (through GET or HEAD requests), with no need to collect any user information or make additional calls. ## Before September 2024 The Hub used to provide download stats only for the datasets loadable via the `datasets` library. To determine the number of downloads, the Hub previously counted every time `load_dataset` was called in Python, excluding Hugging Face's CI tooling on GitHub. No information was sent from the user, and no additional calls were made for this. The count was done server-side as we served files for downloads. This meant that: * The download count was the same regardless of whether the data is directly stored on the Hub repo or if the repository has a [script](/docs/datasets/dataset_script) to load the data from an external source. * If a user manually downloaded the data using tools like `wget` or the Hub's user interface (UI), those downloads were not included in the download count. ### User Provisioning (SCIM) https://huggingface.co/docs/hub/enterprise-scim.md # User Provisioning (SCIM) > [!WARNING] > This feature is part of the Enterprise and Enterprise Plus plans. SCIM (System for Cross-domain Identity Management) is a standard for automating user provisioning. It allows you to connect your Identity Provider (IdP) to Hugging Face to manage your organization's members. SCIM works differently depending on your SSO model. For a detailed comparison, see the [SSO overview](./enterprise-sso#user-provisioning-scim). ## Basic SSO: invitation-based provisioning With [Basic SSO](./security-sso-basic) (Enterprise plan), SCIM automates the **invitation** of existing Hugging Face users to your organization. - Users **must already have a Hugging Face account** before they can be provisioned via SCIM - When your IdP provisions a user, Hugging Face sends them an **invitation email** to join the organization - The user must **accept the invitation** to become a member โ€” provisioning does not grant immediate access - SCIM **cannot modify** user profile information (name, email, username) โ€” the user retains full control of their Hugging Face account - When a user is deprovisioned in your IdP, their invitation is deactivated and their access to the organization is revoked ## Managed SSO: full lifecycle provisioning With [Managed SSO](./enterprise-advanced-sso) (Enterprise Plus plan), SCIM manages the **entire user lifecycle** on Hugging Face. - SCIM **creates a new Hugging Face account** when a user is provisioned โ€” no pre-existing account is needed - The user is **immediately added** to the organization as a member, with no invitation step - SCIM **can update** user profile information (name, email, username) as changes occur in your IdP - When a user is deprovisioned in your IdP, their Hugging Face account is deactivated and their access is revoked ## How to enable SCIM To enable SCIM, go to your organization's settings, navigate to the **SSO** tab, and then select the **SCIM** sub-tab. You will find the **SCIM Tenant URL** and a button to generate a **SCIM token**. You will need both of these to configure your IdP. The SCIM token is a secret and should be stored securely in your IdP's configuration. Once SCIM is enabled in your IdP, provisioned users will appear in the **Users Management** tab and provisioned groups will appear in the **SCIM** tab in your organization's settings. ## Group provisioning In addition to user provisioning, SCIM supports **group provisioning**. Groups pushed from your IdP are stored as SCIM groups on Hugging Face and can be linked to [Resource Groups](./enterprise-resource-groups) from the **SCIM** tab in your organization's settings. ### Linking a SCIM group to a Resource Group To link a SCIM group, go to your organization's **SSO โ†’ SCIM** tab. Provisioned groups are listed in a table. In the **Resource Groups** column, each group shows either a **Link resource groups** button (if no links exist yet) or the number of currently linked resource groups (e.g. "2 resource groups"). Clicking either opens a modal where you can add one or more Resource Groups, each with its own role assignment. You can also change or remove existing links from the same modal. Before linking, make sure the following conditions are met: - The Resource Group must have **no existing members**. Linking to a non-empty Resource Group is not allowed. - The Resource Group must **not have auto-join enabled**. Auto-join (which automatically adds every new org member to the RG) is mutually exclusive with SCIM management. Disable auto-join on the RG before linking. A SCIM group can be linked to multiple Resource Groups, each with its own role. ### What happens after linking Once a SCIM group is linked to a Resource Group: - **Backfill**: Any members already in the SCIM group are immediately added to the Resource Group at the configured role. - **Ongoing sync**: Membership changes in your IdP are automatically reflected: - When a user is **added** to the group in your IdP, they are added to all linked Resource Groups. - When a user is **removed** from the group in your IdP, they are removed from all linked Resource Groups, except those the user is linked to through other SCIM groups. For those, the user's role will be updated to the โ€œhighestโ€ role granted by the other SCIM groups. - When a SCIM group is **deleted** in your IdP, all its members are removed from the linked Resource Groups, except for users who belong to those Resource Groups through other SCIM groups. For each of those Resource Groups, usersโ€™ roles are updated to the โ€œhighestโ€ role granted by the other SCIM groups. - **Role changes**: If you update the role on a link, all current group members' roles in that Resource Group are updated immediately. ### SCIM-managed Resource Groups A Resource Group linked to a SCIM group is considered **SCIM-managed**. The IdP is the sole source of truth for its membership. As a result: - Manual membership changes via the Hub UI or API are **blocked** โ€” any attempt to add, remove, or change a member's role on a SCIM-managed Resource Group will return a `403` error. - Auto-join **cannot be enabled** on a SCIM-managed Resource Group. To re-enable auto-join, first remove the SCIM link. Group provisioning works the same way for both Basic SSO and Managed SSO. ## Supported user attributes The Hugging Face SCIM endpoint supports the following user attributes: | Attribute | Description | Basic SSO | Managed SSO | | --- | --- | --- | --- | | `userName` | Hugging Face username | Read-only | Read/Write | | `name.givenName` | First name | Read-only | Read/Write | | `name.familyName` | Last name | Read-only | Read/Write | | `emails[type eq "work"].value` | Email address | Read-only | Read/Write | | `externalId` | IdP-assigned identifier | Read/Write | Read/Write | | `active` | Whether the user is an active member | Read/Write | Read/Write | With Basic SSO, only `active` and `externalId` can be modified via SCIM โ€” all other attributes are controlled by the user on their Hugging Face account. For group provisioning, the supported attributes are `displayName`, `members`, and `externalId`. ## Deprovisioning Deprovisioning behavior depends on how the user is removed and which SSO model you use. **Setting `active` to `false`** (soft deprovision): - The user loses access to the organization - With Basic SSO: the invitation is deactivated - With Managed SSO: the user is removed from the organization but their account and content are preserved โ€” this is **reversible** by setting `active` back to `true` **Deleting the user via SCIM** (hard deprovision): - With Basic SSO: the user is removed from the organization and all its resource groups. Their Hugging Face account and personal content are **not affected** โ€” they simply lose membership in your organization. - With Managed SSO: the user's Hugging Face account is **permanently deleted**, along with all content they created. This action is **irreversible**. ## Supported Identity Providers We support SCIM with any IdP that implements the SCIM 2.0 protocol. We have specific guides for some of the most popular providers: - [How to configure SCIM with Microsoft Entra ID](./security-sso-entra-id-scim) - [How to configure SCIM with Okta](./security-sso-okta-scim) ### Using PaddleNLP at Hugging Face https://huggingface.co/docs/hub/paddlenlp.md # Using PaddleNLP at Hugging Face Leveraging the [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) framework, [`PaddleNLP`](https://github.com/PaddlePaddle/PaddleNLP) is an easy-to-use and powerful NLP library with awesome pre-trained model zoo, supporting wide-range of NLP tasks from research to industrial applications. ## Exploring PaddleNLP in the Hub You can find `PaddleNLP` models by filtering at the left of the [models page](https://huggingface.co/models?library=paddlenlp&sort=downloads). All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description and metadata tags that help for discoverability. 2. An interactive widget you can use to play out with the model directly in the browser. 3. An Inference Providers widget that allows to make inference requests. 4. Easily deploy your model as a Gradio app on Spaces. ## Installation To get started, you can follow [PaddlePaddle Quick Start](https://www.paddlepaddle.org.cn/en/install) to install the PaddlePaddle Framework with your favorite OS, Package Manager and Compute Platform. `paddlenlp` offers a quick one-line install through pip: ``` pip install -U paddlenlp ``` ## Using existing models Similar to `transformer` models, the `paddlenlp` library provides a simple one-liner to load models from the Hugging Face Hub by setting `from_hf_hub=True`! Depending on how you want to use them, you can use the high-level API using the `Taskflow` function or you can use `AutoModel` and `AutoTokenizer` for more control. ```py # Taskflow provides a simple end-to-end capability and a more optimized experience for inference from paddlenlp.transformers import Taskflow taskflow = Taskflow("fill-mask", task_path="PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) # If you want more control, you will need to define the tokenizer and model. from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) model = AutoModelForMaskedLM.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) ``` If you want to see how to load a specific model, you can click `Use in paddlenlp` and you will be given a working snippet that you can load it! ## Sharing your models You can share your `PaddleNLP` models by using the `save_to_hf_hub` method under all `Model` and `Tokenizer` classes. ```py from paddlenlp.transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) model = AutoModelForMaskedLM.from_pretrained("PaddlePaddle/ernie-1.0-base-zh", from_hf_hub=True) tokenizer.save_to_hf_hub(repo_id="/") model.save_to_hf_hub(repo_id="/") ``` ## Additional resources - PaddlePaddle Installation [guide](https://www.paddlepaddle.org.cn/en/install). - PaddleNLP [GitHub Repo](https://github.com/PaddlePaddle/PaddleNLP). - [PaddlePaddle on the Hugging Face Hub](https://huggingface.co/PaddlePaddle) ### How to configure SCIM with Okta https://huggingface.co/docs/hub/security-sso-okta-scim.md # How to configure SCIM with Okta This guide explains how to set up SCIM user and group provisioning between Okta and your Hugging Face organization using SCIM. > [!WARNING] > This feature is part of the Enterprise and Enterprise Plus plans. ## Step 1: Get SCIM configuration from Hugging Face 1. Navigate to your organization's settings page on Hugging Face. 2. Go to the **SSO** tab, then click on the **SCIM** sub-tab. 3. Copy the **SCIM Tenant URL**. You will need this for the Okta configuration. 4. Click **Generate an access token**. A new SCIM token will be generated. Copy this token immediately and store it securely, as you will not be able to see it again. ## Step 2: Enter Admin Credentials 1. In Okta, go to **Applications** and select your Hugging Face app. 2. Go to the **General** tab and click **Edit** on App Settings 3. For the Provisioning option select **SCIM**, click **Save** 4. Go to the **Provisioning** tab and click **Edit**. 5. Enter the **SCIM Tenant URL** as the SCIM connector base URL. 6. Enter **userName** for Unique identifier field for users. 7. Select all necessary actions for Supported provisioning actions. 8. Select **HTTP Header** for Authentication Mode. 9. Enter the **Access Token** you generated as the Authorization Bearer Token. 10. Click **Test Connector Configuration** to verify the connection. 11. Save your changes. ## Step 3: Configure Provisioning 1. In the **Provisioning** tab, click **To App** from the side nav. 2. Click **Edit** and check to Enable all the features you need, i.e. Create, Update, Delete Users. 3. Click **Save** at the bottom. ## Step 4: Configure Attribute Mappings 1. While still in the **Provisioning** tab scroll down to Attribute Mappings section 2. The default attribute mappings often require adjustments for robust provisioning. We recommend using the following configuration. You can delete attributes that are not here: ## Step 5: Assign Users or Groups 1. Visit the **Assignments** tab, click **Assign** 2. Click **Assign to People** or **Assign to Groups** 3. After finding the User or Group that needs to be assigned, click **Assign** next to their name 4. In the mapping modal the Username needs to be edited to comply with the following rules. > [!WARNING] > > Only regular characters and `-` are accepted in the Username. > `--` (double dash) is forbidden. > `-` cannot start or end the name. > Digit-only names are not accepted. > Minimum length is 2 and maximum length is 42. > Username has to be unique within your org. > 5. Scroll down and click **Save and Go Back** 6. Click **Done** 7. Confirm that users or groups are created, updated, or deactivated in your Hugging Face organization as expected. ## Step 6: Push Okta Groups to Hugging Face via SCIM Before you can link groups to Hugging Face Resource Groups, you need to push your Okta groups to Hugging Face using the **Push Groups** tab. This is separate from assigning users to the app in Step 5. > [!WARNING] > Okta does not support using the same group for app assignment (Step 5) and Group Push. Use a dedicated group for pushing โ€” keep your push groups separate from your assignment groups. 1. In the Okta Admin Console, go to **Applications** and select your Hugging Face app. 2. Click the **Push Groups** tab. 3. Click **+ Push Groups** and select **Find groups by name**. 4. Search for the Okta group you want to push and select it from the results. 5. Choose how to handle the group in Hugging Face: - **Create Group**: Creates a new SCIM group in your Hugging Face organization. - **Link Group**: Links to an existing group already in your Hugging Face organization. 6. Click **Save**. To push additional groups, click **Save & Add Another** and repeat. Once pushed, the group will appear under **SCIM Groups** in your Hugging Face organization settings (SSO โ†’ SCIM tab). Any membership changes you make to the group in Okta will automatically sync to Hugging Face. ## Step 7: Link SCIM Groups to Hugging Face Resource Groups Once your groups are provisioned from Okta, you can link them to Hugging Face Resource Groups to manage permissions at scale. This allows all members of a SCIM group to automatically receive specific roles (like read or write) for a collection of resources. > [!NOTE] > Before linking, make sure the Resource Group you want to link is **empty** (has no existing members) and does **not** have auto-join enabled. Both conditions are required โ€” linking will fail otherwise. 1. In your Hugging Face organization settings, navigate to the **SSO** -> **SCIM** tab. You will see a list of your provisioned groups under **SCIM Groups**. 2. Locate the group you wish to configure and click **Link resource groups** in its row. 3. A dialog will appear. Click **Link a Resource Group**. 4. From the dropdown menus, select the **Resource Group** you want to link and the **Role Assignment** you want to grant to the members of the SCIM group. 5. Click **Link to SCIM group** and save the mapping. Once linked, the Resource Group becomes **SCIM-managed**: any members already in the SCIM group are immediately added to the Resource Group (backfill), and all future membership changes in Okta are automatically reflected. Manual membership edits on the Resource Group via the Hub UI or API will be blocked. ### Distilabel https://huggingface.co/docs/hub/datasets-distilabel.md # Distilabel Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers. Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback. ## What do people build with distilabel? The Argilla community uses distilabel to create amazing [datasets](https://huggingface.co/datasets?other=distilabel) and [models](https://huggingface.co/models?other=distilabel). - The [1M OpenHermesPreference](https://huggingface.co/datasets/argilla/OpenHermesPreferences) is a dataset of ~1 million AI preferences that have been generated using the [teknium/OpenHermes-2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) LLM. It is a great example on how you can use distilabel to scale and increase dataset development. - [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) used to fine-tune the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B). This dataset was built by combining human curation in Argilla with AI feedback from distilabel, leading to an improved version of the Intel Orca dataset and outperforming models fine-tuned on the original dataset. - The [haiku DPO data](https://github.com/davanstrien/haiku-dpo) is an example how anyone can create a synthetic dataset for a specific task, which after curation and evaluation can be used for fine-tuning custom LLMs. ## Prerequisites First [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` Make sure you have `distilabel` installed: ```bash pip install -U distilabel[vllm] ``` ## Distilabel pipelines Distilabel pipelines can be built with any number of interconnected steps or tasks. The output of one step or task is fed as input to another. A series of steps can be chained together to build complex data processing and generation pipelines with LLMs. The input of each step is a batch of data, containing a list of dictionaries, where each dictionary represents a row of the dataset, and the keys are the column names. To feed data from and to the Hugging Face hub, we've defined a `Distiset` class as an abstraction of a `datasets.DatasetDict`. ## Distiset as dataset object A Pipeline in distilabel returns a special type of Hugging Face `datasets.DatasetDict` which is called `Distiset`. The Pipeline can output multiple subsets in the Distiset, which is a dictionary-like object with one entry per subset. A Distiset can then be pushed seamlessly to the Hugging face Hub, with all the subsets in the same repository. ## Load data from the Hub to a Distiset To showcase an example of loading data from the Hub, we will reproduce the [Prometheus 2 paper](https://arxiv.org/pdf/2405.01535) and use the PrometheusEval task implemented in distilabel. The Prometheus 2 and Prometheuseval task direct assessment and pairwise ranking tasks i.e. assessing the quality of a single isolated response for a given instruction with or without a reference answer, and assessing the quality of one response against another one for a given instruction with or without a reference answer, respectively. We will use these task on a dataset loaded from the Hub, which was created by the Hugging Face H4 team named [HuggingFaceH4/instruction-dataset](https://huggingface.co/datasets/HuggingFaceH4/instruction-dataset). ```python from distilabel.llms import vLLM from distilabel.pipeline import Pipeline from distilabel.steps import KeepColumns, LoadDataFromHub from distilabel.steps.tasks import PrometheusEval if __name__ == "__main__": with Pipeline(name="prometheus") as pipeline: load_dataset = LoadDataFromHub( name="load_dataset", repo_id="HuggingFaceH4/instruction-dataset", split="test", output_mappings={"prompt": "instruction", "completion": "generation"}, ) task = PrometheusEval( name="task", llm=vLLM( model="prometheus-eval/prometheus-7b-v2.0", chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]", ), mode="absolute", rubric="factual-validity", reference=False, num_generations=1, group_generations=False, ) keep_columns = KeepColumns( name="keep_columns", columns=["instruction", "generation", "feedback", "result", "model_name"], ) load_dataset >> task >> keep_columns ``` Then we need to call `pipeline.run` with the runtime parameters so that the pipeline can be launched and data can be stores in the `Distiset` object. ```python distiset = pipeline.run( parameters={ task.name: { "llm": { "generation_kwargs": { "max_new_tokens": 1024, "temperature": 0.7, }, }, }, }, ) ``` ## Push a distilabel Distiset to the Hub Push the `Distiset` to a Hugging Face repository, where each one of the subsets will correspond to a different configuration: ```python distiset.push_to_hub( "my-org/my-dataset", commit_message="Initial commit", private=False, token=os.getenv("HF_TOKEN"), ) ``` ## ๐Ÿ“š Resources - [๐Ÿš€ Distilabel Docs](https://distilabel.argilla.io/latest/) - [๐Ÿš€ Distilabel Docs - distiset](https://distilabel.argilla.io/latest/sections/how_to_guides/advanced/distiset/) - [๐Ÿš€ Distilabel Docs - prometheus](https://distilabel.argilla.io/1.2.0/sections/pipeline_samples/papers/prometheus/) - [๐Ÿ†• Introducing distilabel](https://argilla.io/blog/introducing-distilabel-1/) ### Gating Group Collections https://huggingface.co/docs/hub/enterprise-gating-group-collections.md # Gating Group Collections > [!WARNING] > This feature is part of the Team & Enterprise plans. Gating Group Collections allow organizations to grant (or reject) access to all the models and datasets in a collection at once, rather than per repo. Users will only have to go through **a single access request**. To enable Gating Group in a collection: - the collection owner must be an organization - the organization must be subscribed to a Team or Enterprise plan - all models and datasets in the collection must be owned by the same organization as the collection - each model or dataset in the collection may only belong to one Gating Group Collection (but they can still be included in non-gating i.e. _regular_ collections). > [!TIP] > Gating only applies to models and datasets; any other resource part of the collection (such as a Space or a Paper) won't be affected. ## Manage gating group as an organization admin To enable access requests, go to the collection page and click on **Gating group** in the bottom-right corner. By default, gating group is disabled: click on **Configure Access Requests** to open the settings By default, access to the repos in the collection is automatically granted to users when they request it. This is referred to as **automatic approval**. In this mode, any user can access your repos once theyโ€™ve agreed to share their contact information with you. If you want to manually approve which users can access repos in your collection, you must set it to **Manual Review**. When this is the case, you will notice a new option: **Notifications frequency**, which lets you configure when to get notified about new users requesting access. It can be set to once a day or real-time. By default, emails are sent to the first 5 admins of the organization. You can also set a different email address in the **Notifications email** field. ### Review access requests Once access requests are enabled, you have full control of who can access repos in your gating group collection, whether the approval mode is manual or automatic. You can review and manage requests either from the UI or via the API. **Approving a request for a repo in a gating group collection will automatically approve access to all repos (models and datasets) in that collection.** #### From the UI You can review who has access to all the repos in your Gating Group Collection from the settings page of any of the repos in the collection, by clicking on the **Review access requests** button: This will open a modal with 3 lists of users: - **pending**: the list of users waiting for approval to access your repository. This list is empty unless youโ€™ve selected **Manual Review**. You can either **Accept** or **Reject** each request. If the request is rejected, the user cannot access your repository and cannot request access again. - **accepted**: the complete list of users with access to your repository. You can choose to **Reject** access at any time for any user, whether the approval mode is manual or automatic. You can also **Cancel** the approval, which will move the user to the **pending** list. - **rejected**: the list of users youโ€™ve manually rejected. Those users cannot access your repositories. If they go to your repository, they will see a message _Your request to access this repo has been rejected by the repoโ€™s authors_. #### Via the API You can programmatically manage access requests in a Gated Group Collection through the API of any of its models or datasets. Visit our [gated models](https://huggingface.co/docs/hub/models-gated#via-the-api) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#via-the-api) documentation to know more about it. #### Download access report You can download access reports for the Gated Group Collection through the settings page of any of its models or datasets. Visit our [gated models](https://huggingface.co/docs/hub/models-gated#download-access-report) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#download-access-report) documentation to know more about it. #### Customize requested information Organizations can customize the gating parameters as well as the user information that is collected per gated repo. Please, visit our [gated models](https://huggingface.co/docs/hub/models-gated#customize-requested-information) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#customize-requested-information) documentation for more details. > [!WARNING] > There is currently no way to customize the gate parameters and requested information in a centralized way. If you want to collect the same data no matter which collection's repository a user requests access throughout, you need to add the same gate parameters in the metadata of all the models and datasets of the collection, and keep it synced. ## Access gated repos in a Gating Group Collection as a user A Gated Group Collection shows a specific icon next to its name: To get access to the models and datasets in a Gated Group Collection, a single access request on the page of any of those repositories is needed. Once your request is approved, you will be able to access all the other repositories in the collection, including future ones. Visit our [gated models](https://huggingface.co/docs/hub/models-gated#access-gated-models-as-a-user) or [gated datasets](https://huggingface.co/docs/hub/datasets-gated#access-gated-datasets-as-a-user) documentation to learn more about requesting access to a repository. ### THE LANDSCAPE OF ML DOCUMENTATION TOOLS https://huggingface.co/docs/hub/model-card-landscape-analysis.md # THE LANDSCAPE OF ML DOCUMENTATION TOOLS The development of the model cards framework in 2018 was inspired by the major documentation framework efforts of Data Statements for Natural Language Processing ([Bender & Friedman, 2018](https://aclanthology.org/Q18-1041/)) and Datasheets for Datasets ([Gebru et al., 2018](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf)). Since model cards were proposed, a number of other tools have been proposed for documenting and evaluating various aspects of the machine learning development cycle. These tools, including model cards and related documentation efforts proposed prior to model cards, can be contextualised with regard to their focus (e.g., on which part of the ML system lifecycle does the tool focus?) and their intended audiences (e.g., who is the tool designed for?). In Figures 1-2 below, we summarise several prominent documentation tools along these dimensions, provide contextual descriptions of each tool, and link to examples. We broadly classify the documentation tools as belong to the following groups: * **Data-focused**, including documentation tools focused on datasets used in the machine learning system lifecycle * **Models-and-methods-focused**, including documentation tools focused on machine learning models and methods; and * **Systems-focused**, including documentation tools focused on ML systems, including models, methods, datasets, APIs, and non AI/ML components that interact with each other as part of an ML system These groupings are not mutually exclusive; they do include overlapping aspects of the ML system lifecycle. For example, **system cards** focus on documenting ML systems that may include multiple models and datasets, and thus might include content that overlaps with data-focused or model-focused documentation tools. The tools described are a non-exhaustive list of documentation tools for the ML system lifecycle. In general, we included tools that were: * Focused on documentation of some (or multiple) aspects of the ML system lifecycle * Included the release of a template intended for repeated use, adoption, and adaption ## Summary of ML Documentation Tools ### Figure 1 | **Stage of ML System Lifecycle** | **Tool** | **Brief Description** | **Examples** | |:--------------------------------: |-------------------------------------------------------------------------------------------------------------------------------------------------------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | DATA | ***Datasheets*** [(Gebru et al., 2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) | โ€œWe recommend that every dataset be accompanied with a datasheet documenting its motivation, creation, composition, intended uses, distribution, maintenance, and other information.โ€ | See, for example, [Ivy Leeโ€™s repo](https://github.com/ivylee/model-cards-and-datasheets) with examples | | DATA | ***Data Statements*** [(Bender & Friedman, 2018)(Bender et al., 2021)](https://techpolicylab.uw.edu/wp-content/uploads/2021/11/Data_Statements_Guide_V2.pdf) | โ€œA data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.โ€ | See [Data Statements for NLP Workshop](https://techpolicylab.uw.edu/events/event/data-statements-for-nlp/) | | DATA | ***Dataset Nutrition Labels*** [(Holland et al., 2018)](https://huggingface.co/papers/1805.03677) | โ€œThe Dataset Nutrition Labelโ€ฆis a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset โ€œingredientsโ€ before AI model development.โ€ | See [The Data Nutrition Label](https://datanutrition.org/labels/) | | DATA | ***Data Cards for NLP*** [(McMillan-Major et al., 2021)](https://huggingface.co/papers/2108.07374) | โ€œWe present two case studies of creating documentation templates and guides in natural language processing (NLP): the Hugging Face (HF) dataset hub[^1] and the benchmark for Generation and its Evaluation and Metrics (GEM). We use the term data card to refer to documentation for datasets in both cases. | See [(McMillan-Major et al., 2021)](https://huggingface.co/papers/2108.07374) | | DATA | ***Dataset Development Lifecycle Documentation Framework*** [(Hutchinson et al., 2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445918) | โ€œWe introduce a rigorous framework for dataset development transparency that supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle.โ€ | See [(Hutchinson et al., 2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445918), Appendix A for templates | | DATA | ***Data Cards*** [(Pushkarna et al., 2021)](https://huggingface.co/papers/2204.01075) | โ€œData Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders across a datasetโ€™s lifecycle for responsible AI development. These summaries provide explanations of processes and rationales that shape the data and consequently the models.โ€ | See the [Data Cards Playbook github](https://github.com/PAIR-code/datacardsplaybook/) | | DATA | ***CrowdWorkSheets*** [(Dรญaz et al., 2022)](https://huggingface.co/papers/2206.08931) | โ€œWe introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline: task formulation, selection of annotators, plat- form and infrastructure choices, dataset analysis and evaluation, and dataset release and maintenance.โ€ | See [(Dรญaz et al., 2022)](https://huggingface.co/papers/2206.08931) | | MODELS AND METHODS | ***Model Cards*** [Mitchell et al. (2018)](https://huggingface.co/papers/1810.03993) | โ€œModel cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditionsโ€ฆthat are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information.โ€ | See https://huggingface.co/models, the [Model Card Guidebook](https://huggingface.co/docs/hub/model-card-guidebook), and [Model Card Examples](https://huggingface.co/docs/hub/model-card-appendix#model-card-examples) | | MODELS AND METHODS | ***Value Cards*** [Shen et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3442188.3445971) | โ€œWe present Value Cards, a deliberation-driven toolkit for bringing computer science students and practitioners the awareness of the social impacts of machine learning-based decision making systemsโ€ฆ.Value Cards encourages the investigations and debates towards different ML performance metrics and their potential trade-offs.โ€ | See [Shen et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3442188.3445971), Section 3.3 | | MODELS AND METHODS | ***Method Cards*** [Adkins et al. (2022)](https://dl.acm.org/doi/pdf/10.1145/3491101.3519724) | โ€œWe propose method cards to guide ML engineers through the process of model developmentโ€ฆThe information comprises both prescriptive and descriptive elements, putting the main focus on ensuring that ML engineers are able to use these methods properly.โ€ | See [Adkins et al. (2022)](https://dl.acm.org/doi/pdf/10.1145/3491101.3519724), Appendix A | | MODELS AND METHODS | ***Consumer Labels for ML Models*** [Seifert et al. (2019)](https://ris.utwente.nl/ws/portalfiles/portal/158031484/Seifert2019_cogmi_consumer_labels_preprint.pdf) | โ€œWe propose to issue consumer labels for trained and published ML models. These labels primarily target machine learning lay persons, such as the operators of an ML system, the executors of decisions, and the decision subjects themselvesโ€ | See [Seifert et al. (2019)](https://ris.utwente.nl/ws/portalfiles/portal/158031484/Seifert2019_cogmi_consumer_labels_preprint.pdf) | | SYSTEMS | ***Factsheets*** [Arnold et al. (2019)](https://huggingface.co/papers/1808.07261) | โ€œA FactSheet will contain sections on all relevant attributes of an AI service, such as intended use, performance, safety, and security. Performance will include appropriate accuracy or risk measures along with timing information.โ€ | See [IBMโ€™s AI Factsheets 360](https://aifs360.res.ibm.com) and [Hind et al., (2020)](https://dl.acm.org/doi/abs/10.1145/3334480.3383051) | | SYSTEMS | ***System Cards*** [Procope et al. (2022)](https://ai.facebook.com/research/publications/system-level-transparency-of-machine-learning) | โ€œSystem Cards aims to increase the transparency of ML systems by providing stakeholders with an overview of different components of an ML system, how these components interact, and how different pieces of data and protected information are used by the system.โ€ | See [Metaโ€™s Instagram Feed Ranking System Card](https://ai.facebook.com/tools/system-cards/instagram-feed-ranking/) | | SYSTEMS | ***Reward Reports for RL*** [Gilbert et al. (2022)](https://huggingface.co/papers/2204.10817) | โ€œWe sketch a framework for documenting deployed learning systems, which we call Reward Reportsโ€ฆWe outline Reward Reports as living documents that track updates to design choices and assumptions behind what a particular automated system is optimizing for. They are intended to track dynamic phenomena arising from system deployment, rather than merely static properties of models or data.โ€ | See https://rewardreports.github.io | | SYSTEMS | ***Robustness Gym*** [Goel et al. (2021)](https://huggingface.co/papers/2101.04840) | โ€œWe identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks.โ€ | See https://github.com/robustness-gym/robustness-gym | | SYSTEMS | ***ABOUT ML*** [Raji and Yang, (2019)](https://huggingface.co/papers/1912.06166) | โ€œABOUT ML (Annotation and Benchmarking on Understanding and Transparency of Machine Learning Lifecycles) is a multi-year, multi-stakeholder initiative led by PAI. This initiative aims to bring together a diverse range of perspectives to develop, test, and implement machine learning system documentation practices at scale.โ€ | See [ABOUT MLโ€™s resources library](https://partnershiponai.org/about-ml-resources-library/) | ### DATA-FOCUSED DOCUMENTATION TOOLS Several proposed documentation tools focus on datasets used in the ML system lifecycle, including to train, develop, validate, finetune, and evaluate machine learning models as part of continuous cycles. These tools generally focus on the many aspects of the data lifecycle (perhaps for a particular dataset, group of datasets, or more broadly), including how the data was assembled, collected, annotated and how it should be used. * Extending the concept of datasheets in the electronics industry, [Gebru et al. (2018)](https://www.fatml.org/media/documents/datasheets_for_datasets.pdf) propose datasheets for datasets to document details related to a datasetโ€™s creation, potential uses, and associated concerns. * [Bender and Friedman (2018)](https://aclanthology.org/Q18-1041/) propose data statements for natural language processing. [Bender, Friedman and McMillan-Major (2021)](https://techpolicylab.uw.edu/wp-content/uploads/2021/11/Data_Statements_Guide_V2.pdf) update the original data statements framework and provide resources including a guide for writing data statements and translating between the first version of the schema and the newer version[^2]. * [Holland et al. (2018)](https://huggingface.co/papers/1805.03677) propose data nutrition labels, akin to nutrition facts for foodstuffs and nutrition labels for privacy disclosures, as a tool for analyzing and making decisions about datasets. The Data Nutrition Label team released an updated design of and interface for the label in 2020 ([Chmielinski et al., 2020)](https://huggingface.co/papers/2201.03954)). * [McMillan-Major et al. (2021)](https://huggingface.co/papers/2108.07374) describe the development process and resulting templates for **data cards for NLP** in the form of data cards on the Hugging Face Hub[^3] and data cards for datasets that are part of the NLP benchmark for Generation and its Evaluation Metrics (GEM) environment[^4]. * [Hutchinson et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445918) describe the need for comprehensive dataset documentation, and drawing on software development practices, provide templates for documenting several aspects of the dataset development lifecycle (for the purposes of Tables 1 and 2, we refer to their framework as the **Dataset Development Lifecycle Documentation Framework**). * [Pushkarna et al. (2021)](https://huggingface.co/papers/2204.01075) propose the data cards as part of the **data card playbook**, a human-centered documentation tool focused on datasets used in industry and research. ### MODEL-AND-METHOD-FOCUSED DOCUMENTATION TOOLS Another set of documentation tools can be thought of as focusing on machine learning models and machine learning methods. These include: * [Mitchell et al. (2018)](https://huggingface.co/papers/1810.03993) propose **model cards** for model reporting to accompany trained ML models and document issues related to evaluation, use, and other issues * [Shen et al. (2021)](https://dl.acm.org/doi/abs/10.1145/3442188.3445971) propose **value cards** for teaching students and practitioners about values related to ML models * [Seifert et al. (2019)](https://ris.utwente.nl/ws/portalfiles/portal/158031484/Seifert2019_cogmi_consumer_labels_preprint.pdf) propose **consumer labels for ML models** to help non-experts using or affected by the model understand key issues related to the model. * [Adkins et al. (2022)](https://dl.acm.org/doi/pdf/10.1145/3491101.3519724) analyse aspects of descriptive documentation tools โ€“ which they consider to include **model cards** and data sheets โ€“ and argue for increased prescriptive tools for ML engineers. They propose method cards, focused on ML methods, and design primarily with technical stakeholders like model developers and reviewers in mind. * They envision the relationship between model cards and method cards, in part, by stating: โ€œThe sections and prompts we proposeโ€ฆ[in the method card template] focus on ML methods that are sufficient to produce a proper ML model with defined input, output, and task. Examples for these are object detection methods such as Single-shot Detectors and language modelling methods such as Generative Pre-trained Transformers (GPT). *It is possible to create Model Cards for the models created using these methods*.โ€ * They also state โ€œWhile Model Cards and FactSheets put main focus on documenting existing models, Method Cards focus more on the underlying methodical and algorithmic choices that need to be considered when creating and training these models. *As a rough analogy, if Model Cards and FactSheets provide nutritional information about cooked meals, Method Cards provide the recipes*.โ€ ### SYSTEM-FOCUSED DOCUMENTATION TOOLS Rather than focusing on particular models, datasets, or methods, system-focused documentation tools look at how models interact with each other, with datasets, methods, and with other ML components to form ML systems. * [Procope et al. (2022)](https://ai.facebook.com/research/publications/system-level-transparency-of-machine-learning) propose system cards to document and explain AI systems โ€“ potentially including multiple ML models, AI tools, and non-AI technologies โ€“ that work together to accomplish tasks. * [Arnold et al. (2019)](https://huggingface.co/papers/1808.07261) extend the idea of declarations of conformity for consumer products to AI services, proposing FactSheets to document aspects of โ€œAI servicesโ€ which are typically accessed through APIs and may be composed of multiple different ML models. [Hind et al. (2020)](https://dl.acm.org/doi/abs/10.1145/3334480.3383051) share reflections on building factsheets. * [Gilbert et al. (2022)](https://huggingface.co/papers/2204.10817) propose **Reward Reports for Reinforcement Learning** systems, recognizing the dynamic nature of ML systems and the need for documentation efforts to incorporate considerations of post-deployment performance, especially for reinforcement learning systems. * [Goel et al. (2021)](https://huggingface.co/papers/2101.04840) develop **Robustness Gym**, an evaluation toolkit for testing several aspects of deep neural networks in real-world systems, allowing for comparison across evaluation paradigms. * Through the [ABOUT ML project](https://partnershiponai.org/workstream/about-ml/) ([Raji and Yang, 2019](https://huggingface.co/papers/1912.06166)), the Partnership on AI is coordinating efforts across groups of stakeholders in the machine learning community to develop comprehensive, scalable documentation tools for ML systems. ## THE EVOLUTION OF MODEL CARDS Since the proposal for model cards by Mitchell et al. in 2018, model cards have been adopted and adapted by various organisations, including by major technology companies and startups developing and hosting machine learning models[^5], researchers describing new techniques[^6], and government stakeholders evaluating models for various projects[^7]. Model cards also appear as part of AI Ethics educational toolkits, and numerous organisations and developers have created implementations for automating or semi-automating the creation of model cards. Appendix A provides a set of examples of model cards for various types of ML models created by different organisations (including model cards for large language models), model card generation tools, and model card educational tools. ### MODEL CARDS ON THE HUGGING FACE HUB Since 2018, new platforms and mediums for hosting and sharing model cards have also emerged. For example, particularly relevant to this project, Hugging Face hosts model cards on the Hugging Face Hub as README files in the repositories associated with ML models. As a result, model cards figure as a prominent form of documentation for users of models on the Hugging Face Hub. As part of our analysis of model cards, we developed and proposed model cards for several dozen ML models on the Hugging Face Hub, using the Hubโ€™s Pull Request (PR) and Discussion features to gather feedback on model cards, verify information included in model cards, and publish model cards for models on the Hugging Face Hub. At the time of writing of this guide book, all of Hugging Faceโ€™s models on the Hugging Face Hub have an associated model card on the Hub[^8]. The high number of models uploaded to the Hugging Face Hub (101,041 models at the point of writing), enabled us to explore the content within model cards on the hub: We began by analysing language model, model cards, in order to identify patterns (e.g repeated sections and subsections, with the aim of answering initial questions such as: 1) How many of these models have model cards? 2) What percent of downloads had an associated model card? From our analysis of all the models on the hub, we noticed that the most downloads come from top 200 models. With a continued focus on large language models, ordered by most downloaded and only models with model cards to begin with, we noted the most recurring sections within their respective model cards. While some headings within model cards may differ between models, we grouped components/the theme of each section within each model cards and then mapped them to section headings that were the most recurring (mostly found in the top 200 downloaded models and with the aid/guidance of the Bloom model card) > [!TIP] > [Checkout the User Studies](./model-cards-user-studies) > [!TIP] > [See Appendix](./model-card-appendix) [^1]: For each tool, descriptions are excerpted from the linked paper listed in the second column. [^2]: See https://techpolicylab.uw.edu/data-statements/ . [^3]: See https://techpolicylab.uw.edu/data-statements/ . [^4]: See https://techpolicylab.uw.edu/data-statements/ . [^5]: See, e.g., the Hugging Face Hub, Google Cloudโ€™s Model Cards https://modelcards.withgoogle.com/about . [^6]: See Appendix A. [^7]: See GSA / US Census Bureau Collaboration on Model Card Generator. [^8]: By โ€œHugging Face models,โ€ we mean models shared by Hugging Face, not another organisation, on the Hub. Formally, these are models without a โ€˜/โ€™ in their model ID. --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Livebook on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-livebook.md # Livebook on Spaces **Livebook** is an open-source tool for writing interactive code notebooks in [Elixir](https://elixir-lang.org/). It's part of a growing collection of Elixir tools for [numerical computing](https://github.com/elixir-nx/nx), [data science](https://github.com/elixir-nx/explorer), and [Machine Learning](https://github.com/elixir-nx/bumblebee). Some of Livebook's most exciting features are: - **Reproducible workflows**: Livebook runs your code in a predictable order, all the way down to package management - **Smart cells**: perform complex tasks, such as data manipulation and running machine learning models, with a few clicks using Livebook's extensible notebook cells - **Elixir powered**: use the power of the Elixir programming language to write concurrent and distributed notebooks that scale beyond your machine To learn more about it, watch this [15-minute video](https://www.youtube.com/watch?v=EhSNXWkji6o). Or visit [Livebook's website](https://livebook.dev/). Or follow its [Twitter](https://twitter.com/livebookdev) and [blog](https://news.livebook.dev/) to keep up with new features and updates. ## Your first Livebook Space You can get Livebook up and running in a Space with just a few clicks. Click the button below to start creating a new Space using Livebook's Docker template: Then: 1. Give your Space a name 2. Set the password of your Livebook 3. Set its visibility to public 4. Create your Space ![Creating a Livebok Space ](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-new-space.png) This will start building your Space using Livebook's Docker image. The visibility of the Space must be set to public for the Smart cells feature in Livebook to function properly. However, your Livebook instance will be protected by Livebook authentication. > [!TIP] > Smart cell is a type of Livebook cell that provides a UI component for accomplishing a specific task. The code for the task is generated automatically based on the user's interactions with the UI, allowing for faster completion of high-level tasks without writing code from scratch. Once the app build is finished, go to the "App" tab in your Space and log in to your Livebook using the password you previously set: ![Livebook authentication](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-authentication.png) That's it! Now you can start using Livebook inside your Space. If this is your first time using Livebook, you can learn how to use it with its interactive notebooks within Livebook itself: ![Livebook's learn notebooks](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-learn-section.png) ## Livebook integration with Hugging Face Models Livebook has an [official integration with Hugging Face models](https://livebook.dev/integrations/hugging-face). With this feature, you can run various Machine Learning models within Livebook with just a few clicks. Here's a quick video showing how to do that: ## How to update Livebook's version To update Livebook to its latest version, go to the Settings page of your Space and click on "Factory reboot this Space": ![Factory reboot a Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/spaces-livebook-factory-reboot.png) ## Caveats The following caveats apply to running Livebook inside a Space: - The Space's visibility setting must be public. Otherwise, Smart cells won't work. That said, your Livebook instance will still be behind Livebook authentication since you've set the `LIVEBOOK_PASSWORD` secret. - Livebook global configurations will be lost once the Space restarts. Consider using the [desktop app](https://livebook.dev/#install) if you find yourself in need of persisting configuration across deployments. ## Feedback and support If you have improvement suggestions or need specific support, please join the [Livebook community on GitHub](https://github.com/livebook-dev/livebook/discussions). ### Spaces Custom Domain https://huggingface.co/docs/hub/spaces-custom-domain.md # Spaces Custom Domain > [!WARNING] > This feature is part of PRO or Team & Enterprise plans. ## Getting started with a Custom Domain Spaces Custom Domain allows you to host your Space on a custom domain of your choosing: `yourdomain.example.com`. The custom domain must be a valid DNS name. > [!NOTE] > Custom domains require your Space to have **public** or **protected** visibility. They are not supported on private Spaces. ### Setting up your domain You can submit a custom domain in the settings of your Space, under "Custom Domain". You'll need to add a CNAME record pointing your domain to `hf.space`: ### Verifying your domain After submission, the request will move to "pending" status: Once the DNS is properly configured, you'll see a "ready" status confirming the custom domain is active for your Space. If you've completed all the steps but aren't seeing a "ready" status, you can enter your domain [here](https://toolbox.googleapps.com/apps/dig/#CNAME/) to verify it points to `hf.space`. If it doesn't, please check your domain host to ensure the CNAME record was added correctly. ## Removing a Custom Domain Simply remove a custom domain by using the delete button to the right of "Custom Domain" in the settings of your Space. You can delete while the custom domain is pending or in ready state. ### Storage Regions on the Hub https://huggingface.co/docs/hub/storage-regions.md # Storage Regions on the Hub > [!WARNING] > This feature is part of the Team & Enterprise plans. Regions allow you to specify where your organization's models, datasets and Spaces are stored. For non-Team or Enterprise users, repositories are always stored in the US. This offers two key benefits: - Regulatory and legal compliance - Performance (faster download/upload speeds and lower latency) Currently available regions: - US ๐Ÿ‡บ๐Ÿ‡ธ - EU ๐Ÿ‡ช๐Ÿ‡บ - Coming soon: Asia-Pacific ๐ŸŒ ## Getting started with Storage Regions Organizations subscribed to a Team or Enterprise plan can access the Regions settings page to manage their repositories storage locations. This page displays: - An audit of your organization's repository locations - Options to select where new repositories will be stored > [!TIP] > Some advanced compute options for Spaces, such as ZeroGPU, may not be available in all regions. ## Repository Tag Any repository (model or dataset) stored in a non-default location displays its Region as a tag, allowing organization members to quickly identify repository locations. ## Regulatory and legal compliance Regulated industries often require data storage in specific regions. For EU companies, you can use the Hub for ML development in a GDPR-compliant manner, with datasets, models and inference endpoints stored in EU data centers. ## Performance Storing models and datasets closer to your team and infrastructure significantly improves performance for both uploads and downloads. This impact is substantial given the typically large size of model weights and dataset files. For example, European users storing repositories in the EU region can expect approximately 4-5x faster upload and download speeds compared to US storage. ## Spaces Both Spaces's storage and runtime use the chosen region. Available hardware configurations vary by region, and some features may not be available in all regions. Contact your HF account team for specific requests. ### Basic SSO https://huggingface.co/docs/hub/security-sso-basic.md # Basic SSO > [!WARNING] > This feature is part of the Team & Enterprise plans. Basic SSO adds an access-control layer on top of the standard Hugging Face login. It allows you to enforce authentication through your Identity Provider (IdP) when members access resources under your organization's namespace, such as private models, datasets, and Spaces. For a comparison with Managed SSO, see the [SSO overview](./enterprise-sso). ## How it works > [!NOTE] > **Basic SSO does not replace the Hugging Face login.** Your members will still need to sign in to Hugging Face with their own credentials (email/password, Google, or GitHub) before being prompted to complete SSO authentication to access your organization's resources. This is by design: Basic SSO secures access to your organization without taking over the user's Hugging Face identity. When Single Sign-On is enabled, organization members authenticate through your Identity Provider (IdP). You pick whether SSO is **enforced** or **optional**: - **Enforced** (default): Members have to complete SSO authentication before accessing anything under the organization's namespace. - **Optional**: Members get prompted via a banner at the top of the page to set up SSO, but can skip it and still access the organization. This is handy when you're migrating a lot of users and want to give them time to sort out their accounts before definitely enforcing SSO. Public content is still accessible to everyone, including non-members. **We use email addresses to identify SSO users. As a user, make sure that your organizational email address (e.g. your company email) has been added to [your user account](https://huggingface.co/settings/account).** When users log in, they will be prompted to complete the Single Sign-On authentication flow with a banner similar to the following: Single Sign-On only applies to your organization. Members may belong to other organizations on Hugging Face. ## Getting started Basic SSO can be configured directly from your organization's settings. Hugging Face Hub can work with any OIDC-compliant or SAML Identity Provider, including Okta, OneLogin, and Microsoft Entra ID (Azure AD). See our [Configuration Guides](./security-sso-configuration-guides) for step-by-step setup instructions. ## User provisioning Once SSO is enabled on your organization, a direct join link can be copied and shared with new members. This SSO join link is available in both the **SSO** and **Members** settings tabs. Since organizations with SSO enabled cannot use classic invite links, the SSO join link is the primary method for inviting teammates to your organization. Simply click the copy button to copy the link to your clipboard and share it with the members you want to invite. When recipients click the shared link, they will be able to authenticate via SSO and directly join your organization. Organizations on the Enterprise plan can also use [SCIM](./enterprise-scim) to automate invitation-based provisioning from your Identity Provider. See the [SCIM guide](./enterprise-scim) for more details. ## SSO features Basic SSO supports [role mapping, resource group mapping, session timeout, matching email domains, and external collaborators](./security-sso-user-management). These features are configurable from your organization's settings. ### SSO Configuration Guides https://huggingface.co/docs/hub/security-sso-configuration-guides.md # SSO Configuration Guides > [!WARNING] > This feature is part of the Team & Enterprise plans. These guides help you configure SAML 2.0 and OpenID Connect (OIDC) with your Identity Provider for [Basic SSO](./security-sso-basic). Hugging Face Hub can work with any SAML or OIDC-compliant Identity Provider. > [!NOTE] > If you are looking to set up [Managed SSO](./enterprise-advanced-sso), the configuration is done in collaboration with the Hugging Face team. Please contact us to get started. ## Okta - [How to configure OIDC with Okta](./security-sso-okta-oidc) - [How to configure SAML with Okta](./security-sso-okta-saml) - [How to configure SCIM with Okta](./security-sso-okta-scim) ## Microsoft Entra ID (Azure AD) - [How to configure SAML with Entra ID](./security-sso-azure-saml) - [How to configure OIDC with Entra ID](./security-sso-azure-oidc) - [How to configure SCIM with Entra ID](./security-sso-entra-id-scim) ## Google Workspace - [How to configure SAML with Google Workspace](./security-sso-google-saml) - [How to configure OIDC with Google Workspace](./security-sso-google-oidc) ### Storage limits https://huggingface.co/docs/hub/storage-limits.md # Storage limits At Hugging Face we aim to provide the AI community with significant volumes of **free storage space for public repositories**, with options to buy more storage if necessary. We also bill for storage space for **private repositories**, above a free tier (see table below). > [!TIP] > Storage limits and policies apply to all types of repositories (models, datasets, buckets, โ€ฆ) on the Hub. We [optimize our infrastructure](https://huggingface.co/blog/xethub-joins-hf) continuously to [scale our storage](https://x.com/julien_c/status/1821540661973160339) for the coming years of growth in AI and Machine learning. We do have mitigations in place to prevent abuse of free public storage, and in general we ask users and organizations to make sure any uploaded large model or dataset is **as useful to the community as possible** (as represented by numbers of likes or downloads, for instance). Upgrade to a paid Organization or User (PRO) account to unlock higher limits. ## Storage plans | Type of account | Public storage | Private storage | | ------------------------ | ------------------------------------------------------------------- | ---------------------------- | | Free user or org | Best-effort\* | 100GB | | PRO | Up to 10TB included\* + [add-on](#public-storage-add-on) โœ… grants available for impactful workโ€  | 1TB + pay-as-you-go | | Team Organizations | 12TB base + 1TB per seat + [add-on](#public-storage-add-on) โœ… | 1TB per seat + pay-as-you-go | | Enterprise Organizations | 200TB base + 1TB per seat + [add-on](#public-storage-add-on) ๐Ÿ† Up to 1,000TB for large contracts | 1TB per seat + pay-as-you-go | ๐Ÿ’ก [Team or Enterprise Organizations](https://huggingface.co/enterprise) include 1TB of private storage per seat in the subscription: for example, if your organization has 40 members, then you have 40TB of included private storage. \* We aim to continue providing the AI community with generous free storage space for public repositories. Beyond the first few gigabytes, please use this resource responsibly by uploading content that offers genuine value to other users. If you need substantial storage space, you will need to upgrade to [PRO, Team or Enterprise](https://huggingface.co/pricing). โ€  In some cases, additional storage grants are available for high-impact open-source work where a paid plan genuinely cannot cover the need. Contact us with evidence of community impact (likes, downloads, citations). ### Public Storage add-on Users on a paid plan (PRO, Team, or Enterprise) can subscribe to a **Public Storage add-on** for additional public storage on top of their plan's base limit. | Storage add-on | Price | Per TB | | -------------- | -------------- | ---------------- | | 1 TB | $12/month | $12/TB/month | | 5 TB | $60/month | $12/TB/month | | 10 TB | $120/month | $12/TB/month | | 20 TB | $240/month | $12/TB/month | | 50 TB | $500/month | $10/TB/month | You can subscribe or change your tier from the **Billing** settings page of your account or organization. Upgrades take effect immediately; downgrades are scheduled to take effect at the start of the next month. If you need more storage, you can [contact us](https://huggingface.co/contact/sales) to take advantage of [custom large-scale pricing](https://huggingface.co/pricing#storage). ### Private storage Pay-as-you-go Above the included 1TB (or 1TB per seat) of private storage in [PRO](https://huggingface.co/subscribe/pro) and [Team or Enterprise Organizations](https://huggingface.co/enterprise), additional private storage is charged to your payment method in Pay-as-you-go mode, at a base price of $18/TB/mo. Additional discounts are available for large-scale volumes through our account executives: | Volume | Price (private repos) | | ------ | --------------------- | | Base | $18/TB/mo | | 50TB+ | $16/TB/mo | | 200TB+ | $14/TB/mo | | 500TB+ | $12/TB/mo | See our [billing doc](./billing) for more details, or view the latest pricing at [huggingface.co/pricing](https://huggingface.co/pricing#storage). ## Repository limitations and recommendations > [!NOTE] > This section does not apply to [Storage Buckets](./storage-buckets) In addition to storage limits at the account (user or organization) level, there are some limitations to be aware of when dealing with large amounts of data in a specific Git-backed repository. Given the time it takes to stream the data, getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying. In the following section, we describe our recommendations on how to best structure your large repos. ### Recommendations We gathered a list of tips and recommendations for structuring your repo. If you are looking for more practical tips, check out [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#tips-and-tricks-for-large-uploads) on how to upload large amount of data using the Python library. | Characteristic | Recommended | Tips | | ---------------- | ------------------ | ------------------------------------------------------ | | Repo size | - | upgrade your [storage plan](#storage-plans) or contact us for large repos (TBs of data) | | Files per repo | [!NOTE] > Deleting a PR ref is irreversible and will prevent anyone from fetching or checking out those commits locally. ### Super-squash your repository using the API The super-squash operation compresses your entire Git history into a single commit. Consider using super-squash when you need to reclaim storage from old LFS versions you're not using. This operation is only available through the [Hub Python Library](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history) or the API. โš ๏ธ **Important**: This is a destructive operation that cannot be undone, commit history will be permanently lost and **LFS file history will be removed** The effects from the squash operation on your storage quota are not immediate and will be reflected on your quota within 36 hours. ### Advanced: Track LFS file references When you find an LFS file in your repository's "List LFS files" but don't know where it came from, you can trace its history using its SHA-256 OID by using the git log command: ```bash git log --all -p -S ``` For example: ```bash git log --all -p -S 68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 commit 5af368743e3f1d81c2a846f7c8d4a028ad9fb021 Date: Sun Apr 28 02:01:18 2024 +0200 Update LayerNorm tensor names to weight and bias diff --git a/model.safetensors b/model.safetensors index a090ee7..e79c80e 100644 --- a/model.safetensors +++ b/model.safetensors @@ -1,3 +1,3 @@ version https://git-lfs.github.com/spec/v1 -oid sha256:68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 +oid sha256:0bb7a1683251b832d6f4644e523b325adcf485b7193379f5515e6083b5ed174b size 440449768 commit 0a6aa9128b6194f4f3c4db429b6cb4891cdb421b (origin/pr/28) Date: Wed Nov 16 15:15:39 2022 +0000 Adding `safetensors` variant of this model (#15) - Adding `safetensors` variant of this model (18c87780b5e54825a2454d5855a354ad46c5b87e) Co-authored-by: Nicolas Patry diff --git a/model.safetensors b/model.safetensors new file mode 100644 index 0000000..a090ee7 --- /dev/null +++ b/model.safetensors @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 +size 440449768 commit 18c87780b5e54825a2454d5855a354ad46c5b87e (origin/pr/15) Date: Thu Nov 10 09:35:55 2022 +0000 Adding `safetensors` variant of this model diff --git a/model.safetensors b/model.safetensors new file mode 100644 index 0000000..a090ee7 --- /dev/null +++ b/model.safetensors @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:68d45e234eb4a928074dfd868cead0219ab85354cc53d20e772753c6bb9169d3 +size 440449768 ``` ### Model Card Guidebook https://huggingface.co/docs/hub/model-card-guidebook.md # Model Card Guidebook Model cards are an important documentation and transparency framework for machine learning models. We believe that model cards have the potential to serve as *boundary objects*, a single artefact that is accessible to users who have different backgrounds and goals when interacting with model cards โ€“ including developers, students, policymakers, ethicists, those impacted by machine learning models, and other stakeholders. We recognize that developing a single artefact to serve such multifaceted purposes is difficult and requires careful consideration of potential users and use cases. Our goal as part of the Hugging Face science team over the last several months has been to help operationalize model cards towards that vision, taking into account these challenges, both at Hugging Face and in the broader ML community. To work towards that goal, it is important to recognize the thoughtful, dedicated efforts that have helped model cards grow into what they are today, from the adoption of model cards as a standard practice at many large organisations to the development of sophisticated tools for hosting and generating model cards. Since model cards were proposed by Mitchell et al. (2018), the landscape of machine learning documentation has expanded and evolved. A plethora of documentation tools and templates for data, models, and ML systems have been proposed and have developed โ€“ reflecting the incredible work of hundreds of researchers, impacted community members, advocates, and other stakeholders. Important discussions about the relationship between ML documentation and theories of change in responsible AI have created continued important discussions, and at times, divergence. We also recognize the challenges facing model cards, which in some ways mirror the challenges facing machine learning documentation and responsible AI efforts more generally, and we see opportunities ahead to help shape both model cards and the ecosystems in which they function positively in the months and years ahead. Our work presents a view of where we think model cards stand right now and where they could go in the future, at Hugging Face and beyond. This work is a โ€œsnapshotโ€ of the current state of model cards, informed by a landscape analysis of the many ways ML documentation artefacts have been instantiated. It represents one perspective amongst multiple about both the current state and more aspirational visions of model cards. In this blog post, we summarise our work, including a discussion of the broader, growing landscape of ML documentation tools, the diverse audiences for and opinions about model cards, and potential new templates for model card content. We also explore and develop model cards for machine learning models in the context of the Hugging Face Hub, using the Hubโ€™s features to collaboratively create, discuss, and disseminate model cards for ML models. With the launch of this Guidebook, we introduce several new resources and connect together previous work on Model Cards: 1) An updated Model Card template, released in the `huggingface_hub` library [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md), drawing together Model Card work in academia and throughout the industry. 2) An [Annotated Model Card Template](./model-card-annotated), which details how to fill the card out. 3) A [Model Card Creator Tool](https://huggingface.co/spaces/huggingface/Model_Cards_Writing_Tool), to ease card creation without needing to program, and to help teams share the work of different sections. 4) A [User Study](./model-cards-user-studies) on Model Card usage at Hugging Face 5) A [Landscape Analysis and Literature Review](./model-card-landscape-analysis) of the state of the art in model documentation. We also include an [Appendix](./model-card-appendix) with further details from this work. --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Pandas https://huggingface.co/docs/hub/datasets-pandas.md # Pandas [Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit. Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub. ## Load a DataFrame You can load data from local files or from remote storage like Hugging Face Datasets. Pandas supports many formats including CSV, JSON and Parquet: ```python >>> import pandas as pd >>> df = pd.read_csv("path/to/data.csv") ``` To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet`: ```python >>> import pandas as pd >>> df = pd.read_parquet("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet") >>> df text label 0 I rented I AM CURIOUS-YELLOW from my video sto... 0 1 "I Am Curious: Yellow" is a risible and preten... 0 2 If only to avoid making this type of film in t... 0 3 This film was probably inspired by Godard's Ma... 0 4 Oh, brother...after hearing about this ridicul... 0 ... ... ... 24995 A hit at the time but now better categorised a... 1 24996 I love this movie like no other. Another time ... 1 24997 This film and it's sequel Barry Mckenzie holds... 1 24998 'The Adventures Of Barry McKenzie' started lif... 1 24999 The story centers around Barry McKenzie who mu... 1 ``` For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). > [!TIP] > The same `hf://` paths also work with [Storage Buckets](./storage-buckets): > ```python > >>> df = pd.read_parquet("hf://buckets/username/my-bucket/data.parquet") > >>> df.to_parquet("hf://buckets/username/my-bucket/output.parquet") > ``` ## Save a DataFrame You can save a pandas DataFrame using `to_csv/to_json/to_parquet` to a local file or to Hugging Face directly. To save the DataFrame on Hugging Face, you first need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: ```python from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in Pandas: ```python import pandas as pd df.to_parquet("hf://datasets/username/my_dataset/imdb.parquet") # or write in separate files if the dataset has train/validation/test splits df_train.to_parquet("hf://datasets/username/my_dataset/train.parquet") df_valid.to_parquet("hf://datasets/username/my_dataset/validation.parquet") df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet") ``` Note that Parquet files on Hugging Face are optimized to improve storage efficiency, accelerate downloads and uploads, and enable efficient dataset streaming and editing: * [Parquet Content Defined Chunking](https://huggingface.co/blog/parquet-cdc) optimizes Parquet for [Xet](https://huggingface.co/docs/hub/en/xet/index), Hugging Face's storage backend. It accelerates uploads and downloads thanks to chunk-based deduplication and allows efficient file editing * Page index accelerates filters when streaming and enables efficient random access, e.g. in the [Dataset Viewer](https://huggingface.co/docs/dataset-viewer) Pandas require extra argument to write optimized Parquet files: ```python import pandas as pd df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) ``` * `use_content_defined_chunking=True` to enable Parquet Content Defined Chunking, for [deduplication](https://huggingface.co/blog/parquet-cdc) and [editing](./datasets-editing) (it requires `pyarrow>=21.0`) * `write_page_index=True` to include a page index in the Parquet metadata, for [streaming and random access](./datasets-streaming) > [!TIP] > Content defined chunking (CDC) makes the Parquet writer chunk the data pages in a way that makes duplicate data chunked and compressed identically. > Without CDC, the pages are arbitrarily chunked and therefore duplicate data are impossible to detect because of compression. > Thanks to CDC, Parquet uploads and downloads from Hugging Face are faster, since duplicate data are uploaded or downloaded only once. Find more information about Xet [here](https://huggingface.co/join/xet). ## Leverage Xet deduplication for Parquet Optimized Parquet files are written with Content Defined Chunking, which enables deduplication. This accelerates uploads since chunks of data that already exist on Hugging Face don't need to be uploaded again, and this saves a lot of I/O. For example, this code uploads the content of `df` and then for `edited_df` the upload is faster since it only uploads the chunks that changed: ```python import pandas as pd df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) edited_df = ... # e.g. with added/modified/removed rows or columns edited_df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) ``` Chunks are ~64kB and Parquet saves data column per column, so in practice this is what happens when editing an Optimized Parquet file: * add a new column -> only the chunks of the new column are uploaded * add/edit/delete a row -> one chunk per column is uploaded And in addition to this, the chunks of the Parquet footer containing metadata are also uploaded. ## Use Images You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this: ``` Example 1: Example 2: folder/ folder/ โ”œโ”€โ”€ metadata.csv โ”œโ”€โ”€ metadata.csv โ”œโ”€โ”€ img000.png โ””โ”€โ”€ images โ”œโ”€โ”€ img001.png โ”œโ”€โ”€ img000.png ... ... โ””โ”€โ”€ imgNNN.png โ””โ”€โ”€ imgNNN.png ``` You can iterate on the images paths like this: ```python import pandas as pd folder_path = "path/to/folder/" df = pd.read_csv(folder_path + "metadata.csv") for image_path in (folder_path + df["file_name"]): ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.csv` or `.jsonl` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_image_dataset", repo_type="dataset", ) ``` ### Image methods and Parquet Using [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) you enable `PIL.Image` methods on an image column. It also enables saving the dataset as one single Parquet file containing both the images and the metadata: ```python import pandas as pd from pandas_image_methods import PILMethods pd.api.extensions.register_series_accessor("pil")(PILMethods) df["image"] = (folder_path + df["file_name"]).pil.open() df.to_parquet("data.parquet") ``` All the `PIL.Image` methods are available, e.g. ```python df["image"] = df["image"].pil.rotate(90) ``` ## Use Audios You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this: ``` Example 1: Example 2: folder/ folder/ โ”œโ”€โ”€ metadata.csv โ”œโ”€โ”€ metadata.csv โ”œโ”€โ”€ rec000.wav โ””โ”€โ”€ audios โ”œโ”€โ”€ rec001.wav โ”œโ”€โ”€ rec000.wav ... ... โ””โ”€โ”€ recNNN.wav โ””โ”€โ”€ recNNN.wav ``` You can iterate on the audios paths like this: ```python import pandas as pd folder_path = "path/to/folder/" df = pd.read_csv(folder_path + "metadata.csv") for audio_path in (folder_path + df["file_name"]): ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-audio#additional-columns) (a `metadata.csv` or `.jsonl` file with a `file_name` field), you can save it to Hugging Face, and the Hub Dataset Viewer shows both the metadata and audio. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_audio_dataset", repo_type="dataset", ) ``` ### Audio methods and Parquet Using [pandas-audio-methods](https://github.com/lhoestq/pandas-audio-methods) you enable `soundfile` methods on an audio column. It also enables saving the dataset as one single Parquet file containing both the audios and the metadata: ```python import pandas as pd from pandas_audio_methods import SFMethods pd.api.extensions.register_series_accessor("sf")(SFMethods) df["audio"] = (folder_path + df["file_name"]).sf.open() df.to_parquet("data.parquet") ``` This makes it easy to use with `librosa` e.g. for resampling: ```python df["audio"] = [librosa.load(audio, sr=16_000) for audio in df["audio"]] df["audio"] = df["audio"].sf.write() ``` ## Use Transformers You can use `transformers` pipelines on pandas DataFrames to classify, generate text, images, etc. This section shows a few examples with `tqdm` for progress bars. > [!TIP] > Pipelines don't accept a `tqdm` object as input but you can use a python generator instead, in the form `x for x in tqdm(...)` ### Text Classification ```python from transformers import pipeline from tqdm import tqdm pipe = pipeline("text-classification", model="clapAI/modernBERT-base-multilingual-sentiment") # Compute labels df["label"] = [y["label"] for y in pipe(x for x in tqdm(df["text"]))] # Compute labels and scores df[["label", "score"]] = [(y["label"], y["score"]) for y in pipe(x for x in tqdm(df["text"]))] ``` ### Text Generation ```python from transformers import pipeline from tqdm import tqdm pipe = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B-Instruct") # Generate chat response prompt = "What is the main topic of this sentence ? REPLY IN LESS THAN 3 WORDS. Sentence: '{}'" df["output"] = [y["generated_text"][1]["content"] for y in pipe([{"role": "user", "content": prompt.format(x)}] for x in tqdm(df["text"]))] ``` ### GGUF usage with LM Studio https://huggingface.co/docs/hub/lmstudio.md # GGUF usage with LM Studio ![cover](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/gguf-lmstudio-coverimage.png) [LM Studio](https://lmstudio.ai) is a desktop application for experimenting & developing with local AI models directly on your computer. LM Studio is built on llama.cpp and works on Mac (Apple Silicon), Windows, and Linux! ## Getting models from Hugging Face into LM Studio First, enable LM Studio under your [Local Apps Settings](https://huggingface.co/settings/local-apps) in Hugging Face. ### Option 1: Use the 'Use this model' button right from Hugging Face For any GGUF or MLX LLM, click the "Use this model" dropdown and select LM Studio. This will run the model directly in LM Studio if you already have it, or show you a download option if you don't. To try LM Studio with a trending model, find them here: [https://huggingface.co/models?library=gguf\&sort=trending](https://huggingface.co/models?library=gguf&sort=trending) ### Option 2: Use LM Studio's In-App Downloader Open the LM Studio app and search for any model by pressing โŒ˜ + Shift + M on Mac, or Ctrl + Shift + M on PC (M stands for Models). You can even paste entire Hugging Face URLs into the search bar! For each model, you can expand the dropdown to view multiple quantization options. LM Studio highlights the recommended choice for your hardware and indicates which options are supported. ### Option 3: Use lms, LM Studio's CLI: If you prefer a terminal based workflow, use lms, LM Studio's CLI. #### **Search for models from the terminal:** Search with keyword ```bash lms get qwen ``` Filter search by MLX or GGUF results ```bash lms get qwen \--mlx # or \--gguf ``` #### **Download any model from Hugging Face:** Use a full Hugging Face URL ```bash lms get https://huggingface.co/lmstudio-community/Ministral-3-8B-Reasoning-2512-GGUF ``` #### **Choose a model quantization** You can choose a model quantization level that balances performance, memory usage, and accuracy. This is done with the @ qualifier, for example: ```bash lms get https://huggingface.co/lmstudio-community/Ministral-3-8B-Reasoning-2512-GGUF@Q6\_K ``` ## You downloaded the model โ€“ Now what? You've downloaded a model following one of the above options, now let's get started in LM Studio! ### Getting started with the LM Studio Application In the LM Studio application, head to the model loader to view a list of downloaded models and select one to load. You may customize the model load parameters, though LM Studio will by default select the load parameters that optimizes model performance on your hardware. Once the model has completed loading (as indicated by the progress bar), you may start chatting away using our app's chat interface! ### Or, use LM Studio's CLI to interact with your models See a list of commands [here](https://lmstudio.ai/docs/cli). Note that you need to run LM Studio ***at least once*** before you can use lms ## **Keeping up with the latest models** Follow the [LM Studio Community](https://huggingface.co/lmstudio-community) page on Hugging Face to stay updated on the latest & greatest local LLMs as soon as they come out. ### Hugging Face Dataset Upload Decision Guide https://huggingface.co/docs/hub/datasets-upload-guide-llm.md # Hugging Face Dataset Upload Decision Guide > [!TIP] > This guide is primarily designed for LLMs to help users upload datasets to the Hugging Face Hub in the most compatible format. Users can also reference this guide to understand the upload process and best practices. > Decision guide for uploading datasets to Hugging Face Hub. Optimized for Dataset Viewer compatibility and integration with the Hugging Face ecosystem. ## Overview Your goal is to help a user upload a dataset to the Hugging Face Hub. Ideally, the dataset should be compatible with the Dataset Viewer (and thus the `load_dataset` function) to ensure easy access and usability. You should aim to meet the following criteria: | **Criteria** | Description | Priority | | ---------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------- | | **Respect repository limits** | Ensure the dataset adheres to Hugging Face's storage limits for file sizes, repository sizes, and file counts. See the Critical Constraints section below for specific limits. | Required | | **Use hub-compatible formats** | Use Parquet format when possible (best compression, rich typing, large dataset support). For smaller datasets (10k)** | Use upload_large_folder to avoid Git limitations | `api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset")` | | **Streaming large media** | WebDataset format for efficient streaming | Create .tar shards, then `upload_large_folder()` | | **Scientific data (HDF5, NetCDF)** | Convert to Parquet with Array features | See [Scientific Data](#scientific-data) section | | **Custom/proprietary formats** | Document thoroughly if conversion impossible | `upload_large_folder()` with comprehensive README | ## Upload Workflow 0. โœ“ **Gather dataset information** (if needed): - What type of data? (images, text, audio, CSV, etc.) - How is it organized? (folder structure, single file, multiple files) - What's the approximate size? - What format are the files in? - Any special requirements? (e.g., streaming, private access) - Check for existing README or documentation files that describe the dataset 1. โœ“ **Authenticate**: - CLI: `hf auth login` - Or use token: `HfApi(token="hf_...")` or set `HF_TOKEN` environment variable 2. โœ“ **Identify your data type**: Check the [Quick Reference](#quick-reference-by-data-type) table above 3. โœ“ **Choose upload method**: - **Small files (100GB or >10k files - **Custom formats**: Convert to hub-compatible format if possible, otherwise document thoroughly 4. โœ“ **Test locally** (if using built-in loader): ```python # Validate your dataset loads correctly before uploading dataset = load_dataset("loader_name", data_dir="./your_data") print(dataset) ``` 5. โœ“ **Upload to Hub**: ```python # Basic upload dataset.push_to_hub("username/dataset-name") # With options for large datasets dataset.push_to_hub( "username/dataset-name", max_shard_size="5GB", # Control memory usage private=True # For private datasets ) ``` 6. โœ“ **Verify your upload**: - Check Dataset Viewer: `https://huggingface.co/datasets/username/dataset-name` - Test loading: `load_dataset("username/dataset-name")` - If viewer shows errors, check the [Troubleshooting](#common-issues--solutions) section ## Common Conversion Patterns When built-in loaders don't match your data structure, use the datasets library as a compatibility layer. Convert your data to a Dataset object, then use `push_to_hub()` for maximum flexibility and Dataset Viewer compatibility. ### From DataFrames If you already have your data working in pandas, polars, or other dataframe libraries, you can convert directly: ```python # From pandas DataFrame import pandas as pd from datasets import Dataset df = pd.read_csv("your_data.csv") dataset = Dataset.from_pandas(df) dataset.push_to_hub("username/dataset-name") # From polars DataFrame (direct method) import polars as pl from datasets import Dataset df = pl.read_csv("your_data.csv") dataset = Dataset.from_polars(df) # Direct conversion dataset.push_to_hub("username/dataset-name") # From PyArrow Table (useful for scientific data) import pyarrow as pa from datasets import Dataset # If you have a PyArrow table table = pa.table({'data': [1, 2, 3], 'labels': ['a', 'b', 'c']}) dataset = Dataset(table) dataset.push_to_hub("username/dataset-name") # For Spark/Dask dataframes, see https://huggingface.co/docs/hub/datasets-libraries ``` ## Custom Format Conversion When built-in loaders don't match your data format, convert to Dataset objects following these principles: ### Design Principles **1. Prefer wide/flat structures over joins** - Denormalize relational data into single rows for better usability - Include all relevant information in each example - Lean towards bigger but more usable data - Hugging Face's infrastructure uses advanced deduplication (XetHub) and Parquet optimizations to handle redundancy efficiently **2. Use configs for logical dataset variations** - Beyond train/test/val splits, use configs for different subsets or views of your data - Each config can have different features or data organization - Example: language-specific configs, task-specific views, or data modalities ### Conversion Methods **Small datasets (fits in memory) - use `Dataset.from_dict()`**: ```python # Parse your custom format into a dictionary data_dict = { "text": ["example1", "example2"], "label": ["positive", "negative"], "score": [0.9, 0.2] } # Create dataset with appropriate features from datasets import Dataset, Features, Value, ClassLabel features = Features({ 'text': Value('string'), 'label': ClassLabel(names=['negative', 'positive']), 'score': Value('float32') }) dataset = Dataset.from_dict(data_dict, features=features) dataset.push_to_hub("username/dataset") ``` **Large datasets (memory-efficient) - use `Dataset.from_generator()`**: ```python def data_generator(): # Parse your custom format progressively for item in parse_large_file("data.custom"): yield { "text": item["content"], "label": item["category"], "embedding": item["vector"] } # Specify features for Dataset Viewer compatibility from datasets import Features, Value, ClassLabel, List features = Features({ 'text': Value('string'), 'label': ClassLabel(names=['cat1', 'cat2', 'cat3']), 'embedding': List(feature=Value('float32'), length=768) }) dataset = Dataset.from_generator(data_generator, features=features) dataset.push_to_hub("username/dataset", max_shard_size="1GB") ``` **Tip**: For large datasets, test with a subset first by adding a limit to your generator or using `.select(range(100))` after creation. ### Using Configs for Dataset Variations ```python # Push different configurations of your dataset dataset_en = Dataset.from_dict(english_data, features=features) dataset_en.push_to_hub("username/multilingual-dataset", config_name="english") dataset_fr = Dataset.from_dict(french_data, features=features) dataset_fr.push_to_hub("username/multilingual-dataset", config_name="french") # Users can then load specific configs dataset = load_dataset("username/multilingual-dataset", "english") ``` ### Multi-modal Examples **Text + Audio (speech recognition)**: ```python def speech_generator(): for audio_file in Path("audio/").glob("*.wav"): transcript_file = audio_file.with_suffix(".txt") yield { "audio": str(audio_file), "text": transcript_file.read_text().strip(), "speaker_id": audio_file.stem.split("_")[0] } features = Features({ 'audio': Audio(sampling_rate=16000), 'text': Value('string'), 'speaker_id': Value('string') }) dataset = Dataset.from_generator(speech_generator, features=features) dataset.push_to_hub("username/speech-dataset") ``` **Multiple images per example**: ```python # Before/after images, medical imaging, etc. data = { "image_before": ["img1_before.jpg", "img2_before.jpg"], "image_after": ["img1_after.jpg", "img2_after.jpg"], "treatment": ["method_A", "method_B"] } features = Features({ 'image_before': Image(), 'image_after': Image(), 'treatment': ClassLabel(names=['method_A', 'method_B']) }) dataset = Dataset.from_dict(data, features=features) dataset.push_to_hub("username/before-after-images") ``` **Note**: For text + images, consider using ImageFolder with metadata.csv which handles this automatically. ## Essential Features Features define the schema and data types for your dataset columns. Specifying correct features ensures: - Proper data handling and type conversion - Dataset Viewer functionality (e.g., image/audio previews) - Efficient storage and loading - Clear documentation of your data structure For complete feature documentation, see: [Dataset Features](https://huggingface.co/docs/datasets/about_dataset_features) ### Feature Types Overview **Basic Types**: - `Value`: Scalar values - `string`, `int64`, `float32`, `bool`, `binary`, and other numeric types - `ClassLabel`: Categorical data with named classes - `Sequence`: Lists of any feature type - `LargeList`: For very large lists **Media Types** (enable Dataset Viewer previews): - `Image()`: Handles various image formats, returns PIL Image objects - `Audio(sampling_rate=16000)`: Audio with array data and optional sampling rate - `Video()`: Video files - `Pdf()`: PDF documents with text extraction **Array Types** (for tensors/scientific data): - `Array2D`, `Array3D`, `Array4D`, `Array5D`: Fixed or variable-length arrays - Example: `Array2D(shape=(224, 224), dtype='float32')` - First dimension can be `None` for variable length **Translation Types**: - `Translation`: For translation pairs with fixed languages - `TranslationVariableLanguages`: For translations with varying language pairs **Note**: New feature types are added regularly. Check the documentation for the latest additions. ## Upload Methods **Dataset objects (use push_to_hub)**: Use when you've loaded/converted data using the datasets library ```python dataset.push_to_hub("username/dataset", max_shard_size="5GB") ``` **Pre-existing files (use upload_large_folder)**: Use when you have hub-compatible files (e.g., Parquet files) already prepared and organized ```python from huggingface_hub import HfApi api = HfApi() api.upload_large_folder(folder_path="./data", repo_id="username/dataset", repo_type="dataset", num_workers=16) ``` **Important**: Before using `upload_large_folder`, verify the files meet repository limits: - Check folder structure if you have file access: ensure no folder contains >10k files - Ask the user to confirm: "Are your files in a hub-compatible format (Parquet/CSV/JSON) and organized appropriately?" - For non-standard formats, consider converting to Dataset objects first to ensure compatibility ## Validation **Consider small reformatting**: If data is close to a built-in loader format, suggest minor changes: - Rename columns (e.g., 'filename' โ†’ 'file_name' for ImageFolder) - Reorganize folders (e.g., move images into class subfolders) - Rename files to match expected patterns (e.g., 'data.csv' โ†’ 'train.csv') **Pre-upload**: - Test locally: `load_dataset("imagefolder", data_dir="./data")` - Verify features work correctly: ```python # Test first example print(dataset[0]) # For images: verify they load if 'image' in dataset.features: dataset[0]['image'] # Should return PIL Image # Check dataset size before upload print(f"Size: {len(dataset)} examples") ``` - Check metadata.csv has 'file_name' column - Verify relative paths, no leading slashes - Ensure no folder >10k files **Post-upload**: - Check viewer: `https://huggingface.co/datasets/username/dataset` - Test loading: `load_dataset("username/dataset")` - Verify features preserved: `print(dataset.features)` ## Common Issues โ†’ Solutions | Issue | Solution | | -------------------------- | ------------------------------------ | | "Repository not found" | Run `hf auth login` | | Memory errors | Use `max_shard_size="500MB"` | | Dataset viewer not working | Wait 5-10min, check README.md config | | Timeout errors | Use `multi_commits=True` | | Files >50GB | Split into smaller files | | "File not found" | Use relative paths in metadata | ## Dataset Viewer Configuration **Note**: This section is primarily for datasets uploaded directly to the Hub (via UI or `upload_large_folder`). Datasets uploaded with `push_to_hub()` typically configure the viewer automatically. ### When automatic detection works The Dataset Viewer automatically detects standard structures: - Files named: `train.csv`, `test.json`, `validation.parquet` - Directories named: `train/`, `test/`, `validation/` - Split names with delimiters: `test-data.csv` โœ“ (not `testdata.csv` โœ—) ### Manual configuration For custom structures, add YAML to your README.md: ```yaml --- configs: - config_name: default # Required even for single config! data_files: - split: train path: "data/train/*.parquet" - split: test path: "data/test/*.parquet" --- ``` Multiple configurations example: ```yaml --- configs: - config_name: english data_files: "en/*.parquet" - config_name: french data_files: "fr/*.parquet" --- ``` ### Common viewer issues - **No viewer after upload**: Wait 5-10 minutes for processing - **"Config names error"**: Add `config_name` field (required!) - **Files not detected**: Check naming patterns (needs delimiters) - **Viewer disabled**: Remove `viewer: false` from README YAML ## Quick Templates ```python # ImageFolder with metadata dataset = load_dataset("imagefolder", data_dir="./images") dataset.push_to_hub("username/dataset") # Memory-efficient upload dataset.push_to_hub("username/dataset", max_shard_size="500MB") # Multiple CSV files dataset = load_dataset('csv', data_files={'train': 'train.csv', 'test': 'test.csv'}) dataset.push_to_hub("username/dataset") ``` ## Documentation **Core docs**: [Adding datasets](https://huggingface.co/docs/hub/datasets-adding) | [Dataset viewer](https://huggingface.co/docs/hub/datasets-viewer) | [Storage limits](https://huggingface.co/docs/hub/storage-limits) | [Upload guide](https://huggingface.co/docs/datasets/upload_dataset) ## Dataset Cards Remind users to add a dataset card (README.md) with: - Dataset description and usage - License information - Citation details See [Dataset Cards guide](https://huggingface.co/docs/hub/datasets-cards) for details. --- ## Appendix: Special Cases ### WebDataset Structure For streaming large media datasets: - Create 1-5GB tar shards - Consistent internal structure - Upload with `upload_large_folder` ### Scientific Data - HDF5/NetCDF โ†’ Convert to Parquet with Array features - Time series โ†’ Array2D(shape=(None, n)) - Complex metadata โ†’ Store as JSON strings ### Community Resources For very specialized or bespoke formats: - Search the Hub for similar datasets: `https://huggingface.co/datasets` - Ask for advice on the [Hugging Face Forums](https://discuss.huggingface.co/c/datasets/10) - Join the [Hugging Face Discord](https://hf.co/join/discord) for real-time help - Many domain-specific formats already have examples on the Hub ### How to configure SAML SSO with Google Workspace https://huggingface.co/docs/hub/security-sso-google-saml.md # How to configure SAML SSO with Google Workspace In this guide, we will use Google Workspace as the SSO provider and with the Security Assertion Markup Language (SAML) protocol as our preferred identity protocol. We currently support SP-initiated and IdP-initiated authentication. For user provisioning, see [SCIM](./enterprise-scim). > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Step 1: Create SAML App in Google Workspace - In your Google Workspace admin console, navigate to `Admin` > `Apps` > `Web and mobile apps`. - Click `Add app` and then `Add custom SAML app`. - You must provide a name for your application in the "App name" field. - Click `Continue`. ## Step 2: Configure Hugging Face with Google's IdP Details - The next screen in the Google setup contains the SSO information for your application. - In your Hugging Face organization settings, go to the `SSO` tab and select the `SAML` protocol. - Copy the **SSO URL** from Google into the **Sign-on URL** field on Hugging Face. - Copy the **Certificate** from Google into the corresponding field on Hugging Face. The public certificate must have the following format: ``` -----BEGIN CERTIFICATE----- {certificate} -----END CERTIFICATE----- ``` - In the Google Workspace setup, click `Continue`. ## Step 3: Configure Google with Hugging Face's SP Details - In the "Service provider details" screen, you'll need the `Assertion Consumer Service URL` and `SP Entity ID` from your Hugging Face SSO settings. Copy them into the corresponding `ACS URL` and `Entity ID` fields in Google. - Ensure the following are set: - Check the **Signed response** box. - Name ID format: `EMAIL` - Name ID: `Basic Information > Primary email` - Click `Continue`. ## Step 4: Attribute Mapping - On the "Attribute mapping" screen, click `Add mapping` and configure the attributes you want to send. This step is optional and depends on whether you want to use [Role Mapping](./security-sso-user-management#role-mapping) or [Resource Group Mapping](./security-sso-user-management#resource-group-mapping) on Hugging Face. - Click `Finish`. ## Step 5: Test and Enable SSO > [!WARNING] > Before testing, ensure you have granted access to the application for the appropriate users in the Google Workspace admin console under the app's "User access" settings. The admin performing the test must have access. It may take a few minutes for user access changes to apply on Google Workspace. - Now, in your Hugging Face SSO settings, click on **"Update and Test SAML configuration"**. - You should be redirected to your Google login prompt. Once logged in, you'll be redirected to your organization's settings page. - A green check mark near the SAML selector will confirm that the test was successful. - Once the test is successful, you can enable SSO for your organization by clicking the "Enable" button. - Once enabled, members of your organization must complete the SSO authentication flow described in the [How it works](./security-sso-basic#how-it-works) section. ### Third-party scanner: JFrog https://huggingface.co/docs/hub/security-jfrog.md # Third-party scanner: JFrog [JFrog](https://jfrog.com/)'s security scanner detects malicious behavior in machine learning models. ![JFrog report for the danger.dat file contained in mcpotato/42-eicar-street](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/jfrog-report.png) *Example of a report for [danger.dat](https://huggingface.co/mcpotato/42-eicar-street/blob/main/danger.dat)* We [partnered with JFrog](https://hf.co/blog/jfrog) to provide scanning in order to make the Hub safer. Model files are scanned by the JFrog scanner and we expose the scanning results on the Hub interface. JFrog's scanner is built with the goal to reduce false positives. Indeed, what we currently observe is that code contained within model weights is not always malicious. When code is detected in a file, JFrog's scanner will parse it and analyze to check for potential malicious usage. Here is an example repository you can check out to see the feature in action: [mcpotato/42-eicar-street](https://huggingface.co/mcpotato/42-eicar-street). ## Model security refresher To share models, we serialize the data structures we use to interact with the models, in order to facilitate storage and transport. Some serialization formats are vulnerable to nasty exploits, such as arbitrary code execution (looking at you pickle), making sharing models potentially dangerous. As Hugging Face has become a popular platform for model sharing, weโ€™d like to protect the community from this, hence why we have developed tools like [picklescan](https://github.com/mmaitre314/picklescan) and why we integrate third party scanners. Pickle is not the only exploitable format out there, [see for reference](https://github.com/Azure/counterfit/wiki/Abusing-ML-model-file-formats-to-create-malware-on-AI-systems:-A-proof-of-concept) how one can exploit Keras Lambda layers to achieve arbitrary code execution. ### Managing Spaces with CircleCI Workflows https://huggingface.co/docs/hub/spaces-circleci.md # Managing Spaces with CircleCI Workflows You can keep your app in sync with your GitHub repository with a **CircleCI workflow**. [CircleCI](https://circleci.com) is a continuous integration and continuous delivery (CI/CD) platform that helps automate the software development process. A [CircleCI workflow](https://circleci.com/docs/workflows/) is a set of automated tasks defined in a configuration file, orchestrated by CircleCI, to streamline the process of building, testing, and deploying software applications. *Note: For files larger than 10MB, Spaces requires Git-LFS. If you don't want to use Git-LFS, you may need to review your files and check your history. Use a tool like [BFG Repo-Cleaner](https://rtyley.github.io/bfg-repo-cleaner/) to remove any large files from your history. BFG Repo-Cleaner will keep a local copy of your repository as a backup.* First, set up your GitHub repository and Spaces app together. Add your Spaces app as an additional remote to your existing Git repository. ```bash git remote add space https://huggingface.co/spaces/HF_USERNAME/SPACE_NAME ``` Then force push to sync everything for the first time: ```bash git push --force space main ``` Next, set up a [CircleCI workflow](https://circleci.com/docs/workflows/) to push your `main` git branch to Spaces. In the example below: * Replace `HF_USERNAME` with your username and `SPACE_NAME` with your Space name. * [Create a context in CircleCI](https://circleci.com/docs/contexts/) and add an env variable into it called *HF_PERSONAL_TOKEN* (you can give it any name, use the key you create in place of HF_PERSONAL_TOKEN) and the value as your Hugging Face API token. You can find your Hugging Face API token under **API Tokens** on [your Hugging Face profile](https://huggingface.co/settings/tokens). ```yaml version: 2.1 workflows: main: jobs: - sync-to-huggingface: context: - HuggingFace filters: branches: only: - main jobs: sync-to-huggingface: docker: - image: alpine resource_class: small steps: - run: name: install git command: apk update && apk add openssh-client git - checkout - run: name: push to Huggingface hub command: | git config user.email "" git config user.name "" git push -f https://HF_USERNAME:${HF_PERSONAL_TOKEN}@huggingface.co/spaces/HF_USERNAME/SPACE_NAME main ``` ### Perform SQL operations https://huggingface.co/docs/hub/datasets-duckdb-sql.md # Perform SQL operations Performing SQL operations with DuckDB opens up a world of possibilities for querying datasets efficiently. Let's dive into some examples showcasing the power of DuckDB functions. For our demonstration, we'll explore a fascinating dataset. The [MMLU](https://huggingface.co/datasets/cais/mmlu) dataset is a multitask test containing multiple-choice questions spanning various knowledge domains. To preview the dataset, let's select a sample of 3 rows: ```bash FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ subject โ”‚ choices โ”‚ answer โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar[] โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ The model of lightโ€ฆ โ”‚ conceptual_physics โ”‚ [wave model, particle model, Both of these, Neither of these] โ”‚ 1 โ”‚ โ”‚ A person who is loโ€ฆ โ”‚ professional_psychโ€ฆ โ”‚ [his/her life scripts., his/her own feelings, attitudes, and beliefs., the emotional reactions and behaviors of the people he/she is interacting with.โ€ฆ โ”‚ 1 โ”‚ โ”‚ The thermic effectโ€ฆ โ”‚ nutrition โ”‚ [is substantially higher for carbohydrate than for protein, is accompanied by a slight decrease in body core temperature., is partly related to sympatโ€ฆ โ”‚ 2 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` This command retrieves a random sample of 3 rows from the dataset for us to examine. Let's start by examining the schema of our dataset. The following table outlines the structure of our dataset: ```bash DESCRIBE FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' USING SAMPLE 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ column_name โ”‚ column_type โ”‚ null โ”‚ key โ”‚ default โ”‚ extra โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ question โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ subject โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ choices โ”‚ VARCHAR[] โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ answer โ”‚ BIGINT โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Next, let's analyze if there are any duplicated records in our dataset: ```bash SELECT *, COUNT(*) AS counts FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' GROUP BY ALL HAVING counts > 2; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ subject โ”‚ choices โ”‚ answer โ”‚ counts โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar[] โ”‚ int64 โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 0 rows โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Fortunately, our dataset doesn't contain any duplicate records. Let's see the proportion of questions based on the subject in a bar representation: ```bash SELECT subject, COUNT(*) AS counts, BAR(COUNT(*), 0, (SELECT COUNT(*) FROM 'hf://datasets/cais/mmlu/all/test-*.parquet')) AS percentage FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' GROUP BY subject ORDER BY counts DESC; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ subject โ”‚ counts โ”‚ percentage โ”‚ โ”‚ varchar โ”‚ int64 โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ professional_law โ”‚ 1534 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‹ โ”‚ โ”‚ moral_scenarios โ”‚ 895 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ”‚ โ”‚ miscellaneous โ”‚ 783 โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ– โ”‚ โ”‚ professional_psychology โ”‚ 612 โ”‚ โ–ˆโ–ˆโ–ˆโ– โ”‚ โ”‚ high_school_psychology โ”‚ 545 โ”‚ โ–ˆโ–ˆโ–ˆ โ”‚ โ”‚ high_school_macroeconomics โ”‚ 390 โ”‚ โ–ˆโ–ˆโ– โ”‚ โ”‚ elementary_mathematics โ”‚ 378 โ”‚ โ–ˆโ–ˆโ– โ”‚ โ”‚ moral_disputes โ”‚ 346 โ”‚ โ–ˆโ–‰ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 57 rows (8 shown) 3 columns โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Now, let's prepare a subset of the dataset containing questions related to **nutrition** and create a mapping of questions to correct answers. Notice that we have the column **choices** from which we can get the correct answer using the **answer** column as an index. ```bash SELECT * FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' WHERE subject = 'nutrition' LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ subject โ”‚ choices โ”‚ answer โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar[] โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Which foods tend tโ€ฆ โ”‚ nutrition โ”‚ [Meat, Confectionary, Fruits and vegetables, Potatoes] โ”‚ 2 โ”‚ โ”‚ In which one of thโ€ฆ โ”‚ nutrition โ”‚ [If the incidence rate of the disease falls., If survival time with the disease increases., If recovery of the disease is faster., If the population in which theโ€ฆ โ”‚ 1 โ”‚ โ”‚ Which of the folloโ€ฆ โ”‚ nutrition โ”‚ [The flavonoid class comprises flavonoids and isoflavonoids., The digestibility and bioavailability of isoflavones in soya food products are not changed by proceโ€ฆ โ”‚ 0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ```bash SELECT question, choices[answer] AS correct_answer FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' WHERE subject = 'nutrition' LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ correct_answer โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)?\n โ”‚ Confectionary โ”‚ โ”‚ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant?\n โ”‚ If the incidence rate of the disease falls. โ”‚ โ”‚ Which of the following statements is correct?\n โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` To ensure data cleanliness, let's remove any newline characters at the end of the questions and filter out any empty answers: ```bash SELECT regexp_replace(question, '\n', '') AS question, choices[answer] AS correct_answer FROM 'hf://datasets/cais/mmlu/all/test-*.parquet' WHERE subject = 'nutrition' AND LENGTH(correct_answer) > 0 LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ question โ”‚ correct_answer โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Which foods tend to be consumed in lower quantities in Wales and Scotland (as of 2020)? โ”‚ Confectionary โ”‚ โ”‚ In which one of the following circumstances will the prevalence of a disease in the population increase, all else being constant? โ”‚ If the incidence rate of the disease falls. โ”‚ โ”‚ Which vitamin is a major lipid-soluble antioxidant in cell membranes? โ”‚ Vitamin D โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Finally, lets highlight some of the DuckDB functions used in this section: - `DESCRIBE`, returns the table schema. - `USING SAMPLE`, samples are used to randomly select a subset of a dataset. - `BAR`, draws a band whose width is proportional to (x - min) and equal to width characters when x = max. Width defaults to 80. - `string[begin:end]`, extracts a string using slice conventions. Missing begin or end arguments are interpreted as the beginning or end of the list respectively. Negative values are accepted. - `regexp_replace`, if the string contains the regexp pattern, replaces the matching part with replacement. - `LENGTH`, gets the number of characters in the string. > [!TIP] > There are plenty of useful functions available in DuckDB's [SQL functions overview](https://duckdb.org/docs/sql/functions/overview). The best part is that you can use them directly on Hugging Face datasets. ### Static HTML Spaces https://huggingface.co/docs/hub/spaces-sdks-static.md # Static HTML Spaces Spaces also accommodate custom HTML for your app instead of using Streamlit or Gradio. Set `sdk: static` inside the `YAML` block at the top of your Spaces **README.md** file. Then you can place your HTML code within an **index.html** file. Here are some examples of Spaces using custom HTML: * [Smarter NPC](https://huggingface.co/spaces/mishig/smarter_npc): Display a PlayCanvas project with an iframe in Spaces. * [Huggingfab](https://huggingface.co/spaces/pierreant-p/huggingfab): Display a Sketchfab model in Spaces. * [Diffuse the rest](https://huggingface.co/spaces/huggingface-projects/diffuse-the-rest): Draw and diffuse the rest ## Adding a build step before serving Static Spaces support adding a custom build step before serving your static assets. This is useful for frontend frameworks like React, Svelte and Vue that require a build process before serving the application. The build command runs automatically when your Space is updated. Add `app_build_command` inside the `YAML` block at the top of your Spaces **README.md** file, and `app_file`. For example: - `app_build_command: npm run build` - `app_file: dist/index.html` Example spaces: - [Svelte App](https://huggingface.co/spaces/julien-c/vite-svelte) - [React App](https://huggingface.co/spaces/coyotte508/static-vite) Under the hood, it will [launch a build](https://huggingface.co/spaces/huggingface/space-build), storing the generated files in a special `refs/convert/build` ref. ## Space variables Custom [environment variables](./spaces-overview#managing-secrets) can be passed to your Space. OAuth information such as the client ID and scope are also available as environment variables, if you have [enabled OAuth](./spaces-oauth) for your Space. To use these variables in JavaScript, you can use the `window.huggingface.variables` object. For example, to access the `OAUTH_CLIENT_ID` variable, you can use `window.huggingface.variables.OAUTH_CLIENT_ID`. Here is an example of a Space using custom environment variables and oauth enabled and displaying the variables in the HTML: * [Static Variables](https://huggingface.co/spaces/huggingfacejs/static-variables) ### Using spaCy at Hugging Face https://huggingface.co/docs/hub/spacy.md # Using spaCy at Hugging Face `spaCy` is a popular library for advanced Natural Language Processing used widely across industry. `spaCy` makes it easy to use and train pipelines for tasks like named entity recognition, text classification, part of speech tagging and more, and lets you build powerful applications to process and analyze large volumes of text. ## Exploring spaCy models in the Hub The official models from `spaCy` 3.3 are in the `spaCy` [Organization Page](https://huggingface.co/spacy). Anyone in the community can also share their `spaCy` models, which you can find by filtering at the left of the [models page](https://huggingface.co/models?library=spacy). All models on the Hub come up with useful features 1. An automatically generated model card with label scheme, metrics, components, and more. 2. An evaluation sections at top right where you can look at the metrics. 3. Metadata tags that help for discoverability and contain information such as license and language. 4. An interactive widget you can use to play out with the model directly in the browser 5. An Inference Providers widget that allows to make inference requests. ## Using existing models All `spaCy` models from the Hub can be directly installed using pip install. ```bash pip install "en_core_web_sm @ https://huggingface.co/spacy/en_core_web_sm/resolve/main/en_core_web_sm-any-py3-none-any.whl" ``` To find the link of interest, you can go to a repository with a `spaCy` model. When you open the repository, you can click `Use in spaCy` and you will be given a working snippet that you can use to install and load the model! Once installed, you can load the model as any spaCy pipeline. ```python # Using spacy.load(). import spacy nlp = spacy.load("en_core_web_sm") # Importing as module. import en_core_web_sm nlp = en_core_web_sm.load() ``` ## Sharing your models ### Using the spaCy CLI (recommended) The `spacy-huggingface-hub` library extends `spaCy` native CLI so people can easily push their packaged models to the Hub. You can install spacy-huggingface-hub from pip: ```bash pip install spacy-huggingface-hub ``` You can then check if the command has been registered successfully ```bash python -m spacy huggingface-hub --help ``` To push with the CLI, you can use the `huggingface-hub push` command as seen below. ```bash python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo] [--verbose] ``` | Argument | Type | Description | | -------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------- | | `whl_path` | str / `Path` | The path to the `.whl` file packaged with [`spacy package`](https://spacy.io/api/cli#package). | | `--org`, `-o` | str | Optional name of organization to which the pipeline should be uploaded. | | `--msg`, `-m` | str | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. | | `--local-repo`, `-l` | str / `Path` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. | | `--verbose`, `-V` | bool | Output additional info for debugging, e.g. the full generated hub metadata. | You can then upload any pipeline packaged with [`spacy package`](https://spacy.io/api/cli#package). Make sure to set `--build wheel` to output a binary .whl file. The uploader will read all metadata from the pipeline package, including the auto-generated pretty `README.md` and the model details available in the `meta.json`. ```bash hf auth login python -m spacy package ./en_ner_fashion ./output --build wheel cd ./output/en_ner_fashion-0.0.0/dist python -m spacy huggingface-hub push en_ner_fashion-0.0.0-py3-none-any.whl ``` In just a minute, you can get your packaged model in the Hub, try it out directly in the browser, and share it with the rest of the community. All the required metadata will be uploaded for you and you even get a cool model card. The command will output two things: * Where to find your repo in the Hub! For example, https://huggingface.co/spacy/en_core_web_sm * And how to install the pipeline directly from the Hub! ### From a Python script You can use the `push` function from Python. It returns a dictionary containing the `"url"` and "`whl_url`" of the published model and the wheel file, which you can later install with `pip install`. ```py from spacy_huggingface_hub import push result = push("./en_ner_fashion-0.0.0-py3-none-any.whl") print(result["url"]) ``` ## Additional resources * spacy-huggingface-hub [library](https://github.com/explosion/spacy-huggingface-hub). * Launch [blog post](https://huggingface.co/blog/spacy) * spaCy v 3.1 [Announcement](https://explosion.ai/blog/spacy-v3-1#huggingface-hub) * spaCy [documentation](https://spacy.io/universe/project/spacy-huggingface-hub/) ### Webhook guide: build a Discussion bot based on BLOOM https://huggingface.co/docs/hub/webhooks-guide-discussion-bot.md # Webhook guide: build a Discussion bot based on BLOOM Here's a short guide on how to use Hugging Face Webhooks to build a bot that replies to Discussion comments on the Hub with a response generated by BLOOM, a multilingual language model, using Inference Providers. ## Create your Webhook in your user profile First, let's create a Webhook from your [settings]( https://huggingface.co/settings/webhooks). - Input a few target repositories that your Webhook will listen to. - You can put a dummy Webhook URL for now, but defining your webhook will let you look at the events that will be sent to it (and you can replay them, which will be useful for debugging). - Input a secret as it will be more secure. - Subscribe to Community (PR & discussions) events, as we are building a Discussion bot. Your Webhook will look like this: ![webhook-creation](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/webhook-creation.png) ## Create a new `Bot` user profile In this guide, we create a separate user account to host a Space and to post comments: ![discussion-bot-profile](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/discussion-bot-profile.png) > [!TIP] > When creating a bot that will interact with other users on the Hub, we ask that you clearly label the account as a "Bot" (see profile screenshot). ## Create a Space that will react to your Webhook The third step is actually to listen to the Webhook events. An easy way is to use a Space for this. We use the user account we created, but you could do it from your main user account if you wanted to. The Space's code is [here](https://huggingface.co/spaces/discussion-bot/webhook/tree/main). We used NodeJS and Typescript to implement it, but any language or framework would work equally well. Read more about Docker Spaces [here](https://huggingface.co/docs/hub/spaces-sdks-docker). **The main `server.ts` file is [here](https://huggingface.co/spaces/discussion-bot/webhook/blob/main/server.ts)** Let's walk through what happens in this file: ```ts app.post("/", async (req, res) => { if (req.header("X-Webhook-Secret") !== process.env.WEBHOOK_SECRET) { console.error("incorrect secret"); return res.status(400).json({ error: "incorrect secret" }); } ... ``` Here, we listen to POST requests made to `/`, and then we check that the `X-Webhook-Secret` header is equal to the secret we had previously defined (you need to also set the `WEBHOOK_SECRET` secret in your Space's settings to be able to verify it). ```ts const event = req.body.event; if ( event.action === "create" && event.scope === "discussion.comment" && req.body.comment.content.includes(BOT_USERNAME) ) { ... ``` The event's payload is encoded as JSON. Here, we specify that we will run our Webhook only when: - the event concerns a discussion comment - the event is a creation, i.e. a new comment has been posted - the comment's content contains `@discussion-bot`, i.e. our bot was just mentioned in a comment. In that case, we will continue to the next step: ```ts const INFERENCE_URL = "https://api-inference.huggingface.co/models/bigscience/bloom"; const PROMPT = `Pretend that you are a bot that replies to discussions about machine learning, and reply to the following comment:\n`; const response = await fetch(INFERENCE_URL, { method: "POST", body: JSON.stringify({ inputs: PROMPT + req.body.comment.content }), }); if (response.ok) { const output = await response.json(); const continuationText = output[0].generated_text.replace( PROMPT + req.body.comment.content, "" ); ... ``` This is the coolest part: we call Inference Providers for the BLOOM model, prompting it with `PROMPT`, and we get the continuation text, i.e., the part generated by the model. Finally, we will post it as a reply in the same discussion thread: ```ts const commentUrl = req.body.discussion.url.api + "/comment"; const commentApiResponse = await fetch(commentUrl, { method: "POST", headers: { Authorization: `Bearer ${process.env.HF_TOKEN}`, "Content-Type": "application/json", }, body: JSON.stringify({ comment: continuationText }), }); const apiOutput = await commentApiResponse.json(); ``` ## Configure your Webhook to send events to your Space Last but not least, you'll need to configure your Webhook to send POST requests to your Space. Let's first grab our Space's "direct URL" from the contextual menu. Click on "Embed this Space" and copy the "Direct URL". ![embed this Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/embed-space.png) ![direct URL](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/direct-url.png) Update your webhook to send requests to that URL: ![webhook settings](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/webhook-creation.png) ## Result ![discussion-result](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/001-discussion-bot/discussion-result.png) ### Academia Hub https://huggingface.co/docs/hub/academia-hub.md # Academia Hub > [!TIP] > Ask your university's IT or Procurement Team to get in touch from a university-affiliated email address to initiate the subscription process. ## Accelerate your university's AI research, publication pipeline, and collaboration at scale The Hugging Face Hub is where leading researchers and developers across academia and industry collaborate on AI throughout the whole research lifecycle. Academia Hub brings that proven ecosystem to your university, giving your researchers everything they need to work securely, reproducibly, and at scale: compute, storage, collaboration, and governance, all managed through your institution. With Academia Hub, you get **university-level seat management and accounts for researchers, professors, and/or students.** ## Why Academia Hub: Built for the complete research lifecycle Academia Hub scales with your research from early prototypes to large-scale published models while ensuring security, reproducibility, and seamless collaboration across your entire institution. 1. **Store & version** datasets, models, and results with 1TB private storage per seat. 2. **Collaborate & review** with co-authors and lab members in shared workspaces. 3. **Prototype** ideas using interactive Spaces and hosted notebooks. 4. **Train & scale** with managed GPUs and tracked runs. 5. **Publish & share** using model cards, DOIs, and dataset releases. 6. **Preserve** your work for reproducibility and future research. All backed by enterprise-grade infrastructure, institutional governance, and storage that grows with your needs. ## Key features of Academia Hub ***For researchers and students*** - **Storage**: 1 TB private storage per seat (e.g., 400 seats = 400 TB) powered by Xet, purpose-built for versioning large AI models and datasets; expanded public storage; Dataset Viewer for private datasets. - **Hosting & demos**: Spaces Hosting for scalable AI demos and applications powered by ZeroGPU (5ร— priority quota); Dev Mode with SSH/VS Code access for development. - **Compute**: Priority GPU access (H100/H200) for training; managed runs with experiment tracking. - **Collaboration**: Team workspaces with version control, peer review, and shared governance. - **Publishing**: Share the research artifacts like models and datasets with the global AI community through citable releases with model cards, dataset cards, and DOIs. ***For administrators*** - **Pricing**: $10/seat/month (volume-based pricing); $2/seat/month compute credits included (top-ups available). - **Admin & security**: University-level seat management for researchers, professors, and students; centralized administration; SSO with university domain. ***For your research community*** - **Community & resources**: Connect with peers; curated models/datasets/projects for academia. ## How to get started Researchers and students: Contact us to express interest in Academia Hub and help us connect with your university's IT or Procurement Team. IT or Procurement staff: Get in touch directly to set up your institution's Academia Hub subscription or find out more about how your institution can benefit from Academia Hub. ### GGUF usage with llama.cpp https://huggingface.co/docs/hub/gguf-llamacpp.md # GGUF usage with llama.cpp > [!TIP] > You can now deploy any llama.cpp compatible GGUF on Hugging Face Endpoints, read more about it [here](https://huggingface.co/docs/inference-endpoints/en/others/llamacpp_container) Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp downloads the model checkpoint and automatically caches it. The location of the cache is defined by `LLAMA_CACHE` environment variable; read more about it [here](https://github.com/ggerganov/llama.cpp/pull/7826). You can install llama.cpp through brew (works on Mac and Linux), or you can build it from source. There are also pre-built binaries and Docker images that you can [check in the official documentation](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage). ### Option 1: Install with brew/ winget ```bash brew install llama.cpp ``` or, on windows via winget ```bash winget install llama.cpp ``` ### Option 2: build from source Step 1: Clone llama.cpp from GitHub. ``` git clone https://github.com/ggerganov/llama.cpp ``` Step 2: Move into the llama.cpp folder and build it. You can also add hardware-specific flags (for ex: `-DGGML_CUDA=1` for Nvidia GPUs). ``` cd llama.cpp cmake -B build # optionally, add -DGGML_CUDA=ON to activate CUDA cmake --build build --config Release ``` Note: for other hardware support (for ex: AMD ROCm, Intel SYCL), please refer to [llama.cpp's build guide](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) Once installed, you can use the `llama-cli` or `llama-server` as follows: ```bash llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` Note: You can explicitly add `-no-cnv` to run the CLI in raw completion mode (non-chat mode). Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server: ```bash llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 ``` After running the server you can simply utilise the endpoint as below: ```bash curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer no-key" \ -d '{ "messages": [ { "role": "system", "content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests." }, { "role": "user", "content": "Write a limerick about Python exceptions" } ] }' ``` Replace `-hf` with any valid Hugging Face hub repo name - off you go! ๐Ÿฆ™ ### Evaluation Results https://huggingface.co/docs/hub/eval-results.md # Evaluation Results > [!WARNING] > This is a work in progress feature. The Hub provides a decentralized system for tracking model evaluation results. Benchmark datasets host leaderboards, and model repos store evaluation scores that automatically appear on both the model page and the benchmark's leaderboard. ## Benchmark Datasets Dataset repos can be defined as **Benchmarks** (e.g., [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro), [HLE](https://huggingface.co/datasets/cais/hle), [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa)). These display a "Benchmark" tag and automatically aggregate evaluation results from model repos across the Hub and display a leaderboard of top models. ![Benchmark Dataset](https://huggingface.co/huggingface/documentation-images/resolve/main/evaluation-results/benchmark-preview.png) ## Model Evaluation Results Evaluation scores are stored in model repos as YAML files in the `.eval_results/` folder. These results: - Appear on the model page with links to the benchmark leaderboard - Are aggregated into the benchmark dataset's leaderboards - Can be submitted via PRs and marked as "community-provided" ![Model Evaluation Results](https://huggingface.co/huggingface/documentation-images/resolve/main/evaluation-results/eval-results-previw.png) ### Adding Evaluation Results To add evaluation results to a model, you can submit a PR to the model repo with a YAML file in the `.eval_results/` folder. Create a YAML file in `.eval_results/*.yaml` in your model repo: ```yaml - dataset: id: cais/hle # Required. Hub dataset ID (must be a Benchmark) task_id: default # Required. ID of the Task, as defined in the dataset's eval.yaml revision: # Optional. Dataset revision hash value: 20.90 # Required. Metric value verifyToken: # Optional. Cryptographic proof of auditable evaluation date: "2025-01-15" # Optional. ISO-8601 date or datetime of when the eval was run (defaults to git commit time) source: # Optional. Attribution for this result, for instance a repo containing output traces or a Paper url: https://huggingface.co/spaces/SaylorTwift/smollm3-mmlu-pro # Required if source provided name: Eval traces # Optional. Display name user: SaylorTwift # Optional. HF username/org notes: "no-tools" # Optional. Details about the evaluation setup (e.g., "tools", "no-tools", etc.) ``` Or, with only the required attributes: ```yaml - dataset: id: Idavidrein/gpqa task_id: gpqa_diamond value: 0.412 ``` Results display badges based on their metadata in the YAML file: | Badge | Condition | |-------|-----------| | verified | A `verifyToken` is valid (evaluation ran in HF Jobs with inspect-ai) | | community | Result submitted via open PR (not merged to main) | | leaderboard | Links to the benchmark dataset | | source | Links to evaluation logs or external source | For more details on how to format this data, check out the [Eval Results](https://github.com/huggingface/hub-docs/blob/main/eval_results.yaml) specifications. ### Community Contributions Anyone can submit evaluation results to any model via Pull Request: 1. Go to the model page and click on the "Community" tab and open a Pull Request. 3. Add a `.eval_results/*.yaml` file with your results. 4. The PR will show as "community-provided" on the model page while open. For help evaluating a model, see the [Evaluating models with Inspect](https://huggingface.co/docs/inference-providers/guides/evaluation-inspect-ai) guide. > [!TIP] > Community scores are visible while the PR is open. If a score is disputed, the model author can close the PR to remove it. The goal is to surface existing evaluation data transparently while building toward a fully reproducible standard via verified scores. ## Registering a Benchmark To register your dataset as a benchmark: 1. Create a dataset repo containing your evaluation data 2. Add an `eval.yaml` file to the repo root with your benchmark configuration, conform to the specification defined below. 3. The file is validated at push time 4. (**Beta**) Get in touch so we can add it to the allow-list. Examples can be found in these benchmarks: [GPQA](https://huggingface.co/datasets/Idavidrein/gpqa/blob/main/eval.yaml), [MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro/blob/main/eval.yaml), [HLE](https://huggingface.co/datasets/cais/hle/blob/main/eval.yaml), [GSM8K](https://huggingface.co/datasets/openai/gsm8k/blob/main/eval.yaml). ## Eval.yaml specification The `eval.yaml` should contain the following fields: - `name` โ€” Human-readable display name for the benchmark (e.g. `"Humanity's Last Exam"`). - `description` โ€” Short description of what the benchmark measures. - `evaluation_framework` โ€” Canonical evaluation framework identifier for this benchmark. This is an enumerable which the Hugging Face team maintains. Add your own to the list [here](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/eval.ts). Exactly one framework is supported per benchmark. - `tasks[]` โ€” List of tasks (sub-leaderboards) defined by this benchmark (see below). Required fields in each `tasks[]` item: - `id` โ€” Unique identifier for the task. (e.g. `"gpqa_diamond"`). A single benchmark can define several tasks, each producing its own leaderboard. Feel free to choose a leaderboard identifier for each task. Optional fields in each `tasks[]` item: - `config` โ€” Configuration of the Hugging Face dataset to evaluate (e.g. `"default"`). Defaults to the dataset's default config. - `split` โ€” Split of the Hugging Face dataset to evaluate (e.g. `"test"`). Defaults to `"test"`. When setting `evaluation_framework: inspect-ai`, one also requires to set the following fields: - `field_spec` โ€” Specification of the input and output fields. Consists of `input`, `target`, `choices` and optional `input_image` subfields. See the [docs](https://inspect.aisi.org.uk/tasks.html#hugging-facehttps://inspect.aisi.org.uk/tasks.html#hugging-face) for more details. - `solvers` โ€” Solvers used to go from input to output using the AI model. This can range from a simple system prompt to self-critique loops. See the [docs](https://inspect.aisi.org.uk/solvers.html) for more details. - `scores` โ€” Scorers used. Scorers determine whether solvers were successful in finding the right output for the target defined in the dataset, and in what measure. See the [docs](https://inspect.aisi.org.uk/scorers.html) for more details. Minimal example (required fields only): ```yaml name: MathArena AIME 2026 description: The American Invitational Mathematics Exam (AIME). evaluation_framework: math-arena tasks: - id: MathArena/aime_2026 ``` Extended example: ```yaml name: MathArena AIME 2026 description: The American Invitational Mathematics Exam (AIME). evaluation_framework: "math-arena" tasks: - id: MathArena/aime_2026 config: default split: test ``` Extended example (`"inspect-ai"`-specific): ```yaml name: Humanity's Last Exam description: > Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. Humanity's Last Exam consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. evaluation_framework: "inspect-ai" tasks: - id: hle config: default split: test field_spec: input: question input_image: image target: answer solvers: - name: system_message args: template: | Your response should be in the following format: Explanation: {your explanation for your answer choice} Answer: {your chosen answer} Confidence: {your confidence score between 0% and 100% for your answer} - name: generate scorers: - name: model_graded_fact args: model: openai/o3-mini ``` ### Network Security https://huggingface.co/docs/hub/enterprise-network-security.md # Network Security > [!WARNING] > This feature is part of the Enterprise Plus plan. ## Define your organization IP Ranges You can list the IP addresses of your organization's outbound traffic to apply for higher rate limits and/or to enforce authenticated access to Hugging Face from your corporate network. The outbound IP address ranges are defined in CIDR format. For example, `52.219.168.0/24` or `2600:1f69:7400::/40`. You can set multiple ranges, one per line. Once organization admins populate the โ€œOrganization IP Rangesโ€ in the Network Security settings, a manual verificationโ€”carried out jointly by Hugging Face Solution Engineers and the organizationโ€™s adminsโ€”is required for the "Require login for users in your IP ranges" setting to become available. After the โ€œOrganization IP Rangesโ€ have been manually verified, and the organization admins have enabled both โ€œRestrict organization access to your IP ranges onlyโ€ and โ€œRequire login for users in your IP rangesโ€, the following flow applies: - When a user arrives on the platform, their IP address is checked. - If the IP falls within the organizationโ€™s defined ranges, the user must authenticate (via the organizationโ€™s SSO if enabled). - Once authenticated, the Content Access Policy determines which resources the user can access. ## Higher Hub Rate Limits Most of the actions on the Hub have limits; for example, users are limited to creating a certain number of repositories per day. Enterprise Plus automatically gives your users the highest rate limits possible for every action. Additionally, once your IP ranges are set, enabling the "Higher Hub Rate Limits" option allows your organization to benefit from the highest HTTP rate limits on the Hub API, unlocking large volumes of model or dataset downloads. For more information about rate limits, see the [Hub Rate limits](./rate-limits) documentation. ## Restrict organization access to your IP ranges only This option restricts access to your organization's resources to only those coming from your defined IP ranges. No one can access your organization resources outside your IP ranges. The rules also apply to access tokens. When enabled, this option unlocks additional nested security settings below. ### Require login for users in your IP ranges When this option is enabled, anyone visiting Hugging Face from your corporate network must be logged in and belong to your organization (requires a manual verification when IP ranges have changed). If enabled, you can optionally define a content access policy. All public pages will show the following message if access is unauthenticated: ### Content Access Policy Define a fine-grained Content Access Policy by blocking certain sections of the Hugging Face Hub. For example, you can block your organization's members from accessing Spaces by adding `/spaces/*` to the blocked URLs. When users of your organization navigate to a page that matches the URL pattern, they'll be presented the following page: To define Blocked URLs, enter URL patterns, without the domain name, one per line: The Allowed URLs field, enables you to define some exception to the blocking rules, especially. For example by allowing a specific URL within the Blocked URLs pattern, ie `/spaces/meta-llama/*` ### Advanced Topics https://huggingface.co/docs/hub/spaces-advanced.md # Advanced Topics ## Contents - [Using OpenCV in Spaces](./spaces-using-opencv) - [More ways to create Spaces](./spaces-more-ways-to-create) - [Managing Spaces with Github Actions](./spaces-github-actions) - [Managing Spaces with CircleCI Workflows](./spaces-circleci) - [Custom Python Spaces](./spaces-sdks-python) - [How to Add a Space to ArXiv](./spaces-add-to-arxiv) - [Cookie limitations in Spaces](./spaces-cookie-limitations) - [How to handle URL parameters in Spaces](./spaces-handle-url-parameters) - [How to get user status and plan in Spaces](./spaces-get-user-plan) ### GitHub Actions https://huggingface.co/docs/hub/repositories-github-actions.md # GitHub Actions You can use [GitHub Actions](https://docs.github.com/en/actions) to automatically sync your GitHub repository to the Hugging Face Hub. The official [`huggingface/hub-sync`](https://github.com/marketplace/actions/sync-github-to-hugging-face-hub) action supports syncing **Models**, **Datasets**, and **Spaces**. ## Setup 1. Create a Hugging Face [access token](https://huggingface.co/settings/tokens) with **write** permission to the target repo. For better security, use a [fine-grained token](https://huggingface.co/settings/tokens) scoped to only the repository you're syncing to. 2. Add the token as a [GitHub secret](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-an-environment) called `HF_TOKEN` in your repository settings. 3. Add a workflow file (e.g. `.github/workflows/sync-to-hub.yml`) to your repository. ## Basic usage ```yaml name: Sync to Hugging Face Hub on: push: branches: [main] jobs: sync: runs-on: ubuntu-latest steps: - uses: actions/checkout@v6 - uses: huggingface/hub-sync@v0.1.0 with: github_repo_id: ${{ github.repository }} huggingface_repo_id: username/repo-name hf_token: ${{ secrets.HF_TOKEN }} ``` By default, this syncs to a **Space**. To sync a model or dataset, set the `repo_type` parameter: ```yaml - uses: huggingface/hub-sync@v0.1.0 with: github_repo_id: ${{ github.repository }} huggingface_repo_id: username/my-dataset hf_token: ${{ secrets.HF_TOKEN }} repo_type: dataset ``` ## Parameters | Parameter | Required | Default | Description | |---|---|---|---| | `github_repo_id` | Yes | โ€” | GitHub repository (use `${{ github.repository }}`) | | `huggingface_repo_id` | Yes | โ€” | Target repo on the Hub (`username/repo-name`) | | `hf_token` | Yes | โ€” | Hugging Face access token | | `repo_type` | No | `space` | `space`, `model`, or `dataset` | | `space_sdk` | No | `gradio` | `gradio`, `streamlit`, `docker`, or `static` | | `private` | No | `false` | Whether to create the repo as private | | `subdirectory` | No | `.` | Sync a specific subdirectory (useful for monorepos) | The action mirrors your files to the Hub using the `hf` CLI โ€” it is not a git-to-git sync. It automatically excludes `.github/` and `.git/` directories and mirrors deletions (files removed from GitHub will be removed from the Hub). For more complex workflows (e.g. build steps, custom upload logic), you can install and use the [`hf` CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) directly in your workflow instead. For Spaces-specific guidance (file size limits, LFS handling), see [Managing Spaces with GitHub Actions](./spaces-github-actions). ### How to configure OIDC SSO with Okta https://huggingface.co/docs/hub/security-sso-okta-oidc.md # How to configure OIDC SSO with Okta In this guide, we will use Okta as the SSO provider and with the Open ID Connect (OIDC) protocol as our preferred identity protocol. > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to your Okta account. Navigate to "Admin/Applications" and click the "Create App Integration" button. Then choose an โ€œOIDC - OpenID Connectโ€ application, select the application type "Web Application" and click "Create". ## Step 2: Configure your application in Okta Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the OIDC protocol. Copy the "Redirection URI" from the organization's settings on Hugging Face, and paste it in the "Sign-in redirect URI" field on Okta. The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/oidc/consume`. You can leave the optional Sign-out redirect URIs blank. Save your new application. ## Step 3: Finalize configuration on Hugging Face In your Okta application, under "General", find the following fields: - Client ID - Client secret - Issuer URL You will need these to finalize the SSO setup on Hugging Face. The Okta Issuer URL is generally a URL like `https://tenantId.okta.com`; you can refer to their [guide](https://support.okta.com/help/s/article/What-is-theIssuerlocated-under-the-OpenID-Connect-ID-Token-app-settings-used-for?language=en_US) for more details. In the SSO section of your organization's settings on Hugging Face, copy-paste these values from Okta: - Client ID - Client Secret You can now click on "Update and Test OIDC configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the OIDC selector will attest that the test was successful. ## Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in the [How it works](./security-sso-basic#how-it-works) section. ### Digital Object Identifier (DOI) https://huggingface.co/docs/hub/doi.md # Digital Object Identifier (DOI) The Hugging Face Hub offers the possibility to generate DOI for your models or datasets. DOIs (Digital Object Identifiers) are strings uniquely identifying a digital object, anything from articles to figures, including datasets and models. DOIs are tied to object metadata, including the object's URL, version, creation date, description, etc. They are a commonly accepted reference to digital resources across research and academic communities; they are analogous to a book's ISBN. ## How to generate a DOI? To do this, you must go to the settings of your model or dataset. In the DOI section, a button called "Generate DOI" should appear: To generate the DOI for this model or dataset, you need to click on this button and acknowledge that some features on the hub will be restrained and some of your information (your full name) will be transferred to our partner DataCite. When generating a DOI, you can optionally personalize the author name list allowing you to credit all contributors to your model or dataset. After you agree to those terms, your model or dataset will get a DOI assigned, and a new tag should appear in your model or dataset header allowing you to cite it. ## Can I regenerate a new DOI if my model or dataset changes? If ever thereโ€™s a new version of a model or dataset, a new DOI can easily be assigned, and the previous version of the DOI gets outdated. This makes it easy to refer to a specific version of an object, even if it has changed. You just need to click on "Generate new DOI" and tadaam!๐ŸŽ‰ a new DOI is assigned for the current revision of your model or dataset. ## Why is there a 'locked by DOI' message on delete, rename and change visibility action on my model or dataset? DOIs make finding information about a model or dataset easier and sharing them with the world via a permanent link that will never expire or change. As such, datasets/models with DOIs are intended to persist perpetually and may only be deleted, renamed and changed their visibility upon filing a request with our support (website at huggingface.co) ## Further Reading - [Introducing DOI: the Digital Object Identifier to Datasets and Models](https://huggingface.co/blog/introducing-doi) ### How to get a user's plan and status in Spaces https://huggingface.co/docs/hub/spaces-get-user-plan.md # How to get a user's plan and status in Spaces From inside a Space's iframe, you can check if a user is logged in or not on the main site, and if they have a PRO subscription or if one of their orgs has a paid subscription. ```js window.addEventListener("message", (event) => { if (event.data.type === "USER_PLAN") { console.log("plan", event.data.plan); } }) window.parent.postMessage({ type: "USER_PLAN_REQUEST" }, "https://huggingface.co"); ``` `event.data.plan` will be of type: ```ts { user: "anonymous", org: undefined } | { user: "pro" | "free", org: undefined | "team" | "enterprise" | "plus" | "academia" } ``` You will get both the user's status (logged out = `"anonymous"`) and their plan. ## Examples - https://huggingface.co/spaces/huggingfacejs/plan ### Programmatic User Access Control Management https://huggingface.co/docs/hub/programmatic-user-access-control.md # Programmatic User Access Control Management This guide describes how to manage organization member roles and resource group membership via the Hub API: changing a member's organization role and resource group assignments, listing resource groups, adding users to groups, and batch workflows. **Table of contents:** - [Change member role via API](#change-member-role-via-api) โ€” Set a member's org role and resource group assignments (one member per request). - [Resource Groups API](#resource-groups-api) โ€” List resource groups and add users to them. - [Configure auto-join via API](#configure-auto-join-via-api) โ€” Enable or disable auto-join on a Resource Group. --- ## Change member role via API You can change a member's **organization role** (No Access / Read / Contributor / Write / Admin) and, optionally, their roles in **resource groups** using the Hub API. The API updates **one member per request**. To change roles for multiple members, call the API in a loop (examples below). **OpenAPI reference:** PUT /api/organizations/{name}/members/{username}/role ### Prerequisites - Your organization must have a **subscription plan** (e.g. Team or Enterprise). The endpoint returns 402 otherwise. - You must be authenticated as an organization member with **Write** (or Admin) permission on the organization. - The target user must already be a **member** of the organization. ### Base URL and authentication - **Base URL:** `https://huggingface.co` - **Authentication:** Send your token in the request header: ```http Authorization: Bearer ``` Create a fine-grained token with the "Write access to organizations settings / member management" permission scoped to your org at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). ### Change member role endpoint **Request** ```http PUT /api/organizations/{org_name}/members/{username}/role Authorization: Bearer Content-Type: application/json { "role": "read", "resourceGroups": [] } ``` - **Path parameters** - `org_name`: Organization slug (e.g. `my-org`). - `username`: Hugging Face **username** of the member whose role you are changing. - **Body** - `role` (required): The member's **organization-level** role. One of: `"no_access"`, `"read"`, `"contributor"`, `"write"`, or `"admin"`. - `resourceGroups` (optional): Array of resource group assignments for this user. Each item: - `id`: Resource group ID (24-character hex string; get IDs from the [resource groups list API](#list-resource-groups)). - `role`: Role in that resource group: `"read"`, `"contributor"`, `"write"`, or `"admin"`. - If you omit `resourceGroups` or pass `[]`, the user is removed from all resource groups. To only change org role and leave resource groups unchanged, pass their current resource group memberships (the body always sets both org role and resource group list). **Example (curl) โ€“ set org role to "read", no resource groups (removes any the user was previously in)** ```bash curl -s -X PUT \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d '{"role":"read","resourceGroups":[]}' \ "https://huggingface.co/api/organizations/my-org/members/member1/role" ``` **Example (curl) โ€“ set org role and resource group roles (overrides any current groups)** ```bash curl -s -X PUT \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d '{"role":"write","resourceGroups":[{"id":"507f1f77bcf86cd799439011","role":"read"}]}' \ "https://huggingface.co/api/organizations/my-org/members/member2/role" ``` **Success response:** Status `200 OK`; body: `{ "success": true }`. **Typical errors** - `400` โ€” Invalid body (e.g. invalid role or resource group `id`). - `402` โ€” Organization does not have a subscription plan. - `403` โ€” Not allowed (e.g. you lack Write on the org, or a resource group is not in the org). - `404` โ€” Organization or user not found. ### Updating multiple members The API changes **one member per request**. There is no bulk endpoint. To update many members, call the endpoint once per username (e.g. from a list or CSV). **Example: Bash โ€“ loop over usernames, same role for all** ```bash ORG_NAME="my-org" ROLE="read" for username in member1 member2 member3 member4; do echo "Setting $username to $ROLE ..." curl -s -w "\n%{http_code}" -X PUT \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d "{\"role\":\"$ROLE\",\"resourceGroups\":[]}" \ "https://huggingface.co/api/organizations/$ORG_NAME/members/$username/role" echo "" done ``` **Example: Python โ€“ loop over usernames** ```python import os import requests BASE_URL = "https://huggingface.co" HF_TOKEN = os.environ.get("HF_TOKEN", "") def change_member_role(org_name: str, username: str, role: str, resource_groups: list | None = None): payload = {"role": role, "resourceGroups": resource_groups or []} r = requests.put( f"{BASE_URL}/api/organizations/{org_name}/members/{username}/role", headers={"Authorization": f"Bearer {HF_TOKEN}", "Content-Type": "application/json"}, json=payload, ) if r.status_code != 200: raise RuntimeError(f"{r.status_code}: {r.text}") return r.json() org_name = "my-org" role = "read" for username in ["member1", "member2", "member3", "member4"]: print(f"Setting {username} to {role} ... ", end="") try: change_member_role(org_name, username, role) print("OK") except Exception as e: print(f"Failed: {e}") ``` For different roles per user, loop over `(username, role)` pairs (e.g. from a CSV) and call `change_member_role` for each. --- ## Resource Groups API The following endpoints let you **list** resource groups and **add** users to them. To **change** an existing member's organization-level role or their resource group assignments, see [Change member role via API](#change-member-role-via-api) above. **OpenAPI reference:** [Resource groups](https://huggingface.co/spaces/huggingface/openapi#tag/resource-groups) **Table of contents โ€” API approaches:** | Goal | Section | | -------------------------------------------------- | ----------------------------------------------------------------------- | | Add many users to **one** resource group | [Add users to a resource group](#add-users-to-a-resource-group) | | Add the **same** users to **many** resource groups | [Batch-add by looping over the API](#batch-add-by-looping-over-the-api) | | Add **different** users per group | [Batch-add by looping over the API](#batch-add-by-looping-over-the-api) | ### Base URL and authentication - **Base URL:** `https://huggingface.co` - **Authentication:** Use one of: - **Access token (recommended for scripts):** Create a fine-grained token with the "Write access to organizations settings / member management" permission scoped to your org at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). Send it in the request header: ```http Authorization: Bearer ``` - **Session cookie:** If calling from a browser or tool that shares the same session as the Hub UI, the cookie is sent automatically. ### List resource groups Get all resource groups you can manage for the organization. Use this to obtain each group's `id` for the add-users calls. **Request** ```http GET /api/organizations/{org_name}/resource-groups Authorization: Bearer ``` **Example (curl)** ```bash curl -s -H "Authorization: Bearer $HF_TOKEN" \ "https://huggingface.co/api/organizations/my-org/resource-groups" ``` **Example response (trimmed)** ```json [ { "id": "507f1f77bcf86cd799439011", "name": "Cohort 2024", "description": "Members in this group", "users": [...], "repos": [...] } ] ``` Use the `id` of each resource group when adding users. ### Add users to a resource group Add one or more users to a single resource group in one request. You can send multiple users in the same request. **Request** ```http POST /api/organizations/{org_name}/resource-groups/{resource_group_id}/users Authorization: Bearer Content-Type: application/json { "users": [ { "user": "member1", "role": "read" }, { "user": "member2", "role": "read" }, { "user": "member3", "role": "write" } ] } ``` - **Path parameters** - `org_name`: Organization slug (e.g. `my-org`). - `resource_group_id`: The resource group's `id` (24-character hex string from the list endpoint). - **Body** - `users`: Array of objects. Each object must have: - `user`: Hugging Face **username** (required). - `role`: One of `"read"`, `"contributor"`, `"write"`, `"admin"`. **Example (curl)** ```bash curl -s -X POST \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d '{"users":[{"user":"member1","role":"read"},{"user":"member2","role":"read"}]}' \ "https://huggingface.co/api/organizations/my-org/resource-groups/507f1f77bcf86cd799439011/users" ``` **Success:** Status `200 OK`; body is the updated resource group object (includes the new users in `users`). **Typical errors:** - `400` โ€” e.g. user not found, duplicate usernames, or invalid body. - `403` โ€” Not allowed (e.g. not in org, or already in the resource group). The message will indicate whether users are not in the organization or already in the group. ### Adding members via email (workaround) The add-users endpoint only accepts **Hugging Face usernames**, not emails. If you have a list of **emails** (e.g. member emails), you can resolve email โ†’ username first, then call the add-users API. Note that email filtering **only** works when the email's domain matches one of the organization's allowed domains: the **Organization email domain** (Settings โ†’ Account โ†’ Organization email domain) and/or the org's **SSO allowed domains** (if SSO is configured). **Step 1 โ€“ Resolve email to username** ```http GET /api/organizations/{org_name}/members?email={email}&limit=1 Authorization: Bearer ``` Response is an array of members; each member has `user` (username). Use `user` for the add-users call. **Step 2 โ€“ Add to resource group** Use the username from step 1 in a normal add-users request: ```http POST /api/organizations/{org_name}/resource-groups/{resource_group_id}/users Content-Type: application/json Body: { "users": [{ "user": "", "role": "read" }] } ``` **Example: one email (bash)** ```bash ORG_NAME="my-org" RG_ID="507f1f77bcf86cd799439011" EMAIL="member@org.com" # Step 1: look up member by email (domain must match org's Organization email domain or SSO allowed domains) MEMBERS=$(curl -s -H "Authorization: Bearer $HF_TOKEN" \ "https://huggingface.co/api/organizations/$ORG_NAME/members?email=$EMAIL&limit=1") USERNAME=$(echo "$MEMBERS" | jq -r '(.[0] // {} | .user // "")') if [ -z "$USERNAME" ]; then echo "No member found for $EMAIL" exit 1 fi # Step 2: add to resource group curl -s -X POST -H "Authorization: Bearer $HF_TOKEN" -H "Content-Type: application/json" \ -d "{\"users\":[{\"user\":\"$USERNAME\",\"role\":\"read\"}]}" \ "https://huggingface.co/api/organizations/$ORG_NAME/resource-groups/$RG_ID/users" ``` **Example: multiple emails in a loop (Python)** ```python import os import requests BASE = "https://huggingface.co" ORG = "my-org" RG_ID = "507f1f77bcf86cd799439011" ROLE = "read" headers = {"Authorization": f"Bearer {os.environ['HF_TOKEN']}", "Content-Type": "application/json"} emails = ["member1@org.com", "member2@org.com"] for email in emails: # Step 1: resolve email โ†’ username (email domain must match org's Organization email domain or SSO allowed domains) r = requests.get(f"{BASE}/api/organizations/{ORG}/members", params={"email": email, "limit": 1}, headers=headers) r.raise_for_status() members = r.json() if not members: print(f"No member found for {email}") continue username = members[0]["user"] # Step 2: add that user to the resource group add_r = requests.post( f"{BASE}/api/organizations/{ORG}/resource-groups/{RG_ID}/users", headers=headers, json={"users": [{"user": username, "role": ROLE}]}, ) if add_r.status_code == 200: print(f"Added {username} ({email})") else: print(f"Failed {email}: {add_r.status_code} {add_r.text}") ``` If a user is already in the resource group, the add call returns `403`; the script reports it as a failure and you can skip or ignore that case if you prefer. **Limitation:** The email filter only applies when the org has an **Organization email domain** and/or **SSO allowed domains** set, and the email's domain matches one of them. Otherwise you cannot look up by email via the members API; you'd need another source for email โ†’ username (e.g. your own directory). ### Batch-add by looping over the API You can add many users to **one** resource group in one or a few requests (e.g. chunk your list of usernames), or add users to **several** resource groups by looping over groups and calling the add-users endpoint for each. **Example: Bash โ€“ one group, multiple users in one request** ```bash #!/bin/bash # Add a list of users to a single resource group. # Usage: ./add-users-to-rg.sh ORG_NAME="${1:-my-org}" RG_ID="${2:-507f1f77bcf86cd799439011}" ROLE="${3:-read}" USERS="member1 member2 member3 member4" USERS_JSON=$(echo "$USERS" | tr ' ' '\n' | while read u; do [ -n "$u" ] && echo "{\"user\":\"$u\",\"role\":\"$ROLE\"}" done | paste -sd ',' -) curl -s -w "\nHTTP_STATUS:%{http_code}" -X POST \ -H "Authorization: Bearer $HF_TOKEN" \ -H "Content-Type: application/json" \ -d "{\"users\":[$USERS_JSON]}" \ "https://huggingface.co/api/organizations/$ORG_NAME/resource-groups/$RG_ID/users" ``` **Example: Bash โ€“ loop over multiple groups** ```bash # Get group IDs and add users to each curl -s -H "Authorization: Bearer $HF_TOKEN" \ "https://huggingface.co/api/organizations/my-org/resource-groups" \ | jq -r '.[].id' \ | while read -r RG_ID; do [ -z "$RG_ID" ] && continue echo "Adding users to resource group $RG_ID ..." curl -s -X POST -H "Authorization: Bearer $HF_TOKEN" -H "Content-Type: application/json" \ -d "{\"users\":[$USERS_JSON]}" \ "https://huggingface.co/api/organizations/my-org/resource-groups/$RG_ID/users" done ``` **Example: Python โ€“ batch-add to one or many groups** ```python import os import requests BASE_URL = "https://huggingface.co" HF_TOKEN = os.environ.get("HF_TOKEN", "") def list_resource_groups(org_name: str): r = requests.get( f"{BASE_URL}/api/organizations/{org_name}/resource-groups", headers={"Authorization": f"Bearer {HF_TOKEN}"}, ) r.raise_for_status() return r.json() def add_users_to_resource_group(org_name: str, resource_group_id: str, users_with_roles: list): """users_with_roles: list of {"user": "username", "role": "read"|"write"|"contributor"|"admin"}""" r = requests.post( f"{BASE_URL}/api/organizations/{org_name}/resource-groups/{resource_group_id}/users", headers={"Authorization": f"Bearer {HF_TOKEN}", "Content-Type": "application/json"}, json={"users": users_with_roles}, ) if r.status_code != 200: raise RuntimeError(f"Add users failed {r.status_code}: {r.text}") return r.json() # Example: same users added to every resource group org_name = "my-org" role = "read" usernames = ["member1", "member2", "member3"] users_with_roles = [{"user": u, "role": role} for u in usernames] for rg in list_resource_groups(org_name): add_users_to_resource_group(org_name, rg["id"], users_with_roles) ``` For a long list of usernames, chunk them (e.g. 50 per request) and call the API once per chunk to avoid large request bodies or timeouts. ### Important notes 1. **Usernames only** โ€” The API accepts Hugging Face **usernames**, not emails. You need a mapping from email โ†’ username (e.g. from your directory or the org members list) before calling the API. 2. **Users must be in the organization** โ€” Every user in the request must already be a member of the organization. Otherwise the request returns `403` with a message that some users are not in the org. 3. **Idempotency** โ€” If a user is already in the resource group, the backend may return `403` for that request. Your script can catch errors and continue, or skip users already in the group if you first fetch the group's `users` list. 4. **Rate limits** โ€” For large batches, consider adding a short delay between requests (e.g. 0.5โ€“1 second) to avoid hitting rate limits. 5. **Token scope** โ€” The access token must have sufficient permissions for the organization (typically at least "Write access to organizations settings / member management"). Create and store the token securely; do not commit it to version control. --- ## Configure auto-join via API [Auto-join](./security-resource-groups#auto-join) automatically adds every org member to a Resource Group at a specified role. You can enable or disable it via the API. **Enable auto-join** ```http POST /api/organizations/{org_name}/resource-groups/{resource_group_id}/settings Authorization: Bearer Content-Type: application/json { "autoJoin": { "enabled": true, "role": "read" } } ``` - **Path parameters** - `org_name`: Organization slug (e.g. `my-org`). - `resource_group_id`: The Resource Group's ID (24-character hex string; get IDs from the [list resource groups endpoint](#list-resource-groups)). - **Body** - `role`: The role to assign to all org members. One of `"read"`, `"contributor"`, `"write"`, or `"admin"`. Enabling auto-join on an existing Resource Group immediately adds all current org members (backfill). **Disable auto-join** Send the same request with `"enabled": false`. The `role` field is not required when disabling: ```http POST /api/organizations/{org_name}/resource-groups/{resource_group_id}/settings Authorization: Bearer Content-Type: application/json { "autoJoin": { "enabled": false } } ``` > [!NOTE] > Disabling auto-join does **not** remove members who were previously auto-joined. It only stops future org members from being added automatically. Existing members remain in the Resource Group. ### Datasets Overview https://huggingface.co/docs/hub/datasets-overview.md # Datasets Overview ## Datasets on the Hub The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/nyu-mll/glue), include a [Dataset Viewer](./data-studio) to showcase the data. Each dataset is a [Git repository](./repositories) that contains the data required to generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Data files Configuration page](./datasets-data-files-configuration). Following the supported repo structure will ensure that the dataset page on the Hub will have a Viewer. ## Search for datasets Like models and spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you. ## Privacy Since datasets are repositories, you can [toggle their visibility between private and public](./repositories-settings#private-repositories) through the Settings tab. If a dataset is owned by an [organization](./organizations), the privacy settings apply to all the members of the organization. ### Widgets https://huggingface.co/docs/hub/models-widgets.md # Widgets ## What's a widget? Many model repos have a widget that allows anyone to run inferences directly in the browser. These widgets are powered by [Inference Providers](https://huggingface.co/docs/inference-providers), which provide developers streamlined, unified access to hundreds of machine learning models, backed by our serverless inference partners. Here are some examples of current popular models: - [DeepSeek V3](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) - State-of-the-art open-weights conversational model - [Flux Kontext](https://huggingface.co/black-forest-labs/FLUX.1-Kontext-dev) - Open-weights transformer model for image editing - [Falconsai's NSFW Detection](https://huggingface.co/Falconsai/nsfw_image_detection) - Image content moderation - [ResembleAI's Chatterbox](https://huggingface.co/ResembleAI/chatterbox) - Production-grade open source text-to-speech model. You can explore more models and their widgets on the [models page](https://huggingface.co/models?inference_provider=all&sort=trending) or try them interactively in the [Inference Playground](https://huggingface.co/playground). ## Enabling a widget Widgets are displayed when the model is hosted by at least one Inference Provider, ensuring optimal performance and reliability for the model's inference. Providers autonomously choose and control what models they deploy. The type of widget displayed (text-generation, text to image, etc) is inferred from the model's `pipeline_tag`, a special tag that the Hub tries to compute automatically for all models. The only exception is for the `conversational` widget which is shown on models with a `pipeline_tag` of either `text-generation` or `image-text-to-text`, as long as theyโ€™re also tagged as `conversational`. We choose to expose **only one** widget per model for simplicity. For some libraries, such as `transformers`, the model type can be inferred automatically based from configuration files (`config.json`). The architecture can determine the type: for example, `AutoModelForTokenClassification` corresponds to `token-classification`. If you're interested in this, you can see pseudo-code in [this gist](https://gist.github.com/julien-c/857ba86a6c6a895ecd90e7f7cab48046). For most other use cases, we use the model tags to determine the model task type. For example, if there is `tag: text-classification` in the [model card metadata](./model-cards), the inferred `pipeline_tag` will be `text-classification`. **You can always manually override your pipeline type with `pipeline_tag: xxx` in your [model card metadata](./model-cards#model-card-metadata).** (You can also use the metadata GUI editor to do this). ### How can I control my model's widget example input? You can specify the widget input in the model card metadata section: ```yaml widget: - text: "This new restaurant has amazing food and great service!" example_title: "Positive Review" - text: "I'm really disappointed with this product. Poor quality and overpriced." example_title: "Negative Review" - text: "The weather is nice today." example_title: "Neutral Statement" ``` You can provide more than one example input. In the examples dropdown menu of the widget, they will appear as `Example 1`, `Example 2`, etc. Optionally, you can supply `example_title` as well. ```yaml widget: - text: "Is this review positive or negative? Review: Best cast iron skillet you will ever buy." example_title: "Sentiment analysis" - text: "Barack Obama nominated Hilary Clinton as his secretary of state on Monday. He chose her because she had ..." example_title: "Coreference resolution" - text: "On a shelf, there are five books: a gray book, a red book, a purple book, a blue book, and a black book ..." example_title: "Logic puzzles" - text: "The two men running to become New York City's next mayor will face off in their first debate Wednesday night ..." example_title: "Reading comprehension" ``` Moreover, you can specify non-text example inputs in the model card metadata. Refer [here](./models-widgets-examples) for a complete list of sample input formats for all widget types. For vision & audio widget types, provide example inputs with `src` rather than `text`. For example, allow users to choose from two sample audio files for automatic speech recognition tasks by: ```yaml widget: - src: https://example.org/somewhere/speech_samples/sample1.flac example_title: Speech sample 1 - src: https://example.org/somewhere/speech_samples/sample2.flac example_title: Speech sample 2 ``` Note that you can also include example files in your model repository and use them as: ```yaml widget: - src: https://huggingface.co/username/model_repo/resolve/main/sample1.flac example_title: Custom Speech Sample 1 ``` But even more convenient, if the file lives in the corresponding model repo, you can just use the filename or file path inside the repo: ```yaml widget: - src: sample1.flac example_title: Custom Speech Sample 1 ``` or if it was nested inside the repo: ```yaml widget: - src: nested/directory/sample1.flac ``` We provide example inputs for some languages and most widget types in [default-widget-inputs.ts file](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/default-widget-inputs.ts). If some examples are missing, we welcome PRs from the community to add them! ## Example outputs As an extension to example inputs, for each widget example, you can also optionally describe the corresponding model output, directly in the `output` property. This is useful when the model is not yet supported by Inference Providers, so that the model page can still showcase how the model works and what results it gives. For instance, for an [automatic-speech-recognition](./models-widgets-examples#automatic-speech-recognition) model: ```yaml widget: - src: sample1.flac output: text: "Hello my name is Julien" ``` The `output` property should be a YAML dictionary that represents the output format from Inference Providers. For a model that outputs text, see the example above. For a model that outputs labels (like a [text-classification](./models-widgets-examples#text-classification) model for instance), output should look like this: ```yaml widget: - text: "I liked this movie" output: - label: POSITIVE score: 0.8 - label: NEGATIVE score: 0.2 ``` Finally, for a model that outputs an image, audio, or any other kind of asset, the output should include a `url` property linking to either a file name or path inside the repo or a remote URL. For example, for a text-to-image model: ```yaml widget: - text: "picture of a futuristic tiger, artstation" output: url: images/tiger.jpg ``` We can also surface the example outputs in the Hugging Face UI, for instance, for a text-to-image model to display a gallery of cool image generations. ## Widget Availability and Provider Support Not all models have widgets available. Widget availability depends on: 1. **Task Support**: The model's task must be supported by at least one provider in the Inference Providers network 2. **Provider Availability**: At least one provider must be serving the specific model 3. **Model Configuration**: The model must have proper metadata and configuration files To view the full list of supported tasks, check out [our dedicated documentation page](https://huggingface.co/docs/inference-providers/tasks/index). The list of all providers and the tasks they support is available in [this documentation page](https://huggingface.co/docs/inference-providers/index#partners). For models without provider support, you can still showcase functionality using [example outputs](#example-outputs) in your model card. You can also click _Ask for provider support_ directly on the model page to encourage providers to serve the model, given there is enough community interest. ## Exploring Models with the Inference Playground Before integrating models into your applications, you can test them interactively with the [Inference Playground](https://huggingface.co/playground). The playground allows you to: - Test different [chat completion models](https://huggingface.co/models?inference_provider=all&sort=trending&other=conversational) with custom prompts - Compare responses across different models - Experiment with inference parameters like temperature, max tokens, and more - Find the perfect model for your specific use case The playground uses the same Inference Providers infrastructure that powers the widgets, so you can expect similar performance and capabilities when you integrate the models into your own applications. ### Search https://huggingface.co/docs/hub/search.md # Search You can easily search anything on the Hub with **Full-text search**. We index model cards, dataset cards, and Spaces app.py files. Go directly to https://huggingface.co/search or, using the search bar at the top of https://huggingface.co, you can select "Try Full-text search" to help find what you seek on the Hub across models, datasets, and Spaces: ## Filter with ease By default, models, datasets, & spaces are being searched when a user enters a query. If one prefers, one can filter to search only models, datasets, or spaces. Moreover, one can copy & share the URL from one's browser's address bar, which should contain the filter information as URL query. For example, when one searches for a query `llama` with a filter to show `Spaces` only, one gets URL https://huggingface.co/search/full-text?q=llama&type=space ### Data Studio https://huggingface.co/docs/hub/data-studio.md # Data Studio Each dataset page includes a table with the contents of the dataset, arranged by pages of 100 rows. You can navigate between pages using the buttons at the bottom of the table. ## Inspect data distributions At the top of the columns you can see the graphs representing the distribution of their data. This gives you a quick insight on how balanced your classes are, what are the range and distribution of numerical data and lengths of texts, and what portion of the column data is missing. ## Filter by value If you click on a bar of a histogram from a numerical column, the dataset viewer will filter the data and show only the rows with values that fall in the selected range. Similarly, if you select one class from a categorical column, it will show only the rows from the selected category. ## Search a word in the dataset You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of `string`, even if the values are nested in a dictionary or a list. ## Run SQL queries on the dataset You can run SQL queries on the dataset in the browser using the SQL Console. This feature also leverages our [auto-conversion to Parquet](data-studio#access-the-parquet-files). For more information see our guide on [SQL Console](./datasets-viewer-sql-console). ## Share a specific row You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/nyu-mll/glue/viewer/mrpc/test?p=2&row=241 will open the dataset studio on the MRPC dataset, on the test split, and on the 241st row. ## Large scale datasets The Dataset Viewer supports large scale datasets, but depending on the data format it may only show the first 5GB of the dataset: - For Parquet datasets: the Dataset Viewer shows the full dataset, but sorting, filtering and search are only enabled on the first 5GB. - For datasets >5GB in other formats (e.g. [WebDataset](https://github.com/webdataset/webdataset) or JSON Lines): the Dataset Viewer only shows the first 5GB, and sorting, filtering and search are enabled on these first 5GB. In this case, an informational message lets you know that the Viewer is partial. This should be a large enough sample to represent the full dataset accurately, let us know if you need a bigger sample. ## Access the parquet files To power the dataset viewer, the first 5GB of every dataset are auto-converted to the Parquet format (unless it was already a Parquet dataset). In the dataset viewer (for example, see [GLUE](https://huggingface.co/datasets/nyu-mll/glue)), you can click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/nyu-mll/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Please, refer to the [dataset viewer docs](/docs/datasets-server/parquet_process) to learn how to query the dataset parquet files with libraries such as Polars, Pandas or DuckDB. > [!TIP] > Parquet is a columnar storage format optimized for querying and processing large datasets. Parquet is a popular choice for big data processing and analytics and is widely used for data processing and machine learning. You can learn more about the advantages associated with this format in the documentation. ### Conversion bot When you create a new dataset, the [`parquet-converter` bot](https://huggingface.co/parquet-converter) notifies you once it converts the dataset to Parquet. The [discussion](./repositories-pull-requests-discussions) it opens in the repository provides details about the Parquet format and links to the Parquet files. ### Programmatic access You can also access the list of Parquet files programmatically using the [Hub API](./api#get-apidatasetsrepoidparquet); for example, endpoint [`https://huggingface.co/api/datasets/nyu-mll/glue/parquet`](https://huggingface.co/api/datasets/nyu-mll/glue/parquet) lists the parquet files of the `nyu-mll/glue` dataset. We also have a specific documentation about the [Dataset Viewer API](https://huggingface.co/docs/dataset-viewer), which you can call directly. That API lets you access the contents, metadata and basic statistics of all Hugging Face Hub datasets, and powers the Dataset viewer frontend. ## Dataset preview For the biggest datasets, the page shows a preview of the first 100 rows instead of a full-featured viewer. This restriction only applies for datasets over 5GB that are not natively in Parquet format or that have not been auto-converted to Parquet. ## Embed the Dataset Viewer in a webpage You can embed the Dataset Viewer in your own webpage using an iframe. The URL to use is `https://huggingface.co/datasets///embed/viewer`, where `` is the owner of the dataset and `` is the name of the dataset. You can also pass other parameters like the subset, split, filter, search or selected row. For more information see our guide on [How to embed the Dataset Viewer in a webpage](./datasets-viewer-embed). ## Configure the Dataset Viewer To have a properly working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. There is also an option to configure your dataset using YAML. You can specify which files to display in the Dataset Viewer by adding a YAML configuration block at the top of your dataset's `README.md` file. For example, to choose which file goes into which split: ```yaml --- configs: - config_name: default data_files: - split: train path: "data.csv" - split: test path: "holdout.csv" --- ``` You can also select multiple files per split or use glob patterns: ```yaml --- configs: - config_name: default data_files: - split: train path: - "data/train_part1.csv" - "data/train_part2.csv" - split: test path: "data/*.csv" --- ``` For **private** datasets, the Dataset Viewer is enabled for [PRO users](https://huggingface.co/pricing) and [Team or Enterprise organizations](https://huggingface.co/enterprise). For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure). ### Data files Configuration https://huggingface.co/docs/hub/datasets-data-files-configuration.md # Data files Configuration There are no constraints on how to structure dataset repositories. However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. ## What are splits and subsets? Machine learning datasets typically have splits and may also have subsets. A dataset is generally made of _splits_ (e.g. `train` and `test`) that are used during different stages of training and evaluating a model. A _subset_ (also called _configuration_) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you're interested in learning more about splits and subsets, check out the [Splits and subsets](/docs/datasets-server/configs_and_splits) guide! ![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif) ## Automatic splits detection Splits are automatically detected based on file and directory names. For example, this is a dataset with `train`, `test`, and `validation` splits: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ train.csv โ”œโ”€โ”€ test.csv โ””โ”€โ”€ validation.csv ``` To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation and the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135). ## Manual splits and subsets configuration You can choose the data files to show in the Dataset Viewer for your dataset using YAML. It is useful if you want to specify which file goes into which split manually. You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files). Here is an example of a configuration defining a subset called "benchmark" with a `test` split. ```yaml configs: - config_name: benchmark data_files: - split: test path: benchmark.csv ``` See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. Look also to the [example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87). ## Supported file formats See the [File formats](./datasets-adding#file-formats) doc page to find the list of supported formats and recommendations for your dataset. If your dataset uses CSV or TSV files, you can find more information in the [example datasets](https://huggingface.co/collections/datasets-examples/format-csv-and-tsv-655f681cb9673a4249cccb3d). ### Dataset Viewer size-limit errors (`TooBigContentError`) If you see `Error code: TooBigContentError`, then the dataset viewer could not read a preview within its limits. Common messages include `Parquet error: Scan size limit exceeded` and `The size of the content of the first rows exceeds the maximum supported size`. What you can do: - For Parquet files, use smaller row groups and include a page index (`write_page_index=True`) so the Viewer can read only what it needs. - Avoid very large values in the first rows (very long strings, large JSON blobs, base64 payloads). Move large payloads to separate files when possible. - Split very large files into smaller shards or splits, then re-upload. - If the issue remains, review [Configure the Dataset Viewer](./datasets-viewer-configure) and open a discussion on your dataset page with the full error text. ## Image, Audio and Video datasets For image/audio/video classification datasets, you can also use directories to name the image/audio/video classes. And if your images/audio/video files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. We provide two guides that you can check out: - [How to create an image dataset](./datasets-image) ([example datasets](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65)) - [How to create an audio dataset](./datasets-audio) ([example datasets](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607)) - [How to create a video dataset](./datasets-video) ### Transforming your dataset https://huggingface.co/docs/hub/datasets-polars-operations.md # Transforming your dataset On this page we'll guide you through some of the most common operations used when doing data analysis. This is only a small subset of what's possible in Polars. For more information, please visit the [Documentation](https://docs.pola.rs/). For the example we will use the [Common Crawl statistics](https://huggingface.co/datasets/commoncrawl/statistics) dataset. These statistics include: number of pages, distribution of top-level domains, crawl overlaps, etc. For more detailed information and graphs please visit their [official statistics page](https://commoncrawl.github.io/cc-crawl-statistics/plots/tlds). ## Reading ```python import polars as pl df = pl.read_csv( "hf://datasets/commoncrawl/statistics/tlds.csv", try_parse_dates=True, ) df.head(3) ``` ```bash โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”† suffix โ”† crawl โ”† date โ”† โ€ฆ โ”† pages โ”† urls โ”† hosts โ”† domains โ”‚ โ”‚ --- โ”† --- โ”† --- โ”† --- โ”† โ”† --- โ”† --- โ”† --- โ”† --- โ”‚ โ”‚ i64 โ”† str โ”† str โ”† date โ”† โ”† i64 โ”† i64 โ”† f64 โ”† f64 โ”‚ โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก โ”‚ 0 โ”† a.se โ”† CC-MAIN-2008-2009 โ”† 2009-01-12 โ”† โ€ฆ โ”† 18 โ”† 18 โ”† 1.0 โ”† 1.0 โ”‚ โ”‚ 1 โ”† a.se โ”† CC-MAIN-2009-2010 โ”† 2010-09-25 โ”† โ€ฆ โ”† 3462 โ”† 3259 โ”† 166.0 โ”† 151.0 โ”‚ โ”‚ 2 โ”† a.se โ”† CC-MAIN-2012 โ”† 2012-11-02 โ”† โ€ฆ โ”† 6957 โ”† 6794 โ”† 172.0 โ”† 150.0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Selecting columns The dataset contains some columns we don't need. To remove them, we will use the `select` method: ```python df = df.select("suffix", "date", "tld", "pages", "domains") df.head(3) ``` ```bash โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ suffix โ”† crawl โ”† date โ”† tld โ”† pages โ”† domains โ”‚ โ”‚ --- โ”† --- โ”† --- โ”† --- โ”† --- โ”† --- โ”‚ โ”‚ str โ”† str โ”† date โ”† str โ”† i64 โ”† f64 โ”‚ โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก โ”‚ a.se โ”† CC-MAIN-2008-2009 โ”† 2009-01-12 โ”† se โ”† 18 โ”† 1.0 โ”‚ โ”‚ a.se โ”† CC-MAIN-2009-2010 โ”† 2010-09-25 โ”† se โ”† 3462 โ”† 151.0 โ”‚ โ”‚ a.se โ”† CC-MAIN-2012 โ”† 2012-11-02 โ”† se โ”† 6957 โ”† 150.0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Filtering We can filter the dataset using the `filter` method. This method accepts complex expressions, but let's start simple by filtering based on the crawl date: ```python import datetime df = df.filter(pl.col("date") >= datetime.date(2020, 1, 1)) ``` You can combine multiple predicates with `&` or `|` operators: ```python df = df.filter( (pl.col("date") >= datetime.date(2020, 1, 1)) | pl.col("crawl").str.contains("CC") ) ``` ## Transforming In order to add new columns to the dataset, use `with_columns`. In the example below we calculate the total number of pages per domain and add a new column `pages_per_domain` using the `alias` method. The entire statement within `with_columns` is called an expression. Read more about expressions and how to use them in the [Polars user guide](https://docs.pola.rs/user-guide/expressions/) ```python df = df.with_columns( (pl.col("pages") / pl.col("domains")).alias("pages_per_domain") ) df.sample(3) ``` ```bash โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ suffix โ”† crawl โ”† date โ”† tld โ”† pages โ”† domains โ”† pages_per_domain โ”‚ โ”‚ --- โ”† --- โ”† --- โ”† --- โ”† --- โ”† --- โ”† --- โ”‚ โ”‚ str โ”† str โ”† date โ”† str โ”† i64 โ”† f64 โ”† f64 โ”‚ โ•žโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก โ”‚ net.bt โ”† CC-MAIN-2014-41 โ”† 2014-10-06 โ”† bt โ”† 4 โ”† 1.0 โ”† 4.0 โ”‚ โ”‚ org.mk โ”† CC-MAIN-2016-44 โ”† 2016-10-31 โ”† mk โ”† 1445 โ”† 430.0 โ”† 3.360465 โ”‚ โ”‚ com.lc โ”† CC-MAIN-2016-44 โ”† 2016-10-31 โ”† lc โ”† 1 โ”† 1.0 โ”† 1.0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Aggregation & Sorting In order to aggregate data together you can use the `group_by`, `agg` and `sort` methods. Within the aggregation context you can combine expressions to create powerful statements which are still easy to read. First, we aggregate all the data to the top-level domain `tld` per scraped date: ```python df = df.group_by("tld", "date").agg( pl.col("pages").sum(), pl.col("domains").sum(), ) ``` Now we can calculate several statistics per top level domain: - Number of unique scrape dates - Average number of domains in the scraped period - Average growth rate in terms of number of pages ```python df = df.group_by("tld").agg( pl.col("date").unique().count().alias("number_of_scrapes"), pl.col("domains").mean().alias("avg_number_of_domains"), pl.col("pages").sort_by("date").pct_change().mean().alias("avg_page_growth_rate"), ) df = df.sort("avg_number_of_domains", descending=True) df.head(10) ``` ```bash โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ tld โ”† number_of_scrapes โ”† avg_number_of_domains โ”† avg_percent_change_in_number_oโ€ฆ โ”‚ โ”‚ --- โ”† --- โ”† --- โ”† --- โ”‚ โ”‚ str โ”† u32 โ”† f64 โ”† f64 โ”‚ โ•žโ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ก โ”‚ com โ”† 101 โ”† 1.9571e7 โ”† 0.022182 โ”‚ โ”‚ de โ”† 101 โ”† 1.8633e6 โ”† 0.5232 โ”‚ โ”‚ org โ”† 101 โ”† 1.5049e6 โ”† 0.019604 โ”‚ โ”‚ net โ”† 101 โ”† 1.5020e6 โ”† 0.021002 โ”‚ โ”‚ cn โ”† 101 โ”† 1.1101e6 โ”† 0.281726 โ”‚ โ”‚ ru โ”† 101 โ”† 1.0561e6 โ”† 0.416303 โ”‚ โ”‚ uk โ”† 101 โ”† 827453.732673 โ”† 0.065299 โ”‚ โ”‚ nl โ”† 101 โ”† 710492.623762 โ”† 1.040096 โ”‚ โ”‚ fr โ”† 101 โ”† 615471.594059 โ”† 0.419181 โ”‚ โ”‚ jp โ”† 101 โ”† 615391.455446 โ”† 0.246162 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### Shiny on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-shiny.md # Shiny on Spaces [Shiny](https://shiny.posit.co/) is an open-source framework for building simple, beautiful, and performant data applications. The goal when developing Shiny was to build something simple enough to teach someone in an afternoon but extensible enough to power large, mission-critical applications. You can create a useful Shiny app in a few minutes, but if the scope of your project grows, you can be sure that Shiny can accommodate that application. The main feature that differentiates Shiny from other frameworks is its reactive execution model. When you write a Shiny app, the framework infers the relationship between inputs, outputs, and intermediary calculations and uses those relationships to render only the things that need to change as a result of a user's action. The result is that users can easily develop efficient, extensible applications without explicitly caching data or writing callback functions. ## Shiny for Python [Shiny for Python](https://shiny.rstudio.com/py/) is a pure Python implementation of Shiny. This gives you access to all of the great features of Shiny like reactivity, complex layouts, and modules without needing to use R. Shiny for Python is ideal for Hugging Face applications because it integrates smoothly with other Hugging Face tools. To get started deploying a Space, click this button to select your hardware and specify if you want a public or private Space. The Space template will populate a few files to get your app started. _app.py_ This file defines your app's logic. To learn more about how to modify this file, see [the Shiny for Python documentation](https://shiny.rstudio.com/py/docs/overview.html). As your app gets more complex, it's a good idea to break your application logic up into [modules](https://shiny.rstudio.com/py/docs/workflow-modules.html). _Dockerfile_ The Dockerfile for a Shiny for Python app is very minimal because the library doesn't have many system dependencies, but you may need to modify this file if your application has additional system dependencies. The one essential feature of this file is that it exposes and runs the app on the port specified in the space README file (which is 7860 by default). __requirements.txt__ The Space will automatically install dependencies listed in the requirements.txt file. Note that you must include shiny in this file. ## Shiny for R [Shiny for R](https://shiny.rstudio.com/) is a popular and well-established application framework in the R community and is a great choice if you want to host an R app on Hugging Face infrastructure or make use of some of the great [Shiny R extensions](https://github.com/nanxstats/awesome-shiny-extensions). To integrate Hugging Face tools into an R app, you can either use [httr2](https://httr2.r-lib.org/) to call Hugging Face APIs, or [reticulate](https://rstudio.github.io/reticulate/) to call one of the Hugging Face Python SDKs. To deploy an R Shiny Space, click this button and fill out the space metadata. This will populate the Space with all the files you need to get started. _app.R_ This file contains all of your application logic. If you prefer, you can break this file up into `ui.R` and `server.R`. _Dockerfile_ The Dockerfile builds off of the [rocker shiny](https://hub.docker.com/r/rocker/shiny) image. You'll need to modify this file to use additional packages. If you are using a lot of tidyverse packages we recommend switching the base image to [rocker/shinyverse](https://hub.docker.com/r/rocker/shiny-verse). You can install additional R packages by adding them under the `RUN install2.r` section of the dockerfile, and github packages can be installed by adding the repository under `RUN installGithub.r`. There are two main requirements for this Dockerfile: - First, the file must expose the port that you have listed in the README. The default is 7860 and we recommend not changing this port unless you have a reason to. - Second, for the moment you must use the development version of [httpuv](https://github.com/rstudio/httpuv) which resolves an issue with app timeouts on Hugging Face. ### Using SetFit with Hugging Face https://huggingface.co/docs/hub/setfit.md # Using SetFit with Hugging Face SetFit is an efficient and prompt-free framework for few-shot fine-tuning of [Sentence Transformers](https://sbert.net/). It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset, SetFit is competitive with fine-tuning RoBERTa Large on the full training set of 3k examples ๐Ÿคฏ! Compared to other few-shot learning methods, SetFit has several unique features: * ๐Ÿ—ฃ **No prompts or verbalizers:** Current techniques for few-shot fine-tuning require handcrafted prompts or verbalizers to convert examples into a format suitable for the underlying language model. SetFit dispenses with prompts altogether by generating rich embeddings directly from text examples. * ๐ŸŽ **Fast to train:** SetFit doesn't require large-scale models like [T0](https://huggingface.co/bigscience/T0) or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with. * ๐ŸŒŽ **Multilingual support**: SetFit can be used with any [Sentence Transformer](https://huggingface.co/models?library=sentence-transformers&sort=downloads) on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint. ## Exploring SetFit on the Hub You can find SetFit models by filtering at the left of the [models page](https://huggingface.co/models?library=setfit). All models on the Hub come with these useful features: 1. An automatically generated model card with a brief description. 2. An interactive widget you can use to play with the model directly in the browser. 3. An Inference Providers widget that allows you to make inference requests. ## Installation To get started, you can follow the [SetFit installation guide](https://huggingface.co/docs/setfit/installation). You can also use the following one-line install through pip: ``` pip install -U setfit ``` ## Using existing models All `setfit` models can easily be loaded from the Hub. ```py from setfit import SetFitModel model = SetFitModel.from_pretrained("tomaarsen/setfit-paraphrase-mpnet-base-v2-sst2-8-shot") ``` Once loaded, you can use [`SetFitModel.predict`](https://huggingface.co/docs/setfit/reference/main#setfit.SetFitModel.predict) to perform inference. ```py model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.") ``` ```bash ['positive', 'negative'] ``` If you want to load a specific SetFit model, you can click `Use in SetFit` and you will be given a working snippet! ## Additional resources * [All SetFit models available on the Hub](https://huggingface.co/models?library=setfit) * SetFit [repository](https://github.com/huggingface/setfit) * SetFit [docs](https://huggingface.co/docs/setfit) * SetFit [paper](https://arxiv.org/abs/2209.11055) ### Using ๐Ÿค— `transformers` at Hugging Face https://huggingface.co/docs/hub/transformers.md # Using ๐Ÿค— `transformers` at Hugging Face ๐Ÿค— `transformers` is a library maintained by Hugging Face and the community, for state-of-the-art Machine Learning for Pytorch, TensorFlow and JAX. It provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio. We are a bit biased, but we really like ๐Ÿค— `transformers`! ## Exploring ๐Ÿค— transformers in the Hub There are over 630,000 `transformers` models in the Hub which you can find by filtering at the left of [the models page](https://huggingface.co/models?library=transformers&sort=downloads). You can find models for many different tasks: * Extracting the answer from a context ([question-answering](https://huggingface.co/models?library=transformers&pipeline_tag=question-answering&sort=downloads)). * Creating summaries from a large text ([summarization](https://huggingface.co/models?library=transformers&pipeline_tag=summarization&sort=downloads)). * Classify text (e.g. as spam or not spam, [text-classification](https://huggingface.co/models?library=transformers&pipeline_tag=text-classification&sort=downloads)). * Generate a new text with models such as GPT ([text-generation](https://huggingface.co/models?library=transformers&pipeline_tag=text-generation&sort=downloads)). * Identify parts of speech (verb, subject, etc.) or entities (country, organization, etc.) in a sentence ([token-classification](https://huggingface.co/models?library=transformers&pipeline_tag=token-classification&sort=downloads)). * Transcribe audio files to text ([automatic-speech-recognition](https://huggingface.co/models?library=transformers&pipeline_tag=automatic-speech-recognition&sort=downloads)). * Classify the speaker or language in an audio file ([audio-classification](https://huggingface.co/models?library=transformers&pipeline_tag=audio-classification&sort=downloads)). * Detect objects in an image ([object-detection](https://huggingface.co/models?library=transformers&pipeline_tag=object-detection&sort=downloads)). * Segment an image ([image-segmentation](https://huggingface.co/models?library=transformers&pipeline_tag=image-segmentation&sort=downloads)). * Do Reinforcement Learning ([reinforcement-learning](https://huggingface.co/models?library=transformers&pipeline_tag=reinforcement-learning&sort=downloads))! You can try out the models directly in the browser if you want to test them out without downloading them thanks to the in-browser widgets! ## Transformers repository files A [Transformers](https://hf.co/docs/transformers/index) model repository generally contains model files and preprocessor files. ### Model - The **`config.json`** file stores details about the model architecture such as the number of hidden layers, vocabulary size, number of attention heads, the dimensions of each head, and more. This metadata is the model blueprint. - The **`model.safetensors`** file stores the models pretrained layers and weights. For large models, the safetensors file is sharded to limit the amount of memory required to load it. Browse the **`model.safetensors.index.json`** file to see which safetensors file the model weights are being loaded from. ```json { "metadata": { "total_size": 16060522496 }, "weight_map": { "lm_head.weight": "model-00004-of-00004.safetensors", "model.embed_tokens.weight": "model-00001-of-00004.safetensors", ... } } ``` You can also visualize this mapping by clicking on the โ†— button on the model card. [Safetensors](https://hf.co/docs/safetensors/index) is a safer and faster serialization format - compared to [pickle](./security-pickle#use-your-own-serialization-format) - for storing model weights. You may encounter weights pickled in formats such as **`bin`**, **`pth`**, or **`ckpt`**, but **`safetensors`** is increasingly adopted in the model ecosystem as a better alternative. - A model may also have a **`generation_config.json`** file which stores details about how to generate text, such as whether to sample, the top tokens to sample from, the temperature, and the special tokens for starting and stopping generation. ### Preprocessor - The **`tokenizer_config.json`** file stores the special tokens added by a model. These special tokens signal many things to a model such as the beginning of a sentence, specific formatting for chat templates, or indicating an image. This file also shows the maximum input sequence length the model can accept, the preprocessor class, and the outputs it returns. - The **`tokenizer.json`** file stores the model's learned vocabulary. - The **`special_tokens_map.json`** is a mapping of the special tokens. For example, in [Llama 3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/special_tokens_map.json), the beginning of string token is `""`. > [!TIP] > For other modalities, the `tokenizer_config.json` file is replaced by `preprocessor_config.json`. ## Using existing models All `transformer` models are a line away from being used! Depending on how you want to use them, you can use the high-level API using the `pipeline` function or you can use `AutoModel` for more control. ```py # With pipeline, just specify the task and the model id from the Hub. from transformers import pipeline pipe = pipeline("text-generation", model="distilbert/distilgpt2") # If you want more control, you will need to define the tokenizer and model. from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2") model = AutoModelForCausalLM.from_pretrained("distilbert/distilgpt2") ``` You can also load a model from a specific version (based on commit hash, tag name, or branch) as follows: ```py model = AutoModel.from_pretrained( "julien-c/EsperBERTo-small", revision="v2.0.1" # tag name, or branch name, or commit hash ) ``` If you want to see how to load a specific model, you can click `Use in Transformers` and you will be given a working snippet that you can load it! If you need further information about the model architecture, you can also click the "Read model documentation" at the bottom of the snippet. ## Sharing your models To read all about sharing models with `transformers`, please head out to the [Share a model](https://huggingface.co/docs/transformers/model_sharing) guide in the official documentation. Many classes in `transformers`, such as the models and tokenizers, have a `push_to_hub` method that allows to easily upload the files to a repository. ```py # Pushing model to your own account model.push_to_hub("my-awesome-model") # Pushing your tokenizer tokenizer.push_to_hub("my-awesome-model") # Pushing all things after training trainer.push_to_hub() ``` There is much more you can do, so we suggest to review the [Share a model](https://huggingface.co/docs/transformers/model_sharing) guide. ## Additional resources * Transformers [library](https://github.com/huggingface/transformers). * Transformers [docs](https://huggingface.co/docs/transformers/index). * Share a model [guide](https://huggingface.co/docs/transformers/model_sharing). ### Team & Enterprise plans https://huggingface.co/docs/hub/enterprise.md # Team & Enterprise plans > [!TIP] > > Subscribe to a Team or Enterprise plan to get access to advanced features for your organization. Team & Enterprise organization plans add advanced capabilities to organizations, enabling safe, compliant and managed collaboration for companies and teams on Hugging Face. ## Compare our plans at a quick glance ### Core usage, storage, rate limits | Feature | Free | Team | Enterprise | Enterprise Plus | | ----------------------------------------------------- | ----------- | -------------------- | --------------------- | ---------------------- | | Storage โ€“ Public repos | Best effort | 12TB base + 1TB/seat | 200TB base + 1TB/seat | 500TB base + 1TB/seat | | Storage โ€“ Private repos | 100GB | 1TB/seat + PAYG | 1TB/seat + PAYG | 1TB/seat + PAYG | | [Extra storage](./storage-limits#pay-as-you-go-price) | โŒ | โœ… PAYG | โœ… PAYG | โœ… PAYG | | API requests / period\* | 1,000 | 3,000 | 6,000 | 10,000 up to 100,000โ€  | | Resolver requests / period\* | 5,000 | 20,000 | 50,000 | 100,000 up to 500,000โ€  | | Pages requests / period\* | 200 | 400 | 600 | 1,000 up to 10,000โ€  | \* All quotas are calculated over 5-minute fixed windows โ€  When Organization IP Ranges are defined ### Inference & Hub credits | Feature | Free | Team | Enterprise | Enterprise Plus | | ---------------------------------------------------------------------------------------------------------------------- | ------- | --------------------------------- | --------------------------------- | --------------------------------- | | Serve models with Inference Providers | โœ… PAYG | โœ… $2/seat/mo included + PAYG | โœ… $2/seat/mo included + PAYG | โœ… $2/seat/mo included + PAYG | | [Usage & billing control](https://huggingface.co/docs/inference-providers/pricing#inference-providers-usage-breakdown) | โŒ | โœ… | โœ… | โœ… | | Scale deployment with Inference Endpoints | โœ… PAYG | โœ… PAYG | โœ… PAYG | โœ… PAYG | | Hub credits\* included in plan | โŒ | โŒ (bulk purchase available) | $2k included | 5% of ACV included | \* Hub credits can be utilized for inference providers, inference endpoints, Jobs, Space upgrade, ZeroGPU quota extension ### Spaces & Jobs | Feature | Free | Team | Enterprise | Enterprise Plus | | -------------------------------------- | --------- | ----------- | ----------- | --------------- | | Spaces โ€“ CPU-based runtime | 8 units\* | โœ… No limit | โœ… No limit | โœ… No limit | | Spaces โ€“ ZeroGPU usage tiers | 3.5 minโ€  | 25 minโ€  | 45 minโ€  | 45 minโ€  | | Spaces โ€“ Upgraded hardware | PAYG | PAYG | PAYG | PAYG | | Dev Mode / Custom domain for Spaces | โŒ | โœ… | โœ… | โœ… | | Jobs & Scripts (train/fine-tune, eval) | PAYG | PAYG | PAYG | PAYG | \* running at the same time โ€  included daily quota; paid plans can extend beyond quota using credits at $1 per 10 min of GPU time ### Repo rules, access control, visibility | Feature | Free | Team | Enterprise | Enterprise Plus | | -------------------------------------- | :----------------------------------: | :-------------: | :-------------: | :------------------------: | | Access control granularity | [Standard](./organizations-security) | โœ… Fine-grained | โœ… Fine-grained | โœ… Fine-grained + policies | | Org controls | โŒ | โœ… | โœ… | โœ… | | Hub controls | โŒ | โŒ | โŒ | โœ… | | Default private repos | โŒ | โœ… | โœ… | โœ… | | Disable public repositories (org-wide) | โŒ | โœ… | โœ… | โœ… | | [Data residency](./storage-regions) | โŒ | โœ… | โœ… | โœ… | | Data Studio (private datasets) | โŒ | โœ… | โœ… | โœ… | | Gating Group Collections | โŒ | โœ… | โœ… | โœ… | ### Identity, authentication, org security | Feature | Free | Team | Enterprise | Enterprise Plus | | -------------------------------------------------- | :--: | :----------: | :----------: | :-------------: | | [SSO to private org](./enterprise-sso) | โŒ | โœ… Basic SSO | โœ… Basic SSO | โœ… Managed SSO | | [SSO to public Hub](./enterprise-advanced-sso) | โŒ | โŒ | โŒ | โœ… | | [Enforce 2FA](./enterprise-advanced-security) | โŒ | โœ… | โœ… | โœ… | | [OAuth Token Exchange](./oauth#token-exchange-for-organizations-rfc-8693) | โŒ | โŒ | โœ… | โœ… | | Disable personal public repos for users | โŒ | โŒ | โŒ | โœ… | | Disable joining other orgs for users | โŒ | โŒ | โŒ | โœ… | | Disable PRO subscription | โŒ | โŒ | โŒ | โœ… | | Hide members list | โŒ | โœ… | โœ… | โœ… | ### Governance, auditing, compliance | Feature | Free | Team | Enterprise | Enterprise Plus | | ----------------------------------------------------------------------- | :--: | :---------: | :---------: | :-------------: | | RBAC | โœ… | โœ… Advanced | โœ… Advanced | โœ… Advanced | | [Audit logs](./audit-logs) | โŒ | โœ… | โœ… | โœ… | | [Resource groups](./enterprise-advanced-security) | โŒ | โœ… | โœ… | โœ… | | [Tokens admin / management](./enterprise-tokens-management) | โŒ | โœ… | โœ… | โœ… | | [Token revocation](./enterprise-tokens-management#revoking-via-api) | โŒ | โŒ | โœ… | โœ… | | [Users Download analytics](./enterprise-network-security) | โŒ | โŒ | โŒ | โœ… | | [Content access / policy controls](./enterprise-network-security) | โŒ | โŒ | โŒ | โœ… | | [Network access controls](./enterprise-network-security) | โŒ | โŒ | โŒ | โœ… | | [Enforced authentication (advanced)](./enterprise-network-security) | โŒ | โŒ | โŒ | โœ… | ### User provisioning & admin | Feature | Free | Team | Enterprise | Enterprise Plus | | ---------------------- | :-------: | :-----------: | :-----------: | :-------------: | | Onboarding/Offboarding | โœ… manual | โœ… controlled | โœ… controlled | โœ… automated | | SCIM provisioning | โŒ | โŒ | โœ… Invitation-based | โœ… Full lifecycle | | Managed users | โŒ | โŒ | โŒ | โœ… | ### Support, billing, procurement | Feature | Free | Team | Enterprise | Enterprise Plus | | ------------------------------------------- | :----------: | :--------------------: | :--------------------: | :--------------------: | | Support | Forum access | Best effort | Email support with SLA | Advanced Slack support | | Billing | | Credit card self-serve | Pay with Invoice | Pay with Invoice | | Contract (including Purchase Order) | โŒ | โŒ | โœ… HF template | โœ… customer paper | | Legal Review | โŒ | โŒ | โŒ | โœ… | | Vendor onboarding & Security questionnaires | โŒ | โŒ | โŒ | โœ… | ### Community | Feature | Free | Team | Enterprise | Enterprise Plus | | --------------------------------------------------------------------------------------------------------- | :--: | :--: | :--------: | :-------------: | | Org Article | โŒ | โœ… | โœ… | โœ… | | [Publisher Analytics Dashboard](./publisher-analytics) | โŒ | โœ… | โœ… | โœ… | | [Set your primary org on your profile](https://huggingface.co/changelog/primary-organization-on-profiles) | โŒ | โœ… | โœ… | โœ… | ### Pricing | Feature | Free | Team | Enterprise | Enterprise Plus | | ------------------ | :--: | :---------: | :--------------: | :-------------: | | Pricing | - | 20$/user/Mo | from 50$/user/Mo | custom | | Pilot availability | โŒ | โŒ | โŒ | โœ… | ## Dive more In the following sections we will document the following Team & Enterprise features: - [Single Sign-On (SSO)](./enterprise-sso) - [Audit Logs](./audit-logs) - [Storage Regions](./storage-regions) - [Data Studio for Private datasets](./enterprise-datasets) - [Resource Groups](./security-resource-groups) - [Advanced Compute Options](./advanced-compute-options) - [Advanced Security](./enterprise-advanced-security) - [Tokens Management](./enterprise-tokens-management) - [OAuth Token Exchange](./oauth#token-exchange-for-organizations-rfc-8693) - [Publisher Analytics](./publisher-analytics) - [Gating Group Collections](./enterprise-gating-group-collections) - [Network Security](./enterprise-network-security) - [Higher Rate limits](./rate-limits) - [Blog Articles](./enterprise-blog-articles) Finally, Team & Enterprise plans include vastly more [included public storage](./storage-limits), as well as 1TB of [private storage](./storage-limits) per seat in the subscription, i.e. if your organization has 40 members, then you have 40TB included storage for your private models and datasets. ### Langfuse on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-langfuse.md # Langfuse on Spaces This guide shows you how to deploy Langfuse on Hugging Face Spaces and start instrumenting your LLM application for observability. This integration helps you to experiment with LLM APIs on the Hugging Face Hub, manage your prompts in one place, and evaluate model outputs. ## What is Langfuse? [Langfuse](https://langfuse.com) is an open-source LLM engineering platform that helps teams collaboratively debug, evaluate, and iterate on their LLM applications. Key features of Langfuse include LLM tracing to capture the full context of your application's execution flow, prompt management for centralized and collaborative prompt iteration, evaluation metrics to assess output quality, dataset creation for testing and benchmarking, and a playground to experiment with prompts and model configurations. _This video is a 10 min walkthrough of the Langfuse features:_ ## Why LLM Observability? - As language models become more prevalent, understanding their behavior and performance is important. - **LLM observability** involves monitoring and understanding the internal states of an LLM application through its outputs. - It is essential for addressing challenges such as: - **Complex control flows** with repeated or chained calls, making debugging challenging. - **Non-deterministic outputs**, adding complexity to consistent quality assessment. - **Varied user intents**, requiring deep understanding to improve user experience. - Building LLM applications involves intricate workflows, and observability helps in managing these complexities. ## Step 1: Set up Langfuse on Spaces The Langfuse Hugging Face Space allows you to get up and running with a deployed version of Langfuse with just a few clicks. To get started, click the button above or follow these steps: 1. Create a [**new Hugging Face Space**](https://huggingface.co/new-space) 2. Select **Docker** as the Space SDK 3. Select **Langfuse** as the Space template 4. Attach a **[Storage Bucket](https://huggingface.co/docs/hub/storage-buckets)** to ensure your Langfuse data is persisted across restarts 5. Ensure the space is set to **public** visibility so Langfuse API/SDK's can access the app (see note below for more details) 6. [Optional but recommended] For a secure deployment, replace the default values of the **environment variables**: - `NEXTAUTH_SECRET`: Used to validate login session cookies, generate secret with at least 256 entropy using `openssl rand -base64 32`. - `SALT`: Used to salt hashed API keys, generate secret with at least 256 entropy using `openssl rand -base64 32`. - `ENCRYPTION_KEY`: Used to encrypt sensitive data. Must be 256 bits, 64 string characters in hex format, generate via: `openssl rand -hex 32`. 7. Click **Create Space**! ![Clone the Langfuse Space](https://langfuse.com/images/cookbook/huggingface/huggingface-space-setup.png) ### User Access Your Langfuse Space is pre-configured with Hugging Face OAuth for secure authentication, so you'll need to authorize `read` access to your Hugging Face account upon first login by following the instructions in the pop-up. Once inside the app, you can use [the native Langfuse features](https://langfuse.com/docs/rbac) to manage Organizations, Projects, and Users. The Langfuse space _must_ be set to **public** visibility so that Langfuse API/SDK's can reach the app. This means that by default, _any_ logged-in Hugging Face user will be able to access the Langfuse space. You can prevent new users from signing up and accessing the space via two different methods: #### 1. (Recommended) Hugging Face native org-level OAuth restrictions If you want to restrict access to only members of a specified organization(s), you can simply set the `hf_oauth_authorized_org` metadata field in the space's `README.md` file, as shown [here](https://huggingface.co/docs/hub/spaces-oauth#create-an-oauth-app). Once configured, only users who are members of the specified organization(s) will be able to access the space. #### 2. Manual access control You can also restrict access on a per-user basis by setting the `AUTH_DISABLE_SIGNUP` environment variable to `true`. Be sure that you've first signed in & authenticated to the space before setting this variable else your own user profile won't be able to authenticate. > [!TIP] > **Note:** If you've set the `AUTH_DISABLE_SIGNUP` environment variable to `true` to restrict access, and want to grant a new user access to the space, you'll need to first set it back to `false` (wait for rebuild to complete), add the user and have them authenticate with OAuth, and then set it back to `true`. ## Step 2: Use Langfuse Now that you have Langfuse running, you can start instrumenting your LLM application to capture traces and manage your prompts. Let's see how! ### Monitor Any Application Langfuse is model agnostic and can be used to trace any application. Follow the [get-started guide](https://langfuse.com/docs) in Langfuse documentation to see how you can instrument your code. Langfuse maintains native integrations with many popular LLM frameworks, including [Langchain](https://langfuse.com/docs/integrations/langchain/tracing), [LlamaIndex](https://langfuse.com/docs/integrations/llama-index/get-started) and [OpenAI](https://langfuse.com/docs/integrations/openai/python/get-started) and offers Python and JS/TS SDKs to instrument your code. Langfuse also offers various API endpoints to ingest data and has been integrated by other open source projects such as [Langflow](https://langfuse.com/docs/integrations/langflow), [Dify](https://langfuse.com/docs/integrations/dify) and [Haystack](https://langfuse.com/docs/integrations/haystack/get-started). ### Example 1: Trace Calls to Inference Providers As a simple example, here's how to trace LLM calls to [Inference Providers](https://huggingface.co/docs/inference-providers/en/index) using the Langfuse Python SDK. Be sure to first configure your `LANGFUSE_HOST`, `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` environment variables, and make sure you've [authenticated with your Hugging Face account](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication). ```python from langfuse.openai import openai from huggingface_hub import get_token client = openai.OpenAI( base_url="https://router.huggingface.co/hf-inference/models/meta-llama/Llama-3.3-70B-Instruct/v1", api_key=get_token(), ) messages = [{"role": "user", "content": "What is observability for LLMs?"}] response = client.chat.completions.create( model="meta-llama/Llama-3.3-70B-Instruct", messages=messages, max_tokens=100, ) ``` ### Example 2: Monitor a Gradio Application We created a Gradio template space that shows how to create a simple chat application using a Hugging Face model and trace model calls and user feedback in Langfuse - without leaving Hugging Face. To get started, [duplicate this Gradio template space](https://huggingface.co/spaces/langfuse/langfuse-gradio-example-template?duplicate=true) and follow the instructions in the [README](https://huggingface.co/spaces/langfuse/langfuse-gradio-example-template/blob/main/README.md). ## Step 3: View Traces in Langfuse Once you have instrumented your application, and ingested traces or user feedback into Langfuse, you can view your traces in Langfuse. ![Example trace with Gradio](https://langfuse.com/images/cookbook/huggingface/huggingface-gradio-example-trace.png) _[Example trace in the Langfuse UI](https://langfuse-langfuse-template-space.hf.space/project/cm4r1ajtn000a4co550swodxv/traces/9cdc12fb-71bf-4074-ab0b-0b8d212d839f?timestamp=2024-12-20T12%3A12%3A50.089Z&view=preview)_ ## Additional Resources and Support - [Langfuse documentation](https://langfuse.com/docs) - [Langfuse GitHub repository](https://github.com/langfuse/langfuse) - [Langfuse Discord](https://langfuse.com/discord) - [Langfuse template Space](https://huggingface.co/spaces/langfuse/langfuse-template-space) For more help, open a support thread on [GitHub discussions](https://langfuse.com/discussions) or [open an issue](https://github.com/langfuse/langfuse/issues). ### Using SpeechBrain at Hugging Face https://huggingface.co/docs/hub/speechbrain.md # Using SpeechBrain at Hugging Face `speechbrain` is an open-source and all-in-one conversational toolkit for audio/speech. The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognition, speech enhancement, speech separation, language identification, multi-microphone signal processing, and many others. ## Exploring SpeechBrain in the Hub You can find `speechbrain` models by filtering at the left of the [models page](https://huggingface.co/models?library=speechbrain). All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description. 2. Metadata tags that help for discoverability with information such as the language, license, paper, and more. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference Providers widget that allows to make inference requests. ## Using existing models `speechbrain` offers different interfaces to manage pretrained models for different tasks, such as `EncoderClassifier`, `EncoderClassifier`, `SepformerSeparation`, and `SpectralMaskEnhancement`. These classes have a `from_hparams` method you can use to load a model from the Hub Here is an example to run inference for sound recognition in urban sounds. ```py import torchaudio from speechbrain.pretrained import EncoderClassifier classifier = EncoderClassifier.from_hparams( source="speechbrain/urbansound8k_ecapa" ) out_prob, score, index, text_lab = classifier.classify_file('speechbrain/urbansound8k_ecapa/dog_bark.wav') ``` If you want to see how to load a specific model, you can click `Use in speechbrain` and you will be given a working snippet that you can load it! ## Additional resources * SpeechBrain [website](https://speechbrain.github.io/). * SpeechBrain [docs](https://speechbrain.readthedocs.io/en/latest/index.html). ### ZenML on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-zenml.md # ZenML on Spaces [ZenML](https://github.com/zenml-io/zenml) is an extensible, open-source MLOps framework for creating portable, production-ready MLOps pipelines. It's built for Data Scientists, ML Engineers, and MLOps Developers to collaborate as they develop to production. ZenML offers a simple and flexible syntax, is cloud- and tool-agnostic, and has interfaces/abstractions catered toward ML workflows. With ZenML you'll have all your favorite tools in one place, so you can tailor a workflow that caters to your specific needs. The ZenML Huggingface Space allows you to get up and running with a deployed version of ZenML with just a few clicks. Within a few minutes, you'll have this default ZenML dashboard deployed and ready for you to connect to from your local machine. In the sections that follow, you'll learn to deploy your own instance of ZenML and use it to view and manage your machine learning pipelines right from the Hub. ZenML on Huggingface Spaces is a **self-contained application completely hosted on the Hub using Docker**. The diagram below illustrates the complete process. ![ZenML on HuggingFace Spaces -- default deployment](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/zenml/hf_spaces_chart.png) Visit [the ZenML documentation](https://docs.zenml.io/) to learn more about its features and how to get started with running your machine learning pipelines through your Huggingface Spaces deployment. You can check out [some small sample examples](https://github.com/zenml-io/zenml/tree/main/examples) of ZenML pipelines to get started or take your pick of some more complex production-grade projects at [the ZenML Projects repository](https://github.com/zenml-io/zenml-projects). ZenML integrates with many of your favorite tools out of the box, [including Huggingface](https://zenml.io/integrations/huggingface) of course! If there's something else you want to use, we're built to be extensible and you can easily make it work with whatever your custom tool or workflow is. ## โšก๏ธ Deploy ZenML on Spaces You can deploy ZenML on Spaces with just a few clicks: To set up your ZenML app, you need to specify three main components: the Owner (either your personal account or an organization), a Space name, and the Visibility (a bit lower down the page). Note that the space visibility needs to be set to 'Public' if you wish to connect to the ZenML server from your local machine. ![Choose the ZenML Docker template](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/zenml/choose_space.png) You have the option here to select a higher tier machine to use for your server. The advantage of selecting a paid CPU instance is that it is not subject to auto-shutdown policies and thus will stay up as long as you leave it up. In order to make use of a persistent CPU, you'll likely want to create and set up a MySQL database to connect to (see below). To personalize your Space's appearance, such as the title, emojis, and colors, navigate to "Files and Versions" and modify the metadata in your README.md file. Full information on Spaces configuration parameters can be found on the HuggingFace [documentation reference guide](https://huggingface.co/docs/hub/spaces-config-reference). After creating your Space, you'll notice a 'Building' status along with logs displayed on the screen. When this switches to 'Running', your Space is ready for use. If the ZenML login UI isn't visible, try refreshing the page. In the upper-right hand corner of your space you'll see a button with three dots which, when you click on it, will offer you a menu option to "Embed this Space". (See [the HuggingFace documentation](https://huggingface.co/docs/hub/spaces-embed) for more details on this feature.) Copy the "Direct URL" shown in the box that you can now see on the screen. This should look something like this: `https://-.hf.space`. Open that URL and use our default login to access the dashboard (username: 'default', password: (leave it empty)). ## Connecting to your ZenML Server from your Local Machine Once you have your ZenML server up and running, you can connect to it from your local machine. To do this, you'll need to get your Space's 'Direct URL' (see above). > [!WARNING] > Your Space's URL will only be available and usable for connecting from your > local machine if the visibility of the space is set to 'Public'. You can use the 'Direct URL' to connect to your ZenML server from your local machine with the following CLI command (after installing ZenML, and using your custom URL instead of the placeholder): ```shell zenml connect --url '' --username='default' --password='' ``` You can also use the Direct URL in your browser to use the ZenML dashboard as a fullscreen application (i.e. without the HuggingFace Spaces wrapper around it). > [!WARNING] > The ZenML dashboard will currently not work when viewed from within the Huggingface > webpage (i.e. wrapped in the main `https://huggingface.co/...` website). This is on > account of a limitation in how cookies are handled between ZenML and Huggingface. > You **must** view the dashboard from the 'Direct URL' (see above). ## Extra Configuration Options By default the ZenML application will be configured to use a SQLite non-persistent database. If you want to use a persistent database, you can configure this by amending the `Dockerfile` in your Space's root directory. For full details on the various parameters you can change, see [our reference documentation](https://docs.zenml.io/getting-started/deploying-zenml/docker#zenml-server-configuration-options) on configuring ZenML when deployed with Docker. > [!TIP] > If you are using the space just for testing and experimentation, you don't need > to make any changes to the configuration. Everything will work out of the box. You can also use an external secrets backend together with your HuggingFace Spaces as described in [our documentation](https://docs.zenml.io/getting-started/deploying-zenml/docker#zenml-server-configuration-options). You should be sure to use HuggingFace's inbuilt 'Repository secrets' functionality to configure any secrets you need to use in your`Dockerfile` configuration. [See the documentation](https://huggingface.co/docs/hub/spaces-sdks-docker#secret-management) for more details how to set this up. > [!WARNING] > If you wish to use a cloud secrets backend together with ZenML for secrets > management, **you must take the following minimal security precautions** on your ZenML Server on the > Dashboard: > > - change your password on the `default` account that you get when you start. You > can do this from the Dashboard or via the CLI. > - create a new user account with a password and assign it the `admin` role. This > can also be done from the Dashboard (by 'inviting' a new user) or via the CLI. > - reconnect to the server using the new user account and password as described > above, and use this new user account as your working account. > > This is because the default user created by the > HuggingFace Spaces deployment process has no password assigned to it and as the > Space is publicly accessible (since the Space is public) *potentially anyone > could access your secrets without this extra step*. To change your password > navigate to the Settings page by clicking the button in the upper right hand > corner of the Dashboard and then click 'Update Password'. ## Upgrading your ZenML Server on HF Spaces The default space will use the latest version of ZenML automatically. If you want to update your version, you can simply select the 'Factory reboot' option within the 'Settings' tab of the space. Note that this will wipe any data contained within the space and so if you are not using a MySQL persistent database (as described above) you will lose any data contained within your ZenML deployment on the space. You can also configure the space to use an earlier version by updating the `Dockerfile`'s `FROM` import statement at the very top. ## Next Steps As a next step, check out our [Starter Guide to MLOps with ZenML](https://docs.zenml.io/starter-guide/pipelines) which is a series of short practical pages on how to get going quickly. Alternatively, check out [our `quickstart` example](https://github.com/zenml-io/zenml/tree/main/examples/quickstart) which is a full end-to-end example of many of the features of ZenML. ## ๐Ÿค— Feedback and support If you are having trouble with your ZenML server on HuggingFace Spaces, you can view the logs by clicking on the "Open Logs" button at the top of the space. This will give you more context of what's happening with your server. If you have suggestions or need specific support for anything else which isn't working, please [join the ZenML Slack community](https://zenml.io/slack-invite/) and we'll be happy to help you out! ### Using SpanMarker at Hugging Face https://huggingface.co/docs/hub/span_marker.md # Using SpanMarker at Hugging Face [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and DeBERTa. Tightly implemented on top of the ๐Ÿค— Transformers library, SpanMarker can take good advantage of it. As a result, SpanMarker will be intuitive to use for anyone familiar with Transformers. ## Exploring SpanMarker in the Hub You can find `span_marker` models by filtering at the left of the [models page](https://huggingface.co/models?library=span-marker). All models on the Hub come with these useful features: 1. An automatically generated model card with a brief description. 2. An interactive widget you can use to play with the model directly in the browser. 3. An Inference Providers widget that allows you to make inference requests. ## Installation To get started, you can follow the [SpanMarker installation guide](https://tomaarsen.github.io/SpanMarkerNER/install.html). You can also use the following one-line install through pip: ``` pip install -U span_marker ``` ## Using existing models All `span_marker` models can easily be loaded from the Hub. ```py from span_marker import SpanMarkerModel model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super") ``` Once loaded, you can use [`SpanMarkerModel.predict`](https://tomaarsen.github.io/SpanMarkerNER/api/span_marker.modeling.html#span_marker.modeling.SpanMarkerModel.predict) to perform inference. ```py model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.") ``` ```json [ {"span": "Amelia Earhart", "label": "person-other", "score": 0.7629689574241638, "char_start_index": 0, "char_end_index": 14}, {"span": "Lockheed Vega 5B", "label": "product-airplane", "score": 0.9833564758300781, "char_start_index": 38, "char_end_index": 54}, {"span": "Atlantic", "label": "location-bodiesofwater", "score": 0.7621214389801025, "char_start_index": 66, "char_end_index": 74}, {"span": "Paris", "label": "location-GPE", "score": 0.9807717204093933, "char_start_index": 78, "char_end_index": 83} ] ``` If you want to load a specific SpanMarker model, you can click `Use in SpanMarker` and you will be given a working snippet! ## Additional resources * SpanMarker [repository](https://github.com/tomaarsen/SpanMarkerNER) * SpanMarker [docs](https://tomaarsen.github.io/SpanMarkerNER) ### Downloading datasets https://huggingface.co/docs/hub/datasets-downloading.md # Downloading datasets ## Integrated libraries If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use this dataset" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/knkarthick/samsum?library=datasets) shows how to do so with `datasets` below. ## Using the Hugging Face Client Library You can use the [`huggingface_hub`](/docs/huggingface_hub) library to create, delete, update and retrieve information from repos. For example, to download the `HuggingFaceH4/ultrachat_200k` dataset from the command line, run ```bash hf download HuggingFaceH4/ultrachat_200k --repo-type dataset ``` See the [HF CLI download documentation](https://huggingface.co/docs/huggingface_hub/en/guides/cli#download-a-dataset-or-a-space) for more information. You can also integrate this into your own library! For example, you can quickly load a CSV dataset with a few lines using Pandas. ```py from huggingface_hub import hf_hub_download import pandas as pd REPO_ID = "YOUR_REPO_ID" FILENAME = "data.csv" dataset = pd.read_csv( hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") ) ``` ## Using Git Since all datasets on the Hub are Xet-backed Git repositories, you can clone the datasets locally by [installing git-xet](./xet/using-xet-storage#git-xet) and running: ```bash git xet install git lfs install git clone git@hf.co:datasets/ # example: git clone git@hf.co:datasets/allenai/c4 ``` If you have write-access to the particular dataset repo, you'll also have the ability to commit and push revisions to the dataset. Add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes and/or access private repos. ## Using hf-mount For large datasets, you can mount a repo as a local filesystem with [hf-mount](https://github.com/huggingface/hf-mount) instead of downloading the full repo. Files are fetched lazily โ€” only the bytes your code reads hit the network. Useful when your workflow expects local file paths (e.g. `tarfile`, `zipfile`, `imagefolder`) rather than Python iterators. ```bash curl -fsSL https://raw.githubusercontent.com/huggingface/hf-mount/main/install.sh | sh hf-mount start repo datasets/stanfordnlp/imdb /tmp/imdb ``` Repos are mounted read-only. See [Mount as a Local Filesystem](./storage-buckets-access#mount-as-a-local-filesystem) for full setup details, backend options, and caching. ### Using MLX at Hugging Face https://huggingface.co/docs/hub/mlx.md # Using MLX at Hugging Face [MLX](https://github.com/ml-explore/mlx) is a model training and serving framework for Apple silicon made by Apple Machine Learning Research. It comes with a variety of examples: - [Generate text with MLX-LM](https://github.com/ml-explore/mlx-lm/tree/main) and [generating text with MLX-LM for models in GGUF format](https://github.com/ml-explore/mlx-examples/tree/main/llms/gguf_llm). - Large-scale text generation with [LLaMA](https://github.com/ml-explore/mlx-examples/tree/main/llms/llama). - Fine-tuning with [LoRA](https://github.com/ml-explore/mlx-examples/tree/main/lora). - Generating images with [Stable Diffusion](https://github.com/ml-explore/mlx-examples/tree/main/stable_diffusion). - Speech recognition with [OpenAI's Whisper](https://github.com/ml-explore/mlx-examples/tree/main/whisper). ## Exploring MLX on the Hub You can find MLX models by filtering at the left of the [models page](https://huggingface.co/models?library=mlx&sort=trending). There's also an open [MLX community](https://huggingface.co/mlx-community) of contributors converting and publishing weights for MLX format. Thanks to MLX Hugging Face Hub integration, you can load MLX models with a few lines of code. ## Installation MLX comes as a standalone package, and there's a subpackage called MLX-LM with Hugging Face integration for Large Language Models. To install MLX-LM, you can use the following one-line install through `pip`: ```bash pip install mlx-lm ``` You can get more information about it [here](https://github.com/ml-explore/mlx-lm/tree/main). If you install `mlx-lm`, you don't need to install `mlx`. If you don't want to use `mlx-lm` but only MLX, you can install MLX itself as follows. With `pip`: ```bash pip install mlx ``` With `conda`: ```bash conda install -c conda-forge mlx ``` ## Using Existing Models MLX-LM has useful utilities to generate text. The following line directly downloads and loads the model and starts generating text. ```bash python -m mlx_lm.generate --model mistralai/Mistral-7B-Instruct-v0.2 --prompt "hello" ``` For a full list of generation options, run ```bash python -m mlx_lm.generate --help ``` You can also load a model and start generating text through Python like below: ```python from mlx_lm import load, generate model, tokenizer = load("mistralai/Mistral-7B-Instruct-v0.2") response = generate(model, tokenizer, prompt="hello", verbose=True) ``` MLX-LM supports popular LLM architectures including LLaMA, Phi-2, Mistral, and Qwen. Models other than supported ones can easily be downloaded as follows: Setting `HF_XET_HIGH_PERFORMANCE=1` raises concurrency bounds and buffer sizes for machines with high bandwidth and at least 64 GB of RAM: ```bash pip install -U huggingface_hub export HF_XET_HIGH_PERFORMANCE=1 hf download --local-dir / ``` ## Converting and Sharing Models You can convert, and optionally quantize, LLMs from the Hugging Face Hub as follows: ```bash python -m mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.1 -q ``` If you want to directly push the model after the conversion, you can do it like below. ```bash python -m mlx_lm.convert \ --hf-path mistralai/Mistral-7B-v0.1 \ -q \ --upload-repo / ``` ## Additional Resources * [MLX Repository](https://github.com/ml-explore/mlx) * [MLX Docs](https://ml-explore.github.io/mlx/) * [MLX-LM](https://github.com/ml-explore/mlx-lm/tree/main) * [MLX Examples](https://github.com/ml-explore/mlx-examples/tree/main) * [All MLX models on the Hub](https://huggingface.co/models?library=mlx&sort=trending) ### Third-party scanner: Protect AI https://huggingface.co/docs/hub/security-protectai.md # Third-party scanner: Protect AI > [!TIP] > Interested in joining our security partnership / providing scanning information on the Hub? Please get in touch with us over at security@huggingface.co.* [Protect AI](https://protectai.com/)'s [Guardian](https://protectai.com/guardian) catches pickle, Keras, and other exploits as detailed on their [Knowledge Base page](https://protectai.com/insights/knowledge-base/). Guardian also benefits from reports sent in by their community of bounty [Huntr](https://huntr.com/)s. ![Protect AI report for the danger.dat file contained in mcpotato/42-eicar-street](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/protect-ai-report.png) *Example of a report for [danger.dat](https://huggingface.co/mcpotato/42-eicar-street/blob/main/danger.dat)* We partnered with Protect AI to provide scanning in order to make the Hub safer. The same way files are scanned by our internal scanning system, public repositories' files are scanned by Guardian. Our frontend has been redesigned specifically for this purpose, in order to accommodate for new scanners: Here is an example repository you can check out to see the feature in action: [mcpotato/42-eicar-street](https://huggingface.co/mcpotato/42-eicar-street). ## Model security refresher To share models, we serialize the data structures we use to interact with the models, in order to facilitate storage and transport. Some serialization formats are vulnerable to nasty exploits, such as arbitrary code execution (looking at you pickle), making sharing models potentially dangerous. As Hugging Face has become a popular platform for model sharing, weโ€™d like to protect the community from this, hence why we have developed tools like [picklescan](https://github.com/mmaitre314/picklescan) and why we integrate third party scanners. Pickle is not the only exploitable format out there, [see for reference](https://github.com/Azure/counterfit/wiki/Abusing-ML-model-file-formats-to-create-malware-on-AI-systems:-A-proof-of-concept) how one can exploit Keras Lambda layers to achieve arbitrary code execution. ### Getting Started with Repositories https://huggingface.co/docs/hub/repositories-getting-started.md # Getting Started with Repositories This beginner-friendly guide will help you get the basic skills you need to create and manage your repository on the Hub. Each section builds on the previous one, so feel free to choose where to start! ## Requirements This document shows how to handle repositories through the web interface as well as through the terminal. There are no requirements if working with the UI. If you want to work with the terminal, please follow these installation instructions. If you do not have `git` available as a CLI command yet, you will need to [install Git](https://git-scm.com/downloads) for your platform. You will also need to [install Git-Xet](./xet/using-xet-storage#git-xet), which will be used to handle large files such as images and model weights. > [!TIP] > To be able to download and upload large files from Git, you need to install the [Git Xet](./xet/using-xet-storage#git) extension. To be able to push your code to the Hub, you'll need to authenticate somehow. The easiest way to do this is by installing the [`hf` CLI](https://huggingface.co/docs/huggingface_hub/guides/cli) and running the login command: ```bash # Install hf: # brew install hf # or # pip install hf hf auth login ``` **The content in the Getting Started section of this document is also available as a video!** ## Creating a repository Using the Hub's web interface you can easily create repositories, add files (even large ones!), explore models, visualize diffs, and much more. There are three kinds of repositories on the Hub, and in this guide you'll be creating a **model repository** for demonstration purposes. For information on creating and managing models, datasets, and Spaces, refer to their respective documentation. 1. To create a new repository, visit [huggingface.co/new](http://huggingface.co/new): 2. Specify the owner of the repository: this can be either you or any of the organizations youโ€™re affiliated with. 3. Enter your modelโ€™s name. This will also be the name of the repository. 4. Specify whether you want your model to be public or private. 5. Specify the license. You can leave the *License* field blank for now. To learn about licenses, visit the [**Licenses**](repositories-licenses) documentation. After creating your model repository, you should see a page like this: Note that the Hub prompts you to create a *Model Card*, which you can learn about in the [**Model Cards documentation**](./model-cards). Including a Model Card in your model repo is best practice, but since we're only making a test repo at the moment we can skip this. ## Adding files to a repository (Web UI) To add files to your repository via the web UI, start by selecting the **Files** tab, navigating to the desired directory, and then clicking **Add file**. You'll be given the option to create a new file or upload a file directly from your computer. ### Creating a new file Choosing to create a new file will take you to the following editor screen, where you can choose a name for your file, add content, and save your file with a message that summarizes your changes. Instead of directly committing the new file to your repo's `main` branch, you can select `Open as a pull request` to create a [Pull Request](./repositories-pull-requests-discussions). ### Uploading a file If you choose _Upload file_ you'll be able to choose a local file to upload, along with a message summarizing your changes to the repo. As with creating new files, you can select `Open as a pull request` to create a [Pull Request](./repositories-pull-requests-discussions) instead of adding your changes directly to the `main` branch of your repo. ## Adding files to a repository (CLI)[[cli]] You can upload files to your repository directly from the terminal using the [`hf` CLI](https://huggingface.co/docs/huggingface_hub/guides/cli). Use the `hf upload` command to push local files or entire folders: ```bash # Upload a single file to your model repo hf upload your-username/your-model-name model.safetensors # Upload an entire directory hf upload your-username/your-model-name ./my-model-directory # Upload to a dataset repo hf upload your-username/your-dataset-name ./data --repo-type dataset ``` The `hf` CLI handles large files automatically โ€” no extra setup is required. ## Adding files to a repository (git)[[terminal]] ### Cloning repositories Downloading repositories to your local machine is called *cloning*. You can use the following commands to load your repo and navigate to it: ```bash git clone https://huggingface.co// cd ``` Or for a dataset repo: ```bash git clone https://huggingface.co/datasets// cd ``` You can clone over SSH with the following command: ```bash git clone git@hf.co:/ cd ``` You'll need to add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes or access private repositories. ### Set up Now's the time, you can add any files you want to the repository! ๐Ÿ”ฅ Do you have files larger than 10MB? Those files should be tracked with [`git-xet`](./xet/using-xet-storage#git-xet), which you can initialize with: ```bash git xet install ``` When you use Hugging Face to create a repository, Hugging Face automatically provides a list of common file extensions for common Machine Learning large files in the `.gitattributes` file, which `git-xet` uses to efficiently track changes to your large files. However, you might need to add new extensions if your file types are not already handled. You can do so with `git xet track "*.your_extension"`. ### Pushing files You can use Git to save new files and any changes to already existing files as a bundle of changes called a *commit*, which can be thought of as a "revision" to your project. To create a commit, you have to `add` the files to let Git know that we're planning on saving the changes and then `commit` those changes. In order to sync the new commit with the Hugging Face Hub, you then `push` the commit to the Hub. ```bash # Create any files you like! Then... git add . git commit -m "First model version" # You can choose any descriptive message git push ``` And you're done! You can check your repository on Hugging Face with all the recently added files. For example, in the screenshot below the user added a number of files. Note that some files in this example have a size of `1.04 GB`, so the repo uses Xet to track it. > [!TIP] > If you cloned the repository with HTTP, you might be asked to fill your username and password on every push operation. The simplest way to avoid repetition is to [switch to SSH](#cloning-repositories), instead of HTTP. Alternatively, if you have to use HTTP, you might find it helpful to setup a [git credential helper](https://git-scm.com/docs/gitcredentials#_avoiding_repetition) to autofill your username and password. ## Viewing a repo's history Every time you go through the `add`-`commit`-`push` cycle, the repo will keep track of every change you've made to your files. The UI allows you to explore the model files and commits and to see the difference (also known as *diff*) introduced by each commit. To see the history, you can click on the **History: X commits** link. You can click on an individual commit to see what changes that commit introduced: ### Embedding Atlas https://huggingface.co/docs/hub/datasets-embedding-atlas.md # Embedding Atlas [Embedding Atlas](https://apple.github.io/embedding-atlas/) is an interactive visualization tool for exploring large embedding spaces. It enables you to visualize, cross-filter, and search embeddings alongside associated metadata, helping you understand patterns and relationships in high-dimensional data. All computation happens in your computer, ensuring your data remains private and secure. Here is an [example atlas](https://huggingface.co/spaces/davanstrien/megascience) for the [MegaScience](https://huggingface.co/datasets/MegaScience/MegaScience) dataset hosted as a Static Space: ## Key Features - **Interactive exploration**: Navigate through millions of embeddings with smooth, responsive visualization - **Browser-based computation**: Compute embeddings and projections locally without sending data to external servers - **Cross-filtering**: Link and filter data across multiple metadata columns - **Search capabilities**: Find similar data points to a given query or existing item - **Multiple integration options**: Use via command line, Jupyter widgets, or web interface ## Prerequisites First, install Embedding Atlas: ```bash pip install embedding-atlas ``` If you plan to load private datasets from the Hugging Face Hub, you'll also need to [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` ## Loading Datasets from the Hub Embedding Atlas provides seamless integration with the Hugging Face Hub, allowing you to visualize embeddings from any dataset directly. ### Using the Command Line The simplest way to visualize a Hugging Face dataset is through the command line interface. Try it with the IMDB dataset: ```bash # Load the IMDB dataset from the Hub embedding-atlas stanfordnlp/imdb # Specify the text column for embedding computation embedding-atlas stanfordnlp/imdb --text "text" # Load only a sample for faster exploration embedding-atlas stanfordnlp/imdb --text "text" --sample 5000 ``` For your own datasets, use the same pattern: ```bash # Load your dataset from the Hub embedding-atlas username/dataset-name # Load multiple splits embedding-atlas username/dataset-name --split train --split test # Specify custom text column embedding-atlas username/dataset-name --text "content" ``` ### Using Python and Jupyter You can also use Embedding Atlas in Jupyter notebooks for interactive exploration: ```python from embedding_atlas.widget import EmbeddingAtlasWidget from datasets import load_dataset import pandas as pd # Load the IMDB dataset from Hugging Face Hub dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]") # Convert to pandas DataFrame df = dataset.to_pandas() # Create interactive widget widget = EmbeddingAtlasWidget(df) widget ``` For your own datasets: ```python from embedding_atlas.widget import EmbeddingAtlasWidget from datasets import load_dataset import pandas as pd # Load your dataset from the Hub dataset = load_dataset("username/dataset-name", split="train") df = dataset.to_pandas() # Create interactive widget widget = EmbeddingAtlasWidget(df) widget ``` ### Working with Pre-computed Embeddings If you have datasets with pre-computed embeddings, you can load them directly: ```bash # Load dataset with pre-computed coordinates embedding-atlas username/dataset-name \ --x "embedding_x" \ --y "embedding_y" # Load with pre-computed nearest neighbors embedding-atlas username/dataset-name \ --neighbors "neighbors_column" ``` ## Customizing Embeddings Embedding Atlas uses [SentenceTransformers](https://huggingface.co/sentence-transformers) by default but supports custom embedding models: ```bash # Use a specific embedding model embedding-atlas stanfordnlp/imdb \ --text "text" \ --model "sentence-transformers/all-MiniLM-L6-v2" # For models requiring remote code execution embedding-atlas username/dataset-name \ --model "custom/model" \ --trust-remote-code ``` ### UMAP Projection Parameters Fine-tune the dimensionality reduction for your specific use case: ```bash embedding-atlas stanfordnlp/imdb \ --text "text" \ --umap-n-neighbors 30 \ --umap-min-dist 0.1 \ --umap-metric "cosine" ``` ## Use Cases ### Exploring Text Datasets Visualize and explore text corpora to identify clusters, outliers, and patterns: ```python from embedding_atlas.widget import EmbeddingAtlasWidget from datasets import load_dataset import pandas as pd # Load a text classification dataset dataset = load_dataset("stanfordnlp/imdb", split="train[:5000]") df = dataset.to_pandas() # Visualize with metadata widget = EmbeddingAtlasWidget(df) widget ``` ## Additional Resources - [Embedding Atlas GitHub Repository](https://github.com/apple/embedding-atlas) - [Official Documentation](https://apple.github.io/embedding-atlas/) - [Interactive Demo](https://apple.github.io/embedding-atlas/upload/) - [Command Line Reference](https://apple.github.io/embedding-atlas/tool.html) ### Using PEFT at Hugging Face https://huggingface.co/docs/hub/peft.md # Using PEFT at Hugging Face ๐Ÿค— [Parameter-Efficient Fine-Tuning (PEFT)](https://huggingface.co/docs/peft/index) is a library for efficiently adapting pre-trained language models to various downstream applications without fine-tuning all the modelโ€™s parameters. ## Exploring PEFT on the Hub You can find PEFT models by filtering at the left of the [models page](https://huggingface.co/models?library=peft&sort=trending). ## Installation To get started, you can check out the [Quick Tour in the PEFT docs](https://huggingface.co/docs/peft/quicktour). To install, follow the [PEFT installation guide](https://huggingface.co/docs/peft/install). You can also use the following one-line install through pip: ``` $ pip install peft ``` ## Using existing models All PEFT models can be loaded from the Hub. To use a PEFT model you also need to load the base model that was fine-tuned, as shown below. Every fine-tuned model has the base model in its model card. ```py from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel, PeftConfig base_model = "mistralai/Mistral-7B-v0.1" adapter_model = "dfurman/Mistral-7B-Instruct-v0.2" model = AutoModelForCausalLM.from_pretrained(base_model) model = PeftModel.from_pretrained(model, adapter_model) tokenizer = AutoTokenizer.from_pretrained(base_model) model = model.to("cuda") model.eval() ``` Once loaded, you can pass your inputs to the tokenizer to prepare them, and call `model.generate()` in regular `transformers` fashion. ```py inputs = tokenizer("Tell me the recipe for chocolate chip cookie", return_tensors="pt") with torch.no_grad(): outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10) print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0]) ``` It outputs the following: ```text Tell me the recipe for chocolate chip cookie dough. 1. Preheat oven to 375 degrees F (190 degrees C). 2. In a large bowl, cream together 1/2 cup (1 stick) of butter or margarine, 1/2 cup granulated sugar, and 1/2 cup packed brown sugar. 3. Beat in 1 egg and 1 teaspoon vanilla extract. 4. Mix in 1 1/4 cups all-purpose flour. 5. Stir in 1/2 teaspoon baking soda and 1/2 teaspoon salt. 6. Fold in 3/4 cup semisweet chocolate chips. 7. Drop by ``` If you want to load a specific PEFT model, you can click `Use in PEFT` in the model card and you will be given a working snippet! ## Additional resources * PEFT [repository](https://github.com/huggingface/peft) * PEFT [docs](https://huggingface.co/docs/peft/index) * PEFT [models](https://huggingface.co/models?library=peft&sort=trending) ### Using Keras at Hugging Face https://huggingface.co/docs/hub/keras.md # Using Keras at Hugging Face Keras is an open-source multi-backend deep learning framework, with support for JAX, TensorFlow, and PyTorch. You can find more details about it on [keras.io](https://keras.io/). ## Exploring Keras in the Hub You can list `keras` models on the Hub by filtering by library name on the [models page](https://huggingface.co/models?library=keras&sort=downloads). Keras models on the Hub come up with useful features when uploaded directly from the Keras library: 1. A generated model card with a description, a plot of the model, and more. 2. A download count to monitor the popularity of a model. 3. A code snippet to quickly get started with the model. ## Using existing models Keras is deeply integrated with the Hugging Face Hub. This means you can load and save models on the Hub directly from the library. To do that, you need to install a recent version of Keras and `huggingface_hub`. The `huggingface_hub` library is a lightweight Python client used by Keras to interact with the Hub. ``` pip install -U keras huggingface_hub ``` Once you have the library installed, you just need to use the regular `keras.saving.load_model` method by passing as argument a Hugging Face path. An HF path is a `repo_id` prefixed by `hf://` e.g. `"hf://keras-io/weather-prediction"`. Read more about `load_model` in [Keras documentation](https://keras.io/api/models/model_saving_apis/model_saving_and_loading/#load_model-function). ```py import keras model = keras.saving.load_model("hf://Wauplin/mnist_example") ``` If you want to see how to load a specific model, you can click **Use this model** on the model page to get a working code snippet! ## Sharing your models Similarly to `load_model`, you can save and share a `keras` model on the Hub using `model.save()` with an HF path: ```py model = ... model.save("hf://your-username/your-model-name") ``` If the repository does not exist on the Hub, it will be created for you. The uploaded model contains a model card, a plot of the model, the `metadata.json` and `config.json` files, and a `model.weights.h5` file containing the model weights. By default, the repository will contain a minimal model card. Check out the [Model Card guide](https://huggingface.co/docs/hub/model-cards) to learn more about model cards and how to complete them. You can also programmatically update model cards using `huggingface_hub.ModelCard` (see [guide](https://huggingface.co/docs/huggingface_hub/guides/model-cards)). > [!TIP] > You might be already familiar with `.keras` files. In fact, a `.keras` file is simply a zip file containing the `.json` and `model.weights.h5` files. When pushed to the Hub, the model is saved as an unzipped folder in order to let you navigate through the files. Note that if you manually upload a `.keras` file to a model repository on the Hub, the repository will automatically be tagged as `keras` but you won't be able to load it using `keras.saving.load_model`. ## Additional resources * Keras Developer [Guides](https://keras.io/guides/). * Keras [examples](https://keras.io/examples/). ### Models https://huggingface.co/docs/hub/models.md # Models The Hugging Face Hub hosts many models for a [variety of machine learning tasks](https://huggingface.co/tasks). Models are stored in repositories, so they benefit from [all the features](./repositories) possessed by every repo on the Hugging Face Hub. Additionally, model repos have attributes that make exploring and using models as easy as possible. These docs will take you through everything you'll need to know to find models on the Hub, upload your models, and make the most of everything the Model Hub offers! ## Contents - [The Model Hub](./models-the-hub) - [Model Cards](./model-cards) - [CO2 emissions](./model-cards-co2) - [Eval Results](./eval-results) - [Gated models](./models-gated) - [Uploading Models](./models-uploading) - [Downloading Models](./models-downloading) - [Libraries](./models-libraries) - [Widgets](./models-widgets) - [Widget Examples](./models-widgets-examples) - [Model Inference](./models-inference) - [Local Apps](./local-apps) - [Frequently Asked Questions](./models-faq) - [Advanced Topics](./models-advanced) - [Integrating libraries with the Hub](./models-adding-libraries) - [Tasks](./models-tasks) ### Gated models https://huggingface.co/docs/hub/models-gated.md # Gated models To give more control over how models are used, the Hub allows model authors to enable **access requests** for their models. Users must agree to share their contact information (username and email address) with the model authors to access the model files when enabled. Model authors can configure this request with additional fields. A model with access requests enabled is called a **gated model**. Access requests are always granted to individual users rather than to entire organizations. A common use case of gated models is to provide access to early research models before the wider release. ## Manage gated models as a model author To enable access requests, go to the model settings page. By default, the model is not gated. Click on **Enable Access request** in the top-right corner. By default, access to the model is automatically granted to the user when requesting it. This is referred to as **automatic approval**. In this mode, any user can access your model once they've shared their personal information with you. If you want to manually approve which users can access your model, you must set it to **manual approval**. When this is the case, you will notice more options: - **Add access** allows you to search for a user and grant them access even if they did not request it. - **Notification frequency** lets you configure when to get notified if new users request access. It can be set to once a day or real-time. By default, an email is sent to your primary email address. For models hosted under an organization, emails are by default sent to the first 5 admins of the organization. In both cases (user or organization) you can set a different email address in the **Notifications email** field. ### Review access requests Once access requests are enabled, you have full control of who can access your model or not, whether the approval mode is manual or automatic. You can review and manage requests either from the UI or via the API. #### From the UI You can review who has access to your gated model from its settings page by clicking on the **Review access requests** button. This will open a modal with 3 lists of users: - **pending**: the list of users waiting for approval to access your model. This list is empty unless you've selected **manual approval**. You can either **Accept** or **Reject** the demand. If the demand is rejected, the user cannot access your model and cannot request access again. - **accepted**: the complete list of users with access to your model. You can choose to **Reject** access at any time for any user, whether the approval mode is manual or automatic. You can also **Cancel** the approval, which will move the user to the *pending* list. - **rejected**: the list of users you've manually rejected. Those users cannot access your models. If they go to your model repository, they will see a message *Your request to access this repo has been rejected by the repo's authors*. #### Via the API You can automate the approval of access requests by using the API. You must pass a `token` with `write` access to the gated repository. To generate a token, go to [your user settings](https://huggingface.co/settings/tokens). | Method | URI | Description | Headers | Payload | ------ | --- | ----------- | ------- | ------- | | `GET` | `/api/models/{repo_id}/user-access-request/pending` | Retrieve the list of pending requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/models/{repo_id}/user-access-request/accepted` | Retrieve the list of accepted requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/models/{repo_id}/user-access-request/rejected` | Retrieve the list of rejected requests. | `{"authorization": "Bearer $token"}` | | | `POST` | `/api/models/{repo_id}/user-access-request/handle` | Change the status of a given access request to `status`. | `{"authorization": "Bearer $token"}` | `{"status": "accepted"/"rejected"/"pending", "user": "username", "rejectionReason": "Optional rejection reason that will be visible to the user (max 200 characters)."}` | | `POST` | `/api/models/{repo_id}/user-access-request/grant` | Allow a specific user to access your repo. | `{"authorization": "Bearer $token"}` | `{"user": "username"} ` | The base URL for the HTTP endpoints above is `https://huggingface.co`. **NEW!** Those endpoints are now officially supported in our Python client `huggingface_hub`. List the access requests to your model with [`list_pending_access_requests`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_pending_access_requests), [`list_accepted_access_requests`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_accepted_access_requests) and [`list_rejected_access_requests`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_rejected_access_requests). You can also accept, cancel and reject access requests with [`accept_access_request`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.accept_access_request), [`cancel_access_request`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.cancel_access_request), [`reject_access_request`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.reject_access_request). Finally, you can grant access to a user with [`grant_access`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.grant_access). ### Download access report You can download a report of all access requests for a gated model with the **download user access report** button. Click on it to download a json file with a list of users. For each entry, you have: - **user**: the user id. Example: *julien-c*. - **fullname**: name of the user on the Hub. Example: *Julien Chaumond*. - **status**: status of the request. Either `"pending"`, `"accepted"` or `"rejected"`. - **email**: email of the user. - **time**: datetime when the user initially made the request. ### Customize requested information By default, users landing on your gated model will be asked to share their contact information (email and username) by clicking the **Agree and send request to access repo** button. If you want to collect more user information, you can configure additional fields. This information will be accessible from the **Settings** tab. To do so, add an `extra_gated_fields` property to your [model card metadata](./model-cards#model-card-metadata) containing a list of key/value pairs. The *key* is the name of the field and *value* its type or an object with a `type` field. The list of field types is: - `text`: a single-line text field. - `checkbox`: a checkbox field. - `date_picker`: a date picker field. - `country`: a country dropdown. The list of countries is based on the [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) standard. - `select`: a dropdown with a list of options. The list of options is defined in the `options` field. Example: `options: ["option 1", "option 2", {label: "option3", value: "opt3"}]`. Finally, you can also personalize the message displayed to the user with the `extra_gated_prompt` extra field. Here is an example of customized request form where the user is asked to provide their company name and country and acknowledge that the model is for non-commercial use only. ```yaml --- extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects." extra_gated_fields: Company: text Country: country Specific date: date_picker I want to use this model for: type: select options: - Research - Education - label: Other value: other I agree to use this model for non-commercial use ONLY: checkbox --- ``` In some cases, you might also want to modify the default text in the gate heading, description, and button. For those use cases, you can modify `extra_gated_heading`, `extra_gated_description` and `extra_gated_button_content` like this: ```yaml --- extra_gated_heading: "Acknowledge license to accept the repository" extra_gated_description: "Our team may take 2-3 days to process your request" extra_gated_button_content: "Acknowledge license" --- ``` ### Example use cases of programmatically managing access requests Here are a few interesting use cases of programmatically managing access requests for gated repos we've seen organically emerge in the community. As a reminder, the model repo needs to be set to manual approval, otherwise users get access to it automatically. Possible use cases of programmatic management include: - If you have advanced user request screening requirements (for advanced compliance requirements, etc) or you wish to handle the user requests outside the Hub. - An example for this was Meta's [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) initial release where users had to request access on a Meta website. - You can ask users for their HF username in your access flow, and then use a script to programmatically accept user requests on the Hub based on your set of conditions. - If you want to condition access to a model based on completing a payment flow (note that the actual payment flow happens outside of the Hub). - Here's an [example repo](https://huggingface.co/Trelis/openchat_3.5-function-calling-v3) from TrelisResearch that uses this use case. - [@RonanMcGovern](https://huggingface.co/RonanMcGovern) has posted a [video about the flow](https://www.youtube.com/watch?v=2OT2SI5auQU) and tips on how to implement it. ## Manage gated models as an organization (Team & Enterprise) [Team & Enterprise](https://huggingface.co/docs/hub/en/enterprise) subscribers can create a Gating Group Collection to grant (or reject) access to all the models and datasets in a collection at once. More information about Gating Group Collections can be found in [our dedicated doc](https://huggingface.co/docs/hub/en/enterprise-gating-group-collections). ## Access gated models as a user As a user, if you want to use a gated model, you will need to request access to it. This means that you must be logged in to a Hugging Face user account. Requesting access can only be done from your browser. Go to the model on the Hub and you will be prompted to share your information: By clicking on **Agree**, you agree to share your username and email address with the model authors. In some cases, additional fields might be requested. To help the model authors decide whether to grant you access, try to fill out the form as completely as possible. Once the access request is sent, there are two possibilities. If the approval mechanism is automatic, you immediately get access to the model files. Otherwise, the requests have to be approved manually by the authors, which can take more time. > [!WARNING] > The model authors have complete control over model access. In particular, they can decide at any time to block your access to the model without prior notice, regardless of approval mechanism or if your request has already been approved. ### Download files To download files from a gated model you'll need to be authenticated. In the browser, this is automatic as long as you are logged in with your account. If you are using a script, you will need to provide a [user token](./security-tokens). In the Hugging Face Python ecosystem (`transformers`, `diffusers`, `datasets`, etc.), you can login your machine using the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/index) library and running in your terminal: ```bash hf auth login ``` Alternatively, you can programmatically login using `login()` in a notebook or a script: ```python >>> from huggingface_hub import login >>> login() ``` You can also provide the `token` parameter to most loading methods in the libraries (`from_pretrained`, `hf_hub_download`, `load_dataset`, etc.), directly from your scripts. For more details about how to login, check out the [login guide](https://huggingface.co/docs/huggingface_hub/quick-start#login). ### Restricting Access for EU Users For gated models, you can add an additional layer of access control to specifically restrict users from European Union countries. This is useful if your model's license or terms of use prohibit its distribution in the EU. To enable this, add the `extra_gated_eu_disallowed: true` property to your model card's metadata. **Important:** This feature will only activate if your model is already gated. If `gated: false` or the property is not set, this restriction will not apply. ```yaml --- license: mit gated: true extra_gated_eu_disallowed: true --- ``` The system identifies a user's location based on their IP address. ### Run with Docker https://huggingface.co/docs/hub/spaces-run-with-docker.md # Run with Docker You can use Docker to run most Spaces locally. To view instructions to download and run Spaces' Docker images, click on the "Run with Docker" button on the top-right corner of your Space page: ## Login to the Docker registry Some Spaces will require you to login to Hugging Face's Docker registry. To do so, you'll need to provide: - Your Hugging Face username as `username` - A User Access Token as `password`. Generate one [here](https://huggingface.co/settings/tokens). ### Integrate your library with the Hub https://huggingface.co/docs/hub/models-adding-libraries.md # Integrate your library with the Hub The Hugging Face Hub aims to facilitate sharing machine learning models, checkpoints, and artifacts. This endeavor includes integrating the Hub into many of the amazing third-party libraries in the community. Some of the ones already integrated include [spaCy](https://spacy.io/usage/projects#huggingface_hub), [Sentence Transformers](https://sbert.net/), [OpenCLIP](https://github.com/mlfoundations/open_clip), and [timm](https://huggingface.co/docs/timm/index), among many others. Integration means users can download and upload files to the Hub directly from your library. We hope you will integrate your library and join us in democratizing artificial intelligence for everyone. Integrating the Hub with your library provides many benefits, including: - Free model hosting for you and your users. - Built-in file versioning - even for huge files - made possible by [Git-Xet](./xet/using-xet-storage#git-xet). - Community features (discussions, pull requests, likes). - Usage metrics for all models ran with your library. This tutorial will help you integrate the Hub into your library so your users can benefit from all the features offered by the Hub. Before you begin, we recommend you create a [Hugging Face account](https://huggingface.co/join) from which you can manage your repositories and files. If you need help with the integration, feel free to open an [issue](https://github.com/huggingface/huggingface_hub/issues/new/choose), and we would be more than happy to help you. ## Implementation Implementing an integration of a library with the Hub often means providing built-in methods to load models from the Hub and allow users to push new models to the Hub. This section will cover the basics of how to do that using the `huggingface_hub` library. For more in-depth guidance, check out [this guide](https://huggingface.co/docs/huggingface_hub/guides/integrations). ### Installation To integrate your library with the Hub, you will need to add `huggingface_hub` library as a dependency: ```bash pip install huggingface_hub ``` For more details about `huggingface_hub` installation, check out [this guide](https://huggingface.co/docs/huggingface_hub/installation). > [!TIP] > In this guide, we will focus on Python libraries. If you've implemented your library in JavaScript, you can use [`@huggingface/hub`](https://www.npmjs.com/package/@huggingface/hub) instead. The rest of the logic (i.e. hosting files, code samples, etc.) does not depend on the code language. > > ``` > npm add @huggingface/hub > ``` Users will need to authenticate once they have successfully installed the `huggingface_hub` library. The easiest way to authenticate is to save the token on the machine. Users can do that from the terminal using the `login()` command: ``` hf auth login ``` The command tells them if they are already logged in and prompts them for their token. The token is then validated and saved in their `HF_HOME` directory (defaults to `~/.cache/huggingface/token`). Any script or library interacting with the Hub will use this token when sending requests. Alternatively, users can programmatically login using `login()` in a notebook or a script: ```py from huggingface_hub import login login() ``` Authentication is optional when downloading files from public repos on the Hub. ### Download files from the Hub Integrations allow users to download a model from the Hub and instantiate it directly from your library. This is often made possible by providing a method (usually called `from_pretrained` or `load_from_hf`) that has to be specific to your library. To instantiate a model from the Hub, your library has to: - download files from the Hub. This is what we will discuss now. - instantiate the Python model from these files. Use the [`hf_hub_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.hf_hub_download) method to download files from a repository on the Hub. Downloaded files are stored in the cache: `~/.cache/huggingface/hub`. Users won't have to re-download the file the next time they use it, which saves a lot of time for large files. Furthermore, if the repository is updated with a new version of the file, `huggingface_hub` will automatically download the latest version and store it in the cache. Users don't have to worry about updating their files manually. For example, download the `config.json` file from the [lysandre/arxiv-nlp](https://huggingface.co/lysandre/arxiv-nlp) repository: ```python >>> from huggingface_hub import hf_hub_download >>> config_path = hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json") >>> config_path '/home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade/config.json' ``` `config_path` now contains a path to the downloaded file. You are guaranteed that the file exists and is up-to-date. If your library needs to download an entire repository, use [`snapshot_download`](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.snapshot_download). It will take care of downloading all the files in parallel. The return value is a path to the directory containing the downloaded files. ```py >>> from huggingface_hub import snapshot_download >>> snapshot_download(repo_id="lysandre/arxiv-nlp") '/home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade' ``` Many options exists to download files from a specific revision, to filter which files to download, to provide a custom cache directory, to download to a local directory, etc. Check out the [download guide](https://huggingface.co/docs/huggingface_hub/en/guides/download) for more details. ### Upload files to the Hub You might also want to provide a method so that users can push their own models to the Hub. This allows the community to build an ecosystem of models compatible with your library. The `huggingface_hub` library offers methods to create repositories and upload files: - `create_repo` creates a repository on the Hub. - `upload_file` and `upload_folder` upload files to a repository on the Hub. The `create_repo` method creates a repository on the Hub. Use the `repo_id` parameter to provide a name for your repository: ```python >>> from huggingface_hub import create_repo >>> create_repo(repo_id="test-model") 'https://huggingface.co/lysandre/test-model' ``` When you check your Hugging Face account, you should now see a `test-model` repository under your namespace. The [`upload_file`](https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api#huggingface_hub.HfApi.upload_file) method uploads a file to the Hub. This method requires the following: - A path to the file to upload. - The final path in the repository. - The repository you wish to push the files to. For example: ```python >>> from huggingface_hub import upload_file >>> upload_file( ... path_or_fileobj="/home/lysandre/dummy-test/README.md", ... path_in_repo="README.md", ... repo_id="lysandre/test-model" ... ) 'https://huggingface.co/lysandre/test-model/blob/main/README.md' ``` If you check your Hugging Face account, you should see the file inside your repository. Usually, a library will serialize the model to a local directory and then upload to the Hub the entire folder at once. This can be done using [`upload_folder`](https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api#huggingface_hub.HfApi.upload_folder): ```py >>> from huggingface_hub import upload_folder >>> upload_folder( ... folder_path="/home/lysandre/dummy-test", ... repo_id="lysandre/test-model", ... ) ``` For more details about how to upload files, check out the [upload guide](https://huggingface.co/docs/huggingface_hub/en/guides/upload). ## Model cards Model cards are files that accompany the models and provide handy information. Under the hood, model cards are simple Markdown files with additional metadata. Model cards are essential for discoverability, reproducibility, and sharing! You can find a model card as the README.md file in any model repo. See the [model cards guide](./model-cards) for more details about how to create a good model card. If your library allows pushing a model to the Hub, it is recommended to generate a minimal model card with prefilled metadata (typically `library_name`, `pipeline_tag` or `tags`) and information on how the model has been trained. This will help having a standardized description for all models built with your library. ## Register your library Well done! You should now have a library able to load a model from the Hub and eventually push new models. The next step is to make sure that your models on the Hub are well-documented and integrated with the platform. To do so, libraries can be registered on the Hub, which comes with a few benefits for the users: - a pretty label can be shown on the model page (e.g. `KerasNLP` instead of `keras-nlp`) - a link to your library repository and documentation is added to each model page - a custom download count rule can be defined - code snippets can be generated to show how to load the model using your library To register a new library, please open a Pull Request [here](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts) following the instructions below: - The library id should be lowercased and hyphen-separated (example: `"adapter-transformers"`). Make sure to preserve alphabetical order when opening the PR. - set `repoName` and `prettyLabel` with user-friendly casing (example: `DeepForest`). - set `repoUrl` with a link to the library source code (usually a GitHub repository). - (optional) set `docsUrl` with a link to the docs of the library. If the documentation is in the GitHub repo referenced above, no need to set it twice. - set `filter` to `false`. - (optional) define how downloads must be counted by setting `countDownload`. Downloads can be tracked by file extensions or filenames. Make sure to not duplicate the counting. For instance, if loading a model requires 3 files, the download count rule must count downloads only on 1 of the 3 files. Otherwise, the download count will be overestimated. **Note:** if the library uses one of the default config files (`config.json`, `config.yaml`, `hyperparams.yaml`, `params.json`, and `meta.yaml`, see [here](https://huggingface.co/docs/hub/models-download-stats#which-are-the-query-files-for-different-libraries)), there is no need to manually define a download count rule. - (optional) define `snippets` to let the user know how they can quickly instantiate a model. More details below. Before opening the PR, make sure that at least one model is referenced on https://huggingface.co/models?other=my-library-name. If not, the model card metadata of the relevant models must be updated with `library_name: my-library-name` (see [example](https://huggingface.co/google/gemma-scope/blob/main/README.md?code=true#L3)). If you are not the owner of the models on the Hub, please open PRs (see [example](https://huggingface.co/MCG-NJU/VFIMamba/discussions/1)). Here is a minimal [example](https://github.com/huggingface/huggingface.js/pull/885/files) adding integration for VFIMamba. ### Code snippets We recommend adding a code snippet to explain how to use a model in your downstream library. To add a code snippet, you should update the [model-libraries-snippets.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts) file with instructions for your model. For example, the [Asteroid](https://huggingface.co/asteroid-team) integration includes a brief code snippet for how to load and use an Asteroid model: ```typescript const asteroid = (model: ModelData) => `from asteroid.models import BaseModel model = BaseModel.from_pretrained("${model.id}")`; ``` Doing so will also add a tag to your model so users can quickly identify models from your library. Once your snippet has been added to [model-libraries-snippets.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries-snippets.ts), you can reference it in [model-libraries.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts) as described above. ## Document your library Finally, you can add your library to the Hub's documentation. Check for example the [Setfit PR](https://github.com/huggingface/hub-docs/pull/1150) that added [SetFit](./setfit) to the documentation. ### The HF PRO subscription ๐Ÿ”ฅ https://huggingface.co/docs/hub/pro.md # The HF PRO subscription ๐Ÿ”ฅ The PRO subscription unlocks essential features for serious users, including: - Higher [storage capacity](./storage-limits) for public and private repositories - Higher bandwidth and API [rate limits](./rate-limits) - Included credits for [Inference Providers](/docs/inference-providers/) - Higher tier for [ZeroGPU Spaces](./spaces-zerogpu) usage, and pay-as-you-go quota extension - Ability to create ZeroGPU Spaces and use [Dev Mode](./spaces-dev-mode) - Ability to publish Social Posts and Community Blogs - Leverage the [Data Studio](./data-studio) on private datasets - Run and schedule serverless [CPU/GPU Jobs](./jobs) View the full list of benefits at **https://huggingface.co/pro** then subscribe over at https://huggingface.co/subscribe/pro ### Using BERTopic at Hugging Face https://huggingface.co/docs/hub/bertopic.md # Using BERTopic at Hugging Face [BERTopic](https://github.com/MaartenGr/BERTopic) is a topic modeling framework that leverages ๐Ÿค— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. BERTopic supports all kinds of topic modeling techniques: Guided Supervised Semi-supervised Manual Multi-topic distributions Hierarchical Class-based Dynamic Online/Incremental Multimodal Multi-aspect Text Generation/LLM Zero-shot (new!) Merge Models (new!) Seed Words (new!) ## Exploring BERTopic on the Hub You can find BERTopic models by filtering at the left of the [models page](https://huggingface.co/models?library=bertopic&sort=trending). BERTopic models hosted on the Hub have a model card with useful information about the models. Thanks to BERTopic Hugging Face Hub integration, you can load BERTopic models with a few lines of code. You can also deploy these models using [Inference Endpoints](https://huggingface.co/inference-endpoints). ## Installation To get started, you can follow the [BERTopic installation guide](https://github.com/MaartenGr/BERTopic#installation). You can also use the following one-line install through pip: ```bash pip install bertopic ``` ## Using Existing Models All BERTopic models can easily be loaded from the Hub: ```py from bertopic import BERTopic topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia") ``` Once loaded, you can use BERTopic's features to predict the topics for new instances: ```py topic, prob = topic_model.transform("This is an incredible movie!") topic_model.topic_labels_[topic] ``` Which gives us the following topic: ```text 64_rating_rated_cinematography_film ``` ## Sharing Models When you have created a BERTopic model, you can easily share it with others through the Hugging Face Hub. To do so, we can make use of the `push_to_hf_hub` function that allows us to directly push the model to the Hugging Face Hub: ```python from bertopic import BERTopic # Train model topic_model = BERTopic().fit(my_docs) # Push to HuggingFace Hub topic_model.push_to_hf_hub( repo_id="MaartenGr/BERTopic_ArXiv", save_ctfidf=True ) ``` Note that the saved model does not include the dimensionality reduction and clustering algorithms. Those are removed since they are only necessary to train the model and find relevant topics. Inference is done through a straightforward cosine similarity between the topic and document embeddings. This not only speeds up the model but allows us to have a tiny BERTopic model that we can work with. ## Additional Resources * [BERTopic repository](https://github.com/MaartenGr/BERTopic) * [BERTopic docs](https://maartengr.github.io/BERTopic/) * [BERTopic models in the Hub](https://huggingface.co/models?library=bertopic&sort=trending) ### Models Download Stats https://huggingface.co/docs/hub/models-download-stats.md # Models Download Stats ## How are downloads counted for models? Counting the number of downloads for models is not a trivial task, as a single model repository might contain multiple files, including multiple model weight files (e.g., with sharded models) and different formats depending on the library (GGUF, PyTorch, TensorFlow, etc.). To avoid double counting downloads (e.g., counting a single download of a model as multiple downloads), the Hub uses a set of query files that are employed for download counting. No information is sent from the user, and no additional calls are made for this. The count is done server-side as the Hub serves files for downloads. Every HTTP request to these files, including `GET` and `HEAD`, will be counted as a download. By default, when no library is specified, the Hub uses `config.json` as the default query file. Otherwise, the query file depends on each library, and the Hub might examine files such as `pytorch_model.bin` or `adapter_config.json`. ## Which are the query files for different libraries? By default, the Hub looks at `config.json`, `config.yaml`, `hyperparams.yaml`, `params.json`, and `meta.yaml`. Some libraries override these defaults by specifying their own filter (specifying `countDownloads`). The code that defines these overrides is [open-source](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts). For example, for the `nemo` library, all files with `.nemo` extension are used to count downloads. ## Can I add my query files for my library? Yes, you can open a Pull Request [here](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts). Here is a minimal [example](https://github.com/huggingface/huggingface.js/pull/885/files) adding download metrics for VFIMamba. Check out the [integration guide](./models-adding-libraries#register-your-library) for more details. ## How are `GGUF` files handled? GGUF files are self-contained and are not tied to a single library, so all of them are counted for downloads. This will double count downloads in the case a user performs cloning of a whole repository, but most users and interfaces download a single GGUF file for a given repo. ## How is `diffusers` handled? The `diffusers` library is an edge case and has its filter configured in the internal codebase. The filter ensures repos tagged as `diffusers` count both files loaded via the library as well as through UIs that require users to manually download the top-level safetensors. ``` filter: [ { bool: { /// Include documents that match at least one of the following rules should: [ /// Downloaded from diffusers lib { term: { path: "model_index.json" }, }, /// Direct downloads (LoRa, Auto1111 and others) /// Filter out nested safetensors and pickle weights to avoid double counting downloads from the diffusers lib { regexp: { path: "[^/]*\\.safetensors" }, }, { regexp: { path: "[^/]*\\.ckpt" }, }, { regexp: { path: "[^/]*\\.bin" }, }, ], minimum_should_match: 1, }, }, ] } ``` ### Docker Spaces Examples https://huggingface.co/docs/hub/spaces-sdks-docker-examples.md # Docker Spaces Examples We gathered some example demos in the [Spaces Examples](https://huggingface.co/SpacesExamples) organization. Please check them out! * Dummy FastAPI app: https://huggingface.co/spaces/DockerTemplates/fastapi_dummy * FastAPI app serving a static site and using `transformers`: https://huggingface.co/spaces/DockerTemplates/fastapi_t5 * Phoenix app for https://huggingface.co/spaces/DockerTemplates/single_file_phx_bumblebee_ml * HTTP endpoint in Go with query parameters https://huggingface.co/spaces/XciD/test-docker-go?q=Adrien * Shiny app written in Python https://huggingface.co/spaces/elonmuskceo/shiny-orbit-simulation * Genie.jl app in Julia https://huggingface.co/spaces/nooji/GenieOnHuggingFaceSpaces * Argilla app for data labelling and curation: https://huggingface.co/spaces/argilla/live-demo and [write-up about hosting Argilla on Spaces](./spaces-sdks-docker-argilla) by [@dvilasuero](https://huggingface.co/dvilasuero) ๐ŸŽ‰ * JupyterLab and VSCode: https://huggingface.co/spaces/DockerTemplates/docker-examples by [@camenduru](https://twitter.com/camenduru) and [@nateraw](https://hf.co/nateraw). * Zeno app for interactive model evaluation: https://huggingface.co/spaces/zeno-ml/diffusiondb and [instructions for setup](https://zenoml.com/docs/deployment#hugging-face-spaces) * Gradio App: https://huggingface.co/spaces/sayakpaul/demo-docker-gradio ### Advanced Compute Options https://huggingface.co/docs/hub/advanced-compute-options.md # Advanced Compute Options > [!WARNING] > This feature is part of the Team & Enterprise plans. Team & Enterprise organizations gain access to advanced compute options to accelerate their machine learning journey. ## Host ZeroGPU Spaces in your organization ZeroGPU is a dynamic GPU allocation system that optimizes AI deployment on Hugging Face Spaces. By automatically allocating and releasing NVIDIA H200 GPU slices (70GB VRAM) as needed, organizations can efficiently serve their AI applications without dedicated GPU instances. **Key benefits for organizations** - **Free GPU Access**: Access powerful NVIDIA H200 GPUs at no additional cost through dynamic allocation - **Enhanced Resource Management**: Host up to 50 ZeroGPU Spaces for efficient team-wide AI deployment - **Simplified Deployment**: Easy integration with PyTorch-based models, Gradio apps, and other Hugging Face libraries - **Enterprise-Grade Infrastructure**: Access to high-performance NVIDIA H200 GPUs with 70GB VRAM per workload [Learn more about ZeroGPU โ†’](https://huggingface.co/docs/hub/spaces-zerogpu) ### Managed SSO https://huggingface.co/docs/hub/enterprise-advanced-sso.md # Managed SSO > [!WARNING] > This feature is part of the Enterprise Plus plan. Managed SSO **replaces the Hugging Face login entirely**. Your Identity Provider becomes the sole authentication method for your organization's members across the entire Hugging Face platform. The organization controls the full user lifecycle, from account creation to deactivation. For a comparison with Basic SSO, see the [SSO overview](./enterprise-sso). ## How it works > [!NOTE] > **Managed SSO replaces the Hugging Face login.** Your IdP is the only way for managed users to authenticate on Hugging Face โ€” there is no separate Hugging Face login. Unlike Basic SSO, members do not need a pre-existing Hugging Face account. When a user authenticates through your IdP for the first time, an account is automatically created for them. Your IdP is the mandatory authentication route for all your organization's members interacting with any part of the Hugging Face platform. Members are required to authenticate via your IdP for all Hugging Face services, not just when accessing private or organizational repositories. When a user is deactivated in your IdP, their Hugging Face account is deactivated as well. This gives your organization complete control over identity, access, and data governance. ## Getting started Managed SSO cannot be self-configured. To enable Managed SSO for your organization, please contact the Hugging Face team. The setup is done in collaboration with our technical team to ensure a smooth transition for your organization. Both SAML 2.0 and OIDC protocols are supported and can be integrated with popular identity providers such as Okta, Microsoft Entra ID (Azure AD), and Google Workspace. ## User provisioning Managed SSO introduces automated user provisioning through [SCIM](./enterprise-scim), which manages the entire user lifecycle on Hugging Face. SCIM allows your IdP to communicate user identity information to Hugging Face, enabling automatic creation, updates (e.g., name changes, role changes), and deactivation of user accounts as changes occur in your IdP. Learn more about how to set up and manage SCIM in our [dedicated guide](./enterprise-scim). ## SSO features Managed SSO supports [role mapping, resource group mapping, session timeout, and external collaborators](./security-sso-user-management). These features are configurable from your organization's settings. ## Restrictions on managed accounts > [!WARNING] > Important considerations for managed accounts. To ensure organizational control and data governance, managed user accounts have specific restrictions: * **No personal content creation**: Managed users cannot create any content (models, datasets, or Spaces) in their personal user namespace. All content must be created within the organization. * **Organization-bound collaboration**: Managed users are restricted to collaborating solely within their managing organization. They cannot join other organizations or contribute to repositories outside of their managing organization. * **Content visibility**: Content created by managed users resides within the organization. While the managed users cannot create public content in their personal profile, they can **create public content within the organization** if the organization's settings permit it. These restrictions maintain your enterprise's security boundaries. For personal projects or broader collaboration outside your organization, members should use a separate, unmanaged Hugging Face account. ### Webhooks https://huggingface.co/docs/hub/webhooks.md # Webhooks Webhooks are a foundation for MLOps-related features. They allow you to listen for new changes on specific repos or to all repos belonging to particular set of users/organizations (not just your repos, but any repo). You can use them to auto-convert models, build community bots, or build CI/CD for your models, datasets, and Spaces (and much more!). Webhooks can also [trigger Jobs](./jobs-webhooks) to automate compute tasks in response to repo events. The documentation for Webhooks is below โ€“ or you can also browse our **guides** showcasing a few possible use cases of Webhooks: - [Fine-tune a new model whenever a dataset gets updated (Python)](./webhooks-guide-auto-retrain) - [Create a discussion bot on the Hub, using a LLM API (NodeJS)](./webhooks-guide-discussion-bot) - [Create metadata quality reports (Python)](./webhooks-guide-metadata-review) - and more to comeโ€ฆ ## Create your Webhook You can create new Webhooks and edit existing ones in your Webhooks [settings](https://huggingface.co/settings/webhooks): ![Settings of an individual webhook](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhook-settings.png) Webhooks can watch for repos updates, Pull Requests, discussions, and new comments. It's even possible to create a Space to react to your Webhooks! ## Webhook Payloads After registering a Webhook, you will be notified of new events via an `HTTP POST` call on the specified target URL. The payload is encoded in JSON. You can view the history of payloads sent in the activity tab of the webhook settings page, it's also possible to replay past webhooks for easier debugging: ![image.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhook-activity.png) As an example, here is the full payload when a Pull Request is opened: ```json { "event": { "action": "create", "scope": "discussion" }, "repo": { "type": "model", "name": "openai-community/gpt2", "id": "621ffdc036468d709f17434d", "private": false, "url": { "web": "https://huggingface.co/openai-community/gpt2", "api": "https://huggingface.co/api/models/openai-community/gpt2" }, "owner": { "id": "628b753283ef59b5be89e937" } }, "discussion": { "id": "6399f58518721fdd27fc9ca9", "title": "Update co2 emissions", "url": { "web": "https://huggingface.co/openai-community/gpt2/discussions/19", "api": "https://huggingface.co/api/models/openai-community/gpt2/discussions/19" }, "status": "open", "author": { "id": "61d2f90c3c2083e1c08af22d" }, "num": 19, "isPullRequest": true, "changes": { "base": "refs/heads/main" } }, "comment": { "id": "6399f58518721fdd27fc9caa", "author": { "id": "61d2f90c3c2083e1c08af22d" }, "content": "Add co2 emissions information to the model card", "hidden": false, // Note: when `hidden` is `true`, `content` will be undefined "url": { "web": "https://huggingface.co/openai-community/gpt2/discussions/19#6399f58518721fdd27fc9caa" } }, "webhook": { "id": "6390e855e30d9209411de93b", "version": 3 } } ``` ### Event The top-level properties `event` is always specified and used to determine the nature of the event. It has two sub-properties: `event.action` and `event.scope`. `event.scope` will be one of the following values: - `"repo"` - Global events on repos. Possible values for the associated `action`: `"create"`, `"delete"`, `"update"`, `"move"`. - `"repo.content"` - Events on the repo's content, such as new commits or tags. It triggers on new Pull Requests as well due to the newly created reference/commit. The associated `action` is always `"update"`. - `"repo.config"` - Events on the config: update Space secrets, update settings, update DOIs, disabled or not, etc. The associated `action` is always `"update"`. - `"discussion"` - Creating a discussion or Pull Request, updating the title or status, and merging. Possible values for the associated `action`: `"create"`, `"delete"`, `"update"`. - `"discussion.comment"` - Creating, updating, and hiding a comment. Possible values for the associated `action`: `"create"`, `"update"`. More scopes can be added in the future. To handle unknown events, your webhook handler can consider any action on a narrowed scope to be an `"update"` action on the broader scope. For example, if the `"repo.config.dois"` scope is added in the future, any event with that scope can be considered by your webhook handler as an `"update"` action on the `"repo.config"` scope. ### Repo In the current version of webhooks, the top-level property `repo` is always specified, as events can always be associated with a repo. For example, consider the following value: ```json "repo": { "type": "model", "name": "some-user/some-repo", "id": "6366c000a2abcdf2fd69a080", "private": false, "url": { "web": "https://huggingface.co/some-user/some-repo", "api": "https://huggingface.co/api/models/some-user/some-repo" }, "headSha": "c379e821c9c95d613899e8c4343e4bfee2b0c600", "owner": { "id": "61d2000c3c2083e1c08af22d" } } ``` `repo.headSha` is the sha of the latest commit on the repo's `main` branch. It is only sent when `event.scope` starts with `"repo"`, not on community events like discussions and comments. ### Code changes On code changes, the top-level property `updatedRefs` is specified on repo events. It is an array of references that have been updated. Here is an example value: ```json "updatedRefs": [ { "ref": "refs/heads/main", "oldSha": "ce9a4674fa833a68d5a73ec355f0ea95eedd60b7", "newSha": "575db8b7a51b6f85eb06eee540738584589f131c" }, { "ref": "refs/tags/test", "oldSha": null, "newSha": "575db8b7a51b6f85eb06eee540738584589f131c" } ] ``` Newly created references will have `oldSha` set to `null`. Deleted references will have `newSha` set to `null`. You can react to new commits on specific pull requests, new tags, or new branches. ### Config changes When the top-level property `event.scope` is `"repo.config"`, the `updatedConfig` property is specified. It is an object containing the updated config. Here is an example value: ```json "updatedConfig": { "private": false } ``` When the updated config key is not supported by the webhook, the object will be empty: ```json "updatedConfig": {} ``` For now only `private` is supported. If you would benefit from more config keys being present here, please let us know at website@huggingface.co. ### Discussions and Pull Requests The top-level property `discussion` is specified on community events (discussions and Pull Requests). The `discussion.isPullRequest` property is a boolean indicating if the discussion is also a Pull Request (on the Hub, a PR is a special type of discussion). Here is an example value: ```json "discussion": { "id": "639885d811ae2bad2b7ba461", "title": "Hello!", "url": { "web": "https://huggingface.co/some-user/some-repo/discussions/3", "api": "https://huggingface.co/api/models/some-user/some-repo/discussions/3" }, "status": "open", "author": { "id": "61d2000c3c2083e1c08af22d" }, "isPullRequest": true, "changes": { "base": "refs/heads/main" } "num": 3 } ``` ### Comment The top level property `comment` is specified when a comment is created (including on discussion creation) or updated. Here is an example value: ```json "comment": { "id": "6398872887bfcfb93a306f18", "author": { "id": "61d2000c3c2083e1c08af22d" }, "content": "This adds an env key", "hidden": false, "url": { "web": "https://huggingface.co/some-user/some-repo/discussions/4#6398872887bfcfb93a306f18" } } ``` ## Webhook secret Setting a Webhook secret is useful to make sure payloads sent to your Webhook handler URL are actually from Hugging Face. If you set a secret for your Webhook, it will be sent along as an `X-Webhook-Secret` HTTP header on every request. Only ASCII characters are supported. > [!TIP] > It's also possible to add the secret directly in the handler URL. For example, setting it as a query parameter: https://example.com/webhook?secret=XXX. > > This can be helpful if accessing the HTTP headers of the request is complicated for your Webhook handler. ## Rate limiting Each Webhook is limited to 1,000 triggers per 24 hours. You can view your usage in the Webhook settings page in the "Activity" tab. If you need to increase the number of triggers for your Webhook, upgrade to PRO, Team or Enterprise and contact us at website@huggingface.co. ## Developing your Webhooks If you do not have an HTTPS endpoint/URL, you can try out public tools for webhook testing. These tools act as catch-all (capture all requests) sent to them and give 200 OK status code. [Beeceptor](https://beeceptor.com/) is one tool you can use to create a temporary HTTP endpoint and review the incoming payload. Another such tool is [Webhook.site](https://webhook.site/). Additionally, you can route a real Webhook payload to the code running locally on your machine during development. This is a great way to test and debug for faster integrations. You can do this by exposing your localhost port to the Internet. To be able to go this path, you can use [ngrok](https://ngrok.com/) or [localtunnel](https://theboroer.github.io/localtunnel-www/). ## Debugging Webhooks You can easily find recently generated events for your webhooks. Open the activity tab for your webhook. There you will see the list of recent events. ![image.png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhook-payload.png) Here you can review the HTTP status code and the payload of the generated events. Additionally, you can replay these events by clicking on the `Replay` button! Note: When changing the target URL or secret of a Webhook, replaying an event will send the payload to the updated URL. ## FAQ ##### Can I define webhooks on my organization vs my user account? No, this is not currently supported. ##### How can I subscribe to all events on HF (or across a whole repo type, like on all models)? This is not currently exposed to end users but we can toggle this for you if you send an email to website@huggingface.co. ### marimo on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-marimo.md # marimo on Spaces [marimo](https://github.com/marimo-team/marimo) is a reactive notebook for Python that models notebooks as dataflow graphs. When you run a cell or interact with a UI element, marimo automatically runs affected cells (or marks them as stale), keeping code and outputs consistent and preventing bugs before they happen. Every marimo notebook is stored as pure Python, executable as a script, and deployable as an app. Key features: - โšก๏ธ **reactive:** run a cell, and marimo reactively runs all dependent cells or marks them as stale - ๐Ÿ–๏ธ **interactive:** bind sliders, tables, plots, and more to Python โ€” no callbacks required - ๐Ÿ”ฌ **reproducible:** no hidden state, deterministic execution, built-in package management - ๐Ÿƒ **executable:** execute as a Python script, parametrized by CLI args - ๐Ÿ›œ **shareable:** deploy as an interactive web app or slides, run in the browser via WASM - ๐Ÿ›ข๏ธ **designed for data:** query dataframes and databases with SQL, filter and search dataframes ## Deploying marimo apps on Spaces To get started with marimo on Spaces, click the button below: This will start building your Space using marimo's Docker template. If successful, you should see a similar application to the [marimo introduction notebook](https://huggingface.co/spaces/marimo-team/marimo-app-template). ## Customizing your marimo app When you create a marimo Space, you'll get a few key files to help you get started: ### 1. app.py This is your main marimo notebook file that defines your app's logic. marimo notebooks are pure Python files that use the `@app.cell` decorator to define cells. To learn more about building notebooks and apps, see [the marimo documentation](https://docs.marimo.io). As your app grows, you can organize your code into modules and import them into your main notebook. ### 2. Dockerfile The Dockerfile for a marimo app is minimal since marimo has few system dependencies. The key requirements are: - It installs the dependencies listed in `requirements.txt` (using `uv`) - It creates a non-root user for security - It runs the app using `marimo run app.py` You may need to modify this file if your application requires additional system dependencies, permissions, or other CLI flags. ### 3. requirements.txt The Space will automatically install dependencies listed in the `requirements.txt` file. At minimum, you must include `marimo` in this file. You will want to add any other required packages your app needs. The marimo Space template provides a basic setup that you can extend based on your needs. When deployed, your notebook will run in "app mode" which hides the code cells and only shows the interactive outputs - perfect for sharing with end users. You can opt to include the code cells in your app by setting adding `--include-code` to the `marimo run` command in the Dockerfile. ## Additional Resources and Support - [marimo documentation](https://docs.marimo.io) - [marimo GitHub repository](https://github.com/marimo-team/marimo) - [marimo Discord](https://marimo.io/discord) - [marimo template Space](https://huggingface.co/spaces/marimo-team/marimo-app-template) ## Troubleshooting If you encounter issues: 1. Make sure your notebook runs locally in app mode using `marimo run app.py` 2. Check that all required packages are listed in `requirements.txt` 3. Verify the port configuration matches (7860 is the default for Spaces) 4. Check Space logs for any Python errors For more help, visit the [marimo Discord](https://marimo.io/discord) or [open an issue](https://github.com/marimo-team/marimo/issues). ### How to configure SAML SSO with Okta https://huggingface.co/docs/hub/security-sso-okta-saml.md # How to configure SAML SSO with Okta In this guide, we will use Okta as the SSO provider and with the Security Assertion Markup Language (SAML) protocol as our preferred identity protocol. We currently support SP-initiated and IdP-initiated authentication. For user provisioning, see [SCIM](./enterprise-scim). > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to your Okta account. Navigate to "Admin/Applications" and click the "Create App Integration" button. Then choose an "SAML 2.0" application and click "Create". ## Step 2: Configure your application on Okta Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the SAML protocol. Copy the "Assertion Consumer Service URL" from the organization's settings on Hugging Face, and paste it in the "Single sign-on URL" field on Okta. The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/saml/consume`. On Okta, set the following settings: - Set Audience URI (SP Entity Id) to match the "SP Entity ID" value on Hugging Face. - Set Name ID format to EmailAddress. - Under "Show Advanced Settings", verify that Response and Assertion Signature are set to: Signed. Save your new application. ## Step 3: Finalize configuration on Hugging Face In your Okta application, under "Sign On/Settings/More details", find the following fields: - Sign-on URL - Public certificate - SP Entity ID You will need them to finalize the SSO setup on Hugging Face. In the SSO section of your organization's settings, copy-paste these values from Okta: - Sign-on URL - SP Entity ID - Public certificate The public certificate must have the following format: ``` -----BEGIN CERTIFICATE----- {certificate} -----END CERTIFICATE----- ``` You can now click on "Update and Test SAML configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the SAML selector will attest that the test was successful. ## Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in the [How it works](./security-sso-basic#how-it-works) section. ### DuckDB https://huggingface.co/docs/hub/datasets-duckdb.md # DuckDB [DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system. You can use the Hugging Face paths (`hf://`) to access data on the Hub: The [DuckDB CLI](https://duckdb.org/docs/api/cli/overview.html) (Command Line Interface) is a single, dependency-free executable. There are also other APIs available for running DuckDB, including Python, C++, Go, Java, Rust, and more. For additional details, visit their [clients](https://duckdb.org/docs/api/overview.html) page. > [!TIP] > For installation details, visit the [installation page](https://duckdb.org/docs/installation). Starting from version `v0.10.3`, the DuckDB CLI includes native support for accessing datasets on the Hugging Face Hub via URLs with the `hf://` scheme. Here are some features you can leverage with this powerful tool: - Query public datasets and your own gated and private datasets - Analyze datasets and perform SQL operations - Combine datasets and export it to different formats - Conduct vector similarity search on embedding datasets - Implement full-text search on datasets For a complete list of DuckDB features, visit the DuckDB [documentation](https://duckdb.org/docs/). To start the CLI, execute the following command in the installation folder: ```bash ./duckdb ``` ## Forging the Hugging Face URL To access Hugging Face datasets, use the following URL format: ```plaintext hf://datasets/{my-username}/{my-dataset}/{path_to_file} ``` - **my-username**, the user or organization of the dataset, e.g. `ibm` - **my-dataset**, the dataset name, e.g: `duorc` - **path_to_parquet_file**, the parquet file path which supports glob patterns, e.g `**/*.parquet`, to query all parquet files > [!TIP] > You can query auto-converted Parquet files using the @~parquet branch, which corresponds to the `refs/convert/parquet` revision. For more details, refer to the documentation at https://huggingface.co/docs/datasets-server/en/parquet#conversion-to-parquet. > > To reference the `refs/convert/parquet` revision of a dataset, use the following syntax: > > ```plaintext > hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file} > ``` > > Here is a sample URL following the above syntax: > > ```plaintext > hf://datasets/ibm/duorc@~parquet/ParaphraseRC/test/0000.parquet > ``` Let's start with a quick demo to query all the rows of a dataset: ```sql FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; ``` Or using traditional SQL syntax: ```sql SELECT * FROM 'hf://datasets/ibm/duorc/ParaphraseRC/*.parquet' LIMIT 3; ``` In the following sections, we will cover more complex operations you can perform with DuckDB on Hugging Face datasets. > [!TIP] > **Querying Storage Buckets**: When using the DuckDB Python client, you can query data stored in [Storage Buckets](./storage-buckets) by registering the Hugging Face filesystem: > ```python > import duckdb > from huggingface_hub import HfFileSystem > duckdb.register_filesystem(HfFileSystem()) > duckdb.sql("SELECT * FROM 'hf://buckets/username/my-bucket/data.parquet' LIMIT 10") > ``` Native `hf://buckets/` support in DuckDB is expected in a future release. ### Using Stanza at Hugging Face https://huggingface.co/docs/hub/stanza.md # Using Stanza at Hugging Face `stanza` is a collection of accurate and efficient tools for the linguistic analysis of many human languages. Starting from raw text to syntactic analysis and entity recognition, Stanza brings state-of-the-art NLP models to languages of your choosing. ## Exploring Stanza in the Hub You can find `stanza` models by filtering at the left of the [models page](https://huggingface.co/models?library=stanza&sort=downloads). You can find over 70 models for different languages! All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description and metadata tags that help for discoverability. 2. An interactive widget you can use to play out with the model directly in the browser (for named entity recognition and part of speech). 3. An Inference Providers widget that allows to make inference requests (for named entity recognition and part of speech). ## Using existing models The `stanza` library automatically downloads models from the Hub. You can use `stanza.Pipeline` to download the model from the Hub and do inference. ```python import stanza nlp = stanza.Pipeline('en') # download th English model and initialize an English neural pipeline doc = nlp("Barack Obama was born in Hawaii.") # run annotation over a sentence ``` ## Sharing your models To add new official Stanza models, you can follow the process to [add a new language](https://stanfordnlp.github.io/stanza/new_language.html) and then [share your models with the Stanza team](https://stanfordnlp.github.io/stanza/new_language.html#contributing-back-to-stanza). You can also find the official script to upload models to the Hub [here](https://github.com/stanfordnlp/huggingface-models/blob/main/hugging_stanza.py). ## Additional resources * `stanza` [docs](https://stanfordnlp.github.io/stanza/). ### Notifications https://huggingface.co/docs/hub/notifications.md # Notifications Notifications allow you to know when new activities (**Pull Requests or discussions**) happen on models, datasets, and Spaces belonging to users or organizations you are watching. By default, you'll receive a notification if: - Someone mentions you in a discussion/PR. - A new comment is posted in a discussion/PR you participated in. - A new discussion/PR or comment is posted in one of the repositories of an organization or user you are watching. - Someone replies to one of your posts, blog articles, or paper pages. ![Notifications page](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-page.png) You'll get new notifications by email and [directly on the website](https://huggingface.co/notifications), you can change this in your [notifications settings](#notifications-settings). ## Filtering and managing notifications On the [notifications page](https://huggingface.co/notifications), you have several options for filtering and managing your notifications more effectively: - Filter by Repository: Choose to display notifications from a specific repository only. - Filter by Read Status: Display only unread notifications or all notifications. - Filter by Participation: Show notifications you have participated in or those which you have been directly mentioned. Additionally, you can take the following actions to manage your notifications: - Mark as Read/Unread: Change the status of notifications to mark them as read or unread. - Mark as Done: Once marked as done, notifications will no longer appear in the notification center (they are deleted). By default, changes made to notifications will only apply to the selected notifications on the screen. However, you can also apply changes to all matching notifications (like in Gmail for instance) for greater convenience. ## Watching users and organizations By default, you'll be watching all the organizations you are a member of and will be notified of any new activity on those. You can also choose to get notified on arbitrary users or organizations. To do so, use the "Watch repos" button on their HF profiles. Note that you can also quickly watch/unwatch users and organizations directly from your [notifications settings](#notifications-settings). Finally, you can choose to watch a specific repository and get notified about any new activity without having to watch the whole organization or user account. ## Notifications settings In your [notifications settings](https://huggingface.co/settings/notifications) page, you can choose specific channels to get notified on depending on the type of activity, for example, receiving an email for direct mentions but only a web notification for new activity on watched users and organizations. By default, you'll get an email and a web notification for any new activity but feel free to adjust your settings depending on your needs. _Note that clicking the unsubscribe link in an email will unsubscribe you for the type of activity, eg direct mentions._ ![Notifications settings page](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/notifications-settings.png) You can quickly add any user/organization to your watch list by searching them by name using the dedicated search bar. Unsubscribe from a specific user/organization simply by unticking the corresponding checkbox. ## Mute notifications for a specific repository It's possible to mute notifications for a particular repository by using the "Mute notifications" action in the repository's contextual menu. This will prevent you from receiving any new notifications for that particular repository. You can unmute the repository at any time by clicking the "Unmute notifications" action in the same repository menu. ![mute notification menu](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-mute-menu.png) _Note, if a repository is muted, you won't receive any new notification unless you're directly mentioned or participating to a discussion._ The list of muted repositories is available from the notifications settings page: ![Notifications settings page muted repositories](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-settings-muted.png) ## Mute notifications for a specific discussion or PR You can also mute notifications for individual discussions or pull requests by clicking the mute icon in the header. Doing this prevents you from receiving any further notifications from that specific discussion or PR, including direct mentions. You can unmute at any time by clicking the same icon again. ![Notifications mute discussions](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/notifications-mute-discussion.png) ### Gradio Spaces https://huggingface.co/docs/hub/spaces-sdks-gradio.md # Gradio Spaces **Gradio** provides an easy and intuitive interface for running a model from a list of inputs and displaying the outputs in formats such as images, audio, 3D objects, and more. Gradio now even has a [Plot output component](https://gradio.app/docs/#o_plot) for creating data visualizations with Matplotlib, Bokeh, and Plotly! For more details, take a look at the [Getting started](https://gradio.app/getting_started/) guide from the Gradio team. Selecting **Gradio** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Space with the latest version of Gradio by setting the `sdk` property to `gradio` in your `README.md` file's YAML block. If you'd like to change the Gradio version, you can edit the `sdk_version` property. Visit the [Gradio documentation](https://gradio.app/docs/) to learn all about its features and check out the [Gradio Guides](https://gradio.app/guides/) for some handy tutorials to help you get started! ## Your First Gradio Space: Hot Dog Classifier In the following sections, you'll learn the basics of creating a Space, configuring it, and deploying your code to it. We'll create a **Hot Dog Classifier** Space with Gradio that'll be used to demo the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which can detect whether a given picture contains a hot dog ๐ŸŒญ You can find a completed version of this hosted at [NimaBoscarino/hotdog-gradio](https://huggingface.co/spaces/NimaBoscarino/hotdog-gradio). ## Create a new Gradio Space We'll start by [creating a brand new Space](https://huggingface.co/new-space) and choosing **Gradio** as our SDK. Hugging Face Spaces are Git repositories, meaning that you can work on your Space incrementally (and collaboratively) by pushing commits. Take a look at the [Getting Started with Repositories](./repositories-getting-started) guide to learn about how you can create and edit files before continuing. ## Add the dependencies For the **Hot Dog Classifier** we'll be using a [๐Ÿค— Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to use the model, so we need to start by installing a few dependencies. This can be done by creating a **requirements.txt** file in our repository, and adding the following dependencies to it: ``` transformers torch ``` The Spaces runtime will handle installing the dependencies! ## Create the Gradio interface To create the Gradio app, make a new file in the repository called **app.py**, and add the following code: ```python import gradio as gr from transformers import pipeline pipeline = pipeline(task="image-classification", model="julien-c/hotdog-not-hotdog") def predict(input_img): predictions = pipeline(input_img) return input_img, {p["label"]: p["score"] for p in predictions} gradio_app = gr.Interface( predict, inputs=gr.Image(label="Select hot dog candidate", sources=['upload', 'webcam'], type="pil"), outputs=[gr.Image(label="Processed Image"), gr.Label(label="Result", num_top_classes=2)], title="Hot Dog? Or Not?", ) if __name__ == "__main__": gradio_app.launch() ``` This Python script uses a [๐Ÿค— Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to load the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which is used by the Gradio interface. The Gradio app will expect you to upload an image, which it'll then classify as *hot dog* or *not hot dog*. Once you've saved the code to the **app.py** file, visit the **App** tab to see your app in action! ## Embed Gradio Spaces on other webpages You can embed a Gradio Space on other webpages by using either Web Components or the HTML `` tag. Check out [our documentation](./spaces-embed) or the [Gradio documentation](https://gradio.app/sharing_your_app/#embedding-hosted-spaces) for more details. ### Spaces as API endpoints https://huggingface.co/docs/hub/spaces-api-endpoints.md # Spaces as API endpoints Every Gradio Space on Hugging Face is automatically available as an API endpoint. You can call it from Python, JavaScript, or any HTTP client. If you can use a Space in your browser, you can call it as an API. ## Quick start Install the Python client and call any public Space: ```bash pip install --upgrade gradio_client ``` ```python from gradio_client import Client client = Client("abidlabs/en2fr", token="hf_...") result = client.predict("Hello, world!", api_name="/predict") print(result) # "Bonjour, le monde!" ``` ## View available API endpoints Every Gradio Space has a "Use via API" link in the footer. Click it to see: - All available endpoints and their names - Parameter types and descriptions - Auto-generated code snippets for Python and JavaScript - An API Recorder that generates code from your UI interactions Every Space also exposes an OpenAPI specification at: ``` https://.hf.space/gradio_api/openapi.json ``` For example: `https://abidlabs-en2fr.hf.space/gradio_api/openapi.json` This is useful to understand the full API schema and integrate it into your own applications. You can also inspect endpoints programmatically: ```python from gradio_client import Client client = Client("abidlabs/whisper", token="hf_...") client.view_api() # Prints all endpoints with parameters ``` ## Python client ### Installation ```bash pip install --upgrade gradio_client ``` Requires Python 3.10+. ### Connect to a Space ```python from gradio_client import Client # Public Space client = Client("username/space-name") # Private Space (requires token) client = Client("username/private-space", token="hf_xxxxx") ``` > [!TIP] > Get your Hugging Face token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). For private Spaces, you need a token with **READ** permissions. ### Make Predictions **Synchronous (blocking):** ```python result = client.predict("Hello", api_name="/predict") ``` **Asynchronous (non-blocking):** ```python job = client.submit("Hello", api_name="/predict") # Do other work... result = job.result() # Get result when ready ``` ### Handle Files Use `handle_file()` for any file inputs: ```python from gradio_client import Client, handle_file client = Client("abidlabs/whisper", token="hf_...") # From local file result = client.predict(audio=handle_file("audio.wav"), api_name="/predict") # From URL result = client.predict(audio=handle_file("https://example.com/audio.wav"), api_name="/predict") ``` ### Monitor Job Status ```python job = client.submit("Hello", api_name="/predict") # Check status status = job.status() print(f"Queue position: {status.rank}, ETA: {status.eta}") # Check if complete if job.done(): result = job.result() # Cancel a pending job job.cancel() ``` ### Streaming/Generator Endpoints For endpoints that yield multiple outputs: ```python job = client.submit(prompt="Write a story", api_name="/generate") # Iterate over streaming outputs for output in job: print(output) ``` ## JavaScript client ### Installation ```bash npm i @gradio/client ``` Or use via CDN: ```html import { Client } from "https://cdn.jsdelivr.net/npm/@gradio/client/dist/index.min.js"; ``` ### Connect and Predict ```javascript import { Client } from "@gradio/client"; const app = await Client.connect("abidlabs/en2fr", { token: "hf_..." }); const result = await app.predict("/predict", ["Hello"]); console.log(result.data); ``` ### Handle Files ```javascript import { Client, handle_file } from "@gradio/client"; const app = await Client.connect("abidlabs/whisper", { token: "hf_..." }); const result = await app.predict("/predict", [ handle_file("https://example.com/audio.wav") ]); ``` ### Stream Results ```javascript const job = app.submit("/predict", ["Hello"]); for await (const message of job) { if (message.type === "data") { console.log("Result:", message.data); } if (message.type === "status") { console.log("Queue position:", message.position); } } ``` ## REST API (curl) You can also call Gradio Spaces directly via HTTP without any client library. ### Queue-Based API (Recommended) Most Spaces use a two-step process: **Step 1: Submit your request** ```bash curl -X POST "https://abidlabs-en2fr.hf.space/gradio_api/call/predict" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $HF_TOKEN" \ -d '{"data": ["Hello, world"]}' ``` Response: ```json {"event_id": "abc123"} ``` **Step 2: Get the result** ```bash curl -N "https://abidlabs-en2fr.hf.space/gradio_api/call/predict/abc123" \ -H "Authorization: Bearer $HF_TOKEN" ``` Response (Server-Sent Events): ``` event: complete data: ["Bonjour, le monde!"] ``` The `Authorization` header is required for private Spaces and gives better rate limits on public Spaces. ## ZeroGPU Spaces ZeroGPU Spaces have usage quotas based on your account type: | Account Type | Included Daily GPU Quota | |-------------|--------------------------| | Unauthenticated | 2 minutes | | Free account | 3.5 minutes | | PRO account | 25 minutes | When you authenticate with your token, your account's GPU quota is consumed. Unauthenticated requests use a shared pool with stricter limits. PRO, Team, and Enterprise users can go beyond their included daily quota using pre-paid credits at the rate of **$1 per 10 minutes** of GPU time. > [!TIP] > You can [subscribe to PRO](https://huggingface.co/subscribe/pro) for 25 minutes of daily GPU quota, higher queue priority, and the ability to extend your quota with credits. ## Common patterns ### FastAPI Integration ```python from fastapi import FastAPI from gradio_client import Client, handle_file app = FastAPI() client = Client("abidlabs/whisper", token="hf_...") @app.post("/transcribe/") async def transcribe(file_url: str): result = client.predict(audio=handle_file(file_url), api_name="/predict") return {"transcription": result} ``` ### Error Handling with Retries ```python import time from gradio_client import Client def predict_with_retry(client, *args, max_retries=3, **kwargs): for attempt in range(max_retries): try: return client.predict(*args, **kwargs) except Exception as e: if attempt < max_retries - 1: time.sleep(2 ** attempt) # Exponential backoff else: raise client = Client("username/space", token="hf_...") result = predict_with_retry(client, "input", api_name="/predict") ``` ### Calling Spaces from Another Space When calling a ZeroGPU Space from your own Gradio app, forward the user's authentication: ```python import gradio as gr from gradio_client import Client def process(prompt, request: gr.Request): x_ip_token = request.headers.get('x-ip-token', '') client = Client("owner/zerogpu-space", headers={"x-ip-token": x_ip_token}) return client.predict(prompt, api_name="/predict") demo = gr.Interface(fn=process, inputs="text", outputs="text") demo.launch() ``` ## Find Spaces with semantic search With thousands of Gradio Spaces available, you sometimes want to find one for a particular task: ```bash curl -s "https://huggingface.co/api/spaces/semantic-search?q=text+to+speech&sdk=gradio" ``` This returns Spaces ranked by semantic relevance, with metadata including the Space ID, likes, and a short description. Use the `sdk=gradio` parameter to filter for Spaces that expose an API. ## Learn more - [Gradio Python Client Guide](https://www.gradio.app/guides/getting-started-with-the-python-client) - [Gradio JavaScript Client Guide](https://www.gradio.app/guides/getting-started-with-the-js-client) - [Querying Gradio Apps with curl](https://www.gradio.app/guides/querying-gradio-apps-with-curl) - [Spaces ZeroGPU](./spaces-zerogpu) ### Licenses https://huggingface.co/docs/hub/repositories-licenses.md # Licenses You are able to add a license to any repo that you create on the Hugging Face Hub to let other users know about the permissions that you want to attribute to your code or data. The license can be specified in your repository's `README.md` file, known as a _card_ on the Hub, in the card's metadata section. Remember to seek out and respect a project's license if you're considering using their code or data. A full list of the available licenses is available here: | Fullname | License identifier (to use in repo card) | | -------------------------------------------------------------- | ---------------------------------------- | | Apache license 2.0 | `apache-2.0` | | MIT | `mit` | | OpenRAIL license family | `openrail` | | BigScience OpenRAIL-M | `bigscience-openrail-m` | | CreativeML OpenRAIL-M | `creativeml-openrail-m` | | BigScience BLOOM RAIL 1.0 | `bigscience-bloom-rail-1.0` | | BigCode Open RAIL-M v1 | `bigcode-openrail-m` | | Academic Free License v3.0 | `afl-3.0` | | Artistic license 2.0 | `artistic-2.0` | | Boost Software License 1.0 | `bsl-1.0` | | BSD license family | `bsd` | | BSD 2-clause "Simplified" license | `bsd-2-clause` | | BSD 3-clause "New" or "Revised" license | `bsd-3-clause` | | BSD 3-clause Clear license | `bsd-3-clause-clear` | | Computational Use of Data Agreement | `c-uda` | | Creative Commons license family | `cc` | | Creative Commons Zero v1.0 Universal | `cc0-1.0` | | Creative Commons Attribution 2.0 | `cc-by-2.0` | | Creative Commons Attribution 2.5 | `cc-by-2.5` | | Creative Commons Attribution 3.0 | `cc-by-3.0` | | Creative Commons Attribution 4.0 | `cc-by-4.0` | | Creative Commons Attribution Share Alike 3.0 | `cc-by-sa-3.0` | | Creative Commons Attribution Share Alike 4.0 | `cc-by-sa-4.0` | | Creative Commons Attribution Non Commercial 2.0 | `cc-by-nc-2.0` | | Creative Commons Attribution Non Commercial 3.0 | `cc-by-nc-3.0` | | Creative Commons Attribution Non Commercial 4.0 | `cc-by-nc-4.0` | | Creative Commons Attribution No Derivatives 4.0 | `cc-by-nd-4.0` | | Creative Commons Attribution Non Commercial No Derivatives 3.0 | `cc-by-nc-nd-3.0` | | Creative Commons Attribution Non Commercial No Derivatives 4.0 | `cc-by-nc-nd-4.0` | | Creative Commons Attribution Non Commercial Share Alike 2.0 | `cc-by-nc-sa-2.0` | | Creative Commons Attribution Non Commercial Share Alike 3.0 | `cc-by-nc-sa-3.0` | | Creative Commons Attribution Non Commercial Share Alike 4.0 | `cc-by-nc-sa-4.0` | | Community Data License Agreement โ€“ Sharing, Version 1.0 | `cdla-sharing-1.0` | | Community Data License Agreement โ€“ Permissive, Version 1.0 | `cdla-permissive-1.0` | | Community Data License Agreement โ€“ Permissive, Version 2.0 | `cdla-permissive-2.0` | | Do What The F\*ck You Want To Public License | `wtfpl` | | Educational Community License v2.0 | `ecl-2.0` | | Eclipse Public License 1.0 | `epl-1.0` | | Eclipse Public License 2.0 | `epl-2.0` | | Etalab Open License 2.0 | `etalab-2.0` | | European Union Public License 1.1 | `eupl-1.1` | | European Union Public License 1.2 | `eupl-1.2` | | GNU Affero General Public License v3.0 | `agpl-3.0` | | GNU Free Documentation License family | `gfdl` | | GNU General Public License family | `gpl` | | GNU General Public License v2.0 | `gpl-2.0` | | GNU General Public License v3.0 | `gpl-3.0` | | GNU Lesser General Public License family | `lgpl` | | GNU Lesser General Public License v2.1 | `lgpl-2.1` | | GNU Lesser General Public License v3.0 | `lgpl-3.0` | | ISC | `isc` | | H Research License | `h-research` | | Intel Research Use License Agreement | `intel-research` | | LaTeX Project Public License v1.3c | `lppl-1.3c` | | Microsoft Public License | `ms-pl` | | Apple Sample Code license | `apple-ascl` | | Apple Model License for Research | `apple-amlr` | | Mozilla Public License 2.0 | `mpl-2.0` | | Open Data Commons License Attribution family | `odc-by` | | Open Database License family | `odbl` | | Open Model, Data & Weights License Agreement | `openmdw-1.0` | | Open Rail++-M License | `openrail++` | | Open Software License 3.0 | `osl-3.0` | | PostgreSQL License | `postgresql` | | SIL Open Font License 1.1 | `ofl-1.1` | | University of Illinois/NCSA Open Source License | `ncsa` | | The Unlicense | `unlicense` | | zLib License | `zlib` | | Open Data Commons Public Domain Dedication and License | `pddl` | | Lesser General Public License For Linguistic Resources | `lgpl-lr` | | DeepFloyd IF Research License Agreement | `deepfloyd-if-license` | | FAIR Noncommercial Research License | `fair-noncommercial-research-license` | | Llama 2 Community License Agreement | `llama2` | | Llama 3 Community License Agreement | `llama3` | | Llama 3.1 Community License Agreement | `llama3.1` | | Llama 3.2 Community License Agreement | `llama3.2` | | Llama 3.3 Community License Agreement | `llama3.3` | | Llama 4 Community License Agreement | `llama4` | | Grok 2 Community License Agreement | `grok2-community` | | Gemma Terms of Use | `gemma` | | Unknown | `unknown` | | Other | `other` | In case of `license: other` please add the license's text to a `LICENSE` file inside your repo (or contact us to add the license you use to this list), and set a name for it in `license_name`. ### Hugging Face CLI for AI Agents https://huggingface.co/docs/hub/agents-cli.md # Hugging Face CLI for AI Agents The `hf` CLI is a great way to connect your agents to the Hugging Face ecosystem. Search models, manage datasets and buckets, launch Spaces, and run jobs from any coding agent. > [!TIP] > This is a quick guide on agents that use the CLI. For more detailed information, see the [CLI Reference itself](https://huggingface.co/docs/huggingface_hub/guides/cli). ## Install the CLI Make sure the `hf` CLI is installed and up to date. See the [CLI installation guide](https://huggingface.co/docs/huggingface_hub/guides/cli#getting-started) for setup instructions. ## Add the CLI Skill Skills give your agent the context it needs to use tools effectively. Install the CLI Skill so your agent knows every `hf` command and stays current with the latest updates. Learn more about Skills at [agentskills.io](https://agentskills.io). ```bash # install globally (available in all projects, works with Codex, Cursor, OpenCode, # and any agent that loads skills from ~/.agents/skills) hf skills add --global # for Claude Code use the --claude flag hf skills add --claude --global # or install for the current project only (works with Codex, Cursor, OpenCode, # and any agent that loads skills from .agents/skills) hf skills add # for Claude Code, use the --claude flag hf skills add --claude ``` > [!TIP] > The Skill is generated from your locally installed CLI version, so it's always up to date. Alternatively, you can install via the Claude Code plugin system: ```bash claude /plugin marketplace add huggingface/skills /plugin install hf-cli@huggingface/skills ``` ## Resources - [CLI Reference](https://huggingface.co/docs/huggingface_hub/guides/cli) - Complete command documentation - [Token Settings](https://huggingface.co/settings/tokens) - Manage your tokens - [Jobs Documentation](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs) - Compute jobs guide ### Webhook guide: Setup an automatic metadata quality review for models and datasets https://huggingface.co/docs/hub/webhooks-guide-metadata-review.md # Webhook guide: Setup an automatic metadata quality review for models and datasets This guide will walk you through creating a system that reacts to changes to a user's or organization's models or datasets on the Hub and creates a 'metadata review' for the changed repository. ## What are we building and why? Before we dive into the technical details involved in this particular workflow, we'll quickly outline what we're creating and why. [Model cards](https://huggingface.co/docs/hub/model-cards) and [dataset cards](https://huggingface.co/docs/hub/datasets-cards) are essential tools for documenting machine learning models and datasets. The Hugging Face Hub uses a `README.md` file containing a [YAML](https://en.wikipedia.org/wiki/YAML) header block to generate model and dataset cards. This `YAML` section defines metadata relating to the model or dataset. For example: ```yaml --- language: - "List of ISO 639-1 code for your language" - lang1 - lang2 tags: - tag1 - tag2 license: "any valid license identifier" datasets: - dataset1 --- ``` This metadata contains essential information about your model or dataset for potential users. The license, for example, defines the terms under which a model or dataset can be used. Hub users can also use the fields defined in the `YAML` metadata as filters for identifying models or datasets that fit specific criteria. Since the metadata defined in this block is essential for potential users of our models and datasets, it is important that we complete this section. In a team or organization setting, users pushing models and datasets to the Hub may have differing familiarity with the importance of this YAML metadata block. While someone in a team could take on the responsibility of reviewing this metadata, there may instead be some automation we can do to help us with this problem. The result will be a metadata review report automatically posted or updated when a repository on the Hub changes. For our metadata quality, this system works similarly to [CI/CD](https://en.wikipedia.org/wiki/CI/CD). ![Metadata review](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/003-metadata-review/metadata-report-screenshot.png) You can also find an example review [here](https://huggingface.co/datasets/davanstrien/test_webhook/discussions/1#63d932fe19aa7b8ed2718b3f). ## Using the Hub Client Library to create a model review card The `huggingface_hub` is a Python library that allows you to interact with the Hub. We can use this library to [download model and dataset cards](https://huggingface.co/docs/huggingface_hub/how-to-model-cards) from the Hub using the `DatasetCard.load` or `ModelCard.load` methods. In particular, we'll use these methods to load a Python dictionary, which contains the metadata defined in the `YAML` of our model or dataset card. We'll create a small Python function to wrap these methods and do some exception handling. ```python from huggingface_hub import DatasetCard, ModelCard from huggingface_hub.utils import EntryNotFoundError def load_repo_card_metadata(repo_type, repo_name): if repo_type == "dataset": try: return DatasetCard.load(repo_name).data.to_dict() except EntryNotFoundError: return {} if repo_type == "model": try: return ModelCard.load(repo_name).data.to_dict() except EntryNotFoundError: return {} ``` This function will return a Python dictionary containing the metadata associated with the repository (or an empty dictionary if there is no metadata). ```python {'license': 'afl-3.0'} ``` ## Creating our metadata review report Once we have a Python dictionary containing the metadata associated with a repository, we'll create a 'report card' for our metadata review. In this particular instance, we'll review our metadata by defining some metadata fields for which we want values. For example, we may want to ensure that the `license` field has always been completed. To rate our metadata, we'll count which metadata fields are present out of our desired fields and return a percentage score based on the coverage of the required metadata fields we want to see values. Since we have a Python dictionary containing our metadata, we can loop through this dictionary to check if our desired keys are there. If a desired metadata field (a key in our dictionary) is missing, we'll assign the value as `None`. ```python def create_metadata_key_dict(card_data, repo_type: str): shared_keys = ["tags", "license"] if repo_type == "model": model_keys = ["library_name", "datasets", "metrics", "co2", "pipeline_tag"] shared_keys.extend(model_keys) keys = shared_keys return {key: card_data.get(key) for key in keys} if repo_type == "dataset": # [...] ``` This function will return a dictionary containing keys representing the metadata fields we require for our model or dataset. The dictionary values will either include the metadata entered for that field or `None` if that metadata field is missing in the `YAML`. ```python {'tags': None, 'license': 'afl-3.0', 'library_name': None, 'datasets': None, 'metrics': None, 'co2': None, 'pipeline_tag': None} ``` Once we have this dictionary, we can create our metadata report. In the interest of brevity, we won't include the complete code here, but the Hugging Face Spaces [repository](https://huggingface.co/spaces/librarian-bot/webhook_metadata_reviewer/blob/main/main.py) for this Webhook contains the full code. We create one function which creates a markdown table that produces a prettier version of the data we have in our metadata coverage dictionary. ```python def create_metadata_breakdown_table(desired_metadata_dictionary): # [...] return tabulate( table_data, tablefmt="github", headers=("Metadata Field", "Provided Value") ) ``` We also have a Python function that generates a score (representing the percentage of the desired metadata fields present) ```python def calculate_grade(desired_metadata_dictionary): # [...] return round(score, 2) ``` and a Python function that creates a markdown report for our metadata review. This report contains both the score and metadata table, along with some explanation of what the report contains. ```python def create_markdown_report( desired_metadata_dictionary, repo_name, repo_type, score, update: bool = False ): # [...] return report ``` ## How to post the review automatically? We now have a markdown formatted metadata review report. We'll use the `huggingface_hub` library to post this review. We define a function that takes back the Webhook data received from the Hub, parses the data, and creates the metadata report. Depending on whether a report has previously been created, the function creates a new report or posts a new issue to an existing metadata review thread. ```python def create_or_update_report(data): if parsed_post := parse_webhook_post(data): repo_type, repo_name = parsed_post else: return Response("Unable to parse webhook data", status_code=400) # [...] return True ``` > [!TIP] > `:=` is the Python Syntax for an assignment expression operator added to the Python language in version 3.8 (colloquially known as the walrus operator). People have mixed opinions on this syntax, and it doesn't change how Python evaluates the code if you don't use this. You can read more about this operator in this Real Python article. ## Creating a Webhook to respond to changes on the Hub We've now got the core functionality for creating a metadata review report for a model or dataset. The next step is to use Webhooks to respond to changes automatically. ## Create a Webhook in your user profile First, create your Webhook by going to https://huggingface.co/settings/webhooks. - Input a few target repositories that your Webhook will listen to (you will likely want to limit this to your own repositories or the repositories of the organization you belong to). - Input a secret to make your Webhook more secure (if you don't know what to choose for this, you may want to use a [password generator](https://1password.com/password-generator/) to generate a sufficiently long random string for your secret). - We can pass a dummy URL for the `Webhook URL` parameter for now. Your Webhook will look like this: ![webhook settings](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/003-metadata-review/webhook-settings.png) ## Create a new Bot user profile This guide creates a separate user account that will post the metadata reviews. ![Bot user account](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/003-metadata-review/librarian-bot-profile.png) > [!TIP] > When creating a bot that will interact with other users on the Hub, we ask that you clearly label the account as a "Bot" (see profile screenshot). ## Create a Webhook listener We now need some way of listening to Webhook events. There are many possible tools you can use to listen to Webhook events. Many existing services, such as [Zapier](https://zapier.com/) and [IFTTT](https://ifttt.com), can use Webhooks to trigger actions (for example, they could post a tweet every time a model is updated). In this case, we'll implement our Webhook listener using [FastAPI](https://fastapi.tiangolo.com/). [FastAPI](https://fastapi.tiangolo.com/) is a Python web framework. We'll use FastAPI to create a Webhook listener. In particular, we need to implement a route that accepts `POST` requests on `/webhook`. For authentication, we'll compare the `X-Webhook-Secret` header with a `WEBHOOK_SECRET` secret that can be passed to our [Docker container at runtime](./spaces-sdks-docker#runtime). ```python from fastapi import FastAPI, Request, Response import os KEY = os.environ.get("WEBHOOK_SECRET") app = FastAPI() @app.post("/webhook") async def webhook(request: Request): if request.method == "POST": if request.headers.get("X-Webhook-Secret") != KEY: return Response("Invalid secret", status_code=401) data = await request.json() result = create_or_update_report(data) return "Webhook received!" if result else result ``` The above function will receive Webhook events and creates or updates the metadata review report for the changed repository. ## Use Spaces to deploy our Webhook app Our [main.py](https://huggingface.co/spaces/librarian-bot/webhook_metadata_reviewer/blob/main/main.py) file contains all the code we need for our Webhook app. To deploy it, we'll use a [Space](./spaces-overview). For our Space, we'll use Docker to run our app. The [Dockerfile](https://huggingface.co/spaces/librarian-bot/webhook_metadata_reviewer/blob/main/Dockerfile) copies our app file, installs the required dependencies, and runs the application. To populate the `KEY` variable, we'll also set a `WEBHOOK_SECRET` secret for our Space with the secret we generated earlier. You can read more about Docker Spaces [here](./spaces-sdks-docker). Finally, we need to update the URL in our Webhook settings to the URL of our Space. We can get our Spaceโ€™s โ€œdirect URLโ€ from the contextual menu. Click on โ€œEmbed this Spaceโ€ and copy the โ€œDirect URLโ€. ![direct url](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/003-metadata-review/direct-url.png) Once we have this URL, we can pass this to the `Webhook URL` parameter in our Webhook settings. Our bot should now start posting reviews when monitored repositories change! ## Conclusion and next steps We now have an automatic metadata review bot! Here are some ideas for how you could build on this guide: - The metadata review done by our bot was relatively crude; you could add more complex rules for reviewing metadata. - You could use the full `README.md` file for doing the review. - You may want to define 'rules' which are particularly important for your organization and use a webhook to check these are followed. If you build a metadata quality app using Webhooks, please tag me @davanstrien; I would love to know about it! ### Use AI Models Locally https://huggingface.co/docs/hub/local-apps.md # Use AI Models Locally You can run AI models from the Hub locally on your machine. This means that you can benefit from these advantages: - **Privacy**: You won't be sending your data to a remote server. - **Speed**: Your hardware is the limiting factor, not the server or connection speed. - **Control**: You can configure models to your liking. - **Cost**: You can run models locally without paying for an API provider. ## How to Use Local Apps Local apps are applications that can run Hugging Face models directly on your machine. To get started: 1. **Enable local apps** in your [Local Apps settings](https://huggingface.co/settings/local-apps). ![Local Apps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local-apps/settings.png) 1. **Choose a supported model** from the Hub by searching for it. You can filter by `app` in the `Other` section of the navigation bar: ![Local Apps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local-apps/search_llamacpp.png) 3. **Select the local app** from the "Use this model" dropdown on the model page. ![Local Apps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local-apps/button.png) 4. **Copy and run** the provided command in your terminal. ![Local Apps](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/local-apps/command.png) ## Supported Local Apps The best way to check if a local app is supported is to go to the Local Apps settings and see if the app is listed. Here is a quick overview of some of the most popular local apps: > [!TIP] > ๐Ÿ‘จโ€๐Ÿ’ป To use these local apps, copy the snippets from the model card as above. > > ๐Ÿ‘ท If you're building a local app, you can learn about integrating with the Hub in [this guide](https://huggingface.co/docs/hub/en/models-adding-libraries). ### Llama.cpp Llama.cpp is a high-performance C/C++ library for running LLMs locally with optimized inference across lots of different hardware, including CPUs, CUDA and Metal. **Advantages:** - Extremely fast performance for CPU-based models on multiple CPU families - Low resource usage - Multiple interface options (CLI, server, Python library) - Hardware-optimized for CPUs and GPUs To use Llama.cpp, navigate to the model card and click "Use this model" and copy the command. ```sh # Load and run the model: ./llama-server -hf unsloth/gpt-oss-20b-GGUF:Q4_K_M ``` Read our dedicated [llama.cpp + HF doc page](./gguf-llamacpp). ### Ollama Ollama is an application that lets you run large language models locally on your computer with a simple command-line interface. **Advantages:** - Easy installation and setup - Direct integration with Hugging Face Hub To use Ollama, navigate to the model card and click "Use this model" and copy the command. ```sh ollama run hf.co/unsloth/gpt-oss-20b-GGUF:Q4_K_M ``` ### Jan Jan is an open-source ChatGPT alternative that runs entirely offline with a user-friendly interface. **Advantages:** - User-friendly GUI - Chat with documents and files - OpenAI-compatible API server, so you can run models and use them from other apps To use Jan, navigate to the model card and click "Use this model". Jan will open and you can start chatting through the interface. ### LM Studio > [!NOTE] > Read our dedicated [LM Studio doc page](./lmstudio) LM Studio is a desktop application that provides an easy way to download, run, and experiment with local LLMs. **Advantages:** - Intuitive graphical interface - Built-in model browser - Developer tools and APIs - Free for personal and commercial use Navigate to the model card and click "Use this model". LM Studio will open and you can start chatting through the interface. ### Adding a Sign-In with HF button to your Space https://huggingface.co/docs/hub/spaces-oauth.md # Adding a Sign-In with HF button to your Space You can enable a built-in sign-in flow in your Space by seamlessly creating and associating an [OAuth/OpenID connect](https://developer.okta.com/blog/2019/10/21/illustrated-guide-to-oauth-and-oidc) app so users can log in with their HF account. This enables new use cases for your Space. For instance, when combined with [Storage Buckets](https://huggingface.co/docs/hub/storage-buckets), a generative AI Space could allow users to log in to access their previous generations, only accessible to them. > [!TIP] > This guide will take you through the process of integrating a *Sign-In with HF* button into any Space. If you're seeking a fast and simple method to implement this in a **Gradio** Space, take a look at its [built-in integration](https://www.gradio.app/guides/sharing-your-app#o-auth-login-via-hugging-face). > [!TIP] > You can also use the HF OAuth flow to create a "Sign in with HF" flow in any website or App, outside of Spaces. [Read our general OAuth page](./oauth). ## Create an OAuth app All you need to do is add `hf_oauth: true` to your Space's metadata inside your `README.md` file. Here's an example of metadata for a Gradio Space: ```yaml title: Gradio Oauth Test emoji: ๐Ÿ† colorFrom: pink colorTo: pink sdk: gradio sdk_version: 3.40.0 python_version: 3.10.6 app_file: app.py hf_oauth: true # optional, default duration is 8 hours/480 minutes. Max duration is 30 days/43200 minutes. hf_oauth_expiration_minutes: 480 # optional, see "Scopes" below. "openid profile" is always included. hf_oauth_scopes: - read-repos - gated-repos - write-repos - manage-repos - inference-api # optional, restrict access to members of specific organizations hf_oauth_authorized_org: ORG_NAME hf_oauth_authorized_org: - ORG_NAME1 - ORG_NAME2 ``` You can check out the [configuration reference docs](./spaces-config-reference) for more information. This will add the following [environment variables](https://huggingface.co/docs/hub/spaces-overview#helper-environment-variables) to your space: - `OAUTH_CLIENT_ID`: the client ID of your OAuth app (public) - `OAUTH_CLIENT_SECRET`: the client secret of your OAuth app - `OAUTH_SCOPES`: scopes accessible by your OAuth app. - `OPENID_PROVIDER_URL`: The URL of the OpenID provider. The OpenID metadata will be available at [`{OPENID_PROVIDER_URL}/.well-known/openid-configuration`](https://huggingface.co/.well-known/openid-configuration). As for any other environment variable, you can use them in your code by using `os.getenv("OAUTH_CLIENT_ID")`, for example. ## Redirect URLs You can use any redirect URL you want, as long as it targets your Space. Note that `SPACE_HOST` is [available](https://huggingface.co/docs/hub/spaces-overview#helper-environment-variables) as an environment variable. For example, you can use `https://{SPACE_HOST}/login/callback` as a redirect URI. ## Scopes The following scopes are always included for Spaces: - `openid`: Get the ID token in addition to the access token. - `profile`: Get the user's profile information (username, avatar, etc.) Those scopes are optional and can be added by setting `hf_oauth_scopes` in your Space's metadata: - `email`: Get the user's email address. - `read-billing`: Know whether the user has a payment method set up. - `read-repos`: Get read access to the user's personal repos. - `gated-repos`: Get read access to the content of public gated repos the user has been granted access to. Unlike `read-repos`, this does not grant access to private repos. - `contribute-repos`: Can create repositories and access those created by this app. Cannot access any other repositories unless additional permissions are granted. - `write-repos`: Get write/read access to the user's personal repos. - `manage-repos`: Get full access to the user's personal repos. Also grants repo creation and deletion. - `read-collections`: Get read access to the user's personal collections. - `write-collections`: Get write/read access to the user's personal collections. Also grants collection creation and deletion. - `inference-api`: Get access to the [Inference Providers](https://huggingface.co/docs/inference-providers/index), you will be able to make inference requests on behalf of the user. - `jobs`: Run [jobs](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs) - `webhooks`: Manage [webhooks](https://huggingface.co/docs/huggingface_hub/main/en/guides/webhooks) - `write-discussions`: Open discussions and Pull Requests on behalf of the user as well as interact with discussions (including reactions, posting/editing comments, closing discussions, ...). To open Pull Requests on private repos, you need to request the `read-repos` scope as well. ## Accessing organization resources By default, the oauth app does not need to access organization resources. But some scopes like `read-repos` or `read-billing` apply to organizations as well. The user can select which organizations to grant access to when authorizing the app. If you require access to a specific organization, you can add `orgIds=ORG_ID` as a query parameter to the OAuth authorization URL. You have to replace `ORG_ID` with the organization ID, which is available in the `organizations.sub` field of the userinfo response. ## Adding the button to your Space You now have all the information to add a "Sign-in with HF" button to your Space. Some libraries ([Python](https://github.com/lepture/authlib), [NodeJS](https://github.com/panva/node-openid-client)) can help you implement the OpenID/OAuth protocol. Gradio and huggingface.js also provide **built-in support**, making implementing the Sign-in with HF button a breeze; you can check out the associated guides with [gradio](https://www.gradio.app/guides/sharing-your-app#o-auth-login-via-hugging-face) and with [huggingface.js](https://huggingface.co/docs/huggingface.js/hub/README#oauth-login). Basically, you need to: - Redirect the user to `https://huggingface.co/oauth/authorize?redirect_uri={REDIRECT_URI}&scope=openid%20profile&client_id={CLIENT_ID}&state={STATE}`, where `STATE` is a random string that you will need to verify later. - Handle the callback on `/auth/callback` or `/login/callback` (or your own custom callback URL) and verify the `state` parameter. - Use the `code` query parameter to get an access token and id token from `https://huggingface.co/oauth/token` (POST request with `client_id`, `code`, `grant_type=authorization_code` and `redirect_uri` as form data, and with `Authorization: Basic {base64(client_id:client_secret)}` as a header). > [!WARNING] > You should use `target=_blank` on the button to open the sign-in page in a new tab, unless you run the space outside its `iframe`. Otherwise, you might encounter issues with cookies on some browsers. ## Examples: - [Gradio test app](https://huggingface.co/spaces/Wauplin/gradio-oauth-test) - [HuggingChat (NodeJS/SvelteKit)](https://huggingface.co/spaces/huggingchat/chat-ui) - [Inference Widgets (Auth.js/SvelteKit)](https://huggingface.co/spaces/huggingfacejs/inference-widgets), uses the `inference-api` scope to make inference requests on behalf of the user. - [Client-Side in a Static Space (huggingface.js)](https://huggingface.co/spaces/huggingfacejs/client-side-oauth) - very simple JavaScript example. JS Code example: ```js import { oauthLoginUrl, oauthHandleRedirectIfPresent } from "@huggingface/hub"; const oauthResult = await oauthHandleRedirectIfPresent(); if (!oauthResult) { // If the user is not logged in, redirect to the login page window.location.href = await oauthLoginUrl(); } // You can use oauthResult.accessToken, oauthResult.userInfo among other things console.log(oauthResult); ``` ### Local Agents with llama.cpp https://huggingface.co/docs/hub/agents-local.md # Local Agents with llama.cpp You can run a coding agent entirely on your own hardware. Several open-source agents can connect to a local [llama.cpp](https://github.com/ggerganov/llama.cpp) server to give you an experience similar to Claude Code or Codex โ€” but everything runs on your machine. ## Getting Started ### 1. Set Your Local Hardware Set your local hardware so it can show you which models are compatible with your setup. Go to [huggingface.co/settings/local-apps](https://huggingface.co/settings/local-apps) and configure your local hardware profile. Select `llama.cpp` in the Local Apps section as this will be the engine you'll use. ### 2. Find a Compatible Model Browse for [Llama.cpp-compatible models](https://huggingface.co/models?apps=llama.cpp&sort=trending). ### 3. Launch the llama.cpp Server On the model page, click the **"Use this model"** button and select `llama.cpp`. It will show you the exact commands for your setup. The first step is to start a llama.cpp server, e.g. ```bash llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M --jinja ``` This downloads the model and starts an OpenAI-compatible API server on your machine. See the [llama.cpp guide](./gguf-llamacpp) for installation instructions. ### 4. Connect Your Agent Pick one of the agents below and follow the setup instructions. ## Pi [Pi](https://pi.dev) is the agent behind [OpenClaw](https://github.com/openclaw) and is now integrated directly into Hugging Face, giving you access to thousands of compatible models. Install Pi: ```bash npm install -g @mariozechner/pi-coding-agent ``` Then add your local model to Pi's configuration file at `~/.pi/agent/models.json`: ```json { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ggml-org-gemma-4-26b-4b-gguf" } ] } } } ``` Start Pi in your project directory: ```bash pi ``` Pi connects to your local llama.cpp server and gives you an interactive agent session. ![Demo](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/pi-llama-cpp-demo.gif) ### Enabling vision support For vision-capable models, add `"input": ["text", "image"]` to the model entry in `~/.pi/agent/models.json`: ```json "models": [ { "id": "unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_XL", "input": ["text", "image"] } ] ``` Browse [vision-language models compatible with Pi](https://huggingface.co/models?pipeline_tag=image-text-to-text&apps=pi). ## OpenClaw [OpenClaw](https://github.com/openclaw) works locally with llama.cpp. You can set your model via the onboard command: ```bash openclaw onboard --non-interactive \ --auth-choice custom-api-key \ --custom-base-url "http://127.0.0.1:8080/v1" \ --custom-model-id "ggml-org-gemma-4-26b-a4b-gguf" \ --custom-api-key "llama.cpp" \ --secret-input-mode plaintext \ --custom-compatibility openai \ --accept-risk ``` You can also run `openclaw onboard` interactively, select `custom-compatibility` with `openai`, and pass the same configuration. ## Hermes [Hermes](https://hermes-agent.nousresearch.com/) works locally with llama.cpp. Define a default config as: ```yaml model: provider: custom default: ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M base_url: http://127.0.0.1:8080/v1 api_key: llama.cpp custom_providers: - name: Local (127.0.0.1:8080) base_url: http://127.0.0.1:8080/v1 api_key: llama.cpp model: ggml-org/gemma-4-26B-A4B-it-GGUF:Q4_K_M ``` ## OpenCode [OpenCode](https://opencode.ai) works locally with llama.cpp. Define a `~/.config/opencode/opencode.json`: ```json { "$schema": "https://opencode.ai/config.json", "provider": { "llama.cpp": { "npm": "@ai-sdk/openai-compatible", "name": "llama-server (local)", "options": { "baseURL": "http://127.0.0.1:8080/v1" }, "models": { "gemma-4-26b-4b-it": { "name": "Gemma 4 (local)", "limit": { "context": 128000, "output": 8192 } } } } } } ``` ## How It Works The setup has two components running locally: 1. **llama.cpp server** โ€” Serves the model as an OpenAI-compatible API on `localhost`. 2. **Your agent** โ€” The agent process that sends prompts to the local server, reasons about tasks, and executes actions. ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” API calls โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Agent โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚ llama.cpp server โ”‚ โ”‚ โ”‚ โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ (local model) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ responses โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ Your files, terminal, etc. ``` ## Alternative: llama-agent [llama-agent](https://github.com/gary149/llama-agent) takes a different approach โ€” it builds the agent loop directly into [llama.cpp](https://github.com/ggerganov/llama.cpp) as a single binary with zero external dependencies. No Node.js, no Python, just compile and run: ```bash git clone https://github.com/gary149/llama-agent.git cd llama-agent # Build cmake -B build cmake --build build --target llama-agent # Run (downloads the model automatically) ./build/bin/llama-agent -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M ``` Because tool calls happen in-process rather than over HTTP, there is no network overhead between the model and the agent. It also supports subagents, MCP servers, and an HTTP API server mode. ## Next Steps - [Use AI Models Locally](./local-apps) โ€” Learn more about running models on your machine - [llama.cpp Guide](./gguf-llamacpp) โ€” Detailed llama.cpp installation and usage - [Agents on the Hub](./agents-overview) โ€” Connect agents to the Hugging Face ecosystem ### Using ๐Ÿงจ `diffusers` at Hugging Face https://huggingface.co/docs/hub/diffusers.md # Using ๐Ÿงจ `diffusers` at Hugging Face Diffusers is the go-to library for state-of-the-art pretrained diffusion models for generating images, audio, and even 3D structures of molecules. Whether youโ€™re looking for a simple inference solution or want to train your own diffusion model, Diffusers is a modular toolbox that supports both. The library is designed with a focus on usability over performance, simple over easy, and customizability over abstractions. ## Exploring Diffusers in the Hub There are over 10,000 `diffusers` compatible pipelines on the Hub which you can find by filtering at the left of [the models page](https://huggingface.co/models?library=diffusers&sort=downloads). Diffusion systems are typically composed of multiple components such as text encoder, UNet, VAE, and scheduler. Even though they are not standalone models, the pipeline abstraction makes it easy to use them for inference or training. You can find diffusion pipelines for many different tasks: * Generating images from natural language text prompts ([text-to-image](https://huggingface.co/models?library=diffusers&pipeline_tag=text-to-image&sort=downloads)). * Transforming images using natural language text prompts ([image-to-image](https://huggingface.co/models?library=diffusers&pipeline_tag=image-to-image&sort=downloads)). * Generating videos from natural language descriptions ([text-to-video](https://huggingface.co/models?library=diffusers&pipeline_tag=text-to-video&sort=downloads)). You can try out the models directly in the browser if you want to test them out without downloading them, thanks to the in-browser widgets! ## Diffusers repository files A [Diffusers](https://hf.co/docs/diffusers/index) model repository contains all the required model sub-components such as the variational autoencoder for encoding images and decoding latents, text encoder, transformer model, and more. These sub-components are organized into a multi-folder layout. Each subfolder contains the weights and configuration - where applicable - for each component similar to a [Transformers](./transformers) model. Weights are usually stored as safetensors files and the configuration is usually a json file with information about the model architecture. ## Using existing pipelines All `diffusers` pipelines are a line away from being used! To run generation we recommended to always start from the `DiffusionPipeline`: ```py from diffusers import DiffusionPipeline pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0") ``` If you want to load a specific pipeline component such as the UNet, you can do so by: ```py from diffusers import UNet2DConditionModel unet = UNet2DConditionModel.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet") ``` ## Sharing your pipelines and models All the [pipeline classes](https://huggingface.co/docs/diffusers/main/api/pipelines/overview), [model classes](https://huggingface.co/docs/diffusers/main/api/models/overview), and [scheduler classes](https://huggingface.co/docs/diffusers/main/api/schedulers/overview) are fully compatible with the Hub. More specifically, they can be easily loaded from the Hub using the `from_pretrained()` method and can be shared with others using the `push_to_hub()` method. For more details, please check out the [documentation](https://huggingface.co/docs/diffusers/main/en/using-diffusers/push_to_hub). ## Additional resources * Diffusers [library](https://github.com/huggingface/diffusers). * Diffusers [docs](https://huggingface.co/docs/diffusers/index). ### Resource groups https://huggingface.co/docs/hub/enterprise-resource-groups.md # Resource groups > [!WARNING] > This feature is part of the Team & Enterprise plans. Resource Groups allow organizations to enforce fine-grained access control to their repositories. This feature allows organization administrators to: - Group related repositories together for better organization - Control member access at a group level rather than individual repository level - Assign different permission roles (no_access, read, contributor, write, admin) to team members - Keep private repositories visible only to authorized group members - Enable multiple teams to work independently within the same organization - Configure which member roles are allowed to create new resource groups This Team & Enterprise feature helps organizations manage complex team structures and maintain proper access control over their repositories. [Getting started with Resource Groups โ†’](./security-resource-groups) ### User access tokens https://huggingface.co/docs/hub/security-tokens.md # User access tokens ## What are User Access Tokens? User Access Tokens are the preferred way to authenticate an application or notebook to Hugging Face services. You can manage your access tokens in your [settings](https://huggingface.co/settings/tokens). Access tokens allow applications and notebooks to perform specific actions specified by the scope of the roles shown in the following: - `fine-grained`: tokens with this role can be used to provide fine-grained access to specific resources, such as a specific model or models in a specific organization. This type of token is useful in production environments, as you can use your own token without sharing access to all your resources. - `read`: tokens with this role can only be used to provide read access to repositories you could read. That includes public and private repositories that you, or an organization you're a member of, own. Use this role if you only need to read content from the Hugging Face Hub (e.g. when downloading private models or doing inference). - `write`: tokens with this role additionally grant write access to the repositories you have write access to. Use this token if you need to create or push content to a repository (e.g., when training a model or modifying a model card). If you are a member of an organization with read/write/admin role, then your User Access Tokens will be able to read/write the resources according to the token permission (read/write) and organization membership (read/write/admin). ## How to manage User Access Tokens? To create an access token, go to your settings, then click on the [Access Tokens tab](https://huggingface.co/settings/tokens). Click on the **New token** button to create a new User Access Token. Select a role and a name for your token and voilร  - you're ready to go! You can delete and refresh User Access Tokens by clicking on the **Manage** button. ## How to use User Access Tokens? There are plenty of ways to use a User Access Token to access the Hugging Face Hub, granting you the flexibility you need to build awesome apps on top of it. User Access Tokens can be: - used **in place of a password** to access the Hugging Face Hub with git or with basic authentication. - passed as a **bearer token** when calling [Inference Providers](https://huggingface.co/docs/inference-providers). - used in the Hugging Face Python libraries, such as `transformers` or `datasets`: ```python from transformers import AutoModel access_token = "hf_..." model = AutoModel.from_pretrained("private/model", token=access_token) ``` > [!WARNING] > Try not to leak your token! Though you can always rotate it, anyone will be able to read or write your private repos in the meantime which is ๐Ÿ’ฉ ### Best practices We recommend you create one access token per app or usage. For instance, you could have a separate token for: - A local machine. - A Colab notebook. - An awesome custom inference server. This way, you can invalidate one token without impacting your other usages. We also recommend using only fine-grained tokens for production usage. The impact, if leaked, will be reduced, and they can be shared among your organization without impacting your account. For example, if your production application needs read access to a gated model, a member of your organization can request access to the model and then create a fine-grained token with read access to that model. This token can then be used in your production application without giving it access to all your private models. ### For Enterprise organizations If your organization needs to programmatically issue tokens for members without requiring each user to create their own token, see [OAuth Token Exchange](./oauth#token-exchange-for-organizations-rfc-8693). This Enterprise plan feature is ideal for building internal platforms, CI/CD pipelines, or custom integrations that need to access Hugging Face resources on behalf of organization members. ## Tokens in organizations with token management policies Organizations on Team and Enterprise plans can enforce token policies that affect how your tokens work when accessing that organization's resources. ### When your token requires approval (Team & Enterprise organizations) When you create a fine-grained token scoped to an organization that requires administrator approval, the token enters a **Pending** state automatically. It cannot access that organization's resources until an administrator approves it. You will receive an email notification when your token is approved or denied. You can check status from your token list page, a pending token shows an orange hourglass icon next to its permissions badge, and a denied or revoked token shows a red exclamation icon. A red error banner also appears on the token's edit page if your token was denied or revoked. > [!NOTE] > If you are an administrator of the organization, fine-grained tokens you create scoped to that organization are automatically approved โ€” no review step is required. ### When your token is denied (Team & Enterprise organizations) If your token is denied, you will receive an email notification. The token remains in your account and can still be used for resources outside the organization. A denied token can later be approved by an administrator, restoring access without you needing to create a new token. When attempting to use a denied token against organization resources, you will receive a `403` error. ### When your token is revoked (Enterprise organizations) Revocation is permanent. Unlike denial, a revoked token cannot be reinstated. If your token has been revoked, you must delete it and create a new one. If the organization requires administrator approval, the new token will start in a pending state. When attempting to use a revoked token against organization resources, you will receive a `403` error with the message: _"Your token has been revoked by the organization administrator, you can no longer access organization resources. Please contact them for more information."_ Revocation only affects the organization that revoked it. The token continues to work normally for all other resources it is scoped to. ### When your organization only allows fine-grained tokens (Team & Enterprise organizations) If your organization has set a policy requiring fine-grained tokens, read/write tokens will be rejected with a `403` error when used against that organization's resources. ### Bucket Integrations https://huggingface.co/docs/hub/storage-buckets-integrations.md # Bucket Integrations Storage Buckets can be read and written from many Python data libraries using `hf://buckets/` paths, backed by the [`huggingface_hub` filesystem interface](/docs/huggingface_hub/guides/hf_file_system). For the underlying access mechanisms โ€” mounts, volume mounts, and fsspec โ€” see [Access Patterns](./storage-buckets-access). ## pandas ```python import pandas as pd df = pd.read_parquet("hf://buckets/username/my-bucket/data.parquet") df.to_parquet("hf://buckets/username/my-bucket/output.parquet") ``` ## Dask ```python import dask.dataframe as dd df = dd.read_parquet("hf://buckets/username/my-bucket/data.parquet") ``` ## PyArrow ```python import pyarrow.parquet as pq table = pq.read_table("hf://buckets/username/my-bucket/data.parquet") ``` ## PySpark With [`pyspark_huggingface`](https://github.com/huggingface/pyspark_huggingface) installed: ```python df = ( spark.read.format("huggingface") .option("data_files", '["data.parquet"]') .load("buckets/username/my-bucket") ) ``` See [PySpark on the Hub](./datasets-pyspark) for more. ## ๐Ÿค— Datasets ```python from datasets import load_dataset ds = load_dataset("buckets/username/my-bucket", data_files=["data.parquet"]) ``` ## Filesystem operations For direct file operations, `huggingface_hub` exposes a pre-instantiated [filesystem object](/docs/huggingface_hub/guides/hf_file_system), `hffs`: ```python from huggingface_hub import hffs with hffs.open("buckets/username/my-bucket/hello.txt", "w") as f: f.write("Hello world!") hffs.cp("buckets/username/my-bucket/hello.txt", "buckets/username/my-bucket/hello2.txt") hffs.rm("buckets/username/my-bucket/hello2.txt") files = hffs.ls("buckets/username/my-bucket") text_files = hffs.glob("buckets/username/my-bucket/*.txt") ``` ## Other languages [OpenDAL](https://opendal.apache.org/) provides a similar filesystem interface for Rust, Java, Go, JavaScript, and more. ## Coming soon Support for more libraries is on the way โ€” including Polars, DuckDB (native `hf://` URL support), Daft, and webdataset. ### Tasks https://huggingface.co/docs/hub/models-tasks.md # Tasks ## What's a task? Tasks, or pipeline types, describe the "shape" of each model's API (inputs and outputs) and are used to determine which Inference API and widget we want to display for any given model. This classification is relatively coarse-grained (you can always add more fine-grained task names in your model tags), so **you should rarely have to create a new task**. If you want to add support for a new task, this document explains the required steps. ## Overview Having a new task integrated into the Hub means that: * Users can search for all models โ€“ and datasets โ€“ of a given task. * The Inference API supports the task. * Users can try out models directly with the widget. ๐Ÿ† Note that you don't need to implement all the steps by yourself. Adding a new task is a community effort, and multiple people can contribute. ๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ To begin the process, open a new issue in the [huggingface_hub](https://github.com/huggingface/huggingface_hub/issues) repository. Please use the "Adding a new task" template. โš ๏ธBefore doing any coding, it's suggested to go over this document. โš ๏ธ The first step is to upload a model for your proposed task. Once you have a model in the Hub for the new task, the next step is to enable it in the Inference API. There are three types of support that you can choose from: * ๐Ÿค— using a `transformers` model * ๐Ÿณ using a model from an [officially supported library](./models-libraries) * ๐Ÿ–จ๏ธ using a model with custom inference code. This experimental option has downsides, so we recommend using one of the other approaches. Finally, you can add a couple of UI elements, such as the task icon and the widget, that complete the integration in the Hub. ๐Ÿ“ท Some steps are orthogonal; you don't need to do them in order. **You don't need the Inference API to add the icon.** This means that, even if there isn't full integration yet, users can still search for models of a given task. ## Adding new tasks to the Hub ### Using Hugging Face transformers library If your model is a `transformers`-based model, there is a 1:1 mapping between the Inference API task and a `pipeline` class. Here are some example PRs from the `transformers` library: * [Adding ImageClassificationPipeline](https://github.com/huggingface/transformers/pull/11598) * [Adding AudioClassificationPipeline](https://github.com/huggingface/transformers/pull/13342) Once the pipeline is submitted and deployed, you should be able to use the Inference API for your model. ### Using Community Inference API with a supported library The Hub also supports over 10 open-source libraries in the [Community Inference API](https://github.com/huggingface/api-inference-community). **Adding a new task is relatively straightforward and requires 2 PRs:** * PR 1: Add the new task to the API [validation](https://github.com/huggingface/api-inference-community/blob/main/api_inference_community/validation.py). This code ensures that the inference input is valid for a given task. Some PR examples: * [Add text-to-image](https://github.com/huggingface/huggingface_hub/commit/5f040a117cf2a44d704621012eb41c01b103cfca#diff-db8bbac95c077540d79900384cfd524d451e629275cbb5de7a31fc1cd5d6c189) * [Add audio-classification](https://github.com/huggingface/huggingface_hub/commit/141e30588a2031d4d5798eaa2c1250d1d1b75905#diff-db8bbac95c077540d79900384cfd524d451e629275cbb5de7a31fc1cd5d6c189) * [Add tabular-classification](https://github.com/huggingface/huggingface_hub/commit/dbea604a45df163d3f0b4b1d897e4b0fb951c650#diff-db8bbac95c077540d79900384cfd524d451e629275cbb5de7a31fc1cd5d6c189) * PR 2: Add the new task to a library docker image. You should also add a template to [`docker_images/common/app/pipelines`](https://github.com/huggingface/api-inference-community/tree/main/docker_images/common/app/pipelines) to facilitate integrating the task in other libraries. Here is an example PR: * [Add text-classification to spaCy](https://github.com/huggingface/huggingface_hub/commit/6926fd9bec23cb963ce3f58ec53496083997f0fa#diff-3f1083a92ca0047b50f9ad2d04f0fe8dfaeee0e26ab71eb8835e365359a1d0dc) ### Adding Community Inference API for a quick prototype **My model is not supported by any library. Am I doomed? ๐Ÿ˜ฑ** We recommend using [Hugging Face Spaces](./spaces) for these use cases. ### UI elements The Hub allows users to filter models by a given task. To do this, you need to add the task to several places. You'll also get to pick an icon for the task! 1. Add the task type to `Types.ts` In [huggingface.js/packages/tasks/src/pipelines.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/pipelines.ts), you need to do a couple of things * Add the type to `PIPELINE_DATA`. Note that pipeline types are sorted into different categories (NLP, Audio, Computer Vision, and others). * You will also need to fill minor changes in [huggingface.js/packages/tasks/src/tasks/index.ts](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/tasks/index.ts) 2. Choose an icon You can add an icon in the [lib/Icons](https://github.com/huggingface/huggingface.js/tree/main/packages/widgets/src/lib/components/Icons) directory. We usually choose carbon icons from https://icones.js.org/collection/carbon. Also add the icon to [PipelineIcon](https://github.com/huggingface/huggingface.js/blob/main/packages/widgets/src/lib/components/PipelineIcon/PipelineIcon.svelte). ### Widget Once the task is in production, what could be more exciting than implementing some way for users to play directly with the models in their browser? ๐Ÿคฉ You can find all the widgets [here](https://huggingface.co/spaces/huggingfacejs/inference-widgets). If you would be interested in contributing with a widget, you can look at the [implementation](https://github.com/huggingface/huggingface.js/tree/main/packages/widgets) of all the widgets. ### Access control in organizations https://huggingface.co/docs/hub/organizations-security.md # Access control in organizations > [!TIP] > You can set up [Single Sign-On (SSO)](./security-sso) to be able to map access control rules from your organization's Identity Provider. > [!TIP] > Advanced and more fine-grained access control can be achieved with [Resource Groups](./security-resource-groups). > > The Resource Group feature is part of the Team & Enterprise plans. Members of organizations can have five different roles: `no_access`, `read`, `contributor`, `write`, or `admin`: - `no_access`: the member belongs to the Organization but has no access to its repositories or settings. Use with [Resource Groups](./security-resource-groups) to grant access to specific repos only. - `read`: read-only access to the Organization's repos and metadata/settings (eg, the Organization's profile, members list, API token, etc). - `contributor`: additional write rights to the subset of the Organization's repos that were created by the user. I.e., users can create repos and _then_ modify only those repos. This is similar to the `write` role, but scoped to repos _created_ by the user. - `write`: write rights to all the Organization's repos. Users can create, delete, or rename any repo in the Organization namespace. A user can also edit and delete files from the browser editor and push content with `git`. - `admin`: in addition to write rights on repos, admin members can update the Organization's profile, refresh the Organization's API token, and manage Organization members. As an organization `admin`, go to the **Members** section of the org settings to manage roles for users. To change roles or resource group assignments programmatically, see the [Programmatic User Access Control Management](./programmatic-user-access-control) guide. ## Viewing members' email address > [!WARNING] > This feature is part of the Team & Enterprise plans. You may be able to view the email addresses of members of your organization. The visibility of the email addresses depends on the organization's SSO configuration, or verified organization status. - By [verifying an email domain](./organizations-managing#organization-email-domain) for your organization, you can view the email addresses of members with a matching email domain. - If SSO is configured for your organization, you can view the email address for each of your organization members by setting `Matching email domains` in the SSO configuration ## Managing Access Tokens with access to my organization See [Tokens Management](./enterprise-tokens-management) ### Optimizations https://huggingface.co/docs/hub/datasets-polars-optimizations.md # Optimizations We briefly touched upon the difference between lazy and eager evaluation. On this page we will show how the lazy API can be used to get huge performance benefits. ## Lazy vs Eager Polars supports two modes of operation: lazy and eager. In the eager API the query is executed immediately while in the lazy API the query is only evaluated once it's 'needed'. Deferring the execution to the last minute can have significant performance advantages and is why the lazy API is preferred in most non-interactive cases. ## Example We will be using the example from the previous page to show the performance benefits of using the lazy API. The code below will compute the number of uploads from `archive.org`. ### Eager ```python import polars as pl import datetime df = pl.read_csv("hf://datasets/commoncrawl/statistics/tlds.csv", try_parse_dates=True) df = df.select("suffix", "crawl", "date", "tld", "pages", "domains") df = df.filter( (pl.col("date") >= datetime.date(2020, 1, 1)) | pl.col("crawl").str.contains("CC") ) df = df.with_columns( (pl.col("pages") / pl.col("domains")).alias("pages_per_domain") ) df = df.group_by("tld", "date").agg( pl.col("pages").sum(), pl.col("domains").sum(), ) df = df.group_by("tld").agg( pl.col("date").unique().count().alias("number_of_scrapes"), pl.col("domains").mean().alias("avg_number_of_domains"), pl.col("pages").sort_by("date").pct_change().mean().alias("avg_page_growth_rate"), ).sort("avg_number_of_domains", descending=True).head(10) ``` ### Lazy ```python import polars as pl import datetime lf = ( pl.scan_csv("hf://datasets/commoncrawl/statistics/tlds.csv", try_parse_dates=True) .filter( (pl.col("date") >= datetime.date(2020, 1, 1)) | pl.col("crawl").str.contains("CC") ).with_columns( (pl.col("pages") / pl.col("domains")).alias("pages_per_domain") ).group_by("tld", "date").agg( pl.col("pages").sum(), pl.col("domains").sum(), ).group_by("tld").agg( pl.col("date").unique().count().alias("number_of_scrapes"), pl.col("domains").mean().alias("avg_number_of_domains"), pl.col("pages").sort_by("date").pct_change().mean().alias("avg_page_growth_rate"), ).sort("avg_number_of_domains", descending=True).head(10) ) df = lf.collect() ``` ### Timings Running both queries leads to following run times on a regular laptop with a household internet connection: - Eager: `1.96` seconds - Lazy: `410` milliseconds The lazy query is ~5 times faster than the eager one. The reason for this is the query optimizer: if we delay `collect`-ing our dataset until the end, Polars will be able to reason about which columns and rows are required and apply filters as early as possible when reading the data. For file formats such as Parquet that contain metadata (e.g. min, max in a certain group of rows) the difference can even be bigger as Polars can skip entire row groups based on the filters and the metadata without sending the data over the wire. ### Featured Spaces https://huggingface.co/docs/hub/spaces-featured.md # Featured Spaces Hugging Face highlights certain Spaces to make it easier for you to discover high-quality demos and examples built by the community. ## Author-owned Spaces Model, Dataset, and Paper pages can include a section showing the Spaces that use or reference that artifact. A star icon identifies author-owned Spaces, which are official demos or examples created by the owner of the artifact. Learn how to link a Space to a Model or Dataset here, and how to link it to a Paper here. ## Spaces of the Week Spaces of the Week are a curated weekly selection of standout Spaces chosen by the Hugging Face team. ### PyArrow https://huggingface.co/docs/hub/datasets-pyarrow.md # PyArrow [Arrow](https://github.com/apache/arrow) is a columnar format and a toolbox for fast data interchange and in-memory analytics. Since PyArrow supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths ([`hf://`](/docs/huggingface_hub/guides/hf_file_system#integrations)) to read and write data on the Hub. It is especially useful for [Parquet](https://parquet.apache.org/) data, since Parquet is the most common file format on Hugging Face. Indeed, Parquet is particularly efficient thanks to its structure, typing, metadata and compression. ## Load a Table You can load data from local files or from remote storage like Hugging Face Datasets. PyArrow supports many formats including CSV, JSON and more importantly Parquet: ```python >>> import pyarrow.parquet as pq >>> table = pq.read_table("path/to/data.parquet") ``` To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet` as a pyarrow Table (it requires `pyarrow>=21.0`): ```python >>> import pyarrow.parquet as pq >>> table = pq.read_table("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet") >>> table pyarrow.Table text: string label: int64 ---- text: [["I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it (... 1542 chars omitted)", ...],...,[..., "The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritan (... 221 chars omitted)"]] label: [[0,0,0,0,0,...,0,0,0,0,0],...,[1,1,1,1,1,...,1,1,1,1,1]] ``` If you don't want to load the full Parquet data, you can get the Parquet metadata or load row group by row group instead: ```python >>> import pyarrow.parquet as pq >>> pf = pq.ParquetFile("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet") >>> pf.metadata created_by: parquet-cpp-arrow version 12.0.0 num_columns: 2 num_rows: 25000 num_row_groups: 25 format_version: 2.6 serialized_size: 62036 >>> for i in pf.num_row_groups: ... table = pf.read_row_group(i) ... ... ``` For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](/docs/huggingface_hub/guides/hf_file_system). ## Save a Table You can save a pyarrow Table using `pyarrow.parquet.write_table` to a local file or to Hugging Face directly. To save the Table on Hugging Face, you first need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` Then you can [create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: ```python from huggingface_hub import HfApi HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") ``` Finally, you can use [Hugging Face paths](/docs/huggingface_hub/guides/hf_file_system#integrations) in PyArrow: ```python import pyarrow.parquet as pq pq.write_table(table, "hf://datasets/username/my_dataset/imdb.parquet", use_content_defined_chunking=True) # or write in separate files if the dataset has train/validation/test splits pq.write_table(table_train, "hf://datasets/username/my_dataset/train.parquet", use_content_defined_chunking=True) pq.write_table(table_valid, "hf://datasets/username/my_dataset/validation.parquet", use_content_defined_chunking=True) pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", use_content_defined_chunking=True) ``` Note that Parquet files on Hugging Face are optimized to improve storage efficiency, accelerate downloads and uploads, and enable efficient dataset streaming and editing: * [Parquet Content Defined Chunking](https://huggingface.co/blog/parquet-cdc) optimizes Parquet for [Xet](https://huggingface.co/docs/hub/en/xet/index), Hugging Face's storage backend. It accelerates uploads and downloads thanks to chunk-based deduplication and allows efficient file editing * Page index accelerates filters when streaming and enables efficient random access, e.g. in the [Dataset Viewer](https://huggingface.co/docs/dataset-viewer) PyArrow requires extra arguments to write optimized Parquet files: ```python import pyarrow.parquet as pq pq.write_table( table, "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) ``` * `use_content_defined_chunking=True` to enable Parquet Content Defined Chunking, for [deduplication](https://huggingface.co/blog/parquet-cdc) and [editing](./datasets-editing) (it requires `pyarrow>=21.0`) * `write_page_index=True` to include a page index in the Parquet metadata, for [streaming and random access](./datasets-streaming) > [!TIP] > Content defined chunking (CDC) makes the Parquet writer chunk the data pages in a way that makes duplicate data chunked and compressed identically. > Without CDC, the pages are arbitrarily chunked and therefore duplicate data are impossible to detect because of compression. > Thanks to CDC, Parquet uploads and downloads from Hugging Face are faster, since duplicate data are uploaded or downloaded only once. Find more information about Xet [here](https://huggingface.co/join/xet). ## Leverage Xet deduplication for Parquet Optimized Parquet files are written with Content Defined Chunking, which enables deduplication. This accelerates uploads since chunks of data that already exist on Hugging Face don't need to be uploaded again, and this saves a lot of I/O. For example, this code uploads the content of `table` and then for `edited_table` the upload is faster since it only uploads the chunks that changed: ```python import pyarrow.parquet as pq pq.write_table( table, "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) edited_table = ... # e.g. with added/modified/removed rows or columns pq.write_table( edited_table, "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) ``` Chunks are ~64kB and Parquet saves data column per column, so in practice this is what happens when editing an Optimized Parquet file: * add a new column -> only the chunks of the new column are uploaded * add/edit/delete a row -> one chunk per column is uploaded And in addition to this, the chunks of the Parquet footer containing metadata are also uploaded. ## Use Images You can load a folder with a metadata file containing a field for the names or paths to the images, structured like this: ``` Example 1: Example 2: folder/ folder/ โ”œโ”€โ”€ metadata.parquet โ”œโ”€โ”€ metadata.parquet โ”œโ”€โ”€ img000.png โ””โ”€โ”€ images โ”œโ”€โ”€ img001.png โ”œโ”€โ”€ img000.png ... ... โ””โ”€โ”€ imgNNN.png โ””โ”€โ”€ imgNNN.png ``` You can iterate on the images paths like this: ```python from pathlib import Path import pyarrow as pq folder_path = Path("path/to/folder") table = pq.read_table(folder_path + "metadata.parquet") for file_name in table["file_name"].to_pylist(): image_path = folder_path / file_name ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_image_dataset", repo_type="dataset", ) ``` ### Embed Images inside Parquet PyArrow has a binary type which allows to have the images bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the images (bytes and path) and the samples metadata: ```python import pyarrow as pa import pyarrow.parquet as pq # Embed the image bytes in Arrow image_array = pa.array([ { "bytes": (folder_path / file_name).read_bytes(), "path": file_name, } for file_name in table["file_name"].to_pylist() ]) table.append_column("image", image_array) # (Optional) Set the HF Image type for the Dataset Viewer and the `datasets` library features = {"image": {"_type": "Image"}} # or using datasets.Features(...).to_dict() schema_metadata = {"huggingface": {"dataset_info": {"features": features}}} table = table.replace_schema_metadata(schema_metadata) # Save to Parquet # (Optional) with use_content_defined_chunking for faster uploads and downloads pq.write_table(table, "data.parquet", use_content_defined_chunking=True) ``` Setting the Image type in the Arrow schema metadata allows other libraries and the Hugging Face Dataset Viewer to know that "image" contains images and not just binary data. ## Use Audios You can load a folder with a metadata file containing a field for the names or paths to the audios, structured like this: ``` Example 1: Example 2: folder/ folder/ โ”œโ”€โ”€ metadata.parquet โ”œโ”€โ”€ metadata.parquet โ”œโ”€โ”€ rec000.wav โ””โ”€โ”€ audios โ”œโ”€โ”€ rec001.wav โ”œโ”€โ”€ rec000.wav ... ... โ””โ”€โ”€ recNNN.wav โ””โ”€โ”€ recNNN.wav ``` You can iterate on the audios paths like this: ```python from pathlib import Path import pyarrow as pq folder_path = Path("path/to/folder") table = pq.read_table(folder_path + "metadata.parquet") for file_name in table["file_name"].to_pylist(): audio_path = folder_path / file_name ... ``` Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-audio#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save it to Hugging Face, and the Hub Dataset Viewer shows both the metadata and audio. ```python from huggingface_hub import HfApi api = HfApi() api.upload_folder( folder_path=folder_path, repo_id="username/my_audio_dataset", repo_type="dataset", ) ``` ### Embed Audio inside Parquet PyArrow has a binary type which allows for having audio bytes in Arrow tables. Therefore, it enables saving the dataset as one single Parquet file containing both the audio (bytes and path) and the samples metadata: ```python import pyarrow as pa import pyarrow.parquet as pq # Embed the audio bytes in Arrow audio_array = pa.array([ { "bytes": (folder_path / file_name).read_bytes(), "path": file_name, } for file_name in table["file_name"].to_pylist() ]) table.append_column("audio", audio_array) # (Optional) Set the HF Audio type for the Dataset Viewer and the `datasets` library features = {"audio": {"_type": "Audio"}} # or using datasets.Features(...).to_dict() schema_metadata = {"huggingface": {"dataset_info": {"features": features}}} table = table.replace_schema_metadata(schema_metadata) # Save to Parquet # (Optional) with use_content_defined_chunking for faster uploads and downloads pq.write_table(table, "data.parquet", use_content_defined_chunking=True) ``` Setting the Audio type in the Arrow schema metadata enables other libraries and the Hugging Face Dataset Viewer to recognise that "audio" contains audio data, not just binary data. ### Building with the SDK https://huggingface.co/docs/hub/agents-sdk.md # Building with the SDK Build MCP-powered agents with the Hugging Face agentic SDKs. The `huggingface_hub` (Python) and `@huggingface/tiny-agents` (JavaScript) libraries provide everything you need to connect LLMs to MCP tools. ## Installation ```bash pip install "huggingface_hub[mcp]" ``` ```bash npm install @huggingface/tiny-agents # or pnpm add @huggingface/tiny-agents ``` ## Quick Start: Run an Agent The fastest way to get started is with the `tiny-agents` CLI: ```bash tiny-agents run julien-c/flux-schnell-generator ``` ```bash npx @huggingface/tiny-agents run "julien-c/flux-schnell-generator" ``` This loads an agent from the [tiny-agents collection](https://huggingface.co/datasets/tiny-agents/tiny-agents), connects to its MCP servers, and starts an interactive chat. ## Using the Agent Class The `Agent` class manages the chat loop and MCP tool execution. It uses [Inference Providers](https://huggingface.co/docs/inference-providers) to run the LLM. ```python from huggingface_hub import Agent import asyncio agent = Agent( model="Qwen/Qwen2.5-72B-Instruct", provider="novita", servers=[ { "type": "sse", "url": "https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse" } ] ) async def main(): async for chunk in agent.run("Generate an image of a sunset"): if hasattr(chunk, 'choices'): delta = chunk.choices[0].delta if delta.content: print(delta.content, end="") asyncio.run(main()) ``` See the [Agent reference](https://huggingface.co/docs/huggingface_hub/package_reference/mcp#huggingface_hub.Agent) for all options. ```typescript import { Agent } from "@huggingface/tiny-agents"; const agent = new Agent({ model: "Qwen/Qwen2.5-72B-Instruct", provider: "novita", apiKey: process.env.HF_TOKEN, servers: [ { type: "sse", url: "https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse" } ] }); await agent.loadTools(); for await (const chunk of agent.run("Generate an image of a sunset")) { if ("choices" in chunk) { const delta = chunk.choices[0]?.delta; if (delta.content) { console.log(delta.content); } } } ``` See the [tiny-agents documentation](https://huggingface.co/docs/huggingface.js/tiny-agents/README) for all options. ## Using MCPClient Directly For more control, use `MCPClient` to manage MCP servers and tool calls directly. ```python import asyncio from huggingface_hub import MCPClient async def main(): async with MCPClient( model="Qwen/Qwen2.5-72B-Instruct", provider="novita", ) as client: # Connect to an MCP server await client.add_mcp_server( type="sse", url="https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse" ) # Process a request with tools messages = [{"role": "user", "content": "Generate an image of a sunset"}] async for chunk in client.process_single_turn_with_tools(messages): if hasattr(chunk, 'choices'): delta = chunk.choices[0].delta if delta.content: print(delta.content, end="") asyncio.run(main()) ``` See the [MCPClient reference](https://huggingface.co/docs/huggingface_hub/package_reference/mcp#huggingface_hub.MCPClient) for all options. The JavaScript SDK uses the `Agent` class for MCP interactions. For lower-level control, see the [@huggingface/mcp-client](https://huggingface.co/docs/huggingface.js/mcp-client/README) package. ## Share Your Agent Contribute agents to the [tiny-agents collection](https://huggingface.co/datasets/tiny-agents/tiny-agents) on the Hub. Include: - `agent.json` - Agent configuration (required) - `PROMPT.md` or `AGENTS.md` - System prompt (optional) - `EXAMPLES.md` - Sample prompts and use cases (optional) ## Learn More - [huggingface_hub MCP Reference](https://huggingface.co/docs/huggingface_hub/package_reference/mcp) - Python API reference - [tiny-agents Documentation](https://huggingface.co/docs/huggingface.js/tiny-agents/README) - JavaScript API reference - [Inference Providers](https://huggingface.co/docs/inference-providers) - Available LLM providers - [tiny-agents Collection](https://huggingface.co/datasets/tiny-agents/tiny-agents) - Browse community agents - [MCP Server Guide](./agents-mcp) - Connect to the Hugging Face MCP Server ### Video Dataset https://huggingface.co/docs/hub/datasets-video.md # Video Dataset This guide will show you how to configure your dataset repository with video files. A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. Additional information about your videos - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). Alternatively, videos can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. ## Only videos If your dataset only consists of one column with videos, you can simply store your video files at the root: ``` my_dataset_repository/ โ”œโ”€โ”€ 1.mp4 โ”œโ”€โ”€ 2.mp4 โ”œโ”€โ”€ 3.mp4 โ””โ”€โ”€ 4.mp4 ``` or in a subdirectory: ``` my_dataset_repository/ โ””โ”€โ”€ videos โ”œโ”€โ”€ 1.mp4 โ”œโ”€โ”€ 2.mp4 โ”œโ”€โ”€ 3.mp4 โ””โ”€โ”€ 4.mp4 ``` Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including MP4, MOV and AVI. ``` my_dataset_repository/ โ””โ”€โ”€ videos โ”œโ”€โ”€ 1.mp4 โ”œโ”€โ”€ 2.mov โ””โ”€โ”€ 3.avi ``` If you have several splits, you can put your videos into directories named accordingly: ``` my_dataset_repository/ โ”œโ”€โ”€ train โ”‚ย ย  โ”œโ”€โ”€ 1.mp4 โ”‚ย ย  โ””โ”€โ”€ 2.mp4 โ””โ”€โ”€ test โ”œโ”€โ”€ 3.mp4 โ””โ”€โ”€ 4.mp4 ``` See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. ## Additional columns If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like [video generation](https://huggingface.co/tasks/text-to-video) or [object detection](https://huggingface.co/tasks/object-detection). ``` my_dataset_repository/ โ””โ”€โ”€ train โ”œโ”€โ”€ 1.mp4 โ”œโ”€โ”€ 2.mp4 โ”œโ”€โ”€ 3.mp4 โ”œโ”€โ”€ 4.mp4 โ””โ”€โ”€ metadata.csv ``` Your `metadata.csv` file must have a `file_name` column which links video files with their metadata: ```csv file_name,text 1.mp4,an animation of a green pokemon with red eyes 2.mp4,a short video of a green and yellow toy with a red nose 3.mp4,a red and white ball shows an angry look on its face 4.mp4,a cartoon ball is smiling ``` You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: ```jsonl {"file_name": "1.mp4","text": "an animation of a green pokemon with red eyes"} {"file_name": "2.mp4","text": "a short video of a green and yellow toy with a red nose"} {"file_name": "3.mp4","text": "a red and white ball shows an angry look on its face"} {"file_name": "4.mp4","text": "a cartoon ball is smiling"} ``` And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. ## Relative paths Metadata file must be located either in the same directory with the videos it is linked to, or in any parent directory, like in this example: ``` my_dataset_repository/ โ””โ”€โ”€ train โ”œโ”€โ”€ videos โ”‚ย ย  โ”œโ”€โ”€ 1.mp4 โ”‚ย ย  โ”œโ”€โ”€ 2.mp4 โ”‚ย ย  โ”œโ”€โ”€ 3.mp4 โ”‚ย ย  โ””โ”€โ”€ 4.mp4 โ””โ”€โ”€ metadata.csv ``` In this case, the `file_name` column must be a full relative path to the videos, not just the filename: ```csv file_name,text videos/1.mp4,an animation of a green pokemon with red eyes videos/2.mp4,a short video of a green and yellow toy with a red nose videos/3.mp4,a red and white ball shows an angry look on its face videos/4.mp4,a cartoon ball is smiling ``` Metadata files cannot be put in subdirectories of a directory with the videos. More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the videos. ## Video classification For video classification datasets, you can also use a simple setup: use directories to name the video classes. Store your video files in a directory structure like: ``` my_dataset_repository/ โ”œโ”€โ”€ green โ”‚ย ย  โ”œโ”€โ”€ 1.mp4 โ”‚ย ย  โ””โ”€โ”€ 2.mp4 โ””โ”€โ”€ red โ”œโ”€โ”€ 3.mp4 โ””โ”€โ”€ 4.mp4 ``` The dataset created with this structure contains two columns: `video` and `label` (with values `green` and `red`). You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): ``` my_dataset_repository/ โ”œโ”€โ”€ test โ”‚ย ย  โ”œโ”€โ”€ green โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ 2.mp4 โ”‚ย ย  โ””โ”€โ”€ red โ”‚ย ย  โ””โ”€โ”€ 4.mp4 โ””โ”€โ”€ train โ”œโ”€โ”€ green โ”‚ย ย  โ””โ”€โ”€ 1.mp4 โ””โ”€โ”€ red โ””โ”€โ”€ 3.mp4 ``` You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: ```yaml configs: - config_name: default # Name of the dataset subset, if applicable. drop_labels: true ``` ## Large scale datasets ### WebDataset format The [WebDataset](./datasets-webdataset) format is well suited for large scale video datasets. It consists of TAR archives containing videos and their metadata and is optimized for streaming. It is useful if you have a large number of videos and to get streaming data loaders for large scale training. ``` my_dataset_repository/ โ”œโ”€โ”€ train-0000.tar โ”œโ”€โ”€ train-0001.tar โ”œโ”€โ”€ ... โ””โ”€โ”€ train-1023.tar ``` To make a WebDataset TAR archive, create a directory containing the videos and metadata files to be archived and create the TAR archive using e.g. the `tar` command. The usual size per archive is generally around 1GB. Make sure each video and metadata pair share the same file prefix, for example: ``` train-0000/ โ”œโ”€โ”€ 000.mp4 โ”œโ”€โ”€ 000.json โ”œโ”€โ”€ 001.mp4 โ”œโ”€โ”€ 001.json โ”œโ”€โ”€ ... โ”œโ”€โ”€ 999.mp4 โ””โ”€โ”€ 999.json ``` Note that for user convenience and to enable the [Dataset Viewer](./data-studio), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. Since videos can be quite large, the URLs to the videos are stored in the converted Parquet data without the video bytes themselves. Read more about it in the [Parquet format](./data-studio#access-the-parquet-files) documentation. ### Libraries https://huggingface.co/docs/hub/models-libraries.md # Libraries The Hub has support for dozens of libraries in the Open Source ecosystem. Thanks to the `huggingface_hub` Python library, it's easy to enable sharing your models on the Hub. The Hub supports many libraries, and we're working on expanding this support. We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward. The table below summarizes the supported libraries and their level of integration. Find all our supported libraries in [the model-libraries.ts file](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/model-libraries.ts). | Library | Description | Inference Providers | Widgets | Download from Hub | Push to Hub | |-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------|---|---:|---|---| | [Adapters](./adapters) | A unified Transformers add-on for parameter-efficient and modular fine-tuning. | โœ… | โœ… | โœ… | โœ… | | [AllenNLP](./allennlp) | An open-source NLP research library, built on PyTorch. | โœ… | โœ… | โœ… | โŒ | | [Asteroid](./asteroid) | PyTorch-based audio source separation toolkit | โœ… | โœ… | โœ… | โŒ | | [BERTopic](./bertopic) | BERTopic is a topic modeling library for text and images | โœ… | โœ… | โœ… | โœ… | | [Diffusers](./diffusers) | A modular toolbox for inference and training of diffusion models | โœ… | โœ… | โœ… | โœ… | | [docTR](https://github.com/mindee/doctr) | Models and datasets for OCR-related tasks in PyTorch & TensorFlow | โœ… | โœ… | โœ… | โŒ | | [ESPnet](./espnet) | End-to-end speech processing toolkit (e.g. TTS) | โœ… | โœ… | โœ… | โŒ | | [fastai](./fastai) | Library to train fast and accurate models with state-of-the-art outputs. | โœ… | โœ… | โœ… | โœ… | | [Keras](./keras) | Open-source multi-backend deep learning framework, with support for JAX, TensorFlow, and PyTorch. | โŒ | โŒ | โœ… | โœ… | | [KerasNLP](https://keras.io/guides/keras_nlp/upload/) | Natural language processing library built on top of Keras that works natively with TensorFlow, JAX, or PyTorch. | โŒ | โŒ | โœ… | โœ… | | [TF-Keras](./tf-keras) (legacy) | Legacy library that uses a consistent and simple API to build models leveraging TensorFlow and its ecosystem. | โŒ | โŒ | โœ… | โœ… | | [Flair](./flair) | Very simple framework for state-of-the-art NLP. | โœ… | โœ… | โœ… | โœ… | | [MBRL-Lib](https://github.com/facebookresearch/mbrl-lib) | PyTorch implementations of MBRL Algorithms. | โŒ | โŒ | โœ… | โœ… | | [MidiTok](https://github.com/Natooz/MidiTok) | Tokenizers for symbolic music / MIDI files. | โŒ | โŒ | โœ… | โœ… | | [ML-Agents](./ml-agents) | Enables games and simulations made with Unity to serve as environments for training intelligent agents. | โŒ | โŒ | โœ… | โœ… | | [MLX](./mlx) | Model training and serving framework on Apple silicon made by Apple. | โŒ | โŒ | โœ… | โœ… | | [NeMo](https://github.com/NVIDIA/NeMo) | Conversational AI toolkit built for researchers | โœ… | โœ… | โœ… | โŒ | | [OpenCLIP](./open_clip) | Library for open-source implementation of OpenAI's CLIP | โŒ | โŒ | โœ… | โœ… | | [PaddleNLP](./paddlenlp) | Easy-to-use and powerful NLP library built on PaddlePaddle | โœ… | โœ… | โœ… | โœ… | | [PEFT](./peft) | Cutting-edge Parameter Efficient Fine-tuning Library | โœ… | โœ… | โœ… | โœ… | | [Pyannote](https://github.com/pyannote/pyannote-audio) | Neural building blocks for speaker diarization. | โŒ | โŒ | โœ… | โŒ | | [PyCTCDecode](https://github.com/kensho-technologies/pyctcdecode) | Language model supported CTC decoding for speech recognition | โŒ | โŒ | โœ… | โŒ | | [Pythae](https://github.com/clementchadebec/benchmark_VAE) | Unified framework for Generative Autoencoders in Python | โŒ | โŒ | โœ… | โœ… | | [RL-Baselines3-Zoo](./rl-baselines3-zoo) | Training framework for Reinforcement Learning, using [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3).| โŒ | โœ… | โœ… | โœ… | | [Sample Factory](./sample-factory) | Codebase for high throughput asynchronous reinforcement learning. | โŒ | โœ… | โœ… | โœ… | | [Sentence Transformers](./sentence-transformers) | Compute dense vector representations for sentences, paragraphs, and images. | โœ… | โœ… | โœ… | โœ… | | [SetFit](./setfit) | Efficient few-shot text classification with Sentence Transformers | โœ… | โœ… | โœ… | โœ… | | [spaCy](./spacy) | Advanced Natural Language Processing in Python and Cython. | โœ… | โœ… | โœ… | โœ… | | [SpanMarker](./span_marker) | Familiar, simple and state-of-the-art Named Entity Recognition. | โœ… | โœ… | โœ… | โœ… | | [Scikit Learn (using skops)](https://skops.readthedocs.io/en/stable/) | Machine Learning in Python. | โœ… | โœ… | โœ… | โœ… | | [Speechbrain](./speechbrain) | A PyTorch Powered Speech Toolkit. | โœ… | โœ… | โœ… | โŒ | | [Stable-Baselines3](./stable-baselines3) | Set of reliable implementations of deep reinforcement learning algorithms in PyTorch | โŒ | โœ… | โœ… | โœ… | | [TensorFlowTTS](https://github.com/TensorSpeech/TensorFlowTTS) | Real-time state-of-the-art speech synthesis architectures. | โŒ | โŒ | โœ… | โŒ | | [Timm](./timm) | Collection of image models, scripts, pretrained weights, etc. | โœ… | โœ… | โœ… | โœ… | | [Transformers](./transformers) | State-of-the-art Natural Language Processing for PyTorch, TensorFlow, and JAX | โœ… | โœ… | โœ… | โœ… | | [Transformers.js](./transformers-js) | State-of-the-art Machine Learning for the web. Run ๐Ÿค— Transformers directly in your browser, with no need for a server! | โŒ | โŒ | โœ… | โŒ | | [Unity Sentis](./unity-sentis) | Inference engine for the Unity 3D game engine | โŒ | โŒ | โŒ | โŒ | ### How can I add support for a new library? If you're interested in adding your library, please reach out to us! Read about it in [Adding a Library Guide](./models-adding-libraries). ### ๐ŸŸง Label Studio on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-label-studio.md # ๐ŸŸง Label Studio on Spaces [Label Studio](https://labelstud.io) is an [open-source data labeling platform](https://github.com/heartexlabs/label-studio) for labeling, annotating, and exploring many different data types. Additionally, Label Studio includes a powerful [machine learning interface](https://labelstud.io/guide/ml.html) that can be used for new model training, active learning, supervised learning, and many other training techniques. This guide will teach you how to deploy Label Studio for data labeling and annotation within the Hugging Face Hub. You can use the default configuration of Label Studio as a self-contained application hosted completely on the Hub using Docker for demonstration and evaluation purposes, or you can attach your own database and cloud storage to host a fully-featured production-ready application hosted on Spaces. ## โšก๏ธ Deploy Label Studio on Spaces You can deploy Label Studio on Spaces with just a few clicks: Spaces requires you to define: * An **Owner**: either your personal account or an organization you're a part of. * A **Space name**: the name of the Space within the account you're creating the Space. * The **Visibility**: _private_ if you want the Space to be visible only to you or your organization, or _public_ if you want it to be visible to other users or applications using the Label Studio API (suggested). ## ๐Ÿš€ Using the Default Configuration By default, Label Studio is installed in Spaces with a configuration that uses local storage for the application database to store configuration, account credentials, and project information. Labeling tasks and data items are also held in local storage. > [!WARNING] > Storage in Hugging Face Spaces is ephemeral by default. To persist your data > across restarts, attach a [Storage Bucket](./storage-buckets) โ€” see the > [persistence section](#enable-persistence-with-hf-storage-buckets) below. After launching Label Studio, you will be presented with the standard login screen. You can start by creating a new account using your email address and logging in with your new credentials. Periodically after logging in, Label Studio will warn you that the storage is ephemeral and data could be lost if your Space is restarted. You will also be preset with a prompt from Heidi, the helpful Label Studio mascot, to create a new project to start labeling your data. To get started, check out the Label Studio ["Zero to One" tutorial](https://labelstud.io/blog/introduction-to-label-studio-in-hugging-face-spaces/) with a guide on how to build an annotation interface for sentiment analysis. ## ๐Ÿ› ๏ธ Configuring a Production-Ready Instance of Label Studio To make your Space production-ready, you will need to make three configuration changes: * Disable the unrestricted creation of new accounts. * Enable persistence by attaching a [Storage Bucket](./storage-buckets) or an external database. * Optionally, attach cloud storage for labeling tasks. ### Disable Unrestricted Creation of New Accounts The default configuration on Label Studio allows for the unrestricted creation of new accounts for anyone who has the URL for your application. You can [restrict signups](https://labelstud.io/guide/signup.html#Restrict-signup-for-local-deployments) by adding the following configuration secrets to your Space **Settings**. * `LABEL_STUDIO_DISABLE_SIGNUP_WITHOUT_LINK`: Setting this value to `true` will disable unrestricted account creation. * `LABEL_STUDIO_USERNAME`: This is the username of the account that you will use as the first user in your Label Studio Space. It should be a valid email address. * `LABEL_STUDIO_PASSWORD`: The password that will be associated with the first user account. Restart the Space to apply these settings. The ability to create new accounts from the login screen will be disabled. To create new accounts, you will need to invite new users in the `Organization` settings in the Label Studio application. ### Enable Persistence with HF Storage Buckets By default, this Space stores all project configuration and data annotations in local storage with SQLite. If the Space is reset, all configuration and annotation data in the Space will be lost. The simplest way to enable persistence is to attach a [Storage Bucket](./storage-buckets), which mounts persistent object storage directly into the Space. Label Studio writes its SQLite database and media uploads into the mounted bucket, so projects and annotations survive restarts. 1. **Create a bucket:** ```bash hf buckets create /label-studio-data ``` 2. **Attach it** in Space Settings โ†’ Storage Buckets, mount path `/data`. 3. **Set two Space Variables:** ``` LABEL_STUDIO_BASE_DATA_DIR=/data STORAGE_PERSISTENCE=1 ``` 4. **Factory rebuild** the Space. > [!TIP] > Set a `SECRET_KEY` Space Secret to keep user sessions alive across restarts. > Without it, Label Studio generates a random key on each boot and all users > are logged out on restart. #### Have a coding agent do it for you If you'd rather not click through the Space settings, you can ask a coding agent with access to `huggingface_hub` to provision the Space for you. Tell it your Space ID and bucket name, and it can run the equivalent of: ```python from huggingface_hub import HfApi, Volume api = HfApi() space_id = "/" # Attach the bucket at /data api.set_space_volumes( space_id, volumes=[ Volume(type="bucket", source="/label-studio-data", mount_path="/data"), ], ) # Tell Label Studio to write its SQLite DB and media into the mounted bucket api.add_space_variable(space_id, "LABEL_STUDIO_BASE_DATA_DIR", "/data") api.add_space_variable(space_id, "STORAGE_PERSISTENCE", "1") # Optional: set a stable SECRET_KEY so sessions survive restarts api.add_space_secret(space_id, "SECRET_KEY", "") # Factory rebuild so the new mount and variables take effect api.restart_space(space_id, factory_reboot=True) ``` See the [`manage-spaces` guide](/docs/huggingface_hub/guides/manage-spaces) for more on managing spaces and volume mounts via `huggingface_hub`. ### Enable Persistence with Postgres For heavier multi-user deployments, you can instead enable persistence by [connecting an external Postgres database to your space](https://labelstud.io/guide/storedata.html#PostgreSQL-database), guaranteeing that all project and annotation settings are preserved. Set the following secret variables to match your own hosted instance of Postgres. We strongly recommend setting these as secrets to prevent leaking information about your database service to the public in your spaces definition. * `DJANGO_DB`: Set this to `default`. * `POSTGRE_NAME`: Set this to the name of the Postgres database. * `POSTGRE_USER`: Set this to the Postgres username. * `POSTGRE_PASSWORD`: Set this to the password for your Postgres user. * `POSTGRE_HOST`: Set this to the host that your Postgres database is running on. * `POSTGRE_PORT`: Set this to the port that your Postgres database is running on. * `STORAGE_PERSISTENCE`: Set this to `1` to remove the warning about ephemeral storage. Restart the Space to apply these settings. Information about users, projects, and annotations will be stored in the database, and will be reloaded by Label Studio if the space is restarted or reset. ### Enable Cloud Storage By default, the only data storage enabled for this Space is local. In the case of a Space reset, all data will be lost. To enable permanent storage, you must enable a [cloud storage connector](https://labelstud.io/guide/storage.html). Choose the appropriate cloud connector and configure the secrets for it. #### Amazon S3 * `STORAGE_TYPE`: Set this to `s3`. * `STORAGE_AWS_ACCESS_KEY_ID`: `` * `STORAGE_AWS_SECRET_ACCESS_KEY`: `` * `STORAGE_AWS_BUCKET_NAME`: `` * `STORAGE_AWS_REGION_NAME`: `` * `STORAGE_AWS_FOLDER`: Set this to an empty string. #### Google Cloud Storage * `STORAGE_TYPE`: Set this to `gcs`. * `STORAGE_GCS_BUCKET_NAME`: `` * `STORAGE_GCS_PROJECT_ID`: `` * `STORAGE_GCS_FOLDER`: Set this to an empty string. * `GOOGLE_APPLICATION_CREDENTIALS`: Set this to `/opt/heartex/secrets/key.json`. #### Azure Blob Storage * `STORAGE_TYPE`: Set this to `azure`. * `STORAGE_AZURE_ACCOUNT_NAME`: `` * `STORAGE_AZURE_ACCOUNT_KEY`: `` * `STORAGE_AZURE_CONTAINER_NAME`: `` * `STORAGE_AZURE_FOLDER`: Set this to an empty string. ## ๐Ÿค— Next Steps, Feedback, and Support To get started with Label Studio, check out the Label Studio ["Zero to One" tutorial](https://labelstud.io/blog/introduction-to-label-studio-in-hugging-face-spaces/), which walks you through an example sentiment analysis annotation project. You can find a full set of resources about Label Studio and the Label Studio community on at the [Label Studio Home Page](https://labelstud.io). This includes [full documentation](https://labelstud.io/guide/), an [interactive playground](https://labelstud.io/playground/) for trying out different annotation interfaces, and links to join the [Label Studio Slack Community](https://slack.labelstudio.heartex.com/?source=spaces). ### Repository Settings https://huggingface.co/docs/hub/repositories-settings.md # Repository Settings ## Repository visibility You can choose a repository's visibility when you create it, and any repository that you own can have its visibility changed in the **Settings** tab. Unless your repository is owned by an [organization](./organizations), you are the only user that can make changes to your repo or upload any code. For models and datasets, visibility can be toggled between *public* and *private*. For Spaces, visibility is set through a dropdown with three options: *public*, *protected*, and *private*. Protected visibility is available on [PRO](https://huggingface.co/pro) and [Team & Enterprise](https://huggingface.co/enterprise) plans. See [Spaces Overview](./spaces-overview#space-visibility) for details on protected Spaces. Setting your visibility to *private* will: - Ensure your repo does not show up in other users' search results. - Other users who visit the URL of your private repo will receive a `404 - Repo not found` error. - Other users will not be able to clone your repo. ## Renaming or transferring a repo If you own a repository, you will be able to visit the **Settings** tab to manage its name and transfer ownership. Transferring or renaming a repo will automatically redirect the old URL to the new location, and will preserve download counts and likes. There are limitations that depend on [your access level permissions](./organizations-security). Moving can be used in these use cases โœ… - Renaming a repository within the same user. - Renaming a repository within the same organization. You must have "write" or "admin" rights in the organization. - Transferring repository from user to an organization. You must be a member of the organization and have "contributor" rights, at least. - Transferring a repository from an organization to yourself. You must have "admin" rights in the organization. - Transferring a repository from a source organization to another target organization. You must have "admin" rights in the source organization **and** at least "contributor" rights in the target organization. Moving does not work in the following cases โŒ - Transferring a repository from an organization to another user who is not yourself. - Transferring a repository from a source organization to another target organization if the user does not have both "admin" rights in the source organization **and** at least "contributor" rights in the target organization. - Transferring a repository from user A to user B. If these are use cases you need help with, please send us an email at **website at huggingface.co**. ## Disabling Discussions / Pull Requests You can disable all discussions and Pull Requests. Once disabled, all community and contribution features won't be available anymore. This action can be reverted without losing any previous discussions or Pull Requests. ### Agents https://huggingface.co/docs/hub/agents.md # Agents `Hugging Face Agents` connect AI agents to the Hub. Using MCP (Model Context Protocol), Skills, or open-source tooling, agents can search models, explore datasets, run Spaces, and use community tools. You can connect agents via the HF MCP Server, install pre-built Skills for coding agents, or build agents programmatically with the `huggingface_hub` SDK. Agents work with any MCP-compatible client, including ChatGPT, Claude Desktop, Cursor, VS Code, and more. ## Contents - [Agents Overview](./agents-overview) - [Hugging Face CLI for AI Agents](./agents-cli) - [Hugging Face MCP Server](./agents-mcp) - [Hugging Face Agent Skills](./agents-skills) - [Building agents with the HF SDK](./agents-sdk) - [Local Agents with llama.cpp and Pi](./agents-local) - [Agent Libraries](./agents-libraries) ### Using GPU Spaces https://huggingface.co/docs/hub/spaces-gpus.md # Using GPU Spaces You can upgrade your Space to use a GPU accelerator using the _Settings_ button in the top navigation bar of the Space. You can even request a free upgrade if you are building a cool demo for a side project! > [!TIP] > Longer-term, we would also like to expose non-GPU hardware, like HPU, IPU or TPU. If you have a specific AI hardware you'd like to run on, please let us know (website at huggingface.co). As soon as your Space is running on GPU you can see which hardware itโ€™s running on directly from this badge: ## Hardware Specs In the following tables, you can see the Specs for the different upgrade options. ### CPU | **Hardware** | **CPU** | **Memory** | **GPU Memory** | **Disk** | **Hourly Price** | |----------------------- |-------------- |------------- |---------------- |---------- | ----------------- | | CPU Basic | 2 vCPU | 16 GB | - | 50 GB | Free! | | CPU Upgrade | 8 vCPU | 32 GB | - | 50 GB | $0.03 | ### GPU | **Hardware** | **CPU** | **Memory** | **GPU Memory** | **Disk** | **Hourly Price** | |----------------------- |-------------- |------------- |---------------- |---------- | ----------------- | | Nvidia T4 - small | 4 vCPU | 15 GB | 16 GB | 50 GB | $0.40 | | Nvidia T4 - medium | 8 vCPU | 30 GB | 16 GB | 100 GB | $0.60 | | 1x Nvidia L4 | 8 vCPU | 30 GB | 24 GB | 400 GB | $0.80 | | 4x Nvidia L4 | 48 vCPU | 186 GB | 96 GB | 3200 GB | $3.80 | | 1x Nvidia L40S | 8 vCPU | 62 GB | 48 GB | 380 GB | $1.80 | | 4x Nvidia L40S | 48 vCPU | 382 GB | 192 GB | 3200 GB | $8.30 | | 8x Nvidia L40S | 192 vCPU | 1534 GB | 384 GB | 6500 GB | $23.50 | | Nvidia A10G - small | 4 vCPU | 15 GB | 24 GB | 110 GB | $1.00 | | Nvidia A10G - large | 12 vCPU | 46 GB | 24 GB | 200 GB | $1.50 | | 2x Nvidia A10G - large | 24 vCPU | 92 GB | 48 GB | 1000 GB | $3.00 | | 4x Nvidia A10G - large | 48 vCPU | 184 GB | 96 GB | 2000 GB | $5.00 | | Nvidia A100 - large | 12 vCPU | 142 GB | 80 GB | 1000 GB | $2.50 | | ~~Nvidia H100~~ *(removed December 2025)* | | | | | | | ~~8x Nvidia H100~~ *(removed December 2025)* | | | | | | | 4x Nvidia A100 | 48 vCPU | 568 GB | 320 GB | 4000 GB | $10.00 | | 8x Nvidia A100 | 96 vCPU | 1136 GB | 640 GB | 8000 GB | $20.00 | ## Configure hardware programmatically You can programmatically configure your Space hardware using `huggingface_hub`. This allows for a wide range of use cases where you need to dynamically assign GPUs. Check out [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage_spaces) for more details. ## Framework specific requirements[[frameworks]] Most Spaces should run out of the box after a GPU upgrade, but sometimes you'll need to install CUDA versions of the machine learning frameworks you use. Please, follow this guide to ensure your Space takes advantage of the improved hardware. ### PyTorch You'll need to install a version of PyTorch compatible with the built-in CUDA drivers. Adding the following two lines to your `requirements.txt` file should work: ``` --extra-index-url https://download.pytorch.org/whl/cu113 torch ``` You can verify whether the installation was successful by running the following code in your `app.py` and checking the output in your Space logs: ```Python import torch print(f"Is CUDA available: {torch.cuda.is_available()}") # True print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}") # Tesla T4 ``` Many frameworks automatically use the GPU if one is available. This is the case for the Pipelines in ๐Ÿค— `transformers`, `fastai` and many others. In other cases, or if you use PyTorch directly, you may need to move your models and data to the GPU to ensure computation is done on the accelerator and not on the CPU. You can use PyTorch's `.to()` syntax, for example: ```Python model = load_pytorch_model() model = model.to("cuda") ``` ### JAX If you use JAX, you need to specify the URL that contains CUDA compatible packages. Please, add the following lines to your `requirements.txt` file: ``` -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html jax[cuda11_pip] jaxlib ``` After that, you can verify the installation by printing the output from the following code and checking it in your Space logs. ```Python import jax print(f"JAX devices: {jax.devices()}") # JAX devices: [StreamExecutorGpuDevice(id=0, process_index=0)] print(f"JAX device type: {jax.devices()[0].device_kind}") # JAX device type: Tesla T4 ``` ### Tensorflow The default `tensorflow` installation should recognize the CUDA device. Just add `tensorflow` to your `requirements.txt` file and use the following code in your `app.py` to verify in your Space logs. ```Python import tensorflow as tf print(tf.config.list_physical_devices('GPU')) # [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] ``` ## Billing Billing on Spaces is based on hardware usage and is computed by the minute: you get charged for every minute the Space runs on the requested hardware, regardless of whether the Space is used. During a Space's lifecycle, it is only billed when the Space is `Starting` or `Running`. This means that there is no cost during build. If a running Space starts to fail, it will be automatically suspended and the billing will stop. Spaces running on free hardware are suspended automatically if they are not used for an extended period of time (e.g. two days). Upgraded Spaces run indefinitely by default, even if there is no usage. You can change this behavior by [setting a custom "sleep time"](#sleep-time) in the Space's settings. To interrupt the billing on your Space, you can change the Hardware to CPU basic, or [pause](#pause) it. Additional information about billing can be found in the [dedicated Hub-wide section](./billing). ### Community GPU Grants Do you have an awesome Space but need help covering the GPU hardware upgrade costs? We love helping out those with an innovative Space so please feel free to apply for a community GPU grant and see if yours makes the cut! This application can be found in your Space hardware repo settings in the lower left corner under "sleep time settings": ![Community GPU Grant](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/ask-for-community-grant.png) ## Set a custom sleep time[[sleep-time]] If your Space runs on the default `cpu-basic` hardware, it will go to sleep if inactive for more than a set time (currently, 48 hours). Anyone visiting your Space will restart it automatically. If you want your Space never to deactivate or if you want to set a custom sleep time, you need to upgrade to paid hardware. By default, an upgraded Space will never go to sleep. However, you can use this setting for your upgraded Space to become idle (`stopped` stage) when it's unused ๐Ÿ˜ด. You are not going to be charged for the upgraded hardware while it is asleep. The Space will 'wake up' or get restarted once it receives a new visitor. The following interface will then be available in your Spaces hardware settings: The following options are available: ## Replicas You can scale your Space horizontally by requesting multiple replicas. This distributes traffic across multiple instances of your Space for improved availability and throughput. You can set the number of replicas via the API: ``` POST https://huggingface.co/api/spaces/{namespace}/{repo}/replicas Content-Type: application/json { "replicas": 2 } ``` > [!NOTE] > Replicas are only available for upgraded (paid) hardware. Each replica is billed independently. ## Streaming Logs, Events, and Metrics[[streaming]] You can stream real-time logs, status events, and metrics from your Space via SSE (Server-Sent Events): - **Build or run logs**: `GET /api/spaces/{namespace}/{repo}/logs/{build|run}` - **Status events**: `GET /api/spaces/{namespace}/{repo}/events` - **Metrics**: `GET /api/spaces/{namespace}/{repo}/metrics` These endpoints require authentication and return data using the [SSE protocol](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events). ## Pausing a Space[[pause]] You can `pause` a Space from the repo settings. A "paused" Space means that the Space is on hold and will not use resources until manually restarted, and only the owner of a paused Space can restart it. Paused time is not billed. ### Managing organizations https://huggingface.co/docs/hub/organizations-managing.md # Managing organizations ## Creating an organization Visit the [New Organization](https://hf.co/organizations/new) form to create an organization. ## Managing members New members can be added to an organization by visiting the **Organization settings** and clicking on the **Members** tab. There, you'll be able to generate an invite link, add members individually, or send out email invitations in bulk. If the **Allow requests to join from the organization page** setting is enabled, you'll also be able to approve or reject any pending requests on the **Members** page. You can also revoke a user's membership or change their role on this page. ### Inviting members with SSO If [Basic SSO](./security-sso-basic) is enabled on your organization, a direct join link can be copied and shared with new members. This SSO join link is available in both the **SSO** and **Members** settings tabs. Since organizations with SSO enabled cannot use classic invite links, the SSO join link is the primary method for inviting teammates to your organization. Simply click the copy button to copy the link to your clipboard and share it with the members you want to invite. When recipients click the shared link, they will be able to authenticate via SSO and directly join your organization. Organizations using [Managed SSO](./enterprise-advanced-sso) provision users directly through their Identity Provider via [SCIM](./enterprise-scim). ## Organization domain name Under the **Account** tab in the Organization settings, you can set an **Organization email domain**. Specifying a domain will allow any user with a matching email address on the Hugging Face Hub to join your organization. ## Leaving an organization Users can leave an organization visiting their [organization settings](https://huggingface.co/settings/organizations) and clicking **Leave Organization** next to the organization they want to leave. Organization administrators can always remove users as explained above. ### Using Stable-Baselines3 at Hugging Face https://huggingface.co/docs/hub/stable-baselines3.md # Using Stable-Baselines3 at Hugging Face `stable-baselines3` is a set of reliable implementations of reinforcement learning algorithms in PyTorch. ## Exploring Stable-Baselines3 in the Hub You can find Stable-Baselines3 models by filtering at the left of the [models page](https://huggingface.co/models?library=stable-baselines3). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, and more. 2. Metadata tags that help for discoverability. 3. Evaluation results to compare with other models. 4. A video widget where you can watch your agent performing. ## Install the library To install the `stable-baselines3` library, you need to install two packages: - `stable-baselines3`: Stable-Baselines3 library. - `huggingface-sb3`: additional code to load and upload Stable-baselines3 models from the Hub. ``` pip install stable-baselines3 pip install huggingface-sb3 ``` ## Using existing models You can simply download a model from the Hub using the `load_from_hub` function ``` checkpoint = load_from_hub( repo_id="sb3/demo-hf-CartPole-v1", filename="ppo-CartPole-v1.zip", ) ``` You need to define two parameters: - `--repo-id`: the name of the Hugging Face repo you want to download. - `--filename`: the file you want to download. ## Sharing your models You can easily upload your models using two different functions: 1. `package_to_hub()`: save the model, evaluate it, generate a model card and record a replay video of your agent before pushing the complete repo to the Hub. ``` package_to_hub(model=model, model_name="ppo-LunarLander-v2", model_architecture="PPO", env_id=env_id, eval_env=eval_env, repo_id="ThomasSimonini/ppo-LunarLander-v2", commit_message="Test commit") ``` You need to define seven parameters: - `--model`: your trained model. - `--model_architecture`: name of the architecture of your model (DQN, PPO, A2C, SAC...). - `--env_id`: name of the environment. - `--eval_env`: environment used to evaluate the agent. - `--repo-id`: the name of the Hugging Face repo you want to create or update. Itโ€™s `/`. - `--commit-message`. - `--filename`: the file you want to push to the Hub. 2. `push_to_hub()`: simply push a file to the Hub ``` push_to_hub( repo_id="ThomasSimonini/ppo-LunarLander-v2", filename="ppo-LunarLander-v2.zip", commit_message="Added LunarLander-v2 model trained with PPO", ) ``` You need to define three parameters: - `--repo-id`: the name of the Hugging Face repo you want to create or update. Itโ€™s `/`. - `--filename`: the file you want to push to the Hub. - `--commit-message`. ## Additional resources * Hugging Face Stable-Baselines3 [documentation](https://github.com/huggingface/huggingface_sb3#hugging-face--x-stable-baselines3-v20) * Stable-Baselines3 [documentation](https://stable-baselines3.readthedocs.io/en/master/) ### Audio Dataset https://huggingface.co/docs/hub/datasets-audio.md # Audio Dataset This guide will show you how to configure your dataset repository with audio files. You can find accompanying examples of repositories in this [Audio datasets examples collection](https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607). A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. --- Additional information about your audio files - such as transcriptions - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). Alternatively, audio files can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. ## Only audio files If your dataset only consists of one column with audio, you can simply store your audio files at the root: ```plaintext my_dataset_repository/ โ”œโ”€โ”€ 1.wav โ”œโ”€โ”€ 2.wav โ”œโ”€โ”€ 3.wav โ””โ”€โ”€ 4.wav ``` or in a subdirectory: ```plaintext my_dataset_repository/ โ””โ”€โ”€ audio โ”œโ”€โ”€ 1.wav โ”œโ”€โ”€ 2.wav โ”œโ”€โ”€ 3.wav โ””โ”€โ”€ 4.wav ``` Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including AIFF, FLAC, MP3, OGG and WAV. ```plaintext my_dataset_repository/ โ””โ”€โ”€ audio โ”œโ”€โ”€ 1.aiff โ”œโ”€โ”€ 2.ogg โ”œโ”€โ”€ 3.mp3 โ””โ”€โ”€ 4.flac ``` If you have several splits, you can put your audio files into directories named accordingly: ```plaintext my_dataset_repository/ โ”œโ”€โ”€ train โ”‚ย ย  โ”œโ”€โ”€ 1.wav โ”‚ย ย  โ””โ”€โ”€ 2.wav โ””โ”€โ”€ test โ”œโ”€โ”€ 3.wav โ””โ”€โ”€ 4.wav ``` See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. ## Additional columns If there is additional information you'd like to include about your dataset, like the transcription, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different audio tasks like [text-to-speech](https://huggingface.co/tasks/text-to-speech) or [automatic speech recognition](https://huggingface.co/tasks/automatic-speech-recognition). ```plaintext my_dataset_repository/ โ”œโ”€โ”€ 1.wav โ”œโ”€โ”€ 2.wav โ”œโ”€โ”€ 3.wav โ”œโ”€โ”€ 4.wav โ””โ”€โ”€ metadata.csv ``` Your `metadata.csv` file must have a `file_name` column which links image files with their metadata: ```csv file_name,animal 1.wav,cat 2.wav,cat 3.wav,dog 4.wav,dog ``` You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: ```jsonl {"file_name": "1.wav","text": "cat"} {"file_name": "2.wav","text": "cat"} {"file_name": "3.wav","text": "dog"} {"file_name": "4.wav","text": "dog"} ``` And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. ## Relative paths Metadata file must be located either in the same directory with the audio files it is linked to, or in any parent directory, like in this example: ```plaintext my_dataset_repository/ โ””โ”€โ”€ test โ”œโ”€โ”€ audio โ”‚ย ย  โ”œโ”€โ”€ 1.wav โ”‚ย ย  โ”œโ”€โ”€ 2.wav โ”‚ย ย  โ”œโ”€โ”€ 3.wav โ”‚ย ย  โ””โ”€โ”€ 4.wav โ””โ”€โ”€ metadata.csv ``` In this case, the `file_name` column must be a full relative path to the audio files, not just the filename: ```csv file_name,animal audio/1.wav,cat audio/2.wav,cat audio/3.wav,dog audio/4.wav,dog ``` Metadata files cannot be put in subdirectories of a directory with the audio files. More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the audio files. In this example, the `test` directory is used to setup the name of the training split. See [File names and splits](./datasets-file-names-and-splits) for more information. ## Audio classification For audio classification datasets, you can also use a simple setup: use directories to name the audio classes. Store your audio files in a directory structure like: ```plaintext my_dataset_repository/ โ”œโ”€โ”€ cat โ”‚ย ย  โ”œโ”€โ”€ 1.wav โ”‚ย ย  โ””โ”€โ”€ 2.wav โ””โ”€โ”€ dog โ”œโ”€โ”€ 3.wav โ””โ”€โ”€ 4.wav ``` The dataset created with this structure contains two columns: `audio` and `label` (with values `cat` and `dog`). You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): ```plaintext my_dataset_repository/ โ”œโ”€โ”€ test โ”‚ย ย  โ”œโ”€โ”€ cat โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ 2.wav โ”‚ย ย  โ””โ”€โ”€ dog โ”‚ย ย  โ””โ”€โ”€ 4.wav โ””โ”€โ”€ train โ”œโ”€โ”€ cat โ”‚ย ย  โ””โ”€โ”€ 1.wav โ””โ”€โ”€ dog โ””โ”€โ”€ 3.wav ``` You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: ```yaml configs: - config_name: default # Name of the dataset subset, if applicable. drop_labels: true ``` ## Large scale datasets ### WebDataset format The [WebDataset](./datasets-webdataset) format is well suited for large scale audio datasets (see [AlienKevin/sbs_cantonese](https://huggingface.co/datasets/AlienKevin/sbs_cantonese) for example). It consists of TAR archives containing audio files and their metadata and is optimized for streaming. It is useful if you have a large number of audio files and to get streaming data loaders for large scale training. ```plaintext my_dataset_repository/ โ”œโ”€โ”€ train-0000.tar โ”œโ”€โ”€ train-0001.tar โ”œโ”€โ”€ ... โ””โ”€โ”€ train-1023.tar ``` To make a WebDataset TAR archive, create a directory containing the audio files and metadata files to be archived and create the TAR archive using e.g. the `tar` command. The usual size per archive is generally around 1GB. Make sure each audio file and metadata pair share the same file prefix, for example: ```plaintext train-0000/ โ”œโ”€โ”€ 000.flac โ”œโ”€โ”€ 000.json โ”œโ”€โ”€ 001.flac โ”œโ”€โ”€ 001.json โ”œโ”€โ”€ ... โ”œโ”€โ”€ 999.flac โ””โ”€โ”€ 999.json ``` Note that for user convenience and to enable the [Dataset Viewer](./data-studio), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. Read more about it in the [Parquet format](./data-studio#access-the-parquet-files) documentation. ### Parquet format Instead of uploading the audio files and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. This is useful if you have a large number of audio files, if you want to embed multiple audio columns, or if you want to store additional information about the audio in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV. ```plaintext my_dataset_repository/ โ””โ”€โ”€ train.parquet ``` Parquet files with audio data can be created using `pandas` or the `datasets` library. To create Parquet files with audio data in `pandas`, you can use [pandas-audio-methods](https://github.com/lhoestq/pandas-audio-methods) and `df.to_parquet()`. In `datasets`, you can set the column type to `Audio()` and use the `ds.to_parquet(...)` method or `ds.push_to_hub(...)`. You can find a guide on loading audio datasets in `datasets` [here](/docs/datasets/audio_load). Alternatively you can manually set the audio type of Parquet created using other tools. First, make sure your audio columns are of type _struct_, with a binary field `"bytes"` for the audio data and a string field `"path"` for the audio file name or path. Then you should specify the feature types of the columns directly in YAML in the README header, for example: ```yaml dataset_info: features: - name: audio dtype: audio - name: caption dtype: string ``` Note that Parquet is recommended for small audio files (<1MB per audio file) and small row groups (100 rows per row group, which is what `datasets` uses for audio). For larger audio files it is recommended to use the WebDataset format, or to share the original audio files (optionally with metadata files). ### How to configure OIDC SSO with Microsoft Entra ID (Azure AD) https://huggingface.co/docs/hub/security-sso-azure-oidc.md # How to configure OIDC SSO with Microsoft Entra ID (Azure AD) This guide will use Microsoft Entra ID as the SSO provider and the Open ID Connect (OIDC) protocol as our preferred identity protocol. > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to the Azure portal of your organization. Navigate to the Microsoft Entra ID admin center and click on "Enterprise applications" You'll be redirected to this page. Then click "New application" at the top and "Create your own application". Input a name for your application (for example, Hugging Face SSO), then select "Register an application to integrate with Microsoft Entra ID (App you're developing)". ## Step 2: Configure your application on Azure Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the OIDC protocol. Copy the "Redirection URI" from the organization's settings on Hugging Face and paste it into the "Redirect URI" field on Azure Entra ID. Make sure you select "Web" in the dropdown menu. The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/oidc/consume`. Save your new application. ## Step 3: Finalize configuration on Hugging Face We will need to collect the following information to finalize the setup on Hugging Face: - The Client ID of the OIDC app - A Client secret of the OIDC app - The Issuer URL of the OIDC app In Microsoft Entra ID, navigate to Enterprise applications, and click on your newly created application in the list. In the application overview, click on "Single sign-on", then "Go to application" In the OIDC app overview, you will find a copiable field named "Application (client) ID". Copy that ID to your clipboard and paste it into the "Client ID" field on Huggingface. Next, click "Endpoints" in the top menu in Microsoft Entra. Copy the value in the "OpenID connect metadata document" field and paste it into the "Issue URL" field in Hugging Face. Back in Microsoft Entra, navigate to "Certificates & secrets", and generate a new secret by clicking "New client secret". Once you have created the secret, copy the secret value and paste it into the "Client secret" field on Hugging Face. You can now click "Update and Test OIDC configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the OIDC selector will attest that the test was successful. ## Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in the [How it works](./security-sso-basic#how-it-works) section. ### File formats https://huggingface.co/docs/hub/datasets-polars-file-formats.md # File formats Polars supports the following file formats when reading from Hugging Face: - [Parquet](https://docs.pola.rs/api/python/stable/reference/api/polars.read_parquet.html) - [CSV](https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html) - [JSON Lines](https://docs.pola.rs/api/python/stable/reference/api/polars.read_ndjson.html) The examples below show the default settings only. Use the links above to view all available parameters in the API reference guide. # Parquet Parquet is the preferred file format as it stores the schema with type information within the file. This avoids any ambiguity with parsing and speeds up reading. To read a Parquet file in Polars, use the `read_parquet` function: ```python pl.read_parquet("hf://datasets/roneneldan/TinyStories/data/train-00000-of-00004-2d5a1467fff1081b.parquet") ``` # CSV The `read_csv` function can be used to read a CSV file: ```python pl.read_csv("hf://datasets/lhoestq/demo1/data/train.csv") ``` # JSON Polars supports reading new line delimited JSON โ€” also known as [json lines](https://jsonlines.org/) โ€” with the `read_ndjson` function: ```python pl.read_ndjson("hf://datasets/proj-persona/PersonaHub/persona.jsonl") ``` ### Gated datasets https://huggingface.co/docs/hub/datasets-gated.md # Gated datasets To give more control over how datasets are used, the Hub allows datasets authors to enable **access requests** for their datasets. Users must agree to share their contact information (username and email address) with the datasets authors to access the datasets files when enabled. Datasets authors can configure this request with additional fields. A dataset with access requests enabled is called a **gated dataset**. Access requests are always granted to individual users rather than to entire organizations. A common use case of gated datasets is to provide access to early research datasets before the wider release. ## Manage gated datasets as a dataset author To enable access requests, go to the dataset settings page. By default, the dataset is not gated. Click on **Enable Access request** in the top-right corner. By default, access to the dataset is automatically granted to the user when requesting it. This is referred to as **automatic approval**. In this mode, any user can access your dataset once they've shared their personal information with you. If you want to manually approve which users can access your dataset, you must set it to **manual approval**. When this is the case, you will notice more options: - **Add access** allows you to search for a user and grant them access even if they did not request it. - **Notification frequency** lets you configure when to get notified if new users request access. It can be set to once a day or real-time. By default, an email is sent to your primary email address. For datasets hosted under an organization, emails are by default sent to the first 5 admins of the organization. In both cases (user or organization) you can set a different email address in the **Notifications email** field. ### Review access requests Once access requests are enabled, you have full control of who can access your dataset or not, whether the approval mode is manual or automatic. You can review and manage requests either from the UI or via the API. ### From the UI You can review who has access to your gated dataset from its settings page by clicking on the **Review access requests** button. This will open a modal with 3 lists of users: - **pending**: the list of users waiting for approval to access your dataset. This list is empty unless you've selected **manual approval**. You can either **Accept** or **Reject** the demand. If the demand is rejected, the user cannot access your dataset and cannot request access again. - **accepted**: the complete list of users with access to your dataset. You can choose to **Reject** access at any time for any user, whether the approval mode is manual or automatic. You can also **Cancel** the approval, which will move the user to the *pending* list. - **rejected**: the list of users you've manually rejected. Those users cannot access your datasets. If they go to your dataset repository, they will see a message *Your request to access this repo has been rejected by the repo's authors*. #### Via the API You can automate the approval of access requests by using the API. You must pass a `token` with `write` access to the gated repository. To generate a token, go to [your user settings](https://huggingface.co/settings/tokens). | Method | URI | Description | Headers | Payload | ------ | --- | ----------- | ------- | ------- | | `GET` | `/api/datasets/{repo_id}/user-access-request/pending` | Retrieve the list of pending requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/datasets/{repo_id}/user-access-request/accepted` | Retrieve the list of accepted requests. | `{"authorization": "Bearer $token"}` | | | `GET` | `/api/datasets/{repo_id}/user-access-request/rejected` | Retrieve the list of rejected requests. | `{"authorization": "Bearer $token"}` | | | `POST` | `/api/datasets/{repo_id}/user-access-request/handle` | Change the status of a given access request to `status`. | `{"authorization": "Bearer $token"}` | `{"status": "accepted"/"rejected"/"pending", "user": "username", "rejectionReason": "Optional rejection reason that will be visible to the user (max 200 characters)."}}` | | `POST` | `/api/datasets/{repo_id}/user-access-request/grant` | Allow a specific user to access your repo. | `{"authorization": "Bearer $token"}` | `{"user": "username"} ` | The base URL for the HTTP endpoints above is `https://huggingface.co`. **NEW!** Those endpoints are now officially supported in our Python client `huggingface_hub`. List the access requests to your dataset with [`list_pending_access_requests`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_pending_access_requests), [`list_accepted_access_requests`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_accepted_access_requests) and [`list_rejected_access_requests`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.list_rejected_access_requests). You can also accept, cancel and reject access requests with [`accept_access_request`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.accept_access_request), [`cancel_access_request`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.cancel_access_request), [`reject_access_request`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.reject_access_request). Finally, you can grant access to a user with [`grant_access`](/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.grant_access). ### Download access report You can download a report of all access requests for a gated datasets with the **download user access report** button. Click on it to download a json file with a list of users. For each entry, you have: - **user**: the user id. Example: *julien-c*. - **fullname**: name of the user on the Hub. Example: *Julien Chaumond*. - **status**: status of the request. Either `"pending"`, `"accepted"` or `"rejected"`. - **email**: email of the user. - **time**: datetime when the user initially made the request. ### Customize requested information By default, users landing on your gated dataset will be asked to share their contact information (email and username) by clicking the **Agree and send request to access repo** button. If you want to request more user information to provide access, you can configure additional fields. This information will be accessible from the **Settings** tab. To do so, add an `extra_gated_fields` property to your [dataset card metadata](./datasets-cards#dataset-card-metadata) containing a list of key/value pairs. The *key* is the name of the field and *value* its type or an object with a `type` field. The list of field types is: - `text`: a single-line text field. - `checkbox`: a checkbox field. - `date_picker`: a date picker field. - `country`: a country dropdown. The list of countries is based on the [ISO 3166-1 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) standard. - `select`: a dropdown with a list of options. The list of options is defined in the `options` field. Example: `options: ["option 1", "option 2", {label: "option3", value: "opt3"}]`. Finally, you can also personalize the message displayed to the user with the `extra_gated_prompt` extra field. Here is an example of customized request form where the user is asked to provide their company name and country and acknowledge that the dataset is for non-commercial use only. ```yaml --- extra_gated_prompt: "You agree to not use the dataset to conduct experiments that cause harm to human subjects." extra_gated_fields: Company: text Country: country Specific date: date_picker I want to use this dataset for: type: select options: - Research - Education - label: Other value: other I agree to use this dataset for non-commercial use ONLY: checkbox --- ``` In some cases, you might also want to modify the default text in the gate heading, description, and button. For those use cases, you can modify `extra_gated_heading`, `extra_gated_description` and `extra_gated_button_content` like this: ```yaml --- extra_gated_heading: "Acknowledge license to accept the repository" extra_gated_description: "Our team may take 2-3 days to process your request" extra_gated_button_content: "Acknowledge license" --- ``` ## Manage gated datasets as an organization (Team & Enterprise) [Team & Enterprise](https://huggingface.co/docs/hub/en/enterprise) subscribers can create a Gating Group Collection to grant (or reject) access to all the models and datasets in a collection at once. More information about Gating Group Collections can be found in [our dedicated doc](https://huggingface.co/docs/hub/en/enterprise-gating-group-collections). ## Access gated datasets as a user As a user, if you want to use a gated dataset, you will need to request access to it. This means that you must be logged in to a Hugging Face user account. Requesting access can only be done from your browser. Go to the dataset on the Hub and you will be prompted to share your information: By clicking on **Agree**, you agree to share your username and email address with the dataset authors. In some cases, additional fields might be requested. To help the dataset authors decide whether to grant you access, try to fill out the form as completely as possible. Once the access request is sent, there are two possibilities. If the approval mechanism is automatic, you immediately get access to the dataset files. Otherwise, the requests have to be approved manually by the authors, which can take more time. > [!WARNING] > The dataset authors have complete control over dataset access. In particular, they can decide at any time to block your access to the dataset without prior notice, regardless of approval mechanism or if your request has already been approved. ### Download files To download files from a gated dataset you'll need to be authenticated. In the browser, this is automatic as long as you are logged in with your account. If you are using a script, you will need to provide a [user token](./security-tokens). In the Hugging Face Python ecosystem (`transformers`, `diffusers`, `datasets`, etc.), you can login your machine using the [`huggingface_hub`](/docs/huggingface_hub/index) library and running in your terminal: ```bash hf auth login ``` Alternatively, you can programmatically login using `login()` in a notebook or a script: ```python >>> from huggingface_hub import login >>> login() ``` You can also provide the `token` parameter to most loading methods in the libraries (`from_pretrained`, `hf_hub_download`, `load_dataset`, etc.), directly from your scripts. For more details about how to login, check out the [login guide](/docs/huggingface_hub/quick-start#login). ### Restricting Access for EU Users For gated datasets, you can add an additional layer of access control to specifically restrict users from European Union countries. This is useful if your dataset's license or terms of use prohibit its distribution in the EU. To enable this, add the `extra_gated_eu_disallowed: true` property to your dataset card's metadata. **Important:** This feature will only activate if your dataset is already gated. If `gated: false` or the property is not set, this restriction will not apply. ```yaml --- license: mit gated: true extra_gated_eu_disallowed: true --- ``` The system identifies a user's location based on their IP address. ### Accessing Benchmark Leaderboard Data https://huggingface.co/docs/hub/leaderboard-data-guide.md # Accessing Benchmark Leaderboard Data [Benchmark datasets](./eval-results#benchmark-datasets) on the Hub contain leaderboards ranking models by their evaluation scores. You can access this data programmatically to analyse, build dashboards or tools on top of it. ## Discovering official benchmarks Use `huggingface_hub` to find all official benchmark datasets: ```python from huggingface_hub import HfApi api = HfApi() for ds in api.list_datasets(benchmark=True): print(ds.id) ``` Or via the REST API directly (useful for agents and scripting): ``` GET https://huggingface.co/api/datasets?filter=benchmark:official ``` ## Getting leaderboard rankings The leaderboard API returns ranked model scores for a benchmark dataset: ``` GET https://huggingface.co/api/datasets/{dataset_id}/leaderboard ``` Use [`get_dataset_leaderboard`](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.HfApi.get_dataset_leaderboard) to fetch ranked model scores as typed [`DatasetLeaderboardEntry`](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.DatasetLeaderboardEntry) objects: ```python from huggingface_hub import HfApi api = HfApi() leaderboard = api.get_dataset_leaderboard("SWE-bench/SWE-bench_Verified") for entry in leaderboard[:5]: print(f"#{entry.rank} {entry.model_id}: {entry.value}") ``` > [!TIP] > `huggingface_hub` uses your cached token by default. For gated benchmark datasets, make sure you are logged in (`huggingface-cli login`) or pass a token explicitly: > ```python > leaderboard = api.get_dataset_leaderboard("gated/benchmark", token="hf_...") > ``` > [!TIP] > Curl one-liner for quick access (useful for agents and scripting): > ```bash > curl https://huggingface.co/api/datasets/cais/hle/leaderboard \ > --header "Authorization: Bearer $(cat ~/.cache/huggingface/token)" | jq . > ``` ### Response fields Each [`DatasetLeaderboardEntry`](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.DatasetLeaderboardEntry) contains: | Field | Description | |---|---| | `rank` | Position on the leaderboard | | `model_id` | Full model ID (e.g. `Qwen/Qwen3.5-397B-A17B`) | | `value` | The benchmark score | | `verified` | Whether the result has been independently verified | | `author` | A [`User`](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.User) or [`Organization`](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.Organization) object | | `source` | Where the result was submitted from (model card, external, etc.) | | `filename` | Path to the eval results YAML file (e.g. `.eval_results/swe_bench_verified.yaml`) | | `pull_request` | PR number for the submission on the benchmark dataset repo | | `notes` | Optional notes associated with the entry | ## Pre-aggregated multi-benchmark dataset If you want scores from multiple benchmarks in a single file, the [`OpenEvals/leaderboard-data`](https://huggingface.co/datasets/OpenEvals/leaderboard-data) dataset aggregates scores across official benchmarks into one Parquet file: You can load it directly with [pandas](./datasets-pandas) using the `hf://` path: ```python import pandas as pd df = pd.read_parquet( "hf://datasets/OpenEvals/leaderboard-data/data/train-00000-of-00001.parquet" ) print(df[["model_name", "provider", "aime2026_score", "mmluPro_score"]].head()) ``` This is the fastest way to get a cross-benchmark view without calling multiple API endpoints. ## Enriching with model metadata Use `huggingface_hub` to enrich leaderboard data with release dates, parameter counts, and other metadata: ```python from huggingface_hub import HfApi api = HfApi() info = api.model_info("Qwen/Qwen3.5-397B-A17B") print(f"Released: {info.created_at}") print(f"Parameters: {info.safetensors.total / 1e9:.1f}B" if info.safetensors else "") ``` ## Model-centric view: eval results per model The leaderboard API gives a dataset-centric view (all models on one benchmark). For the reverse โ€” all benchmark scores for a single model โ€” use `model_info` with `expand=["evalResults"]`: ```python from huggingface_hub import HfApi api = HfApi() info = api.model_info("Qwen/Qwen3.5-397B-A17B", expand=["evalResults"]) for result in info.eval_results: print(f"{result.dataset_id}: {result.value}") ``` This returns [`EvalResultEntry`](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api#huggingface_hub.EvalResultEntry) objects parsed from the model's `.eval_results/` files. ## Example: building on leaderboard data The [Benchmark Leaderboard Race](https://huggingface.co/spaces/davanstrien/benchmark-race) Space combines these data sources to create an animated visualization of how model rankings evolve over time. You can build your own analyses and visualizations on top of this data โ€” see the [source code](https://huggingface.co/spaces/davanstrien/benchmark-race/tree/main) for a complete example. ## Related - [Eval Results](./eval-results) โ€” how to submit evaluation results and register benchmarks - [Official Benchmark Datasets](https://huggingface.co/datasets?benchmark=benchmark:official&sort=trending) โ€” browse all official benchmarks ### Data Designer https://huggingface.co/docs/hub/datasets-data-designer.md # Data Designer [Data Designer](https://github.com/NVIDIA-NeMo/DataDesigner) is NVIDIA NeMo's framework for generating high-quality synthetic datasets using LLMs. It enables you to create diverse data using statistical samplers, LLMs, or existing seed datasets. ## Prerequisites ```bash pip install data-designer ``` ## Download datasets from the Hub as seeds Use `HuggingFaceSeedSource` to load datasets directly from the Hub as seed data for generation. ```python import data_designer.config as dd from data_designer.interface import DataDesigner data_designer = DataDesigner() config_builder = dd.DataDesignerConfigBuilder() # Load seed data from HuggingFace seed_source = dd.HuggingFaceSeedSource( path="datasets/gretelai/symptom_to_diagnosis/data/train.parquet", token="hf_...", # Optional, for private datasets ) config_builder.with_seed_dataset(seed_source) # Reference seed columns in prompts config_builder.add_column( dd.LLMTextColumnConfig( name="physician_notes", model_alias="openai-gpt-5", prompt="Write notes for a patient with {{ diagnosis }}. Symptoms: {{ patient_summary }}", ) ) preview = data_designer.preview(config_builder, num_records=5) ``` ## Push generated datasets to the Hub Use the built-in `push_to_hub` method to upload generated datasets to the Hub. ```python # Generate dataset results = data_designer.create(config_builder, num_records=1000, dataset_name="my-dataset") # Push to Hub url = results.push_to_hub( repo_id="username/my-synthetic-dataset", description="Synthetic dataset generated with Data Designer.", tags=["medical", "notes"], private=False, ) ``` ## Resources - [Data Designer Documentation](https://nvidia-nemo.github.io/DataDesigner/) - [GitHub Repository](https://github.com/NVIDIA-NeMo/DataDesigner) - [Seed Datasets Guide](https://nvidia-nemo.github.io/DataDesigner/latest/concepts/seed-datasets/) - [Guide to using Data Designer with Inference Providers](https://huggingface.co/docs/inference-providers/integrations/datadesigner) ### TF-Keras (legacy) https://huggingface.co/docs/hub/tf-keras.md ## TF-Keras (legacy) `tf-keras` is the name given to Keras 2.x version. It is now hosted as a separate GitHub repo [here](https://github.com/keras-team/tf-keras). Though it's a legacy framework, there are still [4.5k+ models](https://huggingface.co/models?library=tf-keras&sort=trending) hosted on the Hub. These models can be loaded using the `huggingface_hub` library. You **must** have either `tf-keras` or `keras # The image is a sunflower! ``` You can also host your `tf-keras` model on the Hub. However, keep in mind that `tf-keras` is a legacy framework. To reach a maximum number of users, we recommend to create your model using Keras 3.x and share it natively as described above. For more details about uploading `tf-keras` models, check out [`push_to_hub_keras` documentation](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/mixins#huggingface_hub.push_to_hub_keras). ```py from huggingface_hub import push_to_hub_keras push_to_hub_keras(model, "your-username/your-model-name", "your-tensorboard-log-directory", tags = ["object-detection", "some_other_tag"], **model_save_kwargs, ) ``` ## Additional resources - [GitHub repo](https://github.com/keras-team/tf-keras) * Blog post [Putting Keras on ๐Ÿค— Hub for Collaborative Training and Reproducibility](https://merveenoyan.medium.com/putting-keras-on-hub-for-collaborative-training-and-reproducibility-9018301de877) (April 2022) ### Pickle Scanning https://huggingface.co/docs/hub/security-pickle.md # Pickle Scanning Pickle is a widely used serialization format in ML. Most notably, it is the default format for PyTorch model weights. There are dangerous arbitrary code execution attacks that can be perpetrated when you load a pickle file. We suggest loading models from users and organizations you trust, relying on signed commits, and/or loading models from TF or Jax formats with the `from_tf=True` auto-conversion mechanism. We also alleviate this issue by displaying/"vetting" the list of imports in any pickled file, directly on the Hub. Finally, we are experimenting with a new, simple serialization format for weights called [`safetensors`](https://github.com/huggingface/safetensors). ## What is a pickle? From the [official docs](https://docs.python.org/3/library/pickle.html) : > The `pickle` module implements binary protocols for serializing and de-serializing a Python object structure. What this means is that pickle is a serializing protocol, something you use to efficiently share data amongst parties. We call a pickle the binary file that was generated while pickling. At its core, the pickle is basically a stack of instructions or opcodes. As you probably have guessed, itโ€™s not human readable. The opcodes are generated when pickling and read sequentially at unpickling. Based on the opcode, a given action is executed. Hereโ€™s a small example: ```python import pickle import pickletools var = "data I want to share with a friend" # store the pickle data in a file named 'payload.pkl' with open('payload.pkl', 'wb') as f: pickle.dump(var, f) # disassemble the pickle # and print the instructions to the command line with open('payload.pkl', 'rb') as f: pickletools.dis(f) ``` When you run this, it will create a pickle file and print the following instructions in your terminal: ```python 0: \x80 PROTO 4 2: \x95 FRAME 48 11: \x8c SHORT_BINUNICODE 'data I want to share with a friend' 57: \x94 MEMOIZE (as 0) 58: . STOP highest protocol among opcodes = 4 ``` Donโ€™t worry too much about the instructions for now, just know that the [pickletools](https://docs.python.org/3/library/pickletools.html) module is very useful for analyzing pickles. It allows you to read the instructions in the file ***without*** executing any code. Pickle is not simply a serialization protocol, it allows more flexibility by giving the ability to users to run python code at de-serialization time. Doesnโ€™t sound good, does it? ## Why is it dangerous? As weโ€™ve stated above, de-serializing pickle means that code can be executed. But this comes with certain limitations: you can only reference functions and classes from the top level module; you cannot embed them in the pickle file itself. Back to the drawing board: ```python import pickle import pickletools class Data: def __init__(self, important_stuff: str): self.important_stuff = important_stuff d = Data("42") with open('payload.pkl', 'wb') as f: pickle.dump(d, f) ``` When we run this script we get the `payload.pkl` again. When we check the fileโ€™s contents: ```bash # cat payload.pkl __main__Data)}important_stuff42sb.% # hexyl payload.pkl โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚00000000โ”‚ 80 04 95 33 00 00 00 00 โ”Š 00 00 00 8c 08 5f 5f 6d โ”‚ร—โ€ขร—30000โ”Š000ร—โ€ข__mโ”‚ โ”‚00000010โ”‚ 61 69 6e 5f 5f 94 8c 04 โ”Š 44 61 74 61 94 93 94 29 โ”‚ain__ร—ร—โ€ขโ”ŠDataร—ร—ร—)โ”‚ โ”‚00000020โ”‚ 81 94 7d 94 8c 0f 69 6d โ”Š 70 6f 72 74 61 6e 74 5f โ”‚ร—ร—}ร—ร—โ€ขimโ”Športant_โ”‚ โ”‚00000030โ”‚ 73 74 75 66 66 94 8c 02 โ”Š 34 32 94 73 62 2e โ”‚stuffร—ร—โ€ขโ”Š42ร—sb. โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` We can see that there isnโ€™t much in there, a few opcodes and the associated data. You might be thinking, so whatโ€™s the problem with pickle? Letโ€™s try something else: ```python from fickling.pickle import Pickled import pickle # Create a malicious pickle data = "my friend needs to know this" pickle_bin = pickle.dumps(data) p = Pickled.load(pickle_bin) p.insert_python_exec('print("you\'ve been pwned !")') with open('payload.pkl', 'wb') as f: p.dump(f) # innocently unpickle and get your friend's data with open('payload.pkl', 'rb') as f: data = pickle.load(f) print(data) ``` Here weโ€™re using the [fickling](https://github.com/trailofbits/fickling) library for simplicity. It allows us to add pickle instructions to execute code contained in a string via the `exec` function. This is how you circumvent the fact that you cannot define functions or classes in your pickles: you run exec on python code saved as a string. When you run this, it creates a `payload.pkl` and prints the following: ``` you've been pwned ! my friend needs to know this ``` If we check the contents of the pickle file, we get: ```bash # cat payload.pkl c__builtin__ exec (Vprint("you've been pwned !") tR my friend needs to know this.% # hexyl payload.pkl โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚00000000โ”‚ 63 5f 5f 62 75 69 6c 74 โ”Š 69 6e 5f 5f 0a 65 78 65 โ”‚c__builtโ”Šin___exeโ”‚ โ”‚00000010โ”‚ 63 0a 28 56 70 72 69 6e โ”Š 74 28 22 79 6f 75 27 76 โ”‚c_(Vprinโ”Št("you'vโ”‚ โ”‚00000020โ”‚ 65 20 62 65 65 6e 20 70 โ”Š 77 6e 65 64 20 21 22 29 โ”‚e been pโ”Šwned !")โ”‚ โ”‚00000030โ”‚ 0a 74 52 80 04 95 20 00 โ”Š 00 00 00 00 00 00 8c 1c โ”‚_tRร—โ€ขร— 0โ”Š000000ร—โ€ขโ”‚ โ”‚00000040โ”‚ 6d 79 20 66 72 69 65 6e โ”Š 64 20 6e 65 65 64 73 20 โ”‚my frienโ”Šd needs โ”‚ โ”‚00000050โ”‚ 74 6f 20 6b 6e 6f 77 20 โ”Š 74 68 69 73 94 2e โ”‚to know โ”Šthisร—. โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Basically, this is whatโ€™s happening when you unpickle: ```python # ... opcodes_stack = [exec_func, "malicious argument", "REDUCE"] opcode = stack.pop() if opcode == "REDUCE": arg = opcodes_stack.pop() callable = opcodes_stack.pop() opcodes_stack.append(callable(arg)) # ... ``` The instructions that pose a threat are `STACK_GLOBAL`, `GLOBAL` and `REDUCE`. `REDUCE` is what tells the unpickler to execute the function with the provided arguments and `*GLOBAL` instructions are telling the unpickler to `import` stuff. To sum up, pickle is dangerous because: - when importing a python module, arbitrary code can be executed - you can import builtin functions like `eval` or `exec`, which can be used to execute arbitrary code - when instantiating an object, the constructor may be called This is why it is stated in most docs using pickle, do not unpickle data from untrusted sources. ## Mitigation Strategies ***Donโ€™t use pickle*** Sound advice Luc, but pickle is used profusely and isnโ€™t going anywhere soon: finding a new format everyone is happy with and initiating the change will take some time. So what can we do for now? ### Load files from users and organizations you trust On the Hub, you have the ability to [sign your commits with a GPG key](./security-gpg). This does **not** guarantee that your file is safe, but it does guarantee the origin of the file. If you know and trust user A and the commit that includes the file on the Hub is signed by user Aโ€™s GPG key, itโ€™s pretty safe to assume that you can trust the file. ### Load model weights from TF or Flax TensorFlow and Flax checkpoints are not affected, and can be loaded within PyTorch architectures using the `from_tf` and `from_flax` kwargs for the `from_pretrained` method to circumvent this issue. E.g.: ```python from transformers import AutoModel model = AutoModel.from_pretrained("google-bert/bert-base-cased", from_flax=True) ``` ### Use your own serialization format - [MsgPack](https://msgpack.org/index.html) - [Protobuf](https://developers.google.com/protocol-buffers) - [Cap'n'proto](https://capnproto.org/) - [Avro](https://avro.apache.org/) - [safetensors](https://github.com/huggingface/safetensors) This last format, `safetensors`, is a simple serialization format that we are working on and experimenting with currently! Please help or contribute if you can ๐Ÿ”ฅ. ### Improve `torch.load/save` There's an open discussion in progress at PyTorch on having a [Safe way of loading only weights from *.pt file by default](https://github.com/pytorch/pytorch/issues/52181) โ€“ please chime in there! ### Hubโ€™s Security Scanner #### What we have now We have created a security scanner that scans every file pushed to the Hub and runs security checks. At the time of writing, it runs two types of scans: - ClamAV scans - Pickle Import scans For ClamAV scans, files are run through the open-source antivirus [ClamAV](https://www.clamav.net). While this covers a good amount of dangerous files, it doesnโ€™t cover pickle exploits. We have implemented a Pickle Import scan, which extracts the list of imports referenced in a pickle file. Every time you upload a `pytorch_model.bin` or any other pickled file, this scan is run. On the hub the list of imports will be displayed next to each file containing imports. If any import looks suspicious, it will be highlighted. We get this data thanks to [`pickletools.genops`](https://docs.python.org/3/library/pickletools.html#pickletools.genops) which allows us to read the file without executing potentially dangerous code. Note that this is what allows to know if, when unpickling a file, it will `REDUCE` on a potentially dangerous function that was imported by `*GLOBAL`. ***Disclaimer***: this is not 100% foolproof. It is your responsibility as a user to check if something is safe or not. We are not actively auditing python packages for safety, the safe/unsafe imports lists we have are maintained in a best-effort manner. Please contact us if you think something is not safe, and we flag it as such, by sending us an email to website at huggingface.co #### Potential solutions One could think of creating a custom [Unpickler](https://docs.python.org/3/library/pickle.html#pickle.Unpickler) in the likes of [this one](https://github.com/facebookresearch/CrypTen/blob/main/crypten/common/serial.py). But as we can see in this [sophisticated exploit](https://ctftime.org/writeup/16723), this wonโ€™t work. Thankfully, there is always a trace of the `eval` import, so reading the opcodes directly should allow to catch malicious usage. The current solution I propose is creating a file resembling a `.gitignore` but for imports. This file would be a whitelist of imports that would make a `pytorch_model.bin` file flagged as dangerous if there are imports not included in the whitelist. One could imagine having a regex-ish format where you could allow all numpy submodules for instance via a simple line like: `numpy.*`. ## Further Reading [pickle - Python object serialization - Python 3.10.6 documentation](https://docs.python.org/3/library/pickle.html#what-can-be-pickled-and-unpickled) [Dangerous Pickles - Malicious Python Serialization](https://intoli.com/blog/dangerous-pickles/) [GitHub - trailofbits/fickling: A Python pickling decompiler and static analyzer](https://github.com/trailofbits/fickling) [Exploiting Python pickles](https://davidhamann.de/2020/04/05/exploiting-python-pickle/) [cpython/pickletools.py at 3.10 ยท python/cpython](https://github.com/python/cpython/blob/3.10/Lib/pickletools.py) [cpython/pickle.py at 3.10 ยท python/cpython](https://github.com/python/cpython/blob/3.10/Lib/pickle.py) [CrypTen/serial.py at main ยท facebookresearch/CrypTen](https://github.com/facebookresearch/CrypTen/blob/main/crypten/common/serial.py) [CTFtime.org / Balsn CTF 2019 / pyshv1 / Writeup](https://ctftime.org/writeup/16723) [Rehabilitating Python's pickle module](https://github.com/moreati/pickle-fuzz) ### Hugging Face MCP Server https://huggingface.co/docs/hub/agents-mcp.md # Hugging Face MCP Server The Hugging Face MCP (Model Context Protocol) Server connects your MCPโ€‘compatible AI assistant (for example Codex, Cursor, VS Code extensions, Zed, ChatGPT or Claude Desktop) directly to the Hugging Face Hub. Once connected, your assistant can search and explore Hub resources and use community tools, all from within your editor, chat or CLI. ## What you can do - Search and explore Hub resources: models, datasets, Spaces, and papers. - Search the Hugging Face documentation with natural language queries. - Run community tools via MCPโ€‘compatible Gradio apps hosted on [Spaces](https://hf.co/spaces). - Bring results back into your assistant with metadata, links, and context. ## Get started 1. Open your [MCP settings](https://huggingface.co/settings/mcp) while logged in. 2. Pick your client: select your MCPโ€‘compatible client (for example Cursor, VS Code, Zed, Claude Desktop). The page shows clientโ€‘specific instructions and a readyโ€‘toโ€‘copy configuration snippet. 3. Paste and restart: copy the snippet into your clientโ€™s MCP configuration, save, and restart/reload the client. You should see โ€œHugging Faceโ€ (or similar) listed as a connected MCP server in your client. > [!TIP] > The settings page generates the exact configuration your client expects. Use it rather than writing config by hand. ![MCP Settings Example](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hf-mcp-settings.png) ## Using the server After connecting, ask your assistant to use the Hugging Face tools. Example prompts: - โ€œSearch Hugging Face models for Qwen 3 Quantizations.โ€ - โ€œFind a Space that can transcribe audio files.โ€ - โ€œShow datasets about weather timeโ€‘series.โ€ - โ€œCreate a 1024 x 1024 image of a cat ghibli style.โ€ - "How do I use LoRA adapters with PEFT?" (uses Documentation Semantic Search) - "Find papers about vision-language models." Your assistant will call MCP tools exposed by the Hugging Face MCP Server (including Spaces you have selected, as shown in the next section) and return results (titles, owners, downloads, links, and so on). You can then open the resource on the Hub or continue iterating in the same chat. ![HF MCP with Spaces in VS Code](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hf-mcp-vscode.png) ## Built-in Tools The Hugging Face MCP Server includes several built-in tools that connect your AI assistant to the Hugging Face ecosystem. You can enable or disable each tool from your [MCP settings](https://huggingface.co/settings/mcp). | Tool | Description | |------|-------------| | **Spaces Semantic Search** | Find the best AI Apps via natural language queries. | | **Papers Semantic Search** | Find ML Research Papers via natural language queries. | | **Model Search** | Search for ML models with filters for task, library, and more. | | **Dataset Search** | Search for datasets with filters for author, tags, and more. | | **Documentation Semantic Search** | Search the Hugging Face documentation using natural language. Great for finding guides, API references, and tutorials across all Hugging Face libraries. | | **Run and Manage Jobs** | Run, monitor, and schedule jobs on Hugging Face infrastructure. | | **Hub Repository Details** | Get detailed information about Models, Datasets, and Spaces. Optionally enable **Include repository README files** to include README content in results. | > [!TIP] > Enable **Documentation Semantic Search** to let your assistant find relevant Hugging Face documentation. For example, ask "How do I fine-tune a model with PEFT?" or "What are the options for the transformers Trainer?" ## Add community tools (Spaces) You can extend your setup with MCPโ€‘compatible Gradio Spaces built by the community: - Explore Spaces with MCP support [here](https://huggingface.co/spaces?filter=mcp-server). - Add the relevant Space in your MCP settings on Hugging Face [here](https://huggingface.co/settings/mcp). Gradio MCP apps expose their functions as tools (with arguments and descriptions) so your assistant can call them directly. Please restart or refresh your client so it picks up new tools you add. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/ex9KRpvamn84ZaOlSp_Bj.png) Check out our dedicated guide for Spaces as MCP servers [here](https://huggingface.co/docs/hub/spaces-mcp-servers#add-an-existing-space-to-your-mcp-tools). ### Spaces options Your [MCP settings](https://huggingface.co/settings/mcp) provide several options to customize how Spaces work: | Option | Description | |--------|-------------| | **Dynamic Spaces** *(Experimental)* | Dynamically call MCP Spaces at runtime. When enabled, your assistant can discover and use MCP-compatible Spaces on-the-fly without adding them manually. | | **Remove Embedded Images** | Remove embedded images generated by Gradio Spaces. Useful if your MCP client has limited image support or you want text-only responses. | | **MCP-UI Support** *(Experimental)* | Embed Gradio Spaces directly in your mcp-ui client. This enables richer interactive experiences when your client supports it. | ## Learn more - Settings and client setup: https://huggingface.co/settings/mcp - Changelog announcement: https://huggingface.co/changelog/hf-mcp-server - Hugging Face MCP Server: https://huggingface.co/mcp - Build your own MCP Server with Gradio Spaces: https://www.gradio.app/guides/building-mcp-server-with-gradio ### Configure the Dataset Viewer https://huggingface.co/docs/hub/datasets-viewer-configure.md # Configure the Dataset Viewer The Dataset Viewer supports many [data files formats](./datasets-adding#file-formats), from text to tabular and from image to audio formats. It also separates the train/validation/test splits based on file and folder names. To configure the Dataset Viewer for your dataset, first make sure your dataset is in a [supported data format](./datasets-adding#file-formats). ## Configure dropdowns for splits or subsets In the Dataset Viewer you can view the [train/validation/test](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets) splits of datasets, and sometimes additionally choose between multiple subsets (e.g. one per language). To define those dropdowns, you can name the data files or their folder after their split names (train/validation/test). It is also possible to customize your splits manually using YAML. For more information, feel free to check out the documentation on [Data files Configuration](./datasets-data-files-configuration) and the [collections of example datasets](https://huggingface.co/datasets-examples). The [Image Dataset doc page](./datasets-image) proposes various methods to structure a dataset with images. ## Disable the viewer The dataset viewer can be disabled. To do this, add a YAML section to the dataset's `README.md` file (create one if it does not already exist) and add a `viewer` property with the value `false`. ```yaml --- viewer: false --- ``` ## Private datasets For **private** datasets, the Dataset Viewer is enabled for [PRO users](https://huggingface.co/pricing) and [Team or Enterprise organizations](https://huggingface.co/enterprise). ### Hub Local Cache https://huggingface.co/docs/hub/local-cache.md # Hub Local Cache This document describes the on-disk layout of the HF Hub local cache. It is intended as a reference for reimplementing the cache system in any language. Here is a partial list of applications and libraries that use this cache layout. Please open a PR to add your own. | Library or Application | Language | Notes | |---------|----------|-------| | [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) | Python | And any library that depends on it (e.g. `transformers`, `diffusers`, `datasets`, `mlx`, `vllm` โ€ฆ) | | [`hf-hub`](https://github.com/huggingface/hf-hub) | Rust | | | [`swift-huggingface`](https://github.com/huggingface/swift-huggingface) | Swift | | | [`@huggingface/hub`](https://github.com/huggingface/huggingface.js) | JavaScript | Node.js only | | [`HuggingFaceModelDownloader`](https://github.com/bodaay/HuggingFaceModelDownloader) | Go | | | [`llama.cpp`](https://github.com/ggml-org/llama.cpp) | C++ | *Work in progress* | ## Cache location The default cache directory is: ``` ~/.cache/huggingface/hub ``` This can be overridden with environment variables: - `HF_HUB_CACHE` - direct path to the cache directory (takes priority) - `HF_HOME` - path to the Hugging Face home directory; if set, the cache lives at `$HF_HOME/hub` ## Overview ``` / โ”œโ”€โ”€ .locks/ # Lock files for concurrent download safety โ”œโ”€โ”€ models----/ # Cached model repositories โ”œโ”€โ”€ datasets----/ # Cached dataset repositories โ””โ”€โ”€ spaces----/ # Cached space repositories ``` Each downloaded repository gets a single flat folder. Inside each repo folder, files are stored once in a content-addressed `blobs/` directory and accessed through `snapshots/` symlinks. Named references (branches, tags) are tracked in `refs/`. ## Schema ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Repository folder โ”‚ โ”‚ models--julien-c--EsperBERTo-small โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ v v v v โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ refs/ โ”‚ โ”‚ blobs/ โ”‚ โ”‚ snapshots/ โ”‚ โ”‚.no_exist/โ”‚ โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ "main" contains Files stored One folder Empty marker commit hash by content per commit files for e.g. "aaaaaa" hash (SHA-1 hash, e.g. files known or SHA-256) aaaaaa/ not to exist Resolves a bbbbbb/ branch/tag to โ”‚ a snapshot โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ Contains symlinks to ../../blobs/{hash} ``` ## Repository folder naming Repositories are stored as flat directories at the cache root. The folder name encodes the repo type and repo ID: ``` {type}s--{repo_id_with_slashes_replaced_by_--} ``` Rules: - The repo type is **pluralized**: `models`, `datasets`, `spaces` - Forward slashes (`/`) in the repo ID are replaced with `--` - The separator between all parts is `--` - Casing is preserved Examples: | Hub repo ID | Repo type | Cache folder name | |--------------------------------------|-----------|----------------------------------------------| | `julien-c/EsperBERTo-small` | model | `models--julien-c--EsperBERTo-small` | | `huggingface/DataMeasurementsFiles` | dataset | `datasets--huggingface--DataMeasurementsFiles` | | `dalle-mini/dalle-mini` | space | `spaces--dalle-mini--dalle-mini` | > [!NOTE] > Buckets are not handled by this cache as they are not git-backed. Use the dedicated `hf buckets sync` command instead. ## Inside a repository folder Every cached repository has the same internal structure: ``` / โ”œโ”€โ”€ blobs/ โ”œโ”€โ”€ refs/ โ”œโ”€โ”€ snapshots/ โ””โ”€โ”€ .no_exist/ # may not always be present ``` ### `blobs/`: content-addressed file storage The `blobs/` directory stores the actual file contents. Each file is named after its file etag on the Hub: - **Git-tracked files**: named by their **SHA-1** hash (40 hexadecimal characters) - **Git LFS files**: named by their **SHA-256** hash (64 hexadecimal characters) This is a flat directory -- no subdirectories. Identical files across different revisions are stored only once. ``` blobs/ โ”œโ”€โ”€ 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd # SHA-256 (LFS) โ”œโ”€โ”€ 7cb18dc9bafbfcf74629a4b760af1b160957a83e # SHA-1 (git) โ””โ”€โ”€ d7edf6bd2a681fb0175f7735299831ee1b22b812 # SHA-1 (git) ``` ### `refs/`: branch and tag references The `refs/` directory maps human-readable references (branch names, tags, PR numbers) to commit hashes. Each reference is a plain text file containing a single line: the full commit hash (40 hexadecimal characters). ``` refs/ โ”œโ”€โ”€ main # contains e.g. "bbc77c8132af1cc5cf678da3f1ddf2de43606d48" โ”œโ”€โ”€ 2.4.0 # a tag โ””โ”€โ”€ refs/ โ””โ”€โ”€ pr/ โ””โ”€โ”€ 1 # pull request reference ``` When a file is downloaded using a branch or tag name, the corresponding ref file is created or updated with the latest commit hash. ### `snapshots/`: revision views The `snapshots/` directory contains one subdirectory per cached revision (commit hash). Each revision directory mirrors the file structure of the repository on the Hub, but files are **symlinks** pointing into `../../blobs/{hash}`. ``` snapshots/ โ”œโ”€โ”€ 2439f60ef33a0d46d85da5001d52aeda5b00ce9f/ โ”‚ โ”œโ”€โ”€ README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812 โ”‚ โ””โ”€โ”€ pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd โ””โ”€โ”€ bbc77c8132af1cc5cf678da3f1ddf2de43606d48/ โ”œโ”€โ”€ README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e โ””โ”€โ”€ pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd ``` Key properties: - Symlinks use **relative paths**: `../../blobs/{hash}` - If a file is unchanged between two revisions, both symlinks point to the **same blob** with no data duplication - Files in subdirectories on the Hub are represented as subdirectories in the snapshot (the full relative path is preserved) Switching between snapshots is similar to using `git checkout` in a local git repository. ### `.no_exist/`: non-existence cache The `.no_exist/` directory tracks files that were requested but do not exist on the Hub. This avoids repeated HTTP requests for optional files. Structure mirrors `snapshots/`: one subdirectory per commit hash, containing **empty files** (not symlinks) named after the missing file. ``` .no_exist/ โ””โ”€โ”€ 2439f60ef33a0d46d85da5001d52aeda5b00ce9f/ โ””โ”€โ”€ config_that_does_not_exist.json # empty file ``` Disk usage is negligible since these are only empty marker files. ## Lock files Lock files prevent concurrent processes from downloading the same blob simultaneously. They are stored in a `.locks/` directory at the cache root (not inside the repo folder): ``` /.locks//.lock ``` Example: ``` /.locks/models--julien-c--EsperBERTo-small/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd.lock ``` ## Full example ``` ~/.cache/huggingface/hub/ โ”œโ”€โ”€ .locks/ โ”‚ โ””โ”€โ”€ models--julien-c--EsperBERTo-small/ โ”‚ โ””โ”€โ”€ 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd.lock โ”‚ โ””โ”€โ”€ models--julien-c--EsperBERTo-small/ โ”œโ”€โ”€ blobs/ โ”‚ โ”œโ”€โ”€ [321M] 403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd โ”‚ โ”œโ”€โ”€ [ 398] 7cb18dc9bafbfcf74629a4b760af1b160957a83e โ”‚ โ””โ”€โ”€ [1.4K] d7edf6bd2a681fb0175f7735299831ee1b22b812 โ”‚ โ”œโ”€โ”€ refs/ โ”‚ โ””โ”€โ”€ main # contains "bbc77c8132af1cc5cf678da3f1ddf2de43606d48" โ”‚ โ”œโ”€โ”€ snapshots/ โ”‚ โ”œโ”€โ”€ 2439f60ef33a0d46d85da5001d52aeda5b00ce9f/ โ”‚ โ”‚ โ”œโ”€โ”€ README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812 โ”‚ โ”‚ โ””โ”€โ”€ pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ bbc77c8132af1cc5cf678da3f1ddf2de43606d48/ โ”‚ โ”œโ”€โ”€ README.md -> ../../blobs/7cb18dc9bafbfcf74629a4b760af1b160957a83e โ”‚ โ””โ”€โ”€ pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd โ”‚ โ””โ”€โ”€ .no_exist/ โ””โ”€โ”€ 2439f60ef33a0d46d85da5001d52aeda5b00ce9f/ โ””โ”€โ”€ optional_config.json # empty file ``` Note how `pytorch_model.bin` points to the **same blob** in both revisions. The 321 MB file is stored only once on disk. ## File resolution logic To locate a cached file on disk: 1. **Resolve the revision to a commit hash** - If the revision is already a 40-character hex string, use it directly - Otherwise, read the file at `refs/{revision}` to get the commit hash 2. **Check the snapshot** - Look for `snapshots/{commit_hash}/{relative_path}` - If it exists (as a symlink or file), the file is cached. Follow the symlink to get the content 3. **Check non-existence** - Look for `.no_exist/{commit_hash}/{relative_path}` - If it exists, the file is known not to exist on the Hub for this revision 4. **Cache miss** - If neither path exists, the file has not been cached yet ## Windows behavior The cache relies on **symbolic links**. On Windows systems where symlinks are not available, the cache operates in a **degraded mode**: actual file copies are placed directly in `snapshots/` instead of symlinks. The `blobs/` directory is not used in this mode. This means the same file content may be duplicated across revisions, increasing disk usage. To enable symlink support on Windows, activate [Developer Mode](https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development) or run as administrator. ### Custom Python Spaces https://huggingface.co/docs/hub/spaces-sdks-python.md # Custom Python Spaces > [!TIP] > Spaces now support arbitrary Dockerfiles so you can host any Python app directly using [Docker Spaces](./spaces-sdks-docker). While not an official workflow, you are able to run your own Python + interface stack in Spaces by selecting Gradio as your SDK and serving a frontend on port `7860`. See the [templates](https://huggingface.co/templates#spaces) for examples. Spaces are served in iframes, which by default restrict links from opening in the parent page. The simplest solution is to open them in a new window: ```HTML Spaces ``` Usually, the height of Spaces is automatically adjusted when using the Gradio library interface. However, if you provide your own frontend in the Gradio SDK and the content height is larger than the viewport, you'll need to add an [iFrame Resizer script](https://cdnjs.com/libraries/iframe-resizer), so the content is scrollable in the iframe: ```HTML ``` As an example, here is the same Space with and without the script: - https://huggingface.co/spaces/ronvolutional/http-server - https://huggingface.co/spaces/ronvolutional/iframe-test ### Model(s) Release Checklist https://huggingface.co/docs/hub/model-release-checklist.md # Model(s) Release Checklist The [Hugging Face Hub](https://huggingface.co/models) is the go-to platform for sharing machine learning models. A well-executed release can boost your model's visibility and impact. This section covers **essential** steps for a concise, informative, and user-friendly model release. ## โณ Preparing Your Model for Release ### Upload Model Weights When uploading models to the Hub, follow these best practices: - **Use separate repositories for different model weights**: Create individual repositories for each variant of the same architecture. This lets you group them into a [collection](https://huggingface.co/docs/hub/en/collections), which are easier to navigate than directory listings. It also improves visibility because each model has its own URL (`hf.co/org/model-name`), makes search easier, and provides download counts for each one of your models. A great example is the recent [Qwen3-VL collection](https://huggingface.co/collections/Qwen/qwen3-vl) which features various variants of the VL architecture. - **Prefer [`safetensors`](https://huggingface.co/docs/safetensors/en/index) over `pickle` for weight serialization.**: `safetensors` is safer and faster than Pythonโ€™s `pickle` or `pth`. If you have a `.bin` pickle file, use the [weight conversion tool](https://huggingface.co/docs/safetensors/en/convert-weights) to convert it. ### Write a Comprehensive Model Card A well-crafted model card (the `README.md` in your repository) is essential for discoverability, reproducibility, and effective sharing. Make sure to cover: 1. **Metadata Configuration**: The [metadata section](https://huggingface.co/docs/hub/model-cards#model-card-metadata) (YAML) at the top of your model card is key for search and categorization. Include: ```yaml --- pipeline_tag: text-generation # Specify the task library_name: transformers # Specify the library language: - en # List languages your model supports license: apache-2.0 # Specify a license datasets: - username/dataset # List datasets used for training base_model: username/base-model # If applicable (your model is a fine-tune, quantized, merged version of another model) tags: # Add extra tags which would make the repo searchable using the tag - tag1 - tag2 --- ``` If you create the `README.md` in the Web UI, youโ€™ll see a form with the most important metadata fields we recommend ๐Ÿค—. | ![metadata template on the hub ui](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/metadata-template.png) | | :--: | | Metadata Form on the Hub UI | 2. **Detailed Model Description**: Provide a clear explanation of what your model does, its architecture, and its intended use cases. Help users quickly decide if it fits their needs. 3. **Usage Examples**: Provide clear, copy-and-run code snippets for inference, fine-tuning, or other common tasks. Keep edits needed by users to a minimum. *Bonus*: Add a well-structured `notebook.ipynb` in the repo showing inference or fine-tuning, so users can open it in [Google Colab and Kaggle Notebooks](https://huggingface.co/docs/hub/en/notebooks) directly. | ![colab and kaggle button](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/colab-kaggle.png) | | :--: | | Google and Kaggle Usage Buttons | 4. **Technical Specifications**: Include training parameters, hardware needs, and other details that help users run the model effectively. 5. **Performance Metrics**: Share benchmarks and evaluation results. Include quantitative metrics and qualitative examples to show strengths and limitations. 6. **Limitations and Biases**: Document known limitations, biases, and ethical considerations so users can make informed choices. To make the process more seamless, click **Import model card template** to pre-fill the `README.md`s with placeholders. | ![model card template button on the hub ui](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/model-card-template-button.png) | ![model card template on the hub](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/model-card-template.png) | |:--: | :--: | | The button to import the model card template | A section of the imported template | ### Enhance Model Discoverability and Usability To maximize reach and usability: 1. **Library Integration**: Add support for one of the many [libraries integrated with the Hugging Face Hub](https://huggingface.co/docs/hub/models-libraries) (such as `transformers`, `diffusers`, `sentence-transformers`, `timm`). This integration significantly increases your model's accessibility and provides users with code snippets for working with your model. For example, to specify that your model works with the `transformers` library: ```yaml --- library_name: transformers --- ``` | ![code snippet tab](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/code-snippet.png) | | :--: | | Code snippet tab | You can also [register your own model library](https://huggingface.co/docs/hub/en/models-adding-libraries) or add Hub support to your library and codebase, so the users know how to download model weights from the Hub. We wrote an extensive guide on uploading best practices [here](https://huggingface.co/docs/hub/models-uploading). > [!NOTE] > Using a registered library also allows you to track downloads of your model over time. 2. **Correct Metadata**: - **Pipeline Tag:** Choose the correct [pipeline tag](https://huggingface.co/docs/hub/model-cards#specifying-a-task--pipelinetag-) so your model shows up in the right searches and widgets. Examples of common pipeline tags: - `text-generation` - For language models that generate text - `text-to-image` - For text-to-image generation models - `image-text-to-text` - For vision-language models (VLMs) that generate text - `text-to-speech` - For models that generate audio from text - **License:** License information is crucial for users to understand how they can use the model. 3. **Research Papers**: If your model has associated papers, cite them in the model card. They will be [cross-linked automatically](https://huggingface.co/docs/hub/model-cards#linking-a-paper). ```markdown ## References * [Model Paper](https://arxiv.org/abs/xxxx.xxxxx) ``` 4. **Collections**: If you're releasing multiple related models or variants, organize them into a [collection](https://huggingface.co/docs/hub/collections). Collections help users discover related models and understand relationships across versions. 5. **Demos**: Create a [Hugging Face Space](https://huggingface.co/docs/hub/spaces) with an interactive demo. This lets users try your model without writing code. You can also [link the model](https://huggingface.co/docs/hub/spaces-config-reference) from the Space to make it appear on the model page UI. ```markdown ## Demo Try this model directly in your browser: [Space Demo](https://huggingface.co/spaces/username/model-demo) ``` When you create a demo, download the model from its Hub repository (not external sources like Google Drive). This cross-links artifacts and improves visibility 6. **Quantized Versions**: Consider uploading quantized versions (for example, GGUF) on a separate repository to improve accessibility for users with limited compute. Link these versions using the [`base_model` metadata field](https://huggingface.co/docs/hub/model-cards#specifying-a-base-model) on the quantized model cards, and document performance differences. ```yaml --- base_model: username/original-model base_model_relation: quantized --- ``` | ![model tree showcasing relations](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/release-checklist/model-tree.png) | | :--: | | Model tree showing quantized versions | 7. **Linking Datasets on the Model Page**: Link datasets in your metadata so they appear directly on your model page. ```yaml --- datasets: - username/dataset - username/dataset-2 --- ``` 8. **New Model Version**: If your model is an update of an existing one, specify it on the older model's card. This will [display a banner](https://huggingface.co/docs/hub/en/model-cards#specifying-a-new-version) on the older page linking to the update. ```yaml --- new_version: username/updated-model --- ``` 9. **Visual Examples**: For image or video generation models, include examples directly on your model page using the [`` card component](https://huggingface.co/docs/hub/en/model-cards-components#the-gallery-component). ```markdown ![Example 1](./images/example1.png) ![Example 2](./images/example2.png) ``` 10. **Carbon Emissions**: If possible, specify the [carbon emissions](https://huggingface.co/docs/hub/model-cards-co2) from training. ```yaml --- co2_eq_emissions: emissions: 123.45 source: "CodeCarbon" training_type: "pre-training" geographical_location: "US-East" hardware_used: "8xA100 GPUs" --- ``` ### Access Control and Visibility 1. **Visibility Settings**: When ready to share your model, switch it to public in your [model settings](https://huggingface.co/docs/hub/repositories-settings). Before doing so, double-check that all documentation and code examples to ensure they're accurate and complete. 2. **Gated Access**: If your model needs controlled access, use the [gated access feature](https://huggingface.co/docs/hub/models-gated) and clearly state the conditions users must meet. This is important for models with dual-use concerns or commercial restrictions. ## ๐Ÿ After Releasing Your Model A successful model release extends beyond the initial publication. To maintain quality and maximize impact: ### Maintenance and Community Engagement 1. **Verify Functionality**: After release, test all code snippets in a clean environment to confirm they work as expected. This ensures users can run your model without errors or confusion. For example, if your model is a `transformers` compatible LLM: ```python from transformers import pipeline # This should run without errors pipe = pipeline("text-generation", model="your-username/your-model") result = pipe("Your test prompt") print(result) ``` 2. **Share Share Share**: Most users discover models through social media, chat channels (like Slack or Discord), or newsletters. Share your model links in these spaces, and also add them to your website or GitHub repositories. The more visits and likes your model receives, the higher it appears on the [Hugging Face Trending section](https://huggingface.co/models?sort=trending), bringing even more visibility 3. **Community Interaction**: Use the Community tab to answer questions, address feedback, and resolve issues promptly. Clarify confusion, accept helpful suggestions, and close off-topic threads to keep discussions focused. ### Add Evaluation Results If you evaluate on any of the [supported benchmark datasets on the Hub](https://huggingface.co/datasets?benchmark=benchmark:official&sort=trending), you can add evaluation results to your model repository. This will make benchmark scores visible directly on the model page and benchmark leaderboard in the dataset repository. See the [Evaluation Results documentation](https://huggingface.co/docs/hub/eval-results) for the full specification. To add evaluation results, create YAML files in the `.eval_results/` folder of your model repository. Each file references a Hub Benchmark dataset: ```yaml # .eval_results/gpqa.yaml - dataset: id: Idavidrein/gpqa task_id: diamond value: 76.1 date: "2026-03-19" source: url: https://huggingface.co/your-org/your-model name: Model Card user: your-username ``` The `task_id` must match a task defined in the benchmark dataset's `eval.yaml`. You can find available benchmarks and their task IDs by checking the `eval.yaml` file in benchmark dataset repos like [HLE](https://huggingface.co/datasets/cais/hle/blob/main/eval.yaml). Anyone in the community can also submit evaluation results to any model by opening a Pull Request. Community-submitted scores display a "community" badge on the model page. To streamline this process, you can use the [community-evals](https://github.com/huggingface/community-evals) repository, which provides scripts and an agent skill for extracting scores from model cards and creating PRs automatically. ### Tracking Usage and Impact 1. **Usage Metrics**: [Track downloads](https://huggingface.co/docs/hub/en/models-download-stats) and likes to understand your modelโ€™s reach and adoption. You can view total download metrics in your modelโ€™s settings. 2. **Review Community Contributions**: Regularly check your modelโ€™s repository for contributions from other users. Community pull requests and discussions can provide useful feedback, ideas, and opportunities for collaboration. ## ๐Ÿข Enterprise Features [Hugging Face Team & Enterprise](https://huggingface.co/enterprise) subscription offers additional capabilities for teams and organizations: 1. **Access Control**: Set [resource groups](https://huggingface.co/docs/hub/security-resource-groups) to manage access for specific teams or users. This ensures the right permissions and secure collaboration across your organization. 2. **Storage Region**: Choose the data storage region (US or EU) for your model files to meet regional data regulations and compliance requirements. 3. **Advanced Analytics**: Use [Publisher Analytics](https://huggingface.co/docs/hub/publisher-analytics) to gain deeper insights into model usage patterns, downloads, and adoption trends across your organization. 4. **Extended Storage**: Access additional private storage capacity to host more models and larger artifacts as your model portfolio expands. 5. **Organization Blog Posts**: Enterprise organizations can now [publish blog articles directly on Hugging Face](https://huggingface.co/blog/huggingface/blog-articles-for-orgs). This lets you share model releases, research updates, and announcements with the broader community, all from your organizationโ€™s profile. By following these guidelines and examples, youโ€™ll make your model release on Hugging Face clear, useful, and impactful. This helps your work reach more people, strengthens the AI community, and increases your modelโ€™s visibility. We canโ€™t wait to see what you share next! ๐Ÿค— ### Using _Adapters_ at Hugging Face https://huggingface.co/docs/hub/adapters.md # Using _Adapters_ at Hugging Face > Note: _Adapters_ has replaced the `adapter-transformers` library and is fully compatible in terms of model weights. See [here](https://docs.adapterhub.ml/transitioning.html) for more. [_Adapters_](https://github.com/adapter-hub/adapters) is an add-on library to ๐Ÿค— `transformers` for efficiently fine-tuning pre-trained language models using adapters and other parameter-efficient methods. _Adapters_ also provides various methods for composition of adapter modules during training and inference. You can learn more about this in the [_Adapters_ paper](https://arxiv.org/abs/2311.11077). ## Exploring _Adapters_ on the Hub You can find _Adapters_ models by filtering at the left of the [models page](https://huggingface.co/models?library=adapter-transformers&sort=downloads). Some adapter models can be found in the Adapter Hub [repository](https://github.com/adapter-hub/hub). Models from both sources are aggregated on the [AdapterHub website](https://adapterhub.ml/explore/). ## Installation To get started, you can refer to the [AdapterHub installation guide](https://docs.adapterhub.ml/installation.html). You can also use the following one-line install through pip: ``` pip install adapters ``` ## Using existing models For a full guide on loading pre-trained adapters, we recommend checking out the [official guide](https://docs.adapterhub.ml/loading.html). As a brief summary, a full setup consists of three steps: 1. Load a base `transformers` model with the `AutoAdapterModel` class provided by _Adapters_. 2. Use the `load_adapter()` method to load and add an adapter. 3. Activate the adapter via `active_adapters` (for inference) or activate and set it as trainable via `train_adapter()` (for training). Make sure to also check out [composition of adapters](https://docs.adapterhub.ml/adapter_composition.html). ```py from adapters import AutoAdapterModel # 1. model = AutoAdapterModel.from_pretrained("FacebookAI/roberta-base") # 2. adapter_name = model.load_adapter("AdapterHub/roberta-base-pf-imdb") # 3. model.active_adapters = adapter_name # or model.train_adapter(adapter_name) ``` You can also use `list_adapters` to find all adapter models programmatically: ```py from adapters import list_adapters # source can be "ah" (AdapterHub), "hf" (hf.co) or None (for both, default) adapter_infos = list_adapters(source="hf", model_name="FacebookAI/roberta-base") ``` If you want to see how to load a specific model, you can click `Use in Adapters` and you will be given a working snippet that you can load it! ## Sharing your models For a full guide on sharing models with _Adapters_, we recommend checking out the [official guide](https://docs.adapterhub.ml/huggingface_hub.html#uploading-to-the-hub). You can share your adapter by using the `push_adapter_to_hub` method from a model that already contains an adapter. ```py model.push_adapter_to_hub( "my-awesome-adapter", "awesome_adapter", adapterhub_tag="sentiment/imdb", datasets_tag="imdb" ) ``` This command creates a repository with an automatically generated model card and all necessary metadata. ## Additional resources * _Adapters_ [repository](https://github.com/adapter-hub/adapters) * _Adapters_ [docs](https://docs.adapterhub.ml) * _Adapters_ [paper](https://arxiv.org/abs/2311.11077) * Integration with Hub [docs](https://docs.adapterhub.ml/huggingface_hub.html) ### Examples & Tutorials https://huggingface.co/docs/hub/jobs-examples.md # Examples & Tutorials ## Guides to train with Jobs Guides for using popular libraries with Jobs: - [Training with TRL on Jobs](https://huggingface.co/docs/trl/jobs_training) - Run SFT, GRPO, DPO and more using TRL and TRL Jobs - [Fine-tune with Unsloth on Jobs](https://huggingface.co/blog/unsloth-jobs) - ~2x faster training and ~60% less VRAM using Unsloth - [Transformers example scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch) - UV-compatible training scripts for text classification, summarization, image classification, NER, speech recognition, and more โ€” run directly on Jobs: ```bash hf jobs uv run --flavor a10g-small --secrets HF_TOKEN \ https://raw.githubusercontent.com/huggingface/transformers/main/examples/pytorch/image-classification/run_image_classification.py \ --model_name_or_path google/vit-base-patch16-224-in21k \ --dataset_name ethz/food101 \ --output_dir vit-food101 \ --push_to_hub ``` ## UV Scripts The [uv-scripts](https://huggingface.co/uv-scripts) organization maintains a collection of self-contained uv scripts that run on Jobs with a single command. Scripts cover OCR, batch inference, text classification, object detection, dataset statistics, embedding visualization, and more. [Unsloth](https://huggingface.co/datasets/unsloth/jobs) also provides ready-to-run training scripts for fine-tuning LLMs and VLMs on Jobs. ## Coding Agent Skills The [hugging-face-jobs skill](https://github.com/huggingface/skills/tree/main/skills/hugging-face-jobs) lets coding agents like Claude Code and Cursor submit and monitor Jobs directly from your editor. ## Community Tutorials and Projects - [Train on massive datasets without downloading](https://danielvanstrien.xyz/posts/2026/hf-streaming-unsloth/train-massive-datasets-without-downloading.html) - Stream datasets directly on Jobs with Unsloth, no local storage needed - [Fine-tune a vision-language model with TRL](https://danielvanstrien.xyz/posts/2025/iconclass-vlm-sft/trl-vlm-fine-tuning-iconclass.html) - Fine-tune Qwen2.5-VL for art history tasks using TRL and Jobs - [FreeFlow](https://github.com/wjbmattingly/freeflow) - Open-source annotation platform with built-in Jobs integration for training YOLOv11 object detection models --- Have a tutorial or project using Jobs? [Open a PR](https://github.com/huggingface/hub-docs/edit/main/docs/hub/jobs-examples.md) to add it here. ### File names and splits https://huggingface.co/docs/hub/datasets-file-names-and-splits.md # File names and splits To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Datasets Hub features like the Dataset Viewer. Look at the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135) for more details. A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a dataset viewer on its page on the Hub. Note that if none of the structures below suits your case, you can have more control over how you define splits and subsets with the [Manual Configuration](./datasets-manual-configuration). ## Basic use-case If your dataset isn't split into [train/validation/test splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets), the simplest dataset structure is to have one file: `data.csv` (this works with any [supported file format](./datasets-adding#file-formats) and any file name). Your repository will also contain a `README.md` file, the [dataset card](./datasets-cards) displayed on your dataset page. ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ””โ”€โ”€ data.csv ``` ## Splits Some patterns in the dataset repository can be used to assign certain files to train/validation/test splits. ### File name You can name your data files after the `train`, `test`, and `validation` splits: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ train.csv โ”œโ”€โ”€ test.csv โ””โ”€โ”€ validation.csv ``` If you don't have any non-traditional splits, then you can place the split name anywhere in the data file. The only rule is that the split name must be delimited by non-word characters, like `test-file.csv` for example instead of `testfile.csv`. Supported delimiters include underscores, dashes, spaces, dots, and numbers. For example, the following file names are all acceptable: - train split: `train.csv`, `my_train_file.csv`, `train1.csv` - validation split: `validation.csv`, `my_validation_file.csv`, `validation1.csv` - test split: `test.csv`, `my_test_file.csv`, `test1.csv` ### Directory name You can place your data files into different directories named `train`, `test`, and `validation` where each directory contains the data files for that split: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ””โ”€โ”€ data/ โ”œโ”€โ”€ train/ โ”‚ โ””โ”€โ”€ data.csv โ”œโ”€โ”€ test/ โ”‚ โ””โ”€โ”€ more_data.csv โ””โ”€โ”€ validation/ โ””โ”€โ”€ even_more_data.csv ``` ### Keywords There are several ways to refer to train/validation/test splits. Validation splits are sometimes called "dev", and test splits may be referred to as "eval". These other split names are also supported, and the following keywords are equivalent: - train, training - validation, valid, val, dev - test, testing, eval, evaluation Therefore, the structure below is a valid repository: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ””โ”€โ”€ data/ โ”œโ”€โ”€ training.csv โ”œโ”€โ”€ eval.csv โ””โ”€โ”€ valid.csv ``` ### Multiple files per split Splits can span several files, for example: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ train_0.csv โ”œโ”€โ”€ train_1.csv โ”œโ”€โ”€ train_2.csv โ”œโ”€โ”€ train_3.csv โ”œโ”€โ”€ test_0.csv โ””โ”€โ”€ test_1.csv ``` Make sure all the files of your `train` set have *train* in their names (same for test and validation). You can even add a prefix or suffix to `train` in the file name (like `my_train_file_00001.csv` for example). For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name. ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ””โ”€โ”€ data/ โ”œโ”€โ”€ train/ โ”‚ โ”œโ”€โ”€ shard_0.csv โ”‚ โ”œโ”€โ”€ shard_1.csv โ”‚ โ”œโ”€โ”€ shard_2.csv โ”‚ โ””โ”€โ”€ shard_3.csv โ””โ”€โ”€ test/ โ”œโ”€โ”€ shard_0.csv โ””โ”€โ”€ shard_1.csv ``` ### Custom split name If your dataset splits have custom names that aren't `train`, `test`, or `validation`, then you can name your data files like `data/-xxxxx-of-xxxxx.csv`. Here is an example with three splits, `train`, `test`, and `random`: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ””โ”€โ”€ data/ โ”œโ”€โ”€ train-00000-of-00003.csv โ”œโ”€โ”€ train-00001-of-00003.csv โ”œโ”€โ”€ train-00002-of-00003.csv โ”œโ”€โ”€ test-00000-of-00001.csv โ”œโ”€โ”€ random-00000-of-00003.csv โ”œโ”€โ”€ random-00001-of-00003.csv โ””โ”€โ”€ random-00002-of-00003.csv ``` ### Your First Docker Space: Text Generation with T5 https://huggingface.co/docs/hub/spaces-sdks-docker-first-demo.md # Your First Docker Space: Text Generation with T5 In the following sections, you'll learn the basics of creating a Docker Space, configuring it, and deploying your code to it. We'll create a **Text Generation** Space with Docker that'll be used to demo the [google/flan-t5-small](https://huggingface.co/google/flan-t5-small) model, which can generate text given some input text, using FastAPI as the server. You can find a completed version of this hosted [here](https://huggingface.co/spaces/DockerTemplates/fastapi_t5). ## Create a new Docker Space We'll start by [creating a brand new Space](https://huggingface.co/new-space) and choosing **Docker** as our SDK. Hugging Face Spaces are Git repositories, meaning that you can work on your Space incrementally (and collaboratively) by pushing commits. Take a look at the [Getting Started with Repositories](./repositories-getting-started) guide to learn about how you can create and edit files before continuing. If you prefer to work with a UI, you can also do the work directly in the browser. Selecting **Docker** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Docker Space by setting the `sdk` property to `docker` in your `README.md` file's YAML block. ```yaml sdk: docker ``` You have the option to change the default application port of your Space by setting the `app_port` property in your `README.md` file's YAML block. The default port is `7860`. ```yaml app_port: 7860 ``` ## Add the dependencies For the **Text Generation** Space, we'll be building a FastAPI app that showcases a text generation model called Flan T5. For the model inference, we'll be using a [๐Ÿค— Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to use the model. We need to start by installing a few dependencies. This can be done by creating a **requirements.txt** file in our repository, and adding the following dependencies to it: ``` fastapi==0.74.* requests==2.27.* sentencepiece==0.1.* torch==1.11.* transformers==4.* uvicorn[standard]==0.17.* ``` These dependencies will be installed in the Dockerfile we'll create later. ## Create the app Let's kick off the process with a dummy FastAPI app to see that we can get an endpoint working. The first step is to create an app file, in this case, we'll call it `main.py`. ```python from fastapi import FastAPI app = FastAPI() @app.get("/") def read_root(): return {"Hello": "World!"} ``` ## Create the Dockerfile The main step for a Docker Space is creating a Dockerfile. You can read more about Dockerfiles [here](https://docs.docker.com/get-started/). Although we're using FastAPI in this tutorial, Dockerfiles give great flexibility to users allowing you to build a new generation of ML demos. Let's write the Dockerfile for our application ```Dockerfile # read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker # you will also find guides on how best to write your Dockerfile FROM python:3.9 # The two following lines are requirements for the Dev Mode to be functional # Learn more about the Dev Mode at https://huggingface.co/dev-mode-explorers RUN useradd -m -u 1000 user WORKDIR /app COPY --chown=user ./requirements.txt requirements.txt RUN pip install --no-cache-dir --upgrade -r requirements.txt COPY --chown=user . /app CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"] ``` When the changes are saved, the Space will rebuild and your demo should be up after a couple of seconds! [Here](https://huggingface.co/spaces/DockerTemplates/fastapi_dummy) is an example result at this point. ### Testing locally **Tip for power users (you can skip):** If you're developing locally, this is a good moment in which you can do `docker build` and `docker run` to debug locally, but it's even easier to push the changes to the Hub and see how it looks like! ```bash docker build -t fastapi . docker run -it -p 7860:7860 fastapi ``` If you have [Secrets](spaces-sdks-docker#secret-management) you can use `docker buildx` and pass the secrets as build arguments ```bash export SECRET_EXAMPLE="my_secret_value" docker buildx build --secret id=SECRET_EXAMPLE,env=SECRET_EXAMPLE -t fastapi . ``` and run with `docker run` passing the secrets as environment variables ```bash export SECRET_EXAMPLE="my_secret_value" docker run -it -p 7860:7860 -e SECRET_EXAMPLE=$SECRET_EXAMPLE fastapi ``` ## Adding some ML to our app As mentioned before, the idea is to use a Flan T5 model for text generation. We'll want to add some HTML and CSS for an input field, so let's create a directory called static with `index.html`, `style.css`, and `script.js` files. At this moment, your file structure should look like this: ```bash /static /static/index.html /static/script.js /static/style.css Dockerfile main.py README.md requirements.txt ``` Let's go through all the steps to make this working. We'll skip some of the details of the CSS and HTML. You can find the whole code in the Files and versions tab of the [DockerTemplates/fastapi_t5](https://huggingface.co/spaces/DockerTemplates/fastapi_t5) Space. 1. Write the FastAPI endpoint to do inference We'll use the `pipeline` from `transformers` to load the [google/flan-t5-small](https://huggingface.co/google/flan-t5-small) model. We'll set an endpoint called `infer_t5` that receives and input and outputs the result of the inference call ```python from transformers import pipeline pipe_flan = pipeline("text2text-generation", model="google/flan-t5-small") @app.get("/infer_t5") def t5(input): output = pipe_flan(input) return {"output": output[0]["generated_text"]} ``` 2. Write the `index.html` to have a simple form containing the code of the page. ```html Text generation using Flan T5 Model: google/flan-t5-small Text prompt Submit ``` 3. In the `main.py` file, mount the static files and show the html file in the root route ```python app.mount("/", StaticFiles(directory="static", html=True), name="static") @app.get("/") def index() -> FileResponse: return FileResponse(path="/app/static/index.html", media_type="text/html") ``` 4. In the `script.js` file, make it handle the request ```javascript const textGenForm = document.querySelector(".text-gen-form"); const translateText = async (text) => { const inferResponse = await fetch(`infer_t5?input=${text}`); const inferJson = await inferResponse.json(); return inferJson.output; }; textGenForm.addEventListener("submit", async (event) => { event.preventDefault(); const textGenInput = document.getElementById("text-gen-input"); const textGenParagraph = document.querySelector(".text-gen-output"); textGenParagraph.textContent = await translateText(textGenInput.value); }); ``` 5. Grant permissions to the right directories As discussed in the [Permissions Section](./spaces-sdks-docker#permissions), the container runs with user ID 1000. That means that the Space might face permission issues. For example, `transformers` downloads and caches the models in the path under the `HF_HOME` path. The easiest way to solve this is to create a user with righ permissions and use it to run the container application. We can do this by adding the following lines to the `Dockerfile`. ```Dockerfile # Switch to the "user" user USER user # Set home to the user's home directory ENV HOME=/home/user \ PATH=/home/user/.local/bin:$PATH ``` The final `Dockerfile` should look like this: ```Dockerfile # read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker # you will also find guides on how best to write your Dockerfile FROM python:3.9 # The two following lines are requirements for the Dev Mode to be functional # Learn more about the Dev Mode at https://huggingface.co/dev-mode-explorers RUN useradd -m -u 1000 user WORKDIR /app COPY --chown=user ./requirements.txt requirements.txt RUN pip install --no-cache-dir --upgrade -r requirements.txt COPY --chown=user . /app USER user ENV HOME=/home/user \ PATH=/home/user/.local/bin:$PATH CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"] ``` Success! Your app should be working now! Check out [DockerTemplates/fastapi_t5](https://huggingface.co/spaces/DockerTemplates/fastapi_t5) to see the final result. What a journey! Please remember that Docker Spaces give you lots of freedom, so you're not limited to use FastAPI. From a [Go Endpoint](https://huggingface.co/spaces/DockerTemplates/test-docker-go) to a [Shiny App](https://huggingface.co/spaces/DockerTemplates/shiny-with-python), the limit is the moon! Check out [some official examples](./spaces-sdks-docker-examples). You can also upgrade your Space to a GPU if needed ๐Ÿ˜ƒ ## Debugging You can debug your Space by checking the **Build** and **Container** logs. Click on the **Open Logs** button to open the modal. If everything went well, you will see `Pushing Image` and `Scheduling Space` on the **Build** tab On the **Container** tab, you will see the application status, in this case, `Uvicorn running on http://0.0.0.0:7860` Additionally, you can enable the Dev Mode on your Space. The Dev Mode allows you to connect to your running Space via VSCode or SSH. Learn more here: https://huggingface.co/dev-mode-explorers ## Read More - [Docker Spaces](spaces-sdks-docker) - [List of Docker Spaces examples](spaces-sdks-docker-examples) ### Malware Scanning https://huggingface.co/docs/hub/security-malware.md # Malware Scanning We run every file of your repositories through a [malware scanner](https://www.clamav.net/). Scanning is triggered at each commit. Here is an [example view](https://huggingface.co/mcpotato/42-eicar-street/tree/main) of an infected file: > [!TIP] > If your file has neither an ok nor infected badge, it could mean that it is either currently being scanned, waiting to be scanned, or that there was an error during the scan. It can take up to a few minutes to be scanned. If at least one file has a been scanned as unsafe, a message will warn the users: > [!TIP] > As the repository owner, we advise you to remove the suspicious file. The repository will appear back as safe. ### Hub Rate limits https://huggingface.co/docs/hub/rate-limits.md # Hub Rate limits To protect our platform's integrity and ensure availability to as many AI community members as possible, we enforce rate limits on all requests made to the Hugging Face Hub. We define different rate limits for distinct classes of requests. We distinguish three main buckets: - **Hub APIs** - e.g. model or dataset search, repo creation, user management, etc. All endpoints that belong to this bucket are documented in [Hub API Endpoints](./api). - **Resolvers** - They're all the URLs that contain a `/resolve/` segment in their path, which serve user-generated content from the Hub. Concretely, those are the URLs that are constructed by open source libraries (transformers, datasets, vLLM, llama.cpp, โ€ฆ) or AI applications (LM Studio, Jan, ollama, โ€ฆ) to download model/dataset files from HF. - Specifically, this is the ["Resolve a file" endpoint](https://huggingface-openapi.hf.space/#tag/models/get/apiresolve-cachemodelsnamespacereporevpath) documented in our OpenAPI spec. - Resolve requests are heavily used by the community, and since we optimize our infrastructure to serve them with maximum efficiency, the rate limits for Resolvers are the highest. - **Pages** - All the Web pages we host on huggingface.co. - Usually Web browsing requests are made by humans, hence rate limits don't need to be as high as the above mentioned programmatic endpoints. > [!TIP] > All values are defined over 5-minute windows, which allows for some level of "burstiness" from an application or developer's point of view. If you, your organization, or your application need higher rate limits, we encourage you to upgrade your account to PRO, Team, or Enterprise. We prioritize support requests from PRO, Team, and Enterprise customers โ€“ see built-in limits in [Rate limit Tiers](#rate-limit-tiers). ## Billing dashboard At any point, you can check your rate limit status on your (or your orgโ€™s) Billing page: https://huggingface.co/settings/billing ![dashboard for rate limits](https://cdn-uploads.huggingface.co/production/uploads/5dd96eb166059660ed1ee413/0pzQQyuVG3c9tWjCqrX9Y.png) On the right side, you will see three gauges, one for each bucket of Requests. Each bucket presents the number of current (last 5 minutes) requests, and the number of allowed requests based on your user account or organization plan. Whenever you exceed the limit in the past 5 minutes (the view is updated in real-time), the bar will turn red. Note: You can use the context switcher to easily switch between your user account and your orgs. ## HTTP Headers Whenever you or your organization hits a rate limit, you will receive a **429** `Too Many Requests` HTTP error. We implement the mechanism described in the [IETF draft (Version 9)](https://datatracker.ietf.org/doc/draft-ietf-httpapi-ratelimit-headers/) titled โ€œRateLimit HTTP header fields for HTTPโ€ (also known as `draft-ietf-httpapi-ratelimit-headers`). The goal is to define standardized HTTP headers that servers can use to advertise quota / rate-limit policies and communicate current usage / limits to clients so that they can avoid being throttled. Precisely, we implement the following headers: | Header | Purpose / Meaning | | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | | **`RateLimit`** | The total allowed rate limit for the current window. โ€œHow many requests (of this type) youโ€™re allowed to perform.โ€ | | **`RateLimit-Policy`** | Carries the rate limit policy itself (e.g. โ€œ100 requests per 5 minutesโ€). Itโ€™s informative; shows what policy the client is subject to. | A set of examples is as follows: | Header | Example | | ---------------------- | ----------------------------------------------------------------------------------------------------- | | **`RateLimit`** | `"api\|pages\|resolvers";r=[remaining];t=[seconds remaining until reset]` | | **`RateLimit-Policy`** | `"fixed window";"api\|\pages\|resolvers";q=[total allowed for window];w=[window duration in seconds]` | ## Rate limit Tiers Here are the current rate limits (in September '25) based on your plan: | Plan | API | Resolvers | Pages | | ------------------------------------------------------------------------- | -------- | --------- | ------ | | Anonymous user (per IP address) | 500 \* | 3,000 \* | 100 \* | | Free user | 1,000 \* | 5,000 \* | 200 \* | | PRO user | 2,500 | 12,000 | 400 | | Team organization | 3,000 | 20,000 | 400 | | Enterprise organization | 6,000 | 50,000 | 600 | | Enterprise Plus organization | 10,000 | 100,000 | 1,000 | | Enterprise Plus organization When Organization IP Ranges are defined | 100,000 | 500,000 | 10,000 | | Academia Hub organization | 2,500 | 12,000 | 400 | \* Anonymous and Free users are subject to change over time depending on platform health ๐Ÿคž > [!NOTE] > All quotas are calculated over 5-minute fixed windows. Note: For organizations, rate limits are applied individually to each member, not shared among members. ## What if I get rate-limited First, make sure you always pass a `HF_TOKEN`, and it is passed downstream to all libraries or applications that download _stuff_ from the Hub. This is the number one reason users get rate limited and is a very easy fix. Despite passing `HF_TOKEN` if you are still rate limited, you can: - spread out your requests over longer periods of time - replace Hub API calls with Resolver calls, whenever possible (Resolver rate limits are much higher and much more optimized). - upgrade to PRO, Team, or Enterprise. ## Smart rate limit handling with `huggingface_hub` The Hub Python Library [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/index) (version **1.2.0+**) includes smart retry handling for rate limit errors. When a 429 error occurs, the SDK automatically parses the `RateLimit` header to extract the exact number of seconds until the rate limit resets, then waits precisely that duration before retrying. This applies to file downloads (i.e. Resolvers) and paginated Hub API calls (list models, datasets, spaces, etc.). **We strongly recommend using `huggingface_hub` for all programmatic access to the Hub** to benefit from this optimized retry behavior and avoid implementing custom rate limit handling. ## Granular user action Rate limits In addition to those main classes of rate limits, we enforce limits on certain specific kinds of user actions, like: - repo creation - repo commits - discussions and comments - moderation actions - etc. We don't currently document the rate limits for those specific actions, given they tend to change over time more often. If you get quota errors, we encourage you to upgrade your account to PRO, Team, or Enterprise. Feel free to get in touch with us via the support team. ### Using RL-Baselines3-Zoo at Hugging Face https://huggingface.co/docs/hub/rl-baselines3-zoo.md # Using RL-Baselines3-Zoo at Hugging Face `rl-baselines3-zoo` is a training framework for Reinforcement Learning using Stable Baselines3. ## Exploring RL-Baselines3-Zoo in the Hub You can find RL-Baselines3-Zoo models by filtering at the left of the [models page](https://huggingface.co/models?library=stable-baselines3). The Stable-Baselines3 team is hosting a collection of +150 trained Reinforcement Learning agents with tuned hyperparameters that you can find [here](https://huggingface.co/sb3). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, and more. 2. Metadata tags that help for discoverability. 3. Evaluation results to compare with other models. 4. A video widget where you can watch your agent performing. ## Using existing models You can simply download a model from the Hub using `load_from_hub`: ``` # Download ppo SpaceInvadersNoFrameskip-v4 model and save it into the logs/ folder python -m rl_zoo3.load_from_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/ -orga sb3 python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/ ``` You can define three parameters: - `--repo-name`: The name of the repo. - `-orga`: A Hugging Face username or organization. - `-f`: The destination folder. ## Sharing your models You can easily upload your models with `push_to_hub`. That will save the model, evaluate it, generate a model card and record a replay video of your agent before pushing the complete repo to the Hub. ``` python -m rl_zoo3.push_to_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 --repo-name dqn-SpaceInvadersNoFrameskip-v4 -orga ThomasSimonini -f logs/ ``` You can define three parameters: - `--repo-name`: The name of the repo. - `-orga`: Your Hugging Face username. - `-f`: The folder where the model is saved. ## Additional resources * RL-Baselines3-Zoo [official trained models](https://huggingface.co/sb3) * RL-Baselines3-Zoo [documentation](https://github.com/DLR-RM/rl-baselines3-zoo) ### Handling Spaces Dependencies in Gradio Spaces https://huggingface.co/docs/hub/spaces-dependencies.md # Handling Spaces Dependencies in Gradio Spaces ## Default dependencies The default Gradio Spaces environment comes with several pre-installed dependencies: * The [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/index) client library allows you to manage your repository and files on the Hub with Python and programmatically access [Inference Providers](./models-inference) from your Space. If you choose to instantiate the model in your app with Inference Providers, you can benefit from the built-in acceleration optimizations. This option also consumes less computing resources, which is always nice for the environment! ๐ŸŒŽ Refer to this [page](https://huggingface.co/docs/huggingface_hub/how-to-inference) for more information on how to programmatically access Inference Providers. * [`requests`](https://docs.python-requests.org/en/master/) is useful for calling third-party APIs from your app. * [`datasets`](https://github.com/huggingface/datasets) allows you to fetch or display any dataset from the Hub inside your app. * [`gradio`](https://github.com/gradio-app/gradio). You can optionally require a specific version using [`sdk_version` in the `README.md` file](spaces-config-reference). * Common Debian packages, such as `ffmpeg`, `cmake`, `libsm6`, and few others. ## Adding your own dependencies If you need other Python packages to run your app, add them to a **requirements.txt** file at the root of the repository. The Spaces runtime engine will create a custom environment on-the-fly. You can also add a **pre-requirements.txt** file describing dependencies that will be installed before your main dependencies. It can be useful if you need to update pip itself. Debian dependencies are also supported. Add a **packages.txt** file at the root of your repository, and list all your dependencies in it. Each dependency should be on a separate line, and each line will be read and installed by `apt-get install`. ### Using ML-Agents at Hugging Face https://huggingface.co/docs/hub/ml-agents.md # Using ML-Agents at Hugging Face `ml-agents` is an open-source toolkit that enables games and simulations made with Unity to serve as environments for training intelligent agents. ## Exploring ML-Agents in the Hub You can find `ml-agents` models by filtering at the left of the [models page](https://huggingface.co/models?library=ml-agents). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, and more. 2. Metadata tags that help for discoverability. 3. Tensorboard summary files to visualize the training metrics. 4. A link to the Spaces web demo where you can visualize your agent playing in your browser. ## Install the library To install the `ml-agents` library, you need to clone the repo: ``` # Clone the repository git clone https://github.com/Unity-Technologies/ml-agents # Go inside the repository and install the package cd ml-agents pip3 install -e ./ml-agents-envs pip3 install -e ./ml-agents ``` ## Using existing models You can simply download a model from the Hub using `mlagents-load-from-hf`. ``` mlagents-load-from-hf --repo-id="ThomasSimonini/MLAgents-Pyramids" --local-dir="./downloads" ``` You need to define two parameters: - `--repo-id`: the name of the Hugging Face repo you want to download. - `--local-dir`: the path to download the model. ## Visualize an agent playing You can easily watch any model playing directly in your browser: 1. Go to your model repo. 2. In the `Watch Your Agent Play` section, click on the link. 3. In the demo, on step 1, choose your model repository, which is the model id. 4. In step 2, choose what model you want to replay. ## Sharing your models You can easily upload your models using `mlagents-push-to-hf`: ``` mlagents-push-to-hf --run-id="First Training" --local-dir="results/First Training" --repo-id="ThomasSimonini/MLAgents-Pyramids" --commit-message="Pyramids" ``` You need to define four parameters: - `--run-id`: the name of the training run id. - `--local-dir`: where the model was saved. - `--repo-id`: the name of the Hugging Face repo you want to create or update. Itโ€™s `/`. - `--commit-message`. ## Additional resources * ML-Agents [documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Hugging-Face-Integration.md) * Official Unity ML-Agents Spaces [demos](https://huggingface.co/unity) ### Streaming datasets https://huggingface.co/docs/hub/datasets-streaming.md # Streaming datasets ## Integrated libraries If a dataset on the Hub is compatible with a [supported library](./datasets-libraries) that allows streaming from Hugging Face, streaming the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use this dataset" button on the dataset page to see how to do so. For example, [`knkarthick/samsum`](https://huggingface.co/datasets/knkarthick/samsum?library=datasets) shows how to do so with `datasets` below. ## Using the Hugging Face Client Library You can use the [`huggingface_hub`](/docs/huggingface_hub) library to create, delete, and access files from repositories. For example, to stream the `allenai/c4` dataset in Python, simply install the library (we recommend using the latest version) and run the following code. ```bash pip install -U huggingface_hub ``` ```python from huggingface_hub import hffs repo_id = "allenai/c4" path_in_repo = "en/c4-train.00000-of-01024.json.gz" # Stream the file with hffs.open(f"datasets/{repo_id}/{path_in_repo}", "r", compression="gzip") as f: print(f.readline()) # read only the first line # {"text":"Beginners BBQ Class Taking Place in Missoula!...} ``` See the [`HfFileSystem` documentation](https://huggingface.co/docs/huggingface_hub/en/guides/hf_file_system) for more information. You can also integrate this into your own library! For example, you can quickly stream a CSV dataset using Pandas in batches. ```py import pandas as pd repo_id = "YOUR_REPO_ID" path_in_repo = "data.csv" batch_size = 5 # Stream the file with hffs.open(f"datasets/{repo_id}/{path_in_repo}") as f: for df in pd.read_csv(f, iterator=True, chunksize=batch_size): # read 5 lines at a time print(len(df)) # 5 ``` Streaming is especially useful to read big files on Hugging Face progressively or only a small portion. For example `tarfile` can iterate on the files of TAR archives, `zipfile` can read files from ZIP archives and `pyarrow` can access row groups of Parquet files. > [!TIP] > There is an equivalent filesystem implementation in Rust available in [OpenDAL](https://github.com/apache/opendal). ## Using cURL Since all files on the Hub are available via HTTP, you can stream files using `cURL`: ```bash >>> curl -L https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/resolve/main/prompts.csv | head -n 5 "act","prompt" "An Ethereum Developer","Imagine you are an experienced Ethereum developer tasked with creating... "SEO Prompt","Using WebPilot, create an outline for an article that will be 2,000 words on the ... "Linux Terminal","I want you to act as a linux terminal. I will type commands and you will repl... "English Translator and Improver","I want you to act as an English translator, spelling correct... ``` Use range requests to access a specific portion of a file: ```bash >>> curl -r 40-88 -L https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/resolve/main/prompts.csv Imagine you are an experienced Ethereum developer ``` Stream from private repositories using an [access token](https://huggingface.co/docs/hub/en/security-tokens): ```bash >>> export HF_TOKEN=hf_xxx >>> curl -H "Authorization: Bearer $HF_TOKEN" -L https://huggingface.co/... ``` ## Streaming Parquet Parquet is a great format for AI datasets. It offers good compression, a columnar structure for efficient processing and projections, and multi-level metadata for fast filtering, and is suitable for datasets of all sizes. Parquet files are divided in row groups that are often around 100MB each. This lets data loaders and data processing frameworks stream data progressively, iterating on row groups. Inside row groups are individual columns, which are divied into pages. Pages are the compressed block of around 1MB which contain the actual data. ### Stream Row Groups Use PyArrow to stream row groups from Parquet files on Hugging Face: ```python import pyarrow.parquet as pq repo_id = "HuggingFaceFW/finewiki" path_in_repo = "data/enwiki/000_00000.parquet" # Stream the Parquet file row group per row group with pq.ParquetFile(f"hf://datasets/{repo_id}/{path_in_repo}") as pf: for row_group_idx in range(pf.num_row_groups): row_group_table = pf.read_row_group(row_group_idx) df = row_group_table.to_pandas() ``` > [!TIP] > PyArrow supports `hf://` paths out-of-the-box and uses `HfFileSystem` automatically Find more information in the [PyArrow documentation](./datasets-pyarrow). ### Efficient random access Row groups are further divided into columns, and columns into pages. Pages are often around 1MB and are the smallest unit of data in Parquet, since this is where compression is applied. Accessing pages enables loading specific rows without having to load a full row group, and is possible if the Parquet file has a page index. However not every Parquet frameworks support reading at the page level. PyArrow doesn't for example, but the `parquet` crate in Rust does: ```rust use std::sync::Arc; use object_store::path::Path; use object_store_opendal::OpendalStore; use opendal::services::Huggingface; use opendal::Operator; use parquet::arrow::async_reader::ParquetObjectReader; use parquet::arrow::ParquetRecordBatchStreamBuilder; use futures::TryStreamExt; #[tokio::main] async fn main() -> Result> { let repo_id = "HuggingFaceFW/finewiki"; let path_in_repo = Path::from("data/enwiki/000_00000.parquet"); let offset = 0; let limit = 10; let builder = Huggingface::default().repo_type("dataset").repo_id(repo_id); let operator = Operator::new(builder)?.finish(); let store = Arc::new(OpendalStore::new(operator)); let reader = ParquetObjectReader::new(store, path_in_repo.clone()); let batch_stream = ParquetRecordBatchStreamBuilder::new(reader).await? .with_offset(offset as usize) .with_limit(limit as usize) .build()?; let results = batch_stream.try_collect::>().await?; println!("Read {} batches", results.len()); Ok(()) } ``` > [!TIP] > In Rust we use OpenDAL's `Huggingface` service which is equivalent to `HfFileSystem` in python Pass `write_page_index=True` in PyArrow to include the page index that enables efficient random access. It notably adds "offset_index_offset" and "offset_index_length" to Parquet columns that you can see in the [Parquet metadata viewer on Hugging Face](https://huggingface.co/blog/cfahlgren1/intro-to-parquet-format). Page indexes also speed up the [Hugging Face Dataset Viewer](https://huggingface.co/docs/dataset-viewer) and allows it to show data without row group size limit. ### Organization cards https://huggingface.co/docs/hub/organizations-cards.md # Organization cards You can create an organization card to help users learn more about what your organization is working on and how users can use your libraries, models, datasets, and Spaces. An organization card is displayed on an organization's profile: If you're a member of an organization, you'll see a button to create or edit your organization card on the organization's main page. Organization cards are a `README.md` static file inside a Space repo named `README`. The card can be as simple as Markdown text, or you can create a more customized appearance with HTML. The card for the [Hugging Face Course organization](https://huggingface.co/huggingface-course), shown above, [contains the following HTML](https://huggingface.co/spaces/huggingface-course/README/blob/main/README.md): ```html This is the organization grouping all the models and datasets used in the Hugging Face course. ``` For more examples, take a look at: * [Amazon's](https://huggingface.co/spaces/amazon/README/blob/main/README.md) organization card source code * [spaCy's](https://huggingface.co/spaces/spacy/README/blob/main/README.md) organization card source code. ### fenic https://huggingface.co/docs/hub/datasets-fenic.md # fenic [fenic](https://github.com/typedef-ai/fenic) is a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. fenic provides support for reading datasets directly from the Hugging Face Hub. ## Getting Started To get started, pip install `fenic`: ```bash pip install fenic ``` ### Create a Session Instantiate a fenic session with the default configuration (sufficient for reading datasets and other non-semantic operations): ```python import fenic as fc session = fc.Session.get_or_create(fc.SessionConfig()) ``` ## Overview fenic is an opinionated data processing framework that combines: - **DataFrame API**: PySpark-inspired operations for familiar data manipulation - **Semantic Operations**: Built-in AI/LLM operations including semantic functions, embeddings, and clustering - **Model Integration**: Native support for AI providers (Anthropic, OpenAI, Cohere, Google) - **Query Optimization**: Automatic optimization through logical plan transformations ## Read from Hugging Face Hub fenic can read datasets directly from the Hugging Face Hub using the `hf://` protocol. This functionality is built into fenic's DataFrameReader interface. ### Supported Formats fenic supports reading the following formats from Hugging Face: - **Parquet files** (`.parquet`) - **CSV files** (`.csv`) ### Reading Datasets To read a dataset from the Hugging Face Hub: ```python import fenic as fc session = fc.Session.get_or_create(fc.SessionConfig()) # Read a CSV file from a public dataset df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv") # Read Parquet files using glob patterns df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") # Read from a specific dataset revision df = session.read.parquet("hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet") ``` ### Reading with Schema Management ```python # Read multiple CSV files with schema merging df = session.read.csv("hf://datasets/username/dataset_name/*.csv", merge_schemas=True) # Read multiple Parquet files with schema merging df = session.read.parquet("hf://datasets/username/dataset_name/*.parquet", merge_schemas=True) ``` > **Note:** In fenic, a schema is the set of column names and their data types. When you enable `merge_schemas`, fenic tries to reconcile differences across files by filling missing columns with nulls and widening types where it can. Some layouts still cannot be mergedโ€”consult the fenic docs for [CSV schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.csv) and [Parquet schema merging limitations](https://docs.fenic.ai/latest/reference/fenic/?h=parquet#fenic.DataFrameReader.parquet). ### Authentication To read private datasets, you need to set your Hugging Face token as an environment variable: ```shell export HF_TOKEN="your_hugging_face_token_here" ``` ### Path Format The Hugging Face path format in fenic follows this structure: ``` hf://{repo_type}/{repo_id}/{path_to_file} ``` You can also specify dataset revisions or versions: ``` hf://{repo_type}/{repo_id}@{revision}/{path_to_file} ``` Features: - Supports glob patterns (`*`, `**`) - Dataset revisions/versions using `@` notation: - Specific commit: `@d50d8923b5934dc8e74b66e6e4b0e2cd85e9142e` - Branch: `@refs/convert/parquet` - Branch alias: `@~parquet` - Requires `HF_TOKEN` environment variable for private datasets ### Mixing Data Sources fenic allows you to combine multiple data sources in a single read operation, including mixing different protocols: ```python # Mix HF and local files in one read call df = session.read.parquet([ "hf://datasets/cais/mmlu/astronomy/*.parquet", "file:///local/data/*.parquet", "./relative/path/data.parquet" ]) ``` This flexibility allows you to seamlessly combine data from Hugging Face Hub and local files in your data processing pipeline. ## Processing Data from Hugging Face Once loaded from Hugging Face, you can use fenic's full DataFrame API: ### Basic DataFrame Operations ```python import fenic as fc session = fc.Session.get_or_create(fc.SessionConfig()) # Load IMDB dataset from Hugging Face df = session.read.parquet("hf://datasets/imdb/plain_text/train-*.parquet") # Filter and select positive_reviews = df.filter(fc.col("label") == 1).select("text", "label") # Group by and aggregate label_counts = df.group_by("label").agg( fc.count("*").alias("count") ) ``` ### AI-Powered Operations To use semantic and embedding operations, configure language and embedding models in your SessionConfig. Once configured: ```python import fenic as fc # Requires OPENAI_API_KEY to be set for language and embedding calls session = fc.Session.get_or_create( fc.SessionConfig( semantic=fc.SemanticConfig( language_models={ "gpt-4o-mini": fc.OpenAILanguageModel( model_name="gpt-4o-mini", rpm=60, tpm=60000, ) }, embedding_models={ "text-embedding-3-small": fc.OpenAIEmbeddingModel( model_name="text-embedding-3-small", rpm=60, tpm=60000, ) }, ) ) ) # Load a text dataset from Hugging Face df = session.read.parquet("hf://datasets/imdb/plain_text/train-00000-of-00001.parquet") # Add embeddings to text columns df_with_embeddings = df.select( "*", fc.semantic.embed(fc.col("text")).alias("embedding") ) # Apply semantic functions for sentiment analysis df_analyzed = df_with_embeddings.select( "*", fc.semantic.analyze_sentiment( fc.col("text"), model_alias="gpt-4o-mini", # Optional: specify model ).alias("sentiment") ) ``` ## Example: Analyzing MMLU Dataset ```python import fenic as fc # Requires OPENAI_API_KEY to be set for semantic calls session = fc.Session.get_or_create( fc.SessionConfig( semantic=fc.SemanticConfig( language_models={ "gpt-4o-mini": fc.OpenAILanguageModel( model_name="gpt-4o-mini", rpm=60, tpm=60000, ) }, ) ) ) # Load MMLU astronomy subset from Hugging Face df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") # Process the data processed_df = (df # Filter for specific criteria .filter(fc.col("subject") == "astronomy") # Select relevant columns .select("question", "choices", "answer") # Add difficulty analysis using semantic.map .select( "*", fc.semantic.map( "Rate the difficulty of this question from 1-5: {{question}}", question=fc.col("question"), model_alias="gpt-4o-mini" # Optional: specify model ).alias("difficulty") ) ) # Show results processed_df.show() ``` ## Resources - [fenic GitHub Repository](https://github.com/typedef-ai/fenic) - [fenic Documentation](https://docs.fenic.ai/latest/) ### Argilla https://huggingface.co/docs/hub/datasets-argilla.md # Argilla Argilla is a collaboration tool for AI engineers and domain experts who need to build high quality datasets for their projects. ![image](https://github.com/user-attachments/assets/0e6ce1d8-65ca-4211-b4ba-5182f88168a0) Argilla can be used for collecting human feedback for a wide variety of AI projects like traditional NLP (text classification, NER, etc.), LLMs (RAG, preference tuning, etc.), or multimodal models (text to image, etc.). Argilla's programmatic approach lets you build workflows for continuous evaluation and model improvement. The goal of Argilla is to ensure your data work pays off by quickly iterating on the right data and models. ## What do people build with Argilla? The community uses Argilla to create amazing open-source [datasets](https://huggingface.co/datasets?library=library:argilla&sort=trending) and [models](https://huggingface.co/models?other=distilabel). ### Open-source datasets and models Argilla contributed some models and datasets to open-source too. - [Cleaned UltraFeedback dataset](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) used to fine-tune the [Notus](https://huggingface.co/argilla/notus-7b-v1) and [Notux](https://huggingface.co/argilla/notux-8x7b-v1) models. The original UltraFeedback dataset was curated using Argilla UI filters to find and report a bug in the original data generation code. Based on this data curation process, Argilla built this new version of the UltraFeedback dataset and fine-tuned Notus, outperforming Zephyr on several benchmarks. - [distilabeled Intel Orca DPO dataset](https://huggingface.co/datasets/argilla/distilabel-intel-orca-dpo-pairs) used to fine-tune the [improved OpenHermes model](https://huggingface.co/argilla/distilabeled-OpenHermes-2.5-Mistral-7B). This dataset was built by combining human curation in Argilla with AI feedback from distilabel, leading to an improved version of the Intel Orca dataset and outperforming models fine-tuned on the original dataset. ### Examples Use cases AI teams from companies like [the Red Cross](https://510.global/), [Loris.ai](https://loris.ai/) and [Prolific](https://www.prolific.com/) use Argilla to improve the quality and efficiency of AI projects. They shared their experiences in our [AI community meetup](https://lu.ma/embed-checkout/evt-IQtRiSuXZCIW6FB). - AI for good: [the Red Cross presentation](https://youtu.be/ZsCqrAhzkFU?feature=shared) showcases how the Red Cross domain experts and AI team collaborated by classifying and redirecting requests from refugees of the Ukrainian crisis to streamline the support processes of the Red Cross. - Customer support: during [the Loris meetup](https://youtu.be/jWrtgf2w4VU?feature=shared) they showed how their AI team uses unsupervised and few-shot contrastive learning to help them quickly validate and gain labelled samples for a huge amount of multi-label classifiers. - Research studies: [the showcase from Prolific](https://youtu.be/ePDlhIxnuAs?feature=shared) announced their integration with our platform. They use it to actively distribute data collection projects among their annotating workforce. This allows Prolific to quickly and efficiently collect high-quality data for research studies. ## Prerequisites First [login with your Hugging Face account](/docs/huggingface_hub/quick-start#login): ```bash hf auth login ``` Make sure you have `argilla>=2.0.0` installed: ```bash pip install -U argilla ``` Lastly, you will need to deploy the Argilla server and UI, which can be done [easily on the Hugging Face Hub](https://argilla-io.github.io/argilla/latest/getting_started/quickstart/#run-the-argilla-server). ## Importing and exporting datasets and records This guide shows how to import and export your dataset to the Hugging Face Hub. In Argilla, you can import/export two main components of a dataset: - The dataset's complete configuration defined in `rg.Settings`. This is useful if your want to share your feedback task or restore it later in Argilla. - The records stored in the dataset, including `Metadata`, `Vectors`, `Suggestions`, and `Responses`. This is useful if you want to use your dataset's records outside of Argilla. ### Push an Argilla dataset to the Hugging Face Hub You can push a dataset from Argilla to the Hugging Face Hub. This is useful if you want to share your dataset with the community or version control it. You can push the dataset to the Hugging Face Hub using the `rg.Dataset.to_hub` method. ```python import argilla as rg client = rg.Argilla(api_url="", api_key="") dataset = client.datasets(name="my_dataset") dataset.to_hub(repo_id="") ``` #### With or without records The example above will push the dataset's `Settings` and records to the hub. If you only want to push the dataset's configuration, you can set the `with_records` parameter to `False`. This is useful if you're just interested in a specific dataset template or you want to make changes in the dataset settings and/or records. ```python dataset.to_hub(repo_id="", with_records=False) ``` ### Pull an Argilla dataset from the Hugging Face Hub You can pull a dataset from the Hugging Face Hub to Argilla. This is useful if you want to restore a dataset and its configuration. You can pull the dataset from the Hugging Face Hub using the `rg.Dataset.from_hub` method. ```python import argilla as rg client = rg.Argilla(api_url="", api_key="") dataset = rg.Dataset.from_hub(repo_id="") ``` The `rg.Dataset.from_hub` method loads the configuration and records from the dataset repo. If you only want to load records, you can pass a `datasets.Dataset` object to the `rg.Dataset.log` method. This enables you to configure your own dataset and reuse existing Hub datasets. #### With or without records The example above will pull the dataset's `Settings` and records from the hub. If you only want to pull the dataset's configuration, you can set the `with_records` parameter to `False`. This is useful if you're just interested in a specific dataset template or you want to make changes in the dataset settings and/or records. ```python dataset = rg.Dataset.from_hub(repo_id="", with_records=False) ``` With the dataset's configuration you could then make changes to the dataset. For example, you could adapt the dataset's settings for a different task: ```python dataset.settings.questions = [rg.TextQuestion(name="answer")] ``` You could then log the dataset's records using the `load_dataset` method of the `datasets` package and pass the dataset to the `rg.Dataset.log` method. ```python hf_dataset = load_dataset("") dataset.log(hf_dataset) ``` ## ๐Ÿ“š Resources - [๐Ÿš€ Argilla Docs](https://argilla-io.github.io/argilla/) - [๐Ÿš€ Argilla Docs - import export guides](https://argilla-io.github.io/argilla/latest/how_to_guides/import_export/) ### Widget Examples https://huggingface.co/docs/hub/models-widgets-examples.md # Widget Examples Note that each widget example can also optionally describe the corresponding model output, directly in the `output` property. See [the spec](./models-widgets#example-outputs) for more details. ## Natural Language Processing ### Fill-Mask ```yaml widget: - text: "Paris is the of France." example_title: "Capital" - text: "The goal of life is ." example_title: "Philosophy" ``` ### Question Answering ```yaml widget: - text: "What's my name?" context: "My name is Clara and I live in Berkeley." example_title: "Name" - text: "Where do I live?" context: "My name is Sarah and I live in London" example_title: "Location" ``` ### Summarization ```yaml widget: - text: "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct." example_title: "Eiffel Tower" - text: "Laika, a dog that was the first living creature to be launched into Earth orbit, on board the Soviet artificial satellite Sputnik 2, on November 3, 1957. It was always understood that Laika would not survive the mission, but her actual fate was misrepresented for decades. Laika was a small (13 pounds [6 kg]), even-tempered, mixed-breed dog about two years of age. She was one of a number of stray dogs that were taken into the Soviet spaceflight program after being rescued from the streets. Only female dogs were used because they were considered to be anatomically better suited than males for close confinement." example_title: "First in Space" ``` ### Table Question Answering ```yaml widget: - text: "How many stars does the transformers repository have?" table: Repository: - "Transformers" - "Datasets" - "Tokenizers" Stars: - 36542 - 4512 - 3934 Contributors: - 651 - 77 - 34 Programming language: - "Python" - "Python" - "Rust, Python and NodeJS" example_title: "Github stars" ``` ### Text Classification ```yaml widget: - text: "I love football so much" example_title: "Positive" - text: "I don't really like this type of food" example_title: "Negative" ``` ### Text Generation ```yaml widget: - text: "My name is Julien and I like to" example_title: "Julien" - text: "My name is Merve and my favorite" example_title: "Merve" ``` ### Text2Text Generation ```yaml widget: - text: "My name is Julien and I like to" example_title: "Julien" - text: "My name is Merve and my favorite" example_title: "Merve" ``` ### Token Classification ```yaml widget: - text: "My name is Sylvain and I live in Paris" example_title: "Parisian" - text: "My name is Sarah and I live in London" example_title: "Londoner" ``` ### Translation ```yaml widget: - text: "My name is Sylvain and I live in Paris" example_title: "Parisian" - text: "My name is Sarah and I live in London" example_title: "Londoner" ``` ### Zero-Shot Classification ```yaml widget: - text: "I have a problem with my car that needs to be resolved asap!!" candidate_labels: "urgent, not urgent, phone, tablet, computer" multi_class: true example_title: "Car problem" - text: "Last week I upgraded my iOS version and ever since then my phone has been overheating whenever I use your app." candidate_labels: "mobile, website, billing, account access" multi_class: false example_title: "Phone issue" ``` ### Sentence Similarity ```yaml widget: - source_sentence: "That is a happy person" sentences: - "That is a happy dog" - "That is a very happy person" - "Today is a sunny day" example_title: "Happy" ``` ### Conversational ```yaml widget: - text: "Hey my name is Julien! How are you?" example_title: "Julien" - text: "Hey my name is Clara! How are you?" example_title: "Clara" ``` ### Feature Extraction ```yaml widget: - text: "My name is Sylvain and I live in Paris" example_title: "Parisian" - text: "My name is Sarah and I live in London" example_title: "Londoner" ``` ## Audio ### Text-to-Speech ```yaml widget: - text: "My name is Sylvain and I live in Paris" example_title: "Parisian" - text: "My name is Sarah and I live in London" example_title: "Londoner" ``` ### Automatic Speech Recognition ```yaml widget: - src: https://cdn-media.huggingface.co/speech_samples/sample1.flac example_title: Librispeech sample 1 - src: https://cdn-media.huggingface.co/speech_samples/sample2.flac example_title: Librispeech sample 2 ``` ### Audio-to-Audio ```yaml widget: - src: https://cdn-media.huggingface.co/speech_samples/sample1.flac example_title: Librispeech sample 1 - src: https://cdn-media.huggingface.co/speech_samples/sample2.flac example_title: Librispeech sample 2 ``` ### Audio Classification ```yaml widget: - src: https://cdn-media.huggingface.co/speech_samples/sample1.flac example_title: Librispeech sample 1 - src: https://cdn-media.huggingface.co/speech_samples/sample2.flac example_title: Librispeech sample 2 ``` ### Voice Activity Detection ```yaml widget: - src: https://cdn-media.huggingface.co/speech_samples/sample1.flac example_title: Librispeech sample 1 - src: https://cdn-media.huggingface.co/speech_samples/sample2.flac example_title: Librispeech sample 2 ``` ## Computer Vision ### Image Classification ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg example_title: Tiger - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg example_title: Teapot ``` ### Object Detection ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport ``` ### Image Segmentation ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg example_title: Football Match - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg example_title: Airport ``` ### Image-to-Image ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/canny-edge.jpg prompt: Girl with Pearl Earring # `prompt` field is optional in case the underlying model supports text guidance ``` ### Image-to-Video ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/canny-edge.jpg prompt: Girl with Pearl Earring # `prompt` field is optional in case the underlying model supports text guidance ``` ### Text-to-Image ```yaml widget: - text: "A cat playing with a ball" example_title: "Cat" - text: "A dog jumping over a fence" example_title: "Dog" ``` ### Document Question Answering ```yaml widget: - text: "What is the invoice number?" src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png" - text: "What is the purchase amount?" src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/contract.jpeg" ``` ### Visual Question Answering ```yaml widget: - text: "What animal is it?" src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg" - text: "Where is it?" src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg" ``` ### Zero-Shot Image Classification ```yaml widget: - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png candidate_labels: playing music, playing sports example_title: Cat & Dog ``` ## Other ### Structured Data Classification ```yaml widget: - structured_data: fixed_acidity: - 7.4 - 7.8 - 10.3 volatile_acidity: - 0.7 - 0.88 - 0.32 citric_acid: - 0 - 0 - 0.45 residual_sugar: - 1.9 - 2.6 - 6.4 chlorides: - 0.076 - 0.098 - 0.073 free_sulfur_dioxide: - 11 - 25 - 5 total_sulfur_dioxide: - 34 - 67 - 13 density: - 0.9978 - 0.9968 - 0.9976 pH: - 3.51 - 3.2 - 3.23 sulphates: - 0.56 - 0.68 - 0.82 alcohol: - 9.4 - 9.8 - 12.6 example_title: "Wine" ``` ### Spaces Settings https://huggingface.co/docs/hub/spaces-settings.md # Spaces Settings You can configure your Space's appearance and other settings inside the `YAML` block at the top of the **README.md** file at the root of the repository. For example, if you want to create a Space with Gradio named `Demo Space` with a yellow to orange gradient thumbnail: ```yaml --- title: Demo Space emoji: ๐Ÿค— colorFrom: yellow colorTo: orange sdk: gradio app_file: app.py pinned: false --- ``` For additional settings, refer to the [Reference](./spaces-config-reference) section. ### Spaces Configuration Reference https://huggingface.co/docs/hub/spaces-config-reference.md # Spaces Configuration Reference Spaces are configured through the `YAML` block at the top of the **README.md** file at the root of the repository. All the accepted parameters are listed below. **`title`** : _string_ Display title for the Space. **`emoji`** : _string_ Space emoji (emoji-only character allowed). **`colorFrom`** : _string_ Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray). **`colorTo`** : _string_ Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray). **`sdk`** : _string_ Can be either `gradio`, `docker`, or `static`. **`python_version`**: _string_ Any valid Python `3.x` or `3.x.x` version. Defaults to `3.10`. **`sdk_version`** : _string_ Specify the version of Gradio to use. All versions of Gradio are supported. **`suggested_hardware`** : _string_ Specify the suggested [hardware](https://huggingface.co/docs/hub/spaces-gpus) on which this Space must be run. Useful for Spaces that are meant to be duplicated by other users. Setting this value will not automatically assign an hardware to this Space. Value must be a valid hardware flavor. Current valid hardware flavors: - CPU: `"cpu-basic"`, `"cpu-upgrade"` - GPU: `"t4-small"`, `"t4-medium"`, `"l4x1"`, `"l4x4"`, `"l40sx1"`, `"l40sx4"`, `"l40sx8"`, `"a10g-small"`, `"a10g-large"`, `"a10g-largex2"`, `"a10g-largex4"`, `"a100-large"`, `"a100x4"`, `"a100x8"` **`suggested_storage`** : _string_ Specify the suggested [permanent storage](https://huggingface.co/docs/hub/spaces-storage) on which this Space must be run. Useful for Spaces that are meant to be duplicated by other users. Setting this value will not automatically assign a permanent storage to this Space. Value must be one of `"small"`, `"medium"` or `"large"`. > [!NOTE] > The persistent storage feature is no longer available so this setting will be ignored. **`app_file`** : _string_ Path to your main application file (which contains either `gradio` Python code or `static` html code). Path is relative to the root of the repository. **`app_build_command`** : _string_ For static Spaces, command to run first to generate the HTML to render. Example: `npm run build`. This is used in conjunction with `app_file` which points to the built index file: e.g. `app_file: dist/index.html`. Each update, the build command will run in a Job and the build output will be stored in `refs/convert/build`, which will be served by the Space. See an example at https://huggingface.co/spaces/coyotte508/static-vite **`app_port`** : _int_ Port on which your application is running. Used only if `sdk` is `docker`. Default port is `7860`. **`base_path`**: _string_ For non-static Spaces, initial url to render. Needs to start with `/`. For static Spaces, use `app_file` instead. **`fullWidth`**: _boolean_ Whether your Space is rendered inside a full-width (when `true`) or fixed-width column (ie. "container" CSS) inside the iframe. Defaults to `true`. **`header`**: _string_ Can be either `mini` or `default`. If `header` is set to `mini` the space will be displayed full-screen with a mini floating header . **`short_description`**: _string_ A short description of the Space. This will be displayed in the Space's thumbnail. **`models`** : _List[string]_ HF model IDs (like `openai-community/gpt2` or `deepset/roberta-base-squad2`) used in the Space. Will be parsed automatically from your code if not specified here. **`datasets`** : _List[string]_ HF dataset IDs (like `mozilla-foundation/common_voice_13_0` or `oscar-corpus/OSCAR-2109`) used in the Space. Will be parsed automatically from your code if not specified here. **`tags`** : _List[string]_ List of terms that describe your Space task or scope. **`thumbnail`**: _string_ URL for defining a custom thumbnail for social sharing. **`pinned`** : _boolean_ Whether the Space stays on top of your profile. Can be useful if you have a lot of Spaces so you and others can quickly see your best Space. **`hf_oauth`** : _boolean_ Whether a connected OAuth app is associated to this Space. See [Adding a Sign-In with HF button to your Space](https://huggingface.co/docs/hub/spaces-oauth) for more details. **`hf_oauth_scopes`** : _List[string]_ Authorized scopes of the connected OAuth app. `openid` and `profile` are authorized by default and do not need this parameter. See [Adding a Sign-In with HF button to your space](https://huggingface.co/docs/hub/spaces-oauth) for more details. **`hf_oauth_expiration_minutes`** : _int_ Duration of the OAuth token in minutes. Defaults to 480 minutes (8 hours). Maximum duration is 43200 minutes (30 days). See [Adding a Sign-In with HF button to your space](https://huggingface.co/docs/hub/spaces-oauth) for more details. **`hf_oauth_authorized_org`** : _string_ or _List[string]_ Restrict OAuth access to members of specific organizations. See [Adding a Sign-In with HF button to your space](https://huggingface.co/docs/hub/spaces-oauth) for more details. **`disable_embedding`** : _boolean_ Whether the Space iframe can be embedded in other websites. Defaults to false, i.e. Spaces *can* be embedded. **`startup_duration_timeout`**: _string_ Set a custom startup duration timeout for your Space. This is the maximum time your Space is allowed to start before it times out and is flagged as unhealthy. Defaults to 30 minutes, but any valid duration (like `1h`, `30m`) is acceptable. **`custom_headers`** : _Dict[string, string]_ Set custom HTTP headers that will be added to all HTTP responses when serving your Space. For now, only the [cross-origin-embedder-policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cross-Origin-Embedder-Policy) (COEP), [cross-origin-opener-policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cross-Origin-Opener-Policy) (COOP), and [cross-origin-resource-policy](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cross-Origin-Resource-Policy) (CORP) headers are allowed. These headers can be used to set up a cross-origin isolated environment and enable powerful features like `SharedArrayBuffer`, for example: ```yaml custom_headers: cross-origin-embedder-policy: require-corp cross-origin-opener-policy: same-origin cross-origin-resource-policy: cross-origin ``` *Note:* all headers and values must be lowercase. **`preload_from_hub`**: _List[string]_ Specify a list of Hugging Face Hub models or other large files to be preloaded during the build time of your Space. This optimizes the startup time by having the files ready when your application starts. This is particularly useful for Spaces that rely on large models or datasets that would otherwise need to be downloaded at runtime. The format for each item is `"repository_name"` to download all files from a repository, or `"repository_name file1,file2"` for downloading specific files within that repository. You can also specify a specific commit to download using the format `"repository_name file1,file2 commit_sha256"`. Example usage: ```yaml preload_from_hub: - warp-ai/wuerstchen-prior text_encoder/model.safetensors,prior/diffusion_pytorch_model.safetensors - coqui/XTTS-v1 - openai-community/gpt2 config.json 11c5a3d5811f50298f278a704980280950aedb10 ``` In this example, the Space will preload specific .safetensors files from `warp-ai/wuerstchen-prior`, the complete `coqui/XTTS-v1` repository, and a specific revision of the `config.json` file in the `openai-community/gpt2` repository from the Hugging Face Hub during build time. > [!WARNING] > Files are saved in the default `huggingface_hub` disk cache `~/.cache/huggingface/hub`. If your application expects them elsewhere or you changed your `HF_HOME` variable, this preloading does not follow that at this time. > [!NOTE] > Preloading of private repos is not supported yet. ### WebDataset https://huggingface.co/docs/hub/datasets-webdataset.md # WebDataset [WebDataset](https://github.com/webdataset/webdataset) is a library for writing I/O pipelines for large datasets. Its sequential I/O and sharding features make it especially useful for streaming large-scale datasets to a DataLoader. ## The WebDataset format A WebDataset file is a TAR archive containing a series of data files. All successive data files with the same prefix are considered to be part of the same example (e.g., an image/audio file and its label or metadata): Labels and metadata can be in a `.json` file, in a `.txt` (for a caption, a description), or in a `.cls` (for a class index). A large scale WebDataset is made of many files called shards, where each shard is a TAR archive. Each shard is often ~1GB but the full dataset can be multiple terabytes! ## Multimodal support WebDataset is designed for multimodal datasets, i.e. for image, audio and/or video datasets. Indeed, since media files tend to be quite big, WebDataset's sequential I/O enables large reads and buffering, resulting in the best data loading speed. Here is a non-exhaustive list of supported data formats: - image: jpeg, png, tiff - audio: mp3, m4a, wav, flac - video: mp4, mov, avi - other: npy, npz The full list evolves over time and depends on the implementation. For example, you can find which formats the `webdataset` package supports in the source code [here](https://github.com/webdataset/webdataset/blob/main/src/webdataset/autodecode.py). ## Streaming Streaming TAR archives is fast because it reads contiguous chunks of data. It can be orders of magnitude faster than reading separate data files one by one. WebDataset streaming offers high-speed performance both when reading from disk and from cloud storage, which makes it an ideal format to feed to a DataLoader: For example here is how to stream the [timm/imagenet-12k-wds](https://huggingface.co/datasets/timm/imagenet-12k-wds) dataset directly from Hugging Face: First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ``` hf auth login ``` And then you can stream the dataset with WebDataset: ```python >>> import webdataset as wds >>> from huggingface_hub import get_token >>> from torch.utils.data import DataLoader >>> hf_token = get_token() >>> url = "https://huggingface.co/datasets/timm/imagenet-12k-wds/resolve/main/imagenet12k-train-{{0000..1023}}.tar" >>> url = f"pipe:curl -s -L {url} -H 'Authorization:Bearer {hf_token}'" >>> dataset = wds.WebDataset(url).decode() >>> dataloader = DataLoader(dataset, batch_size=64, num_workers=4) ``` ## Shuffle Generally, datasets in WebDataset formats are already shuffled and ready to feed to a DataLoader. But you can still reshuffle the data with WebDataset's approximate shuffling. In addition to shuffling the list of shards, WebDataset uses a buffer to shuffle a dataset without any cost to speed: To shuffle a list of sharded files and randomly sample from the shuffle buffer: ```python >>> buffer_size = 1000 >>> dataset = ( ... wds.WebDataset(url, shardshuffle=True) ... .shuffle(buffer_size) ... .decode() ... ) ``` ### Authentication for private and gated datasets https://huggingface.co/docs/hub/datasets-duckdb-auth.md # Authentication for private and gated datasets To access private or gated datasets, you need to configure your Hugging Face Token in the DuckDB Secrets Manager. Visit [Hugging Face Settings - Tokens](https://huggingface.co/settings/tokens) to obtain your access token. DuckDB supports two providers for managing secrets: - `CONFIG`: Requires the user to pass all configuration information into the CREATE SECRET statement. - `CREDENTIAL_CHAIN`: Automatically tries to fetch credentials. For the Hugging Face token, it will try to get it from `~/.cache/huggingface/token`. For more information about DuckDB Secrets visit the [Secrets Manager](https://duckdb.org/docs/configuration/secrets_manager.html) guide. ## Creating a secret with `CONFIG` provider To create a secret using the CONFIG provider, use the following command: ```bash CREATE SECRET hf_token (TYPE HUGGINGFACE, TOKEN 'your_hf_token'); ``` Replace `your_hf_token` with your actual Hugging Face token. ## Creating a secret with `CREDENTIAL_CHAIN` provider To create a secret using the CREDENTIAL_CHAIN provider, use the following command: ```bash CREATE SECRET hf_token (TYPE HUGGINGFACE, PROVIDER credential_chain); ``` This command automatically retrieves the stored token from `~/.cache/huggingface/token`. First you need to [Login with your Hugging Face account](/docs/huggingface_hub/quick-start#login), for example using: ```bash hf auth login ``` Alternatively, you can set your Hugging Face token as an environment variable: ```bash export HF_TOKEN="hf_xxxxxxxxxxxxx" ``` For more information on authentication, see the [Hugging Face authentication](/docs/huggingface_hub/main/en/quick-start#authentication) documentation. ### Datasets https://huggingface.co/docs/hub/datasets.md # Datasets The Hugging Face Hub is home to a growing collection of datasets that span a variety of domains and tasks. These docs will guide you through interacting with the datasets on the Hub, uploading new datasets, exploring the datasets contents, and using datasets in your projects. This documentation focuses on the datasets functionality in the Hugging Face Hub and how to use the datasets with supported libraries. For detailed information about the ๐Ÿค— Datasets python package, visit the [๐Ÿค— Datasets documentation](/docs/datasets/index). ## Contents - [Datasets Overview](./datasets-overview) - [Dataset Cards](./datasets-cards) - [Gated Datasets](./datasets-gated) - [Uploading Datasets](./datasets-adding) - [Downloading Datasets](./datasets-downloading) - [Libraries](./datasets-libraries) - [Dataset Viewer](./datasets-viewer) - [Data files Configuration](./datasets-data-files-configuration) ### Storage Buckets https://huggingface.co/docs/hub/storage-buckets.md # Storage Buckets Storage Buckets are a repo type on the Hugging Face Hub providing S3-like object storage, powered by the [Xet](./xet/index) storage backend. Unlike Git-based [repositories](./repositories) (models, datasets, Spaces), buckets are **non-versioned** and **mutable**, designed for use cases where you need simple, fast storage such as training checkpoints, logs, intermediate artifacts, or any large collection of files that doesn't need version control. You can interact with buckets using the Hub web interface, the [`hf` CLI](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-buckets), or the [Python API](https://huggingface.co/docs/huggingface_hub/guides/buckets). > [!TIP] > Buckets are available to all users and organizations. See [hf.co/storage](https://huggingface.co/storage) for pricing details. > [!TIP] > See [Access Patterns](./storage-buckets-access) for how to reach bucket data from your tools (mount as a filesystem, `hf://` paths, volume mounts in Jobs/Spaces), and [Bucket Integrations](./storage-buckets-integrations) for ready-to-use snippets in popular data libraries like pandas, Dask, and Spark. ## Buckets vs Repositories The Hub offers two types of storage: Git-based **repositories** for versioned, collaborative work and **buckets** for fast, mutable object storage. | Feature | Repositories (Git-based) | Storage Buckets | | ------------------ | ------------------------------- | ----------------------------------- | | Versioning | Full Git history | None (mutable, overwrite-in-place) | | Types | Models, Datasets, Spaces | Standalone bucket | | Primary use case | Publishing finished artifacts | Working storage / intermediate data | | Operations | Hub API, Git push/pull | S3-like `sync`, `cp`, `rm` | | Deduplication | Xet chunk-level | Xet chunk-level | | Pull Requests | Yes | No | | Model/Dataset Cards| Yes | No | Use **repositories** when you want version history, collaboration features (PRs, discussions), and library integrations. Use **buckets** when you need fast, mutable storage for data that changes frequently โ€” files can be overwritten or deleted in place. ## Creating a Bucket ### From the Hub UI 1. Navigate to [huggingface.co/new-bucket](https://huggingface.co/new-bucket): 2. Specify the owner of the bucket: this can be either you or any of the organizations you're affiliated with. 3. Enter a bucket name. 4. Choose whether the bucket should be public or private. 5. Optionally, preselect [CDN pre-warming](#pre-warming-and-cdn) regions to cache your data closer to your compute from the start. After creating the bucket, you should see the bucket page: ### From the CLI ```bash # Create a bucket under your namespace hf buckets create my-bucket # Create a private bucket hf buckets create my-bucket --private # Create a bucket under an organization hf buckets create my-org/shared-bucket ``` ### From Python ```python from huggingface_hub import create_bucket # Create a bucket under your namespace create_bucket("my-bucket") # Create a private bucket create_bucket("my-bucket", private=True) # Create a bucket under an organization create_bucket("my-org/shared-bucket") ``` For the full Python API reference including deleting, moving, and listing buckets, see the [`huggingface_hub` Buckets guide](https://huggingface.co/docs/huggingface_hub/guides/buckets). ## Browsing Buckets on the Hub Every bucket has a page on the Hub where you can browse its contents, navigate directories, and view file details. Bucket pages are available at `https://huggingface.co/buckets//`. You can also list bucket contents from the CLI: ```bash # List files in a bucket (with human-readable sizes) hf buckets list julien-c/my-training-bucket -h Feb 17 14:46 art/ Feb 17 14:58 arxivqa/ Feb 17 15:02 arxivqa2/ Feb 17 15:04 arxivqa3/ Feb 17 14:47 captcha/ Feb 17 14:53 captcha2/ Feb 24 17:22 julien/ # Recursive listing hf buckets list julien-c/my-training-bucket/art -h -R 423.6 MB Feb 17 14:29 art/train-00000-of-00011.parquet 441.0 MB Feb 17 14:29 art/train-00001-of-00011.parquet 521.7 MB Feb 17 14:29 art/train-00002-of-00011.parquet 481.4 MB Feb 17 14:29 art/train-00003-of-00011.parquet 444.6 MB Feb 17 14:29 art/train-00004-of-00011.parquet 461.6 MB Feb 17 14:29 art/train-00005-of-00011.parquet 466.4 MB Feb 17 14:29 art/train-00006-of-00011.parquet 486.3 MB Feb 17 14:29 art/train-00007-of-00011.parquet 477.0 MB Feb 17 14:29 art/train-00008-of-00011.parquet 454.0 MB Feb 17 14:29 art/train-00009-of-00011.parquet 483.1 MB Feb 17 14:29 art/train-00010-of-00011.parquet # Tree view hf buckets list julien-c/my-training-bucket --tree -h -R โ”œโ”€โ”€ art/ 423.6 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00000-of-00011.parquet 441.0 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00001-of-00011.parquet 521.7 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00002-of-00011.parquet 481.4 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00003-of-00011.parquet 444.6 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00004-of-00011.parquet 461.6 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00005-of-00011.parquet 466.4 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00006-of-00011.parquet 486.3 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00007-of-00011.parquet 477.0 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00008-of-00011.parquet 454.0 MB Feb 17 14:29 โ”‚ โ”œโ”€โ”€ train-00009-of-00011.parquet 483.1 MB Feb 17 14:29 โ”‚ โ””โ”€โ”€ train-00010-of-00011.parquet โ”œโ”€โ”€ arxivqa/ 495.9 MB Feb 17 14:32 โ”‚ โ”œโ”€โ”€ train-00000-of-00164.parquet 518.3 MB Feb 17 14:32 โ”‚ โ”œโ”€โ”€ train-00001-of-00164.parquet 495.5 MB Feb 17 14:32 โ”‚ โ”œโ”€โ”€ train-00002-of-00164.parquet 486.6 MB Feb 17 14:32 โ”‚ โ”œโ”€โ”€ train-00003-of-00164.parquet 490.4 MB Feb 17 14:32 โ”‚ โ”œโ”€โ”€ train-00004-of-00164.parquet ... ``` ## Managing Files You can upload and download files directly from the bucket page on the Hub, or use the CLI and Python API for programmatic access. Bucket files are referenced using `hf://buckets/` paths (e.g., `hf://buckets/username/my-bucket/path/to/file`). The `hf buckets cp` command handles individual file transfers while `hf buckets sync` is better suited for directories. All commands work in both directions โ€” local-to-remote and remote-to-local. ### Uploading files For quick uploads, you can drag and drop files directly on the bucket page in your browser. For programmatic use, `hf buckets cp` copies individual files into a bucket. The source is a local path and the destination is an `hf://buckets/` path. You can also pipe data from stdin, which is handy for programmatically generated content. **CLI:** ```bash # Upload a single file hf buckets cp ./model.safetensors hf://buckets/username/my-bucket/models/model.safetensors # Upload from stdin cat config.json | hf buckets cp - hf://buckets/username/my-bucket/config.json ``` In Python, use `batch_bucket_files` to upload one or more files in a single call. Each entry is a tuple of `(local_path, remote_path)`. **Python:** ```python from huggingface_hub import batch_bucket_files batch_bucket_files( "username/my-bucket", add=[ ("./model.safetensors", "models/model.safetensors"), ("./config.json", "models/config.json"), ], ) ``` For more upload options (raw bytes, combined upload+delete, etc.), see the [`huggingface_hub` upload guide](https://huggingface.co/docs/huggingface_hub/guides/buckets#upload-files). ### Downloading files You can download individual files directly from the bucket page on the Hub by clicking on them. For programmatic access, downloading mirrors the upload syntax โ€” swap the source and destination in `hf buckets cp`. You can also stream a file to stdout by using `-` as the destination, which lets you pipe bucket contents directly into other tools. **CLI:** ```bash # Download a single file hf buckets cp hf://buckets/username/my-bucket/models/model.safetensors ./model.safetensors # Download to stdout and pipe hf buckets cp hf://buckets/username/my-bucket/config.json - | jq . ``` In Python, use `download_bucket_files` with a list of `(remote_path, local_path)` tuples. **Python:** ```python from huggingface_hub import download_bucket_files download_bucket_files( "username/my-bucket", files=[ ("models/model.safetensors", "./local/model.safetensors"), ("config.json", "./local/config.json"), ], ) ``` For faster downloads using pre-fetched metadata, see the [`huggingface_hub` download guide](https://huggingface.co/docs/huggingface_hub/guides/buckets#download-files). ### Syncing directories The `sync` command works like `rsync` or `aws s3 sync` โ€” it compares source and destination and only transfers files that have changed. This is the most efficient way to keep a local directory and a bucket in sync. By default, `sync` only adds and updates files. Pass `--delete` to also remove files at the destination that no longer exist at the source. Use `--dry-run` to preview what would happen without actually transferring anything. **CLI:** ```bash # Upload a local directory to a bucket hf buckets sync ./data hf://buckets/username/my-bucket/data # Download from a bucket to a local directory hf buckets sync hf://buckets/username/my-bucket/data ./data # Sync with deletion of extraneous files hf buckets sync ./data hf://buckets/username/my-bucket/data --delete # Preview what would be synced without executing hf buckets sync ./data hf://buckets/username/my-bucket/data --dry-run # Plan and apply: review the sync plan before executing hf buckets sync ./data hf://buckets/username/my-bucket/data --plan sync-plan.jsonl # ... review the plan file, then apply it hf buckets sync --apply sync-plan.jsonl ``` > [!TIP] > `hf sync` is a convenient alias for `hf buckets sync`. **Python:** ```python from huggingface_hub import sync_bucket # Upload a local directory to a bucket sync_bucket("./data", "hf://buckets/username/my-bucket/data") # Download from a bucket to a local directory sync_bucket("hf://buckets/username/my-bucket/data", "./data") ``` The `sync` command supports filtering (`--include`, `--exclude`), comparison modes (`--ignore-times`, `--existing`), and a **plan-and-apply** workflow to review operations before executing them. For the full set of options, see the [`huggingface_hub` sync guide](https://huggingface.co/docs/huggingface_hub/guides/buckets#sync-directories). ### Deleting files Since buckets are non-versioned, deletions are immediate and permanent โ€” there is no way to recover a deleted file. Use `--dry-run` to double-check before removing files, especially when using `--recursive`. **CLI:** ```bash # Remove a single file hf buckets rm username/my-bucket/old-model.bin # Remove all files under a prefix hf buckets rm username/my-bucket/logs/ --recursive # Preview what would be deleted hf buckets rm username/my-bucket/checkpoints/ --recursive --dry-run ``` **Python:** ```python from huggingface_hub import batch_bucket_files batch_bucket_files("username/my-bucket", delete=["old-model.bin", "logs/debug.log"]) ``` For more deletion options (pattern-based filtering, recursive removal, etc.), see the [`huggingface_hub` delete guide](https://huggingface.co/docs/huggingface_hub/guides/buckets#delete-files). ### Copying files between repos and buckets You can copy [Xet](./xet/index)-tracked files from any repository (model, dataset, Space) or bucket into a destination bucket without re-uploading the data. The copy is server-side: only the Xet content hashes are migrated, so even very large files are copied instantly. > [!NOTE] > Only Xet-tracked files are copied server-to-server. Small non-Xet files (e.g., config files and READMEs) are automatically downloaded and re-uploaded. **CLI:** ```bash hf buckets cp \ hf://datasets/HuggingFaceFW/fineweb/data \ hf://buckets/username/fineweb-data ``` **Python:** ```python from huggingface_hub import HfApi api = HfApi() api.copy_files( "hf://datasets/HuggingFaceFW/fineweb/data", "hf://buckets/username/fineweb-data", ) ``` You need read access to the source repository or bucket and write access to the destination bucket. Note that transferring data the other way from a bucket to a repository (model, dataset, Space) without reuploading is not yet available, but is on the roadmap. ## Pre-warming and CDN Buckets live on the Hub's global storage by default. For workloads where storage location directly affects throughput you can **pre-warm** bucket data to bring it closer to your compute. Pre-warming caches files at edge locations near specific cloud providers and regions, so your jobs read data locally instead of pulling it across regions. This is especially useful for: - Training clusters that need fast access to large datasets or checkpoints - Multi-region setups where different parts of a pipeline run in different clouds - Distributing large artifacts to many consumers worldwide See [hf.co/storage](https://huggingface.co/storage) for available regions and details on enabling pre-warming. ## Use Cases ### Training checkpoints and logs When running training jobs (e.g., via [Jobs](./jobs)), save checkpoints and logs to a bucket. Unlike a Git repo, you can overwrite the latest checkpoint without accumulating version history, and `sync` ensures only changed data is transferred. ```bash # After each evaluation step, sync checkpoints to a bucket hf sync ./checkpoints hf://buckets/my-org/training-run-42/checkpoints ``` Because buckets are built on [Xet](./xet/index), successive checkpoints where large parts of the model are frozen benefit from chunk-level deduplication. Only the changed chunks are uploaded. ### Data processing pipelines Buckets serve as staging areas for data processing workflows. Process raw data, write intermediate outputs to a bucket, then promote the final artifact to a versioned [Dataset](./datasets) repository when the pipeline completes. This keeps your versioned repo clean while giving your pipeline fast mutable storage. Note that transferring data from a Bucket to a repository without reuploading is not yet available, but is on the roadmap. ### Agentic storage AI agents need scratch storage for intermediate results, tool outputs, traces, and working memory. Buckets provide a Hub-native place for this data: fast mutable access without Git overhead, standard Hugging Face permissions, and addressable via `hf://buckets/` paths across the Hub ecosystem. ### Rolling backups Buckets are well-suited for maintaining rolling backups. With a Git-based [Dataset](./datasets) repository, deleting outdated files doesn't free storage โ€” Git history retains every past version, so you'd need to squash commits or rewrite history to actually reclaim space. With buckets, old files are truly gone once deleted, and you only pay for what's currently stored. ```bash # Sync today's backup, removing files that no longer exist locally hf sync ./daily-backup hf://buckets/my-user/backups/latest --delete ``` ### Linking models to buckets You can create a two-way link between a model and a bucket by adding the `buckets` field to the model card metadata. The linked models will then appear on the bucket page, and the bucket will appear as a tag on the model page. ```yaml # In the model card YAML frontmatter buckets: - my-org/my-bucket ``` See [Specifying a bucket](./model-cards#specifying-a-bucket) in the model cards documentation for more details. ## Pricing Storage Buckets are billed based on the amount of data stored, with simple per-TB pricing. Enterprise plans benefit from dedup-based billing, where shared chunks across files directly reduce the billed footprint. As for other repositories, buckets are free to create and have a free storage allowance. For usage above the [free tier](https://huggingface.co/docs/hub/storage-limits), see [hf.co/storage](https://huggingface.co/storage). For general billing information, see the [Billing](./billing) documentation. ### Polars https://huggingface.co/docs/hub/datasets-polars.md # Polars [Polars](https://pola.rs/) is an in-memory DataFrame library on top of an [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) query engine. It is fast, easy to use, and [open source](https://github.com/pola-rs/polars/). Starting from version `1.2.0`, Polars provides _native_ support for the Hugging Face file system. This means that all the benefits of the Polars query optimizer (e.g. predicate and projection pushdown) are applied and Polars will only load the data necessary to complete the query. This significantly speeds up reading, especially for large datasets (see [optimizations](./datasets-polars-optimizations)) You can use the Hugging Face paths (`hf://`) to access data on the Hub: ## Getting started To get started, you can simply `pip install` Polars into your environment: ```bash pip install polars ``` Once you have installed Polars, you can directly query a dataset based on a Hugging Face URL. No other dependencies are needed for this. ```python import polars as pl pl.read_parquet("hf://datasets/roneneldan/TinyStories/data/train-00000-of-00004-2d5a1467fff1081b.parquet") ``` > [!TIP] > Polars provides two APIs: a lazy API (`scan_parquet`) and an eager API (`read_parquet`). We recommend using the eager API for interactive workloads and the lazy API for performance as it allows for better query optimization. For more information on the topic, check out the [Polars user guide](https://docs.pola.rs/user-guide/concepts/lazy-api/#when-to-use-which). Polars supports globbing to download multiple files at once into a single DataFrame. ```python pl.read_parquet("hf://datasets/roneneldan/TinyStories/data/train-*.parquet") ``` ### Hugging Face URLs A Hugging Face URL can be constructed from the `username` and `dataset` name like this: - `hf://datasets/{username}/{dataset}/{path_to_file}` The path may include globbing patterns such as `**/*.parquet` to query all the files matching the pattern. Additionally, for any non-supported [file formats](./datasets-polars-file-formats) you can use the auto-converted parquet files that Hugging Face provides using the `@~parquet branch`: - `hf://datasets/{my-username}/{my-dataset}@~parquet/{path_to_file}` ### Schedule Jobs https://huggingface.co/docs/hub/jobs-schedule.md # Schedule Jobs Schedule and manage jobs that will run on HF infrastructure. Use `hf jobs uv run ` or `hf jobs run` with a schedule of `@annually`, `@yearly`, `@monthly`, `@weekly`, `@daily`, `@hourly`, or a CRON schedule expression (e.g., `"0 9 * * 1"` for 9 AM every Monday): ```bash # Schedule a job that runs every hour >>> hf jobs scheduled uv run @hourly python -c "print('This runs every hour!')" # Use the CRON syntax >>> hf jobs scheduled uv run "*/5 * * * *" python -c "print('This runs every five minutes!')" # Schedule with GPU >>> hf jobs scheduled uv run --flavor a10g-small --with torch @hourly python -c 'import torch; print(f"This code ran with the following GPU: {torch.cuda.get_device_name()}")' # Schedule with a Docker image >>> hf jobs scheduled run @hourly python:3.12 python -c "print('This runs every hour!')" # Schedule a Python script with a label >>> hf jobs scheduled uv run --label fine-tuning @hourly my_script.py ``` Use the same parameters as `hf jobs uv run` and `hf jobs run` to pass environment variables, secrets, timeout, labels, etc. Manage scheduled jobs using `hf jobs scheduled ps`, `hf jobs scheduled inspect`, `hf jobs scheduled suspend`, `hf jobs scheduled resume`, and `hf jobs scheduled delete`: ```python # List your active scheduled jobs >>> hf jobs scheduled ps # List all your scheduled jobs (including suspended jobs) >>> hf jobs scheduled ps -a # Inspect the status of a job >>> hf jobs scheduled inspect # Suspend (pause) a scheduled job >>> hf jobs scheduled suspend # Resume a scheduled job >>> hf jobs scheduled resume # Delete a scheduled job >>> hf jobs scheduled delete ``` ### Authentication https://huggingface.co/docs/hub/datasets-polars-auth.md # Authentication In order to access private or gated datasets, you need to authenticate first. Authentication works by providing an access token which will be used to authenticate and authorize your access to gated and private datasets. The first step is to create an access token for your account. This can be done by visiting [Hugging Face Settings - Tokens](https://huggingface.co/settings/tokens). There are three ways to provide the token: setting an environment variable, passing a parameter to the reader or using the Hugging Face CLI. ## Environment variable If you set the environment variable `HF_TOKEN`, Polars will automatically use it when requesting datasets from Hugging Face. ```bash export HF_TOKEN="hf_xxxxxxxxxxxxx" ``` ## Parameters You can also explicitly provide the access token to the reader (e.g. `read_parquet`) through the `storage_options` parameter. For a full overview of all the parameters, check out the [API reference guide](https://docs.pola.rs/api/python/stable/reference/api/polars.read_parquet.html). ```python pl.read_parquet( "hf://datasets/roneneldan/TinyStories/data/train-*.parquet", storage_options={"token": ACCESS_TOKEN}, ) ``` ## CLI Alternatively, you can you use the [Hugging Face CLI](/docs/huggingface_hub/en/guides/cli) to authenticate. After successfully logging in with `hf auth login` an access token will be stored in the `HF_HOME` directory which defaults to `~/.cache/huggingface`. Polars will then use this token for authentication. If multiple methods are specified, they are prioritized in the following order: - Parameters (`storage_options`) - Environment variable (`HF_TOKEN`) - CLI ### Datasets https://huggingface.co/docs/hub/enterprise-datasets.md # Datasets > [!WARNING] > This feature is part of the Team & Enterprise plans. Data Studio is enabled on private datasets under your Team or Enterprise organization. Data Studio allows teams to understand their data and to help them build better data processing and filtering for AI. This powerful viewer allows you to explore dataset content, inspect data distributions, filter by values, search for keywords, or even run SQL queries on your data without leaving your browser. More information about [Data Studio](./datasets-viewer). ### Combine datasets and export https://huggingface.co/docs/hub/datasets-duckdb-combine-and-export.md # Combine datasets and export In this section, we'll demonstrate how to combine two datasets and export the result. The first dataset is in CSV format, and the second dataset is in Parquet format. Let's start by examining our datasets: The first will be [TheFusion21/PokemonCards](https://huggingface.co/datasets/TheFusion21/PokemonCards): ```bash FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ id โ”‚ image_url โ”‚ caption โ”‚ name โ”‚ hp โ”‚ set_name โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ pl3-1 โ”‚ https://images.pokโ€ฆ โ”‚ A Basic, SP Pokemon Card of type Darkness with the title Absol G and 70 HP of rarity Rare Holo from the set Supreme Victors. It has โ€ฆ โ”‚ Absol G โ”‚ 70 โ”‚ Supreme Victors โ”‚ โ”‚ ex12-1 โ”‚ https://images.pokโ€ฆ โ”‚ A Stage 1 Pokemon Card of type Colorless with the title Aerodactyl and 70 HP of rarity Rare Holo evolved from Mysterious Fossil from โ€ฆ โ”‚ Aerodactyl โ”‚ 70 โ”‚ Legend Maker โ”‚ โ”‚ xy5-1 โ”‚ https://images.pokโ€ฆ โ”‚ A Basic Pokemon Card of type Grass with the title Weedle and 50 HP of rarity Common from the set Primal Clash and the flavor text: Itโ€ฆ โ”‚ Weedle โ”‚ 50 โ”‚ Primal Clash โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` And the second one will be [wanghaofan/pokemon-wiki-captions](https://huggingface.co/datasets/wanghaofan/pokemon-wiki-captions): ```bash FROM 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ image โ”‚ name_en โ”‚ name_zh โ”‚ text_en โ”‚ text_zh โ”‚ โ”‚ struct(bytes blob,โ€ฆ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ {'bytes': \x89PNG\โ€ฆ โ”‚ abomasnow โ”‚ ๆšด้›ช็Ž‹ โ”‚ Grass attributes,Blizzard King standing on two feet, with โ€ฆ โ”‚ ่‰ๅฑžๆ€ง๏ผŒๅŒ่„š็ซ™็ซ‹็š„ๆšด้›ช็Ž‹๏ผŒๅ…จ่บซ็™ฝ่‰ฒ็š„็ป’ๆฏ›๏ผŒๆทก็ดซ่‰ฒ็š„็œผ็›๏ผŒๅ‡ ็ผ•้•ฟๆก่ฃ…็š„ๆฏ›็šฎ็›–็€ๅฎƒ็š„ๅ˜ดๅทด โ”‚ โ”‚ {'bytes': \x89PNG\โ€ฆ โ”‚ abra โ”‚ ๅ‡ฏ่ฅฟ โ”‚ Super power attributes, the whole body is yellow, the headโ€ฆ โ”‚ ่ถ…่ƒฝๅŠ›ๅฑžๆ€ง๏ผŒ้€šไฝ“้ป„่‰ฒ๏ผŒๅคด้ƒจๅค–ๅฝข็ฑปไผผ็‹็‹ธ๏ผŒๅฐ–ๅฐ–้ผปๅญ๏ผŒๆ‰‹ๅ’Œ่„šไธŠ้ƒฝๆœ‰ไธ‰ไธชๆŒ‡ๅคด๏ผŒ้•ฟๅฐพๅทดๆœซ็ซฏๅธฆ็€ไธ€ไธช่ค่‰ฒๅœ†็Žฏ โ”‚ โ”‚ {'bytes': \x89PNG\โ€ฆ โ”‚ absol โ”‚ ้˜ฟๅ‹ƒๆขญ้ฒ โ”‚ Evil attribute, with white hair, blue-gray part without haโ€ฆ โ”‚ ๆถๅฑžๆ€ง๏ผŒๆœ‰็™ฝ่‰ฒๆฏ›ๅ‘๏ผŒๆฒกๆฏ›ๅ‘็š„้ƒจๅˆ†ๆ˜ฏ่“็ฐ่‰ฒ๏ผŒๅคดๅณ่พน็ฑปไผผๅผ“็š„่ง’๏ผŒ็บข่‰ฒ็œผ็› โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Now, let's try to combine these two datasets by joining on the `name` column: ```bash SELECT a.image_url , a.caption AS card_caption , a.name , a.hp , b.text_en as wiki_caption FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b ON LOWER(a.name) = b.name_en LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ image_url โ”‚ card_caption โ”‚ name โ”‚ hp โ”‚ wiki_caption โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ https://images.pokโ€ฆ โ”‚ A Stage 1 Pokemon โ€ฆ โ”‚ Aerodactyl โ”‚ 70 โ”‚ A Pokรฉmon with rock attributes, gray body, blue pupils, purple inner wings, two sharp claws on the wings, jagged teeth, and an arrow-like โ€ฆ โ”‚ โ”‚ https://images.pokโ€ฆ โ”‚ A Basic Pokemon Caโ€ฆ โ”‚ Weedle โ”‚ 50 โ”‚ Insect-like, caterpillar-like in appearance, with a khaki-yellow body, seven pairs of pink gastropods, a pink nose, a sharp poisonous needโ€ฆ โ”‚ โ”‚ https://images.pokโ€ฆ โ”‚ A Basic Pokemon Caโ€ฆ โ”‚ Caterpie โ”‚ 50 โ”‚ Insect attributes, caterpillar appearance, green back, white abdomen, Y-shaped red antennae on the head, yellow spindle-shaped tail, two pโ€ฆ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` We can export the result to a Parquet file using the `COPY` command: ```bash COPY (SELECT a.image_url , a.caption AS card_caption , a.name , a.hp , b.text_en as wiki_caption FROM 'hf://datasets/TheFusion21/PokemonCards/train.csv' a JOIN 'hf://datasets/wanghaofan/pokemon-wiki-captions/data/*.parquet' b ON LOWER(a.name) = b.name_en) TO 'output.parquet' (FORMAT PARQUET); ``` Let's validate the new Parquet file: ```bash SELECT COUNT(*) FROM 'output.parquet'; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ count_star() โ”‚ โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 9460 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` > [!TIP] > You can also export to [CSV](https://duckdb.org/docs/guides/file_formats/csv_export), [Excel](https://duckdb.org/docs/guides/file_formats/excel_export > ) and [JSON](https://duckdb.org/docs/guides/file_formats/json_export > ) formats. Finally, let's push the resulting dataset to the Hub. You can use the Hub UI, the `huggingface_hub` client library and more to upload your Parquet file, see more information [here](./datasets-adding). And that's it! You've successfully combined two datasets, exported the result, and uploaded it to the Hugging Face Hub. ### Publisher Analytics https://huggingface.co/docs/hub/publisher-analytics.md # Publisher Analytics > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Publisher Analytics Dashboard Track all your repository activity with a detailed downloads overview that shows total downloads for all the Models and Datasets published by your organization. Toggle between "All Time" and "Last Month" views to gain insights across your repositories over different periods. ### Per-repo breakdown Explore the metrics of individual repositories with the per-repository drill-down table. Utilize the built-in search feature to quickly locate specific repositories. Each row also features a time-series graph that illustrates the trend of downloads over time. ## Export Publisher Analytics as CSV Download a comprehensive CSV file containing analytics for all your repositories, including model and dataset download activity. You can also access this data programmatically via the following API endpoint: ```bash curl -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \ "https://huggingface.co/organizations/YOUR_ORG_NAME/settings/publisher-analytics/download-breakdown" \ --output breakdown.csv ``` ### Response Structure The CSV file is made of daily download records for each of your models and datasets. ```csv repoType,repoName,total,timestamp,downloads model,huggingface/CodeBERTa-small-v1,4362460,2021-01-22T00:00:00.000Z,4 model,huggingface/CodeBERTa-small-v1,4362460,2021-01-23T00:00:00.000Z,7 model,huggingface/CodeBERTa-small-v1,4362460,2021-01-24T00:00:00.000Z,2 dataset,huggingface/documentation-images,2167284,2021-11-27T00:00:00.000Z,3 dataset,huggingface/documentation-images,2167284,2021-11-28T00:00:00.000Z,18 dataset,huggingface/documentation-images,2167284,2021-11-29T00:00:00.000Z,7 ``` ### Repository Object Structure Each record in the CSV contains: - `repoType`: The type of repository (e.g., "model", "dataset") - `repoName`: Full repository name including organization (e.g., "huggingface/documentation-images") - `total`: Cumulative number of downloads for this repository - `timestamp`: ISO 8601 formatted date (UTC) - `downloads`: Number of downloads for that day Records are ordered chronologically and provide a daily granular view of download activity for each repository. > [!NOTE] > Download figures are **not** deduplicated by user. If you need unique download counts, refer to the next section. ## Unique downloaders and more granular logs > [!WARNING] > This feature is an add-on for the Enterprise Plus plan. As an advanced feature, Hugging Face can export anonymized, request-level access logs for all of the models and datasets published by your organization. Each line represents a single download-related request, giving you full granularity over your models and datasets' download data. Your team is responsible for ingesting these logs and running computations on them. The export intentionally includes raw HTTP status codes and methods so you can classify `HEAD`, partial-content, redirect, and other request patterns based on your own analytics needs. | Column | Description | | -------------- | ------------------------------------------------------- | | `timestamp` | Request timestamp | | `status` | HTTP status code (for example `200`, `206`, `302`, `307`, `304`) | | `method` | HTTP method (for example `GET`, `HEAD`) | | `repoName` | Full repo name (e.g. `nvidia/segformer-b0`) | | `repoType` | Repository type: `model`, `dataset`, or `space` | | `hashedUserId` | Non-reversible hash of user ID (if authenticated) | | `hashedIp` | Non-reversible hash of IP address (if unauthenticated) | | `country` | Country ISO code | | `region` | Region or city name | As it requires setting up a custom data export pipeline on our side (custom Elastic index, etc), this is only available as an add-on to Enterprise Plus. ### Using fastai at Hugging Face https://huggingface.co/docs/hub/fastai.md # Using fastai at Hugging Face `fastai` is an open-source Deep Learning library that leverages PyTorch and Python to provide high-level components to train fast and accurate neural networks with state-of-the-art outputs on text, vision, and tabular data. ## Exploring fastai in the Hub You can find `fastai` models by filtering at the left of the [models page](https://huggingface.co/models?library=fastai&sort=downloads). All models on the Hub come up with the following features: 1. An automatically generated model card with a brief description and metadata tags that help for discoverability. 2. An interactive widget you can use to play out with the model directly in the browser (for Image Classification) 3. An Inference Providers widget that allows to make inference requests (for Image Classification). ## Using existing models The `huggingface_hub` library is a lightweight Python client with utility functions to download models from the Hub. ```bash pip install huggingface_hub["fastai"] ``` Once you have the library installed, you just need to use the `from_pretrained_fastai` method. This method not only loads the model, but also validates the `fastai` version when the model was saved, which is important for reproducibility. ```py from huggingface_hub import from_pretrained_fastai learner = from_pretrained_fastai("espejelomar/identify-my-cat") _,_,probs = learner.predict(img) print(f"Probability it's a cat: {100*probs[1].item():.2f}%") # Probability it's a cat: 100.00% ``` If you want to see how to load a specific model, you can click `Use in fastai` and you will be given a working snippet that you can load it! ## Sharing your models You can share your `fastai` models by using the `push_to_hub_fastai` method. ```py from huggingface_hub import push_to_hub_fastai push_to_hub_fastai(learner=learn, repo_id="espejelomar/identify-my-cat") ``` ## Additional resources * fastai [course](https://course.fast.ai/). * fastai [website](https://www.fast.ai/). * Integration with Hub [docs](https://docs.fast.ai/huggingface.html). * Integration with Hub [announcement](https://huggingface.co/blog/fastai). ### Models Frequently Asked Questions https://huggingface.co/docs/hub/models-faq.md # Models Frequently Asked Questions ## How can I see what dataset was used to train the model? It's up to the person who uploaded the model to include the training information! A user can [specify](./model-cards#specifying-a-dataset) the dataset used for training a model. If the datasets used for the model are on the Hub, the uploader may have included them in the [model card's metadata](https://huggingface.co/Jiva/xlm-roberta-large-it-mnli/blob/main/README.md#L7-L9). In that case, the datasets would be linked with a handy card on the right side of the model page: ## How can I see an example of the model in action? Models can have inference widgets that let you try out the model in the browser! Inference widgets are easy to configure, and there are many different options at your disposal. Visit the [Widgets documentation](models-widgets) to learn more. The Hugging Face Hub is also home to Spaces, which are interactive demos used to showcase models. If a model has any Spaces associated with it, you'll find them linked on the model page like so: Spaces are a great way to show off a model you've made or explore new ways to use existing models! Visit the [Spaces documentation](./spaces) to learn how to make your own. ## How do I upload an update / new version of the model? Releasing an update to a model that you've already published can be done by pushing a new commit to your model's repo. To do this, go through the same process that you followed to upload your initial model. Your previous model versions will remain in the repository's commit history, so you can still download previous model versions from a specific git commit or tag or revert to previous versions if needed. ## What if I have a different checkpoint of the model trained on a different dataset? By convention, each model repo should contain a single checkpoint. You should upload any new checkpoints trained on different datasets to the Hub in a new model repo. You can link the models together by using a tag specified in the `tags` key in your [model card's metadata](./model-cards), by using [Collections](./collections) to group distinct related repositories together or by linking to them in the model cards. The [akiyamasho/AnimeBackgroundGAN-Shinkai](https://huggingface.co/akiyamasho/AnimeBackgroundGAN-Shinkai#other-pre-trained-model-versions) model, for example, references other checkpoints in the model card under *"Other pre-trained model versions"*. ## Can I link my model to a paper on arXiv? If the model card includes a link to a paper on arXiv, the Hugging Face Hub will extract the arXiv ID and include it in the model tags with the format `arxiv:`. Clicking on the tag will let you: * Visit the paper page * Filter for other models on the Hub that cite the same paper. Read more about paper pages [here](./paper-pages). ### Organizations, Security, and the Hub API https://huggingface.co/docs/hub/other.md # Organizations, Security, and the Hub API ## Contents - [Organizations](./organizations) - [Managing Organizations](./organizations-managing) - [Organization Cards](./organizations-cards) - [Access control in organizations](./organizations-security) - [Team & Enterprise](./enterprise) - [Moderation](./moderation) - [Billing](./billing) - [Digital Object Identifier (DOI)](./doi) - [Security](./security) - [User Access Tokens](./security-tokens) - [Signing commits with GPG](./security-gpg) - [Malware Scanning](./security-malware) - [Pickle Scanning](./security-pickle) - [Hub API Endpoints](./api) - [Webhooks](./webhooks) ### Using Flair at Hugging Face https://huggingface.co/docs/hub/flair.md # Using Flair at Hugging Face [Flair](https://github.com/flairNLP/flair) is a very simple framework for state-of-the-art NLP. Developed by [Humboldt University of Berlin](https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/) and friends. ## Exploring Flair in the Hub You can find `flair` models by filtering at the left of the [models page](https://huggingface.co/models?library=flair). All models on the Hub come with these useful features: 1. An automatically generated model card with a brief description. 2. An interactive widget you can use to play with the model directly in the browser. 3. An Inference Providers widget that allows you to make inference requests. ## Installation To get started, you can follow the [Flair installation guide](https://github.com/flairNLP/flair?tab=readme-ov-file#requirements-and-installation). You can also use the following one-line install through pip: ``` $ pip install -U flair ``` ## Using existing models All `flair` models can easily be loaded from the Hub: ```py from flair.data import Sentence from flair.models import SequenceTagger # load tagger tagger = SequenceTagger.load("flair/ner-multi") ``` Once loaded, you can use `predict()` to perform inference: ```py sentence = Sentence("George Washington ging nach Washington.") tagger.predict(sentence) # print sentence print(sentence) ``` It outputs the following: ```text Sentence[6]: "George Washington ging nach Washington." โ†’ ["George Washington"/PER, "Washington"/LOC] ``` If you want to load a specific Flair model, you can click `Use in Flair` in the model card and you will be given a working snippet! ## Additional resources * Flair [repository](https://github.com/flairNLP/flair) * Flair [docs](https://flairnlp.github.io/docs/intro) * Official Flair [models](https://huggingface.co/flair) on the Hub (mainly trained by [@alanakbik](https://huggingface.co/alanakbik) and [@stefan-it](https://huggingface.co/stefan-it)) ### How to configure OIDC SSO with Google Workspace https://huggingface.co/docs/hub/security-sso-google-oidc.md # How to configure OIDC SSO with Google Workspace In this guide, we will use Google Workspace as the SSO provider with the OpenID Connect (OIDC) protocol as our preferred identity protocol. We currently support SP-initiated authentication. For user provisioning, see [SCIM](./enterprise-scim). > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Step 1: Create OIDC App in Google Workspace - In your Google Cloud console, search and navigate to `Google Auth Platform` > `Clients`. - Click `Create Client`. - For Application Type select `Web Application`. - Provide a name for your application. - Retrieve the `Redirection URI` from your Hugging Face organization settings, go to the `SSO` tab and select the `OIDC` protocol. - Click `Create`. - A pop-up will appear with the `Client ID` and `Client Secret`, copy those and paste them into your Hugging Face organization settings. In the `SSO` tab (make sure `OIDC` is selected) paste the corresponding values for `Client Identifier` and `Client Secret`. ## Step 2: Configure Hugging Face with Google's OIDC Details - At this point the **Client ID** and **Client Secret** should be set in your Hugging Face organization settings `SSO` tab. - Set the **Issuer URL** to `https://accounts.google.com`. ## Step 3: Test and Enable SSO > [!WARNING] > Before testing, ensure you have granted access to the application for the appropriate users. The admin performing the test must have access. - Now, in your Hugging Face SSO settings, click on **"Update and Test OIDC configuration"**. - You should be redirected to your Google login prompt. Once logged in, you'll be redirected to your organization's settings page. - A green check mark near the OIDC selector will confirm that the test was successful. - Once the test is successful, you can enable SSO for your organization by clicking the "Enable" button. - Once enabled, members of your organization must complete the SSO authentication flow described in the [How it works](./security-sso-basic#how-it-works) section. ### Editing Datasets in Data Studio https://huggingface.co/docs/hub/datasets-cell-editing.md # Editing Datasets in Data Studio Data Studio lets you edit dataset values directly in the browser, then commit those edits back to your repo โ€” no re-upload required. Cell editing is available when: - You have **write access** to the dataset repository. - The selected split is backed by a single `.csv`, `.tsv`, or `.parquet` file. ## Enter edit mode Click **Toggle Edit Mode** to start editing the current split. ## Edit and stage changes In edit mode, editable cells show an edit action. You can edit **string**, **number**, and **boolean** values (including setting values to `null`). Edits are staged locally first and shown as an unsaved-changes counter, so you can review everything before committing. ## Commit your edits Click **Commit**, update the commit message if needed, and confirm. > [!TIP] > Commits are optimized for large files: Data Studio computes the exact edited byte ranges and commits only those changes instead of rewriting the full file. Storage and upload on the Hub are handled efficiently with Xet-backed chunking and deduplication. ## Discard staged edits You can leave edit mode at any time. If you don't want to keep your staged edits, discard them before exiting edit mode. ### Disk usage on Spaces https://huggingface.co/docs/hub/spaces-storage.md # Disk usage on Spaces Every Space comes with a small amount of disk storage. This disk space is ephemeral, meaning its content will be lost if your Space restarts or is stopped. If you need to persist data with a longer lifetime than the Space itself, you can attach one or more [Storage Buckets](./storage-buckets) as volumes. ## Attached Volumes [Storage Buckets](./storage-buckets) are the recommended way to persist data in your Space. Attached buckets are mounted into the Space container at the path you specify, making their contents available as local files at runtime. Buckets can be attached when creating a Space, from the Space settings UI, or programmatically via the [`huggingface_hub`](/docs/huggingface_hub/guides/manage-spaces#mount-volumes-in-your-space) Python API. They can be mounted read-write (the default) or read-only. See the [Storage Buckets documentation](./storage-buckets) for full details on creating and using buckets. ### Viewing attached volumes The Space page displays attached volumes in the actions dropdown. Each volume shows its source bucket, its mount path inside the container, and whether it is mounted as read-only or read-write. ## Mounting models, datasets, and other Spaces Models, datasets, and other Spaces can be attached as volumes through the [`huggingface_hub`](/docs/huggingface_hub/guides/manage-spaces#mount-volumes-in-your-space) Python API. They are always mounted as read-only. Once attached, repo volumes appear in the Space actions dropdown alongside buckets and can be viewed or unmounted from the UI. When a volume references a private repository, users without access will still see the volume listed (with its mount path and access mode), but the source will be masked as `****/******` with a "(private)" label. ### Reference https://huggingface.co/docs/hub/jobs-reference.md # Reference # Jobs Command Line Interface (CLI) The `huggingface_hub` Python package comes with a built-in CLI called `hf`. This tool allows you to interact with the Hugging Face Hub directly from a terminal. For example, you can log in to your account, create a repository, upload and download files, etc. It also comes with handy features to configure your machine or manage your cache, and start and manage Jobs. Find the `hf jobs` installation steps, guides and reference in the `huggingface_hub` documentation here: * [Installation](https://huggingface.co/docs/huggingface_hub/en/guides/cli#getting-started) * [Run and manage Jobs](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-jobs) * [CLI reference for Jobs](https://huggingface.co/docs/huggingface_hub/en/package_reference/cli#hf-jobs) ## Python client The `huggingface_hub` Python package comes with a client called `HfApi`. This client allows you to interact with the Hugging Face Hub directly in Python. For example, you can log in to your account, create a repository, upload and download files, etc. It also comes with handy features to configure your machine or manage your cache, and start and manage Jobs. Find the installation steps and guides in the `huggingface_hub` documentation: * [Installation](https://huggingface.co/docs/huggingface_hub/en/installation) * [Run and manage Jobs](https://huggingface.co/docs/huggingface_hub/en/guides/jobs) ## HTTP API The Jobs HTTP API Endpoints are available under `https://huggingface.co/api/jobs`. Authenticate using a Hugging face token with the permission to start and manage Jobs under your namespace (your account or organization). Pass the token as a Bearer token with the header: `"Authorization: Bearer {token}"`. Here is a list of available endpoints and arguments: * [View Jobs OpenAPI](https://huggingface-openapi.hf.space/#tag/jobs) ### Jobs https://huggingface.co/docs/hub/jobs.md # Jobs `Hugging Face Jobs` provide compute for AI and data workflows, allowing you to run workloads on Hugging Face infrastructure with a familiar UV & Docker-like interface. Jobs are ideal for fine-tuning AI models, running inference with GPUs, and data ingestion and processing. You can run jobs using the `hf` CLI, the `huggingface_hub` Python client, or the Jobs HTTP API. Jobs support any hardware from CPUs to A100s & TPUs, with pay-as-you-go pricing where you only pay for seconds used. ## Contents - [Jobs Overview](./jobs-overview) - [Quickstart](./jobs-quickstart) - [Pricing](./jobs-pricing) - [Manage Jobs](./jobs-manage) - [Jobs Configuration](./jobs-configuration) - [Popular Images](./jobs-popular-images) - [Schedule Jobs](./jobs-schedule) - [Webhooks Automation](./jobs-webhooks) - [Reference](./jobs-reference) ### Using AllenNLP at Hugging Face https://huggingface.co/docs/hub/allennlp.md # Using AllenNLP at Hugging Face `allennlp` is a NLP library for developing state-of-the-art models on different linguistic tasks. It provides high-level abstractions and APIs for common components and models in modern NLP. It also provides an extensible framework that makes it easy to run and manage NLP experiments. ## Exploring allennlp in the Hub You can find `allennlp` models on the Hub by filtering at the left of the [models page](https://huggingface.co/models?library=allennlp). All models on the Hub come up with useful features 1. A training metrics tab with automatically hosted TensorBoard traces. 2. Metadata tags that help for discoverability. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference Providers widget that allows to make inference requests. ## Using existing models You can use the `Predictor` class to load existing models on the Hub. To achieve this, use the `from_path` method and use the `"hf://"` prefix with the repository id. Here is an end-to-end example. ```py import allennlp_models from allennlp.predictors.predictor import Predictor predictor = Predictor.from_path("hf://allenai/bidaf-elmo") predictor_input = { "passage": "My name is Wolfgang and I live in Berlin", "question": "Where do I live?" } predictions = predictor.predict_json(predictor_input) ``` To get a snippet such as this, you can click `Use in AllenNLP` at the top right, ## Sharing your models The first step is to save the model locally. For example, you can use the [`archive_model`](https://docs.allennlp.org/main/api/models/archival/#archive_model) method to save the model as a `model.tar.gz` file. You can then push the zipped model to the Hub. When you train a model with `allennlp`, the model is automatically serialized so you can use that as a preferred option. ### Using the AllenNLP CLI To push with the CLI, you can use the `allennlp push_to_hf` command as seen below. ```bash allennlp push_to_hf --repo_name test_allennlp --archive_path model ``` | Argument | Type | Description | |----------------------------- |-------------- |------------------------------------------------------------------------------------------------------------------------------- | | `--repo_name`, `-n` | str / `Path` | Name of the repository on the Hub. | | `--organization`, `-o` | str | Optional name of organization to which the pipeline should be uploaded. | | `--serialization-dir`, `-s` | str / `Path` | Path to directory with the serialized model. | | `--archive-path`, `-a` | str / `Path` | If instead of a serialization path you're using a zipped model (e.g. model/model.tar.gz), you can use this flag. | | `--local-repo-path`, `-l` | str / `Path` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. | | `--commit-message`, `-c` | str | Commit message to use for update. Defaults to `"update repository"`. | ### From a Python script The `push_to_hf` function has the same parameters as the bash script. ```py from allennlp.common.push_to_hf import push_to_hf serialization_dir = "path/to/serialization/directory" push_to_hf( repo_name="my_repo_name", serialization_dir=serialization_dir, local_repo_path=self.local_repo_path ) ``` In just a minute, you can get your model in the Hub, try it out directly in the browser, and share it with the rest of the community. All the required metadata will be uploaded for you! ## Additional resources * AllenNLP [website](https://allenai.org/allennlp). * AllenNLP [repository](https://github.com/allenai/allennlp). ### Libraries https://huggingface.co/docs/hub/datasets-libraries.md # Libraries The Datasets Hub has support for several libraries in the Open Source ecosystem. Thanks to the [huggingface_hub Python library](/docs/huggingface_hub), it's easy to enable sharing your datasets on the Hub. We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward. ## Libraries table The table below summarizes the supported libraries and their level of integration. | Library | Description | Download from Hub | Stream from Hub | Push to Hub | Stream to Hub | Optimized Parquet files | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ----------------- | --------------- | ----------- | ------------- | ----------------------- | | [Argilla](./datasets-argilla) | Collaboration tool for AI engineers and domain experts that value high quality data. | โœ… | โŒ | โœ… | โŒ | โŒ | | [Daft](./datasets-daft) | Data engine for large scale, multimodal data processing with a Python-native interface. | โœ… | โœ… | โœ… | โœ… | โœ… | | [Dask](./datasets-dask) | Parallel and distributed computing library that scales the existing Python and PyData ecosystem. | โœ… | โœ… | โœ… | โœ… | โœ…* | | [Data Designer](./datasets-data-designer) | NVIDIA NeMo framework for generating synthetic datasets using LLMs. | โœ… | โŒ | โœ… | โŒ | โŒ | | [Datasets](./datasets-usage) | ๐Ÿค— Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | โœ… | โœ… | โœ… | โœ… | โœ… | | [Distilabel](./datasets-distilabel) | The framework for synthetic data generation and AI feedback. | โœ… | โŒ | โœ… | โŒ | โŒ | | [DuckDB](./datasets-duckdb) | In-process SQL OLAP database management system. | โœ… | โœ… | โŒ | โŒ | โŒ | | [Embedding Atlas](./datasets-embedding-atlas) | Interactive visualization and exploration tool for large embeddings. | โœ… | โœ… | โŒ | โŒ | โŒ | | [Fenic](./datasets-fenic) | PySpark-inspired DataFrame framework for building production AI and agentic applications. | โœ… | โœ… | โŒ | โŒ | โŒ | | [FiftyOne](./datasets-fiftyone) | FiftyOne is a library for curation and visualization of image, video, and 3D data. | โœ… | โœ… | โœ… | โŒ | โŒ | | [Lance](./datasets-lance) | An open lakehouse format for multimodal AI. | โœ… | โœ… | โŒ | โŒ | โŒ | | [Pandas](./datasets-pandas) | Python data analysis toolkit. | โœ… | โŒ | โœ… | โŒ | โœ…* | | [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. | โœ… | โœ… | โœ… | โŒ | โŒ | | [PyArrow](./datasets-pyarrow) | Apache Arrow is a columnar format and a toolbox for fast data interchange and in-memory analytics. | โœ… | โœ… | โœ… | โŒ | โœ…* | | [Spark](./datasets-spark) | Real-time, large-scale data processing tool in a distributed environment. | โœ… | โœ… | โœ… | โœ… | โœ… | | [WebDataset](./datasets-webdataset) | Library to write I/O pipelines for large datasets. | โœ… | โœ… | โŒ | โŒ | โŒ | _ * Requires passing extra arguments to write optimized Parquet files_ ## Data Processing Libraries ### Streaming Dataset streaming allows iterating on a dataset from Hugging Face progressively without having to download it completely. It saves local disk space because the data is never on disk. It saves memory since only a small portion of the dataset is used at a time. And it saves time, since there is no need to download data before the CPU or GPU workload. In addition to streaming *from* Hugging Face, many libraries also support streaming *back to* Hugging Face. Therefore, they can run end-to-end streaming pipelines: streaming from a source and writing to Hugging Face progressively, often overlapping the download, upload, and processing steps. For more details on how to do streaming, check out the documentation of a library that support streaming (see table above) or the [streaming datasets](./datasets-streaming) documentation if you want to stream datasets from Hugging Face by yourself. ### Optimized Parquet files Parquet files on Hugging Face are optimized to improve storage efficiency, accelerate downloads and uploads, and enable efficient dataset streaming and editing. Optimized Parquet files are Parquet files with additional features: * [Parquet Content Defined Chunking](https://huggingface.co/blog/parquet-cdc) optimizes Parquet for [Xet](https://huggingface.co/docs/hub/en/xet/index), Hugging Face's storage backend. It accelerates uploads and downloads thanks to chunk-based deduplication and allows efficient file editing * Page index accelerates filters when streaming and enables efficient random access, e.g. in the [Dataset Viewer](https://huggingface.co/docs/dataset-viewer) Some libraries require extra argument to write Optimized Parquet files like `Pandas` and `PyArrow`: * `use_content_defined_chunking=True` to enable Parquet Content Defined Chunking, for [deduplication](https://huggingface.co/blog/parquet-cdc) and [editing](./datasets-editing) * `write_page_index=True` to include a page index in the Parquet metadata, for [streaming and random access](./datasets-streaming) ## Training Libraries Training libraries that integrate with Hub datasets for model training. The table below shows their streaming capabilities - the ability to train on datasets without downloading them first. | Library | Description | Stream from Hub | | ------- | ----------- | --------------- | | [Axolotl](https://docs.axolotl.ai/docs/streaming.html) | Low-code LLM fine-tuning framework | โœ… | | [LlamaFactory](https://github.com/hiyouga/LLaMA-Factory) | Unified fine-tuning for 100+ LLMs | โœ… | | [Sentence Transformers](https://sbert.net/docs/sentence_transformer/training_overview.html) | Text embeddings and semantic similarity | โœ… | | [Transformers](https://huggingface.co/docs/transformers/trainer) | ๐Ÿค— Transformers Trainer for fine-tuning models | โœ… | | [TRL](https://huggingface.co/docs/trl) | Training LLMs with reinforcement learning (SFT, DPO, GRPO) | โš ๏ธ* | | [Unsloth](https://docs.unsloth.ai) | Fast LLM fine-tuning (2x speedup, 70% less memory) | โœ… | _* SFTTrainer and DPOTrainer support streaming; GRPOTrainer does not yet support streaming input_ ### Streaming from Hub Streaming allows training on massive datasets without downloading them first. This is valuable when: - Your dataset is too large to fit on disk - You want to start training immediately - You're using [HF Jobs](https://huggingface.co/docs/hub/jobs) where co-located compute provides faster streaming Recent improvements have made streaming [up to 100x more efficient](https://huggingface.co/blog/streaming-datasets) with faster startup, prefetching, and better scaling to many workers. **Note:** Streaming requires `max_steps` in training arguments since dataset length is unknown, and uses buffer-based shuffling. See [streaming datasets](./datasets-streaming) for more details. ### Logging to Hub Some tools can stream training data back to the Hub during training: - **[Trackio](https://github.com/huggingface/trackio)**: Streams training metrics to a Hub dataset in real-time ## Integrating data libraries and tools with the Hub This guide is designed for developers and maintainers of data libraries and tools who want to integrate with the Hugging Face Hub. Whether you're building a data processing library, analysis tool, or any software that needs to interact with datasets, this documentation will help you implement a Hub integration. The guide covers: - Possible approaches to loading data from the Hub into your library/tool - Possible approaches to uploading data from your library/tool to the Hub ### Loading data from the Hub If you have a library for working with data, it can be helpful for your users to load data from the Hub. In general, we suggest relying on an existing library like `datasets`, `pandas` or `polars` to do this unless you have a specific reason to implement your own. If you require more control over the loading process, you can use the `huggingface_hub` library, which will allow you, for example, to download a specific subset of files from a repository. You can find more information about loading data from the Hub [here](https://huggingface.co/docs/hub/datasets-downloading). #### Integrating via the Dataset Viewer and Parquet Files The Hub's dataset viewer and Parquet conversion system provide a standardized way to integrate with datasets, regardless of their original format. This infrastructure is a reliable integration layer between the Hub and external libraries. If the dataset is not already in Parquet, the Hub automatically converts the first 5GB of every dataset to Parquet format to power the dataset viewer and provide consistent access patterns. This standardization offers several benefits for library integrations: - Consistent data access patterns regardless of original format - Built-in dataset preview and exploration through the Hub's dataset viewer. The dataset viewer can also be embedded as an iframe in your applications, making it easy to provide rich dataset previews. For more information about embedding the viewer, see the [dataset viewer embedding documentation](https://huggingface.co/docs/hub/en/datasets-viewer-embed). - Efficient columnar storage optimized for querying. For example, you could use a tool like [DuckDB](https://duckdb.org/) to query or filter for a specific subset of data. - Parquet is well supported across the machine learning and data science ecosystem. For more details on working with the Dataset Viewer API, see the [Dataset Viewer API documentation](https://huggingface.co/docs/dataset-viewer/index) ### Uploading data to the Hub This section covers possible approaches for adding the ability to upload data to the Hub in your library, i.e. how to implement a `push_to_hub` method. This guide will cover three primary ways to upload data to the Hub: - using the `datasets` library and the `push_to_hub` method - using `pandas` to write to the Hub - using the `huggingface_hub` library and the `hf_hub_download` method - directly using the API or Git with git-xet #### Use the `datasets` library The most straightforward approach to pushing data to the Hub is to rely on the existing [`push_to_hub`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.push_to_hub) method from the `datasets` library. The `push_to_hub` method will automatically handle: - the creation of the repository - the conversion of the dataset to Parquet - chunking the dataset into suitable parts - uploading the data For example, if you have a synthetic data generation library that returns a list of dictionaries, you could simply do the following: ```python from datasets import Dataset data = [{"prompt": "Write a cake recipe", "response": "Measure 1 cup ..."}] ds = Dataset.from_list(data) ds.push_to_hub("USERNAME_OR_ORG/repo_ID") ``` Examples of this kind of integration: - [Distilabel](https://github.com/argilla-io/distilabel/blob/8ad48387dfa4d7bd5639065661f1975dcb44c16a/src/distilabel/distiset.py#L77) #### Rely on an existing libraries integration with the Hub Polars, Pandas, Dask, Spark, DuckDB, and Daft can all write to a Hugging Face Hub repository. See [datasets libraries](https://huggingface.co/docs/hub/datasets-libraries) for more details. If you are already using one of these libraries in your code, adding the ability to push to the Hub is straightforward. For example, if you have a synthetic data generation library that can return a Pandas DataFrame, here is the code you would need to write to the Hub: ```python from huggingface_hub import HfApi # Initialize the Hub API hf_api = HfApi(token=os.getenv("HF_TOKEN")) # Create a repository (if it doesn't exist) hf_api.create_repo(repo_id="username/my-dataset", repo_type="dataset") # Convert your data to a DataFrame and save directly to the Hub df.to_parquet("hf://datasets/username/my-dataset/data.parquet") ``` #### Using the huggingface_hub Python library The `huggingface_hub` Python library offers a more flexible approach to uploading data to the Hub. The library allows you to upload specific files or subsets of files to a repository. This is useful if you have a large dataset that you don't want to convert to Parquet, want to upload a specific subset of files, or want more control over the repo structure. Depending on your use case, you can upload a file or folder at a specific point in your code, i.e., export annotations from a tool to the Hub when a user clicks "push to Hub". For example, ```python from huggingface_hub import HfApi api = HfApi(token=HF_TOKEN) api.upload_folder( folder_path="/my-cool-library/data-folder", repo_id="username/my-cool-space", repo_type="dataset", commit_message="Push annotations to Hub" allow_patterns="*.jsonl", ) ``` You can find more information about ways to upload data to the Hub [here](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload). Alternatively, there are situations where you may want to upload data in the background, for example, synthetic data being generated every 10 minutes. In this case you can use the `scheduled_uploads` feature of the `huggingface_hub` library. For more details, see the [scheduled uploads documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads). You can see an example of using this approach to upload data to the Hub in - The [fastdata](https://github.com/AnswerDotAI/fastdata/blob/main/nbs/00_core.ipynb) library - This [magpie](https://huggingface.co/spaces/davanstrien/magpie/blob/fc79672c740b8d3d098378dca37c0f191c208de0/app.py#L67) Demo Space ## More support For technical questions about integration, feel free to contact the datasets team at datasets@huggingface.co. ### Model Card components https://huggingface.co/docs/hub/model-cards-components.md # Model Card components **Model Card Components** are special elements that you can inject directly into your Model Card markdown to display powerful custom components in your model page. These components are authored by us, feel free to share ideas about new Model Card component in [this discussion](https://huggingface.co/spaces/huggingface/HuggingDiscussions/discussions/17). ## The Gallery component The `` component can be used in your model card to showcase your generated images and videos. ### How to use it? 1. Update your Model Card [widget metadata](/docs/hub/models-widgets-examples#text-to-image) to add the media you want to showcase. ```yaml widget: - text: a girl wandering through the forest output: url: images/6CD03C101B7F6545EB60E9F48D60B8B3C2D31D42D20F8B7B9B149DD0C646C0C2.jpeg - text: a tiny witch child output: url: images/7B482E1FDB39DA5A102B9CD041F4A2902A8395B3835105C736C5AD9C1D905157.jpeg - text: an artist leaning over to draw something output: url: images/7CCEA11F1B74C8D8992C47C1C5DEA9BD6F75940B380E9E6EC7D01D85863AF718.jpeg ``` 2. Add the `` component to your card. The widget metadata will be used by the `` component to display the media with each associated prompt. ```md ## Model description A very classic hand drawn cartoon style. ``` See result [here](https://huggingface.co/alvdansen/littletinies#little-tinies). > Hint: Support of Card Components through the GUI editor coming soon... ### Next Steps https://huggingface.co/docs/hub/repositories-next-steps.md # Next Steps These next sections highlight features and additional information that you may find useful to make the most out of the Git repositories on the Hugging Face Hub. ## How to programmatically manage repositories Hugging Face supports accessing repos with Python via the [`huggingface_hub` library](https://huggingface.co/docs/huggingface_hub/index). The operations that we've explored, such as downloading repositories and uploading files, are available through the library, as well as other useful functions! If you prefer to use git directly, please read the sections below. ## Learning more about Git A good place to visit if you want to continue learning about Git is [this Git tutorial](https://learngitbranching.js.org/). For even more background on Git, you can take a look at [GitHub's Git Guides](https://github.com/git-guides). ## How to use branches To effectively use Git repos collaboratively and to work on features without releasing premature code you can use **branches**. Branches allow you to separate your "work in progress" code from your "production-ready" code, with the additional benefit of letting multiple people work on a project without frequently conflicting with each others' contributions. You can use branches to isolate experiments in their own branch, and even [adopt team-wide practices for managing branches](https://ericmjl.github.io/essays-on-data-science/workflow/gitflow/). To learn about Git branching, you can try out the [Learn Git Branching interactive tutorial](https://learngitbranching.js.org/). ## Using tags Git allows you to *tag* commits so that you can easily note milestones in your project. As such, you can use tags to mark commits in your Hub repos! To learn about using tags, you can visit [this DevConnected post](https://devconnected.com/how-to-create-git-tags/). Beyond making it easy to identify important commits in your repo's history, using Git tags also allows you to do A/B testing, [clone a repository at a specific tag](https://www.techiedelight.com/clone-specific-tag-with-git/), and more! The `huggingface_hub` library also supports working with tags, such as [downloading files from a specific tagged commit](https://huggingface.co/docs/huggingface_hub/main/en/how-to-downstream#hfhuburl). ## How to duplicate a repo There are several ways to duplicate a repository, depending on whether you need to preserve the Git history. ### Duplicating from the Hub Click the three dots at the top right of any repository page, then select **Duplicate this model**, **Duplicate this dataset**, or **Duplicate this Space**. This operation is nearly instant, thanks to the use of [Xet deduplication technology](./xet/deduplication). You will be able to choose: * **Owner**: Your account or any organization in which you have write access. * **Repository name**: The name of the duplicated repository. By default it keeps the same name as the source, under your namespace (e.g. duplicating `bigscience/bloom-560m` creates `your-username/bloom-560m`). * **Visibility**: You can choose to make the duplicated repo public or private. Read more about private repositories [here](./repositories-settings#private-repositories). For models and datasets, the Git history is squashed into a single commit. For Spaces, the full Git history is preserved. Public variables are copied over for Spaces, but secrets must be re-entered manually. #### Restrictions Some repositories cannot be duplicated: - **Gated repositories** (models or datasets with access requests enabled). - Repositories where the author has **disabled duplication**. - **Cross-region duplication** is not supported (e.g. a repository stored in the US region cannot be duplicated to an EU organization). ### Duplicating programmatically You can also duplicate repositories using the `huggingface_hub` library or CLI. These use the same server-side API as the Hub button above (Git history is squashed for models and datasets, preserved for Spaces). Using Python: ```python from huggingface_hub import duplicate_repo duplicate_repo("bigscience/bloom-560m", private=False) duplicate_repo("openai/gdpval", repo_type="dataset") duplicate_repo("multimodalart/dreambooth-training", repo_type="space", private=False) ``` Or using the CLI: ```bash hf repos duplicate bigscience/bloom-560m hf repos duplicate openai/gdpval --type dataset ``` For Spaces, you will still need to configure your own settings (hardware, sleep time, storage, variables and secrets). Check out the [Manage your Space](https://huggingface.co/docs/huggingface_hub/guides/manage-spaces) guide for more details. Alternatively, if you want to keep a local copy of the repo, you can use `hf download` followed by `hf upload` to a different namespace. This won't preserve the Git history either. ### Forking manually with Git If you need to preserve Git history for models/datasets, or want more control over the process (e.g. rebasing on top of your own changes), you can fork a repository manually using Git. You will need [`git-xet`](https://huggingface.co/docs/hub/xet/using-xet-storage#git) installed. Forking can take time depending on your bandwidth because you will have to fetch and re-upload all the LFS files (though the re-upload will be fast thanks to Xet). 1. Create a destination repository (e.g. `me/myfork`) on https://huggingface.co 2. Clone it and add the source repo as a remote: ```bash git clone git@hf.co:me/myfork cd myfork git xet install git remote add upstream git@hf.co:friend/upstream git fetch upstream git lfs fetch --all upstream ``` 3. Replace the fork contents with the upstream history: ```bash git reset --hard upstream/main ``` 4. Push: ```bash git push --force origin main ``` ### Agents https://huggingface.co/docs/hub/agents-overview.md # Agents Hugging Face provides tools and protocols that connect AI agents directly to the Hub. Whether you're chatting with Claude, building with Codex, or developing custom agents, you can access models, datasets, Spaces, and community tools. This page covers connecting your [chat agents](#chat-with-hugging-face) and [coding agents](#coding-agents) to the Hub. | Page | Description | | ---- | ----------- | | [CLI](./agents-cli) | Give your agent the `hf` CLI with a built-in Skill | | [MCP Server](./agents-mcp) | Connect any MCP-compatible client to the Hub | | [Skills](./agents-skills) | Task-specific guidance for AI/ML workflows | | [SDK](./agents-sdk) | Build agents programmatically with Python or JavaScript | | [Local Agents](./agents-local) | Run fully local agents with llama.cpp and Pi | ## Chat with Hugging Face Connect your AI assistant directly to the Hugging Face Hub using the Model Context Protocol (MCP). Once connected, you can search models, explore datasets, generate images, and use community toolsโ€”all from within your chat interface. ### Supported Assistants The HF MCP Server works with any MCP-compatible client: - **ChatGPT** (via plugins) - **Claude Desktop** - **Custom MCP clients** ### Setup #### 1. Open MCP Settings ![MCP Settings Example](https://huggingface.co/huggingface/documentation-images/resolve/main/agents-docs/mcp-settings.png) Visit [huggingface.co/settings/mcp](https://huggingface.co/settings/mcp) while logged in. #### 2. Select Your Client Choose your MCP-compatible client from the list. The page shows client-specific instructions and a ready-to-copy configuration snippet. #### 3. Configure and Restart Copy the configuration snippet into your client's MCP settings, save, and restart your client. > [!TIP] > The settings page generates the exact configuration your client expects. Use it rather than writing config by hand. ### What You Can Do Once connected, ask your assistant to use Hugging Face tools among the ones you selected in your configuration: | Task | Example Prompt | | ---- | -------------- | | Search models | "Find Qwen 3 quantizations on Hugging Face" | | Explore datasets | "Show datasets about weather time-series" | | Find Spaces | "Find a Space that can transcribe audio files" | | Generate images | "Create a 1024x1024 image of a cat in Ghibli style" | | Search papers | "Find recent papers on vision-language models" | Your assistant calls MCP tools exposed by the Hugging Face server and returns results with metadata, links, and context. ### Add Community Tools Extend your setup with MCP-compatible Gradio Spaces: 1. Browse [Spaces with MCP support](https://huggingface.co/spaces?filter=mcp-server) 2. Add them in your [MCP settings](https://huggingface.co/settings/mcp) 3. Restart your client to pick up new tools Gradio MCP apps expose their functions as tools with arguments and descriptions, so your assistant can call them directly. ### Learn More - [MCP Server Guide](./agents-mcp) - Detailed setup and configuration - [HF MCP Settings](https://huggingface.co/settings/mcp) - Configure your client - [MCP-compatible Spaces](https://huggingface.co/spaces?filter=mcp-server) - Community tools ## Coding Agents Integrate Hugging Face into your coding workflow with the MCP Server and Skills. Access models, datasets, and ML tools directly from your IDE or coding agent. For example, we cover these coding agents and more with MCP and/or Skills: | Coding Agent | Integration Method | | ------------ | ------------------ | | [Claude Code](https://code.claude.com/docs) | MCP Server + Skills | | [OpenAI Codex](https://openai.com/codex/) | MCP Server + Skills | | [Open Code](https://opencode.ai/) | MCP Server + Skills | | [Cursor](https://www.cursor.com/) | MCP Server + Skills | | [VS Code](https://code.visualstudio.com/) | MCP Server | | [Gemini CLI](https://geminicli.com/) | MCP Server | | [Zed](https://zed.dev/) | MCP Server | ### Quick Setup #### MCP Server The MCP Server gives your coding agent access to Hub search, Spaces, and community tools. **Cursor / VS Code / Zed:** 1. Visit [huggingface.co/settings/mcp](https://huggingface.co/settings/mcp) 2. Select your IDE from the list 3. Copy the configuration snippet 4. Add it to your IDE's MCP settings 5. Restart the IDE **Claude Code:** ```bash claude mcp add hf-mcp-server -t http "https://huggingface.co/mcp?login" ``` #### Skills Skills provide task-specific guidance for AI/ML workflows. They work alongside MCP or standalone. ```bash # start claude claude # install the skills marketplace plugin /plugin marketplace add huggingface/skills ``` Then, to install a Skill specification: ```bash /plugin install hf-cli@huggingface/skills ``` See the [Skills Guide](./agents-skills) for available skills and usage. ### What You Can Do Once configured, your coding agent can: | Capability | Example | | ---------- | ------- | | Search the Hub | "Find a code generation model under 7B parameters" | | Generate images | "Create a diagram of a transformer architecture" | | Explore datasets | "What datasets are available for sentiment analysis?" | | Run Spaces | "Use the Whisper Space to transcribe this audio file" | | Get documentation | "How do I fine-tune a model with transformers?" | ### Environment Configuration #### Authentication Set your Hugging Face token as an environment variable: ```bash export HF_TOKEN="hf_..." ``` Or authenticate via the [CLI](./agents-cli): ```bash hf auth login ``` #### Adding Community Tools Extend your setup with MCP-compatible Gradio Spaces: 1. Browse [Spaces with MCP support](https://huggingface.co/spaces?filter=mcp-server) 2. Add them in your [MCP settings](https://huggingface.co/settings/mcp) 3. Restart your IDE ### Example Workflow ```text You: Find a text classification model that works well on short texts Agent: [Searches Hugging Face Hub] Found several options: - distilbert-base-uncased-finetuned-sst-2-english (sentiment) - facebook/bart-large-mnli (zero-shot) ... You: Show me how to use the first one Agent: [Fetches documentation] Here's how to use it with transformers: from transformers import pipeline classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english") result = classifier("I love this product!") ``` ## Next Steps - [CLI](./agents-cli) - Command-line interface for Hub operations - [MCP Server](./agents-mcp) - Connect any MCP-compatible AI assistant to the Hub - [Skills](./agents-skills) - Pre-built capabilities for coding agents - [SDK](./agents-sdk) - Python and JavaScript libraries for building agents ### Moderation https://huggingface.co/docs/hub/moderation.md # Moderation > [!TIP] > Check out the [Code of Conduct](https://huggingface.co/code-of-conduct) and the [Content Guidelines](https://huggingface.co/content-guidelines). ## Reporting a repository To report a repository, you can click the three dots at the top right of a repository. Afterwards, you can click "Report the repository". This will allow you to explain what's the reason behind the report (Ethical issue, legal issue, not working, or other) and a description for the report. Once you do this, a **public discussion** will be opened. ## Reporting a comment To report a comment, you can click the three dots at the top right of a comment. That will submit a request for the Hugging Face team to review. ### Inference Providers https://huggingface.co/docs/hub/models-inference.md # Inference Providers Hugging Face's model pages have pay-as-you-go inference for thousands of models, so you can try them all out right in the browser. Service is powered by Inference Providers and includes a free-tier. Inference Providers give developers streamlined, unified access to hundreds of machine learning models, powered by the best serverless inference partners. ๐Ÿ‘‰ **For complete documentation, visit the [Inference Providers Documentation](https://huggingface.co/docs/inference-providers)**. ## Inference Providers on the Hub Inference Providers is deeply integrated with the Hugging Face Hub, and you can use it in a few different ways: - **Interactive Widgets** - Test models directly on model pages with interactive widgets that use Inference Providers under the hood. Check out the [DeepSeek-R1-0528 model page](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) for an example. - **Inference Playground** - Easily test and compare chat completion models with your prompts. Check out the [Inference Playground](https://huggingface.co/playground) to get started. - **Search** - Filter models by inference provider on the [models page](https://huggingface.co/models?inference_provider=all) to find models available through specific providers. - **Data Studio** - Use AI to explore datasets on the Hub. Check out [Data Studio](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/viewer?views%5B%5D=train) on your favorite dataset. ## Build with Inference Providers You can integrate Inference Providers into your own applications using our SDKs or HTTP clients. Here's a quick start with Python and JavaScript, for more details, check out the [Inference Providers Documentation](https://huggingface.co/docs/inference-providers). You can use our Python SDK to interact with Inference Providers. ```python from huggingface_hub import InferenceClient import os client = InferenceClient( api_key=os.environ["HF_TOKEN"], provider="auto", # Automatically selects best provider ) # Chat completion completion = client.chat.completions.create( model="deepseek-ai/DeepSeek-V3-0324", messages=[{"role": "user", "content": "A story about hiking in the mountains"}] ) # Image generation image = client.text_to_image( prompt="A serene lake surrounded by mountains at sunset, photorealistic style", model="black-forest-labs/FLUX.1-dev" ) ``` Or, you can just use the OpenAI API compatible client. ```python import os from openai import OpenAI client = OpenAI( base_url="https://router.huggingface.co/v1", api_key=os.environ["HF_TOKEN"], ) completion = client.chat.completions.create( model="deepseek-ai/DeepSeek-V3-0324", messages=[ { "role": "user", "content": "A story about hiking in the mountains" } ], ) ``` > [!WARNING] > The OpenAI API compatible client is not supported for image generation. You can use our JavaScript SDK to interact with Inference Providers. ```javascript import { InferenceClient } from "@huggingface/inference"; const client = new InferenceClient(process.env.HF_TOKEN); const chatCompletion = await client.chatCompletion({ provider: "auto", // Automatically selects best provider model: "deepseek-ai/DeepSeek-V3-0324", messages: [{ role: "user", content: "Hello!" }] }); const imageBlob = await client.textToImage({ model: "black-forest-labs/FLUX.1-dev", inputs: "A serene lake surrounded by mountains at sunset, photorealistic style", }); ``` Or, you can just use the OpenAI API compatible client. ```javascript import { OpenAI } from "openai"; const client = new OpenAI({ baseURL: "https://router.huggingface.co/v1", apiKey: process.env.HF_TOKEN, }); const completion = await client.chat.completions.create({ model: "meta-llama/Llama-3.1-8B-Instruct", messages: [{ role: "user", content: "A story about hiking in the mountains" }], }); ``` > [!WARNING] > The OpenAI API compatible client is not supported for image generation. You'll need a Hugging Face token with inference permissions. Create one at [Settings > Tokens](https://huggingface.co/settings/tokens/new?ownUserPermissions=inference.serverless.write&tokenType=fineGrained). ### How Inference Providers works To dive deeper into Inference Providers, check out the [Inference Providers Documentation](https://huggingface.co/docs/inference-providers). Here are some key resources: - **[Quick Start](https://huggingface.co/docs/inference-providers)** - **[Pricing & Billing Guide](https://huggingface.co/docs/inference-providers/pricing)** - **[Hub Integration Details](https://huggingface.co/docs/inference-providers/hub-integration)** ### What was the HF-Inference API? HF-Inference API is one of the providers available through Inference Providers. It was previously called "Inference API (serverless)" and is powered by [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index) under the hood. For more details about the HF-Inference provider specifically, check out its [dedicated page](https://huggingface.co/docs/inference-providers/providers/hf-inference). ### Spaces as MCP servers https://huggingface.co/docs/hub/spaces-mcp-servers.md # Spaces as MCP servers You can **expose any public Space that has a visible `MCP` badge into a callable tool** that will be available in any MCP-compatible client, you can add as many Spaces as you want and without writing a single line of code. ## Setup your MCP Client From your [Hub MCP settings](https://huggingface.co/settings/mcp), select your MCP client (VSCode, Cursor, Claude Code, etc.) then follow the setup instructions. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/wWm_GeuWF17OrMyJT4tMx.png) > [!WARNING] > You need a valid Hugging Face token with READ permissions to use MCP tools. If you don't have one, create a new "Read" access token here. ## Add an existing Space to your MCP tools ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/ex9KRpvamn84ZaOlSp_Bj.png) 1. Browse compatible [Spaces](https://huggingface.co/spaces?filter=mcp-server) to find Spaces that are usable via MCP. You can also look for the grey **MCP** badge on any Spaces card. 2. Click the badge and choose **Add to MCP tools** then confirm when asked. 3. The Space should be listed in your MCP Server settings in the Spaces Tools section. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f17f0a0925b9863e28ad517/uI4PsneUZoWn_TExhNJyt.png) ## Use Spaces from your MCP client If your MCP client is configured correctly, the Spaces you added will be available instantly without changing anything (if it doesn't restart your client and it should appear). Most MCP clients will list what tools are currently loaded so you can make sure the Space is available. > [!TIP] > For ZeroGPU Spaces, your quota will be used when the tool is called, if you run out of quota you can subscribe to PRO to get 25 minutes of daily quota (x8 more quota than free users). For example your PRO account lets you generate up to 600 images per day using FLUX.1-schnell. ## Build your own MCP-compatible Gradio Space To create your own MCP-enabled Space, you need to [Create a new Gradio Space](https://huggingface.co/new-space?sdk=gradio) then make sure to enable MCP support in the code. Get started with [Gradio Spaces](https://huggingface.co/docs/hub/en/spaces-sdks-gradio) and make sure to check the [detailed MCP guide](https://www.gradio.app/guides/building-mcp-server-with-gradio) for more details. First, install Gradio with MCP support: ```bash pip install "gradio[mcp]" ``` Then create your app with clear type hints and docstrings: ```python import gradio as gr def letter_counter(word: str, letter: str) -> int: """Count occurrences of a letter in a word. Args: word: The word to search in letter: The letter to count Returns: Number of times the letter appears in the word """ return word.lower().count(letter.lower()) demo = gr.Interface(fn=letter_counter, inputs=["text", "text"], outputs="number") demo.launch(mcp_server=True) # exposes an MCP schema automatically ``` Push the app to a **Gradio Space** and it will automatically receive the **MCP** badge. Anyone can then add it as a tool with a single click. > [!TIP] > It's also quite easy to convert an existing Gradio Space to MCP server. Duplicate it from the context menu then just add the mcp_server=True parameter to your launch() method, and ensure your functions have clear type hints and docstrings - you can use AI tools to automate this quite easily (example of AI generated docstrings). ## Be creative by mixing Spaces! As Hugging Face Spaces is the largest directory of AI apps, you can find many creative tools that can be used as MCP tools. Mixing and matching different Spaces can lead to powerful and creative workflows. This video demonstrates the use of Lightricks/ltx-video-distilled and ResembleAI/Chatterbox in Claude Code to generate a video with audio. ### Tokens Management https://huggingface.co/docs/hub/enterprise-tokens-management.md # Tokens Management > [!WARNING] > This feature is part of the Team & Enterprise plans. Tokens Management enables organization administrators to oversee access tokens within their organization, ensuring secure access to organization resources. > [!NOTE] > For the member experience when token management policies are in effect โ€” including how to check token status, what errors to expect, and how denial and revocation affect a token โ€” see [Tokens in organizations with token management policies](./security-tokens#tokens-in-organizations-with-token-management-policies). ## Viewing and Managing Access Tokens The token listing feature displays all access tokens within your organization. Administrators can: - Monitor token usage and identify or prevent potential security risks: - Unauthorized access to private resources ("leaks") - Overly broad access scopes - Suboptimal token hygiene (e.g., tokens that have not been rotated in a long time) - Identify inactive or unused tokens Fine-grained tokens display their specific permissions: Revoked tokens are hidden from the listing by default. Use the **Show revoked tokens** toggle above the list to include them in the view. A **REVOKED** status badge is shown for any revoked token regardless of the organization's token policy. **PENDING**, **APPROVED**, and **DENIED** badges only appear in organizations with the "Require administrator approval" policy enabled, as those states are only created by the approval flow. ## Token Policy Team & Enterprise organization administrators can enforce the following policies: | **Policy** | **Unscoped (Read/Write) Access Tokens** | **Fine-Grained Tokens** | | ------------------------------------------------- | --------------------------------------- | ----------------------------------------------------------- | | **Allow access via User Access Tokens (default)** | Authorized | Authorized | | **Only access via fine-grained tokens** | Unauthorized | Authorized | | **Require administrator approval** | Unauthorized | Unauthorized without an approval (except for admin-created) | ## Reviewing Token Authorization When token policy is set to "Require administrator approval", organization administrators can review details of all fine-grained tokens accessing organization-owned resources and approve or deny access. When a new token enters the pending state, up to 5 organization administrators with confirmed email addresses receive a notification with a direct link to the token review page. No notification is sent when a token is auto-approved (e.g., because the creator is an org admin). - **Pending** tokens are awaiting an administrator decision - **Approved** tokens have been authorized and are active - **Denied** tokens have been blocked from accessing organization resources When a token is approved or denied, the token owner receives an email notification. Denial is not permanent: a denied token can later be approved by an administrator, restoring its access. Likewise, an already-approved token can be denied at any time, which removes its approval-based access immediately. > [!NOTE] > Token names are only visible to administrators when the "Require administrator approval" policy is enabled. ### What Members See When Blocked Members whose tokens are pending or denied receive a `403` error when accessing organization resources: _"Due to the organization token policy, your token needs to be approved by the organization before you can access this resource."_ The error message is the same for both pending and denied states. To see the status, members can navigate to the individual token's edit page. The organization administrator token management settings page shows status badges for all member tokens. ## Deny vs. Revoke Administrators have two ways to remove a token's access to an organization: | **Aspect** | **Deny** | **Revoke** | | ------------------- | ------------------------------------------ | ------------------------------------------------------------- | | **Plan** | Team & Enterprise | Enterprise plan and above | | **Scope** | Operates within the approval workflow | Independent of the token policy | | **Effect** | Blocks or removes approval-based access | Forcefully removes access regardless of policy or token state | | **Reversible?** | Yes โ€” a denied token can later be approved | No โ€” revoked status persists even if the policy changes | | **Token elsewhere** | No effect outside the org | No effect outside the org | Use **deny** when managing access within the approval workflow (the token transitions to a `denied` state and can be re-approved later). Use **revoke** when you need to permanently cut off a token's access to the organization. ## Revoking Tokens > [!WARNING] > This feature is part of the Enterprise plan and above. Organization administrators can revoke any member's access token from the token detail page. Revocation is available regardless of whether the organization uses the "Require administrator approval" policy. A revoked token can no longer access the organization's resources, but continues to work elsewhere. The token owner receives an email notification upon revocation. Revoked tokens remain revoked even if the organization's token policy is later changed or disabled. Revocation is permanent at the organization level โ€” there is no un-revoke action. If a member needs access restored, they must delete the revoked token and create a new one. If the organization uses the "Require administrator approval" policy, the new token will start in the pending state and require admin approval. Members whose tokens have been revoked receive a `403` error with the message: _"Your token has been revoked by the organization administrator, you can no longer access organization resources. Please contact them for more information."_ This message is shown regardless of whether the organization uses the "Require administrator approval" policy. ### Revoking via API Administrators can also revoke a token programmatically by providing the raw token value. This is useful for automated workflows such as secrets scanning, where a leaked token is detected and needs to be revoked immediately. ```bash # ORG_NAME should be your organization name and ADMIN_HF_TOKEN an admin's access token # LEAKED_HF_TOKEN should contain the raw token value to revoke curl -X POST "https://huggingface.co/api/organizations/${ORG_NAME}/settings/tokens/revoke" \ -H "Authorization: Bearer ${ADMIN_HF_TOKEN}" \ -H "Content-Type: application/json" \ -d '{"token": "${LEAKED_HF_TOKEN}"}' ``` > [!TIP] > To avoid leaking token values in shell history or logs, pass them via environment variables or files, and avoid pasting raw tokens directly into command lines. An administrator cannot revoke their own token (`LEAKED_HF_TOKEN` cannot have the same value as `ADMIN_HF_TOKEN` in the snippet above). ## Programmatic Token Issuance For organizations that need to programmatically issue access tokens for their members (e.g., for internal platforms, CI/CD pipelines, or custom integrations), see [OAuth Token Exchange](./oauth#token-exchange-for-organizations-rfc-8693). This Enterprise plan feature allows your backend services to issue scoped tokens for organization members without requiring interactive user consent. ### Agent Libraries https://huggingface.co/docs/hub/agents-libraries.md # Agent Libraries ## tiny-agents A lightweight toolkit for running MCP-powered agents on top of Hugging Face Inference. Available in [JavaScript](https://huggingface.co/docs/huggingface.js/en/tiny-agents/README) (`@huggingface/tiny-agents`) and [Python](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/mcp) (`huggingface_hub`). ```bash # JavaScript npx @huggingface/tiny-agents run "agent/id" # Python pip install "huggingface_hub[mcp]" tiny-agents run "agent/id" ``` Create your own agent with an `agent.json` config: ```json { "model": "Qwen/Qwen2.5-72B-Instruct", "provider": "together", "servers": [ { "type": "stdio", "command": "npx", "args": ["@playwright/mcp@latest"] } ] } ``` For local LLMs, add an `endpointUrl` pointing to your server (e.g. `http://localhost:1234/v1`). Learn more in the [SDK guide](./agents-sdk). ## Gradio MCP Server Turn any Gradio app into an MCP server with a single-line change: ```python demo.launch(mcp_server=True) ``` The server exposes each function as a tool, with descriptions auto-generated from docstrings. Connect it to any MCP client. Thousands of MCP-compatible Spaces are available on the [Hub](https://huggingface.co/spaces?filter=mcp-server). Learn more in the [Gradio MCP guide](https://www.gradio.app/guides/building-mcp-server-with-gradio). ## smolagents [smolagents](https://github.com/huggingface/smolagents) is a lightweight Python library for building agents in a few lines of code. It supports `CodeAgent` (writes actions in Python) and `ToolCallingAgent` (uses JSON tool calls), works with any model via [Inference Providers](../inference-providers/index), and integrates with MCP servers. ```bash smolagent "Plan a trip to Tokyo, Kyoto and Osaka between Mar 28 and Apr 7." \ --model-type "InferenceClientModel" \ --model-id "Qwen/Qwen2.5-Coder-32B-Instruct" \ --tools "web_search" ``` Agents can be pushed to the Hub as Spaces. Browse community agents [here](https://huggingface.co/spaces?filter=smolagents&sort=likes). Learn more in the [smolagents documentation](https://huggingface.co/docs/smolagents/tutorials/tools#use-mcp-tools-with-mcpclient-directly). ### Paper Pages https://huggingface.co/docs/hub/paper-pages.md # Paper Pages Paper pages allow people to find artifacts related to a paper such as models, datasets and apps/demos (Spaces). Paper pages also enable the community to discuss about the paper. ## Linking a Paper to a model, dataset or Space If the repository card (`README.md`) includes a link to a Paper page (either on HF or an Arxiv abstract/PDF), the Hugging Face Hub will extract the arXiv ID and include it in the repository's tags. Clicking on the arxiv tag will let you: * Visit the Paper page. * Filter for other models or datasets on the Hub that cite the same paper. ## Claiming authorship to a Paper The Hub will attempt to automatically match paper to users based on their email. If your paper is not linked to your account, you can click in your name in the corresponding Paper page and click "claim authorship". This will automatically re-direct to your paper settings where you can confirm the request. The admin team will validate your request soon. Once confirmed, the Paper page will show as verified. If you don't have any papers on Hugging Face yet, you can index your first one as explained [here](#can-i-have-a-paper-page-even-if-i-have-no-modeldatasetspace). Once available, you can claim authorship. ## Frequently Asked Questions ### Can I control which Paper pages show in my profile? Yes! You can visit your Papers in [settings](https://huggingface.co/settings/papers), where you will see a list of verified papers. There, you can click the "Show on profile" checkbox to hide/show it in your profile. ### Do you support ACL anthology? We're starting with Arxiv as it accounts for 95% of the paper URLs Hugging Face users have linked in their repos organically. We'll check how this evolve and potentially extend to other paper hosts in the future. ### Can I have a Paper page even if I have no model/dataset/Space? Yes. You can go to [the main Papers page](https://huggingface.co/papers), click search and write the name of the paper or the full Arxiv id. If the paper does not exist, you will get an option to index it. You can also just visit the page `hf.co/papers/xxxx.yyyyy` replacing with the arxiv id of the paper you wish to index. ### Use Ollama with any GGUF Model on Hugging Face Hub https://huggingface.co/docs/hub/ollama.md # Use Ollama with any GGUF Model on Hugging Face Hub ![cover](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ollama/cover.png) You can also run private GGUFs from the Hugging Face Hub. Ollama is an application based on llama.cpp to interact with LLMs directly through your computer. You can use any GGUF quants created by the community ([bartowski](https://huggingface.co/bartowski), [MaziyarPanahi](https://huggingface.co/MaziyarPanahi) and [many more](https://huggingface.co/models?pipeline_tag=text-generation&library=gguf&sort=trending)) on Hugging Face directly with Ollama, without creating a new `Modelfile`. At the time of writing there are 45K public GGUF checkpoints on the Hub, you can run any of them with a single `ollama run` command. We also provide customisations like choosing quantization type, system prompt and more to improve your overall experience. Getting started is as simple as: 1. Enable `ollama` under your [Local Apps settings](https://huggingface.co/settings/local-apps). 2. On a model page, choose `ollama` from `Use this model` dropdown. For example: [bartowski/Llama-3.2-1B-Instruct-GGUF](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF). The snippet would be in format: ```sh ollama run hf.co/{username}/{repository} ``` Please note that you can use both `hf.co` and `huggingface.co` as the domain name. Here are some models you can try: ```sh ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF ``` ## Custom Quantization By default, the `Q4_K_M` quantization scheme is used, when it's present inside the model repo. If not, we default to picking one reasonable quant type present inside the repo. To select a different scheme, simply: 1. From `Files and versions` tab on a model page, open GGUF viewer on a particular GGUF file. 2. Choose `ollama` from `Use this model` dropdown. The snippet would be in format (quantization tag added): ```sh ollama run hf.co/{username}/{repository}:{quantization} ``` For example: ```sh ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0 # the quantization name is case-insensitive, this will also work ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:iq3_m # you can also directly use the full filename as a tag ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-IQ3_M.gguf ``` ## Custom Chat Template and Parameters By default, a template will be selected automatically from a list of commonly used templates. It will be selected based on the built-in `tokenizer.chat_template` metadata stored inside the GGUF file. If your GGUF file doesn't have a built-in template or if you want to customize your chat template, you can create a new file called `template` in the repository. The template must be a Go template, not a Jinja template. Here's an example: ``` {{ if .System }} {{ .System }} {{ end }}{{ if .Prompt }} {{ .Prompt }} {{ end }} {{ .Response }} ``` To know more about the Go template format, please refer to [this documentation](https://github.com/ollama/ollama/blob/main/docs/template.mdx) You can optionally configure a system prompt by putting it into a new file named `system` in the repository. To change sampling parameters, create a file named `params` in the repository. The file must be in JSON format. For the list of all available parameters, please refer to [this documentation](https://github.com/ollama/ollama/blob/main/docs/modelfile.mdx#parameter). ## Run Private GGUFs from the Hugging Face Hub You can run private GGUFs from your personal account or from an associated organisation account in two simple steps: 1. Copy your Ollama SSH key, you can do so via: `cat ~/.ollama/id_ed25519.pub | pbcopy` 2. Add the corresponding key to your Hugging Face account by going to [your account settings](https://huggingface.co/settings/keys) and clicking on `Add new SSH key`. 3. That's it! You can now run private GGUFs from the Hugging Face Hub: `ollama run hf.co/{username}/{repository}`. ## References - https://github.com/ollama/ollama/blob/main/docs/README.md - https://huggingface.co/docs/hub/en/gguf ### Two-Factor Authentication (2FA) https://huggingface.co/docs/hub/security-2fa.md # Two-Factor Authentication (2FA) Using two-factor authentication verifies a user's identity with two methods, adding extra security to ensure only authorized individuals can access an account, even if the password is compromised. If you choose to enable two-factor authentication, at every login you will need to provide: - Username or email & password (normal login credentials) - One-time security code via app ## Enable Two-factor Authentication (2FA) To enable Two-factor Authentication with a one-time password: On the Hugging Face Hub: 1. Go to your [Authentication settings](https://hf.co/settings/authentication) 2. Select Add Two-Factor Authentication On your device (usually your phone): 1. Install a compatible application. For example: - Authy - Google Authenticator - Microsoft Authenticator - FreeOTP 2. In the application, add a new entry in one of two ways: - Scan the code displayed on screen Hub with your deviceโ€™s camera to add the entry automatically - Enter the details provided to add the entry manually In Hugging Face Hub: 1. Enter the six-digit pin number from your authentication device into "Code" 2. Save If you entered the correct pin, the Hub displays a list of recovery codes. Download them and keep them in a safe place. > [!TIP] > You will be prompted for 2FA every time you log in, and every 30 days ## Recovery codes Right after you've successfully activated 2FA with a one-time password, you're requested to download a collection of generated recovery codes. If you ever lose access to your one-time password authenticator, you can use one of these recovery codes to log in to your account. - Each code can be used only **once** to sign in to your account - You should copy and print the codes, or download them for storage in a safe place. If you choose to download them, the file is called **huggingface-recovery-codes.txt** If you lose the recovery codes, or want to generate new ones, you can use the [Authentication settings](https://hf.co/settings/authentication) page. ## Regenerate two-factor authentication recovery codes To regenerate 2FA recovery codes: 1. Access your [Authentication settings](https://hf.co/settings/authentication) 2. If youโ€™ve already configured 2FA, select Recovery Code 3. Click on Regenerate recovery codes > [!WARNING] > If you regenerate 2FA recovery codes, save them. You canโ€™t use any previously created recovery codes. ## Sign in with two-factor authentication enabled When you sign in with 2FA enabled, the process is only slightly different than the standard sign-in procedure. After entering your username and password, you'll encounter an additional prompt, depending on the type of 2FA you've set up. When prompted, provide the pin from your one-time password authenticator's app or a recovery code to complete the sign-in process. ## Disable two-factor authentication To disable 2FA: 1. Access your [Authentication settings](https://hf.co/settings/authentication) 2. Click on "Remove". This clears all your 2FA registrations. ## Recovery options If you no longer have access to your authentication device, you can still recover access to your account: - Use a saved recovery code, if you saved them when you enabled two-factor authentication - Requesting help with two-factor authentication ### Use a recovery code To use a recovery code: 1. Enter your username or email, and password, on the [Hub sign-in page](https://hf.co/login) 2. When prompted for a two-factor code, click on "Lost access to your two-factor authentication app? Use a recovery code" 3. Enter one of your recovery codes After you use a recovery code, you cannot re-use it. You can still use the other recovery codes you saved. ### Requesting help with two-factor authentication In case you've forgotten your password and lost access to your two-factor authentication credentials, you can reach out to support (website@huggingface.co) to regain access to your account. You'll be required to verify your identity using a recovery authentication factor, such as an SSH key or personal access token. ### Spaces as Agent Tools https://huggingface.co/docs/hub/spaces-agents.md # Spaces as Agent Tools Every Gradio Space exposes a plain-text `agents.md` that coding agents (Claude Code, Codex, OpenCode, Pi, etc.) can call directly. Find one via semantic search on [huggingface.co/spaces](https://huggingface.co/spaces) (e.g. "audio transcription"), optionally try it in the UI first, then point your agent at its `agents.md`. The response is four lines: schema URL, call template, poll template, auth hint. This gets even more powerful when **chaining Spaces**. An agent can turn a prompt into a 3D asset by calling [`black-forest-labs/flux-klein-9b-kv`](https://huggingface.co/spaces/black-forest-labs/flux-klein-9b-kv) for an image, then passing the generated image into [`microsoft/TRELLIS.2`](https://huggingface.co/spaces/microsoft/TRELLIS.2) for the 3D model. No client library, no hardcoded integration. All you need is an [HF_TOKEN](https://huggingface.co/settings/tokens) set in your environment to get started. ## From the UI Every compatible Space page has an **Agents** button in the header. Click it to copy the `curl` command for that Space's `agents.md`, then paste it into your agent. ## The agents.md endpoint ``` https://huggingface.co/spaces///agents.md ``` Example: ```bash curl https://huggingface.co/spaces/microsoft/TRELLIS.2/agents.md ``` Returns: ``` To use this application (microsoft/TRELLIS.2: Create 3D model from a single image): API schema: GET https://microsoft-trellis-2.hf.space/gradio_api/info Call endpoint: POST https://microsoft-trellis-2.hf.space/gradio_api/call/{endpoint} {"data": [...]} Poll result: GET https://microsoft-trellis-2.hf.space/gradio_api/call/{endpoint}/{event_id} Auth: Bearer $HF_TOKEN ``` ## Authentication and ZeroGPU Most popular Spaces run on [ZeroGPU](./spaces-zerogpu), which uses the caller's daily quota. Agents should always pass an `$HF_TOKEN` so calls are billed to your account rather than a throttled anonymous pool. The same token is also required for private Spaces. ```bash curl -H "Authorization: Bearer $HF_TOKEN" \ https://microsoft-trellis-2.hf.space/gradio_api/call/predict \ -d '{"data": [{"path": "https://example.com/chair.png"}]}' ``` ## How agents will use this 1. The agent `curl`s `/agents.md` for the Space. 2. It fetches `/gradio_api/info` to learn endpoint names and inputs. 3. It POSTs to `/gradio_api/call/`, then GETs the poll URL to stream the result. ### How to configure SCIM with Microsoft Entra ID (Azure AD) https://huggingface.co/docs/hub/security-sso-entra-id-scim.md # How to configure SCIM with Microsoft Entra ID (Azure AD) This guide explains how to set up automatic user and group provisioning between Microsoft Entra ID and your Hugging Face organization using SCIM. > [!WARNING] > This feature is part of the Enterprise and Enterprise Plus plans. ## Step 1: Get SCIM configuration from Hugging Face 1. Navigate to your organization's settings page on Hugging Face. 2. Go to the **SSO** tab, then click on the **SCIM** sub-tab. 3. Copy the **SCIM Tenant URL**. You will need this for the Entra ID configuration. 4. Click **Generate an access token**. A new SCIM token will be generated. Copy this token immediately and store it securely, as you will not be able to see it again. ## Step 2: Configure Provisioning in Microsoft Entra ID 1. In the Microsoft Entra admin center, navigate to your Hugging Face Enterprise Application. 2. In the left-hand menu, select **Provisioning**. 3. Click **Get started**. 4. Change the **Provisioning Mode** from "Manual" to **Automatic**. ## Step 3: Enter Admin Credentials 1. In the **Admin Credentials** section, paste the **SCIM Tenant URL** from Hugging Face into the **Tenant URL** field. 2. Paste the **SCIM token** from Hugging Face into the **Secret Token** field. 3. Click **Test Connection**. You should see a success notification. 4. Click **Save**. ## Step 4: Configure Attribute Mappings 1. Under the **Mappings** section, click on **Provision Microsoft Entra ID Users**. 2. The default attribute mappings often require adjustments for robust provisioning. We recommend using the following configuration. You can delete attributes that are not listed here: | `customappsso` Attribute | Microsoft Entra ID Attribute | Matching precedence | |---|---|---| | `userName` | `Replace([mailNickname], ".", "", "", "", "", "")` | | | `active` | `Switch([IsSoftDeleted], , "False", "True", "True", "False")` | | | `emails[type eq "work"].value` | `userPrincipalName` | | | `name.givenName` | `givenName` | | | `name.familyName` | `surname` | | | `externalId` | `objectId` | `1` | 3. The Username needs to comply with the following rules. > [!WARNING] > > Only regular characters and `-` are accepted in the Username. > `--` (double dash) is forbidden. > `-` cannot start or end the name. > Digit-only names are not accepted. > Minimum length is 2 and maximum length is 42. > Username has to be unique within your org. > 4. After configuring the user mappings, go back to the Provisioning screen and click on **Provision Microsoft Entra ID Groups** to review group mappings. The default settings for groups are usually sufficient. ## Step 5: Start Provisioning 1. On the main Provisioning screen, set the **Provisioning Status** to **On**. 2. Under **Settings**, you can configure the **Scope** to either "Sync only assigned users and groups" or "Sync all users and groups". We recommend starting with "Sync only assigned users and groups". 3. Save your changes. The initial synchronization can take up to 40 minutes to start. You can monitor the progress in the **Provisioning logs** tab. ### Assigning Users and Groups for Provisioning To control which users and groups are provisioned to your Hugging Face organization, you need to assign them to the Hugging Face Enterprise Application in Microsoft Entra ID. This is done in the **Users and groups** tab of your application. 1. Navigate to your Hugging Face Enterprise Application in the Microsoft Entra admin center. 2. Go to the **Users and groups** tab. 3. Click **Add user/group**. 4. Select the users and groups you want to provision and click **Assign**. Only the users and groups you assign here will be provisioned to Hugging Face if you have set the **Scope** to "Sync only assigned users and groups". > [!TIP] > Active Directory Plan Considerations > > With Free, Office 365, and Premium P1/P2 plans, you can assign individual users to the application for provisioning. > With Premium P1/P2 plans, you can also assign groups. This is the recommended approach for managing access at scale, as you can manage group membership in AD, and the changes will automatically be reflected in Hugging Face. > ## Step 6: Verify Provisioning in Hugging Face Once the synchronization is complete, navigate back to your Hugging Face organization settings: - Provisioned users will appear in the **Users Management** tab. - Provisioned groups will appear in the **SCIM** tab under **SCIM Groups**. These groups can then be assigned to [Resource Groups](./security-resource-groups) for fine-grained access control. ## Step 7: Link SCIM Groups to Hugging Face Resource Groups Once your groups are provisioned from Entra ID, you can link them to Hugging Face Resource Groups to manage permissions at scale. This allows all members of a SCIM group to automatically receive specific roles (like read or write) for a collection of resources. > [!NOTE] > Before linking, make sure the Resource Group you want to link is **empty** (has no existing members) and does **not** have auto-join enabled. Both conditions are required โ€” linking will fail otherwise. 1. In your Hugging Face organization settings, navigate to the **SSO** -> **SCIM** tab. You will see a list of your provisioned groups under **SCIM Groups**. 2. Locate the group you wish to configure and click **Link resource groups** in its row. 3. A dialog will appear. Click **Link a Resource Group**. 4. From the dropdown menus, select the **Resource Group** you want to link and the **Role Assignment** you want to grant to the members of the SCIM group. 5. Click **Link to SCIM group** and save the mapping. Once linked, the Resource Group becomes **SCIM-managed**: any members already in the SCIM group are immediately added to the Resource Group (backfill), and all future membership changes in Entra ID are automatically reflected. Manual membership edits on the Resource Group via the Hub UI or API will be blocked. ### Single Sign-On (SSO) https://huggingface.co/docs/hub/security-sso.md # Single Sign-On (SSO) > [!WARNING] > This feature is part of the Team & Enterprise plans. Hugging Face supports Single Sign-On (SSO) to let organizations manage user authentication through their own Identity Provider (IdP). Both SAML 2.0 and OpenID Connect (OIDC) protocols are supported. There are two SSO models available, depending on your plan and needs. For a detailed comparison, see the [SSO overview](./enterprise-sso). - **[Basic SSO](./security-sso-basic)** โ€” Available on Team & Enterprise plans. Adds an access-control layer on top of the standard Hugging Face login to secure your organization's resources. - **[Managed SSO](./enterprise-advanced-sso)** โ€” Available on the Enterprise Plus plan. Replaces the Hugging Face login entirely, giving your organization full control over user accounts and access. Requires setup with the Hugging Face team โ€” contact us to get started. ## Further reading - [User Management](./security-sso-user-management) โ€” Role mapping, resource group mapping, session timeout, and more - [Configuration Guides](./security-sso-configuration-guides) โ€” Step-by-step setup instructions for Okta, Microsoft Entra ID, and Google Workspace - [User Provisioning (SCIM)](./enterprise-scim) โ€” Automated user provisioning from your Identity Provider ### How to configure SAML SSO with Microsoft Entra ID (Azure AD) https://huggingface.co/docs/hub/security-sso-azure-saml.md # How to configure SAML SSO with Microsoft Entra ID (Azure AD) In this guide, we will use Microsoft Entra ID as the SSO provider and with the Security Assertion Markup Language (SAML) protocol as our preferred identity protocol. We currently support SP-initiated and IdP-initiated authentication. For user provisioning, see [SCIM](./enterprise-scim). > [!WARNING] > This feature is part of the Team & Enterprise plans. ## Step 1: Create a new application in your Identity Provider Open a new tab/window in your browser and sign in to the Azure portal of your organization. Navigate to "Enterprise applications" and click the "New application" button. You'll be redirected to this page, click on "Create your own application", fill the name of your application, and then "Create" the application. Then select "Single Sign-On", and select SAML ## Step 2: Configure your application on Azure Open a new tab/window in your browser and navigate to the SSO section of your organization's settings. Select the SAML protocol. Copy the "SP Entity Id" from the organization's settings on Hugging Face, and paste it in the "Identifier (Entity Id)" field on Azure (1). Copy the "Assertion Consumer Service URL" from the organization's settings on Hugging Face, and paste it in the "Reply URL" field on Azure (2). The URL looks like this: `https://huggingface.co/organizations/[organizationIdentifier]/saml/consume`. Then under "SAML Certificates", verify that "Signin Option" is set to "Sign SAML response and assertion". Save your new application. ## Step 3: Finalize configuration on Hugging Face In your Azure application, under "Set up", find the following field: - Login Url And under "SAML Certificates": - Download the "Certificate (base64)" You will need them to finalize the SSO setup on Hugging Face. In the SSO section of your organization's settings, copy-paste these values from Azure: - Login Url -> Sign-on URL - Certificate -> Public certificate The public certificate must have the following format: ``` -----BEGIN CERTIFICATE----- {certificate} -----END CERTIFICATE----- ``` You can now click on "Update and Test SAML configuration" to save the settings. You should be redirected to your SSO provider (IdP) login prompt. Once logged in, you'll be redirected to your organization's settings page. A green check mark near the SAML selector will attest that the test was successful. ## Step 4: Enable SSO in your organization Now that Single Sign-On is configured and tested, you can enable it for members of your organization by clicking on the "Enable" button. Once enabled, members of your organization must complete the SSO authentication flow described in the [How it works](./security-sso-basic#how-it-works) section. ### Advanced Access Control in Organizations with Resource Groups https://huggingface.co/docs/hub/security-resource-groups.md # Advanced Access Control in Organizations with Resource Groups > [!WARNING] > This feature is part of the Team & Enterprise plans. In your Hugging Face organization, you can use Resource Groups to control which members have access to specific repositories. ## How does it work? Resource Groups allow organization administrators to group related repositories together, allowing different teams in your organization to work on independent sets of repositories. A repository can belong to only one Resource Group. Organizations members need to be added to the Resource Group to access its repositories. An Organization Member can belong to several Resource Groups. Members are assigned a role in each Resource Group that determines their permissions for the group's repositories. Four distinct roles exist for Resource Groups: - `read`: Grants read access to repositories within the Resource Group. - `contributor`: Provides extra write rights to the subset of the Organization's repositories created by the user (i.e., users can create repos and then modify only those repos). Similar to the 'Write' role, but limited to repos created by the user. - `write`: Offers write access to all repositories in the Resource Group. Users can create, delete, or rename any repository in the Resource Group. - `admin`: In addition to write permissions on repositories, admin members can administer the Resource Group โ€” add, remove, and alter the roles of other members. They can also manage already existing repositories in a Resource Group. In addition, Organization admins can manage all resource groups inside the organization. This includes moving repositories in and out of any Resource Group. Resource Groups also affect the visibility of private repositories inside the organization. A private repository that is part of a Resource Group will only be visible to members of that Resource Group. Public repositories, on the other hand, are visible to anyone, inside and outside the organization. ## Getting started Head to your Organization's settings, then navigate to the "Resource Group" tab in the left menu. Organization admins can create and manage Resource Groups from that page. Depending on the organization's settings, members with lower roles may also be allowed to create Resource Groups (see [Who can create Resource Groups](#who-can-create-resource-groups) below). After creating a Resource Group and giving it a meaningful name, you can start adding repositories and users to it. > [!TIP] > When adding users to a Resource Group, you can search by email address if the user has an organization-specific email (e.g., `user@your-company.com`) matching your organization email domain. Remember that a repository can be part of only one Resource Group. You'll be warned when trying to add a repository that already belongs to another Resource Group. ## Auto-join Auto-join automatically adds **every org member** to a Resource Group at a specified role โ€” both members who are already in the org when auto-join is enabled, and any new members who join in the future. This is useful for Resource Groups that should be accessible to your entire organization without requiring manual membership management. ### Enabling auto-join **Via the UI**: Open the Resource Group's settings page and check the **Include all org members** option, then select the role to assign. **Via the API**: See [Configure auto-join via API](./programmatic-user-access-control#configure-auto-join-via-api). When auto-join is enabled on an existing Resource Group, all current org members are **immediately added** to the group at the configured role (backfill). ### Auto-join and SCIM Auto-join and SCIM management are **mutually exclusive** on the same Resource Group. Auto-join adds every org member automatically; SCIM management means only the IdP controls membership. These two behaviors conflict, so: - You cannot enable auto-join on a Resource Group that is linked to a SCIM group. - You cannot link a SCIM group to a Resource Group that has auto-join enabled. To switch a Resource Group from auto-join to SCIM-managed (or vice versa), disable the current setting first. ## Who can create Resource Groups By default, only organization admins can create new Resource Groups. Org admins can change this by setting the **minimum member role required to create Resource Groups** on the Resource Groups settings page. The available options are: - **Admins only** (default) โ€” only org admins can create Resource Groups. - **Write** โ€” members with Write or Admin role can create Resource Groups. - **Contributor** โ€” members with Contributor, Write, or Admin role can create Resource Groups. - **Read+** โ€” any org member can create Resource Groups. When a non-admin member creates a Resource Group through the UI, they are automatically added as an **admin** of that newly created group. Through the API, this does not happen automatically, since API callers may be creating groups on behalf of others. Non-admin API callers must include at least one user with the admin role in the group's initial member list. ## Resource Groups API You can list resource groups and add users to them (or change a member's org role and resource group assignments) via the Hub API. For the full reference, examples, and batch workflows, see the [Programmatic User Access Control Management](./programmatic-user-access-control) guide. ### Organizations https://huggingface.co/docs/hub/organizations.md # Organizations The Hugging Face Hub offers **Organizations**, which can be used to group accounts and manage datasets, models, and Spaces. The Hub also allows admins to set user roles to [**control access to repositories**](./organizations-security) and manage their organization's [payment method and billing info](https://huggingface.co/pricing). If an organization needs to track user access to a dataset or a model due to licensing or privacy issues, an organization can enable [user access requests](./datasets-gated). Note: Use the context switcher in your org settings to quickly switch between your account and your orgs. ## Contents - [Managing Organizations](./organizations-managing) - [Organization Cards](./organizations-cards) - [Access Control in Organizations](./organizations-security) ## Next: Power up your organization - [Team & Enterprise Plans](./enterprise) ### Aim on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-aim.md # Aim on Spaces **Aim** is an easy-to-use & supercharged open-source experiment tracker. Aim logs your training runs and enables a beautiful UI to compare them and an API to query them programmatically. ML engineers and researchers use Aim explorers to compare 1000s of training runs in a few clicks. Check out the [Aim docs](https://aimstack.readthedocs.io/en/latest/) to learn more about Aim. If you have an idea for a new feature or have noticed a bug, feel free to [open a feature request or report a bug](https://github.com/aimhubio/aim/issues/new/choose). In the following sections, you'll learn how to deploy Aim on the Hugging Face Hub Spaces and explore your training runs directly from the Hub. ## Deploy Aim on Spaces You can deploy Aim on Spaces with a single click! Once you have created the Space, you'll see the `Building` status, and once it becomes `Running,` your Space is ready to go! Now, when you navigate to your Space's **App** section, you can access the Aim UI. ## Compare your experiments with Aim on Spaces Let's use a quick example of a PyTorch CNN trained on MNIST to demonstrate end-to-end Aim on Spaces deployment. The full example is in the [Aim repo examples folder](https://github.com/aimhubio/aim/blob/main/examples/pytorch_track.py). ```python from aim import Run from aim.pytorch import track_gradients_dists, track_params_dists # Initialize a new Run aim_run = Run() ... items = {'accuracy': acc, 'loss': loss} aim_run.track(items, epoch=epoch, context={'subset': 'train'}) # Track weights and gradients distributions track_params_dists(model, aim_run) track_gradients_dists(model, aim_run) ``` The experiments tracked by Aim are stored in the `.aim` folder. **To display the logs with the Aim UI in your Space, you need to compress the `.aim` folder to a `tar.gz` file and upload it to your Space using `git` or the Files and Versions sections of your Space.** Here's a bash command for that: ```bash tar -czvf aim_repo.tar.gz .aim ``` Thatโ€™s it! Now open the App section of your Space and the Aim UI is available with your logs. Here is what to expect: ![Aim UI on HF Hub Spaces](https://user-images.githubusercontent.com/23078323/232034340-0ba3ebbf-0374-4b14-ba80-1d36162fc994.png) Filter your runs using Aimโ€™s Pythonic search. You can write pythonic [queries against](https://aimstack.readthedocs.io/en/latest/using/search.html) EVERYTHING you have tracked - metrics, hyperparams etc. Check out some [examples](https://huggingface.co/aimstack) on HF Hub Spaces. > [!TIP] > Note that if your logs are in TensorBoard format, you can easily convert them to Aim with one command and use the many advanced and high-performant training run comparison features available. ## More on HF Spaces - [HF Docker spaces](https://huggingface.co/docs/hub/spaces-sdks-docker) - [HF Docker space examples](https://huggingface.co/docs/hub/spaces-sdks-docker-examples) ## Feedback and Support If you have improvement suggestions or need support, please open an issue on [Aim GitHub repo](https://github.com/aimhubio/aim). The [Aim community Discord](https://github.com/aimhubio/aim#-community) is also available for community discussions. ### Lance https://huggingface.co/docs/hub/datasets-lance.md # Lance [Lance](https://lance.org) is an open multimodal lakehouse table format for AI. You can use Hugging Face paths (`hf://`) to access Lance datasets on the Hub. This lets you scan and search large datasets on the Hugging Face Hub without having to copy the entire dataset locally. ## Getting Started To get started, pip install `pylance` and `pyarrow`: ```bash pip install pylance pyarrow ``` ## Why Lance? - Optimized for ML/AI workloads: Lance is a modern columnar format designed for fast random access without compromising scan performance, making it useful for search, analytics, training, feature engineering and many more use cases. - Multimodal assets are stored as bytes, or binary objects ("[blobs as files](https://lance.org/guide/blob/)") in Lance alongside embeddings, and traditional scalar data -- this makes it easier to govern, share, and distribute your large datasets via the Hub. - Indexing is a first-class citizen (native to the format itself): Lance comes with fast, on-disk, scalable [vector](https://lance.org/quickstart/vector-search) and FTS indexes that sit right alongside the dataset on the Hub, so you can share not only your data but also your embeddings and indexes without your users needing to recompute them. - Flexible schema and [data evolution](https://lance.org/guide/data_evolution) let you incrementally add new features/columns (moderation tags, embeddings, etc.) **without** needing to rewrite the entire table. ## Store all your data in one place In Lance, your multimodal data assets (images, audio, video) are stored as raw bytes alongside your scalar metadata and embeddings. This makes it easy to scan and filter your dataset in one place without needing to stitch together multiple storage systems. ## Stream from the Hub with `datasets` Use `load_dataset(..., streaming=True)` to scan and iterate through the data without downloading it locally. ```python from datasets import load_dataset # Return as a Hugging Face dataset ds = load_dataset( "lance-format/laion-1m", split="train", streaming=True ) # Take first three rows for row in ds.take(3): print(row["caption"]) ``` Streaming is great for sampling metadata to understand what you have. For vector search or working with large binary blobs, you can use the Lance `dataset` API, explained below. > [!WARNING] > Streaming is fast for sampling simple scalar metadata but not as quick for embeddings or large multimodal assets. To work with large datasets, it's recommended to scan the metadata, identify subsets of what you need, and download that portion of the dataset locally to avoid facing Hub rate limits: > `hf download lance-format/laion-1m --repo-type dataset --local-dir ./laion` ## Stream from the Hub with `lance.dataset` You can also scan a Lance dataset that's stored on the Hugging Face Hub using the `hf://` path specifier. This scans the remote dataset without requiring that you download it locally. Using the Lance `dataset` API, it's very simple to set limits, filters and projections to only fetch the data you need. ```python import lance # Return as a Lance dataset ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance") scanner = ds.scanner( columns=["caption", "url", "similarity"], limit=5 ) rows = scanner.to_table().to_pylist() for row in rows: print(row) ``` ## Work with binary assets The example below shows how images are retrieved from a Lance dataset as raw JPEG bytes in the `image` column, and used downstream. Use `ds.take` to fetch the bytes and write them to disk so you can use them elsewhere. ```python import lance from pathlib import Path ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance") dir_name = "laion_samples" Path(dir_name).mkdir(exist_ok=True) rows = ds.take([0, 1], columns=["image", "caption"]).to_pylist() for idx, row in enumerate(rows): with open(f"{dir_name}/{idx}.jpg", "wb") as f: f.write(row["image"]) print(f"Wrote image with caption: {row['caption']}") ``` ## Write a subset to a new Lance dataset Working with large datasets? It's simple to run a filtered scan to select a subset of rows from the Hub and materialize them into a local Lance dataset. ```python import lance ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance") scanner = ds.scanner( columns=["image", "caption", "width", "height"], filter="width >= 200 AND height >= 100", limit=10, ) subset = scanner.to_table() lance.write_dataset(subset, "./laion_subset") ``` ## Create index If your dataset doesn't already have an index associated with it, you can create one after downloading it locally. ```python # ds is a local Lance dataset ds.create_index( "img_emb", index_type="IVF_PQ", num_partitions=256, num_sub_vectors=96, replace=True, ) ``` See the [Lance docs](https://lance.org/quickstart/vector-search/) on vector index creation for a more detailed example. Once you have a vector index created, you can run similarity search on the data via embeddings. ## Vector search Because indexes are first-class citizens in Lance, you can store not only your data but also your embeddings and indexes together and query them **directly on the Hub**. Simply use the `describe_indices()` method to list the index information for the dataset. If an index doesn't exist in the dataset, you can use `lance.write_dataset()` to write a local version of the dataset and use [LanceDataset.create_index](https://lance-format.github.io/lance-python-doc/all-modules.html#lance.dataset.LanceDataset.create_index) to create an index for your needs. The example below shows a dataset for which we already have a vector index on the `img_emb` field: ```python import lance ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance") print(ds.list_indices()) # Returns # [ # IndexDescription( # name=img_emb_idx, # type_url=/lance.table.VectorIndexDetails, # num_rows_indexed=1209588, # fields=[15], # field_names=["img_emb"], # num_segments=1 #. ) # ] ``` You can run vector search queries directly on the remote dataset without downloading it (or, if you prefer, download the dataset locally and create a new index). The example below shows how to run a nearest neighbor search on a vector index using an image embedding as the query vector. ```python import lance import pyarrow as pa ds = lance.dataset("hf://datasets/lance-format/laion-1m/data/train.lance") emb_field = ds.schema.field("img_emb") ref = ds.take([0], columns=["img_emb"]).to_pylist()[0]["img_emb"] query = pa.array([ref], type=emb_field.type) neighbors = ds.scanner( nearest={ "column": emb_field.name, "q": query[0], "k": 6, "nprobes": 16, "refine_factor": 30, }, columns=["caption", "url", "similarity"], ).to_table().to_pylist() ``` > [!NOTE] > Setting a large `k` or `nprobes` value, or sending a large batch of queries all at once can hit Hub rate limits. For heavy usage, download the dataset (or a subset of it) locally and point Lance at the local path to avoid throttling. ## Dataset evolution One of Lance's most powerful features is flexible, zero-cost data evolution, meaning that you can effortlessly add derived columns **without** rewriting the original table. For very large tables with a lot of large blobs, the savings in I/O can be quite significant. This feature is very relevant if you're experimenting with your data for ML/AI engineering tasks and you frequently find yourself adding new features, embeddings, or derived metadata. The example below shows how to add a derived `moderation_label` column that marks an image as `NSFW` based on an existing score column. When you make this change, backfilling the new column **only** writes the new column data, without touching the original image blobs or data in other columns. You can also choose to just add the new column schema without backfilling any data. ```python import lance import pyarrow as pa # Assumes you ran the export to Lance example above to store a local subset of the data local_ds = lance.dataset("./laion_subset") # schema only (data to be added later) local_ds.add_columns(pa.field("moderation_label", pa.string())) # with data backfill local_ds.add_columns( { "moderation_label": "case WHEN \"NSFW\" > 0.5 THEN 'review' ELSE 'ok' END" } ) ``` See the Lance docs on [data evolution](https://lance.org/guide/data_evolution/) to learn how to alter and drop columns in Lance datasets. ## Work with video blobs Lance tables also support large inline video blobs. The `OpenVid-1M` dataset (from [this paper](https://arxiv.org/abs/2407.02371)) contains high-quality, expressive videos and their captions. The video data is stored in the `video_blob` column of the following Lance dataset on the Hub. ```python import lance lance_ds = lance.dataset("hf://datasets/lance-format/Openvid-1M/data/train.lance") blob_file = lance_ds.take_blobs("video_blob", ids=[0])[0] video_bytes = blob_file.read() ``` Unlike other data formats, large multimodal binary objects (blobs) are first-class citizens in Lance. The [blob API](https://lance.org/guide/blob/) provides a high-level API to store and retrieve large blobs in Lance datasets. The following example shows how to efficiently browse metadata without loading the heavier video blobs, then fetch the relevant video blobs on demand. ```python import lance ds = lance.dataset("hf://datasets/lance-format/Openvid-1M/data/train.lance") # 1. Browse metadata without loading video blobs. metadata = ds.scanner( columns=["caption", "aesthetic_score"], filter="aesthetic_score >= 4.5", limit=2, ).to_table().to_pylist() # 2. Fetch a single video blob by row index. selected_index = 0 blob_file = ds.take_blobs("video_blob", ids=[selected_index])[0] with open("video_0.mp4", "wb") as f: f.write(blob_file.read()) ``` ## Prepare data for training Training is another area where Lance's fast random access and scan performance can be useful. You can use Lance datasets as the storage mechanism for your training data, shuffling it and loading into batches as part of your training pipelines. The blob API in Lance is compatible with `torchcodec`, so you can easily decode video blobs as `torch` tensors: ```python from torchcodec.decoders import VideoDecoder decoder = VideoDecoder(blob_file) tensor = decoder[0] # uint8 tensor of shape [C, H, W] ``` See the [torchcodec docs](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.VideoDecoder.html) for more functions for efficiently decoding videos. In addition, you can also check out the [Lance documentation](https://lance.org/examples/python/clip_training/) for more examples on loading image data into `torchvision` for training your own image models. ## Explore more Lance datasets Lance is an open format with native support for multimodal blobs alongside your traditional tabular data. With the Hugging Face Hub integration, you can easily work with images, audio, video, text, embeddings, and scalar metadata all in one place. Explore more Lance datasets on the [Hugging Face Hub](https://huggingface.co/datasets?format=format:lance), and share your own Lance datasets with others in the community! You can visit [lance.org](https://lance.org/integrations/huggingface/) for more code snippets and examples. ### Using TensorBoard https://huggingface.co/docs/hub/tensorboard.md # Using TensorBoard TensorBoard provides tooling for tracking and visualizing metrics as well as visualizing models. All repositories that contain TensorBoard traces have an automatic tab with a hosted TensorBoard instance for anyone to check it out without any additional effort! ## Exploring TensorBoard models on the Hub Over 52k repositories have TensorBoard traces on the Hub. You can find them by filtering at the left of the [models page](https://huggingface.co/models?filter=tensorboard). As an example, if you go to the [aubmindlab/bert-base-arabertv02](https://huggingface.co/aubmindlab/bert-base-arabertv02) repository, there is a **Metrics** tab. If you select it, you'll view a TensorBoard instance. ## Adding your TensorBoard traces The Hub automatically detects TensorBoard traces (such as `tfevents`). Once you push your TensorBoard files to the Hub, they will automatically start an instance. ## Additional resources * TensorBoard [documentation](https://www.tensorflow.org/tensorboard). ### User Management https://huggingface.co/docs/hub/security-sso-user-management.md # User Management > [!WARNING] > This feature is part of the Team & Enterprise plans. The following features are available to organizations with SSO enabled. See [Basic SSO](./security-sso-basic) and [Managed SSO](./enterprise-advanced-sso) for details on each mode. ## Session Timeout This value sets the duration of the session for members of your organization. After this time, members will be prompted to re-authenticate with your Identity Provider to access the organization's resources. The default value is 7 days. ## Role Mapping When enabled, Role Mapping allows you to dynamically assign [roles](./organizations-security#access-control-in-organizations) to organization members based on data provided by your Identity Provider. This section allows you to define a mapping from your IdP's user profile data to the assigned role in Hugging Face. - **IdP Role Attribute Path** A JSON path to an attribute in your user's IdP profile data. It supports dot notation (e.g. `user.role` or `groups`). For SAML, this can be a URI (e.g. `http://schemas.microsoft.com/ws/2008/06/identity/claims/role`). - **Role Mapping** A mapping from the IdP attribute value to the assigned role in the Hugging Face organization. Available roles are `admin`, `write`, `contributor`, and `read`. See [roles documentation](./organizations-security#access-control-in-organizations) for more details. > [!WARNING] > You must map at least one `admin` role in your configuration. If the attribute in the IdP response contains multiple values (e.g. a list of groups), the **first matching mapping** will be used to determine the user's role. If there is no match, a user will be assigned the default role for your organization. The default role can be customized in the `Members` section of the organization's settings. Role synchronization is performed on every login. ## Resource Group Mapping When enabled, Resource Group Mapping allows you to dynamically assign members to [resource groups](./enterprise-resource-groups) in your organization, based on data provided by your Identity Provider. - **IdP Attribute Path** A JSON path to an attribute in your user's IdP profile data. Similar to Role Mapping, this supports dot notation or URIs for SAML. - **Resource Group Mapping** A mapping from the IdP attribute value to a resource group in your Hugging Face organization. You can assign a specific role (`admin`, `write`, `contributor`, `read`) for each resource group mapping. Unlike Role Mapping, **Resource Group Mapping is additive**. If a user matches multiple mappings (e.g. they belong to multiple groups in your IdP that are mapped to different Resource Groups), they will be added to **all** matched Resource Groups. If there is no match, the user will not be assigned to any resource group. ## Matching email domains > [!NOTE] > This feature is only relevant for [Basic SSO](./security-sso-basic). With [Managed SSO](./enterprise-advanced-sso), user accounts are fully managed by the organization, so email domain matching does not apply. When enabled, 'Matching email domains' only allows organization members to complete SSO if the email provided by your identity provider matches one of their emails on Hugging Face. To add an email domain, fill out the 'Matching email domains' field, click enter on your keyboard, and save. ## External Collaborators This enables certain users within your organization to access resources without completing the Single Sign-On (SSO) flow. This can be helpful when you work with external parties who aren't part of your organization's Identity Provider (IdP) but require access to specific resources. To add a user as an "External Collaborator" visit the `SSO/Users Management` section in your organization's settings. Once added, these users won't need to go through the SSO process. However, they will still be subject to your organization's access controls ([Resource Groups](./enterprise-resource-groups)). It's crucial to manage their access carefully to maintain your organization's data security. ### Collections https://huggingface.co/docs/hub/collections.md # Collections Use Collections to group repositories from the Hub (Models, Datasets, Spaces and Papers) on a dedicated page. ![Collection page](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-intro.webp) Collections have many use cases: - Highlight specific repositories on your personal or organizational profile. - Separate key repositories from others for your profile visitors. - Showcase and share a complete project with its paper(s), dataset(s), model(s) and Space(s). - Bookmark things you find on the Hub in categories. - Have a dedicated page of curated things to share with others. - Gate a group of models/datasets (Team & Enterprise) This is just a list of possible uses, but remember that collections are just a way of grouping things, so use them in the way that best fits your use case. ## Creating a new collection There are several ways to create a collection: - For personal collections: Use the **+ New** button on your logged-in homepage (1). - For organization collections: Use the **+ New** button available on organizations page (2). ![New collection](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-new.webp) It's also possible to create a collection on the fly when adding the first item from a repository page, select **+ Create new collection** from the dropdown menu. You'll need to enter a title and short description for your collection to be created. ## Adding items to a collection There are 2 ways to add items to a collection: - From any repository page: Use the context menu available on any repository page then select **Add to collection** to add it to a collection (1). - From the collection page: If you know the name of the repository you want to add, use the **+ add to collection** option in the right-hand menu (2). ![Add items to collections](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-add.webp) It's possible to add external repositories to your collections, not just your own. ## Collaborating on collections Organization collections are a great way to build collections together. Members with Read-only access can view collections, but only members with Write (or higher) organization permissions can create collections or add, edit, and remove items. Use the **history feature** to keep track of who has edited the collection. ![Collection history](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-history.webp) ## Collection options ### Collection visibility ![Collections on profiles](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-profile.webp) **Public** collections appear at the top of your profile or organization page and can be viewed by anyone. The first 3 items in each collection are visible directly in the collection preview (1). To see more, the user must click to go to the collection page. Set your collection to **private** if you don't want it to be accessible via its URL (it will not be displayed on your profile/organization page). For organizations, private collections are only available to members of the organization. ### Gating Group Collections (Team & Enterprise) You can use a collection to [gate](https://huggingface.co/docs/hub/en/models-gated) all the models/datasets belonging to it, allowing you to grant (or reject) access to all of them at once. This feature is reserved for [Team & Enterprise](https://huggingface.co/docs/hub/en/enterprise) subscribers: more information about Gating Group Collections can be found in [our dedicated doc](https://huggingface.co/docs/hub/en/enterprise-gating-group-collections). ### Ordering your collections and their items You can use the drag and drop handles in the collections list (on the left side of your collections page) to change the order of your collections (1). The first two collections will be directly visible on your profile/organization pages. You can also sort repositories within a collection by dragging the handles next to each item (2). ![Collections sort](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-sort.webp) ### Deleting items from a collection To delete an item from a collection, click the trash icon in the menu that shows up on the right when you hover over an item (1). To delete the whole collection, click delete on the right-hand menu (2) - you'll need to confirm this action. ![Collection delete](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-delete.webp) ### Adding notes to collection's items It's possible to add a note to any item in a collection to give it more context (for others, or as a reminder to yourself). You can add notes by clicking the pencil icon when you hover over an item with your mouse. Notes are plain text and don't support markdown, to keep things clean and simple. URLs in notes are converted into clickable links. ![Collection note](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collection-note.webp) ### Adding images to a collection item Similarly, you can attach images to a collection item. This is useful for showcasing the output of a model, the content of a dataset, attaching an infographic for context, etc. To start adding images to your collection, you can click on the image icon in the contextual menu of an item. The menu shows up when you hover over an item with your mouse. ![Collection image icon](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collections-image-button.webp) Then, add images by dragging and dropping images from your computer. You can also click on the gray zone to select image files from your computer's file system. ![Collection image drop zone with images](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collections-image-gallery.webp) You can re-order images by drag-and-dropping them. Clicking on an image will open it in full-screen mode. ![Collection image viewer](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/collections/collections-image-viewer.webp) ## Your feedback on collections We're working on improving collections, so if you have any bugs, questions, or new features you'd like to see added, please post a message in the [dedicated discussion](https://huggingface.co/spaces/huggingface/HuggingDiscussions/discussions/12). ### Spaces Overview https://huggingface.co/docs/hub/spaces-overview.md # Spaces Overview Hugging Face Spaces make it easy for you to create and deploy ML-powered demos in minutes. Watch the following video for a quick introduction to Spaces: In the following sections, you'll learn the basics of creating a Space, configuring it, and deploying your code to it. ## Creating a new Space **To make a new Space**, visit the [Spaces main page](https://huggingface.co/spaces) and click on **Create new Space**. Along with choosing a name for your Space, selecting an optional license, and setting your Space's [visibility](#space-visibility) (public, protected, or private), you'll be prompted to choose the **SDK** for your Space. The Hub offers three SDK options: Gradio, Docker and static HTML. If you select "Gradio" as your SDK, you'll be navigated to a new repo showing the following page: Under the hood, Spaces stores your code inside a git repository, just like the model and dataset repositories. Thanks to this, the same tools we use for all the [other repositories on the Hub](./repositories) (`git` and `git-xet`) also work for Spaces. Follow the same flow as in [Getting Started with Repositories](./repositories-getting-started) to add files to your Space. Each time a new commit is pushed, the Space will automatically rebuild and restart. For step-by-step tutorials to creating your first Space, see the guides below: * [Creating a Gradio Space](./spaces-sdks-gradio) * [Creating a Docker Space](./spaces-sdks-docker-first-demo) ## Space visibility You can set a Space's visibility from the **Settings** tab using the visibility dropdown. Spaces support three visibility levels: **public**, **protected**, and **private**. > [!WARNING] > Protected visibility is part of PRO or Team & Enterprise plans. | | Public | Protected | Private | |---|---|---|---| | Source code on the Hub | Visible to everyone | Private (only owner/collaborators) | Private (only owner/collaborators) | | App accessible via embed URL | Yes | Yes | No | | App accessible via [custom domain](./spaces-custom-domain) | Yes | Yes | No | | Clonable by others | Yes | No | No | **Public** Spaces are fully open: anyone can view the source code, access the running app, and clone the repository. **Protected** Spaces keep their source code private on the Hub โ€” only the owner and collaborators can view or clone the repository. However, the running app is publicly accessible through its embed URL (`https://.hf.space`) or through a [custom domain](./spaces-custom-domain) when one is configured. This is especially useful for hosting websites or apps without publishing the source code. **Private** Spaces are fully private: the source code and the running app are only accessible to the owner and collaborators. The Space will not appear in search results and other users will receive a `404` error when visiting its URL. ## Hardware resources Each Spaces environment is limited to 16GB RAM, 2 CPU cores and 50GB of (not persistent) disk space by default, which you can use free of charge. You can upgrade to better hardware, including a variety of GPU accelerators, for a [competitive price](https://huggingface.co/pricing#spaces). To request an upgrade, please click the _Settings_ button in your Space and select your preferred hardware environment. | **Hardware** | **CPU** | **Memory** | **GPU Memory** | **Hourly Price** | |----------------------- |-------------- |------------- |---------------- | ----------------- | | CPU Basic | 2 vCPU | 16 GB | | FREE | | CPU Upgrade | 8 vCPU | 32 GB | | $0.03 | | Nvidia T4 - small | 4 vCPU | 15 GB | 16 GB | $0.40 | | Nvidia T4 - medium | 8 vCPU | 30 GB | 16 GB | $0.60 | | 1x Nvidia L4 | 8 vCPU | 30 GB | 24 GB | $0.80 | | 4x Nvidia L4 | 48 vCPU | 186 GB | 96 GB | $3.80 | | 1x Nvidia L40S | 8 vCPU | 62 GB | 48 GB | $1.80 | | 4x Nvidia L40S | 48 vCPU | 382 GB | 192 GB | $8.30 | | 8x Nvidia L40S | 192 vCPU | 1534 GB | 384 GB | $23.50 | | Nvidia A10G - small | 4 vCPU | 15 GB | 24 GB | $1.00 | | Nvidia A10G - large | 12 vCPU | 46 GB | 24 GB | $1.50 | | 2x Nvidia A10G - large | 24 vCPU | 92 GB | 48 GB | $3.00 | | 4x Nvidia A10G - large | 48 vCPU | 184 GB | 96 GB | $5.00 | | Nvidia A100 - large | 12 vCPU | 142 GB | 80 GB | $2.50 | | 4x Nvidia A100 | 48 vCPU | 568 GB | 320 GB | $10.00 | | 8x Nvidia A100 | 96 vCPU | 1136 GB | 640 GB | $20.00 | Note: Find more detailed and comprehensive pricing information on [our pricing page](https://huggingface.co/pricing). Do you have an awesome Space but need help covering the hardware upgrade costs? We love helping out those with an innovative Space so please feel free to apply for a community GPU grant using the link in the _Settings_ tab of your Space and see if yours makes the cut! Read more in our dedicated sections on [Spaces GPU Upgrades](./spaces-gpus) and [Spaces Disk Usage & Storage](./spaces-storage). ## Managing secrets and environment variables[[managing-secrets]] If your app requires environment variables (for instance, secret keys or tokens), do not hard-code them inside your app! Instead, go to the Settings page of your Space repository and add a new **variable** or **secret**. Use variables if you need to store non-sensitive configuration values and secrets for storing access tokens, API keys, or any sensitive value or credentials. You can use: * **Variables** if you need to store non-sensitive configuration values. They are publicly accessible and viewable and will be automatically added to Spaces duplicated from yours. * **Secrets** to store access tokens, API keys, or any sensitive values or credentials. They are private and their value cannot be read from the Space's settings page once set. They won't be added to Spaces duplicated from your repository. Accessing secrets and variables is different depending on your Space SDK: - For Static Spaces, both are available through client-side JavaScript in `window.huggingface.variables` - For Docker Spaces, check out [environment management with Docker](./spaces-sdks-docker#secrets-and-variables-management) For other Spaces, both are exposed to your app as environment variables. Here is a very simple example of accessing the previously declared `MODEL_REPO_ID` variable in Python (it would be the same for secrets): ```py import os print(os.getenv('MODEL_REPO_ID')) ``` Spaces owners are warned when our `Spaces Secrets Scanner` [finds hard-coded secrets](./security-secrets). ## Duplicating a Space Duplicating a Space can be useful if you want to build a new demo using another demo as an initial template. Duplicated Spaces can also be useful if you want to have an individual Upgraded Space for your use with fast inference. If you want to duplicate a Space, you can click the three dots at the top right of the space and click **Duplicate this Space**. Once you do this, you will be able to change the following attributes: * Owner: The duplicated Space can be under your account or any organization in which you have write access * Space name * Visibility: The Space is private by default. Read more about visibility options [here](./repositories-settings#repository-visibility). * Hardware: You can choose the hardware on which the Space will be running. Read more about hardware upgrades [here](./spaces-gpus). * Storage: If the original repo uses a storage bucket, you will be prompted to configure storage. Read more about disk usage and storage [here](./spaces-storage). * Secrets and variables: If the original repo has set some secrets and variables, you'll be able to set them while duplicating the repo. Some Spaces might have environment variables that you may need to set up. In these cases, the duplicate workflow will auto-populate the public Variables from the source Space, and give you a warning about setting up the Secrets. The duplicated Space will use a free CPU hardware by default, but you can later upgrade if needed. ## Networking If your Space needs to make any network requests, you can make requests through the standard HTTP and HTTPS ports (80 and 443) along with port 8080. Any requests going to other ports will be blocked. ## Lifecycle management On free hardware, your Space will "go to sleep" and stop executing after a period of time if unused. If you wish for your Space to run indefinitely, consider [upgrading to paid hardware](./spaces-gpus). You can also manually pause your Space from the **Settings** tab. A paused Space stops executing until manually restarted by its owner. Paused time is not billed. ## Built-in environment variables In some cases, you might be interested in having programmatic access to the Space author or repository name. This feature is particularly useful when you expect users to duplicate your Space. To help with this, Spaces exposes different environment variables at runtime (see also [built-in environment variables in Jobs](./jobs-configuration#built-in-environment-variables)). Given a Space [`osanseviero/i-like-flan`](https://huggingface.co/spaces/osanseviero/i-like-flan): * `ACCELERATOR`: The type of accelerator available (e.g., `t4-medium`, `a10g-small`), or `none` for CPU-only Spaces. * `CPU_CORES`: 4 * `MEMORY`: 15Gi * `SPACE_AUTHOR_NAME`: osanseviero * `SPACE_REPO_NAME`: i-like-flan * `SPACE_TITLE`: I Like Flan (specified in the README file) * `SPACE_ID`: `osanseviero/i-like-flan` * `SPACE_HOST`: `osanseviero-i-like-flan.hf.space` * `SPACE_CREATOR_USER_ID`: `6032802e1f993496bc14d9e3` - This is the ID of the user that originally created the Space. It's useful if the Space is under an organization. You can get the user information with an API call to `https://huggingface.co/api/users/{SPACE_CREATOR_USER_ID}/overview`. In case [OAuth](./spaces-oauth) is enabled for your Space, the following variables will also be available: * `OAUTH_CLIENT_ID`: the client ID of your OAuth app (public) * `OAUTH_CLIENT_SECRET`: the client secret of your OAuth app * `OAUTH_SCOPES`: scopes accessible by your OAuth app. Currently, this is always `"openid profile"`. * `OPENID_PROVIDER_URL`: The URL of the OpenID provider. The OpenID metadata will be available at [`{OPENID_PROVIDER_URL}/.well-known/openid-configuration`](https://huggingface.co/.well-known/openid-configuration). ## Clone the Repository You can easily clone your Space repo locally. Start by clicking on the dropdown menu in the top right of your Space page: Select "Clone repository", and then you'll be able to follow the instructions to clone the Space repo to your local machine using HTTPS or SSH. ## Linking Models and Datasets on the Hub You can showcase all the models and datasets that your Space links to by adding their identifier in your Space's README metadata. To do so, you can define them under the `models` and `datasets` keys. In addition to listing the artefacts in the README file, you can also record them in any `.py`, `.ini` or `.html` file as well. We'll parse it auto-magically! Here's an example linking two models from a space: ``` title: My lovely space emoji: ๐Ÿค— colorFrom: blue colorTo: green sdk: docker pinned: false models: - reach-vb/musicgen-large-fp16-endpoint - reach-vb/wav2vec2-large-xls-r-1B-common_voice7-lt-ft ``` ### Configuration https://huggingface.co/docs/hub/jobs-configuration.md # Configuration ## Authentication You need to be authenticated with `hf auth login` to run Jobs, and use a token with the permission to start and manage Jobs. Alternatively, pass a Hugging Face token manually with `--token` in the CLI, or the `token` argument in Python. ## UV Jobs Specify the UV script or python command to run as you would with UV: ```bash >>> hf jobs uv run train.py ``` ```bash >>> hf jobs uv run python -c 'print("Hello from the cloud!")' ``` The `hf jobs uv run` command accepts an UV argument like `--with` and `--python`. The `--with` argument lets you specify python dependencies, and `--python` lets you choose the python version to use: ```bash >>> hf jobs uv run --with trl train.py >>> hf jobs uv run --python 3.12 train.py ``` Arguments following the command (or script) are not interpreted as arguments to uv. All options to uv must be provided before the command, e.g., uv run --verbose foo. A `--` can be used to separate the command from jobs/uv options for clarity, e.g. ```bash >>> hf jobs uv run --with trl-jobs -- trl-jobs sft --model_name Qwen/Qwen3-0.6B --dataset_name trl-lib/Capybara ``` Find the list of all arguments in the [CLI documentation](https://huggingface.co/docs/huggingface_hub/package_reference/cli#hf-jobs-uv-run) and the [UV Commands documentation](https://docs.astral.sh/uv/reference/cli/#uv-run). By default, UV Jobs run with the `ghcr.io/astral-sh/uv:python3.12-bookworm` Docker image, but you can use another image as long as it has UV installed, using `--image `. ## Docker Jobs Specify the Docker image and the command to run as you would with docker: ```bash >>> hf jobs run ubuntu echo "Hello from the cloud!" ``` All options to Jobs must be provided before the command. A `--` can be used to separate the command from jobs/uv options for clarity, e.g. ```bash >>> hf jobs run --token hf_xxx ubuntu -- echo "Hello from the cloud!" ``` Find the list of all arguments in the [CLI documentation](https://huggingface.co/docs/huggingface_hub/package_reference/cli#hf-jobs-run). ## Environment variables and Secrets ### Built-in environment variables Similarly to the [built-in environment variables in Spaces](./spaces-overview#built-in-environment-variables), Jobs automatically provide the following environment variables inside the container: | Variable | Description | |----------|-------------| | `JOB_ID` | The unique identifier of the current job (e.g., `699d874f1aad19adb8aaeadc`). This is the same ID shown in the UI and the job URL. | | `ACCELERATOR` | The type of accelerator available (e.g., `t4-medium`, `a10g-small`, `a100x4`), or `none` for CPU-only jobs. | | `CPU_CORES` | The number of CPU cores allocated to the job. | | `MEMORY` | The amount of memory allocated to the job (e.g., `8Gi`). | You can use these variables to track outputs, adapt your code to available resources, or reference the current job programmatically: ```bash # Access job environment information >>> hf jobs run python:3.12 python -c "import os; print(f'Job: {os.environ.get(\"JOB_ID\")}, CPU: {os.environ.get(\"CPU_CORES\")}, Mem: {os.environ.get(\"MEMORY\")}')" ``` ### User-defined environment variables You can pass environment variables to your job using ```bash # Pass environment variables >>> hf jobs uv run -e FOO=foo -e BAR=bar python -c 'import os; print(os.environ["FOO"], os.environ["BAR"])' ``` ```bash # Pass an environment from a local .env file >>> hf jobs uv run --env-file .env python -c 'import os; print(os.environ["FOO"], os.environ["BAR"])' ``` ```bash # Pass secrets - they will be encrypted server side >>> hf jobs uv run -s MY_SECRET=psswrd python -c 'import os; print(os.environ["MY_SECRET"])' ``` ```bash # Pass secrets from a local .env.secrets file - they will be encrypted server side >>> hf jobs uv run --secrets-file .env.secrets python -c 'import os; print(os.environ["MY_SECRET"])' ``` > [!TIP] > Use `--secrets HF_TOKEN` to pass your local Hugging Face token implicitly. > With this syntax, the secret is retrieved from the environment variable. > For `HF_TOKEN`, it may read the token file located in the Hugging Face home folder if the environment variable is unset. ## Volumes Mount Hugging Face repositories (models, datasets) or [Storage Buckets](./storage-buckets) as volumes in your job container using `-v` or `--volume`. The syntax uses the `hf://` URL scheme: `hf://[TYPE/]SOURCE:/MOUNT_PATH[:ro]`. Volume types: | Type | Example | |------|---------| | Model repo | `-v hf://openai/gpt-oss-120b:/model` | | Dataset repo | `-v hf://datasets/stanfordnlp/imdb:/data` | | Storage bucket | `-v hf://buckets/username/my-bucket:/mnt` | | Subfolder | `-v hf://datasets/org/my-dataset/train:/data` | Then use the mounted volume as a local directory inside the container: ```bash # Mount a dataset and query it with DuckDB >>> hf jobs run -v hf://datasets/stanfordnlp/imdb:/dataset \ ... duckdb/duckdb duckdb -c "SELECT * FROM '/dataset/**/*.parquet' LIMIT 5" # Mount a bucket to save training checkpoints >>> hf jobs uv run -v hf://buckets/username/my-bucket:/training-outputs \ ... sft.py --output-dir /training-outputs/training-v3-final ``` Multiple volumes can be mounted by repeating the `-v` flag: ```bash >>> hf jobs run -v hf://datasets/username/my-dataset:/data -v hf://buckets/username/my-bucket:/output \ ... python:3.12 python script.py ``` Models and datasets are always mounted **read-only**. Storage buckets are **read-write** by default, which is useful for saving outputs, checkpoints, or intermediate results. Use `:ro` to mount a bucket in read-only mode: ```bash >>> hf jobs run -v hf://buckets/username/my-bucket:/mnt:ro python:3.12 ls /mnt ``` In Python, use the [`Volume`](https://huggingface.co/docs/huggingface_hub/package_reference/jobs#huggingface_hub.Volume) class: ```python from huggingface_hub import Volume, run_job job = run_job( image="python:3.12", command=["python", "-c", "import os; print(os.listdir('/data'))"], volumes=[ Volume(type="dataset", source="username/my-dataset", mount_path="/data"), Volume(type="bucket", source="username/my-bucket", mount_path="/output"), ], ) ``` > [!NOTE] > Volume mounting requires `huggingface_hub` >= 1.8.0. See the [Python client documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs#mount-a-volume) and [installation guide](https://huggingface.co/docs/huggingface_hub/installation) for more details. ## Hardware flavor Run jobs on GPUs or TPUs with the `flavor` argument. For example, to run a PyTorch job on an A10G GPU: ```bash >>> hf jobs uv run --with torch --flavor a10g-small python -c "import torch; print(f'This code ran with the following GPU: {torch.cuda.get_device_name()}')" ``` Running this will show the following output! ``` This code ran with the following GPU: NVIDIA A10G ``` Here is another example to run a fine-tuning script like [trl/scripts/sft.py](https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py): ```bash >>> hf jobs uv run --with trl --flavor a10g-small -s HF_TOKEN -- sft.py --model_name_or_path Qwen/Qwen2-0.5B ... ``` > [!TIP] > For comprehensive guidance on running model training jobs with TRL on Hugging Face infrastructure, check out the [TRL Jobs Training documentation](https://huggingface.co/docs/trl/main/en/jobs_training). It covers fine-tuning recipes, hardware selection, and best practices for training models efficiently. See the list of available `--flavor` options using the `hf jobs hardware` command (default is `cpu-basic`): ```bash >>> hf jobs hardware NAME PRETTY NAME CPU RAM ACCELERATOR COST/MIN COST/HOUR ------------ ---------------------- -------- ------- ---------------- -------- --------- cpu-basic CPU Basic 2 vCPU 16 GB N/A $0.0002 $0.01 cpu-upgrade CPU Upgrade 8 vCPU 32 GB N/A $0.0005 $0.03 t4-small Nvidia T4 - small 4 vCPU 15 GB 1x T4 (16 GB) $0.0067 $0.40 t4-medium Nvidia T4 - medium 8 vCPU 30 GB 1x T4 (16 GB) $0.0100 $0.60 a10g-small Nvidia A10G - small 4 vCPU 15 GB 1x A10G (24 GB) $0.0167 $1.00 a10g-large Nvidia A10G - large 12 vCPU 46 GB 1x A10G (24 GB) $0.0250 $1.50 a10g-largex2 2x Nvidia A10G - large 24 vCPU 92 GB 2x A10G (48 GB) $0.0500 $3.00 a10g-largex4 4x Nvidia A10G - large 48 vCPU 184 GB 4x A10G (96 GB) $0.0833 $5.00 a100-large Nvidia A100 - large 12 vCPU 142 GB 1x A100 (80 GB) $0.0417 $2.50 a100x4 4x Nvidia A100 48 vCPU 568 GB 4x A100 (320 GB) $0.1667 $10.00 a100x8 8x Nvidia A100 96 vCPU 1136 GB 8x A100 (640 GB) $0.3333 $20.00 l4x1 1x Nvidia L4 8 vCPU 30 GB 1x L4 (24 GB) $0.0133 $0.80 l4x4 4x Nvidia L4 48 vCPU 186 GB 4x L4 (96 GB) $0.0633 $3.80 l40sx1 1x Nvidia L40S 8 vCPU 62 GB 1x L40S (48 GB) $0.0300 $1.80 l40sx4 4x Nvidia L40S 48 vCPU 382 GB 4x L40S (192 GB) $0.1383 $8.30 l40sx8 8x Nvidia L40S 192 vCPU 1534 GB 8x L40S (384 GB) $0.3917 $23.50 ``` ## Timeout Jobs have a default timeout (30 minutes), after which they will automatically stop. This is important to know when running long-running tasks like model training. You can specify a custom timeout value using the `--timeout` parameter when running a job. The timeout can be specified in two ways: 1. **As a number** (interpreted as seconds): Use `--timeout` and pass the number in seconds (here 2 hours = 7200 seconds): ```bash >>> hf jobs uv run --timeout 7200 --with torch --flavor a10g-large train.py ``` 2. **As a string with time units**: Or use `--timeout` and use diffetent time units: ```bash >>> hf jobs uv run --timeout 2h --with torch --flavor a10g-large train.py ``` Other examples: ```bash --timeout 30m # 30 minutes --timeout 1.5h # 1.5 hours --timeout 1d # 1 day --timeout 3600s # 3600 seconds ``` Supported time units: - `s` - seconds - `m` - minutes - `h` - hours - `d` - days > [!WARNING] > If you don't specify a timeout, a default timeout will be applied to your job. For long-running tasks like model training that may take hours, make sure to set an appropriate timeout to avoid unexpected job terminations. ## Namespace Run Jobs under your organization account using the `--namespace` argument. Make sure you are logged in with a token that has the permission to start and manage Jobs under your orgzanization account. ```bash >>> hf jobs uv run --namespace my-org-name python -c "print('Running in an org account')" ``` Note that you can pass a token with the right permission manually: ```bash >>> hf jobs uv run --namespace my-org-name --token hf_xxx python -c "print('Running in an org account')" ``` ## Labels Add one or more labels to a Job to add some metadata with `-l` or `-label`. You can use such metadata later to filter Jobs on the website or in the CLI. Add labels with `--label my-label` or key-value labels with `--label key=value`. For example: ```bash hf jobs uv run --label fine-tuning --label model=Qwen3-0.6B --label dataset=Capybara ... ``` Note that using the same `key` multiple times causes the last `key=value` to overwrite and discard any previous label with `key`. ### Repositories https://huggingface.co/docs/hub/repositories.md # Repositories Models, Spaces, and Datasets are hosted on the Hugging Face Hub as [Git repositories](https://git-scm.com/about), which means that version control and collaboration are core elements of the Hub. In a nutshell, a repository (also known as a **repo**) is a place where code and assets can be stored to back up your work, share it with the community, and work in a team. > [!TIP] > Looking for non-versioned, mutable storage? Check out [Storage Buckets](./storage-buckets), which provide S3-like object storage without Git history. Unlike other collaboration platforms, our Git repositories are optimized for Machine Learning and AI files โ€“ large binary files, usually in specific file formats like Parquet and Safetensors, and up to [Terabyte-scale sizes](https://huggingface.co/blog/from-files-to-chunks)! To achieve this, we built [Xet](./xet/index), a modern custom storage system built specifically for AI/ML development, enabling chunk-level deduplication, smaller uploads, and faster downloads. In these pages, you will go over the basics of getting started with Git and Xet and interacting with repositories on the Hub. Once you get the hang of it, you can explore the best practices and next steps that we've compiled for effective repository usage. ## Contents - [Getting Started with Repositories](./repositories-getting-started) - [Settings](./repositories-settings) - [Storage Limits](./storage-limits) - [Storage Backend (Xet)](./xet/index) - [Local Cache](./local-cache) - [Pull Requests & Discussions](./repositories-pull-requests-discussions) - [Pull Requests advanced usage](./repositories-pull-requests-discussions#pull-requests-advanced-usage) - [Collections](./collections) - [Notifications](./notifications) - [Webhooks](./webhooks) - [Next Steps](./repositories-next-steps) - [Licenses](./repositories-licenses) ### Secrets Scanning https://huggingface.co/docs/hub/security-secrets.md # Secrets Scanning It is important to manage [your secrets (env variables) properly](./spaces-overview#managing-secrets). The most common way people expose their secrets to the outside world is by hard-coding their secrets in their code files directly, which makes it possible for a malicious user to utilize your secrets and services your secrets have access to. For example, this is what a compromised `app.py` file might look like: ```py import numpy as np import scipy as sp api_key = "sw-xyz1234567891213" def call_inference(prompt: str) -> str: result = call_api(prompt, api_key) return result ``` To prevent this issue, we run [TruffleHog](https://trufflesecurity.com/trufflehog) on each push you make. TruffleHog scans for hard-coded secrets, and we will send you an email upon detection. You'll only receive emails for verified secrets, which are the ones that have been confirmed to work for authentication against their respective providers. Note, however, that unverified secrets are not necessarily harmless or invalid: verification can fail due to technical reasons, such as in the case of a network error. TruffleHog can verify secrets that work across multiple services, it is not restricted to Hugging Face tokens. You can opt-out from those email notifications from [your settings](https://huggingface.co/settings/notifications). ### Advanced Topics https://huggingface.co/docs/hub/models-advanced.md # Advanced Topics ## Contents - [Integrate your library with the Hub](./models-adding-libraries) - [Adding new tasks to the Hub](./models-tasks) - [GGUF format](./gguf) - [DDUF format](./dduf) ### Embed your Space in another website https://huggingface.co/docs/hub/spaces-embed.md # Embed your Space in another website Once your Space is up and running you might wish to embed it in a website or in your blog. Embedding or sharing your Space is a great way to allow your audience to interact with your work and demonstrations without requiring any setup on their side. To embed a Space its visibility needs to be **public** or **protected**. Protected Spaces keep their source code private on the Hub while remaining publicly accessible through their embed URLs (and [custom domains](./spaces-custom-domain), when configured). See [Space visibility](./spaces-overview#space-visibility) for more details. ## Direct URL A Space is assigned a unique URL you can use to share your Space or embed it in a website. This URL is of the form: `"https://.hf.space"`. For instance, the Space [NimaBoscarino/hotdog-gradio](https://huggingface.co/spaces/NimaBoscarino/hotdog-gradio) has the corresponding URL of `"https://nimaboscarino-hotdog-gradio.hf.space"`. The subdomain is unique and only changes if you move or rename your Space. Your space is always served from the root of this subdomain. You can find the Space URL along with examples snippets of how to embed it directly from the options menu: ## Embedding with IFrames The default embedding method for a Space is using IFrames. Add in the HTML location where you want to embed your Space the following element: ```html .hf.space" frameborder="0" width="850" height="450" > ``` For instance using the [NimaBoscarino/hotdog-gradio](https://huggingface.co/spaces/NimaBoscarino/hotdog-gradio) Space: ## Embedding with WebComponents If the Space you wish to embed is Gradio-based, you can use Web Components to embed your Space. WebComponents are faster than IFrames and automatically adjust to your web page so that you do not need to configure `width` or `height` for your element. First, you need to import the Gradio JS library that corresponds to the Gradio version in the Space by adding the following script to your HTML. Then, add a `gradio-app` element where you want to embed your Space. ```html .hf.space"> ``` Check out the [Gradio documentation](https://www.gradio.app/guides/sharing-your-app#embedding-hosted-spaces) for more details. ### Spark https://huggingface.co/docs/hub/datasets-spark.md # Spark Spark enables real-time, large-scale data processing in a distributed environment. You can use `pyspark_huggingface` to access Hugging Face datasets repositories in PySpark via the "huggingface" Data Source. Try out [Spark Notebooks](https://huggingface.co/spaces/Dataset-Tools/Spark-Notebooks) on Hugging Face Spaces to get Notebooks with PySpark and `pyspark_huggingface` pre-installed. ## Set up ### Installation To be able to read and write to Hugging Face Datasets, you need to install the `pyspark_huggingface` library: ``` pip install pyspark_huggingface ``` This will also install required dependencies like `huggingface_hub` for authentication, and `pyarrow` for reading and writing datasets. ### Authentication You need to authenticate to Hugging Face to read private/gated dataset repositories or to write to your dataset repositories. You can use the CLI for example: ``` hf auth login ``` It's also possible to provide your Hugging Face token with the `HF_TOKEN` environment variable or passing the `token` option to the reader. For more details about authentication, check out [this guide](https://huggingface.co/docs/huggingface_hub/quick-start#authentication). ### Enable the "huggingface" Data Source PySpark 4 came with a new Data Source API which allows to use datasets from custom sources. If `pyspark_huggingface` is installed, PySpark auto-imports it and enables the "huggingface" Data Source. The library also backports the Data Source API for the "huggingface" Data Source for PySpark 3.5, 3.4 and 3.3. However in this case `pyspark_huggingface` should be imported explicitly to activate the backport and enable the "huggingface" Data Dource: ```python >>> import pyspark_huggingface huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4) ``` ## Read The "huggingface" Data Source allows to read datasets from Hugging Face, using `pyarrow` under the hood to stream Arrow data. This is compatible with all the dataset in [supported format](https://huggingface.co/docs/hub/datasets-adding#file-formats) on Hugging Face, like Parquet datasets. For example here is how to load the [stanfordnlp/imdb](https://huggingface.co/stanfordnlp/imdb) dataset: ```python >>> import pyspark_huggingface >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.appName("demo").getOrCreate() >>> df = spark.read.format("huggingface").load("stanfordnlp/imdb") ``` Here is another example with the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset. It is a gated repository, users have to accept the terms of use before accessing it. It also has multiple subsets, namely, "3M" and "7M". So we need to specify which one to load. We use the `.format()` function to use the "huggingface" Data Source, and `.load()` to load the dataset (more precisely the config or subset named "7M" containing 7M samples). Then we compute the number of dialogue per language and filter the dataset. After logging-in to access the gated repository, we can run: ```python >>> import pyspark_huggingface >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.appName("demo").getOrCreate() >>> df = spark.read.format("huggingface").option("config", "7M").load("BAAI/Infinity-Instruct") >>> df.show() +---+----------------------------+-----+----------+--------------------+ | id| conversations|label|langdetect| source| +---+----------------------------+-----+----------+--------------------+ | 0| [{human, def exti...| | en| code_exercises| | 1| [{human, See the ...| | en| flan| | 2| [{human, This is ...| | en| flan| | 3| [{human, If you d...| | en| flan| | 4| [{human, In a Uni...| | en| flan| | 5| [{human, Read the...| | en| flan| | 6| [{human, You are ...| | en| code_bagel| | 7| [{human, I want y...| | en| Subjective| | 8| [{human, Given th...| | en| flan| | 9|[{human, ๅ› ๆžœ่”็ณปๅŽŸๅˆ™ๆ˜ฏๆณ•...| | zh-cn| Subjective| | 10| [{human, Provide ...| | en|self-oss-instruct...| | 11| [{human, The univ...| | en| flan| | 12| [{human, Q: I am ...| | en| flan| | 13| [{human, What is ...| | en| OpenHermes-2.5| | 14| [{human, In react...| | en| flan| | 15| [{human, Write Py...| | en| code_exercises| | 16| [{human, Find the...| | en| MetaMath| | 17| [{human, Three of...| | en| MetaMath| | 18| [{human, Chandra ...| | en| MetaMath| | 19|[{human, ็”จ็ปๆตŽๅญฆ็Ÿฅ่ฏ†ๅˆ†ๆž...| | zh-cn| Subjective| +---+----------------------------+-----+----------+--------------------+ ``` This loads the dataset in a streaming fashion, and the output DataFrame has one partition per data file in the dataset to enable efficient distributed processing. To compute the number of dialogues per language we run this code that uses the `columns` option and a `groupBy()` operation. The `columns` option is useful to only load the data we need, since PySpark doesn't enable predicate push-down with the Data Source API. There is also a `filters` option to only load data with values within a certain range. ```python >>> df_langdetect_only = ( ... spark.read.format("huggingface") ... .option("config", "7M") ... .option("columns", '["langdetect"]') ... .load("BAAI/Infinity-Instruct") ... ) >>> df_langdetect_only.groupBy("langdetect").count().show() +----------+-------+ |langdetect| count| +----------+-------+ | en|6697793| | zh-cn| 751313| +----------+-------+ ``` To filter the dataset and only keep dialogues in Chinese: ```python >>> df_chinese_only = ( ... spark.read.format("huggingface") ... .option("config", "7M") ... .option("filters", '[("langdetect", "=", "zh-cn")]') ... .load("BAAI/Infinity-Instruct") ... ) >>> df_chinese_only.show() +---+----------------------------+-----+----------+----------+ | id| conversations|label|langdetect| source| +---+----------------------------+-----+----------+----------+ | 9|[{human, ๅ› ๆžœ่”็ณปๅŽŸๅˆ™ๆ˜ฏๆณ•...| | zh-cn|Subjective| | 19|[{human, ็”จ็ปๆตŽๅญฆ็Ÿฅ่ฏ†ๅˆ†ๆž...| | zh-cn|Subjective| | 38| [{human, ๆŸไธช่€ƒ่ฏ•ๅ…ฑๆœ‰Aใ€...| | zh-cn|Subjective| | 39|[{human, ๆ’ฐๅ†™ไธ€็ฏ‡ๅ…ณไบŽๆ–ๆณข...| | zh-cn|Subjective| | 57|[{human, ๆ€ป็ป“ไธ–็•ŒๅކๅฒไธŠ็š„...| | zh-cn|Subjective| | 61|[{human, ็”Ÿๆˆไธ€ๅˆ™ๅนฟๅ‘Š่ฏใ€‚...| | zh-cn|Subjective| | 66|[{human, ๆ่ฟฐไธ€ไธชๆœ‰ๆ•ˆ็š„ๅ›ข...| | zh-cn|Subjective| | 94|[{human, ๅฆ‚ๆžœๆฏ”ๅˆฉๅ’Œ่’‚่Š™ๅฐผ...| | zh-cn|Subjective| |102|[{human, ็”Ÿๆˆไธ€ๅฅ่‹ฑๆ–‡ๅ่จ€...| | zh-cn|Subjective| |106|[{human, ๅ†™ไธ€ๅฐๆ„Ÿ่ฐขไฟก๏ผŒๆ„Ÿ...| | zh-cn|Subjective| |118| [{human, ็”Ÿๆˆไธ€ไธชๆ•…ไบ‹ใ€‚}...| | zh-cn|Subjective| |174|[{human, ้ซ˜่ƒ†ๅ›บ้†‡ๆฐดๅนณ็š„ๅŽ...| | zh-cn|Subjective| |180|[{human, ๅŸบไบŽไปฅไธ‹่ง’่‰ฒไฟกๆฏ...| | zh-cn|Subjective| |192|[{human, ่ฏทๅ†™ไธ€็ฏ‡ๆ–‡็ซ ๏ผŒๆฆ‚...| | zh-cn|Subjective| |221|[{human, ไปฅ่ฏ—ๆญŒๅฝขๅผ่กจ่พพๅฏน...| | zh-cn|Subjective| |228|[{human, ๆ นๆฎ็ป™ๅฎš็š„ๆŒ‡ไปค๏ผŒ...| | zh-cn|Subjective| |236|[{human, ๆ‰“ๅผ€ไธ€ไธชๆ–ฐ็š„็”Ÿๆˆ...| | zh-cn|Subjective| |260|[{human, ็”Ÿๆˆไธ€ไธชๆœ‰ๅ…ณๆœชๆฅ...| | zh-cn|Subjective| |268|[{human, ๅฆ‚ๆžœๆœ‰ไธ€ๅฎšๆ•ฐ้‡็š„...| | zh-cn|Subjective| |273| [{human, ้ข˜็›ฎ๏ผšๅฐๆ˜Žๆœ‰5ไธช...| | zh-cn|Subjective| +---+----------------------------+-----+----------+----------+ ``` It is also possible to apply filters or remove columns on the loaded DataFrame, but it is more efficient to do it while loading, especially on Parquet datasets. Indeed, Parquet contains metadata at the file and row group level, which allows to skip entire parts of the dataset that don't contain samples that satisfy the criteria. Columns in Parquet can also be loaded independently, which allows to skip the excluded columns and avoid loading unnecessary data. ### Options Here is the list of available options you can pass to `read..option()`: * `config` (string): select a dataset subset/config * `split` (string): select a dataset split (default is "train") * `token` (string): your Hugging Face token Instead of specifying a config or split, you can select which files to load manually: * `data_dir` (string): select a directory * `data_files` (string): select one or many files, e.g. `"data/*.parquet"` or `'["part1.parquet", "par2.parquet"]'` For Parquet datasets: * `columns` (string): select a subset of columns to load, e.g. `'["id"]'` * `filters` (string): to skip files and row groups that don't match a criteria, e.g. `'[("source", "=", "code_exercises")]'`. Filters are passed to [pyarrow.parquet.ParquetDataset](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html). Any other option is passed as an argument to [datasets.load_dataset] (https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset) ### Run SQL queries Once you have your PySpark Dataframe ready, you can run SQL queries using `spark.sql`: ```python >>> import pyspark_huggingface >>> from pyspark.sql import SparkSession >>> spark = SparkSession.builder.appName("demo").getOrCreate() >>> df = ( ... spark.read.format("huggingface") ... .option("config", "7M") ... .option("columns", '["source"]') ... .load("BAAI/Infinity-Instruct") ... ) >>> spark.sql("SELECT source, count(*) AS total FROM {df} GROUP BY source ORDER BY total DESC", df=df).show() +--------------------+-------+ | source| total| +--------------------+-------+ | flan|2435840| | Subjective|1342427| | OpenHermes-2.5| 855478| | MetaMath| 690138| | code_exercises| 590958| |Orca-math-word-pr...| 398168| | code_bagel| 386649| | MathInstruct| 329254| |python-code-datas...| 88632| |instructional_cod...| 82920| | CodeFeedback| 79513| |self-oss-instruct...| 50467| |Evol-Instruct-Cod...| 43354| |CodeExercise-Pyth...| 27159| |code_instructions...| 23130| | Code-Instruct-700k| 10860| |Glaive-code-assis...| 9281| |python_code_instr...| 2581| |Python-Code-23k-S...| 2297| +--------------------+-------+ ``` Again, specifying the `columns` option is not necessary, but is useful to avoid loading unnecessary data and make the query faster. ## Write You can write a PySpark Dataframe to Hugging Face with the "huggingface" Data Source. It uploads Parquet files in parallel in a distributed manner, and only commits the files once they're all uploaded. It works like this: ```python >>> import pyspark_huggingface >>> df.write.format("huggingface").save("username/dataset_name") ``` Here is how we can use this function to write the filtered version of the [BAAI/Infinity-Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct) dataset back to Hugging Face. First you need to [create a dataset repository](https://huggingface.co/new-dataset), e.g. `username/Infinity-Instruct-Chinese-Only` (you can set it to private if you want). Then, make sure you are authenticated and you can use the "huggingface" Data Source, set the `mode` to "overwrite" (or "append" if you want to extend an existing dataset), and push to Hugging Face with `.save()`: ```python >>> df_chinese_only.write.format("huggingface").mode("overwrite").save("username/Infinity-Instruct-Chinese-Only") ``` ### Mode Two modes are available when pushing a dataset to Hugging Face: * "overwrite": overwrite the dataset if it already exists * "append": append the dataset to an existing dataset ### Options Here is the list of available options you can pass to `write.option()`: * `token` (string): your Hugging Face token Contributions are welcome to add more options here, in particular `subset` and `split`. ## Storage Buckets It is common to process raw data in [Storage Buckets](/docs/hub/storage-buckets) and experiment there before publishing AI-ready data in Dataset repositories. Access Storage Buckets the same way as Dataset repositories but with the `buckets/` prefix and with the `data_dir` or `data_files` options: ```python >>> df = spark.read.format("huggingface").option("data_dir", "data").load("buckets/username/my-bucket") >>> # OR with a glob pattern >>> # df = spark.read.format("huggingface").option("data_files", "data/*.parquet").load("buckets/username/my-bucket") >>> df.write.format("huggingface").option("data_dir", "new-data").save("buckets/username/my-bucket") ``` ### Using ESPnet at Hugging Face https://huggingface.co/docs/hub/espnet.md # Using ESPnet at Hugging Face `espnet` is an end-to-end toolkit for speech processing, including automatic speech recognition, text to speech, speech enhancement, dirarization and other tasks. ## Exploring ESPnet in the Hub You can find hundreds of `espnet` models by filtering at the left of the [models page](https://huggingface.co/models?library=espnet&sort=downloads). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, licenses and more. 2. Metadata tags that help for discoverability and contain information such as license, language and datasets. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference Providers widget that allows to make inference requests. ## Using existing models For a full guide on loading pre-trained models, we recommend checking out the [official guide](https://github.com/espnet/espnet_model_zoo)). If you're interested in doing inference, different classes for different tasks have a `from_pretrained` method that allows loading models from the Hub. For example: * `Speech2Text` for Automatic Speech Recognition. * `Text2Speech` for Text to Speech. * `SeparateSpeech` for Audio Source Separation. Here is an inference example: ```py import soundfile from espnet2.bin.tts_inference import Text2Speech text2speech = Text2Speech.from_pretrained("model_name") speech = text2speech("foobar")["wav"] soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16") ``` If you want to see how to load a specific model, you can click `Use in ESPnet` and you will be given a working snippet that you can load it! ## Sharing your models `ESPnet` outputs a `zip` file that can be uploaded to Hugging Face easily. For a full guide on sharing models, we recommend checking out the [official guide](https://github.com/espnet/espnet_model_zoo#register-your-model)). The `run.sh` script allows to upload a given model to a Hugging Face repository. ```bash ./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo ``` ## Additional resources * ESPnet [docs](https://espnet.github.io/espnet/index.html). * ESPnet model zoo [repository](https://github.com/espnet/espnet_model_zoo). * Integration [docs](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md). ### Daft https://huggingface.co/docs/hub/datasets-daft.md # Daft [Daft](https://daft.ai/) is a high-performance data engine providing simple and reliable data processing for any modality and scale. Daft has native support for reading from and writing to Hugging Face datasets. ## Getting Started To get started, pip install `daft` with the `huggingface` feature: ```bash pip install 'daft[huggingface]' ``` ## Read Daft is able to read datasets directly from the Hugging Face Hub using the [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface) function or via the `hf://datasets/` protocol. ### Reading an Entire Dataset Using [`daft.read_huggingface()`](https://docs.daft.ai/en/stable/api/io/#daft.read_huggingface), you can easily load a dataset. ```python import daft df = daft.read_huggingface("username/dataset_name") ``` This will read the entire dataset into a DataFrame. ### Reading Specific Files Not only can you read entire datasets, but you can also read individual files from a dataset repository. Using a read function that takes in a path (such as [`daft.read_parquet()`](https://docs.daft.ai/en/stable/api/io/#daft.read_parquet), [`daft.read_csv()`](https://docs.daft.ai/en/stable/api/io/#daft.read_csv), or [`daft.read_json()`](https://docs.daft.ai/en/stable/api/io/#daft.read_json)), specify a Hugging Face dataset path via the `hf://datasets/` prefix: ```python import daft # read a specific Parquet file df = daft.read_parquet("hf://datasets/username/dataset_name/file_name.parquet") # or a csv file df = daft.read_csv("hf://datasets/username/dataset_name/file_name.csv") # or a set of Parquet files using a glob pattern df = daft.read_parquet("hf://datasets/username/dataset_name/**/*.parquet") ``` ## Write Daft is able to write Parquet files to a Hugging Face dataset repository using [`daft.DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_deltalake). Daft supports [Content-Defined Chunking](https://huggingface.co/blog/parquet-cdc) and [Xet](https://huggingface.co/blog/xet-on-the-hub) for faster, deduplicated writes. Basic usage: ```python import daft df: daft.DataFrame = ... df.write_huggingface("username/dataset_name") ``` See the [`DataFrame.write_huggingface`](https://docs.daft.ai/en/stable/api/dataframe/#daft.DataFrame.write_huggingface) API page for more info. ## Authentication The `token` parameter in [`daft.io.HuggingFaceConfig`](https://docs.daft.ai/en/stable/api/config/#daft.io.HuggingFaceConfig) can be used to specify a Hugging Face access token for requests that require authentication (e.g. reading private dataset repositories or writing to a dataset repository). Example of loading a dataset with a specified token: ```python from daft.io import IOConfig, HuggingFaceConfig io_config = IOConfig(hf=HuggingFaceConfig(token="your_token")) df = daft.read_parquet("hf://datasets/username/dataset_name", io_config=io_config) ``` ### Dataset Cards https://huggingface.co/docs/hub/datasets-cards.md # Dataset Cards ## What are Dataset Cards? Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used. You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub, and [data files configuration](./datasets-manual-configuration) options. Tags are defined in a YAML metadata section at the top of the `README.md` file. ## Dataset card metadata A dataset repo will render its README.md as a dataset card. To control how the Hub displays the card, you should create a YAML section in the README file to define some metadata. Start by adding three --- at the top, then include all of the relevant metadata, and close the section with another group of --- like the example below: ```yaml language: - "List of ISO 639-1 code for your language" - lang1 - lang2 pretty_name: "Pretty Name of the Dataset" tags: - tag1 - tag2 license: "any valid license identifier" task_categories: - task1 - task2 ``` The metadata that you add to the dataset card enables certain interactions on the Hub. For example: * Allow users to filter and discover datasets at https://huggingface.co/datasets. * If you choose a license using the keywords listed in the right column of [this table](./repositories-licenses), the license will be displayed on the dataset page. When creating a README.md file in a dataset repository on the Hub, use Metadata UI to fill the main metadata: To see metadata fields, see the detailed [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1). ### Dataset card creation guide For a step-by-step guide on creating a dataset card, check out the [Create a dataset card](https://huggingface.co/docs/datasets/dataset_card) guide. Reading through existing dataset cards, such as the [ELI5 dataset card](https://huggingface.co/datasets/eli5/blob/main/README.md), is a great way to familiarize yourself with the common conventions. ### Linking a Paper If the dataset card includes a link to a Paper page (either on HF or an Arxiv abstract/PDF), the Hub will extract the arXiv ID and include it in the dataset tags with the format `arxiv:`. Clicking on the tag will let you: * Visit the Paper page * Filter for other models on the Hub that cite the same paper. Read more about paper pages [here](./paper-pages). ### Force set a dataset modality The Hub will automatically detect the modality of a dataset based on the files it contains (audio, video, geospatial, etc.). If you want to force a specific modality, you can add a tag to the dataset card metadata: `3d`, `audio`, `geospatial`, `image`, `tabular`, `text`, `timeseries`, `video`. For example, to force the modality to `audio`, add the following to the dataset card metadata: ```yaml tags: - audio ``` ### Associate a library to the dataset The dataset page automatically shows libraries and tools that are able to natively load the dataset, but if you want to show another specific library, you can add a tag to the dataset card metadata: `argilla`, `dask`, `datasets`, `distilabel`, `fiftyone`, `mlcroissant`, `pandas`, `webdataset`. See the [list of supported libraries](https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/dataset-libraries.ts) for more information, or to propose to add a new library. For example, to associate the `argilla` library to the dataset card, add the following to the dataset card metadata: ```yaml tags: - argilla ``` ### Docker Spaces https://huggingface.co/docs/hub/spaces-sdks-docker.md # Docker Spaces Spaces accommodate custom [Docker containers](https://docs.docker.com/get-started/) for apps outside the scope of Streamlit and Gradio. Docker Spaces allow users to go beyond the limits of what was previously possible with the standard SDKs. From FastAPI and Go endpoints to Phoenix apps and ML Ops tools, Docker Spaces can help in many different setups. ## Setting up Docker Spaces Selecting **Docker** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Space by setting the `sdk` property to `docker` in your `README.md` file's YAML block. Alternatively, given an existing Space repository, set `sdk: docker` inside the `YAML` block at the top of your Spaces **README.md** file. You can also change the default exposed port `7860` by setting `app_port: 7860`. Afterwards, you can create a usual `Dockerfile`. ```Yaml --- title: Basic Docker SDK Space emoji: ๐Ÿณ colorFrom: purple colorTo: gray sdk: docker app_port: 7860 --- ``` Internally you could have as many open ports as you want. For instance, you can install Elasticsearch inside your Space and call it internally on its default port 9200. If you want to expose apps served on multiple ports to the outside world, a workaround is to use a reverse proxy like Nginx to dispatch requests from the broader internet (on a single port) to different internal ports. ## Secrets and Variables Management You can manage a Space's environment variables in the Space Settings. Read more [here](./spaces-overview#managing-secrets). ### Variables #### Buildtime Variables are passed as `build-arg`s when building your Docker Space. Read [Docker's dedicated documentation](https://docs.docker.com/engine/reference/builder/#arg) for a complete guide on how to use this in the Dockerfile. ```Dockerfile # Declare your environment variables with the ARG directive ARG MODEL_REPO_NAME FROM python:latest # [...] # You can use them like environment variables RUN predict.py $MODEL_REPO_NAME ``` #### Runtime Variables are injected in the container's environment at runtime. ### Secrets #### Buildtime In Docker Spaces, the secrets management is different for security reasons. Once you create a secret in the [Settings tab](./spaces-overview#managing-secrets), you can expose the secret by adding the following line in your Dockerfile: For example, if `SECRET_EXAMPLE` is the name of the secret you created in the Settings tab, you can read it at build time by mounting it to a file, then reading it with `$(cat /run/secrets/SECRET_EXAMPLE)`. See an example below: ```Dockerfile # Expose the secret SECRET_EXAMPLE at buildtime and use its value as git remote URL RUN --mount=type=secret,id=SECRET_EXAMPLE,mode=0444,required=true \ git init && \ git remote add origin $(cat /run/secrets/SECRET_EXAMPLE) ``` ```Dockerfile # Expose the secret SECRET_EXAMPLE at buildtime and use its value as a Bearer token for a curl request RUN --mount=type=secret,id=SECRET_EXAMPLE,mode=0444,required=true \ curl test -H 'Authorization: Bearer $(cat /run/secrets/SECRET_EXAMPLE)' ``` #### Runtime Same as for public Variables, at runtime, you can access the secrets as environment variables. For example, in Python you would use `os.environ.get("SECRET_EXAMPLE")`. Check out this [example](https://huggingface.co/spaces/DockerTemplates/secret-example) of a Docker Space that uses secrets. ## Permissions The container runs with user ID 1000. To avoid permission issues you should create a user and set its `WORKDIR` before any `COPY` or download. ```Dockerfile # Set up a new user named "user" with user ID 1000 RUN useradd -m -u 1000 user # Switch to the "user" user USER user # Set home to the user's home directory ENV HOME=/home/user \ PATH=/home/user/.local/bin:$PATH # Set the working directory to the user's home directory WORKDIR $HOME/app # Try and run pip command after setting the user with `USER user` to avoid permission issues with Python RUN pip install --no-cache-dir --upgrade pip # Copy the current directory contents into the container at $HOME/app setting the owner to the user COPY --chown=user . $HOME/app # Download a checkpoint RUN mkdir content ADD --chown=user https:// content/ ``` Always specify the `--chown=user` with `ADD` and `COPY` to ensure the new files are owned by your user. If you still face permission issues, you might need to use `chmod` or `chown` in your `Dockerfile` to grant the right permissions. For example, if you want to use the directory `/data`, you can do: ```Dockerfile RUN mkdir -p /data RUN chmod 777 /data ``` You should always avoid superfluous chowns. > [!WARNING] > Updating metadata for a file creates a new copy stored in the new layer. Therefore, a recursive chown can result in a very large image due to the duplication of all affected files. Rather than fixing permission by running `chown`: ``` COPY checkpoint . RUN chown -R user checkpoint ``` you should always do: ``` COPY --chown=user checkpoint . ``` (same goes for `ADD` command) ## Data Persistence The data written on disk is lost whenever your Docker Space restarts. To persist data across restarts, you can attach a [Storage Bucket](./storage-buckets) to your Space. At the moment, `/data` volume is only available at runtime, i.e. you cannot use `/data` during the build step of your Dockerfile. You can also use our Datasets Hub for specific cases, where you can store state and data in a git LFS repository. You can find an example of persistence [here](https://huggingface.co/spaces/Wauplin/space_to_dataset_saver), which uses the [`huggingface_hub` library](https://huggingface.co/docs/huggingface_hub/index) for programmatically uploading files to a dataset repository. This Space example along with [this guide](https://huggingface.co/docs/huggingface_hub/main/en/guides/upload#scheduled-uploads) will help you define which solution fits best your data type. Finally, in some cases, you might want to use an external storage solution from your Space's code like an external hosted DB, S3, etc. ### Docker container with GPU You can run Docker containers with GPU support by using one of our GPU-flavored [Spaces Hardware](./spaces-gpus). We recommend using the [`nvidia/cuda`](https://hub.docker.com/r/nvidia/cuda) from Docker Hub as a base image, which comes with CUDA and cuDNN pre-installed. During Docker buildtime, you don't have access to a GPU hardware. Therefore, you should not try to run any GPU-related command during the build step of your Dockerfile. For example, you can't run `nvidia-smi` or `torch.cuda.is_available()` building an image. Read more [here](https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#description). ## Read More - [Full Docker demo example](spaces-sdks-docker-first-demo) - [List of Docker Spaces examples](spaces-sdks-docker-examples) - [Spaces Examples](https://huggingface.co/SpacesExamples) ### Streamlit Spaces https://huggingface.co/docs/hub/spaces-sdks-streamlit.md # Streamlit Spaces **Streamlit** gives users freedom to build a full-featured web app with Python in a *reactive* way. Your code is rerun each time the state of the app changes. Streamlit is also great for data visualization and supports several charting libraries such as Bokeh, Plotly, and Altair. Read this [blog post](https://huggingface.co/blog/streamlit-spaces) about building and hosting Streamlit apps in Spaces. Selecting **Streamlit** as the SDK when [creating a new Space](https://huggingface.co/new-space) will initialize your Space with the latest version of Streamlit by setting the `sdk` property to `streamlit` in your `README.md` file's YAML block. If you'd like to change the Streamlit version, you can edit the `sdk_version` property. To use Streamlit in a Space, select **Streamlit** as the SDK when you create a Space through the [**New Space** form](https://huggingface.co/new-space). This will create a repository with a `README.md` that contains the following properties in the YAML configuration block: ```yaml sdk: streamlit sdk_version: 1.25.0 # The latest supported version ``` You can edit the `sdk_version`, but note that issues may occur when you use an unsupported Streamlit version. Not all Streamlit versions are supported, so please refer to the [reference section](./spaces-config-reference) to see which versions are available. For in-depth information about Streamlit, refer to the [Streamlit documentation](https://docs.streamlit.io/). > [!WARNING] > Only port 8501 is allowed for Streamlit Spaces (default port). As a result if you provide a `config.toml` file for your Space make sure the default port is not overridden. ## Your First Streamlit Space: Hot Dog Classifier In the following sections, you'll learn the basics of creating a Space, configuring it, and deploying your code to it. We'll create a **Hot Dog Classifier** Space with Streamlit that'll be used to demo the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which can detect whether a given picture contains a hot dog ๐ŸŒญ You can find a completed version of this hosted at [NimaBoscarino/hotdog-streamlit](https://huggingface.co/spaces/NimaBoscarino/hotdog-streamlit). ## Create a new Streamlit Space We'll start by [creating a brand new Space](https://huggingface.co/new-space) and choosing **Streamlit** as our SDK. Hugging Face Spaces are Git repositories, meaning that you can work on your Space incrementally (and collaboratively) by pushing commits. Take a look at the [Getting Started with Repositories](./repositories-getting-started) guide to learn about how you can create and edit files before continuing. ## Add the dependencies For the **Hot Dog Classifier** we'll be using a [๐Ÿค— Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to use the model, so we need to start by installing a few dependencies. This can be done by creating a **requirements.txt** file in our repository, and adding the following dependencies to it: ``` transformers torch ``` The Spaces runtime will handle installing the dependencies! ## Create the Streamlit app To create the Streamlit app, make a new file in the repository called **app.py**, and add the following code: ```python import streamlit as st from transformers import pipeline from PIL import Image pipeline = pipeline(task="image-classification", model="julien-c/hotdog-not-hotdog") st.title("Hot Dog? Or Not?") file_name = st.file_uploader("Upload a hot dog candidate image") if file_name is not None: col1, col2 = st.columns(2) image = Image.open(file_name) col1.image(image, use_column_width=True) predictions = pipeline(image) col2.header("Probabilities") for p in predictions: col2.subheader(f"{ p['label'] }: { round(p['score'] * 100, 1)}%") ``` This Python script uses a [๐Ÿค— Transformers pipeline](https://huggingface.co/docs/transformers/pipeline_tutorial) to load the [julien-c/hotdog-not-hotdog](https://huggingface.co/julien-c/hotdog-not-hotdog) model, which is used by the Streamlit interface. The Streamlit app will expect you to upload an image, which it'll then classify as *hot dog* or *not hot dog*. Once you've saved the code to the **app.py** file, visit the **App** tab to see your app in action! ## Embed Streamlit Spaces on other webpages You can use the HTML `` tag to embed a Streamlit Space as an inline frame on other webpages. Simply include the URL of your Space, ending with the `.hf.space` suffix. To find the URL of your Space, you can use the "Embed this Space" button from the Spaces options. For example, the demo above can be embedded in these docs with the following tag: ``` ``` Please note that we have added `?embed=true` to the URL, which activates the embed mode of the Streamlit app, removing some spacers and the footer for slim embeds. ## Embed Streamlit Spaces with auto-resizing IFrames Streamlit has supported automatic iframe resizing since [1.17.0](https://docs.streamlit.io/library/changelog#version-1170) so that the size of the parent iframe is automatically adjusted to fit the content volume of the embedded Streamlit application. It relies on the [`iFrame Resizer`](https://github.com/davidjbradshaw/iframe-resizer) library, for which you need to add a few lines of code, as in the following example where - `id` is set to `` that is used to specify the auto-resize target. - The `iFrame Resizer` is loaded via the `script` tag. - The `iFrameResize()` function is called with the ID of the target `iframe` element, so that its size changes automatically. We can pass options to the first argument of `iFrameResize()`. See [the document](https://github.com/davidjbradshaw/iframe-resizer/blob/master/docs/parent_page/options.md) for the details. ```html .hf.space" frameborder="0" width="850" height="450" > iFrameResize({}, "#your-iframe-id") ``` Additionally, you can checkout [our documentation](./spaces-embed). ### Annotated Model Card Template https://huggingface.co/docs/hub/model-card-annotated.md # Annotated Model Card Template ## Template [modelcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md) ## Directions Fully filling out a model card requires input from a few different roles. (One person may have more than one role.) Weโ€™ll refer to these roles as the **developer**, who writes the code and runs training; the **sociotechnic**, who is skilled at analyzing the interaction of technology and society long-term (this includes lawyers, ethicists, sociologists, or rights advocates); and the **project organizer**, who understands the overall scope and reach of the model, can roughly fill out each part of the card, and who serves as a contact person for model card updates. * The **developer** is necessary for filling out [Training Procedure](#training-procedure-optional) and [Technical Specifications](#technical-specifications-optional). They are also particularly useful for the โ€œLimitationsโ€ section of [Bias, Risks, and Limitations](#bias-risks-and-limitations). They are responsible for providing [Results](#results) for the Evaluation, and ideally work with the other roles to define the rest of the Evaluation: [Testing Data, Factors & Metrics](#testing-data-factors--metrics). * The **sociotechnic** is necessary for filling out โ€œBiasโ€ and โ€œRisksโ€ within [Bias, Risks, and Limitations](#bias-risks-and-limitations), and particularly useful for โ€œOut of Scope Useโ€ within [Uses](#uses). * The **project organizer** is necessary for filling out [Model Details](#model-details) and [Uses](#uses). They might also fill out [Training Data](#training-data). Project organizers could also be in charge of [Citation](#citation-optional), [Glossary](#glossary-optional), [Model Card Contact](#model-card-contact), [Model Card Authors](#model-card-authors-optional), and [More Information](#more-information-optional). _Instructions are provided below, in italics._ Template variable names appear in `monospace`. --- # Model Name **Section Overview:** Provide the model name and a 1-2 sentence summary of what the model is. `model_id` `model_summary` # Table of Contents **Section Overview:** Provide this with links to each section, to enable people to easily jump around/use the file in other locations with the preserved TOC/print out the content/etc. # Model Details **Section Overview:** This section provides basic information about what the model is, its current status, and where it came from. It should be useful for anyone who wants to reference the model. ## Model Description `model_description` _Provide basic details about the model. This includes the architecture, version, if it was introduced in a paper, if an original implementation is available, and the creators. Any copyright should be attributed here. General information about training procedures, parameters, and important disclaimers can also be mentioned in this section._ * **Developed by:** `developers` _List (and ideally link to) the people who built the model._ * **Funded by:** `funded_by` _List (and ideally link to) the funding sources that financially, computationally, or otherwise supported or enabled this model._ * **Shared by [optional]:** `shared_by` _List (and ideally link to) the people/organization making the model available online._ * **Model type:** `model_type` _You can name the โ€œtypeโ€ as:_ _1. Supervision/Learning Method_ _2. Machine Learning Type_ _3. Modality_ * **Language(s)** [NLP]: `language` _Use this field when the system uses or processes natural (human) language._ * **License:** `license` _Name and link to the license being used._ * **Finetuned From Model [optional]:** `base_model` _If this model has another model as its base, link to that model here._ ## Model Sources [optional] * **Repository:** `repo` * **Paper [optional]:** `paper` * **Demo [optional]:** `demo` _Provide sources for the user to directly see the model and its details. Additional kinds of resources โ€“ training logs, lessons learned, etc. โ€“ belong in the [More Information](#more-information-optional) section. If you include one thing for this section, link to the repository._ # Uses **Section Overview:** This section addresses questions around how the model is intended to be used in different applied contexts, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model. Note this section is not intended to include the license usage details. For that, link directly to the license. ## Direct Use `direct_use` _Explain how the model can be used without fine-tuning, post-processing, or plugging into a pipeline. An example code snippet is recommended._ ## Downstream Use [optional] `downstream_use` _Explain how this model can be used when fine-tuned for a task or when plugged into a larger ecosystem or app. An example code snippet is recommended._ ## Out-of-Scope Use `out_of_scope_use` _List how the model may foreseeably be misused (used in a way it will not work for) and address what users ought not do with the model._ # Bias, Risks, and Limitations **Section Overview:** This section identifies foreseeable harms, misunderstandings, and technical and sociotechnical limitations. It also provides information on warnings and potential mitigations. Bias, risks, and limitations can sometimes be inseparable/refer to the same issues. Generally, bias and risks are sociotechnical, while limitations are technical: - A **bias** is a stereotype or disproportionate performance (skew) for some subpopulations. - A **risk** is a socially-relevant issue that the model might cause. - A **limitation** is a likely failure mode that can be addressed following the listed Recommendations. `bias_risks_limitations` _What are the known or foreseeable issues stemming from this model?_ ## Recommendations `bias_recommendations` _What are recommendations with respect to the foreseeable issues? This can include everything from โ€œdownsample your imageโ€ to filtering explicit content._ # Training Details **Section Overview:** This section provides information to describe and replicate training, including the training data, the speed and size of training elements, and the environmental impact of training. This relates heavily to the [Technical Specifications](#technical-specifications-optional) as well, and content here should link to that section when it is relevant to the training procedure. It is useful for people who want to learn more about the model inputs and training footprint. It is relevant for anyone who wants to know the basics of what the model is learning. ## Training Data `training_data` _Write 1-2 sentences on what the training data is. Ideally this links to a Dataset Card for further information. Links to documentation related to data pre-processing or additional filtering may go here as well as in [More Information](#more-information-optional)._ ## Training Procedure [optional] ### Preprocessing `preprocessing` _Detail tokenization, resizing/rewriting (depending on the modality), etc._ ### Speeds, Sizes, Times `speeds_sizes_times` _Detail throughput, start/end time, checkpoint sizes, etc._ # Evaluation **Section Overview:** This section describes the evaluation protocols, what is being measured in the evaluation, and provides the results. Evaluation ideally has at least two parts, with one part looking at quantitative measurement of general performance ([Testing Data, Factors & Metrics](#testing-data-factors--metrics)), such as may be done with benchmarking; and another looking at performance with respect to specific social safety issues ([Societal Impact Assessment](#societal-impact-assessment-optional)), such as may be done with red-teaming. You can also specify your model's evaluation results in a structured way in the model card metadata. Results are parsed by the Hub and displayed in a widget on the model page. See https://huggingface.co/docs/hub/model-cards#evaluation-results. ## Testing Data, Factors & Metrics _Evaluation is ideally **disaggregated** with respect to different factors, such as task, domain and population subgroup; and calculated with metrics that are most meaningful for foreseeable contexts of use. Equal evaluation performance across different subgroups is said to be "fair" across those subgroups; target fairness metrics should be decided based on which errors are more likely to be problematic in light of the model use. However, this section is most commonly used to report aggregate evaluation performance on different task benchmarks._ ### Testing Data `testing_data` _Describe testing data or link to its Dataset Card._ ### Factors `testing_factors` _What are the foreseeable characteristics that will influence how the model behaves? Evaluation should ideally be disaggregated across these factors in order to uncover disparities in performance._ ### Metrics `testing_metrics` _What metrics will be used for evaluation?_ ## Results `results` _Results should be based on the Factors and Metrics defined above._ ### Summary `results_summary` _What do the results say? This can function as a kind of tl;dr for general audiences._ ## Societal Impact Assessment [optional] _Use this free text section to explain how this model has been evaluated for risk of societal harm, such as for child safety, NCII, privacy, and violence. This might take the form of answers to the following questions:_ - _Is this model safe for kids to use? Why or why not?_ - _Has this model been tested to evaluate risks pertaining to non-consensual intimate imagery (including CSEM)?_ - _Has this model been tested to evaluate risks pertaining to violent activities, or depictions of violence? What were the results?_ _Quantitative numbers on each issue may also be provided._ # Model Examination [optional] **Section Overview:** This is an experimental section some developers are beginning to add, where work on explainability/interpretability may go. `model_examination` # Environmental Impact **Section Overview:** Summarizes the information necessary to calculate environmental impacts such as electricity usage and carbon emissions. * **Hardware Type:** `hardware_type` * **Hours used:** `hours_used` * **Cloud Provider:** `cloud_provider` * **Compute Region:** `cloud_region` * **Carbon Emitted:** `co2_emitted` _Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700)._ # Technical Specifications [optional] **Section Overview:** This section includes details about the model objective and architecture, and the compute infrastructure. It is useful for people interested in model development. Writing this section usually requires the model developer to be directly involved. ## Model Architecture and Objective `model_specs` ## Compute Infrastructure `compute_infrastructure` ### Hardware `hardware_requirements` _What are the minimum hardware requirements, e.g. processing, storage, and memory requirements?_ ### Software `software` # Citation [optional] **Section Overview:** The developersโ€™ preferred citation for this model. This is often a paper. ### BibTeX `citation_bibtex` ### APA `citation_apa` # Glossary [optional] **Section Overview:** This section defines common terms and how metrics are calculated. `glossary` _Clearly define terms in order to be accessible across audiences._ # More Information [optional] **Section Overview:** This section provides links to writing on dataset creation, technical specifications, lessons learned, and initial results. `more_information` # Model Card Authors [optional] **Section Overview:** This section lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction. `model_card_authors` # Model Card Contact **Section Overview:** Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors `model_card_contact` # How to Get Started with the Model **Section Overview:** Provides a code snippet to show how to use the model. `get_started_code` --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### How to handle URL parameters in Spaces https://huggingface.co/docs/hub/spaces-handle-url-parameters.md # How to handle URL parameters in Spaces You can use URL query parameters as a data sharing mechanism, for instance to be able to deep-link into an app with a specific state. On a Space page (`https://huggingface.co/spaces//`), the actual application page (`https://*.hf.space/`) is embedded in an iframe. The query string and the hash attached to the parent page URL are propagated to the embedded app on initial load, so the embedded app can read these values without special consideration. In contrast, updating the query string and the hash of the parent page URL from the embedded app is slightly more complex. If you want to do this in a Docker or static Space, you need to add the following JS code that sends a message to the parent page that has a `queryString` and/or `hash` key. ```js const queryString = "..."; const hash = "..."; window.parent.postMessage({ queryString, hash, }, "https://huggingface.co"); ``` **This is only for Docker or static Spaces.** For Streamlit apps, Spaces automatically syncs the URL parameters. Gradio apps can read the query parameters from the Spaces page, but do not sync updated URL parameters with the parent page. Note that the URL parameters of the parent page are propagated to the embedded app *only* on the initial load. So `location.hash` in the embedded app will not change even if the parent URL hash is updated using this method. An example of this method can be found in this static Space, [`whitphx/static-url-param-sync-example`](https://huggingface.co/spaces/whitphx/static-url-param-sync-example). ### Signing commits with GPG https://huggingface.co/docs/hub/security-gpg.md # Signing commits with GPG `git` has an authentication layer to control who can push commits to a repo, but it does not authenticate the actual commit authors. In other words, you can commit changes as `Elon Musk `, push them to your preferred `git` host (for instance github.com), and your commit will link to Elon's GitHub profile. (Try it! But don't blame us if Elon gets mad at you for impersonating him.) The reasons we implemented GPG signing were: - To provide finer-grained security, especially as more and more Enterprise users rely on the Hub. - To provide ML benchmarks backed by a cryptographically-secure source. See Ale Segala's [How (and why) to sign `git` commits](https://withblue.ink/2020/05/17/how-and-why-to-sign-git-commits.html) for more context. You can prove a commit was authored by you with GNU Privacy Guard (GPG) and a key server. GPG is a cryptographic tool used to verify the authenticity of a message's origin. We'll explain how to set this up on Hugging Face below. The Pro Git book is, as usual, a good resource about commit signing: [Pro Git: Signing your work](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work). ## Setting up signed commits verification You will need to install [GPG](https://gnupg.org/) on your system in order to execute the following commands. > It's included by default in most Linux distributions. > On Windows, it is included in Git Bash (which comes with `git` for Windows). You can sign your commits locally using [GPG](https://gnupg.org/). Then configure your profile to mark these commits as **verified** on the Hub, so other people can be confident that they come from a trusted source. For a more in-depth explanation of how git and GPG interact, please visit the [git documentation on the subject](https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work) Commits can have the following signing statuses: | Status | Explanation | | ----------------- | ------------------------------------------------------------ | | Verified | The commit is signed and the signature is verified | | Unverified | The commit is signed but the signature could not be verified | | No signing status | The commit is not signed | For a commit to be marked as **verified**, you need to upload the public key used to sign it on your Hugging Face account. Use the `gpg --list-secret-keys` command to list the GPG keys for which you have both a public and private key. A private key is required for signing commits or tags. If you don't have a GPG key pair or you don't want to use the existing keys to sign your commits, go to **Generating a new GPG key**. Otherwise, go straight to [Adding a GPG key to your account](#adding-a-gpg-key-to-your-account). ## Generating a new GPG key To generate a GPG key, run the following: ```bash gpg --gen-key ``` GPG will then guide you through the process of creating a GPG key pair. Make sure you specify an email address for this key, and that the email address matches the one you specified in your Hugging Face [account](https://huggingface.co/settings/account). ## Adding a GPG key to your account 1. First, select or generate a GPG key on your computer. Make sure the email address of the key matches the one in your Hugging Face [account](https://huggingface.co/settings/account) and that the email of your account is verified. 2. Export the public part of the selected key: ```bash gpg --armor --export ``` 3. Then visit your profile [settings page](https://huggingface.co/settings/keys) and click on **Add GPG Key**. Copy & paste the output of the `gpg --export` command in the text area and click on **Add Key**. 4. Congratulations! ๐ŸŽ‰ You've just added a GPG key to your account! ## Configure git to sign your commits with GPG The last step is to configure git to sign your commits: ```bash git config user.signingkey git config user.email ``` Then add the `-S` flag to your `git commit` commands to sign your commits! ```bash git commit -S -m "My first signed commit" ``` Once pushed on the Hub, you should see the commit with a "Verified" badge. > [!TIP] > To sign all commits by default in any local repository on your computer, you can run git config --global commit.gpgsign true. ### User Studies https://huggingface.co/docs/hub/model-cards-user-studies.md # User Studies ## Model Card Audiences and Use Cases During our investigation into the landscape of model documentation tools (data cards etc), we noted how different stakeholders make use of existing infrastructure to create a kind of model card with information focused on their needed domain. One such example are โ€˜business analystsโ€™ or those whose focus is on B2B as well as an internal only audience.The static and more manual approach for this audience is using Confluence pages. (*if PMs write the page, we are detaching the model creators from its theoretical consumption; if ML engineers write the page, they may tend to stress only a certain type of information.* [^1]) or a proposed combination of HTML (Jinja) templates, Metaflow classes and external APi keys, in order to create model cards that include a perspective of the model information that is needed for their domain/use case. We conducted a user study, with the aim of validating a literature informed model card structure and to understand sections/ areas of ranked importance for the different stakeholders perspectives. The study aimed to validate the following components: * **Model Card Layout** During our examination of the state of the art of model cards, which noted recurring sections from the top ~100 downloaded models on the hub that had model cards. From this analysis we catalogued the top recurring model card sections and recurring information, this coupled with the structure of the Bloom model card, lead us to the initial version of a standard model card structure. As we began to structure our user studies, two variations of model cards - that made use of the [initial model card structure](./model-card-annotated) - were used as interactive demonstrations. The aim of these demoโ€™s was to understand not only the different user perspectives on the visual elements of the model cardโ€™s but also the content presented to users. The {desired} outcome would enable us to further understand what makes a model card both easier to read, still providing some level of interactivity within the model cards, all while presenting the information in an easily understandable [approachable] manner. * **Stakeholder Perspectives** As different people, of varying technical backgrounds, could be collaborating on a model and subsequently the model card, we sought to validate the need for different stakeholders perspectives. Based on the ease of use of writing the different model card sections and the sections that one would read first Participants ranked the different sections of model cards in the perspective of one reading a model card and then as an author of a model card. An ordering scheme - 1 being the highest weight and 10 being the lowest - was applied to the different sections that the user would usually read first in a model card and the sections of a model card that a model card author would find easiest to write. ## Summary of Responses to the User Studies Survey Our user studies provided further clarity on the sections that different user profiles/stakeholders would find more challenging or easier to write. The results illustrated below show that while the Bias, Risks and Limitations section ranks second for both model card writers and model card readers for *In what order do you write the model card and What section do you look at first*, respectively, it is also noted as the most challenging/longest section to write. This favoured/endorsed the need to further evaluate the Bias, Risks and Limitations sections in order to assist with writing this decisive/imperative section. These templates were then used to generate model cards for the top 200 most downloaded Hugging Face (HF) models. * We first began by pulling all Hugging Face model's on the hub and, in particular, subsections on Limitations and Bias ("Risks" subsections were largely not present). * Based on inputs that were the most continuously used with a higher number of model downloads, grouped by model typed, the tool provides prompted text within the Bias, Risks and Limitations sections. We also prompt a default text if the model type is not specified. Using this information, we returned back to our analysis of all model cards on the hub, coupled with suggestions from other researchers and peers at HF and additional research on the type of prompted information we could provide to users while they are creating model cards. These defaulted prompted text allowed us to satisfy the aims: 1) For those who have not created model cards before or who do not usually make a model card or any other type of model documentation for their modelโ€™s, the prompted text enables these users to easily create a model card. This in turn increased the number of model cards created. 2) Users who already write model cards, the prompted text invites them to add more to their model card, further developing the content/standard of model cards. ## User Study Details We selected people from a variety of different backgrounds relevant to machine learning and model documentation. Below, we detail their demographics, the questions they were asked, and the corresponding insights from their responses. Full details on responses are available in [Appendix A](./model-card-appendix#appendix-a-user-study). ### Respondent Demographics * Tech & Regulatory Affairs Counsel * ML Engineer (x2) * Developer Advocate * Executive Assistant * Monetization Lead * Policy Manager/AI Researcher * Research Intern **What are the key pieces of information you want or need to know about a model when interacting with a machine learning model?** **Insight:** * Respondents prioritised information about the model task/domain (x3), training data/training procedure (x2), how to use the model (with code) (x2), bias and limitations, and the model licence ### Feedback on Specific Model Card Formats #### Format 1: **Current [distilbert/distilgpt2 model card](https://huggingface.co/distilbert/distilgpt2) on the Hub** **Insights:** * Respondents found this model card format to be concise, complete, and readable. * There was no consensus about the collapsible sections (some liked them and wanted more, some disliked them). * Some respondents said โ€œRisks and Limitationsโ€ should go with โ€œOut of Scope Usesโ€ #### Format 2: **Nazneen Rajani's [Interactive Model Card space](https://huggingface.co/spaces/nazneen/interactive-model-cards)** **Insights:** * While a few respondents really liked this format, most found it overwhelming or as an overload of information. Several suggested this could be a nice tool to layer onto a base model card for more advanced audiences. #### Format 3: **Ezi Ozoani's [Semi-Interactive Model Card Space](https://huggingface.co/spaces/Ezi/ModelCardsAnalysis)** **Insights:** * Several respondents found this format overwhelming, but they generally found it less overwhelming than format 2. * Several respondents disagreed with the current layout and gave specific feedback about which sections should be prioritised within each column. ### Section Rankings *Ordered based on average ranking. Arrows are shown relative to the order of the associated section in the question on the survey.* **Insights:** * When writing model cards, respondents generally said they would write a model card in the same order in which the sections were listed in the survey question. * When ranking the sections of the model card by ease/quickness of writing, consensus was that the sections on uses and limitations and risks were the most difficult. * When reading model cards, respondents said they looked at the cardsโ€™ sections in an order that was close to โ€“ but not perfectly aligned with โ€“ the order in which the sections were listed in the survey question. ![user studies results 1](https://huggingface.co/datasets/huggingface/documentation-images/blob/main/hub/usaer-studes-responses(1).png) ![user studies results 2](https://huggingface.co/datasets/huggingface/documentation-images/blob/main/hub/user-studies-responses(2).png) > [!TIP] > [Checkout the Appendix](./model-card-appendix) Acknowledgements ================ We want to acknowledge and thank [Bibi Ofuya](https://www.figma.com/proto/qrPCjWfFz5HEpWqQ0PJSWW/Bibi's-Portfolio?page-id=0%3A1&node-id=1%3A28&viewport=243%2C48%2C0.2&scaling=min-zoom&starting-point-node-id=1%3A28) for her question creation and her guidance on user-focused ordering and presentation during the user studies. [^1]: See https://towardsdatascience.com/dag-card-is-the-new-model-card-70754847a111 --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Skills https://huggingface.co/docs/hub/agents-skills.md # Skills > [!TIP] > Looking for the `hf` CLI Skill? It's the quickest way to connect your agent to the Hugging Face ecosystem. See the [Hugging Face CLI for AI Agents](./agents-cli) guide. Hugging Face provides a curated set of Skills built for AI builders. Train models, create datasets, run evaluations, track experiments. Each Skill is a self-contained `SKILL.md` that your agent follows while working on the task. Skills work with all major coding agents: Claude Code, OpenAI Codex, Google Gemini CLI, and Cursor. Learn more about the format at [agentskills.io](https://agentskills.io). ## Installation ```bash # register the skills marketplace /plugin marketplace add huggingface/skills # install a specific Skill /plugin install @huggingface/skills ``` Copy or symlink skills from the [repository](https://github.com/huggingface/skills) into one of Codex's standard `.agents/skills` locations (e.g. `$REPO_ROOT/.agents/skills` or `$HOME/.agents/skills`). Codex discovers them automatically via the Agent Skills standard. Alternatively, use the bundled [`agents/AGENTS.md`](https://github.com/huggingface/skills/blob/main/agents/AGENTS.md) as a fallback. ```bash gemini extensions install https://github.com/huggingface/skills.git --consent ``` Install via the Cursor plugin flow using the [repository URL](https://github.com/huggingface/skills). The repo includes `.cursor-plugin/plugin.json` and `.mcp.json` manifests. ## Available Skills | Skill | What it does | | ----- | ------------ | | [`hf-cli`](https://github.com/huggingface/skills/tree/main/skills/hf-cli) | Hub operations via the `hf` CLI: download, upload, manage repos, run jobs | | [`huggingface-datasets`](https://github.com/huggingface/skills/tree/main/skills/huggingface-datasets) | Explore datasets, paginate rows, search text, apply filters | | [`huggingface-llm-trainer`](https://github.com/huggingface/skills/tree/main/skills/huggingface-llm-trainer) | Train or fine-tune LLMs with TRL (SFT, DPO, GRPO) on HF Jobs | | [`huggingface-vision-trainer`](https://github.com/huggingface/skills/tree/main/skills/huggingface-vision-trainer) | Train object detection and image classification models | | [`huggingface-community-evals`](https://github.com/huggingface/skills/tree/main/skills/huggingface-community-evals) | Run evaluations against models on the Hugging Face Hub on local hardware | | [`huggingface-trackio`](https://github.com/huggingface/skills/tree/main/skills/huggingface-trackio) | Track and visualize ML training experiments with Trackio | | [`huggingface-papers`](https://github.com/huggingface/skills/tree/main/skills/huggingface-papers) | Look up and read Hugging Face paper pages in markdown | | [`huggingface-paper-publisher`](https://github.com/huggingface/skills/tree/main/skills/huggingface-paper-publisher) | Publish and manage research papers on the Hub | | [`huggingface-tool-builder`](https://github.com/huggingface/skills/tree/main/skills/huggingface-tool-builder) | Build reusable scripts for HF API operations | | [`gradio`](https://github.com/huggingface/skills/tree/main/skills/huggingface-gradio) | Build Gradio web UIs and demos | | [`transformers-js`](https://github.com/huggingface/skills/tree/main/skills/transformers-js) | Run ML models in JavaScript/TypeScript with WebGPU/WASM | ## Using Skills Once installed, mention the Skill directly in your prompt: - "Use the HF model trainer Skill to fine-tune Qwen3-0.6B with SFT on the Capybara dataset" - "Use the HF evaluation Skill to add benchmark results to my model card" - "Use the HF datasets Skill to create a new dataset from these examples" Your agent loads the corresponding `SKILL.md` instructions and helper scripts automatically. ## Resources - [Skills Repository](https://github.com/huggingface/skills) - Browse and contribute - [Agent Skills format](https://agentskills.io/home) - Specification and docs - [CLI Guide](./agents-cli) - Hugging Face CLI for AI Agents - [MCP Guide](./agents-mcp) - Use alongside Skills ### GGUF usage with GPT4All https://huggingface.co/docs/hub/gguf-gpt4all.md # GGUF usage with GPT4All [GPT4All](https://gpt4all.io/) is an open-source LLM application developed by [Nomic](https://nomic.ai/). Version 2.7.2 introduces a brand new, experimental feature called `Model Discovery`. `Model Discovery` provides a built-in way to search for and download GGUF models from the Hub. To get started, open GPT4All and click `Download Models`. From here, you can use the search bar to find a model. After you have selected and downloaded a model, you can go to `Settings` and provide an appropriate prompt template in the GPT4All format (`%1` and `%2` placeholders). Then from the main page, you can select the model from the list of installed models and start a conversation. ### GGUF https://huggingface.co/docs/hub/gguf.md # GGUF Hugging Face Hub supports all file formats, but has built-in features for [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md), a binary format that is optimized for quick loading and saving of models, making it highly efficient for inference purposes. GGUF is designed for use with GGML and other executors. GGUF was developed by [@ggerganov](https://huggingface.co/ggerganov) who is also the developer of [llama.cpp](https://github.com/ggerganov/llama.cpp), a popular C/C++ LLM inference framework. Models initially developed in frameworks like PyTorch can be converted to GGUF format for use with those engines. As we can see in this graph, unlike tensor-only file formats like [safetensors](https://huggingface.co/docs/safetensors) โ€“ย which is also a recommended model format for the Hub โ€“ GGUF encodes both the tensors and a standardized set of metadata. ## Finding GGUF files You can browse all models with GGUF files filtering by the GGUF tag: [hf.co/models?library=gguf](https://huggingface.co/models?library=gguf). Moreover, you can use [ggml-org/gguf-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) tool to convert/quantize your model weights into GGUF weights. For example, you can check out [TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF) for seeing GGUF files in action. ## Viewer for metadata & tensors info The Hub has a viewer for GGUF files that lets a user check out metadata & tensors info (name, shape, precision). The viewer is available on model page ([example](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF?show_tensors=mixtral-8x7b-instruct-v0.1.Q4_0.gguf)) & files page ([example](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/tree/main?show_tensors=mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf)). ## Usage with open-source tools * [llama.cpp](./gguf-llamacpp) * [LM Studio](./lmstudio) * [GPT4All](./gguf-gpt4all) * [Ollama](./ollama) ## Parsing the metadata with @huggingface/gguf We've also created a javascript GGUF parser that works on remotely hosted files (e.g. Hugging Face Hub). ```bash npm install @huggingface/gguf ``` ```ts import { gguf } from "@huggingface/gguf"; // remote GGUF file from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF const URL_LLAMA = "https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/191239b/llama-2-7b-chat.Q2_K.gguf"; const { metadata, tensorInfos } = await gguf(URL_LLAMA); ``` Find more information [here](https://github.com/huggingface/huggingface.js/tree/main/packages/gguf). ## Quantization Types | type | source | description | |---------------------------|--------|-------------| | F64 | [Wikipedia](https://en.wikipedia.org/wiki/Double-precision_floating-point_format) | 64-bit standard IEEE 754 double-precision floating-point number. | | I64 | [GH](https://github.com/ggerganov/llama.cpp/pull/6062) | 64-bit fixed-width integer number. | | F32 | [Wikipedia](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) | 32-bit standard IEEE 754 single-precision floating-point number. | | I32 | [GH](https://github.com/ggerganov/llama.cpp/pull/6045) | 32-bit fixed-width integer number. | | F16 | [Wikipedia](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) | 16-bit standard IEEE 754 half-precision floating-point number. | | BF16 | [Wikipedia](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) | 16-bit shortened version of the 32-bit IEEE 754 single-precision floating-point number. | | I16 | [GH](https://github.com/ggerganov/llama.cpp/pull/6045) | 16-bit fixed-width integer number. | | Q8_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 8-bit quantization (`q`). Each block has 256 weights. Only used for quantizing intermediate results. All 2-6 bit dot products are implemented for this quantization type. Weight formula: `w = q * block_scale`. | | I8 | [GH](https://github.com/ggerganov/llama.cpp/pull/6045) | 8-bit fixed-width integer number. | | Q6_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 6-bit quantization (`q`). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: `w = q * block_scale(8-bit)`, resulting in 6.5625 bits-per-weight. | | Q5_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 5-bit quantization (`q`). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: `w = q * block_scale(6-bit) + block_min(6-bit)`, resulting in 5.5 bits-per-weight. | | Q4_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 4-bit quantization (`q`). Super-blocks with 8 blocks, each block has 32 weights. Weight formula: `w = q * block_scale(6-bit) + block_min(6-bit)`, resulting in 4.5 bits-per-weight. | | Q3_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 3-bit quantization (`q`). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: `w = q * block_scale(6-bit)`, resulting in 3.4375 bits-per-weight. | | Q2_K | [GH](https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305) | 2-bit quantization (`q`). Super-blocks with 16 blocks, each block has 16 weights. Weight formula: `w = q * block_scale(4-bit) + block_min(4-bit)`, resulting in 2.625 bits-per-weight. | | IQ4_NL | [GH](https://github.com/ggerganov/llama.cpp/pull/5590) | 4-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`. | | IQ4_XS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 4-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 4.25 bits-per-weight. | | IQ3_S | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 3-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 3.44 bits-per-weight. | | IQ3_XXS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 3-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 3.06 bits-per-weight. | | IQ2_XXS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 2-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 2.06 bits-per-weight. | | IQ2_S | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 2-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 2.5 bits-per-weight. | | IQ2_XS | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 2-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 2.31 bits-per-weight. | | IQ1_S | [HF](https://huggingface.co/CISCai/OpenCodeInterpreter-DS-6.7B-SOTA-GGUF/blob/main/README.md?code=true#L59-L70) | 1-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 1.56 bits-per-weight. | | IQ1_M | [GH](https://github.com/ggerganov/llama.cpp/pull/6302) | 1-bit quantization (`q`). Super-blocks with 256 weights. Weight `w` is obtained using `super_block_scale` & `importance matrix`, resulting in 1.75 bits-per-weight. | | TQ1_0 | [GH](https://github.com/ggml-org/llama.cpp/pull/8151) | Ternary quantization. | | TQ2_0 | [GH](https://github.com/ggml-org/llama.cpp/pull/8151) | Ternary quantization. | | MXFP4 | [GH](https://github.com/ggml-org/llama.cpp/pull/15091) | 4-bit Microscaling Block Floating Point. | | **Legacy types** | | | | Q8_0 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249) | 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). | | Q8_1 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290) | 8-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). | | Q5_0 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249) | 5-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). | | Q5_1 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290) | 5-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). | | Q4_0 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557654249) | 4-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale`. Legacy quantization method (not used widely as of today). | | Q4_1 | [GH](https://github.com/huggingface/huggingface.js/pull/615#discussion_r1557682290) | 4-bit round-to-nearest quantization (`q`). Each block has 32 weights. Weight formula: `w = q * block_scale + block_minimum`. Legacy quantization method (not used widely as of today). | *if there's any inaccuracy on the table above, please open a PR on [this file](https://github.com/huggingface/huggingface.js/blob/main/packages/gguf/src/quant-descriptions.ts).* ### Jupyter Notebooks on the Hugging Face Hub https://huggingface.co/docs/hub/notebooks.md # Jupyter Notebooks on the Hugging Face Hub [Jupyter notebooks](https://jupyter.org/) are a very popular format for sharing code and data analysis for machine learning and data science. They are interactive documents that can contain code, visualizations, and text. ## Open models in Google Colab and Kaggle When you visit a model page on the Hugging Face Hub, youโ€™ll see a new โ€œGoogle Colabโ€/ "Kaggle" button in the โ€œUse this modelโ€ drop down. Clicking this will generate a ready-to-run notebook with basic code to load and test the model. This is perfect for quick prototyping, inference testing, or fine-tuning experiments โ€” all without leaving your browser. ![Google Colab and Kaggle option for models on the Hub](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/hf-google-colab/gemma3-4b-it-dark.png) Users can also access a ready-to-run notebook by appending /colab to the model cardโ€™s URL. As an example, for the latest Gemma 3 4B IT model, the corresponding Colab notebook can be reached by taking the model card URL: https://huggingface.co/google/gemma-3-4b-it And then appending `/colab` to it: https://huggingface.co/google/gemma-3-4b-it/colab and similarly for kaggle: https://huggingface.co/google/gemma-3-4b-it/kaggle If a model repository includes a file called `notebook.ipynb`, we will use it for Colab and Kaggle instead of the auto-generated notebook content. Model authors can provide tailored examples, detailed walkthroughs, or advanced use cases while still benefiting from one-click Colab integration. [NousResearch/Genstruct-7B](https://huggingface.co/NousResearch/Genstruct-7B) is one such example. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/hf-google-colab/genstruct-notebook-dark.png) ## Rendering .ipynb Jupyter notebooks on the Hub Under the hood, Jupyter Notebook files (usually shared with a `.ipynb` extension) are JSON files. While viewing these files directly is possible, it's not a format intended to be read by humans. The Hub has rendering support for notebooks hosted on the Hub. This means that notebooks are displayed in a human-readable format. ![Before and after notebook rendering](https://huggingface.co/blog/assets/135_notebooks-hub/before_after_notebook_rendering.png) Notebooks will be rendered when included in any type of repository on the Hub. This includes models, datasets, and Spaces. ### Launch in Google Colab [Google Colab](https://colab.google/) is a free Jupyter Notebook environment that requires no setup and runs entirely in the cloud. It's a great way to run Jupyter Notebooks without having to install anything on your local machine. All .ipynb files hosted on the Hub are automatically given a "Open in Colab" button. This allows you to open the notebook in Colab with a single click. ### Pull requests and Discussions https://huggingface.co/docs/hub/repositories-pull-requests-discussions.md # Pull requests and Discussions Hub Pull requests and Discussions allow users to do community contributions to repositories. Pull requests and discussions work the same for all the repo types. At a high level, the aim is to build a simpler version of other git hosts' (like GitHub's) PRs and Issues: - no forks are involved: contributors push to a special `ref` branch directly on the source repo. - there's no hard distinction between discussions and PRs: they are essentially the same so they are displayed in the same lists. - they are streamlined for ML (i.e. models/datasets/spaces repos), not arbitrary repos. _Note, Pull Requests and discussions can be enabled or disabled from the [repository settings](./repositories-settings#disabling-discussions--pull-requests)_ ## List By going to the community tab in any repository, you can see all Discussions and Pull requests. You can also filter to only see the ones that are open. ## View The Discussion page allows you to see the comments from different users. If it's a Pull Request, you can see all the changes by going to the Files changed tab. ## Editing a Discussion / Pull request title If you opened a PR or discussion, are the author of the repository, or have write access to it, you can edit the discussion title by clicking on the pencil button. ## Pin a Discussion / Pull Request If you have write access to a repository, you can pin discussions and Pull Requests. Pinned discussions appear at the top of all the discussions. ## Lock a Discussion / Pull Request If you have write access to a repository, you can lock discussions or Pull Requests. Once a discussion is locked, previous comments are still visible and users won't be able to add new comments. ## Comment edition and moderation If you wrote a comment or have write access to the repository, you can edit the content of the comment from the contextual menu in the top-right corner of the comment box. Once the comment has been edited, a new link will appear above the comment. This link shows the edit history. You can also hide a comment. Hiding a comment is irreversible, and nobody will be able to see its content nor edit it anymore. Read also [moderation](./moderation) to see how to report an abusive comment. ## Can I use Markdown and LaTeX in my comments and discussions? Yes! You can use Markdown to add formatting to your comments. Additionally, you can utilize LaTeX for mathematical typesetting, your formulas will be rendered with [KaTeX](https://katex.org/) before being parsed in Markdown. For LaTeX equations, you have to use the following delimiters: - `$$ ... $$` for display mode - `\\(...\\)` for inline mode (no space between the slashes and the parenthesis). ## How do I manage Pull requests locally? Let's assume your PR number is 42. ```bash git fetch origin refs/pr/42:pr/42 git checkout pr/42 # Do your changes git add . git commit -m "Add your change" git push origin pr/42:refs/pr/42 ``` ### Draft mode Draft mode is the default status when opening a new Pull request from scratch in "Advanced mode". With this status, other contributors know that your Pull request is under work and it cannot be merged. When your branch is ready, just hit the "Publish" button to change the status of the Pull request to "Open". Note that once published you cannot go back to draft mode. ## Deleting a Pull request ref When a Pull request is closed or merged, you can delete its associated git ref (the branch storing the PR's commits) to free up storage space. After closing or merging a PR, you'll see a notice at the bottom of the discussion showing the estimated storage that could be freed by deleting the ref. Click the "Delete ref" button to permanently remove the PR's git ref and reclaim the storage. > [!TIP] > This is especially useful when the main branch has been squashed and files removed later on. Those files remain in the PR branch history even if they weren't added by the PR itself, taking up storage that could be freed. > [!WARNING] > Deleting a PR ref is irreversible. Once deleted, you won't be able to fetch or checkout the PR's commits locally anymore. ## Pull requests advanced usage ### Where in the git repo are changes stored? Our Pull requests do not use forks and branches, but instead custom "branches" called `refs` that are stored directly on the source repo. [Git References](https://git-scm.com/book/en/v2/Git-Internals-Git-References) are the internal machinery of git which already stores tags and branches. The advantage of using custom refs (like `refs/pr/42` for instance) instead of branches is that they're not fetched (by default) by people (including the repo "owner") cloning the repo, but they can still be fetched on demand. ### Fetching all Pull requests: for git magicians ๐Ÿง™โ€โ™€๏ธ You can tweak your local **refspec** to fetch all Pull requests: 1. Fetch ```bash git fetch origin refs/pr/*:refs/remotes/origin/pr/* ``` 2. create a local branch tracking the ref ```bash git checkout pr/{PR_NUMBER} # for example: git checkout pr/42 ``` 3. IF you make local changes, to push to the PR ref: ```bash git push origin pr/{PR_NUMBER}:refs/pr/{PR_NUMBER} # for example: git push origin pr/42:refs/pr/42 ``` ### JupyterLab on Spaces https://huggingface.co/docs/hub/spaces-sdks-docker-jupyter.md # JupyterLab on Spaces [JupyterLab](https://jupyter.org/) is a web-based interactive development environment for Jupyter notebooks, code, and data. It is a great tool for data science and machine learning, and it is widely used by the community. With Hugging Face Spaces, you can deploy your own JupyterLab instance and use it for development directly from the Hugging Face website. ## โšก๏ธ Deploy a JupyterLab instance on Spaces You can deploy JupyterLab on Spaces with just a few clicks. First, go to [this link](https://huggingface.co/new-space?template=SpacesExamples/jupyterlab) or click the button below: Spaces requires you to define: * An **Owner**: either your personal account or an organization you're a part of. * A **Space name**: the name of the Space within the account you're creating the Space. * The **Visibility**: _private_ if you want the Space to be visible only to you or your organization, or _public_ if you want it to be visible to other users. * The **Hardware**: the hardware you want to use for your JupyterLab instance. This goes from CPUs to H100s. * You can optionally configure a `JUPYTER_TOKEN` password to protect your JupyterLab workspace. When unspecified, defaults to `huggingface`. We strongly recommend setting this up if your Space is public or if the Space is in an organization. Storage in Hugging Face Spaces is ephemeral, and the data you store in the default configuration can be lost in a reboot or reset of the Space. We recommend saving your work to a remote location or attaching a [Storage Bucket](https://huggingface.co/docs/hub/storage-buckets) to your Space for persistent data. ## Read more - [HF Docker Spaces](https://huggingface.co/docs/hub/spaces-sdks-docker) If you have any feedback or change requests, please don't hesitate to reach out to the owners on the [Feedback Discussion](https://huggingface.co/spaces/SpacesExamples/jupyterlab/discussions/3). ## Acknowledgments This template was created by [camenduru](https://twitter.com/camenduru) and [nateraw](https://huggingface.co/nateraw), with contributions from [osanseviero](https://huggingface.co/osanseviero) and [azzr](https://huggingface.co/azzr). ### Perform vector similarity search https://huggingface.co/docs/hub/datasets-duckdb-vector-similarity-search.md # Perform vector similarity search The Fixed-Length Arrays feature was added in DuckDB version 0.10.0. This lets you use vector embeddings in DuckDB tables, making your data analysis even more powerful. Additionally, the array_cosine_similarity function was introduced. This function measures the cosine of the angle between two vectors, indicating their similarity. A value of 1 means theyโ€™re perfectly aligned, 0 means theyโ€™re perpendicular, and -1 means theyโ€™re completely opposite. Let's explore how to use this function for similarity searches. In this section, weโ€™ll show you how to perform similarity searches using DuckDB. We will use the [asoria/awesome-chatgpt-prompts-embeddings](https://huggingface.co/datasets/asoria/awesome-chatgpt-prompts-embeddings) dataset. First, let's preview a few records from the dataset: ```bash FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT act, prompt, len(embedding) as embed_len LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ act โ”‚ prompt โ”‚ embed_len โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Linux Terminal โ”‚ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insidโ€ฆ โ”‚ 384 โ”‚ โ”‚ English Translatorโ€ฆ โ”‚ I want you to act as an English translator, spelling corrector and improver. I will speak to you in any language and you will detect the language, translate it and answerโ€ฆ โ”‚ 384 โ”‚ โ”‚ `position` Interviโ€ฆ โ”‚ I want you to act as an interviewer. I will be the candidate and you will ask me the interview questions for the `position` position. I want you to only reply as the inteโ€ฆ โ”‚ 384 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Next, let's choose an embedding to use for the similarity search: ```bash FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' SELECT embedding WHERE act = 'Linux Terminal'; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ embedding โ”‚ โ”‚ float[] โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ [-0.020781303, -0.029143505, -0.0660217, -0.00932716, -0.02601602, -0.011426172, 0.06627567, 0.11941507, 0.0013917526, 0.012889079, 0.053234346, -0.07380514, 0.04871567, -0.043601237, -0.0025319182, 0.0448โ€ฆ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Now, let's use the selected embedding to find similar records: ```bash SELECT act, prompt, array_cosine_similarity(embedding::float[384], (SELECT embedding FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' WHERE act = 'Linux Terminal')::float[384]) AS similarity FROM 'hf://datasets/asoria/awesome-chatgpt-prompts-embeddings/data/*.parquet' ORDER BY similarity DESC LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ act โ”‚ prompt โ”‚ similarity โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ float โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Linux Terminal โ”‚ I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output insiโ€ฆ โ”‚ 1.0 โ”‚ โ”‚ JavaScript Console โ”‚ I want you to act as a javascript console. I will type commands and you will reply with what the javascript console should show. I want you to only reply with the terminโ€ฆ โ”‚ 0.7599728 โ”‚ โ”‚ R programming Inteโ€ฆ โ”‚ I want you to act as a R interpreter. I'll type commands and you'll reply with what the terminal should show. I want you to only reply with the terminal output inside onโ€ฆ โ”‚ 0.7303775 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` That's it! You have successfully performed a vector similarity search using DuckDB. ### Blog Articles for Organizations https://huggingface.co/docs/hub/enterprise-blog-articles.md # Blog Articles for Organizations > [!WARNING] > This feature is part of the Team & Enterprise plans. Blog Articles allow Team and Enterprise organizations to publish long-form content directly under your organization profile, enabling you to share model releases, research updates, and announcements with the broader community. ## Publishing as an Organization When creating a new article at [huggingface.co/new-blog](https://huggingface.co/new-blog), select your organization from the dropdown to publish as the organization rather than as an individual. Once published, the article will appear on your organization's profile page. ## Permissions To publish blog articles under an organization namespace, members need `write` or `admin` role at the organization level. See [Access Control in Organizations](./organizations-security) for more details on roles. > [!NOTE] > Blog article permissions are currently tied to organization-level roles and cannot be scoped using [Resource Groups](./security-resource-groups). Resource Groups only control access to repositories (models, datasets, and Spaces), not blog articles. ### Uploading models https://huggingface.co/docs/hub/models-uploading.md # Uploading models To upload models to the Hub, you'll need to create an account at [Hugging Face](https://huggingface.co/join). Models on the Hub are [Git-based repositories](./repositories), which give you versioning, branches, discoverability and sharing features, integration with dozens of libraries, and more! You have control over what you want to upload to your repository, which could include checkpoints, configs, and any other files. You can link repositories with an individual user, such as [osanseviero/fashion_brands_patterns](https://huggingface.co/osanseviero/fashion_brands_patterns), or with an organization, such as [facebook/bart-large-xsum](https://huggingface.co/facebook/bart-large-xsum). Organizations can collect models related to a company, community, or library! If you choose an organization, the model will be featured on the organizationโ€™s page, and every member of the organization will have the ability to contribute to the repository. You can create a new organization [here](https://huggingface.co/organizations/new). > **_NOTE:_** Models do NOT need to be compatible with the Transformers/Diffusers libraries to get download metrics. Any custom model is supported. Read more below! There are several ways to upload models for them to be nicely integrated into the Hub and get [download metrics](models-download-stats), described below. - In case your model is designed for a library that has [built-in support](#upload-from-a-library-with-built-in-support), you can use the methods provided by the library. Custom models that use `trust_remote_code=True` can also leverage these methods. - In case your model is a custom PyTorch model, one can leverage the [`PyTorchModelHubMixin` class](#upload-a-pytorch-model-using-huggingfacehub) as it allows to add `from_pretrained`, `push_to_hub` to any `nn.Module` class, just like models in the Transformers, Diffusers and Timm libraries. - In addition to programmatic uploads, you can always use the [web interface](#using-the-web-interface) or [the git command line](#using-git). Once your model is uploaded, we suggest adding a [Model Card](./model-cards) to your repo to document your model and make it more discoverable. Example [repository](https://huggingface.co/LiheYoung/depth_anything_vitl14) that leverages [PyTorchModelHubMixin](#upload-a-pytorch-model-using-huggingfacehub). Downloads are shown on the right. ## Using the web interface To create a brand new model repository, visit [huggingface.co/new](http://huggingface.co/new). Then follow these steps: 1. In the "Files and versions" tab, select "Add File" and specify "Upload File": 2. From there, select a file from your computer to upload and leave a helpful commit message to know what you are uploading: 3. Afterwards, click **Commit changes** to upload your model to the Hub! 4. Inspect files and history You can check your repository with all the recently added files! The UI allows you to explore the model files and commits and to see the diff introduced by each commit: 5. Add metadata You can add metadata to your model card. You can specify: * the type of task this model is for, enabling widgets and Inference Providers. * the used library (`transformers`, `spaCy`, etc.) * the language * the dataset * metrics * license * a lot more! Read more about model tags [here](./model-cards#model-card-metadata). 6. Add TensorBoard traces Any repository that contains TensorBoard traces (filenames that contain `tfevents`) is categorized with the [`TensorBoard` tag](https://huggingface.co/models?filter=tensorboard). As a convention, we suggest that you save traces under the `runs/` subfolder. The "Training metrics" tab then makes it easy to review charts of the logged variables, like the loss or the accuracy. Models trained with ๐Ÿค— Transformers will generate [TensorBoard traces](https://huggingface.co/docs/transformers/main_classes/callback#transformers.integrations.TensorBoardCallback) by default if [`tensorboard`](https://pypi.org/project/tensorboard/) is installed. ## Upload from a library with built-in support First check if your model is from a library that has built-in support to push to/load from the Hub, like Transformers, Diffusers, Timm, Asteroid, etc.: https://huggingface.co/docs/hub/models-libraries. Below we'll show how easy this is for a library like Transformers: ```python from transformers import BertConfig, BertModel config = BertConfig() model = BertModel(config) model.push_to_hub("nielsr/my-awesome-bert-model") # reload model = BertModel.from_pretrained("nielsr/my-awesome-bert-model") ``` Some libraries, like Transformers, support loading [code from the Hub](https://huggingface.co/docs/transformers/custom_models). This is a way to make your model work with Transformers using the `trust_remote_code=True` flag. You may want to consider this option instead of a full-fledged library integration. ## Upload a PyTorch model using huggingface_hub In case your model is a (custom) PyTorch model, you can leverage the `PyTorchModelHubMixin` [class](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) available in the [huggingface_hub](https://github.com/huggingface/huggingface_hub) Python library. It is a minimal class which adds `from_pretrained` and `push_to_hub` capabilities to any `nn.Module`, along with download metrics. Here is how to use it (assuming you have run `pip install huggingface_hub`): ```python import torch import torch.nn as nn from huggingface_hub import PyTorchModelHubMixin class MyModel( nn.Module, PyTorchModelHubMixin, # optionally, you can add metadata which gets pushed to the model card repo_url="your-repo-url", pipeline_tag="text-to-image", license="mit", ): def __init__(self, num_channels: int, hidden_size: int, num_classes: int): super().__init__() self.param = nn.Parameter(torch.rand(num_channels, hidden_size)) self.linear = nn.Linear(hidden_size, num_classes) def forward(self, x): return self.linear(x + self.param) # create model config = {"num_channels": 3, "hidden_size": 32, "num_classes": 10} model = MyModel(**config) # save locally model.save_pretrained("my-awesome-model") # push to the hub model.push_to_hub("your-hf-username/my-awesome-model") # reload model = MyModel.from_pretrained("your-hf-username/my-awesome-model") ``` As you can see, the only requirement is that your model inherits from `PyTorchModelHubMixin`. All instance attributes will be automatically serialized to a `config.json` file. Note that the `init` method can only take arguments which are JSON serializable. Python dataclasses are supported. This comes with automated download metrics, meaning that you'll be able to see how many times the model is downloaded, the same way they are available for models integrated natively in the Transformers, Diffusers or Timm libraries. With this mixin class, each separate checkpoint is stored on the Hub in a single repository consisting of 2 files: - a `pytorch_model.bin` or `model.safetensors` file containing the weights - a `config.json` file which is a serialized version of the model configuration. This class is used for counting download metrics: everytime a user calls `from_pretrained` to load a `config.json`, the count goes up by one. See [this guide](https://huggingface.co/docs/hub/models-download-stats) regarding automated download metrics. It's recommended to add a model card to each checkpoint so that people can read what the model is about, have a link to the paper, etc. Visit [the huggingface_hub's documentation](https://huggingface.co/docs/huggingface_hub/guides/integrations) to learn more. Alternatively, one can also simply programmatically upload files or folders to the hub: https://huggingface.co/docs/huggingface_hub/guides/upload. ## Using Git Finally, since model repos are just Git repositories, you can also use Git to push your model files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started#terminal) to learn about using the `git` CLI to commit and push your models. ### Appendix https://huggingface.co/docs/hub/model-card-appendix.md # Appendix ## Appendix A: User Study _Full text responses to key questions_ ### How would you define model cards? ***Insight: Respondents had generally similar views of what model cards are: documentation focused on issues like training, use cases, and bias/limitations*** * Model cards are model descriptions, both of how they were trained, their use cases, and potential biases and limitations * Documents describing the essential features of a model in order for the reader/user to understand the artefact he/she has in front, the background/training, how it can be used, and its technical/ethical limitations. * They serve as a living artefact of models to document them. Model cards contain information that go from a high level description of what the specific model can be used to, to limitations, biases, metrics, and much more. They are used primarily to understand what the model does. * Model cards are to models what GitHub READMEs are to GitHub projects. It tells people all the information they need to know about the model. If you don't write one, nobody will use your model. * From what I understand, a model card uses certain benchmarks (geography, culture, sex, etc) to define both a model's usability and limitations. It's essentially a model's 'nutrition facts label' that can show how a model was created and educates others on its reusability. * Model cards are the metadata and documentation about the model, everything I need to know to use the model properly: info about the model, what paper introduced it, what dataset was it trained on or fine-tuned on, whom does it belong to, are there known risks and limitations with this model, any useful technical info. * IMO model cards are a brief presentation of a model which includes: * short summary of the architectural particularities of the model * describing the data it was trained on * what is the performance on reference datasets (accuracy and speed metrics if possible) * limitations * how to use it in the context of the Transformers library * source (original article, Github repo,...) * Easily accessible documentation that any background can read and learn about critical model components and social impact ### What do you like about model cards? * They are interesting to teach people about new models * As a non-technical guy, the possibility of getting to know the model, to understand the basics of it, it's an opportunity for the author to disclose its innovation in a transparent & explainable (i.e. trustworthy) way. * I like interactive model cards with visuals and widgets that allow me to try the model without running any code. * What I like about good model cards is that you can find all the information you need about that particular model. * Model cards are revolutionary to the world of AI ethics. It's one of the first tangible steps in mitigating/educating on biases in machine learning. They foster greater awareness and accountability! * Structured, exhaustive, the more info the better. * It helps to get an understanding of what the model is good (or bad) at. * Conciseness and accessibility ### What do you dislike about model cards? * Might get to technical and/or dense * They contain lots of information for different audiences (researchers, engineers, non engineers), so it's difficult to explore model cards with an intended use cases. * [NOTE: this comment could be addressed with toggle views for different audiences] * Good ones are time consuming to create. They are hard to test to make sure the information is up to date. Often times, model cards are formatted completely differently - so you have to sort of figure out how that certain individual has structured theirs. * [NOTE: this comment helps demonstrate the value of a standardized format and automation tools to make it easier to create model cards] * Without the help of the community to pitch in supplemental evals, model cards might be subject to inherent biases that the developer might not be aware of. It's early days for them, but without more thorough evaluations, a model card's information might be too limited. * Empty model cards. No license information - customers need that info and generally don't have it. * They are usually either too concise or too verbose. * writing them lol bless you ### Other key new insights * Model cards are best filled out when done by people with different roles: Technical specifications can generally only be filled out by the developers; ethical considerations throughout are generally best informed by people who tend to work on ethical issues. * Model users care a lot about licences -- specifically, whether a model can legally be used for a specific task. ## Appendix B: Landscape Analysis _Overview of the state of model documentation in Machine Learning_ ### MODEL CARD EXAMPLES Examples of model cards and closely-related variants include: * Google Cloud: [Face Detection](https://modelcards.withgoogle.com/face-detection), [Object Detection](https://modelcards.withgoogle.com/object-detection) * Google Research: [ML Kit Vision Models](https://developers.google.com/s/results/ml-kit?q=%22Model%20Card%22), [Face Detection](https://sites.google.com/view/perception-cv4arvr/blazeface), [Conversation AI](https://github.com/conversationai/perspectiveapi/tree/main/model-cards) * OpenAI: [GPT-3](https://github.com/openai/gpt-3/blob/master/model-card.md), [GPT-2](https://github.com/openai/gpt-2/blob/master/model_card.md), [DALL-E dVAE](https://github.com/openai/DALL-E/blob/master/model_card.md), [CLIP](https://github.com/openai/CLIP-featurevis/blob/master/model-card.md) * [NVIDIA Model Cards](https://catalog.ngc.nvidia.com/models?filters=&orderBy=weightPopularASC&query=) * [Salesforce Model Cards](https://blog.salesforceairesearch.com/model-cards-for-ai-model-transparency/) * [Allen AI Model Cards](https://github.com/allenai/allennlp-models/tree/main/allennlp_models/modelcards) * [Co:here AI Model Cards](https://docs.cohere.ai/responsible-use/) * [Duke PULSE Model Card](https://arxiv.org/pdf/2003.03808.pdf) * [Stanford Dynasent](https://github.com/cgpotts/dynasent/blob/main/dynasent_modelcard.md) * [GEM Model Cards](https://gem-benchmark.com/model_cards) * Parl.AI: [Parl.AI sample model cards](https://github.com/facebookresearch/ParlAI/tree/main/docs/sample_model_cards), [BlenderBot 2.0 2.7B](https://github.com/facebookresearch/ParlAI/blob/main/parlai/zoo/blenderbot2/model_card.md) * [Perspective API Model Cards](https://github.com/conversationai/perspectiveapi/tree/main/model-cards) * See https://github.com/ivylee/model-cards-and-datasheets for more examples! ### MODEL CARDS FOR LARGE LANGUAGE MODELS Large language models are often released with associated documentation. Large language models that have an associated model card (or related documentation tool) include: * [Big Science BLOOM model card](https://huggingface.co/bigscience/bloom) * [GPT-2 Model Card](https://github.com/openai/gpt-2/blob/master/model_card.md) * [GPT-3 Model Card](https://github.com/openai/gpt-3/blob/master/model-card.md) * [DALL-E 2 Preview System Card](https://github.com/openai/dalle-2-preview/blob/main/system-card.md) * [OPT-175B model card](https://arxiv.org/pdf/2205.01068.pdf) ### MODEL CARD GENERATION TOOLS Tools for programmatically or interactively generating model cards include: * [Salesforce Model Card Creation](https://help.salesforce.com/s/articleView?id=release-notes.rn_bi_edd_model_card.htm&type=5&release=232) * [TensorFlow Model Card Toolkit](https://ai.googleblog.com/2020/07/introducing-model-card-toolkit-for.html) * [Python library](https://pypi.org/project/model-card-toolkit/) * [GSA / US Census Bureau Collaboration on Model Card Generator](https://bias.xd.gov/resources/model-card-generator/) * [Parl.AI Auto Generation Tool](https://parl.ai/docs/tutorial_model_cards.html) * [VerifyML Model Card Generation Web Tool](https://www.verifyml.com) * [RMarkdown Template for Model Card as part of vetiver package](https://cran.r-project.org/web/packages/vetiver/vignettes/model-card.html) * [Databaseline ML Cards toolkit](https://databaseline.tech/ml-cards/) ### MODEL CARD EDUCATIONAL TOOLS Tools for understanding model cards and understanding how to create model cards include: * [Hugging Face Hub docs](https://huggingface.co/course/chapter4/4?fw=pt) * [Perspective API](https://developers.perspectiveapi.com/s/about-the-api-model-cards) * [Kaggle](https://www.kaggle.com/code/var0101/model-cards/tutorial) * [Code.org](https://studio.code.org/s/aiml-2021/lessons/8) * [UNICEF](https://unicef.github.io/inventory/data/model-card/) --- **Please cite as:** Ozoani, Ezi and Gerchick, Marissa and Mitchell, Margaret. Model Card Guidebook. Hugging Face, 2022. https://huggingface.co/docs/hub/en/model-card-guidebook ### Quickstart https://huggingface.co/docs/hub/jobs-quickstart.md # Quickstart In this guide you will run a Job to fine-tune an open source model on Hugging Face infrastructure in only a few minutes. Make sure you are logged in to Hugging Face and have access to your [Jobs page](https://huggingface.co/settings/jobs). Jobs are available to any user or organization with [pre-paid credits](https://huggingface.co/pricing). ## Getting started First install the Hugging Face CLI: ### 1. Install the CLI Recommended approach: ```bash >>> curl -LsSf https://hf.co/cli/install.sh | bash ``` Or using Homebrew: ```bash >>> brew install hf ``` Or using uv: ```bash >>> uv tool install hf ``` ### 2. Login to your Hugging Face account Login ```bash >>> hf auth login ``` ### 3. Create your first jobs using the `hf jobs` command Run a UV command or script ```bash >>> hf jobs uv run python -c 'print("Hello from the cloud!")' Job started with ID: 693aef401a39f67af5a41c0e View at: https://huggingface.co/jobs/lhoestq/693aef401a39f67af5a41c0e Hello from the cloud! ``` ```bash >>> echo "print('Hello from uv script!')" > script.py >>> hf jobs uv run script.py Job started with ID: 695f6cd8d2f3efac77e8cf7f View at: https://huggingface.co/jobs/lhoestq/695f6cd8d2f3efac77e8cf7f Hello from uv script! ``` Run a Docker command ```bash >>> hf jobs run ubuntu echo 'Hello from the cloud!' Job started with ID: 693aee76c67c9f186cfe233e View at: https://huggingface.co/jobs/lhoestq/693aee76c67c9f186cfe233e Hello from the cloud! ``` ### 4. Check your first jobs The job logs appear in your terminal, but you can also see them in your jobs page. Open the job page to see the job information, status and logs: ## The training script Here is a simple training script to fine-tune a base model to a conversational model using Supervised Fine-Tuning (SFT). It uses the [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) model and the [trl-lib/Capybara](https://huggingface.co/datasets/trl-lib/Capybara) dataset, and the [TRL](https://huggingface.co/docs/trl/en/index) library, and saves the resulting model to your Hugging Face account under the name `"Qwen2.5-0.5B-SFT"`: ```python from datasets import load_dataset from trl import SFTTrainer dataset = load_dataset("trl-lib/Capybara", split="train") trainer = SFTTrainer( model="Qwen/Qwen2.5-0.5B", train_dataset=dataset, ) trainer.train() trainer.push_to_hub("Qwen2.5-0.5B-SFT") ``` Save this script as `train.py`, and we can now run it with UV on Hugging Face Jobs. ## Run the training job `hf jobs` takes several arguments: select the hardware with `--flavor`, choose a maximum duration with `--timeout`, and pass environment variable with `--env` and `--secrets`. Here we use the A100 Large GPU flavor with `--flavor a100-large` and pass your Hugging Face token as a secret with `--secrets HF_TOKEN` in order to be able to push the resulting model to your account. Moreover, UV accepts the `--with` argument to define python dependencies, so we use `--with trl` to have the `trl` library available. You can now run the final command which looks like this: ```bash hf jobs uv run \ --flavor a100-large \ --timeout 6h \ --with trl \ --secrets HF_TOKEN \ train.py ``` The logs appear in your terminal, and you can safely Ctrl+C to stop streaming the logs, the job will keep running. ``` ... Downloaded nvidia-cudnn-cu12 Downloaded torch Installed 66 packages in 233ms Generating train split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 15806/15806 [00:00 Monitor GPU usage and other metrics in the CLI or use the [MacOS menu bar](./jobs-manage#macos-menu-bar). Here with the CLI you get: ```bash >>> hf jobs stats JOB ID CPU % NUM CPU MEM % MEM USAGE NET I/O GPU UTIL % GPU MEM % GPU MEM USAGE ------------------------ ----- ------- ----- ---------------- --------------- ---------- --------- --------------- 695e83c5d2f3efac77e8cf18 8% 12.0 7.18% 10.9GB / 152.5GB 0.0bps / 0.0bps 100% 31.92% 25.9GB / 81.2GB ``` Once the job is done, find your model on your account: Congrats ! You just run your first Job to fine-tune an open source model ๐Ÿ”ฅ Feel free to try out your model locally and evaluate it using e.g. [transformers](https://huggingface.co/docs/transformers) by clicking on "Use this model", or deploy it to [Inference Endpoints](https://huggingface.co/docs/inference-endpoints) in one click using the "Deploy" button. ### Image Dataset https://huggingface.co/docs/hub/datasets-image.md # Image Dataset This guide will show you how to configure your dataset repository with image files. You can find accompanying examples of repositories in this [Image datasets examples collection](https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65). A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its page on the Hub. Additional information about your images - such as captions or bounding boxes for object detection - is automatically loaded as long as you include this information in a metadata file (`metadata.csv`/`metadata.jsonl`/`metadata.parquet`). Alternatively, images can be in Parquet files or in TAR archives following the [WebDataset](https://github.com/webdataset/webdataset) format. ## Only images If your dataset only consists of one column with images, you can simply store your image files at the root: ``` my_dataset_repository/ โ”œโ”€โ”€ 1.jpg โ”œโ”€โ”€ 2.jpg โ”œโ”€โ”€ 3.jpg โ””โ”€โ”€ 4.jpg ``` or in a subdirectory: ``` my_dataset_repository/ โ””โ”€โ”€ images โ”œโ”€โ”€ 1.jpg โ”œโ”€โ”€ 2.jpg โ”œโ”€โ”€ 3.jpg โ””โ”€โ”€ 4.jpg ``` Multiple [formats](./datasets-adding#file-formats) are supported at the same time, including PNG, JPEG, TIFF and WebP. ``` my_dataset_repository/ โ””โ”€โ”€ images โ”œโ”€โ”€ 1.jpg โ”œโ”€โ”€ 2.png โ”œโ”€โ”€ 3.tiff โ””โ”€โ”€ 4.webp ``` If you have several splits, you can put your images into directories named accordingly: ``` my_dataset_repository/ โ”œโ”€โ”€ train โ”‚ย ย  โ”œโ”€โ”€ 1.jpg โ”‚ย ย  โ””โ”€โ”€ 2.jpg โ””โ”€โ”€ test โ”œโ”€โ”€ 3.jpg โ””โ”€โ”€ 4.jpg ``` See [File names and splits](./datasets-file-names-and-splits) for more information and other ways to organize data by splits. ## Additional columns If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your repository. This lets you quickly create datasets for different computer vision tasks like [text captioning](https://huggingface.co/tasks/image-to-text) or [object detection](https://huggingface.co/tasks/object-detection). ``` my_dataset_repository/ โ””โ”€โ”€ train โ”œโ”€โ”€ 1.jpg โ”œโ”€โ”€ 2.jpg โ”œโ”€โ”€ 3.jpg โ”œโ”€โ”€ 4.jpg โ””โ”€โ”€ metadata.csv ``` Your `metadata.csv` file must have a `file_name` column which links image files with their metadata: ```csv file_name,text 1.jpg,a drawing of a green pokemon with red eyes 2.jpg,a green and yellow toy with a red nose 3.jpg,a red and white ball with an angry look on its face 4.jpg,a cartoon ball with a smile on its face ``` You can also use a [JSONL](https://jsonlines.org/) file `metadata.jsonl`: ```jsonl {"file_name": "1.jpg","text": "a drawing of a green pokemon with red eyes"} {"file_name": "2.jpg","text": "a green and yellow toy with a red nose"} {"file_name": "3.jpg","text": "a red and white ball with an angry look on its face"} {"file_name": "4.jpg","text": "a cartoon ball with a smile on its face"} ``` And for bigger datasets or if you are interested in advanced data retrieval features, you can use a [Parquet](https://parquet.apache.org/) file `metadata.parquet`. ## Relative paths Metadata file must be located either in the same directory with the images it is linked to, or in any parent directory, like in this example: ``` my_dataset_repository/ โ””โ”€โ”€ train โ”œโ”€โ”€ images โ”‚ย ย  โ”œโ”€โ”€ 1.jpg โ”‚ย ย  โ”œโ”€โ”€ 2.jpg โ”‚ย ย  โ”œโ”€โ”€ 3.jpg โ”‚ย ย  โ””โ”€โ”€ 4.jpg โ””โ”€โ”€ metadata.csv ``` In this case, the `file_name` column must be a full relative path to the images, not just the filename: ```csv file_name,text images/1.jpg,a drawing of a green pokemon with red eyes images/2.jpg,a green and yellow toy with a red nose images/3.jpg,a red and white ball with an angry look on its face images/4.jpg,a cartoon ball with a smile on it's face ``` Metadata files cannot be put in subdirectories of a directory with the images. More generally, any column named `file_name` or `*_file_name` should contain the full relative path to the images. ## Image classification For image classification datasets, you can also use a simple setup: use directories to name the image classes. Store your image files in a directory structure like: ``` my_dataset_repository/ โ”œโ”€โ”€ green โ”‚ย ย  โ”œโ”€โ”€ 1.jpg โ”‚ย ย  โ””โ”€โ”€ 2.jpg โ””โ”€โ”€ red โ”œโ”€โ”€ 3.jpg โ””โ”€โ”€ 4.jpg ``` The dataset created with this structure contains two columns: `image` and `label` (with values `green` and `red`). You can also provide multiple splits. To do so, your dataset directory should have the following structure (see [File names and splits](./datasets-file-names-and-splits) for more information): ``` my_dataset_repository/ โ”œโ”€โ”€ test โ”‚ย ย  โ”œโ”€โ”€ green โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ 2.jpg โ”‚ย ย  โ””โ”€โ”€ red โ”‚ย ย  โ””โ”€โ”€ 4.jpg โ””โ”€โ”€ train โ”œโ”€โ”€ green โ”‚ย ย  โ””โ”€โ”€ 1.jpg โ””โ”€โ”€ red โ””โ”€โ”€ 3.jpg ``` You can disable this automatic addition of the `label` column in the [YAML configuration](./datasets-manual-configuration). If your directory names have no special meaning, set `drop_labels: true` in the README header: ```yaml configs: - config_name: default # Name of the dataset subset, if applicable. drop_labels: true ``` ## Large scale datasets ### WebDataset format The [WebDataset](./datasets-webdataset) format is well suited for large scale image datasets (see [timm/imagenet-12k-wds](https://huggingface.co/datasets/timm/imagenet-12k-wds) for example). It consists of TAR archives containing images and their metadata and is optimized for streaming. It is useful if you have a large number of images and to get streaming data loaders for large scale training. ``` my_dataset_repository/ โ”œโ”€โ”€ train-0000.tar โ”œโ”€โ”€ train-0001.tar โ”œโ”€โ”€ ... โ””โ”€โ”€ train-1023.tar ``` To make a WebDataset TAR archive, create a directory containing the images and metadata files to be archived and create the TAR archive using e.g. the `tar` command. The usual size per archive is generally around 1GB. Make sure each image and metadata pair share the same file prefix, for example: ``` train-0000/ โ”œโ”€โ”€ 000.jpg โ”œโ”€โ”€ 000.json โ”œโ”€โ”€ 001.jpg โ”œโ”€โ”€ 001.json โ”œโ”€โ”€ ... โ”œโ”€โ”€ 999.jpg โ””โ”€โ”€ 999.json ``` Note that for user convenience and to enable the [Dataset Viewer](./data-studio), every dataset hosted in the Hub is automatically converted to Parquet format up to 5GB. Read more about it in the [Parquet format](./data-studio#access-the-parquet-files) documentation. ### Parquet format Instead of uploading the images and metadata as individual files, you can embed everything inside a [Parquet](https://parquet.apache.org/) file. This is useful if you have a large number of images, if you want to embed multiple image columns, or if you want to store additional information about the images in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV. ``` my_dataset_repository/ โ””โ”€โ”€ train.parquet ``` Parquet files with image data can be created using `pandas` or the `datasets` library. To create Parquet files with image data in `pandas`, you can use [pandas-image-methods](https://github.com/lhoestq/pandas-image-methods) and `df.to_parquet()`. In `datasets`, you can set the column type to `Image()` and use the `ds.to_parquet(...)` method or `ds.push_to_hub(...)`. You can find a guide on loading image datasets in `datasets` [here](/docs/datasets/image_load). Alternatively you can manually set the image type of Parquet created using other tools. First, make sure your image columns are of type _struct_, with a binary field `"bytes"` for the image data and a string field `"path"` for the image file name or path. Then you should specify the feature types of the columns directly in YAML in the README header, for example: ```yaml dataset_info: features: - name: image dtype: image - name: caption dtype: string ``` Note that Parquet is recommended for small images (<1MB per image) and small row groups (100 rows per row group, which is what `datasets` uses for images). For larger images it is recommended to use the WebDataset format, or to share the original image files (optionally with metadata files, and following the [repositories recommendations and limits](https://huggingface.co/docs/hub/en/storage-limits) for storage and number of files). ### Uploading datasets https://huggingface.co/docs/hub/datasets-adding.md # Uploading datasets The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away! Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet. ## Upload using the Hub UI The Hub's web-based interface allows users without any developer experience to upload a dataset. ### Create a repository A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible. 1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset). 2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization. ### Upload dataset 1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, image and other data extensions such as `.csv`, `.mp3`, and `.jpg` (see the full list of [File formats](#file-formats)). 2. Drag and drop your dataset files. 3. After uploading your dataset files, they are stored in your dataset repository. ### Create a Dataset card Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly. 1. Click on **Create Dataset Card** to create a [Dataset card](./datasets-cards). This button creates a `README.md` file in your repository. 2. At the top, you'll see the **Metadata UI** with several fields to select from such as license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub (when applicable). When you select an option for a field, it will be automatically added to the top of the dataset card. You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of allowed tags, including optional like `annotations_creators`, to help you choose the ones that are useful for your dataset. 3. Write your dataset documentation in the Dataset Card to introduce your dataset to the community and help users understand what is inside: what are the use cases and limitations, where the data comes from, what are important ethical considerations, and any other relevant details. You can click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail). ## Using the `huggingface_hub` client library The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Hub. Visit [the client library's documentation](/docs/huggingface_hub/index) to learn more. ## Using other libraries Some libraries like [๐Ÿค— Datasets](/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Polars](https://pola.rs), [Dask](https://www.dask.org/), [DuckDB](https://duckdb.org/), or [Daft](https://daft.ai/) can upload files to the Hub. See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. ## Using Git Since dataset repos are Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets. ## Ingest datasets If you have data in databases, cloud storage or behind APIs, you can ingest them to Hugging Face as ready-to-use datasets. Find more information in the [documentation on ingesting datasets](./datasets-ingesting). ## File formats The Hub natively supports multiple file formats: - Parquet (.parquet) - CSV (.csv, .tsv) - JSON Lines, JSON (.jsonl, .json) - Arrow streaming and IPC formats (.arrow) - Text (.txt) - Images (.png, .jpg, etc.) - Audio (.wav, .mp3, etc.) - Video (.mp4, .mov, .avi, etc.) - PDF (.pdf) - [WebDataset](./datasets-webdataset) (.tar) - [Lance](./datasets-lance) (.lance) It supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). Image and audio files can also have additional metadata files. See the [Data files Configuration](./datasets-data-files-configuration#image-and-audio-datasets) on image and audio datasets, as well as the collections of [example datasets](https://huggingface.co/datasets-examples) for CSV, TSV and images. You may want to convert your files to these formats to benefit from all the Hub features. Other formats and structures may not be recognized by the Hub. ### Which file format should I use? For most types of datasets, **Parquet** is the recommended format due to its efficient compression, rich typing, and since a variety of tools supports this format with optimized read and batched operations. Alternatively, CSV or JSON Lines/JSON can be used for tabular data (prefer JSON Lines for nested data). Although easy to parse compared to Parquet, these formats are not recommended for data larger than several GBs. For image and audio datasets, uploading raw files is the most practical for most use cases since it's easy to access individual files. For large scale image and audio datasets streaming, [WebDataset](https://github.com/webdataset/webdataset) should be preferred over raw image and audio files to avoid the overhead of accessing individual files. Though for more general use cases involving analytics, data filtering or metadata parsing, Parquet is the recommended option for large scale image and audio datasets. ### Data Studio The [Data Studio](./data-studio) is useful to know how the data actually looks like before you download it. It is enabled by default for all public datasets. It is also available for private datasets owned by a [PRO user](https://huggingface.co/pricing) or a [Team or Enterprise organization](https://huggingface.co/enterprise). After uploading your dataset, make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure). ## Large scale datasets The Hugging Face Hub supports large scale datasets, usually uploaded in Parquet (e.g. via `push_to_hub()` using [๐Ÿค— Datasets](/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.push_to_hub)) or [WebDataset](https://github.com/webdataset/webdataset) format. You can upload large scale datasets at high speed using the `huggingface_hub` library. See [how to upload a folder by chunks](/docs/huggingface_hub/guides/upload#upload-a-folder-by-chunks), the [tips and tricks for large uploads](/docs/huggingface_hub/guides/upload#tips-and-tricks-for-large-uploads) and the [repository storage limits and recommendations](./storage-limits). ### DDUF https://huggingface.co/docs/hub/dduf.md # DDUF ## Overview DDUF (**D**DUFโ€™s **D**iffusion **U**nified **F**ormat) is a single-file format for diffusion models that aims to unify the different model distribution methods and weight-saving formats by packaging all model components into a single file. It is language-agnostic and built to be parsable from a remote location without downloading the entire file. This work draws inspiration from the [GGUF](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) format. Check out the [DDUF](https://huggingface.co/DDUF) org to start using some of the most popular diffusion models in DDUF. > [!TIP] > We welcome contributions with open arms! > > To create a widely adopted file format, we need early feedback from the community. Nothing is set in stone, and we value everyone's input. Is your use case not covered? Please let us know in the DDUF organization [discussions](https://huggingface.co/spaces/DDUF/README/discussions/2). Its key features include the following. 1. **Single file** packaging. 2. Based on **ZIP file format** to leverage existing tooling. 3. No compression, ensuring **`mmap` compatibility** for fast loading and saving. 4. **Language-agnostic**: tooling can be implemented in Python, JavaScript, Rust, C++, etc. 5. **HTTP-friendly**: metadata and file structure can be fetched remotely using HTTP Range requests. 6. **Flexible**: each model component is stored in its own directory, following the current Diffusers structure. 7. **Safe**: uses [Safetensors](https://huggingface.co/docs/diffusers/using-diffusers/other-formats#safetensors) as a weight-saving format and prohibits nested directories to prevent ZIP bombs. ## Technical specifications Technically, a `.dduf` file **is** a [`.zip` archive](https://en.wikipedia.org/wiki/ZIP_(file_format)). By building on a universally supported file format, we ensure robust tooling already exists. However, some constraints are enforced to meet diffusion models' requirements: - Data must be stored uncompressed (flag `0`), allowing lazy-loading using memory-mapping. - Data must be stored using ZIP64 protocol, enabling saving files above 4GB. - The archive can only contain `.json`, `.safetensors`, `.model` and `.txt` files. - A `model_index.json` file must be present at the root of the archive. It must contain a key-value mapping with metadata about the model and its components. - Each component must be stored in its own directory (e.g., `vae/`, `text_encoder/`). Nested files must use UNIX-style path separators (`/`). - Each directory must correspond to a component in the `model_index.json` index. - Each directory must contain a json config file (one of `config.json`, `tokenizer_config.json`, `preprocessor_config.json`, `scheduler_config.json`). - Sub-directories are forbidden. Want to check if your file is valid? Check it out using this Space: https://huggingface.co/spaces/DDUF/dduf-check. ## Usage The `huggingface_hub` provides tooling to handle DDUF files in Python. It includes built-in rules to validate file integrity and helpers to read and export DDUF files. The goal is to see this tooling adopted in the Python ecosystem, such as in the `diffusers` integration. Similar tooling can be developed for other languages (JavaScript, Rust, C++, etc.). ### How to read a DDUF file? Pass a path to `read_dduf_file` to read a DDUF file. Only the metadata is read, meaning this is a lightweight call that won't explode your memory. In the example below, we consider that you've already downloaded the [`FLUX.1-dev.dduf`](https://huggingface.co/DDUF/FLUX.1-dev-DDUF/blob/main/FLUX.1-dev.dduf) file locally. ```python >>> from huggingface_hub import read_dduf_file # Read DDUF metadata >>> dduf_entries = read_dduf_file("FLUX.1-dev.dduf") ``` `read_dduf_file` returns a mapping where each entry corresponds to a file in the DDUF archive. A file is represented by a `DDUFEntry` dataclass that contains the filename, offset, and length of the entry in the original DDUF file. This information is useful to read its content without loading the whole file. In practice, you won't have to handle low-level reading but rely on helpers instead. For instance, here is how to load the `model_index.json` content: ```python >>> import json >>> json.loads(dduf_entries["model_index.json"].read_text()) {'_class_name': 'FluxPipeline', '_diffusers_version': '0.32.0.dev0', '_name_or_path': 'black-forest-labs/FLUX.1-dev', ... ``` For binary files, you'll want to access the raw bytes using `as_mmap`. This returns bytes as a memory-mapping on the original file. The memory-mapping allows you to read only the bytes you need without loading everything in memory. For instance, here is how to load safetensors weights: ```python >>> import safetensors.torch >>> with dduf_entries["vae/diffusion_pytorch_model.safetensors"].as_mmap() as mm: ... state_dict = safetensors.torch.load(mm) # `mm` is a bytes object ``` > [!TIP] > `as_mmap` must be used in a context manager to benefit from the memory-mapping properties. ### How to write a DDUF file? Pass a folder path to `export_folder_as_dduf` to export a DDUF file. ```python # Export a folder as a DDUF file >>> from huggingface_hub import export_folder_as_dduf >>> export_folder_as_dduf("FLUX.1-dev.dduf", folder_path="path/to/FLUX.1-dev") ``` This tool scans the folder, adds the relevant entries and ensures the exported file is valid. If anything goes wrong during the process, a `DDUFExportError` is raised. For more flexibility, use [`export_entries_as_dduf`] to explicitly specify a list of files to include in the final DDUF file: ```python # Export specific files from the local disk. >>> from huggingface_hub import export_entries_as_dduf >>> export_entries_as_dduf( ... dduf_path="stable-diffusion-v1-4-FP16.dduf", ... entries=[ # List entries to add to the DDUF file (here, only FP16 weights) ... ("model_index.json", "path/to/model_index.json"), ... ("vae/config.json", "path/to/vae/config.json"), ... ("vae/diffusion_pytorch_model.fp16.safetensors", "path/to/vae/diffusion_pytorch_model.fp16.safetensors"), ... ("text_encoder/config.json", "path/to/text_encoder/config.json"), ... ("text_encoder/model.fp16.safetensors", "path/to/text_encoder/model.fp16.safetensors"), ... # ... add more entries here ... ] ... ) ``` `export_entries_as_dduf` works well if you've already saved your model on the disk. But what if you have a model loaded in memory and want to serialize it directly into a DDUF file? `export_entries_as_dduf` lets you do that by providing a Python `generator` that tells how to serialize the data iteratively: ```python (...) # Export state_dicts one by one from a loaded pipeline >>> def as_entries(pipe: DiffusionPipeline) -> Generator[Tuple[str, bytes], None, None]: ... # Build a generator that yields the entries to add to the DDUF file. ... # The first element of the tuple is the filename in the DDUF archive. The second element is the content of the file. ... # Entries will be evaluated lazily when the DDUF file is created (only 1 entry is loaded in memory at a time) ... yield "vae/config.json", pipe.vae.to_json_string().encode() ... yield "vae/diffusion_pytorch_model.safetensors", safetensors.torch.save(pipe.vae.state_dict()) ... yield "text_encoder/config.json", pipe.text_encoder.config.to_json_string().encode() ... yield "text_encoder/model.safetensors", safetensors.torch.save(pipe.text_encoder.state_dict()) ... # ... add more entries here >>> export_entries_as_dduf(dduf_path="my-cool-diffusion-model.dduf", entries=as_entries(pipe)) ``` ### Loading a DDUF file with Diffusers Diffusers has a built-in integration for DDUF files. Here is an example on how to load a pipeline from a stored checkpoint on the Hub: ```py from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained( "DDUF/FLUX.1-dev-DDUF", dduf_file="FLUX.1-dev.dduf", torch_dtype=torch.bfloat16 ).to("cuda") image = pipe( "photo a cat holding a sign that says Diffusers", num_inference_steps=50, guidance_scale=3.5 ).images[0] image.save("cat.png") ``` ## F.A.Q. ### Why build on top of ZIP? ZIP provides several advantages: - Universally supported file format - No additional dependencies for reading - Built-in file indexing - Wide language support ### Why not use a TAR with a table of contents at the beginning of the archive? See the explanation in this [comment](https://github.com/huggingface/huggingface_hub/pull/2692#issuecomment-2519863726). ### Why no compression? - Enables direct memory mapping of large files - Ensures consistent and predictable remote file access - Prevents CPU overhead during file reading - Maintains compatibility with safetensors ### Can I modify a DDUF file? No. For now, DDUF files are designed to be immutable. To update a model, create a new DDUF file. ### Which frameworks/apps support DDUFs? - [Diffusers](https://github.com/huggingface/diffusers) We are constantly reaching out to other libraries and frameworks. If you are interested in adding support to your project, open a Discussion in the [DDUF org](https://huggingface.co/spaces/DDUF/README/discussions). ### Using sample-factory at Hugging Face https://huggingface.co/docs/hub/sample-factory.md # Using sample-factory at Hugging Face [`sample-factory`](https://github.com/alex-petrenko/sample-factory) is a codebase for high throughput asynchronous reinforcement learning. It has integrations with the Hugging Face Hub to share models with evaluation results and training metrics. ## Exploring sample-factory in the Hub You can find `sample-factory` models by filtering at the left of the [models page](https://huggingface.co/models?library=sample-factory). All models on the Hub come up with useful features: 1. An automatically generated model card with a description, a training configuration, and more. 2. Metadata tags that help for discoverability. 3. Evaluation results to compare with other models. 4. A video widget where you can watch your agent performing. ## Install the library To install the `sample-factory` library, you need to install the package: `pip install sample-factory` SF is known to work on Linux and MacOS. There is no Windows support at this time. ## Loading models from the Hub ### Using load_from_hub To download a model from the Hugging Face Hub to use with Sample-Factory, use the `load_from_hub` script: ``` python -m sample_factory.huggingface.load_from_hub -r -d ``` The command line arguments are: - `-r`: The repo ID for the HF repository to download from. The repo ID should be in the format `/` - `-d`: An optional argument to specify the directory to save the experiment to. Defaults to `./train_dir` which will save the repo to `./train_dir/` ### Download Model Repository Directly Hugging Face repositories can be downloaded directly using `git clone`: ``` git clone git@hf.co: # example: git clone git@hf.co:bigscience/bloom ``` ## Using Downloaded Models with Sample-Factory After downloading the model, you can run the models in the repo with the enjoy script corresponding to your environment. For example, if you are downloading a `mujoco-ant` model, it can be run with: ``` python -m sf_examples.mujoco.enjoy_mujoco --algo=APPO --env=mujoco_ant --experiment= --train_dir=./train_dir ``` Note, you may have to specify the `--train_dir` if your local train_dir has a different path than the one in the `cfg.json` ## Sharing your models ### Using push_to_hub If you want to upload without generating evaluation metrics or a replay video, you can use the `push_to_hub` script: ``` python -m sample_factory.huggingface.push_to_hub -r / -d ``` The command line arguments are: - `-r`: The repo_id to save on HF Hub. This is the same as `hf_repository` in the enjoy script and must be in the form `/` - `-d`: The full path to your experiment directory to upload ### Using enjoy.py You can upload your models to the Hub using your environment's `enjoy` script with the `--push_to_hub` flag. Uploading using `enjoy` can also generate evaluation metrics and a replay video. The evaluation metrics are generated by running your model on the specified environment for a number of episodes and reporting the mean and std reward of those runs. Other relevant command line arguments are: - `--hf_repository`: The repository to push to. Must be of the form `/`. The model will be saved to `https://huggingface.co//` - `--max_num_episodes`: Number of episodes to evaluate on before uploading. Used to generate evaluation metrics. It is recommended to use multiple episodes to generate an accurate mean and std. - `--max_num_frames`: Number of frames to evaluate on before uploading. An alternative to `max_num_episodes` - `--no_render`: A flag that disables rendering and showing the environment steps. It is recommended to set this flag to speed up the evaluation process. You can also save a video of the model during evaluation to upload to the hub with the `--save_video` flag - `--video_frames`: The number of frames to be rendered in the video. Defaults to -1 which renders an entire episode - `--video_name`: The name of the video to save as. If `None`, will save to `replay.mp4` in your experiment directory For example: ``` python -m sf_examples.mujoco_examples.enjoy_mujoco --algo=APPO --env=mujoco_ant --experiment= --train_dir=./train_dir --max_num_episodes=10 --push_to_hub --hf_username= --hf_repository= --save_video --no_render ``` ### Using Asteroid at Hugging Face https://huggingface.co/docs/hub/asteroid.md # Using Asteroid at Hugging Face `asteroid` is a Pytorch toolkit for audio source separation. It enables fast experimentation on common datasets with support for a large range of datasets and recipes to reproduce papers. ## Exploring Asteroid in the Hub You can find `asteroid` models by filtering at the left of the [models page](https://huggingface.co/models?filter=asteroid). All models on the Hub come up with the following features: 1. An automatically generated model card with a description, training configuration, metrics, and more. 2. Metadata tags that help for discoverability and contain information such as licenses and datasets. 3. An interactive widget you can use to play out with the model directly in the browser. 4. An Inference Providers widget that allows to make inference requests. ## Using existing models For a full guide on loading pre-trained models, we recommend checking out the [official guide](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md). All model classes (`BaseModel`, `ConvTasNet`, etc) have a `from_pretrained` method that allows to load models from the Hub. ```py from asteroid.models import ConvTasNet model = ConvTasNet.from_pretrained('mpariente/ConvTasNet_WHAM_sepclean') ``` If you want to see how to load a specific model, you can click `Use in Adapter Transformers` and you will be given a working snippet that you can load it! ## Sharing your models At the moment there is no automatic method to upload your models to the Hub, but the process to upload them is documented in the [official guide](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md#share-your-models). All the recipes create all the needed files to upload a model to the Hub. The process usually involves the following steps: 1. Create and clone a model repository. 2. Moving files from the recipe output to the repository (model card, model filte, TensorBoard traces). 3. Push the files (`git add` + `git commit` + `git push`). Once you do this, you can try out your model directly in the browser and share it with the rest of the community. ## Additional resources * Asteroid [website](https://asteroid-team.github.io/). * Asteroid [library](https://github.com/asteroid-team/asteroid). * Integration [docs](https://github.com/asteroid-team/asteroid/blob/master/docs/source/readmes/pretrained_models.md). ### Security https://huggingface.co/docs/hub/security.md # Security The Hugging Face Hub offers several security features to ensure that your code and data are secure. Beyond offering [private repositories](./repositories-settings#private-repositories) for models, datasets, and Spaces, the Hub supports access tokens, resource groups, MFA, commit signatures, malware scanning, and more. Hugging Face is GDPR compliant. If a contract or specific data storage is something you'll need, we recommend taking a look at our [Team & Enterprise Support](https://huggingface.co/support). Hugging Face can also offer Business Associate Addendums or GDPR data processing agreements through an [Enterprise Plan](https://huggingface.co/pricing). Hugging Face is also [SOC2 Type 2 certified](https://us.aicpa.org/interestareas/frc/assuranceadvisoryservices/aicpasoc2report.html), meaning we provide security certification to our customers and actively monitor and patch any security weaknesses. For any other security questions, please feel free to send us an email at security@huggingface.co. ## Contents - [User Access Tokens](./security-tokens) - [Two-Factor Authentication (2FA)](./security-2fa) - [Git over SSH](./security-git-ssh) - [Signing commits with GPG](./security-gpg) - [Single Sign-On (SSO)](./security-sso) - [Advanced Access Control (Resource Groups)](./security-resource-groups) - [Malware Scanning](./security-malware) - [Pickle Scanning](./security-pickle) - [Secrets Scanning](./security-secrets) - [Third-party scanner: Protect AI](./security-protectai) - [Third-party scanner: JFrog](./security-jfrog) ### Popular Images https://huggingface.co/docs/hub/jobs-popular-images.md # Popular Images Here is the list of ready-to-use Docker images from popular frameworks that you can use in Jobs with uv. These Docker images already have uv installed but if you want to use an image + uv for an image without uv installed youโ€™ll need to make sure uv is installed first. This will work well in many cases but for LLM inference libraries which can have quite specific requirements, it can be useful to use a specific image that has the library installed. ## vLLM vLLM is a very well known and heavily used inference engine. It is known for its ability to scale inference for LLMs. They provide the `vllm/vllm-openai` Docker image with vLLM and UV ready. This image is ideal to run batch inference. Use the `--image` argument to use this Docker image: ```bash >>> hf jobs uv run --image vllm/vllm-openai --flavor l4x4 generate-responses.py ``` You can find more information on vLLM batch inference on Jobs in [Daniel Van Strien's blog post](https://danielvanstrien.xyz/posts/2025/hf-jobs/vllm-batch-inference.html). ## TRL TRL is a library designed for post-training models using techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). An up-to-date Docker image with UV and all TRL dependencies is available at `huggingface/trl` and can be used directly with Hugging Face Jobs. Use the `--image` argument to use this Docker image: ```bash >>> hf jobs uv run --image huggingface/trl --flavor a100-large -s HF_TOKEN train.py ``` ### Webhook guide: Setup an automatic system to re-train a model when a dataset changes https://huggingface.co/docs/hub/webhooks-guide-auto-retrain.md # Webhook guide: Setup an automatic system to re-train a model when a dataset changes This guide will help walk you through the setup of an automatic training pipeline on the Hugging Face platform using HF Datasets, Webhooks, Spaces, and AutoTrain. We will build a Webhook that listens to changes on an image classification dataset and triggers a fine-tuning of [microsoft/resnet-50](https://huggingface.co/microsoft/resnet-50) using [AutoTrain](https://huggingface.co/autotrain). ## Prerequisite: Upload your dataset to the Hub We will use a [simple image classification dataset](https://huggingface.co/datasets/huggingface-projects/auto-retrain-input-dataset) for the sake of the example. Learn more about uploading your data to the Hub [here](https://huggingface.co/docs/datasets/upload_dataset). ![dataset](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/dataset.png) ## Create a Webhook to react to the dataset's changes First, let's create a Webhook from your [settings]( https://huggingface.co/settings/webhooks). - Select your dataset as the target repository. We will target [huggingface-projects/input-dataset](https://huggingface.co/datasets/huggingface-projects/input-dataset) in this example. - You can put a dummy Webhook URL for now. Defining your Webhook will let you look at the events that will be sent to it. You can also replay them, which will be useful for debugging! - Input a secret to make it more secure. - Subscribe to "Repo update" events as we want to react to data changes Your Webhook will look like this: ![webhook-creation](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/webhook-creation.png) ## Create a Space to react to your Webhook We now need a way to react to your Webhook events. An easy way to do this is to use a [Space](https://huggingface.co/docs/hub/spaces-overview)! You can find an example Space [here](https://huggingface.co/spaces/huggingface-projects/auto-retrain/tree/main). This Space uses Docker, Python, [FastAPI](https://fastapi.tiangolo.com/), and [uvicorn](https://www.uvicorn.org) to run a simple HTTP server. Read more about Docker Spaces [here](https://huggingface.co/docs/hub/spaces-sdks-docker). The entry point is [src/main.py](https://huggingface.co/spaces/huggingface-projects/auto-retrain/blob/main/src/main.py). Let's walk through this file and detail what it does: 1. It spawns a FastAPI app that will listen to HTTP `POST` requests on `/webhook`: ```python from fastapi import FastAPI # [...] @app.post("/webhook") async def post_webhook( # ... ): # ... ``` 2. 2. This route checks that the `X-Webhook-Secret` header is present and that its value is the same as the one you set in your Webhook's settings. The `WEBHOOK_SECRET` secret must be set in the Space's settings and be the same as the secret set in your Webhook. ```python # [...] WEBHOOK_SECRET = os.getenv("WEBHOOK_SECRET") # [...] @app.post("/webhook") async def post_webhook( # [...] x_webhook_secret: Optional[str] = Header(default=None), # ^ checks for the X-Webhook-Secret HTTP header ): if x_webhook_secret is None: raise HTTPException(401) if x_webhook_secret != WEBHOOK_SECRET: raise HTTPException(403) # [...] ``` 3. The event's payload is encoded as JSON. Here, we'll be using pydantic models to parse the event payload. We also specify that we will run our Webhook only when: - the event concerns the input dataset - the event is an update on the repo's content, i.e., there has been a new commit ```python # defined in src/models.py class WebhookPayloadEvent(BaseModel): action: Literal["create", "update", "delete", "move"] scope: str class WebhookPayloadRepo(BaseModel): type: Literal["dataset", "model", "space"] name: str id: str private: bool headSha: str class WebhookPayload(BaseModel): event: WebhookPayloadEvent repo: WebhookPayloadRepo # [...] @app.post("/webhook") async def post_webhook( # [...] payload: WebhookPayload, # ^ Pydantic model defining the payload format ): # [...] if not ( payload.event.action == "update" and payload.event.scope.startswith("repo.content") and payload.repo.name == config.input_dataset and payload.repo.type == "dataset" ): # no-op if the payload does not match our expectations return {"processed": False} #[...] ``` 4. If the payload is valid, the next step is to create a project on AutoTrain, schedule a fine-tuning of the input model (`microsoft/resnet-50` in our example) on the input dataset, and create a discussion on the dataset when it's done! ```python def schedule_retrain(payload: WebhookPayload): # Create the autotrain project try: project = AutoTrain.create_project(payload) AutoTrain.add_data(project_id=project["id"]) AutoTrain.start_processing(project_id=project["id"]) except requests.HTTPError as err: print("ERROR while requesting AutoTrain API:") print(f" code: {err.response.status_code}") print(f" {err.response.json()}") raise # Notify in the community tab notify_success(project["id"]) ``` Visit the link inside the comment to review the training cost estimate, and start fine-tuning the model! ![community tab notification](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/notification.png) In this example, we used Hugging Face AutoTrain to fine-tune our model quickly, but you can of course plug in your training infrastructure! Feel free to duplicate the Space to your personal namespace and play with it. You will need to provide two secrets: - `WEBHOOK_SECRET` : the secret from your Webhook. - `HF_ACCESS_TOKEN` : a User Access Token with `write` rights. You can create one [from your settings](https://huggingface.co/settings/tokens). You will also need to tweak the [`config.json` file](https://huggingface.co/spaces/huggingface-projects/auto-retrain/blob/main/config.json) to use the dataset and model of you choice: ```json { "target_namespace": "the namespace where the trained model should end up", "input_dataset": "the dataset on which the model will be trained", "input_model": "the base model to re-train", "autotrain_project_prefix": "A prefix for the AutoTrain project" } ``` ## Configure your Webhook to send events to your Space Last but not least, you'll need to configure your webhook to send POST requests to your Space. Let's first grab our Space's "direct URL" from the contextual menu. Click on "Embed this Space" and copy the "Direct URL". ![embed this Space](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/duplicate-space.png) ![direct URL](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/direct-url.png) Update your Webhook to send requests to that URL: ![webhook settings](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/webhooks-guides/002-auto-retrain/update-webhook.png) And that's it! Now every commit to the input dataset will trigger a fine-tuning of ResNet-50 with AutoTrain ๐ŸŽ‰ ### Sign in with Hugging Face https://huggingface.co/docs/hub/oauth.md # Sign in with Hugging Face You can use the HF OAuth / OpenID connect flow to create a **"Sign in with HF"** flow in any website or App. This will allow users to sign in to your website or app using their HF account, by clicking a button similar to this one: ![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-xl-dark.svg) After clicking this button your users will be presented with a permissions modal to authorize your app: ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-accept-application.png) ## Creating an oauth app You can create your application in your [settings](https://huggingface.co/settings/applications/new): ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-create-application.png) ### Public OAuth apps (no secret) You can create or use OAuth apps without a client secret. This is useful for native apps, CLIs, or other contexts where keeping a secret is impractical. - **At app creation**: When creating a new OAuth app, you can choose to create it without a secret. - **After creation**: For an existing app, you can delete the client secret in the app settings. The app will then work as a public app. Public apps authenticate using only the client ID (e.g. in device code or authorization code flows with PKCE). Apps that have a secret can still use the secret when needed (e.g. `Authorization: Basic` for token requests). ### If you are hosting in Spaces > [!TIP] > If you host your app on Spaces, then the flow will be even easier to implement (and built-in to Gradio directly); Check our [Spaces OAuth guide](https://huggingface.co/docs/hub/spaces-oauth). ### Automated oauth app creation Hugging Face supports CIMD aka [Client ID Metadata Documents](https://datatracker.ietf.org/doc/draft-ietf-oauth-client-id-metadata-document/), which allows you to create an oauth app for your website in an automated manner: - Add an endpoint to your website `/.well-known/oauth-cimd` which returns the following JSON: ```json { client_id: "[your website url]/.well-known/oauth-cimd", client_name: "Your Website", redirect_uris: ["[your website url]/oauth/callback/huggingface"], token_endpoint_auth_method: "none", logo_uri: "https://....", // optional client_uri: "[your website url]", // optional } ``` - Use `"[your website url]/.well-known/oauth-cimd"` as client ID, and PCKE as auth mechanism This is particularly useful for ephemeral environments or MCP clients. See an [implementation example](https://github.com/huggingface/chat-ui/pull/1978) in Hugging Chat. ## Device code OAuth Device code flow lets users authorize an app on one device (e.g. a CLI) by entering a short code on another device (e.g. a phone or browser). No redirect URI or browser on the device running the app is required. ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-device-first-step.png) ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/oauth-device-second-step.png) ### Testing with a sample script You can test a device-code OAuth app with the following script. Replace `` with your appโ€™s client ID. For **public apps** (no secret), the script works as-is. For **apps with a secret**, add an `Authorization: Basic` header (Base64 of `client_id:client_secret`) to both the device and token requests. ```sh #!/bin/bash CLIENT_ID="" # Step 1: Get device code RESPONSE=$(curl -s -X POST https://huggingface.co/oauth/device \ -d "client_id=$CLIENT_ID") DEVICE_CODE=$(echo $RESPONSE | jq -r '.device_code') USER_CODE=$(echo $RESPONSE | jq -r '.user_code') VERIFICATION_URI=$(echo $RESPONSE | jq -r '.verification_uri') echo "Device Code: $DEVICE_CODE" echo "User Code: $USER_CODE" echo "" echo "Open: ${VERIFICATION_URI}" echo "Enter the user code: $USER_CODE" echo "" read -p "Press Enter after authorizing..." # Step 3: Get token curl -X POST https://huggingface.co/oauth/token \ -d "grant_type=urn:ietf:params:oauth:grant-type:device_code" \ -d "device_code=$DEVICE_CODE" \ -d "client_id=$CLIENT_ID" ``` > [!NOTE] > For OAuth apps that have a client secret, include an `Authorization: Basic` header (with Base64-encoded `client_id:client_secret`) on both the device code request and the token request. ## Currently supported scopes The currently supported scopes are: - `openid`: Get the ID token in addition to the access token. - `profile`: Get the user's profile information (username, avatar, etc.) - `email`: Get the user's email address. - `read-billing`: Know whether the user has a payment method set up. - `read-repos`: Get read access to the user's personal repos. - `gated-repos`: Get read access to the content of public gated repos the user has been granted access to. Unlike `read-repos`, this does not grant access to private repos. - `contribute-repos`: Can create repositories and access those created by this app. Cannot access any other repositories unless additional permissions are granted. - `write-repos`: Get write/read access to the user's personal repos. - `manage-repos`: Get full access to the user's personal repos. Also grants repo creation and deletion. - `read-collections`: Get read access to the user's personal collections. - `write-collections`: Get write/read access to the user's personal collections. Also grants collection creation and deletion. - `inference-api`: Get access to the [Inference Providers](https://huggingface.co/docs/inference-providers/index), you will be able to make inference requests on behalf of the user. - `jobs`: Run [jobs](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs) - `webhooks`: Manage [webhooks](https://huggingface.co/docs/huggingface_hub/main/en/guides/webhooks) - `write-discussions`: Open discussions and Pull Requests on behalf of the user as well as interact with discussions (including reactions, posting/editing comments, closing discussions, ...). To open Pull Requests on private repos, you need to request the `read-repos` scope as well. All other information is available in the [OpenID metadata](https://huggingface.co/.well-known/openid-configuration). > [!WARNING] > Please contact us if you need any extra scopes. ## Accessing organization resources By default, the oauth app does not need to access organization resources. But some scopes like `read-repos` or `read-billing` apply to organizations as well. The user can select which organizations to grant access to when authorizing the app. If you require access to a specific organization, you can add `orgIds=ORG_ID` as a query parameter to the OAuth authorization URL. You have to replace `ORG_ID` with the organization ID, which is available in the `organizations.sub` field of the userinfo response. ## Branding You are free to use your own design for the button. Below are some SVG images helpfully provided. Check out [our badges](https://huggingface.co/datasets/huggingface/badges#sign-in-with-hugging-face) with explanations for integrating them in markdown or HTML. [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-sm.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-sm-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-md.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-md-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-lg.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-lg-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-xl.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) [![Sign in with Hugging Face](https://huggingface.co/datasets/huggingface/badges/resolve/main/sign-in-with-huggingface-xl-dark.svg)](https://huggingface.co/oauth/authorize?client_id=CLIENT_ID&redirect_uri=REDIRECT_URI&scope=openid%20profile&state=STATE) ## Token Exchange for Organizations (RFC 8693) > [!WARNING] > This feature is part of the Enterprise plan. Token Exchange allows organizations to programmatically issue access tokens for their members without requiring interactive user consent. This is particularly useful for building internal tools, automation pipelines, and enterprise integrations that need to access Hugging Face resources on behalf of organization members. This feature implements [RFC 8693 - OAuth 2.0 Token Exchange](https://www.rfc-editor.org/rfc/rfc8693.html), a standard protocol for token exchange scenarios. ### Use cases Token Exchange is designed for scenarios where your organization needs to: - **Build internal platforms**: Create dashboards or portals that access Hugging Face resources on behalf of your team members, without requiring each user to manually authenticate. - **Automate CI/CD pipelines**: Issue short-lived, scoped tokens for automated workflows that need to push models or datasets to organization repositories. - **Integrate with enterprise identity systems**: Bridge your existing identity provider with Hugging Face by issuing tokens based on your internal user directory. - **Implement custom access controls**: Build middleware that issues tokens with specific scopes based on your organization's internal policies. ### How it works 1. Your organization has an OAuth application bound to your organization with the `token-exchange` privilege. 2. Your backend service authenticates with this OAuth app using client credentials. 3. Your service requests an access token for a specific organization member (identified by email). 4. Hugging Face verifies the user is a member of your organization and issues a scoped token. 5. The issued token can only access resources within your organization's scope. ### Prerequisites To use Token Exchange, you need an organization-bound OAuth application with the `token-exchange` privilege. Contact Hugging Face support to set up an eligible OAuth app for your organization. Once configured, you will receive: - A **Client ID** (e.g., `a1b2c3d4-e5f6-7890-abcd-ef1234567890`) - A **Client Secret** (keep this secure!) > [!WARNING] > Organization administrators can manage the OAuth app after creation, including refreshing the client secret and configuring the token duration. ### Authentication Token Exchange uses HTTP Basic Authentication with your OAuth app credentials. Create the authorization header by Base64-encoding your `client_id:client_secret`: ```bash # Create the authorization header export CLIENT_ID="your-client-id" export CLIENT_SECRET="your-client-secret" export AUTH_HEADER=$(echo -n "${CLIENT_ID}:${CLIENT_SECRET}" | base64) ``` ### Issuing tokens by email To issue an access token for an organization member using their email address: ```bash curl -X POST "https://huggingface.co/oauth/token" \ -H "Content-Type: application/x-www-form-urlencoded" \ -H "Authorization: Basic ${AUTH_HEADER}" \ -d "grant_type=urn:ietf:params:oauth:grant-type:token-exchange" \ -d "subject_token=user@yourorg.com" \ -d "subject_token_type=urn:huggingface:token-type:user-email" ``` ### Response A successful request returns an access token: ```json { "access_token": "hf_oauth_...", "token_type": "bearer", "expires_in": 28800, "scope": "openid profile email read-repos", "id_token": "eyJhbGciOiJS...", "issued_token_type": "urn:ietf:params:oauth:token-type:access_token" } ``` The `id_token` field is included when the `openid` scope is requested. You can then use this token to make API requests on behalf of the user: ```bash curl "https://huggingface.co/api/whoami-v2" \ -H "Authorization: Bearer ${ACCESS_TOKEN}" ``` ### Scope control By default, issued tokens inherit all scopes configured on the OAuth app. You can request specific scopes by adding the `scope` parameter. See [Currently supported scopes](#currently-supported-scopes) for available values. The token's effective permissions are limited both by the requested scope and by the user's role within the organization. ```bash curl -X POST "https://huggingface.co/oauth/token" \ -H "Content-Type: application/x-www-form-urlencoded" \ -H "Authorization: Basic ${AUTH_HEADER}" \ -d "grant_type=urn:ietf:params:oauth:grant-type:token-exchange" \ -d "subject_token=user@yourorg.com" \ -d "subject_token_type=urn:huggingface:token-type:user-email" \ -d "scope=openid profile" ``` > [!TIP] > Follow the principle of least privilege: request only the scopes your application actually needs. ### Security considerations Tokens issued via Token Exchange have built-in security restrictions: - **Organization-scoped**: Tokens can only access resources within your organization (models, datasets, Spaces, and collections owned by the org). Outside the org, access is read-only and limited to: public collections from any user or organization, and public gated repos the user has been individually granted access to. - **No personal access**: Tokens cannot access the user's personal private repositories or private repos from other organizations. - **Short-lived**: Tokens expire after 8 hours by default. Organization administrators can configure the token duration (up to 30 days) in the OAuth app settings. No refresh tokens are provided. - **Auditable**: All token exchanges are logged and visible in your organization's [audit logs](./audit-logs). > [!WARNING] > Protect your OAuth app credentials carefully. Anyone with access to your client secret can issue tokens for any member of your organization. ### Error responses | Error | Description | |-------|-------------| | `invalid_client` | Client is not authorized to use token exchange, or the app is not bound to an organization | | `invalid_grant` | User not found in the bound organization | | `invalid_scope` | Requested scope is not valid | ### Reference **Grant type:** ``` urn:ietf:params:oauth:grant-type:token-exchange ``` **Request parameter (`subject_token_type`):** | Value | Description | |-------|-------------| | `urn:huggingface:token-type:user-email` | Identify the user by their email address | **Response field (`issued_token_type`):** | Value | Description | |-------|-------------| | `urn:ietf:params:oauth:token-type:access_token` | Indicates an access token was issued | **Related documentation:** - [RFC 8693 - OAuth 2.0 Token Exchange](https://www.rfc-editor.org/rfc/rfc8693.html) - [Audit Logs](./audit-logs) ### Using Spaces for Organization Cards https://huggingface.co/docs/hub/spaces-organization-cards.md # Using Spaces for Organization Cards Organization cards are a way to describe your organization to other users. They take the form of a `README.md` static file, inside a Space repo named `README`. Please read more in the [dedicated doc section](./organizations-cards). ### Query datasets https://huggingface.co/docs/hub/datasets-duckdb-select.md # Query datasets Querying datasets is a fundamental step in data analysis. Here, we'll guide you through querying datasets using various methods. There are [several ways](https://duckdb.org/docs/data/parquet/overview.html) to select your data. Using the `FROM` syntax: ```bash FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' SELECT city, country, region LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ city โ”‚ country โ”‚ region โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Kabul โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ โ”‚ Kandahar โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ โ”‚ Mazar-e Sharif โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Using the `SELECT` and `FROM` syntax: ```bash SELECT city, country, region FROM 'hf://datasets/jamescalam/world-cities-geo/train.jsonl' USING SAMPLE 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ city โ”‚ country โ”‚ region โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Wenzhou โ”‚ China โ”‚ Eastern Asia โ”‚ โ”‚ Valdez โ”‚ Ecuador โ”‚ South America โ”‚ โ”‚ Aplahoue โ”‚ Benin โ”‚ Western Africa โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Count all JSONL files matching a glob pattern: ```bash SELECT COUNT(*) FROM 'hf://datasets/jamescalam/world-cities-geo/*.jsonl'; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ count_star() โ”‚ โ”‚ int64 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ 9083 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` You can also query Parquet files using the `read_parquet` function (or its alias `parquet_scan`). This function, along with other [parameters](https://duckdb.org/docs/data/parquet/overview.html#parameters), provides flexibility in handling Parquet files specially if they dont have a `.parquet` extension. Let's explore these functions using the auto-converted Parquet files from the same dataset. Select using [read_parquet](https://duckdb.org/docs/guides/file_formats/query_parquet.html) function: ```bash SELECT * FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet') LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ city โ”‚ country โ”‚ region โ”‚ continent โ”‚ latitude โ”‚ longitude โ”‚ x โ”‚ y โ”‚ z โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ double โ”‚ double โ”‚ double โ”‚ double โ”‚ double โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Kabul โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ Asia โ”‚ 34.5166667 โ”‚ 69.1833344 โ”‚ 1865.546409629258 โ”‚ 4906.785732164055 โ”‚ 3610.1012966606136 โ”‚ โ”‚ Kandahar โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ Asia โ”‚ 31.61 โ”‚ 65.6999969 โ”‚ 2232.782351694877 โ”‚ 4945.064042683584 โ”‚ 3339.261233224765 โ”‚ โ”‚ Mazar-e Sharif โ”‚ Afghanistan โ”‚ Southern Asia โ”‚ Asia โ”‚ 36.7069444 โ”‚ 67.1122208 โ”‚ 1986.5057687360124 โ”‚ 4705.51748048584 โ”‚ 3808.088900172991 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Read all files that match a glob pattern and include a filename column specifying which file each row came from: ```bash SELECT city, country, filename FROM read_parquet('hf://datasets/jamescalam/world-cities-geo@~parquet/default/**/*.parquet', filename = true) LIMIT 3; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ city โ”‚ country โ”‚ filename โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Kabul โ”‚ Afghanistan โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ โ”‚ Kandahar โ”‚ Afghanistan โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ โ”‚ Mazar-e Sharif โ”‚ Afghanistan โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ## Get metadata and schema The [parquet_metadata](https://duckdb.org/docs/data/parquet/metadata.html) function can be used to query the metadata contained within a Parquet file. ```bash SELECT * FROM parquet_metadata('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ file_name โ”‚ row_group_id โ”‚ row_group_num_rows โ”‚ compression โ”‚ โ”‚ varchar โ”‚ int64 โ”‚ int64 โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ 0 โ”‚ 1000 โ”‚ SNAPPY โ”‚ โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ 0 โ”‚ 1000 โ”‚ SNAPPY โ”‚ โ”‚ hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet โ”‚ 0 โ”‚ 1000 โ”‚ SNAPPY โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Fetch the column names and column types: ```bash DESCRIBE SELECT * FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ column_name โ”‚ column_type โ”‚ null โ”‚ key โ”‚ default โ”‚ extra โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ city โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ country โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ region โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ continent โ”‚ VARCHAR โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ latitude โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ longitude โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ x โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ y โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ z โ”‚ DOUBLE โ”‚ YES โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Fetch the internal schema (excluding the file name): ```bash SELECT * EXCLUDE (file_name) FROM parquet_schema('hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'); โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ name โ”‚ type โ”‚ type_length โ”‚ repetition_type โ”‚ num_children โ”‚ converted_type โ”‚ scale โ”‚ precision โ”‚ field_id โ”‚ logical_type โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ varchar โ”‚ int64 โ”‚ int64 โ”‚ int64 โ”‚ varchar โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ schema โ”‚ โ”‚ โ”‚ REQUIRED โ”‚ 9 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ city โ”‚ BYTE_ARRAY โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ UTF8 โ”‚ โ”‚ โ”‚ โ”‚ StringType() โ”‚ โ”‚ country โ”‚ BYTE_ARRAY โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ UTF8 โ”‚ โ”‚ โ”‚ โ”‚ StringType() โ”‚ โ”‚ region โ”‚ BYTE_ARRAY โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ UTF8 โ”‚ โ”‚ โ”‚ โ”‚ StringType() โ”‚ โ”‚ continent โ”‚ BYTE_ARRAY โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ UTF8 โ”‚ โ”‚ โ”‚ โ”‚ StringType() โ”‚ โ”‚ latitude โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ longitude โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ x โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ y โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ z โ”‚ DOUBLE โ”‚ โ”‚ OPTIONAL โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค ``` ## Get statistics The `SUMMARIZE` command can be used to get various aggregates over a query (min, max, approx_unique, avg, std, q25, q50, q75, count). It returns these statistics along with the column name, column type, and the percentage of NULL values. ```bash SUMMARIZE SELECT latitude, longitude FROM 'hf://datasets/jamescalam/world-cities-geo@~parquet/default/train/0000.parquet'; โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ column_name โ”‚ column_type โ”‚ min โ”‚ max โ”‚ approx_unique โ”‚ avg โ”‚ std โ”‚ q25 โ”‚ q50 โ”‚ q75 โ”‚ count โ”‚ null_percentage โ”‚ โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ varchar โ”‚ int64 โ”‚ decimal(9,2) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ latitude โ”‚ DOUBLE โ”‚ -54.8 โ”‚ 67.8557214 โ”‚ 7324 โ”‚ 22.5004568364307 โ”‚ 26.770454684690925 โ”‚ 6.089858461951687 โ”‚ 29.321258648324747 โ”‚ 44.90191158328915 โ”‚ 9083 โ”‚ 0.00 โ”‚ โ”‚ longitude โ”‚ DOUBLE โ”‚ -175.2166595 โ”‚ 179.3833313 โ”‚ 7802 โ”‚ 14.699333721953098 โ”‚ 63.93672742608224 โ”‚ -6.877990418604821 โ”‚ 19.12963979385393 โ”‚ 43.873513093419966 โ”‚ 9083 โ”‚ 0.00 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` ### Manual Configuration https://huggingface.co/docs/hub/datasets-manual-configuration.md # Manual Configuration This guide will show you how to configure a custom structure for your dataset repository. The [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87) showcases each section of the documentation. A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to define the splits, subsets and builder parameters that are used by the Viewer. It is also possible to define multiple subsets (also called "configurations") for the same dataset (e.g. if the dataset has various independent files). ## Splits If you have multiple files and want to define which file goes into which split, you can use YAML at the top of your README.md. For example, given a repository like this one: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ data.csv โ””โ”€โ”€ holdout.csv ``` You can define a subset for your splits by adding the `configs` field in the YAML block at the top of your README.md: ```yaml --- configs: - config_name: default data_files: - split: train path: "data.csv" - split: test path: "holdout.csv" --- ``` You can select multiple files per split using a list of paths: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ data/ โ”‚ โ”œโ”€โ”€ abc.csv โ”‚ โ””โ”€โ”€ def.csv โ””โ”€โ”€ holdout/ โ””โ”€โ”€ ghi.csv ``` ```yaml --- configs: - config_name: default data_files: - split: train path: - "data/abc.csv" - "data/def.csv" - split: test path: "holdout/ghi.csv" --- ``` Or you can use glob patterns to automatically list all the files you need: ```yaml --- configs: - config_name: default data_files: - split: train path: "data/*.csv" - split: test path: "holdout/*.csv" --- ``` > [!WARNING] > Note that `config_name` field is required even if you have a single subset. ## Multiple Subsets Your dataset might have several subsets of data that you want to be able to use separately. For example each subset has its own dropdown in the Dataset Viewer the Hugging Face Hub. In that case you can define a list of subsets inside the `configs` field in YAML: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ main_data.csv โ””โ”€โ”€ additional_data.csv ``` ```yaml --- configs: - config_name: main_data data_files: "main_data.csv" - config_name: additional_data data_files: "additional_data.csv" --- ``` Note that the order of subsets shown in the viewer is the default one first, then alphabetical. > [!TIP] > You can set a default subset using `default: true` > > ```yaml > - config_name: main_data > data_files: "main_data.csv" > default: true > ``` > > This is useful to set which subset the Dataset Viewer shows first, and which subset data libraries load by default. ## Data Directory Instead of listing individual files with `data_files`, you can use `data_dir` to point to a directory. Files inside that directory are resolved automatically based on file extensions. This is especially useful when your data is organized in subdirectories: For example in a case like this, you can simply use `data_dir` since each subset's data lives in its own directory: ``` my_dataset_repository/ โ”œโ”€โ”€ README.md โ”œโ”€โ”€ main/ โ”‚ โ”œโ”€โ”€ train.csv โ”‚ โ””โ”€โ”€ test.csv โ””โ”€โ”€ extra/ โ”œโ”€โ”€ train.csv โ””โ”€โ”€ test.csv ``` ```yaml --- configs: - config_name: main data_dir: "main" - config_name: extra data_dir: "extra" --- ``` When `data_dir` is set, the builder resolves files relative to that directory. If the directory contains files matching the default split naming pattern (e.g. `train.csv`, `test.csv`), splits are assigned automatically without needing explicit `data_files`. You can also combine `data_dir` with `data_files` for more control: ```yaml --- configs: - config_name: default data_dir: "data" data_files: - split: train path: "training_*.csv" - split: test path: "eval_*.csv" --- ``` In this case, the `path` patterns in `data_files` are resolved relative to the `data_dir`. ## Builder parameters Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which subset to load your `csv` files: ```yaml --- configs: - config_name: tab data_files: "main_data.csv" sep: "\t" - config_name: comma data_files: "additional_data.csv" sep: "," --- ``` Refer to the [specific builders' documentation](/docs/datasets/package_reference/builder_classes) to see what parameters they have. ### Spaces Changelog https://huggingface.co/docs/hub/spaces-changelog.md # Spaces Changelog ## [2026-03-18] - Protected Spaces visibility - Spaces now support a **protected** visibility option, in addition to public and private. In Space settings, visibility is now set through a dropdown with three options instead of a simple public/private toggle. - Protected visibility is available on [PRO](https://huggingface.co/pro) and [Team & Enterprise](https://huggingface.co/enterprise) plans. - A protected Space keeps its source code private on the Hub, while the app remains publicly accessible through its embed URL or [custom domain](./spaces-custom-domain). - This is especially useful for hosting websites without publishing source code. - Read more in the [Spaces Overview](./spaces-overview#space-visibility). ## [2025-04-30] - Deprecate Streamlit SDK - Streamlit is no longer provided as a default built-in SDK option. Streamlit applications are now created using the Docker template. ## [2023-07-28] - Upstream Streamlit frontend for `>=1.23.0` - Streamlit SDK uses the upstream packages published on PyPI for `>=1.23.0`, so the newly released versions are available from the day of release. ## [2023-05-30] - Add support for Streamlit 1.23.x and 1.24.0 - Added support for Streamlit `1.23.0`, `1.23.1`, and `1.24.0`. - Since `1.23.0`, the Streamlit frontend has been changed to the upstream version from the HF-customized one. ## [2023-05-30] - Add support for Streamlit 1.22.0 - Added support for Streamlit `1.22.0`. ## [2023-05-15] - The default Streamlit version - The default Streamlit version is set as `1.21.0`. ## [2023-04-12] - Add support for Streamlit up to 1.19.0 - Support for `1.16.0`, `1.17.0`, `1.18.1`, and `1.19.0` is added and the default SDK version is set as `1.19.0`. ## [2023-03-28] - Bug fix - Fixed a bug causing inability to scroll on iframe-embedded or directly accessed Streamlit apps, which was reported at https://discuss.huggingface.co/t/how-to-add-scroll-bars-to-a-streamlit-app-using-space-direct-embed-url/34101. The patch has been applied to Streamlit>=1.18.1. ## [2022-12-15] - Spaces supports Docker Containers - Read more doc about: [Docker Spaces](./spaces-sdks-docker) ## [2022-12-14] - Ability to set a custom `sleep` time - Read more doc here: [Spaces sleep time](./spaces-gpus#sleep-time) ## [2022-12-07] - Add support for Streamlit 1.15 - Announcement : https://twitter.com/osanseviero/status/1600881584214638592. ## [2022-06-07] - Add support for Streamlit 1.10.0 - The new multipage apps feature is working out-of-the-box on Spaces. - Streamlit blogpost : https://blog.streamlit.io/introducing-multipage-apps. ## [2022-05-23] - Spaces speedup and reactive system theme - All Spaces using Gradio 3+ and Streamlit 1.x.x have a significant speedup in loading. - System theme is now reactive inside the app. If the user changes to dark mode, it automatically changes. ## [2022-05-21] - Default Debian packages and Factory Reboot - Spaces environments now come with pre-installed popular packages (`ffmpeg`, `libsndfile1`, etc.). - This way, most of the time, you don't need to specify any additional package for your Space to work properly. - The `packages.txt` file can still be used if needed. - Added factory reboot button to Spaces, which allows users to do a full restart avoiding cached requirements and freeing GPU memory. ## [2022-05-17] - Add support for Streamlit 1.9.0 - All `1.x.0` versions are now supported (up to `1.9.0`). ## [2022-05-16] - Gradio 3 is out! - This is the default version when creating a new Space, don't hesitate to [check it out](https://huggingface.co/blog/gradio-blocks). ## [2022-03-04] - SDK version lock - The `sdk_version` field is now automatically pre-filled at Space creation time. - It ensures that your Space stays on the same SDK version after an updatE. ## [2022-03-02] - Gradio version pinning - The `sdk_version` configuration field now works with the Gradio SDK. ## [2022-02-21] - Python versions - You can specify the version of Python that you want your Space to run on. - Only Python 3 versions are supported. ## [2022-01-24] - Automatic model and dataset linking from Spaces - We attempt to automatically extract model and dataset repo ids used in your code - You can always manually define them with `models` and `datasets` in your YAML. ## [2021-10-20] - Add support for Streamlit 1.0 - We now support all versions between 0.79.0 and 1.0.0 ## [2021-09-07] - Streamlit version pinning - You can now choose which version of Streamlit will be installed within your Space ## [2021-09-06] - Upgrade Streamlit to `0.84.2` - Supporting Session State API - [Streamlit changelog](https://github.com/streamlit/streamlit/releases/tag/0.84.0) ## [2021-08-10] - Upgrade Streamlit to `0.83.0` - [Streamlit changelog](https://github.com/streamlit/streamlit/releases/tag/0.83.0) ## [2021-08-04] - Debian packages - You can now add your `apt-get` dependencies into a `packages.txt` file ## [2021-08-03] - Streamlit components - Add support for [Streamlit components](https://streamlit.io/components) ## [2021-08-03] - Flax/Jax GPU improvements - For GPU-activated Spaces, make sure Flax / Jax runs smoothly on GPU ## [2021-08-02] - Upgrade Streamlit to `0.82.0` - [Streamlit changelog](https://github.com/streamlit/streamlit/releases/tag/0.82.0) ## [2021-08-01] - Raw logs available - Add link to raw logs (build and container) from the space repository (viewable by users with write access to a Space) ### Jobs Overview https://huggingface.co/docs/hub/jobs-overview.md # Jobs Overview Run compute jobs on Hugging Face infrastructure with a familiar UV & Docker-like interface! UV & Docker-like CLI uv,run,ps,logs,stats,inspect Any Hardware CPUs to A100s & TPUs Run Anything UV, Docker, HF Spaces & more Pay-as-you-go Pay only for seconds used The Hugging Face Hub provides compute for AI and data workflows via Jobs. Jobs runs on Hugging Face infrastructure and aim at providing AI builders, Data engineers, developers and AI agents an easy access to cloud infrastructure to run their workloads. They are ideal to fine tune AI models and run inference with GPUs, but also for data ingestion and processing as well. A job is defined with a command to run (e.g. a UV or python command), a hardware flavor (CPU, GPU, TPU), and optionally a Docker image from Hugging Face Spaces or Docker Hub. Many jobs can run in parallel, which is useful e.g. for parameter tuning or parallel inference and data processing. ## Run Jobs from anywhere There are multiple tools you can use to run jobs: * the `hf` Command Line Interface (see the [CLI installation steps](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) and the [Jobs CLI documentation](https://huggingface.co/docs/huggingface_hub/guides/cli#hf-jobs) for more information) * the `huggingface_hub` Python client (see the [`huggingface_hub` Jobs documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs) for more information) * the Jobs HTTP API (see the [Jobs OpenAPI](https://huggingface-openapi.hf.space/#tag/jobs) for more information) ## Run any workload The `hf` Jobs CLI and the `huggingface_hub` Python client offer a UV-like interface to run Python workloads. UV installs the required Python dependencies and run the Python script in one single command. Python dependencies may also be defined in a self-contained UV script, and in this case there is no need to specify anything but the UV script to run the Job. ```diff - uv run + hf jobs uv run ``` More generally, Hugging Face Jobs supports any workload based on Docker and a command. Jobs offers a Docker-like interface to rub Jobs, where you can specify a Docker image from Hugging Face Spaces or Docker Hub, as well as the command to run. Docker provides the ability to package ready-to-use environments as Docker images that are shared by the community or custom made. Therefore you may choose or define your Docker image based on what your workloads need (e.g. python, torch, vllm) and run any command. This is more advanced than using UV but provides more flexibility. ```diff - docker run + hf jobs run ``` ## Automate Jobs Trigger Jobs automatically with a schedule or using webhooks. With a schedule, you can run Jobs every X minutes, hours, days, weeks or months. Scheduling Jobs uses the `cron` syntax like `"*/5 * * * *"` for "every 5 minutes", or aliases like `"@hourly"`, `"@daily"`, `"weekly"` or `"@monthly"`. With webhooks, Jobs can run whenever there is an update on a Hugging Face repository. For example you can configure webhooks to trigger for every model update under a given account, and retrieve the updated model from the webhook payload in the Job. ### Cookie limitations in Spaces https://huggingface.co/docs/hub/spaces-cookie-limitations.md # Cookie limitations in Spaces In Hugging Face Spaces, applications have certain limitations when using cookies. This is primarily due to the structure of the Spaces' pages (`https://huggingface.co/spaces//`), which contain applications hosted on a different domain (`*.hf.space`) within an iframe. For security reasons, modern browsers tend to restrict the use of cookies from iframe pages hosted on a different domain than the parent page. ## Impact on Hosting Streamlit Apps with Docker SDK One instance where these cookie restrictions can become problematic is when hosting Streamlit applications using the Docker SDK. By default, Streamlit enables cookie-based XSRF protection. As a result, certain components that submit data to the server, such as `st.file_uploader()`, will not work properly on HF Spaces where cookie usage is restricted. To work around this issue, you would need to set the `server.enableXsrfProtection` option in Streamlit to `false`. There are two ways to do this: 1. Command line argument: The option can be specified as a command line argument when running the Streamlit application. Here is the example command: ```shell streamlit run app.py --server.enableXsrfProtection false ``` 2. Configuration file: Alternatively, you can specify the option in the Streamlit configuration file `.streamlit/config.toml`. You would write it like this: ```toml [server] enableXsrfProtection = false ``` > [!TIP] > When you are using the Streamlit SDK, you don't need to worry about this because the SDK does it for you. ### Editing datasets https://huggingface.co/docs/hub/datasets-editing.md # Editing datasets The [Hub](https://huggingface.co/datasets) enables collaborative curation of community and research datasets. We encourage you to explore the datasets available on the Hub and contribute to their improvement to help grow the ML community and accelerate progress for everyone. All contributions are welcome! Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet. ## Edit using the Hub UI > [!WARNING] > This feature is only available for CSV, TSV, and Parquet datasets for now. The Hub's web interface allows users without any technical expertise to edit a dataset. Open the dataset page and navigate to the **Data Studio** tab to begin editing. Click on **Toggle edit mode** to enable dataset editing. Edit as many cells as you want and finally click **Commit** to commit your changes and leave a commit message. ## Using the `huggingface_hub` client library The `huggingface_hub` library can manage Hub repositories including editing datasets. For example here is how to edit a CSV file using the [Hugging Face FileSystem API](https://huggingface.co/docs/huggingface_hub/en/guides/hf_file_system): ```python from huggingface_hub import hffs path = f"datasets/{repo_id}/data.csv" with hffs.open(path, "r") as f: content = f.read() edited_content = content.replace("foo", "bar") with hffs.open(path, "w") as f: f.write(edited_content) ``` You can also apply edit locally on your disk and commit the changes: ```python from huggingface_hub import hf_hub_download, upload_file local_path = hf_hub_download(repo_id=repo_id, path_in_repo= "data.csv", repo_type="dataset") with open(path, "r") as f: content = f.read() edited_content = content.replace("foo", "bar") with open(path, "w") as f: f.write(edited_content) upload_file(repo_id=repo_id, path_in_repo=local_path, repo_type="dataset") ``` > [!TIP] > > To have the entire dataset repository locally and edit many files at once, use `snapshot_download` and `upload_folder` instead of `hf_hub_download` and `upload_file` Visit [the client library's documentation](/docs/huggingface_hub/index) to learn more. ## Integrated libraries If a dataset on the Hub is compatible with a [supported library](./datasets-libraries), loading, editing, and pushing the dataset takes just a few lines. Here is how to edit a CSV file with Pandas: ```python import pandas as pd # Load the dataset df = pd.read_csv(f"hf://datasets/{repo_id}/data.csv") # Edit df = df.apply(...) # Commit the changes df.to_csv(f"hf://datasets/{repo_id}/data.csv") ``` Libraries like Polars and DuckDB also implement the `hf://` protocol to read, edit and write files on Hugging Face. And other libraries are useful to edit datasets made of many files like Spark, Dask or ๐Ÿค— Datasets. See the full list of supported libraries [here](./datasets-libraries) For information on accessing the dataset on the website, you can click on the "Use this dataset" button on the dataset page to see how to do so. For example, [`samsum`](https://huggingface.co/datasets/knkarthick/samsum?library=datasets) shows how to do so with ๐Ÿค— Datasets below. ## Only upload the new data Hugging Face's storage is powered by [Xet](https://huggingface.co/docs/hub/en/xet), which uses chunk deduplication to make uploads more efficient. Unlike traditional cloud storage, Xet doesn't require the entire dataset to be re-uploaded to commit changes. Instead, it automatically detects which parts of the dataset have changed and instructs the client library only to upload the updated parts. To do that, Xet uses a smart algorithm to find chunks of 64kB that already exist on Hugging Face. Let's see our previous example with Pandas: ```python import pandas as pd # Load the dataset df = pd.read_csv(f"hf://datasets/{repo_id}/data.csv") # Edit part of the dataset df = df.apply(...) # Commit the changes df.to_csv(f"hf://datasets/{repo_id}/data.csv") ``` This code first loads a dataset and then edits it. Once the edits are done, `to_csv()` materializes the file in memory, chunks it, asks Xet which chunks are already on Hugging Face and which chunks have changed, and then uploads only the new data. ## Optimized Parquet editing The amount of data to upload depends on the edits and the file structure. The Parquet format is columnar and compressed at the page level (pages are around ~1MB). We optimized Parquet for Xet with [Parquet Content Defined Chunking](https://huggingface.co/blog/parquet-cdc), which ensures unchanged data generally result in unchanged pages. For example, this code uploads the content of `df` and then for `edited_df` the upload is faster since it only uploads the chunks that changed: ```python import pandas as pd df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) edited_df = ... # e.g. with added/modified/removed rows or columns edited_df.to_parquet( "hf://datasets/username/my_dataset/imdb.parquet", # Optimize for Xet use_content_defined_chunking=True, write_page_index=True, ) ``` Chunks are ~64kB and Parquet saves data column per column, so in practice this is what happens when editing an Optimized Parquet file: * add a new column -> only the chunks of the new column are uploaded * add/edit/delete a row -> one chunk per column is uploaded And in addition to this, the chunks of the Parquet footer containing metadata are also uploaded. Check out if your library supports Optimized Parquet in the [supported libraries](./datasets-libraries) page. ## Streaming For big datasets, libraries with dataset streaming features for end-to-end streaming pipelines are recommended. In this case, the dataset processing runs progressively as the old data arrives and the new data is uploaded to the Hub. Check out if your library supports streaming in the [supported libraries](./datasets-libraries) page. ### Manage Jobs https://huggingface.co/docs/hub/jobs-manage.md # Manage Jobs ## List Jobs Find your list of Jobs in the Jobs page or your organization Jobs page (user/organization page > settings > Jobs): It is also available in the Hugging Face CLI. Show the list of running Jobs with `hf jobs ps` and use `-a` to show all the Jobs: ```bash >>> hf jobs ps JOB ID IMAGE/SPACE COMMAND CREATED STATUS ------------ ---------------- ----------- ------------------- ------- 69402ea6c... ghcr.io/astra... uv run p... 2025-12-15 15:52:06 RUNNING >>> hf jobs ps -a JOB ID IMAGE/SPACE COMMAND CREATED STATUS ------------ ---------- --------------- ------------------- --------- 69402ea6c... ghcr.io... uv run pytho... 2025-12-15 15:52:06 RUNNING 693b06b8c... ghcr.io... uv run pytho... 2025-12-11 18:00:24 CANCELED 693b069fc... ghcr.io... uv run pytho... 2025-12-11 17:59:59 ERROR 693aef401... ghcr.io... uv run pytho... 2025-12-11 16:20:16 COMPLETED 693aee76c... ubuntu echo Hello f... 2025-12-11 16:16:54 COMPLETED 693ae8e3c... python:... python -c pr... 2025-12-11 15:53:07 COMPLETED ``` Specify your organization `namespace` to list Jobs under your organization: ```bash >>> hf jobs ps --namespace ``` ## Filter Jobs Click on a Job's label to filter Jobs by label: In the CLI, you can filter Jobs based on conditions provided, using the format key=value: Filter by labels: ```bash >>> hf jobs ps --filter label=fine-tuning --filter label=model=Qwen3-06B -a JOB ID IMAGE/SPACE COMMAND CREATED STATUS ------------ ------------ ---------------- ------------------- --------- 6978b1254... ghcr.io/a... uv run --with... 2026-01-27 12:35:49 COMPLETED 6978b11d4... ghcr.io/a... uv run --with... 2026-01-27 12:33:53 COMPLETED ``` Filter on any condition: ```bash >>> hf jobs ps --filter status=error -a JOB ID IMAGE/SPACE COMMAND CREATED STATUS ------------ ---------- ------------------ ------------------- ------ 693b069fc... ghcr.io... uv run python -... 2025-12-11 17:59:59 ERROR 693996dec... ghcr.io... bash -c python ... 2025-12-10 15:50:54 ERROR 69399695c... ghcr.io... uv run --with t... 2025-12-10 15:49:41 ERROR 693994bdc... ghcr.io... uv run --with t... 2025-12-10 15:41:49 ERROR 68d3c1af3... ghcr.io... uv run bash -c ... 2025-09-24 10:02:23 ERROR ``` Filtering supports negation `!=` and glob patterns (including `*` and `?`): ```bash # Show Jobs that are not completed >>> hf jobs ps -a --filter status!=completed # Show Jobs with a command that ends with "train.py" >>> hf jobs ps -a --filter "command=*train.py" # Show Jobs with a "fine-tuning" label >>> hf jobs ps -a --filter label=fine-tuning # Show Jobs that don't have the "prod" label and have a label that starts with "data-" >>> hf jobs ps -a --filter label!=prod --filter "label=data-*" # Show Jobs based on key=value labels >>> hf jobs ps -a --filter label=model=Qwen3-06B --filter label=dataset!=Capybara ``` ## Monitor resource usage Use `hf jobs stats` to get the usage statistics for CPU, memory, network and GPU (if any) of running Jobs: ```bash >>> hf jobs stats JOB ID CPU % NUM CPU MEM % MEM USAGE NET I/O GPU UTIL % GPU MEM % GPU MEM USAGE ------------------------ ----- ------- ----- ---------------- --------------- ---------- --------- --------------- 695e83c5d2f3efac77e8cf18 8% 12.0 7.18% 10.9GB / 152.5GB 0.0bps / 0.0bps 100% 31.92% 25.9GB / 81.2GB ``` Specify one or several Job ids to only show the statistics of certain Jobs: ```bash >>> hf jobs stats [job-ids]... ``` ## Inspect a Job You can see the status logs of a Job in the Job page: Alternatively using the CLI ```bash >>> hf jobs inspect 693994e21a39f67af5a41ad0 [ { "id": "693994e21a39f67af5a41ad0", "created_at": "2025-12-10 15:42:26.835000+00:00", "docker_image": "ghcr.io/astral-sh/uv:python3.12-bookworm", "space_id": null, "command": ["bash", "-c", "python -c \"import urllib.request; import os; from pathlib import Path; o = urllib.request.build_opener(); o.addheaders = [(\\\"Authorization\\\", \\\"Bearer \\\" + os.environ[\\\"UV_SCRIPT_HF_TOKEN\\\"])]; Path(\\\"/tmp/script.py\\\").write_bytes(o.open(os.environ[\\\"UV_SCRIPT_URL\\\"]).read())\" && uv run --with trl /tmp/script.py"], "arguments": [], "environment": {"UV_SCRIPT_URL": "https://huggingface.co/datasets/lhoestq/hf-cli-jobs-uv-run-scripts/resolve/728cc5682eb402d7ffe66a2f6f97645b34cb08dd/train.py"}, "secrets": ["HF_TOKEN", "UV_SCRIPT_HF_TOKEN"], "flavor": "a100-large", "status": {"stage": "COMPLETED", "message": null}, "owner": {"id": "5e9ecfc04957053f60648a3e", "name": "lhoestq", "type": "user"}, "endpoint": "https://huggingface.co", "url": "https://huggingface.co/jobs/lhoestq/693994e21a39f67af5a41ad0" } ] ``` and for the logs ```bash >>> hf jobs logs 693994e21a39f67af5a41ad0 Downloading nvidia-cuda-nvrtc-cu12 (84.0MiB) Downloading numpy (15.8MiB) Downloading nvidia-cuda-cupti-cu12 (9.8MiB) Downloading tokenizers (3.1MiB) Downloading nvidia-cusolver-cu12 (255.1MiB) Downloading nvidia-cufft-cu12 (184.2MiB) Downloading transformers (11.4MiB) Downloading setuptools (1.1MiB) ... ``` Specify your organization `namespace` to inspect a Job under your organization: ```bash hf jobs inspect --namespace hf jobs logs --namespace ``` ## Debug a Job If a Job has an error, you can see it in on the Job page Look at the status message and the logs on the Job page to see what went wrong. You may also look at the last lines of logs to see what happened before a Job failed. You can see that in the Job page, or using the CLI: ```bash >>> hf jobs logs 69405cf51a39f67af5a41f29 | tail -n 10 Downloaded nvidia-cudnn-cu12 Downloaded torch Installed 66 packages in 226ms Generating train split: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 15806/15806 [00:00 train_dataset=train_dataset, ^^^^^^^^^^^^^ NameError: name 'train_dataset' is not defined. Did you mean: 'load_dataset'? ``` Debug a Job locally using your local UV or Docker setup: * `hf jobs uv run ...` -> `uv run ...` * `hf jobs run ...` -> `docker run ...` The status message can say "Job timeout": it means the Job didn't finish in time before the timeout (the default is 30min) and therefore it was stopped. In this case you need to specify a higher timeout, using `--timeout` in the CLI, e.g. ```bash hf jobs uv run --timeout 3h ... ``` ## Cancel Jobs Use the "Cancel" button on the Job page to cancel a Job: or in the CLI: ```bash hf jobs cancel 693b06b8c67c9f186cfe239e ``` Specify your organization `namespace` to cancel a Job under your organization: ```bash hf jobs cancel --namespace ``` ## MacOS menu bar Find your list of Jobs in the MacOS [`hfjobs-menubar`](https://github.com/drbh/hfjobs-menubar) client: Get Jobs information, and monitor logs and resource usage statistics: ### Using Xet Storage https://huggingface.co/docs/hub/xet/using-xet-storage.md # Using Xet Storage ## Python To access a Xet-aware version of the `huggingface_hub`, simply install the latest version: ```bash pip install -U huggingface_hub ``` As of `huggingface_hub` 0.32.0, this will also install `hf_xet`. The `hf_xet` package integrates `huggingface_hub` with [`xet-core`](https://github.com/huggingface/xet-core), the Rust client for the Xet backend. If you use the `transformers` or `datasets` libraries, it's already using `huggingface_hub`. So long as the version of `huggingface_hub` >= 0.32.0, no further action needs to be taken. Where versions of `huggingface_hub` >= 0.30.0 and Git users can access the benefits of Xet by downloading and installing the Git Xet extension. Once installed, simply use the [standard workflows for managing Hub repositories with Git](../repositories-getting-started) - no additional changes necessary. ### Prerequisites Install [Git](https://git-scm.com/) and [Git LFS](https://git-lfs.com/). ### Install on macOS or Linux (amd64 or aarch64) Install using an installation script with the following command in your terminal (requires `curl` and `unzip`): ``` curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/huggingface/xet-core/refs/heads/main/git_xet/install.sh | sh ``` Or, install using [Homebrew](https://brew.sh/): ``` brew install git-xet git xet install ``` To verify the installation, run: ``` git xet --version ``` ### Windows (amd64) Using `winget`: ``` winget install git-xet ``` Using an installer: - Download `git-xet-windows-installer-x86_64.zip` ([available here](https://github.com/huggingface/xet-core/releases/download/git-xet-v0.2.0/git-xet-windows-installer-x86_64.zip)) and unzip. - Run the `msi` installer file and follow the prompts. Manual installation: - Download `git-xet-windows-x86_64.zip` ([available here](https://github.com/huggingface/xet-core/releases/download/git-xet-v0.2.0/git-xet-windows-x86_64.zip)) and unzip. - Place the extracted `git-xet.exe` under a `PATH` directory. - Run `git xet install` in a terminal. To verify the installation, run: ``` git xet --version ``` ### Using Git Xet Once installed on your platform, using Git Xet is as simple as following the Hub's standard Git workflows. Make sure all [prerequisites are installed and configured](https://huggingface.co/docs/hub/repositories-getting-started#requirements), follow the [setup instructions for working with repositories on the Hub](https://huggingface.co/docs/hub/repositories-getting-started#set-up), then commit your changes, and `push` to the Hub: ``` # Create any files you like! Then... git add . git commit -m "Uploading new models" # You can choose any descriptive message git push ``` Under the hood, the [Xet protocol](https://huggingface.co/docs/xet/index) is invoked to upload large files directly to Xet storage, increasing upload speeds through the power of [chunk-level deduplication](./deduplication). ### Uninstall on macOS or Linux Using Homebrew: ```bash git xet uninstall brew uninstall git-xet ``` If you used the installation script (for MacOS or Linux), run the following in your terminal: ```bash git xet uninstall sudo rm $(which git-xet) ``` ### Uninstall on Windows If you used `winget`: ``` winget uninstall git-xet ``` If you used the installer: - Navigate to Settings -> Apps -> Installed apps - Find "Git-Xet". - Select the "Uninstall" option available in the context menu. If you manually installed: - Run `git xet uninstall` in a terminal. - Delete the `git-xet.exe` file from the location where it was originally placed. ## Recommendations Xet integrates seamlessly with all of the Hub's workflows. However, there are a few steps you may consider to get the most benefits from Xet storage. When uploading or downloading with Python: - **Make sure `hf_xet` is installed**: While Xet remains backward compatible with legacy clients optimized for Git LFS, the `hf_xet` integration with `huggingface_hub` delivers optimal chunk-based performance and faster iteration on large files. - **Adaptive concurrency is on by default**: `hf_xet` automatically adjusts the number of parallel transfer streams based on real-time network conditions โ€” no configuration required. The default settings will saturate most network paths without any tuning. - **Advanced tuning**: For fine-grained control, `HF_XET_FIXED_DOWNLOAD_CONCURRENCY` and `HF_XET_FIXED_UPLOAD_CONCURRENCY` let you pin concurrency to a fixed value, bypassing the adaptive controller. See `hf_xet`'s [environment variables](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#xet) for the full list of options. When uploading or downloading in Git or Python: - **Leverage frequent, incremental commits**: Xet's chunk-level deduplication means you can safely make incremental updates to models or datasets. Only changed chunks are uploaded, so frequent commits are both fast and storage-efficient. - **Be Specific in .gitattributes**: When defining patterns for Xet or LFS, use precise file extensions (e.g., `*.safetensors`, `*.bin`) to avoid unnecessarily routing smaller files through large-file storage. - **Prioritize community access**: Xet substantially increases the efficiency and scale of large file transfers. Instead of structuring your repository to reduce its total size (or the size of individual files), organize it for collaborators and community users so they may easily navigate and retrieve the content they need. ## Environment Variables Both `hf_xet` and Git Xet are powered by `xet-core`, which can be configured via environment variables. The tables below list the individual variables for fine-grained control. Most users will not need to change any of these โ€” the defaults are tuned to saturate most network paths automatically. > [!NOTE] > `HF_XET_HIGH_PERFORMANCE=1` is a convenience flag that adjusts several settings at once (concurrency bounds, buffer sizes, and parallel file limits). It is intended for machines with high bandwidth **and at least 64 GB of RAM** for buffering. On machines with less memory, it may degrade performance. ### Adaptive Concurrency By default, `xet-core` uses adaptive concurrency โ€” dynamically adjusting parallelism based on real-time network conditions. These are advanced settings that are unlikely to be needed in most cases. The variables below control the adaptive controller's behavior: | Environment Variable | Default | Description | |---|---|---| | `HF_XET_CLIENT_ENABLE_ADAPTIVE_CONCURRENCY` | `true` | Enable or disable adaptive concurrency control. When disabled, concurrency stays at the initial value. | | `HF_XET_CLIENT_AC_INITIAL_UPLOAD_CONCURRENCY` | `1` | Starting number of concurrent upload streams. HP mode: `16`. | | `HF_XET_CLIENT_AC_INITIAL_DOWNLOAD_CONCURRENCY` | `1` | Starting number of concurrent download streams. HP mode: `16`. | | `HF_XET_CLIENT_AC_MIN_UPLOAD_CONCURRENCY` | `1` | Lower bound for upload concurrency. HP mode: `4`. | | `HF_XET_CLIENT_AC_MIN_DOWNLOAD_CONCURRENCY` | `1` | Lower bound for download concurrency. HP mode: `4`. | | `HF_XET_CLIENT_AC_MAX_UPLOAD_CONCURRENCY` | `64` | Upper bound for upload concurrency. HP mode: `124`. | | `HF_XET_CLIENT_AC_MAX_DOWNLOAD_CONCURRENCY` | `64` | Upper bound for download concurrency. HP mode: `124`. | | `HF_XET_CLIENT_AC_TARGET_RTT` | `60s` | Target round-trip time. Concurrency increases as long as the predicted round-trip time for a full transfer is below this value. | | `HF_XET_CLIENT_AC_MAX_HEALTHY_RTT` | `90s` | Maximum acceptable round-trip time. Transfers taking longer than this are counted as failures by the adaptive controller. | | `HF_XET_CLIENT_AC_HEALTHY_SUCCESS_RATIO_THRESHOLD` | `0.8` | Success ratio above which the controller increases concurrency. | | `HF_XET_CLIENT_AC_UNHEALTHY_SUCCESS_RATIO_THRESHOLD` | `0.5` | Success ratio below which the controller decreases concurrency. | | `HF_XET_CLIENT_AC_LOGGING_INTERVAL_MS` | `10000` | Interval (in ms) at which concurrency status is logged. | > [!TIP] > To pin concurrency to a fixed value (bypassing the adaptive controller), use the convenience aliases `HF_XET_FIXED_UPLOAD_CONCURRENCY` and `HF_XET_FIXED_DOWNLOAD_CONCURRENCY`. These set the initial, minimum, and maximum concurrency to the same value. ### Network and Retry | Environment Variable | Default | Description | |---|---|---| | `HF_XET_CLIENT_RETRY_MAX_ATTEMPTS` | `5` | Maximum number of retry attempts for failed requests. | | `HF_XET_CLIENT_RETRY_BASE_DELAY` | `3000ms` | Base delay between retries (with exponential backoff). | | `HF_XET_CLIENT_RETRY_MAX_DURATION` | `360s` | Maximum total time to spend retrying a request. | | `HF_XET_CLIENT_CONNECT_TIMEOUT` | `60s` | TCP connection timeout. | | `HF_XET_CLIENT_READ_TIMEOUT` | `120s` | Read timeout for HTTP responses. | | `HF_XET_CLIENT_IDLE_CONNECTION_TIMEOUT` | `60s` | Timeout before idle connections are closed. | | `HF_XET_CLIENT_MAX_IDLE_CONNECTIONS` | `16` | Maximum number of idle connections in the pool. | ### Data Transfer | Environment Variable | Default | Description | |---|---|---| | `HF_XET_DATA_MAX_CONCURRENT_FILE_INGESTION` | `8` | Maximum number of files processed concurrently during upload. HP mode: `100`. | | `HF_XET_DATA_MAX_CONCURRENT_FILE_DOWNLOADS` | `8` | Maximum number of files downloaded concurrently. | | `HF_XET_DATA_INGESTION_BLOCK_SIZE` | `8mb` | Size of blocks read during file ingestion. | | `HF_XET_DATA_PROGRESS_UPDATE_INTERVAL` | `200ms` | How often progress bars are updated. | | `HF_XET_DATA_PROGRESS_UPDATE_SPEED_SAMPLING_WINDOW` | `10s` | Time window used for aggregating transfer speed measurements in progress reporting. | ### Download Buffers These control memory usage during downloads. `HF_XET_HIGH_PERFORMANCE=1` raises these significantly. | Environment Variable | Default | HP Mode | Description | |---|---|---|---| | `HF_XET_RECONSTRUCTION_MIN_RECONSTRUCTION_FETCH_SIZE` | `256mb` | `1gb` | Minimum fetch size for reconstruction requests. | | `HF_XET_RECONSTRUCTION_MAX_RECONSTRUCTION_FETCH_SIZE` | `8gb` | `16gb` | Maximum fetch size for reconstruction requests. | | `HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE` | `2gb` | `16gb` | Total download buffer size. | | `HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_PERFILE_SIZE` | `512mb` | `2gb` | Per-file download buffer size. | | `HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_LIMIT` | `8gb` | `64gb` | Hard limit on total download buffer memory. | | `HF_XET_RECONSTRUCTION_TARGET_BLOCK_COMPLETION_TIME` | `15m` | โ€” | Target time for completing a prefetch block. Used to determine how much data to prefetch ahead during downloads. | | `HF_XET_RECONSTRUCTION_MIN_PREFETCH_BUFFER` | `1gb` | โ€” | Minimum amount of data to keep prefetched during downloads, regardless of estimated completion time. | ### Logging | Environment Variable | Default | Description | |---|---|---| | `HF_XET_LOG_DEST` | (none) | Log destination. Accepts a file path or directory path (ending with `/`). When set to a directory, log files are created with timestamped names. When set to an empty string, logs go to the console. When unset, logs go to the `logs/` subdirectory in the Hugging Face Xet cache directory. | | `HF_XET_LOG_FORMAT` | (none) | Log format. Set to `json` for JSON-formatted logs; otherwise plain text. By default, file logging uses JSON and console logging uses text. | | `HF_XET_LOG_PREFIX` | `xet` | Prefix for log file names when logging to a directory. | | `HF_XET_LOG_DIR_DISABLE_CLEANUP` | `false` | Disable automatic cleanup of old log files in the log directory. | | `HF_XET_LOG_DIR_MAX_SIZE` | `250mb` | Maximum total size of log files in the log directory. Old files are pruned to stay under this limit. | | `HF_XET_LOG_DIR_MIN_DELETION_AGE` | `1d` | Minimum age before a log file can be deleted during cleanup. | | `HF_XET_LOG_DIR_MAX_RETENTION_AGE` | `14d` | Maximum age for log files. Files older than this are always deleted during cleanup. | ## Current Limitations While Xet brings fine-grained deduplication and enhanced performance to Git-based storage, some features and platform compatibilities are still in development. As a result, keep the following constraints in mind when working with a Xet-enabled repository: - **64-bit systems only**: Both `hf_xet` and Git Xet currently require a 64-bit architecture; 32-bit systems are not supported. ### Xet: our Storage Backend https://huggingface.co/docs/hub/xet/index.md # Xet: our Storage Backend Repositories on the Hugging Face Hub are different from those on software development platforms. They contain files that are: - Large - model or dataset files are in the range of GB and above. We have a few TB-scale files! - Binary - not in a human readable format by default (e.g., [Safetensors](https://huggingface.co/docs/safetensors/en/index) or [Parquet](https://huggingface.co/docs/dataset-viewer/en/parquet#what-is-parquet)) While the Hub leverages modern version control with the support of Git, these differences make [Model](https://huggingface.co/docs/hub/models) and [Dataset](https://huggingface.co/docs/hub/datasets) repositories quite different from those that contain only source code. Storing these files directly in a pure Git repository is impractical. Not only are the typical storage systems behind Git repositories unsuited for such files, but when you clone a repository, Git retrieves the entire history, including all file revisions. This can be prohibitively large for massive binaries, forcing you to download gigabytes of historic data you may never need. Instead, on the Hub, these large files are tracked using "pointer files" and identified through a `.gitattributes` file (both discussed in more detail below), which remain in the Git repository while the actual data is stored in remote storage (like [Amazon S3](https://aws.amazon.com/s3/)). As a result, the repository stays small and typical Git workflows remain efficient. Historically, Hub repositories have relied on [Git LFS](https://git-lfs.com/) for this mechanism. While Git LFS remains supported (see [Backwards Compatibility & Legacy](./legacy-git-lfs)), the Hub has adopted Xet, a modern custom storage system built specifically for AI/ML development. It enables chunk-level deduplication, smaller uploads, and faster downloads than Git LFS. ## Open Source Xet Protocol If you are looking to understand the underlying Xet protocol or are looking to build a new client library to access Xet Storage, check out the [Xet Protocol Specification](https://huggingface.co/docs/xet/index). In these pages you will get started in using Xet Storage. ## Contents - [Xet History & Overview](./overview) - [Using Xet Storage](./using-xet-storage) - [Security](./security) - [Backwards Compatibility & Legacy](./legacy-git-lfs) - [Deduplication](./deduplication) ### Xet History & Overview https://huggingface.co/docs/hub/xet/overview.md # Xet History & Overview [In August 2024 Hugging Face acquired XetHub](https://huggingface.co/blog/xethub-joins-hf), a [seed-stage startup based in Seattle](https://www.geekwire.com/2023/ex-apple-engineers-raise-7-5m-for-new-seattle-data-storage-startup/), to replace Git LFS on the Hub. Like Git LFS, a Xet-backed repository utilizes S3 as the remote storage with a `.gitattributes` file at the repository root helping identify what files should be stored remotely. A Git LFS pointer file provides metadata to locate the actual file contents in remote storage: - **SHA256**: Provides a unique identifier for the actual large file. This identifier is generated by computing the SHA-256 hash of the fileโ€™s contents. - **Pointer size**: The size of the pointer file stored in the Git repository. - **Size of the remote file**: Indicates the size of the actual large file in bytes. This metadata is useful for both verification purposes and for managing storage and transfer operations. A Xet pointer includes all of this information by design. Refer to the section on [backwards compatibility with Git LFS](legacy-git-lfs#backward-compatibility-with-lfs) with the addition of a `Xet backed hash` field for referencing the file in Xet storage. Unlike Git LFS, which deduplicates at the file level, Xet-enabled repositories deduplicate at the level of bytes. When a file backed by Xet storage is updated, only the modified data is uploaded to remote storage, significantly saving on network transfers. For many workflows, like incremental updates to model checkpoints or appending/inserting new data into a dataset, this improves iteration speed for yourself and your collaborators. To learn more about deduplication in Xet storage, refer to [Deduplication](deduplication). ### Deduplication https://huggingface.co/docs/hub/xet/deduplication.md # Deduplication Xet-enabled repositories utilize [content-defined chunking (CDC)](https://huggingface.co/blog/from-files-to-chunks) to deduplicate on the level of bytes (~64KB of data, also referred to as a "chunk"). Each chunk is identified by a rolling hash that determines chunk boundaries based on the actual file contents, making it resilient to insertions or deletions anywhere in the file. When a file is uploaded to a Xet-backed repository using a Xet-aware client, its contents are broken down into these variable-sized chunks. Only new chunks not already present in Xet storage are kept after chunking, everything else is discarded. ## How Content-Defined Chunking Works To understand content-defined chunking, imagine a file as a long passage of text. The system scans the data using a rolling hash โ€” a small mathematical function that slides over the bytes. Whenever the hash hits a special pattern, a chunk boundary is placed at that position. Because the boundaries are determined by the *content itself* (not by fixed positions), identical regions of data always produce the same chunks, even if surrounding content changes. ### Why Not Fixed-Size Chunks? Consider what happens when you insert a small amount of data in the middle of a file. With fixed-size chunking, every chunk boundary after the insertion shifts, invalidating all downstream chunks โ€” even though most of the data is unchanged: ```text Original file, fixed 6-byte chunks: |The qu|ick br|own fo|x jump|s over| the l|azy do|g | chunk1 chunk2 chunk3 chunk4 chunk5 chunk6 chunk7 chunk8 Insert "very " before "lazy": |The qu|ick br|own fo|x jump|s over| the v|ery la|zy dog| chunk1 chunk2 chunk3 chunk4 chunk5 chunk6 chunk7 chunk8 ~~~~~~ ~~~~~~ ~~~~~~ 3 chunks changed! ``` Even though only 5 bytes were inserted, **3 out of 8 chunks changed** because all boundaries after the edit shifted by 5 positions. In real files at a 64KB chunk size, a small edit can invalidate hundreds of megabytes of chunks. ### Content-Defined Chunking Keeps Boundaries Stable With CDC, boundaries are placed where the *content* matches a pattern โ€” not at fixed intervals. This means an insertion only affects the chunk where the edit occurs. Chunks before and after remain identical: ```text Original file, content-defined chunks (boundaries marked by "|"): |The quick |brown fox |jumps over |the lazy dog| chunk 1 chunk 2 chunk 3 chunk 4 Insert "very " before "lazy": |The quick |brown fox |jumps over |the very lazy dog| chunk 1 chunk 2 chunk 3 chunk 4' (same) (same) (same) (changed) ``` Only **1 out of 4 chunks changed** โ€” the one containing the edit. The other three are byte-for-byte identical and are deduplicated. This is why CDC is so effective for versioned data: when you update a model checkpoint or append rows to a dataset, only the modified portions need to be uploaded and stored. ### From Chunks to Storage The full deduplication pipeline works as follows: ```mermaid flowchart LR A["File"] --> B["Content-Defined\nChunking"] B --> C{"Chunk already\nstored?"} C -- "Yes (duplicate)" --> D["Skip upload\n(reuse existing)"] C -- "No (new)" --> E["Group into\n64 MB blocks"] E --> F["Upload to\nXet Storage"] ``` When a file is chunked, each chunk's hash is checked against what is already stored. This happens at multiple levels: first against chunks already seen in the current upload session, then against a local cache of previously uploaded metadata, and finally a subset of chunks are checked against all of Xet storage via a global deduplication query. Duplicate chunks are skipped entirely. New chunks are grouped into 64 MB blocks and uploaded. Each block is stored once in a content-addressed store (CAS), keyed by its hash. ## Storage Savings in Practice The Hub's [current recommendation](https://huggingface.co/docs/hub/storage-limits#recommendations) is to limit files to 200 GB. At a 64KB chunk size, a 20GB file has 312,500 chunks, many of which go unchanged from version to version. Git LFS is designed to notice only that a file has changed and store the entirety of that revision. By deduplicating at the level of chunks, the Xet backend enables storing only the modified content in a file (which might only be a few KB or MB) and securely deduplicates shared blocks across repositories. For the large binary files found in Model and Dataset repositories, this provides significant improvements to file transfer times. For more details, refer to the [From Files to Chunks](https://huggingface.co/blog/from-files-to-chunks) and [From Chunks to Blocks](https://huggingface.co/blog/from-chunks-to-blocks) blog posts, or the [Git is for Data](https://www.cidrdb.org/cidr2023/papers/p43-low.pdf) paper by Low et al. that served as the launch point for XetHub prior to being acquired by Hugging Face. ### Backward Compatibility with LFS https://huggingface.co/docs/hub/xet/legacy-git-lfs.md # Backward Compatibility with LFS Uploads from legacy / nonโ€‘Xetโ€‘aware clients still follow the standard Gitโ€ฏLFS path, even if the repo is already Xet-backed. Once the file is uploaded to LFS, a background process automatically migrates the file to using Xet storage. The Xet architecture provides backwards compatibility for legacy clients downloading files from Xet-backed repos by offering a Git LFS bridge. While a Xet-aware client will receive file reconstruction information from CAS to download the Xet-backed file, a legacy client will get a single URL from the bridge which does the work of reconstructing the request file and returning the URL to the resource. This allows downloading files through a URL so that you can continue to use the Hub's web interface or `curl`. By having LFS file uploads automatically migrate and having older clients continue to download files from Xet-backed repositories, maintainers and the rest of the Hub can update their pipelines at their own pace. Xet storage provides a seamless transition for existing Hub repositories. It isn't necessary to know if the Xet backend is involved at all. Xet-backed repositories continue to use the Git LFS pointer file format; the addition of the `Xet backed hash` is only added to the web interface as a convenience. Practically, this means existing repos and newly created repos will not look any different if you do a `bare clone` of them. Each of the large files (or binary files) will continue to have a pointer file that matches the Git LFS pointer file specification. This symmetry allows non-Xet-aware clients (e.g., older versions of the `huggingface_hub`) to interact with Xet-backed repositories without concern. In fact, within a repository a mixture of Git LFS and Xet backed files are supported. The Xet backend indicates whether a file is in Git LFS or Xet storage, allowing downstream services to request the proper URL(s) from S3, regardless of which storage system holds the content. ## Legacy Storage: Git LFS The legacy storage system on the Hub, Git LFS utilizes many of the same conventions as Xet-backed repositories. The Hub's Git LFS backend is [Amazon Simple Storage Service (S3)](https://aws.amazon.com/s3/). When Git LFS is invoked, it stores the file contents in S3 using the SHA256 hash to name the file for future access. This storage architecture is relatively simple and has allowed Hub to store millions of models, datasets, and spaces repositories' files. The primary limitation of Git LFS is its file-centric approach to deduplication. Any change to a file, irrespective of how large of small that change is, means the entire file is versioned - incurring significant overheads in file transfers as the entire file is uploaded (if committing to a repository) or downloaded (if pulling the latest version to your machine). This leads to a worse developer experience along with a proliferation of additional storage. ### Security Model https://huggingface.co/docs/hub/xet/security.md # Security Model Xet storage provides data deduplication over all chunks stored in Hugging Face. This is done via cryptographic hashing in a privacy sensitive way. The contents of chunks are protected and are associated with repository permissions, i.e. you can only read chunks which are required to reproduce files you have access to, and no more. More information and details on how deduplication is done in a privacy-preserving way are described in the [Xet Protocol Specification](https://huggingface.co/docs/xet/deduplication).