Activity Feed

AI & ML interests

This organization contains official transformers implementation for Florence-2 model by Microsoft.

Recent Activity

Organization Card

This is the organization for official transformers converted checkpoints of Microsoft's Florence model. Try the model itself here. This integration unlocks use of Florence-2 with all the libraries/APIs in Hugging Face ecosystem.

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.

Resources and Technical Documentation:

Model Model size Model Description
Florence-2-base[HF] 0.23B Pretrained model with FLD-5B
Florence-2-large[HF] 0.77B Pretrained model with FLD-5B
Florence-2-base-ft[HF] 0.23B Finetuned model on a colletion of downstream tasks
Florence-2-large-ft[HF] 0.77B Finetuned model on a colletion of downstream tasks

Use the code below to get started with the model.

import torch
import requests
from PIL import Image
from transformers import AutoProcessor, Florence2ForConditionalGeneration


model = Florence2ForConditionalGeneration.from_pretrained(
    "florence-community/Florence-2-base-ft",
    dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("florence-community/Florence-2-base-ft")

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

task_prompt = "<OD>"
inputs = processor(text=task_prompt, images=image, return_tensors="pt").to(model.device, torch.bfloat16)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
    num_beams=3,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

image_size = image.size
parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=image_size)

print(parsed_answer)

datasets 0

None public yet