Diffusers documentation

ZImageTransformer2DModel

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

ZImageTransformer2DModel

A Transformer model for image-like data from Z-Image.

ZImageTransformer2DModel

class diffusers.ZImageTransformer2DModel

< >

( all_patch_size = (2,) all_f_patch_size = (1,) in_channels = 16 dim = 3840 n_layers = 30 n_refiner_layers = 2 n_heads = 30 n_kv_heads = 30 norm_eps = 1e-05 qk_norm = True cap_feat_dim = 2560 siglip_feat_dim = None rope_theta = 256.0 t_scale = 1000.0 axes_dims = [32, 48, 48] axes_lens = [1024, 512, 512] )

forward

< >

( x: list t cap_feats: list return_dict: bool = True controlnet_block_samples: dict[int, torch.Tensor] | None = None siglip_feats: list[list[torch.Tensor]] | None = None image_noise_mask: list[list[int]] | None = None patch_size: int = 2 f_patch_size: int = 1 )

Flow: patchify -> t_embed -> x_embed -> x_refine -> cap_embed -> cap_refine -> [siglip_embed -> siglip_refine] -> build_unified -> main_layers -> final_layer -> unpatchify

patchify_and_embed

< >

( all_image: list all_cap_feats: list patch_size: int f_patch_size: int )

Patchify for basic mode: single image per batch item.

patchify_and_embed_omni

< >

( all_x: list all_cap_feats: list all_siglip_feats: list patch_size: int f_patch_size: int images_noise_mask: list )

Patchify for omni mode: multiple images per batch item with noise masks.

Update on GitHub