naver-clova-ix

9 models • 1 total models in database

Sort by:

donut-base-finetuned-docvqa

--- license: mit pipeline_tag: document-question-answering tags: - donut - image-to-text - vision widget: - text: "What is the invoice number?" src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png" - text: "What is the purchase amount?" src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/contract.jpeg" ---

license:mit

223,744

253

donut-base

Donut model pre-trained-only. It was introduced in the paper OCR-free Document Understanding Transformer by Geewok et al. and first released in this repository. Disclaimer: The team releasing Donut did not write a model card for this model so this model card has been written by the Hugging Face team. Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batchsize, seqlen, hiddensize), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder. This model is meant to be fine-tuned on a downstream task, like document image classification or document parsing. See the model hub to look for fine-tuned versions on a task that interests you. We refer to the documentation which includes code examples.

license:mit

90,557

234

naver-clova-ix

donut-base-finetuned-docvqa

donut-base

donut-base-finetuned-cord-v2

donut-base-finetuned-rvlcdip

donut-base-finetuned-zhtrainticket

donut-base-finetuned-cord-v1

donut-proto

donut-base-finetuned-cord-v1-2560

donut-base-finetuned-kuzushiji