naver-clova-ix
donut-base-finetuned-docvqa
--- license: mit pipeline_tag: document-question-answering tags: - donut - image-to-text - vision widget: - text: "What is the invoice number?" src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/invoice.png" - text: "What is the purchase amount?" src: "https://huggingface.co/spaces/impira/docquery/resolve/2359223c1837a7587402bda0f2643382a6eefeab/contract.jpeg" ---
donut-base
Donut model pre-trained-only. It was introduced in the paper OCR-free Document Understanding Transformer by Geewok et al. and first released in this repository. Disclaimer: The team releasing Donut did not write a model card for this model so this model card has been written by the Hugging Face team. Donut consists of a vision encoder (Swin Transformer) and a text decoder (BART). Given an image, the encoder first encodes the image into a tensor of embeddings (of shape batchsize, seqlen, hiddensize), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder. This model is meant to be fine-tuned on a downstream task, like document image classification or document parsing. See the model hub to look for fine-tuned versions on a task that interests you. We refer to the documentation which includes code examples.