numind
NuExtract-1.5
--- license: mit language: - multilingual tags: - nlp base_model: microsoft/Phi-3.5-mini-instruct pipeline_tag: text-generation inference: true new_version: numind/NuExtract-2.0-4B ---
NuNER_Zero
NuNER Zero is a zero-shot Named Entity Recognition (NER) Model. (Check NuNER for the few-shot setting). NuNER Zero uses the GLiNER architecture: its input should be a concatenation of entity types and text. Unlike GliNER, NuNER Zero is a token classifier, which allows detect arbitrary long entities. NuNER Zero was trained on NuNER v2.0 dataset, which combines subsets of Pile and C4 annotated via LLMs using NuNER's procedure. NuNER Zero is (at the time of its release) the best compact zero-shot NER model (+3.1% token-level F1-Score over GLiNER-large-v2.1 on GLiNERS's benchmark)
NuExtract-tiny
NuNER-v0.1
NuNER-multilingual-v0.1
SOTA Entity Recognition Multilingual Foundation Model by NuMind 🔥 This model provides the best embedding for the Entity Recognition task and supports 9+ languages. Checkout other models by NuMind: SOTA Entity Recognition Foundation Model in English: link SOTA Sentiment Analysis Foundation Model: English, Multilingual Multilingual BERT finetunned on an artificially annotated multilingual subset of Oscar dataset. This model provides domain & language independent embedding for Entity Recognition Task. We fine-tunned it only on 9 languages but the model can generalize over other languages that are supported by the Multilingual BERT. Read more about evaluation protocol & datasets in our blog post | Model | F1 macro | |----------|----------| | bert-base-multilingual-cased | 0.5206 | | ours | 0.5892 | | ours + two emb | 0.6231 | Embeddings can be used out of the box or fine-tuned on specific datasets.
NuNER-v2.0
NuExtract-2.0-4B
🖥️ API / Platform    |   📑 Blog    |   🗣️ Discord    |   🔗 GitHub NuExtract 2.0 is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual. We provide several versions of different sizes, all based on pre-trained models from the QwenVL family. | Model Size | Model Name | Base Model | License | Huggingface Link | |------------|------------|------------|---------|------------------| | 2B | NuExtract-2.0-2B | Qwen2-VL-2B-Instruct | MIT | 🤗 NuExtract-2.0-2B | | 4B | NuExtract-2.0-4B | Qwen2.5-VL-3B-Instruct | Qwen Research License | 🤗 NuExtract-2.0-4B | | 8B | NuExtract-2.0-8B | Qwen2.5-VL-7B-Instruct | MIT | 🤗 NuExtract-2.0-8B | ❗️Note: `NuExtract-2.0-2B` is based on Qwen2-VL rather than Qwen2.5-VL because the smallest Qwen2.5-VL model (3B) has a more restrictive, non-commercial license. We therefore include `NuExtract-2.0-2B` as a small model option that can be used commercially. Benchmark Performance on collection of ~1,000 diverse extraction examples containing both text and image inputs. To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type. Support types include: `verbatim-string` - instructs the model to extract text that is present verbatim in the input. `string` - a generic string field that can incorporate paraphrasing/abstraction. `integer` - a whole number. `number` - a whole or decimal number. `date-time` - ISO formatted date. Array of any of the above types (e.g. `["string"]`) `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`). `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`). If the model does not identify relevant information for a field, it will return `null` or `[]` (for arrays and multi-labels). ⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks. You will need the following function to handle loading of image input data: In-Context Examples Sometimes the model might not perform as well as we want because our task is challenging or involves some degree of ambiguity. Alternatively, we may want the model to follow some specific formatting, or just give it a bit more help. In cases like this it can be valuable to provide "in-context examples" to help NuExtract better understand the task. To do so, we can provide a list examples (dictionaries of input/output pairs). In the example below, we show to the model that we want the extracted names to be in captial letters with `-` on either side (for the sake of illustration). Usually providing multiple examples will lead to better results. Image Inputs If we want to give image inputs to NuExtract, instead of text, we simply provide a dictionary specifying the desired image file as the message content, instead of a string. (e.g. `{"type": "image", "image": "file://image.jpg"}`). You can also specify an image URL (e.g. `{"type": "image", "image": "http://path/to/your/image.jpg"}`) or base64 encoding (e.g. `{"type": "image", "image": "data:image;base64,/9j/..."}`). Template Generation If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2.0 models can automatically generate this for you. E.g. generate a template from natural language description: Fine-Tuning You can find a fine-tuning tutorial notebook in the cookbooks folder of the GitHub repo. vLLM Deployment Run the command below to serve an OpenAI-compatible API: If you encounter memory issues, set `--max-model-len` accordingly. For image inputs, structure requests as shown below. Make sure to order the images in `"content"` as they appear in the prompt (i.e. any in-context examples before the main input).
NuExtract-2.0-2B
🖥️ API / Platform    |   📑 Blog    |   🗣️ Discord    |   🔗 GitHub NuExtract 2.0 is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual. We provide several versions of different sizes, all based on pre-trained models from the QwenVL family. | Model Size | Model Name | Base Model | License | Huggingface Link | |------------|------------|------------|---------|------------------| | 2B | NuExtract-2.0-2B | Qwen2-VL-2B-Instruct | MIT | 🤗 NuExtract-2.0-2B | | 4B | NuExtract-2.0-4B | Qwen2.5-VL-3B-Instruct | Qwen Research License | 🤗 NuExtract-2.0-4B | | 8B | NuExtract-2.0-8B | Qwen2.5-VL-7B-Instruct | MIT | 🤗 NuExtract-2.0-8B | ❗️Note: `NuExtract-2.0-2B` is based on Qwen2-VL rather than Qwen2.5-VL because the smallest Qwen2.5-VL model (3B) has a more restrictive, non-commercial license. We therefore include `NuExtract-2.0-2B` as a small model option that can be used commercially. Benchmark Performance on collection of ~1,000 diverse extraction examples containing both text and image inputs. To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type. Support types include: `verbatim-string` - instructs the model to extract text that is present verbatim in the input. `string` - a generic string field that can incorporate paraphrasing/abstraction. `integer` - a whole number. `number` - a whole or decimal number. `date-time` - ISO formatted date. Array of any of the above types (e.g. `["string"]`) `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`). `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`). If the model does not identify relevant information for a field, it will return `null` or `[]` (for arrays and multi-labels). ⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks. You will need the following function to handle loading of image input data: In-Context Examples Sometimes the model might not perform as well as we want because our task is challenging or involves some degree of ambiguity. Alternatively, we may want the model to follow some specific formatting, or just give it a bit more help. In cases like this it can be valuable to provide "in-context examples" to help NuExtract better understand the task. To do so, we can provide a list examples (dictionaries of input/output pairs). In the example below, we show to the model that we want the extracted names to be in captial letters with `-` on either side (for the sake of illustration). Usually providing multiple examples will lead to better results. Image Inputs If we want to give image inputs to NuExtract, instead of text, we simply provide a dictionary specifying the desired image file as the message content, instead of a string. (e.g. `{"type": "image", "image": "file://image.jpg"}`). You can also specify an image URL (e.g. `{"type": "image", "image": "http://path/to/your/image.jpg"}`) or base64 encoding (e.g. `{"type": "image", "image": "data:image;base64,/9j/..."}`). Template Generation If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2.0 models can automatically generate this for you. E.g. generate a template from natural language description: Fine-Tuning You can find a fine-tuning tutorial notebook in the cookbooks folder of the GitHub repo. vLLM Deployment Run the command below to serve an OpenAI-compatible API: If you encounter memory issues, set `--max-model-len` accordingly. For image inputs, structure requests as shown below. Make sure to order the images in `"content"` as they appear in the prompt (i.e. any in-context examples before the main input).
NuExtract-2.0-8B
🖥️ API / Platform    |   📑 Blog    |   🗣️ Discord    |   🔗 GitHub NuExtract 2.0 is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual. We provide several versions of different sizes, all based on pre-trained models from the QwenVL family. | Model Size | Model Name | Base Model | License | Huggingface Link | |------------|------------|------------|---------|------------------| | 2B | NuExtract-2.0-2B | Qwen2-VL-2B-Instruct | MIT | 🤗 NuExtract-2.0-2B | | 4B | NuExtract-2.0-4B | Qwen2.5-VL-3B-Instruct | Qwen Research License | 🤗 NuExtract-2.0-4B | | 8B | NuExtract-2.0-8B | Qwen2.5-VL-7B-Instruct | MIT | 🤗 NuExtract-2.0-8B | ❗️Note: `NuExtract-2.0-2B` is based on Qwen2-VL rather than Qwen2.5-VL because the smallest Qwen2.5-VL model (3B) has a more restrictive, non-commercial license. We therefore include `NuExtract-2.0-2B` as a small model option that can be used commercially. Benchmark Performance on collection of ~1,000 diverse extraction examples containing both text and image inputs. To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type. Support types include: `verbatim-string` - instructs the model to extract text that is present verbatim in the input. `string` - a generic string field that can incorporate paraphrasing/abstraction. `integer` - a whole number. `number` - a whole or decimal number. `date-time` - ISO formatted date. Array of any of the above types (e.g. `["string"]`) `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`). `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`). If the model does not identify relevant information for a field, it will return `null` or `[]` (for arrays and multi-labels). ⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks. You will need the following function to handle loading of image input data: In-Context Examples Sometimes the model might not perform as well as we want because our task is challenging or involves some degree of ambiguity. Alternatively, we may want the model to follow some specific formatting, or just give it a bit more help. In cases like this it can be valuable to provide "in-context examples" to help NuExtract better understand the task. To do so, we can provide a list examples (dictionaries of input/output pairs). In the example below, we show to the model that we want the extracted names to be in captial letters with `-` on either side (for the sake of illustration). Usually providing multiple examples will lead to better results. Image Inputs If we want to give image inputs to NuExtract, instead of text, we simply provide a dictionary specifying the desired image file as the message content, instead of a string. (e.g. `{"type": "image", "image": "file://image.jpg"}`). You can also specify an image URL (e.g. `{"type": "image", "image": "http://path/to/your/image.jpg"}`) or base64 encoding (e.g. `{"type": "image", "image": "data:image;base64,/9j/..."}`). Template Generation If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2.0 models can automatically generate this for you. E.g. generate a template from natural language description: Fine-Tuning You can find a fine-tuning tutorial notebook in the cookbooks folder of the GitHub repo. vLLM Deployment Run the command below to serve an OpenAI-compatible API: If you encounter memory issues, set `--max-model-len` accordingly. For image inputs, structure requests as shown below. Make sure to order the images in `"content"` as they appear in the prompt (i.e. any in-context examples before the main input).
NuMarkdown-8B-Thinking
NuExtract-2.0-8B-GGUF
NuMarkdown-8B-Thinking-GGUF
NuExtract-1.5-tiny
NuExtract
NuExtract-2.0-2B-GGUF
NuExtract-2.0-4B-GGUF
NuExtract-2.0-4B-GPTQ
NuExtract-1.5-smol
NuExtract-2-8B-experimental
NuExtract-2-2B-experimental
NuNER_Zero-span
NuNER Zero-span is the span-prediction version of NuNER Zero. NuNER Zero-span shows slightly better performance than NuNER Zero but cannot detect entities that are larger than 12 tokens.
NuNER_Zero-4k
NuNER Zero 4k is the long-context (4k tokens) version of NuNER Zero. NuNER Zero 4k is generally less performant than NuNER Zero, but can outperform NuNER Zero on applications where context size matters.