Paranioar
NEO1 0 2B SFT
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Two lingering clouds cast shadows over its widespread exploration and promotion: - What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? - How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. We construct native VLMs built from first principles, where its primitive should: - effectively align pixel and word representations within a shared semantic space; - seamlessly integrate the strengths of separate vision and language modules; - inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. - With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives. - NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem. - Number of Layers: 40 (12 for Pre-Buffer & 28 for Post-LLM) We release the 2B weights of NEO10 in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT). | Model name | Weight | | ---------- | ------------------------------------------------------- | | NEO-2B-PT | š¤ NEO-2B-PT HF link | | NEO-2B-MT | š¤ NEO-2B-MT HF link | | NEO-2B-SFT | š¤ NEO-2B-SFT HF link | āļøāļø Citation If NEO is helpful for your research, please consider star ā and citation š :
NEO1_0-2B-PT
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Two lingering clouds cast shadows over its widespread exploration and promotion: - What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? - How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. We construct native VLMs built from first principles, where its primitive should: - effectively align pixel and word representations within a shared semantic space; - seamlessly integrate the strengths of separate vision and language modules; - inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. - With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives. - NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem. - Number of Layers: 40 (12 for Pre-Buffer & 28 for Post-LLM) We release the 2B weights of NEO10 in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT). | Model name | Weight | | ---------- | ------------------------------------------------------- | | NEO-2B-PT | š¤ NEO-2B-PT HF link | | NEO-2B-MT | š¤ NEO-2B-MT HF link | | NEO-2B-SFT | š¤ NEO-2B-SFT HF link | āļøāļø Citation If NEO is helpful for your research, please consider star ā and citation š :
NEO1 0 9B SFT
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Two lingering clouds cast shadows over its widespread exploration and promotion: - What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome?
NEO1_0-9B-PT
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Two lingering clouds cast shadows over its widespread exploration and promotion: - What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? - How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. We construct native VLMs built from first principles, where its primitive should: - effectively align pixel and word representations within a shared semantic space; - seamlessly integrate the strengths of separate vision and language modules; - inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. - With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives. - NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem. - Number of Layers: 42 (6 for Pre-Buffer & 36 for Post-LLM) We release the 9B weights of NEO10 in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT). | Model name | Weight | | ---------- | ------------------------------------------------------- | | NEO-9B-PT | š¤ NEO-9B-PT HF link | | NEO-9B-MT | š¤ NEO-9B-MT HF link | | NEO-9B-SFT | š¤ NEO-9B-SFT HF link | āļøāļø Citation If NEO is helpful for your research, please consider star ā and citation š :
NEO1_0-9B-MT
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Two lingering clouds cast shadows over its widespread exploration and promotion: - What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? - How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. We construct native VLMs built from first principles, where its primitive should: - effectively align pixel and word representations within a shared semantic space; - seamlessly integrate the strengths of separate vision and language modules; - inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. - With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives. - NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem. - Number of Layers: 42 (6 for Pre-Buffer & 36 for Post-LLM) We release the 9B weights of NEO10 in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT). | Model name | Weight | | ---------- | ------------------------------------------------------- | | NEO-9B-PT | š¤ NEO-9B-PT HF link | | NEO-9B-MT | š¤ NEO-9B-MT HF link | | NEO-9B-SFT | š¤ NEO-9B-SFT HF link | āļøāļø Citation If NEO is helpful for your research, please consider star ā and citation š :
NEO1_0-2B-MT
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale Two lingering clouds cast shadows over its widespread exploration and promotion: - What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? - How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. We construct native VLMs built from first principles, where its primitive should: - effectively align pixel and word representations within a shared semantic space; - seamlessly integrate the strengths of separate vision and language modules; - inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. - With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives. - NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem. - Number of Layers: 40 (12 for Pre-Buffer & 28 for Post-LLM) We release the 2B weights of NEO10 in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT). | Model name | Weight | | ---------- | ------------------------------------------------------- | | NEO-2B-PT | š¤ NEO-2B-PT HF link | | NEO-2B-MT | š¤ NEO-2B-MT HF link | | NEO-2B-SFT | š¤ NEO-2B-SFT HF link | āļøāļø Citation If NEO is helpful for your research, please consider star ā and citation š :