nllg
Detikzify V2.5 8b
Model Card for DeTikZify v2.5 (8b) DeTikZify v2.5 (8b) is a multimodal language model that automatically converts sketches and existing scientific figures into editable, semantics-preserving TikZ graphics programs. It builds on DeTikZify v2 post-trained with reinforcement learning and self-computed rewards. This approach, which we call reinforcement learning from self-feedback (RLSF), allows the model to considerably improve itself without necessitating external reward functions. Check out the DeTikZify project for more information and tips on how to best run the model. Background DeTikZify employs an iterative inference algorithm based on Monte Carlo Tree Search (MCTS), enabling it to continuously refine its outputs without additional training. The reward scores required by MCTS are computed entirely using DeTikZify's vision encoder, by visually assessing the similarity between input figures and compiled generated outputs. External reward models are not only unnecessary but often have lower correlations with human judgments, as the vision encoder was fine-tuned end-to-end with the entire model, optimizing it for evaluating this specific task. We refer readers to the DeTikZify and TikZero papers for further details. These self-computed rewards have been effective in enhancing model outputs during inference. With reinforcement learning algorithms like Group Relative Policy Optimization, this reward signal could also be used for the model to improve itself during a post-training step, i.e., reinforcement learning from self-feedback (RLSF). Model Training Our post-training setup does only require figures, and does not require aligned code as in supervised fine-tuning, granting us more flexibility in selecting training data. 50% of the training data comes from the subset of DaTikZ v3 , that was filtered out during the training of DeTikZify v2 . The remaining 50% is sampled from the SPIQA dataset, which contains image labels for figures extracted from arXiv. We exclude all figures from papers included in DaTikZ v3 . We sample this split so that 60% of these figures are labeled as schematics, 20% as plots, and 20% come from other categories. Since these figures are not necessarily created from TikZ, they may aid in enhancing the model's generalization capabilities. As with DeTikZify v2 , input figures are randomly converted into synthetic sketches using image transformations and UltraSketch. Using this dataset, we post-train DeTikZify v2 with RLSF, employing a batch size of 16. For each image, 32 outputs are generated, resulting in the model being trained on 512 outputs per step. We train for a total of 500 steps which takes 5 days to complete on eight Nvidia H200 GPUs. We keep the vision encoder frozen to mitigate reward hacking. Experiments and Results We evaluate DeTikZify v2.5 (8b) on the test split of DaTikZ v3 and compare it to DeTikZify v2 (8b). The metrics employed include DreamSim (DSim), Kernel Inception Distance (KID), CrystalBLEU (cBLEU), TeX Edit Distance (TED), Mean Token Efficiency (MTE), and Mean Sampling Throughput (MST). Refer to the DeTikZify paper for further details. All scores except MST are multiplied by 100. Model DSim ↑ KID ↓ cBLEU ↑ TED ↓ MTE ↑ DSim ↑ KID ↓ cBLEU ↑ TED ↓ MTE ↑ DeTi k Zify v2 (8b) 80.503 0.626 6.105 54.946 93.326 DeTi k Zify v2.5 (8b) 84.6438 0.298 4.202 52.939 100 In sampling-based inference (i.e., accepting the first output that compiles successfully) using reference figures and synthetic sketch inputs, DeTikZify v2.5 (8b) outperforms DeTikZify v2 (8b) on most metrics, demonstrating that RLSF can effectively enhance performance. The considerably increased DreamSim scores indicate that DeTikZify v2.5 (8b) generates outputs that are much more visually similar to the reference figures. Furthermore, it is much less likely to produce outputs that do not compile, as evidenced by its perfect MTE score. Interestingly, while it scores lower on the code-based metric CrystalBLEU, it performs better on the code-based TED. DeTikZify v2.5 (8b) tends to generate more concise programs with less syntactic noise. While this likely reduces the n-gram overlap with the reference code, it also decreases the number of edits necessary to convert one into another, explaining this phenomenon. Generally, more concise programs are beneficial as long as the semantics are preserved. Model DSim ↑ KID ↓ cBLEU ↑ TED ↓ MST ↑ DSim ↑ KID ↓ cBLEU ↑ TED ↓ MST ↑ DeTi k Zify v2 (8b) 89.020 0.016 6.593 52.466 52.723 DeTi k Zify v2.5 (8b) 90.889 -0.047 4.646 51.824 68.12 We observe similar trends when using our MCTS-based inference algorithm with a time budget of 10 minutes. Compared to sampling-based inference, DeTikZify v2.5 (8b) noticeably improves its scores, illustrating that MCTS on top of RLSF can still lead to additional gains. Additionally, within the same timeframe, DeTikZify v2.5 (8b) generates 25 more outputs than DeTikZify v2 (8b), supporting our hypothesis that the generated programs are more concise. On reference figures, DeTikZify v2.5 (8b) scores better on both DreamSim and KID, with the KID score even being slightly negative due to the high similarity of distributions. For synthetic sketches, it achieves a higher DreamSim score but performs worse on KID, indicating a prioritization of faithfulness to the reference figure over just focusing on general aesthetics. Model DSim ↑ KID ↓ CLIP ↑ cBLEU ↑ TED ↓ MTE ↑ DeTi k Zify v2 (8b) 52.829 5.103 10.051 1.603 65.51 82.291 DeTi k Zify v2.5 (8b) 53.564 7.471 7.968 0.732 62.189 100 TikZero adapters integrate into the vision encoder of DeTikZify models, enabling them to be conditioned on text in addition to images. Since we keep the vision encoder frozen, we can evaluate DeTikZify v2.5 (8b) on adapters trained for DeTikZify v2 (8b). Compared to our previous experiments, the results are more varied. While DeTikZify v2.5 (8b) achieves a better DreamSim value and maintains a perfect MTE, it performs worse on CLIPScore, suggesting difficulties in reproducing text from captions. This could be due to an increased modality gap, as RLSF further refines the model for image-only inputs. We plan to address this in future work by incorporating caption inputs into RLSF training. Summary Overall, RLSF greatly enhances model performance for most tasks. For image and sketch inputs, DeTikZify v2.5 (8b) emerges as the clear leader. For text inputs via TikZero adapters, the choice between model versions depends on the specific use case, given the trade-offs involved. Acknowledgments This model was trained using computational resources provided by the bwForCluster Helix, as part of the bwHPC-S5 project. The authors acknowledge support from the state of Baden-Württemberg through the bwHPC initiative and the German Research Foundation (DFG) under grant INST 35/1597-1 FUGG. This project was inspired by the paper Rendering-Aware Reinforcement Learning for Vector Graphics Generation.
detikzify-tl-1.1b
tikzero-plus-10b
detikzify-v2-8b
Model Card for DeTikZify v2 (8b) DeTikZify v2 (8b) is a multimodal language model that automatically converts sketches and existing scientific figures into editable, semantics-preserving TikZ graphics programs. It is based on LLaMA 3.1 (8b) and the SigLIP vision encoder of PaliGemma Mix-448 (3b). Check out the DeTikZify project for more information and tips on how to best run the model. Changes from DeTikZify v1 We document all changes between DeTikZify v1 and DeTikZify v2 in our paper, "TikZero: Zero-Shot Text-Guided Graphics Program Synthesis". For convenience, they are also listed below. Architecture Similar to DeTikZify v1 , DeTikZify v2 uses a SigLIP vision encoder. However, inspired by the continued ViT pretraining of InternVL, we initialize the weights with the fine-tuned vision encoder of PaliGemma Mix-448 (3b) and increase DeTikZify's resolution to 420x420 pixels. Further, the vision encoder is no longer kept frozen but fully fine-tuned with the rest of the model. Training Data For pretraining, we switch from MetaFig to the much larger ArXivCap dataset and extract 1 million (figure, caption, OCR) tuples for pretraining the modality connector. For fine-tuning, we create the new DaTikZ v3 dataset with over 450k TikZ drawings. We also train a new model called UltraSketch to generate synthetic sketches during training. It is based on UltraEdit and achieves a congruence coefficient (CC) of 0.74. Additionally, we generate synthetic sketches using image transformation. While these sketches are less diverse, they are better at preserving text rendering, achieving a similar CC of 0.75. When we average the sketch representations produced by both methods, the resulting CC increases to 0.82, indicating that the methods are orthogonal and complement each other effectively. Training & Inference We observe improved performance by extending the training to 5 epochs and increasing the learning rate to 5e-5. Fully fine-tuning the vision encoder means that we can no longer compute SelfSim as the cosine similarity between pooled outputs during inference, as the pooling head is not fine-tuned. However, by instead computing Earth Mover's Distance on the fine-tuned patch embeddings, it actually enhances the correlation with human judgments (0.456 segment-level and 0.911 system-level correlation). This means that DeTikZify v2 also works well with our MCTS-based inference algorithm. Evaluation Here is how DeTikZify v2 (8b) compares to DeTi k Zify DS (7b), previously the best performing DeTikZify model, as evaluated on the test split of DaTikZ v3 . Scores are multiplied by 100. Model DSim ↑ KID ↓ cBLEU ↑ TED ↓ MTE ↑ DSim ↑ KID ↓ cBLEU ↑ TED ↓ MTE ↑ DeTi k Zify DS (7b) 75.46 0.842 2.953 56.851 84.019 67.379 0.766 1.541 59.589 84.401 DeTi k Zify v2 (8b) 80.503 0.626 6.105 54.946 93.326 74.584 0.751 3.356 58.32 93.858