Sanraj

2 models • 1 total models in database
Sort by:

Qwen3-1.7B-jailbreak-finetuned

Qwen3-1.7B Fine-tuned (Jailbreaking Prompt-Response) This model is a fine-tuned version of Qwen3 1.7B, trained using the Sanraj/jailbreaking-prompt-response dataset. The fine-tuning process focuses on improving response consistency, contextual understanding, and controlled jailbreak behavior. It was trained for 10 epochs, and the best checkpoint was automatically selected based on the lowest validation loss. The final model achieved a training loss of around 2.0 and a validation loss of around 2.4, showing stable and well-generalized learning behavior. Dataset: Sanraj/jailbreaking-prompt-response Structure: prompt → user query or instruction response → model or human-generated answer Default mode for safe and aligned responses. Produces polite and contextually aware outputs. Activated when the prompt starts with a bad word or uncensored trigger word. Allows freer or less restricted outputs, useful for robustness or safety research. Note: This mode is for research and testing only, not for generating harmful or unsafe content. | Parameter | Value | | ----------------------- | ----------------------------------- | | Base Model | Qwen3-1.7B | | Dataset | Sanraj/jailbreaking-prompt-response | | Epochs | 10 | | Batch Size | 4 | | Learning Rate | 2e-5 | | Optimizer | AdamW | | Scheduler | Linear decay | | Precision | bfloat16 | | Gradient Accumulation | Enabled | | Gradient Clipping | 1.0 | | Mixed Precision | Yes | | Use Cache | False | | savetotallimit | 3 | | loadbestmodelatend | True | | Train Loss (Final) | ~2.0 | | Validation Loss (Final) | ~2.4 | | Framework | PyTorch + Transformers | Starting your prompt with an uncensored or “bad” word triggers the jailbreak mode. This mode allows the model to generate freer and more raw responses (still within research-safe limits). The model shows smooth convergence and steady learning. Validation loss closely follows training loss, with only a small gap (~0.3–0.4), suggesting minimal overfitting. Performs effectively for creative text generation, open-ended Q&A, and robustness testing. This model includes a “jailbreak simulation” capability designed strictly for research and testing of AI alignment and robustness. It must not be used for generating, promoting, or distributing harmful or unethical content. Developers and researchers using this model should apply safety filters when deploying it in production or user-facing environments. Base model: Qwen3-1.7B(https://huggingface.co/Qwen/Qwen3-1.7B) Dataset: Sanraj/jailbreaking-prompt-response Ensure compliance with both licenses when redistributing or deploying the model. Base Model: Qwen3-1.7B Dataset: Sanraj/jailbreaking-prompt-response Trainer: Hugging Face Transformers Compute: Colab / Kaggle / Local GPU Fine-tuned by Santhos Raj — bridging AI safety and capability research. Contributions are highly encouraged! You can help improve this project in several ways: Expanding the dataset with diverse and high-quality prompt-response pairs. Enhancing the jailbreak control mechanism for better balance between creativity and safety. Evaluating model alignment and robustness under different scenarios. Reporting bugs, performance issues, or inconsistencies. Credit will be given to all meaningful contributors in future releases. Let’s work together to make open-source models more robust, aligned, and accessible.

NaNK
license:apache-2.0
197
5

tiny_llama1.1B_finetuned

NaNK
llama
1
1