Phi 4 Mm Inst Zeroth Kor
7
1 language
โ
by
seastar105
Code Model
OTHER
New
0 downloads
Early-stage
Edge AI:
Mobile
Laptop
Server
Unknown
Mobile
Laptop
Server
Quick Summary
This model is fine-tuned from microsoft/Phi-4-multimodal-instruct on kresnik/zerothkorean dataset only 1 epoch.
Training Data Analysis
๐ก Average (5.2/10)
Researched training datasets used by Phi 4 Mm Inst Zeroth Kor with quality assessment
Specialized For
code
general
science
multilingual
Training Datasets (3)
the pile
๐ข 8/10
code
general
science
multilingual
Key Strengths
- โขDeliberate Diversity: Explicitly curated to include diverse content types (academia, code, Q&A, book...
- โขDocumented Quality: Each component dataset is thoroughly documented with rationale for inclusion, en...
- โขEpoch Weighting: Component datasets receive different training epochs based on perceived quality, al...
common crawl
๐ด 2.5/10
general
science
Key Strengths
- โขScale and Accessibility: At 9.5+ petabytes, Common Crawl provides unprecedented scale for training d...
- โขDiversity: The dataset captures billions of web pages across multiple domains and content types, ena...
- โขComprehensive Coverage: Despite limitations, Common Crawl attempts to represent the broader web acro...
Considerations
- โขBiased Coverage: The crawling process prioritizes frequently linked domains, making content from dig...
- โขLarge-Scale Problematic Content: Contains significant amounts of hate speech, pornography, violent c...
wikipedia
๐ก 5/10
science
multilingual
Key Strengths
- โขHigh-Quality Content: Wikipedia articles are subject to community review, fact-checking, and citatio...
- โขMultilingual Coverage: Available in 300+ languages, enabling training of models that understand and ...
- โขStructured Knowledge: Articles follow consistent formatting with clear sections, allowing models to ...
Considerations
- โขLanguage Inequality: Low-resource language editions have significantly lower quality, fewer articles...
- โขBiased Coverage: Reflects biases in contributor demographics; topics related to Western culture and ...
Explore our comprehensive training dataset analysis
View All DatasetsCode Examples
Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Example scriptpython
orig_model_path = "microsoft/Phi-4-multimodal-instruct"
ft_model_path = "seastar105/Phi-4-mm-inst-zeroth-kor"
generation_config = GenerationConfig.from_pretrained(orig_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(orig_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
ft_model_path,
trust_remote_code=True,
torch_dtype='auto',
_attn_implementation='flash_attention_2',
).cuda()
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'
# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
asr_ds = load_dataset("kresnik/zeroth_korean", split="test")
ast_ds = load_dataset("seastar105/fleurs_ko_en_test", split="train")
# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๋ชฌํ ํฌ์ ์๋
๋ค์ด ์ฌ๋์ ์ ๋๋ก ๋ชป ๋ฐ๊ณ ํฌ๋ฉด ๋งค์ฐ ์ฌ๊ฐํ ๊ฒฐ๊ณผ๊ฐ ์ด๋๋๋ค๋ ๊ฒฐ๋ก ์ ๋ด๋ ธ์ต๋๋ค"
# AST, EN -> KO
item = ast_ds[-1]
audio = (item["en_audio"]["array"], item["en_audio"]["sampling_rate"])
inputs = processor(text=ast_en, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] # "๊ฐ์ฅ ์ฝ๊ฒ ์ ๊ทผ ๊ฐ๋ฅํ ์๋ฌผ ์์์ ์๊ณผ lรฉgumes์์ ์ ๊ทผ ๊ฐ๋ฅํ ๋จ๋ฐฑ์ง์ด์์ ๊ฒ์ด๋ค๊ฐ์ ํ์ง๋ง ์ด๊ฒ๋ค์ ๊ณ ํ์ ๋๋ฌผ์ฒ๋ผ ์ฐ๋ฆฌ์๊ฒ ์ํํ๊ธฐ ์ด๋ ต์ต๋๋ค๋ง ๊ทธ๊ฒ๋ค์ด ๋์ฌ ์๋ค๋ฉด์"Deploy This Model
Production-ready deployment in minutes
Together.ai
Instant API access to this model
Production-ready inference API. Start free, scale to millions.
Try Free APIReplicate
One-click model deployment
Run models in the cloud with simple API. No DevOps required.
Deploy NowDisclosure: We may earn a commission from these partners. This helps keep LLMYourWay free.