yjoonjang

11 models • 1 total models in database

Sort by:

splade-ko-v1

splade-ko-v1.0

splade-ko-v1 is a Korean-specific SPLADE Sparse Encoder model finetuned from skt/A.X-Encoder-base using the sentence-transformers library. It maps sentences & paragraphs to a 50000-dimensional sparse vector space and can be used for semantic search and sparse retrieval. Model Details Model Description - Model Type: SPLADE Sparse Encoder - Base model: skt/A.X-Encoder-base - Maximum Sequence Length: 8192 tokens - Output Dimensionality: 50000 dimensions - Similarity Function: Dot Product MTEB-ko-retrieval Leaderboard Evaluated all the Korean Retrieval Benchmarks on MTEB Korean Retrieval Benchmark | Dataset | Description | Average Length (characters) | |-----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------------------| | Ko-StrategyQA | Korean ODQA multi-hop retrieval dataset (translated from StrategyQA) | 305.15 | | AutoRAGRetrieval | Korean document retrieval dataset constructed by parsing PDFs across 5 domains: finance, public sector, healthcare, legal, and commerce | 823.60 | | MIRACLRetrieval | Wikipedia-based Korean document retrieval dataset | 166.63 | | PublicHealthQA | Korean document retrieval dataset for medical and public health domains | 339.00 | | BelebeleRetrieval | FLORES-200-based Korean document retrieval dataset | 243.11 | | MrTidyRetrieval | Wikipedia-based Korean document retrieval dataset | 166.90 | | MultiLongDocRetrieval | Korean long document retrieval dataset across various domains | 13,813.44 | - In our evaluation, we excluded the XPQARetrieval dataset. XPQA is a dataset designed to evaluate Cross-Lingual QA capabilities, and we determined it to be inappropriate for evaluating retrieval tasks that require finding supporting documents based on queries. - Examples from the XPQARetrieval dataset are as follows: - Details for excluding this dataset is shown in the Github Issue Evaluation Metrics - Recall@10 - NDCG@10 - MRR@10 - AVGQueryActiveDims - AVGCorpusActiveDims Evaluation Code Our evaluation uses the SparseInformationRetrievalEvaluator from the sentence-transformers library. Code | Model | Parameters | Recall@10 | NDCG@10 | MRR@10 | AVGQueryActiveDims | AVGCorpusActiveDims | |-------|------------|-----------|---------|--------|----------------------|------------------------| | yjoonjang/splade-ko-v1 | 0.1B | 0.8391 | 0.7376 | 0.7260 | 110.7664 | 783.7026 | | telepix/PIXIE-Splade-Preview | 0.1B | 0.8107 | 0.7175 | 0.7072 | 30.481 | 566.8242 | | opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 | 0.1B | 0.6570 | 0.5383 | 0.5233 | 27.8722 | 177.5564 | Look here for more details. | Model | Parameters | Average NDCG@10 | | :--- | :--- | :--- | | Sparse Embedding | | yjoonjang/splade-ko-v1 | 0.1B | 0.7376 | | telepix/PIXIE-Splade-Preview | 0.1B | 0.7175 | | opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1 | 0.1B | 0.5383 | |||| | Dense Embedding | | Qwen/Qwen3-Embedding-8B | 8B | 0.7635 | | Qwen/Qwen3-Embedding-4B | 4B | 0.7484 | | telepix/PIXIE-Rune-Preview | 0.6B | 0.7420 | | nlpai-lab/KURE-v1 | 0.6B | 0.7395 | | dragonkue/snowflake-arctic-embed-l-v2.0-ko | 0.6B | 0.7386 | | telepix/PIXIE-Spell-Preview-1.7B | 1.7B | 0.7342 | | BAAI/bge-m3 | 0.6B | 0.7339 | | dragonkue/BGE-m3-ko | 0.6B | 0.7312 | | Snowflake/snowflake-arctic-embed-l-v2.0 | 0.6B | 0.7179 | | telepix/PIXIE-Spell-Preview-0.6B | 0.6B | 0.7106 | | intfloat/multilingual-e5-large | 0.6B | 0.7075 | | FronyAI/frony-embed-medium-arctic-ko-v2.5 | 0.6B | 0.7067 | | nlpai-lab/KoE5 | 0.6B | 0.7043 | | google/embeddinggemma-300m | 0.3B | 0.6944 | | BAAI/bge-multilingual-gemma2 | 9.4B | 0.6931 | | Qwen/Qwen3-Embedding-0.6B | 0.6B | 0.6895 | | Alibaba-NLP/gte-multilingual-base | 0.3B | 0.6879 | | jinaai/jina-embeddings-v3 | 0.6B | 0.6872 | | SamilPwC-AXNode-GenAI/PwC-Embeddingexpr | 0.6B | 0.6846 | | nomic-ai/nomic-embed-text-v2-moe | 0.5B | 0.6799 | | intfloat/multilingual-e5-large-instruct | 0.6B | 0.6799 | | intfloat/multilingual-e5-base | 0.3B | 0.6709 | | Alibaba-NLP/gte-Qwen2-7B-instruct | 7.6B | 0.6689 | | intfloat/e5-mistral-7b-instruct | 7.1B | 0.6649 | | openai/text-embedding-3-large | Unkown | 0.6513 | | upskyy/bge-m3-korean | 0.6B | 0.6434 | | Salesforce/SFR-Embedding-2R | 2.6B | 0.6391 | | jhgan/ko-sroberta-multitask | 0.1B | 0.5165 | Training Hyperparameters Non-Default Hyperparameters - `evalstrategy`: steps - `perdevicetrainbatchsize`: 4 - `perdeviceevalbatchsize`: 2 - `learningrate`: 2e-05 - `numtrainepochs`: 2 - `warmupratio`: 0.1 - `bf16`: True - `negsperquery`: 6 (from our dataset) - `gatherdevice`: True (Makes samples available to be shared across devices) - `overwriteoutputdir`: False - `dopredict`: False - `evalstrategy`: steps - `predictionlossonly`: True - `perdevicetrainbatchsize`: 4 - `perdeviceevalbatchsize`: 2 - `pergputrainbatchsize`: None - `pergpuevalbatchsize`: None - `gradientaccumulationsteps`: 1 - `evalaccumulationsteps`: None - `torchemptycachesteps`: None - `learningrate`: 2e-05 - `weightdecay`: 0.0 - `adambeta1`: 0.9 - `adambeta2`: 0.999 - `adamepsilon`: 1e-08 - `maxgradnorm`: 1.0 - `numtrainepochs`: 2 - `maxsteps`: -1 - `lrschedulertype`: linear - `lrschedulerkwargs`: {} - `warmupratio`: 0.1 - `warmupsteps`: 0 - `loglevel`: passive - `loglevelreplica`: warning - `logoneachnode`: True - `loggingnaninffilter`: True - `savesafetensors`: True - `saveoneachnode`: False - `saveonlymodel`: False - `restorecallbackstatesfromcheckpoint`: False - `nocuda`: False - `usecpu`: False - `usempsdevice`: False - `seed`: 42 - `dataseed`: None - `jitmodeeval`: False - `useipex`: False - `bf16`: True - `fp16`: False - `fp16optlevel`: O1 - `halfprecisionbackend`: auto - `bf16fulleval`: False - `fp16fulleval`: False - `tf32`: None - `localrank`: 7 - `ddpbackend`: None - `tpunumcores`: None - `tpumetricsdebug`: False - `debug`: [] - `dataloaderdroplast`: True - `dataloadernumworkers`: 0 - `dataloaderprefetchfactor`: None - `pastindex`: -1 - `disabletqdm`: False - `removeunusedcolumns`: True - `labelnames`: None - `loadbestmodelatend`: False - `ignoredataskip`: False - `fsdp`: [] - `fsdpminnumparams`: 0 - `fsdpconfig`: {'minnumparams': 0, 'xla': False, 'xlafsdpv2': False, 'xlafsdpgradckpt': False} - `fsdptransformerlayerclstowrap`: None - `acceleratorconfig`: {'splitbatches': False, 'dispatchbatches': None, 'evenbatches': True, 'useseedablesampler': True, 'nonblocking': False, 'gradientaccumulationkwargs': None} - `parallelismconfig`: None - `deepspeed`: None - `labelsmoothingfactor`: 0.0 - `optim`: adamwtorchfused - `optimargs`: None - `adafactor`: False - `groupbylength`: False - `lengthcolumnname`: length - `ddpfindunusedparameters`: None - `ddpbucketcapmb`: None - `ddpbroadcastbuffers`: False - `dataloaderpinmemory`: True - `dataloaderpersistentworkers`: False - `skipmemorymetrics`: True - `uselegacypredictionloop`: False - `pushtohub`: False - `resumefromcheckpoint`: None - `hubmodelid`: None - `hubstrategy`: everysave - `hubprivaterepo`: None - `hubalwayspush`: False - `hubrevision`: None - `gradientcheckpointing`: False - `gradientcheckpointingkwargs`: None - `includeinputsformetrics`: False - `includeformetrics`: [] - `evaldoconcatbatches`: True - `fp16backend`: auto - `pushtohubmodelid`: None - `pushtohuborganization`: None - `mpparameters`: - `autofindbatchsize`: False - `fulldeterminism`: False - `torchdynamo`: None - `rayscope`: last - `ddptimeout`: 1800 - `torchcompile`: False - `torchcompilebackend`: None - `torchcompilemode`: None - `includetokenspersecond`: False - `includenuminputtokensseen`: False - `neftunenoisealpha`: None - `optimtargetmodules`: None - `batchevalmetrics`: False - `evalonstart`: False - `useligerkernel`: False - `ligerkernelconfig`: None - `evalusegatherobject`: False - `averagetokensacrossdevices`: True - `prompts`: None - `batchsampler`: batchsampler - `multidatasetbatchsampler`: proportional - `routermapping`: {} - `learningratemapping`: {} Framework Versions - Python: 3.10.18 - Sentence Transformers: 5.1.1 - Transformers: 4.56.2 - PyTorch: 2.8.0+cu128 - Accelerate: 1.10.1 - Datasets: 4.1.1 - Tokenizers: 0.22.1

license:apache-2.0

352

colbert-ko-v1.0

colbert-ko-v1 is a Korean colbert model finetuned with PyLate. This model is trained exclusively on Korean dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. Model Description - Model Type: PyLate model - Document Length: 1024 tokens - Query Length: 32 tokens - Output Dimensionality: 128 tokens - Similarity Function: MaxSim Evaluation Dataset | Dataset | Description | Average Length (characters) | |-----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------------------| | Ko-StrategyQA | Korean ODQA multi-hop retrieval dataset (translated from StrategyQA) | 305.15 | | AutoRAGRetrieval | Korean document retrieval dataset constructed by parsing PDFs from 5 domains: finance, public, medical, legal, and commerce | 823.60 | | PublicHealthQA | Korean document retrieval dataset for medical and public health domains | 339.00 | | BelebeleRetrieval | Korean document retrieval dataset based on FLORES-200 | 243.11 | | MultiLongDocRetrieval | Korean long document retrieval dataset covering various domains | 13,813.44 | We omit MIRACLRetrieval and MrTidyRetrieval in evalution due to our device conditions. | Model | Parameters | Average Recall@10 | Average Precision@10 | Average NDCG@10 | Average F1@10 | |-----------------------------------------------|------------|----------------|-------------------|--------------|------------| | colbert-ko-v1 | 0.1B | 0.7999 | 0.0930 | 0.7172 | 0.1655| | jina-colbert-v2 | 0.5B | 0.7518 | 0.0888 | 0.6671 | 0.1577 | If you only want to use the colbert model to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank: Usage with MUVERA First install the muvera-py (Python implementation of MUVERA): Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search. Load the ColBERT model and initialize the PLAID index, then encode and index your documents: Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it: Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores: Training Details Loss: pylate.losses.cachedcontrastive.CachedContrastive Training Hyperparameters Non-Default Hyperparameters - `perdevicetrainbatchsize`: 128 - `perdeviceevalbatchsize`: 32 - `learningrate`: 3e-06 - `numtrainepochs`: 1 - `warmupratio`: 0.1 - `bf16`: True - `overwriteoutputdir`: False - `dopredict`: False - `evalstrategy`: steps - `predictionlossonly`: True - `perdevicetrainbatchsize`: 128 - `perdeviceevalbatchsize`: 32 - `pergputrainbatchsize`: None - `pergpuevalbatchsize`: None - `gradientaccumulationsteps`: 1 - `evalaccumulationsteps`: None - `torchemptycachesteps`: None - `learningrate`: 3e-06 - `weightdecay`: 0.0 - `adambeta1`: 0.9 - `adambeta2`: 0.999 - `adamepsilon`: 1e-08 - `maxgradnorm`: 1.0 - `numtrainepochs`: 1 - `maxsteps`: -1 - `lrschedulertype`: linear - `lrschedulerkwargs`: {} - `warmupratio`: 0.1 - `warmupsteps`: 0 - `loglevel`: passive - `loglevelreplica`: warning - `logoneachnode`: True - `loggingnaninffilter`: True - `savesafetensors`: True - `saveoneachnode`: False - `saveonlymodel`: False - `restorecallbackstatesfromcheckpoint`: False - `nocuda`: False - `usecpu`: False - `usempsdevice`: False - `seed`: 42 - `dataseed`: None - `jitmodeeval`: False - `useipex`: False - `bf16`: True - `fp16`: False - `fp16optlevel`: O1 - `halfprecisionbackend`: auto - `bf16fulleval`: False - `fp16fulleval`: False - `tf32`: None - `localrank`: 0 - `ddpbackend`: None - `tpunumcores`: None - `tpumetricsdebug`: False - `debug`: [] - `dataloaderdroplast`: True - `dataloadernumworkers`: 0 - `dataloaderprefetchfactor`: None - `pastindex`: -1 - `disabletqdm`: False - `removeunusedcolumns`: True - `labelnames`: None - `loadbestmodelatend`: False - `ignoredataskip`: False - `fsdp`: [] - `fsdpminnumparams`: 0 - `fsdpconfig`: {'minnumparams': 0, 'xla': False, 'xlafsdpv2': False, 'xlafsdpgradckpt': False} - `fsdptransformerlayerclstowrap`: None - `acceleratorconfig`: {'splitbatches': False, 'dispatchbatches': None, 'evenbatches': True, 'useseedablesampler': True, 'nonblocking': False, 'gradientaccumulationkwargs': None} - `deepspeed`: None - `labelsmoothingfactor`: 0.0 - `optim`: adamwtorch - `optimargs`: None - `adafactor`: False - `groupbylength`: False - `lengthcolumnname`: length - `ddpfindunusedparameters`: None - `ddpbucketcapmb`: None - `ddpbroadcastbuffers`: False - `dataloaderpinmemory`: True - `dataloaderpersistentworkers`: False - `skipmemorymetrics`: True - `uselegacypredictionloop`: False - `pushtohub`: False - `resumefromcheckpoint`: None - `hubmodelid`: None - `hubstrategy`: everysave - `hubprivaterepo`: None - `hubalwayspush`: False - `gradientcheckpointing`: False - `gradientcheckpointingkwargs`: None - `includeinputsformetrics`: False - `includeformetrics`: [] - `evaldoconcatbatches`: True - `fp16backend`: auto - `pushtohubmodelid`: None - `pushtohuborganization`: None - `mpparameters`: - `autofindbatchsize`: False - `fulldeterminism`: False - `torchdynamo`: None - `rayscope`: last - `ddptimeout`: 1800 - `torchcompile`: False - `torchcompilebackend`: None - `torchcompilemode`: None - `includetokenspersecond`: False - `includenuminputtokensseen`: False - `neftunenoisealpha`: None - `optimtargetmodules`: None - `batchevalmetrics`: False - `evalonstart`: False - `useligerkernel`: False - `evalusegatherobject`: False - `averagetokensacrossdevices`: False - `prompts`: None - `batchsampler`: batchsampler - `multidatasetbatchsampler`: proportional Framework Versions - Python: 3.10.18 - Sentence Transformers: 4.0.2 - PyLate: 1.3.0 - Transformers: 4.52.3 - PyTorch: 2.8.0+cu128 - Accelerate: 1.10.1 - Datasets: 3.6.0 - Tokenizers: 0.21.4

license:apache-2.0

e5-large-sparseq

News (May 2023): please switch to e5-large-v2, which has better performance and same method of usage. Text Embeddings by Weakly-Supervised Contrastive Pre-training. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022 This model has 24 layers and the embedding size is 1024. Below is an example to encode queries and passages from the MS-MARCO passage ranking dataset. Please refer to our paper at https://arxiv.org/pdf/2212.03533.pdf. Check out unilm/e5 to reproduce evaluation results on the BEIR and MTEB benchmark. Below is an example for usage with sentencetransformers. 1. Do I need to add the prefix "query: " and "passage: " to input texts? Yes, this is how the model is trained, otherwise you will see a performance degradation. Here are some rules of thumb: - Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval. - Use "query: " prefix for symmetric tasks such as semantic similarity, paraphrase retrieval. - Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering. 2. Why are my reproduced results slightly different from reported in the model card? Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences. 3. Why does the cosine similarity scores distribute around 0.7 to 1.0? This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss. For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue. If you find our paper or models helpful, please consider cite as follows: This model only works for English texts. Long texts will be truncated to at most 512 tokens.

license:mit

yjoonjang

splade-ko-v1

splade-ko-v1.0

colbert-ko-v1.0

colbert-ko-v1

reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-ranknetloss-softmax

e5-large-unsupervised-sparseq

preranker-v1

reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-plistmle-customweight

reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-ranknetloss-sigmoid

reranker-msmarco-v1.1-MiniLM-L12-H384-uncased-lambdaloss-noweight

e5-large-sparseq