Spaces:
Running
Running
| title: DGEB | |
| app_file : leaderboard/app.py | |
| sdk: docker | |
| sdk_version: 4.36.1 | |
| <h1 align="center">Diverse Genomic Embedding Benchmark</h1> | |
| <p align="center"> | |
| <a href="https://github.com/tattabio/dgeb/releases"> | |
| <img alt="GitHub release" src="https://img.shields.io/github/v/release/tattabio/dgeb.svg"> | |
| </a> | |
| <a href=""> | |
| <img alt="arXiv URL" src=""> | |
| </a> | |
| <a href="https://github.com/tattabio/dgeb/blob/main/LICENSE"> | |
| <img alt="License" src="https://img.shields.io/github/license/tattabio/dgeb.svg"> | |
| </a> | |
| <a href="https://pepy.tech/project/dgeb"> | |
| <img alt="Downloads" src="https://static.pepy.tech/personalized-badge/dgeb?period=total&units=international_system&left_color=grey&right_color=orange&left_text=Downloads"> | |
| </a> | |
| </p> | |
| <h4 align="center"> | |
| <p> | |
| <a href="#installation">Installation</a> | | |
| <a href="#usage">Usage</a> | | |
| <a href="https://huggingface.co/spaces/tattabio/DGEB">Leaderboard</a> | | |
| <a href="#citing">Citing</a> | |
| <p> | |
| </h4> | |
| <h3 align="center"> | |
| <a href="https://huggingface.co/spaces/dgeb"><img style="float: middle; padding: 10px 10px 10px 10px;" width="100" height="100" src="./docs/images/tatta_logo.png" /></a> | |
| </h3> | |
| DGEB is a benchmark for evaluating biological sequence models on functional and evolutionary information. | |
| DGEB is designed to evaluate model embeddings using: | |
| - Diverse sequences accross the tree of life. | |
| - Diverse tasks that capture different aspects of biological function. | |
| - Both amino acid and nucleotide sequences. | |
| The current version of DGEB consists of 18 datasets covering all three domains of life (Bacteria, Archaea and Eukarya). DGEB evaluates embeddings using six different embedding tasks: Classification, BiGene mining, Evolutionary Distance Similarity (EDS), Pair Classification, Clustering, and Retrieval. | |
| We welcome contributions of new tasks and datasets. | |
| ## Installation | |
| Install DGEB using pip. | |
| ```bash | |
| pip install dgeb | |
| ``` | |
| ## Usage | |
| - Launch evaluation using the python script (see [cli.py](https://github.com/tattabio/dgeb/blob/main/dgeb/cli.py)): | |
| ```bash | |
| dgeb --model facebook/esm2_t6_8M_UR50D | |
| ``` | |
| - To see all supported models and tasks: | |
| ```bash | |
| dgeb --help | |
| ``` | |
| - Using the python API: | |
| ```py | |
| import dgeb | |
| model = dgeb.get_model("facebook/esm2_t6_8M_UR50D") | |
| tasks = dgeb.get_tasks_by_modality(dgeb.Modality.PROTEIN) | |
| evaluation = dgeb.DGEB(tasks=tasks) | |
| evaluation.run(model, output_folder="results") | |
| ``` | |
| ### Using a custom model | |
| Custom models should be wrapped with the `dgeb.models.BioSeqTransformer` abstract class, and specify the modality, number of layers, and embedding dimension. See [models.py](https://github.com/tattabio/dgeb/blob/main/dgeb/models.py) for additional examples on custom model loading and inference. | |
| ```python | |
| import dgeb | |
| from dgeb.models import BioSeqTransformer | |
| from dgeb.tasks.tasks import Modality | |
| class MyModel(BioSeqTransformer): | |
| @property | |
| def modality(self) -> Modality: | |
| return Modality.PROTEIN | |
| @property | |
| def num_layers(self) -> int: | |
| return self.config.num_hidden_layers | |
| @property | |
| def embed_dim(self) -> int: | |
| return self.config.hidden_size | |
| model = MyModel(model_name='path_to/huggingface_model') | |
| tasks = dgeb.get_tasks_by_modality(model.modality) | |
| evaluation = dgeb.DGEB(tasks=tasks) | |
| evaluation.run(model) | |
| ``` | |
| ### Evaluating on a custom dataset | |
| **We strongly encourage users to contribute their custom datasets to DGEB. Please open a PR adding your dataset so that the community can benefit!** | |
| To evaluate on a custom dataset, first upload your dataset to the [Huggingface Hub](https://huggingface.co/docs/hub/en/datasets-adding). Then define a `Task` subclass with `TaskMetadata` that points to your huggingface dataset. For example, a classification task on a custom dataset can be defined as follows: | |
| ```python | |
| import dgeb | |
| from dgeb.models import BioSeqTransformer | |
| from dgeb.tasks import Dataset, Task, TaskMetadata, TaskResult | |
| from dgeb.tasks.classification_tasks import run_classification_task | |
| class MyCustomTask(Task): | |
| metadata = TaskMetadata( | |
| id="my_custom_classification", | |
| display_name="...", | |
| description="...", | |
| type="classification", | |
| modality=Modality.PROTEIN, | |
| datasets=[ | |
| Dataset( | |
| path="path_to/huggingface_dataset", | |
| revision="...", | |
| ) | |
| ], | |
| primary_metric_id="f1", | |
| ) | |
| def run(self, model: BioSeqTransformer) -> TaskResult: | |
| return run_classification_task(model, self.metadata) | |
| model = dgeb.get_model("facebook/esm2_t6_8M_UR50D") | |
| evaluation = dgeb.DGEB(tasks=[MyCustomTask]) | |
| evaluation.run(model) | |
| ``` | |
| ## Leaderboard | |
| To add your submission to the DGEB leaderboard, proceed through the following instructions. | |
| 1. Fork the DGEB repository by following GitHub's instruction [Forking Workflow](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request-from-a-fork). | |
| 2. Add your submission .json file to the leaderboard/submissions/<HF_MODEL_NAME>/ directory. | |
| ```bash | |
| mv /path/to/<SUBMISSION_FILE>.json /path/to/DGEB/leaderboard/submissions/<HF_MODEL_NAME>/ | |
| ``` | |
| 4. Update your fork with the new submission: | |
| ```bash | |
| git add leaderboard/submissions/<HF_MODEL_NAME>/<SUBMISSION_FILE>.json | |
| git commit -m "Add submission for <HF_MODEL_NAME>" | |
| git push | |
| ``` | |
| 5. Open a pull request to the main branch of the repository via the Github interface. | |
| 6. Once the PR is review and merged, your submission will be added to the leaderboard! | |
| ## Acknowledgements | |
| DGEB follows the design of text embedding bechmark [MTEB](https://github.com/embeddings-benchmark/mteb) developed by Huggingface 🤗. The evaluation code is adapted from the MTEB codebase. | |
| ## Citing | |
| DGEB was introduced in "[Diverse Genomic Embedding Benchmark for Functional Evaluation Across the Tree of Life]()", feel free to cite: | |
| TODO | |