Text Generation
Transformers
Safetensors
qwen3
conversational
text-generation-inference
File size: 5,036 Bytes
e91e3c3
2fd7cdf
 
e91e3c3
412662b
 
2fd7cdf
e91e3c3
2fd7cdf
e91e3c3
2fd7cdf
8814b38
412662b
8814b38
412662b
 
e91e3c3
2fd7cdf
 
e91e3c3
 
412662b
 
e91e3c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
412662b
e91e3c3
 
 
 
 
 
 
412662b
 
 
e91e3c3
 
 
412662b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e91e3c3
 
 
 
 
 
 
 
 
 
 
2fd7cdf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
base_model:
- JetBrains-Research/Qwen3-8B-am
datasets:
- JetBrains-Research/PIPer-envbench-zeroshot-rl
- JetBrains-Research/PIPer-SFT-2500-sharegpt
library_name: transformers
license: mit
pipeline_tag: text-generation
---

<img src="https://github.com/JetBrains-Research/PIPer/blob/main/misc/piper-logo.png?raw=true" alt="PIPer Mascot" style="height: 6em">
<h1>
  PIPer: On-Device Environment Setup via Online Reinforcement Learning
  
</h1>

[Paper](https://huggingface.co/papers/2509.25455) | [Code](https://github.com/JetBrains-Research/PIPer)

<div align="center">

[![Models](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Models-orange.svg)](https://jb.gg/PIPer)
[![Dataset](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Dataset-green.svg)](https://huggingface.co/datasets/JetBrains-Research/PIPer-envbench-zeroshot-rl)
[![License](https://img.shields.io/badge/License-MIT-red.svg)](LICENSE)

*Democratizing environment setup with on-device sized models that match the performance of much larger proprietary systems*

</div>

## 🎯 Overview

Environment setupβ€”the process of configuring systems to work with specific software projectsβ€”remains a persistent challenge in software engineering. **PIPer** addresses this by training specialized on-device models that can automatically generate correct Bash scripts for environment configuration.

Our approach combines:
- πŸ“š **Supervised Fine-Tuning (SFT)** with executable scripts from larger models
- 🎯 **Reinforcement Learning with Verifiable Rewards (RLVR)** using lightweight proxy LLM-reward

## πŸ† Key Results

| Model | Size | EnvBench avg@5 | Cost per 1M tokens |
|-------|------|----------------|-------------------|
| **PIPer** | 8B | **19.4** | $0.60 |
| GPT-4o | - | 19.4 | $15.00 |
| Qwen3-32B | 32B | 16.2 | $2.00 |
| Qwen3-8B | 8B | 2.6 | $0.60 |

> πŸŽ‰ **PIPer achieves 9Γ— improvement** over its base model while **matching GPT-4o performance** at **25x lower cost**

![Performance vs Cost Analysis](https://github.com/JetBrains-Research/PIPer/blob/main/misc/combined_pass_n_and_cost.png?raw=true)

## πŸ“¦ Available Artifacts

### πŸ€– Model Checkpoints

| Model | Description | HuggingFace Link |
|-------|-------------|------------------|
| **πŸ… PIPer (Full)** | Complete SFT+RL trained model | [JetBrains-Research/PIPer-8B](https://huggingface.co/JetBrains-Research/PIPer-8B) |
| 🎯 PIPer (RL-only) | RLVR checkpoint only | [JetBrains-Research/PIPer-8B-RL-only](https://huggingface.co/JetBrains-Research/PIPer-8B-RL-only) |
| πŸ“š PIPer (SFT-only) | Supervised fine-tuning only | [JetBrains-Research/PIPer-8B-SFT-only](https://huggingface.co/JetBrains-Research/PIPer-8B-SFT-only) |

### πŸ“Š Datasets

| Dataset                   | Description                                            | HuggingFace Link                                                                                    |
|---------------------------|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
| **EnvBench Zero-shot RL** | Training prompts and evaluation data                   | [JetBrains-Research/PIPer-envbench-zeroshot-rl](https://huggingface.co/datasets/JetBrains-Research/PIPer-envbench-zeroshot-rl)  |
| **EnvBench SFT 2500**     | Zeroshot trajectories from Qwen-32B in ShareGPT format | [JetBrains-Research/PIPer-SFT-2500-sharegpt](https://huggingface.co/datasets/JetBrains-Research/PIPer-SFT-2500-sharegpt)  |
| **PIPer Eval**            | Full evaluation results for EnvBench and Repo2Run      | [JetBrains-Research/PIPer-eval](https://huggingface.co/datasets/JetBrains-Research/PIPer-eval/tree/main)  |


## πŸš€ Reproduce the results
We use [uv](https://docs.astral.sh/uv/) for dependency management and [Ray](https://docs.ray.io/en/latest/ray-core/ray-core.html) for distributed training.

```bash
git clone https://github.com/JetBrains-Research/PIPer.git
cd PIPer
git submodule update --init --recursive
uv sync
```

To run the experiments, you need a node with at least 4 H200 GPUs and [Ray](https://docs.ray.io/en/latest/ray-core/ray-core.html) installed and running.
Then you can run all the experiments with the following command:

```bash
uv run piper/hparams_entrypoint.py --multirun +experiment==llm-reward
```

You can look up the experiment [Hydra](https://hydra.cc/docs/intro/) configurations in `piper/config/` folder, or print out the whole config with the following command:

```bash
uv run piper/hparams_entrypoint.py +experiment=llm-reward --info config
```

## πŸ“Š Evaluation Benchmarks

| Benchmark | Description | Metric | Our Result |
|-----------|-------------|---------|------------|
| **EnvBench-Python** | 329 Python repositories | pass@5 | πŸ† **27/329** |
| **Repo2Run** | 420 Python repositories | pass@5 | πŸ† **103/420** |
| **Terminal-Bench** | 80 terminal tasks | pass@10 | **4/80** |

## πŸ“„ License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.