Dark-O-Ether commited on
Commit
8764b41
Β·
1 Parent(s): 5e4d6f3

Added application file

Browse files
Files changed (2) hide show
  1. app.py +814 -0
  2. requirements.txt +3 -0
app.py ADDED
@@ -0,0 +1,814 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import streamlit.components.v1 as components
3
+ import pandas as pd
4
+ import random
5
+ import json
6
+ from streamlit_javascript import st_javascript
7
+ import time
8
+
9
+ # Set page configuration
10
+ st.set_page_config(
11
+ page_title="tokeniser-py Demonstration",
12
+ page_icon="πŸ”£",
13
+ layout="wide",
14
+ )
15
+
16
+ # Custom CSS for better UI
17
+ st.markdown("""
18
+ <style>
19
+ .main {
20
+ background-color: #0e1117;
21
+ color: white;
22
+ }
23
+ .stTextInput > div > div > input, .stTextArea > div > div > textarea {
24
+ background-color: #1e2130;
25
+ color: white;
26
+ border: 1px solid #30343e;
27
+ border-radius: 4px;
28
+ padding: 10px;
29
+ }
30
+ .token-display {
31
+ margin-top: 20px;
32
+ padding: 15px;
33
+ border-radius: 5px;
34
+ background-color: #1e2130;
35
+ line-height: 2;
36
+ overflow-wrap: break-word;
37
+ }
38
+ .token {
39
+ display: inline-block;
40
+ padding: 2px 4px;
41
+ margin: 2px;
42
+ border-radius: 3px;
43
+ position: relative;
44
+ cursor: pointer;
45
+ color: #0e1117 !important;
46
+ font-weight: 600;
47
+ text-shadow: 0px 0px 1px rgba(0,0,0,0.2);
48
+ }
49
+ .token:hover::after {
50
+ content: attr(data-id);
51
+ position: absolute;
52
+ top: -25px;
53
+ left: 0;
54
+ background: #3c4356;
55
+ color: white;
56
+ padding: 2px 6px;
57
+ border-radius: 3px;
58
+ font-size: 12px;
59
+ white-space: nowrap;
60
+ z-index: 100;
61
+ }
62
+ .button-container {
63
+ display: flex;
64
+ gap: 10px;
65
+ margin-bottom: 15px;
66
+ }
67
+ .stButton button {
68
+ background-color: #2c313d;
69
+ border: none;
70
+ color: white;
71
+ }
72
+ .stButton button:hover {
73
+ background-color: #3c4356;
74
+ }
75
+ .info-box {
76
+ margin-top: 20px;
77
+ padding: 20px;
78
+ border-radius: 5px;
79
+ background-color: #1e2130;
80
+ font-size: 14px;
81
+ line-height: 1.6;
82
+ }
83
+ .quote {
84
+ border-left: 4px solid #00ba7c;
85
+ padding-left: 10px;
86
+ margin: 10px 0;
87
+ color: #e0e0e0;
88
+ }
89
+ .highlight {
90
+ background-color: rgba(0, 186, 124, 0.15);
91
+ padding: 2px 4px;
92
+ border-radius: 3px;
93
+ font-weight: 500;
94
+ }
95
+ .comparison-table {
96
+ background-color: #262b38;
97
+ padding: 15px;
98
+ border-radius: 5px;
99
+ margin: 15px 0;
100
+ }
101
+ .section-title {
102
+ font-weight: 600;
103
+ margin-top: 15px;
104
+ margin-bottom: 8px;
105
+ color: #00ba7c;
106
+ }
107
+ .stRadio [role=radiogroup] {
108
+ background-color: #1e2130;
109
+ padding: 5px;
110
+ border-radius: 5px;
111
+ }
112
+ .header-container {
113
+ display: flex;
114
+ justify-content: space-between;
115
+ align-items: center;
116
+ padding: 10px 0;
117
+ margin-top: -80px;
118
+ }
119
+ .stats-container {
120
+ display: flex;
121
+ gap: 20px;
122
+ padding: 10px;
123
+ background-color: #1e2130;
124
+ border-radius: 5px;
125
+ margin-bottom: 20px;
126
+ }
127
+ .stat-box {
128
+ padding: 10px;
129
+ }
130
+ .stat-label {
131
+ font-size: 0.9em;
132
+ color: #aaa;
133
+ }
134
+ .stat-value {
135
+ font-size: 1.5em;
136
+ font-weight: bold;
137
+ }
138
+ a {
139
+ color: #00ba7c !important;
140
+ text-decoration: none;
141
+ }
142
+ a:hover {
143
+ text-decoration: underline;
144
+ }
145
+ .monospace {
146
+ font-family: monospace;
147
+ }
148
+ .note-box {
149
+ background-color: rgba(255, 204, 0, 0.1);
150
+ border-left: 3px solid rgba(255, 204, 0, 0.7);
151
+ padding: 10px 15px;
152
+ margin: 10px 0;
153
+ border-radius: 0 5px 5px 0;
154
+ }
155
+ .buttons-row {
156
+ display: flex;
157
+ gap: 10px;
158
+ }
159
+ /* Enhanced bullet points styling */
160
+ .bullet-point {
161
+ display: flex;
162
+ align-items: baseline;
163
+ margin: 8px 0;
164
+ padding: 4px 0;
165
+ }
166
+ .bullet-point-icon {
167
+ display: inline-flex;
168
+ align-items: center;
169
+ justify-content: center;
170
+ min-width: 24px;
171
+ height: 24px;
172
+ background-color: rgba(0, 186, 124, 0.2);
173
+ color: #00ba7c;
174
+ border-radius: 50%;
175
+ margin-right: 10px;
176
+ font-weight: bold;
177
+ }
178
+ .secondary-bullet {
179
+ background-color: rgba(0, 186, 124, 0.1);
180
+ }
181
+ .comparison-item {
182
+ display: flex;
183
+ align-items: baseline;
184
+ margin: 10px 0;
185
+ padding: 6px 0;
186
+ }
187
+ .comparison-icon {
188
+ display: inline-flex;
189
+ align-items: center;
190
+ justify-content: center;
191
+ min-width: 28px;
192
+ height: 28px;
193
+ background-color: rgba(0, 186, 124, 0.25);
194
+ color: #00ba7c;
195
+ border-radius: 50%;
196
+ margin-right: 12px;
197
+ font-weight: bold;
198
+ }
199
+ .comparison-text {
200
+ flex: 1;
201
+ }
202
+ .learn-more-section {
203
+ background-color: #1e2130;
204
+ border-radius: 5px;
205
+ padding: 20px;
206
+ }
207
+ .icon-wrapper {
208
+ display: inline-flex;
209
+ align-items: center;
210
+ justify-content: center;
211
+ }
212
+ .colored-icon {
213
+ display: inline-block;
214
+ color: #00ba7c;
215
+ font-size: 1.4em;
216
+ margin-right: 10px;
217
+ }
218
+ .library-feature {
219
+ display: flex;
220
+ align-items: baseline;
221
+ margin: 10px 0;
222
+ }
223
+ .feature-dot {
224
+ min-width: 18px;
225
+ height: 18px;
226
+ background-color: rgba(0, 186, 124, 0.2);
227
+ border-radius: 50%;
228
+ margin-right: 10px;
229
+ display: flex;
230
+ align-items: center;
231
+ justify-content: center;
232
+ }
233
+ .feature-text {
234
+ flex: 1;
235
+ }
236
+ .sub-feature {
237
+ display: flex;
238
+ padding-left: 30px;
239
+ margin: 8px 0;
240
+ align-items: baseline;
241
+ }
242
+ .sub-feature-dot {
243
+ min-width: 12px;
244
+ height: 12px;
245
+ background-color: rgba(0, 186, 124, 0.1);
246
+ border-radius: 50%;
247
+ margin-right: 10px;
248
+ }
249
+ .code-block {
250
+ background-color: #0e1117;
251
+ padding: 15px;
252
+ border-radius: 5px;
253
+ font-family: 'Courier New', monospace;
254
+ margin: 15px 0;
255
+ color: #e0e0e0;
256
+ border-left: 3px solid #00ba7c;
257
+ }
258
+ .code-line {
259
+ padding: 2px 0;
260
+ display: block;
261
+ }
262
+ .code-import {
263
+ color: #ff79c6;
264
+ }
265
+ .code-class {
266
+ color: #8be9fd;
267
+ }
268
+ .code-function {
269
+ color: #50fa7b;
270
+ }
271
+ .code-var {
272
+ color: #f1fa8c;
273
+ }
274
+ .code-string {
275
+ color: #f1fa8c;
276
+ }
277
+ .code-comment {
278
+ color: #6272a4;
279
+ }
280
+ .link-top-a{
281
+ color: rgb(72, 140, 255) !important;
282
+ font-size: 18px;
283
+ }
284
+ .link-top{
285
+ color: rgb(180, 220, 255) !important;
286
+ font-size: 18px;
287
+ }
288
+ </style>
289
+ """, unsafe_allow_html=True)
290
+
291
+ # Header with logo and title
292
+ st.markdown("""
293
+ <div class="header-container">
294
+ <div>
295
+ <h1>tokeniser-py πŸ”£</h1>
296
+ <a href = "https://github.com/Tasmay-Tibrewal/tokeniser-py" class="link-top-a" style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">Library GitHub</span></a>
297
+ <p class="link-top" style="display: inline;"> | </p>
298
+ <a href = "https://huggingface.co/datasets/Tasmay-Tib/Tokeniser" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">HF Dataset</span></a>
299
+ <p class="link-top" style="display: inline;"> | </p>
300
+ <a href = "https://github.com/Tasmay-Tibrewal/Tokeniser" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">GitHub Dataset (chunked)</span></a>
301
+ <p class="link-top" style="display: inline;"> | </p>
302
+ <a href = "https://github.com/Tasmay-Tibrewal/Tokeniser-imp" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">GitHub Imp Files</span></a>
303
+ <p class="link-top" style="display: inline;"> | </p>
304
+ <a href = "https://pypi.org/project/tokeniser-py/" class="link-top-a"style="display: inline;"><span style="background-color:rgba(100,146,154,0.17); padding:2px 4px; border-radius:3px;">PyPI Package</span></a>
305
+ <p></p>
306
+ <p style="font-size: 20px;"><strong>Learn about language model tokenization</strong></p>
307
+ <p style="font-size: 17px; margin-bottom: 5px;">
308
+ <span style="background-color:rgba(154, 187, 255,0.4); padding:2px 4px; border-radius:3px;">tokeniser-py's</span> custom tokenizer processes text using tokens, which are common sequences of characters found in a set of text. The model learns to understand the statistical relationships
309
+ between these tokens, and excel at producing the next token in a sequence of tokens. You can use the tool below to understand how a piece of text might be tokenized by a language model, and the total count of tokens in that piece of text.
310
+ </p>
311
+ </div>
312
+ </div>
313
+ """, unsafe_allow_html=True)
314
+
315
+ # Initialize tokenizer
316
+ @st.cache_resource
317
+ def load_tokenizer(ln="1b", token_ordered=False):
318
+ try:
319
+ from tokeniser import Tokeniser
320
+ # Pass parameters based on selection
321
+ return Tokeniser(ln=ln, token_ordered=token_ordered)
322
+ except Exception as e:
323
+ st.error(f"Error loading tokenizer: {e}")
324
+ return None
325
+
326
+ # Information about tokenization
327
+ # st.markdown("""
328
+ # """)
329
+
330
+ # st.markdown("")
331
+ # st.markdown("")
332
+ st.markdown("###### Model")
333
+ # Create tabs for different models
334
+ model_version = st.radio(
335
+ "",
336
+ ["Default (1b model unordered)", "1b model ordered", "0.5b model unordered", "0.5b model ordered"],
337
+ horizontal=True
338
+ )
339
+
340
+ # Map selected model version to parameters
341
+ if model_version == "Default (1b model unordered)":
342
+ ln_param = "1b"
343
+ ordered_param = False
344
+ elif model_version == "1b model ordered":
345
+ ln_param = "1b"
346
+ ordered_param = True
347
+ elif model_version == "0.5b model unordered":
348
+ ln_param = "0.5b"
349
+ ordered_param = False
350
+ else:
351
+ ln_param = "0.5b"
352
+ ordered_param = True
353
+
354
+ # Load tokenizer with selected parameters
355
+ tokenizer = load_tokenizer(ln=ln_param, token_ordered=ordered_param)
356
+
357
+ # Function to generate consistent pastel colors for tokens
358
+ @st.cache_data
359
+ def get_token_colors(tokens):
360
+ # Use hash of token to get consistent colors
361
+ colors = {}
362
+ for token in set(tokens):
363
+ # Generate a pastel color based on the hash of the token
364
+ hash_val = hash(token) % 360
365
+ colors[token] = f"hsl({hash_val}, 80%, 75%)"
366
+ return colors
367
+
368
+ # Function to display tokens with colors and hover effects
369
+ def display_colored_tokens(tokens, token_ids, token_colors):
370
+ html = ""
371
+ for i, (token, token_id) in enumerate(zip(tokens, token_ids)):
372
+ # Handle special characters for display
373
+ if token == '\n':
374
+ display_token = '\\n'
375
+ elif token == '\t':
376
+ display_token = '\\t'
377
+ else:
378
+ display_token = token.replace("<", "&lt;").replace(">", "&gt;").replace(" ", "&nbsp;")
379
+
380
+ html += f'<span class="token" style="background-color: {token_colors[token]};" data-id="{token_id}">{display_token}</span>'
381
+ return html
382
+
383
+ # Function to display token IDs
384
+ def display_token_ids(token_ids):
385
+ return f'<div class="monospace">{json.dumps(token_ids)}</div>'
386
+
387
+ # Initialize session state for text input if not exists
388
+ if 'text_input' not in st.session_state:
389
+ st.session_state.text_input = "Hi I am Tasmay, I am a third year undergraduate at IIT Kharagpur and this is my tokeniser. Please enter your text in this box"
390
+ st.session_state.text_ind = 0
391
+ print(st.session_state.text_ind)
392
+
393
+ st.markdown("###### Enter text to tokenize")
394
+ # Text input area
395
+ text_input = st.text_area(
396
+ "",
397
+ st.session_state.text_input,
398
+ height=150,
399
+ placeholder="Please enter the text to tokenise",
400
+ # on_change=handle_text_change,
401
+ )
402
+
403
+ def clear_text():
404
+ st.session_state.text_input = ""
405
+
406
+ def show_example():
407
+ examples = [
408
+ "Hi I am Tasmay, I am a third year undergraduate at IIT Kharagpur and this is my tokeniser. Please enter your text in this box",
409
+ "Wop, wop, wop, wop, wop, I'ma do my stuff",
410
+ "I got loyalty, got royalty inside my DNA",
411
+ "Sit down, be humble",
412
+ "We gon' be alright"
413
+ ]
414
+ st.session_state.text_ind = (st.session_state.text_ind + 1) % len(examples)
415
+ st.session_state.text_input = examples[st.session_state.text_ind]
416
+
417
+ # Add CSS for fixed-width buttons that wrap to new line
418
+ st.markdown("""
419
+ <style>
420
+ div[data-testid="stHorizontalBlock"] {
421
+ flex-wrap: wrap;
422
+ gap: 10px;
423
+ margin-top: -15px;
424
+ padding-top: 0px;
425
+ margin-bottom: -15px;
426
+ }
427
+
428
+ div[data-testid="stHorizontalBlock"] > div {
429
+ flex: 0 0 auto !important;
430
+ width: auto !important;
431
+ min-width: initial !important;
432
+ }
433
+
434
+ div[data-testid="stHorizontalBlock"] button {
435
+ width: 80px; /* Fixed width for "Clear" button */
436
+ margin-top: 0px;
437
+ }
438
+
439
+ div[data-testid="stHorizontalBlock"] div:nth-child(2) button {
440
+ margin-top: 0px;
441
+ width: 150px; /* Fixed width for "Show example" button */
442
+ }
443
+ </style>
444
+ """, unsafe_allow_html=True)
445
+
446
+ # Create a horizontal block for buttons
447
+ button_container = st.container()
448
+ with button_container:
449
+ cols = st.columns([1, 1, 10])
450
+ with cols[0]:
451
+ st.button("Clear", on_click=clear_text)
452
+ with cols[1]:
453
+ st.button("Show example", on_click=show_example)
454
+
455
+ # Process the text for tokenization
456
+ if tokenizer:
457
+ try:
458
+ tokens, count = tokenizer.tokenise(text_input)
459
+ token_ids = tokenizer.token_ids(tokens)
460
+ num_tokens = len(tokens)
461
+ num_chars = len(text_input)
462
+ chars_per_token = num_chars / num_tokens if num_tokens > 0 else 0
463
+ except Exception as e:
464
+ st.error(f"Error tokenizing text: {e}")
465
+ tokens = []
466
+ token_ids = []
467
+ num_tokens = 0
468
+ num_chars = 0
469
+ chars_per_token = 0
470
+
471
+ # Inject custom CSS
472
+ st.markdown(
473
+ """
474
+ <style>
475
+ div[role="radiogroup"] > label {
476
+ height: 40px !important;
477
+ padding-left: 10px;
478
+ display: flex;
479
+ align-items: center;
480
+ }
481
+ div[role="radiogroup"] {
482
+ margin-top: -30px;
483
+ margin-bottom: 0px;
484
+ }
485
+ div[data-testid="stTextArea"] {
486
+ margin-top: -30px;
487
+ }
488
+ </style>
489
+ """,
490
+ unsafe_allow_html=True
491
+ )
492
+
493
+ # st.markdown("###### View")
494
+
495
+ # Create view toggle
496
+ view_option = st.radio(
497
+ "",
498
+ ["Text", "Token IDs"],
499
+ horizontal=True
500
+ )
501
+
502
+ # Get token colors if we have tokens
503
+ token_colors = get_token_colors(tokens) if tokens else {}
504
+
505
+ # Always display the token display, even if empty
506
+ if view_option == "Text":
507
+ if tokens:
508
+ st.markdown(f'<div class="token-display" style="margin-top: -25px;">{display_colored_tokens(tokens, token_ids, token_colors)}</div>', unsafe_allow_html=True)
509
+ else:
510
+ st.markdown(f'<div class="token-display" style="margin-top: -25px;">No tokens to display</div>', unsafe_allow_html=True)
511
+ else:
512
+ if token_ids:
513
+ st.markdown(f'<div class="token-display" style="margin-top: -25px;">{display_token_ids(token_ids)}</div>', unsafe_allow_html=True)
514
+ else:
515
+ st.markdown(f'<div class="token-display" style="margin-top: -25px;">No token IDs to display</div>', unsafe_allow_html=True)
516
+
517
+ # Always display the stats container, even if empty
518
+ st.markdown("""
519
+ <div class="stats-container" style="margin-top: -10px; margin-bottom: 10px;">
520
+ <div class="stat-box">
521
+ <div class="stat-label">Tokens</div>
522
+ <div class="stat-value">{}</div>
523
+ </div>
524
+ <div class="stat-box">
525
+ <div class="stat-label">Characters</div>
526
+ <div class="stat-value">{}</div>
527
+ </div>
528
+ <div class="stat-box">
529
+ <div class="stat-label">Chars per token</div>
530
+ <div class="stat-value">{:.2f}</div>
531
+ </div>
532
+ </div>
533
+ """.format(num_tokens, num_chars, chars_per_token),
534
+ unsafe_allow_html=True)
535
+
536
+ # Information box split into multiple markdown elements for better rendering
537
+ # st.markdown("<div class='info-box'>", unsafe_allow_html=True)
538
+
539
+ # Section 1: Tokenization Efficiency
540
+ st.markdown("---")
541
+ st.markdown("<h3 style='color:#00ba7c; margin-top:10px;'>Tokenization Efficiency</h3>", unsafe_allow_html=True)
542
+
543
+ # Quote block
544
+ st.markdown("""
545
+ <div style="border-left: 4px solid #00ba7c; padding-left: 15px; margin: 15px 0; color: #e0e0e0;">
546
+ A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for
547
+ common English text. This translates to roughly ΒΎ of a word (so 100 tokens ~= 75 words).
548
+ <div style="font-style: italic; color: #aaa; margin-top: 5px;">β€” OpenAI</div>
549
+ </div>
550
+ """, unsafe_allow_html=True)
551
+
552
+ # Section 2: Our Analysis
553
+ st.markdown("<h3 style='color:#00ba7c; margin-top:20px;'>Our Analysis</h3>", unsafe_allow_html=True)
554
+ st.markdown("<p>We've conducted a thorough analysis of token efficiency of our tokeniser against different tokenizers:</p>", unsafe_allow_html=True)
555
+
556
+ # Analysis points with enhanced styling
557
+ st.markdown("""
558
+ <div class="bullet-point">
559
+ <div class="bullet-point-icon">β€’</div>
560
+ <div>The <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px;">GPT-2 tokenizer</span> corresponds to approximately <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px;">3.9 characters per token</span></div>
561
+ </div>
562
+
563
+ <div class="bullet-point">
564
+ <div class="bullet-point-icon">β€’</div>
565
+ <div>English text corpus typically has average word lengths ranging from <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px;">4.7 to 5.1 characters</span>, which was observed to be <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">4.73-4.79 in our dataset</span></div>
566
+ </div>
567
+
568
+ <div class="bullet-point">
569
+ <div class="bullet-point-icon">β€’</div>
570
+ <div>Thus for our dataset, traditional tokenizers convert to roughly <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">⁴⁄₅ of a word</span> (100 tokens β‰ˆ 80 words)</div>
571
+ </div>
572
+ """, unsafe_allow_html=True)
573
+
574
+ # Section 3: tokeniser-py Efficiency
575
+ st.markdown("<h3 style='color:#00ba7c; margin-top:20px;'><u>tokeniser-py</u> efficiency</h3>", unsafe_allow_html=True)
576
+ st.markdown("<p>Our tokenizer demonstrates different characteristics:</p>", unsafe_allow_html=True)
577
+
578
+ # Efficiency points with enhanced styling
579
+ st.markdown("""
580
+ <div class="bullet-point">
581
+ <div class="bullet-point-icon">β€’</div>
582
+ <div>Average token size of <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px;">~2.52 characters**</span> across all token types</div>
583
+ </div>
584
+
585
+ <div class="bullet-point">
586
+ <div class="bullet-point-icon">β€’</div>
587
+ <div>For alphanumeric tokens only: <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">~3.97 characters per token</span></div>
588
+ </div>
589
+
590
+ <div class="bullet-point">
591
+ <div class="bullet-point-icon">β€’</div>
592
+ <div>This translates to approximately <span style="background-color:rgba(0,186,124,0.4); padding:2px 4px; border-radius:3px;">⁹⁄₁₀ of a word</span> (100 tokens β‰ˆ 90 words)</div>
593
+ </div>
594
+ """, unsafe_allow_html=True)
595
+
596
+ # Section 4: Real-world Comparison with completely redesigned styling
597
+ st.markdown("""
598
+ <div style="background-color:#262b38; padding:20px; border-radius:5px; margin:25px 0;">
599
+ <h3 style="color:#00ba7c; margin-top:0px; margin-bottom:15px; font-size:1.3em;">Real-world Comparison</h3>
600
+ <p style="margin-bottom:15px;">We tested a 28-page blog post across different tokenizers:</p>
601
+ <div class="comparison-item">
602
+ <div class="comparison-icon">1</div>
603
+ <div class="comparison-text">
604
+ <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px; font-weight:500;">GPT-4o/GPT-4:</span>
605
+ <span style="font-size:1.1em; margin-left:8px;">~10.4k tokens</span>
606
+ </div>
607
+ </div>
608
+ <div class="comparison-item">
609
+ <div class="comparison-icon">2</div>
610
+ <div class="comparison-text">
611
+ <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px; font-weight:500;">GPT-3:</span>
612
+ <span style="font-size:1.1em; margin-left:8px;">~12.1k tokens</span>
613
+ </div>
614
+ </div>
615
+ <div class="comparison-item">
616
+ <div class="comparison-icon">3</div>
617
+ <div class="comparison-text">
618
+ <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px; font-weight:500;">tokeniser-py:</span>
619
+ <span style="font-size:1.1em; margin-left:8px;">~18.8k tokens</span>
620
+ <span style="color:#aaa;">(including ~8.4k space tokens and ~2.6k other special-char based tokens)</span>
621
+ </div>
622
+ </div>
623
+ <div class="comparison-item">
624
+ <div class="comparison-icon">4</div>
625
+ <div class="comparison-text">
626
+ <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px; font-weight:500;">tokeniser-py (alphanumeric only):</span>
627
+ <span style="font-size:1.1em; margin-left:8px;">~7.8k tokens</span>
628
+ </div>
629
+ </div>
630
+ <div class="comparison-item">
631
+ <div class="comparison-icon">5</div>
632
+ <div class="comparison-text">
633
+ <span style="background-color:rgba(0,186,124,0.15); padding:2px 4px; border-radius:3px; font-weight:500;">GPT-4/GPT-4o (alphanumeric):</span>
634
+ <span style="font-size:1.1em; margin-left:8px;">~8k tokens</span>
635
+ </div>
636
+ </div>
637
+ </div>
638
+ """, unsafe_allow_html=True)
639
+
640
+ # Note box with enhanced styling
641
+ st.markdown("""
642
+ <div style="background-color:rgba(255,204,0,0.1); border-left:3px solid rgba(255,204,0,0.7); padding:15px; margin:20px 0; border-radius:0 5px 5px 0;">
643
+ <div style="font-size:18px; font-weight:bold; margin-bottom:12px; color:#ffcc00;">Note:</div>
644
+ <p style="line-height:2.2;"><span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.2); color:#ffcc00;">β€’</span>
645
+ <span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">**2.52 characters</span> is the average (adjusted frequency)-weighted token size i.e. we weigh the token size by their true occurences, obtained after adjusting their observed occurences by their super-tokens' occurences.<br>
646
+ <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
647
+ <span>A super-token of a token say '<span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">e</span>' is any token which contains '<span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">e</span>' (like '<span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">ear</span>', '<span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">ears</span>', '<span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">years</span>', etc.). While weighing the token length we find that a smaller tokens have an undue higher weightage due their occurences in super-tokens being added up as well.
648
+ To adjust this we hierarchially subtract the occurence of a token from its super tokens to get a True frequency.</span><br>
649
+ <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
650
+ <span>Un-adjusted frequency weighting gives an average size of <span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">~2.2 characters</span> per token, and a raw (un-weighted) average results in <span style="background-color:rgba(255,204,0,0.15); padding:2px 4px; border-radius:3px;">~4.6-4.7 chars</span> per token.</span><br>
651
+ <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
652
+ <span>Our tokenization strategy separates non-underscore special characters from alphanumeric tokens.</span><br>
653
+ <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
654
+ <span>We define alphanumeric tokens as any word that doesn't contain special characters (except underscores).</span><br>
655
+ <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
656
+ <span>For OpenAI's tokens, we considered any token containing at least one alphanumeric character (excluding underscores) as an alphanumeric token.</span><br>
657
+ <span class="bullet-point-icon" style="background-color:rgba(255,204,0,0.15); color:#ffcc00;">β€’</span>
658
+ <span>This difference is due to the different special characters handling methodology followed in both tokeniser.</span></p>
659
+ </div>
660
+ """, unsafe_allow_html=True)
661
+
662
+ # Section 5: Design Philosophy with enhanced styling
663
+ st.markdown("<h3 style='color:#00ba7c; margin-top:20px;'>Design Philosophy</h3>", unsafe_allow_html=True)
664
+ st.markdown("<p>Our approach prioritizes semantic representation over token count minimization:</p>", unsafe_allow_html=True)
665
+
666
+ # Philosophy points with enhanced styling
667
+ st.markdown("""
668
+ <div class="bullet-point">
669
+ <div class="bullet-point-icon">β€’</div>
670
+ <div>We consciously separate special characters from alphanumeric tokens</div>
671
+ </div>
672
+
673
+ <div class="bullet-point">
674
+ <div class="bullet-point-icon">β€’</div>
675
+ <div>This provides more available alphanumeric tokens in the vocabulary</div>
676
+ </div>
677
+
678
+ <div class="bullet-point">
679
+ <div class="bullet-point-icon">β€’</div>
680
+ <div>While this may increase total token count, it improves semantic representation</div>
681
+ </div>
682
+
683
+ <div class="bullet-point">
684
+ <div class="bullet-point-icon">β€’</div>
685
+ <div>Our design philosophy favors representation quality over token count minimization</div>
686
+ </div>
687
+ """, unsafe_allow_html=True)
688
+
689
+ # Footer link
690
+ st.markdown("""
691
+ <p style="margin-top:20px;">
692
+ Need a programmatic interface for tokenizing text? Check out our
693
+ <a href="https://pypi.org/project/tokeniser-py/">tokeniser-py</a> package for Python.
694
+ </p>
695
+ </div>
696
+ """, unsafe_allow_html=True)
697
+
698
+ # Footer with additional information
699
+ st.markdown("---")
700
+ st.markdown("""<h2 style='color:#00ba7c; margin-top:0px;'>About tokeniser-py</h2>
701
+
702
+ A high-performance, fully custom tokeniser built from scratch β€” no BPE, no existing NLP tokenisation scheme.
703
+ This tokeniser is based on a unique algorithm developed independently and trained on over 1 billion tokens
704
+ from the SlimPajama dataset (Val + Test), providing an efficient, interpretable, and extendable tokenisation pipeline.
705
+
706
+ <div class="library-feature">
707
+ <div class="feature-dot">β€’</div>
708
+ <div class="feature-text"><strong>Tokeniser built on a vocabulary of 131,072 tokens</strong></div>
709
+ </div>
710
+
711
+ <div class="library-feature">
712
+ <div class="feature-dot">β€’</div>
713
+ <div class="feature-text"><strong>Two versions of vocab:</strong> <code>0.5B</code> (Validation-only data) and <code>1B</code> (Validation + Test data)</div>
714
+ </div>
715
+
716
+ <div class="library-feature">
717
+ <div class="feature-dot">β€’</div>
718
+ <div class="feature-text"><strong>Token vocab built via a custom algorithm</strong> β€” no Byte Pair Encoding (BPE)</div>
719
+ </div>
720
+
721
+ <div class="library-feature">
722
+ <div class="feature-dot">β€’</div>
723
+ <div class="feature-text"><strong>Lightweight JSON format</strong> for token maps & token count maps</div>
724
+ </div>
725
+
726
+ <div class="library-feature">
727
+ <div class="feature-dot">β€’</div>
728
+ <div class="feature-text"><strong>Ready for integration</strong> into any LLM pre-tokenisation pipeline</div>
729
+ </div>
730
+
731
+ [GitHub Repository](https://github.com/Tasmay-Tibrewal/tokeniser-py) | [PyPI Package](https://pypi.org/project/tokeniser-py/)
732
+ """, unsafe_allow_html=True)
733
+
734
+ import streamlit as st
735
+
736
+ # Add explanation of the library in expandable section
737
+ with st.expander("Learn more about tokeniser-py"):
738
+ st.markdown("""
739
+ ### πŸš€ What This Library Offers
740
+
741
+ - Tokeniser built on a vocabulary of **131,072 tokens**
742
+ - Two versions of vocab:
743
+ - `0.5B`: Validation-only data
744
+ - `1B`: Validation + Test data
745
+ - Token vocab built via a **custom algorithm** β€” no Byte Pair Encoding (BPE)
746
+ - Tokenisation logic includes:
747
+ - Token lookup from pre-generated token map
748
+ - Dynamic programming-based segmentation for out-of-vocab tokens
749
+ - One-hot encoding (NumPy or PyTorch)
750
+ - Visualisation utilities for tokens and token IDs
751
+ - Lightweight JSON format for token maps & token count maps
752
+ - Ready for integration into any LLM pre-tokenisation pipeline
753
+ """)
754
+
755
+ # Add custom CSS
756
+ st.markdown("""
757
+ <style>
758
+ div.stCodeBlock {
759
+ background-color: #1a1c24 !important;
760
+ border-radius: 10px;
761
+ padding-left: 25px;
762
+ padding-top: 15px;
763
+ padding-bottom: 15px;
764
+ }
765
+ pre.language-python {
766
+ background-color: #1a1c24 !important;
767
+ border-radius: 10px;
768
+ }
769
+ .code-header {
770
+ font-size: 1.5em;
771
+ font-weight: bold;
772
+ margin-top: 0em;
773
+ margin-bottom: 0.5em;
774
+ display: flex;
775
+ align-items: center;
776
+ }
777
+ .code-block {
778
+ background-color: #1a1c24;
779
+ border-radius: 5px;
780
+ padding: 1em;
781
+ margin-bottom: 1em;
782
+ font-family: 'Courier New', monospace;
783
+ white-space: pre;
784
+ color: #d4d4d4;
785
+ overflow-x: auto;
786
+ line-height: 1.5;
787
+ }
788
+ .keyword { color: #c586c0; }
789
+ .string { color: #CE9178; }
790
+ .function { color: #4ec9b0; }
791
+ .parenthesis {color: #ffd700;}
792
+ .var {color: #8cdcfe;}
793
+ </style>
794
+ """, unsafe_allow_html=True)
795
+
796
+ # Code header and block with simpler HTML
797
+ st.markdown("""
798
+ <div class="code-header">πŸ› οΈ Usage</div>
799
+ <pre class="code-block"><span class="keyword">from</span> <span class="function">tokeniser</span> <span class="keyword">import</span> <span class="function">Tokeniser</span><br>
800
+ <span class="var">t</span> = <span class="function">Tokeniser</span><span class="parenthesis">()</span><br>
801
+ <span class="var">tokens</span>, <span class="var">count</span> = <span class="var">t</span>.<span class="function">tokenise</span><span class="parenthesis">(</span><span class="string">"Your input text here."</span><span class="parenthesis">)</span><br>
802
+ <span class="var">token_ids</span> = <span class="var">t</span>.<span class="function">token_ids</span><span class="parenthesis">(</span><span class="var">tokens</span><span class="parenthesis">)</span></pre>
803
+ """, unsafe_allow_html=True)
804
+
805
+ st.markdown("""
806
+ Use `t.one_hot_tokens(token_ids)` for NumPy-based one-hot encoding, or `op='torch'` for PyTorch.
807
+
808
+ ### πŸ“ Vocab Files
809
+
810
+ - `ordered_tokenizer_1b_val_test_data.json` β€” Ordered tokens (1B data)
811
+ - `unordered_tokenizer_1b_val_test_data.json` β€” Unordered tokens (1B)
812
+ - `count_tokenizer_1b_val_test_data.json` β€” Token counts (1B)
813
+ - Similar structure for 0.5B val-only version
814
+ """)
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ streamlit>=1.27.0
2
+ pandas>=1.5.0
3
+ tokeniser-py