MediumWord-559k
Ever heard of TinyWord? Yeah, this is a scaled up version. But whether you can call medium or not is a coin toss.
MediumWord is a five-hundred and fifty-nine thousand parameter word-generator trained on seven-hundred and fifty-three thousand words.
It showcases a significant boost in quality than TinyWord-134k, but still stays relatively cheap.
Architecture
MediumWord uses a scaled down version of the Qwen3 architecture.
| Parameter | Value |
|---|---|
| Hidden Layers | 3 |
| Hidden Size | 96 |
| Attention Heads | 1 |
| KV Heads | 1 |
| Intermediate Size | 384 |
| RoPE Theta | 1000.0 |
| Max Position Embeddings | 32 |
| Tie Word Embeddings | True |
| Vocab Size | 1200 |
Note: 1 attention head and a RoPE Theta of 1000 (vs Qwen3's 1,000,000) are intentional reductions for this scale. Max sequence length is 32, so positional generalization at range isn't a concern.
Training
MediumWord was trained on 753,232 unique words, 3,225,398 tokens, and 7,022,310 characters. ~660k of those words are English, while ~90k of them are Spanish. Talk about multilingual, right? Way better than GPT5. (P.S That's not true, but in our dreams, maybe).
Dataset
| Key | Value |
|---|---|
| Entries (words) | 753,232 |
| Tokens | 3,225,398 |
| Characters | 7,022,310 |
| Avg. Tokens Per Entry | ~4.2 |
| Avg. Words Per Entry | 1 |
| Avg. Chars Per Entry | ~9.3 |
| Longest Entry (Tokens) | 36 |
| Shortest Entry (Tokens) | 1 |
| English Words | ~660k |
| Spanish Words | ~90k |
Hardware
MediumWord was trained on one NVIDIA RTX 2060 GPU for 1.5 epochs with a batch size of 8.
Training Results
| Epoch | Train Loss | Val Loss | Train PPL | Val PPL |
|---|---|---|---|---|
| 0.03 | 5.3714 | 4.3303 | 215.74 | 75.87 |
| 0.13 | 2.7478 | 2.5754 | 15.61 | 13.14 |
| 0.23 | 2.2428 | 2.1622 | 9.42 | 8.69 |
| 0.32 | 2.0692 | 1.9979 | 7.92 | 7.37 |
| 0.42 | 1.9682 | 1.8948 | 7.16 | 6.65 |
| 0.52 | 1.8981 | 1.8302 | 6.67 | 6.23 |
| 0.62 | 1.8256 | 1.7769 | 6.21 | 5.91 |
| 0.71 | 1.7900 | 1.7332 | 5.99 | 5.66 |
| 0.81 | 1.7589 | 1.7009 | 5.81 | 5.48 |
| 0.91 | 1.7254 | 1.6700 | 5.62 | 5.31 |
| 1.01 | 1.6840 | 1.6368 | 5.39 | 5.14 |
| 1.10 | 1.6417 | 1.6174 | 5.17 | 5.04 |
| 1.20 | 1.6421 | 1.6058 | 5.17 | 4.98 |
| 1.30 | 1.5954 | 1.5755 | 4.93 | 4.83 |
| 1.40 | 1.5970 | 1.5704 | 4.94 | 4.81 |
| 1.49 | 1.5787 | 1.5458 | 4.85 | 4.69 |
Generations
Prompt: mo
Output:
moed
Prompt: app
Output:
appurist
Prompt: c
Output:
ers
Prompt: b
Output:
oro
Prompt: z
Output:
ed
Prompt: tho
Output:
es
Prompt: tho
Output:
et
Prompt: ye
Output:
et
Prompt: b
Output:
rum
Prompt: b
Output:
ed
Prompt: b
Output:
urry
Prompt: b
Output:
us
As you can see, the model generates both real words and plausible-looking words. For example, burry is a real word, and appurist follows a valid English agentive suffix pattern (-ist). It wasn't trained to generate real words, it was trained to generate plausible words that reflect the morphology of the English and Spanish languages.
Limitations
- It does not generate sentences, prose, code, or anything besides a single word-like sequence.
- It cannot reason or produce complex language.
- Generated words may or may not be real. The goal isn't real word generation but reflecting the lexicon and morphology of the English and Spanish languages through tiny language models.
- Output is non-deterministic. The same prompt can produce very different completions across runs.
Inference
# =============================================================================
# Inference
# =============================================================================
MODEL_DIR = "Harley-ml/MediumWord-559k" # path
TOKENIZER_PATH = "Harley-ml/MediumWord-559k"
# --- Generation settings ---
PROMPT = "b" # prompt
MAX_NEW_TOKENS = 32
TEMPERATURE = 1.2
TOP_P = 0.95
TOP_K = 50
REPETITION_PENALTY = 1.1
DO_SAMPLE = True
# =============================================================================
import torch
from pathlib import Path
from transformers import (
AutoModelForCausalLM,
PreTrainedTokenizerFast,
AddedToken,
)
# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------
device = (
"cuda" if torch.cuda.is_available() else
"mps" if torch.backends.mps.is_available() else
"cpu"
)
print(f"Device : {device}")
# ---------------------------------------------------------------------------
# Tokenizer (mirrors training setup)
# ---------------------------------------------------------------------------
def load_tokenizer(path: str):
p = Path(path).resolve()
if not p.exists():
raise FileNotFoundError(f"Tokenizer not found: {p}")
tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
specials = {}
if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
if tok.pad_token is None:
if tok.eos_token is not None:
tok.pad_token = tok.eos_token
else:
specials["pad_token"] = AddedToken("<|pad|>", special=True)
if specials:
tok.add_special_tokens(specials)
tok.padding_side = "left" # left-pad for batched generation
return tok
print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f" Vocab size : {tokenizer.vocab_size}")
print(f" BOS : {tokenizer.bos_token!r}")
print(f" EOS : {tokenizer.eos_token!r}")
print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")
# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------
print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_DIR,
dtype=torch.float16 if device == "cuda" else torch.float32,
low_cpu_mem_usage=True,
)
model.eval()
model.to(device)
total_params = sum(p.numel() for p in model.parameters())
print(f" Parameters : {total_params:,}")
# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------
def generate(
prompt: str = PROMPT,
max_new_tokens: int = MAX_NEW_TOKENS,
temperature: float = TEMPERATURE,
top_p: float = TOP_P,
top_k: int = TOP_K,
repetition_penalty: float = REPETITION_PENALTY,
do_sample: bool = DO_SAMPLE,
) -> str:
bos = tokenizer.bos_token or ""
full_prompt = bos + prompt
inputs = tokenizer(
full_prompt,
return_tensors="pt",
add_special_tokens=False,
).to(device)
inputs.pop("token_type_ids", None) # Qwen3 doesn't use this
gen_kwargs = dict(
max_new_tokens = max_new_tokens,
do_sample = do_sample,
repetition_penalty = repetition_penalty,
eos_token_id = tokenizer.eos_token_id,
pad_token_id = tokenizer.pad_token_id,
)
if do_sample:
gen_kwargs["temperature"] = temperature
gen_kwargs["top_p"] = top_p
gen_kwargs["top_k"] = top_k
with torch.inference_mode():
output_ids = model.generate(**inputs, **gen_kwargs)
# Strip the prompt tokens so we only return what was generated
prompt_len = inputs["input_ids"].shape[-1]
new_ids = output_ids[0][prompt_len:]
return tokenizer.decode(new_ids, skip_special_tokens=True)
# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------
if __name__ == "__main__":
print(f"\nPrompt : {PROMPT!r}")
print("-" * 60)
output = generate(PROMPT)
print("Generated:")
print(output)
Related Models
- Downloads last month
- 1,253