--- license: cc-by-4.0 language: - en - de - fr - pl - ru - it - pt - cs - nl - es - fi - tr - hu - bg - uk - bs - hr - da - et - lt - ro - sk - sl - sv - 'no' - lv - sr - sq - mk - is - mt - ga datasets: - HPLT/HPLT2.0_cleaned - HPLT/hplt_monolingual_v1_2 - HuggingFaceFW/fineweb-2 - allenai/MADLAD-400 - uonlp/CulturaX - bigcode/the-stack - common-pile/arxiv_papers - HuggingFaceFW/finepdfs library_name: transformers --- **Developed by:** [Tilde.ai](https://tilde.ai/tildeopen-llm/) **Paper:** [https://arxiv.org/abs/2603.08182](https://arxiv.org/abs/2603.08182) **Funded by:** European Commission via [EuroHPC JU Large AI Grand Challenge](https://www.eurohpc-ju.europa.eu/winners-announced-large-ai-grand-challenge-2024-06-26_en) **Model type:** A 30B parameter dense decoder-only transformer **Languages:** Albanian, Bosnian, Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Hungarian, Icelandic, Irish, Italian, Latgalian, Latvian, Lithuanian, Macedonian, Maltese, Montenegrin, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Spanish, Swedish, Turkish, Ukrainian as well as mathematical proofs, programming code and XML documents containing translation data **License:** CC-BY-4.0 ## Info This is the large context version of [TildeOpen 30B](https://arxiv.org/abs/2603.08182) foundational model, featuring context extension from 8k to 64k tokens using [YaRN](https://arxiv.org/abs/2309.00071). We used a mixture of TildeOpen 30B base training data, HuggingFace's [FinePDF](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) datset and synthetic data based on [KNOT](https://arxiv.org/abs/2409.04774v1). Due to YaRN, this model is primarily intended for use with ```transformers >= 5``` or vLLM. We provide auto patches for use with older ```transformers``` versions, explained in this [differences section](#differences-from-huggingface-llama-model-implementation-for-transformers-5). For more detailed background information, please refer to the original model repository: [https://huggingface.co/TildeAI/TildeOpen-30b](https://huggingface.co/TildeAI/TildeOpen-30b). This model remains a base model and **has not been instruction tuned**, nor human aligned. ## Model Hyper-Parameters | Parameter | Value | |-----------|-------| | Sequence Length | 65536 | | Number of Layers | 60 | | Embedding Size | 6144 | | FFN Hidden Size | 21504 | | Number of Heads | 48 | | Number of KV Heads (GQA) | 8 | | Activation Function | SwiGLU | | Position Encodings | YaRN | | Embedding Parameters | 8.05E+08 | | LM Head Parameters | 8.05E+08 | | Non-embedding Parameters | 2.91E+10 | | Total Parameters | 3.07E+10 | We use the following YaRN configuration for RoPE scaling: | Parameter | Value | |-----------|-------| | Attention Factor | 1 | | Beta Fast | 32 | | Beta Slow | 1 | | Scaling Factor | 10 | | Original Max. Position Embeddings | 8192 | | Rope Theta | 200000 | We follow [Deespeek v3](https://arxiv.org/pdf/2412.19437) and slightly overscale the YaRN embeddings to 10x rather than 8x; related PyTorch warnings can be ignored. ## Running model using HF ```transformers >= 5``` ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Load tokenizer + model tokenizer = AutoTokenizer.from_pretrained("TildeAI/TildeOpen-30b-64k") model = AutoModelForCausalLM.from_pretrained( "TildeAI/TildeOpen-30b-64k", torch_dtype=torch.bfloat16, device_map="auto" ) # Tokenize inputs = tokenizer(user_in, return_tensors="pt").to(model.device) # Generate (greedy, deterministic) outputs = model.generate( **inputs, max_new_tokens=512, repetition_penalty=1.2, do_sample=False, ) text = tokenizer.decode(outputs[0], skip_special_tokens=True) ``` ## Running model using (old) HF ```transformers < 5``` **NOTE**: The provided YARN patch was written specifically for **transformers==4.46.3**. It likely can support other versions, but that has not been thoroughly tested. We suggest avoiding patches and using transformers >= 5 or vLLM. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Load tokenizer + model tokenizer = AutoTokenizer.from_pretrained("TildeAI/TildeOpen-30b-64k") model = AutoModelForCausalLM.from_pretrained( "TildeAI/TildeOpen-30b-64k", trust_remote_code=True, # important, this applies patches torch_dtype=torch.bfloat16, device_map="auto" ) # Tokenize inputs = tokenizer( user_in, return_tensors="pt", return_token_type_ids=False, # sometimes needed for older transformers ).to(model.device) # Generate (greedy, deterministic) outputs = model.generate( **inputs, max_new_tokens=512, repetition_penalty=1.2, do_sample=False, ) text = tokenizer.decode(outputs[0], skip_special_tokens=True) ``` # Evaluation ## Needle-in-a-Haystack (NIAH) Evaluation for context extension ![Classic Needle Illustration]() Smaller "Needle Position" values indicate placement closer to the beginning of the document; larger values indicate placement closer to the end. ### Experiment Setup For each context length and needle insertion depth: 1. The target context length was selected from Paul Graham’s essays. 2. The sentence `The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.` was inserted between two sentences around the target depth in the document. 3. The sentence `The best thing to do in San Francisco is` was appended. 4. The model output was considered correct if the model generated `eat a sandwich and sit in Dolores Park on a sunny day` within the next 15 tokens. ### Alternative NIAH Experiment Since the classic needle test has been known for a long time, there is risk that the model overfit to this exact sentence. So to double check the capabilities we also ran the same experiment, but with the string tripplet `The secret key is 954834`; `The secret key is` and `954834` and 5 different samples of text. ![Secret Key Needle Illustration]() ## Belebele Benchmark: Reading Comprehension **What is Belebele Benchmark?** [Belebele](https://aclanthology.org/anthology-files/anthology-files/pdf/acl/2024.acl-long.44.pdf) is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. Results **Why does this Matter?** Belebele tests LLM's ability to provide answers based on a given text -- a standard use case in retrieval augumented generation workflows. **What did we do?** We used the standard implementation of the [belebele](https://github.com/eleutherai/lm-evaluation-harness/tree/main/lm_eval/tasks/belebele) task from the LLM Evaluation Harness. We report **5-shot** accuracy. | 5-shot | **Gemma 2 27b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 30b 64k** | |----------|:-------------:|:----------:|:------------:|:-------------------:| | Bulgarian | 79.8% | 78.8% | **85.3%** | 84.1% | | Czech | 81.4% | 78.3% | 85.3% | **85.4%** | | German | 81.2% | 80.6% | **85.0%** | **85.0%** | | English | **88.9%** | 83.0% | 87.6% | 87.7% | | Estonian | 72.1% | 73.7% | 82.0% | **82.1%** | | Finnish | 79.0% | 78.1% | **84.3%** | 84.0% | | French | 82.6% | 80.1% | **85.7%** | 83.8% | | Hungarian | 77.9% | 76.2% | **83.3%** | 82.4% | | Icelandic | 70.8% | 58.2% | 54.3% | **82.8%** | | Italian | 82.1% | 77.8% | 81.0% | **84.0%** | | Lithuanian | 76.1% | 76.1% | **85.2%** | 84.1% | | Latvian | 78.4% | 77.7% | 84.6% | **84.8%** | | Dutch | 80.2% | 78.9% | 83.2% | **85.4%** | | Polish | 78.3% | 77.9% | 82.2% | **82.4%** | | Portuguese | 83.8% | 80.1% | **86.1%** | 85.6% | | Romanian | 80.3% | 78.8% | **85.3%** | 83.1% | | Russian | 79.4% | 79.4% | 84.2% | **84.3%** | | Slovak | 78.9% | 78.0% | 84.1% | **85.0%** | | Slovenian | 78.0% | 80.0% | **83.7%** | 83.6% | | Spanish | 82.1% | 78.4% | 84.1% | **84.8%** | | Serbian | 79.8% | 78.4% | 74.1% | **84.2%** | | Swedish | 80.6% | 76.3% | **85.3%** | 83.6% | | Turkish | 77.4% | 62.3% | 79.9% | **81.3%** | | Ukrainian | 78.0% | 77.0% | **83.9%** | **83.9%** | | **Average** | 79.5% | 76.8% | 82.5% | **84.1%** | ## MultiBLiMP Benchmark: Grammar Test **What is MultiBLiMP?** [MultiBLiMP](https://arxiv.org/pdf/2504.02768) is a massively multilingual test of core grammar. It gives models pairs of almost-identical sentences—one grammatical and one ungrammatical—and asks whether the model assigns a higher probability to the correct one. Version 1.0 covers 101 languages **Why does this Matter?** MultiBLiMP tests models' ability to distinguish correct and erroneous language. Just like humans, producing mostly correct language is not a big achievement. Rather, it is very bad to make any mistakes at all. **What did we do?** We used the standard implementation of the [MultiBLiMP](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/multiblimp) task from the LLM Evaluation Harness. We report **0-shot** accuracy. | Language | **Gemma 2 27b** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 30b 64k** |----------|-------------|----------|---------------------|-------------| | Bulgarian | 95.4% | 98.8% | 97.7% | **99.7%** | | Czech | 98.6% | **98.9%** | 98.5% | 98.3% | | German | 98.8% | 98.7% | 98.0% | **99.4%** | | English | 98.4% | 98.7% | 98.7% | **99.2%** | | Estonian | 92.0% | 95.6% | 95.8% | **98.5%** | | Finnish | 93.0% | 96.3% | 95.2% | **98.4%** | | French | 98.2% | 98.8% | 98.7% | **99.2%** | | Serbo-Croatian | 94.6% | 98.5% | 96.4% | **99.6%** | | Hungarian | 95.9% | 98.8% | 97.8% | **99.8%** | | Icelandic | 88.5% | 80.3% | 74.4% | **99.0%** | | Italian | 96.0% | 96.7% | 96.6% | **98.3%** | | Latvian | 91.6% | 95.2% | 96.9% | **99.0%** | | Lithuanian | 95.3% | 99.0% | 99.0% | **99.4%** | | Dutch | 94.0% | 96.6% | 96.5% | **99.0%** | | Polish | 97.0% | 97.5% | 97.6% | **99.3%** | | Portuguese | 96.1% | 97.6% | 97.1% | **98.3%** | | Romanian | 97.7% | 98.9% | 98.5% | **99.3%** | | Russian | 94.7% | 96.6% | 97.3% | **99.3%** | | Slovak | 97.7% | 98.8% | 97.7% | **99.5%** | | Slovenian | 99.0% | **100.0%** | **100.0%** | 98.5% | | Spanish | 95.6% | 98.0% | 97.3% | **98.5%** | | Swedish | 95.8% | 85.1% | 93.8% | **100.0%** | | Turkish | 97.6% | **98.7%** | 97.9% | 96.6% | | Ukrainian | 95.6% | 98.0% | 97.3% | **99.0%** | | **Average** | 95.7% | 96.7% | 96.4% | **99.0%** | ## Knowledge tests ### ARC Benchmark Results **What is ARC?** [ARC](https://arxiv.org/pdf/1803.05457) - The AI2 Reasoning Challenge is a multiple-choice science question benchmark **in English**, derived from U.S. grade-school standardized exams. It has two subsets — ARC Easy and ARC Challenge — designed to test factual knowledge and common-sense. **Why does this Matter?** ARC probes a model’s ability to answer non-trivial questions by applying world knowledge. Although the answer can sometimes be inferred from the question, in the classic lm-evaluation-harness ARC implementation the answer choices for each question are **not** provided during inference, thus placing emphasis on world knowledge, rather than on the model's reasoning capabilities. **What did we do?** We use multilingual translations of ARC provided by [Eurolingua](https://huggingface.co/datasets/Eurolingua/arcx); please refer to the [publication](https://arxiv.org/pdf/2410.08928). Other than the data source, we replicate the standard [LM Evaluation Harness configuration for ARC](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/arc). Our exact configuration is available at [TBA]. We report **5-shot** accuracy. | 5-shot | | **ARC Easy**| | | **ARC Challenge**| | |----------|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| | **Language** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 30b 64k** | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 30b 64k** | | Danish | 79.9% | **80.1%** | 79.8% | 53.4% | 52.6% | **54.2%** | | German | 79.6% | **79.9%** | 77.7% | 53.4% | **53.6%** | 51.9% | | Spanish | **82.9%** | 81.7% | 78.8% | **57.3%** | 56.1% | 52.4% | | French | **81.7%** | 81.1% | 79.2% | **56.0%** | 54.5% | 52.6% | | Italian | 80.5% | **81.6%** | 78.4% | **56.4%** | 54.8% | 53.3% | | Dutch | **80.1%** | 80.0% | 78.3% | **54.0%** | 53.8% | 53.0% | | Portuguese | **81.7%** | 81.1% | 79.1% | **56.9%** | 55.5% | 53.9% | | Swedish | 80.3% | **80.5%** | 79.0% | 53.8% | 53.1% | **54.7%** | | **AVG WEST** | **80.8%** | **80.8%** | 78.8% | **55.2%** | 54.2% | 53.2% | | | | | | | | | | Bulgarian | **79.8%** | 79.2% | **79.8%** | **53.8%** | 51.8% | 53.5% | | Czech | **79.5%** | **79.5%** | 79.3% | 51.5% | 52.3% | **54.3%** | | Estonian | 72.4% | 73.0% | **73.4%** | 49.6% | 49.8% | **51.3%** | | Finnish | 73.8% | **74.2%** | 73.8% | 48.7% | 51.1% | **52.1%** | | Hungarian | 74.0% | 73.9% | **74.2%** | 49.3% | 49.0% | **50.6%** | | Lithuanian | 76.4% | 76.1% | **78.2%** | 50.3% | 51.6% | **52.6%** | | Latvian | 76.2% | **76.4%** | 75.6% | 50.7% | 49.8% | **51.0%** | | Polish | **79.2%** | 78.2% | 78.2% | **54.5%** | 53.3% | 53.0% | | Romanian | **79.6%** | 78.8% | 79.5% | **55.5%** | 53.7% | 53.8% | | Slovak | 78.8% | **79.2%** | **79.2%** | 52.5% | 53.0% | **53.9%** | | Slovenian | 78.3% | 78.5% | **78.5%** | 53.4% | 52.2% | **53.8%** | | **AVG EAST** | 77.1% | 77.0% | **77.2%** | 51.8% | 51.6% | **53.0%** | ### MMLU Benchmark Results **What is MMLU?** [MMLU](https://arxiv.org/pdf/2009.03300) is a massive multitask test consisting of multiple-choice questions from various branches of knowledge, **in English**. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks including elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability. Questions are four option multiple choice and assess factual knowledge, reading comprehension, and reasoning across disciplines. The questions can be grouped under four topics - stem, humanities, social_sciences and other, allowing for individual evaluation of each group. **Why does this Matter?** Similarly to ARC, MMLU measures broad, general purpose factual knowledge and some reasoning capabilites. The possible answer choices are included during prompting, which can allow the model to employ reasoning to discard false answers, rather than just relying on knowing the correct one. It should be noted that some question groups are exclusive to the anglocentric world, e.g. US history or law. **What did we do?** We use multilingual translations of MMLU provided by [Eurolingua](https://huggingface.co/datasets/Eurolingua/mmlux), please refer to the [publication](https://arxiv.org/pdf/2410.08928). Other than the data source, we replicate the standard [LM Evaluation Harness configuration for MMLU](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu/default). Our configuration is available at [TODO]. We report **0-shot** accuracy. | 0-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 30b 64k** | |----------|:-----------------:|:---------------------:|:-------------------:| | Bulgarian | 48.3% | 52.0% | **56.5%** | | Czech | 49.1% | 51.7% | **55.9%** | | Danish | 50.2% | 51.1% | **56.9%** | | German | 51.0% | 51.8% | **55.5%** | | Spanish | 53.3% | 53.4% | **57.1%** | | Estonian | 48.7% | 49.2% | **55.5%** | | Finnish | 47.4% | 48.9% | **55.4%** | | French | 53.1% | 53.8% | **56.9%** | | Hungarian | 49.9% | 44.4% | **55.5%** | | Italian | 52.3% | 53.7% | **57.3%** | | Lithuanian | 47.3% | 49.4% | **55.2%** | | Latvian | 46.9% | 48.0% | **54.2%** | | Dutch | 50.8% | 53.0% | **56.7%** | | Polish | 50.6% | 49.6% | **56.0%** | | Portuguese | 52.4% | 53.7% | **57.2%** | | Romanian | 51.0% | 52.1% | **56.4%** | | Slovak | 49.0% | 52.2% | **56.3%** | | Slovenian | 48.2% | 50.7% | **56.0%** | | Swedish | 49.6% | 51.2% | **56.3%** | | **Average** | 50.0% | 51.1% | **56.1%** | ### National Exams Results **What are National Exams?** A curated suite of **multlingual** publicly available past questions from national-level standardized exams across multiple countries (e.g., high-school exit and university-entrance exams), please refer to the [publication](https://aclanthology.org/2020.emnlp-main.438.pdf). The dataset is available on HuggingFace [here](https://huggingface.co/datasets/mhardalov/exams). Items are presented in multiple-choice format. **Why does this Matter?** Similarly to MMLU, the model is tested on factual knowledge and reasoning capabilites. However, it should be stressed that for each language the bench is **unique** (the exams are different) and available in the **source language** (i.e. not translated). This places emphasis on the model's regional knowledge and eliminates translation noise that is present in many other multilingual benchmarks. Possible answer choices are once again included during inference, allowing for the model to employ reasoning if factual knowledge is lacking. **What did we do?** [TODO] | 5-shot | **ALIA 40b** | **EuroLLM Prev. 22b** | **TildeOpen 30b 64k** | |----------|----------|-------------------|-------------------| | Bulgarian | 62.4% | **66.8%** | 66.4% | | Croatian | 70.8% | **72.5%** | 71.9% | | Hungarian | 48.9% | **51.9%** | 47.6% | | Italian | 65.5% | 64.6% | **65.9%** | | Macedonian | 74.2% | 72.0% | **80.5%** | | Polish | 61.2% | **61.4%** | **61.4%** | | Portuguese | **61.4%** | 60.9% | 56.5% | | Albanian | 55.6% | 55.0% | **72.7%** | | Serbian | 64.7% | 57.3% | **65.3%** | | **Average** | 62.7% | 62.5% | **65.4%** | # Differences from Huggingface LLaMa model implementation for transformers <5 Patch is primarily contained in ```llama_yarn_patch_4x.py``` and ```configuration_llama_patch_4x.py```. We provide wrappers for enabling AutoModel API. Running patched code with ```transformers >= 5``` is safe and will ignore the patch files and default to base implementation. Main differences: - New rotary embedding class was written - `NeoXRotaryEmbeddings`. - Supports **YaRN**, implemented by analogy with vLLM’s YaRN approach and the [YARN paper](https://arxiv.org/abs/2309.00071) - Designed to match the rotary embedding behaviour used during pretraining and context extension. - The attention implementation was modified to more closely match the attention behaviour used during training. Running the code requires `flash-attn >= 2.0.6, < 3.0`. The model can still be run with the original Hugging Face LLaMa code. However, when using YaRN, we found that can lead to vastly different logit generation. **NOTE**: If you are using transformers >= 5 or vLLM, this section does not apply and can safely be ignored.