Benchmarks performed on an A100-40GB, batch size 1, FP16.
| Metric | Original 0105 | Webe Tori Model 0105 Patched | |--------|----------------|------------------------------| | | 42.3 | 44.1 | | TruthfulQA | 51.7 | 54.2 | | GSM8K (Math reasoning) | 23.1 | 27.6 | | Multilingual NER (F1) | 68.4 | 81.3 | | Inference Time (100 tokens) | 2.1s | 1.6s | | Hallucination Rate | 12.4% | 6.8% | webe tori model 0105 patched
outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.7, do_sample=True, repetition_penalty=1.1 ) Benchmarks performed on an A100-40GB, batch size 1, FP16
tokenizer = AutoTokenizer.from_pretrained(model_name) model.config.attention_dropout = 0.0 model = model.to("cuda") Inference example input_text = "Explain the concept of a 'patched model' in AI." inputs = tokenizer(input_text, return_tensors="pt").to("cuda") Benchmarks performed on an A100-40GB
| Model | Size | MMLU | Speed (tok/s) | |--------|------|------|----------------| | TinyLlama 1.1B | 1.1B | 43.5 | 85 | | | 1.2B | 44.1 | 92 | | Phi-2 | 2.7B | 56.0 | 68 |