Tweety-Tatar-7B: A Tatar Large Language Model

Tweety Tatar / Base 7b / 2024-v1

Model description

This model is our trans-tokenized LLM for the Tatar language, converted from the Mistral-7B-Instruct-v0.2 model trained by MistralAI. Trans-tokenized LLMs are language models finetuned to produce output in a particular language, using a novel tokenizer native to that language.

  • Developed by: Franรงois Remy (UGent), Alfiya Khabibullina (BeCode), et al.
  • Funded by: IDLab / GPULab (UGent)
  • Model type: Foundation model using the mistral architecture
  • Language(s) (NLP): Tatar
  • License: Apache 2.0

In-scope usage

This model can be used as-is to perform basic language modeling operations in Tatar, or finetuned to perform more complex operations. This model has not undergone Instruction- or Chat-based finetuning, which means that the model functions best in few-shot settings.

Usage instructions

This model can be used just like any LLM in the HuggingFace framework:

import transformers

MODEL_NAME = "Tweeties/tweety-tatar-base-7b-2024-v1"
generate = transformers.pipeline("text-generation", model=MODEL_NAME)

Word Analogies

ANALOGY_PROMPT = """ะ‘ัƒ ะฐะฝะฐะปะพะณะปะฐั€ ั‚ะฐะฑะปะธั†ะฐัั‹ะฝ ั‚ัƒั‚ั‹ั€ั‹ะณั‹ะท:
* {x1} : {y1}
* {x2} :"""
def score_analogy(x1, y1, x2, y2):
    Y2_PROMPT = ANALOGY_PROMPT.replace('{x1}', x1).replace('{y1}', y1).replace('{x2}', x2)
    answer = generate(Y2_PROMPT, use_cache=True, do_sample=False, max_new_tokens=10, return_full_text=False, pad_token_id=generate.tokenizer.eos_token_id, eos_token_id=generate.tokenizer.convert_tokens_to_ids(['<0x0A>','</s>']))[0]['generated_text'].strip()
    return 1 if answer == y2 else 0

score_analogy('ะœำ™ัะบำ™าฏ', 'ะ ัƒัะธั', 'ำ˜ะฝะบะฐั€ะฐ', 'ะขำฉั€ะบะธั') # 1

Summarization


SUMMARIZE = "ะขาฏะฑำ™ะฝะดำ™ะณะต ั‚ะตะบัั‚ะบะฐ ะนะพะผะณะฐะบ ััะฐะณั‹ะท:\n"
LONG_TEXT = "\n\nะžะทั‹ะฝ ั‚ะตะบัั‚:\n"
LONG_TEXT_DEMO = "ะšะตัˆะต ะพั€ะณะฐะฝะธะทะผั‹ ะบะฐั‚ะปะฐัƒะปั‹ ะพั€ะณะฐะฝะธะทะผ, ะฐะฝั‹าฃ ำฉั‡ะตะฝ ะบะธั€ำ™ะบะปะต ั‚ัƒะบะปั‹ะบะปั‹ ะผะฐั‚ะดำ™ะปำ™ั€ะฝะตาฃ ะฐะตั€ั‹ะผ ะฑะฐะปะฐะฝั ั‚ะฐะปำ™ะฟ ะธั‚ำ™. ะšะตัˆะต ะพั€ะณะฐะฝะธะทะผั‹ะฝั‹าฃ ั‚ัƒะบะปะฐะฝัƒ ั€ะฐั†ะธะพะฝั‹ ะฝะธะณะตะทะดำ™ ะฟะตัˆะตะบะปำ™ะฝะณำ™ะฝ ั€ะธะทั‹ะบะปะฐั€ะดะฐะฝ ั‚ะพั€ะฐ ะธะบำ™ะฝ, ะฐะฝั‹าฃ ะพั€ะณะฐะฝะธะทะผั‹ ะฑัƒ ั‹ััƒะป ะฑะตะปำ™ะฝ ั‚ัƒะบะปะฐะฝัƒะณะฐ า—ะฐะนะปะฐัˆะฐ. ำ˜ะผะผะฐ, ัˆัƒะป ัƒะบ ะบะตัˆะต ะบะธะฝำ™ั‚ ั‡ะธะผะฐะป ะดะธะตั‚ะฐัั‹ะฝะฐ ะบาฏั‡ำ™ ะธะบำ™ะฝ, ะฐะฝั‹าฃ ะพั€ะณะฐะฝะธะทะผั‹ ำ™ะปะตะณะต าฏะทะณำ™ั€ะตัˆะฝะต ะบะฐะฑัƒะป ะธั‚ำ™ ะฐะปะผั‹ะน, ะฑัƒ ะผำฉะผะบะธะฝ ะบะฐะดำ™ั€ ะทั‹ัะฝ ะบะธั‚ะตั€ะตั€ะณำ™ ะผำฉะผะบะธะฝ." # The human body is a complex organism that requires a specific balance of nutrients. If the human body's diet consists mainly of cooked foods, its body adapts to this type of nutrition. However, if the same person suddenly switches to a raw diet, his body cannot adapt to this change, which can be harmful. # The human body is a complex organism that requires a specific balance of nutrients to function optimally. When a person's diet consists primarily of cooked food, their body adapts to this way of eating. However, if that same person suddenly switches to a raw food diet, their body may not be able to handle the sudden change, leading to potential harm. 
SHORT_TEXT = "\n\nะšั‹ัะบะฐ ั‚ะตะบัั‚:\n"
SHORT_TEXT_DEMO = "ำ˜ะผะผะฐ ะฟะตัˆะบำ™ะฝ ั€ะธะทั‹ะบ ะฐัˆะฐัƒะณะฐ ะณั‹ะฝะฐ ะบาฏะฝะณำ™ะฝ ะพั€ะณะฐะฝะธะทะผะณะฐ ะบะธะฝำ™ั‚ ั‡ะธ ั€ะธะทั‹ะบ ะฑะตะปำ™ะฝ ั‚ัƒะบะปะฐะฝัƒะณะฐ ะบาฏั‡าฏะฝะตาฃ ะทะฐั€ะฐั€ะปั‹ ะฝำ™ั‚ะธา—ำ™ัะต ะดำ™ ะฑัƒะปั‹ั€ะณะฐ ะผำฉะผะบะธะฝ." # However, a body accustomed to eating only cooked food can have harmful consequences when suddenly switching to eating raw food.

def generate_tatar_summary(tatar_text_to_summarize: str) -> str:

    # craft the 1-shot example
    input_ids = torch.concat([
        tokenizer.encode(SUMMARIZE, return_tensors='pt'),
        tokenizer.encode(LONG_TEXT, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(LONG_TEXT_DEMO, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(SHORT_TEXT, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(SHORT_TEXT_DEMO, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode("\n\n", add_special_tokens=False, return_tensors='pt')
    ], axis=1)
    
    # craft the input
    input_ids = torch.concat([
        input_ids,
        tokenizer.encode(SUMMARIZE, return_tensors='pt'),
        tokenizer.encode(LONG_TEXT, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(tatar_text_to_summarize, add_special_tokens=False, return_tensors='pt'),
        tokenizer.encode(SHORT_TEXT, add_special_tokens=False, return_tensors='pt'),
    ], axis=1)

    # generate the output
    model_inputs = {'input_ids':input_ids.to(cuda_device)}
    model_outputs = model.generate(
        **model_inputs,
        max_new_tokens=80,
        num_beams=8,
        no_repeat_ngram_size=6,
        early_stopping=False,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.convert_tokens_to_ids(['<0x0A>','</s>']),
    )

    # decode the output
    return (tokenizer.decode(model_outputs[0][input_ids.shape[1]:])).rstrip()

generate_tatar_summary("ะ—ัƒั€ ัˆะฐั€ั‚ะปะฐัƒ (ะธะฝะณะป. Big Bang) โ€“ ะ“ะฐะปำ™ะผะฝะตาฃ ะฑะฐัˆะปะฐะฝะณั‹ั‡, ัะธะฝะณัƒะปัั€ ั…ะฐะปำ™ั‚ั‚ำ™ ั‚ะพั€ะณะฐะฝ ั‡ะพั€ั‹ะฝ ั‚ะฐัะฒะธั€ะปะฐัƒั‡ั‹ ะบะพัะผะพะปะพะณะธะบ ะผะพะดะตะปัŒ. ำ˜ะปะต ะฅะฅ ะณะฐัั‹ั€ะดะฐ ะดะฐ ะฑะตะท ััˆำ™ะณำ™ะฝ ะ“ะฐะปำ™ะผ ัั‚ะฐั‚ะธะบ ัั‚ั€ัƒะบั‚ัƒั€ะฐะปั‹, ะดะธะณำ™ะฝ ั„ะธะบะตั€ ััˆำ™ะณำ™ะฝ. ะฏะณัŠะฝะธ, ะ“ะฐะปำ™ะผะฝะตาฃ ะฑะฐัˆั‹ าปำ™ะผ ะฐั…ั‹ั€ั‹ ัŽะบ, ะธะผะตัˆ, ัƒะป าปำ™ั€ะฒะฐะบั‹ั‚ ะฑัƒะปะณะฐะฝ าปำ™ะผ ะฑัƒะปะฐั‡ะฐะบ. ะ‘ัƒ ั„ะธะบะตั€ ั„ำ™ะฝ ะดำฉะฝัŒััั‹ะฝะดะฐ ะฑะธะบ ะพะทะฐะบ, ะฐัั‚ั€ะพะฝะพะผะธั ั„ำ™ะฝะตะฝะตาฃ ะฑำฉั‚ะตะฝ ะฝะธะณะตะทะปำ™ั€ะตะฝ า—ะธะผะตั€ะตะฟ ัาฃะฐ ั‚ะตะพั€ะธั ะฑะฐั€ะปั‹ะบะบะฐ ะบะธะปะณำ™ะฝั‡ะต ััˆำ™ะณำ™ะฝ. ะ‘ัƒ ั‚ะตะพั€ะธัะฝะตาฃ ะธัะตะผะต โ€“ ยซะ—ัƒั€ ัˆะฐั€ั‚ะปะฐัƒยป ั‚ะตะพั€ะธััะต.")

Citation

If you use this model, please cite our work as:

@article{tweeties2024,
    title = {Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP},
    author = {Franรงois Remy and Pieter Delobelle and Hayastan Avetisyan and Alfiya Khabibullina and Miryam de Lhoneux and Thomas Demeester},
    url = {https://arxiv.org/abs/2408.04303},
    year = {2024},
    note = {Accepted at COLM 2024}
}
Downloads last month
18
Safetensors
Model size
7B params
Tensor type
F32
ยท
Inference Providers NEW

Model tree for Tweeties/tweety-7b-tatar-v24a

Finetuned
(1088)
this model
Finetunes
1 model
Quantizations
1 model

Dataset used to train Tweeties/tweety-7b-tatar-v24a

Spaces using Tweeties/tweety-7b-tatar-v24a 2

Paper for Tweeties/tweety-7b-tatar-v24a