Questions tagged [huggingface-tokenizers]
Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers
huggingface-tokenizers
510
questions
0
votes
0
answers
13
views
BPE tokenizer add_tokens overlap with trained tokens
I am training a BPE from scratch. I want the vocabulary to include certain tokens that might or might not exist in the training dataset.
from datasets import load_dataset
from tokenizers import models,...
0
votes
0
answers
25
views
special_tokens parameter of SentencePieceBPETokenizer.train_from_iterator()
I want to train a custom tokenizer from scratch. Following some online tutorials, they suggest adding a series of special tokens to the train_from_iterator() function:
special_tokens = ["<unk&...
0
votes
0
answers
7
views
How to get custom trained Bert tokenizer not to split certain characters
I am training my own tokenizer based on bert-based-cased. The problem I have is that in my data (dead language), there are tokens that begin with = and this should not be split off from the rest of ...
-1
votes
0
answers
13
views
MBART-50 looks not compatible with Pipeline
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
article_en = "When you have a medical appointment, your health provider writes notes on the visit that are available ...
0
votes
2
answers
136
views
How do I increase max_new_tokens
I'm facing this error while running my code:
ValueError: Input length of input_ids is 1495, but max_length is set to 20. This can lead to unexpected behavior. You should consider increasing ...
1
vote
0
answers
113
views
How to fine-tune merlinite 7B model in Python
I am new to LLM programming in Python and I am trying to fine-tune the instructlab/merlinite-7b-lab model on my Mac M1. My goal is to teach this model to a new music composer Xenobi Amilen I have ...
0
votes
1
answer
422
views
How to set eos_token_id in llama3 in HuggingFaceLLM?
I wanna set my eos_token_id, and pad_token_id. I googled alot, and most are suggesting to use e.g. tokenizer.pad_token_id (like from here https://huggingface.co/meta-llama/Meta-Llama-3-8B/discussions/...
1
vote
0
answers
122
views
HuggingFace's transformers I'm getting the message "Some non-default generation parameters are set in the model config"
I need help with a problem. The template works but I want to correct this message so that everything is correct. The purpose of this code is to save the model before training. It saves, but this ...
0
votes
0
answers
82
views
How to Deploy a Hugging Face Transformers Model for Inference Using KServe (without KServe 0.13v)?
I'm working on deploying a pre-trained Hugging Face Transformer models for inference using KServe, but my Kubernetes environment does not support KServe 0.13v. I've researched the topic and found ...
0
votes
0
answers
20
views
Can I wrap a PyTorch model into ONNX together with tokenizers?
This is something that worked trivially in tf but in torch strings are not supported natively at all. I tried adding mapping nodes manually to the resulting onnx model but I'm getting all kinds of ...
1
vote
1
answer
111
views
How do we add/modify the normalizer in a pretrained Huggingface tokenizer?
Given a Huggingface tokenizer that already have a normalizer, e.g. "mistralai/Mistral-7B-v0.1", we can do this to modify the normalizer
import json
from transformers import AutoTokenizer
...
0
votes
0
answers
14
views
Can I increase tiktoken throughput?
Hello I'm trying to speed up processing when using tiktoken. Is by default a limitation set regarding the processing of documents using tiktoken or can i somehow change thread settings? Would ...
0
votes
1
answer
76
views
Seq2SeqTrainer produces incorrect EvalPrediction after changing another Tokenizer
I'm using Seq2SeqTrainer to train my model with a custom tokenizer. The base model is BART Chinese (fnlp/bart-base-chinese). If the original tokenizer of BART Chinese is used, the output is normal. ...
1
vote
1
answer
33
views
Reordering GPT2Tokenizer tokens by frequency leads to unrecognized tokens
I am trying to create a new tokenizer by reordering the token ids in my existing tokenizer based on frequency. In theory, the order of token ids has no effect on performance or usability, but it ...
0
votes
0
answers
30
views
Mac: Unable to install pyzmq and tokenizers in virtual environment
I can install them in the global environment, but I kept getting error messages when trying to install them in the virtual environment. (I'm a mac user).
Part of the messages are:
× Building wheel ...