Skip to main content
The 2024 Developer Survey results are live! See the results

Questions tagged [huggingface-tokenizers]

Use this tag for questions related to the tokenizers project from Hugging Face. GitHub: https://github.com/huggingface/tokenizers

huggingface-tokenizers
0 votes
0 answers
13 views

BPE tokenizer add_tokens overlap with trained tokens

I am training a BPE from scratch. I want the vocabulary to include certain tokens that might or might not exist in the training dataset. from datasets import load_dataset from tokenizers import models,...
meliksahturker's user avatar
0 votes
0 answers
25 views

special_tokens parameter of SentencePieceBPETokenizer.train_from_iterator()

I want to train a custom tokenizer from scratch. Following some online tutorials, they suggest adding a series of special tokens to the train_from_iterator() function: special_tokens = ["<unk&...
Raptor's user avatar
  • 53.6k
0 votes
0 answers
7 views

How to get custom trained Bert tokenizer not to split certain characters

I am training my own tokenizer based on bert-based-cased. The problem I have is that in my data (dead language), there are tokens that begin with = and this should not be split off from the rest of ...
bulbul's user avatar
  • 80
-1 votes
0 answers
13 views

MBART-50 looks not compatible with Pipeline

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast article_en = "When you have a medical appointment, your health provider writes notes on the visit that are available ...
LearnToGrow's user avatar
  • 1,720
0 votes
2 answers
136 views

How do I increase max_new_tokens

I'm facing this error while running my code: ValueError: Input length of input_ids is 1495, but max_length is set to 20. This can lead to unexpected behavior. You should consider increasing ...
khuzi yunus's user avatar
1 vote
0 answers
113 views

How to fine-tune merlinite 7B model in Python

I am new to LLM programming in Python and I am trying to fine-tune the instructlab/merlinite-7b-lab model on my Mac M1. My goal is to teach this model to a new music composer Xenobi Amilen I have ...
Salvatore D'angelo's user avatar
0 votes
1 answer
422 views

How to set eos_token_id in llama3 in HuggingFaceLLM?

I wanna set my eos_token_id, and pad_token_id. I googled alot, and most are suggesting to use e.g. tokenizer.pad_token_id (like from here https://huggingface.co/meta-llama/Meta-Llama-3-8B/discussions/...
yts61's user avatar
  • 1,509
1 vote
0 answers
122 views

HuggingFace's transformers I'm getting the message "Some non-default generation parameters are set in the model config"

I need help with a problem. The template works but I want to correct this message so that everything is correct. The purpose of this code is to save the model before training. It saves, but this ...
Lucas Dias Noronha's user avatar
0 votes
0 answers
82 views

How to Deploy a Hugging Face Transformers Model for Inference Using KServe (without KServe 0.13v)?

I'm working on deploying a pre-trained Hugging Face Transformer models for inference using KServe, but my Kubernetes environment does not support KServe 0.13v. I've researched the topic and found ...
Reehan's user avatar
  • 1
0 votes
0 answers
20 views

Can I wrap a PyTorch model into ONNX together with tokenizers?

This is something that worked trivially in tf but in torch strings are not supported natively at all. I tried adding mapping nodes manually to the resulting onnx model but I'm getting all kinds of ...
rudolfovic's user avatar
  • 3,206
1 vote
1 answer
111 views

How do we add/modify the normalizer in a pretrained Huggingface tokenizer?

Given a Huggingface tokenizer that already have a normalizer, e.g. "mistralai/Mistral-7B-v0.1", we can do this to modify the normalizer import json from transformers import AutoTokenizer ...
alvas's user avatar
  • 120k
0 votes
0 answers
14 views

Can I increase tiktoken throughput?

Hello I'm trying to speed up processing when using tiktoken. Is by default a limitation set regarding the processing of documents using tiktoken or can i somehow change thread settings? Would ...
Ben's user avatar
  • 324
0 votes
1 answer
76 views

Seq2SeqTrainer produces incorrect EvalPrediction after changing another Tokenizer

I'm using Seq2SeqTrainer to train my model with a custom tokenizer. The base model is BART Chinese (fnlp/bart-base-chinese). If the original tokenizer of BART Chinese is used, the output is normal. ...
Raptor's user avatar
  • 53.6k
1 vote
1 answer
33 views

Reordering GPT2Tokenizer tokens by frequency leads to unrecognized tokens

I am trying to create a new tokenizer by reordering the token ids in my existing tokenizer based on frequency. In theory, the order of token ids has no effect on performance or usability, but it ...
Cade Harger's user avatar
0 votes
0 answers
30 views

Mac: Unable to install pyzmq and tokenizers in virtual environment

I can install them in the global environment, but I kept getting error messages when trying to install them in the virtual environment. (I'm a mac user). Part of the messages are: × Building wheel ...
Jessica Li's user avatar

15 30 50 per page
1
2 3 4 5
34