Newest 'huggingface-tokenizers' Questions

0 votes

0 answers

13 views

BPE tokenizer add_tokens overlap with trained tokens

I am training a BPE from scratch. I want the vocabulary to include certain tokens that might or might not exist in the training dataset. from datasets import load_dataset from tokenizers import models,...

meliksahturker

1,404

asked Jul 22 at 19:11

0 votes

0 answers

25 views

special_tokens parameter of SentencePieceBPETokenizer.train_from_iterator()

I want to train a custom tokenizer from scratch. Following some online tutorials, they suggest adding a series of special tokens to the train_from_iterator() function: special_tokens = ["<unk&...

Raptor

53.6k

asked Jul 18 at 9:01

0 votes

0 answers

7 views

How to get custom trained Bert tokenizer not to split certain characters

I am training my own tokenizer based on bert-based-cased. The problem I have is that in my data (dead language), there are tokens that begin with = and this should not be split off from the rest of ...

bulbul

80

asked Jul 15 at 9:50

-1 votes

0 answers

13 views

MBART-50 looks not compatible with Pipeline

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast article_en = "When you have a medical appointment, your health provider writes notes on the visit that are available ...

LearnToGrow

1,720

asked Jul 11 at 12:49

0 votes

2 answers

136 views

How do I increase max_new_tokens

I'm facing this error while running my code: ValueError: Input length of input_ids is 1495, but max_length is set to 20. This can lead to unexpected behavior. You should consider increasing ...

khuzi yunus

1

asked Jul 11 at 9:58

1 vote

0 answers

113 views

How to fine-tune merlinite 7B model in Python

I am new to LLM programming in Python and I am trying to fine-tune the instructlab/merlinite-7b-lab model on my Mac M1. My goal is to teach this model to a new music composer Xenobi Amilen I have ...

Salvatore D'angelo

1,109

asked Jun 30 at 21:37

0 votes

1 answer

422 views

How to set eos_token_id in llama3 in HuggingFaceLLM?

I wanna set my eos_token_id, and pad_token_id. I googled alot, and most are suggesting to use e.g. tokenizer.pad_token_id (like from here https://huggingface.co/meta-llama/Meta-Llama-3-8B/discussions/...

yts61

1,509

asked Jun 30 at 11:11

1 vote

0 answers

122 views

HuggingFace's transformers I'm getting the message "Some non-default generation parameters are set in the model config"

I need help with a problem. The template works but I want to correct this message so that everything is correct. The purpose of this code is to save the model before training. It saves, but this ...

Lucas Dias Noronha

29

asked Jun 23 at 22:33

0 votes

0 answers

82 views

How to Deploy a Hugging Face Transformers Model for Inference Using KServe (without KServe 0.13v)?

I'm working on deploying a pre-trained Hugging Face Transformer models for inference using KServe, but my Kubernetes environment does not support KServe 0.13v. I've researched the topic and found ...

Reehan

1

asked Jun 22 at 6:20

0 votes

0 answers

20 views

Can I wrap a PyTorch model into ONNX together with tokenizers?

This is something that worked trivially in tf but in torch strings are not supported natively at all. I tried adding mapping nodes manually to the resulting onnx model but I'm getting all kinds of ...

rudolfovic

3,206

asked Jun 16 at 8:27

1 vote

1 answer

111 views

How do we add/modify the normalizer in a pretrained Huggingface tokenizer?

Given a Huggingface tokenizer that already have a normalizer, e.g. "mistralai/Mistral-7B-v0.1", we can do this to modify the normalizer import json from transformers import AutoTokenizer ...

alvas

120k

asked Jun 12 at 11:03

0 votes

0 answers

14 views

Can I increase tiktoken throughput?

Hello I'm trying to speed up processing when using tiktoken. Is by default a limitation set regarding the processing of documents using tiktoken or can i somehow change thread settings? Would ...

Ben

324

asked Jun 10 at 17:36

0 votes

1 answer

76 views

Seq2SeqTrainer produces incorrect EvalPrediction after changing another Tokenizer

I'm using Seq2SeqTrainer to train my model with a custom tokenizer. The base model is BART Chinese (fnlp/bart-base-chinese). If the original tokenizer of BART Chinese is used, the output is normal. ...

Raptor

53.6k

asked Jun 7 at 13:54

1 vote

1 answer

33 views

Reordering GPT2Tokenizer tokens by frequency leads to unrecognized tokens

I am trying to create a new tokenizer by reordering the token ids in my existing tokenizer based on frequency. In theory, the order of token ids has no effect on performance or usability, but it ...

Cade Harger

11

asked Jun 5 at 7:49

0 votes

0 answers

30 views

Mac: Unable to install pyzmq and tokenizers in virtual environment

I can install them in the global environment, but I kept getting error messages when trying to install them in the virtual environment. (I'm a mac user). Part of the messages are: × Building wheel ...

Jessica Li

1

asked Jun 1 at 20:26

Collectives™ on Stack Overflow

Questions tagged [huggingface-tokenizers]

BPE tokenizer add_tokens overlap with trained tokens

special_tokens parameter of SentencePieceBPETokenizer.train_from_iterator()

How to get custom trained Bert tokenizer not to split certain characters

MBART-50 looks not compatible with Pipeline

How do I increase max_new_tokens

How to fine-tune merlinite 7B model in Python

How to set eos_token_id in llama3 in HuggingFaceLLM?

HuggingFace's transformers I'm getting the message "Some non-default generation parameters are set in the model config"

How to Deploy a Hugging Face Transformers Model for Inference Using KServe (without KServe 0.13v)?

Can I wrap a PyTorch model into ONNX together with tokenizers?

How do we add/modify the normalizer in a pretrained Huggingface tokenizer?

Can I increase tiktoken throughput?

Seq2SeqTrainer produces incorrect EvalPrediction after changing another Tokenizer

Reordering GPT2Tokenizer tokens by frequency leads to unrecognized tokens

Mac: Unable to install pyzmq and tokenizers in virtual environment

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [huggingface-tokenizers]

Related Tags