Questions
Browse questions with relevant NLP tags
44 questions
Has recommended answerError while converting google flan T5 model to onnx
Use https://huggingface.co/datasets/bakks/flan-t5-onnx instead. And to convert the google/flan-t5, see https://huggingface.co/datasets/bakks/flan-t5-onnx/blob/main/exportt5.py from pathlib import ...
Why did my fine-tuning T5-Base Model for a sequence-to-sequence task has short incomplete generation?
Because of: labels = tokenizer(targets, max_length=32, padding="max_length", truncation=True) Most probably your model has learnt to just output/generate outputs that are ~32 tokens. Try: ...
How to save the LLM2Vec model as a HuggingFace PreTrainedModel object?
Wrapping the LLM2Vec object around like in https://stackoverflow.com/a/74109727/610569 We can try this: import torch.nn as nn from transformers import PreTrainedModel, PretrainedConfig from ...
Mistral model generates the same embeddings for different input texts
You're not slicing it the dimensions right at outputs.last_hidden_state[0, 0, :].numpy() Q: What is the 0th token in all inputs? A: Beginning of sentence token (BOS) Q: So that's the "embeddings&...
How to fine-tune a Mistral-7B model for machine translation?
The key is to re-format the data from a traditional machine translation dataset that splits the source and target text and piece them up together in a format that the model expects. For the Mistral 7B ...
What is the expected inputs to Mistral model's embedding layer?
Try the return_tensors='pt' argument, e.g. model.model.embed_tokens(tokenizer("Hello world", return_tensors='pt').input_ids)
Huggingface Tokenizer not adding the padding tokens
Depends on what you want to do with the padded tokens, most probably if you're going to just run inference or feed it to the Trainer object, then you wont need special arguments to get the batch size ...
What is the TREC 2006 Spam Track Public Corpora Format?
Disclaimer Before reading the answer, please note that since I had not participated in the TREC06 task nor am I the data creator/provider, I can do only some educated guess to the questions you have ...
How to calculate the weighted sum of last 4 hidden layers using Roberta?
First, lets do some digging from the OG BERT code, https://github.com/google-research/bert If we just do a quick search for "sum" on the github repo, we find this https://github.com/google-...
BERT token vs. embedding
Inside BERT, as well as most other NLP deep learning models, the conversion from token IDs to vectors is done with an Embedding layer. For instance, in Pytorch it is the torch.nn.Embedding module. The ...
Error while loading a tagger (probably missing model file)
Almost certainly this means that the tagger model file is not present at runtime on the path specified. At runtime, does the path you give: libs/stanford-corenlp-4.5.4-models.jar work to access the ...
How to get Enhanced++ dependency labels with a java command line in the terminal?
You actually are getting enhanced++ dependency labels. However, it looks like you are looking for something else or an older version. UD was somewhat revised between UDv1 and UDv2. One of the changes ...
How to de-normalize text in Python?
I understand that you mean to generate all possible inflexions of an English word. For that, you may use LemmInflect as follows: from lemminflect import getAllInflections > getInflection('watch', ...
How to concatenate a split word using NLP caused by tokenizers after machine translation?
Try detokenizers but because there are rules to process tokens that are expected to change x 's -> x's but not x ' s -> x's, you might have to iteratively apply the detokenizer, e.g. using ...
Can i use 4 bit, 8 bit version of transformers translation model?
It might not work for every model, but you can try 8-bit quantization with native pytorch, https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html like this: import gc import torch ...
Low score and wrong answer for Flan-T5-XXL "question-answering" task
Pre/Script: This is more of a science experiment design or product development question than a programming question, so most probably someone will flag to close this question on Stackoverflow ...
How to skip tokenization and translation of custom glossary in huggingface NMT models?
Constraining beam search (or sampling from a generative model) is difficult because even when you know what string you want to have in the target sentence, you do not know what position it should be. ...
Detecting adding/removal from string difference between texts
If you like some git-diff like functions, you can try: from difflib import unified_diff s1 = "(a) The provisions of this article apply to machinery of class 6." s2 = "(a) The ...
How to highlight the differences between two strings in Python?
Use difflib to get the matching blocks: from difflib import SequenceMatcher s1 = "I'm enjoying the summer breeze on the beach while I do some pilates." s2 = "I'm enjoying the summer ...
Backpropagation / minibatching in training large language models (LLMs)
There are mainly two training routines for most auto-regressive language models: Casual Language Model (given a word predict another word) Masked Language Model (given a fixed sequence space, predict ...
CUDA out of memory using trainer in huggingface during validation (training is fine)
First, ensure that you have the latest accelerate>=0.21.0 installed. pip install -U accelerate Then, try using auto_find_batch_size args=transformers.TrainingArguments( ...
Validation and Training Loss when using HuggingFace
In Short Depends on what you want to do with the evaluation function, knowing the internal workings of the evaluation might or might not be practical for you to train the model appropriately. Scroll ...
Some doubts about huggingface's BPE algorithm
The end of word marker </w> is part of the tokens during the creation of a vocabulary, not a token per se. Once the BPE vocabulary creation is finished, you normally invert the mark: you mark ...
Why do we add |V| in the denominator in the Add-One smoothing for n-gram language models?
The |V| variable that we see in the determiner of additive smoothing function is not actually a direct definition of the probabilisitic estimation of the n-gram. It is derived from: First, we start ...
Freeze and unfreeze certain layers in TFDistilBertModel
First, we need to access the layers/params with its name so that we know what we want to freeze/unfreeze: from transformers import AutoModel model = AutoModel.from_pretrained('distilbert-base-uncased'...
why nn.Embedding layer is used for positional encoding in bert?
nn.Embedding is just a table of vectors. Its input are indices to the table. Its output are the vectors associated to the indices from the input. Conceptually, it is equivalent to having one-hot ...
How to use Pytorch Tokenize without punctuation and all lowercase?
You can apply a previous step to add punctuation and proper casing to the text. For this, you may use Re-punctuate. from transformers import T5Tokenizer, TFT5ForConditionalGeneration tokenizer = ...
Why are Neural Networks Needed with Word Embeddings?
Because neural networks give better results than word embeddings in general. Many text classification problems can be addressed just by using word embeddings. However, word embeddings tend to present ...
RuntimeError when trying to extract text features from a BERT model then using KNN for classification
It seems that you are feeding ALL your data to the model at once and you don't have enough memory to do that. Instead of doing that, you can invoke the model sentence by sentence or with small ...
download hugging face llama2 model to local server
The error tells you that there is no space left on the storage drive (e.g. a hard drive partition). From the name of the variables, it could be the partition where the model is, or maybe the temporary ...
Getting connection refused error using openllm library of python
You should refer to the "Starting an LLM Server " section of the github project you linked to. For instance, to start a server with the OPT model, you would do as follows: openllm start opt ...
Bertbaseuncased install with spacy is not working
This is addressed in a discussion on the Spacy github repo. The explanation of the error is that en_trf_bertbaseuncased_lg is a Spacy 2.x model and you are using 3.x. Instead of said model, you can ...
OSError: Can't find model 'en_core_web_sm' in spaCy when running script in Python IDE
You can download the model at runtime to ensure it is present. For this, you can wrap the creation of the model in a function like this one: import spacy def create_spacy_model() -> spacy....
How to force falcon 40B to print in JSON format?
You can use the Guidance library. From the "Guaranteeing valid syntax JSON example" section of their readme: Large language models are great at generating useful outputs, but they are not ...
What is Stanford CoreNLP's recipe for tokenization?
Here are a few notes from one of main authors of it. What you write in your answer is all basically correct, but there are many nuances. 😊 Yes, the CoreNLP tokenizer was written to follow the ...
Parsing city of origin / destination city from a string
TL;DR Pretty much impossible at first glance, unless you have access to some API that contains pretty sophisticated components. In Long From first look, it seems like you're asking to solve a ...
How can I use Stanford NLP commercially?
You can either use the software under the GPL license, or you can purchase a commercial license. For the latter, you can contact us at the support email address found here.
Resource u'tokenizers/punkt/english.pickle' not found
If you're looking to only download the punkt model: import nltk nltk.download('punkt') If you're unsure which data/model you need, you can install the popular datasets, models and taggers from NLTK: ...
How do I download NLTK data?
TL;DR To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use: $ python3 >>> import nltk >>>...
What exactly is an n Gram?
Usually a picture is worth thousand words. Source: http://recognize-speech.com/language-model/n-gram-model/comparison
n-grams in python, four, five, six grams?
Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library). There is an ...
What are all possible POS tags of NLTK?
To save some folks some time, here is a list I extracted from a small corpus. I do not know if it is complete, but it should have most (if not all) of the help definitions from upenn_tagset... CC: ...
What is NLTK POS tagger asking me to download?
From NLTK versions higher than v3.2, please use: >>> import nltk >>> nltk.__version__ '3.2.1' >>> nltk.download('averaged_perceptron_tagger') [nltk_data] Downloading ...
Failed loading english.pickle with nltk.data.load
The main reason why you see that error is nltk couldn't find punkt package. Due to the size of nltk suite, all available packages are not downloaded by default when one installs it. You can download ...
Simply submit a proposal, get it approved, and publish it.
See how the process works