Skip to main content
The 2024 Developer Survey results are live! See the results

NLP Collective

Questions

Browse questions with relevant NLP tags

44 questions

Has recommended answer
1 vote
1 answer
168 views

Error while converting google flan T5 model to onnx

I am looking to convert flan-T5 model downloaded from Hugging face into onnx format and make inference with the same. My input data is the symptoms of disease and expected output is the Disease name ...
Romi's user avatar
  • 253
Answer

Use https://huggingface.co/datasets/bakks/flan-t5-onnx instead. And to convert the google/flan-t5, see https://huggingface.co/datasets/bakks/flan-t5-onnx/blob/main/exportt5.py from pathlib import ...

View answer
alvas's user avatar
  • 120k
1 vote
1 answer
93 views

Why did my fine-tuning T5-Base Model for a sequence-to-sequence task has short incomplete generation?

I am trying to fine-tune a t5-base model for creating appropriate question against a compliance item. Compliance iteams are paragraph of texts and my question are in the past format of them. I have ...
Daremitsu's user avatar
  • 609
Answer

Because of: labels = tokenizer(targets, max_length=32, padding="max_length", truncation=True) Most probably your model has learnt to just output/generate outputs that are ~32 tokens. Try: ...

View answer
alvas's user avatar
  • 120k
1 vote
1 answer
179 views

How to save the LLM2Vec model as a HuggingFace PreTrainedModel object?

Typically, we should be able to save a merged base + PEFT model, like this: import torch from transformers import AutoTokenizer, AutoModel, AutoConfig from peft import PeftModel # Loading base MNTP ...
alvas's user avatar
  • 120k
Answer

Wrapping the LLM2Vec object around like in https://stackoverflow.com/a/74109727/610569 We can try this: import torch.nn as nn from transformers import PreTrainedModel, PretrainedConfig from ...

View answer
alvas's user avatar
  • 120k
3 votes
1 answer
529 views

Mistral model generates the same embeddings for different input texts

I am using pre-trained LLM to generate a representative embedding for an input text. But it is wired that the output embeddings are all the same regardless of different input texts. The codes: from ...
Howie's user avatar
  • 101
Answer Accepted

You're not slicing it the dimensions right at outputs.last_hidden_state[0, 0, :].numpy() Q: What is the 0th token in all inputs? A: Beginning of sentence token (BOS) Q: So that's the "embeddings&...

View answer
alvas's user avatar
  • 120k
4 votes
1 answer
723 views

How to fine-tune a Mistral-7B model for machine translation?

There's a lot of tutorials online that uses raw text affix with arcane syntax to indicate document boundary and accessed through Huggingface datasets.Dataset object through the text key. E.g. from ...
alvas's user avatar
  • 120k
Answer

The key is to re-format the data from a traditional machine translation dataset that splits the source and target text and piece them up together in a format that the model expects. For the Mistral 7B ...

View answer
alvas's user avatar
  • 120k
0 votes
1 answer
209 views

What is the expected inputs to Mistral model's embedding layer?

After installing !pip install -U bitsandbytes !pip install -U transformers !pip install -U peft !pip install -U accelerate !pip install -U trl And then some boilerplates to load the Mistral model: ...
alvas's user avatar
  • 120k
Answer

Try the return_tensors='pt' argument, e.g. model.model.embed_tokens(tokenizer("Hello world", return_tensors='pt').input_ids)

View answer
alvas's user avatar
  • 120k
1 vote
1 answer
386 views

Huggingface Tokenizer not adding the padding tokens

I am trying to follow this to translate english sentences to japanese. Using this line: import torch from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM ...
Labyrinthian's user avatar
Answer Accepted

Depends on what you want to do with the padded tokens, most probably if you're going to just run inference or feed it to the Trainer object, then you wont need special arguments to get the batch size ...

View answer
alvas's user avatar
  • 120k
2 votes
1 answer
77 views

What is the TREC 2006 Spam Track Public Corpora Format?

link to original dataset I have downloaded this dataset The TREC 2006 Public Corpus -- 75MB (trec06p.tgz). Here is the folder structure: . └── trec 06p/ ├── data ├── data-delay ├── full ...
Manish Joyeuse's user avatar
Answer Accepted

Disclaimer Before reading the answer, please note that since I had not participated in the TREC06 task nor am I the data creator/provider, I can do only some educated guess to the questions you have ...

View answer
alvas's user avatar
  • 120k
2 votes
1 answer
214 views

How to calculate the weighted sum of last 4 hidden layers using Roberta?

The table from this paper that explains various approaches to obtain the embedding, I think these approaches are also applicable to Roberta too: I'm trying to calculate the weighted sum of last 4 ...
user avatar
Answer Accepted

First, lets do some digging from the OG BERT code, https://github.com/google-research/bert If we just do a quick search for "sum" on the github repo, we find this https://github.com/google-...

View answer
alvas's user avatar
  • 120k
5 votes
2 answers
2k views

BERT token vs. embedding

I understand that WordPiece is used to break text into tokens. And I understand that, somewhere in BERT, the model maps tokens into token embeddings that represent the meaning of the tokens. But ...
i82much's user avatar
  • 61
Answer

Inside BERT, as well as most other NLP deep learning models, the conversion from token IDs to vectors is done with an Embedding layer. For instance, in Pytorch it is the torch.nn.Embedding module. The ...

View answer
noe's user avatar
  • 2,024
0 votes
1 answer
269 views

Error while loading a tagger (probably missing model file)

I am creating an Android App in Android Studio where I use Stanford Core NLP and Jetpack Compose. I have been looking for hours in this platform to see if someone has a similar problem like me, but I ...
Eduardo's user avatar
  • 71
Answer

Almost certainly this means that the tagger model file is not present at runtime on the path specified. At runtime, does the path you give: libs/stanford-corenlp-4.5.4-models.jar work to access the ...

View answer
Christopher Manning's user avatar
0 votes
1 answer
66 views

How to get Enhanced++ dependency labels with a java command line in the terminal?

I don't really know java, but I was just trying to use the documentation of the Stanford NLP parser to get the Enhanced++ dependency labels. This is the line I ran: java -cp "*" -Xmx2g edu....
Galit's user avatar
  • 67
Answer Accepted

You actually are getting enhanced++ dependency labels. However, it looks like you are looking for something else or an older version. UD was somewhat revised between UDv1 and UDv2. One of the changes ...

View answer
Christopher Manning's user avatar
1 vote
2 answers
220 views

How to de-normalize text in Python?

I am currently working on a Python project using text semantic to match similarities. At the end, my goal is to have a dataset column where all my interesting words are in order to be searched in by a ...
Lefloch Had's user avatar
Answer

I understand that you mean to generate all possible inflexions of an English word. For that, you may use LemmInflect as follows: from lemminflect import getAllInflections > getInflection('watch', ...

View answer
noe's user avatar
  • 2,024
0 votes
1 answer
226 views

How to concatenate a split word using NLP caused by tokenizers after machine translation?

Russian translation produces the following result, is there a NLP function which we can use to concatenate as "Europe's" in the following string? "Nitzchia Protector Todibo can go to ...
user352290's user avatar
  • 1,105
Answer Accepted

Try detokenizers but because there are rules to process tokens that are expected to change x 's -> x's but not x ' s -> x's, you might have to iteratively apply the detokenizer, e.g. using ...

View answer
alvas's user avatar
  • 120k
0 votes
1 answer
714 views

Can i use 4 bit, 8 bit version of transformers translation model?

Are quantized versions available for other transformer models beyond LLMs, specifically for translation models? I'm looking for information about the following models: This models: https://...
mary evans's user avatar
Answer

It might not work for every model, but you can try 8-bit quantization with native pytorch, https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html like this: import gc import torch ...

View answer
alvas's user avatar
  • 120k
2 votes
1 answer
913 views

Low score and wrong answer for Flan-T5-XXL "question-answering" task

I'm trying to run Flan-T5-XXL model for a "question-answering" task. Here's how I loaded and executed the model: model_id = "~/Downloads/test_LLM/flan-t5-xxl" tokenizer = ...
AnonX's user avatar
  • 169
Answer

Pre/Script: This is more of a science experiment design or product development question than a programming question, so most probably someone will flag to close this question on Stackoverflow ...

View answer
alvas's user avatar
  • 120k
0 votes
1 answer
434 views

How to skip tokenization and translation of custom glossary in huggingface NMT models?

I am using mBART50 and opus-MT-en-de for bilingual translations from huggingface. We have a custom dictionary of organization-specific glossary containing ~10,000 English terms (ngrams with n=1-5) and ...
Bharatiya's user avatar
Answer Accepted

Constraining beam search (or sampling from a generative model) is difficult because even when you know what string you want to have in the target sentence, you do not know what position it should be. ...

View answer
Jindřich's user avatar
  • 11k
1 vote
1 answer
83 views

Detecting adding/removal from string difference between texts

I have two versions of a short text, e.g.: old = "(a) The provisions of this article apply to machinery of class 6." new = "(a) The provisions of this article apply to machinery of ...
user456789's user avatar
Answer

If you like some git-diff like functions, you can try: from difflib import unified_diff s1 = "(a) The provisions of this article apply to machinery of class 6." s2 = "(a) The ...

View answer
alvas's user avatar
  • 120k
1 vote
3 answers
2k views

How to highlight the differences between two strings in Python?

I want to highlight the differences between two strings in a colour using Python code. Example 1: sentence1 = "I'm enjoying the summer breeze on the beach while I do some pilates." sentence2 ...
Oliver's user avatar
  • 562
Answer

Use difflib to get the matching blocks: from difflib import SequenceMatcher s1 = "I'm enjoying the summer breeze on the beach while I do some pilates." s2 = "I'm enjoying the summer ...

View answer
alvas's user avatar
  • 120k
2 votes
1 answer
904 views

Backpropagation / minibatching in training large language models (LLMs)

I am struggling to understand how backprop works for transformer-based LLMs. Here is my guess of how this process works. Given a sequence of tokens with length 64, we process the sequence in parallel ...
Chinmaya Andukuri's user avatar
Answer

There are mainly two training routines for most auto-regressive language models: Casual Language Model (given a word predict another word) Masked Language Model (given a fixed sequence space, predict ...

View answer
alvas's user avatar
  • 120k
5 votes
1 answer
7k views

CUDA out of memory using trainer in huggingface during validation (training is fine)

When doing fine-tuning with Hg trainer, training is fine but it failed during validation. Even reducing the eval_accumation_steps = 1 did not work. I followed the procedure in the link: Why is ...
Tommy's user avatar
  • 59
Answer

First, ensure that you have the latest accelerate>=0.21.0 installed. pip install -U accelerate Then, try using auto_find_batch_size args=transformers.TrainingArguments( ...

View answer
alvas's user avatar
  • 120k
4 votes
1 answer
4k views

Validation and Training Loss when using HuggingFace

I do not seem to find an explanation on how the validation and training losses are calculated when we finetune a model using the huggingFace trainer. Does anyone know here to find this information?
tt40kiwi's user avatar
  • 411
Answer

In Short Depends on what you want to do with the evaluation function, knowing the internal workings of the evaluation might or might not be practical for you to train the model appropriately. Scroll ...

View answer
alvas's user avatar
  • 120k
2 votes
1 answer
561 views

Some doubts about huggingface's BPE algorithm

In most BPE(Byte-Pair Encoding) tutorials, it is mentioned to add </w> after a word. The function of this mark is to distinguish whether a subword is a prefix of a word or a suffix of a word. We ...
korangar leo's user avatar
Answer Accepted

The end of word marker </w> is part of the tokens during the creation of a vocabulary, not a token per se. Once the BPE vocabulary creation is finished, you normally invert the mark: you mark ...

View answer
noe's user avatar
  • 2,024
1 vote
1 answer
400 views

Why do we add |V| in the denominator in the Add-One smoothing for n-gram language models?

In NLP when we use Laplace(Add-one) smoothing technique we assume that the every word is seen one more time than the actual count and the formula is like this where V is the size of the vocabulary. ...
hxdshell's user avatar
Answer Accepted

The |V| variable that we see in the determiner of additive smoothing function is not actually a direct definition of the probabilisitic estimation of the n-gram. It is derived from: First, we start ...

View answer
alvas's user avatar
  • 120k
2 votes
1 answer
386 views

Freeze and unfreeze certain layers in TFDistilBertModel

I am trying to implement either TFBertModel or TFDistilBertModel in my neural network model (there are other layers such as dense and batch norm). My understanding is that there are hidden layer in ...
marmamar's user avatar
Answer

First, we need to access the layers/params with its name so that we know what we want to freeze/unfreeze: from transformers import AutoModel model = AutoModel.from_pretrained('distilbert-base-uncased'...

View answer
alvas's user avatar
  • 120k
1 vote
1 answer
3k views

why nn.Embedding layer is used for positional encoding in bert?

In the huggingface implementation of bert model, for positional embedding nn.Embedding is used. Why it is used instead of traditional sin/cos positional embedding described in the transformer paper? ...
dsoum's user avatar
  • 45
Answer Accepted

nn.Embedding is just a table of vectors. Its input are indices to the table. Its output are the vectors associated to the indices from the input. Conceptually, it is equivalent to having one-hot ...

View answer
noe's user avatar
  • 2,024
1 vote
1 answer
106 views

How to use Pytorch Tokenize without punctuation and all lowercase?

source_string = "first a modest refactor to fit the current project size and second a full refactor to move all our code into plugins dont feel like you have to code along to this whole book"...
Daniel's user avatar
  • 11
Answer

You can apply a previous step to add punctuation and proper casing to the text. For this, you may use Re-punctuate. from transformers import T5Tokenizer, TFT5ForConditionalGeneration tokenizer = ...

View answer
noe's user avatar
  • 2,024
1 vote
2 answers
185 views

Why are Neural Networks Needed with Word Embeddings?

Why do we need a neural network to do text classification when we vectorize documents with word embeddings? If word embeddings capture the meaning of words/documents, then why can't we just use cosine ...
Tejas_hooray's user avatar
Answer

Because neural networks give better results than word embeddings in general. Many text classification problems can be addressed just by using word embeddings. However, word embeddings tend to present ...

View answer
noe's user avatar
  • 2,024
2 votes
1 answer
103 views

RuntimeError when trying to extract text features from a BERT model then using KNN for classification

I'm trying to use camembert model to just to extract text features. After that, I'm trying to use a KNN classifier to classify the feature vectors as inputs. This is the code I wrote import torch from ...
Wajih101's user avatar
Answer Accepted

It seems that you are feeding ALL your data to the model at once and you don't have enough memory to do that. Instead of doing that, you can invoke the model sentence by sentence or with small ...

View answer
noe's user avatar
  • 2,024
1 vote
3 answers
7k views

download hugging face llama2 model to local server

I am running the pytorch code below. I'm running the code in a jupyter notebook. the noebook is running on my ubuntu server. I'm trying to download the llama2-70b-chat model from hugging face. my ...
user3476463's user avatar
  • 4,455
Answer

The error tells you that there is no space left on the storage drive (e.g. a hard drive partition). From the name of the variables, it could be the partition where the model is, or maybe the temporary ...

View answer
noe's user avatar
  • 2,024
1 vote
1 answer
422 views

Getting connection refused error using openllm library of python

I am trying to utilise this github repo, particularly the below python code: import openllm client = openllm.client.HTTPClient('http://localhost:3000') client.query('Explain to me the difference ...
Shanam Afzal's user avatar
Answer

You should refer to the "Starting an LLM Server " section of the github project you linked to. For instance, to start a server with the OPT model, you would do as follows: openllm start opt ...

View answer
noe's user avatar
  • 2,024
0 votes
1 answer
234 views

Bertbaseuncased install with spacy is not working

I am trying to start an NLP project using spacy transformers. When trying to install bertbaseuncased I'm getting this error: ✘ No compatible package found for 'en_trf_bertbaseuncased_lg' (spaCy v3.6.0)...
Patrick's user avatar
  • 29
Answer Accepted

This is addressed in a discussion on the Spacy github repo. The explanation of the error is that en_trf_bertbaseuncased_lg is a Spacy 2.x model and you are using 3.x. Instead of said model, you can ...

View answer
noe's user avatar
  • 2,024
1 vote
1 answer
125 views

OSError: Can't find model 'en_core_web_sm' in spaCy when running script in Python IDE

I'm trying to run a Python script in Python's built-in IDE (IDLE) that uses the spaCy library, and I'm running into an issue where it can't seem to find the 'en_core_web_sm' model. Here's the error I'...
Brandon biba's user avatar
Answer

You can download the model at runtime to ensure it is present. For this, you can wrap the creation of the model in a function like this one: import spacy def create_spacy_model() -> spacy....

View answer
noe's user avatar
  • 2,024
-1 votes
1 answer
482 views

How to force falcon 40B to print in JSON format?

I have been trying to extract start time and end time from the input text using Falcon 40B. This was my prompt, Identify the following items from the given text which states random shipping details: ...
Tamizhini Venkatesan's user avatar
Answer

You can use the Guidance library. From the "Guaranteeing valid syntax JSON example" section of their readme: Large language models are great at generating useful outputs, but they are not ...

View answer
noe's user avatar
  • 2,024
0 votes
2 answers
164 views

What is Stanford CoreNLP's recipe for tokenization?

Whether you're using Stanza or Corenlp (now deprecated) python wrappers, or the original Java implementation, the tokenization rules that StanfordCoreNLP follows is super hard for me to figure out ...
lrthistlethwaite's user avatar
Answer Accepted

Here are a few notes from one of main authors of it. What you write in your answer is all basically correct, but there are many nuances. 😊 Yes, the CoreNLP tokenizer was written to follow the ...

View answer
Christopher Manning's user avatar
33 votes
2 answers
10k views

Parsing city of origin / destination city from a string

I have a pandas dataframe where one column is a bunch of strings with certain travel details. My goal is to parse each string to extract the city of origin and destination city (I would like to ...
Merv Merzoug's user avatar
  • 1,237
Answer Accepted Answer recommended by NLP Collective

TL;DR Pretty much impossible at first glance, unless you have access to some API that contains pretty sophisticated components. In Long From first look, it seems like you're asking to solve a ...

View answer
alvas's user avatar
  • 120k
4 votes
1 answer
2k views

How can I use Stanford NLP commercially?

I'm working on the company that makes toy cars that can talk with children. We want to use Stanford Core NLP as a parser. However, it is licensed in GPL: they doesn't allow using the NLP commercially. ...
Cheong-An Lee's user avatar
Answer Accepted Answer recommended by NLP Collective

You can either use the software under the GPL license, or you can purchase a commercial license. For the latter, you can contact us at the support email address found here.

View answer
Christopher Manning's user avatar
124 votes
19 answers
157k views

Resource u'tokenizers/punkt/english.pickle' not found

My Code: import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ERROR Message: [ec2-user@ip-172-31-31-31 sentiment]$ python mapper_local_v1.0.py Traceback (most recent ...
Supreeth Meka's user avatar
Answer Answer recommended by NLP Collective

If you're looking to only download the punkt model: import nltk nltk.download('punkt') If you're unsure which data/model you need, you can install the popular datasets, models and taggers from NLTK: ...

View answer
alvas's user avatar
  • 120k
70 votes
15 answers
210k views

How do I download NLTK data?

Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!! I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution ...
Q-ximi's user avatar
  • 951
Answer Answer recommended by NLP Collective

TL;DR To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use: $ python3 >>> import nltk >>>...

View answer
alvas's user avatar
  • 120k
29 votes
3 answers
44k views

What exactly is an n Gram?

I found this previous question on SO: N-grams: Explanation + 2 applications. The OP gave this example and asked if it was correct: Sentence: "I live in NY." word level bigrams (2 for n): "# I', "I ...
user2649614's user avatar
Answer Answer recommended by NLP Collective

Usually a picture is worth thousand words. Source: http://recognize-speech.com/language-model/n-gram-model/comparison

View answer
Kamran's user avatar
  • 2,711
179 votes
17 answers
261k views

n-grams in python, four, five, six grams?

I'm looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = ...
Shifu's user avatar
  • 2,165
Answer Accepted Answer recommended by NLP Collective

Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library). There is an ...

View answer
alvas's user avatar
  • 120k
198 votes
9 answers
157k views

What are all possible POS tags of NLTK?

How do I find a list with all possible POS tags used by the Natural Language Toolkit (NLTK)?
OrangeTux's user avatar
  • 11.4k
Answer Answer recommended by NLP Collective

To save some folks some time, here is a list I extracted from a small corpus. I do not know if it is complete, but it should have most (if not all) of the help definitions from upenn_tagset... CC: ...

View answer
binarymax's user avatar
  • 3,325
32 votes
7 answers
49k views

What is NLTK POS tagger asking me to download?

I just started using a part-of-speech tagger, and I am facing many problems. I started POS tagging with the following: import nltk text=nltk.word_tokenize("We are going out.Just you and me.") When ...
Pearl's user avatar
  • 759
Answer Answer recommended by NLP Collective

From NLTK versions higher than v3.2, please use: >>> import nltk >>> nltk.__version__ '3.2.1' >>> nltk.download('averaged_perceptron_tagger') [nltk_data] Downloading ...

View answer
alvas's user avatar
  • 120k
191 votes
18 answers
231k views

Failed loading english.pickle with nltk.data.load

When trying to load the punkt tokenizer... import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ...a LookupError was raised: > LookupError: > **************...
Martin's user avatar
  • 1,913
Answer Answer recommended by NLP Collective

The main reason why you see that error is nltk couldn't find punkt package. Due to the size of nltk suite, all available packages are not downloaded by default when one installs it. You can download ...

View answer
Naren Yellavula's user avatar