NLP Collective

1 vote

1 answer

168 views

Error while converting google flan T5 model to onnx

I am looking to convert flan-T5 model downloaded from Hugging face into onnx format and make inference with the same. My input data is the symptoms of disease and expected output is the Disease name ...

Romi

253

asked May 15 at 10:12

Answer

Use https://huggingface.co/datasets/bakks/flan-t5-onnx instead. And to convert the google/flan-t5, see https://huggingface.co/datasets/bakks/flan-t5-onnx/blob/main/exportt5.py from pathlib import ...

View answer

alvas

120k

answered May 15 at 15:44

1 vote

1 answer

93 views

Why did my fine-tuning T5-Base Model for a sequence-to-sequence task has short incomplete generation?

I am trying to fine-tune a t5-base model for creating appropriate question against a compliance item. Compliance iteams are paragraph of texts and my question are in the past format of them. I have ...

Daremitsu

609

asked May 8 at 13:28

Answer

Because of: labels = tokenizer(targets, max_length=32, padding="max_length", truncation=True) Most probably your model has learnt to just output/generate outputs that are ~32 tokens. Try: ...

View answer

alvas

120k

answered May 8 at 17:16

1 vote

1 answer

179 views

How to save the LLM2Vec model as a HuggingFace PreTrainedModel object?

Typically, we should be able to save a merged base + PEFT model, like this: import torch from transformers import AutoTokenizer, AutoModel, AutoConfig from peft import PeftModel # Loading base MNTP ...

alvas

120k

asked Apr 12 at 18:32

Answer

Wrapping the LLM2Vec object around like in https://stackoverflow.com/a/74109727/610569 We can try this: import torch.nn as nn from transformers import PreTrainedModel, PretrainedConfig from ...

View answer

alvas

120k

answered Apr 12 at 18:33

3 votes

1 answer

529 views

Mistral model generates the same embeddings for different input texts

I am using pre-trained LLM to generate a representative embedding for an input text. But it is wired that the output embeddings are all the same regardless of different input texts. The codes: from ...

Howie

101

asked Apr 11 at 10:37

Answer Accepted

You're not slicing it the dimensions right at outputs.last_hidden_state[0, 0, :].numpy() Q: What is the 0th token in all inputs? A: Beginning of sentence token (BOS) Q: So that's the "embeddings&...

View answer

alvas

120k

answered Apr 11 at 12:13

4 votes

1 answer

723 views

How to fine-tune a Mistral-7B model for machine translation?

There's a lot of tutorials online that uses raw text affix with arcane syntax to indicate document boundary and accessed through Huggingface datasets.Dataset object through the text key. E.g. from ...

alvas

120k

asked Mar 13 at 20:51

Answer

The key is to re-format the data from a traditional machine translation dataset that splits the source and target text and piece them up together in a format that the model expects. For the Mistral 7B ...

View answer

alvas

120k

answered Mar 13 at 20:56

0 votes

1 answer

209 views

What is the expected inputs to Mistral model's embedding layer?

After installing !pip install -U bitsandbytes !pip install -U transformers !pip install -U peft !pip install -U accelerate !pip install -U trl And then some boilerplates to load the Mistral model: ...

alvas

120k

asked Mar 4 at 20:45

Answer

Try the return_tensors='pt' argument, e.g. model.model.embed_tokens(tokenizer("Hello world", return_tensors='pt').input_ids)

View answer

alvas

120k

answered Mar 4 at 22:12

1 vote

1 answer

386 views

Huggingface Tokenizer not adding the padding tokens

I am trying to follow this to translate english sentences to japanese. Using this line: import torch from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM ...

Labyrinthian

51

asked Feb 15 at 13:48

Answer Accepted

Depends on what you want to do with the padded tokens, most probably if you're going to just run inference or feed it to the Trainer object, then you wont need special arguments to get the batch size ...

View answer

alvas

120k

answered Feb 15 at 16:44

2 votes

1 answer

77 views

What is the TREC 2006 Spam Track Public Corpora Format?

link to original dataset I have downloaded this dataset The TREC 2006 Public Corpus -- 75MB (trec06p.tgz). Here is the folder structure: . └── trec 06p/ ├── data ├── data-delay ├── full ...

Manish Joyeuse

65

asked Feb 4 at 10:58

Answer Accepted

Disclaimer Before reading the answer, please note that since I had not participated in the TREC06 task nor am I the data creator/provider, I can do only some educated guess to the questions you have ...

View answer

alvas

120k

answered Feb 7 at 18:21

2 votes

1 answer

214 views

How to calculate the weighted sum of last 4 hidden layers using Roberta?

The table from this paper that explains various approaches to obtain the embedding, I think these approaches are also applicable to Roberta too: I'm trying to calculate the weighted sum of last 4 ...

user23232264

asked Feb 3 at 20:34

Answer Accepted

First, lets do some digging from the OG BERT code, https://github.com/google-research/bert If we just do a quick search for "sum" on the github repo, we find this https://github.com/google-...

View answer

alvas

120k

answered Feb 7 at 4:09

5 votes

2 answers

2k views

BERT token vs. embedding

I understand that WordPiece is used to break text into tokens. And I understand that, somewhere in BERT, the model maps tokens into token embeddings that represent the meaning of the tokens. But ...

i82much

61

asked Sep 27, 2023 at 18:03

Answer

Inside BERT, as well as most other NLP deep learning models, the conversion from token IDs to vectors is done with an Embedding layer. For instance, in Pytorch it is the torch.nn.Embedding module. The ...

View answer

noe

2,024

answered Sep 28, 2023 at 13:24

0 votes

1 answer

269 views

Error while loading a tagger (probably missing model file)

I am creating an Android App in Android Studio where I use Stanford Core NLP and Jetpack Compose. I have been looking for hours in this platform to see if someone has a similar problem like me, but I ...

Eduardo

71

asked Sep 16, 2023 at 21:05

Answer

Almost certainly this means that the tagger model file is not present at runtime on the path specified. At runtime, does the path you give: libs/stanford-corenlp-4.5.4-models.jar work to access the ...

View answer

Christopher Manning

9,440

answered Sep 27, 2023 at 17:30

0 votes

1 answer

66 views

How to get Enhanced++ dependency labels with a java command line in the terminal?

I don't really know java, but I was just trying to use the documentation of the Stanford NLP parser to get the Enhanced++ dependency labels. This is the line I ran: java -cp "*" -Xmx2g edu....

Galit

67

asked Sep 13, 2023 at 0:12

Answer Accepted

You actually are getting enhanced++ dependency labels. However, it looks like you are looking for something else or an older version. UD was somewhat revised between UDv1 and UDv2. One of the changes ...

View answer

Christopher Manning

9,440

answered Oct 1, 2023 at 22:54

1 vote

2 answers

220 views

How to de-normalize text in Python?

I am currently working on a Python project using text semantic to match similarities. At the end, my goal is to have a dataset column where all my interesting words are in order to be searched in by a ...

Lefloch Had

55

asked Aug 30, 2023 at 14:45

Answer

I understand that you mean to generate all possible inflexions of an English word. For that, you may use LemmInflect as follows: from lemminflect import getAllInflections > getInflection('watch', ...

View answer

noe

2,024

answered Aug 30, 2023 at 15:03

0 votes

1 answer

226 views

How to concatenate a split word using NLP caused by tokenizers after machine translation?

Russian translation produces the following result, is there a NLP function which we can use to concatenate as "Europe's" in the following string? "Nitzchia Protector Todibo can go to ...

user352290

1,105

asked Aug 30, 2023 at 6:10

Answer Accepted

Try detokenizers but because there are rules to process tokens that are expected to change x 's -> x's but not x ' s -> x's, you might have to iteratively apply the detokenizer, e.g. using ...

View answer

alvas

120k

answered Aug 30, 2023 at 8:52

0 votes

1 answer

714 views

Can i use 4 bit, 8 bit version of transformers translation model?

Are quantized versions available for other transformer models beyond LLMs, specifically for translation models? I'm looking for information about the following models: This models: https://...

mary evans

9

asked Aug 28, 2023 at 6:08

Answer

It might not work for every model, but you can try 8-bit quantization with native pytorch, https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html like this: import gc import torch ...

View answer

alvas

120k

answered Aug 30, 2023 at 0:09

2 votes

1 answer

913 views

Low score and wrong answer for Flan-T5-XXL "question-answering" task

I'm trying to run Flan-T5-XXL model for a "question-answering" task. Here's how I loaded and executed the model: model_id = "~/Downloads/test_LLM/flan-t5-xxl" tokenizer = ...

AnonX

169

asked Aug 23, 2023 at 18:03

Answer

Pre/Script: This is more of a science experiment design or product development question than a programming question, so most probably someone will flag to close this question on Stackoverflow ...

View answer

alvas

120k

answered Aug 23, 2023 at 19:31

0 votes

1 answer

434 views

How to skip tokenization and translation of custom glossary in huggingface NMT models?

I am using mBART50 and opus-MT-en-de for bilingual translations from huggingface. We have a custom dictionary of organization-specific glossary containing ~10,000 English terms (ngrams with n=1-5) and ...

Bharatiya

23

asked Aug 22, 2023 at 17:35

Answer Accepted

Constraining beam search (or sampling from a generative model) is difficult because even when you know what string you want to have in the target sentence, you do not know what position it should be. ...

View answer

Jindřich

11k

answered Oct 2, 2023 at 8:01

1 vote

1 answer

83 views

Detecting adding/removal from string difference between texts

I have two versions of a short text, e.g.: old = "(a) The provisions of this article apply to machinery of class 6." new = "(a) The provisions of this article apply to machinery of ...

user456789

351

asked Aug 21, 2023 at 14:42

Answer

If you like some git-diff like functions, you can try: from difflib import unified_diff s1 = "(a) The provisions of this article apply to machinery of class 6." s2 = "(a) The ...

View answer

alvas

120k

answered Aug 21, 2023 at 16:03

1 vote

3 answers

2k views

How to highlight the differences between two strings in Python?

I want to highlight the differences between two strings in a colour using Python code. Example 1: sentence1 = "I'm enjoying the summer breeze on the beach while I do some pilates." sentence2 ...

Oliver

562

asked Aug 21, 2023 at 14:23

Answer

Use difflib to get the matching blocks: from difflib import SequenceMatcher s1 = "I'm enjoying the summer breeze on the beach while I do some pilates." s2 = "I'm enjoying the summer ...

View answer

alvas

120k

answered Aug 21, 2023 at 15:34

2 votes

1 answer

904 views

Backpropagation / minibatching in training large language models (LLMs)

I am struggling to understand how backprop works for transformer-based LLMs. Here is my guess of how this process works. Given a sequence of tokens with length 64, we process the sequence in parallel ...

Chinmaya Andukuri

21

asked Aug 17, 2023 at 18:57

Answer

There are mainly two training routines for most auto-regressive language models: Casual Language Model (given a word predict another word) Masked Language Model (given a fixed sequence space, predict ...

View answer

alvas

120k

answered Aug 18, 2023 at 0:10

5 votes

1 answer

7k views

CUDA out of memory using trainer in huggingface during validation (training is fine)

When doing fine-tuning with Hg trainer, training is fine but it failed during validation. Even reducing the eval_accumation_steps = 1 did not work. I followed the procedure in the link: Why is ...

Tommy

59

asked Aug 17, 2023 at 11:45

Answer

First, ensure that you have the latest accelerate>=0.21.0 installed. pip install -U accelerate Then, try using auto_find_batch_size args=transformers.TrainingArguments( ...

View answer

alvas

120k

answered Aug 17, 2023 at 14:55

4 votes

1 answer

4k views

Validation and Training Loss when using HuggingFace

I do not seem to find an explanation on how the validation and training losses are calculated when we finetune a model using the huggingFace trainer. Does anyone know here to find this information?

tt40kiwi

411

asked Aug 16, 2023 at 13:36

Answer

In Short Depends on what you want to do with the evaluation function, knowing the internal workings of the evaluation might or might not be practical for you to train the model appropriately. Scroll ...

View answer

alvas

120k

answered Aug 17, 2023 at 15:20

2 votes

1 answer

561 views

Some doubts about huggingface's BPE algorithm

In most BPE(Byte-Pair Encoding) tutorials, it is mentioned to add </w> after a word. The function of this mark is to distinguish whether a subword is a prefix of a word or a suffix of a word. We ...

korangar leo

69

asked Aug 13, 2023 at 7:01

Answer Accepted

The end of word marker </w> is part of the tokens during the creation of a vocabulary, not a token per se. Once the BPE vocabulary creation is finished, you normally invert the mark: you mark ...

View answer

noe

2,024

answered Aug 13, 2023 at 13:34

1 vote

1 answer

400 views

Why do we add |V| in the denominator in the Add-One smoothing for n-gram language models?

In NLP when we use Laplace(Add-one) smoothing technique we assume that the every word is seen one more time than the actual count and the formula is like this where V is the size of the vocabulary. ...

hxdshell

25

asked Aug 11, 2023 at 10:21

Answer Accepted

The |V| variable that we see in the determiner of additive smoothing function is not actually a direct definition of the probabilisitic estimation of the n-gram. It is derived from: First, we start ...

View answer

alvas

120k

answered Aug 12, 2023 at 11:42

2 votes

1 answer

386 views

Freeze and unfreeze certain layers in TFDistilBertModel

I am trying to implement either TFBertModel or TFDistilBertModel in my neural network model (there are other layers such as dense and batch norm). My understanding is that there are hidden layer in ...

marmamar

21

asked Aug 5, 2023 at 13:38

Answer

First, we need to access the layers/params with its name so that we know what we want to freeze/unfreeze: from transformers import AutoModel model = AutoModel.from_pretrained('distilbert-base-uncased'...

View answer

alvas

120k

answered Aug 6, 2023 at 12:31

1 vote

1 answer

3k views

why nn.Embedding layer is used for positional encoding in bert?

In the huggingface implementation of bert model, for positional embedding nn.Embedding is used. Why it is used instead of traditional sin/cos positional embedding described in the transformer paper? ...

dsoum

45

asked Aug 3, 2023 at 5:11

Answer Accepted

nn.Embedding is just a table of vectors. Its input are indices to the table. Its output are the vectors associated to the indices from the input. Conceptually, it is equivalent to having one-hot ...

View answer

noe

2,024

answered Aug 3, 2023 at 8:18

1 vote

1 answer

106 views

How to use Pytorch Tokenize without punctuation and all lowercase?

source_string = "first a modest refactor to fit the current project size and second a full refactor to move all our code into plugins dont feel like you have to code along to this whole book"...

Daniel

11

asked Aug 2, 2023 at 8:33

Answer

You can apply a previous step to add punctuation and proper casing to the text. For this, you may use Re-punctuate. from transformers import T5Tokenizer, TFT5ForConditionalGeneration tokenizer = ...

View answer

noe

2,024

answered Aug 2, 2023 at 9:04

1 vote

2 answers

185 views

Why are Neural Networks Needed with Word Embeddings?

Why do we need a neural network to do text classification when we vectorize documents with word embeddings? If word embeddings capture the meaning of words/documents, then why can't we just use cosine ...

Tejas_hooray

626

asked Aug 1, 2023 at 23:55

Answer

Because neural networks give better results than word embeddings in general. Many text classification problems can be addressed just by using word embeddings. However, word embeddings tend to present ...

View answer

noe

2,024

answered Aug 2, 2023 at 6:58

2 votes

1 answer

103 views

RuntimeError when trying to extract text features from a BERT model then using KNN for classification

I'm trying to use camembert model to just to extract text features. After that, I'm trying to use a KNN classifier to classify the feature vectors as inputs. This is the code I wrote import torch from ...

Wajih101

11

asked Jul 31, 2023 at 9:02

Answer Accepted

It seems that you are feeding ALL your data to the model at once and you don't have enough memory to do that. Instead of doing that, you can invoke the model sentence by sentence or with small ...

View answer

noe

2,024

answered Jul 31, 2023 at 9:48

1 vote

3 answers

7k views

download hugging face llama2 model to local server

I am running the pytorch code below. I'm running the code in a jupyter notebook. the noebook is running on my ubuntu server. I'm trying to download the llama2-70b-chat model from hugging face. my ...

user3476463

4,455

asked Jul 30, 2023 at 21:33

Answer

The error tells you that there is no space left on the storage drive (e.g. a hard drive partition). From the name of the variables, it could be the partition where the model is, or maybe the temporary ...

View answer

noe

2,024

answered Jul 31, 2023 at 7:30

1 vote

1 answer

422 views

Getting connection refused error using openllm library of python

I am trying to utilise this github repo, particularly the below python code: import openllm client = openllm.client.HTTPClient('http://localhost:3000') client.query('Explain to me the difference ...

Shanam Afzal

21

asked Jul 29, 2023 at 14:40

Answer

You should refer to the "Starting an LLM Server " section of the github project you linked to. For instance, to start a server with the OPT model, you would do as follows: openllm start opt ...

View answer

noe

2,024

answered Jul 29, 2023 at 15:03

0 votes

1 answer

234 views

Bertbaseuncased install with spacy is not working

I am trying to start an NLP project using spacy transformers. When trying to install bertbaseuncased I'm getting this error: ✘ No compatible package found for 'en_trf_bertbaseuncased_lg' (spaCy v3.6.0)...

Patrick

29

asked Jul 27, 2023 at 22:55

Answer Accepted

This is addressed in a discussion on the Spacy github repo. The explanation of the error is that en_trf_bertbaseuncased_lg is a Spacy 2.x model and you are using 3.x. Instead of said model, you can ...

View answer

noe

2,024

answered Jul 28, 2023 at 6:25

1 vote

1 answer

125 views

OSError: Can't find model 'en_core_web_sm' in spaCy when running script in Python IDE

I'm trying to run a Python script in Python's built-in IDE (IDLE) that uses the spaCy library, and I'm running into an issue where it can't seem to find the 'en_core_web_sm' model. Here's the error I'...

Brandon biba

11

asked Jul 27, 2023 at 16:53

Answer

You can download the model at runtime to ensure it is present. For this, you can wrap the creation of the model in a function like this one: import spacy def create_spacy_model() -> spacy....

View answer

noe

2,024

answered Jul 27, 2023 at 20:52

-1 votes

1 answer

482 views

How to force falcon 40B to print in JSON format?

I have been trying to extract start time and end time from the input text using Falcon 40B. This was my prompt, Identify the following items from the given text which states random shipping details: ...

Tamizhini Venkatesan

7

asked Jul 17, 2023 at 17:51

Answer

You can use the Guidance library. From the "Guaranteeing valid syntax JSON example" section of their readme: Large language models are great at generating useful outputs, but they are not ...

View answer

noe

2,024

answered Jul 29, 2023 at 15:12

0 votes

2 answers

164 views

What is Stanford CoreNLP's recipe for tokenization?

Whether you're using Stanza or Corenlp (now deprecated) python wrappers, or the original Java implementation, the tokenization rules that StanfordCoreNLP follows is super hard for me to figure out ...

lrthistlethwaite

514

asked Apr 11, 2023 at 20:16

Answer Accepted

Here are a few notes from one of main authors of it. What you write in your answer is all basically correct, but there are many nuances. 😊 Yes, the CoreNLP tokenizer was written to follow the ...

View answer

Christopher Manning

9,440

answered Sep 27, 2023 at 17:21

33 votes

2 answers

10k views

Parsing city of origin / destination city from a string

I have a pandas dataframe where one column is a bunch of strings with certain travel details. My goal is to parse each string to extract the city of origin and destination city (I would like to ...

Merv Merzoug

1,237

asked Jan 28, 2020 at 20:39

Answer Accepted Answer recommended by NLP Collective

TL;DR Pretty much impossible at first glance, unless you have access to some API that contains pretty sophisticated components. In Long From first look, it seems like you're asking to solve a ...

View answer

alvas

120k

answered Jan 29, 2020 at 0:52

4 votes

1 answer

2k views

How can I use Stanford NLP commercially?

I'm working on the company that makes toy cars that can talk with children. We want to use Stanford Core NLP as a parser. However, it is licensed in GPL: they doesn't allow using the NLP commercially. ...

Cheong-An Lee

63

asked May 19, 2015 at 4:29

Answer Accepted Answer recommended by NLP Collective

You can either use the software under the GPL license, or you can purchase a commercial license. For the latter, you can contact us at the support email address found here.

View answer

Christopher Manning

9,440

answered May 19, 2015 at 15:08

124 votes

19 answers

157k views

Resource u'tokenizers/punkt/english.pickle' not found

My Code: import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ERROR Message: [ec2-user@ip-172-31-31-31 sentiment]$ python mapper_local_v1.0.py Traceback (most recent ...

Supreeth Meka

1,889

asked Oct 26, 2014 at 7:52

Answer Answer recommended by NLP Collective

If you're looking to only download the punkt model: import nltk nltk.download('punkt') If you're unsure which data/model you need, you can install the popular datasets, models and taggers from NLTK: ...

View answer

alvas

120k

answered Oct 26, 2014 at 22:40

70 votes

15 answers

210k views

How do I download NLTK data?

Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!! I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution ...

Q-ximi

951

asked Mar 5, 2014 at 23:19

Answer Answer recommended by NLP Collective

TL;DR To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use: $ python3 >>> import nltk >>>...

View answer

alvas

120k

answered Jun 13, 2015 at 19:58

29 votes

3 answers

44k views

What exactly is an n Gram?

I found this previous question on SO: N-grams: Explanation + 2 applications. The OP gave this example and asked if it was correct: Sentence: "I live in NY." word level bigrams (2 for n): "# I', "I ...

user2649614

449

asked Aug 12, 2013 at 17:40

Answer Answer recommended by NLP Collective

Usually a picture is worth thousand words. Source: http://recognize-speech.com/language-model/n-gram-model/comparison

View answer

Kamran

2,711

answered Aug 3, 2017 at 7:20

179 votes

17 answers

261k views

n-grams in python, four, five, six grams?

I'm looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = ...

Shifu

2,165

asked Jul 8, 2013 at 16:35

Answer Accepted Answer recommended by NLP Collective

Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library). There is an ...

View answer

alvas

120k

answered Jul 9, 2013 at 12:10

198 votes

9 answers

157k views

What are all possible POS tags of NLTK?

How do I find a list with all possible POS tags used by the Natural Language Toolkit (NLTK)?

OrangeTux

11.4k

asked Mar 13, 2013 at 14:59

Answer Answer recommended by NLP Collective

To save some folks some time, here is a list I extracted from a small corpus. I do not know if it is complete, but it should have most (if not all) of the help definitions from upenn_tagset... CC: ...

View answer

binarymax

3,325

answered Jul 8, 2016 at 10:22

32 votes

7 answers

49k views

What is NLTK POS tagger asking me to download?

I just started using a part-of-speech tagger, and I am facing many problems. I started POS tagging with the following: import nltk text=nltk.word_tokenize("We are going out.Just you and me.") When ...

Pearl

759

asked Dec 21, 2011 at 13:14

Answer Answer recommended by NLP Collective

From NLTK versions higher than v3.2, please use: >>> import nltk >>> nltk.__version__ '3.2.1' >>> nltk.download('averaged_perceptron_tagger') [nltk_data] Downloading ...

View answer

alvas

120k

answered Jun 6, 2016 at 7:01

191 votes

18 answers

231k views

Failed loading english.pickle with nltk.data.load

When trying to load the punkt tokenizer... import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ...a LookupError was raised: > LookupError: > **************...

Martin

1,913

asked Feb 1, 2011 at 19:43

Answer Answer recommended by NLP Collective

The main reason why you see that error is nltk couldn't find punkt package. Due to the size of nltk suite, all available packages are not downloaded by default when one installs it. You can download ...

View answer

Naren Yellavula

7,593

answered Dec 30, 2014 at 13:50

Collectives™ on Stack Overflow

Questions

44 questions