Skip to main content
The 2024 Developer Survey results are live! See the results

NLP Collective

Questions

Browse questions with relevant NLP tags

39,269 questions

0 votes
0 answers
3 views

BertTokenizer vocab_size remains unchanged after adding tokens

I am using HuggingFace BertTokenizer and adding some tokens to it. Here are the codes: from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('fnlp/bart-base-chinese') print(...
Raptor's user avatar
  • 53.6k
0 votes
1 answer
10 views

SgaeMaker training: what's the correct REGEX patrern to capture metrics?

This is the pattern I've seen suggested in a few different posts on SO: metric_definitions = [ {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"}, {'Name': 'learning_rate', ...
Yoan B. M.Sc's user avatar
  • 1,503
0 votes
0 answers
6 views

GGUF model in LM Studio returns broken answer

I try to run LLM GGUF model QuantFactory/T-lite-instruct-0.1-GGUF specifically its quantized version T-lite-instruct-0.1.Q2_K.gguf in LM Studio. Sometimes it works fine. But sometimes it returns "...
pav's user avatar
  • 99
1 vote
0 answers
15 views

LDA is predicting same topics for all data

I'm using the German political speech dataset to train the LDA model. My goal here is to categorize each speech into some topics. But the problem is that the generated topics are too similar, and all ...
Ryu Ahmed's user avatar
0 votes
0 answers
5 views

RuntimeError with DeBERTaV3 Sequence Classification: Tensor Size Mismatch

Iam trying to fine-tune the microsoft/deberta-v3-base model for sequence classification with three labels. I have set up my tokenizer and data preprocessing, but I encounter a RuntimeError during ...
suri's user avatar
  • 21
-1 votes
0 answers
13 views

How can I use Word Embeddings for Sentiment Analysis?

I have a project where I've created a classifier but I've learned that word embeddings are a better approach. From my search, I found that CBOW and Skip-grams are the methods to use with Word2Vec. I ...
LoukasPap's user avatar
  • 1,350
1 vote
0 answers
17 views

CPU Memory Leak While Inference Models in Infinite Loop

I'm experiencing a CPU memory leak while running a Python script that processes text using various NLP models in an infinite loop. The script includes language translation, sentiment analysis, and ...
Amritesh Nandan's user avatar
-2 votes
0 answers
28 views

Divide a text based on Intent Analysis with NLP

I have this input from a chat: "Set an alarm for 7:00 am and play a song by Caparezza on Spotify." The input may contain multiple actions to do on the back-end. I want to divide a text based ...
flowibbia's user avatar
0 votes
2 answers
71 views

What is the best practice to calculate global frequency of list of elements with exact orders in python within multiple pandas dataframe?

Let's say I have the following datafarme df1 corresponding to user1: +-------------------+-------+--------+-------+-------+----------+----------------+ | Models | MAE | MSE | RMSE | ...
Mario's user avatar
  • 1,831
0 votes
0 answers
26 views

CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

i am trying to convert my text into its embeddings using a bert model , when i apply this to my my dataset it works fine for some of my inputs then stops and gives that error i have set ...
Gaurav B.V's user avatar
-3 votes
0 answers
25 views

Extracting CEO Information [closed]

I am working on a project in which I have to Extract CEOs information (Their company, name, and tenure information) from last 25 years throughout the US and saving it into a CSV format for further ...
Izhan Ali Syed's user avatar
-1 votes
0 answers
24 views

Poor Performance and Signs of Overfitting When Fine-Tuning BART with Adapters on CNN/DailyMail Dataset

I am currently fine-tuning the BART model with adapters for a summarization task using the CNN/DailyMail dataset. I've noticed that the model shows poor performance and signs of overfitting. Below is ...
Emilia Delizia's user avatar
1 vote
0 answers
15 views

execute lucene query in multiple language utilizing AI Model

We have requirement to support multiple language search for the same field. for example title is "Badminton" and subject is "sports" I want to search in solr like title:Badminton ...
Jigar Gajjar's user avatar
-1 votes
0 answers
18 views

Multitasking bert for multilabel classification of 5 classes [duplicate]

I built 5 BioClinicalBERT-based models (finetuned bert) to predict labels for medical records for the following categories: specialties = ["aud","den","oph","oto&...
FATMA HAMZA's user avatar
-1 votes
0 answers
11 views

Hybridized collaborative filtering and sentence similarity-based system for doctor recommendation based on user input of symptoms and location

I'm trying to solve a problem of recommending a doctor based on a user's symptoms and location using a hybridized collaborative filtering and sentence similarity-based recommender system that follow ...
Sadura Akinrinwa's user avatar
1 vote
0 answers
22 views

Multitasking bert for multilabel classification of 5 categories

I built and finetuned 5 BioClinicalBERT-based models (finetuned bert) to predict labels for medical records for the following categories: specialties = ["aud","den","oph",...
FATMA HAMZA's user avatar
1 vote
0 answers
8 views

Hugging Face pipeline vs manual processing produces different embeddings for Vision Transformers

I am using the transformers library with the ViTForImageClassification model ('google/vit-base-patch16-224') to extract embeddings from images. However, I am observing different embeddings when I use ...
martinelliadr's user avatar
0 votes
0 answers
14 views

RuntimeError: Failed to import transformers.training_args

I am trying to use transformers in a task of building a chatbot from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, trainer import torch import time ...
Chawki.Hjaiji's user avatar
0 votes
0 answers
34 views

How do I run this model in HuggingFace from Nvidia and Mistral?

The model is: nvidia/Mistral-NeMo-12B-Instruct And the link in HuggingFace nvidia/Mistral-NeMo-12B-Instruct Most model pages in HuggingFace have example Python code. But this model page doesn't have ...
abbas-h's user avatar
  • 420
0 votes
1 answer
18 views

HF transformers: ValueError: Unable to create tensor

I was following this guide for text classification and i gotten and error: ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=...
Ryan's user avatar
  • 402
0 votes
1 answer
39 views

Azure AI Search Scoring Profiles are not modifying the score retrival

I have been using azure ai search and scoring profiles to boost the documents of my index that come form the 'reviewed' source that means I want to send to the very TOP documents that have the string '...
R_Student's user avatar
  • 711
0 votes
1 answer
36 views

Score Profiles Azure AI search NOT WORKING

I have configured on my Index a default score profile to use on all of my seacrhes, I have an test index that has a field named 'source' if the filed is == to 'reviwed' I want those docs to be move up ...
R_Student's user avatar
  • 711
0 votes
0 answers
13 views

BPE tokenizer add_tokens overlap with trained tokens

I am training a BPE from scratch. I want the vocabulary to include certain tokens that might or might not exist in the training dataset. from datasets import load_dataset from tokenizers import models,...
meliksahturker's user avatar
0 votes
0 answers
35 views

Separating text into smaller chunks based on meaning

I am working on a project involving approximately 8,000 job advertisements in CSV format. I have extracted job titles, IDs, descriptions, and other relevant information and saved it in a PostgreSQL ...
Ameya's user avatar
  • 1
0 votes
0 answers
18 views

Transformer models for contextual word embedding in large datasets

I'm interested in using contextual word embeddings generated by a transformer-based model to explore the similarity of certain words in a large dataset. Most transformer models only allow up to 512 ...
C_B's user avatar
  • 13
0 votes
1 answer
52 views

Do I need to use Named Entity Recognition (NER) in tokenization?

I am working on an NLP project for sentiment analysis. I am using SpaCy to tokenize sentences. As I was reading the documentation, I learned about NER. I've read that it can be used to extract ...
LoukasPap's user avatar
  • 1,350
-2 votes
0 answers
17 views

Big o notation of neural network [closed]

My problem is that how to calculate the computational complexity which used big o metrics of deep neural network,cart,lightbgm and random forest, And where I can find the proof process of these? I ...
Guod Wu's user avatar
-4 votes
0 answers
35 views

Using regex for Account Number Extraction [closed]

Using Regex, how to read the accounts from below table in such a manner that from the first row, four IDs can be extracted- 300501798101, 359073848101, 359073848102 and 300501798101 whereas from the ...
Rohit's user avatar
  • 9
-1 votes
0 answers
16 views

How to Modify and Replace Embeddings in a Large Language Model (LLM)? [closed]

I am a beginner in large language models (LLMs) and I am working on a project. I have a question regarding embeddings in an LLM. How can I modify the embeddings of an LLM? Are they stored in a ...
Steven Thorn's user avatar
0 votes
0 answers
66 views

CUDA out of memory when training Llama-2-7b-hf model locally

I want to finetune meta-llama/Llama-2-7b-hf locally on my laptop. I am running out of CUDA memory when instantiating the Trainer class. I have 16Gb system RAM and a GTX 1060 with 6 Gb of GPU memory. I ...
Vinmean's user avatar
  • 113
0 votes
0 answers
8 views

Fine-Tuning T5 for Question Answering using HuggingFace Transformers, Pytorch Lightning & Python

when try follow video on finetuning T5 on Question Answering link: https://www.youtube.com/watch?v=r6XY80Z9eSA&list=RDCMUCoW_WzQNJVAjxo4osNAxd_g&index=1 when i run 53 trainer.fit(model,...
Nhất Duy Nguyễn Trần's user avatar
0 votes
0 answers
22 views

Is updating points in Qdrant vectordb without re-embedding the data safe?

I'm building a RAG chatbot using Langchain, using the data I've stored in a Qdrant vector database. I wanted to change the metadata of a few documents in my qdrant vector database. For this, I stored ...
Akshitha Rao's user avatar
0 votes
0 answers
17 views

Transformer Model Repeating Same Codon During Inference Despite High Training Accuracy

I'm working on a transformer-based model to translate amino acids to codons. During training and validation, my model achieves 95-98% accuracy. However, during inference, I encounter an issue where ...
Farshid B's user avatar
-1 votes
1 answer
34 views

IndexError: list index out of range, when trying to predict from the fine tuned model using Hugginface

i am trying to learn on how to fine tune a pretrained model and use it. this is my code from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer from ...
Lijin Durairaj's user avatar
-2 votes
1 answer
29 views

How to Implement NLP for Text Analysis in Evaluating Business Projects? [closed]

I need to evaluate business activities (projects) for eligibility based on specific criteria. We gather data through interviews with stakeholders, capturing details like project names, descriptions, ...
quadratic's user avatar
-3 votes
1 answer
18 views

how to match job title with vacancies name or vacancy descriptions? [closed]

How to match 400 professions to 10,000 job vacancies? I have two files: one contains the profession names and the sector to which they belong, and the second file is 10,000 vacancies from hh.kz, ...
Maulen Omirtay's user avatar
1 vote
2 answers
62 views

Identify starting row of actual data in Pandas DataFrame with merged header cells

My original df looks like this - df Note in the data frame: The headers are there till row 3 & from row 4 onwards, the values for those headers are starting. The numbers of rows & columns ...
Debojit Roy's user avatar
-1 votes
0 answers
26 views

How to Estimate GPU Memory for training and inference, Data Requirements, and Training Time for Large Language Models?

This is a very concrete and well-defined computer engineering question. I don't understand why someone would want to close it. Today, I faced this question during an interview for an ML Engineer ...
maplemaple's user avatar
  • 1,435
-1 votes
0 answers
38 views
+50

How to use HuggingFace's run_translation.py script to train a translation from scratch?

I tried various HuggingFace scripts to build language models, such as run_mlm.py (link), run_clm.py (link) and run_translation.py (link). For the former 2 scripts, it can train a language model from ...
Raptor's user avatar
  • 53.6k
0 votes
0 answers
18 views

Training LLM uses unexpected amount of GPU memory

I'm training model with self-implemented training loops. A 1.5B Qwen2 occupies 40G of GPU memory. When I did the same training using llama factory, it only takes about 24G. I tried to delete some ...
StaEx_G's user avatar
  • 13
0 votes
0 answers
31 views

How to evaluate LLM response [closed]

I am retrieving response using QWEN 72B model. I want to validate my response and don’t have ground truth answers. How can I evaluate my response without help of ground truth answers. I want to use ...
Prashanth Kolaneru's user avatar
-1 votes
0 answers
19 views

what kind of pre-processing is required to apply on sentence before passing it dependency parser?

I'm trying out sentiment analysis where I convert the sentence into a Graph with nodes being word embedding and edges being dependency between the two words. I'm still confused how exactly should I ...
Harsh Chauhan's user avatar
0 votes
0 answers
18 views

Finetuning BERT on classification task, tensor device mismatch error

I'm having trouble on fine-tuning a BERT model on a classification task, as I'm quite new to this. My data is composed of two columns, "item_title" (my input) and "meta_categ_id" (...
Jerry Zhu's user avatar
-1 votes
0 answers
48 views

cleaning list object containing text and creating new variables using Python

I am trying to create a data frame running the following code - # pip install edgartools import pandas as pd from edgar import * # Tell the SEC who you are set_identity("Your Name youremail@...
Sharif's user avatar
  • 177
0 votes
0 answers
37 views

ValueError: expected sequence of length 129 at dim 1 (got 46)

I was trying to fine-tune an image-to-text model using the following code: import json import torch from torch.utils.data import DataLoader import io from transformers import VisionEncoderDecoderModel,...
demostene's user avatar
-1 votes
0 answers
23 views

Por que o o modelo spacy não está reconhecendo as entidades do modelo treinado? [closed]

Criei uma base de treinamento para o modelo de processamento de linguagem natural, utilizando a biblioteca SPACY, baseado em uma publicação ambiental sobre derramamento de óleo no mar nordestino. ...
user26424635's user avatar
0 votes
0 answers
22 views

Huggingface Trainer CUDA Out Of Memory for 500M Model

I'm training MobiLLama for classification. This model is just 500Million Parameters and when I fine-tune it for the downstream tasks, the trainer keep giving me the CUDA out of memory error. I faced ...
Hoangdz's user avatar
  • 187
-1 votes
0 answers
9 views

I want to evaluate the three models which are LDA, LSM and CTM for my data based on coherence score?

My name is Phani. I want to choose which is the best model i.e Latent Dirichlet Allocation, Latent Semantic Analysis and Correlated Topic Model for my data. I already preprocessed the data but I want ...
Phaneswar Manchina's user avatar
1 vote
1 answer
30 views

How to deal with word counts of zero when calculating Pointwise Mutual Information (PMI) for word cooccurrences in Natural Language Processing

I have a co-occurrence matrix of words in a text (two words x and y are considered co-occurring, if they both occur in a context window of w words). I want to calculate the Pointwise Mutual ...
AlinaOs's user avatar
  • 25
-3 votes
0 answers
20 views

Need guidance on a document version control project [closed]

I have a document version control project where basically two things needs to be done: identify which document is the latest of them what are the historical version control changes on the documents? ...
Daremitsu's user avatar
  • 609


15 30 50 per page
1
2 3 4 5
786