NLP Collective

0 votes

0 answers

3 views

BertTokenizer vocab_size remains unchanged after adding tokens

I am using HuggingFace BertTokenizer and adding some tokens to it. Here are the codes: from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('fnlp/bart-base-chinese') print(...

Raptor

53.6k

asked 5 mins ago

0 votes

1 answer

10 views

SgaeMaker training: what's the correct REGEX patrern to capture metrics?

This is the pattern I've seen suggested in a few different posts on SO: metric_definitions = [ {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"}, {'Name': 'learning_rate', ...

Yoan B. M.Sc

1,503

asked 14 hours ago

0 votes

0 answers

6 views

GGUF model in LM Studio returns broken answer

I try to run LLM GGUF model QuantFactory/T-lite-instruct-0.1-GGUF specifically its quantized version T-lite-instruct-0.1.Q2_K.gguf in LM Studio. Sometimes it works fine. But sometimes it returns "...

pav

99

asked 15 hours ago

1 vote

0 answers

15 views

LDA is predicting same topics for all data

I'm using the German political speech dataset to train the LDA model. My goal here is to categorize each speech into some topics. But the problem is that the generated topics are too similar, and all ...

Ryu Ahmed

11

asked 15 hours ago

0 votes

0 answers

5 views

RuntimeError with DeBERTaV3 Sequence Classification: Tensor Size Mismatch

Iam trying to fine-tune the microsoft/deberta-v3-base model for sequence classification with three labels. I have set up my tokenizer and data preprocessing, but I encounter a RuntimeError during ...

suri

21

asked 15 hours ago

-1 votes

0 answers

13 views

How can I use Word Embeddings for Sentiment Analysis?

I have a project where I've created a classifier but I've learned that word embeddings are a better approach. From my search, I found that CBOW and Skip-grams are the methods to use with Word2Vec. I ...

LoukasPap

1,350

asked 17 hours ago

1 vote

0 answers

17 views

CPU Memory Leak While Inference Models in Infinite Loop

I'm experiencing a CPU memory leak while running a Python script that processes text using various NLP models in an infinite loop. The script includes language translation, sentiment analysis, and ...

Amritesh Nandan

41

asked 20 hours ago

-2 votes

0 answers

28 views

Divide a text based on Intent Analysis with NLP

I have this input from a chat: "Set an alarm for 7:00 am and play a song by Caparezza on Spotify." The input may contain multiple actions to do on the back-end. I want to divide a text based ...

flowibbia

17

asked yesterday

0 votes

2 answers

71 views

What is the best practice to calculate global frequency of list of elements with exact orders in python within multiple pandas dataframe?

Let's say I have the following datafarme df1 corresponding to user1: +-------------------+-------+--------+-------+-------+----------+----------------+ | Models | MAE | MSE | RMSE | ...

Mario

1,831

asked yesterday

0 votes

0 answers

26 views

CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

i am trying to convert my text into its embeddings using a bert model , when i apply this to my my dataset it works fine for some of my inputs then stops and gives that error i have set ...

Gaurav B.V

1

asked yesterday

-3 votes

0 answers

25 views

Extracting CEO Information [closed]

I am working on a project in which I have to Extract CEOs information (Their company, name, and tenure information) from last 25 years throughout the US and saving it into a CSV format for further ...

Izhan Ali Syed

1

asked yesterday

-1 votes

0 answers

24 views

Poor Performance and Signs of Overfitting When Fine-Tuning BART with Adapters on CNN/DailyMail Dataset

I am currently fine-tuning the BART model with adapters for a summarization task using the CNN/DailyMail dataset. I've noticed that the model shows poor performance and signs of overfitting. Below is ...

Emilia Delizia

349

asked 2 days ago

1 vote

0 answers

15 views

execute lucene query in multiple language utilizing AI Model

We have requirement to support multiple language search for the same field. for example title is "Badminton" and subject is "sports" I want to search in solr like title:Badminton ...

Jigar Gajjar

333

asked 2 days ago

-1 votes

0 answers

18 views

Multitasking bert for multilabel classification of 5 classes [duplicate]

I built 5 BioClinicalBERT-based models (finetuned bert) to predict labels for medical records for the following categories: specialties = ["aud","den","oph","oto&...

FATMA HAMZA

9

asked 2 days ago

-1 votes

0 answers

11 views

Hybridized collaborative filtering and sentence similarity-based system for doctor recommendation based on user input of symptoms and location

I'm trying to solve a problem of recommending a doctor based on a user's symptoms and location using a hybridized collaborative filtering and sentence similarity-based recommender system that follow ...

Sadura Akinrinwa

1

asked 2 days ago

1 vote

0 answers

22 views

Multitasking bert for multilabel classification of 5 categories

I built and finetuned 5 BioClinicalBERT-based models (finetuned bert) to predict labels for medical records for the following categories: specialties = ["aud","den","oph",...

FATMA HAMZA

9

asked 2 days ago

1 vote

0 answers

8 views

Hugging Face pipeline vs manual processing produces different embeddings for Vision Transformers

I am using the transformers library with the ViTForImageClassification model ('google/vit-base-patch16-224') to extract embeddings from images. However, I am observing different embeddings when I use ...

martinelliadr

11

asked 2 days ago

0 votes

0 answers

14 views

RuntimeError: Failed to import transformers.training_args

I am trying to use transformers in a task of building a chatbot from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, trainer import torch import time ...

Chawki.Hjaiji

1

asked 2 days ago

0 votes

0 answers

34 views

How do I run this model in HuggingFace from Nvidia and Mistral?

The model is: nvidia/Mistral-NeMo-12B-Instruct And the link in HuggingFace nvidia/Mistral-NeMo-12B-Instruct Most model pages in HuggingFace have example Python code. But this model page doesn't have ...

abbas-h

420

asked Jul 23 at 7:21

0 votes

1 answer

18 views

HF transformers: ValueError: Unable to create tensor

I was following this guide for text classification and i gotten and error: ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=...

Ryan

402

asked Jul 23 at 2:57

0 votes

1 answer

39 views

Azure AI Search Scoring Profiles are not modifying the score retrival

I have been using azure ai search and scoring profiles to boost the documents of my index that come form the 'reviewed' source that means I want to send to the very TOP documents that have the string '...

R_Student

711

asked Jul 23 at 2:37

0 votes

1 answer

36 views

Score Profiles Azure AI search NOT WORKING

I have configured on my Index a default score profile to use on all of my seacrhes, I have an test index that has a field named 'source' if the filed is == to 'reviwed' I want those docs to be move up ...

R_Student

711

asked Jul 22 at 22:45

0 votes

0 answers

13 views

BPE tokenizer add_tokens overlap with trained tokens

I am training a BPE from scratch. I want the vocabulary to include certain tokens that might or might not exist in the training dataset. from datasets import load_dataset from tokenizers import models,...

meliksahturker

1,404

asked Jul 22 at 19:11

0 votes

0 answers

35 views

Separating text into smaller chunks based on meaning

I am working on a project involving approximately 8,000 job advertisements in CSV format. I have extracted job titles, IDs, descriptions, and other relevant information and saved it in a PostgreSQL ...

Ameya

1

asked Jul 22 at 15:21

0 votes

0 answers

18 views

Transformer models for contextual word embedding in large datasets

I'm interested in using contextual word embeddings generated by a transformer-based model to explore the similarity of certain words in a large dataset. Most transformer models only allow up to 512 ...

C_B

13

asked Jul 22 at 13:48

0 votes

1 answer

52 views

Do I need to use Named Entity Recognition (NER) in tokenization?

I am working on an NLP project for sentiment analysis. I am using SpaCy to tokenize sentences. As I was reading the documentation, I learned about NER. I've read that it can be used to extract ...

LoukasPap

1,350

asked Jul 22 at 13:28

-2 votes

0 answers

17 views

Big o notation of neural network [closed]

My problem is that how to calculate the computational complexity which used big o metrics of deep neural network,cart,lightbgm and random forest, And where I can find the proof process of these? I ...

Guod Wu

1

asked Jul 22 at 11:53

-4 votes

0 answers

35 views

Using regex for Account Number Extraction [closed]

Using Regex, how to read the accounts from below table in such a manner that from the first row, four IDs can be extracted- 300501798101, 359073848101, 359073848102 and 300501798101 whereas from the ...

Rohit

9

asked Jul 22 at 8:58

-1 votes

0 answers

16 views

How to Modify and Replace Embeddings in a Large Language Model (LLM)? [closed]

I am a beginner in large language models (LLMs) and I am working on a project. I have a question regarding embeddings in an LLM. How can I modify the embeddings of an LLM? Are they stored in a ...

Steven Thorn

1

asked Jul 22 at 3:09

0 votes

0 answers

66 views

CUDA out of memory when training Llama-2-7b-hf model locally

I want to finetune meta-llama/Llama-2-7b-hf locally on my laptop. I am running out of CUDA memory when instantiating the Trainer class. I have 16Gb system RAM and a GTX 1060 with 6 Gb of GPU memory. I ...

Vinmean

113

asked Jul 22 at 2:27

0 votes

0 answers

8 views

Fine-Tuning T5 for Question Answering using HuggingFace Transformers, Pytorch Lightning & Python

when try follow video on finetuning T5 on Question Answering link: https://www.youtube.com/watch?v=r6XY80Z9eSA&list=RDCMUCoW_WzQNJVAjxo4osNAxd_g&index=1 when i run 53 trainer.fit(model,...

Nhất Duy Nguyễn Trần

1

asked Jul 21 at 22:03

0 votes

0 answers

22 views

Is updating points in Qdrant vectordb without re-embedding the data safe?

I'm building a RAG chatbot using Langchain, using the data I've stored in a Qdrant vector database. I wanted to change the metadata of a few documents in my qdrant vector database. For this, I stored ...

Akshitha Rao

11

asked Jul 21 at 21:14

0 votes

0 answers

17 views

Transformer Model Repeating Same Codon During Inference Despite High Training Accuracy

I'm working on a transformer-based model to translate amino acids to codons. During training and validation, my model achieves 95-98% accuracy. However, during inference, I encounter an issue where ...

Farshid B

1

asked Jul 21 at 11:06

-1 votes

1 answer

34 views

IndexError: list index out of range, when trying to predict from the fine tuned model using Hugginface

i am trying to learn on how to fine tune a pretrained model and use it. this is my code from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer from ...

Lijin Durairaj

5,072

asked Jul 20 at 19:44

-2 votes

1 answer

29 views

How to Implement NLP for Text Analysis in Evaluating Business Projects? [closed]

I need to evaluate business activities (projects) for eligibility based on specific criteria. We gather data through interviews with stakeholders, capturing details like project names, descriptions, ...

quadratic

1

asked Jul 20 at 18:23

-3 votes

1 answer

18 views

how to match job title with vacancies name or vacancy descriptions? [closed]

How to match 400 professions to 10,000 job vacancies? I have two files: one contains the profession names and the sector to which they belong, and the second file is 10,000 vacancies from hh.kz, ...

Maulen Omirtay

1

asked Jul 20 at 14:06

1 vote

2 answers

62 views

Identify starting row of actual data in Pandas DataFrame with merged header cells

My original df looks like this - df Note in the data frame: The headers are there till row 3 & from row 4 onwards, the values for those headers are starting. The numbers of rows & columns ...

Debojit Roy

11

asked Jul 20 at 10:55

-1 votes

0 answers

26 views

How to Estimate GPU Memory for training and inference, Data Requirements, and Training Time for Large Language Models?

This is a very concrete and well-defined computer engineering question. I don't understand why someone would want to close it. Today, I faced this question during an interview for an ML Engineer ...

maplemaple

1,435

asked Jul 20 at 7:32

-1 votes

0 answers

38 views

+50

How to use HuggingFace's run_translation.py script to train a translation from scratch?

I tried various HuggingFace scripts to build language models, such as run_mlm.py (link), run_clm.py (link) and run_translation.py (link). For the former 2 scripts, it can train a language model from ...

Raptor

53.6k

asked Jul 19 at 14:53

0 votes

0 answers

18 views

Training LLM uses unexpected amount of GPU memory

I'm training model with self-implemented training loops. A 1.5B Qwen2 occupies 40G of GPU memory. When I did the same training using llama factory, it only takes about 24G. I tried to delete some ...

StaEx_G

13

asked Jul 19 at 10:02

0 votes

0 answers

31 views

How to evaluate LLM response [closed]

I am retrieving response using QWEN 72B model. I want to validate my response and don’t have ground truth answers. How can I evaluate my response without help of ground truth answers. I want to use ...

Prashanth Kolaneru

15

asked Jul 19 at 9:32

-1 votes

0 answers

19 views

what kind of pre-processing is required to apply on sentence before passing it dependency parser?

I'm trying out sentiment analysis where I convert the sentence into a Graph with nodes being word embedding and edges being dependency between the two words. I'm still confused how exactly should I ...

Harsh Chauhan

1

asked Jul 19 at 6:57

0 votes

0 answers

18 views

Finetuning BERT on classification task, tensor device mismatch error

I'm having trouble on fine-tuning a BERT model on a classification task, as I'm quite new to this. My data is composed of two columns, "item_title" (my input) and "meta_categ_id" (...

Jerry Zhu

1

asked Jul 18 at 20:01

-1 votes

0 answers

48 views

cleaning list object containing text and creating new variables using Python

I am trying to create a data frame running the following code - # pip install edgartools import pandas as pd from edgar import * # Tell the SEC who you are set_identity("Your Name youremail@...

Sharif

177

asked Jul 18 at 19:23

0 votes

0 answers

37 views

ValueError: expected sequence of length 129 at dim 1 (got 46)

I was trying to fine-tune an image-to-text model using the following code: import json import torch from torch.utils.data import DataLoader import io from transformers import VisionEncoderDecoderModel,...

demostene

1

asked Jul 18 at 18:40

-1 votes

0 answers

23 views

Por que o o modelo spacy não está reconhecendo as entidades do modelo treinado? [closed]

Criei uma base de treinamento para o modelo de processamento de linguagem natural, utilizando a biblioteca SPACY, baseado em uma publicação ambiental sobre derramamento de óleo no mar nordestino. ...

user26424635

1

asked Jul 18 at 18:17

0 votes

0 answers

22 views

Huggingface Trainer CUDA Out Of Memory for 500M Model

I'm training MobiLLama for classification. This model is just 500Million Parameters and when I fine-tune it for the downstream tasks, the trainer keep giving me the CUDA out of memory error. I faced ...

Hoangdz

187

asked Jul 18 at 16:28

-1 votes

0 answers

9 views

I want to evaluate the three models which are LDA, LSM and CTM for my data based on coherence score?

My name is Phani. I want to choose which is the best model i.e Latent Dirichlet Allocation, Latent Semantic Analysis and Correlated Topic Model for my data. I already preprocessed the data but I want ...

Phaneswar Manchina

1

asked Jul 18 at 14:19

1 vote

1 answer

30 views

How to deal with word counts of zero when calculating Pointwise Mutual Information (PMI) for word cooccurrences in Natural Language Processing

I have a co-occurrence matrix of words in a text (two words x and y are considered co-occurring, if they both occur in a context window of w words). I want to calculate the Pointwise Mutual ...

AlinaOs

25

asked Jul 18 at 12:36

-3 votes

0 answers

20 views

Need guidance on a document version control project [closed]

I have a document version control project where basically two things needs to be done: identify which document is the latest of them what are the historical version control changes on the documents? ...

Daremitsu

609

asked Jul 18 at 11:39

Collectives™ on Stack Overflow

Questions

39,269 questions