Skip to main content
The 2024 Developer Survey results are live! See the results

NLP Collective

Questions

Browse questions with relevant NLP tags

39,293 questions

467 votes
18 answers
103k views

How does the Google "Did you mean?" Algorithm work? [closed]

I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly ...
Andrew Harry's user avatar
  • 13.9k
354 votes
7 answers
219k views

What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply ...
TIMEX's user avatar
  • 268k
161 votes
34 answers
426k views

spacy Can't find model 'en_core_web_sm' on windows 10 and Python 3.5.3 :: Anaconda custom (64-bit)

what is difference between spacy.load('en_core_web_sm') and spacy.load('en')? This link explains different model sizes. But i am still not clear how spacy.load('en_core_web_sm') and spacy.load('en') ...
user2543622's user avatar
  • 6,466
284 votes
14 answers
304k views

How to compute the similarity between two text documents?

I am looking at working on an NLP project, in any programming language (though Python will be my preference). I want to take two documents and determine how similar they are.
Reily Bourne's user avatar
  • 5,227
217 votes
18 answers
223k views

googletrans stopped working with error 'NoneType' object has no attribute 'group'

I was trying googletrans and it was working quite well. Since this morning I started getting below error. I went through multiple posts from stackoverflow and other sites and found probably my ip is ...
steveJ's user avatar
  • 2,371
141 votes
29 answers
403k views

pip issue installing almost any library

I have a difficult time using pip to install almost anything. I'm new to coding, so I thought maybe this is something I've been doing wrong and have opted out to easy_install to get most of what I ...
contentclown's user avatar
  • 1,441
191 votes
18 answers
231k views

Failed loading english.pickle with nltk.data.load

When trying to load the punkt tokenizer... import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ...a LookupError was raised: > LookupError: > **************...
Martin's user avatar
  • 1,913
195 votes
16 answers
186k views

How to determine the language of a piece of text?

I want to get this: Input text: "ру́сский язы́к" Output text: "Russian" Input text: "中文" Output text: "Chinese" Input text: "にほんご" Output text: &...
Rita's user avatar
  • 2,247
55 votes
51 answers
18k views

Is there a human readable programming language? [closed]

I mean, is there a coded language with human style coding? For example: Create an object called MyVar and initialize it to 10; Take MyVar and call MyMethod() with parameters. . . I know it's not so ...
179 votes
17 answers
261k views

n-grams in python, four, five, six grams?

I'm looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = ...
Shifu's user avatar
  • 2,165
206 votes
15 answers
164k views

What is the difference between lemmatization vs stemming?

When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was?
TIMEX's user avatar
  • 268k
190 votes
12 answers
296k views

How to check if a word is an English word with Python?

I want to check in a Python program if a word is in the English dictionary. I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task. def ...
Barthelemy's user avatar
  • 8,567
198 votes
9 answers
157k views

What are all possible POS tags of NLTK?

How do I find a list with all possible POS tags used by the Natural Language Toolkit (NLTK)?
OrangeTux's user avatar
  • 11.4k
124 votes
19 answers
157k views

Resource u'tokenizers/punkt/english.pickle' not found

My Code: import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ERROR Message: [ec2-user@ip-172-31-31-31 sentiment]$ python mapper_local_v1.0.py Traceback (most recent ...
Supreeth Meka's user avatar
186 votes
10 answers
105k views

Java Stanford NLP: Part of Speech labels?

The Stanford NLP, demo'd here, gives an output like this: Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./. What do the Part of Speech tags mean? I am unable to find an official list. Is it ...
Nick Heiner's user avatar
156 votes
17 answers
83k views

Detecting syllables in a word

I need to find a fairly efficient way to detect syllables in a word. E.g., Invisible -> in-vi-sib-le There are some syllabification rules that could be used: V CV VC CVC CCV CCCV CVCC *where V is ...
user50705's user avatar
  • 1,623
167 votes
13 answers
302k views

How to get rid of punctuation using NLTK tokenizer?

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. ...
lizarisk's user avatar
  • 7,750
174 votes
9 answers
79k views

What does tf.nn.embedding_lookup function do?

tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None) I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding ...
Poorya Pzm's user avatar
  • 2,133
145 votes
14 answers
133k views

How to calculate the sentence similarity using word2vec model of gensim with python

According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words. e.g. trained_model.similarity('woman', 'man') 0.73723527 However, the ...
zhfkt's user avatar
  • 2,441
114 votes
22 answers
144k views

How do I do word Stemming or Lemmatization?

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones. My test words are: "cats running ran cactus cactuses cacti community communities", and both get ...
manixrock's user avatar
  • 2,533
61 votes
33 answers
28k views

What programming language is most like natural language? [closed]

I got the idea for this question from numerous situations where I don't understand what the person is talking about and when others don't understand me. So, a "smart" solution would be to speak a ...
146 votes
14 answers
271k views

How to remove stop words using nltk or python

I have a dataset from which I would like to remove stop words. I used NLTK to get a list of stop words: from nltk.corpus import stopwords stopwords.words('english') Exactly how do I compare the data ...
Alex's user avatar
  • 1,923
103 votes
25 answers
19k views

How can I correctly prefix a word with "a" and "an"?

I have a .NET application where, given a noun, I want it to correctly prefix that word with "a" or "an". How would I do that? Before you think the answer is to simply check if the first letter is a ...
ryeguy's user avatar
  • 66.4k
104 votes
15 answers
129k views

NLTK download SSL: Certificate verify failed

I get the following error when trying to install Punkt for nltk: nltk.download('punkt') [nltk_data] Error loading Punkt: <urlopen error [SSL: [nltk_data] CERTIFICATE_VERIFY_FAILED] ...
user3429986's user avatar
  • 1,195
100 votes
18 answers
114k views

How to use Stanford Parser in NLTK using Python

Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)
ThanaDaray's user avatar
  • 1,693
94 votes
16 answers
97k views

Ordinal numbers replacement

I am currently looking for the way to replace words like first, second, third,...with appropriate ordinal number representation (1st, 2nd, 3rd). I have been googling for the last week and I didn't ...
skornos's user avatar
  • 3,231
118 votes
17 answers
32k views

How do you implement a "Did you mean"? [duplicate]

Possible Duplicate: How does the Google “Did you mean?” Algorithm work? Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>...
pek's user avatar
  • 18k
138 votes
10 answers
361k views

how to check which version of nltk, scikit learn installed?

In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script: import nltk echo nltk.__version__ but it stops shell script at ...
nlper's user avatar
  • 2,377
128 votes
6 answers
110k views

Understanding min_df and max_df in scikit CountVectorizer

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency ...
moeabdol's user avatar
  • 4,979
97 votes
12 answers
239k views

Corpora/stopwords not found when import nltk library

I trying to import the nltk package in python 2.7 import nltk stopwords = nltk.corpus.stopwords.words('english') print(stopwords[:10]) Running this gives me the following error: LookupError: ...
Frits Verstraten's user avatar
38 votes
22 answers
13k views

Code Golf: Number to Words

The code golf series seem to be fairly popular. I ran across some code that converts a number to its word representation. Some examples would be (powers of 2 for programming fun): 2 -> Two 1024 -> ...
133 votes
6 answers
26k views

How does Apple find dates, times and addresses in emails?

In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not ...
Martin's user avatar
  • 40.1k
108 votes
8 answers
162k views

How to change huggingface transformers default cache directory

The default cache directory is lack of disk capacity, I need change the configure of the default cache directory.
Ivan Lee's user avatar
  • 4,021
145 votes
4 answers
319k views

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this: label instances 5 1190 4 838 3 239 1 204 2 127 So my data is unbalanced since 1190 ...
new_with_python's user avatar
70 votes
15 answers
210k views

How do I download NLTK data?

Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!! I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution ...
Q-ximi's user avatar
  • 951
112 votes
6 answers
155k views

Python: tf-idf-cosine: to find document similarity

I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the ...
add-semi-colons's user avatar
105 votes
9 answers
103k views

Where does hugging face's transformers save models?

Running the below code downloads a model - does anyone know what folder it downloads it to? !pip install -q transformers from transformers import pipeline model = pipeline('fill-mask')
user3472360's user avatar
  • 1,835
90 votes
8 answers
151k views

Calculate cosine similarity given 2 sentence strings

From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate ...
alvas's user avatar
  • 120k
96 votes
9 answers
86k views

How to get vector for a sentence from the word2vec of tokens in sentence

I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of the tokens in the sentence....
trialcritic's user avatar
  • 1,245
88 votes
12 answers
52k views

Sentiment analysis for Twitter in Python [closed]

I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source ...
Ran's user avatar
  • 7,609
119 votes
4 answers
113k views

What does Keras Tokenizer method exactly do?

On occasion, circumstances require us to do the following: from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words=my_max) Then, invariably, we chant this mantra: tokenizer....
Jack Fleeting's user avatar
131 votes
6 answers
531k views

re.sub erroring with "Expected string or bytes-like object"

I have read multiple posts regarding this error, but I still can't figure it out. When I try to loop through my function: def fix_Plan(location): letters_only = re.sub("[^a-zA-Z]", # ...
imanexcelnoob's user avatar
79 votes
6 answers
166k views

Stopword removal with NLTK

I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after ...
Grahesh Parkar's user avatar
50 votes
14 answers
89k views

Load Pretrained glove vectors in python

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file using gensim but I don't ...
Same's user avatar
  • 759
72 votes
15 answers
188k views

How to download a model from huggingface?

For example, I want to download bert-base-uncased on https://huggingface.co/models, but can't find a 'Download' link. Or is it not downloadable?
marlon's user avatar
  • 7,197
111 votes
7 answers
194k views

NLTK python error: "TypeError: 'dict_keys' object is not subscriptable"

I'm following instructions for a class homework assignment and I'm supposed to look up the top 200 most frequently used words in a text file. Here's the last part of the code: fdist1 = FreqDist(...
user3760644's user avatar
  • 1,167
94 votes
4 answers
130k views

Fuzzy String Comparison

What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive ...
jacksonstephenc's user avatar
77 votes
9 answers
69k views

What do spaCy's part-of-speech and dependency tags mean?

spaCy tags up each of the Tokens in a Document with a part of speech (in two different formats, one stored in the pos and pos_ properties of the Token and the other stored in the tag and tag_ ...
Mark Amery's user avatar
  • 150k
120 votes
3 answers
60k views

word2vec: negative sampling (in layman term)? [closed]

I'm reading the paper below and I have some trouble , understanding the concept of negative sampling. http://arxiv.org/pdf/1402.3722v1.pdf Can anyone help , please?
Andy K's user avatar
  • 5,014
98 votes
10 answers
102k views

How to use Bert for long text classification?

We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text How can BERT be used?
user1337896's user avatar
  • 1,261


15 30 50 per page
1
2 3 4 5
786