Stay up to date
We'll highlight new content in your collectives with a blue activity indicator on navigation items and posts.
Manage preferences
Questions
Browse questions with relevant NLP tags
39,293 questions
467
votes
18
answers
103k
views
How does the Google "Did you mean?" Algorithm work? [closed]
I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly ...
354
votes
7
answers
219k
views
What is "entropy and information gain"?
I am reading this book (NLTK) and it is confusing. Entropy is defined as:
Entropy is the sum of the probability of each label
times the log probability of that same label
How can I apply ...
161
votes
34
answers
426k
views
spacy Can't find model 'en_core_web_sm' on windows 10 and Python 3.5.3 :: Anaconda custom (64-bit)
what is difference between spacy.load('en_core_web_sm') and spacy.load('en')? This link explains different model sizes. But i am still not clear how spacy.load('en_core_web_sm') and spacy.load('en') ...
284
votes
14
answers
304k
views
How to compute the similarity between two text documents?
I am looking at working on an NLP project, in any programming language (though Python will be my preference).
I want to take two documents and determine how similar they are.
217
votes
18
answers
223k
views
googletrans stopped working with error 'NoneType' object has no attribute 'group'
I was trying googletrans and it was working quite well. Since this morning I started getting below error. I went through multiple posts from stackoverflow and other sites and found probably my ip is ...
141
votes
29
answers
403k
views
pip issue installing almost any library
I have a difficult time using pip to install almost anything. I'm new to coding, so I thought maybe this is something I've been doing wrong and have opted out to easy_install to get most of what I ...
191
votes
18
answers
231k
views
Failed loading english.pickle with nltk.data.load
When trying to load the punkt tokenizer...
import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
...a LookupError was raised:
> LookupError:
> **************...
195
votes
16
answers
186k
views
How to determine the language of a piece of text?
I want to get this:
Input text: "ру́сский язы́к"
Output text: "Russian"
Input text: "中文"
Output text: "Chinese"
Input text: "にほんご"
Output text: &...
55
votes
51
answers
18k
views
Is there a human readable programming language? [closed]
I mean, is there a coded language with human style coding?
For example:
Create an object called MyVar and initialize it to 10;
Take MyVar and call MyMethod() with parameters. . .
I know it's not so ...
179
votes
17
answers
261k
views
n-grams in python, four, five, six grams?
I'm looking for a way to split a text into n-grams.
Normally I would do something like:
import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = ...
206
votes
15
answers
164k
views
What is the difference between lemmatization vs stemming?
When do I use each ?
Also...is the NLTK lemmatization dependent upon Parts of Speech?
Wouldn't it be more accurate if it was?
190
votes
12
answers
296k
views
How to check if a word is an English word with Python?
I want to check in a Python program if a word is in the English dictionary.
I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.
def ...
198
votes
9
answers
157k
views
What are all possible POS tags of NLTK?
How do I find a list with all possible POS tags used by the Natural Language Toolkit (NLTK)?
124
votes
19
answers
157k
views
Resource u'tokenizers/punkt/english.pickle' not found
My Code:
import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')
ERROR Message:
[ec2-user@ip-172-31-31-31 sentiment]$ python mapper_local_v1.0.py
Traceback (most recent ...
186
votes
10
answers
105k
views
Java Stanford NLP: Part of Speech labels?
The Stanford NLP, demo'd here, gives an output like this:
Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./.
What do the Part of Speech tags mean? I am unable to find an official list. Is it ...
156
votes
17
answers
83k
views
Detecting syllables in a word
I need to find a fairly efficient way to detect syllables in a word. E.g.,
Invisible -> in-vi-sib-le
There are some syllabification rules that could be used:
V
CV
VC
CVC
CCV
CCCV
CVCC
*where V is ...
167
votes
13
answers
302k
views
How to get rid of punctuation using NLTK tokenizer?
I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. ...
174
votes
9
answers
79k
views
What does tf.nn.embedding_lookup function do?
tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None)
I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding ...
145
votes
14
answers
133k
views
How to calculate the sentence similarity using word2vec model of gensim with python
According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words.
e.g.
trained_model.similarity('woman', 'man')
0.73723527
However, the ...
114
votes
22
answers
144k
views
How do I do word Stemming or Lemmatization?
I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones.
My test words are: "cats running ran cactus cactuses cacti community communities", and both get ...
61
votes
33
answers
28k
views
What programming language is most like natural language? [closed]
I got the idea for this question from numerous situations where I don't understand what the person is talking about and when others don't understand me.
So, a "smart" solution would be to speak a ...
146
votes
14
answers
271k
views
How to remove stop words using nltk or python
I have a dataset from which I would like to remove stop words.
I used NLTK to get a list of stop words:
from nltk.corpus import stopwords
stopwords.words('english')
Exactly how do I compare the data ...
103
votes
25
answers
19k
views
How can I correctly prefix a word with "a" and "an"?
I have a .NET application where, given a noun, I want it to correctly prefix that word with "a" or "an". How would I do that?
Before you think the answer is to simply check if the first letter is a ...
104
votes
15
answers
129k
views
NLTK download SSL: Certificate verify failed
I get the following error when trying to install Punkt for nltk:
nltk.download('punkt')
[nltk_data] Error loading Punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] ...
100
votes
18
answers
114k
views
How to use Stanford Parser in NLTK using Python
Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)
94
votes
16
answers
97k
views
Ordinal numbers replacement
I am currently looking for the way to replace words like first, second, third,...with appropriate ordinal number representation (1st, 2nd, 3rd).
I have been googling for the last week and I didn't ...
118
votes
17
answers
32k
views
How do you implement a "Did you mean"? [duplicate]
Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?
Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>...
138
votes
10
answers
361k
views
how to check which version of nltk, scikit learn installed?
In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script:
import nltk
echo nltk.__version__
but it stops shell script at ...
128
votes
6
answers
110k
views
Understanding min_df and max_df in scikit CountVectorizer
I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency ...
97
votes
12
answers
239k
views
Corpora/stopwords not found when import nltk library
I trying to import the nltk package in python 2.7
import nltk
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords[:10])
Running this gives me the following error:
LookupError:
...
38
votes
22
answers
13k
views
Code Golf: Number to Words
The code golf series seem to be fairly popular. I ran across some code that converts a number to its word representation. Some examples would be (powers of 2 for programming fun):
2 -> Two
1024 -> ...
133
votes
6
answers
26k
views
How does Apple find dates, times and addresses in emails?
In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not ...
108
votes
8
answers
162k
views
How to change huggingface transformers default cache directory
The default cache directory is lack of disk capacity, I need change the configure of the default cache directory.
145
votes
4
answers
319k
views
How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?
I'm working in a sentiment analysis problem the data looks like this:
label instances
5 1190
4 838
3 239
1 204
2 127
So my data is unbalanced since 1190 ...
70
votes
15
answers
210k
views
How do I download NLTK data?
Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!!
I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution ...
112
votes
6
answers
155k
views
Python: tf-idf-cosine: to find document similarity
I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the ...
105
votes
9
answers
103k
views
Where does hugging face's transformers save models?
Running the below code downloads a model - does anyone know what folder it downloads it to?
!pip install -q transformers
from transformers import pipeline
model = pipeline('fill-mask')
90
votes
8
answers
151k
views
Calculate cosine similarity given 2 sentence strings
From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate ...
96
votes
9
answers
86k
views
How to get vector for a sentence from the word2vec of tokens in sentence
I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of the tokens in the sentence....
88
votes
12
answers
52k
views
Sentiment analysis for Twitter in Python [closed]
I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source ...
119
votes
4
answers
113k
views
What does Keras Tokenizer method exactly do?
On occasion, circumstances require us to do the following:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=my_max)
Then, invariably, we chant this mantra:
tokenizer....
131
votes
6
answers
531k
views
re.sub erroring with "Expected string or bytes-like object"
I have read multiple posts regarding this error, but I still can't figure it out. When I try to loop through my function:
def fix_Plan(location):
letters_only = re.sub("[^a-zA-Z]", # ...
79
votes
6
answers
166k
views
Stopword removal with NLTK
I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after ...
50
votes
14
answers
89k
views
Load Pretrained glove vectors in python
I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file using gensim but I don't ...
72
votes
15
answers
188k
views
How to download a model from huggingface?
For example, I want to download bert-base-uncased on https://huggingface.co/models, but can't find a 'Download' link. Or is it not downloadable?
111
votes
7
answers
194k
views
NLTK python error: "TypeError: 'dict_keys' object is not subscriptable"
I'm following instructions for a class homework assignment and I'm supposed to look up the top 200 most frequently used words in a text file.
Here's the last part of the code:
fdist1 = FreqDist(...
94
votes
4
answers
130k
views
Fuzzy String Comparison
What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive ...
77
votes
9
answers
69k
views
What do spaCy's part-of-speech and dependency tags mean?
spaCy tags up each of the Tokens in a Document with a part of speech (in two different formats, one stored in the pos and pos_ properties of the Token and the other stored in the tag and tag_ ...
120
votes
3
answers
60k
views
word2vec: negative sampling (in layman term)? [closed]
I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
98
votes
10
answers
102k
views
How to use Bert for long text classification?
We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text
How can BERT be used?
Members can contribute articles
Simply submit a proposal, get it approved, and publish it.
See how the process works
Simply submit a proposal, get it approved, and publish it.
See how the process works