NLP Collective

467 votes

18 answers

103k views

How does the Google "Did you mean?" Algorithm work? [closed]

I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly ...

Andrew Harry

13.9k

asked Nov 20, 2008 at 23:34

354 votes

7 answers

219k views

What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply ...

TIMEX

268k

asked Dec 7, 2009 at 11:54

161 votes

34 answers

426k views

spacy Can't find model 'en_core_web_sm' on windows 10 and Python 3.5.3 :: Anaconda custom (64-bit)

what is difference between spacy.load('en_core_web_sm') and spacy.load('en')? This link explains different model sizes. But i am still not clear how spacy.load('en_core_web_sm') and spacy.load('en') ...

user2543622

6,466

asked Jan 23, 2019 at 19:24

284 votes

14 answers

304k views

How to compute the similarity between two text documents?

I am looking at working on an NLP project, in any programming language (though Python will be my preference). I want to take two documents and determine how similar they are.

Reily Bourne

5,227

asked Jan 17, 2012 at 15:51

217 votes

18 answers

223k views

googletrans stopped working with error 'NoneType' object has no attribute 'group'

I was trying googletrans and it was working quite well. Since this morning I started getting below error. I went through multiple posts from stackoverflow and other sites and found probably my ip is ...

steveJ

2,371

asked Sep 22, 2018 at 10:29

141 votes

29 answers

403k views

pip issue installing almost any library

I have a difficult time using pip to install almost anything. I'm new to coding, so I thought maybe this is something I've been doing wrong and have opted out to easy_install to get most of what I ...

contentclown

1,441

asked May 4, 2013 at 4:29

191 votes

18 answers

231k views

Failed loading english.pickle with nltk.data.load

When trying to load the punkt tokenizer... import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ...a LookupError was raised: > LookupError: > **************...

Martin

1,913

asked Feb 1, 2011 at 19:43

195 votes

16 answers

186k views

How to determine the language of a piece of text?

I want to get this: Input text: "ру́сский язы́к" Output text: "Russian" Input text: "中文" Output text: "Chinese" Input text: "にほんご" Output text: &...

Rita

2,247

asked Aug 25, 2016 at 10:26

55 votes

51 answers

18k views

Is there a human readable programming language? [closed]

I mean, is there a coded language with human style coding? For example: Create an object called MyVar and initialize it to 10; Take MyVar and call MyMethod() with parameters. . . I know it's not so ...

Community wiki

5 revs, 5 users 89%
Enreeco

179 votes

17 answers

261k views

n-grams in python, four, five, six grams?

I'm looking for a way to split a text into n-grams. Normally I would do something like: import nltk from nltk import bigrams string = "I really like python, it's pretty awesome." string_bigrams = ...

Shifu

2,165

asked Jul 8, 2013 at 16:35

206 votes

15 answers

164k views

What is the difference between lemmatization vs stemming?

When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was?

TIMEX

268k

asked Nov 24, 2009 at 0:48

190 votes

12 answers

296k views

How to check if a word is an English word with Python?

I want to check in a Python program if a word is in the English dictionary. I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task. def ...

Barthelemy

8,567

asked Sep 24, 2010 at 16:01

198 votes

9 answers

157k views

What are all possible POS tags of NLTK?

How do I find a list with all possible POS tags used by the Natural Language Toolkit (NLTK)?

OrangeTux

11.4k

asked Mar 13, 2013 at 14:59

124 votes

19 answers

157k views

Resource u'tokenizers/punkt/english.pickle' not found

My Code: import nltk.data tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle') ERROR Message: [ec2-user@ip-172-31-31-31 sentiment]$ python mapper_local_v1.0.py Traceback (most recent ...

Supreeth Meka

1,889

asked Oct 26, 2014 at 7:52

186 votes

10 answers

105k views

Java Stanford NLP: Part of Speech labels?

The Stanford NLP, demo'd here, gives an output like this: Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./. What do the Part of Speech tags mean? I am unable to find an official list. Is it ...

Nick Heiner

122k

asked Dec 2, 2009 at 14:30

156 votes

17 answers

83k views

Detecting syllables in a word

I need to find a fairly efficient way to detect syllables in a word. E.g., Invisible -> in-vi-sib-le There are some syllabification rules that could be used: V CV VC CVC CCV CCCV CVCC *where V is ...

user50705

1,623

asked Jan 1, 2009 at 17:08

167 votes

13 answers

302k views

How to get rid of punctuation using NLTK tokenizer?

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. ...

lizarisk

7,750

asked Mar 21, 2013 at 12:22

174 votes

9 answers

79k views

What does tf.nn.embedding_lookup function do?

tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None) I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding ...

Poorya Pzm

2,133

asked Jan 19, 2016 at 7:14

145 votes

14 answers

133k views

How to calculate the sentence similarity using word2vec model of gensim with python

According to the Gensim Word2Vec, I can use the word2vec model in gensim package to calculate the similarity between 2 words. e.g. trained_model.similarity('woman', 'man') 0.73723527 However, the ...

zhfkt

2,441

asked Mar 2, 2014 at 16:04

114 votes

22 answers

144k views

How do I do word Stemming or Lemmatization?

I've tried PorterStemmer and Snowball but both don't work on all words, missing some very common ones. My test words are: "cats running ran cactus cactuses cacti community communities", and both get ...

manixrock

2,533

asked Apr 21, 2009 at 10:07

61 votes

33 answers

28k views

What programming language is most like natural language? [closed]

I got the idea for this question from numerous situations where I don't understand what the person is talking about and when others don't understand me. So, a "smart" solution would be to speak a ...

Community wiki

3 revs, 3 users 100%
kliketa

146 votes

14 answers

271k views

How to remove stop words using nltk or python

I have a dataset from which I would like to remove stop words. I used NLTK to get a list of stop words: from nltk.corpus import stopwords stopwords.words('english') Exactly how do I compare the data ...

Alex

1,923

asked Mar 30, 2011 at 12:36

103 votes

25 answers

19k views

How can I correctly prefix a word with "a" and "an"?

I have a .NET application where, given a noun, I want it to correctly prefix that word with "a" or "an". How would I do that? Before you think the answer is to simply check if the first letter is a ...

ryeguy

66.4k

asked Aug 17, 2009 at 14:34

104 votes

15 answers

129k views

NLTK download SSL: Certificate verify failed

I get the following error when trying to install Punkt for nltk: nltk.download('punkt') [nltk_data] Error loading Punkt: <urlopen error [SSL: [nltk_data] CERTIFICATE_VERIFY_FAILED] ...

user3429986

1,195

asked Aug 12, 2016 at 11:04

100 votes

18 answers

114k views

How to use Stanford Parser in NLTK using Python

Is it possible to use Stanford Parser in NLTK? (I am not talking about Stanford POS.)

ThanaDaray

1,693

asked Dec 14, 2012 at 17:12

94 votes

16 answers

97k views

Ordinal numbers replacement

I am currently looking for the way to replace words like first, second, third,...with appropriate ordinal number representation (1st, 2nd, 3rd). I have been googling for the last week and I didn't ...

skornos

3,231

asked Mar 10, 2012 at 14:27

118 votes

17 answers

32k views

How do you implement a "Did you mean"? [duplicate]

Possible Duplicate: How does the Google “Did you mean?” Algorithm work? Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>...

pek

18k

asked Sep 3, 2008 at 10:36

138 votes

10 answers

361k views

how to check which version of nltk, scikit learn installed?

In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script: import nltk echo nltk.__version__ but it stops shell script at ...

nlper

2,377

asked Feb 13, 2015 at 13:46

128 votes

6 answers

110k views

Understanding min_df and max_df in scikit CountVectorizer

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency ...

moeabdol

4,979

asked Dec 29, 2014 at 23:57

97 votes

12 answers

239k views

Corpora/stopwords not found when import nltk library

I trying to import the nltk package in python 2.7 import nltk stopwords = nltk.corpus.stopwords.words('english') print(stopwords[:10]) Running this gives me the following error: LookupError: ...

Frits Verstraten

2,159

asked Jan 12, 2017 at 10:19

38 votes

22 answers

13k views

Code Golf: Number to Words

The code golf series seem to be fairly popular. I ran across some code that converts a number to its word representation. Some examples would be (powers of 2 for programming fun): 2 -> Two 1024 -> ...

Community wiki

6 revs, 5 users 100%
Jason Z

133 votes

6 answers

26k views

How does Apple find dates, times and addresses in emails?

In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not ...

Martin

40.1k

asked Feb 15, 2012 at 14:12

108 votes

8 answers

162k views

How to change huggingface transformers default cache directory

The default cache directory is lack of disk capacity, I need change the configure of the default cache directory.

Ivan Lee

4,021

asked Aug 8, 2020 at 7:28

145 votes

4 answers

319k views

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this: label instances 5 1190 4 838 3 239 1 204 2 127 So my data is unbalanced since 1190 ...

new_with_python

1,607

asked Jul 15, 2015 at 4:17

70 votes

15 answers

210k views

How do I download NLTK data?

Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!! I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution ...

Q-ximi

951

asked Mar 5, 2014 at 23:19

112 votes

6 answers

155k views

Python: tf-idf-cosine: to find document similarity

I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the ...

add-semi-colons

18.6k

asked Aug 25, 2012 at 2:41

105 votes

9 answers

103k views

Where does hugging face's transformers save models?

Running the below code downloads a model - does anyone know what folder it downloads it to? !pip install -q transformers from transformers import pipeline model = pipeline('fill-mask')

user3472360

1,835

asked May 14, 2020 at 13:27

90 votes

8 answers

151k views

Calculate cosine similarity given 2 sentence strings

From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate ...

alvas

120k

asked Mar 2, 2013 at 10:06

96 votes

9 answers

86k views

How to get vector for a sentence from the word2vec of tokens in sentence

I have generated the vectors for a list of tokens from a large document using word2vec. Given a sentence, is it possible to get the vector of the sentence from the vector of the tokens in the sentence....

trialcritic

1,245

asked Apr 21, 2015 at 0:46

88 votes

12 answers

52k views

Sentiment analysis for Twitter in Python [closed]

I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source ...

Ran

7,609

asked Feb 21, 2009 at 21:20

119 votes

4 answers

113k views

What does Keras Tokenizer method exactly do?

On occasion, circumstances require us to do the following: from keras.preprocessing.text import Tokenizer tokenizer = Tokenizer(num_words=my_max) Then, invariably, we chant this mantra: tokenizer....

Jack Fleeting

24.9k

asked Aug 21, 2018 at 20:08

131 votes

6 answers

531k views

re.sub erroring with "Expected string or bytes-like object"

I have read multiple posts regarding this error, but I still can't figure it out. When I try to loop through my function: def fix_Plan(location): letters_only = re.sub("[^a-zA-Z]", # ...

imanexcelnoob

1,333

asked May 1, 2017 at 22:47

79 votes

6 answers

166k views

Stopword removal with NLTK

I am trying to process a user entered text by removing stopwords using nltk toolkit, but with stopword-removal the words like 'and', 'or', 'not' gets removed. I want these words to be present after ...

Grahesh Parkar

1,017

asked Oct 2, 2013 at 5:29

50 votes

14 answers

89k views

Load Pretrained glove vectors in python

I have downloaded pretrained glove vector file from the internet. It is a .txt file. I am unable to load and access it. It is easy to load and access a word vector binary file using gensim but I don't ...

Same

759

asked Jun 13, 2016 at 15:01

72 votes

15 answers

188k views

How to download a model from huggingface?

For example, I want to download bert-base-uncased on https://huggingface.co/models, but can't find a 'Download' link. Or is it not downloadable?

marlon

7,197

asked May 19, 2021 at 0:34

111 votes

7 answers

194k views

NLTK python error: "TypeError: 'dict_keys' object is not subscriptable"

I'm following instructions for a class homework assignment and I'm supposed to look up the top 200 most frequently used words in a text file. Here's the last part of the code: fdist1 = FreqDist(...

user3760644

1,167

asked Oct 16, 2014 at 1:20

94 votes

4 answers

130k views

Fuzzy String Comparison

What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to the original will receive ...

jacksonstephenc

941

asked Apr 30, 2012 at 11:37

77 votes

9 answers

69k views

What do spaCy's part-of-speech and dependency tags mean?

spaCy tags up each of the Tokens in a Document with a part of speech (in two different formats, one stored in the pos and pos_ properties of the Token and the other stored in the tag and tag_ ...

Mark Amery

150k

asked Oct 27, 2016 at 15:14

120 votes

3 answers

60k views

word2vec: negative sampling (in layman term)? [closed]

I'm reading the paper below and I have some trouble , understanding the concept of negative sampling. http://arxiv.org/pdf/1402.3722v1.pdf Can anyone help , please?

Andy K

5,014

asked Jan 9, 2015 at 12:31

98 votes

10 answers

102k views

How to use Bert for long text classification?

We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text How can BERT be used?

user1337896

1,261

asked Oct 31, 2019 at 3:34

Collectives™ on Stack Overflow

Questions

39,293 questions