Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instagram is blocking our scraping #665

Closed
snarfed opened this issue Apr 30, 2016 · 31 comments
Closed

Instagram is blocking our scraping #665

snarfed opened this issue Apr 30, 2016 · 31 comments
Labels

Comments

@snarfed
Copy link
Owner

snarfed commented Apr 30, 2016

... by returning empty 429s to our profile page HTTP requests. seems like it started under 36h ago. may have happened before too though. eg https://brid.gy/log?start_time=1461974780&key=aglzfmJyaWQtZ3lyFgsSCUluc3RhZ3JhbSIHc25hcmZlZAw

instagram-atom isn't having this problem, and it's using the same IPs, so maybe changing user agent might fix it.

@snarfed
Copy link
Owner Author

snarfed commented Apr 30, 2016

changed our user agent to a normal browser string, which seemed to fix this....but i doubt it'll stay fixed for long. app engine still appends our app id to the user agent, so instagram will still be able to identify us. we'll see.

@snarfed
Copy link
Owner Author

snarfed commented Apr 30, 2016

didn't work :(

snarfed added a commit to snarfed/granary that referenced this issue Apr 30, 2016
@snarfed
Copy link
Owner Author

snarfed commented Apr 30, 2016

the 429 body:

screen shot 2016-04-30 at 9 21 43 am

@snarfed
Copy link
Owner Author

snarfed commented Apr 30, 2016

i also dropped instagram max poll freq down to 2h. with that and the new user agent, we're back in business. >75% of active instagram accounts have polled successfully in the last few hrs. eg https://brid.gy/instagram/aaronpk

not sure which of the two changes did the trick. we'll see.

@snarfed
Copy link
Owner Author

snarfed commented May 1, 2016

we're still mostly blocked after all. :/ a few fetches went through ok, but they were the exception.

i'm going to disable instagram entirely for a day or two to see if that resets anything on their end.

@snarfed
Copy link
Owner Author

snarfed commented May 1, 2016

i wonder if this is all of app engine's (slash google's) IP block, not just bridgy. eg granary-demo sees the same problem: https://granary-demo.appspot.com/?site=instagram

@snarfed
Copy link
Owner Author

snarfed commented May 1, 2016

evidence for that: scraping instagram with bridgy's user agent works fine on my local machine.

@snarfed
Copy link
Owner Author

snarfed commented May 2, 2016

tried switching to sockets instead of urlfetch in the hopes that it used a different IP block, but no luck. one request made it through out of five, but the other four were 429ed. :/

@snarfed
Copy link
Owner Author

snarfed commented May 2, 2016

@snarfed
Copy link
Owner Author

snarfed commented May 2, 2016

i set up a reverse proxy to get around the IP block.

@snarfed
Copy link
Owner Author

snarfed commented May 3, 2016

this has been working ok for a couple days now, yay. we'll see how long it lasts. :P closing.

@snarfed snarfed closed this as completed May 3, 2016
@gerbz
Copy link

gerbz commented May 24, 2016

I noticed you're scraping the profile page - you should checkout /username/media/. No auth needed.
https://www.instagram.com/snarfed/media/

Discovering this blew my mind.

Are you using a single IP? How often are you polling? Been working for 21+ days since your fix?

@snarfed
Copy link
Owner Author

snarfed commented May 24, 2016

@gerbz sadly that only works if you're logged in. http://stackoverflow.com/questions/17373886/33783840#comment61481772_33783840

the proxy was a single IP, yes, but instagram actually stopped blocking app engine recently, so i switched back to fetching directly instead.

we're polling ~1k users between once a day and once an hr, depending on how active they are. each poll may also fetch up to N individual media pages too though. in practice it looks like we average <1qpm right now, slightly bursty.

@gerbz
Copy link

gerbz commented May 24, 2016

@snarfed that comment is incorrect - try for yourself. I've even hit it unauthed using Tor. Works fine. Haven't polled it excessively but should work.

Thanks for the info.

@snarfed
Copy link
Owner Author

snarfed commented May 24, 2016

good point! you're right. thanks! i just realized i was testing on a private account. public accounts work fine.

@shafikhaan
Copy link

@snarfed what the current status of your scraping, Is project still up ?
p.s 👍 Thanks for the comments, Its really helping

@snarfed
Copy link
Owner Author

snarfed commented Jun 30, 2018

@shafikhaan yup! https://brid.gy/ , https://granary.io , and https://instagram-atom.appspot.com are still happily scraping Instagram.

@shafikhaan
Copy link

@snarfed Which one will you pick from the above ?

@snarfed
Copy link
Owner Author

snarfed commented Jun 30, 2018

@shafikhaan sorry? i don't follow the question.

they all share this scraping code, if that helps:

https://github.com/snarfed/granary/blob/master/granary/instagram.py#L758-L975

@snarfed snarfed reopened this Aug 26, 2019
@snarfed
Copy link
Owner Author

snarfed commented Aug 26, 2019

happening again. started 8/21, probably due to an ongoing flood of https://granary.io/ instagram fetches for individual profiles via subscriptions in Aperture-based news readers. ugh. i've disabled instagram in granary entirely for now.

for the record, and since i might need to use it again, when i proxied requests last time, i used Apache 2.4's mod_proxy and mod_ssl with this config:

LoadModule proxy_module /usr/lib64/httpd/modules/mod_proxy.so
LoadModule ssl_module /usr/lib64/httpd/modules/mod_ssl.so
SSLProxyEngine on
ProxyPass /instagram/ https://www.instagram.com/
@snarfed
Copy link
Owner Author

snarfed commented Aug 26, 2019

interestingly, the symptom this time is different. when it happened originally, back in 2016, we got 429s with a nice Sorry, too many requests. HTML body. now, it's 401s with an empty body. example log.

@snarfed
Copy link
Owner Author

snarfed commented Aug 26, 2019

back to proxying. working for now. i've re-enabled all affected IG accounts.

@snarfed
Copy link
Owner Author

snarfed commented Aug 27, 2019

instagram blocked my proxy's IP. whee.

snarfed added a commit to snarfed/granary that referenced this issue Aug 27, 2019
snarfed added a commit to snarfed/granary that referenced this issue Aug 29, 2019
trying to discourage people from using granary for social feeds, esp due to eg IG's recent blocking, snarfed/bridgy#665 (comment)
snarfed added a commit to snarfed/instagram-atom that referenced this issue Aug 30, 2019
...since i had to block instagram in granary due to their rate limiting/blocking. snarfed/bridgy#665 (comment)
snarfed added a commit to snarfed/twitter-atom that referenced this issue Aug 30, 2019
inspired by snarfed/instagram-atom@856575b, since i had to block instagram in granary due to their rate limiting/blocking. snarfed/bridgy#665 (comment)

UI next!
@snarfed
Copy link
Owner Author

snarfed commented Jan 13, 2020

we've been scraping with a logged in session cookie for a while now. not ideal, maybe not sustainable, but it's been working. tentatively closing.

@snarfed snarfed closed this as completed Jan 13, 2020
@staabm
Copy link

staabm commented Apr 5, 2020

The website still reports bridgy is blocked? Is this message still up2date?

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

@staabm which web site? where?

@staabm
Copy link

staabm commented Apr 5, 2020

image

@staabm
Copy link

staabm commented Apr 5, 2020

I a getting this erro when trying to add instagram to my brigy account

@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

ah, got it, thanks. I'll look soon!

snarfed added a commit that referenced this issue Apr 5, 2020
snarfed added a commit that referenced this issue Apr 5, 2020
@snarfed
Copy link
Owner Author

snarfed commented Apr 5, 2020

@staabm thanks again for the report. i think i've fixed this. mind trying again?

@staabm
Copy link

staabm commented Apr 5, 2020

Thx for the fast fix. The error is gone..

Seems to work now ✌️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 participants