Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port statistics and to_tsv.py to bigquery #715

Closed
snarfed opened this issue Jan 5, 2017 · 10 comments
Closed

port statistics and to_tsv.py to bigquery #715

snarfed opened this issue Jan 5, 2017 · 10 comments

Comments

@snarfed
Copy link
Owner

snarfed commented Jan 5, 2017

we currently use to_tsv.py and a google spreadsheet to generate our periodic usage and growth stats, e.g. https://snarfed.org/2016-06-06_bridgy-stats-update-2

...but we've outgrown it. we have ~25GB of data now, so downloading it alone is a bit prohibitive, and processing it afterward takes ~4GB memory and >6h on a 2014 MBP. ugh.

fortunately google has solved this for us. we can load the existing datastore backups into bigquery, then port to_tsv.py to a handful of SQL queries. we may even get better viz and charts for free. yahtzee!

@snarfed
Copy link
Owner Author

snarfed commented Aug 16, 2017

loading into bigquery would help provide a recent request from @tantek: find the most webmentioned posts, both of all time and ongoing. https://chat.indieweb.org/2017-08-16#t1502909939532000

@snarfed
Copy link
Owner Author

snarfed commented Aug 18, 2017

i tried to load the datastore backups for the Response entity into bigquery and hit a couple roadblocks.

first, the bigquery docs claim it supports wildcards in cloud storage URIs, e.g. gs://brid-gy.appspot.com/weekly/datastore_backup_full_2017_08_06_Response/15677859164382301757551D37199F8/*, but when i tried, it failed with "not found". details in https://bigquery.cloud.google.com/jobs/brid-gy .

second, when i tried loading a single file (output-0), i hit the size limit. app engine datastore backup generates 30-65M files, but bigquery's file limit is 10MB. got this error: Backup_info file 'gs://brid-gy.appspot.com/weekly/datastore_backup_full_2017_08_06_Response/15677859164382301757551D37199F8/output-0' is too big: 68780032 bytes. Maximum allowed size is: 10485760 bytes (error code: invalid), linked to https://cloud.google.com/bigquery/troubleshooting-errors#invalid .

@snarfed
Copy link
Owner Author

snarfed commented Aug 18, 2017

aha, i needed to load the .backup_info file: https://cloud.google.com/bigquery/loading-data-cloud-datastore#webui

e.g. gs://brid-gy.appspot.com/weekly/aglzfmJyaWQtZ3lyQQsSHF9BRV9EYXRhc3RvcmVBZG1pbl9PcGVyYXRpb24Yqd-8BQwLEhZfQUVfQmFja3VwX0luZm9ybWF0aW9uGAEM.Response.backup_info

@snarfed
Copy link
Owner Author

snarfed commented Aug 18, 2017

ok, Responses are loaded, and i opened the dataset up to the public: https://bigquery.cloud.google.com/table/brid-gy:datastore.Response

i may or may not maintain it. we'll see.

@snarfed
Copy link
Owner Author

snarfed commented Nov 22, 2017

loaded the rest of the data (accounts, publishes, etc.) into bigquery from an 11/19 snapshot, and started trying google data studio as a BI tool: https://datastudio.google.com/open/10lOfwjq6844GqRmqtfv-2nZc2F9BZyWq

overall, it's improved a lot since i tried it this past summer for indie map...but it's still not quite ready for prime time.

  1. it chokes on big time series datasets, e.g. the 7+M Response entities: The server encountered an internal error and was unable to complete your request. Error ID: ce0acec0
  2. it can't add calculated fields (e.g. COUNT) on bigquery views or custom queries, which i kind of need for merging all of the different source (account) types. i'd load them all into the same bigquery table, but bigquery's datastore import doesn't support that.

i've sent these as feedback and filed them in the issue tracker:
https://issuetracker.google.com/issues/69639296
https://issuetracker.google.com/issues/69639425

@snarfed
Copy link
Owner Author

snarfed commented Nov 27, 2017

@snarfed
Copy link
Owner Author

snarfed commented Dec 15, 2017

so, qlik's bigquery connector requires their desktop app, and mode only had a free trial, not a permanent free tier, and their trial ended.

so i'm now trying out local metabase. working pretty well so far.

@snarfed
Copy link
Owner Author

snarfed commented Dec 16, 2017

spoke too soon. metabase has been giving me a hell of a time dealing with bigquery views (e.g. a combined Sources view with all the different silo account types), legacy vs standard SQL (especially on views), and graphing multiple things on the same chart.

next idea: use bigquery to populate the current spreadsheet, and continue to use that to generate the graphs.

@snarfed
Copy link
Owner Author

snarfed commented Jan 2, 2018

worked! the spreadsheet now has updated data and graphs.

will document the new process soon. in the meantime, here's the monster bigquery query that populates the spreadsheet data:

WITH
days AS (SELECT * FROM UNNEST(
  GENERATE_DATE_ARRAY('2013-12-01', (SELECT MAX(CAST(created AS date)) FROM datastore.Response))) AS day),
s as (SELECT cast(created as date) AS day, __key__.kind, count(*) as num FROM datastore.Sources GROUP BY day, kind),
bl as (SELECT * FROM s WHERE kind = 'Blogger'),
fb as (SELECT * FROM s WHERE kind = 'FacebookPage'),
fl as (SELECT * FROM s WHERE kind = 'Flickr'),
gp as (SELECT * FROM s WHERE kind = 'GooglePlusPage'),
ig as (SELECT * FROM s WHERE kind = 'Instagram'),
me as (SELECT * FROM s WHERE kind = 'Medium'),
tu as (SELECT * FROM s WHERE kind = 'Tumblr'),
tw as (SELECT * FROM s WHERE kind = 'Twitter'),
wp as (SELECT * FROM s WHERE kind = 'WordPress'),
resp as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.Response GROUP BY day),
bp as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.BlogPost GROUP BY day),
pub as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.Publish GROUP BY day),
bw as (SELECT cast(created as date) AS day, count(*) as num FROM datastore.BlogWebmention GROUP BY day),

resps_and_bps as (
  SELECT created, sent, unsent, error, failed, skipped FROM datastore.Response
  UNION ALL
  SELECT created, sent, unsent, error, failed, skipped FROM datastore.BlogPost
),
wms as (
  SELECT cast(created as date) AS day, count(*) as num,
  SUM(ARRAY_LENGTH(sent)) AS sent,
  SUM(ARRAY_LENGTH(sent) + ARRAY_LENGTH(unsent) + ARRAY_LENGTH(error) + ARRAY_LENGTH(failed) + ARRAY_LENGTH(skipped)) AS links
  FROM resps_and_bps GROUP BY day),
features as (
  SELECT cast(created as date) AS day, feat, count(*) as num
  FROM datastore.Sources s, s.features feat GROUP BY feat, day),
listen as (SELECT * FROM features WHERE feat = 'listen'),
publish as (SELECT * FROM features WHERE feat = 'publish'),
webmention as (SELECT * FROM features WHERE feat = 'webmention'),

domains as (
  SELECT DISTINCT NET.HOST(s) AS domain, MIN(CAST(rbs.created AS date)) AS day
  FROM resps_and_bps rbs, rbs.sent s GROUP BY domain ORDER BY day),
domain_counts as (SELECT day, COUNT(*) AS num FROM domains GROUP BY day)

SELECT day,
  SUM(bl.num) OVER win AS Blogger,
  SUM(fb.num) OVER win AS FacebookPage,
  SUM(fl.num) OVER win AS Flickr,
  SUM(gp.num) OVER win AS GooglePlusPage,
  SUM(ig.num) OVER win AS Instagram,
  SUM(me.num) OVER win AS Medium,
  SUM(tu.num) OVER win AS Tumblr,
  SUM(tw.num) OVER win AS Twitter,
  SUM(wp.num) OVER win AS WordPress,
  SUM(resp.num) OVER win AS Response,
  SUM(bp.num) OVER win AS BlogPost,
  SUM(pub.num) OVER win AS Publish,
  SUM(bw.num) OVER win AS BlogWebmention,
  SUM(listen.num) OVER win AS f_listen,
  SUM(publish.num) OVER win AS f_publish,
  SUM(webmention.num) OVER win AS f_blog_webmention,
  SUM(wms.links) OVER win AS links,
  SUM(wms.sent + bp.num) OVER win AS webmentions,
  SUM(domain_counts.num) OVER win AS domains

FROM days
LEFT JOIN bl USING(day)
LEFT JOIN fb USING(day)
LEFT JOIN fl USING(day)
LEFT JOIN gp USING(day)
LEFT JOIN ig USING(day)
LEFT JOIN me USING(day)
LEFT JOIN tu USING(day)
LEFT JOIN tw USING(day)
LEFT JOIN wp USING(day)
LEFT JOIN resp USING(day)
LEFT JOIN bp USING(day)
LEFT JOIN pub USING(day)
LEFT JOIN bw USING(day)
LEFT JOIN listen USING(day)
LEFT JOIN publish USING(day)
LEFT JOIN wms USING(day)
LEFT JOIN webmention USING(day)
LEFT JOIN domain_counts USING(day)

WINDOW win AS (ORDER BY day ROWS UNBOUNDED PRECEDING)
ORDER BY day
snarfed added a commit that referenced this issue Jan 2, 2018
@snarfed snarfed closed this as completed Jan 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
1 participant