Optimize costs #1149

snarfed · 2024-06-24T20:34:19Z

Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.

I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.

snarfed · 2024-06-26T23:39:29Z

related: #1152 (but I expect outbox is a small fraction of overall cost)

snarfed · 2024-06-26T23:39:53Z

Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated .

snarfed · 2024-06-26T23:41:26Z

Other leads:

snarfed · 2024-06-28T18:37:03Z

Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics

snarfed · 2024-06-28T19:15:33Z

...maybe promising? Eg this query breaks down by API method:

SELECT DISTINCT
  proto_payload.audit_log.method_name as method_name,
  count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000

and this one samples the actual contents of queries:

SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200

I can't aggregate (group by) by fields inside proto_payload though, I get Grouping by expressions of type JSON is not allowed. Next step, try BigQuery to see if it can get around that.

snarfed · 2024-06-28T19:18:46Z

It's in BigQuery, https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbridgy-federated!2slogs!3s_Default, let's see how that goes. 37M rows total.

snarfed · 2024-06-28T20:31:52Z

Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view.

snarfed · 2024-07-25T21:48:15Z

Back to looking at this.

snarfed · 2024-07-30T20:46:00Z

Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original Objects and users for a given copy.

Query

SELECT
  DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count
FROM bridgy-federated.logs._AllLogs
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  group by method_name
  order by count desc

method_name	count
Lookup	4374961
RunQuery	3596350
Commit	548773
BeginTx	353552
Rollback	23279
...

Query

SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
group by kind, num_composite_filters
order by count(*) desc

kind	num_composites	count
Object		767016
ATProto		764103
MagicKey		755585
ActivityPub		754627
UIProtocol		468614
Follower	2	49023
AtpBlock		17479
AtpRemoteBlob		6890
AtpRepo		3566
Follower		3147
AtpRepo		2
ATProto		2
...

Query

SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  FilterStr(proto_payload.audit_log.request.query.filter) as filter,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
  and proto_payload.audit_log.request.query.filter.propertyFilter is not null
group by kind, filter
order by count(*) desc

kind	filter	count
ActivityPub	copies.uri EQUAL	753344
MagicKey	copies.uri EQUAL	753290
Object	copies.uri EQUAL	753245
ATProto	copies.uri EQUAL	753243
UIProtocol	copies.uri EQUAL	468614
AtpBlock	seq GREATER_THAN_OR_EQUAL	17479
Object	users EQUAL	12535
AtpRemoteBlob	cid EQUAL	6890
ATProto	handle EQUAL	6516
MagicKey	manual_opt_out EQ	2295
Follower	from EQUAL	1575
Follower	to EQUAL	1572
ATProto	enabled_protocols NOT_EQUAL	1232
ActivityPub	enabled_protocols NOT_EQUAL	1231
...

…rator for #1149

snarfed · 2024-07-30T21:54:48Z

^ should hopefully help with queries.

Here's the monitoring I'm watching: https://console.cloud.google.com/apis/api/datastore.googleapis.com/metrics?project=bridgy-federated&pageState=%28%22duration%22%3A%28%22groupValue%22%3A%22PT1H%22%2C%22customValue%22%3Anull%29%29

snarfed · 2024-07-30T22:30:46Z

Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew.

Query

SELECT
  array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
group by num_keys
order by count(*) desc

num_keys	count
1	4371422
2	999
3	519
4	365
5	229
6	171
7	101
8	83
9	74
100	68
12	67
...

Query

SELECT
  string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1
group by kind
order by count(*) desc

kind	count
Object	4080965
ATProto	112141
AtpBlock	97120
MagicKey	38829
ActivityPub	29996
AtpRepo	7222
AtpSequence	3551
AtpRemoteBlob	1574
Cursor	24

Object lookups by id scheme:

Query

SELECT
  split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
group by scheme
order by count(*) desc

scheme	count
did	3434021
at	413000
https	235446
http	85
...

for #1149

snarfed · 2024-07-31T20:21:27Z

I don't understand why the DID Object lookups aren't using memcache. The ndb contexts in all three services seem to have it configured right. Here are the top DIDs by number of lookups, in just a ~9h window. Almost all of them should have been cached. Why weren't they?

Query

SELECT
  JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
  and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did'
group by name
order by count(*) desc
limit 100

name	count
did:plc:ayutykgvyf4x7ev5ornltyzz	176211
did:plc:4mjwxpnhoeaknxqabwhf2n6i	64569
did:plc:p2ygpwluon3vrk5yecjq7wc5	56364
did:plc:hm5cxb2g2q4ma4ucsks73tex	53051
did:plc:b5glolrcdfnaxwc4dbe4zbty	43557
did:plc:3rxcq5wacdop5thjoh3sny3p	35103
did:plc:6zfgmvghpidjvcn3cqtitnx5	28309
did:plc:5syae7p7dr6zsfdoe4k6buky	26882
did:plc:dqgdfku26vpxdktkzne5x2xj	26240
did:plc:sese4fb6luojywziva3s7zjo	24876
...

for #1149

snarfed · 2024-08-01T23:50:39Z

Progress here! Managed to cut datastore lookups and queries both way down, 5-10x each.

for snarfed/bridgy-fed#1149

for #1149

snarfed · 2024-08-02T17:42:46Z

Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so.

The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing

Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging

https://console.cloud.google.com/monitoring/dashboards/builder/24c22d42-91d8-4feb-aa6b-99dbb84c6417;duration=PT8H?project=bridgy-federated

snarfed · 2024-08-02T17:44:15Z

The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs

snarfed · 2024-08-03T03:57:01Z

Next step for datastore reads: snarfed/arroba#30

snarfed added the now label Jun 25, 2024

snarfed mentioned this issue Jun 27, 2024

Improve feed polling efficiency #791

Open

snarfed added a commit that referenced this issue Jul 30, 2024

cache models.get_originals in memcache with new memcache_memoize deco…

88cbe3b

…rator for #1149

snarfed added a commit that referenced this issue Jul 31, 2024

disable caching get_originals (88cbe3b) until bugs are fixed

3529216

for #1149

snarfed added a commit that referenced this issue Aug 1, 2024

use ndb in-memory context cache for DID doc Objects

4097fe7

for #1149

snarfed added a commit that referenced this issue Aug 1, 2024

cache models.get_originals in memory

5679d69

for #1149

snarfed added a commit to snarfed/webutil that referenced this issue Aug 2, 2024

minor logging optimizations

3492551

for snarfed/bridgy-fed#1149

snarfed added a commit that referenced this issue Aug 2, 2024

minor logging optimizations

70dd6d3

for #1149

snarfed added a commit that referenced this issue Aug 2, 2024

more minor logging optimizations

3acde57

for #1149

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize costs #1149

Optimize costs #1149

snarfed commented Jun 24, 2024

snarfed commented Jun 26, 2024

snarfed commented Jun 26, 2024 •

edited

Loading

snarfed commented Jun 26, 2024

snarfed commented Jun 28, 2024 •

edited

Loading

snarfed commented Jun 28, 2024 •

edited

Loading

snarfed commented Jun 28, 2024

snarfed commented Jun 28, 2024

snarfed commented Jul 25, 2024

snarfed commented Jul 30, 2024

snarfed commented Jul 30, 2024

snarfed commented Jul 30, 2024 •

edited

Loading

snarfed commented Jul 31, 2024

snarfed commented Aug 1, 2024

snarfed commented Aug 2, 2024 •

edited

Loading

snarfed commented Aug 2, 2024

snarfed commented Aug 3, 2024

Optimize costs #1149

Optimize costs #1149

Comments

snarfed commented Jun 24, 2024

snarfed commented Jun 26, 2024

snarfed commented Jun 26, 2024 • edited Loading

snarfed commented Jun 26, 2024

snarfed commented Jun 28, 2024 • edited Loading

snarfed commented Jun 28, 2024 • edited Loading

snarfed commented Jun 28, 2024

snarfed commented Jun 28, 2024

snarfed commented Jul 25, 2024

snarfed commented Jul 30, 2024

snarfed commented Jul 30, 2024

snarfed commented Jul 30, 2024 • edited Loading

snarfed commented Jul 31, 2024

snarfed commented Aug 1, 2024

snarfed commented Aug 2, 2024 • edited Loading

snarfed commented Aug 2, 2024

snarfed commented Aug 3, 2024

snarfed commented Jun 26, 2024 •

edited

Loading

snarfed commented Jun 28, 2024 •

edited

Loading

snarfed commented Jun 28, 2024 •

edited

Loading

snarfed commented Jul 30, 2024 •

edited

Loading

snarfed commented Aug 2, 2024 •

edited

Loading