Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize costs #1149

Open
snarfed opened this issue Jun 24, 2024 · 16 comments
Open

Optimize costs #1149

snarfed opened this issue Jun 24, 2024 · 16 comments
Labels

Comments

@snarfed
Copy link
Owner

snarfed commented Jun 24, 2024

Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.

I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.

month

year

@snarfed snarfed added the now label Jun 25, 2024
@snarfed
Copy link
Owner Author

snarfed commented Jun 26, 2024

related: #1152 (but I expect outbox is a small fraction of overall cost)

@snarfed
Copy link
Owner Author

snarfed commented Jun 26, 2024

Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated .

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

...maybe promising? Eg this query breaks down by API method:

SELECT DISTINCT
  proto_payload.audit_log.method_name as method_name,
  count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000

and this one samples the actual contents of queries:

SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200

I can't aggregate (group by) by fields inside proto_payload though, I get Grouping by expressions of type JSON is not allowed. Next step, try BigQuery to see if it can get around that.

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

@snarfed
Copy link
Owner Author

snarfed commented Jun 28, 2024

Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view.

@snarfed
Copy link
Owner Author

snarfed commented Jul 25, 2024

Back to looking at this.

@snarfed
Copy link
Owner Author

snarfed commented Jul 30, 2024

Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original Objects and users for a given copy.

Query
SELECT
  DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count
FROM bridgy-federated.logs._AllLogs
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  group by method_name
  order by count desc
method_name count
Lookup 4374961
RunQuery 3596350
Commit 548773
BeginTx 353552
Rollback 23279
...
Query
SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
group by kind, num_composite_filters
order by count(*) desc
kind num_composites count
Object 767016
ATProto 764103
MagicKey 755585
ActivityPub 754627
UIProtocol 468614
Follower 2 49023
AtpBlock 17479
AtpRemoteBlob 6890
AtpRepo 3566
Follower 3147
AtpRepo 2
ATProto 2
...
Query
SELECT
  string(proto_payload.audit_log.request.query.kind[0].name) as kind,
  FilterStr(proto_payload.audit_log.request.query.filter) as filter,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
  and proto_payload.audit_log.request.query.filter.propertyFilter is not null
group by kind, filter
order by count(*) desc
kind filter count
ActivityPub copies.uri EQUAL 753344
MagicKey copies.uri EQUAL 753290
Object copies.uri EQUAL 753245
ATProto copies.uri EQUAL 753243
UIProtocol copies.uri EQUAL 468614
AtpBlock seq GREATER_THAN_OR_EQUAL 17479
Object users EQUAL 12535
AtpRemoteBlob cid EQUAL 6890
ATProto handle EQUAL 6516
MagicKey manual_opt_out EQ 2295
Follower from EQUAL 1575
Follower to EQUAL 1572
ATProto enabled_protocols NOT_EQUAL 1232
ActivityPub enabled_protocols NOT_EQUAL 1231
...
@snarfed
Copy link
Owner Author

snarfed commented Jul 30, 2024

Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew.

Query
SELECT
  array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
group by num_keys
order by count(*) desc
num_keys count
1 4371422
2 999
3 519
4 365
5 229
6 171
7 101
8 83
9 74
100 68
12 67
...
Query
SELECT
  string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1
group by kind
order by count(*) desc
kind count
Object 4080965
ATProto 112141
AtpBlock 97120
MagicKey 38829
ActivityPub 29996
AtpRepo 7222
AtpSequence 3551
AtpRemoteBlob 1574
Cursor 24

Object lookups by id scheme:

Query
SELECT
  split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
group by scheme
order by count(*) desc
scheme count
did 3434021
at 413000
https 235446
http 85
...
@snarfed
Copy link
Owner Author

snarfed commented Jul 31, 2024

I don't understand why the DID Object lookups aren't using memcache. The ndb contexts in all three services seem to have it configured right. Here are the top DIDs by number of lookups, in just a ~9h window. Almost all of them should have been cached. Why weren't they?

Query
SELECT
  JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name,
  count(*)
FROM bridgy-federated.logs._Default
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
  and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup'
  and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object'
  and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did'
group by name
order by count(*) desc
limit 100
name count
did:plc:ayutykgvyf4x7ev5ornltyzz 176211
did:plc:4mjwxpnhoeaknxqabwhf2n6i 64569
did:plc:p2ygpwluon3vrk5yecjq7wc5 56364
did:plc:hm5cxb2g2q4ma4ucsks73tex 53051
did:plc:b5glolrcdfnaxwc4dbe4zbty 43557
did:plc:3rxcq5wacdop5thjoh3sny3p 35103
did:plc:6zfgmvghpidjvcn3cqtitnx5 28309
did:plc:5syae7p7dr6zsfdoe4k6buky 26882
did:plc:dqgdfku26vpxdktkzne5x2xj 26240
did:plc:sese4fb6luojywziva3s7zjo 24876
...
snarfed added a commit that referenced this issue Aug 1, 2024
@snarfed
Copy link
Owner Author

snarfed commented Aug 1, 2024

Progress here! Managed to cut datastore lookups and queries both way down, 5-10x each.

image image
snarfed added a commit to snarfed/webutil that referenced this issue Aug 2, 2024
snarfed added a commit that referenced this issue Aug 2, 2024
snarfed added a commit that referenced this issue Aug 2, 2024
@snarfed
Copy link
Owner Author

snarfed commented Aug 2, 2024

Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so.

The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing

Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging

https://console.cloud.google.com/monitoring/dashboards/builder/24c22d42-91d8-4feb-aa6b-99dbb84c6417;duration=PT8H?project=bridgy-federated

image
@snarfed
Copy link
Owner Author

snarfed commented Aug 2, 2024

The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs

image
@snarfed
Copy link
Owner Author

snarfed commented Aug 3, 2024

Next step for datastore reads: snarfed/arroba#30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 participant