-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize costs #1149
Comments
related: #1152 (but I expect outbox is a small fraction of overall cost) |
Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated . |
Other leads: |
Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics |
...maybe promising? Eg this query breaks down by API method: SELECT DISTINCT
proto_payload.audit_log.method_name as method_name,
count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000 and this one samples the actual contents of queries: SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200 I can't aggregate (group by) by fields inside |
It's in BigQuery, https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbridgy-federated!2slogs!3s_Default, let's see how that goes. 37M rows total. |
Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view. |
Back to looking at this. |
Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original QuerySELECT DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count FROM bridgy-federated.logs._AllLogs where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' group by method_name order by count desc
QuerySELECT string(proto_payload.audit_log.request.query.kind[0].name) as kind, ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery' group by kind, num_composite_filters order by count(*) desc
QuerySELECT string(proto_payload.audit_log.request.query.kind[0].name) as kind, FilterStr(proto_payload.audit_log.request.query.filter) as filter, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery' and proto_payload.audit_log.request.query.filter.propertyFilter is not null group by kind, filter order by count(*) desc
|
^ should hopefully help with queries. Here's the monitoring I'm watching: https://console.cloud.google.com/apis/api/datastore.googleapis.com/metrics?project=bridgy-federated&pageState=%28%22duration%22%3A%28%22groupValue%22%3A%22PT1H%22%2C%22customValue%22%3Anull%29%29 |
Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew. QuerySELECT array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' group by num_keys order by count(*) desc
QuerySELECT string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1 group by kind order by count(*) desc
Object lookups by id scheme: QuerySELECT split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object' group by scheme order by count(*) desc
|
I don't understand why the DID QuerySELECT JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object' and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did' group by name order by count(*) desc limit 100
|
Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so. The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging ![]() |
The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs ![]() |
Next step for datastore reads: snarfed/arroba#30 |
Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.
I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.
The text was updated successfully, but these errors were encountered: