Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

router sometimes slows down badly #1138

Closed
snarfed opened this issue Jun 18, 2024 · 3 comments
Closed

router sometimes slows down badly #1138

snarfed opened this issue Jun 18, 2024 · 3 comments
Labels

Comments

@snarfed
Copy link
Owner

snarfed commented Jun 18, 2024

The router sometimes gets into a bad state where it takes forever to handle /queue/receive requests, eg 30-75s when they should average .5-2s or so. I don't understand what's going on here yet, or why this only happens sometimes.

image

It seems maybe loosely related to the number of WSGI workers and threads per worker, ie it seems worse with one worker w/100 threads, better with five workers w/10 threads each, but only somewhat, and I'm not 100% sure of the correlation.

Maybe it's context switching overhead between threads? But the slowdown seems way too drastic to be caused by that alone. Another theory is that the thread pool gets stuck on tasks that need HTTP requests to external servers that are down or very slow, and either our per-request timeout is too long, or it's ok but we attempt a lot of different outbound requests per task, and so these tasks starve other tasks. That theory feels unsatisfying too, but I don't have any other theories yet. Hrmph.

@snarfed
Copy link
Owner Author

snarfed commented Jun 25, 2024

Explanation here may be simpler, it may actually just be when the CPU gets pegged and we're suddenly CPU-bound. Here's the last day or so, note the correlation:

image image

Ugh. Well, silver lining is at least that's very understandable and manageable, adding cores and/or optimizing should fix it.

@snarfed
Copy link
Owner Author

snarfed commented Jun 25, 2024

More CPU vs latency correlation. At 1:45p, we bumped router up from 2 cores to 4, with a single WSGI worker with 200 threads. That didn't seem to work well, maybe because of the GIL and context switching? ...so at 2:15p (I think) I switched it to 4 WSGI workers, one per core, with 50 threads each.

image image
@snarfed
Copy link
Owner Author

snarfed commented Jul 18, 2024

Haven't seen this since we went to four cores, which pretty much confirms it was CPU. Closing.

@snarfed snarfed closed this as completed Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 participant