router sometimes slows down badly #1138

snarfed · 2024-06-18T17:13:09Z

The router sometimes gets into a bad state where it takes forever to handle /queue/receive requests, eg 30-75s when they should average .5-2s or so. I don't understand what's going on here yet, or why this only happens sometimes.

It seems maybe loosely related to the number of WSGI workers and threads per worker, ie it seems worse with one worker w/100 threads, better with five workers w/10 threads each, but only somewhat, and I'm not 100% sure of the correlation.

Maybe it's context switching overhead between threads? But the slowdown seems way too drastic to be caused by that alone. Another theory is that the thread pool gets stuck on tasks that need HTTP requests to external servers that are down or very slow, and either our per-request timeout is too long, or it's ok but we attempt a lot of different outbound requests per task, and so these tasks starve other tasks. That theory feels unsatisfying too, but I don't have any other theories yet. Hrmph.

The text was updated successfully, but these errors were encountered:

for #1138

snarfed · 2024-06-25T20:32:13Z

Explanation here may be simpler, it may actually just be when the CPU gets pegged and we're suddenly CPU-bound. Here's the last day or so, note the correlation:

Ugh. Well, silver lining is at least that's very understandable and manageable, adding cores and/or optimizing should fix it.

snarfed · 2024-06-25T22:37:27Z

More CPU vs latency correlation. At 1:45p, we bumped router up from 2 cores to 4, with a single WSGI worker with 200 threads. That didn't seem to work well, maybe because of the GIL and context switching? ...so at 2:15p (I think) I switched it to 4 WSGI workers, one per core, with 50 threads each.

for #1138

snarfed · 2024-07-18T04:09:29Z

Haven't seen this since we went to four cores, which pretty much confirms it was CPU. Closing.

snarfed added now infra labels Jun 18, 2024

snarfed added a commit that referenced this issue Jun 19, 2024

router config: back to two workers, 50 threads each 🤷

875db04

for #1138

snarfed removed the now label Jun 24, 2024

snarfed added a commit that referenced this issue Jun 25, 2024

bump up router to 4 cores, bump up receive and send queues to match

863aa96

for #1138

snarfed closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

router sometimes slows down badly #1138

router sometimes slows down badly #1138

snarfed commented Jun 18, 2024

snarfed commented Jun 25, 2024

snarfed commented Jun 25, 2024

snarfed commented Jul 18, 2024

router sometimes slows down badly #1138

router sometimes slows down badly #1138

Comments

snarfed commented Jun 18, 2024

snarfed commented Jun 25, 2024

snarfed commented Jun 25, 2024

snarfed commented Jul 18, 2024