Thread Per Core

async-execution-model-thread-per-core

Fig: Thread per core async runtime based execution model for laminarmq

In the thread per core model since each processor core is limited to a single thread, tasks in a thread need to be scheduled efficiently. Hence each worker thread runs their own task scheduler.

We currently use glommio as our thread-per-core runtime.

Here, tasks can be scheduled on different task queues and different task queues can be provisioned with specific fractions of CPU time shares. Generally tasks with similar latency profiles are executed on the same task queue. For instance web server tasks will be executed on a different queue than the one that runs tasks for persisting data to the disk.

We re-use the same constructs that we use in the general async runtime execution model. The only difference being, we explicitly care about in which task queue a class of future's tasks are executed. In our case, we have the following 4 task queues:

  • Request router task queue
  • HTTP server request parser task queue
  • Partition replica controller task queue
  • Response poller task queue

Each of these task queue can be assigned specific fractions of CPU time shares. glommio also provides utilities for automatically deducing these CPU time shares based on their runtime latency profiles.

Apart from this glommio leverages the new linux 5.x io_uring API which facilitates true asynchronous IO for both networking and disk interfaces. (Other async runtimes such as tokio make blocking system calls for disk IO operations in a thread-pool.)

io_uring also has the advantage of being able to queue together multiple system calls together and then asynchronously wait for their completion by making a maximum of one context switch. It is also possible to avoid context switches altogether. This is achieved with a pair of ring buffers called the submission-queue and the completion-queue. Once the queues are set up, user can queue multiple system calls on the submission queue. The linux kernel processes the system calls and places the results in the completion queue. The user can then freely read the results from the completion-queue. This entire process after setting up the queues doesn't require any additional context switch.

Read more: https://man.archlinux.org/man/io_uring.7.en

glommio presents additional abstractions on top of io_uring in the form of an async runtime, with support for networking, disk IO, channels, single threaded locks and more.

Read more: https://www.datadoghq.com/blog/engineering/introducing-glommio/