[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPNVh5cm9LhLEi1Td3rbOWtWH5oCvZTTMRd+p5bu75Epr3mqwA@mail.gmail.com>
Date: Mon, 12 Jul 2021 16:31:01 -0700
From: Peter Oskolkov <posk@...gle.com>
To: Thierry Delisle <tdelisle@...terloo.ca>
Cc: avagin@...gle.com, bsegall@...gle.com, jannh@...gle.com,
jnewsome@...project.org, joel@...lfernandes.org,
linux-api@...r.kernel.org, linux-kernel@...r.kernel.org,
mingo@...hat.com, mkarsten@...terloo.ca, pabuhr@...terloo.ca,
peterz@...radead.org, pjt@...gle.com, posk@...k.io,
tglx@...utronix.de
Subject: Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
On Mon, Jul 12, 2021 at 2:44 PM Thierry Delisle <tdelisle@...terloo.ca> wrote:
>
> > sys_umcg_wait without next_tid puts the task in UMCG_IDLE state; wake
> > wakes it. These are standard sched operations. If they are emulated
> > via futexes, fast context switching will require something like
> > FUTEX_SWAP that was NACKed last year.
>
> I understand these wait and wake semantics and the need for the fast
> context-switch(swap). As I see it, you need 3 operations:
>
> - SWAP: context-switch directly to a different thread, no scheduler involved
> - WAIT: block current thread, go back to server thread
> - WAKE: unblock target thread, add it to scheduler, e.g. through
> idle_workers_ptr
>
> There is no existing syscalls to handle SWAP, so I agree sys_umcg_wait is
> needed for this to work.
>
> However, there already exists sys_futex to handle WAIT and WAKE. When a
> worker
> calls either sys_futex WAIT or sys_umcg_wait next_tid == NULL, in both case
> the worker will block, SWAP to the server and wait for FUTEX_WAKE,
> UMCG_WAIT_WAKE_ONLY respectively. It's not obvious to me that there
> would be
> performance difference and the semantics seem to be the same to me.
>
> So what I am asking is: is UMCG_WAIT_WAKE_ONLY needed?
Because the approach you described has been tried last year and was NACKed:
https://lore.kernel.org/lkml/20200722234538.166697-1-posk@posk.io/
In short, futex maintainers do not want to touch the existing futex
code at all other than for bugfixes. No new futex functionality,
period. See e.g. futex2 efforts:
https://lore.kernel.org/lkml/20210603195924.361327-1-andrealmeid@collabora.com/
> Is the idea to support workers directly context-switching among each other,
> without involving server threads and without going through idle_servers_ptr?
Yes.
> If so, can you explain some of the intended state transitions in this case.
Cooperative scheduling: workers can yield to each other (i.e. swap).
This allows building very fast concurrency primitives (mutexes,
producer-consumer queues, etc.). For example: a producer-consumer
queue: a producer: prepare an item, contex-switch to a consumer on CPU
synchronously: this can be done much faster than waking the consumer
asynchronously; helps a lot to reduce latency (throughput? not so
much)
>
>
> > > However, I do not understand how the userspace is expected to use
> it. I also
> > > do not understand if these link fields form a stack or a queue and
> where is
> > > the head.
> >
> > When a server has nothing to do (no work to run), it is put into IDLE
> > state and added to the list. The kernel wakes an IDLE server if a
> > blocked worker unblocks.
>
> From the code in umcg_wq_worker_running (Step 3), I am guessing users are
> expected to provide a global head somewhere in memory and
> umcg_task.idle_servers_ptr points to the head of the list for all workers.
> Servers are then added in user space using atomic_stack_push_user. Is this
> correct? I did not find any documentation on the list head.
This is going to change somewhat. I'll post a new patchset soon-ish
(later this week?)
>
> I like the idea that each worker thread points to a given list, it
> allows the
> possibility for separate containers with their own independent servers,
> workers
> and scheduling. However, it seems that the list itself could be implemented
> using existing kernel APIs, for example a futex or an event fd. Like so:
>
> struct umcg_task {
> [...]
>
> /**
> * @idle_futex_ptr: pointer to a futex user for idle server threads.
> *
> * When waking a worker, the kernel decrements the pointed to
> futex value
> * if it is non-zero and wakes a server if the decrement occurred.
> *
> * Server threads that have no work to do should increment the futex
> * value and FUTEX_WAIT
> */
> uint64_t idle_futex_ptr; /* r/w */
>
> [...]
> } __attribute__((packed, aligned(8 * sizeof(__u64))));
>
> I believe the futex approach, like the list, has the advantage that when
> there
> are no idle servers, checking the list requires no locking. I don't know if
> that can be achieved with eventfd.
As mentioned above, a futex-based solution was rejected by
maintainers. Believe me, I tried. Not only tried, we have a working
userspace scheduling stack on top of FUTEX_SWAP deployed internally at
Google, with some actual usage (mostly testing/debugging workloads). I
suggest we stop discussing futexes in this context - I do not see the
maintainers changing their position. And the approach used in this
patchset seems to work (and I actually like how the code is shaping
up).
Powered by blists - more mailing lists