linux-kernel - Re: UMCG - how should we proceed? Should we?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230406103841.GJ386572@hirez.programming.kicks-ass.net>
Date:   Thu, 6 Apr 2023 12:38:41 +0200
From:   Peter Zijlstra <peterz@...radead.org>
To:     Peter Oskolkov <posk@...gle.com>
Cc:     tglx@...utronix.de, bsegall@...gle.com,
        linux-kernel@...r.kernel.org, pjt@...gle.com, posk@...k.io,
        joshdon@...gle.com
Subject: Re: UMCG - how should we proceed? Should we?

On Tue, Mar 28, 2023 at 02:07:54PM -0700, Peter Oskolkov wrote:
> Hi Peter!
> 
> TL;DR: which approach, if any, should a UMCG implementation in the mainline kernel use?
> 
> Details:
> 
> We are rolling out internally a UMCG implementation copied below (with some
> boilerplate omitted), so I would like to restart our discussion on the topic.
> 
> The implementation below is different from what we had earlier
> (https://lore.kernel.org/lkml/20220120155517.066795336@infradead.org/)
> in that it keeps UMCG state in the kernel rather than TLS.
> 
> While having UMCG state in TLS is _much_ better, as it makes state synchronization
> between the userspace and the kernel much simpler, the whole page pinning
> machinery in the link above looked very scary, honestly.
> 
> So if we are going to ever have something like UMCG in the mainline kernel, we need
> to figure out the approach to use: the TLS-based one, something similar
> to what we have now internally (details below), or something else. Or none at all...
> 
> While I would very much prefer to have it done your way (state in TLS), the page pinning
> business was too much for me. If you can figure out a way to do it cleanly and reliably, great!

A few quick notes without having looked at the patch...

> The main differences between what you had in the TLS patchset and what is below:

(note that in the end the per-task UMCG info thing didn't *need* to be
TLS, although it is a logical place to put it)

> - per worker/server state not in TLS but in task_struct
> - we keep a list of idle workers and a list of idle servers in mm

How much of a scalability fail is that? Mathieu and me are currently
poking at a rseq/cid regression due to large multi thread contention on
mm data.

But yeah, I think this was one of the open issues we still had; with the
other implementation -- I seem to have a half finished patch for an
idle_server list.

> - worker wake events are delivered not to servers which ran the workers earlier,
>   but to idle servers from the idle server list

Provided there is one I take it; very easy to run out of idle things.
Also, what if you want to explicitly manage placement, can you still
direct the wakeup?

> - worker preemption happens not via a syscall (umcg_kick) but by hooking
>   into sched_tick

sched_tick would render it somewhat unsuitable for RT
workloads/schedulers where you might need more immediate preemption.

> None of the differences above are deal breakers; again, if the TLS/page pinning
> approach is viable, we will gladly use it.

Urgh, so yeah.. I meant to go look at the whole UMCG thing again with an
eye specifically at inter-process support.

I'm hoping inter-process UMCG can be used to implement custom libpthread
that would allow running most of userspace under a custom UMCG scheduler
and obviate the need for this horrible piece of shit eBPF sched thing.

But I keep getting side-tracked with other stuff :/ I'll try and bump
this stuff up the todo list.