linux-kernel - Re: [PATCH for 5.9 v2 1/4] futex: introduce FUTEX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200804123147.GI2674@hirez.programming.kicks-ass.net>
Date:   Tue, 4 Aug 2020 14:31:47 +0200
From:   peterz@...radead.org
To:     Peter Oskolkov <posk@...k.io>
Cc:     Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Ingo Molnar <mingo@...nel.org>,
        Darren Hart <dvhart@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Peter Oskolkov <posk@...gle.com>,
        Andrei Vagin <avagin@...gle.com>, Paul Turner <pjt@...gle.com>,
        Ben Segall <bsegall@...gle.com>, Aaron Lu <aaron.lwe@...il.com>
Subject: Re: [PATCH for 5.9 v2 1/4] futex: introduce FUTEX_SWAP operation

On Mon, Aug 03, 2020 at 03:15:07PM -0700, Peter Oskolkov wrote:
> A simplified/idealized use case: imagine a multi-user service application
> (e.g. a DBMS) that has to implement the following user CPU quota
> policy:

So the last posting made hackernews; and there a bunch expressed far
more interest in coroutines, which, if I'm not mistaken, can also be
implemented using all this.

Would that not make for a far simpler and more convincing use-case?

> - block detection: when a task blocks in the kernel (on a network
>   read, for example), the userspace scheduler is notified and
>   schedules (resumes or swaps into) a pending task in the newly available
>   CPU slot;
> - wake detection: when a task wakes from a previously blocking kernel
>   operation (e.g. can now process some data on a network socket), the
>   userspace scheduler is notified and can now schedule the task to
>   run on a CPU when a CPU is available and the task can use it according
>   to its scheduling policy.
> 
> (Technically, block/wake detection is still experimental and not
> used widely: as we control the userspace, we can actually determine
> blocking/waking syscalls without kernel support).
> 
> Internally we currently use kernel patches that are too "intrusive" to be
> included in a general-purpose Linux kernel, so we are exploring ways to
> upstream this functionality.
> 
> The easiest/least intrusive approach that we have come up with is this:
> 
> - block/resume map perfectly to futex wait/wake;
> - switch_to thus maps to FUTEX_SWAP;
> - block and wake detection can be done either through tracing
>   or by introducing new BPF attach points (when a task blocks or wakes,
>   a BPF program is triggered that then communicates with the userspace);
> - the BPF attach points are per task, and the task needs to "opt in"
>   (i.e. all other tasks suffer just an additional pointer comparison
>   on block/wake);
> - the BPF programs triggered on block/wake should be able to perform
>   futex ops (e.g. wake a designated userspace scheduling task) - this
>   probably indicates that tracing is not enough, and a new BPF prog type
>   is needed.

I really think we want to have block/resume detection sorted before this
goes anywhere, I also strongly feel BPF should not be used for
functional interfaces like that.

That is, I want to see a complete interface before I want to commit to
an ABI that we're stuck with.

I also want to see userspace that goes along with it; like with
sys_membarrier() / liburcu and sys_rseq() / librseq (which seems to be
heading for glibc).

Also, and this seems to be the crux of the whole endeavour, you want to
allow your 'fibers' to block. Which is what makes
{make,swap,get,set}context() unsuited for your needs and gives rise to
the whole block/resume issue above.

Also, I want words on the interaction between resume notification and
wake-up preemption. That is, how do you envision managing the
interaction between the two schedulers.


All in all, I don't think you're even close to having something
mergable.