linux-kernel - Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrUnnHN0dyx02ewuRu=_TZ+cD=HaEzNs34ADNbWps1tTBw@mail.gmail.com>
Date:   Wed, 15 Sep 2021 09:50:41 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Andy Lutomirski <luto@...nel.org>, Jann Horn <jannh@...gle.com>,
        Peter Oskolkov <posk@...gle.com>,
        Peter Oskolkov <posk@...k.io>, Ingo Molnar <mingo@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>,
        Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
        Andrei Vagin <avagin@...gle.com>,
        Thierry Delisle <tdelisle@...terloo.ca>
Subject: Re: [PATCH 2/4 v0.5] sched/umcg: RFC: add userspace atomic helpers

On Wed, Sep 15, 2021 at 8:45 AM Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Tue, Sep 14, 2021 at 11:40:01AM -0700, Andy Lutomirski wrote:
> >
> >
> > On Tue, Sep 14, 2021, at 11:11 AM, Peter Zijlstra wrote:
> > > On Tue, Sep 14, 2021 at 09:52:08AM -0700, Andy Lutomirski wrote:
> > > > With a custom mapping, you don’t need to pin pages at all, I think.
> > > > As long as you can reconstruct the contents of the shared page and
> > > > you’re willing to do some slightly careful synchronization, you can
> > > > detect that the page is missing when you try to update it and skip the
> > > > update. The vm_ops->fault handler can repopulate the page the next
> > > > time it’s accessed.
> > >
> > > The point is that the moment we know we need to do this user-poke, is
> > > schedule(), which could be called while holding mmap_sem (it being a
> > > preemptable lock). Which means we cannot go and do faults.
> >
> > That’s fine. The page would be in one or two states: present and
> > writable by kernel or completely gone. If its present, the scheduler
> > writes it. If it’s gone, the scheduler skips the write and the next
> > fault fills it in.
>
> That's non-deterministic, and as such not suitable.

What's the precise problem?  The code would be roughly:

if (try_pin_the_page) {
  write it;
  unpin;
} else {
  do nothing -- .fault will fill in the correct contents.
}

The time this takes is nondeterministic, but it's bounded and short.

>
> > > > All that being said, I feel like I’m missing something. The point of
> > > > this is to send what the old M:N folks called “scheduler activations”,
> > > > right?  Wouldn’t it be more efficient to explicitly wake something
> > > > blockable/pollable and write the message into a more efficient data
> > > > structure?  Polling one page per task from userspace seems like it
> > > > will have inherently high latency due to the polling interval and will
> > > > also have very poor locality.  Or am I missing something?
> > >
> > > The idea was to link the user structures together in a (single) linked
> > > list. The server structure gets a list of all the blocked tasks. This
> > > avoids having to a full N iteration (like Java, they're talking stupid
> > > number of N).
> > >
> > > Polling should not happen, once we run out of runnable tasks, the server
> > > task gets ran again and it can instantly pick up all the blocked
> > > notifications.
> > >
> >
> > How does the server task know when to read the linked list?  And
> > what’s wrong with a ring buffer or a syscall?
>
> Same problem, ring-buffer has the case where it's full and events get
> dropped, at which point you've completely lost state. If it is at all
> possible to recover from that, doing so is non-deterministic.
>
> I really want this stuff to work for realtime workloads too.

A ring buffer would have a bounded size -- one word (of whatever size)
per user thread.