linux-kernel - Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YnA5CUJKvqmXJxf2@zx2c4.com>
Date:   Mon, 2 May 2022 22:03:21 +0200
From:   "Jason A. Donenfeld" <Jason@...c4.com>
To:     Alexander Graf <graf@...zon.com>
Cc:     Lennart Poettering <mzxreary@...inter.de>,
        linux-kernel@...r.kernel.org, linux-crypto@...r.kernel.org,
        Dominik Brodowski <linux@...inikbrodowski.net>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Theodore Ts'o <tytso@....edu>,
        Colm MacCarthaigh <colmmacc@...zon.com>,
        Torben Hansen <htorben@...zon.co.uk>,
        Jann Horn <jannh@...gle.com>,
        "Michael Kelley (LINUX)" <mikelley@...rosoft.com>
Subject: Re: [PATCH 2/2] random: add fork_event sysctl for polling VM forks

Hi Alex,

On Mon, May 02, 2022 at 08:57:01PM +0200, Alexander Graf wrote:
> So far I see little that would block your patch? It seems to go into a 
> good direction from all I can tell.

With kernel code for the kernel, if it's good enough, it's good enough,
and can be made better later. With kernel code for userspace, it might
not be possible to make it better later, so if we've got a plethora of
options, it makes the incrementalism I was enthusiastic about a little
bit more risky.

Here's where we're at:

- /proc/sys has this easy to use poll() interface [which currently breaks
  normal poll() semantics; see patch 1/2]. It'd be nice to use this as
  an async notifier interface.

- A race-free mmap()d mechanism needs virtual hardware support to be
  useful, so I don't want to add it now. But when we do have it, where
  should it be added?

  * Atop the same /proc/sys file that this patch has returning -ENODATA
    at the moment? /proc/sys doesn't do mmap now so that'd be some
    plumbing work, which would be pretty odd for the sysctl abstraction.

  * Atop some new file elsewhere? Not hard to do, but also begs the
    question of which driver should do it, where the file should be, and
    a whole host of bikesheddy questions.

  * In the VDSO? This would make the most sense to me if we're ever
    going to do a VDSO RNG, and to design it along with the userspace
    RNG stuff people think they want, even if it's maybe not the best
    idea in the end. This is a huge can of worms. Termites, even, and
    they'll eat through the beams of your house.

  * Nowhere, because even though this race-free mechanism is attractive,
    it's way too hard to program for, and nobody is going to use it, and
    not many people care /that/ much about having this mechanism being
    100% race-free. Colm said Amazon doesn't, for example, since it uses
    quiescent states. And apparently Microsoft doesn't care much either,
    and they don't even use quiescent states. Maybe I'm the only one
    raising a fuss about the race window (while hypocritically also not
    being very eager to write code around an interface that has to check
    a variable all the time).

- A file in /proc/sys can return -ENODATA, as this patch has it doing,
  or it can return something else:

  * A counter. Flawed for reasons discussed elsewhere.

  * A UUID, just like /proc/sys/kernel/random/boot_id, except we'd name
    it generation_id or something and it'd change on VM fork. An earlier
    version of this patch did that.

  * -ENODATA, as this patch does, because each caller can get a new unique
    value with getrandom(), as there's no point in having some global
    identifier like with boot_id. This is what I'm leaning toward. It
    seems like getrandom() would handle Lennart's case, Go's case, s2n's
    case, and we don't need a particular ID.

  * The direct value of the ACPI vmgenid value. Discussed in earlier
    emails; I don't want to do this.

- You don't care about race-free, but you do care about mmap(), for
  "programming reasons" in s2n. I'm not yet sympathetic to this case
  :-P. The point of mmap() should be for the race-free stuff. Otherwise
  poll() should do the job, and if you can't poll, you can hack around
  it with another process or a thread or whatever other trick. Your
  systemd proposal with a file was just punting the idea elsewhere in
  userspace (which I think is a bit ugly too...).

This state of affairs inclines me to:

- Have a /proc/sys/kernel/random/fork_event that has a poll() interface
  with reads returning -ENODATA (this patch).

- Try to convince whomever that poll()ing on fork_event and then calling
  getrandom() for a process-local snapshot ID is better than poll()ing
  on and subsequently read()ing a /proc/sys/kernel/random/generation_id
  that returns a global UUID like boot_id. [Lennart - I'm specifically
  interested in trying to convince you of this. Thoughts?]

- Given that you are okay with an async mechanism, figure out how poll()
  can work for s2n rather than mmap().

- Not really pursue the race-free mmap() thing unless Microsoft goes
  full steam ahead with it because it's complicated to program for and
  maybe (or maybe not? I don't know? data please?) a theoretical
  concern.
 
> You block the thread on poll, so the only option is a thread. I was 
> until now always under the working assumption that we can't do this in a 
> thread because you don't want your single threaded application to turn 
> into a pthreaded one, but you make me wonder. Let me check with Torben 
> tomorrow.

Can't you use raw clone() and just have a super dirty thread that shares
a single page mapping and nothing else? Poll on the pidfd of the parent
and the fork_event fd, quit on the former, and ++ on the latter.

> >> 2) A way to notify larger applications (think Java here) that a system
> >> is going to be suspended soon so it can wipe PII before it gets cloned
> >> for example.
> > Suspension, like S3 power notification stuff? Talk to Rafael about that;
> 
> 
> Whether you go running -> S3 -> clone or you go running -> paused -> 
> clone is an implementation detail I'm not terribly worried about. Users 
> can do either, because on both cases the VM is in paused state.

Ahh, I see.
https://lore.kernel.org/lkml/20220501123849.3858-1-Jason@zx2c4.com/
might interest you.

Jason