[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZogzJCb66vwxwSLN@zx2c4.com>
Date: Fri, 5 Jul 2024 19:53:40 +0200
From: "Jason A. Donenfeld" <Jason@...c4.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: jolsa@...nel.org, mhiramat@...nel.org, cgzones@...glemail.com,
brauner@...nel.org, linux-kernel@...r.kernel.org, arnd@...db.de
Subject: Re: deconflicting new syscall numbers for 6.11
Hi Linus,
On Fri, Jul 05, 2024 at 10:39:48AM -0700, Linus Torvalds wrote:
> Yes. And it should be pretty trivial.
>
> We just at least initially have to be very careful to limit it to
> MAP_ANONYMOUS and MAP_PRIVATE. Because dropping dirty bits on shared
> mappings sounds insane and like a possible source of confusion (and
> thus bugs and maybe even security issues).
>
> It's possible that we might even use a MAP_TYPE flag for this. Or make
> it a PROT_xyz bit rather than a MAP_xyz.
>
> So there's some trivial sanity checks and some UI issues to just pick,
> but apart from "just pick something sane", exposing this for mmap() is
> _not_ hard, and I do think it needs to be done first.
I can take a stab at it.
> > - The "mechanism" needs to return allocated memory to userspace that can
> > be chunked up on a per-thread basis, with no state straddling pages,
> > which means it also needs to return the size of each state, and the
> > number of states that were allocated.
> >
> > - The size of each state might change kernel version to kernel version.
>
> Just pick a size large enough.
>
> And why would that size not be one page?
>
> Considering that you really don't want to rely on page-crossing state
> *ANYWAY* because of the whole "one page can go away while another one
> sticks around" issue, I would expect that states over one page per
> thread would be a *very* questionable idea to begin with.
>
> I don't think we'll ever see systems with page sizes smaller than 4k.
> They have existed in the past, but they're not making a comeback.
> People want larger pages, not smaller ones.
That sounds not so good: the current state is 144 bytes, and it's
expected that there'll be one of these per thread. Mapping 16k or 4k per
thread seems pretty bad. At least it certainly seems that way? Wasting
16240 bytes per thread + a new vmap I can't imagine is okay.
Also, these points still stand:
| - In an effort to match the behaviors of syscall getrandom() as much as
| possible, it needs to be mapped with various flags (the ones in the
| current vgetrandom_alloc() implementation).
|
| - Which flags are needed might change kernel version to kernel version.
|
| - Future memory tagging CPU extensions might allow us to prevent the
| memory from being accessed unless the accesses are coming from vDSO
| code, which would avoid heartbleed-like bugs. This is very appealing.
It seems like leaving it just up to mmap() will not only result in users
doing it wrong, but kind of limits our options moving forward. And
there's this whole issue of communicating sizes so as not to be
wasteful.
Another idea I had, if you hate the syscall, is I could just add this as
(another) private ioctl() on the /dev/random node. This sounds worse
than a syscall worse because it means that node has to exist and the fd
has to be opened -- and concerns about this were what lead to the
getrandom() syscall being introduced in the first place -- but it would
at least avoid the syscall. I'm not crazy about that though.
Maybe the winning solution is MAP_DROPPABLE (or PROT_DROPPABLE) in
mmap(), and then in the following commit, add the vgetrandom_alloc()
syscall, and then we'll avoid vgetrandom_alloc() getting abused, but
still have a nice interface that isn't too constraining.
Jason
Powered by blists - more mailing lists