netdev - Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <90161ac9-3ca0-4c72-b1c4-ab1293e55445@app.fastmail.com>
Date: Sun, 25 Jun 2023 09:59:34 -0700
From: "Andy Lutomirski" <luto@...nel.org>
To: "Mike Rapoport" <rppt@...nel.org>
Cc: "Mark Rutland" <mark.rutland@....com>,
 "Kees Cook" <keescook@...omium.org>,
 "Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>,
 "Andrew Morton" <akpm@...ux-foundation.org>,
 "Catalin Marinas" <catalin.marinas@....com>,
 "Christophe Leroy" <christophe.leroy@...roup.eu>,
 "David S. Miller" <davem@...emloft.net>,
 "Dinh Nguyen" <dinguyen@...nel.org>,
 "Heiko Carstens" <hca@...ux.ibm.com>, "Helge Deller" <deller@....de>,
 "Huacai Chen" <chenhuacai@...nel.org>,
 "Kent Overstreet" <kent.overstreet@...ux.dev>,
 "Luis Chamberlain" <mcgrof@...nel.org>,
 "Michael Ellerman" <mpe@...erman.id.au>,
 "Nadav Amit" <nadav.amit@...il.com>,
 "Naveen N. Rao" <naveen.n.rao@...ux.ibm.com>,
 "Palmer Dabbelt" <palmer@...belt.com>,
 "Puranjay Mohan" <puranjay12@...il.com>,
 "Rick P Edgecombe" <rick.p.edgecombe@...el.com>,
 "Russell King (Oracle)" <linux@...linux.org.uk>,
 "Song Liu" <song@...nel.org>, "Steven Rostedt" <rostedt@...dmis.org>,
 "Thomas Bogendoerfer" <tsbogend@...ha.franken.de>,
 "Thomas Gleixner" <tglx@...utronix.de>, "Will Deacon" <will@...nel.org>,
 bpf@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
 linux-mips@...r.kernel.org, linux-mm@...ck.org,
 linux-modules@...r.kernel.org, linux-parisc@...r.kernel.org,
 linux-riscv@...ts.infradead.org, linux-s390@...r.kernel.org,
 linux-trace-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org,
 loongarch@...ts.linux.dev, netdev@...r.kernel.org,
 sparclinux@...r.kernel.org, "the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()



On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
>> 
>> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
>> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> >> > From: "Mike Rapoport (IBM)" <rppt@...nel.org>
>> >> >
>> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >> >
>> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> >> > and puts the burden of code allocation to the modules code.
>> >> >
>> >> > Several architectures override module_alloc() because of various
>> >> > constraints where the executable memory can be located and this causes
>> >> > additional obstacles for improvements of code allocation.
>> >> >
>> >> > Start splitting code allocation from modules by introducing
>> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >> >
>> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> >> > module_alloc() and execmem_free() and jit_free() are replacements of
>> >> > module_memfree() to allow updating all call sites to use the new APIs.
>> >> >
>> >> > The intention semantics for new allocation APIs:
>> >> >
>> >> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >> >   close to the kernel image, like loadable kernel modules and generated
>> >> >   code that is restricted by relative addressing.
>> >> >
>> >> > * jit_text_alloc() should be used to allocate memory for generated code
>> >> >   when there are no restrictions for the code placement. For
>> >> >   architectures that require that any code is within certain distance
>> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >> >   execmem_text_alloc().
>> >> >
>> >> 
>> >> Is there anything in this series to help users do the appropriate
>> >> synchronization when the actually populate the allocated memory with
>> >> code?  See here, for example:
>> >
>> > This series only factors out the executable allocations from modules and
>> > puts them in a central place.
>> > Anything else would go on top after this lands.
>> 
>> Hmm.
>> 
>> On the one hand, there's nothing wrong with factoring out common code. On
>> the other hand, this is probably the right time to at least start
>> thinking about synchronization, at least to the extent that it might make
>> us want to change this API.  (I'm not at all saying that this series
>> should require changes -- I'm just saying that this is a good time to
>> think about how this should work.)
>> 
>> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
>> look like the one think in the Linux ecosystem that actually
>> intelligently and efficiently maps new text into an address space:
>> mmap().
>> 
>> On x86, you can mmap() an existing file full of executable code PROT_EXEC
>> and jump to it with minimal synchronization (just the standard implicit
>> ordering in the kernel that populates the pages before setting up the
>> PTEs and whatever user synchronization is needed to avoid jumping into
>> the mapping before mmap() finishes).  It works across CPUs, and the only
>> possible way userspace can screw it up (for a read-only mapping of
>> read-only text, anyway) is to jump to the mapping too early, in which
>> case userspace gets a page fault.  Incoherence is impossible, and no one
>> needs to "serialize" (in the SDM sense).
>> 
>> I think the same sequence (from userspace's perspective) works on other
>> architectures, too, although I think more cache management is needed on
>> the kernel's end.  As far as I know, no Linux SMP architecture needs an
>> IPI to map executable text into usermode, but I could easily be wrong.
>> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
>> remember the details.)
>> 
>> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
>> rather fraught, and I bet many things do it wrong when userspace is
>> multithreaded.  But not in production because it's mostly not used in
>> production.)
>> 
>> But jit_text_alloc() can't do this, because the order of operations
>> doesn't match.  With jit_text_alloc(), the executable mapping shows up
>> before the text is populated, so there is no atomic change from not-there
>> to populated-and-executable.  Which means that there is an opportunity
>> for CPUs, speculatively or otherwise, to start filling various caches
>> with intermediate states of the text, which means that various
>> architectures (even x86!) may need serialization.
>> 
>> For eBPF- and module- like use cases, where JITting/code gen is quite
>> coarse-grained, perhaps something vaguely like:
>> 
>> jit_text_alloc() -> returns a handle and an executable virtual address,
>> but does *not* map it there
>> jit_text_write() -> write to that handle
>> jit_text_map() -> map it and synchronize if needed (no sync needed on
>> x86, I think)
>> 
>> could be more efficient and/or safer.
>> 
>> (Modules could use this too.  Getting alternatives right might take some
>> fiddling, because off the top of my head, this doesn't match how it works
>> now.)
>> 
>> To make alternatives easier, this could work, maybe (haven't fully
>> thought it through):
>> 
>> jit_text_alloc()
>> jit_text_map_rw_inplace() -> map at the target address, but RW, !X
>> 
>> write the text and apply alternatives
>> 
>> jit_text_finalize() -> change from RW to RX *and synchronize*
>> 
>> jit_text_finalize() would either need to wait for RCU (possibly extra
>> heavy weight RCU to get "serialization") or send an IPI.
>
> This essentially how modules work now. The memory is allocated RW, written
> and updated with alternatives and then made ROX in the end with set_memory
> APIs.
>
> The issue with not having the memory mapped X when it's written is that we
> cannot use large pages to map it. One of the goals is to have executable
> memory mapped with large pages and make code allocator able to divide that
> page among several callers.
>
> So the idea was that jit_text_alloc() will have a cache of large pages
> mapped ROX, will allocate memory from those caches and there will be
> jit_update() that uses text poking for writing to that memory.
>
> Upon allocation of a large page to increase the cache, that large page will
> be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> x86)

Is this actually valid?  In between int3 and real code, there’s a potential torn read of real code mixed up with 0xcc.

>
> To improve the performance of this process, we can write to !X copy and
> then text_poke it to the actual address in one go. This will require some
> changes to get the alternatives right.
>
> -- 
> Sincerely yours,
> Mike.