linux-kernel - Re: [PATCH 1/3] module: Introduce module_alloc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZHFDO/95KMXRdOWI@moria.home.lan>
Date:   Fri, 26 May 2023 19:39:39 -0400
From:   Kent Overstreet <kent.overstreet@...ux.dev>
To:     Song Liu <song@...nel.org>
Cc:     linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
        mcgrof@...nel.org, peterz@...radead.org, tglx@...utronix.de,
        x86@...nel.org, rppt@...nel.org
Subject: Re: [PATCH 1/3] module: Introduce module_alloc_type

On Fri, May 26, 2023 at 04:09:01PM -0700, Song Liu wrote:
> On Fri, May 26, 2023 at 3:30 PM Kent Overstreet
> <kent.overstreet@...ux.dev> wrote:
> >
> > On Thu, May 25, 2023 at 10:15:27PM -0700, Song Liu wrote:
> > > Introduce memory type aware module_alloc_type, which provides a unified
> > > allocator for all different archs. This work was discussed in [1].
> > >
> > > Each arch can configure the allocator to do the following:
> > >
> > >    1. Specify module_vaddr and module_end
> > >    2. Random module start address for KASLR
> > >    3. kasan_alloc_module_shadow()
> > >    4. kasan_reset_tag()
> > >    5. Preferred and secondary module address ranges
> > >
> > > enum mod_alloc_params_flags are used to control the behavior of
> > > module_alloc_type. Specifically: MOD_ALLOC_FALLBACK let module_alloc_type
> > > fallback to existing module_alloc. MOD_ALLOC_SET_MEMORY let
> > > module_alloc_type to protect the memory before returning to the user.
> > >
> > > module_allocator_init() call is added to start_kernel() to initialize
> > > module_alloc_type.
> > >
> >
> > ...
> >
> > > +/**
> > > + * struct vmalloc_params - Parameters to call __vmalloc_node_range()
> > > + * @start:          Address space range start
> > > + * @end:            Address space range end
> > > + * @gfp_mask:       The gfp_t mask used for this range
> > > + * @pgprot:         The page protection for this range
> > > + * @vm_flags        The vm_flag used for this range
> > > + */
> > > +struct vmalloc_params {
> > > +     unsigned long   start;
> > > +     unsigned long   end;
> > > +     gfp_t           gfp_mask;
> > > +     pgprot_t        pgprot;
> > > +     unsigned long   vm_flags;
> > > +};
> > > +
> > > +/**
> > > + * struct mod_alloc_params - Parameters for module allocation type
> > > + * @flags:          Properties in mod_alloc_params_flags
> > > + * @granularity:    The allocation granularity (PAGE/PMD) in bytes
> > > + * @alignment:      The allocation alignment requirement
> > > + * @vmp:            Parameters used to call vmalloc
> > > + * @fill:           Function to fill allocated space. If NULL, use memcpy()
> > > + * @invalidate:     Function to invalidate memory space.
> > > + *
> > > + * If @granularity > @alignment the allocation can reuse free space in
> > > + * previously allocated pages. If they are the same, then fresh pages
> > > + * have to be allocated.
> > > + */
> > > +struct mod_alloc_params {
> > > +     unsigned int            flags;
> > > +     unsigned int            granularity;
> > > +     unsigned int            alignment;
> > > +     struct vmalloc_params   vmp[MOD_MAX_ADDR_SPACES];
> > > +     void *                  (*fill)(void *dst, const void *src, size_t len);
> > > +     void *                  (*invalidate)(void *ptr, size_t len);
> > > +};
> > > +
> > > +struct mod_type_allocator {
> > > +     struct mod_alloc_params params;
> > > +};
> > > +
> > > +struct mod_allocators {
> > > +     struct mod_type_allocator *types[MOD_MEM_NUM_TYPES];
> > > +};
> > > +
> > > +void *module_alloc_type(size_t size, enum mod_mem_type type);
> > > +void module_memfree_type(void *ptr, enum mod_mem_type type);
> > > +void module_memory_fill_type(void *dst, void *src, size_t len, enum mod_mem_type type);
> > > +void module_memory_invalidate_type(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_memory_protect(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_memory_unprotect(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_memory_force_protect(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_memory_force_unprotect(void *ptr, size_t len, enum mod_mem_type type);
> > > +void module_alloc_type_init(struct mod_allocators *allocators);
> >
> > This is a pretty big and complicated interface, and I haven't been able
> > to find any reasoning or justification for why it's needed.
> >
> > Why is this going in module.h? Don't do that, this is supposed to be a
> > general purpose allocator for executable memory so start a new file.
> 
> The goal of this work is to build a memory allocator for modules, text,
> rw data, and ro data. So it is not the same as execmem_alloc or jitalloc.
> 
> >
> > What is vmalloc_params doing here? Why is it needed? Can we just get rid
> > of it?
> 
> We need it to have an interface for all archs. They are all using different
> variations of vmalloc for module_alloc.

But it should be an internal implementation detail, I don't think we
want the external interface tied to vmalloc - 

> These two APIs allow the core code work with all archs. They won't break
> sub-page allocations. (not all archs will have sub-page allocations)

So yes, text_poke() doesn't work on all architectures, and we need a
fallback.

But if the fallback needs to go the unprotect/protect route, I think we
need to put that in the helper, and not expose it. Part of what people
are wanting is to limit or eliminate pages that are RWX, so we
definitely shouldn't be creating new interfaces to flipping page
permissions: that should be pushed down to as low a level as possible.

E.g., with my jitalloc_update(), a more complete version would look like

void jitalloc_update(void *dst, void *src, size_t len)
{
	if (text_poke_available) {
		text_poke(...);
	} else {
		unprotect();
		memcpy();
		protect();
	}
}

See? We provide a simpler, safer interface, and this also lets us handle
multiple threads racing to update/flip permissions on the same page in a
single place (e.g. with sticking a lock somewhere in the jitalloc
structures).

Or ideally, we'd have a kmap_local() variant that actually creates a new
mapping on all architectures, and we'd just use that to do the update
without any arch-dependent code - if we get that, having this helper
means we'll only have to change a single place later.

> 
> OTOH, the "force" version of the two APIs should be removed later. In this
> set, we only need them for arch_prepare_bpf_trampoline(). I plan to remove
> this limitation for x86 soon.
> 
> >
> > The text_poke() abstraction works; the exposed interface should look
> > like that.
> >
> > Please kill MOD_ALLOC_SET_MEMORY, and _only_ return memory that is
> > read-only. You should be able to kill mod_alloc_params entirely.
> >
> > Our other memory allocators don't expose kasan - why does this one?
> > Again, that looks wrong.
> >
> > I would like to see something much simpler (that doesn't encode awkward
> > assumptions from module and bpf!): please look at the code I sketched
> > out below and tell me why this interface won't work - or if it can, go
> > with _that_.
> 
> I think we need to align the goal here. PS: we did align the goal about
> 6 months ago when I proposed the execmem_alloc() set.

Yeah, I know what it feels like to get yanked around by different
reviewers who see and join in the discussion at different times. Sorry :)

Having a rational/design doc in the header file would help a _lot_. 

> My current goal is to build an allocator for all the use cases of modules,
> text, data, rodata, etc. Then the same allocator can be used by other
> dynamic kernel text: bpf, ftrace, kprobe, bcachefs, etc.

This must have been part of the previous discussion since you started
with execmem_alloc(), but I did scan through that and I'm still not
seeing the justification for why this needs to handle non-text
allocations.

If I had to hazard a guess it would be because of tglx wanting to solve
tlb fragmentation for modules, but to me that doesn't sound like a good
enough reason and it looks like we're ending up with a messy "try to be
all things to all people" interface as a result :/

Do we _really_ need that?

Mike was just saying at LSF that direct map fragmentation turned out to
be a non issue for performance for non-text, so maybe we don't.

Also - module_memory_fill_type(), module_memory_invalidate_type() - I
know these are for BPF, but could you explain why we need them in the
external interface here? Could they perhaps be small helpers in the bpf
code that use something like jitalloc_update()?

> OTOH, jit_alloc (as-is) solves a subset of the problem (only the text).
> We can probably use it (with some updates) instead of the vmap_area
> based allocator. But I guess we need to align the direction first.

Let's see if we can get tglx to chime in again, since he was pretty
vocal in the last discussion.

It's also good practice to try to summarize all the relevant "whys" from
the discussion that went into a feature and put that in at least the
commit message - or even better, header file comments.

Also, organization: the module code is a huge mess, let's _please_ do
split this out and think about organization a bit more, not add to the
mess :)

Cheers,
Kent