linux-kernel - Re: [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8fc4249b-9155-438a-8ef8-c678a12ea30d@lucifer.local>
Date: Mon, 28 Apr 2025 09:53:05 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Kees Cook <kees@...nel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
        "Liam R . Howlett" <Liam.Howlett@...cle.com>,
        Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
        Pedro Falcato <pfalcato@...e.de>, David Hildenbrand <david@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
        Suren Baghdasaryan <surenb@...gle.com>, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem

On Fri, Apr 25, 2025 at 10:09:34AM -0700, Kees Cook wrote:
> On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
> > There are peculiarities within the kernel where what is very clearly mm
> > code is performed elsewhere arbitrarily.
> >
> > This violates separation of concerns and makes it harder to refactor code
> > to make changes to how fundamental initialisation and operation of mm logic
> > is performed.
> >
> > One such case is the creation of the VMA containing the initial stack upon
> > execve()'ing a new process. This is currently performed in __bprm_mm_init()
> > in fs/exec.c.
> >
> > Abstract this operation to create_init_stack_vma(). This allows us to limit
> > use of vma allocation and free code to fork and mm only.
> >
> > We previously did the same for the step at which we relocate the initial
> > stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> > establishment too.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> > Acked-by: David Hildenbrand <david@...hat.com>
> > Reviewed-by: Suren Baghdasaryan <surenb@...gle.com>
> > ---
> >  fs/exec.c          | 51 +---------------------------------
>
> I'm kind of on the fence about this. On the one hand, yes, it's all vma
> goo, and should live with the rest of vma code, as you suggest. On the
> other had, exec is the only consumer of this behavior, and moving it
> out of fs/exec.c means that changes to the code that specifically only
> impacts exec are now in a separate file, and will no longer get exec
> maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
> notoriously fragile, so I'm kind of generally paranoid about changes to
> its behaviors going unnoticed.
>
> In defense of moving it, yes, this routine has gotten updates over the
> many years, but it's relatively stable. But at least one thing has gone in
> without exec maintainer review recently (I would have Acked it, but the
> point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
> Everything else was before I took on the role officially (Nov 2022).
>
> So I guess I'm asking, how do we make sure stuff pulled out of exec
> still gets exec maintainer review?

I think we have two options here:

1. Separate out this code into mm/vma_exec.c and treat it like
   mm/vma_init.c, then add you as a reviewer, so you have visibility on
   everything that happens there.
2. Add you as a reviewer to memory mapping in general.

I think 1 is preferable actually, as it'll reduce noise for you and then
you'll _always_ get notified about change in this code.

Note that we have done this previously for similar reasons with
relocate_vma_down() we could move this function into that file.

For the sake of saving time given time zone differences, let me explore
option 1in a v3, and obviously if that doesn't work for you let me know! :)

We'll have to then have fs/exec.c include ../mm/vma_exec.h, so it is _only_
exec that gets access to this.

>
> > [...]
> >  static int __bprm_mm_init(struct linux_binprm *bprm)
> >  {
> > -	int err;
> > [...]
> > -	return err;
> > +	return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
> >  }
>
> I'd prefer __bprm_mm_init() go away if it's just a 1:1 wrapper now.
> However, it doesn't really look like it makes too much sense for the NOMMU
> logic get moved as well, since it explicitly depends on exec-specific
> values (MAX_ARG_PAGES), so perhaps something like this:
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 8e4ea5f1e64c..313dc70e0012 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -382,9 +382,13 @@ static int bprm_mm_init(struct linux_binprm *bprm)
>  	bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
>  	task_unlock(current->group_leader);
>
> -	err = __bprm_mm_init(bprm);
> +#ifndef CONFIG_MMU
> +	bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
> +#else
> +	err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
>  	if (err)
>  		goto err;
> +#endif
>
>  	return 0;

Sure I can respin with this.

>
>
>
> On a related note, I'd like to point out that my claim that exec is
> the only consumer here, is slightly a lie. Technically this is correct,
> but only because this is specifically setting up the _stack_.
>
> The rest of the VMA setup actually surrounds this code (another
> reason I remain unhappy about moving it). Specifically the mm_alloc()
> before __bprm_mm_init (which is reached through alloc_brpm()). And
> then, following alloc_bprm() in do_execveat_common(), is the call to
> setup_new_exec(), which does the rest of the VMA setup, specifically
> arch_pick_mmap_layout() and related fiddling.

No other callers try to allocate/free vmas, which is the issue here.

>
> The "create userspace VMA" logic, mostly through mm_alloc(), is
> used in a few places (e.g. text poking), but the "bring up a _usable_
> userspace VMA" logic (i.e. one also with functional mmap) is repeated in
> lib/kunit/alloc_user.c for allowing testing of code that touches userspace
> (see kunit_attach_mm() and the kunit_vm_mmap() users). (But these tests
> don't actually run userspace code, so no stack is set up.)

Hm but this seems something different? This is using the mm_alloc()
interface?

>
> I guess what I'm trying to say is that I think we need a more clearly
> defined "create usable userspace VMA" API, as we've got at least 3
> scattered approaches right now: exec ("everything"), non-mmap-non-stack
> users (text poking, et al), and mmap-but-not-stack users (kunit tests).

But only exec is actually allocating and essentially 'forcing in' a stack
vma like this?

I mean I'm all for having some nicely abstracted interface rather than
scattering open-coded stuff around, but the one and only objective _here_
is to disallow the use of very clearly mm-specific internal APIs elsewhere.

exec is of course 'special' in this stack behaviour, but I feel we can
express this 'specialness' better by having this kind of well-defined,
well-described functions exposed by mm, rather than handing over absolutely
fundamental API calls to any part of the kernel that wants them.

>
> And the One True User of a full userspace VMA, exec, has the full setup
> scattered into several phases, mostly due to needing to separate those
> phases because it needs to progressively gather the information needed
> to correctly configure each piece:
> - set up userspace VMA at all (mm_alloc)
> - set up a stack because exec args need to go somewhere (__bprm_mm_init)
> - move stack to the right place (depends on executable binary and task bits)
> - set up mmap (arch_pick_mmap_layout) to actually load executable binary
>   (depends on arch, binary, and task bits)
>
> Hopefully this all explains why I'm uncomfortable to see __bprm_mm_init
> get relocated. It'll _probably_ be fine, but I get antsy about changes
> to code that only exec uses...

Right, I understand and appreciate that, for sure. This is fiddly core
code, you worry about things breaking - I mean I feel you (don't ask me
about brk(), please please don't ask me about brk() :P)

So hopefully the proposed solution with vma_exec.c should cover this off
nicely?


>
> --
> Kees Cook

Cheers, Lorenzo