linux-kernel - Re: [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <202504250925.58434D763@keescook>
Date: Fri, 25 Apr 2025 10:09:34 -0700
From: Kees Cook <kees@...nel.org>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	"Liam R . Howlett" <Liam.Howlett@...cle.com>,
	Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
	Pedro Falcato <pfalcato@...e.de>,
	David Hildenbrand <david@...hat.com>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
	Suren Baghdasaryan <surenb@...gle.com>, linux-mm@...ck.org,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/3] mm: abstract initial stack setup to mm subsystem

On Fri, Apr 25, 2025 at 03:54:34PM +0100, Lorenzo Stoakes wrote:
> There are peculiarities within the kernel where what is very clearly mm
> code is performed elsewhere arbitrarily.
> 
> This violates separation of concerns and makes it harder to refactor code
> to make changes to how fundamental initialisation and operation of mm logic
> is performed.
> 
> One such case is the creation of the VMA containing the initial stack upon
> execve()'ing a new process. This is currently performed in __bprm_mm_init()
> in fs/exec.c.
> 
> Abstract this operation to create_init_stack_vma(). This allows us to limit
> use of vma allocation and free code to fork and mm only.
> 
> We previously did the same for the step at which we relocate the initial
> stack VMA downwards via relocate_vma_down(), now we move the initial VMA
> establishment too.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
> Acked-by: David Hildenbrand <david@...hat.com>
> Reviewed-by: Suren Baghdasaryan <surenb@...gle.com>
> ---
>  fs/exec.c          | 51 +---------------------------------

I'm kind of on the fence about this. On the one hand, yes, it's all vma
goo, and should live with the rest of vma code, as you suggest. On the
other had, exec is the only consumer of this behavior, and moving it
out of fs/exec.c means that changes to the code that specifically only
impacts exec are now in a separate file, and will no longer get exec
maintainer/reviewer CCs (based on MAINTAINERS file matching). Exec is
notoriously fragile, so I'm kind of generally paranoid about changes to
its behaviors going unnoticed.

In defense of moving it, yes, this routine has gotten updates over the
many years, but it's relatively stable. But at least one thing has gone in
without exec maintainer review recently (I would have Acked it, but the
point is review): 9e567ca45f ("mm/ksm: fix ksm exec support for prctl")
Everything else was before I took on the role officially (Nov 2022).

So I guess I'm asking, how do we make sure stuff pulled out of exec
still gets exec maintainer review?

> [...]
>  static int __bprm_mm_init(struct linux_binprm *bprm)
>  {
> -	int err;
> [...]
> -	return err;
> +	return create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
>  }

I'd prefer __bprm_mm_init() go away if it's just a 1:1 wrapper now.
However, it doesn't really look like it makes too much sense for the NOMMU
logic get moved as well, since it explicitly depends on exec-specific
values (MAX_ARG_PAGES), so perhaps something like this:

diff --git a/fs/exec.c b/fs/exec.c
index 8e4ea5f1e64c..313dc70e0012 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -382,9 +382,13 @@ static int bprm_mm_init(struct linux_binprm *bprm)
 	bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
 	task_unlock(current->group_leader);
 
-	err = __bprm_mm_init(bprm);
+#ifndef CONFIG_MMU
+	bprm->p = PAGE_SIZE * MAX_ARG_PAGES - sizeof(void *);
+#else
+	err = create_init_stack_vma(bprm->mm, &bprm->vma, &bprm->p);
 	if (err)
 		goto err;
+#endif
 
 	return 0;
 


On a related note, I'd like to point out that my claim that exec is
the only consumer here, is slightly a lie. Technically this is correct,
but only because this is specifically setting up the _stack_.

The rest of the VMA setup actually surrounds this code (another
reason I remain unhappy about moving it). Specifically the mm_alloc()
before __bprm_mm_init (which is reached through alloc_brpm()). And
then, following alloc_bprm() in do_execveat_common(), is the call to
setup_new_exec(), which does the rest of the VMA setup, specifically
arch_pick_mmap_layout() and related fiddling.

The "create userspace VMA" logic, mostly through mm_alloc(), is
used in a few places (e.g. text poking), but the "bring up a _usable_
userspace VMA" logic (i.e. one also with functional mmap) is repeated in
lib/kunit/alloc_user.c for allowing testing of code that touches userspace
(see kunit_attach_mm() and the kunit_vm_mmap() users). (But these tests
don't actually run userspace code, so no stack is set up.)

I guess what I'm trying to say is that I think we need a more clearly
defined "create usable userspace VMA" API, as we've got at least 3
scattered approaches right now: exec ("everything"), non-mmap-non-stack
users (text poking, et al), and mmap-but-not-stack users (kunit tests).

And the One True User of a full userspace VMA, exec, has the full setup
scattered into several phases, mostly due to needing to separate those
phases because it needs to progressively gather the information needed
to correctly configure each piece:
- set up userspace VMA at all (mm_alloc)
- set up a stack because exec args need to go somewhere (__bprm_mm_init)
- move stack to the right place (depends on executable binary and task bits)
- set up mmap (arch_pick_mmap_layout) to actually load executable binary
  (depends on arch, binary, and task bits)

Hopefully this all explains why I'm uncomfortable to see __bprm_mm_init
get relocated. It'll _probably_ be fine, but I get antsy about changes
to code that only exec uses...

-- 
Kees Cook