linux-kernel - Re: [PATCH v2] audit: Avoid excessive dput/dget in audit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260206052218.GV3183987@ZenIV>
Date: Fri, 6 Feb 2026 05:22:18 +0000
From: Al Viro <viro@...iv.linux.org.uk>
To: Waiman Long <llong@...hat.com>
Cc: Paul Moore <paul@...l-moore.com>, Eric Paris <eparis@...hat.com>,
	Christian Brauner <brauner@...nel.org>,
	linux-kernel@...r.kernel.org, audit@...r.kernel.org,
	Richard Guy Briggs <rgb@...hat.com>,
	Ricardo Robaina <rrobaina@...hat.com>
Subject: Re: [PATCH v2] audit: Avoid excessive dput/dget in audit_context
 setup and reset paths

On Thu, Feb 05, 2026 at 11:11:51PM -0500, Waiman Long wrote:

> __latent_entropy
> struct mnt_namespace *copy_mnt_ns(u64 flags, struct mnt_namespace *ns,
>                 struct user_namespace *user_ns, struct fs_struct *new_fs)
> {
>   :
>                 if (new_fs) {
>                         if (&p->mnt == new_fs->root.mnt) {
>                                 new_fs->root.mnt = mntget(&q->mnt);
>                                 rootmnt = &p->mnt;
>                         }
>                         if (&p->mnt == new_fs->pwd.mnt) {
>                                 new_fs->pwd.mnt = mntget(&q->mnt);
>                                 pwdmnt = &p->mnt;
>                         }
>                 }
> 
> It is replacing the fs->pwd.mnt with a new one while pwd_refs is 1. I can
> make this work with the new fs_struct field. I do have one question though.
> Do we need to acquire write_seqlock(&new_fs->seq) if we are changing root or
> pwd here or if the new_fs are in such a state that it will never change when
> this copying operation is in progress?

In all cases when we get to that point, new_fs is always a freshly
created private copy of current->fs, not reachable from anywhere
other than stack frames of the callers, but the proof is not pretty.
copy_mnt_ns() is called only by create_new_namespaces() and it gets to
copying anything if and only if CLONE_NEWNS is in the flags.  So far,
so good.  The call in create_new_namespaces() is
	new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
and both flags and new_fs come straight from create_new_namespaces()
callers.  There are 3 of those:
1) int copy_namespaces(u64 flags, struct task_struct *tsk):
	new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
That gets called by copy_process(), with tsk being the child we are
creating there.  tsk->fs is set a bit earlier, by copy_fs() call - either
to extra ref to current->fs (when CLONE_FS is present in flags) or to
a private copy thereof.  Since in the very beginning of copy_process()
we have
	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
		return ERR_PTR(-EINVAL);
and we are interested in case when CLONE_NEWNS is set, tsk->fs is going
to be a private copy.

2) int unshare_nsproxy_namespaces(unsigned long unshare_flags,
	struct nsproxy **new_nsp, struct cred *new_cred, struct fs_struct *new_fs):
        *new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
					 new_fs ? new_fs : current->fs);
That gets called from ksys_unshare().  Earlier in ksys_unshare() we have
        /*
	 * If unsharing namespace, must also unshare filesystem information.
	 */
	if (unshare_flags & CLONE_NEWNS)
		unshare_flags |= CLONE_FS;
so in our case CLONE_FS is going to be set.  new_fs is initially set
to NULL and, since CLONE_FS is there, the call of unshare_fs() has
replaced it with a reference to private copy of current->fs.  Again,
we get a private copy.

3) int exec_task_namespaces(void):
        new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
No CLONE_NEWNS in flags, so we don't get there at all.

Incidentally, one bogosity I've spotted in unshare_fs() today is
        if (!(unshare_flags & CLONE_FS) || !fs)
					^^^^^^ this
		return 0;

The history is amusing - it had been faithfully preserved since
cf2e340f4249 ("[PATCH] unshare system call -v5: system call handler
function") back in 2006, when unshare(2) had been added; initially it
had been a stub:
+static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
+{
+       struct fs_struct *fs = current->fs;
+
+       if ((unshare_flags & CLONE_FS) &&
+           (fs && atomic_read(&fs->count) > 1))
+               return -EINVAL;
+
+       return 0;
+}

The thing is, task->fs does not become NULL until exit_fs() clears it, at
which point we'd better *not* run into any syscalls, unshare(2) included
;-) The same had been true back in 2006; as the matter of fact, I don't
know if it _ever_ had not been true.  I suspect that the logics had been
copied from exit_fs(), which also has a pointless check that seems to have
been added there in 1.3.26, when task->fs went kmalloc'ed.  The thing
is, in 1.3.26 copy_fs() a failed allocation aborted do_fork() (today's
copy_process()) with exit_fs() never called for the child-not-to-be...

Anyway, with exit_fs() there might have been some value in making it
idempotent, but for unshare_fs() that never made sense.  Nobody noticed
and nobody who'd massaged that function afterwards (myself included)
had ever asked WTF would that exit be about and what would happen if we
ever ended up going there (answer: an oops galore)...