linux-ext4 - Re: generic_permission() optimization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250414-anomalie-abpfiff-9f293dce366b@brauner>
Date: Mon, 14 Apr 2025 12:21:28 +0200
From: Christian Brauner <brauner@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Mateusz Guzik <mjguzik@...il.com>, Al Viro <viro@...iv.linux.org.uk>, 
	linux-fsdevel <linux-fsdevel@...r.kernel.org>, Jan Kara <jack@...e.cz>, 
	Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: generic_permission() optimization

On Sat, Apr 12, 2025 at 01:22:38PM -0700, Linus Torvalds wrote:
> On Sat, 12 Apr 2025 at 09:26, Mateusz Guzik <mjguzik@...il.com> wrote:
> >
> > I plopped your snippet towards the end of __ext4_iget:
> 
> That's literally where I did the same thing, except I put it right after the
> 
>           brelse(iloc.bh);
> 
> line, rather than before as you did.
> 
> And it made no difference for me, but I didn't try to figure out why.
> Maybe some environment differences? Or maybe I just screwed up my
> testing...
> 
> As mentioned earlier in the thread, I had this bi-modal distribution
> of results, because if I had a load where the *non*-owner of the inode
> looked up the pathnames, then the ACL information would get filled in
> when the VFS layer would do the lookup, and then once the ACLs were
> cached, everything worked beautifully.
> 
> But if the only lookups of a path were done by the owner of the inodes
> (which is typical for at least my normal kernel build tree - nothing
> but my build will look at the files, and they are obviously always
> owned by me) then the ACL caches will never be filled because there
> will never be any real ACL lookups.
> 
> And then rather than doing the nice efficient "no ACLs anywhere, no
> need to even look", it ends up having to actually do the vfsuid
> comparison for the UID equality check.
> 
> Which then does the extra accesses to look up the idmap etc, and is
> visible in the profiles due to that whole dance:
> 
>         /* Are we the owner? If so, ACL's don't matter */
>         vfsuid = i_uid_into_vfsuid(idmap, inode);
>         if (likely(vfsuid_eq_kuid(vfsuid, current_fsuid()))) {
> 
> even when idmap is 'nop_mnt_idmap' and it is reasonably cheap. Just
> because it ends up calling out to different functions and does extra
> D$ accesses to the inode and the suberblock (ie i_user_ns() is this
> 
>         return inode->i_sb->s_user_ns;

I think we can improve this. Right now multiple mounts from different
superblocks can share the same struct mnt_idmap. But I can change the
code so that struct mnt_idmap can only be shared between mounts from the
same superblock. With that we could do:

diff --git a/fs/mnt_idmapping.c b/fs/mnt_idmapping.c
index a37991fdb194..a5ec15c8c754 100644
--- a/fs/mnt_idmapping.c
+++ b/fs/mnt_idmapping.c
@@ -20,6 +20,7 @@
 struct mnt_idmap {
        struct uid_gid_map uid_map;
        struct uid_gid_map gid_map;
+       struct user_namespace *s_user_ns;
        refcount_t count;
 };

And then stuff like:

static inline vfsuid_t i_uid_into_vfsuid(struct mnt_idmap *idmap,
                                         const struct inode *inode)
{
        return make_vfsuid(idmap, i_user_ns(inode), inode->i_uid);
}

just becomes:

static inline vfsuid_t i_uid_into_vfsuid(struct mnt_idmap *idmap,
                                         const struct inode *inode)
{
        return make_vfsuid(idmap, inode->i_uid);
}

which means:

vfsuid_t make_vfsuid(struct mnt_idmap *idmap,
                     kuid_t kuid)
{
        uid_t uid;

        if (idmap == &nop_mnt_idmap)
                return VFSUIDT_INIT(kuid);

<snip>
}

will only have to verify nop_mnt_idmap and we never have to access the
inode->i_sb->s_user_ns at all.

I'll wip up a patch for this.

> 
> so just to *see* that it's nop_mnt_idmap takes effort.
> 
> One improvement might be to cache that 'nop_mnt_idmap' thing in the
> inode as a flag.
> 
> But it would be even better if the filesystem just initializes the
> inode at inode read time to say "I have no ACL's for this inode" and
> none of this code will even trigger.

Yes, let's please do this.