linux-kernel - Re: [PATCH v3 2/2] vfs: avoid duplicating creds in faccessat if possible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHH8t9_5iLd8FsTW4PBZ+_vGad3YAd8K=n=SrRtnWHm49Q@mail.gmail.com>
Date:   Sat, 4 Mar 2023 20:01:45 +0100
From:   Mateusz Guzik <mjguzik@...il.com>
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Alexander Potapenko <glider@...gle.com>,
        Al Viro <viro@...iv.linux.org.uk>,
        Kees Cook <keescook@...omium.org>,
        Eric Biggers <ebiggers@...gle.com>,
        Christian Brauner <brauner@...nel.org>, serge@...lyn.com,
        paul@...l-moore.com, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-security-module@...r.kernel.org
Subject: Re: [PATCH v3 2/2] vfs: avoid duplicating creds in faccessat if possible

On 3/3/23, Linus Torvalds <torvalds@...ux-foundation.org> wrote:
> On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@...il.com> wrote:
>>
>> I think there is a systemic problem which comes with the kzalloc API
>
> Well, it's not necessarily the API that is bad, but the implementation.
>
> We could easily make kzalloc() with a constant size just expand to
> kmalloc+memset, and get the behavior you want.
>
> We already do magical things for "find the right slab bucket" part of
> kmalloc too for constant sizes. It's changed over the years, but that
> policy goes back a long long time. See
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8
>
> from the BK history tree.
>
> Exactly because some things are worth optimizing for when the size is
> known at compile time.
>
> Maybe just extending kzalloc() similarly? Trivial and entirely untested
> patch:
>
>    --- a/include/linux/slab.h
>    +++ b/include/linux/slab.h
>    @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
>      */
>     static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
>     {
>    +    if (__builtin_constant_p(size)) {
>    +            void *ret = kmalloc(size, flags);
>    +            if (ret)
>    +                    memset(ret, 0, size);
>    +            return ret;
>    +    }
>         return kmalloc(size, flags | __GFP_ZERO);
>     }
>

So I played with this and have a rather nasty summary. Bullet points:
1. patched kzalloc does not reduce memsets calls during kernel build
2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop
it significantly (36150671 -> 14414454)
3. ... inline memset generated by gcc sucks by resorting to rep stosq
around 48 bytes
4. memsets not sorted out have sizes not known at compilation time and
are not necessarily perf bugs on their own [read: would benefit from
faster memset]

Onto the the meat:

I patched the kernel with a slightly tweaked version of the above:
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af70315a94..7abb5490690f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
kmem_cache *k, gfp_t flags)
  */
 static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
 {
+       if (__builtin_constant_p(size)) {
+               void *ret = kmalloc(size, flags);
+               if (likely(ret))
+                       memset(ret, 0, size);
+               return ret;
+       }
        return kmalloc(size, flags | __GFP_ZERO);
 }

and verified it indeed zeroes inline:

void kztest(void)
{
        void *ptr;

        ptr = kzalloc(32, GFP_KERNEL);
        if (unlikely(!ptr))
                return;
        memsettest_rec(ptr);
}

$ objdump --disassemble=kztest vmlinux
[snip]
call   ffffffff8135e130 <kmalloc_trace>
test   %rax,%rax
je     ffffffff81447d5f <kztest+0x4f>
movq   $0x0,(%rax)
mov    %rax,%rdi
movq   $0x0,0x8(%rax)
movq   $0x0,0x10(%rax)
movq   $0x0,0x18(%rax)
call   ffffffff81454060 <memsettest_rec>
[snip]

This did *NOT* lead to reduction of memset calls when building the kernel.

I verified few cases by hand, it is all either kmem_cache_zalloc or
explicitly added memsets with sizes not known at compilation time.

Two most frequent callers:
@[
    memset+5
    __alloc_file+40
    alloc_empty_file+73
    path_openat+77
    do_filp_open+182
    do_sys_openat2+159
    __x64_sys_openat+89
    do_syscall_64+93
    entry_SYSCALL_64_after_hwframe+114
]: 11028994
@[
    memset+5
    security_file_alloc+45
    __alloc_file+89
    alloc_empty_file+73
    path_openat+77
    do_filp_open+182
    do_sys_openat2+159
    __x64_sys_openat+89
    do_syscall_64+93
    entry_SYSCALL_64_after_hwframe+114
]: 11028994

My wip addition is:

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af70315a94..12b5b02ef3d3 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct
kmem_cache *k, gfp_t flags)
        return kmem_cache_alloc(k, flags | __GFP_ZERO);
 }

+#define kmem_cache_zalloc_ptr(k, f, retp) ({                           \
+       __typeof(retp) _retp = kmem_cache_alloc(k, f);                  \
+       bool _rv = false;                                               \
+       retp = _retp;                                                   \
+       if (likely(_retp)) {                                            \
+               memset(_retp, 0, sizeof(*_retp));                       \
+               _rv = true;                                             \
+       }                                                               \
+       _rv;                                                            \
+})
+
diff --git a/security/security.c b/security/security.c
index cf6cc576736f..0f769ede0e54 100644
--- a/security/security.c
+++ b/security/security.c
@@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file)
                return 0;
        }

-       file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL);
-       if (file->f_security == NULL)
+       if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL,
file->f_security))
                return -ENOMEM;
        return 0;
 }
diff --git a/fs/file_table.c b/fs/file_table.c
index 372653b92617..8e0dabf9530e 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const
struct cred *cred)
        struct file *f;
        int error;

-       f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
-       if (unlikely(!f))
+       if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f))
                return ERR_PTR(-ENOMEM);

        f->f_cred = get_cred(cred);

As mentioned above it cuts total calls in more than half.

The problem is it is it rolls with rep stosq way too easily, partially
defeating the point of inlining anything. clang does not have this
problem.

Take a look at __alloc_file:
[snip]
mov    0x19cab05(%rip),%rdi        # ffffffff82df4318 <filp_cachep>
call   ffffffff813dd610 <kmem_cache_alloc>
test   %rax,%rax
je     ffffffff814298b7 <__alloc_file+0xc7>
mov    %rax,%r12
mov    $0x1d,%ecx
xor    %eax,%eax
mov    %r12,%rdi
rep stos %rax,%es:(%rdi)
[/snip]

Here is a sample consumer which can't help but have a variable size --
select, used by gmake:
@[
    memset+5
    do_pselect.constprop.0+202
    __x64_sys_pselect6+101
    do_syscall_64+93
    entry_SYSCALL_64_after_hwframe+114
]: 13160

In conclusion:
1. fixing up memsets is worthwhile regardless of what happens to its
current consumers -- not all of them are necessarily doing something
wrong
2. inlining memset can be a pessimization vs non-plain-erms memset as
evidenced above. will have to figure out how to convince gcc to be
less eager to use it.

Sometimes I hate computers.

-- 
Mateusz Guzik <mjguzik gmail.com>