linux-kernel - Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to put_unused

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <suehfaqsibehz3vws6oourswenaz7bbbn75a7joi5cxi6komyk@3fxp7v3bg5qk>
Date: Mon, 17 Jun 2024 13:23:31 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: "Ma, Yu" <yu.ma@...el.com>
Cc: viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, tim.c.chen@...ux.intel.com, 
	tim.c.chen@...el.com, pan.deng@...el.com, tianyou.li@...el.com
Subject: Re: [PATCH 3/3] fs/file.c: move sanity_check from alloc_fd() to
 put_unused_fd()

On Sun, Jun 16, 2024 at 11:47:40AM +0800, Ma, Yu wrote:
> 
> On 6/15/2024 12:41 PM, Mateusz Guzik wrote:
> > So you are moving this to another locked area, but one which does not
> > execute in the benchmark?
> 
> The consideration here as mentioned is to reduce the performance impact (if
> to reserve the sanity check, and have the same functionality) by moving it
> from critical path to non-critical, as put_unused_fd() is mostly used for
> error handling when fd is allocated successfully but struct file failed to
> obtained in the next step.
> 

As mentioned by Christian in his mail this check can just be removed.

>         error = -EMFILE;
>         if (fd < fdt->max_fds) {

I would likely() on it.

>                 if (~fdt->open_fds[0]) {
>                         fd = find_next_zero_bit(fdt->open_fds,
> BITS_PER_LONG, fd);
>                         goto fastreturn;
>                 }
>                 fd = find_next_fd(fdt, fd);
>         }
> 
>         if (unlikely(fd >= fdt->max_fds)) {
>                 error = expand_files(files, fd);
>                 if (error < 0)
>                         goto out;
>                 if (error)
>                         goto repeat;
>         }
> fastreturn:
>         if (unlikely(fd >= end))
>                 goto out;
>         if (start <= files->next_fd)
>                 files->next_fd = fd + 1;
> 
>        ....
> 

This is not a review, but it does read fine.

LTP (https://github.com/linux-test-project/ltp.git) has a bunch of fd
tests, I would make sure they still pass before posting a v2.

I would definitely try moving out the lock to its own cacheline --
currently it resides with the bitmaps (and first 4 fds of the embedded
array).

I expect it to provide some win on top of the current patchset, but
whether it will be sufficient to justify it I have no idea.

Something of this sort:
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 2944d4aa413b..423cb599268a 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -50,11 +50,11 @@ struct files_struct {
    * written part on a separate cache line in SMP
    */
        spinlock_t file_lock ____cacheline_aligned_in_smp;
-       unsigned int next_fd;
+       unsigned int next_fd ____cacheline_aligned_in_smp;
        unsigned long close_on_exec_init[1];
        unsigned long open_fds_init[1];
        unsigned long full_fds_bits_init[1];
-       struct file __rcu * fd_array[NR_OPEN_DEFAULT];
+       struct file __rcu * fd_array[NR_OPEN_DEFAULT] ____cacheline_aligned_in_smp;
 };

 struct file_operations;

(feel free to just take)

All that said, I have to note blogbench has numerous bugs. For example
it collects stats by merely bumping a global variable (not even with
atomics), so some updates are straight up lost.

To my understanding it was meant to test filesystems and I'm guessing
threading (instead of separate processes) was only used to make it
easier to share statistics. Given the current scales this
unintentionally transitioned into bottlenecking on the file_lock
instead.

There were scalability changes made about 9 years ago and according to
git log they were benchmarked by Eric Dumazet, while benchmark code was
not shared. I suggest CC'ing the guy with your v2 and asking if he can
bench.  Even if he is no longer in position to do it perhaps he can rope
someone in (or even better share his benchmark).