linux-kernel - Re: fuse: increase readdir() buffer size

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4d25c1fb-040a-4e0d-bc12-67f17a04378f@uls.co.za>
Date: Sat, 29 Mar 2025 10:59:42 +0200
From: Jaco Kroon <jaco@....co.za>
To: David Laight <david.laight.linux@...il.com>
Cc: linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
 bernd.schubert@...tmail.fm, miklos@...redi.hu, rdunlap@...radead.org,
 trapexit@...wn.link
Subject: Re: fuse: increase readdir() buffer size

Hi David,

On 2025/03/28 21:40, David Laight wrote:
> On Fri, 28 Mar 2025 12:15:47 +0200
> Jaco Kroon <jaco@....co.za> wrote:
>
>> Hi All,
>>
>> I've not seen feedback on this, please may I get some direction on this?
> The only thing I can think of is that the longer loop might affect
> scheduling latencies.

I'm assuming you're referring to the fact that it's now possible to 
iterate more times through the loop at the very last phase?

The first loop which sets up the size of the folio should only ever 
execute once unless there is fairly huge memory pressure and allocations 
fails.

The clamping code is unfortunate in my opinion, but it is possible that 
we can either get an insane huge count (based on inspection of other 
code, intended to be -1) or a zero value.

The upper limit here is arbitrary, but in the usual case I'm expecting a 
128KiB buffer to be allocated, which should usually succeed on the first 
round.

The call to fuse_read_args_fill may result in marginally longer running 
code, along with the clamping code, but due to larger buffers of 
dentries being returned this is more than compensated for in terms of 
the drastic reduction in user-kernel space context switches by way of 
fewer getdents() system calls having to be made to iterate a full readdir().

Other systems may be more resilient, but glusterfs without my dentry 
cache patches has a tendency to "forget" dentries under pressure, and 
then have to restart getdents() runs via client-server round-trip hoops, 
often resulting in system call durations on getdents upwards of 30s at a 
time.  With this patch, the overall latency comes down drastically, and 
the number of these (inferred) restart runs on the server side to find 
the right place to continue reading from is also reduced, even without 
increasing the dentry cache in the glusterfs code.  These two patches in 
combination basically in my experience makes glusterfs operate no worse 
than NFS on average.

Given the feedback I'll discuss deploying updated kernels including 
these patches to our production rather than test environments (one node 
at a time, 21 in total) with the team.  And then provide feedback from 
there.  Our rule remains not more than one node at a time for 
potentially disruptive changes like these, and preferably no more than 
one node a day per environment.

In our test environment this panned out on newest (at time of mailing 
this in) kernel, and informal time trials (time find /mount/point, with 
180m inodes) was quite positive, and orders of magnitude better than 
without (We killed the without after it took about 3x longer than with, 
still incomplete).

Kind regards,
Jaco

>
> 	David
>
>> Kind regards,
>> Jaco
>>
>> On 2025/03/15 00:16, Jaco Kroon wrote:
>>> This is a follow up to the attempt made a while ago.
>>>
>>> Whist the patch worked, newer kernels have moved from pages to folios,
>>> which gave me the motivation to implement the mechanism based on the
>>> userspace buffer size patch that Miklos supplied.
>>>
>>> That patch works as is, but I note there are changes to components
>>> (overlayfs and exportfs) that I've got very little experience with, and
>>> have not tested specifically here.  They do look logical.  I've marked
>>> Miklos as the Author: here, and added my own Signed-off-by - I hope this
>>> is correct.
>>>
>>> The second patch in the series implements the changes to fuse's readdir
>>> in order to utilize the first to enable reading more than one page of
>>> dent structures at a time from userspace, I've included a strace from
>>> before and after this patch in the commit to illustrate the difference.
>>>
>>> To get the relevant performance on glusterfs improved (which was
>>> mentioned elsewhere in the thread) changes to glusterfs to increase the
>>> number of cached dentries is also required (these are pushed to github
>>> but not yet merged, because similar to this patch, got stalled before
>>> getting to the "ready for merge" phase even though it's operational).
>>>
>>> Please advise if these two patches looks good (I've only done relatively
>>> basic testing now, and it's not running on production systems yet)
>>>
>>> Kind regards,
>>> Jaco
>>>