lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ADB5A1E8-F1BF-4B42-BD77-96C57B135305@hammerspace.com>
Date: Sun, 09 Nov 2025 13:34:26 -0500
From: Benjamin Coddington <bcodding@...merspace.com>
To: Dai Ngo <dai.ngo@...cle.com>
Cc: chuck.lever@...cle.com, jlayton@...nel.org, neilb@...mail.net,
 okorniev@...hat.com, tom@...pey.com, hch@....de, alex.aring@...il.com,
 viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz,
 linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-nfs@...r.kernel.org
Subject: Re: [Patch 0/2] NFSD: Fix server hang when there are multiple layout
 conflicts

On 6 Nov 2025, at 12:05, Dai Ngo wrote:

> When a layout conflict triggers a call to __break_lease, the function
> nfsd4_layout_lm_break clears the fl_break_time timeout before sending
> the CB_LAYOUTRECALL. As a result, __break_lease repeatedly restarts
> its loop, waiting indefinitely for the conflicting file lease to be
> released.
>
> If the number of lease conflicts matches the number of NFSD threads (which
> defaults to 8), all available NFSD threads become occupied. Consequently,
> there are no threads left to handle incoming requests or callback replies,
> leading to a total hang of the NFS server.
>
> This issue is reliably reproducible by running the Git test suite on a
> configuration using SCSI layout.
>
> This patchset fixes this problem by introducing the new lm_breaker_timedout
> operation to lease_manager_operations and using timeout for layout
> lease break.

Hey Dai,

I like your solution here, but I worry it can cause unexpected or
unnecessary client fencing when the problem is server-side (not enough
threads).  Clients might be dutifully sending LAYOUTRETURN, but the server
can't service them - and this change will cause some potentially unexpected
fencing in environments where things could be fixed (by adding more knfsd
threads).  Also, I think we significantly bumped default thread counts
recently in nfs-utils:
eb5abb5c60ab (tag: nfs-utils-2-8-2-rc3) nfsd: dump default number of threads to 16

You probably have already seen previous discussions about this:
https://lore.kernel.org/linux-nfs/1CC82EC5-6120-4EE4-A7F0-019CF7BC762C@redhat.com/

This also changes the behavior for all layouts, I haven't thought through
the implications of that - but I wish we could have knob for this behavior,
or perhaps a knfsd-specific fl_break_time tuneable.

Last thought (for now): I think Neil has some work for dynamic knfsd thread
count.. or Jeff?  (I am having trouble finding it) Would that work around
this problem?

Regards,
Ben

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ