linux-kernel - Re: [GIT PULL] bcachefs changes for 6.12-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dia6l34faugmuwmgpyvpeeppqjwmv2qhhvu57nrerc34qknwlo@ltwkoy7pstrm>
Date: Mon, 23 Sep 2024 23:47:54 -0400
From: Kent Overstreet <kent.overstreet@...ux.dev>
To: Dave Chinner <david@...morbit.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>, 
	linux-bcachefs@...r.kernel.org, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Dave Chinner <dchinner@...hat.com>
Subject: Re: [GIT PULL] bcachefs changes for 6.12-rc1

On Tue, Sep 24, 2024 at 01:34:14PM GMT, Dave Chinner wrote:
> On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote:
> > But stat/statx always pulls into the vfs inode cache, and that's likely
> > worth fixing.
> 
> No, let's not even consider going there.
> 
> Unlike most people, old time XFS developers have direct experience
> with the problems that "uncached" inode access for stat purposes.
> 
> XFS has had the bulkstat API for a long, long time (i.e. since 1998
> on Irix). When it was first implemented on Irix, it was VFS cache
> coherent. But in the early 2000s, that caused problems with HSMs
> needing to scan billions inodes indexing petabytes of stored data
> with certain SLA guarantees (i.e. needing to scan at least a million
> inodes a second).  The CPU overhead of cache instantiation and
> teardown was too great to meet those performance targets on 500MHz
> MIPS CPUs.
> 
> So we converted bulkstat to run directly out of the XFS buffer cache
> (i.e. uncached from the perspective of the VFS). This reduced the
> CPU over per-inode substantially, allowing bulkstat rates to
> increase by a factor of 10. However, it introduced all sorts of
> coherency problems between cached inode state vs what was stored in
> the buffer cache. It was basically O_DIRECT for stat() and, as you'd
> expect from that description, the coherency problems were horrible.
> Detecting iallocated-but-not-yet-updated and
> unlinked-but-not-yet-freed inodes were particularly consistent
> sources of issues.
> 
> The only way to fix these coherency problems was to check the inode
> cache for a resident inode first, which basically defeated the
> entire purpose of bypassing the VFS cache in the first place.

Eh? Of course it'd have to be coherent, but just checking if an inode is
present in the VFS cache is what, 1-2 cache misses? Depending on hash
table fill factor...

That's going to show up, but I have a hard time seeing that "defeating
the entire purpose" of bypassing the VFS cache, as you say.

> Don't hack around VFS scalability issues if it can be avoided.

Well, maybe if your dlock list patches make it in - I still see crazy
lock contention there...