linux-kernel - Re: [GIT PULL] bcachefs changes for 6.12-rc1

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZvNgmoKgWF0TBXP8@dread.disaster.area>
Date: Wed, 25 Sep 2024 11:00:10 +1000
From: Dave Chinner <david@...morbit.com>
To: Kent Overstreet <kent.overstreet@...ux.dev>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-bcachefs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, Dave Chinner <dchinner@...hat.com>
Subject: Re: [GIT PULL] bcachefs changes for 6.12-rc1

On Mon, Sep 23, 2024 at 11:47:54PM -0400, Kent Overstreet wrote:
> On Tue, Sep 24, 2024 at 01:34:14PM GMT, Dave Chinner wrote:
> > On Mon, Sep 23, 2024 at 10:55:57PM -0400, Kent Overstreet wrote:
> > > But stat/statx always pulls into the vfs inode cache, and that's likely
> > > worth fixing.
> > 
> > No, let's not even consider going there.
> > 
> > Unlike most people, old time XFS developers have direct experience
> > with the problems that "uncached" inode access for stat purposes.
> > 
> > XFS has had the bulkstat API for a long, long time (i.e. since 1998
> > on Irix). When it was first implemented on Irix, it was VFS cache
> > coherent. But in the early 2000s, that caused problems with HSMs
> > needing to scan billions inodes indexing petabytes of stored data
> > with certain SLA guarantees (i.e. needing to scan at least a million
> > inodes a second).  The CPU overhead of cache instantiation and
> > teardown was too great to meet those performance targets on 500MHz
> > MIPS CPUs.
> > 
> > So we converted bulkstat to run directly out of the XFS buffer cache
> > (i.e. uncached from the perspective of the VFS). This reduced the
> > CPU over per-inode substantially, allowing bulkstat rates to
> > increase by a factor of 10. However, it introduced all sorts of
> > coherency problems between cached inode state vs what was stored in
> > the buffer cache. It was basically O_DIRECT for stat() and, as you'd
> > expect from that description, the coherency problems were horrible.
> > Detecting iallocated-but-not-yet-updated and
> > unlinked-but-not-yet-freed inodes were particularly consistent
> > sources of issues.
> > 
> > The only way to fix these coherency problems was to check the inode
> > cache for a resident inode first, which basically defeated the
> > entire purpose of bypassing the VFS cache in the first place.
> 
> Eh? Of course it'd have to be coherent, but just checking if an inode is
> present in the VFS cache is what, 1-2 cache misses? Depending on hash
> table fill factor...

Sure, when there is no contention and you have CPU to spare. But the
moment the lookup hits contention problems (i.e. we are exceeding
the cache lookup scalability capability), we are straight back to
running a VFS cache speed instead of uncached speed.

IOWs, needing to perform the cache lookup defeated the purpose of
using uncached lookups to avoid the cache scalabilty problems.

Keep in mind that not having a referenced inode opens up the code to
things like pre-emption races. i.e. a cache miss doesn't prevent
the current task from being preempted before it reads the inode
information into the user buffer. The VFS inode could bei
instantiated and modified before the uncached access runs again and
pulls stale information from the underlying buffer and returns that
to userspace.

Those were the sorts of problems we continually had with using low
level inode information for stat operations vs using the up-to-date
VFS inode state....

-Dave.
-- 
Dave Chinner
david@...morbit.com