linux-kernel - Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems):

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOQ4uxieqyB9oVAoEL+CG-J-LsWVN0GEke+J=pTad4+D+OrBxA@mail.gmail.com>
Date: Sat, 11 Jan 2025 11:33:18 +0100
From: Amir Goldstein <amir73il@...il.com>
To: "Artem S. Tashkinov" <aros@....com>
Cc: linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-xfs <linux-xfs@...r.kernel.org>
Subject: Re: Spooling large metadata updates / Proposal for a new API/feature
 in the Linux Kernel (VFS/Filesystems):

On Sat, Jan 11, 2025 at 10:18 AM Artem S. Tashkinov <aros@....com> wrote:
>
> Hello,
>
> I had this idea on 2021-11-07, then I thought it was wrong/stupid, now
> I've asked AI and it said it was actually not bad, so I'm bringing it
> forward now:
>
> Imagine the following scenarios:
>
>   * You need to delete tens of thousands of files.
>   * You need to change the permissions, ownership, or security context
> (chmod, chown, chcon) for tens of thousands of files.
>   * You need to update timestamps for tens of thousands of files.
>
> All these operations are currently relatively slow because they are
> executed sequentially, generating significant I/O overhead.
>
> What if these operations could be spooled and performed as a single
> transaction? By bundling metadata updates into one atomic operation,

atomicity is not implied from the use case you described.
IOW, the use case should not care in how many sub-transactions
the changes are executed.

> such tasks could become near-instant or significantly faster. This would
> also reduce the number of writes, leading to less wear and tear on
> storage devices.
>
> Does this idea make sense? If it already exists, or if there’s a reason
> it wouldn’t work, please let me know.

Yes it is already how journaled filesystems work, but userspace can only request
to commit the current transaction (a.k.a fsync), so transactions can
be committed
too frequently or at inefficient manner for the workload (e.g. rm -rf).

There was a big effort IIRC around v6.1 to improve scalability of rm
-rf workload
in xfs which led to a long series of regressions and fixes cycles.

I think that an API for rm -rf is interesting because:
- It is a *very* common use case, which is often very inefficient
- filesystems already have "orphan" lists to deal with deferred work
on deleted inodes

What could be done in principle:
1. Filesystems could opt-in to implement unlink(path, AT_REMOVE_NONEMPTY_DIR)
2. This API will fail if the directory has subdirs (i_nlink != 2)
3. If the directory has only files, it can be unlinked and its inode added to an
    "orphan" list as a special "batch delete" transaction
4. When executed, the "batch delete" transaction will iterate the
directory entries,
    decrement nlink of inodes, likely adding those inodes to the "orphan" list
5. rm -rf will iterate DFS, calling unlink(path, AT_REMOVE_NONEMPTY_DIR)
    on leaf directories whose nlink is 2

Among other complications, this API does not take into account permissions for
unlinking the child inodes, based on the child inode attributes such
as immutable
flag or LSM security policies.

This could be an interesting as TOPIC for LSFMM.

Thanks,
Amir.