linux-kernel - Re: [PATCH v3] securityfs: fix missing of d_delete() in securityfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250509032326.GJ2023217@ZenIV>
Date: Fri, 9 May 2025 04:23:26 +0100
From: Al Viro <viro@...iv.linux.org.uk>
To: alexjlzheng@...il.com
Cc: paul@...l-moore.com, jmorris@...ei.org, serge@...lyn.com,
	greg@...ah.com, chrisw@...l.org,
	linux-security-module@...r.kernel.org, linux-kernel@...r.kernel.org,
	Jinliang Zheng <alexjlzheng@...cent.com>
Subject: Re: [PATCH v3] securityfs: fix missing of d_delete() in
 securityfs_remove()

On Thu, May 08, 2025 at 10:04:39PM +0800, alexjlzheng@...il.com wrote:

> In addition, securityfs_recursive_remove() avoids this problem by calling
> __d_drop() directly. As a non-recursive version, it is somewhat strange
> that securityfs_remove() does not clean up the deleted dentry.
> 
> Fix this by adding d_delete() in securityfs_remove().

This is not a fix.  First and foremost, securityfs_recursive_remove()
does *not* just call __d_drop() - it calls simple_recursive_removal(),
which takes care to evict anything possibly mounted on those suckers.

Your variant trivially turns into a mount leak - just bind anything
on that thing and trigger removal.

<a bit of a rant follows; if it offends somebody, feel free to report
to CoC committee>

What's more, securityfs object creation is... special.  It does, for
some odd reason, leave you dentry with refcount *two*.  For no reason
whatsoever, as far as I can tell.

securityfs_remove() matches that; securityfs_recursive_remove(),
as far as I can tell, should simply leak them.  That's from RTFS
alone, but I don't see how it could possibly *not* happen -
securityfs_create_file() is a call of securityfs_create_dentry(),
which
	* calls lookup_one_len(), getting a negative dentry with
refcount 1.
	* verifies it's negative
	* gets a new inode
	* does d_instantiate(), attaching it to dentry.
	* does dget(), for some unspeakable reason.  Refcount is 2 now.
	* returns that dentry to caller.

policyfs stuff calls securityfs_create_dir() (which is a wrapper for
securityfs_create_file(), with nothing extra done to refcounts),
then populates it with a bunch of files, all with the same refcount
weirdness.

Result: directory dentry with refcount 2 + number of children and
a bunch of children, each with refcount 2.

Now, securityfs_recursive_remove() calls simple_recursive_removal(),
which will strip _one_ off the refcount of each dentry in that tree.
Yes, they are all unhashed and any stuff mounted on them is kicked
out, but you have a massive dentry leak now - all of those dentries
have refcount at least 1.

I'm not blaming securityfs_recursive_remove() authors - it *should*
have worked; their only fault is that they hadn't guessed that
object creation on securityfs happens to be that strange.

Another special snowflake is efi_secret_unlink() - it calls
securityfs_remove(), which is needed instead of simple_unlink()
since
	* that double refcount needs to be dropped
	* having internal mount pinned is something that needs
to be undone, innit?

Of course, it runs afoul of the parent being locked, but nevermind that -
it just unlocks and relocks it, 'cuz what can go wrong?  That - instead
of discussing that with VFS and filesystem folks.

As for "what can go wrong"...  Consider what happens if another process
calls unlink() on the same file, just before the first one drops the
lock on parent.  Parent found, process 2 blocked on the lock.  Process 1
unlocks that lock and loses CPU.  Process 2 runs and tries to lock the
victim; blocks since process 1 is still holding it locked.  Process 1,
in securityfs_remove(): blocks trying to lock the parent.  AB-BA deadlock.

Oh, well...

Anyway, the reasons for securityfs_remove() use there are real deficiencies
of securityfs.  Weird shit with refcounts is one thing; internal mount
pinning is a bit more subtle, but it's also solvable.

The thing is, objects on securityfs never change parents.  So you only
need to pin for subdirectories of root - everything deeper will be
automatically fine.  And that kills the second reason for those games.
With that dealt with, efi_secret_unlink() can simply call simple_unlink()
instead of those games.

After that securityfs_remove() can become an alias for
securityfs_recursive_remove() (or the other way round, preferably).

BTW,
        d_inode(dent)->i_op = &efi_secret_dir_inode_operations;
in the same drivers/virt/coco is also nasty - you don't change the method
table on an object that is already exposed in shared data structures.
Basic multithreaded programming safety rules...  Yes, _that_ probably runs
too early in the boot for anything to hit it, so it's not a security hole,
but the same "what if somebody copies that code and gets screwed" applies
there...  If anything, that points to the need of securityfs_create_dir()
variant that would override ->i_op, which should've been discussed back
when the thing had been merged.

</rant>

I have fixes for some of that crap done on top of tree-in-dcache series;
give me an hour or two and I'll separate those and rebase to mainline...