linux-kernel - Re: [RFC bpf-next fanotify 5/5] selftests/bpf: Add test for BPF based fanotify fastpath handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPhsuW71hKC1G9jayVqmAMFu+kPOibWC9xqdMj9tLmCcteQuEQ@mail.gmail.com>
Date: Thu, 7 Nov 2024 12:48:16 -0800
From: Song Liu <song@...nel.org>
To: Amir Goldstein <amir73il@...il.com>
Cc: Song Liu <songliubraving@...a.com>, bpf <bpf@...r.kernel.org>, 
	Linux-Fsdevel <linux-fsdevel@...r.kernel.org>, LKML <linux-kernel@...r.kernel.org>, 
	Kernel Team <kernel-team@...a.com>, "andrii@...nel.org" <andrii@...nel.org>, 
	"eddyz87@...il.com" <eddyz87@...il.com>, "ast@...nel.org" <ast@...nel.org>, 
	"daniel@...earbox.net" <daniel@...earbox.net>, "martin.lau@...ux.dev" <martin.lau@...ux.dev>, 
	"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>, "brauner@...nel.org" <brauner@...nel.org>, 
	"jack@...e.cz" <jack@...e.cz>, "kpsingh@...nel.org" <kpsingh@...nel.org>, 
	"mattbobrowski@...gle.com" <mattbobrowski@...gle.com>, "repnop@...gle.com" <repnop@...gle.com>, 
	"jlayton@...nel.org" <jlayton@...nel.org>, Josef Bacik <josef@...icpanda.com>
Subject: Re: [RFC bpf-next fanotify 5/5] selftests/bpf: Add test for BPF based
 fanotify fastpath handler

On Thu, Nov 7, 2024 at 12:34 PM Amir Goldstein <amir73il@...il.com> wrote:
>
> On Thu, Nov 7, 2024 at 8:53 PM Song Liu <songliubraving@...a.com> wrote:
> >
> >
> >
> > > On Nov 7, 2024, at 3:10 AM, Amir Goldstein <amir73il@...il.com> wrote:
> > >
> > > On Wed, Oct 30, 2024 at 12:13 AM Song Liu <song@...nel.org> wrote:
> > >>
> > >> This test shows a simplified logic that monitors a subtree. This is
> > >> simplified as it doesn't handle all the scenarios, such as:
> > >>
> > >>  1) moving a subsubtree into/outof the being monitoring subtree;
> > >
> > > There is a solution for that (see below)
> > >
> > >>  2) mount point inside the being monitored subtree
> > >
> > > For that we will need to add the MOUNT/UNMOUNT/MOVE_MOUNT events,
> > > but those have been requested by userspace anyway.
> > >
> > >>
> > >> Therefore, this is not to show a way to reliably monitor a subtree.
> > >> Instead, this is to test the functionalities of bpf based fastpath.
> > >> To really monitor a subtree reliably, we will need more complex logic.
> > >
> > > Actually, this example is the foundation of my vision for efficient and race
> > > free subtree filtering:
> > >
> > > 1. The inode map is to be treated as a cache for the is_subdir() query
> >
> > Using is_subdir() as the truth and managing the cache in inode map seems
> > promising to me.
> >
> > > 2. Cache entries can also have a "distance from root" (i.e. depth) value
> > > 3. Each unknown queried path can call is_subdir() and populate the cache
> > >    entries for all ancestors
> > > 4. The cache/map size should be limited and when limit is reached,
> > >    evicting entries by depth priority makes sense
> > > 5. A rename event for a directory whose inode is in the map and whose
> > >   new parent is not in the map or has a different value than old parent
> > >   needs to invalidate the entire map
> > > 6. fast_path also needs a hook from inode evict to clear cache entries
> >
> > The inode map is physically attached to the inode itself. So the evict
> > event is automatically handled. IOW, an inode's entry in the inode map
> > is automatically removed when the inode is freed. For the same reason,
> > we don't need to set a limit in map size and add evicting logic. Of
> > course, this works based on the assumption that we don't use too much
> > memory for each inode. I think this assumption is true.
>
> Oh no, it is definitely wrong each inode is around 1K and this was the
> main incentive to implement FAN_MARK_EVICTABLE and
> FAN_MARK_FILESYSTEK in the first place, because recursive
> pinning of all inodes in a large tree is not scalable to large trees.

The inode map does not pin the inode in memory. The inode will
still be evicted as normal. The entry associated with the being
evicted inode in the map will be freed together as the inoide is freed.
The overhead I mentioned was the extra few bytes (for the flag, etc.)
per inode. When an evicted inode is loaded again, we will use
is_subdir() to recreate the entry in the inode map.

Does this make sense and address the concern?

Thanks,
Song