[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <277F219D-75B1-4D9D-A2FF-50FF90BAA0A2@fb.com>
Date: Tue, 19 Nov 2024 22:35:04 +0000
From: Song Liu <songliubraving@...a.com>
To: Casey Schaufler <casey@...aufler-ca.com>
CC: "Dr. Greg" <greg@...ellic.com>, Song Liu <songliubraving@...a.com>,
James
Bottomley <James.Bottomley@...senPartnership.com>,
"jack@...e.cz"
<jack@...e.cz>,
"brauner@...nel.org" <brauner@...nel.org>, Song Liu
<song@...nel.org>,
"bpf@...r.kernel.org" <bpf@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-security-module@...r.kernel.org"
<linux-security-module@...r.kernel.org>,
Kernel Team <kernel-team@...a.com>,
"andrii@...nel.org" <andrii@...nel.org>,
"eddyz87@...il.com"
<eddyz87@...il.com>,
"ast@...nel.org" <ast@...nel.org>,
"daniel@...earbox.net" <daniel@...earbox.net>,
"martin.lau@...ux.dev"
<martin.lau@...ux.dev>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"kpsingh@...nel.org" <kpsingh@...nel.org>,
"mattbobrowski@...gle.com"
<mattbobrowski@...gle.com>,
"amir73il@...il.com" <amir73il@...il.com>,
"repnop@...gle.com" <repnop@...gle.com>,
"jlayton@...nel.org"
<jlayton@...nel.org>,
Josef Bacik <josef@...icpanda.com>,
"mic@...ikod.net"
<mic@...ikod.net>,
"gnoack@...gle.com" <gnoack@...gle.com>
Subject: Re: [PATCH bpf-next 0/4] Make inode storage available to tracing prog
> On Nov 19, 2024, at 10:14 AM, Casey Schaufler <casey@...aufler-ca.com> wrote:
>
> On 11/19/2024 4:27 AM, Dr. Greg wrote:
>> On Sun, Nov 17, 2024 at 10:59:18PM +0000, Song Liu wrote:
>>
>>> Hi Christian, James and Jan,
>> Good morning, I hope the day is starting well for everyone.
>>
>>>> On Nov 14, 2024, at 1:49???PM, James Bottomley <James.Bottomley@...senPartnership.com> wrote:
>>> [...]
>>>
>>>>> We can address this with something like following:
>>>>>
>>>>> #ifdef CONFIG_SECURITY
>>>>> void *i_security;
>>>>> #elif CONFIG_BPF_SYSCALL
>>>>> struct bpf_local_storage __rcu *i_bpf_storage;
>>>>> #endif
>>>>>
>>>>> This will help catch all misuse of the i_bpf_storage at compile
>>>>> time, as i_bpf_storage doesn't exist with CONFIG_SECURITY=y.
>>>>>
>>>>> Does this make sense?
>>>> Got to say I'm with Casey here, this will generate horrible and failure
>>>> prone code.
>>>>
>>>> Since effectively you're making i_security always present anyway,
>>>> simply do that and also pull the allocation code out of security.c in a
>>>> way that it's always available? That way you don't have to special
>>>> case the code depending on whether CONFIG_SECURITY is defined.
>>>> Effectively this would give everyone a generic way to attach some
>>>> memory area to an inode. I know it's more complex than this because
>>>> there are LSM hooks that run from security_inode_alloc() but if you can
>>>> make it work generically, I'm sure everyone will benefit.
>>> On a second thought, I think making i_security generic is not
>>> the right solution for "BPF inode storage in tracing use cases".
>>>
>>> This is because i_security serves a very specific use case: it
>>> points to a piece of memory whose size is calculated at system
>>> boot time. If some of the supported LSMs is not enabled by the
>>> lsm= kernel arg, the kernel will not allocate memory in
>>> i_security for them. The only way to change lsm= is to reboot
>>> the system. BPF LSM programs can be disabled at the boot time,
>>> which fits well in i_security. However, BPF tracing programs
>>> cannot be disabled at boot time (even we change the code to
>>> make it possible, we are not likely to disable BPF tracing).
>>> IOW, as long as CONFIG_BPF_SYSCALL is enabled, we expect some
>>> BPF tracing programs to load at some point of time, and these
>>> programs may use BPF inode storage.
>>>
>>> Therefore, with CONFIG_BPF_SYSCALL enabled, some extra memory
>>> always will be attached to i_security (maybe under a different
>>> name, say, i_generic) of every inode. In this case, we should
>>> really add i_bpf_storage directly to the inode, because another
>>> pointer jump via i_generic gives nothing but overhead.
>>>
>>> Does this make sense? Or did I misunderstand the suggestion?
>> There is a colloquialism that seems relevant here: "Pick your poison".
>>
>> In the greater interests of the kernel, it seems that a generic
>> mechanism for attaching per inode information is the only realistic
>> path forward, unless Christian changes his position on expanding
>> the size of struct inode.
>>
>> There are two pathways forward.
>>
>> 1.) Attach a constant size 'blob' of storage to each inode.
>>
>> This is a similar approach to what the LSM uses where each blob is
>> sized as follows:
>>
>> S = U * sizeof(void *)
>>
>> Where U is the number of sub-systems that have a desire to use inode
>> specific storage.
>
> I can't tell for sure, but it looks like you don't understand how
> LSM i_security blobs are used. It is *not* the case that each LSM
> gets a pointer in the i_security blob. Each LSM that wants storage
> tells the infrastructure at initialization time how much space it
> wants in the blob. That can be a pointer, but usually it's a struct
> with flags, pointers and even lists.
>
>> Each sub-system uses it's pointer slot to manage any additional
>> storage that it desires to attach to the inode.
>
> Again, an LSM may choose to do it that way, but most don't.
> SELinux and Smack need data on every inode. It makes much more sense
> to put it directly in the blob than to allocate a separate chunk
> for every inode.
AFAICT, i_security is somehow unique in the way that its size
is calculated at boot time. I guess we will just keep most LSM
users behind.
>
>> This has the obvious advantage of O(1) cost complexity for any
>> sub-system that wants to access its inode specific storage.
>>
>> The disadvantage, as you note, is that it wastes memory if a
>> sub-system does not elect to attach per inode information, for example
>> the tracing infrastructure.
>
> To be clear, that disadvantage only comes up if the sub-system uses
> inode data on an occasional basis. If it never uses inode data there
> is no need to have a pointer to it.
>
>> This disadvantage is parried by the fact that it reduces the size of
>> the inode proper by 24 bytes (4 pointers down to 1) and allows future
>> extensibility without colliding with the interests and desires of the
>> VFS maintainers.
>
> You're adding a level of indirection. Even I would object based on
> the performance impact.
>
>> 2.) Implement key/value mapping for inode specific storage.
>>
>> The key would be a sub-system specific numeric value that returns a
>> pointer the sub-system uses to manage its inode specific memory for a
>> particular inode.
>>
>> A participating sub-system in turn uses its identifier to register an
>> inode specific pointer for its sub-system.
>>
>> This strategy loses O(1) lookup complexity but reduces total memory
>> consumption and only imposes memory costs for inodes when a sub-system
>> desires to use inode specific storage.
>
> SELinux and Smack use an inode blob for every inode. The performance
> regression boggles the mind. Not to mention the additional complexity
> of managing the memory.
>
>> Approach 2 requires the introduction of generic infrastructure that
>> allows an inode's key/value mappings to be located, presumably based
>> on the inode's pointer value. We could probably just resurrect the
>> old IMA iint code for this purpose.
>>
>> In the end it comes down to a rather standard trade-off in this
>> business, memory vs. execution cost.
>>
>> We would posit that option 2 is the only viable scheme if the design
>> metric is overall good for the Linux kernel eco-system.
>
> No. Really, no. You need look no further than secmarks to understand
> how a key based blob allocation scheme leads to tears. Keys are fine
> in the case where use of data is sparse. They have no place when data
> use is the norm.
OTOH, I think some on-demand key-value storage makes sense for many
other use cases, such as BPF (LSM and tracing), file lock, fanotify,
etc.
Overall, I think we have 3 types storages attached to inode:
1. Embedded in struct inode, gated by CONFIG_*.
2. Behind i_security (or maybe call it a different name if we
can find other uses for it). The size is calculated at boot
time.
3. Behind a key-value storage.
To evaluate these categories, we have:
Speed: 1 > 2 > 3
Flexibility: 3 > 2 > 1
We don't really have 3 right now. I think the direction is to
create it. BPF inode storage is a key-value store. If we can
get another user for 3, in addition to BPF inode storage, it
should be a net win.
Does this sound like a viable path forward?
Thanks,
Song
Powered by blists - more mailing lists