lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <AM6PR03MB508053DF89CDFEB95CBEB20C99E32@AM6PR03MB5080.eurprd03.prod.outlook.com>
Date: Fri, 24 Jan 2025 22:44:46 +0000
From: Juntong Deng <juntong.deng@...look.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Alexei Starovoitov <ast@...nel.org>,
 Daniel Borkmann <daniel@...earbox.net>,
 John Fastabend <john.fastabend@...il.com>,
 Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau
 <martin.lau@...ux.dev>, Eddy Z <eddyz87@...il.com>,
 Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>,
 KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>,
 Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
 Kumar Kartikeya Dwivedi <memxor@...il.com>, Tejun Heo <tj@...nel.org>,
 David Vernet <void@...ifault.com>, bpf <bpf@...r.kernel.org>,
 LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities

On 2025/1/24 04:52, Alexei Starovoitov wrote:
> On Thu, Jan 16, 2025 at 11:47 AM Juntong Deng <juntong.deng@...look.com> wrote:
>>
>> This patch modifies SCX to use BPF capabilities.
>>
>> Make all SCX kfuncs register to BPF capabilities instead of
>> BPF_PROG_TYPE_STRUCT_OPS.
>>
>> Add bpf_scx_bpf_capabilities_adjust as bpf_capabilities_adjust
>> callback function.
>>
>> Signed-off-by: Juntong Deng <juntong.deng@...look.com>
>> ---
>>   kernel/sched/ext.c | 74 ++++++++++++++++++++++++++++++++++++++--------
>>   1 file changed, 62 insertions(+), 12 deletions(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 7fff1d045477..53cc7c3ed80b 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -5765,10 +5765,66 @@ bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>>          }
>>   }
> 
> 'capabilities' name doesn't fit.
> The word already has its meaning in the kernel.
> It cannot be reused for a different purpose.
> 
>> +static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities,
>> +                                          u32 context_info, bool enter)
>> +{
>> +       if (enter) {
>> +               switch (context_info) {
>> +               case offsetof(struct sched_ext_ops, select_cpu):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, enqueue):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, dispatch):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, running):
>> +               case offsetof(struct sched_ext_ops, stopping):
>> +               case offsetof(struct sched_ext_ops, enable):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, init):
>> +               case offsetof(struct sched_ext_ops, exit):
>> +                       ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       } else {
>> +               switch (context_info) {
>> +               case offsetof(struct sched_ext_ops, select_cpu):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, enqueue):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, dispatch):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, running):
>> +               case offsetof(struct sched_ext_ops, stopping):
>> +               case offsetof(struct sched_ext_ops, enable):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
>> +                       break;
>> +               case offsetof(struct sched_ext_ops, init):
>> +               case offsetof(struct sched_ext_ops, exit):
>> +                       DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
>> +                       break;
>> +               default:
>> +                       return -EINVAL;
>> +               }
>> +       }
>> +       return 0;
>> +}
> 
> and this callback defeats the whole point of u32 bitmask.
> 

Yes, you are right, I agree that procedural callbacks defeat the purpose
of BPF capabilities.

> In earlier patch
> env->context_info = __btf_member_bit_offset(t, member) / 8; // moff
> 
> is also wrong.
> The context_info name is too generic and misleading.
> and 'env' isn't a right place to save moff.
> 
> Let's try to implement what was discussed earlier:
> 
> 1
> After successful check_struct_ops_btf_id() save moff in
> prog->aux->attach_st_ops_member_off.
> 
> 2
> Add .filter callback to sched-ext kfunc registration path and
> let it allow/deny kfuncs based on st_ops attach point.
> 
> 3
> Remove scx_kf_allow() and current->scx.kf_mask.
> 
> That will be a nice perf win and will prove that
> this approach works end-to-end.

I am trying, but I found a problem (bug?) when I added test cases
to bpf_testmod.c.

Filters currently do not work with kernel modules.

Filters rely heavily on (bpf_fs_kfunc_set_ids as an example)

if (!btf_id_set8_contains(&bpf_fs_kfunc_set_ids, kfunc_id)

exclude kfuncs that are not part of its own set
(__btf_kfunc_id_set_contains performs all the filters for each kfunc),
otherwise it will result in false rejects.

But this method cannot be used in kernel modules because the BTF ids of
all kfuncs are relocated.

The BTF ids of all kfuncs in the kernel module will be relocated by
btf_relocate_id in btf_populate_kfunc_set.

This results in the kfunc_id passed into the filter being different from
the BTF id in set_ids.

One possible solution is to export btf_relocate_id and
btf_get_module_btf, and let the kernel module do the relocation itself.

But I am not sure exporting them is a good idea.

Do you have any suggestions?


In addition, BTF_KFUNC_FILTER_MAX_CNT is currently 16, which is not a
large enough size.

If we use filters to enforce restrictions on struct_ops for different
contexts, then each different context needs a filter.

All filters for scenarios using struct_ops (SCX, HID, TCP congestion,
etc.) are placed in the same struct btf_kfunc_hook_filter
(filters array).

It is foreseeable that the 16 slots will be exhausted soon.

Should we change it to a linked list?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ