[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<AM6PR03MB508053DF89CDFEB95CBEB20C99E32@AM6PR03MB5080.eurprd03.prod.outlook.com>
Date: Fri, 24 Jan 2025 22:44:46 +0000
From: Juntong Deng <juntong.deng@...look.com>
To: Alexei Starovoitov <alexei.starovoitov@...il.com>
Cc: Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
John Fastabend <john.fastabend@...il.com>,
Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau
<martin.lau@...ux.dev>, Eddy Z <eddyz87@...il.com>,
Song Liu <song@...nel.org>, Yonghong Song <yonghong.song@...ux.dev>,
KP Singh <kpsingh@...nel.org>, Stanislav Fomichev <sdf@...ichev.me>,
Hao Luo <haoluo@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
Kumar Kartikeya Dwivedi <memxor@...il.com>, Tejun Heo <tj@...nel.org>,
David Vernet <void@...ifault.com>, bpf <bpf@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH bpf-next 6/7] sched_ext: Make SCX use BPF capabilities
On 2025/1/24 04:52, Alexei Starovoitov wrote:
> On Thu, Jan 16, 2025 at 11:47 AM Juntong Deng <juntong.deng@...look.com> wrote:
>>
>> This patch modifies SCX to use BPF capabilities.
>>
>> Make all SCX kfuncs register to BPF capabilities instead of
>> BPF_PROG_TYPE_STRUCT_OPS.
>>
>> Add bpf_scx_bpf_capabilities_adjust as bpf_capabilities_adjust
>> callback function.
>>
>> Signed-off-by: Juntong Deng <juntong.deng@...look.com>
>> ---
>> kernel/sched/ext.c | 74 ++++++++++++++++++++++++++++++++++++++--------
>> 1 file changed, 62 insertions(+), 12 deletions(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 7fff1d045477..53cc7c3ed80b 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -5765,10 +5765,66 @@ bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> }
>> }
>
> 'capabilities' name doesn't fit.
> The word already has its meaning in the kernel.
> It cannot be reused for a different purpose.
>
>> +static int bpf_scx_bpf_capabilities_adjust(unsigned long *bpf_capabilities,
>> + u32 context_info, bool enter)
>> +{
>> + if (enter) {
>> + switch (context_info) {
>> + case offsetof(struct sched_ext_ops, select_cpu):
>> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
>> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> + break;
>> + case offsetof(struct sched_ext_ops, enqueue):
>> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> + break;
>> + case offsetof(struct sched_ext_ops, dispatch):
>> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
>> + break;
>> + case offsetof(struct sched_ext_ops, running):
>> + case offsetof(struct sched_ext_ops, stopping):
>> + case offsetof(struct sched_ext_ops, enable):
>> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
>> + break;
>> + case offsetof(struct sched_ext_ops, init):
>> + case offsetof(struct sched_ext_ops, exit):
>> + ENABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
>> + break;
>> + default:
>> + return -EINVAL;
>> + }
>> + } else {
>> + switch (context_info) {
>> + case offsetof(struct sched_ext_ops, select_cpu):
>> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_SELECT_CPU);
>> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> + break;
>> + case offsetof(struct sched_ext_ops, enqueue):
>> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_ENQUEUE);
>> + break;
>> + case offsetof(struct sched_ext_ops, dispatch):
>> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_DISPATCH);
>> + break;
>> + case offsetof(struct sched_ext_ops, running):
>> + case offsetof(struct sched_ext_ops, stopping):
>> + case offsetof(struct sched_ext_ops, enable):
>> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_REST);
>> + break;
>> + case offsetof(struct sched_ext_ops, init):
>> + case offsetof(struct sched_ext_ops, exit):
>> + DISABLE_BPF_CAPABILITY(bpf_capabilities, BPF_CAP_SCX_KF_UNLOCKED);
>> + break;
>> + default:
>> + return -EINVAL;
>> + }
>> + }
>> + return 0;
>> +}
>
> and this callback defeats the whole point of u32 bitmask.
>
Yes, you are right, I agree that procedural callbacks defeat the purpose
of BPF capabilities.
> In earlier patch
> env->context_info = __btf_member_bit_offset(t, member) / 8; // moff
>
> is also wrong.
> The context_info name is too generic and misleading.
> and 'env' isn't a right place to save moff.
>
> Let's try to implement what was discussed earlier:
>
> 1
> After successful check_struct_ops_btf_id() save moff in
> prog->aux->attach_st_ops_member_off.
>
> 2
> Add .filter callback to sched-ext kfunc registration path and
> let it allow/deny kfuncs based on st_ops attach point.
>
> 3
> Remove scx_kf_allow() and current->scx.kf_mask.
>
> That will be a nice perf win and will prove that
> this approach works end-to-end.
I am trying, but I found a problem (bug?) when I added test cases
to bpf_testmod.c.
Filters currently do not work with kernel modules.
Filters rely heavily on (bpf_fs_kfunc_set_ids as an example)
if (!btf_id_set8_contains(&bpf_fs_kfunc_set_ids, kfunc_id)
exclude kfuncs that are not part of its own set
(__btf_kfunc_id_set_contains performs all the filters for each kfunc),
otherwise it will result in false rejects.
But this method cannot be used in kernel modules because the BTF ids of
all kfuncs are relocated.
The BTF ids of all kfuncs in the kernel module will be relocated by
btf_relocate_id in btf_populate_kfunc_set.
This results in the kfunc_id passed into the filter being different from
the BTF id in set_ids.
One possible solution is to export btf_relocate_id and
btf_get_module_btf, and let the kernel module do the relocation itself.
But I am not sure exporting them is a good idea.
Do you have any suggestions?
In addition, BTF_KFUNC_FILTER_MAX_CNT is currently 16, which is not a
large enough size.
If we use filters to enforce restrictions on struct_ops for different
contexts, then each different context needs a filter.
All filters for scenarios using struct_ops (SCX, HID, TCP congestion,
etc.) are placed in the same struct btf_kfunc_hook_filter
(filters array).
It is foreseeable that the 16 slots will be exhausted soon.
Should we change it to a linked list?
Powered by blists - more mailing lists