linux-kernel - Re: [PATCH v3] cgroup/bpf: fast path skb BPF filtering

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <92f69969-42dc-204a-4138-16fdaaebb78d@gmail.com>
Date:   Wed, 15 Dec 2021 19:55:51 +0000
From:   Pavel Begunkov <asml.silence@...il.com>
To:     Stanislav Fomichev <sdf@...gle.com>
Cc:     netdev@...r.kernel.org, bpf@...r.kernel.org,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] cgroup/bpf: fast path skb BPF filtering

On 12/15/21 19:15, Stanislav Fomichev wrote:
> On Wed, Dec 15, 2021 at 10:54 AM Pavel Begunkov <asml.silence@...il.com> wrote:
>>
>> On 12/15/21 18:24, sdf@...gle.com wrote:
>>> On 12/15, Pavel Begunkov wrote:
>>>> On 12/15/21 17:33, sdf@...gle.com wrote:
>>>>> On 12/15, Pavel Begunkov wrote:
>>>>>> On 12/15/21 16:51, sdf@...gle.com wrote:
>>>>>>> On 12/15, Pavel Begunkov wrote:
>>>>>>>> � /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
>>>>>>>> � #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)����������������� \
>>>>>>>> � ({����������������������������������������� \
>>>>>>>> ����� int __ret = 0;��������������������������������� \
>>>>>>>> -��� if (cgroup_bpf_enabled(CGROUP_INET_INGRESS))������������� \
>>>>>>>> +��� if (cgroup_bpf_enabled(CGROUP_INET_INGRESS) && sk &&������������� \
>>>>>>>> +������� CGROUP_BPF_TYPE_ENABLED((sk), CGROUP_INET_INGRESS))���������� \
>>>>>>>
>>>>>>> Why not add this __cgroup_bpf_run_filter_skb check to
>>>>>>> __cgroup_bpf_run_filter_skb? Result of sock_cgroup_ptr() is already there
>>>>>>> and you can use it. Maybe move the things around if you want
>>>>>>> it to happen earlier.
>>>>>
>>>>>> For inlining. Just wanted to get it done right, otherwise I'll likely be
>>>>>> returning to it back in a few months complaining that I see measurable
>>>>>> overhead from the function call :)
>>>>>
>>>>> Do you expect that direct call to bring any visible overhead?
>>>>> Would be nice to compare that inlined case vs
>>>>> __cgroup_bpf_prog_array_is_empty inside of __cgroup_bpf_run_filter_skb
>>>>> while you're at it (plus move offset initialization down?).
>>>
>>>> Sorry but that would be waste of time. I naively hope it will be visible
>>>> with net at some moment (if not already), that's how it was with io_uring,
>>>> that's what I see in the block layer. And in anyway, if just one inlined
>>>> won't make a difference, then 10 will.
>>>
>>> I can probably do more experiments on my side once your patch is
>>> accepted. I'm mostly concerned with getsockopt(TCP_ZEROCOPY_RECEIVE).
>>> If you claim there is visible overhead for a direct call then there
>>> should be visible benefit to using CGROUP_BPF_TYPE_ENABLED there as
>>> well.
>>
>> Interesting, sounds getsockopt might be performance sensitive to
>> someone.
>>
>> FWIW, I forgot to mention that for testing tx I'm using io_uring
>> (for both zc and not) with good submission batching.
> 
> Yeah, last time I saw 2-3% as well, but it was due to kmalloc, see
> more details in 9cacf81f8161, it was pretty visible under perf.
> That's why I'm a bit skeptical of your claims of direct calls being
> somehow visible in these 2-3% (even skb pulls/pushes are not 2-3%?).

migrate_disable/enable together were taking somewhat in-between
1% and 1.5% in profiling, don't remember the exact number. The rest
should be from rcu_read_lock/unlock() in BPF_PROG_RUN_ARRAY_CG_FLAGS()
and other extra bits on the way.

I'm skeptical I'll be able to measure inlining one function,
variability between boots/runs is usually greater and would hide it.

> But tbf I don't understand how it all plays out with the io_uring.

1 syscall per N requests (N=32 IIRC), 1 fdget() per N, no payload
page referencing (for zc), and so on


> (mostly trying to understand where there is some gain left on the
> table for TCP_ZEROCOPY_RECEIVE).

-- 
Pavel Begunkov