netdev - Re: [PATCH net] veth: Fix RX stats for bpf_redirect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <6d5cb0ef-fabc-7ca3-94b2-5b1925a6805f@iogearbox.net>
Date: Tue, 31 Oct 2023 23:20:23 +0100
From: Daniel Borkmann <daniel@...earbox.net>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Peilin Ye <yepeilin.cs@...il.com>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Paolo Abeni <pabeni@...hat.com>, Alexei Starovoitov <ast@...nel.org>,
 Andrii Nakryiko <andrii@...nel.org>, Martin KaFai Lau
 <martin.lau@...ux.dev>, Song Liu <song@...nel.org>,
 Yonghong Song <yonghong.song@...ux.dev>,
 John Fastabend <john.fastabend@...il.com>, KP Singh <kpsingh@...nel.org>,
 Stanislav Fomichev <sdf@...gle.com>, Hao Luo <haoluo@...gle.com>,
 Jiri Olsa <jolsa@...nel.org>, Jesper Dangaard Brouer <hawk@...nel.org>,
 Peilin Ye <peilin.ye@...edance.com>, netdev@...r.kernel.org,
 bpf@...r.kernel.org, linux-kernel@...r.kernel.org,
 Cong Wang <cong.wang@...edance.com>, Jiang Wang <jiang.wang@...edance.com>,
 Youlun Zhang <zhangyoulun@...edance.com>
Subject: Re: [PATCH net] veth: Fix RX stats for bpf_redirect_peer() traffic

On 10/31/23 8:53 PM, Jakub Kicinski wrote:
> On Mon, 30 Oct 2023 15:19:26 +0100 Daniel Borkmann wrote:
>>> Since I didn't want to update host-veth's TX counters.  If we
>>> bpf_redirect_peer()ed a packet from NIC TC ingress to Pod-veth TC ingress,
>>> I think it means we've bypassed host-veth TX?
>>
>> Yes. So the idea is to transition to tstats replace the location where
>> we used to bump lstats with tstat's tx counter, and only the peer redirect
>> would bump the rx counter.. then upon stats traversal we fold the latter into
>> the rx stats which was populated by the opposite's tx counters. Makes sense.
>>
>> OT: does cadvisor run inside the Pod to collect the device stats? Just
>> curious how it gathers them.
> 
> Somewhat related - where does netkit count stats?

Yeap, it needs it as well, I have a local branch here where I pushed all
of it - coming out soon; I was planning to add some selftests in addition
till end of this week:

https://github.com/cilium/linux/commits/pr/ndo_peer

>>>> Definitely no new stats ndo resp indirect call in fast path.
>>>
>>> Yeah, I think I'll put a comment saying that all devices that support
>>> BPF_F_PEER must use tstats (or must use lstats), then.
>>
>> sgtm.
> 
> Is comment good enough? Can we try to do something more robust?
> Move the allocation of stats into the core at registration based
> on some u8 assigned in the driver? (I haven't looked at the code TBH)
Hm, not sure. One thing that comes to mind is lazy one-time allocation
like in case of netdev_core_stats_inc(), so whenever one of the helpers
like dev_sw_netstats_{rx,tx}_add() are called and dev->tstats are still
NULL, the core knows about the driver's intent, but tbh that doesn't
feel overly clean and in case of netdev_core_stats_inc() it's more in
the exception case rather than fast-path.

Other option could be to have two small helpers in the core which then
set a flag as well:

static inline int netdev_tstats_alloc(struct net_device *dev)
{
	dev->tstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
	if (!dev->tstats)
		return -ENOMEM;
	dev->priv_flags |= IFF_USES_TSTATS;
	return 0;
}

static inline void netdev_tstats_free(struct net_device *dev)
{
	free_percpu(dev->tstats);
}

They can then be used from .ndo_init/uninit - not sure if this would
be overall nicer.. or just leaving it at the .ndo callback comment for
the time being until really more users show up (which I doubt tbh).

Thanks,
Daniel