[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <64b7e2a81b9b0_267b6729485@willemb.c.googlers.com.notmuch>
Date: Wed, 19 Jul 2023 09:18:32 -0400
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: David Ahern <dsahern@...nel.org>,
Yan Zhai <yan@...udflare.com>,
Jakub Kicinski <kuba@...nel.org>
Cc: Ivan Babrou <ivan@...udflare.com>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
kernel-team <kernel-team@...udflare.com>,
Eric Dumazet <edumazet@...gle.com>,
"David S. Miller" <davem@...emloft.net>,
Paolo Abeni <pabeni@...hat.com>,
Steven Rostedt <rostedt@...dmis.org>,
Masami Hiramatsu <mhiramat@...nel.org>,
Willem de Bruijn <willemdebruijn.kernel@...il.com>
Subject: Re: Stacks leading into skb:kfree_skb
David Ahern wrote:
> On 7/18/23 9:10 PM, Yan Zhai wrote:
> > On Tue, Jul 18, 2023 at 5:36 PM Jakub Kicinski <kuba@...nel.org> wrote:
> >>
> >> On Fri, 14 Jul 2023 18:54:14 -0600 David Ahern wrote:
> >>>> I made some aggregations for the stacks we see leading into
> >>>> skb:kfree_skb endpoint. There's a lot of data that is not easily
> >>>> digestible, so I lightly massaged the data and added flamegraphs in
> >>>> addition to raw stack counts. Here's the gist link:
> >>>>
> >>>> * https://gist.github.com/bobrik/0e57671c732d9b13ac49fed85a2b2290
> >>>
> >>> I see a lot of packet_rcv as the tip before kfree_skb. How many packet
> >>> sockets do you have running on that box? Can you accumulate the total
> >>> packet_rcv -> kfree_skb_reasons into 1 count -- regardless of remaining
> >>> stacktrace?
> >>
> >> On a quick look we have 3 branches which can get us to kfree_skb from
> >> packet_rcv:
> >>
> >> if (skb->pkt_type == PACKET_LOOPBACK)
> >> goto drop;
> >> ...
> >> if (!net_eq(dev_net(dev), sock_net(sk)))
> >> goto drop;
> >> ...
> >> res = run_filter(skb, sk, snaplen);
> >> if (!res)
> >> goto drop_n_restore;
> >>
> >> I'd guess is the last one? Which we should mark with the SOCKET_FILTER
> >> drop reason?
> >
> > So we have multiple packet socket consumers on our edge:
> > * systemd-networkd: listens on ETH_P_LLDPD, which is the role model
> > that does not do excessive things
>
> ETH level means raw packet socket which means *all* packets are duplicated.
>
> > * lldpd: I am not sure why we needed this one in presence of
> > systemd-networkd, but it is running atm, which contributes to constant
> > packet_rcv calls. It listens on ETH_P_ALL because of
> > https://github.com/lldpd/lldpd/pull/414. But its filter is doing the
> > correct work, so packets hitting this one is mostly "consumed"
>
> This one I am familiar with and its filter -- the fact that the filter
> applies *after* the clone means it still contributes to the packet load.
>
> Together these 2 sockets might explain why the filter drop shows up in
> packet_rcv.
>
> >
> > Now the bad kids:
> > * arping: listens on ETH_P_ALL. This one contributes all the
> > skb:kfree_skb spikes, and the reason is sk_rmem_alloc overflows
> > rcvbuf. I suspect it is due to a poorly constructed filter so too many
> > packets get queued too fast.
>
> Any packet socket is the problem because the filter is applied to the
> clone. Clone the packet, run the filter, kfree the packet.
Small clarification: on receive in __netif_receive_skb_core, the skb
is only cloned if accepted by packet_rcv. deliver_skb increases
skb->users to ensure that the skb is not freed if a filter declines.
On transmit, dev_queue_xmit_nit does create an initial clone. But
then passes this one clone to all sockets, again using deliver_skb.
A packet socket which filter accepts the skb is worse, then, as that
clones the initial shared skb.
>
> > * conduit-watcher: a health checker, sending packets on ETH_P_IP in
> > non-init netns. Majority of packet_rcv on this one goes to direct drop
> > due to netns difference.
>
> So this the raw packet socket at L3 that shows up. This one should not
> be as large of a contributor to the increases packet load.
>
> >
> > So to conclude, it might be useful to set a reason for rcvbuf related
> > drops at least. On the other hand, almost all packets entered
> > packet_rcv are shared, so clone failure probably can also be a thing
> > under memory pressure.
kfree_skb is changed across the stack into kfree_skb_reason. Doing the
same for PF_PACKET sounds entirely reasonable to me.
Just be careful about false positives where only the filter does not
matches and the shared skb is dereferenced. This is WAI and not cause
for a report.
> >
> >
>
Powered by blists - more mailing lists