netdev - Re: Stacks leading into skb:kfree

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Wed, 19 Jul 2023 15:26:51 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc: David Ahern <dsahern@...nel.org>, Yan Zhai <yan@...udflare.com>, 
	Jakub Kicinski <kuba@...nel.org>, Ivan Babrou <ivan@...udflare.com>, 
	Linux Kernel Network Developers <netdev@...r.kernel.org>, kernel-team <kernel-team@...udflare.com>, 
	"David S. Miller" <davem@...emloft.net>, Paolo Abeni <pabeni@...hat.com>, 
	Steven Rostedt <rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>
Subject: Re: Stacks leading into skb:kfree_skb

On Wed, Jul 19, 2023 at 3:18 PM Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
>
> David Ahern wrote:
> > On 7/18/23 9:10 PM, Yan Zhai wrote:
> > > On Tue, Jul 18, 2023 at 5:36 PM Jakub Kicinski <kuba@...nel.org> wrote:
> > >>
> > >> On Fri, 14 Jul 2023 18:54:14 -0600 David Ahern wrote:
> > >>>> I made some aggregations for the stacks we see leading into
> > >>>> skb:kfree_skb endpoint. There's a lot of data that is not easily
> > >>>> digestible, so I lightly massaged the data and added flamegraphs in
> > >>>> addition to raw stack counts. Here's the gist link:
> > >>>>
> > >>>> * https://gist.github.com/bobrik/0e57671c732d9b13ac49fed85a2b2290
> > >>>
> > >>> I see a lot of packet_rcv as the tip before kfree_skb. How many packet
> > >>> sockets do you have running on that box? Can you accumulate the total
> > >>> packet_rcv -> kfree_skb_reasons into 1 count -- regardless of remaining
> > >>> stacktrace?
> > >>
> > >> On a quick look we have 3 branches which can get us to kfree_skb from
> > >> packet_rcv:
> > >>
> > >>         if (skb->pkt_type == PACKET_LOOPBACK)
> > >>                 goto drop;
> > >> ...
> > >>         if (!net_eq(dev_net(dev), sock_net(sk)))
> > >>                 goto drop;
> > >> ...
> > >>         res = run_filter(skb, sk, snaplen);
> > >>         if (!res)
> > >>                 goto drop_n_restore;
> > >>
> > >> I'd guess is the last one? Which we should mark with the SOCKET_FILTER
> > >> drop reason?
> > >
> > > So we have multiple packet socket consumers on our edge:
> > > * systemd-networkd: listens on ETH_P_LLDPD, which is the role model
> > > that does not do excessive things
> >
> > ETH level means raw packet socket which means *all* packets are duplicated.
> >
> > > * lldpd: I am not sure why we needed this one in presence of
> > > systemd-networkd, but it is running atm, which contributes to constant
> > > packet_rcv calls. It listens on ETH_P_ALL because of
> > > https://github.com/lldpd/lldpd/pull/414. But its filter is doing the
> > > correct work, so packets hitting this one is mostly "consumed"
> >
> > This one I am familiar with and its filter -- the fact that the filter
> > applies *after* the clone means it still contributes to the packet load.
> >
> > Together these 2 sockets might explain why the filter drop shows up in
> > packet_rcv.
> >
> > >
> > > Now the bad kids:
> > > * arping: listens on ETH_P_ALL. This one contributes all the
> > > skb:kfree_skb spikes, and the reason is sk_rmem_alloc overflows
> > > rcvbuf. I suspect it is due to a poorly constructed filter so too many
> > > packets get queued too fast.
> >
> > Any packet socket is the problem because the filter is applied to the
> > clone. Clone the packet, run the filter, kfree the packet.
>
> Small clarification: on receive in __netif_receive_skb_core, the skb
> is only cloned if accepted by packet_rcv. deliver_skb increases
> skb->users to ensure that the skb is not freed if a filter declines.
>
> On transmit, dev_queue_xmit_nit does create an initial clone. But
> then passes this one clone to all sockets, again using deliver_skb.
>
> A packet socket which filter accepts the skb is worse, then, as that
> clones the initial shared skb.
>
> >
> > > * conduit-watcher: a health checker, sending packets on ETH_P_IP in
> > > non-init netns. Majority of packet_rcv on this one goes to direct drop
> > > due to netns difference.
> >
> > So this the raw packet socket at L3 that shows up. This one should not
> > be as large of a contributor to the increases packet load.
> >
> > >
> > > So to conclude, it might be useful to set a reason for rcvbuf related
> > > drops at least. On the other hand, almost all packets entered
> > > packet_rcv are shared, so clone failure probably can also be a thing
> > > under memory pressure.
>
> kfree_skb is changed across the stack into kfree_skb_reason. Doing the
> same for PF_PACKET sounds entirely reasonable to me.
>
> Just be careful about false positives where only the filter does not
> matches and the shared skb is dereferenced. This is WAI and not cause
> for a report.
>

Relevant prior work was:

commit da37845fdce24e174f44d020bc4085ddd1c8a6bd
Author: Weongyo Jeong <weongyo.linux@...il.com>
Date:   Thu Apr 14 14:10:04 2016 -0700

    packet: uses kfree_skb() for errors.

    consume_skb() isn't for error cases that kfree_skb() is more proper
    one.  At this patch, it fixed tpacket_rcv() and packet_rcv() to be
    consistent for error or non-error cases letting perf trace its event
    properly.