lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <92a4d42491a2c219192ae86fa04b579ea3676d8c.camel@redhat.com>
Date:   Tue, 04 Jul 2023 12:10:10 +0200
From:   Paolo Abeni <pabeni@...hat.com>
To:     Ian Kumlien <ian.kumlien@...il.com>
Cc:     Alexander Lobakin <aleksander.lobakin@...el.com>,
        intel-wired-lan <intel-wired-lan@...ts.osuosl.org>,
        Jakub Kicinski <kuba@...nel.org>,
        Eric Dumazet <edumazet@...gle.com>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [Intel-wired-lan] bug with rx-udp-gro-forwarding offloading?

On Mon, 2023-07-03 at 11:37 +0200, Ian Kumlien wrote:
> So, got back, switched to 6.4.1 and reran with kmemleak and kasan
> 
> I got the splat from:
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index cea28d30abb5..701c1b5cf532 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4328,6 +4328,9 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> 
>         skb->prev = tail;
> 
> +       if (WARN_ON_ONCE(!skb->next))
> +               goto err_linearize;
> +
>         if (skb_needs_linearize(skb, features) &&
>             __skb_linearize(skb))
>                 goto err_linearize;
> 
> I'm just happy i ran with dmesg -W since there was only minimal output
> on the console:
> [39914.833696] rcu: INFO: rcu_preempt self-detected stall on CPU
> [39914.839598] rcu:     2-....: (20997 ticks this GP)
> idle=dd64/1/0x4000000000000000 softirq=4633489/4633489 fqs=4687
> [39914.849839] rcu:     (t=21017 jiffies g=18175157 q=45473 ncpus=12)
> [39977.862108] rcu: INFO: rcu_preempt self-detected stall on CPU
> [39977.868002] rcu:     2-....: (84001 ticks this GP)
> idle=dd64/1/0x4000000000000000 softirq=4633489/4633489 fqs=28434
> [39977.878340] rcu:     (t=84047 jiffies g=18175157 q=263477 ncpus=12)
> [40040.892521] rcu: INFO: rcu_preempt self-detected stall on CPU
> [40040.898414] rcu:     2-....: (147006 ticks this GP)
> idle=dd64/1/0x4000000000000000 softirq=4633489/4633489 fqs=53043
> [40040.908831] rcu:     (t=147079 jiffies g=18175157 q=464422 ncpus=12)
> [40065.080842] ixgbe 0000:06:00.1 eno2: Reset adapter

Ouch, just another slightly different issue, apparently :(

I'll try some wild guesses. The rcu stall could cause the OOM observed
in the previous tests. Here we the OOM did not trigger because due to
kasan/kmemleak the kernel is able to process a lesser number of packets
in the same period of time.

[...]
> [39914.857231] skb_segment (net/core/skbuff.c:4519)

I *think* this could be looping "forever", if gso_size becomes 0, which
is in turn completely unexpected ...


> [39914.857257] ? write_profile (kernel/stacktrace.c:83)
> [39914.857296] ? pskb_extract (net/core/skbuff.c:4360)
> [39914.857320] ? rt6_score_route (net/ipv6/route.c:713 (discriminator 1))
> [39914.857346] ? llist_add_batch (lib/llist.c:33 (discriminator 14))
> [39914.857379] __udp_gso_segment (net/ipv4/udp_offload.c:290)
> [39914.857413] ? ip6_dst_destroy (net/ipv6/route.c:788)
> [39914.857442] udp6_ufo_fragment (net/ipv6/udp_offload.c:47)
> [39914.857472] ? udp6_gro_complete (net/ipv6/udp_offload.c:20)
> [39914.857498] ? ipv6_gso_pull_exthdrs (net/ipv6/ip6_offload.c:53)
> [39914.857528] ipv6_gso_segment (net/ipv6/ip6_offload.c:119
> net/ipv6/ip6_offload.c:74)
> [39914.857557] ? ipv6_gso_pull_exthdrs (net/ipv6/ip6_offload.c:76)
> [39914.857583] ? nft_update_chain_stats (net/netfilter/nf_tables_core.c:254)
> [39914.857612] ? fib6_select_path (net/ipv6/route.c:458)
> [39914.857643] skb_mac_gso_segment (net/core/gro.c:141)
> [39914.857673] ? skb_eth_gso_segment (net/core/gro.c:127)
> [39914.857702] ? ipv6_skip_exthdr (net/ipv6/exthdrs_core.c:190)
> [39914.857726] ? kasan_save_stack (mm/kasan/common.c:47)
> [39914.857758] __skb_gso_segment (net/core/dev.c:3401 (discriminator 2))
> [39914.857787] udpv6_queue_rcv_skb (./include/net/udp.h:492
> net/ipv6/udp.c:796 net/ipv6/udp.c:787)
> [39914.857816] __udp6_lib_rcv (net/ipv6/udp.c:906 net/ipv6/udp.c:1013)

... but this means we are processing a multicast packet, likely skb is
cloned. If one of the clone instance enters simultaneusly
skb_segment_list() the latter would inconditionally call:

	skb_gso_reset(skb);

clearing the gso area in the shared info and causing unexpected results
(possibly the memory corruption observed before, and the above RCU
stall) for the other clone instances.

Assuming there are no other issues and that the above is not just a
side effect of ENOCOFFEE here, the following should possibly solve,
could you please add it to your testbed? (still with kasan+previous
patch, kmemleak is possibly not needed).

Thanks!
---
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6c5915efbc17..ac1ca6c7bff9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4263,6 +4263,11 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
 
 	skb_shinfo(skb)->frag_list = NULL;
 
+	/* later code will clear the gso area in the shared info */
+	err = skb_header_unclone(skb, GFP_ATOMIC);
+	if (err)
+		goto err_linearize;
+
 	while (list_skb) {
 		nskb = list_skb;
 		list_skb = list_skb->next;

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ