[<prev] [next>] [day] [month] [year] [list]
Message-ID: <bff9cef5-84bd-e8f9-a42a-6f5491fe48c5@knorrie.org>
Date: Thu, 24 May 2018 16:49:51 +0200
From: Hans van Kranenburg <hans@...rrie.org>
To: netdev@...r.kernel.org, dev@...nvswitch.org
Cc: "Kranenburg, Hans van" <Hans.van.Kranenburg@...dix.com>,
Eric Dumazet <edumazet@...gle.com>, 899044@...s.debian.org
Subject: Oops: 0000 [#1] SMP in skb_release_data, openvswitch related
To: netdev, dev@...nvswitch
Cc: Eric Dumazet (author of ff04a771ad), debian bug
Hi,
As follow-up to my bug report at Debian [0], I'm trying to do bug triage
and find out more. I'm not the expert here, but anything could help, and
it's an opportunity to learn things.
I'm observing the attached errors ('general protection fault: 0000 [#1]
SMP' and 'BUG: unable to handle kernel paging request') on machines that
are Xen dom0 and running a 4.9.88 Debian Stretch kernel as dom0 kernel.
The errors have been happening a few times in the last few weeks. It
started after upgrading them from Jessie and 3.16 kernel to Stretch with
4.9 kernel.
The traces printed look very much alike every time.
If I look up the listed address, I get:
-$ addr2line -e /usr/lib/debug/boot/vmlinux-4.9.0-6-amd64 -i -a
ffffffff814f5c7d
0xffffffff814f5c7d
./debian/build/build_amd64_none_amd64/./include/linux/compiler.h:243
(discriminator 3)
./debian/build/build_amd64_none_amd64/./include/linux/page-flags.h:143
(discriminator 3)
./debian/build/build_amd64_none_amd64/./include/linux/mm.h:779
(discriminator 3)
./debian/build/build_amd64_none_amd64/./include/linux/skbuff.h:2592
(discriminator 3)
./debian/build/build_amd64_none_amd64/./net/core/skbuff.c:594
(discriminator 3)
583 static void skb_release_data(struct sk_buff *skb)
584 {
585 struct skb_shared_info *shinfo = skb_shinfo(skb);
586 int i;
587
588 if (skb->cloned &&
589 atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT)
+ 1 : 1,
590 &shinfo->dataref))
591 return;
592
593 for (i = 0; i < shinfo->nr_frags; i++)
594 -----> __skb_frag_unref(&shinfo->frags[i]); <------
595
596 /*
597 * If skb buf is from userspace, we need to notify the caller
598 * the lower device DMA has done;
599 */
600 if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) {
601 struct ubuf_info *uarg;
602
603 uarg = shinfo->destructor_arg;
604 if (uarg->callback)
605 uarg->callback(uarg, true);
606 }
607
608 if (shinfo->frag_list)
609 kfree_skb_list(shinfo->frag_list);
610
611 skb_free_head(skb);
612 }
The most recent (well, from 2014) biggest change in this area is...
commit ff04a771ad25fc9ba91690e73465b4d34b6bf8b3
Author: Eric Dumazet <edumazet@...gle.com>
Date: Tue Sep 23 18:39:30 2014 -0700
net : optimize skb_release_data()
...which is not present in the 3.16.y kernel that Debian Jessie still
uses, and which does not hit this problem (however, also using older
openvswitch userspace components).
Other changes in this area mention zero copy IO, which sounds like
something openvswitch could be using.
-- background: openvswitch usage --
For networking between domUs and the outside world, we use openvswitch.
After such an error happens:
* The amount of "flows" in the kernel quickly raises to the limit,
10000, as seen in output of ovs-dpctl show.
* Network traffic that should flow through the openvswitch bridge starts
disappearing in a seemingly random way (probably because it can't handle
new traffic flows).
* The memory usage of the userspace ovs-vswitchd starts growing quickly.
* Many of the ovs commands, like to add or remove an interface or bridge
hang.
After a restart of the openvswitch-switch service, and fixing up a bunch
of configuration of connected interfaces, functionality is restored.
While most of the symptoms seem related to userspace openvswitch
processes, the cause of it all seems to be in the kernel, while the
userspace ovs-vswitchd process is receiving a network packet?
-- reproducer --
I don't have a reliable reproducer yet, except for waiting days or weeks
until it randomly happens somewhere. There's no sign of unusual amounts
of traffic / load etc when it happens.
An idea I can come up with is builing a semi-random udp packet generator
to start stressing the code path from kernel to ovs-vswitchd.
If I succeed reproducing, I can start trying other kernels or changes.
Please advice what else I could do to help resolving this issue.
Thanks,
Regards,
Hans van Kranenburg
[0] https://bugs.debian.org/899044
View attachment "kernel-errors.txt" of type "text/plain" (20268 bytes)
Powered by blists - more mailing lists