lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+=Fv5Tf=3HLhkDqRoxSgp0p6kpn7F62qiRXOXMSyR2KzQNUDQ@mail.gmail.com>
Date: Fri, 29 Nov 2024 14:50:54 +0100
From: Magnus Lindholm <linmag7@...il.com>
To: netdev@...r.kernel.org
Subject: kernel Oops: net-next-6.9 breaks stuff on Alpha?

Hi,



First some background:
I've been trying to boot recent kernels on my alpha machines. Anything
after linux-6.8.12 gives me trouble. After doing a kernel bisect, I
found that commit 9187210eee7d87eea37b45ea93454a88681894a4
(net-next-6.9) is where my troubles begin. The problem consist in that
the boot process gets stuck when trying to set parameters for network
interfaces. The bad commit does make a lot of updates to the network
code.

When booting the system with kernel 6.12.0 I'm able to boot into
single-user mode, but when starting system services one by one I
trigger a kernel Oops when the network interface is renamed (see stack
dump below). Looking at the changes made by the bad commit, it seems
to (among other things) be replacing the locking mechanism (RCU
instead of rtnl_lock). The stack dump from the kernel Oops suggests
that something is happening in the RCU locking code. I'm no expert on
RCU-stuff but I read somewhere that its done by volatile access on all
systems other than DEC Alpha, where a memory barrier instruction is
required. This indicates that the change could affect Alpha
architecture differently? Inspecting the changes to networking code in
the bad commit, particularly the changes made to net/core/dev.c, I put
together the patch below. This patch reverts one on the lines changed
in the "bad commit" for net/core/dev.c. After reverting the change on
just this line, I'm able to boot kernel 6.12.0 on my Alpha ES-40 to
full multi-user again. I've tested this on an Alpha ES40 and an
UP2000+ and the problem is 100% reproducible on both systems. The
patch might not be a real solution to the problem but could be a good
place to start looking when figuring out whats really going on.
Not sure what is the next step here, it would be interesting to hear
if anyone else has seen this or is able to reproduce it?

Regards
Magnus Lindholm

---------------------------
Patch to "fix" the problem:
---------------------------

diff --git a/net/core/dev.c b/net/core/dev.c
index 13d00fc10f55..26fda14367e5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1261,7 +1261,7 @@ int dev_change_name(struct net_device *dev,
const char *newname)

        netdev_name_node_del(dev->name_node);

-       synchronize_net();
+       synchronize_rcu();

        netdev_name_node_add(net, dev->name_node);


-----------------
dmesg/kernel log:
-----------------

[   93.431592] tulip 0000:01:02.0 enp1s2: renamed from eth0

[   93.436475] Unable to handle kernel paging request at virtual
address 0000000000000000
[   93.436475] CPU 1
[   93.436475] rcu_exp_gp_kthr(17): Oops -1
[   93.436475] pc = [<0000000000000000>]  ra = [<0000000000000000>]
ps = 0000    Not tainted
[   93.436475] pc is at 0x0
[   93.436475] ra is at 0x0
[   93.436475] v0 = 0000000000000007  t0 = fffffc0000e62440  t1 =
0000000000000001
[   93.436475] t2 = 0000000000000000  t3 = 0000000000000001  t4 =
0000000000000001
[   93.436475] t5 = 0000000000000001  t6 = 0000000000000001  t7 =
fffffc0003138000
[   93.436475] s0 = fffffc0000e62440  s1 = fffffc0000ec3a10  s2 =
fffffc0000ec3a10
[   93.436475] s3 = fffffc0000ec3a10  s4 = fffffc00003a90f0  s5 =
fffffc0000e62440
[   93.436475] s6 = 0000000000000000
[   93.436475] a0 = 0000000000000000  a1 = 0000000000000000  a2 =
0000000000000000
[   93.436475] a3 = 0000000000000000  a4 = 0000000000000001  a5 =
fffffc0000517744
[   93.436475] t8 = 0000000000000001  t9 = 0000000000000001  t10=
fffffc0000e3d320
[   93.436475] t11= fffffc0000220240  pv = fffffc0000b73210  at =
0000000000000000
[   93.436475] gp = fffffc0000eb3a10  sp = 00000000ea2ea184
[   93.436475] Disabling lock debugging due to kernel taint
[   93.436475] Trace:
[   93.436475] [<fffffc00003aee60>] wait_rcu_exp_gp+0x30/0xa0
[   93.436475] [<fffffc0000b6c200>] __cond_resched+0x30/0x90
[   93.436475] [<fffffc00003569b8>] kthread_worker_fn+0xc8/0x1f0
[   93.436475] [<fffffc000035863c>] kthread+0x17c/0x1c0
[   93.436475] [<fffffc00003568f0>] kthread_worker_fn+0x0/0x1f0
[   93.436475] [<fffffc0000311128>] ret_from_kernel_thread+0x18/0x20

[   93.436475] Code:
[   93.436475]  00000000
[   93.436475]  00000000
[   93.436475]  00063301
[   93.436475]  0000077c
[   93.436475]  00001111
[   93.436475]  000022a2

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ