lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <878se0ktmo.fsf@x220.int.ebiederm.org>
Date:   Thu, 27 Aug 2020 10:15:11 -0500
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     Wang Long <w@...qinren.net>
Cc:     netdev@...r.kernel.org, davem@...emloft.net, kuznet@....inr.ac.ru,
        yoshfuji@...ux-ipv6.org, kuba@...nel.org, edumazet@...gle.com,
        eric.dumazet@...il.com, opurdila@...acom.com,
        vegard.nossum@...il.com, LKML <linux-kernel@...r.kernel.org>
Subject: Re: RFC: inet_timewait_sock->tw_timer list corruption

Wang Long <w@...qinren.net> writes:

> Hi,
>
> we encountered a kernel panic as following:
>
> [4394470.273792] general protection fault: 0000 [#1] SMP NOPTI
> [4394470.274038] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G    W
> --------- -  - 4.18.0-80.el8.x86_64 #1
> [4394470.274477] Hardware name: Sugon I620-G30/60P24-US, BIOS MJGS1223
> 04/07/2020
> [4394470.274727] RIP: 0010:run_timer_softirq+0x34e/0x440
> [4394470.274957] Code: 84 3f ff ff ff 49 8b 04 24 48 85 c0 74 58 49 8b 1c 24 48
> 89 5d 08 0f 1f 44 00 00 48 8b 03 48 8b 53 08 48 85 c0 48 89 02 74 04 <48> 89 50
> 08 f6 43 22 20 48 c7 43 08 00 00 00 00 48 89 ef 4c 89 2b
> [4394470.275505] RSP: 0018:ffff88f000803ee0 EFLAGS: 00010086
> [4394470.275783] RAX: dead000000000200 RBX: ffff88e5e33ea078 RCX:
> 0000000000000100
> [4394470.276087] RDX: ffff88f000803ee8 RSI: 0000000000000000 RDI:
> ffff88f00081aa00
> [4394470.276391] RBP: ffff88f00081aa00 R08: 0000000000000001 R09:
> 0000000000000000
> [4394470.276697] R10: ffff88e5e33eb1f0 R11: 0000000000000000 R12:
> ffff88f000803ee8
> [4394470.277030] R13: dead000000000200 R14: ffff88f000803ee0 R15:
> 0000000000000000
> [4394470.277350] FS:  0000000000000000(0000) GS:ffff88f000800000(0000)
> knlGS:0000000000000000
> [4394470.277684] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [4394470.278020] CR2: 00007f200eddd160 CR3: 0000000e0b20a002 CR4:
> 00000000007606f0
> [4394470.278412] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [4394470.278799] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [4394470.279194] PKRU: 55555554
> [4394470.279543] Call Trace:
> [4394470.279889]  <IRQ>
> [4394470.280237]  ? __hrtimer_init+0xb0/0xb0
> [4394470.280618]  ? sched_clock+0x5/0x10
> [4394470.281000]  __do_softirq+0xe8/0x2ef
> [4394470.281397]  irq_exit+0xf1/0x100
> [4394470.281761]  smp_apic_timer_interrupt+0x74/0x130
> [4394470.282132]  apic_timer_interrupt+0xf/0x20
> [4394470.282548]  </IRQ>
> [4394470.282954] RIP: 0010:cpuidle_enter_state+0xa0/0x2b0
> [4394470.283341] Code: 8b 3d 6c fb 59 4c e8 0f ed a6 ff 48 89 c3 0f 1f 44 00 00
> 31 ff e8 80 00 a7 ff 45 84 f6 0f 85 c3 01 00 00 fb 66 0f 1f 44 00 00 <4c> 29 fb
> 48 ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7
> [4394470.284219] RSP: 0018:ffffffffb4603e78 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffff13
> [4394470.284671] RAX: ffff88f000823080 RBX: 000f9cbf579e86c6 RCX:
> 000000000000001f
> [4394470.285129] RDX: 000f9cbf579e86c6 RSI: 0000000037a6f674 RDI:
> 0000000000000000
> [4394470.285623] RBP: 0000000000000002 R08: 00000000000000c4 R09:
> 0000000000000027
> [4394470.286088] R10: ffffffffb4603e58 R11: 000000000000004c R12:
> ffff88f00082df00
> [4394470.286566] R13: ffffffffb4724118 R14: 0000000000000000 R15:
> 000f9cbf579d44e0
> [4394470.287045]  ? cpuidle_enter_state+0x90/0x2b0
> [4394470.287527]  do_idle+0x200/0x280
> [4394470.288010]  cpu_startup_entry+0x6f/0x80
> [4394470.288501]  start_kernel+0x533/0x553
> [4394470.288994]  secondary_startup_64+0xb7/0xc0
>
>
> After analysis, we found that the timer which expires has timer->entry.next ==
> POISON2 !(the list corruption )
>
> the crash scenario is the same as https://lkml.org/lkml/2017/3/21/732,
>
> I cannot reproduce this issue, but I found that the timer cause crash is the
> inet_timewait_sock->tw_timer(its callback function is tw_timer_handler), and the
> value of tcp_tw_reuse is 1.
>
> # cat /proc/sys/net/ipv4/tcp_tw_reuse
> 1
>
> In the production environment, we encountered this problem many times, and every
> time it was a problem with the inet_timewait_sock->tw_timer.
>
> Do anyone have any ideas for this issue? Thanks.

You might enble list debugging if it isn't already.  That might give you
enough information to track this down.

You might also contact redhat support as it appears you are running a
redhat kernel.

Eric

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ