[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2a630a76-bcfb-d08d-619d-eafa6a7b1025@intel.com>
Date: Mon, 14 Nov 2022 16:49:15 -0800
From: Tony Nguyen <anthony.l.nguyen@...el.com>
To: Stefan Assmann <sassmann@...hat.com>,
Ivan Vecera <ivecera@...hat.com>
CC: <netdev@...r.kernel.org>, Jacob Keller <jacob.e.keller@...el.com>,
"Patryk Piotrowski" <patryk.piotrowski@...el.com>,
SlawomirX Laba <slawomirx.laba@...el.com>,
Jesse Brandeburg <jesse.brandeburg@...el.com>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
"moderated list:INTEL ETHERNET DRIVERS"
<intel-wired-lan@...ts.osuosl.org>,
open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net] iavf: Fix a crash during reset task
On 11/8/2022 2:53 AM, Stefan Assmann wrote:
> On 2022-11-08 10:35, Ivan Vecera wrote:
>> Recent commit aa626da947e9 ("iavf: Detach device during reset task")
>> removed netif_tx_stop_all_queues() with an assumption that Tx queues
>> are already stopped by netif_device_detach() in the beginning of
>> reset task. This assumption is incorrect because during reset
>> task a potential link event can start Tx queues again.
>> Revert this change to fix this issue.
>>
>> Reproducer:
>> 1. Run some Tx traffic (e.g. iperf3) over iavf interface
>> 2. Switch MTU of this interface in a loop
>>
>> [root@...t ~]# cat repro.sh
>> #!/bin/sh
>>
>> IF=enp2s0f0v0
>>
>> iperf3 -c 192.168.0.1 -t 600 --logfile /dev/null &
>> sleep 2
>>
>> while :; do
>> for i in 1280 1500 2000 900 ; do
>> ip link set $IF mtu $i
>> sleep 2
>> done
>> done
>
> With this patch applied iavf doesn't crash anymore but after a few
> cycles with the reproducer tx timeouts are observed.
>
> [ 47.551151] iavf 0000:00:09.0 eth0: NIC Link is Up Speed is 10 Gbps Full Duplex
> [ 54.035902] ------------[ cut here ]------------
> [ 54.037397] NETDEV WATCHDOG: eth0 (iavf): transmit queue 3 timed out
> [ 54.039264] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:526 dev_watchdog+0x20f/0x250
> [ 54.041524] Modules linked in: 8021q intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass rapl pcspkr drm ramoops reed_solomon crct10dif_pclmul crc32_pclmul crc32c_intel ata_generic pata_acpi ghash_clmulni_intel ata_piix aesni_intel crypto_simd iavf libata be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
> [ 54.049723] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.1.0-rc2+ #90
> [ 54.051049] Hardware name: Red Hat KVM, BIOS 1.15.0-2.module+el8.6.0+14757+c25ee005 04/01/2014
> [ 54.052898] RIP: 0010:dev_watchdog+0x20f/0x250
> [ 54.053907] Code: 00 e9 4d ff ff ff 48 89 df c6 05 92 24 96 01 01 e8 c6 f2 f8 ff 44 89 e9 48 89 de 48 c7 c7 28 7f f6 a0 48 89 c2 e8 6e 65 23 00 <0f> 0b e9 2f ff ff ff e8 25 06 2a 00 85 c0 74 b5 80 3d 74 1b 96 01
> [ 54.057282] RSP: 0018:ffffaf56c00e0e80 EFLAGS: 00010282
> [ 54.058164] RAX: 0000000000000000 RBX: ffff993ed95b8000 RCX: 0000000000000103
> [ 54.059345] RDX: 0000000000000103 RSI: 00000000000000f6 RDI: 00000000ffffffff
> [ 54.060473] RBP: ffff993ed95b8508 R08: 0000000000000000 R09: c0000000fff7ffff
> [ 54.061558] R10: 0000000000000001 R11: ffffaf56c00e0d18 R12: ffff993ed95b8420
> [ 54.062640] R13: 0000000000000003 R14: ffff993ed95b8508 R15: ffff993ef74a06c0
> [ 54.063681] FS: 0000000000000000(0000) GS:ffff993ef7480000(0000) knlGS:0000000000000000
> [ 54.064867] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 54.065654] CR2: 00007f42309e1280 CR3: 0000000107f6a003 CR4: 0000000000170ee0
> [ 54.066612] Call Trace:
> [ 54.066985] <IRQ>
> [ 54.067265] ? mq_change_real_num_tx+0xd0/0xd0
> [ 54.067844] call_timer_fn+0xa1/0x2c0
> [ 54.068330] ? mq_change_real_num_tx+0xd0/0xd0
> [ 54.068916] run_timer_softirq+0x527/0x550
> [ 54.069447] ? lock_is_held_type+0xd8/0x130
> [ 54.069998] __do_softirq+0xc3/0x481
> [ 54.070469] irq_exit_rcu+0xe4/0x120
> [ 54.070963] sysvec_apic_timer_interrupt+0x9e/0xc0
> [ 54.071604] </IRQ>
> [ 54.071909] <TASK>
> [ 54.072223] asm_sysvec_apic_timer_interrupt+0x16/0x20
> [ 54.072942] RIP: 0010:default_idle+0x10/0x20
> [ 54.073533] Code: 89 df 31 f6 5b 5d e9 ff 1c a5 ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 eb 07 0f 00 2d f2 2a 42 00 fb f4 <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 65
>
> This only occurs when the device is detached and reattached during reset.
Hi Ivan,
Was there going to be an update to the patch to resolve this? If not,
I'll take what there is now.
Thanks,
Tony
Powered by blists - more mailing lists