netdev - Re: mvpp2 crash under load.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <89d13dec-b7b2-89e6-bd45-d56c0f9f4491@arm.com>
Date:   Wed, 24 Jan 2018 11:39:39 -0600
From:   Jeremy Linton <jeremy.linton@....com>
To:     Antoine Tenart <antoine.tenart@...e-electrons.com>
Cc:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        thomas.petazzoni@...e-electrons.com, elfring@...rs.sourceforge.net,
        mw@...ihalf.com
Subject: Re: mvpp2 crash under load.

Hi,

First, thanks for taking a look at this.


On 01/23/2018 01:53 AM, Antoine Tenart wrote:
> Hi Jeremy,
> 
> On Mon, Jan 22, 2018 at 05:14:27PM -0600, Jeremy Linton wrote:
>>
>> I'm running 4.15rc7 and hitting the following crash on the MACCHIATObin.
>> This is 100% reproducible once the adapter is given any load. Within a few
>> seconds of starting a scp or nfs copies inbound to the machine it dies like
>> this:
>>
>>
>> [12544.192436] mvpp2 f4000000.ethernet eth2: wrong cpu on the end of Tx
>> processing
>> [12548.513734] mvpp2 f4000000.ethernet eth2: wrong cpu on the end of Tx
>> processing
>> [12548.623574] mvpp2 f4000000.ethernet eth2: wrong cpu on the end of Tx
>> processing
> 
> I believe this is the root cause of this issue: txq_done() is scheduled
> on the wrong CPU and we know it can't run on 2 CPUs at the same time. We
> had a similar issue (same stack trace, different root cause):
> 082297e61480c4d72ed75b31077e74aca0e7c799

I'm pretty sure I already had that patch, I've rebased to 4.15rc9 and it 
continues. I also cherry picked "net: mvpp2: only free the TSO header 
buffers when it was allocated" from net-next which didn't appear to fix 
it either.

Thanks,


> 
> Thanks for reporting this!
> 
> Antoine
> 
>> [12548.630943] Unable to handle kernel paging request at virtual address
>> 97ffd6fdd28000e8
>> [12548.638897] Mem abort info:
>> [12548.641703]   ESR = 0x96000004
>> [12548.644775]   Exception class = DABT (current EL), IL = 32 bits
>> [12548.650720]   SET = 0, FnV = 0
>> [12548.653795]   EA = 0, S1PTW = 0
>> [12548.656952] Data abort info:
>> [12548.659846]   ISV = 0, ISS = 0x00000004
>> [12548.663700]   CM = 0, WnR = 0
>> [12548.666684] [97ffd6fdd28000e8] address between user and kernel address
>> ranges
>> [12548.673855] Internal error: Oops: 96000004 [#1] SMP
>> [12548.678757] Modules linked in: ax88179_178a usbnet ip6t_rpfilter
>> ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat
>> ebtable_brox
>> [12548.749992]  xhci_plat_hcd ahci_platform [last unloaded: usbnet]
>> [12548.756034] CPU: 3 PID: 0 Comm: swapper/3 Not tainted
>> 4.15.0-0.rc7.git0.1.fc28.aarch64 #1
>> [12548.764249] Hardware name: Marvell Armada 8040 MacchiatoBin/Armada 8040
>> MacchiatoBin, BIOS EDK II Oct  2 2017
>> [12548.774210] pstate: 40400005 (nZcv daif +PAN -UAO)
>> [12548.779033] pc : consume_skb+0x1c/0xd8
>> [12548.782802] lr : __dev_kfree_skb_any+0x58/0x68
>> [12548.787264] sp : ffff00000801bc30
>> [12548.790594] x29: ffff00000801bc30 x28: ffff831bed412a40
>> [12548.795934] x27: ffff831bf7ce8000 x26: 0000000000000001
>> [12548.801273] x25: ffff27e28d746120 x24: ffff831bed412948
>> [12548.806612] x23: 0000000000000018 x22: ffff27e28d746120
>> [12548.811950] x21: 0000000000000007 x20: 0000000000000001
>> [12548.817289] x19: 97ffd6fdd2800004 x18: 0000000000000010
>> [12548.822627] x17: 0000000000000000 x16: ffff27e28d5bb4a0
>> [12548.827966] x15: ffffffffffffffff x14: 737365636f727020
>> [12548.833305] x13: 785420666f20646e x12: 6520656874206e6f
>> [12548.838643] x11: ffff27e28e07b448 x10: ffff27e28d35eb00
>> [12548.843981] x9 : 2074656e72656874 x8 : 0000000000000005
>> [12548.849319] x7 : 00000000b26f0000 x6 : 00000000b66f0000
>> [12548.854658] x5 : 0000000000000001 x4 : 0000000000000000
>> [12548.859995] x3 : 0000000000000001 x2 : 97ffd6fdd2800004
>> [12548.865333] x1 : 0000000000000001 x0 : ffff27e28d5bb4f8
>> [12548.870673] Process swapper/3 (pid: 0, stack limit = 0x0000000071feb006)
>> [12548.877404] Call trace:
>> [12548.879863]  consume_skb+0x1c/0xd8
>> [12548.883281]  __dev_kfree_skb_any+0x58/0x68
>> [12548.887411]  mvpp2_txq_bufs_free.isra.53+0xd0/0x118 [mvpp2]
>> [12548.893017]  mvpp2_txq_done.isra.68+0xb0/0xf8 [mvpp2]
>> [12548.898100]  mvpp2_tx_done+0xb4/0x118 [mvpp2]
>> [12548.902484]  mvpp2_poll+0x5c4/0x658 [mvpp2]
>> [12548.906688]  net_rx_action+0x160/0x3f8
>> [12548.910456]  __do_softirq+0x138/0x344
>> [12548.914137]  irq_exit+0xd0/0x100
>> [12548.917381]  __handle_domain_irq+0x6c/0xc0
>> [12548.921497]  gic_handle_irq+0x60/0xb0
>> [12548.925175]  el1_irq+0xd8/0x180
>> [12548.928331]  arch_cpu_idle+0x30/0x188
>> [12548.932011]  do_idle+0x138/0x1f8
>> [12548.935255]  cpu_startup_entry+0x2c/0x30
>> [12548.939197]  secondary_start_kernel+0x11c/0x130
>> [12548.943750] Code: aa0003f3 aa1e03e0 d503201f b4000153 (b940e660)
>> [12548.949876] ---[ end trace c9cfd11479961f0c ]---
>> [12548.954515] Kernel panic - not syncing: Fatal exception in interrupt
>> [12548.960900] SMP: stopping secondary CPUs
>> [12548.964845] Kernel Offset: 0x27e284d50000 from 0xffff000008000000
>> [12548.970967] CPU features: 0x002000
>> [12548.974384] Memory Limit: none
>>
>> Its interesting that the wrong CPU messages are still appearing despite the
>> irqbalance change from MarkZ. I disabled irqbalance and tried starting it in
>> single queue mode and it did the same thing.
>>
>