lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5ba94d4f-65ae-befc-d977-cbad64fa984f@gmail.com>
Date:   Thu, 15 Apr 2021 13:44:20 +0100
From:   Edward Cree <ecree.xilinx@...il.com>
To:     Trevor Hemsley <themsley@...ceflex.com>
Cc:     Network Development <netdev@...r.kernel.org>
Subject: Re: Panic in sfc module on boot since 5.10

On 15/04/2021 10:03, Trevor Hemsley wrote:
> Hi,
> 
> I run Fedora 32 and since kernels in the 5.10 series I have been unable to boot without getting a panic in the sfc module. I tried on 5.11.12 tonight and the crash still occurs. I have tried reporting this via Fedora channels but the silence has been deafening
Seems Red Hat couldn't even be bothered to forward it to us :sigh:

> and I suspect this is an upstream issue anyway.
You could try building an upstream kernel and driver, and attempting to
 reproduce the issue there.  That would remove some of the unknowns.

> BUG: kernel NULL pointer dereference, address: 0000000000000104

> RIP: 0010:efx_farch_ev_process+0x3d2/0x910 [sfc]
> Code: c0 02 39 f0 76 34 c1 fe 02 41 03 b6 28 07 00 00 83 e1 03 49 8b 84 f6 d0 00 00 00 48 8b 94 c8 80 09 00 00 b0 01 00 00 00 31 c9 <f0> 8f b1 8a 04 81 00 00 05 c0 0f 05 37 03 00 00 48 8d 74 24 20 4c
Hmm, I think this is actually <f0> 0f b1 8a 04 01 00 00 85...
 which decodes as lock cmpxchg %ecx,0x104(%rdx)
With other transcription errors fixed, the key sequence appears to be
    mov $0x1,%eax
    xor %ecx,%ecx
    lock cmpxchg %ecx,0x104(%rdx)
So we're saying "if (rdx[0x104] == 1) rdx[0x104] = 0", only atomically.
I'd *guess* this is the atomic_cmpxchg() in efx_farch_handle_tx_flush_done()
 (though it'd be nice to have your sfc.ko, with debugging symbols, to
 check for certain).
Which in turn tells us that tx_queue is NULL; this is suspicious
 because the relevant commits
    a81dcd85a7c1 ("sfc: assign TXQs without gaps")
    12804793b17c ("sfc: decouple TXQ type from label")
 happened at about the right time to cause this regression.
So now I have to go off and figure out exactly what the semantics
 of this TX flush done event's 'subdata' field are... looks like it
 probably corresponds to tx_queue->queue from
 efx_farch_flush_tx_queue().
Unfortunately, there is no simple lookup to convert from qid to
 tx_queue, because we just allocate queues as-needed in
 efx_set_channels() and don't store the reverse mapping (everything
 else works by label rather than queue, so doesn't need it).
I think the right fix is probably just to have
 efx_farch_handle_tx_flush_done() (and presumably also
 efx_farch_handle_rx_flush_done()) iterate over all queues (or at
 least all queues on the channel that received the event; but
 possibly the events might always be delivered to channel 0 rather
 than necessarily the channel that owns the queue) and perform the
 handling on any queue whose qid matches.
I will followup with a patch, hopefully some time next week if I can
 find a 6122F to test with.

> Just prior to the crash I get a pair of messages that don't look particularly right but I get these on 5.9.16 too and that survives.
> 
> [    9.027961] sfc 0000:0b:00.0 enp11s0f0np0: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0
> [    9.029895] sfc 0000:0b:00.1 enp11s0f1np1: MC command 0x2a inlen 16 failed rc=-22 (raw=0) arg=0

0x2a is MC_CMD_SET_LINK, which gets called in a variety of situations
 like MTU change, link advertising change (e.g. ethtool -s), and SFP+
 module hotplug.  An -EINVAL failure typically means we've asked for
 some combination of link modes that is unsupported or nonsensical; to
 investigate this further you could try with the mcdi_logging_default=1
 module parameter, which will log all MC commands and responses at
 KERN_INFO — these can then be decoded by reference to mcdi_pcol.h.
In any case this seems to be unrelated to the above issue.

-ed

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ