lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a4b2bf44-ef21-6cc8-f79f-d0e41c8f037f@huawei.com>
Date:   Fri, 10 Dec 2021 10:35:56 +0000
From:   John Garry <john.garry@...wei.com>
To:     Damien Le Moal <damien.lemoal@...nsource.wdc.com>,
        <Ajish.Koshy@...rochip.com>
CC:     <jinpu.wang@...ud.ionos.com>, <jejb@...ux.ibm.com>,
        <martin.petersen@...cle.com>, <Viswas.G@...rochip.com>,
        <linux-scsi@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <Niklas.Cassel@....com>,
        <Vasanthalakshmi.Tharmarajan@...rochip.com>
Subject: Re: [PATCH] scsi: pm8001: Fix phys_to_virt() usage on dma_addr_t

On 09/12/2021 23:55, Damien Le Moal wrote:
>> Earlier today it was the mount command which was hanging. From debugging
>> that, I found that the very first SSP command when mounting is sent the
>> HW successfully but no completion interrupt ever occurs there - I really
>> don't know why. Other SSP commands complete successfully before this and
>> after (TMFs in the error handling), including ones which have sgls.
>>
>> sda is a SAS drive, but I think SATA has the same issue - I was just
>> looking at sda.
>>
>> One thing I noticed in the driver is that it uses mb() in between
>> writing to the DMA memory and initiating the HW - I don't think mb is
>> strong enough. However I don't think that is my issue - it wouldn't fail
>> reliably if it was.
> Weird. 

Yeah, quite strange.

I will also note that these earlier logs are also red flags, which I 
have not investigated:

[87.288239] sas: target proto 0x0 at 500e004aaaaaaa1f:0x10 not handled
[87.294793] sas: ex 500e004aaaaaaa1f phy16 failed to discover

> I do not have an arm host to test. Could it be that the card FW is
> crashing ?

But the later TMFs seem to succeed, so I doubt it's crashing. I did 
wonder if it's going into some low-power/idle mode and just not 
responding, but not sure on that.

> Can you recover from the above ? 

It never really recovers and is always caught up in some error handling.

> Or do you have to power cycle for
> the HDD to be accessible again ?

Power cycle is necessary to recover as we can't remove the driver when 
it is in error handling

> 
> Other possibility may be an IRQ controller issue with the platform ?
> 

Highly unlikely. I did wonder if the interrupts are properly allocated 
and requested, and they look ok from /proc/interrupts

I also tried limiting the CPUs we bring up to a single CPU and so that 
we only use a single MSIx and a single HW queue, and now get this crash:

[7.775168] loop: module loaded
[7.783226] pm80xx 0000:04:00.0: Adding to iommu group 0
[7.795787] pm80xx 0000:04:00.0: pm80xx: driver version 0.1.40
[7.806789] pm80xx 0000:04:00.0: enabling device (0140 -> 0142)
[7.818910] :: pm8001_pci_alloc  530:Setting link rate to default value
[8.866618] scsi host0: pm80xx
[8.879056] pm80xx0:: process_oq  4169:Firmware Fatal error! Regval:0xc0f
[8.885842] pm80xx0:: print_scratchpad_registers 
4130:MSGU_SCRATCH_PAD_0: 0x40002000
[8.893661] pm80xx0:: print_scratchpad_registers 
4132:MSGU_SCRATCH_PAD_1:0xc0f
[8.900958] pm80xx0:: print_scratchpad_registers 
4134:MSGU_SCRATCH_PAD_2: 0x0
[8.908169] pm80xx0:: print_scratchpad_registers 
4136:MSGU_SCRATCH_PAD_3: 0x30000000
[8.915986] pm80xx0:: print_scratchpad_registers 
4138:MSGU_HOST_SCRATCH_PAD_0: 0x0
[8.923630] pm80xx0:: print_scratchpad_registers 
4140:MSGU_HOST_SCRATCH_PAD_1: 0x0
[8.931274] pm80xx0:: print_scratchpad_registers 
4142:MSGU_HOST_SCRATCH_PAD_2: 0x0
[8.938917] pm80xx0:: print_scratchpad_registers 
4144:MSGU_HOST_SCRATCH_PAD_3: 0x0
[8.946561] pm80xx0:: print_scratchpad_registers 
4146:MSGU_HOST_SCRATCH_PAD_4: 0x0
[8.954205] pm80xx0:: print_scratchpad_registers 
4148:MSGU_HOST_SCRATCH_PAD_5: 0x0
[8.961849] pm80xx0:: print_scratchpad_registers 
4150:MSGU_RSVD_SCRATCH_PAD_0: 0x0
[8.969493] pm80xx0:: print_scratchpad_registers 
4152:MSGU_RSVD_SCRATCH_PAD_1: 0x0
[8.977143] Unable to handle kernel NULL pointer dereference at virtual 
address 0000000000000018
[8.994782] Mem abort info:
[8.997565]   ESR = 0x96000004
[9.006782]   EC = 0x25: DABT (current EL), IL = 32 bits
[9.018781]   SET = 0, FnV = 0
[9.021824]   EA = 0, S1PTW = 0
[9.030797]   FSC = 0x04: level 0 translation fault
[9.038794] Data abort info:
[9.041662]   ISV = 0, ISS = 0x00000004
[9.050781]   CM = 0, WnR = 0
[9.053737] [0000000000000018] user address but active_mm is swapper
[9.070782] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[9.076343] Modules linked in:
[9.079387] CPU: 0 PID: 20 Comm: kworker/0:2 Not tainted 
5.16.0-rc4-00002-ge23d68774178-dirty #328
[9.088333] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - 
V1.16.01 03/15/2019
[9.096844] Workqueue: pm80xx pm8001_work_fn
[9.101108] pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[9.108057] pc : pm8001_work_fn+0x298/0x690
[9.112229] lr : process_one_work+0x1d0/0x354
[9.116574] sp : ffff800012d23d50
[9.119876] x29: ffff800012d23d50 x28: 0000000000000000 x27: 0000000000000000
[9.127000] x26: ffff8000117e8bc0 x25: ffff8000113aaeb0 x24: ffff00209d4b0000
[9.134124] x23: ffff0020ad23b280 x22: ffff00209d4b8000 x21: 0000000000000001
[9.141249] x20: 0000000000000000 x19: 0000000000000038 x18: 0000000000000000
[9.148373] x17: 4441505f48435441 x16: 5243535f44565352 x15: 000000052ff6b548
[9.155496] x14: 0000000000000018 x13: 0000000000000018 x12: 0000000000000000
[9.162620] x11: 0000000000000014 x10: 00000000000009a0 x9 : ffff002086ef6074
[9.169743] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff002086ef6074
[9.176867] x5 : 0000787830386d70 x4 : ffff00209d5e0000 x3 : 0000000000000000
[9.183990] x2 : ffff00209d5e0038 x1 : ffff800010a20120 x0 : 0000000000000051
[9.191114] Call trace:
[9.193547]  pm8001_work_fn+0x298/0x690
[9.197372]  process_one_work+0x1d0/0x354
[9.201369]  worker_thread+0x13c/0x470
[9.205105]  kthread+0x17c/0x190
[9.208321]  ret_from_fork+0x10/0x20
[9.211886] Code: 17fffff1 310006bf 54fffde0 f9400c54 (f9400e80)
[9.217968] ---[ end trace de649a9be2843866 ]---
[9.339812] pm80xx0:: process_oq  4169:Firmware Fatal error! Regval:0xc0f
[9.346602] pm80xx0:: print_scratchpad_registers 
4130:MSGU_SCRATCH_PAD_0: 0x40002000
[9.354420] pm80xx0:: print_scratchpad_registers 
4132:MSGU_SCRATCH_PAD_1:0xc0f
[9.361717] pm80xx0:: print_scratchpad_registers 
4134:MSGU_SCRATCH_PAD_2: 0x0
[9.368927] pm80xx0:: print_scratchpad_registers 
4136:MSGU_SCRATCH_PAD_3: 0x30000000
[9.376744] pm80xx0:: print_scratchpad_registers 
4138:MSGU_HOST_SCRATCH_PAD_0: 0x0
[9.384388] pm80xx0:: print_scratchpad_registers 
4140:MSGU_HOST_SCRATCH_PAD_1: 0x0
[9.392032] pm80xx0:: print_scratchpad_registers 
4142:MSGU_HOST_SCRATCH_PAD_2: 0x0
[9.399676] pm80xx0:: print_scratchpad_registers 
4144:MSGU_HOST_SCRATCH_PAD_3: 0x0
[9.407319] pm80xx0:: print_scratchpad_registers 
4146:MSGU_HOST_SCRATCH_PAD_4: 0x0
[9.414963] pm80xx0:: print_scratchpad_registers 
4148:MSGU_HOST_SCRATCH_PAD_5: 0x0
[9.422607] pm80xx0:: print_scratchpad_registers 
4150:MSGU_RSVD_SCRATCH_PAD_0: 0x0
[9.430251] pm80xx0:: print_scratchpad_registers 
4152:MSGU_RSVD_SCRATCH_PAD_1: 0x0
[   10.028906] Freeing initrd memory: 413456K

...

Thanks,
John

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ