lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 6 Nov 2020 21:41:37 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Pavel Procopiuc <pavel.procopiuc@...il.com>
Cc:     Vlastimil Babka <vbabka@...e.cz>,
        Kalle Valo <kvalo@...eaurora.org>, ath11k@...ts.infradead.org,
        linux-mm@...ck.org, akpm@...ux-foundation.org,
        linux-kernel@...r.kernel.org, linux-wireless@...r.kernel.org
Subject: Re: Regression: QCA6390 fails with "mm/page_alloc: place pages to
 tail in __free_pages_core()"

On 06.11.20 18:32, Pavel Procopiuc wrote:
> Op 05.11.2020 om 21:23 schreef David Hildenbrand:
>>> So just to make sure I understand you correctly, you'd like to see if the problem with ath11k driver on my hardware persists when I boot pristine 5.10-rc2 kernel (without reverting commit 7fef431be9c9ac255838a9578331567b9dba4477) and with page_alloc.shuffle=1, right?
>>>
>>
>> Right, but as lists are randomized then it might take a couple of tries to reproduce. I‘ll have a look at the driver code / failing path on Monday, when back to work.
> 
> I have done 5 boots of pristine 5.10-rc2 with page_alloc.shuffle=1. Out of those: 1st, 2nd, 4th and 5th resulted in
> working ath11k driver, logs were the same as with the commit 7fef431be9c9ac255838a9578331567b9dba4477 reverted. The 3rd
> one failed, but in a different way, I just had no output from the driver after initialization lines:
> 
> Nov 06 18:19:41 razor kernel: Linux version 5.10.0-rc2 (root@...or) (gcc (Gentoo 9.3.0-r1 p3) 9.3.0, GNU ld (Gentoo 2.34
> p6) 2.34.0) #8 SMP Fri Nov 6 18:14:36 CET 2020
> Nov 06 18:19:41 razor kernel: pci 0000:05:00.0: [17cb:1101] type 00 class 0x028000
> Nov 06 18:19:41 razor kernel: pci 0000:05:00.0: reg 0x10: [mem 0xd2100000-0xd21fffff 64bit]
> Nov 06 18:19:41 razor kernel: pci 0000:05:00.0: PME# supported from D0 D3hot D3cold
> Nov 06 18:19:41 razor kernel: pci 0000:05:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at
> 0000:00:1c.1 (capable of 7.876 Gb/s with 8.0 GT/s PCIe x1 link)
> Nov 06 18:19:41 razor kernel: pci 0000:05:00.0: Adding to iommu group 21
> Nov 06 18:19:42 razor kernel: ath11k_pci 0000:05:00.0: WARNING: ath11k PCI support is experimental!
> Nov 06 18:19:42 razor kernel: ath11k_pci 0000:05:00.0: BAR 0: assigned [mem 0xd2100000-0xd21fffff 64bit]
> Nov 06 18:19:42 razor kernel: ath11k_pci 0000:05:00.0: enabling device (0000 -> 0002)
> Nov 06 18:19:42 razor kernel: mhi 0000:05:00.0: Requested to power ON
> Nov 06 18:19:42 razor kernel: mhi 0000:05:00.0: Power on setup success
> 
> I had this before and usually it was fixed after rebooting into Windows and back. This time I just went and rebooted
> into Linux again and driver was working on that boot (4th).

I'm sorry, but "WARNING: ath11k PCI support is experimental!" and such 
occasional issues don't give me the best feeling that everything is 
operating as it should :)

> 
> After that I removed page_alloc.shuffle=1 and did 2 additional boots, both of them resulted in a non-working driver with
> the error messages about not being able to talk to firmware like I had before on the clean 5.10-rc2:
> 
> Nov 06 18:24:07 razor kernel: Linux version 5.10.0-rc2 (root@...or) (gcc (Gentoo 9.3.0-r1 p3) 9.3.0, GNU ld (Gentoo 2.34
> p6) 2.34.0) #9 SMP Fri Nov 6 18:22:43 CET 2020
> Nov 06 18:24:07 razor kernel: pci 0000:05:00.0: [17cb:1101] type 00 class 0x028000
> Nov 06 18:24:07 razor kernel: pci 0000:05:00.0: reg 0x10: [mem 0xd2100000-0xd21fffff 64bit]
> Nov 06 18:24:07 razor kernel: pci 0000:05:00.0: PME# supported from D0 D3hot D3cold
> Nov 06 18:24:07 razor kernel: pci 0000:05:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at
> 0000:00:1c.1 (capable of 7.876 Gb/s with 8.0 GT/s PCIe x1 link)
> Nov 06 18:24:07 razor kernel: pci 0000:05:00.0: Adding to iommu group 21
> Nov 06 18:24:08 razor kernel: ath11k_pci 0000:05:00.0: WARNING: ath11k PCI support is experimental!
> Nov 06 18:24:08 razor kernel: ath11k_pci 0000:05:00.0: BAR 0: assigned [mem 0xd2100000-0xd21fffff 64bit]
> Nov 06 18:24:08 razor kernel: ath11k_pci 0000:05:00.0: enabling device (0000 -> 0002)
> Nov 06 18:24:08 razor kernel: mhi 0000:05:00.0: Requested to power ON
> Nov 06 18:24:08 razor kernel: mhi 0000:05:00.0: Power on setup success
> Nov 06 18:24:08 razor kernel: ath11k_pci 0000:05:00.0: Respond mem req failed, result: 1, err: 0
> Nov 06 18:24:08 razor kernel: ath11k_pci 0000:05:00.0: qmi failed to respond fw mem req:-22
> Nov 06 18:24:13 razor kernel: ath11k_pci 0000:05:00.0: qmi failed memory request, err = -110
> Nov 06 18:24:13 razor kernel: ath11k_pci 0000:05:00.0: qmi failed to respond fw mem req:-110
> Nov 06 18:25:39 razor kernel: mhi 0000:05:00.0: Device failed to exit MHI Reset state
> 

Okay, that means that you should be able to reproduce 
pre-7fef431be9c9ac255838a9578331567b9dba4477 with page_alloc.shuffle=1 
as well ... it just might take a lot of tries to get a problematic page.

I could also imagine that loading the driver deferred, after quite some 
system/mm activity could result in the same issue.

Looks like something either cannot handle a specific address we received 
via dma_alloc_coherent(), or something is reading out of bounds, and the 
content after our allocated page doesn't have the expected value anymore 
(e.g., used to be zero, now no longer zero).

What puzzles me is that "err: 0". That should have been properly set by 
HW, no?

-- 
Thanks,

David / dhildenb

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ