netdev - Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c8cf2b24-c790-fa70-c2c5-474201743b4d@nvidia.com>
Date:   Thu, 2 Dec 2021 12:05:42 +0200
From:   Moshe Shemesh <moshe@...dia.com>
To:     Thorsten Leemhuis <regressions@...mhuis.info>,
        Niklas Schnelle <schnelle@...ux.ibm.com>,
        Amir Tzin <amirtz@...dia.com>,
        Saeed Mahameed <saeedm@...dia.com>
CC:     netdev <netdev@...r.kernel.org>, <regressions@...ts.linux.dev>,
        linux-s390 <linux-s390@...r.kernel.org>
Subject: Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may
 try to wait 245 million years


On 12/2/2021 8:52 AM, Thorsten Leemhuis wrote:
> External email: Use caution opening links or attachments
>
>
> Hi, this is your Linux kernel regression tracker speaking.
>
> On 20.11.21 17:38, Moshe Shemesh wrote:
>> Thank you for reporting Niklas.
>>
>> This is actually a case of use after free, as following that patch the
>> recovery flow goes through mlx5_tout_cleanup() while timeouts structure
>> is still needed in this flow.
>>
>> We know the root cause and will send a fix.
> That was twelve days ago, thus allow me asking: has any progress been
> made? I could not find any with a quick search on lore.


Yes, fix was submitted by Saeed yesterday, title: "[net 10/13] net/mlx5: 
Fix use after free in mlx5_health_wait_pci_up".

> Ciao, Thorsten
>
>> On 11/19/2021 12:58 PM, Niklas Schnelle wrote:
>>> Hello Amir, Moshe, and Saeed,
>>>
>>> (resent due to wrong netdev mailing list address, sorry about that)
>>>
>>> During testing of PCI device recovery, I found a problem in the mlx5
>>> recovery support introduced in v5.16-rc1 by commit 32def4120e48
>>> ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
>>> the problem.
>>>
>>> When the device is in an error state, at least on s390 but I believe
>>> also on other systems, it is isolated and all PCI MMIO reads return
>>> 0xff. This is detected by your driver and it will immediately attempt
>>> to recovery the device with a mlx5_core driver specific recovery
>>> mechanism. Since at this point no reset has been done that would take
>>> the device out of isolation this will of course fail as it can't
>>> communicate with the device. Under normal circumstances this reset
>>> would happen later during the new recovery flow introduced in
>>> 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
>>> firmware has done their side of the recovery allowing that to succeed
>>> once the driver specific recovery has failed.
>>>
>>> With v5.16-rc1 however the driver specific recovery gets stuck holding
>>> locks which will block our recovery. Without our recovery mechanism
>>> this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
>>> which hangs on the device lock forever.
>>>
>>> Digging into this I tracked the problem down to
>>> mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
>>> turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
>>> returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
>>> 245 million years. After reverting that commit things work again,
>>> though of course the driver specific recovery flow will still fail
>>> before ours can kick in and finally succeed.
>>>
>>> Thanks,
>>> Niklas Schnelle
>>>
>>> #regzbot introduced: 32def4120e48
>>>
>>
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> on my table. I can only look briefly into most of them. Unfortunately
> therefore I sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to
> tell me about it in a public reply. That's in everyone's interest, as
> what I wrote above might be misleading to everyone reading this; any
> suggestion I gave they thus might sent someone reading this down the
> wrong rabbit hole, which none of us wants.
>
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinux-regtracking.leemhuis.info%2Fregzbot%2F&amp;data=04%7C01%7Cmoshe%40nvidia.com%7C33857ebcf13946a09c6408d9b5605f19%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637740248366231179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=Fuqme7inI68fhvGfPh2WPzvussq1awkqxFLqKHm%2FSmQ%3D&amp;reserved=0). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
>
> #regzbot poke