netdev - Re: [EXT] Aquantia ethernet driver suspend/resume issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <32a0ccb2-9570-4099-961c-6a53e1a553d7@pwaller.net>
Date: Tue, 23 Jan 2024 21:02:09 +0000
From: Peter Waller <p@...ller.net>
To: Igor Russkikh <irusskikh@...vell.com>
Cc: Jakub Kicinski <kuba@...nel.org>,
 Linus Torvalds <torvalds@...ux-foundation.org>,
 Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
 Netdev <netdev@...r.kernel.org>
Subject: Re: [EXT] Aquantia ethernet driver suspend/resume issues

Here's part of the log, I can provide more off list if it helps. - Peter

<previous boot> Filesystems sync: 0.014 seconds
<n>.678271 Freezing user space processes
<n>.678366 Freezing user space processes completed (elapsed 0.001 seconds)
<n>.678383 OOM killer disabled.
<n>.678397 Freezing remaining freezable tasks
<n>.678407 Freezing remaining freezable tasks completed (elapsed 0.000 
seconds)
<n>.678423 printk: Suspending console(s) (use no_console_suspend to debug)
<n>.678437 serial 00:04: disabled
<n>.678654 queueing ieee80211 work while going to suspend
<n>.678680 sd 9:0:0:0: [sda] Synchronizing SCSI cache
<n>.678884 ata10.00: Entering standby power mode
<n>.678900 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0014 address=0xfc80b000 flags=0x0020]
<n>.679124 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0014 address=0xffeae520 flags=0x0020]
<n>.679270 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0014 address=0xfc80c000 flags=0x0020]
<n>.679411 atlantic 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x0014 address=0xffeae530 flags=0x0020]
<n>.679541 ACPI: EC: interrupt blocked
<n>.679554 amdgpu 0000:03:00.0: amdgpu: MODE1 reset
<n>.679682 amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
<n>.679803 amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
<n>.679919 ACPI: PM: Preparing to enter system sleep state S3
<n>.679931 ACPI: EC: event blocked
<n>.679942 ACPI: EC: EC stopped
<n>.679952 ACPI: PM: Saving platform NVS memory
<n>.679959 Disabling non-boot CPUs ...
<snip>
<n>.682471 atlantic 0000:0c:00.0 eno2: atlantic: link change old 1000 new 0
<snip>
<n>.687497 PM: suspend exit

On 23/01/2024 15:13, Peter Waller wrote:
> True, it is a warning rather than a hard crash, though shutdown hangs. Thanks for the workaround.
>
> I can provide more dmesg when I’m back at my computer. Do you need the whole thing or is there something in particular you want from it? From memory there isn’t much more in the way of messages that looked connected to me.
>
> Sent from my mobile, please excuse brevity
>
>> On 23 Jan 2024, at 14:59, Igor Russkikh <irusskikh@...vell.com> wrote:
>>
>> 
>>> On 1/21/2024 10:05 PM, Peter Waller wrote:
>>> I see a fix for double free [0] landed in 6.7; I've been running that
>>> for a few days and have hit a resume from suspend issue twice. Stack
>>> trace looks a little different (via __iommu_dma_map instead of
>>> __iommu_dma_free), provided below.
>>>
>>> I've had resume issues with the atlantic driver since I've had this
>>> hardware, but it went away for a while and seems as though it may have
>>> come back with 6.7. (No crashes since logs begin on Dec 15 till Jan 12,
>>> Upgrade to 6.7; crashes 20th and 21st, though my usage style of the
>>> system has also varied, maybe crashes are associated with higher memory
>>> usage?).
>> Hi Peter,
>>
>> Are these hard crashes, or just warnings in dmesg you see?
>>  From the log you provided it looks like a warning, meaning system is usable
>> and driver can be restored with `if down/up` sequence.
>>
>> If so, then this is somewhat expected, because I'm still looking into
>> how to refactor this suspend/resume cycle to reduce mem usage.
>> Permanent workaround would be to reduce rx/tx ring sizes with something like
>>
>>     ethtool -G rx 1024 tx 1024
>>
>> If its a hard panic, we should look deeper into it.
>>
>>> Possibly unrelated but I also see fairly frequent (1 to ten times per
>>> boot, since logs begin?) messages in my logs of the form "atlantic
>>> 0000:0c:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0014
>>> address=0xffce8000 flags=0x0020]".
>> Seems to be unrelated, but basically indicates HW or FW tries to access unmapped
>> memory addresses, and iommu catches that.
>> Full dmesg may help analyze this.
>>
>> Regards
>>   Igor