netdev - Re: [EXT] Aquantia ethernet driver suspend/resume issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=whLsdX=Kr010LiM2smEu2rC3Hedwmuxtcp0pYtZvFj+=A@mail.gmail.com>
Date: Mon, 27 Nov 2023 10:02:06 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Igor Russkikh <irusskikh@...vell.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, 
	Paolo Abeni <pabeni@...hat.com>, Netdev <netdev@...r.kernel.org>
Subject: Re: [EXT] Aquantia ethernet driver suspend/resume issues

On Mon, 27 Nov 2023 at 09:29, Igor Russkikh <irusskikh@...vell.com> wrote:
>
> I'm trying to repro this on my side with some artificially increased structure
> sizes, but no success so far.

So I suspect that one reason I triggered the problem was simply
because the suspend/resume happened while I walked away from the
computer when it was copying a few hundred gig of data from the old
SSD (over USB, so not hugely fast).

End result: the suspend/resume happened while the machine was actually
quite busy, and presumably a *lot* of memory was just used for disk
buffers etc. The "duspend when idle" logic doesn't really take
background tasks into account, and my logs leading up to the suspend
shows

  13:54:09  systemd[1]: Starting systemd-suspend.service - System Suspend...
  13:54:09  wpa_supplicant[2401]: wlo2: CTRL-EVENT-DSCP-POLICY clear_all
  13:54:09  systemd-sleep[12477]: Entering sleep state 'suspend'...
  13:54:09  kernel: PM: suspend entry (deep)
  13:54:19  systemd[1]: NetworkManager-dispatcher.service: Deactivated
successfully.
  13:54:22  kernel: Filesystems sync: 12.738 seconds
  14:06:30  kernel: Freezing user space processes

and while that last timestamp is bogus (the timestamp comes from
syslogd logging, and it actually happens at *resume*), you can see
that the filesystem activity was pretty significant with the sync
taking a long time, because the copy process was still on-going the
whole time. And it continued *after* the sync too.

So I - accidentally - ended up hitting a lot of "that's not great"
cases on this, that I wouldn't hit normally (because I obviously turn
off suspend-at-idle). All on hardware that isn't normally used for
suspend/resume anyway, so it probably has somewhat limited testing to
begin with.

For triggering it, you might try to change that

        self->buff_ring =
                kcalloc(self->size, sizeof(struct aq_ring_buff_s), GFP_KERNEL);

to use GFP_NOWAIT instead of GFP_KERNEL. That makes allocation
failures *much* more likely. It will still work at boot time.

Or just artificially make it fail with a "fail the Nth time you hit it".

Also, make sure you don't have ridiculous amounts of memory in your
machine.  I've got "only" 64GB in mine, which is small for a big
machine, and presumably a lot of it was used for buffer cache, and I'm
not sure what the device suspend/resume ordering was (ie disk might be
resumed after ethernet).

                Linus