[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=whLsdX=Kr010LiM2smEu2rC3Hedwmuxtcp0pYtZvFj+=A@mail.gmail.com>
Date: Mon, 27 Nov 2023 10:02:06 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Igor Russkikh <irusskikh@...vell.com>
Cc: Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Netdev <netdev@...r.kernel.org>
Subject: Re: [EXT] Aquantia ethernet driver suspend/resume issues
On Mon, 27 Nov 2023 at 09:29, Igor Russkikh <irusskikh@...vell.com> wrote:
>
> I'm trying to repro this on my side with some artificially increased structure
> sizes, but no success so far.
So I suspect that one reason I triggered the problem was simply
because the suspend/resume happened while I walked away from the
computer when it was copying a few hundred gig of data from the old
SSD (over USB, so not hugely fast).
End result: the suspend/resume happened while the machine was actually
quite busy, and presumably a *lot* of memory was just used for disk
buffers etc. The "duspend when idle" logic doesn't really take
background tasks into account, and my logs leading up to the suspend
shows
13:54:09 systemd[1]: Starting systemd-suspend.service - System Suspend...
13:54:09 wpa_supplicant[2401]: wlo2: CTRL-EVENT-DSCP-POLICY clear_all
13:54:09 systemd-sleep[12477]: Entering sleep state 'suspend'...
13:54:09 kernel: PM: suspend entry (deep)
13:54:19 systemd[1]: NetworkManager-dispatcher.service: Deactivated
successfully.
13:54:22 kernel: Filesystems sync: 12.738 seconds
14:06:30 kernel: Freezing user space processes
and while that last timestamp is bogus (the timestamp comes from
syslogd logging, and it actually happens at *resume*), you can see
that the filesystem activity was pretty significant with the sync
taking a long time, because the copy process was still on-going the
whole time. And it continued *after* the sync too.
So I - accidentally - ended up hitting a lot of "that's not great"
cases on this, that I wouldn't hit normally (because I obviously turn
off suspend-at-idle). All on hardware that isn't normally used for
suspend/resume anyway, so it probably has somewhat limited testing to
begin with.
For triggering it, you might try to change that
self->buff_ring =
kcalloc(self->size, sizeof(struct aq_ring_buff_s), GFP_KERNEL);
to use GFP_NOWAIT instead of GFP_KERNEL. That makes allocation
failures *much* more likely. It will still work at boot time.
Or just artificially make it fail with a "fail the Nth time you hit it".
Also, make sure you don't have ridiculous amounts of memory in your
machine. I've got "only" 64GB in mine, which is small for a big
machine, and presumably a lot of it was used for buffer cache, and I'm
not sure what the device suspend/resume ordering was (ie disk might be
resumed after ethernet).
Linus
Powered by blists - more mailing lists