[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cf6e78b6-e4e2-faab-f8c6-19dc462b1d74@marvell.com>
Date: Mon, 27 Nov 2023 18:29:19 +0100
From: Igor Russkikh <irusskikh@...vell.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
CC: Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, Netdev <netdev@...r.kernel.org>
Subject: Re: [EXT] Aquantia ethernet driver suspend/resume issues
Hi Linus,
> and now the slab cache is corrupt and the system is dead.
>
> My *guess* is that what is going on is that when the kcalloc() failued
> (because it tries to allocate a large area, and it has only been
> tested at boot-time when it succeeds), we end up doing that
You are probably right root causing the issue here, and the bad thing is
that direct fix will not solve the problem.
I'm trying to repro this on my side with some artificially increased structure
sizes, but no success so far.
Datapath initialization in driver requires normally 8 vectors each having rx/tx
pair of rings, in total making 8*(4096+2048)*64 = ~ 3Mb of memory.
...
> Anyway, I suspect a fix for the fatal error might be something like
> the attached, but I think the *root* of the problem is how the
> aquantia driver tried to allocate a humongous buff_ring with kmalloc,
> which really doesn't work. You can see that "order:6", ie we're
> talking an allocation > 100kB, and in low-memory situations that kind
> of kmalloc space simply isn't available. It *will* fail.
Correct.
It seems after some series of pm-related changes the driver logic was changed to always
reinit full sw structures during suspend/resume, which is obviously an overkill.
...
>
> I don't know what the right fix is, but *one* fix would certainly be
> to not tear everything down at suspend time, only to build it up again
> at resume.
>
> And please please please don't double-free things randomly (if that is
> what was going on, but it does look like it was).
You are right its also a potential double free problem (may be not manifesting),
but aq_ring_free is being called from "same level" aq_ring_alloc, but also then
from aq_ring_tx_alloc caller error handler.
Attached is my draft patch to improve that.
Still looking into the bigger problem of fixing suspend/resume.
Thanks
Igor
View attachment "0001-net-atlantic-fixed-double-free-on-reinit.patch" of type "text/plain" (8274 bytes)
Powered by blists - more mailing lists