linux-kernel - Re: [PATCH net] net: ps3_gelic_net: handle skb allocation failures

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aRRUiYIrOcpSiakH@lithos>
Date: Wed, 12 Nov 2025 10:34:01 +0100
From: Florian Fuchs <fuchsfl@...il.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: Geoff Levand <geoff@...radead.org>, netdev@...r.kernel.org,
	Andrew Lunn <andrew+netdev@...n.ch>,
	"David S . Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>,
	Madhavan Srinivasan <maddy@...ux.ibm.com>,
	Michael Ellerman <mpe@...erman.id.au>,
	Nicholas Piggin <npiggin@...il.com>,
	Christophe Leroy <christophe.leroy@...roup.eu>,
	linuxppc-dev@...ts.ozlabs.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net] net: ps3_gelic_net: handle skb allocation failures

Hi Jakub,

On 11 Nov 18:04, Jakub Kicinski wrote:
> On Mon, 10 Nov 2025 12:45:23 +0100 Florian Fuchs wrote:
> > Steps to reproduce the issue:
> > 	1. Start a continuous network traffic, like scp of a 20GB file
> > 	2. Inject failslab errors using the kernel fault injection:
> > 	    echo -1 > /sys/kernel/debug/failslab/times
> > 	    echo 30 > /sys/kernel/debug/failslab/interval
> > 	    echo 100 > /sys/kernel/debug/failslab/probability
> > 	3. After some time, traces start to appear, kernel Oopses
> > 	   and the system stops
> > 
> > Step 2 is not always necessary, as it is usually already triggered by
> > the transfer of a big enough file.
> 
> Have you actually tested this on a real device?
> Please describe the testing you have done rather that "how to test".

Yes, of course, I intensively tested the patch on a Sony PS3 (CECHL04
PAL). I ran the final fix for many hours, with continuous system load
and high network transfer load. I am happy to get feedback on better or
acceptable testing.

My testing consisted of:
1. Produce Oops: Test the kernel without any gelic patches, scp a big
   file to usb stick and create high cpu/memory load (like compiling
   some software) or extract verbose, tar xv, a big file via ssh
2. Safely re-produce the Oops using failslab injection, so I dont need
   to wait for it
3. Develop against that failslab injection, high load and network
   transfer
4. First solution was to just always refill the chain, which resulted in
   RX stall after some time, as the dmac seemed to be stopped, when buffer
   was full and NOT_IN_USE head found and needed rmmod/modprobe to work
   again
5. Run the final fix for many hours while injecting failslabs, high load,
   and high network load with continuous scp and netcat
6. Further massive improvement is to convert the driver to use
   napi_gro_receive and napi_skb_alloc, but this would be a separate
   patch

> > --- a/drivers/net/ethernet/toshiba/ps3_gelic_net.c
> > +++ b/drivers/net/ethernet/toshiba/ps3_gelic_net.c
> > @@ -259,6 +259,7 @@ void gelic_card_down(struct gelic_card *card)
> >  	mutex_lock(&card->updown_lock);
> >  	if (atomic_dec_if_positive(&card->users) == 0) {
> >  		pr_debug("%s: real do\n", __func__);
> > +		timer_delete_sync(&card->rx_oom_timer);
> >  		napi_disable(&card->napi);
> 
> I think the ordering here should be inverted

I thought, that there might be a race condition in the inverted order
like that napi gets re-enabled by the timer in between of the down:

1. napi_disable
2. rx_oom_timer runs and calls napi_schedule again
3. timer_delete_sync

So the timer is deleted first, to prevent any possibility to run.

> TBH handling the OOM inside the Rx function seems a little fragile.
> What if there is a packet to Rx as we enter. I don't see any loop here
> it just replaces the used buffer..

I am not sure, the handling needs to happen, when the skb allocation
fails, and that happens in the rx function, right? I am open to better
fitting fix position.

Thank you for your feedback!
Florian