linux-kernel - RE: [PATCH] net: fec: handle page_pool_dev_alloc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <AM9PR04MB85059A742116F7E57D0DC32288032@AM9PR04MB8505.eurprd04.prod.outlook.com>
Date: Tue, 24 Dec 2024 12:56:49 +0000
From: Wei Fang <wei.fang@....com>
To: Kevin Groeneveld <kgroeneveld@...brook.com>
CC: Kevin Groeneveld <kgroeneveld@...brook.com>, Shenwei Wang
	<shenwei.wang@....com>, Clark Wang <xiaoning.wang@....com>, Andrew Lunn
	<andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>, Eric
 Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
	<pabeni@...hat.com>, "imx@...ts.linux.dev" <imx@...ts.linux.dev>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] net: fec: handle page_pool_dev_alloc_pages error

> The fec_enet_update_cbd function called page_pool_dev_alloc_pages but did
> not handle the case when it returned NULL. There was a
> WARN_ON(!new_page)
> but it would still proceed to use the NULL pointer and then crash.
> 
> This case does seem somewhat rare but when the system is under memory
> pressure it can happen. One case where I can duplicate this with some
> frequency is when writing over a smbd share to a SATA HDD attached to an
> imx6q.
> 
> Example kernel panic:

Please simplify the log, it's too long.

[  112.357543] Unable to handle kernel NULL pointer dereference at virtual address 00000010 when read
[  112.370145] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[  112.397701] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[  112.404233] PC is at fec_enet_rx_napi+0x394/0xba0
[  112.408949] LR is at fec_enet_rx_napi+0x37c/0xba0
[  112.650805] 1fe0: 81067700 80e6befc 90981df8 80122bd8 8013ebd0 80122c78 8013ebd0 807d07d0
[  112.658986] Call trace:
[  112.658992]  fec_enet_rx_napi from __napi_poll+0x28/0x148
[  112.666945]  __napi_poll from net_rx_action+0xf0/0x1f8
[  112.672107]  net_rx_action from handle_softirqs+0x19c/0x204
[  112.677703]  handle_softirqs from __irq_exit_rcu+0x60/0xa8

> 
> [  112.154336] ------------[ cut here ]------------
> [  112.159015] WARNING: CPU: 0 PID: 30 at
> drivers/net/ethernet/freescale/fec_main.c:1584
> fec_enet_rx_napi+0x37c/0xba0
> [  112.169451] Modules linked in: nfnetlink caam_keyblob caam_jr
> rsa_generic mpi caamhash_desc caamalg_desc crypto_engine caam error
> etnaviv gpu_sched
> [  112.182764] CPU: 0 PID: 30 Comm: kswapd0 Not tainted
> 6.9.0-02044-gffb6ab7d6829 #1
> [  112.190261] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> [  112.196794] Call trace:
> [  112.196806]  unwind_backtrace from show_stack+0x10/0x14
> [  112.204592]  show_stack from dump_stack_lvl+0x50/0x64
> [  112.209663]  dump_stack_lvl from __warn+0x70/0xc4
> [  112.214385]  __warn from warn_slowpath_fmt+0x98/0x118
> [  112.219463]  warn_slowpath_fmt from fec_enet_rx_napi+0x37c/0xba0
> [  112.225490]  fec_enet_rx_napi from __napi_poll+0x28/0x148
> [  112.230914]  __napi_poll from net_rx_action+0xf0/0x1f8
> [  112.236075]  net_rx_action from handle_softirqs+0x19c/0x204
> [  112.241672]  handle_softirqs from __irq_exit_rcu+0x60/0xa8
> [  112.247172]  __irq_exit_rcu from irq_exit+0x8/0x10
> [  112.251979]  irq_exit from call_with_stack+0x18/0x20
> [  112.256967]  call_with_stack from __irq_svc+0x98/0xc8
> [  112.262036] Exception stack(0x90981e00 to 0x90981e48)
> [  112.267100] 1e00: 8fdc4400 81067700 81067700 00000001 00000000
> 8fdc4400 81a85500 807fd414
> [  112.275288] 1e20: 00000102 81067700 80e6befc 90981e7c 90981e80
> 90981e50 8080113c 8013ebd0
> [  112.283470] 1e40: 200d0013 ffffffff
> [  112.286964]  __irq_svc from finish_task_switch+0x140/0x1f4
> [  112.292474]  finish_task_switch from __schedule+0x300/0x47c
> [  112.298072]  __schedule from schedule+0x38/0x60
> [  112.302625]  schedule from schedule_timeout+0x84/0xa4
> [  112.307699]  schedule_timeout from kswapd+0x29c/0x674
> [  112.312776]  kswapd from kthread+0xe4/0xec
> [  112.316893]  kthread from ret_from_fork+0x14/0x28
> [  112.321609] Exception stack(0x90981fb0 to 0x90981ff8)
> [  112.326669] 1fa0:                                     00000000
> 00000000 00000000 00000000
> [  112.334855] 1fc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [  112.343038] 1fe0: 00000000 00000000 00000000 00000000 00000013
> 00000000
> [  112.349835] ---[ end trace 0000000000000000 ]---
> [  112.354483] 8<--- cut here ---
> [  112.357543] Unable to handle kernel NULL pointer dereference at virtual
> address 00000010 when read
> [  112.366539] [00000010] *pgd=00000000
> [  112.370145] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> [  112.375468] Modules linked in: nfnetlink caam_keyblob caam_jr
> rsa_generic mpi caamhash_desc caamalg_desc crypto_engine caam error
> etnaviv gpu_sched
> [  112.388734] CPU: 0 PID: 30 Comm: kswapd0 Tainted: G        W
> 6.9.0-02044-gffb6ab7d6829 #1
> [  112.397701] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> [  112.404233] PC is at fec_enet_rx_napi+0x394/0xba0
> [  112.408949] LR is at fec_enet_rx_napi+0x37c/0xba0
> [  112.413661] pc : [<805355e8>]    lr : [<805355d0>]    psr: 600d0113
> [  112.419934] sp : 90801e68  ip : 00000000  fp : 00000000
> [  112.425163] r10: 8fe2b740  r9 : 000005f0  r8 : 00000000
> [  112.430393] r7 : 8133830c  r6 : 8e740820  r5 : 81338000  r4 :
> 810cf000
> [  112.436927] r3 : 00000100  r2 : 81067700  r1 : 80e6d4b8  r0 :
> 00000000
> [  112.443461] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
> Segment none
> [  112.450605] Control: 10c5387d  Table: 11d3004a  DAC: 00000051
> [  112.456355] Register r0 information: NULL pointer
> [  112.461072] Register r1 information: non-slab/vmalloc memory
> [  112.466742] Register r2 information: slab task_struct start 81067700
> pointer offset 0 size 2176
> [  112.475469] Register r3 information: non-paged memory
> [  112.480528] Register r4 information: slab kmalloc-4k start 810cf000
> pointer offset 0 size 4096
> [  112.489164] Register r5 information: slab kmalloc-4k start 81338000
> pointer offset 0 size 4096
> [  112.497800] Register r6 information: 0-page vmalloc region starting at
> 0x8e400000 allocated at iotable_init+0x0/0xe8
> [  112.508349] Register r7 information: slab kmalloc-4k start 81338000
> pointer offset 780 size 4096
> [  112.517158] Register r8 information: NULL pointer
> [  112.521871] Register r9 information: non-paged memory
> [  112.526930] Register r10 information: non-slab/vmalloc memory
> [  112.532685] Register r11 information: NULL pointer
> [  112.537484] Register r12 information: NULL pointer
> [  112.542283] Process kswapd0 (pid: 30, stack limit = 0x3ca59c34)
> [  112.548211] Stack: (0x90801e68 to 0x90802000)
> [  112.552578] 1e60:                   8fdc4400 8fdc4440 8105aa80
> 00000000 00000000 00000000
> [  112.560764] 1e80: 00000000 00000000 8fdc4440 00000000 00000016
> 00000040 00000000 80147414
> [  112.568950] 1ea0: 00000102 8105aa80 00000000 810cf5c0 00000006
> 00000000 8fdc4400 810cf6a0
> [  112.577135] 1ec0: 90801eec 8014037c 00000100 00000000 00000040
> 00000040 00000000 80d49400
> [  112.585320] 1ee0: 00000000 00000001 81067700 90801f2c 00000101
> 0f07b000 00000001 80e0a240
> [  112.593506] 1f00: 81338c40 800d0193 00001000 00000000 8fdc4080
> 810cf6a0 00000001 00000040
> [  112.601692] 1f20: 90801f67 810cf6a0 00000000 0000012c 90801f70
> 8065ec7c 810cf6a0 90801f67
> [  112.609877] 1f40: 90801f68 8fdc4fc0 80d49fc0 0f07b000 90801f68
> 8065efd4 ffffb6a1 80e02d40
> [  112.618063] 1f60: a00d0193 00000015 90801f68 90801f68 90801f70
> 90801f70 81a5232c 80e0208c
> [  112.626248] 1f80: 00000008 00220840 80e02080 81067700 0000000a
> 00000101 80e02080 801229d8
> [  112.634434] 1fa0: 80d47510 80197904 ffffb6a0 00000004 00000000
> 80d48c80 80e02d40 80d48c80
> [  112.642620] 1fc0: 40000003 00000003 f4000100 8013ebd0 200d0013
> ffffffff 90981e34 00000102
> [  112.650805] 1fe0: 81067700 80e6befc 90981df8 80122bd8 8013ebd0
> 80122c78 8013ebd0 807d07d0
> [  112.658986] Call trace:
> [  112.658992]  fec_enet_rx_napi from __napi_poll+0x28/0x148
> [  112.666945]  __napi_poll from net_rx_action+0xf0/0x1f8
> [  112.672107]  net_rx_action from handle_softirqs+0x19c/0x204
> [  112.677703]  handle_softirqs from __irq_exit_rcu+0x60/0xa8
> [  112.683205]  __irq_exit_rcu from irq_exit+0x8/0x10
> [  112.688009]  irq_exit from call_with_stack+0x18/0x20
> [  112.692996]  call_with_stack from __irq_svc+0x98/0xc8
> [  112.698066] Exception stack(0x90981e00 to 0x90981e48)
> [  112.703129] 1e00: 8fdc4400 81067700 81067700 00000001 00000000
> 8fdc4400 81a85500 807fd414
> [  112.711315] 1e20: 00000102 81067700 80e6befc 90981e7c 90981e80
> 90981e50 8080113c 8013ebd0
> [  112.719499] 1e40: 200d0013 ffffffff
> [  112.722994]  __irq_svc from finish_task_switch+0x140/0x1f4
> [  112.728501]  finish_task_switch from __schedule+0x300/0x47c
> [  112.734098]  __schedule from schedule+0x38/0x60
> [  112.738650]  schedule from schedule_timeout+0x84/0xa4
> [  112.743721]  schedule_timeout from kswapd+0x29c/0x674
> [  112.748795]  kswapd from kthread+0xe4/0xec
> [  112.752910]  kthread from ret_from_fork+0x14/0x28
> [  112.757626] Exception stack(0x90981fb0 to 0x90981ff8)
> [  112.762685] 1fa0:                                     00000000
> 00000000 00000000 00000000
> [  112.770870] 1fc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [  112.779054] 1fe0: 00000000 00000000 00000000 00000000 00000013
> 00000000
> [  112.785676] Code: e0275793 e3a03c01 e5878020 e587301c (e5983010)
> [  112.791907] ---[ end trace 0000000000000000 ]---
> [  112.796535] Kernel panic - not syncing: Fatal exception in interrupt
> [  112.802897] ---[ end Kernel panic - not syncing: Fatal exception in
> interrupt ]---
> 
> When I first encountered this issue on newer kernels I did a bisect to try
> to find exactly when it started. My bisect led me to c0a242394cb9 ("mm,
> page_alloc: scale the number of pages that are batch allocated"). But
> this commit does not seem to be the true problem and just makes the bug
> in the fec driver more likely to be encountered.

I think the above description is unnecessary as it is irrelevant to the issue.

> 
> Setting /proc/sys/vm/min_free_kbytes to higher values also seems to solve
> the problem for my test case. But it still seems wrong that the fec driver
> ignores the memory allocation error and crashes.
> 
> Fixes: 95698ff6177b5 ("net: fec: using page pool to manage RX buffers")
> Signed-off-by: Kevin Groeneveld <kgroeneveld@...brook.com>
> ---
>  drivers/net/ethernet/freescale/fec.h      |  2 ++
>  drivers/net/ethernet/freescale/fec_main.c | 38 ++++++++++++-----------
>  2 files changed, 22 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/fec.h
> b/drivers/net/ethernet/freescale/fec.h
> index 1cca0425d493..ce44d4f2a798 100644
> --- a/drivers/net/ethernet/freescale/fec.h
> +++ b/drivers/net/ethernet/freescale/fec.h
> @@ -618,6 +618,8 @@ struct fec_enet_private {
>  	struct fec_enet_priv_tx_q *tx_queue[FEC_ENET_MAX_TX_QS];
>  	struct fec_enet_priv_rx_q *rx_queue[FEC_ENET_MAX_RX_QS];
> 
> +	bool rx_err_nomem;
> +
>  	unsigned int total_tx_ring_size;
>  	unsigned int total_rx_ring_size;
> 
> diff --git a/drivers/net/ethernet/freescale/fec_main.c
> b/drivers/net/ethernet/freescale/fec_main.c
> index 1b55047c0237..81832e0940bb 100644
> --- a/drivers/net/ethernet/freescale/fec_main.c
> +++ b/drivers/net/ethernet/freescale/fec_main.c
> @@ -1591,21 +1591,6 @@ static void fec_enet_tx(struct net_device *ndev,
> int budget)
>  		fec_enet_tx_queue(ndev, i, budget);
>  }
> 
> -static void fec_enet_update_cbd(struct fec_enet_priv_rx_q *rxq,
> -				struct bufdesc *bdp, int index)
> -{
> -	struct page *new_page;
> -	dma_addr_t phys_addr;
> -
> -	new_page = page_pool_dev_alloc_pages(rxq->page_pool);
> -	WARN_ON(!new_page);
> -	rxq->rx_skb_info[index].page = new_page;
> -
> -	rxq->rx_skb_info[index].offset = FEC_ENET_XDP_HEADROOM;
> -	phys_addr = page_pool_get_dma_addr(new_page) +
> FEC_ENET_XDP_HEADROOM;
> -	bdp->cbd_bufaddr = cpu_to_fec32(phys_addr);
> -}
> -
>  static u32
>  fec_enet_run_xdp(struct fec_enet_private *fep, struct bpf_prog *prog,
>  		 struct xdp_buff *xdp, struct fec_enet_priv_rx_q *rxq, int cpu)
> @@ -1697,7 +1682,7 @@ fec_enet_rx_queue(struct net_device *ndev, int
> budget, u16 queue_id)
>  	u32 data_start = FEC_ENET_XDP_HEADROOM;
>  	int cpu = smp_processor_id();
>  	struct xdp_buff xdp;
> -	struct page *page;
> +	struct page *page, *new_page;
>  	u32 sub_len = 4;
> 
>  #if !defined(CONFIG_M5272)
> @@ -1759,6 +1744,16 @@ fec_enet_rx_queue(struct net_device *ndev, int
> budget, u16 queue_id)
>  			goto rx_processing_done;
>  		}
> 
> +		/* Make sure we can allocate a new page before we start
> +		 * processing the next frame so we can still more easily abort.
> +		 */
> +		new_page = page_pool_dev_alloc_pages(rxq->page_pool);
> +		if (unlikely(!new_page)) {
> +			fep->rx_err_nomem = true;
> +			pkt_received--;
> +			goto rx_nomem;
> +		}
> +
>  		/* Process the incoming frame. */
>  		ndev->stats.rx_packets++;
>  		pkt_len = fec16_to_cpu(bdp->cbd_datlen);
> @@ -1771,7 +1766,11 @@ fec_enet_rx_queue(struct net_device *ndev, int
> budget, u16 queue_id)
>  					pkt_len,
>  					DMA_FROM_DEVICE);
>  		prefetch(page_address(page));
> -		fec_enet_update_cbd(rxq, bdp, index);
> +
> +		/* Update cbd with new page. */
> +		rxq->rx_skb_info[index].page = new_page;
> +		rxq->rx_skb_info[index].offset = FEC_ENET_XDP_HEADROOM;
> +		bdp->cbd_bufaddr =
> cpu_to_fec32(page_pool_get_dma_addr(new_page) +
> FEC_ENET_XDP_HEADROOM);
> 
>  		if (xdp_prog) {
>  			xdp_buff_clear_frags_flag(&xdp);
> @@ -1883,6 +1882,7 @@ fec_enet_rx_queue(struct net_device *ndev, int
> budget, u16 queue_id)
>  		 */
>  		writel(0, rxq->bd.reg_desc_active);
>  	}
> +rx_nomem:
>  	rxq->bd.cur = bdp;
> 
>  	if (xdp_result & FEC_ENET_XDP_REDIR)
> @@ -1943,10 +1943,12 @@ static int fec_enet_rx_napi(struct napi_struct
> *napi, int budget)
>  	struct fec_enet_private *fep = netdev_priv(ndev);
>  	int done = 0;
> 
> +	fep->rx_err_nomem = false;
> +
>  	do {
>  		done += fec_enet_rx(ndev, budget - done);
>  		fec_enet_tx(ndev, budget);
> -	} while ((done < budget) && fec_enet_collect_events(fep));
> +	} while ((done < budget) && !fep->rx_err_nomem &&
> fec_enet_collect_events(fep));

Is the condition "!fep->rx_err_nomem" necessary here? If not, then there
is no need to add this variable to fec_enet_private.

One situation I am concerned about is that when the issue occurs, the Rx
rings are full. At the same time, because the 'done < budget' condition is
met, the interrupt mode will be used to receive the packets. However,
since the Rx rings are full, no Rx interrupt events will be generated. This
means that the packets on the Rx rings may not be received by the CPU
for a long time unless Tx interrupt events are generated.

> 
>  	if (done < budget) {
>  		napi_complete_done(napi, done);
> --
> 2.43.0
> 
> I am not confident this patch is correct as my knowledge of napi and the
> fec driver is limited. Even if all my assumptions are correct I still do
> not entirely like this patch with how it propagates the error state via a
> variable I added to fec_enet_private. But I could not think of any other
> way that did not involve more significant changes to the existing code and
> I did not want to spend too much time on it until I am more sure the
> overall concept is acceptable. Suggestions welcome.

Another approach is to discard the packets when the issue occurs, as
shown below. Note that the following modification has not been verified.

-static void fec_enet_update_cbd(struct fec_enet_priv_rx_q *rxq,
+static int fec_enet_update_cbd(struct fec_enet_priv_rx_q *rxq,
                                struct bufdesc *bdp, int index)
 {
        struct page *new_page;
        dma_addr_t phys_addr;

        new_page = page_pool_dev_alloc_pages(rxq->page_pool);
-       WARN_ON(!new_page);
+       if (unlikely(!new_page))
+               return -ENOMEM;
+
        rxq->rx_skb_info[index].page = new_page;

        rxq->rx_skb_info[index].offset = FEC_ENET_XDP_HEADROOM;
        phys_addr = page_pool_get_dma_addr(new_page) + FEC_ENET_XDP_HEADROOM;
        bdp->cbd_bufaddr = cpu_to_fec32(phys_addr);
+
+       return 0;
 }

 static u32
@@ -1771,7 +1775,10 @@ fec_enet_rx_queue(struct net_device *ndev, int budget, u16 queue_id)
                                        pkt_len,
                                        DMA_FROM_DEVICE);
                prefetch(page_address(page));
-               fec_enet_update_cbd(rxq, bdp, index);
+               if (fec_enet_update_cbd(rxq, bdp, index)) {
+                       ndev->stats.rx_dropped++;
+                       goto rx_processing_done;
+               }

                if (xdp_prog) {
                        xdp_buff_clear_frags_flag(&xdp);