lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 13 Nov 2023 14:44:41 +0100
From:   Miquel Raynal <miquel.raynal@...tlin.com>
To:     Ronald Monthero <debug.penguin32@...il.com>
Cc:     richard@....at, vigneshr@...com, heiko@...ech.de,
        martin.blumenstingl@...glemail.com, paul@...pouillou.net,
        robh@...nel.org, u.kleine-koenig@...gutronix.de,
        AVKrasnov@...rdevices.ru, r.czerwinski@...gutronix.de,
        andriy.shevchenko@...ux.intel.com, jaimeliao.tw@...il.com,
        linux-mtd@...ts.infradead.org, linux-kernel@...r.kernel.org,
        Philippe Mathieu-Daudé <philmd@...aro.org>
Subject: Re: [PATCH] mtd:nand: Increment IFC_TIMEOUT_MSECS for nand
 controller response

Hi Ronald,

Thanks for the patch, here are a couple of comments I'd like you to
address before taking the fix.

debug.penguin32@...il.com wrote on Mon, 13 Nov 2023 23:16:28 +1000:

The title prefix needs to be aligned with today's policy, you can check
it with git log --oneline -- <your file>.

> The nand controller not responding scenario occurs causing blocked tasks
> and rcu_prempt warnings of stall on cpus. Incrementing the
> IFC_TIMEOUT_MSECS appears to solve the nand controller not responding
> issue.

I would rephrase a bit this paragraph with more confidence. Under heavy
load it is likely that the controller is done with its own task but the
thread unlocking the wait look is never scheduled (or not in time)
resulting in such kind of error. Maybe there is something else wrong in
the code which stalls the CPU in this case, (hence the first message).

Enlarging the timeout to 1s in this case is fine, but maybe there is
still something wrong aside.

> ** ID_531 main/smrcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu:    Tasks blocked on level-0 rcu_node (CPUs 0-1): P116/2:b..l
>         (detected by 1, t=2102 jiffies, g=12653, q=518)
> task:irq/31-arm-irq1 state:D stack: 0 pid:  116 ppid: 2 flags:0x00000000
> [<8064b97f>] (__schedule) from [<8064bb01>] (schedule+0x8d/0xc2)
> [<8064bb01>] (schedule) from [<8064fa65>] (schedule_timeout+0x6d/0xa0)
> [<8064fa65>] (schedule_timeout) from [<804ba353>] (fsl_ifc_run_command+0x6f/0x178)
> [<804ba353>] (fsl_ifc_run_command) from [<804ba72f>] (fsl_ifc_cmdfunc+0x203/0x2b8)
> [<804ba72f>] (fsl_ifc_cmdfunc) from [<804b135f>] (nand_status_op+0xaf/0xe0)
> [<804b135f>] (nand_status_op) from [<804b13b3>] (nand_check_wp+0x23/0x48)
> [<804b13b3>] (nand_check_wp) from [<804b231d>] (nand_do_write_ops+0x99/0x2b8)
> [<804b231d>] (nand_do_write_ops) from [<804b5355>] (nand_write_oob+0x3b/0x4a)
> [<804b5355>] (nand_write_oob) from [<804a3585>] (mtd_write+0x41/0x5c)
> [<804a3585>] (mtd_write) from [<804c1d47>] (ubi_io_write+0x17f/0x22c)
> [<804c1d47>] (ubi_io_write) from [<804c047b>] (ubi_eba_write_leb+0x5b/0x1d0)
> [<804c047b>] (ubi_eba_write_leb) from [<804bf62d>] (ubi_leb_write+0x75/0x90)
> [<804bf62d>] (ubi_leb_write) from [<803745b7>] (ubifs_leb_write+0x4b/0x8c)
> [<803745b7>] (ubifs_leb_write) from [<80374bbb>] (ubifs_wbuf_sync_nolock+0x10f/0x1a4)
> [<80374bbb>] (ubifs_wbuf_sync_nolock) from [<8036c6dd>] (ubifs_jnl_update+0x1e9/0x36c)
> [<8036c6dd>] (ubifs_jnl_update) from [<80370933>] (ubifs_create+0xb3/0x130)
> [<80370933>] (ubifs_create) from [<802cf0c7>] (lookup_open+0x173/0x1c4)
> [<802cf0c7>] (lookup_open) from [<802cf8a3>] (open_last_lookups+0xd7/0x16c)
> [<802cf8a3>] (open_last_lookups) from [<802d08e5>] (path_openat+0x91/0x104)
> [<802d08e5>] (path_openat) from [<802d0989>] (do_filp_open+0x31/0x74)
> [<802d0989>] (do_filp_open) from [<802c4fb3>] (file_open_name+0x33/0x48)
> [<802c4fb3>] (file_open_name) from [<802c4fe9>] (filp_open+0x21/0x2e)
> [<802c4fe9>] (filp_open) from [<80490bd3>] (irq1_handler+0x53/0xa4)
> [<80490bd3>] (irq1_handler) from [<80247dd7>] (irq_forced_thread_fn+0x1f/0x4c)
> [<80247dd7>] (irq_forced_thread_fn) from [<80247cd9>] (irq_thread+0x89/0x114)
> [<80247cd9>] (irq_thread) from [<8022ca67>] (kthread+0xcf/0xe4)
> [<8022ca67>] (kthread) from [<80200149>] (ret_from_fork+0x11/0x28)
> Exception stack(0x822bbfb0 to 0x822bbff8)
> bfa0:                                     00000000 00000000 00000000 00000000
> bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
> bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
> 
> fsl,ifc-nand 7e800000.nand: Controller is not responding
> ID_531 main/smp_fsm.c:1884 <inrcu: rcu_preempt detected stalls on CPUs:
> rcu:    Tasks blocked on level-0 rcu_node (CPUs 0-1): P116/2:b..l
>         (detected by 1, t=2102 jiffies, g=7729, q=754)
> task:irq/31-arm-irq1 state:D stack: 0 pid:  116 ppid: 2 flags:0x00000000
> [<8064b97f>] (__schedule) from [<8064bb01>] (schedule+0x8d/0xc2)
> [<8064bb01>] (schedule) from [<8064dacd>] (rt_mutex_slowlock_block.con)
> [<8064dacd>] (rt_mutex_slowlock_block.constprop.0) from [<8064db57>] 
> [<8064db57>] (__rt_mutex_slowlock.constprop.0) from [<8064dbf7>] 
> [<8064dbf7>] (rt_mutex_slowlock.constprop.0) from [<804b2047>] 
> [<804b2047>] (nand_get_device) from [<804b5335>] (nand_write_oob+0x1b/0x4a)
> [<804b5335>] (nand_write_oob) from [<804a3585>] (mtd_write+0x41/0x5c)
> [<804a3585>] (mtd_write) from [<804c1d47>] (ubi_io_write+0x17f/0x22c)
> [<804c1d47>] (ubi_io_write) from [<804c047b>] (ubi_eba_write_leb+0x5b/0x1d0)
> [<804c047b>] (ubi_eba_write_leb) from [<804bf62d>] (ubi_leb_write+0x75/0x90)
> [<804bf62d>] (ubi_leb_write) from [<803745b7>] (ubifs_leb_write+0x4b/0x8c)
> [<803745b7>] (ubifs_leb_write) from [<80374bbb>] (ubifs_wbuf_sync_nolock+0x10f/0x1a4)
> [<80374bbb>] (ubifs_wbuf_sync_nolock) from [<8036c6dd>] (ubifs_jnl_update+0x1e9/0x36c)

You can trim down the traces to only show the interesting part.

Here you need a Fixes: and Cc: stable tag.

> Signed-off-by: Ronald Monthero <debug.penguin32@...il.com>
> ---
>  drivers/mtd/nand/raw/fsl_ifc_nand.c | 2 +-
>  drivers/mtd/nand/raw/nand_base.c    | 5 ++++-
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/mtd/nand/raw/fsl_ifc_nand.c b/drivers/mtd/nand/raw/fsl_ifc_nand.c
> index 20bb1e0cb5eb..42f8ea46b6a8 100644
> --- a/drivers/mtd/nand/raw/fsl_ifc_nand.c
> +++ b/drivers/mtd/nand/raw/fsl_ifc_nand.c
> @@ -21,7 +21,7 @@
>  
>  #define ERR_BYTE		0xFF /* Value returned for read
>  					bytes when read failed	*/
> -#define IFC_TIMEOUT_MSECS	500  /* Maximum number of mSecs to wait
> +#define IFC_TIMEOUT_MSECS	1000  /* Maximum number of mSecs to wait
>  					for IFC NAND Machine	*/
>  
>  struct fsl_ifc_ctrl;
> diff --git a/drivers/mtd/nand/raw/nand_base.c b/drivers/mtd/nand/raw/nand_base.c
> index 9e24bedffd89..05b52ed41f4c 100644
> --- a/drivers/mtd/nand/raw/nand_base.c
> +++ b/drivers/mtd/nand/raw/nand_base.c
> @@ -363,8 +363,11 @@ static int nand_check_wp(struct nand_chip *chip)
>  	int ret;
>  
>  	/* Broken xD cards report WP despite being writable */
> -	if (chip->options & NAND_BROKEN_XD)
> +	if (chip->options & NAND_BROKEN_XD) {
> +		pr_info("nand_chip->options indicates NAND_BROKEN_XD %d\n",
> +			(chip->options & NAND_BROKEN_XD));
>  		return 0;
> +	}

This is an unrelated debug message and should be dropped.

>  
>  	/* Check the WP bit */
>  	ret = nand_status_op(chip, &status);

Thanks,
Miquèl

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ