lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55EDA149.6040002@imgtec.com>
Date:	Mon, 7 Sep 2015 15:38:01 +0100
From:	Alex Smith <alex.smith@...tec.com>
To:	Ezequiel Garcia <ezequiel@...guardiasur.com.ar>
CC:	<linux-mtd@...ts.infradead.org>,
	Brian Norris <computersforpeace@...il.com>,
	David Woodhouse <dwmw2@...radead.org>,
	"Zubair Lutfullah Kakakhel" <Zubair.Kakakhel@...tec.com>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v4 1/3] mtd: nand: increase ready wait timeout and report
 timeouts

Hi Ezequiel,

Thanks for reviewing the series.

On 06/09/2015 21:37, Ezequiel Garcia wrote:
> On 27 Jul 02:50 PM, Alex Smith wrote:
>> If nand_wait_ready() times out, this is silently ignored, and its
>> caller will then proceed to read from/write to the chip before it is
>> ready. This can potentially result in corruption with no indication as
>> to why.
>>
>> While a 20ms timeout seems like it should be plenty enough, certain
>> behaviour can cause it to timeout much earlier than expected. The
>> situation which prompted this change was that CPU 0, which is
>> responsible for updating jiffies, was holding interrupts disabled
>> for a fairly long time while writing to the console during a printk,
>> causing several jiffies updates to be delayed. If CPU 1 happens to
>> enter the timeout loop in nand_wait_ready() just before CPU 0 re-
>> enables interrupts and updates jiffies, CPU 1 will immediately time
>> out when the delayed jiffies updates are made. The result of this is
>> that nand_wait_ready() actually waits less time than the NAND chip
>> would normally take to be ready, and then read_page() proceeds to
>> read out bad data from the chip.
>>
>> The situation described above may seem unlikely, but in fact it can be
>> reproduced almost every boot on the MIPS Creator Ci20.
>>
> 
> Not only unlikely but scary :) BTW, can't find SMP patches for Ci20,
> are you sure this behavior will apply once SMP is upstreamed?

Certainly made for fun debugging ;)

SMP support only exists in our 3.18 branch [1] at the moment, which was where this problem was encountered. Support should be upstreamed at some point, and I would guess that this behaviour could still happen then (even though it's a really obscure edge case that we were somehow managing to almost always hit on boot).

[1] https://github.com/MIPS/CI20_linux

> 
>> Debugging this was made more difficult by the misleading comment above
>> nand_wait_ready() stating "The timeout is caught later" - no timeout
>> was ever reported, leading me away from the real source of the problem.
>>
>> Therefore, this patch increases the timeout to 200ms. This should be
>> enough to cover cases where jiffies updates get delayed. Additionally,
>> add a pr_warn() when a timeout does occur so that it is easier to
>> pinpoint any problems in future caused by the chip not becoming ready.
>>
>> Signed-off-by: Alex Smith <alex.smith@...tec.com>
>> Cc: Zubair Lutfullah Kakakhel <Zubair.Kakakhel@...tec.com>
>> Cc: David Woodhouse <dwmw2@...radead.org>
>> Cc: Brian Norris <computersforpeace@...il.com>
>> Cc: linux-mtd@...ts.infradead.org
>> Cc: linux-kernel@...r.kernel.org
>> ---
>> v3 -> v4:
>>  - New patch to fix issue encountered in external Ci20 3.18 kernel
>>    branch which also applies upstream.
>> ---
>>  drivers/mtd/nand/nand_base.c | 15 ++++++++++++---
>>  1 file changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
>> index ceb68ca8277a..a0dab3414f16 100644
>> --- a/drivers/mtd/nand/nand_base.c
>> +++ b/drivers/mtd/nand/nand_base.c
>> @@ -543,23 +543,32 @@ static void panic_nand_wait_ready(struct mtd_info *mtd, unsigned long timeo)
>>  	}
>>  }
>>  
>> -/* Wait for the ready pin, after a command. The timeout is caught later. */
>> +/**
>> + * nand_wait_ready - [GENERIC] Wait for the ready pin after commands.
>> + * @mtd: MTD device structure
>> + *
>> + * Wait for the ready pin after a command, and warn if a timeout occurs.
>> + */
>>  void nand_wait_ready(struct mtd_info *mtd)
>>  {
>>  	struct nand_chip *chip = mtd->priv;
>> -	unsigned long timeo = jiffies + msecs_to_jiffies(20);
>> +	unsigned long timeo = jiffies + msecs_to_jiffies(200);
>>  
>>  	/* 400ms timeout */
>>  	if (in_interrupt() || oops_in_progress)
>>  		return panic_nand_wait_ready(mtd, 400);
>>  
>>  	led_trigger_event(nand_led_trigger, LED_FULL);
>> +
> 
> Spurious change here.

Removed.

> 
>>  	/* Wait until command is processed or timeout occurs */
>>  	do {
>>  		if (chip->dev_ready(mtd))
>> -			break;
>> +			goto out;
>>  		touch_softlockup_watchdog();
>>  	} while (time_before(jiffies, timeo));
>> +
>> +	pr_warn("timeout while waiting for chip to become ready\n");
>> +out:
>>  	led_trigger_event(nand_led_trigger, LED_OFF);
>>  }
> 
> This change looks reasonable, a timeout value should be large enough
> to be confident the operation has _really_ timed out. On non-error
> path, this change shouldn't make any difference.
> 
> And the warning is probably helpful too, so:
> 
> Reviewed-by: Ezequiel Garcia <ezequiel@...guardiasur.com.ar>

Great, thanks.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ