linux-kernel - Re: [PATCH 3/3] mtd/nand : workaround for Freescale FCM to support large-page Nand chip

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4EF0E67B.5080002@freescale.com>
Date:	Tue, 20 Dec 2011 13:48:11 -0600
From:	Scott Wood <scottwood@...escale.com>
To:	Li Yang <leoli@...escale.com>
CC:	<dedekind1@...il.com>, <dwmw2@...radead.org>,
	LiuShuo <b35362@...escale.com>, <linux-kernel@...r.kernel.org>,
	<shuo.liu@...escale.com>, <linux-mtd@...ts.infradead.org>,
	<akpm@...ux-foundation.org>, <linuxppc-dev@...ts.ozlabs.org>
Subject: Re: [PATCH 3/3] mtd/nand : workaround for Freescale FCM to support
 large-page Nand chip

On 12/20/2011 03:08 AM, Li Yang wrote:
> On Tue, Dec 20, 2011 at 12:47 AM, Scott Wood <scottwood@...escale.com> wrote:
>> On 12/19/2011 05:05 AM, Li Yang wrote:
>>> On Sat, Dec 17, 2011 at 1:59 AM, Scott Wood <scottwood@...escale.com> wrote:
>>>> On 12/15/2011 08:44 PM, LiuShuo wrote:
>>>>> hi Artem,
>>>>> Could this patch be applied now and we make a independent patch for  bad
>>>>> block information
>>>>> migration later?
>>>>
>>>> This patch is not safe to use without migration.
>>>
>>> Hi Scott,
>>>
>>> We agree it's not entirely safe without migrating the bad block flag.
>>> But let's consider two sides of the situation.
>>>
>>> Firstly, it's only unsafe when there is a need to re-built the Bad
>>> Block Table from scratch(old BBT broken).
>>
>> No, it's unsafe in the presence of bad blocks.
>>
> 
> Instead of migrating the factory bad block markers I proposed to
> modify the code of building BBT to make it different for 4K page, so
> that the default BBT can correctly covers the factory bad blocks.  It
> is the easiest way with nearly no harm to the functionality.

Even if we were to agree to that (I disagree with "nearly no harm"),
this patch doesn't implement that.  As is, this patch simply ignores the
issue.

Note that besides possibly tossing away bad block information during
development, the BBT-only approach will not work for booting from NAND,
as we don't use the BBT in that case (need to keep the code minimal to
fit the 4k boot block).  Yes, this ignores blocks that were marked bad
by software, but that's usually OK since that part of the chip isn't
managed by a software layer such as jffs2 that will mark blocks as bad.

>> The BBT erasure issue relates to how me mark the flash as migrated, not
>> whether we migrate in the first place.
> 
> It is connected to whether we do the migration at all.  I mentioned in
> earlier mail that if we are doing the migration, we need to make sure
> the migration only happens once.  And it need to be done before the
> flash is used for the first time and before BBT is created.  If we
> can't guarantee these condition, we are marking good blocks as bad by
> doing the migration.  Even worse than doing nothing.

You also can't do the special BBT scan once the flash has been written
to with normal data.  This patch does not implement the special BBT scan.

>>>  But currently there is no
>>> easy way to do that(re-build BBT on demand),
>>
>> You scrub the blocks with U-Boot.  It's not supposed to be *easy*, it's
>> a developer recovery mechanism.
> 
> Scrub clears the factory bad block markers also.  

Only if you scrub bad blocks.  I was talking about scrubbing the BBT
specifically, not the entire chip.

>>> Secondly, even if the previous said problem happens(BBT broken).  We
>>> can still recover all the data if we overrule the bad block flag.
>>
>> How so?  The bad block markers -- including ones legitimately written to
>> the BBT after the fact -- are used for block skipping with certain types
>> of writes.  Without the knowledge of which blocks were marked bad, how
>> do we know which blocks were skipped?
> 
> This is not supposed to be *easy*.  We might get more information in
> the file system level.  Or we check the content of the blocks.

If you need to take special, non-automatic steps to recover the data,
that counts as data loss.

>>> however, it can be used
>>> if we take the risk of losing data from errors that ECC can't
>>> notice(low possibility too).
>>
>> Can you quantify "low possibility" here?
>>
>> Note that any block that *was* marked bad will have a multi-bit error
>> from the marker itself, since it will be embedded in the main data area.
> 
> I found the definition of bad block from one NAND chip manual: Bad
> Blocks are blocks that contain one or more invalid bits whose
> reliability is not guaranteed.
> 
> There is no mentioning that the bad block has to have multi-bit error.

"It's not guaranteed to fail" is rather different from "low possibility
of failure".

>  Although the factory bad blocks might have worse error than wear-off
> bad blocks, it's not what I can tell.

Why would they have such a mechanism to mark blocks bad, if it's not needed?

>> Why is it so critical that it be merged now, and not in a few weeks (or
>> next merge window) when I have a chance to do the migration code
>> (assuming nobody else does it first) and add a suitable check for the
>> migration marker in the Linux driver?
> 
> A few weeks might be ok.  But I feared that the merge can be further
> delayed and might finally goes no where.

And I fear that the bad block handling will be forgotten about if enough
gets merged for this to sort-of work.

> And as I argued above, I'm not sure if migrating is necessary in the first place.
> 
> In general.  We are not trying to get unqualified code merged.  But I
> also don't agree we need to perfect all things before any of the code
> can be merged.

I'm not asking for it to be perfect -- this just seems like a difficult
thing to fix once people start using the feature (similar to getting a
userspace API merged, we want to get it right first), and I'm not
comfortable with the risks of people using it without bad block handling.

> My understanding is that even if certain code is not
> complete in feature or have certain drawbacks, if the current chunk
> provided some useful features and the drawbacks are acceptable,

Whether "the drawbacks are acceptable" is the issue.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/