lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mafs0ldve7qms.fsf@kernel.org>
Date: Mon, 13 Jan 2025 14:08:43 +0000
From: Pratyush Yadav <pratyush@...nel.org>
To: Miquel Raynal <miquel.raynal@...tlin.com>
Cc: Pratyush Yadav <pratyush@...nel.org>,  Tudor Ambarus
 <tudor.ambarus@...aro.org>,  Michael Walle <mwalle@...nel.org>,  Richard
 Weinberger <richard@....at>,  Vignesh Raghavendra <vigneshr@...com>,
  Thomas Petazzoni <thomas.petazzoni@...tlin.com>,  Steam Lin
 <STLin2@...bond.com>,  linux-mtd@...ts.infradead.org,
  linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] mtd: spi-nor: winbond: Add support for w25q01jv

On Mon, Dec 30 2024, Miquel Raynal wrote:

> Hello Pratyush,
>
> On 24/12/2024 at 21:15:41 GMT, Pratyush Yadav <pratyush@...nel.org> wrote:
>
>> On Tue, Dec 24 2024, Miquel Raynal wrote:
[...]
>>> As there are very few situations where this can actually happen, a
>>> status register write being the most likely one, another possibility
>>> might have been to use volatile writes instead of non-volatile writes,
>>> as most of the deviation comes from the action of writing the bit. But
>>> this would overlook other possible actions where both dies can be used
>>> at the same time like a chip erase (or any erase over the die boundary
>>> in general). This last approach would have the least impact but because
>>> it does not feel like it is totally safe to use and because the impact
>>> of the second solution presented above is also negligible, we keep this
>>> second approach for now (which can be further tuned later if it appears
>>> to be too impacting in the end).
>>
>> I am a bit confused by this paragraph. What do you mean by "this" in
>> the
>
> "this" = "the race condition"
>
>> first sentence? What do status register writes have to do with the ready
>> bit being racy?
>
> The bug that has been experienced followed this sequence:
> - send the write enable command (non-volatile)
> - wait for the ready/busy bit, ie. wait for the WEL bit to be set
>   because it is non-volatile write
> - active die is ready, (but idle die is not!)
> - enter 4-byte address mode, only the die that is ready processes the
>   command.
>
> We only observed the issue in this particular case which involves
> writing the status register, because it is one of the very few commands
> targeting all dies at the same time.
>
> I assume another sequence that might lead to a similar issue might be a
> chip erasure, as all dies are involved in parallel, but maybe there are
> other situations I did not think about which might be racy as well.


>
>> I would assume those would be nearly instant since
>> status registers are usually volatile. What do volatile writes mean in
>> this context?
>
> You are actually right. Status register bits can be volatile (in this
> case writing the bits themselves is almost instant) but currently when
> we allow this register to be writable by sending the write enable (06h)
> command, the non-volatile way is used, ie. the state of the bit itself
> is stored in non-volatile memory and write durations can vary from one
> die to another.

Okay, that is strange behaviour. Normally the status registers are
always volatile, and don't have a non-volatile counterpart.

>
> Winbond chips (maybe this is a shared capability?) accepts another
> command, "Write Enable for Volatile Status Register (50h)", which
> specifically change the status register bits to use the volatile method.
>
> Hence, if the only situation we want to solve is the status register
> access, then we may just enable this command (this is the third solution
> I tried to explain in the commit log), but if we think there are other
> racy situations, this approach is not complete and we must fallback to
> one of the approaches listed above.

I am not quite sure how you fix the write-enable-being-racy bug with
your patch. If you look at the code, spi_nor_write_enable() only calls
the write enable command (06h), and does not call
spi_nor_wait_till_ready() after that. After the write enable, it
immediately executes the program or erase operation. So you never
actually wait for all dies to be ready after a write enable.

You can see an example in spi_nor_write(). It does:

    spi_nor_write_enable() -> spi_nor_write_data() -> spi_nor_wait_till_ready()

Do you have a consistent reproducer for the race? If so, does the patch
actually somehow make the race go away? If so, I would be curious to
know why.

>
>>>
>>> However, the fixup, whatever which one we pick, must be applied on
>>> multi-die chips, which hence must be properly flagged. The SFDP tables
>>> implemented give a lot of information but the die details are part of an
>>> optional table that is not implemented, hence we use a post parsing
>>> fixup hook to set the params->n_dice value manually.
>>>
[...]

-- 
Regards,
Pratyush Yadav

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ