linux-kernel - Re: [PATCH] USB:bugfix a controller halt error

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2fbd21f3-2da0-5a10-23c4-fcfecfe48311@suse.com>
Date:   Thu, 27 Jul 2023 10:46:05 +0200
From:   Oliver Neukum <oneukum@...e.com>
To:     liulongfang <liulongfang@...wei.com>,
        Oliver Neukum <oneukum@...e.com>,
        Greg KH <gregkh@...uxfoundation.org>
Cc:     linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] USB:bugfix a controller halt error

On 27.07.23 09:00, liulongfang wrote:
> On 2023/7/26 19:16, Oliver Neukum wrote:

>> 1. temporary - that is you have detected memory corruption but the RAM cell is not broken
>> 2. unrecoverable - that is we have lost data
>> 3. locateable - that is you know it hit the buffer of this operation and only it
>>
>> Am I correct so far?
>>
> You are right about the testing process.
> But this problem can exist in the real environment, just the probability of
> occurrence is very low.

Understood. Bit flips are random.

But this leaves two open questions.

1. How is the error reported

2. How are we supposed to handle it

Firstly, if we already know that there is an ECC failure
on the host we can use a specific error code and can check
for that.

Secondly, does this mean that the affected memory location
must not be touched until the machine is power cycled
or does it simply mean that the buffer is invalid?

> Our test tool only simulates that external interference destroys this part
> of the data in the buffer on the ECC memory. Even without this testing tool.
> This problem may also occur on real business hardware devices.

Understood. But what is the correct remedy if teh problem strikes
for real?

	Regards
		Oliver