linux-kernel - Re: [PATCH v1 11/11] mtd: new support oops logger based on pstore/blk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <05c21cb6-988b-63a7-60ff-c4e975f25817@allwinnertech.com>
Date:   Fri, 7 Feb 2020 18:30:37 +0800
From:   liaoweixiong <liaoweixiong@...winnertech.com>
To:     Miquel Raynal <mraynal@...nel.org>
Cc:     Kees Cook <keescook@...omium.org>,
        Anton Vorontsov <anton@...msg.org>,
        Colin Cross <ccross@...roid.com>,
        Tony Luck <tony.luck@...el.com>,
        Jonathan Corbet <corbet@....net>,
        Richard Weinberger <richard@....at>,
        Vignesh Raghavendra <vigneshr@...com>,
        Mauro Carvalho Chehab <mchehab+samsung@...nel.org>,
        "David S. Miller" <davem@...emloft.net>,
        Rob Herring <robh@...nel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Jonathan Cameron <Jonathan.Cameron@...wei.com>,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mtd@...ts.infradead.org
Subject: Re: [PATCH v1 11/11] mtd: new support oops logger based on pstore/blk

hi Miquel Raynal,

On 2020/2/7 下午4:41, Miquel Raynal wrote:
> Hi Liao,
> 
> liaoweixiong <liaoweixiong@...winnertech.com> wrote on Fri, 7 Feb 2020
> 12:13:08 +0800:
> 
>> hi Miquel Raynal,
>>
>> On 2020/2/6 PM 11:45, Miquel Raynal wrote:
>>> Hi liao,
>>>
>>> liaoweixiong <liaoweixiong@...winnertech.com> wrote on Thu, 6 Feb 2020
>>> 21:10:47 +0800:
>>>    
>>>> hi Miquel Raynal,
>>>>
>>>> On 2020/1/23 AM 1:41, Miquel Raynal wrote:
>>>>> Hello,
>>>>>   
>>>>>     >>>>>>>> +/*
>>>>>>>>>> + * All zones will be read as pstore/blk will read zone one by one when do
>>>>>>>>>> + * recover.
>>>>>>>>>> + */
>>>>>>>>>> +static ssize_t mtdpstore_read(char *buf, size_t size, loff_t off)
>>>>>>>>>> +{
>>>>>>>>>> +	struct mtdpstore_context *cxt = &oops_cxt;
>>>>>>>>>> +	size_t retlen;
>>>>>>>>>> +	int ret;
>>>>>>>>>> +
>>>>>>>>>> +	if (mtdpstore_block_isbad(cxt, off))
>>>>>>>>>> +		return -ENEXT;
>>>>>>>>>> +
>>>>>>>>>> +	pr_debug("try to read off 0x%llx size %zu\n", off, size);
>>>>>>>>>> +	ret = mtd_read(cxt->mtd, off, size, &retlen, (u_char *)buf);
>>>>>>>>>> +	if ((ret < 0 && !mtd_is_bitflip(ret)) || size != retlen)  {
>>>>>>>>>
>>>>>>>>> IIRC size != retlen does not mean it failed, but that you should
>>>>>>>>> continue reading after retlen bytes, no?
>>>>>>>>>       >>
>>>>>>>> Yes, you are right. I will fix it. Thanks.
>>>>>>>>     >>>>> Also, mtd_is_bitflip() does not mean that you are reading a false
>>>>>>>>> buffer, but that the data has been corrected as it contained bitflips.
>>>>>>>>> mtd_is_eccerr() however, would be meaningful.
>>>>>>>>>       >>
>>>>>>>> Sure I know mtd_is_bitflip() does not mean failure, but I do not think
>>>>>>>> mtd_is_eccerr() should be here since the codes are ret < 0 and NOT
>>>>>>>> mtd_is_bitflip().
>>>>>>>
>>>>>>> Yes, just drop this check, only keep ret < 0.
>>>>>>>      >>
>>>>>> If I don't get it wrong, it should not	 be dropped here. Like your words,
>>>>>> "mtd_is_bitflip() does not mean that you are reading a false buffer,
>>>>>> but that the data has been corrected as it contained bitflips.", the
>>>>>> data I get are valid even if mtd_is_bitflip() return true. It's correct
>>>>>> data and it's no need to go to handle error. To me, the codes
>>>>>> should be:
>>>>>> 	if (ret < 0 && !mit_is_bitflip())
>>>>>> 		[error handling]
>>>>>
>>>>> Please check the implementation of mtd_is_bitflip(). You'll probably
>>>>> figure out what I am saying.
>>>>>
>>>>> https://elixir.bootlin.com/linux/latest/source/include/linux/mtd/mtd.h#L585
>>>>>     >>
>>>> How about the codes as follows:
>>>>
>>>> for (done = 0, retlen = 0; done < size; done += retlen) {
>>>> 	ret = mtd_read(..., &retlen, ...);
>>>> 	if (!ret)
>>>> 		continue;
>>>> 	/*
>>>> 	 * do nothing if bitflip and ecc error occurs because whether
>>>> 	 * it's bitflip or ECC error, just a small number of bits flip
>>>> 	 * and the impact on log data is so small. The mtdpstore just
>>>> 	 * hands over what it gets and user can judge whether the data
>>>> 	 * is valid or not.
>>>> 	 */
>>>> 	if (mtd_is_bitflip(ret)) {
>>>> 		dev_warn("bitflip at....");
>>>> 		continue;
>>
>>> I don't understand why do you check for bitflips. Bitflips have been
>>> corrected at this stage, you just get the information that there
>>> has been bitflips, but the data integrity is fine.
>>>    
>>
>> Both of bitflip and eccerror are not real wrong in this
>> case. So we must check them.
>>
>>> I am not against ignoring ECC errors in this case though. I would
>>> propose:
>>>
>>> 	for (...) {
>>> 		if (ret < 0) {
>>> 			complain;
>>> 			return;
>>> 		}
>>>    
>>
>> -117 (-EUCLEAN) means bitflip but be corrected.
>> -74 (-EBADMSG) means ecc error that uncorrectable
>> All of them are negative number that smaller than 0. If it just keeps
>> "ret < 0", it can never make a difference between bitflip/eccerror
>> and others.
> 
> I forgot about these, your're right, so what about:
> 
> 	static bool mtdpstore_is_io_error(int ret)
> 	{
> 		return ret < 0 && !mtd_is_eccerr(ret)> && !mtd_is_bitflip(ret);
> 	}
> 
> 	for (...) {
> 		if (mtdpstore_is_io_error(ret))
> 			return ret;
> 
> 		/*
> 		 * Comment explaining why we still return valid data
> 		 * despite ECC errors.
> 		 */
> 		if (mtd_is_eccerr(ret))
> 			just-complain();
> 	}
> 
> This snippet makes a difference between any "controller/bus
> timeout/error : data could not be retrieved" and "ECC error, maybe we
> can still read it and try to understand (risky, must be warned)".
> 

That seems good to me. I will fix it later.
Thanks for your review.

>>
>>> 		if (mtd_is_eccerr())
>>> 			complain;
>>> 	}
>>> 		
>>>> 	} else if (mtd_is_eccerr(ret)) {
>>>> 		dev_warn("eccerr at....");
>>>> 		retlen = retlen == 0 ? size : retlen;
>>>> 		continue;
>>>> 	} else {
>>>> 		dev_err("read failure at...");
>>>> 		/* this zone is broken, try next one */
>>>> 		return -ENEXT;
>>>> 	}
>>>> }
>>>>   
>>>
>>>
>>> Thanks,
>>> Miquèl
>>>    
> 
> Thanks,
> Miquèl
>