[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190603131005.e23lovwyvii53vzo@rric.localdomain>
Date: Mon, 3 Jun 2019 13:10:13 +0000
From: Robert Richter <rrichter@...vell.com>
To: James Morse <james.morse@....com>
CC: Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...el.com>,
"Mauro Carvalho Chehab" <mchehab@...nel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 11/21] EDAC, ghes: Unify trace_mc_event() code with
edac_mc driver
On 29.05.19 16:12:38, James Morse wrote:
> Hi Robert,
>
> On 29/05/2019 09:44, Robert Richter wrote:
> > Almost duplicate code, remove it.
>
> almost?
The grain ... as noted below.
>
>
> > Note: there is a difference in the calculation of the grain_bits,
> > using the edac_mc's version here.
>
> But is it the right thing to do?
>
> Is this an off-by-one bug being papered over as some cleanup?
> If so could you post a separate fix that can be picked up for an rc.
>
> Do Marvell have firmware that populates this field?
>
> ...
>
> Unless the argument is no one cares about this...
>
> >From ghes_edac_report_mem_error():
> | /* Error grain */
> | if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
> | e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
>
> Fishy, why would the kernel page-size be relevant here?
That looked broken to me too, I did not put to much effort in fixing
the grain yet. So I just took the edac_mc version first in the
assumption, that one is working.
It looks like the intention here is to limit the grain to the page
size. But right, the calculation is wrong here. I am also going to
reply to your patch you sent on this.
>
> If physical_addr_mask were the same as PAGE_MASK this wouldn't this always give ~0?
> (masking logic like this always does my head in)
>
> /me gives it ago:
> | {1}[Hardware Error]: physical_address: 0x00000000deadbeef
> | {1}[Hardware Error]: physical_address_mask: 0xffffffffffff0000
> | {1}[Hardware Error]: error_type: 6, master abort
> | EDAC MC0: 1 CE Master abort on unknown label ( page:0xdead offset:0xbeef
> | grain:-1 syndrome:0x0 - status(0x0000000000000001): reserved)
>
> That 'grain:-1' is because the calculated e->grain was an unlikely 0xffffffffffffffff.
> Patch incoming, if you could test it on your platform that'd be great.
>
> I don't think ghes_edac.c wants this '+1'.
The +1 looks odd to me also for the edac_mc driver, but I need to take
a closer look here as well as some logs suggest the grain is
calculated correctly.
I will do some further examination here and also respond to your
patch.
Thank you for review.
-Robert
Powered by blists - more mailing lists