[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <245167.20496.qm@web50107.mail.re2.yahoo.com>
Date: Tue, 21 Sep 2010 07:08:58 -0700 (PDT)
From: Doug Thompson <norsk5@...oo.com>
To: Huang Ying <ying.huang@...el.com>,
Borislav Petkov <borislav.petkov@....com>
Cc: RobertRichter <robert.richter@....com>,
Ingo Molnar <mingo@...e.hu>, "H. Peter Anvin" <hpa@...or.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Andi Kleen <andi@...stfloor.org>,
edac-devel <linux-edac@...r.kernel.org>
Subject: Re: [RFC 3/6] x86, NMI, Rename memory parity error to PCI SERR error
--- On Tue, 9/21/10, Borislav Petkov <borislav.petkov@....com> wrote:
> From: Borislav Petkov <borislav.petkov@....com>
> Subject: Re: [RFC 3/6] x86, NMI, Rename memory parity error to PCI SERR error
> To: "Huang Ying" <ying.huang@...el.com>
> Cc: "Richter, Robert" <robert.richter@....com>, "Ingo Molnar" <mingo@...e.hu>, "H. Peter Anvin" <hpa@...or.com>, "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Andi Kleen" <andi@...stfloor.org>, "Doug Thompson" <norsk5@...oo.com>, "edac-devel" <linux-edac@...r.kernel.org>
> Date: Tuesday, September 21, 2010, 12:37 AM
> From: Huang Ying <ying.huang@...el.com>
> Date: Mon, Sep 20, 2010 at 08:22:28PM -0400
>
> (Forgot to add edac-devel to Cc)
thanks
>
> > > What is more, there are a bunch of edac drivers
> using the PCI SERR nmi
> > > as a means to check for PCI errors so we
> shouldn't be removing it now,
> > > should we?
> >
> > After checking the source, I found in mem_parity_error
> (will renamed to
> > pci_serr_error), edac_atomic_assert_error() is called,
> which increase
> > edac_err_assert, edac_err_assert is used in
> > edac_mc_assert_error_check_and_clear(), which is used
> in
> > edac_mc_workq_function for memory error only, not for
> PCI errors.
>
> Yes, I suppose the edac part in the mem_parity_error() was
> originally
> meant for memory parity errors. Now, I understand your
> incentive of
> changing that to handle PCI SERR errors but by axing the
> edac part,
> you're practically disabling the mci->edac_check() call
> for edac
> drivers using NMIs for error reporting (I don't know how
> many do that,
> btw...) and almost every edac driver defines that function
> pointer to a
> driver-specific error checking function.
>
> So if there are no more IBM PC-AT machines running Linux
> out, I
> think we can rip out the whole code around edac_err_assert
> and thus
> remove the edac_mc_assert_error_check_and_clear() part from
> the
> edac_mc_workq_function() which would make all edac drivers
> solely poll
> for mem errors.
>
> What do the others think, Doug?
History lesson:
The addition of PCI bus ERROR scanning was added to EDAC when we (at Linux Networx, 2005 timeframe) discovered a bad PCI riser card. Due to bad manufacturing, it would cause PCI bus errors during transfers. We determined that by scanning the PCI bus, we could isolate bad cards and verify proper PCI bus operation on those cards and other PCI devices.
Once that was in place, we found other systems that also had PCI bus errors, though less frequently than our original system. A handful of errors were discovered and we found that not many systems were handling any of them, nor even reporting them. Hence the PCI bus scanner of EDAC. I profiled it, and it took on average 2500 TSC cycles to perform a complete bus scan which occurred every second. So it was an expensive operation. That is why I added controls to turn off PCI bus scanning if desired.
Memory parity errors via NMI was NOT the prime reason it was added to EDAC. Looking at SERR (or any PCI bus error) was the prime reason it was added. There was a patch to better handle NMI event handling better, but didn't get pushed upstream.'
I have not kept up with the kernel PCI SERR error handling code development else (if any) and don't fully know its features nor current operation.
If it is determined that EDAC no longer needs to scan the PCI bus because some other system is now managing that error checking operation, great!
It was a wart in a way, hanging in the memory driver. I really wanted to revisit that and either set it up as a separate driver or embedded in the EDAC core or something.
doug thompson
>
> --
> Regards/Gruss,
> Boris.
>
> Advanced Micro Devices GmbH
> Einsteinring 24, 85609 Dornach
> General Managers: Alberto Bozzo, Andrew Bowd
> Registration: Dornach, Gemeinde Aschheim, Landkreis
> Muenchen
> Registergericht Muenchen, HRB Nr. 43632
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists