lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9907ff256ff74f65aff89255bae3b92f@zhaoxin.com>
Date:   Fri, 31 May 2019 03:17:07 +0000
From:   David Wang <DavidWang@...oxin.com>
To:     "Raj, Ashok" <ashok.raj@...el.com>,
        Tony W Wang-oc <TonyWWang-oc@...oxin.com>
CC:     "tipbot@...or.com" <tipbot@...or.com>, "bp@...e.de" <bp@...e.de>,
        "hpa@...or.com" <hpa@...or.com>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-tip-commits@...r.kernel.org" 
        <linux-tip-commits@...r.kernel.org>,
        "mingo@...nel.org" <mingo@...nel.org>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "stable@...r.kernel.org" <stable@...r.kernel.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "tony.luck@...el.com" <tony.luck@...el.com>,
        "torvalds@...ux-foundation.org" <torvalds@...ux-foundation.org>
Subject: 答复: 答复: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t participate in rendezvous process

> -----Original Mail-----
> Sender: Raj, Ashok <ashok.raj@...el.com>
> Time: 2019.05.31 1:11
> To : Tony W Wang-oc <TonyWWang-oc@...oxin.com>
> CC: tipbot@...or.com; bp@...e.de; hpa@...or.com;
> linux-edac@...r.kernel.org; linux-kernel@...r.kernel.org;
> linux-tip-commits@...r.kernel.org; mingo@...nel.org; peterz@...radead.org;
> stable@...r.kernel.org; tglx@...utronix.de; tony.luck@...el.com;
> torvalds@...ux-foundation.org; David Wang <DavidWang@...oxin.com>; Ashok
> Raj <ashok.raj@...el.com>
> Topic: Re: Re: Re: [tip:x86/urgent] x86/mce: Ensure offline CPUs don' t
> participate in rendezvous process
> 
> On Thu, May 30, 2019 at 09:13:39AM +0000, Tony W Wang-oc wrote:
> > On Thu, May 30, 2019, Tony W Wang-oc wrote:
> > > Hi Ashok,
> > > I have two questions about this patch, could you help to check:
> > >
> > > 1, for broadcast #MC exceptions, this patch seems require #MC
> > > exception errors set MCG_STATUS_RIPV = 1.
> > > But for Intel CPU, some #MC exception errors set MCG_STATUS_RIPV = 0
> > > (like "Recoverable-not-continuable SRAR Type" Errors), for these
> > > errors the patch doesn't seem to work, is that okay?
> > >
> > > 2, for LMCE exceptions, this patch seems require #MC exception
> > > errors set MCG_STATUS_RIPV = 0 to make sure LMCE be handled normally
> > > even on offline CPU.
> > > For LMCE errors set MCG_STAUS_RIPV = 1, the patch prevents offline
> > > CPU handle these LMCE errors, is that okay?
> > >
> >
> > More specifically, this patch seems require #MC exceptions meet the
> > condition "MCG_STATUS_RIPV ^ MCG_STATUS_LMCES == 1"; But on a Xeon
> > X5650 machine (SMP),
> 
> The offline CPU will never get a LMCE=1, since those only happen on the CPU
> that's doing active work. Offline CPUs just sitting in idle.
So, for intel CPU, LMCE is only for Thread level(or core level) error? If not, suppose 2 threads
share level-2 cache. And thread 0 is active, thread 1 was offlined by SW. When MCE for this level-2
cache occurred, thread 1 will be active. When thread 1 read mcgstatus.lmce, the result will be always 0?

Thanks.
> 
> The specific error here is a PCC=1, so irrespective of what happens We do capture
> the errors in the per-cpu log, and kernel would panic.
> 
> What specifically this patch tries to achieve is to leave an error sitting with
> MCG-STATUS.MCIP=1 and another recoverable error would shut the system
> dowm.
> 
> I don't see anything wrong with what this patch does..
> 
> > "Data CACHE Level-2 Generic Error" does not meet this condition.
> >
> > I got below message from:
> > https://www.centos.org/forums/viewtopic.php?p=292742
> >
> > Hardware event. This is not a software error.
> > MCE 0
> > CPU 4 BANK 6 TSC b7065eeaa18b0
> > TIME 1545643603 Mon Dec 24 10:26:43 2018 MCG status:MCIP MCi status:
> > Uncorrected error
> > Error enabled
> > Processor context corrupt
> > MCA: Data CACHE Level-2 Generic Error
> > STATUS b200000080000106 MCGSTATUS 4
> > MCGCAP 1c09 APICID 4 SOCKETID 0
> > CPUID Vendor Intel Family 6 Model 44
> >
> > > Thanks
> > > Tony W Wang-oc

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ