linux-kernel - Re: Fwd: [WARNING AND ERROR] may be system slow and audio and video breaking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201018210323.GD8364@zn.tnic>
Date:   Sun, 18 Oct 2020 23:03:23 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Jeffrin Jose T <jeffrin@...agiritech.edu.in>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        jpoimboe@...hat.com, mbenes@...e.cz,
        "peterz@...radead.org" <peterz@...radead.org>,
        shile.zhang@...ux.alibaba.com, lkml <linux-kernel@...r.kernel.org>,
        Greg KH <gregkh@...uxfoundation.org>,
        Shuah Khan <shuah@...nel.org>
Subject: Re: Fwd: [WARNING AND ERROR]  may be  system slow and  audio and
 video breaking

On Mon, Oct 19, 2020 at 01:51:34AM +0530, Jeffrin Jose T wrote:
> On Sun, 2020-10-18 at 19:49 +0200, Borislav Petkov wrote:
> > On Sun, Oct 18, 2020 at 10:42:39PM +0530, Jeffrin Jose T wrote:
> > > smpboot: Scheduler frequency invariance went wobbly, disabling!
> > > [ 1112.592866] unchecked MSR access error: RDMSR from 0x123 at rIP:
> > > 0xffffffffb5c9a184 (native_read_msr+0x4/0x30)

Ok, you forgot to say in your initial mail that this happens when you
suspend your laptop.

Now, this unchecked MSR error thing happens only once because that early
during resume the microcode on CPU1 is not updated yet - and that needs
to be debugged separately and I'll try to reproduce that on my machine -
so the microcode is not updated yet and therefore the 0x123 MSR is not
"emulated" by the microcode, so to speak, thus the warning.

That warning doesn't happen anymore, though, once the microcode is
updated.

But what happens after that is you get a flood of correctable PCIe
errors about a transaction to a device timeoutting:

pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00001000/00002000
pcieport 0000:00:1c.5:    [12] Timeout 

and it looks like that flood is slowing down the machine because it is
busy logging them.

Do

# lspci -nn -xxx

as root. It'll show us which device that 8086:9d15 is.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette