[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <D768BA17-8D2E-42FD-92D2-D94F6F1A6BF2@amacapital.net>
Date: Thu, 7 Sep 2017 18:23:27 -0700
From: Andy Lutomirski <luto@...capital.net>
To: Jiri Kosina <jikos@...nel.org>
Cc: Ingo Molnar <mingo@...nel.org>, Andy Lutomirski <luto@...nel.org>,
X86 ML <x86@...nel.org>, Borislav Petkov <bpetkov@...e.de>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH 1/2] x86/mm: Reinitialize TLB state on hotplug and resume
> On Sep 7, 2017, at 12:55 PM, Jiri Kosina <jikos@...nel.org> wrote:
>
> On Thu, 7 Sep 2017, Ingo Molnar wrote:
>
>>>> When Linux brings a CPU down and back up, it switches to init_mm and then
>>>> loads swapper_pg_dir into CR3. With PCID enabled, this has the side effect
>>>> of masking off the ASID bits in CR3.
>>>>
>>>> This can result in some confusion in the TLB handling code. If we
>>>> bring a CPU down and back up with any ASID other than 0, we end up
>>>> with the wrong ASID active on the CPU after resume. This could
>>>> cause our internal state to become corrupt, although major
>>>> corruption is unlikely because init_mm doesn't have any user pages.
>>>> More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion
>>>> in the next context switch. The result of *that* is a failure to
>>>> resume from suspend with probability 1 - 1/6^(cpus-1).
>>>>
>>>> Fix it by reinitializing cpu_tlbstate on resume and CPU bringup.
>>>>
>>>> Reported-by: Linus Torvalds <torvalds@...ux-foundation.org>
>>>> Reported-by: Jiri Kosina <jikos@...nel.org>
>>>> Fixes: 10af6235e0d3 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID")
>>>> Signed-off-by: Andy Lutomirski <luto@...nel.org>
>>>
>>> Tested-by: Jiri Kosina <jkosina@...e.cz>
>>
>> The fix should be upstream already, as of 1c9fe4409ce3 and later.
>
> Hm, so I've just experienced two instances in a row of reboot just after
> reading hibernation image (i.e. exactly the same symptom as before) even
> with 3b9f8ed kernel (which contains the fix). Seems like the fix is either
> incomplete (just the probability of it happening is lower), or I'm seeing
> something differet with the same symptom.
>
> I'll try to figure out whether it's the same VM_BUG_ON() triggering, but
> probably will be able to do so only tomorrow.
>
Nah, don't waste your time. I think I see the bug, and it's a different bug. It's an easy one-line fix, but I have to figure out how to test it.
> --
> Jiri Kosina
> SUSE Labs
>
Powered by blists - more mailing lists