linux-kernel - Re: KVM guest sometimes failed to boot because of kernel stack overflow if KPTI is enabled on a hisilicon ARM64 platform.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9549e15d-4ec6-8dd3-2237-b6c9b52fc816@arm.com>
Date:   Thu, 28 Jun 2018 09:45:56 +0100
From:   James Morse <james.morse@....com>
To:     Wei Xu <xuwei5@...ilicon.com>
Cc:     Will Deacon <will.deacon@....com>, mark.rutland@....com,
        catalin.marinas@....com, Linuxarm <linuxarm@...wei.com>,
        Zhangyi ac <zhangyi.ac@...wei.com>, suzuki.poulose@....com,
        marc.zyngier@....com,
        "Xiongfanggou (James)" <james.xiong@...wei.com>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        dave.martin@....com,
        "Liyuan (Larry, Turing Solution)" <Larry.T@...wei.com>,
        libeijian@...ilicon.com
Subject: Re: KVM guest sometimes failed to boot because of kernel stack
 overflow if KPTI is enabled on a hisilicon ARM64 platform.

Hi Wei,

On 27/06/18 14:26, Wei Xu wrote:
> Sorry, I should highlight that I have only updated the default value
> of CONFIG_NR_CPUS by menuconfig in the previous mail.
> That is why it showed dirty.

(menuconfig changes don't show up like this)

More than 64 CPUs ... Is this system running more VMs than it has VMIDs? Too-few
VMIDs does work with KVM, its just going to trigger rollover frequently.

Just to check, what kernel version is the host running? Does it have commit
f0cf47d939d0 ("KVM: arm/arm64: Close VMID generation race")
(looks like that went in as a fix for v4.17-rc3)

Are you running (lots) of other VMs whenever this happens? Do they have multiple
vcpus? (I'm thinking of the scenario in that patch's description)

Is the host system otherwise idle when this happens?
(If not, can you reproduce the issue without exhausting the VMIDs?)

It may be that writing back the page-table entries with the MMU off, and
changing the cache maintenance are just changing the timing of something else.

Thanks,

James