linux-kernel - Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash high memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YrQ/98J5UqPh8K89@arm.com>
Date:   Thu, 23 Jun 2022 11:27:03 +0100
From:   Catalin Marinas <catalin.marinas@....com>
To:     Kefeng Wang <wangkefeng.wang@...wei.com>
Cc:     Baoquan He <bhe@...hat.com>, Zhen Lei <thunder.leizhen@...wei.com>,
        Ard Biesheuvel <ardb@...nel.org>,
        Mark Rutland <mark.rutland@....com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        x86@...nel.org, "H . Peter Anvin" <hpa@...or.com>,
        Eric Biederman <ebiederm@...ssion.com>,
        Rob Herring <robh+dt@...nel.org>,
        Frank Rowand <frowand.list@...il.com>,
        devicetree@...r.kernel.org, Dave Young <dyoung@...hat.com>,
        Vivek Goyal <vgoyal@...hat.com>, kexec@...ts.infradead.org,
        linux-kernel@...r.kernel.org, Will Deacon <will@...nel.org>,
        linux-arm-kernel@...ts.infradead.org,
        Jonathan Corbet <corbet@....net>, linux-doc@...r.kernel.org,
        Randy Dunlap <rdunlap@...radead.org>,
        Feng Zhou <zhoufeng.zf@...edance.com>,
        Chen Zhou <dingguo.cz@...group.com>,
        John Donnelly <John.p.donnelly@...cle.com>,
        Dave Kleikamp <dave.kleikamp@...cle.com>,
        liushixin <liushixin2@...wei.com>
Subject: Re: [PATCH 5/5] arm64: kdump: Don't defer the reservation of crash
 high memory

On Wed, Jun 22, 2022 at 08:03:21PM +0800, Kefeng Wang wrote:
> On 2022/6/22 2:04, Catalin Marinas wrote:
> > On Tue, Jun 21, 2022 at 02:24:01PM +0800, Kefeng Wang wrote:
> > > On 2022/6/21 13:33, Baoquan He wrote:
> > > > On 06/13/22 at 04:09pm, Zhen Lei wrote:
> > > > > If the crashkernel has both high memory above DMA zones and low memory
> > > > > in DMA zones, kexec always loads the content such as Image and dtb to the
> > > > > high memory instead of the low memory. This means that only high memory
> > > > > requires write protection based on page-level mapping. The allocation of
> > > > > high memory does not depend on the DMA boundary. So we can reserve the
> > > > > high memory first even if the crashkernel reservation is deferred.
> > > > > 
> > > > > This means that the block mapping can still be performed on other kernel
> > > > > linear address spaces, the TLB miss rate can be reduced and the system
> > > > > performance will be improved.
> > > > Ugh, this looks a little ugly, honestly.
> > > > 
> > > > If that's for sure arm64 can't split large page mapping of linear
> > > > region, this patch is one way to optimize linear mapping. Given kdump
> > > > setting is necessary on arm64 server, the booting speed is truly
> > > > impacted heavily.
> > > Is there some conclusion or discussion that arm64 can't split large page
> > > mapping?
> > > 
> > > Could the crashkernel reservation (and Kfence pool) be splited dynamically?
> > > 
> > > I found Mark replay "arm64: remove page granularity limitation from
> > > KFENCE"[1],
> > > 
> > >    "We also avoid live changes from block<->table mappings, since the
> > >    archtitecture gives us very weak guarantees there and generally requires
> > >    a Break-Before-Make sequence (though IIRC this was tightened up
> > >    somewhat, so maybe going one way is supposed to work). Unless it's
> > >    really necessary, I'd rather not split these block mappings while
> > >    they're live."
> > The problem with splitting is that you can end up with two entries in
> > the TLB for the same VA->PA mapping (e.g. one for a 4KB page and another
> > for a 2MB block). In the lucky case, the CPU will trigger a TLB conflict
> > abort (but can be worse like loss of coherency).
> Thanks for your explanation，
> > Prior to FEAT_BBM (added in ARMv8.4), such scenario was not allowed at
> > all, the software would have to unmap the range, TLBI, remap. With
> > FEAT_BBM (level 2), we can do this without tearing the mapping down but
> > we still need to handle the potential TLB conflict abort. The handler
> > only needs a TLBI but if it touches the memory range being changed it
> > risks faulting again. With vmap stacks and the kernel image mapped in
> > the vmalloc space, we have a small window where this could be handled
> > but we probably can't go into the C part of the exception handling
> > (tracing etc. may access a kmalloc'ed object for example).
> 
> So if without FEAT_BBM，we can only guarantee BBM sequence via
> "unmap the range, TLBI, remap" or the following option,

Yes, that's the break-before-make sequence.

> and with FEAT_BBM (level 2), we could have easy way to avoid TLB
> conflict for some vmalloc space, but still hard to deal with other
> scence?

It's not too hard in theory. Basically there's a small risk of getting a
TLB conflict abort for the mappings you change without a BBM sequence (I
think it's nearly non-existed when going from large block to smaller
pages, though the architecture states that it's still possible). Since
we only want to do this for the linear map and the kernel and stack are
in the vmalloc space, we can handle such trap as an safety measure (it
just needs a TLBI). It may help to tweak a model to force it to generate
such conflict aborts, otherwise we'd not be able to test the code.

It's possible that such trap is raised at EL2 if a guest caused the
conflict abort (the architecture left this as IMP DEF). The hypervisors
may need to be taught to do a TLBI VMALLS12E1 instead of killing the
guest. I haven't checked what KVM does.

-- 
Catalin