linux-kernel - Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZEhbsHqBapHtdrg7@google.com>
Date:   Tue, 25 Apr 2023 16:01:04 -0700
From:   Sean Christopherson <seanjc@...gle.com>
To:     Ackerley Tng <ackerleytng@...gle.com>
Cc:     chao.p.peng@...ux.intel.com, xiaoyao.li@...el.com,
        isaku.yamahata@...il.com, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-arch@...r.kernel.org,
        linux-api@...r.kernel.org, linux-doc@...r.kernel.org,
        qemu-devel@...gnu.org, pbonzini@...hat.com, corbet@....net,
        vkuznets@...hat.com, wanpengli@...cent.com, jmattson@...gle.com,
        joro@...tes.org, tglx@...utronix.de, mingo@...hat.com,
        bp@...en8.de, arnd@...db.de, naoya.horiguchi@....com,
        linmiaohe@...wei.com, x86@...nel.org, hpa@...or.com,
        hughd@...gle.com, jlayton@...nel.org, bfields@...ldses.org,
        akpm@...ux-foundation.org, shuah@...nel.org, rppt@...nel.org,
        steven.price@....com, mail@...iej.szmigiero.name, vbabka@...e.cz,
        vannapurve@...gle.com, yu.c.zhang@...ux.intel.com,
        kirill.shutemov@...ux.intel.com, luto@...nel.org,
        jun.nakajima@...el.com, dave.hansen@...el.com, ak@...ux.intel.com,
        david@...hat.com, aarcange@...hat.com, ddutile@...hat.com,
        dhildenb@...hat.com, qperret@...gle.com, tabba@...gle.com,
        michael.roth@....com, mhocko@...e.com, wei.w.wang@...el.com
Subject: Re: [PATCH v10 9/9] KVM: Enable and expose KVM_MEM_PRIVATE

On Tue, Apr 18, 2023, Ackerley Tng wrote:
> Sean Christopherson <seanjc@...gle.com> writes:
> > I agree, a pure alignment check is too restrictive, and not really what I
> > intended despite past me literally saying that's what I wanted :-)  I think
> > I may have also inverted the "less alignment" statement, but luckily I
> > believe that ends up being a moot point.
> 
> > The goal is to avoid having to juggle scenarios where KVM wants to create a
> > hugepage, but restrictedmem can't provide one because of a misaligned file
> > offset.  I think the rule we want is that the offset must be aligned to the
> > largest page size allowed by the memslot _size_.  E.g. on x86, if the
> > memslot size is >=1GiB then the offset must be 1GiB or beter, ditto for
> > >=2MiB and >=4KiB (ignoring that 4KiB is already a requirement).
> 
> > We could loosen that to say the largest size allowed by the memslot, but I
> > don't think that's worth the effort unless it's trivially easy to implement
> > in code, e.g. KVM could technically allow a 4KiB aligned offset if the
> > memslot is 2MiB sized but only 4KiB aligned on the GPA.  I doubt there's a
> > real use case for such a memslot, so I want to disallow that unless it's
> > super easy to implement.
> 
> Checking my understanding here about why we need this alignment check:
> 
> When KVM requests a page from restrictedmem, KVM will provide an offset
> into the file in terms of 4K pages.
> 
> When shmem is configured to use hugepages, shmem_get_folio() will round
> the requested offset down to the nearest hugepage-aligned boundary in
> shmem_alloc_hugefolio().
> 
> Example of problematic configuration provided to
> KVM_SET_USER_MEMORY_REGION2:
> 
> + shmem configured to use 1GB pages
> + restrictedmem_offset provided to KVM_SET_USER_MEMORY_REGION2: 0x4000
> + memory_size provided in KVM_SET_USER_MEMORY_REGION2: 1GB
> + KVM requests offset (pgoff_t) 0x8, which translates to offset 0x8000
> 
> restrictedmem_get_page() and shmem_get_folio() returns the page for
> offset 0x0 in the file, since rounding down 0x8000 to the nearest 1GB is
> 0x0. This is allocating outside the range that KVM is supposed to use,
> since the parameters provided in KVM_SET_USER_MEMORY_REGION2 is only
> supposed to be offset 0x4000 to (0x4000 + 1GB = 0x40004000) in the file.
> 
> IIUC shmem will actually just round down (0x4000 rounded down to nearest
> 1GB will be 0x0) and allocate without checking bounds, so if offset 0x0
> to 0x4000 in the file were supposed to be used by something else, there
> might be issues.
> 
> Hence, this alignment check ensures that rounding down of any offsets
> provided by KVM (based on page size configured in the backing file
> provided) to restrictedmem_get_page() must not go below the offset
> provided to KVM_SET_USER_MEMORY_REGION2.
> 
> Enforcing alignment of restrictedmem_offset based on the currently-set
> page size in the backing file (i.e. shmem) may not be effective, since
> the size of the pages in the backing file can be adjusted to a larger
> size after KVM_SET_USER_MEMORY_REGION2 succeeds. With that, we may still
> end up allocating outside the range that KVM was provided with.
> 
> Hence, to be safe, we should check alignment to the max page size across
> all backing filesystems, so the constraint is
> 
>     rounding down restrictedmem_offset to
>     min(max page size across all backing filesystems,
>         max page size that fits in memory_size) == restrictedmem_offset
> 
> which is the same check as
> 
>     restrictedmem_offset must be aligned to min(max page size across all
>     backing filesystems, max page size that fits in memory_size)
> 
> which can safely reduce to
> 
>     restrictedmem_offset must be aligned to max page size that fits in
>     memory_size
> 
> since "max page size that fits in memory_size" is probably <= to "max
> page size across all backing filesystems", and if it's larger, it'll
> just be a tighter constraint.

Yes?  The alignment check isn't strictly required, KVM _could_ deal with the above
scenario, it's just a lot simpler and safer for KVM if the file offset needs to
be sanely aligned.

> If the above understanding is correct:
> 
> + We must enforce this in the KVM_SET_USER_MEMORY_REGION2 handler, since
>   IIUC shmem will just round down and allocate without checking bounds.
> 
>     + I think this is okay because holes in the restrictedmem file (in
>       terms of offset) made to accommodate this constraint don't cost us
>       anything anyway(?) Are they just arbitrary offsets in a file? In
>       our case, this file is usually a new and empty file.
> 
>     + In the case of migration of a restrictedmem file between two KVM
>       VMs, this constraint would cause a problem is if the largest
>       possible page size on the destination machine is larger than that
>       of the source machine. In that case, we might have to move the
>       data in the file to a different offset (a separate problem).

Hmm, I was thinking this would be a non-issue because the check would be tied to
the max page _possible_ page size irrespective of hardware support, but that would
be problematic if KVM ever supports 512GiB pages.  I'm not sure that speculatively
requiring super huge memslots to be 512GiB aligned is sensible.

Aha!  If we go with a KVM ioctl(), a clean way around this is tie the alignment
requirement to the memfd flags, e.g. if userspace requests the memfd to be backed
by PMD hugepages, then the memslot offset needs to be 2MiB aligned on x86.  That
will continue to work if (big if) KVM supports 512GiB pages because the "legacy"
memfd would still be capped at 2MiB pages.

Architectures that support variable hugepage sizes might need to do something
else, but I don't think that possibility affects what x86 can/can't do.

> + On this note, it seems like there is no check for when the range is
>   smaller than the allocated page? Like if the range provided is 4KB in
>   size, but shmem is then configured to use a 1GB page, will we end up
>   allocating past the end of the range?

No, KVM already gracefully handles situations like this.  Well, x86 does, I assume
other architectures do too :-)

As above, the intent of the extra restriction is so that KVM doen't need even more
weird code (read: math) to gracefully handle the new edge cases that would come with
fd-only memslots.