linux-kernel - Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <157609629270.3810.9676234389583169255@sif>
Date:   Wed, 11 Dec 2019 14:31:32 -0600
From:   Michael Roth <mdroth@...ux.vnet.ibm.com>
To:     Alexey Kardashevskiy <aik@...abs.ru>, Ram Pai <linuxram@...ibm.com>
Cc:     mpe@...erman.id.au, linuxppc-dev@...ts.ozlabs.org,
        benh@...nel.crashing.org, david@...son.dropbear.id.au,
        paulus@...abs.org, hch@....de, andmike@...ibm.com,
        sukadev@...ux.vnet.ibm.com, mst@...hat.com, ram.n.pai@...il.com,
        cai@....pw, tglx@...utronix.de, bauerman@...ux.ibm.com,
        linux-kernel@...r.kernel.org, leonardo@...ux.ibm.com
Subject: Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with
 the hypervisor.

Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
> 
> 
> On 11/12/2019 02:35, Ram Pai wrote:
> > On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 10/12/2019 16:12, Ram Pai wrote:
> >>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
> >>>>
> >>>>
> >>>> On 07/12/2019 12:12, Ram Pai wrote:
> >>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> >>>>> its parameters.  On secure VMs, hypervisor cannot access the contents of
> >>>>> this page since it gets encrypted.  Hence share the page with the
> >>>>> hypervisor, and unshare when done.
> >>>>
> >>>>
> >>>> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
> >>>> pages. There is small problem that when DDW is enabled,
> >>>> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
> >>>> about the performance on slack but this is caused by initial cleanup of
> >>>> the default TCE window (which we do not use anyway) and to battle this
> >>>> we can simply reduce its size by adding
> >>>
> >>> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
> >>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
> >>> 30GB guest. With larger guest, the time will further detoriate.
> >>
> >>
> >> No it will not, I checked. The time is the same for 2GB and 32GB guests-
> >> the delay is caused by clearing the small DMA window which is small by
> >> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
> >> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
> >> (depends on the system) so the number of TCEs is much smaller.
> > 
> > I cant get your results.  What changes did you make to get it?
> 
> 
> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
> in clearing the default window and the huge window took a fraction of a
> second to create and map.

Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
of H_PUT_TCE everywhere?

In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
instant), and then force H_PUT_TCE for new mappings via something like:

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 6ba081dd61c9..85d092baf17d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
        unsigned long flags;
 
        if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
+       if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || is_secure_guest()) {
                return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
                                           direction, attrs);
        }

That seems like it would avoid the extra 13s.

If we take the additional step of only mapping SWIOTLB range in
enable_ddw() for is_secure_guest() that might further improve things
(though the bigger motivation with that is the extra isolation it would
grant us for stuff behind the IOMMU, since it apparently doesn't affect
boot-time all that much)

> 
> 
> >>>>
> >>>> -global
> >>>> spapr-pci-host-bridge.dma_win_size=0x4000000
> >>>
> >>> This option, speeds it up tremendously.  But than should this option be
> >>> enabled in qemu by default?  only for secure VMs? for both VMs?
> >>
> >>
> >> As discussed in slack, by default we do not need to clear the entire TCE
> >> table and we only have to map swiotlb buffer using the small window. It
> >> is a guest kernel change only. Thanks,
> > 
> > Can you tell me what code you are talking about here.  Where is the TCE
> > table getting cleared? What code needs to be changed to not clear it?
> 
> 
> pci_dma_bus_setup_pSeriesLP()
>         iommu_init_table()
>                 iommu_table_clear()
>                         for () tbl->it_ops->get()
> 
> We do not really need to clear it there, we only need it for VFIO with
> IOMMU SPAPR TCE v1 which reuses these tables but there are
> iommu_take_ownership/iommu_release_ownership to clear these tables. I'll
> send a patch for this.


> 
> 
> > Is the code in tce_buildmulti_pSeriesLP(), the one that does the clear
> > aswell?
> 
> 
> This one does not need to clear TCEs as this creates a window of known
> size and maps it all.
> 
> Well, actually, it only maps actual guest RAM, if there are gaps in RAM,
> then TCEs for the gaps will have what hypervisor had there (which is
> zeroes, qemu/kvm clears it anyway).
> 
> 
> > But before I close, you have not told me clearly, what is the problem
> > with;  'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the page'.
> 
> Between share and unshare you have a (tiny) window of opportunity to
> attack the guest. No, I do not know how exactly.
> 
> For example, the hypervisor does a lot of PHB+PCI hotplug-unplug with
> 64bit devices - each time this will create a huge window which will
> share/unshare the same page.  No, I do not know how exactly how this can
> be exploited either, we cannot rely of what you or myself know today. My
> point is that we should not be sharing pages at all unless we really
> really have to, and this does not seem to be the case.
> 
> But since this seems to an acceptable compromise anyway,
> 
> Reviewed-by: Alexey Kardashevskiy <aik@...abs.ru>
> 
> 
> 
> 
> 
> > Remember this is the same page that is earmarked for doing
> > H_PUT_INDIRECT_TCE, not by my patch, but its already earmarked by the
> > existing code. So it not some random buffer that is picked. Second 
> > this page is temporarily shared and unshared, it does not stay shared
> > for life.  It does not slow the boot. it does not need any
> > special command line options on the qemu.
> >> Shared pages technology was put in place, exactly for the purpose of
> > sharing data with the hypervisor.  We are using this technology exactly
> > for that purpose.  And finally I agreed with your concern of having
> > shared pages staying around.  Hence i addressed that concern, by
> > unsharing the page.  At this point, I fail to understand your concern.
> 
> 
> 
> 
> -- 
> Alexey