linux-kernel - RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR11MB1645A346723C7A334AC0EA3D8CEA0@MWHPR11MB1645.namprd11.prod.outlook.com>
Date:   Mon, 9 Nov 2020 07:59:19 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     "Raj, Ashok" <ashok.raj@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>
CC:     Jason Gunthorpe <jgg@...dia.com>,
        "Williams, Dan J" <dan.j.williams@...el.com>,
        "Jiang, Dave" <dave.jiang@...el.com>,
        "Bjorn Helgaas" <helgaas@...nel.org>,
        "vkoul@...nel.org" <vkoul@...nel.org>,
        "Dey, Megha" <megha.dey@...el.com>,
        "maz@...nel.org" <maz@...nel.org>,
        "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
        "alex.williamson@...hat.com" <alex.williamson@...hat.com>,
        "Pan, Jacob jun" <jacob.jun.pan@...el.com>,
        "Liu, Yi L" <yi.l.liu@...el.com>, "Lu, Baolu" <baolu.lu@...el.com>,
        "Kumar, Sanjay K" <sanjay.k.kumar@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        "kwankhede@...dia.com" <kwankhede@...dia.com>,
        "eric.auger@...hat.com" <eric.auger@...hat.com>,
        "parav@...lanox.com" <parav@...lanox.com>,
        "rafael@...nel.org" <rafael@...nel.org>,
        "netanelg@...lanox.com" <netanelg@...lanox.com>,
        "shahafs@...lanox.com" <shahafs@...lanox.com>,
        "yan.y.zhao@...ux.intel.com" <yan.y.zhao@...ux.intel.com>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        "Ortiz, Samuel" <samuel.ortiz@...el.com>,
        "Hossain, Mona" <mona.hossain@...el.com>,
        "dmaengine@...r.kernel.org" <dmaengine@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "Raj, Ashok" <ashok.raj@...el.com>
Subject: RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

> From: Raj, Ashok <ashok.raj@...el.com>
> Sent: Monday, November 9, 2020 7:59 AM
> 
> Hi Thomas,
> 
> [-] Jing, She isn't working at Intel anymore.
> 
> Now this is getting compiled as a book :-).. Thanks a ton!
> 
> One question on the hypercall case that isn't immediately
> clear to me.
> 
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> >
> > Now if we look at the virtualization scenario and device hand through
> > then the structure in the guest view is not any different from the basic
> > case. This works with PCI-MSI[X] and the IDXD IMS variant because the
> > hypervisor can trap the access to the storage and translate the message:
> >
> >                    |
> >                    |
> >   [CPU]    -- [Bri | dge] -- Bus -- [Device]
> >                    |
> >   Alloc +
> >   Compose                   Store     Use
> >                              |
> >                              | Trap
> >                              v
> >                              Hypervisor translates and stores
> >
> 
> The above case, VMM is responsible for writing to the message
> store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
> the writes to the device interrupt region and to the IRTE tables.
> 
> > But obviously with an IMS storage location which is software controlled
> > by the guest side driver (the case Jason is interested in) the above
> > cannot work for obvious reasons.
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> >                      |
> >                      |
> >   [CPU]       -- [VI | RT]  -- [Bridge] --    Bus    -- [Device]
> >                      |
> >   Alloc          "Compose"                   Store         Use
> >
> >   Vectordomain   HCALLdomain                Busdomain
> >                  |        ^
> >                  |        |
> >                  v        |
> >             Hypervisor
> >                Alloc + Compose
> >
> > Why? Because it reflects the boundaries and leaves the busdomain part
> > agnostic as it should be. And it works for _all_ variants of Busdomains.
> >
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
> 
> The isolation problem is not just the guest memory being used as interrrupt
> store right? If the Store to device region is not trapped and controlled by
> VMM, there is no gaurantee the guest OS has done the right thing?
> 
> 
> Thinking about it, guest memory might be more problematic since its not
> trappable and VMM can't enforce what is written. This is something that
> needs more attension. But for now the devices supporting memory on device
> the trap and store by VMM seems to satisfy the security properties you
> highlight here.
> 

Just want to clarify the trap part.

Guest memory is not trappable in Jason's example, which has queue/IMS
storage swapped between device/memory and requires special command 
to sync the state.

But there is also other forms of in-memory IMS implementation. e.g. Some
devices serve work requests based on command buffers instead of HW work
queues. The command buffers are linked in per-process contexts (both in 
memory) thus similarly IMS could be stored in each context too. There is no
swap per se. The context is allocated by the driver and then registered to 
the device through a mgmt. interface. When the mgmt. interface is mediated, 
the hypervisor knows the IMS location and could mark it as read-only in 
EPT page table to enable trapping of guest writes. Of course this approach
is awkward if the complexity is paid just for virtualizing IMS.

Thanks
Kevin