linux-kernel - Re: Summary of LPC guest MSI discussion in Santa Fe

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5822214F.2070500@redhat.com>
Date:   Tue, 8 Nov 2016 14:02:39 -0500
From:   Don Dutile <ddutile@...hat.com>
To:     Will Deacon <will.deacon@....com>,
        Auger Eric <eric.auger@...hat.com>
Cc:     Alex Williamson <alex.williamson@...hat.com>, drjones@...hat.com,
        jason@...edaemon.net, kvm@...r.kernel.org, marc.zyngier@....com,
        benh@...nel.crashing.org, joro@...tes.org, punit.agrawal@....com,
        linux-kernel@...r.kernel.org, arnd@...db.de, diana.craciun@....com,
        iommu@...ts.linux-foundation.org, pranav.sawargaonkar@...il.com,
        linux-arm-kernel@...ts.infradead.org, jcm@...hat.com,
        tglx@...utronix.de, robin.murphy@....com, dwmw@...zon.co.uk,
        christoffer.dall@...aro.org, eric.auger.pro@...il.com
Subject: Re: Summary of LPC guest MSI discussion in Santa Fe

On 11/08/2016 12:54 PM, Will Deacon wrote:
> On Tue, Nov 08, 2016 at 03:27:23PM +0100, Auger Eric wrote:
>> On 08/11/2016 03:45, Will Deacon wrote:
>>> Rather than treat these as separate problems, a better interface is to
>>> tell userspace about a set of reserved regions, and have this include
>>> the MSI doorbell, irrespective of whether or not it can be remapped.
>>> Don suggested that we statically pick an address for the doorbell in a
>>> similar way to x86, and have the kernel map it there. We could even pick
>>> 0xfee00000. If it conflicts with a reserved region on the platform (due
>>> to (4)), then we'd obviously have to (deterministically?) allocate it
>>> somewhere else, but probably within the bottom 4G.
>> This is tentatively achieved now with
>> [1] [RFC v2 0/8] KVM PCIe/MSI passthrough on ARM/ARM64 - Alt II
>> (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1264506.html)
> Yup, I saw that fly by. Hopefully some of the internals can be reused
> with the current thinking on user ABI.
>
>>> The next question is how to tell userspace about all of the reserved
>>> regions. Initially, the idea was to extend VFIO, however Alex pointed
>>> out a horrible scenario:
>>>
>>>    1. QEMU spawns a VM on system 0
>>>    2. VM is migrated to system 1
>>>    3. QEMU attempts to passthrough a device using PCI hotplug
>>>
>>> In this scenario, the guest memory map is chosen at step (1), yet there
>>> is no VFIO fd available to determine the reserved regions. Furthermore,
>>> the reserved regions may vary between system 0 and system 1. This pretty
>>> much rules out using VFIO to determine the reserved regions.Alex suggested
>>> that the SMMU driver can advertise the regions via /sys/class/iommu/. This
>>> would solve part of the problem, but migration between systems with
>>> different memory maps can still cause problems if the reserved regions
>>> of the new system conflict with the guest memory map chosen by QEMU.
>>
>> OK so I understand we do not want anymore the VFIO chain capability API
>> (patch 5 of above series) but we prefer a sysfs approach instead.
> Right.
>
>> I understand the sysfs approach which allows the userspace to get the
>> info earlier and independently on VFIO. Keeping in mind current QEMU
>> virt - which is not the only userspace - will not do much from this info
>> until we bring upheavals in virt address space management. So if I am
>> not wrong, at the moment the main action to be undertaken is the
>> rejection of the PCI hotplug in case we detect a collision.
> I don't think so; it should be up to userspace to reject the hotplug.
> If userspace doesn't have support for the regions, then that's fine --
> you just end up in a situation where the CPU page table maps memory
> somewhere that the device can't see. In other words, you'll end up with
> spurious DMA failures, but that's exactly what happens with current systems
> if you passthrough an overlapping region (Robin demonstrated this on Juno).
>
> Additionally, you can imagine some future support where you can tell the
> guest not to use certain regions of its memory for DMA. In this case, you
> wouldn't want to refuse the hotplug in the case of overlapping regions.
>
> Really, I think the kernel side just needs to enumerate the fixed reserved
> regions, place the doorbell at a fixed address and then advertise these
> via sysfs.
>
>> I can respin [1]
>> - studying and taking into account Robin's comments about dm_regions
>> similarities
>> - removing the VFIO capability chain and replacing this by a sysfs API
> Ideally, this would be reusable between different SMMU drivers so the sysfs
> entries have the same format etc.
>
>> Would that be OK?
> Sounds good to me. Are you in a position to prototype something on the qemu
> side once we've got kernel-side agreement?
>
>> What about Alex comments who wanted to report the usable memory ranges
>> instead of unusable memory ranges?
>>
>> Also did you have a chance to discuss the following items:
>> 1) the VFIO irq safety assessment
> The discussion really focussed on system topology, as opposed to properties
> of the doorbell. Regardless of how the device talks to the doorbell, if
> the doorbell can't protect against things like MSI spoofing, then it's
> unsafe. My opinion is that we shouldn't allow passthrough by default on
> systems with unsafe doorbells (we could piggyback on allow_unsafe_interrupts
> cmdline option to VFIO).
>
> A first step would be making all this opt-in, and only supporting GICv3
> ITS for now.
You're trying to support a config that is < GICv3 and no ITS ? ...
That would be the equiv. of x86 pre-intr-remap, and that's why allow_unsafe_interrupts
hook was created ... to enable devel/kick-the-tires.
>> 2) the MSI reserved size computation (is an arbitrary size OK?)
> If we fix the base address, we could fix a size too. However, we'd still
> need to enumerate the doorbells to check that they fit in the region we
> have. If not, then we can warn during boot and treat it the same way as
> a resource conflict (that is, reallocate the region in some deterministic
> way).
>
> Will