linux-kernel - Re: [PATCH] MMIO should have more priority then IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <83C436BD-E12E-420C-B651-B3788F1C4683@vmware.com>
Date:   Mon, 11 Jul 2022 17:04:48 +0000
From:   Nadav Amit <namit@...are.com>
To:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     Matthew Wilcox <willy@...radead.org>,
        "bhelgaas@...gle.com" <bhelgaas@...gle.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
        "rostedt@...dmis.org" <rostedt@...dmis.org>,
        Srivatsa Bhat <srivatsab@...are.com>,
        "srivatsa@...il.mit.edu" <srivatsa@...il.mit.edu>,
        Alexey Makhalov <amakhalov@...are.com>,
        Anish Swaminathan <anishs@...are.com>,
        Vasavi Sirnapalli <vsirnapalli@...are.com>,
        "er.ajay.kaher@...il.com" <er.ajay.kaher@...il.com>,
        Bjorn Helgaas <helgaas@...nel.org>,
        Ajay Kaher <akaher@...are.com>
Subject: Re: [PATCH] MMIO should have more priority then IO

On Jul 10, 2022, at 11:31 PM, Ajay Kaher <akaher@...are.com> wrote:

> On 09/07/22, 1:19 AM, "Nadav Amit" <namit@...are.com> wrote:
> 
>> On Jul 8, 2022, at 11:43 AM, Matthew Wilcox <willy@...radead.org> wrote:
> 
>>> I have no misconceptions about whatever you want to call the mechanism
>>> for communicating with the hypervisor at a higher level than "prod this
>>> byte". For example, one of the more intensive things we use config
>>> space for is sizing BARs. If we had a hypercall to siz a BAR, that
>>> would eliminate:
>>> 
>>> - Read current value from BAR
>>> - Write all-ones to BAR
>>> - Read new value from BAR
>>> - Write original value back to BAR
>>> 
>>> Bingo, one hypercall instead of 4 MMIO or 8 PIO accesses.
> 
> To improve further we can have following mechanism:
> Map (as read only) the 'virtual device config i.e. 4KB ECAM' to
> VM MMIO. VM will have direct read access using MMIO but
> not using PIO.
> 
> Virtual Machine test result with above mechanism:
> 1 hundred thousand read using raw_pci_read() took:
> PIO: 12.809 Sec.
> MMIO: 0.010 Sec.
> 
> And while VM booting, PCI scan and initialization time have been
> reduced by ~65%. In our case it reduced to ~18 mSec from ~55 mSec.
> 
> Thanks Matthew, for sharing history and your views on this patch.
> 
> As you mentioned ordering change may impact some Hardware, so
> it's better to have this change for VMware hypervisor or generic to
> all hypervisor.

I was chatting with Ajay, since I personally did not fully understand his
use-case from the email. Others may have fully understood and can ignore
this email. Here is a short summary of my understanding:

During boot-time there are many PCI reads. Currently, when these reads are
performed by a virtual machine, they all cause a VM-exit, and therefore each
one of them induces a considerable overhead.

When using MMIO (but not PIO), it is possible to map the PCI BARs of the
virtual machine to some memory area that holds the values that the “emulated
hardware” is supposed to return. The memory region is mapped as "read-only”
in the NPT/EPT, so reads from these BAR regions would be treated as regular
memory reads. Writes would still be trapped and emulated by the hypervisor.

I have a vague recollection from some similar project that I had 10 years
ago that this might not work for certain emulated device registers. For
instance some hardware registers, specifically those the report hardware
events, are “clear-on-read”. Apparently, Ajay took that into consideration.

That is the reason for this quite amazing difference - several orders of
magnitude - between the overhead that is caused by raw_pci_read(): 120us for
PIO and 100ns for MMIO. Admittedly, I do not understand why PIO access would
take 120us (I would have expected it to be 10 times faster, at least), but
the benefit is quite clear.