linux-kernel - Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <560f7d27-fe38-0db9-834a-50dda5fa6157@redhat.com>
Date:   Sun, 12 Jun 2022 19:29:30 +0200
From:   Paolo Bonzini <pbonzini@...hat.com>
To:     paulmck@...nel.org,
        "zhangfei.gao@...mail.com" <zhangfei.gao@...mail.com>
Cc:     Zhangfei Gao <zhangfei.gao@...aro.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        rcu@...r.kernel.org, Lai Jiangshan <jiangshanlai@...il.com>,
        Josh Triplett <josh@...htriplett.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Matthew Wilcox <willy@...radead.org>,
        Shameerali Kolothum Thodi 
        <shameerali.kolothum.thodi@...wei.com>, mtosatti@...hat.com,
        sheng.yang@...el.com
Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking
 readers from consuming CPU) cause qemu boot slow

On 6/12/22 18:40, Paul E. McKenney wrote:
>> Do these reserved memory regions really need to be allocated separately?
>> (For example, are they really all non-contiguous?  If not, that is, if
>> there are a lot of contiguous memory regions, could you sort the IORT
>> by address and do one ioctl() for each set of contiguous memory regions?)
>>
>> Are all of these reserved memory regions set up before init is spawned?
>>
>> Are all of these reserved memory regions set up while there is only a
>> single vCPU up and running?
>>
>> Is the SRCU grace period really needed in this case?  (I freely confess
>> to not being all that familiar with KVM.)
> 
> Oh, and there was a similar many-requests problem with networking many
> years ago.  This was solved by adding a new syscall/ioctl()/whatever
> that permitted many requests to be presented to the kernel with a single
> system call.
> 
> Could a new ioctl() be introduced that requested a large number
> of these memory regions in one go so as to make each call to
> synchronize_rcu_expedited() cover a useful fraction of your 9000+
> requests?  Adding a few of the KVM guys on CC for their thoughts.

Unfortunately not.  Apart from this specific case, in general the calls 
to KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers 
in the guest, and those writes then map to a ioctl.  Typically the guest 
sets up a device at a time, and each setup step causes a 
synchronize_srcu()---and expedited at that.

KVM has two SRCUs:

1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has 
readers that are very very small, but it needs extremely fast detection 
of grace periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up 
KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers 
are not so frequent.

2) kvm->srcu is nastier because there are readers all the time.  The 
read-side critical section are still short-ish, but they need the 
sleepable part because they access user memory.

Writers are not frequent per se; the problem is they come in very large 
bursts when a guest boots.  And while the whole boot path overall can be 
quadratic, O(n) expensive calls to synchronize_srcu() can have a larger 
impact on runtime than the O(n^2) parts, as demonstrated here.

Therefore, we operated on the assumption that the callers of 
synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest 
code and the desire was to get past the booting phase as fast as 
possible.  If the guest wants to eat host CPU it can "for(;;)" as much 
as it wants; therefore, as long as expedited GPs didn't eat CPU 
*throughout the whole system*, a preemptable busy wait in 
synchronize_srcu_expedited() were not problematic.

This assumptions did match the SRCU code when kvm->srcu and 
kvm->irq_srcu were was introduced (respectively in 2009 and 2014).  But 
perhaps they do not hold anymore now that each SRCU is not as 
independent as it used to be in those years, and instead they use 
workqueues instead?

Thanks,

Paolo