linux-kernel - Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <tencent_E306F58EAEC1D188ED6D5F358A269F34C707@qq.com>
Date:   Mon, 13 Jun 2022 11:04:39 +0800
From:   "zhangfei.gao@...mail.com" <zhangfei.gao@...mail.com>
To:     paulmck@...nel.org, Paolo Bonzini <pbonzini@...hat.com>
Cc:     Zhangfei Gao <zhangfei.gao@...aro.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        rcu@...r.kernel.org, Lai Jiangshan <jiangshanlai@...il.com>,
        Josh Triplett <josh@...htriplett.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Matthew Wilcox <willy@...radead.org>,
        Shameerali Kolothum Thodi 
        <shameerali.kolothum.thodi@...wei.com>, mtosatti@...hat.com,
        sheng.yang@...el.com
Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking
 readers from consuming CPU) cause qemu boot slow

Hi, Paul

On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>> Do these reserved memory regions really need to be allocated separately?
>>>> (For example, are they really all non-contiguous?  If not, that is, if
>>>> there are a lot of contiguous memory regions, could you sort the IORT
>>>> by address and do one ioctl() for each set of contiguous memory regions?)
>>>>
>>>> Are all of these reserved memory regions set up before init is spawned?
>>>>
>>>> Are all of these reserved memory regions set up while there is only a
>>>> single vCPU up and running?
>>>>
>>>> Is the SRCU grace period really needed in this case?  (I freely confess
>>>> to not being all that familiar with KVM.)
>>> Oh, and there was a similar many-requests problem with networking many
>>> years ago.  This was solved by adding a new syscall/ioctl()/whatever
>>> that permitted many requests to be presented to the kernel with a single
>>> system call.
>>>
>>> Could a new ioctl() be introduced that requested a large number
>>> of these memory regions in one go so as to make each call to
>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>>> requests?  Adding a few of the KVM guys on CC for their thoughts.
>> Unfortunately not.  Apart from this specific case, in general the calls to
>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
>> guest, and those writes then map to a ioctl.  Typically the guest sets up a
>> device at a time, and each setup step causes a synchronize_srcu()---and
>> expedited at that.
> I was afraid of something like that...
>
>> KVM has two SRCUs:
>>
>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>> that are very very small, but it needs extremely fast detection of grace
>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
>> not so frequent.
>>
>> 2) kvm->srcu is nastier because there are readers all the time.  The
>> read-side critical section are still short-ish, but they need the sleepable
>> part because they access user memory.
> Which one of these two is in play in this case?
>
>> Writers are not frequent per se; the problem is they come in very large
>> bursts when a guest boots.  And while the whole boot path overall can be
>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
>> impact on runtime than the O(n^2) parts, as demonstrated here.
>>
>> Therefore, we operated on the assumption that the callers of
>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
>> and the desire was to get past the booting phase as fast as possible.  If
>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
>> problematic.
>>
>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
>> were was introduced (respectively in 2009 and 2014).  But perhaps they do
>> not hold anymore now that each SRCU is not as independent as it used to be
>> in those years, and instead they use workqueues instead?
> The problem was not internal to SRCU, but rather due to the fact
> that kernel live patching (KLP) had problems with the CPU-bound tasks
> resulting from repeated synchronize_rcu_expedited() invocations.  So I
> added heuristics to get the occasional sleep in there for KLP's benefit.
> Perhaps these heuristics need to be less aggressive about adding sleep.
>
> These heuristics have these aspects:
>
> 1.	The longer readers persist in an expedited SRCU grace period,
> 	the longer the wait between successive checks of the reader
> 	state.  Roughly speaking, we wait as long as the grace period
> 	has currently been in effect, capped at ten jiffies.
>
> 2.	SRCU grace periods have several phases.  We reset so that each
> 	phase starts by not waiting (new phase, new set of readers,
> 	so don't penalize this set for the sins of the previous set).
> 	But once we get to the point of adding delay, we add the
> 	delay based on the beginning of the full grace period.
>
> Right now, the checking for grace-period length does not allow for the
> possibility that a grace period might start just before the jiffies
> counter gets incremented (because I didn't realize that anyone cared),
> so that is one possible thing to change.  I can also allow more no-delay
> checks per SRCU grace-period phase.
>
> Zhangfei, does something like the patch shown below help?
>
> Additional adjustments are likely needed to avoid re-breaking KLP,
> but we have to start somewhere...
>
> 							Thanx, Paul
>
> ------------------------------------------------------------------------
>
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 50ba70f019dea..6a354368ac1d1 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>   
>   #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
>   #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
> -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
> +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
>   #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
>   
>   /*
> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>    */
>   static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>   {
> +	unsigned long gpstart;
> +	unsigned long j;
>   	unsigned long jbase = SRCU_INTERVAL;
>   
>   	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>   		jbase = 0;
> -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> +		j = jiffies - 1;
> +		gpstart = READ_ONCE(ssp->srcu_gp_start);
> +		if (time_after(j, gpstart))
> +			jbase += j - gpstart;
> +	}
>   	if (!jbase) {
>   		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>   		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
Unfortunately, this patch does not helpful.

Then re-add the debug info.

During the qemu boot
[  232.997667]  __synchronize_srcu loop=1000

[  361.094493]  __synchronize_srcu loop=9000
[  361.094501] Call trace:
[  361.094502]  dump_backtrace+0xe4/0xf0
[  361.094505]  show_stack+0x20/0x70
[  361.094507]  dump_stack_lvl+0x8c/0xb8
[  361.094509]  dump_stack+0x18/0x34
[  361.094511]  __synchronize_srcu+0x120/0x128
[  361.094514]  synchronize_srcu_expedited+0x2c/0x40
[  361.094515]  kvm_swap_active_memslots+0x130/0x198
[  361.094519]  kvm_activate_memslot+0x40/0x68
[  361.094520]  kvm_set_memslot+0x2f8/0x3b0
[  361.094523]  __kvm_set_memory_region+0x2e4/0x438
[  361.094524]  kvm_set_memory_region+0x78/0xb8
[  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
[  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
[  361.094530]  invoke_syscall+0x4c/0x110
[  361.094533]  el0_svc_common.constprop.0+0x68/0x128
[  361.094536]  do_el0_svc+0x34/0xc0
[  361.094538]  el0_svc+0x30/0x98
[  361.094541]  el0t_64_sync_handler+0xb8/0xc0
[  361.094544]  el0t_64_sync+0x18c/0x190
[  363.942817]  kvm_set_memory_region loop=6000