[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <tencent_F82250B6E3D51A9AC0D2BE1AE43A4E060909@qq.com>
Date: Mon, 13 Jun 2022 14:55:47 +0800
From: "zhangfei.gao@...mail.com" <zhangfei.gao@...mail.com>
To: paulmck@...nel.org
Cc: Paolo Bonzini <pbonzini@...hat.com>,
Zhangfei Gao <zhangfei.gao@...aro.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
rcu@...r.kernel.org, Lai Jiangshan <jiangshanlai@...il.com>,
Josh Triplett <josh@...htriplett.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Matthew Wilcox <willy@...radead.org>,
Shameerali Kolothum Thodi
<shameerali.kolothum.thodi@...wei.com>, mtosatti@...hat.com,
Auger Eric <eric.auger@...hat.com>
Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking
readers from consuming CPU) cause qemu boot slow
Hi, Paul
On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
>> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@...mail.com wrote:
>>> Hi, Paul
>>>
>>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
>>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
>>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
>>>>>>> Do these reserved memory regions really need to be allocated separately?
>>>>>>> (For example, are they really all non-contiguous? If not, that is, if
>>>>>>> there are a lot of contiguous memory regions, could you sort the IORT
>>>>>>> by address and do one ioctl() for each set of contiguous memory regions?)
>>>>>>>
>>>>>>> Are all of these reserved memory regions set up before init is spawned?
>>>>>>>
>>>>>>> Are all of these reserved memory regions set up while there is only a
>>>>>>> single vCPU up and running?
>>>>>>>
>>>>>>> Is the SRCU grace period really needed in this case? (I freely confess
>>>>>>> to not being all that familiar with KVM.)
>>>>>> Oh, and there was a similar many-requests problem with networking many
>>>>>> years ago. This was solved by adding a new syscall/ioctl()/whatever
>>>>>> that permitted many requests to be presented to the kernel with a single
>>>>>> system call.
>>>>>>
>>>>>> Could a new ioctl() be introduced that requested a large number
>>>>>> of these memory regions in one go so as to make each call to
>>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
>>>>>> requests? Adding a few of the KVM guys on CC for their thoughts.
>>>>> Unfortunately not. Apart from this specific case, in general the calls to
>>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
>>>>> guest, and those writes then map to a ioctl. Typically the guest sets up a
>>>>> device at a time, and each setup step causes a synchronize_srcu()---and
>>>>> expedited at that.
>>>> I was afraid of something like that...
>>>>
>>>>> KVM has two SRCUs:
>>>>>
>>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
>>>>> that are very very small, but it needs extremely fast detection of grace
>>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
>>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are
>>>>> not so frequent.
>>>>>
>>>>> 2) kvm->srcu is nastier because there are readers all the time. The
>>>>> read-side critical section are still short-ish, but they need the sleepable
>>>>> part because they access user memory.
>>>> Which one of these two is in play in this case?
>>>>
>>>>> Writers are not frequent per se; the problem is they come in very large
>>>>> bursts when a guest boots. And while the whole boot path overall can be
>>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
>>>>> impact on runtime than the O(n^2) parts, as demonstrated here.
>>>>>
>>>>> Therefore, we operated on the assumption that the callers of
>>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
>>>>> and the desire was to get past the booting phase as fast as possible. If
>>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
>>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
>>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
>>>>> problematic.
>>>>>
>>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
>>>>> were was introduced (respectively in 2009 and 2014). But perhaps they do
>>>>> not hold anymore now that each SRCU is not as independent as it used to be
>>>>> in those years, and instead they use workqueues instead?
>>>> The problem was not internal to SRCU, but rather due to the fact
>>>> that kernel live patching (KLP) had problems with the CPU-bound tasks
>>>> resulting from repeated synchronize_rcu_expedited() invocations. So I
>>>> added heuristics to get the occasional sleep in there for KLP's benefit.
>>>> Perhaps these heuristics need to be less aggressive about adding sleep.
>>>>
>>>> These heuristics have these aspects:
>>>>
>>>> 1. The longer readers persist in an expedited SRCU grace period,
>>>> the longer the wait between successive checks of the reader
>>>> state. Roughly speaking, we wait as long as the grace period
>>>> has currently been in effect, capped at ten jiffies.
>>>>
>>>> 2. SRCU grace periods have several phases. We reset so that each
>>>> phase starts by not waiting (new phase, new set of readers,
>>>> so don't penalize this set for the sins of the previous set).
>>>> But once we get to the point of adding delay, we add the
>>>> delay based on the beginning of the full grace period.
>>>>
>>>> Right now, the checking for grace-period length does not allow for the
>>>> possibility that a grace period might start just before the jiffies
>>>> counter gets incremented (because I didn't realize that anyone cared),
>>>> so that is one possible thing to change. I can also allow more no-delay
>>>> checks per SRCU grace-period phase.
>>>>
>>>> Zhangfei, does something like the patch shown below help?
>>>>
>>>> Additional adjustments are likely needed to avoid re-breaking KLP,
>>>> but we have to start somewhere...
>>>>
>>>> Thanx, Paul
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>> index 50ba70f019dea..6a354368ac1d1 100644
>>>> --- a/kernel/rcu/srcutree.c
>>>> +++ b/kernel/rcu/srcutree.c
>>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending.
>>>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers.
>>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances.
>>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances.
>>>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances.
>>>> /*
>>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
>>>> */
>>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>>>> {
>>>> + unsigned long gpstart;
>>>> + unsigned long j;
>>>> unsigned long jbase = SRCU_INTERVAL;
>>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>> jbase = 0;
>>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
>>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
>>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
>>>> + j = jiffies - 1;
>>>> + gpstart = READ_ONCE(ssp->srcu_gp_start);
>>>> + if (time_after(j, gpstart))
>>>> + jbase += j - gpstart;
>>>> + }
>>>> if (!jbase) {
>>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
>>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
>>> Unfortunately, this patch does not helpful.
>>>
>>> Then re-add the debug info.
>>>
>>> During the qemu boot
>>> [ 232.997667] __synchronize_srcu loop=1000
>>>
>>> [ 361.094493] __synchronize_srcu loop=9000
>>> [ 361.094501] Call trace:
>>> [ 361.094502] dump_backtrace+0xe4/0xf0
>>> [ 361.094505] show_stack+0x20/0x70
>>> [ 361.094507] dump_stack_lvl+0x8c/0xb8
>>> [ 361.094509] dump_stack+0x18/0x34
>>> [ 361.094511] __synchronize_srcu+0x120/0x128
>>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40
>>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198
>>> [ 361.094519] kvm_activate_memslot+0x40/0x68
>>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0
>>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438
>>> [ 361.094524] kvm_set_memory_region+0x78/0xb8
>>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
>>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
>>> [ 361.094530] invoke_syscall+0x4c/0x110
>>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128
>>> [ 361.094536] do_el0_svc+0x34/0xc0
>>> [ 361.094538] el0_svc+0x30/0x98
>>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0
>>> [ 361.094544] el0t_64_sync+0x18c/0x190
>>> [ 363.942817] kvm_set_memory_region loop=6000
>> Huh.
>>
>> One possibility is that the "if (!jbase)" block needs to be nested
>> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
I test this diff and NO helpful
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 50ba70f019de..36286a4b74e6 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
#define SRCU_INTERVAL 1 // Base delay if no expedited
GPs pending.
#define SRCU_MAX_INTERVAL 10 // Maximum incremental delay
from slow readers.
-#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase
consecutive no-delay instances.
+#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase
consecutive no-delay instances.
#define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay
instances.
/*
@@ -522,16 +522,23 @@ static bool srcu_readers_active(struct srcu_struct
*ssp)
*/
static unsigned long srcu_get_delay(struct srcu_struct *ssp)
{
+ unsigned long gpstart;
+ unsigned long j;
unsigned long jbase = SRCU_INTERVAL;
if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
jbase = 0;
- if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
- jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
- if (!jbase) {
- WRITE_ONCE(ssp->srcu_n_exp_nodelay,
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
- if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
SRCU_MAX_NODELAY_PHASE)
- jbase = 1;
+ if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
+ j = jiffies - 1;
+ gpstart = READ_ONCE(ssp->srcu_gp_start);
+ if (time_after(j, gpstart))
+ jbase += j - gpstart;
+
+ if (!jbase) {
+ WRITE_ONCE(ssp->srcu_n_exp_nodelay,
READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
+ if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
SRCU_MAX_NODELAY_PHASE)
+ jbase = 1;
+ }
}
> And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> above change reduces the overhead by more than an order of magnitude.
> Except that the overhead of the series is far less than one second,
> not the several minutes that you are seeing. So the per-call overhead
> decreases from about 17 microseconds to a bit more than one microsecond.
>
> I could imagine an extra order of magnitude if you are running HZ=100
> instead of the HZ=1000 that I am running. But that only gets up to a
> few seconds.
>
>> One additional debug is to apply the patch below on top of the one you
apply the patch below?
>> just now kindly tested, then use whatever debug technique you wish to
>> work out what fraction of the time during that critical interval that
>> srcu_get_delay() returns non-zero.
Sorry, I am confused, no patch right?
Just measure srcu_get_delay return to non-zero?
By the way, the issue should be only related with qemu apci. not related
with rmr feature
Test with: https://github.com/qemu/qemu/tree/stable-6.1
Looks it caused by too many kvm_region_add & kvm_region_del if acpi=force,
If no acpi, no print kvm_region_add/del (1000 times print once)
If with acpi=force,
During qemu boot
kvm_region_add region_add = 1000
kvm_region_del region_del = 1000
kvm_region_add region_add = 2000
kvm_region_del region_del = 2000
kvm_region_add region_add = 3000
kvm_region_del region_del = 3000
kvm_region_add region_add = 4000
kvm_region_del region_del = 4000
kvm_region_add region_add = 5000
kvm_region_del region_del = 5000
kvm_region_add region_add = 6000
kvm_region_del region_del = 6000
kvm_region_add/kvm_region_del ->
kvm_set_phys_mem->
kvm_set_user_memory_region->
kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem)
[ 361.094493] __synchronize_srcu loop=9000
[ 361.094501] Call trace:
[ 361.094502] dump_backtrace+0xe4/0xf0
[ 361.094505] show_stack+0x20/0x70
[ 361.094507] dump_stack_lvl+0x8c/0xb8
[ 361.094509] dump_stack+0x18/0x34
[ 361.094511] __synchronize_srcu+0x120/0x128
[ 361.094514] synchronize_srcu_expedited+0x2c/0x40
[ 361.094515] kvm_swap_active_memslots+0x130/0x198
[ 361.094519] kvm_activate_memslot+0x40/0x68
[ 361.094520] kvm_set_memslot+0x2f8/0x3b0
[ 361.094523] __kvm_set_memory_region+0x2e4/0x438
[ 361.094524] kvm_set_memory_region+0x78/0xb8
[ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0
[ 361.094528] __arm64_sys_ioctl+0xb0/0xf8
[ 361.094530] invoke_syscall+0x4c/0x110
[ 361.094533] el0_svc_common.constprop.0+0x68/0x128
[ 361.094536] do_el0_svc+0x34/0xc0
[ 361.094538] el0_svc+0x30/0x98
[ 361.094541] el0t_64_sync_handler+0xb8/0xc0
[ 361.094544] el0t_64_sync+0x18c/0x190
[ 363.942817] kvm_set_memory_region loop=6000
Powered by blists - more mailing lists