[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <tencent_5C6A354D8C69190DB41D0E83D51195DBC108@qq.com>
Date: Fri, 24 Jun 2022 23:30:37 +0800
From: "zhangfei.gao@...mail.com" <zhangfei.gao@...mail.com>
To: Neeraj Upadhyay <quic_neeraju@...cinc.com>, paulmck@...nel.org
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@...wei.com>,
Paolo Bonzini <pbonzini@...hat.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"rcu@...r.kernel.org" <rcu@...r.kernel.org>,
Lai Jiangshan <jiangshanlai@...il.com>,
Josh Triplett <josh@...htriplett.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Matthew Wilcox <willy@...radead.org>,
"mtosatti@...hat.com" <mtosatti@...hat.com>,
Auger Eric <eric.auger@...hat.com>,
"chenxiang (M)" <chenxiang66@...ilicon.com>
Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking
readers from consuming CPU) cause qemu boot slow
On 2022/6/20 下午3:50, Neeraj Upadhyay wrote:
> Hi,
>
>
> On 6/18/2022 8:37 AM, zhangfei.gao@...mail.com wrote:
>>
>>
>> On 2022/6/15 下午6:40, Neeraj Upadhyay wrote:
>>> Hi,
>>>
>>> On 6/15/2022 2:33 PM, zhangfei.gao@...mail.com wrote:
>>>>
>>>>
>>>> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>>>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@...mail.com
>>>>> wrote:
>>>>>>
>>>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>>>>> 5.18-rc4 based ~8sec
>>>>>>>>
>>>>>>>> 5.19-rc1 ~2m43sec
>>>>>>>>
>>>>>>>> 5.19-rc1+fix1 ~19sec
>>>>>>>>
>>>>>>>> 5.19-rc1-fix2 ~19sec
>>>>>>>>
>>>>>>> If you try below diff on top of either 5.19-rc1+fix1 or
>>>>>>> 5.19-rc1-fix2 ;
>>>>>>> does it show any difference in boot time?
>>>>>>>
>>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct
>>>>>>> srcu_struct
>>>>>>> *ssp, struct srcu_node *snp
>>>>>>> */
>>>>>>> static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>>> {
>>>>>>> - unsigned long cbdelay;
>>>>>>> + unsigned long cbdelay = 1;
>>>>>>> bool cbs;
>>>>>>> bool last_lvl;
>>>>>>> int cpu;
>>>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct
>>>>>>> *ssp)
>>>>>>> spin_lock_irq_rcu_node(ssp);
>>>>>>> idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>>>>> WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>>>>> - cbdelay = !!srcu_get_delay(ssp);
>>>>>>> + if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>>> + cbdelay = 0;
>>>>>>> +
>>>>>>> WRITE_ONCE(ssp->srcu_last_gp_end,
>>>>>>> ktime_get_mono_fast_ns());
>>>>> Thank you both for the testing and the proposed fix!
>>>>>
>>>>>> Test here:
>>>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>>>>> kernel:
>>>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
>>>>>>
>>>>>> (in case test patch not clear, push in git tree)
>>>>>>
>>>>>> Hardware: aarch64
>>>>>>
>>>>>> 1. 5.18-rc6
>>>>>> real 0m8.402s
>>>>>> user 0m3.015s
>>>>>> sys 0m1.102s
>>>>>>
>>>>>> 2. 5.19-rc1
>>>>>> real 2m41.433s
>>>>>> user 0m3.097s
>>>>>> sys 0m1.177s
>>>>>>
>>>>>> 3. 5.19-rc1 + fix1 from Paul
>>>>>> real 2m43.404s
>>>>>> user 0m2.880s
>>>>>> sys 0m1.214s
>>>>>>
>>>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>>>>> real 0m15.262s
>>>>>> user 0m3.003s
>>>>>> sys 0m1.033s
>>>>>>
>>>>>> When build kernel in the meantime, load time become longer.
>>>>>>
>>>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>>> real 0m15.215s
>>>>>> user 0m2.942s
>>>>>> sys 0m1.172s
>>>>>>
>>>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end
>>>>>> real 1m23.936s
>>>>>> user 0m2.969s
>>>>>> sys 0m1.181s
>>>>> And thank you for the testing!
>>>>>
>>>>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>>>>
>>>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>>>>> srcu_gp_end.
>>>>>
>>>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>>>>> times start rising? This is probably best done by starting with
>>>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>>>>> until boot time becomes slow, followed by a binary search between the
>>>>> last two values. (The idea is to bias the search so that fast boot
>>>>> times are the common case.)
>>>>
>>>> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
>>>>
>>>>
>>>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>
>>>> real 0m11.154s ~12s
>>>>
>>>> user 0m2.919s
>>>>
>>>> sys 0m1.064s
>>>>
>>>>
>>>>
>>>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
>>>>
>>>> real 0m11.258s
>>>>
>>>> user 0m3.113s
>>>>
>>>> sys 0m1.073s
>>>>
>>>>
>>>>
>>>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>>>>
>>>> real 0m30.053s ~ 32s
>>>>
>>>> user 0m2.827s
>>>>
>>>> sys 0m1.161s
>>>>
>>>>
>>>>
>>>> By the way, if build kernel on the board in the meantime (using
>>>> memory), time become much longer.
>>>>
>>>> real 1m2.763s
>>>>
>>>>
>>>>
>>>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>>>
>>>> real 0m11.443s
>>>>
>>>> user 0m3.022s
>>>>
>>>> sys 0m1.052s
>>>>
>>>>
>>>
>>> This is useful data, thanks! Did you get chance to check between 100
>>> and 1000, to narrow down further, from which point (does need to be
>>> exact value) between 100 and 1000, you start seeing degradation at,
>>> for ex. 250, 500 , ...?
>>>
>>> Is it also possible to try experiment 10 and 11 with below diff.
>>> What I have done in below diff is, call srcu_get_delay() only once
>>> in try_check_zero() (and not for every loop iteration); also
>>> retry with a different delay for the extra iteration which is done
>>> when srcu_get_delay(ssp) returns 0.
>>>
>>> Once we have this data, can you also try by changing
>>> SRCU_RETRY_CHECK_LONG_DELAY to 100, on top of below diff.
>>>
>>> #define SRCU_RETRY_CHECK_LONG_DELAY 100
>>>
>>> -------------------------------------------------------------------------
>>>
>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>> index 6a354368ac1d..3aff2f3e99ab 100644
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>>> * we repeatedly block for 1-millisecond time periods.
>>> */
>>> #define SRCU_RETRY_CHECK_DELAY 5
>>> +#define SRCU_RETRY_CHECK_LONG_DELAY 5
>>>
>>> /*
>>> * Start an SRCU grace period.
>>> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct
>>> srcu_struct *ssp, struct srcu_data *sdp,
>>> */
>>> static bool try_check_zero(struct srcu_struct *ssp, int idx, int
>>> trycount)
>>> {
>>> + unsigned long curdelay;
>>> + curdelay = !srcu_get_delay(ssp);
>>> for (;;) {
>>> if (srcu_readers_active_idx_check(ssp, idx))
>>> return true;
>>> - if (--trycount + !srcu_get_delay(ssp) <= 0)
>>> + if (--trycount + curdelay <= 0)
>>> return false;
>>> - udelay(SRCU_RETRY_CHECK_DELAY);
>>> + if (trycount)
>>> + udelay(SRCU_RETRY_CHECK_DELAY);
>>> + else
>>> + udelay(SRCU_RETRY_CHECK_LONG_DELAY);
>>> }
>>> }
>>>
>>
>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>> real 0m11.443
>> s user 0m3.022
>> s sys 0m1.052s
>>
>> fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500
>>
>> real 0m11.401s
>> user 0m2.798s
>> sys 0m1.328s
>>
>>
>> fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250
>>
>> real 0m15.748s
>> user 0m2.781s
>> sys 0m1.294s
>>
>>
>> fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200
>>
>> real 0m20.704s 21
>> user 0m2.954s
>> sys 0m1.226s
>>
>> fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150
>>
>> real 0m25.151s
>> user 0m2.980s
>> sys 0m1.256s
>>
>>
>> fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>> real 0m30.053s ~ 32s
>> user 0m2.827s
>> sys 0m1.161s
>>
>>
>> fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5
>>
>> real 0m19.263s
>> user 0m3.018s
>> sys 0m1.211s
>>
>>
>>
>> fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 +
>> SRCU_RETRY_CHECK_LONG_DELAY 100
>>
>> real 0m9.347s
>> user 0m3.132s
>> sys 0m1.041s
>>
>>
>
> Thanks. From the data and experiments done, looks to me that we get
> comparable (to 5.18-rc4 ) timings, when we retry without sleep for
> time duration close to 4-5 ms, which could be closer to the configured
> HZ (as it is 250)? Is it possible to try below configuration on top
> of fix15?
> If possible can you try with both HZ_1000 and HZ_250?
> As multiple fixes are getting combined in experiments, for clarity,
> please also share the diff of srcutree.c (on top of baseline) for all
> experiments.
>
> 16. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) +
> (long delay scaled to 1 jiffy)
>
>
> #define SRCU_MAX_NODELAY_TRY_CHECK_PHASE 10
> #define SRCU_MAX_NODELAY_PHASE (SRCU_MAX_NODELAY_TRY_CHECK_PHASE * 2)
> #define SRCU_RETRY_CHECK_LONG_DELAY \
> (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
>
>
> 17. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) +
> (long delay scaled to 2 jiffy)
>
> #define SRCU_RETRY_CHECK_LONG_DELAY \
> (2 * USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
>
> 18. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) +
> (long delay scaled to 1/2 jiffy)
>
> #define SRCU_RETRY_CHECK_LONG_DELAY \
> (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE / 2)
fix16: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero()
calls) + (long delay scaled to 1 jiffy)
real 0m10.120s
user 0m3.885s
sys 0m1.040s
fix17: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero()
calls) + (long delay scaled to 2 jiffy)
real 0m9.851s
user 0m3.886s
sys 0m1.011s
fix18: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero()
calls) + (long delay scaled to 1/2 jiffy)
real 0m9.741s
user 0m3.837s
sys 0m1.060s
code push to
https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test
Powered by blists - more mailing lists