linux-kernel - Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <tencent_5C6A354D8C69190DB41D0E83D51195DBC108@qq.com>
Date:   Fri, 24 Jun 2022 23:30:37 +0800
From:   "zhangfei.gao@...mail.com" <zhangfei.gao@...mail.com>
To:     Neeraj Upadhyay <quic_neeraju@...cinc.com>, paulmck@...nel.org
Cc:     Shameerali Kolothum Thodi <shameerali.kolothum.thodi@...wei.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "rcu@...r.kernel.org" <rcu@...r.kernel.org>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Josh Triplett <josh@...htriplett.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Matthew Wilcox <willy@...radead.org>,
        "mtosatti@...hat.com" <mtosatti@...hat.com>,
        Auger Eric <eric.auger@...hat.com>,
        "chenxiang (M)" <chenxiang66@...ilicon.com>
Subject: Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking
 readers from consuming CPU) cause qemu boot slow



On 2022/6/20 下午3:50, Neeraj Upadhyay wrote:
> Hi,
>
>
> On 6/18/2022 8:37 AM, zhangfei.gao@...mail.com wrote:
>>
>>
>> On 2022/6/15 下午6:40, Neeraj Upadhyay wrote:
>>> Hi,
>>>
>>> On 6/15/2022 2:33 PM, zhangfei.gao@...mail.com wrote:
>>>>
>>>>
>>>> On 2022/6/14 下午10:17, Paul E. McKenney wrote:
>>>>> On Tue, Jun 14, 2022 at 10:03:35PM +0800, zhangfei.gao@...mail.com 
>>>>> wrote:
>>>>>>
>>>>>> On 2022/6/14 下午8:19, Neeraj Upadhyay wrote:
>>>>>>>> 5.18-rc4 based ~8sec
>>>>>>>>
>>>>>>>> 5.19-rc1                     ~2m43sec
>>>>>>>>
>>>>>>>> 5.19-rc1+fix1                 ~19sec
>>>>>>>>
>>>>>>>> 5.19-rc1-fix2                 ~19sec
>>>>>>>>
>>>>>>> If you try below diff on top of either 5.19-rc1+fix1 or 
>>>>>>> 5.19-rc1-fix2 ;
>>>>>>> does it show any difference in boot time?
>>>>>>>
>>>>>>> --- a/kernel/rcu/srcutree.c
>>>>>>> +++ b/kernel/rcu/srcutree.c
>>>>>>> @@ -706,7 +706,7 @@ static void srcu_schedule_cbs_snp(struct 
>>>>>>> srcu_struct
>>>>>>> *ssp, struct srcu_node *snp
>>>>>>>    */
>>>>>>>   static void srcu_gp_end(struct srcu_struct *ssp)
>>>>>>>   {
>>>>>>> -       unsigned long cbdelay;
>>>>>>> +       unsigned long cbdelay = 1;
>>>>>>>          bool cbs;
>>>>>>>          bool last_lvl;
>>>>>>>          int cpu;
>>>>>>> @@ -726,7 +726,9 @@ static void srcu_gp_end(struct srcu_struct 
>>>>>>> *ssp)
>>>>>>>          spin_lock_irq_rcu_node(ssp);
>>>>>>>          idx = rcu_seq_state(ssp->srcu_gp_seq);
>>>>>>>          WARN_ON_ONCE(idx != SRCU_STATE_SCAN2);
>>>>>>> -       cbdelay = !!srcu_get_delay(ssp);
>>>>>>> +       if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
>>>>>>> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>>>>>>> +               cbdelay = 0;
>>>>>>> +
>>>>>>>          WRITE_ONCE(ssp->srcu_last_gp_end, 
>>>>>>> ktime_get_mono_fast_ns());
>>>>> Thank you both for the testing and the proposed fix!
>>>>>
>>>>>> Test here:
>>>>>> qemu: https://github.com/qemu/qemu/tree/stable-6.1
>>>>>> kernel:
>>>>>> https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test 
>>>>>>
>>>>>> (in case test patch not clear, push in git tree)
>>>>>>
>>>>>> Hardware: aarch64
>>>>>>
>>>>>> 1. 5.18-rc6
>>>>>> real    0m8.402s
>>>>>> user    0m3.015s
>>>>>> sys     0m1.102s
>>>>>>
>>>>>> 2. 5.19-rc1
>>>>>> real    2m41.433s
>>>>>> user    0m3.097s
>>>>>> sys     0m1.177s
>>>>>>
>>>>>> 3. 5.19-rc1 + fix1 from Paul
>>>>>> real    2m43.404s
>>>>>> user    0m2.880s
>>>>>> sys     0m1.214s
>>>>>>
>>>>>> 4. 5.19-rc1 + fix2: fix1 + Remove "if (!jbase)" block
>>>>>> real    0m15.262s
>>>>>> user    0m3.003s
>>>>>> sys     0m1.033s
>>>>>>
>>>>>> When build kernel in the meantime, load time become longer.
>>>>>>
>>>>>> 5. 5.19-rc1 + fix3: fix1 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>>> real    0m15.215s
>>>>>> user    0m2.942s
>>>>>> sys    0m1.172s
>>>>>>
>>>>>> 6. 5.19-rc1 + fix4: fix1 + Neeraj's change of srcu_gp_end　
>>>>>> real    1m23.936s
>>>>>> user    0m2.969s
>>>>>> sys    0m1.181s
>>>>> And thank you for the testing!
>>>>>
>>>>> Could you please try fix3 + Neeraj's change of srcu_gp_end?
>>>>>
>>>>> That is, fix1 + SRCU_MAX_NODELAY_PHASE 1000000 + Neeraj's change of
>>>>> srcu_gp_end.
>>>>>
>>>>> Also, at what value of SRCU_MAX_NODELAY_PHASE do the boot
>>>>> times start rising?  This is probably best done by starting with
>>>>> SRCU_MAX_NODELAY_PHASE=100000 and dividing by (say) ten on each run
>>>>> until boot time becomes slow, followed by a binary search between the
>>>>> last two values.  (The idea is to bias the search so that fast boot
>>>>> times are the common case.)
>>>>
>>>> SRCU_MAX_NODELAY_PHASE 100 becomes slower.
>>>>
>>>>
>>>> 8. 5.19-rc1 + fix6: fix4 + SRCU_MAX_NODELAY_PHASE 1000000
>>>>
>>>> real 0m11.154s ~12s
>>>>
>>>> user 0m2.919s
>>>>
>>>> sys 0m1.064s
>>>>
>>>>
>>>>
>>>> 9. 5.19-rc1 + fix7: fix4 + SRCU_MAX_NODELAY_PHASE 10000
>>>>
>>>> real 0m11.258s
>>>>
>>>> user 0m3.113s
>>>>
>>>> sys 0m1.073s
>>>>
>>>>
>>>>
>>>> 10. 5.19-rc1 + fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>>>>
>>>> real 0m30.053s ~ 32s
>>>>
>>>> user 0m2.827s
>>>>
>>>> sys 0m1.161s
>>>>
>>>>
>>>>
>>>> By the way, if build kernel on the board in the meantime (using 
>>>> memory), time become much longer.
>>>>
>>>> real 1m2.763s
>>>>
>>>>
>>>>
>>>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>>>
>>>> real 0m11.443s
>>>>
>>>> user 0m3.022s
>>>>
>>>> sys 0m1.052s
>>>>
>>>>
>>>
>>> This is useful data, thanks! Did you get chance to check between 100 
>>> and 1000, to narrow down further, from which point (does need to be 
>>> exact value) between 100 and 1000,  you start seeing degradation at, 
>>> for ex. 250, 500 , ...?
>>>
>>> Is it also possible to try experiment 10 and 11 with below diff.
>>> What I have done in below diff is, call srcu_get_delay() only once
>>> in try_check_zero() (and not for every loop iteration); also
>>> retry with a different delay for the extra iteration which is done
>>> when srcu_get_delay(ssp) returns 0.
>>>
>>> Once we have this data, can you also try by changing 
>>> SRCU_RETRY_CHECK_LONG_DELAY   to 100, on top of below diff.
>>>
>>> #define SRCU_RETRY_CHECK_LONG_DELAY  100
>>>
>>> ------------------------------------------------------------------------- 
>>>
>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>> index 6a354368ac1d..3aff2f3e99ab 100644
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -620,6 +620,7 @@ EXPORT_SYMBOL_GPL(__srcu_read_unlock);
>>>   * we repeatedly block for 1-millisecond time periods.
>>>   */
>>>  #define SRCU_RETRY_CHECK_DELAY         5
>>> +#define SRCU_RETRY_CHECK_LONG_DELAY            5
>>>
>>>  /*
>>>   * Start an SRCU grace period.
>>> @@ -927,12 +928,17 @@ static void srcu_funnel_gp_start(struct 
>>> srcu_struct *ssp, struct srcu_data *sdp,
>>>   */
>>>  static bool try_check_zero(struct srcu_struct *ssp, int idx, int 
>>> trycount)
>>>  {
>>> +       unsigned long curdelay;
>>> +       curdelay = !srcu_get_delay(ssp);
>>>         for (;;) {
>>>                 if (srcu_readers_active_idx_check(ssp, idx))
>>>                         return true;
>>> -               if (--trycount + !srcu_get_delay(ssp) <= 0)
>>> +               if (--trycount + curdelay <= 0)
>>>                         return false;
>>> -               udelay(SRCU_RETRY_CHECK_DELAY);
>>> +               if (trycount)
>>> +                       udelay(SRCU_RETRY_CHECK_DELAY);
>>> +               else
>>> +                       udelay(SRCU_RETRY_CHECK_LONG_DELAY);
>>>         }
>>>  }
>>>
>>
>> 11. 5.19-rc1 + fix9: fix4 + SRCU_MAX_NODELAY_PHASE 1000
>>   real    0m11.443
>> s user    0m3.022
>> s sys    0m1.052s
>>
>> fix10: fix4 + SRCU_MAX_NODELAY_PHASE 500
>>
>> real    0m11.401s
>> user    0m2.798s
>> sys     0m1.328s
>>
>>
>> fix11: fix4 + SRCU_MAX_NODELAY_PHASE 250
>>
>>      real    0m15.748s
>>      user    0m2.781s
>>      sys     0m1.294s
>>
>>
>> fix12: fix4 + SRCU_MAX_NODELAY_PHASE 200
>>
>>      real    0m20.704s  21
>>      user    0m2.954s
>>      sys     0m1.226s
>>
>> fix13: fix4 + SRCU_MAX_NODELAY_PHASE 150
>>
>>      real    0m25.151s
>>      user    0m2.980s
>>      sys     0m1.256s
>>
>>
>> fix8: fix4 + SRCU_MAX_NODELAY_PHASE 100
>> real    0m30.053s ~ 32s
>>   user    0m2.827s
>>   sys    0m1.161s
>>
>>
>> fix14: fix4 + SRCU_MAX_NODELAY_PHASE 100 + SRCU_RETRY_CHECK_LONG_DELAY 5
>>
>>      real    0m19.263s
>>      user    0m3.018s
>>      sys     0m1.211s
>>
>>
>>
>>      fix15: fix4 + SRCU_MAX_NODELAY_PHASE 100 + 
>> SRCU_RETRY_CHECK_LONG_DELAY 100
>>
>>      real    0m9.347s
>>      user    0m3.132s
>>      sys     0m1.041s
>>
>>
>
> Thanks. From the data and experiments done, looks to me that we get
> comparable (to 5.18-rc4 ) timings, when we retry without sleep for 
> time duration close to 4-5 ms, which could be closer to the configured 
> HZ (as it is 250)? Is it possible to try below configuration on top 
> of  fix15?
> If possible can you try with both HZ_1000 and HZ_250?
> As multiple fixes are getting combined in experiments, for clarity, 
> please also share the diff of srcutree.c (on top of baseline) for all 
> experiments.
>
> 16. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + 
> (long delay scaled to 1 jiffy)
>
>
> #define SRCU_MAX_NODELAY_TRY_CHECK_PHASE       10
> #define SRCU_MAX_NODELAY_PHASE (SRCU_MAX_NODELAY_TRY_CHECK_PHASE * 2)
> #define SRCU_RETRY_CHECK_LONG_DELAY           \
>   (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
>
>
> 17. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + 
> (long delay scaled to 2 jiffy)
>
> #define SRCU_RETRY_CHECK_LONG_DELAY        \
>      (2 * USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE)
>
> 18. fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() calls) + 
> (long delay scaled to 1/2 jiffy)
>
> #define SRCU_RETRY_CHECK_LONG_DELAY      \
>        (USEC_PER_SEC / HZ / SRCU_MAX_NODELAY_TRY_CHECK_PHASE / 2)

     fix16: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() 
calls) + (long delay scaled to 1 jiffy)

     real    0m10.120s
     user    0m3.885s
     sys     0m1.040s


     fix17: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() 
calls) + (long delay scaled to 2 jiffy)

     real    0m9.851s
     user    0m3.886s
     sys     0m1.011s


     fix18: fix15 + SRCU_MAX_NODELAY_PHASE 20 (10 try_check_zero() 
calls) + (long delay scaled to 1/2 jiffy)

     real    0m9.741s
     user    0m3.837s
     sys     0m1.060s

code push to
https://github.com/Linaro/linux-kernel-uadk/tree/uacce-devel-5.19-srcu-test