linux-kernel - Re: [BUG] rcu-tasks : should take care of sparse cpu masks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7a90a9b5-df13-6824-32d1-931f19c96cba@quicinc.com>
Date:   Mon, 4 Apr 2022 11:15:23 +0530
From:   Neeraj Upadhyay <quic_neeraju@...cinc.com>
To:     <paulmck@...nel.org>, Eric Dumazet <edumazet@...gle.com>
CC:     LKML <linux-kernel@...r.kernel.org>, KP Singh <kpsingh@...nel.org>,
        Stanislav Fomichev <sdf@...gle.com>, <andrii@...nel.org>,
        <ast@...nel.org>
Subject: Re: [BUG] rcu-tasks : should take care of sparse cpu masks

Hi,

Trying to understand the issue.

On 4/1/2022 9:18 PM, Paul E. McKenney wrote:
> On Fri, Apr 01, 2022 at 08:28:14AM -0700, Paul E. McKenney wrote:
>> [ Adding Andrii and Alexei at Andrii's request. ]
>>
>> On Fri, Apr 01, 2022 at 08:20:37AM -0700, Paul E. McKenney wrote:
>>> On Fri, Apr 01, 2022 at 06:24:13AM -0700, Eric Dumazet wrote:
>>>> On Fri, Apr 1, 2022 at 6:01 AM Paul E. McKenney <paulmck@...nel.org> wrote:
>>>>>
>>>>> On Thu, Mar 31, 2022 at 09:39:02PM -0700, Eric Dumazet wrote:
>>>>>> On Thu, Mar 31, 2022 at 5:06 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>>>>>>>
>>>>>>> On Thu, Mar 31, 2022 at 04:28:04PM -0700, Eric Dumazet wrote:
>>>>>>>> On Thu, Mar 31, 2022 at 4:13 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>>>>>>>>>
>>>>>>>>> The initial setting of ->percpu_enqueue_shift forces all in-range CPU
>>>>>>>>> IDs to shift down to zero.  The grace-period kthread is allowed to run
>>>>>>>>> where it likes.  The callback lists are protected by locking, even in
>>>>>>>>> the case of local access, so this should be safe.
>>>>>>>>>
>>>>>>>>> Or am I missing your point?
>>>>>>>>>
>>>>>>>>
>>>>>>>> In fact I have been looking at this code, because we bisected a
>>>>>>>> regression back to this patch:
>>>>>>>>
>>>>>>>> 4fe192dfbe5ba9780df699d411aa4f25ba24cf61 rcu-tasks: Shorten
>>>>>>>> per-grace-period sleep for RCU Tasks Trace
>>>>>>>>
>>>>>>>> It is very possible the regression comes because the RCU task thread
>>>>>>>> is using more cpu cycles, from 'CPU 0'  where our system daemons are
>>>>>>>> pinned.
>>>>>>>
>>>>>>> Heh!  I did express that concern when creating that patch, but was
>>>>>>> assured that the latency was much more important.
>>>>>>>
>>>>>>> Yes, that patch most definitely increases CPU utilization during RCU Tasks
>>>>>>> Trace grace periods.  If you can tolerate longer grace-period latencies,
>>>>>>> it might be worth toning it down a bit.  The ask was for about twice
>>>>>>> the latency I achieved in my initial attempt, and I made the mistake of
>>>>>>> forwarding that attempt out for testing.  They liked the shorter latency
>>>>>>> very much, and objected strenuously to the thought that I might detune
>>>>>>> it back to the latency that they originally asked for.  ;-)
>>>>>>>
>>>>>>> But I can easily provide the means to detune it through use of a kernel
>>>>>>> boot parameter or some such, if that would help.
>>>>>>>
>>>>>>>> But I could not spot where the RCU task kthread is forced to run on CPU 0.
>>>>>>>
>>>>>>> I never did intend this kthread be bound anywhere.  RCU's policy is
>>>>>>> that any binding of its kthreads is the responsibility of the sysadm,
>>>>>>> be that carbon-based or otherwise.
>>>>>>>
>>>>>>> But this kthread is spawned early enough that only CPU 0 is online,
>>>>>>> so maybe the question is not "what is binding it to CPU 0?" but rather
>>>>>>> "why isn't something kicking it off of CPU 0?"
>>>>>>
>>>>>> I guess the answer to this question can be found in the following
>>>>>> piece of code :)
>>>>>>
>>>>>> rcu_read_lock();
>>>>>> for_each_process_thread(g, t)
>>>>>>          rtp->pertask_func(t, &holdouts);
>>>>>> rcu_read_unlock();
>>>>>>
>>>>>>
>>>>>> With ~150,000 threads on a 256 cpu host, this holds current cpu for
>>>>>> very long times:
>>>>>>
>>>>>>   rcu_tasks_trace    11 [017]  5010.544762:
>>>>>> probe:rcu_tasks_wait_gp: (ffffffff963fb4b0)
>>>>>>   rcu_tasks_trace    11 [017]  5010.600396:
>>>>>> probe:rcu_tasks_trace_postscan: (ffffffff963fb7c0)
>>>>>
>>>>> So about 55 milliseconds for the tasklist scan, correct?  Or am I
>>>>> losing the plot here?
>>>>>
>>>>>>   rcu_tasks_trace    11 [022]  5010.618783:
>>>>>> probe:check_all_holdout_tasks_trace: (ffffffff963fb850)
>>>>>>   rcu_tasks_trace    11 [022]  5010.618840:
>>>>>> probe:rcu_tasks_trace_postgp: (ffffffff963fba70)
>>>>>>
>>>>>> In this case, CPU 22 is the victim, not CPU 0 :)
>>>>>
>>>>> My faith in the scheduler is restored!  ;-)
>>>>>
>>>>> My position has been that this tasklist scan does not need to be broken
>>>>> up because it should happen only when a sleepable BPF program is removed,
>>>>> which is a rare event.
>>>>
>>>> Hmm... what about  bpf_sk_storage_free() ?
>>>>
>>>> Definitely not a rare event.
>>>
>>> Hmmm...  Are the BPF guys using call_rcu_tasks_trace() to free things that
>>> are not trampolines for sleepable BPF programs?  Kind of looks like it.
>>>
>>> Maybe RCU Tasks Trace was too convenient to use?  ;-)
>>>
>>>>> In addition, breaking up this scan is not trivial, because as far as I
>>>>> know there is no way to force a given task to stay in the list.  I would
>>>>> have to instead use something like rcu_lock_break(), and restart the
>>>>> scan if either of the nailed-down pair of tasks was removed from the list.
>>>>> In a system where tasks were coming and going very frequently, it might
>>>>> be that such a broken-up scan would never complete.
>>>>>
>>>>> I can imagine tricks where the nailed-down tasks are kept on a list,
>>>>> and the nailed-downness is moved to the next task when those tasks
>>>>> are removed.  I can also imagine a less-than-happy response to such
>>>>> a proposal.
>>>>>
>>>>> So I am not currently thinking in terms of breaking up this scan.
>>>>>
>>>>> Or is there some trick that I am missing?
>>>>>
>>>>> In the meantime, a simple patch that reduces the frequency of the scan
>>>>> by a factor of two.  But this would not be the scan of the full tasklist,
>>>>> but rather the frequency of the calls to check_all_holdout_tasks_trace().
>>>>> And the total of these looks to be less than 20 milliseconds, if I am
>>>>> correctly interpreting your trace.  And most of that 20 milliseconds
>>>>> is sleeping.
>>>>>
>>>>> Nevertheless, the patch is at the end of this email.
>>>>>
>>>>> Other than that, I could imagine batching removal of sleepable BPF
>>>>> programs and using a single grace period for all of their trampolines.
>>>>> But are there enough sleepable BPF programs ever installed to make this
>>>>> a useful approach?
>>>>>
>>>>> Or is the status quo in fact acceptable?  (Hey, I can dream, can't I?)
>>>>>
>>>>>                                                          Thanx, Paul
>>>>>
>>>>>>>> I attempted to backport to our kernel all related patches that were
>>>>>>>> not yet backported,
>>>>>>>> and we still see a regression in our tests.
>>>>>>>
>>>>>>> The per-grace-period CPU consumption of rcu_tasks_trace was intentionally
>>>>>>> increased by the above commit, and I never have done anything to reduce
>>>>>>> that CPU consumption.  In part because you are the first to call my
>>>>>>> attention to it.
>>>>>>>
>>>>>>> Oh, and one other issue that I very recently fixed, that has not
>>>>>>> yet reached mainline, just in case it matters.  If you are building a
>>>>>>> CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y kernel, but also have
>>>>>>> CONFIG_RCU_TORTURE_TEST=m (or, for that matter, =y, but please don't in
>>>>>>> production!), then your kernel will use RCU Tasks instead of vanilla RCU.
>>>>>>> (Note well, RCU Tasks, not RCU Tasks Trace, the latter being necessaary
>>>>>>> for sleepable BPF programs regardless of kernel .config).
>>>>>>>
>>>>>>>> Please ignore the sha1 in this current patch series, this is only to
>>>>>>>> show my current attempt to fix the regression in our tree.
>>>>>>>>
>>>>>>>> 450b3244f29b rcu-tasks: Don't remove tasks with pending IPIs from holdout list
>>>>>>>> 5f88f7e9cc36 rcu-tasks: Create per-CPU callback lists
>>>>>>>> 1a943d0041dc rcu-tasks: Introduce ->percpu_enqueue_shift for dynamic
>>>>>>>> queue selection
>>>>>>>> ea5289f12fce rcu-tasks: Convert grace-period counter to grace-period
>>>>>>>> sequence number
>>>>>>>> 22efd5093c3b rcu/segcblist: Prevent useless GP start if no CBs to accelerate
>>>>>>>> 16dee1b3babf rcu: Implement rcu_segcblist_is_offloaded() config dependent
>>>>>>>> 8cafaadb6144 rcu: Add callbacks-invoked counters
>>>>>>>> 323234685765 rcu/tree: Make rcu_do_batch count how many callbacks were executed
>>>>>>>> f48f3386a1cc rcu/segcblist: Add additional comments to explain smp_mb()
>>>>>>>> 4408105116de rcu/segcblist: Add counters to segcblist datastructure
>>>>>>>> 4a0b89a918d6 rcu/tree: segcblist: Remove redundant smp_mb()s
>>>>>>>> 38c0d18e8740 rcu: Add READ_ONCE() to rcu_do_batch() access to rcu_divisor
>>>>>>>> 0b5d1031b509 rcu/segcblist: Add debug checks for segment lengths
>>>>>>>> 8a82886fbf02 rcu_tasks: Convert bespoke callback list to rcu_segcblist structure
>>>>>>>> cbd452a5c01f rcu-tasks: Use spin_lock_rcu_node() and friends
>>>>>>>> 073222be51f3 rcu-tasks: Add a ->percpu_enqueue_lim to the rcu_tasks structure
>>>>>>>> 5af10fb0f8fb rcu-tasks: Abstract checking of callback lists
>>>>>>>> d3e8be598546 rcu-tasks: Abstract invocations of callbacks
>>>>>>>> 65784460a392 rcu-tasks: Use workqueues for multiple
>>>>>>>> rcu_tasks_invoke_cbs() invocations
>>>>>>>> dd6413e355f1 rcu-tasks: Make rcu_barrier_tasks*() handle multiple
>>>>>>>> callback queues
>>>>>>>> 2499cb3c438e rcu-tasks: Add rcupdate.rcu_task_enqueue_lim to set
>>>>>>>> initial queueing
>>>>>>>> a859f409a503 rcu-tasks: Count trylocks to estimate call_rcu_tasks() contention
>>>>>>>> 4ab253ca056e rcu-tasks: Avoid raw-spinlocked wakeups from
>>>>>>>> call_rcu_tasks_generic()
>>>>>>>> e9a3563fe76e rcu-tasks: Use more callback queues if contention encountered
>>>>>>>> 4023187fe31d rcu-tasks: Use separate ->percpu_dequeue_lim for callback
>>>>>>>> dequeueing
>>>>>>>> 533be3bd47c3 rcu: Provide polling interfaces for Tree RCU grace periods
>>>>>>>> f7e5a81d7953 rcu-tasks: Use fewer callbacks queues if callback flood ends
>>>>>>>> bb7ad9078e1b rcu-tasks: Fix computation of CPU-to-list shift counts
>>>>>>>> d9cebde55539 rcu-tasks: Use order_base_2() instead of ilog2()
>>>>>>>> 95606f1248f5 rcu-tasks: Set ->percpu_enqueue_shift to zero upon contention
>>>>>
>>>>>
>>>>> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
>>>>> index 65d6e21a607a..141e2b4c70cc 100644
>>>>> --- a/kernel/rcu/tasks.h
>>>>> +++ b/kernel/rcu/tasks.h
>>>>> @@ -1640,10 +1640,10 @@ static int __init rcu_spawn_tasks_trace_kthread(void)
>>>>>                  rcu_tasks_trace.gp_sleep = HZ / 10;
>>>>>                  rcu_tasks_trace.init_fract = HZ / 10;
>>>>>          } else {
>>>>> -               rcu_tasks_trace.gp_sleep = HZ / 200;
>>>>> +               rcu_tasks_trace.gp_sleep = HZ / 100;
>>>>>                  if (rcu_tasks_trace.gp_sleep <= 0)
>>>>>                          rcu_tasks_trace.gp_sleep = 1;
>>>>> -               rcu_tasks_trace.init_fract = HZ / 200;
>>>>> +               rcu_tasks_trace.init_fract = HZ / 100;
>>>>>                  if (rcu_tasks_trace.init_fract <= 0)
>>>>>                          rcu_tasks_trace.init_fract = 1;
>>>>>          }
>>>>
>>>> It seems that if the scan time is > 50ms in some common cases (at
>>>> least at Google scale),
>>>> the claim of having a latency of 10ms is not reasonable.
>>>
>>> But does the above patch make things better?  If it does, I will send
>>> you a proper patch with kernel boot parameters.  We can then discuss
>>> better autotuning, for example, making the defaults a function of the
>>> number of CPUs.
>>>
>>> Either way, that certainly is a fair point.  Another fair point is that
>>> the offending commit was in response to a bug report from your colleagues.  ;-)
>>>
>>> Except that I don't see any uses of synchronize_rcu_tasks_trace(), so
>>> I am at a loss as to why latency matters anymore.
>>>
>>> Is the issue the overall CPU consumption of the scan (which is my
>>> current guess) or the length of time that the scan runs without invoking
>>> cond_resched() or similar?
>>>

I agree on this part; depending on the results of increasing the sleep 
time for trace kthread to 10 ms; if  scanning all threads is holding the 
CPU, we can try cond_resched(), to isolate the issue. I checked other 
commits in this code path. Don't see anything obvious impacting this.
However, will check more on this.


Thanks
Neeraj

>>> Either way, how frequently is call_rcu_tasks_trace() being invoked in
>>> your setup?  If it is being invoked frequently, increasing delays would
>>> allow multiple call_rcu_tasks_trace() instances to be served by a single
>>> tasklist scan.
>>>
>>>> Given that, I do not think bpf_sk_storage_free() can/should use
>>>> call_rcu_tasks_trace(),
>>>> we probably will have to fix this soon (or revert from our kernels)
>>>
>>> Well, you are in luck!!!  This commit added call_rcu_tasks_trace() to
>>> bpf_selem_unlink_storage_nolock(), which is invoked in a loop by
>>> bpf_sk_storage_free():
>>>
>>> 0fe4b381a59e ("bpf: Allow bpf_local_storage to be used by sleepable programs")
>>>
>>> This commit was authored by KP Singh, who I am adding on CC.  Or I would
>>> have, except that you beat me to it.  Good show!!!  ;-)
>>>
>>> If this commit provoked this issue, then the above patch might help.
> 
> Another question...  Were there actually any sleepable BPF
> programs running on this system at that time?  If not, then maybe
> bpf_selem_unlink_storage_nolock() should check for that condition, and
> invoke call_rcu_tasks_trace() only if there are actually sleepable BPF
> programs running (or in the process of being removed.  If I understand
> correctly (ha!), if there were no sleepable BPF programs running, you
> would instead want call_rcu().  (Here I am assuming that non-sleepable
> BPF programs still run within a normal RCU read-side critical section.)
> 
> 							Thanx, Paul
> 
>>> I am also adding Neeraj Uphadhyay on CC for his thoughts, as he has
>>> also been throught this code.
>>>
>>> My response time may be a bit slow next week, but I will be checking
>>> email.
>>>
>>> 							Thanx, Paul