[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <32f82969-420e-413a-99f9-b631ba894d20@gmail.com>
Date: Mon, 23 Jun 2025 14:34:55 +0800
From: zoucao <zoucaox@...il.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, linux-kernel@...r.kernel.org,
Olice Zou <olicezou@...cent.com>
Subject: Re: [PATCH] sched/stats: TASK_IDLE task bypass the block_starts time
On 6/20/25 16:55, Peter Zijlstra wrote:
> On Fri, Jun 20, 2025 at 11:14:50AM +0800, Olice Zou wrote:
>> For TASK_IDLE task, we not should record the block_starts, it is
>> not real TASK_UNINTERRUPTIBLE task.
> Why, I mean it is still blocked, right?
Thank you for your reply.
I find this problem when running test case for intense lock contention,
it has contention among thousands of rwsem/mutex locks,
them are real blocked task, but when idle machine, it also has so much
of blocked kworker thread to be found, but the machine is idle.
the TASK_IDLE not a blocked task, it more like sleeping task as follow:
int kernel/workqueue.c
2690 static int worker_thread(void *__worker)
2691 {
...................
2758 sleep:
2759 /*
2760 * pool->lock is held and there's no work to process and no need to
2761 * manage, sleep. Workers are woken up only while holding
2762 * pool->lock or from local cpu, so setting the current state
2763 * before releasing pool->lock is enough to prevent losing any
2764 * event.
2765 */
2766 worker_enter_idle(worker);
2767 __set_current_state(TASK_IDLE); ---> this set task->__stat
to TASK_IDLE, it will cause the blocked measure, but it more like sleep
task.
2768 raw_spin_unlock_irq(&pool->lock);
2769 schedule();
2770 goto woke_up;
2771 }
this trace of sched:sched_stat_blocked is a good point to measure the
duration of lock contention, it provide the blocked delta time.
after this patch, it is beautiful to observe the lock competition in a
easy way.
"
#!/bin/bpftrace
#include<linux/sched.h>
tracepoint:sched:sched_stat_blocked
{
if (args->delay > 1000000) {
@sa[args->pid] = 1;
}
}
kprobe:finish_task_switch
{
$task = (struct task_struct *) arg0;
if (@sa[tid] ) {
print(kstack());
delete(@sa[tid]);
}
}
"
catch the lock bocked delta task as follow:
dynamic_offline 8684678
finish_task_switch+1
schedule+108
schedule_timeout+567
wait_for_completion+149
__wait_rcu_gp+316
synchronize_rcu+237
rcu_sync_enter+92
percpu_down_write+41 --> this is real blocked task for
percpu_rwsem wait.
cgroup_procs_write_start+111
__cgroup1_procs_write.constprop.0+91
cgroup1_procs_write+23
cgroup_file_write+137
kernfs_fop_write_iter+304
vfs_write+618
ksys_write+107
__x64_sys_write+30
x64_sys_call+5679
do_syscall_64+55
entry_SYSCALL_64_after_hwframe+12
It is also useful the iowait task except TASK_IDLE.
Or put the task_idle task into the sleep of sched_statistics to measure?
>> It is easy to find this problem in a idle machine as followe:
>>
>> bpftrace -e 'tracepoint:sched:sched_stat_blocked { \
>> if (args->delay > 1000000) \
>> { \
>> printf("%s %d\n", args->comm, args->delay); \
>> print(kstack()); \
>> } \
>> }
>>
>> rcu_preempt 3881764
>> __update_stats_enqueue_sleeper+604
>> __update_stats_enqueue_sleeper+604
>> enqueue_entity+1014
>> enqueue_task_fair+156
>> activate_task+109
>> ttwu_do_activate+111
>> try_to_wake_up+615
>> wake_up_process+25
>> process_timeout+22
>> call_timer_fn+44
>> run_timer_softirq+1100
>> handle_softirqs+178
>> irq_exit_rcu+113
>> sysvec_apic_timer_interrupt+132
>> asm_sysvec_apic_timer_interrupt+31
>> pv_native_safe_halt+15
>> arch_cpu_idle+13
>> default_idle_call+48
>> do_idle+516
>> cpu_startup_entry+49
>> start_secondary+280
>> secondary_startup_64_no_verify+404
> Not sure what I'm looking at there. What is the problem?
Sorry, i lost the setup as follow:
echo 1 > /proc/sys/kernel/sched_schedstat
we should enable the sched_schedstat sysctrl switch first
>> Signed-off-by: Olice Zou <olicezou@...cent.com>
>> ---
>> kernel/sched/fair.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index a85539df75a5..e473e3244dda 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1285,7 +1285,7 @@ update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int fl
>> if (state & TASK_INTERRUPTIBLE)
>> __schedstat_set(tsk->stats.sleep_start,
>> rq_clock(rq_of(cfs_rq)));
>> - if (state & TASK_UNINTERRUPTIBLE)
>> + if (state != TASK_IDLE && (state & TASK_UNINTERRUPTIBLE))
>> __schedstat_set(tsk->stats.block_start,
>> rq_clock(rq_of(cfs_rq)));
>> }
>> --
>> 2.25.1
>>
Powered by blists - more mailing lists