linux-kernel - Re: [PATCH 1/2] sched/fair: Record the average duration of a task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZoFY/n2S7rMp6ypn@chenyu5-mobl2>
Date: Sun, 30 Jun 2024 21:09:18 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Mike Galbraith <efault@....de>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Tim Chen <tim.c.chen@...el.com>, Yujie Liu
	<yujie.liu@...el.com>, K Prateek Nayak <kprateek.nayak@....com>, "Gautham R .
 Shenoy" <gautham.shenoy@....com>, Chen Yu <yu.chen.surf@...il.com>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched/fair: Record the average duration of a task

Hi Mike,

Thanks for your time and giving the insights.

On 2024-06-26 at 06:21:43 +0200, Mike Galbraith wrote:
> On Tue, 2024-06-25 at 15:22 +0800, Chen Yu wrote:
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 0935f9d4bb7b..7399c4143528 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> >         p->migration_pending = NULL;
> >  #endif
> >         init_sched_mm_cid(p);
> > +       p->prev_sleep_sum_runtime = 0;
> > +       p->duration_avg = 0;
> >  }
> 
> Beginning life biased toward stacking?
>

OK, so I should change the short_task() to skip the 0 duration_avg, to avoid
task stacking in the beginning.
   
> >  DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 41b58387023d..445877069fbf 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> >
> > @@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  
> >  dequeue_throttle:
> >         util_est_update(&rq->cfs, p, task_sleep);
> > +       if (task_sleep)
> > +               dur_avg_update(p);
> > +
> >         hrtick_update(rq);
> >  }
> >
> 
> That qualifier looks a bit dangerous.  Microbench components tend to
> have only one behavior, but the real world goes through all kinds of
> nutty gyrations, intentional and otherwise.
>

Understand. Unfortunately I don't have access to production environment
so I have to rely on microbenchmarks and a OLTP to check the result. I
get feedback from Abel that the former version of this patch brought
benefit to some short tasks like Redis in the production environment[1].
https://lore.kernel.org/lkml/36ba3b68-5b73-9db0-2247-061627b0d95a@bytedance.com/

I can launch a combination of microbenchmarks in parallel to check the impact.

> The heuristics in the next patch seem to exhibit a healthy level of
> paranoia, but these bits could perhaps use a tad more.  Bad experiences
> springs to mind when I stare at that - sleepers going hog, hogs meet
> sleeping lock contention, preemption, sync hint not meaning much...
>

I see. If I understand correctly, the scenario mentioned above
could bring a false positive of 'short task', which causes
task stacking.

If the sleeper task:
1. is preempted frequently. This should not be a problem, because
   the task duration is unlikely to be shorten by preemption, and
   the short_task() is unlikely to return true.
   Because the task duration will span the time from the task is
   scheduled in, to finally scheduled out due to sleeping(dequeue_task_fair()),
   but not by preemption.
   This time duration should be long enough to not trigger the 'short task'
   case. But since there is a delay queue mechanism under development,
   calculating the duration in dequeue_task_fair() in current patch might not
   be a proper method anymore.

2. meets sleeping lock contention. This would be a false positive 'short task',
   which bring unexpected task stacking.

So consider 1 and 2, I'm thinking of moving the calculating of task duration from
dequeue_task_fair() to wait_woken(). The reason to update the task's duration in
wait_woken() rather than dequeue_task_fair() is that, the former is aware of the
scenario that the task is waiting for the real resource, rather than blocking
on a random sleeping lock. And the wait_woken() is widely used by the driver to
indicate that task is waiting for the resource. For example, the netperf calltrace:

    schedule_timeout+222
    wait_woken+84
    sk_wait_data+378
    tcp_recvmsg_locked+507
    tcp_recvmsg+115
    inet_recvmsg+90
    sock_recvmsg+150

In the future, if there is requirement other scenario could also invoke the newly
introduced update_cur_duration() when needed. For example, the pipe_read() could
use it if the task is going to sleep due to the empty pipe buffer. I change the
code as below, may I have your suggestion on this?

thanks,
Chenyu

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e..78747d3954fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,9 @@ struct task_struct {
 	struct callback_head		cid_work;
 #endif
 
+	u64				prev_sleep_sum_runtime;
+	u64				duration_avg;
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	/* Cache last used pipe for splice(): */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..7399c4143528 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	p->prev_sleep_sum_runtime = 0;
+	p->duration_avg = 0;
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..bbeba36d0145 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -744,6 +744,19 @@ int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	return vruntime_eligible(cfs_rq, se->vruntime);
 }
 
+void update_curr_duration(void)
+{
+	struct sched_entity *curr = &current->se;
+	unsigned long flags;
+	u64 dur;
+
+	local_irq_save(flags);
+	dur = curr->sum_exec_runtime - current->prev_sleep_sum_runtime;
+	current->prev_sleep_sum_runtime = curr->sum_exec_runtime;
+	update_avg(&current->duration_avg, dur);
+	local_irq_restore(flags);
+}
+
 static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 {
 	u64 min_vruntime = cfs_rq->min_vruntime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62fd8bc6fd08..7beb604ca76b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3574,6 +3574,7 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
 
 extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
 extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
+extern void update_curr_duration(void);
 
 #ifdef CONFIG_RT_MUTEXES
 
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..a0004cc7454f 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -419,8 +419,10 @@ long wait_woken(struct wait_queue_entry *wq_entry, unsigned mode, long timeout)
 	 * or woken_wake_function() sees our store to current->state.
 	 */
 	set_current_state(mode); /* A */
-	if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !kthread_should_stop_or_park())
+	if (!(wq_entry->flags & WQ_FLAG_WOKEN) && !kthread_should_stop_or_park()) {
+		update_curr_duration();
 		timeout = schedule_timeout(timeout);
+	}
 	__set_current_state(TASK_RUNNING);
 
 	/*
-- 
2.25.1