lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 13 Jun 2024 12:04:41 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: John Stultz <jstultz@...gle.com>
Cc: LKML <linux-kernel@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Frederic Weisbecker <frederic@...nel.org>,
	Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Daniel Bristot de Oliveira <bristot@...hat.com>,
	Valentin Schneider <vschneid@...hat.com>,
	Qais Yousef <qyousef@...alina.io>,
	Joel Fernandes <joel@...lfernandes.org>, kernel-team@...roid.com
Subject: Re: [PATCH] RFC: sched: Rework task_sched_runtime to avoid calling
 update_rq_clock

On Wed, Jun 12, 2024 at 06:58:26PM -0700, John Stultz wrote:
> I recently got a bug report that
> clock_gettime(CLOCK_THREAD_CPUTIME_ID,...) had regressed between
> 5.10 and 6.1. Its not a huge regression in absolute time
> (~30-40ns), but is >10% change.
> 
> I narrowed the cause down to the addition of
> psi_account_irqtime() in update_rq_clock_task(), in commit
> 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ
> pressure")
> 
> So that explains the behavior change, 

It doesn't really... that patch just feeds the irq_time we *already*
subtracted prior to it, to PSI, such that PSI can also enjoy the 'view'.

The only explanation I have is that interrupts that end up in the
scheduler (wakeups, tick, etc..) now get to do that PSI cgroup iteration
and that cost adds up to the IRQ time itself, and as such the task time
slows down accordingly.

Are you using silly deep cgroup hierarchies?

> but it also seems odd that
> we're doing psi irq accounting from a syscall that is just
> trying to read the thread's cputime.

In order to avoid doing all the accounting in the IRQ entry/exit paths,
those paths only do the bare minimum of tracking how much IRQ time there
is.

update_rq_clock_task() then looks at the increase of IRQ time and
subtracts this from the task time -- after all, all time spend in the
IRQ wasn't spend on the task itself.

It did that prior to the PSI patch, and it does so after. The only
change is it now feeds this delta into the PSI thing.

> NOTE: I'm not 100% sure this is correct yet. There may be some
> edge cases I've overlooked, so I'd greatly appreciate any
> review or feedback.

Urgh, you're sprinkling the details of what is clock_task over multiple
places.

Does something like so work?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..d4b87539d72a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -724,7 +724,6 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 	rq->prev_irq_time += irq_delta;
 	delta -= irq_delta;
-	psi_account_irqtime(rq->curr, irq_delta);
 	delayacct_irq(rq->curr, irq_delta);
 #endif
 #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
@@ -5459,6 +5458,8 @@ void sched_tick(void)
 
 	sched_clock_tick();
 
+	psi_account_irqtime(curr, &rq->psi_irq_time);
+
 	rq_lock(rq, &rf);
 
 	update_rq_clock(rq);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 146baa91d104..57fdb0b9efbd 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -991,12 +991,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct task_struct *task, u32 delta)
+void psi_account_irqtime(struct task_struct *task, u64 *prev)
 {
 	int cpu = task_cpu(task);
 	struct psi_group *group;
 	struct psi_group_cpu *groupc;
-	u64 now;
+	u64 now, irq;
+	s64 delta;
 
 	if (static_branch_likely(&psi_disabled))
 		return;
@@ -1005,6 +1006,11 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
 		return;
 
 	now = cpu_clock(cpu);
+	irq = irq_time_read(cpu);
+	delta = (s64)(irq - *prev);
+	if (delta < 0)
+		return;
+	*prev = irq;
 
 	group = task_psi_group(task);
 	do {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62fd8bc6fd08..a63eb546bc4a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1133,6 +1133,7 @@ struct rq {
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	u64			prev_irq_time;
+	u64			psi_irq_time;
 #endif
 #ifdef CONFIG_PARAVIRT
 	u64			prev_steal_time;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index d1445410840a..1111f060264f 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -110,7 +110,7 @@ __schedstats_from_se(struct sched_entity *se)
 void psi_task_change(struct task_struct *task, int clear, int set);
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep);
-void psi_account_irqtime(struct task_struct *task, u32 delta);
+void psi_account_irqtime(struct task_struct *task, u64 *prev);
 
 /*
  * PSI tracks state that persists across sleeps, such as iowaits and
@@ -192,7 +192,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
 static inline void psi_sched_switch(struct task_struct *prev,
 				    struct task_struct *next,
 				    bool sleep) {}
-static inline void psi_account_irqtime(struct task_struct *task, u32 delta) {}
+static inline void psi_account_irqtime(struct task_struct *task, u64 *prev) {}
 #endif /* CONFIG_PSI */
 
 #ifdef CONFIG_SCHED_INFO

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ