[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150423215713.3334ae6f@annuminas.surriel.com>
Date: Thu, 23 Apr 2015 21:57:13 -0400
From: Rik van Riel <riel@...riel.com>
To: linux-kernel@...r.kernel.org
Cc: Andy Lutomirsky <amluto@...capital.com>,
Frederic Weisbecker <fweisbec@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Heiko Carstens <heiko.carstens@...ibm.com>,
Luiz Capitulino <lcapitulino@...hat.com>,
Marcelo Tosatti <mtosatti@...hat.com>,
Clark Williams <williams@...hat.com>
Subject: [PATCH] context_tracking: remove local_irq_save from
__acct_update_integrals
The function __acct_update_integrals() is called both from irq context
and task context. This creates a race where irq context can advance
tsk->acct_timexpd to a value larger than time, leading to a negative
value, which causes a divide error. See commit 6d5b5acca9e5
("Fix fixpoint divide exception in acct_update_integrals")
In 2012, __acct_update_integrals() was changed to get utime and stime
as function parameters. This re-introduced the bug, because an irq
can hit in-between the call to task_cputime() and where irqs actually
get disabled.
However, this race condition was originally reproduced on Hercules,
and I have not seen any reports of it re-occurring since it was
re-introduced 3 years ago.
On the other hand, the irq disabling and re-enabling, which no longer
even protects us against the race today, show up prominently in the
perf profile of a program that makes a very large number of system calls
in a short period of time, when nohz_full= (and context tracking) is
enabled.
This patch replaces the (now ineffective) irq blocking with a cheaper
way to test for the race condition, and speeds up my microbenchmark
with 10 million iterations:
run time system time
vanilla 5.49s 2.08s
patch 5.21s 1.92s
The above shows a reduction in system time of about 7%.
The standard deviation is mostly in the third digit after
the decimal point.
Cc: Andy Lutomirsky <amluto@...capital.com>
Cc: Frederic Weisbecker <fweisbec@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Heiko Carstens <heiko.carstens@...ibm.com>
Cc: Luiz Capitulino <lcapitulino@...hat.com>
Cc: Marcelo Tosatti <mtosatti@...hat.com>
Cc: Clark Williams <williams@...hat.com>
Signed-off-by: Rik van Riel <riel@...hat.com>
---
kernel/tsacct.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 975cb49e32bf..0b967f116a6b 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -126,23 +126,29 @@ static void __acct_update_integrals(struct task_struct *tsk,
if (likely(tsk->mm)) {
cputime_t time, dtime;
struct timeval value;
- unsigned long flags;
u64 delta;
- local_irq_save(flags);
time = stime + utime;
dtime = time - tsk->acct_timexpd;
+ /*
+ * This code is called both from irq context and from
+ * task context. There is a race where irq context advances
+ * tsk->acct_timexpd to a value larger than time, creating
+ * a negative value. In that case, the irq has already
+ * updated the statistics.
+ */
+ if (unlikely((signed long)dtime <= 0))
+ return;
+
jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
delta = value.tv_sec;
delta = delta * USEC_PER_SEC + value.tv_usec;
if (delta == 0)
- goto out;
+ return;
tsk->acct_timexpd = time;
tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
- out:
- local_irq_restore(flags);
}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists