lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Thu, 23 Apr 2015 21:57:13 -0400 From: Rik van Riel <riel@...riel.com> To: linux-kernel@...r.kernel.org Cc: Andy Lutomirsky <amluto@...capital.com>, Frederic Weisbecker <fweisbec@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Heiko Carstens <heiko.carstens@...ibm.com>, Luiz Capitulino <lcapitulino@...hat.com>, Marcelo Tosatti <mtosatti@...hat.com>, Clark Williams <williams@...hat.com> Subject: [PATCH] context_tracking: remove local_irq_save from __acct_update_integrals The function __acct_update_integrals() is called both from irq context and task context. This creates a race where irq context can advance tsk->acct_timexpd to a value larger than time, leading to a negative value, which causes a divide error. See commit 6d5b5acca9e5 ("Fix fixpoint divide exception in acct_update_integrals") In 2012, __acct_update_integrals() was changed to get utime and stime as function parameters. This re-introduced the bug, because an irq can hit in-between the call to task_cputime() and where irqs actually get disabled. However, this race condition was originally reproduced on Hercules, and I have not seen any reports of it re-occurring since it was re-introduced 3 years ago. On the other hand, the irq disabling and re-enabling, which no longer even protects us against the race today, show up prominently in the perf profile of a program that makes a very large number of system calls in a short period of time, when nohz_full= (and context tracking) is enabled. This patch replaces the (now ineffective) irq blocking with a cheaper way to test for the race condition, and speeds up my microbenchmark with 10 million iterations: run time system time vanilla 5.49s 2.08s patch 5.21s 1.92s The above shows a reduction in system time of about 7%. The standard deviation is mostly in the third digit after the decimal point. Cc: Andy Lutomirsky <amluto@...capital.com> Cc: Frederic Weisbecker <fweisbec@...hat.com> Cc: Peter Zijlstra <peterz@...radead.org> Cc: Heiko Carstens <heiko.carstens@...ibm.com> Cc: Luiz Capitulino <lcapitulino@...hat.com> Cc: Marcelo Tosatti <mtosatti@...hat.com> Cc: Clark Williams <williams@...hat.com> Signed-off-by: Rik van Riel <riel@...hat.com> --- kernel/tsacct.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/tsacct.c b/kernel/tsacct.c index 975cb49e32bf..0b967f116a6b 100644 --- a/kernel/tsacct.c +++ b/kernel/tsacct.c @@ -126,23 +126,29 @@ static void __acct_update_integrals(struct task_struct *tsk, if (likely(tsk->mm)) { cputime_t time, dtime; struct timeval value; - unsigned long flags; u64 delta; - local_irq_save(flags); time = stime + utime; dtime = time - tsk->acct_timexpd; + /* + * This code is called both from irq context and from + * task context. There is a race where irq context advances + * tsk->acct_timexpd to a value larger than time, creating + * a negative value. In that case, the irq has already + * updated the statistics. + */ + if (unlikely((signed long)dtime <= 0)) + return; + jiffies_to_timeval(cputime_to_jiffies(dtime), &value); delta = value.tv_sec; delta = delta * USEC_PER_SEC + value.tv_usec; if (delta == 0) - goto out; + return; tsk->acct_timexpd = time; tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm); tsk->acct_vm_mem1 += delta * tsk->mm->total_vm; - out: - local_irq_restore(flags); } } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists