linux-kernel - Re: [GIT PULL] cputime: Full dynticks task/cputime accounting v7

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFTL4hwx+e3X74Lzi-QTsEwkKicRo7EeFspza5ZNeeboSLt8KQ@mail.gmail.com>
Date:	Tue, 29 Jan 2013 17:55:39 +0100
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	Ingo Molnar <mingo@...nel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Li Zhong <zhong@...ux.vnet.ibm.com>,
	Namhyung Kim <namhyung.kim@....com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Paul Gortmaker <paul.gortmaker@...driver.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	Sedat Dilek <sedat.dilek@...il.com>,
	Gleb Natapov <gleb@...hat.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	Tony Luck <tony.luck@...el.com>,
	Fenghua Yu <fenghua.yu@...el.com>
Subject: Re: [GIT PULL] cputime: Full dynticks task/cputime accounting v7

2013/1/29 Steven Rostedt <rostedt@...dmis.org>:
> On Tue, 2013-01-29 at 16:12 +0100, Frederic Weisbecker wrote:
>
>> I have set some cputime tracepoint and looked at the stacktrace. They
>> are sorted by weight, which is the amount of cputime accounted. I
>> can't find anything wrong on these. See attachment.
>
> That's because a histogram doesn't show the issue. You need to see the
> order of events. That is, you need to see that user time is calculated
> again within a system call. I ran my trace again with stack trace
> enabled and it shows the race nicely:
>
> kernelsp-2367    7....  7088.483227: funcgraph_entry:                   |  syscall_trace_leave() {
> kernelsp-2367    7d..1 7088.483228: bprint:               account_system_time : system: 17887
> kernelsp-2367    7d..1  7088.483242: kernel_stack:         <stack trace>
> => __vtime_account_system (ffffffff810646e5)
> => vtime_user_enter (ffffffff81064a5e)
> => user_enter (ffffffff810d8b2b)
> => syscall_trace_leave (ffffffff8100d03f)
> => return_to_handler (ffffffff814c8dc5)
> => int_check_syscall_exit_work (ffffffff814c925f)
>
> syscall_trace_leave() was called, we are leaving a system call and going
> back to userspace. We calculate the system time and set the state to
> IN_USER by user_enter().
>
>
> kernelsp-2367    7d..1  7088.483244: bprint:               account_user_time : user: 15753
> kernelsp-2367    7d..1  7088.483262: kernel_stack:         <stack trace>
> => vtime_account_user (ffffffff81064a12)
> => vtime_account_irq_enter (ffffffff81064b40)
> => irq_enter (ffffffff8103d36e)
> => smp_apic_timer_interrupt (ffffffff814caa83)
> => apic_timer_interrupt (ffffffff814c9bb2)
> => user_enter (ffffffff810d8b58)
> => syscall_trace_leave (ffffffff8100d03f)
> => return_to_handler (ffffffff814c8dc5)
> => int_check_syscall_exit_work (ffffffff814c925f)
>
>
> But before we got to user space, we took an interrupt (smp_apic_timer_interrupt).
>
> Now the vtime_account_irq_enter() sees the IN_USER set, and accounts the
> time back to userspace. That was 16us it calculated incorrectly.

Yeah I know. There will always be this small window between the time
we call user_enter() and the time we actually go to userspace. This
can't be helped much, except perhaps by checking
user_mode(get_irq_regs()) may be. But the time accounted is really
just micro chunk, nothing to worry about. It barely shows up in the
histogram.

>
>
> kernelsp-2367    7d.h1  7088.483267: bprint:               account_system_time : system: 23401
> kernelsp-2367    7d.h1  7088.483286: kernel_stack:         <stack trace>
> => __vtime_account_system (ffffffff810646e5)
> => vtime_account_irq_exit (ffffffff810649b8)
> => irq_exit (ffffffff8103d39e)
> => smp_apic_timer_interrupt (ffffffff814caacf)
> => apic_timer_interrupt (ffffffff814c9bb2)
> => user_enter (ffffffff810d8b58)
> => syscall_trace_leave (ffffffff8100d03f)
> => return_to_handler (ffffffff814c8dc5)
> => int_check_syscall_exit_work (ffffffff814c925f)
>
> Now we are leaving the interrupt and since the state is still IN_USER,
> the system time is calculated.
>
> kernelsp-2367    7d..2  7088.483287: bprint:               account_user_time : user: 19865
> kernelsp-2367    7d..2  7088.483309: kernel_stack:         <stack trace>
> => vtime_account_user (ffffffff81064a12)
> => vtime_account_irq_enter (ffffffff81064b40)
> => __do_softirq (ffffffff8103d0f5)
> => call_softirq (ffffffff814ca2fc)
> => do_softirq (ffffffff81003fdc)
> => irq_exit (ffffffff8103d3df)
> => smp_apic_timer_interrupt (ffffffff814caacf)
> => apic_timer_interrupt (ffffffff814c9bb2)
> => user_enter (ffffffff810d8b58)
> => syscall_trace_leave (ffffffff8100d03f)
> => return_to_handler (ffffffff814c8dc5)
> => int_check_syscall_exit_work (ffffffff814c925f)
>
> But oops, we are now running a softirq, which happens to get everything
> wrong again, and calculates another 20us incorrectly towards user time.

Same thing here. I'm aware of that, this can certainly be fixed but
that's really no big issue, it's only about a few microseconds and
that too almost doesn't show up on the histogram.

> Perhaps the best thing to do is to have the irq vtime determine usermode
> or not with "user_mode(get_irq_regs())". I'll try it.

That will get rid of those few microsecond error margin but I bet
that's not where the issue is.

> -- Steve
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/