linux-kernel - Re: [PATCH] x86/sched/dynamic-ticks: Call new schedule_irq_disable() for scheduling in entry

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130510002203.GA2394@somewhere>
Date:	Fri, 10 May 2013 02:22:06 +0200
From:	Frederic Weisbecker <fweisbec@...il.com>
To:	Steven Rostedt <rostedt@...dmis.org>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	"Paul E. McKenney" <paulmck@...ibm.com>,
	Ingo Molnar <mingo@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	"H. Peter Anvin" <hpa@...ux.intel.com>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH] x86/sched/dynamic-ticks: Call new schedule_irq_disable()
 for scheduling in entry_64.S

On Wed, May 08, 2013 at 09:21:20PM -0400, Steven Rostedt wrote:
> I started testing the new NOHZ_FULL in the kernel and had some issues,
> so I started function tracing and this bug report came out:
> 
> 
> [23446.458073] ------------[ cut here ]------------
> [23446.461028] WARNING:
> at /home/rostedt/work/git/linux-trace.git/kernel/rcutree.c:388
> rcu_eqs_enter+0x4b/0x89()
> [23446.466096] Modules linked in: ebtables ipt_MASQUERADE sunrpc bridge
> stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
> ip6_tables ipv6 uinput snd_hda_codec_id
> t snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm
> kvm_intel snd_timer snd kvm soundcore shpchp snd_page_alloc microcode
> i2c_i801 pata_acpi firewire_ohci firewi
> re_core crc_itu_t ata_generic i915 drm_kms_helper drm i2c_algo_bit
> i2c_core video
> [23446.466096] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.9.0-test+ #11
> [23446.466096] Hardware name: To Be Filled By O.E.M. To Be Filled By
> O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
> [23446.466096]  ffffffff814879f0 ffffffff81004fa6 ffffffff810337af
> ffff88007d400000
> [23446.466096]  0000000000000000 0000000000000000 ffff88007d40d900
> ffffffff81a01f00
> [23446.466096]  ffffffff81acba00 ffff88007d6495c0 ffffffff810a1f36
> 0000000000000a28
> [23446.466096] Call Trace:
> [23446.466096]  [<ffffffff814879f0>] ? dump_stack+0xd/0x17
> [23446.466096]  [<ffffffff81004fa6>] ? show_stack+0x5/0x3e
> [23446.466096]  [<ffffffff810337af>] ? warn_slowpath_common+0x64/0x7c
> [23446.466096]  [<ffffffff810a1f36>] ? rcu_eqs_enter+0x4b/0x89
> [23446.466096]  [<ffffffff810a1f8e>] ? rcu_idle_enter+0x1a/0x30
> [23446.466096]  [<ffffffff81491bc5>] ? ftrace_graph_caller+0x85/0x85
> [23446.466096]  [<ffffffff810707f3>] cpu_startup_entry+0xb3/0x11c
> [23446.466096]  [<ffffffff81ae9d2e>] ? start_kernel+0x41a/0x425
> [23446.466096]  [<ffffffff81ae972d>] ? repair_env_string+0x54/0x54
> [23446.466096] ---[ end trace ddbb69ae2a0f6687 ]---
> [23446.466096] ------------[ cut here ]------------
> 
> This is caused by an imbalance of rcu_eqs_enter() and rcu_eqs_exit().
> Adding more tracing, I also discovered that there seemed to be an
> imbalance in user_exit() and user_enter(). Specifically, I found that
> user_enter() was called, and then a migration happened and user_exit()
> was never called before the schedule. Then I had noticed this in my
> trace:
> 
> run-isol-11546   1.N.. 37363.637641: function:             schedule_user <-- retint_careful
> run-isol-11546   1.N.. 37363.637641: function:             __schedule <-- preempt_schedule
> 
> The funny part here is that schedule_user does not call
> preempt_schedule. But then I also noticed the flags of the call to
> schedule_user(): "1.N..". This shows that it was running on cpu 1 with
> NEED_RESCHED set (N), but both interrupts and preemption are enabled. If
> an interrupt came in here we can schedule out again, but that would call
> preempt_schedule_irq, which has code to check if its in "user context
> (according to dynamic_ticks)" and would do the user_exit() too. But
> something didn't seem right here.
> 
> Then I realized that it was actually the function tracer itself, as for
> every function it traces, it calls preempt_disable() and
> preempt_enable() to record the data on a per_cpu buffer. That
> preempt_enable() noticed the NEED_RESCHED set and since preemption is
> enabled, it kindly called preempt_schedule() for us!
> 
> All this before schedule_user() was able to call user_exit() and take us
> out of dynamic tick user context.
> 
> I then looked at the code in entry_64.S, and noticed that the calls to
> SCHEDULE_USER were all encompassed with ENABLE_INTERRUPTS and
> DISABLE_INTERRUPTS before calling schedule. And the SCHEDULE_USER macro
> is different if we have CONTEXT_TRACKING enabled or not.
> 
> I created a new schedule_irq_disabled() which is similar to
> schedule_preempt_disable()  and works like preempt_schedule_irq(), where
> it includes the exception_enter() and exit() routines that take care of
> the accounting of dynamic ticks when interrupting user mode or not. It
> also takes care of it before enabling interrupts, so we don't need to
> worry about any preemption happening at the wrong time.
> 
> Now the code always calls schedule_irq_disabled() whether or not
> CONTEXT_TRACKING is configured, which cleans up entry_64.S and removes
> the need of the context_tracking.h header file in asm.
> 
> Signed-off-by: Steven Rostedt <rostedt@...dmis.org>
> 
> Index: linux-trace.git/arch/x86/include/asm/context_tracking.h
> ===================================================================
> --- linux-trace.git.orig/arch/x86/include/asm/context_tracking.h
> +++ /dev/null
> @@ -1,10 +0,0 @@
> -#ifndef _ASM_X86_CONTEXT_TRACKING_H
> -#define _ASM_X86_CONTEXT_TRACKING_H
> -
> -#ifdef CONFIG_CONTEXT_TRACKING
> -# define SCHEDULE_USER call schedule_user
> -#else
> -# define SCHEDULE_USER call schedule
> -#endif
> -
> -#endif
> Index: linux-trace.git/arch/x86/kernel/entry_64.S
> ===================================================================
> --- linux-trace.git.orig/arch/x86/kernel/entry_64.S
> +++ linux-trace.git/arch/x86/kernel/entry_64.S
> @@ -56,7 +56,6 @@
>  #include <asm/ftrace.h>
>  #include <asm/percpu.h>
>  #include <asm/asm.h>
> -#include <asm/context_tracking.h>
>  #include <asm/smap.h>
>  #include <linux/err.h>
>  
> @@ -654,6 +653,7 @@ sysret_check:
>  	LOCKDEP_SYS_EXIT
>  	DISABLE_INTERRUPTS(CLBR_NONE)
>  	TRACE_IRQS_OFF
> +sysret_check_irqsoff:
>  	movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
>  	andl %edi,%edx
>  	jnz  sysret_careful
> @@ -675,12 +675,10 @@ sysret_check:
>  sysret_careful:
>  	bt $TIF_NEED_RESCHED,%edx
>  	jnc sysret_signal
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
>  	pushq_cfi %rdi
> -	SCHEDULE_USER
> +	call schedule_irq_disabled
>  	popq_cfi %rdi
> -	jmp sysret_check
> +	jmp sysret_check_irqsoff

Don't we still want to call LOCKDEP_SYS_EXIT?, the previous
task scheduled might have taken some locks.

>  
>  	/* Handle a signal */
>  sysret_signal:
> @@ -788,13 +786,9 @@ GLOBAL(int_with_check)
>  int_careful:
>  	bt $TIF_NEED_RESCHED,%edx
>  	jnc  int_very_careful
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
>  	pushq_cfi %rdi
> -	SCHEDULE_USER
> +	call schedule_irq_disabled
>  	popq_cfi %rdi
> -	DISABLE_INTERRUPTS(CLBR_NONE)
> -	TRACE_IRQS_OFF
>  	jmp int_with_check
>  
>  	/* handle signals and tracing -- both require a full stack frame */
> @@ -1088,14 +1082,10 @@ retint_careful:
>  	CFI_RESTORE_STATE
>  	bt    $TIF_NEED_RESCHED,%edx
>  	jnc   retint_signal
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
>  	pushq_cfi %rdi
> -	SCHEDULE_USER
> +	call schedule_irq_disabled
>  	popq_cfi %rdi
>  	GET_THREAD_INFO(%rcx)
> -	DISABLE_INTERRUPTS(CLBR_NONE)
> -	TRACE_IRQS_OFF
>  	jmp retint_check
>  
>  retint_signal:
> @@ -1534,11 +1524,7 @@ paranoid_userspace:
>  	TRACE_IRQS_OFF
>  	jmp paranoid_userspace
>  paranoid_schedule:
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_ANY)
> -	SCHEDULE_USER
> -	DISABLE_INTERRUPTS(CLBR_ANY)
> -	TRACE_IRQS_OFF
> +	call schedule_irq_disabled
>  	jmp paranoid_userspace
>  	CFI_ENDPROC
>  END(paranoid_exit)
> Index: linux-trace.git/kernel/sched/core.c
> ===================================================================
> --- linux-trace.git.orig/kernel/sched/core.c
> +++ linux-trace.git/kernel/sched/core.c
> @@ -3034,21 +3034,6 @@ asmlinkage void __sched schedule(void)
>  }
>  EXPORT_SYMBOL(schedule);
>  
> -#ifdef CONFIG_CONTEXT_TRACKING
> -asmlinkage void __sched schedule_user(void)
> -{
> -	/*
> -	 * If we come here after a random call to set_need_resched(),
> -	 * or we have been woken up remotely but the IPI has not yet arrived,
> -	 * we haven't yet exited the RCU idle mode. Do it here manually until
> -	 * we find a better solution.
> -	 */
> -	user_exit();
> -	schedule();
> -	user_enter();
> -}
> -#endif
> -
>  /**
>   * schedule_preempt_disabled - called with preemption disabled
>   *
> @@ -3127,6 +3112,24 @@ asmlinkage void __sched preempt_schedule
>  
>  #endif /* CONFIG_PREEMPT */
>  
> +/*
> + * this is the entry point to schedule() from kernel irq off context.
> + * Note, that this is called and return with irqs disabled. This will
> + * protect us against recursive calling from irq.
> + */
> +asmlinkage void __sched schedule_irq_disabled(void)
> +{
> +	enum ctx_state prev_state;
> +
> +	prev_state = exception_enter();

So why have you turned user_enter to exception_enter?
Is it to make it work anywhere and not just preemption on
user resume time?

> +
> +	local_irq_enable();
> +	__schedule();

Are you sure it's fine to call __schedule() instead of schedule()?
I don't know much about the blk work to handle there but it may be
worth considering.

> +	local_irq_disable();
> +
> +	exception_exit(prev_state);
> +}

The patch looks pretty good otherwise.

Thanks.

> +
>  int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
>  			  void *key)
>  {
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/