lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20170412234859.GQ3956@linux.vnet.ibm.com>
Date:   Wed, 12 Apr 2017 16:48:59 -0700
From:   "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Ingo Molnar <mingo@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH v2] sched: Have do_idle() call __schedule() without
  enabling preemption

On Wed, Apr 12, 2017 at 02:27:44PM -0400, Steven Rostedt wrote:
> From: Steven Rostedt (VMware) <rostedt@...dmis.org>
> 
> I finally got around to creating trampolines for dynamically allocated
> ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
> function hook callbacks, like perf, that allocate the ftrace_ops
> descriptor via kmalloc() and friends, ftrace was not able to optimize
> the functions being traced to use a trampoline because they would also
> need to be allocated dynamically. The problem is that they cannot be
> freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
> was preempted on the trampoline. That was before Paul McKenney
> implemented synchronize_rcu_tasks() that would make sure all tasks
> (except idle) have scheduled out or have entered user space.
> 
> While testing this, I triggered this bug:
> 
>  BUG: unable to handle kernel paging request at ffffffffa0230077
>  IP: 0xffffffffa0230077
>  PGD 2414067 
>  PUD 2415063 
>  PMD c463c067 
>  PTE 0
> 
>  Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
>  Modules linked in: ip6table_filter ip6_tables snd_hda_codec_hdmi
>  snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel
>  snd_hda_codec snd_hwdep snd_hda_core x86_pkg_temp_thermal kvm_intel
>  i915 snd_seq kvm snd_seq_device snd_pcm i2c_algo_bit snd_timer
>  drm_kms_helper irqbypass syscopyarea sysfillrect sysimgblt fb_sys_fops
>  drm i2c_i801 snd soundcore wmi i2c_core video e1000e ptp pps_core CPU:
>  2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc3-test+ #356 Hardware
>  name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05
>  05/07/2012 task: ffff8800cebbd0c0 task.stack: ffff8800cebd8000 RIP:
>  0010:0xffffffffa0230077 RSP: 0018:ffff8800cebdfd80 EFLAGS: 00010286
>  RAX: 0000000000000000 RBX: ffff8800cebbd0c0 RCX: ffffffff8121391e RDX:
>  dffffc0000000000 RSI: ffffffff81ce7c78 RDI: ffff8800cebbf298 RBP:
>  ffff8800cebdfe28 R08: 0000000000000003 R09: 0000000000000000 R10:
>  0000000000000000 R11: 0000000000000000 R12: ffff8800cebbd0c0 R13:
>  ffffffff828fbe90 R14: ffff8800d392c480 R15: ffffffff826cc380 FS:
>  0000000000000000(0000) GS:ffff8800d3900000(0000)
>  knlGS:0000000000000000 CS:  0010 DS: 0000 ES: 0000 CR0:
>  0000000080050033 CR2: ffffffffa0230077 CR3: 0000000002413000 CR4:
>  00000000001406e0 Call Trace: ? debug_smp_processor_id+0x17/0x20 ?
>  sched_ttwu_pending+0x79/0x190 ? schedule+0x5/0xe0 ?
>  trace_hardirqs_on_caller+0x182/0x280 schedule+0x5/0xe0
>  schedule_preempt_disabled+0x18/0x30 ? schedule+0x5/0xe0
>   ? schedule_preempt_disabled+0x18/0x30
>   do_idle+0x172/0x220
> 
> What happened was that the idle task was preempted on the trampoline.
> As synchronize_rcu_tasks() ignores the idle thread, there's nothing
> that lets ftrace know that the idle task was preempted on a trampoline.
> 
> The idle task shouldn't need to ever enable preemption. The idle task
> is simply a loop that calls schedule or places the cpu into idle mode.
> In fact, having preemption enabled is inefficient, because it can
> happen when idle is just about to call schedule anyway, which would
> cause schedule to be called twice. Once for when the interrupt came in
> and was returning back to normal context, and then again in the normal
> path that the idle loop is running in, which would be pointless, as it
> had already scheduled.
> 
> Adding a new function local to kernel/sched/ that allows idle to call
> the scheduler without enabling preemption, fixes the
> synchronize_rcu_tasks) issue, as well as removes the pointless spurious
> scheduled caused by interrupts happening in the brief window where
> preemption is enabled just before it calls schedule.
> 
> Cc: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
> Signed-off-by: Steven Rostedt (VMware) <rostedt@...dmis.org>

I don't understand why schedule_idle() needs a loop given that
schedule_preempt_disabled() doesn't have one, but other than
that, for whatever it is worth, it looks good to me.

							Thanx, Paul

> ---
> Changes from v1:
> 
>   o Added Thomas to Cc as he wrote the original generic idle code
> 
>   o Added comment to why schedule_idle() exits
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3b31fc0..0788286 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3502,6 +3502,23 @@ asmlinkage __visible void __sched schedule(void)
>  }
>  EXPORT_SYMBOL(schedule);
> 
> +/*
> + * synchronize_rcu_tasks() makes sure that no task is stuck in preempted
> + * state (have scheduled out non-voluntarily) by making sure that all
> + * tasks have either left the run queue or have gone into user space.
> + * As idle tasks do not do either, they must not ever be preempted
> + * (schedule out non-voluntarily).
> + *
> + * schedule_idle() is similar to schedule_preempt_disable() except
> + * that it never enables preemption.
> + */
> +void __sched schedule_idle(void)
> +{
> +	do {
> +		__schedule(false);
> +	} while (need_resched());
> +}
> +
>  #ifdef CONFIG_CONTEXT_TRACKING
>  asmlinkage __visible void __sched schedule_user(void)
>  {
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index ac6d517..229c17e 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -264,7 +264,7 @@ static void do_idle(void)
>  	smp_mb__after_atomic();
> 
>  	sched_ttwu_pending();
> -	schedule_preempt_disabled();
> +	schedule_idle();
>  }
> 
>  bool cpu_in_idle(unsigned long pc)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5cbf922..c5ee02b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1465,6 +1465,8 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
>  }
>  #endif
> 
> +extern void schedule_idle(void);
> +
>  extern void sysrq_sched_debug_show(void);
>  extern void sched_init_granularity(void);
>  extern void update_max_interval(void);
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ