[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fbbb0b40-cf40-4fa7-bd3f-828e027b19b5@efficios.com>
Date: Tue, 21 Jan 2025 06:49:26 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Ingo Molnar <mingo@...nel.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Shrikanth Hegde <sshegde@...ux.ibm.com>, Tejun Heo <tj@...nel.org>
Subject: Re: [GIT PULL v2] Scheduler enhancements for v6.14
On 2025-01-21 02:23, Ingo Molnar wrote:
>
> * Mathieu Desnoyers <mathieu.desnoyers@...icios.com> wrote:
>
>> On 20-Jan-2025 12:07:41 PM, Ingo Molnar wrote:
>>>
>>> Linus,
>>>
>>> Please pull the latest sched/core Git tree from:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2025-01-20
>>>
>>> # HEAD: 7d9da040575b343085287686fa902a5b2d43c7ca psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
>>>
>>> Scheduler enhancements for v6.14:
>>
>> [...]
>>
>>> - RSEQ enhancements:
>>>
>>> - Validate read-only fields under DEBUG_RSEQ config
>>> (Mathieu Desnoyers)
>>
>> FYI, a regression introduced by this commit was reported by s390x
>> glibc developers testing against linux-next:
>>
>> https://sourceware.org/pipermail/libc-alpha/2025-January/163993.html
>>
>> I've sent a fix here:
>>
>> https://lore.kernel.org/lkml/20250116205956.836074-1-mathieu.desnoyers@efficios.com/
>>
>> The commit introducing the issue is in this PR, but not the fix.
>
> Indeed - with the bug RSEQ_FLAG_UNREGISTER would fail with an incorrect
> -EFAULT return.
>
> I've applied your fix, and updated the pull request for Linus further
> below. If Linus has already pulled I'll send a fixes pull request
> separately, or Linus can apply the fix from email directly:
>
> Acked-by: Ingo Molnar <mingo@...nel.org>
>
> Or he can pull the sched-core-2025-01-21 tag below safely on top of
> sched-core-2025-01-20, which will result in a diffstat of:
>
> Mathieu Desnoyers (1):
> rseq: Fix rseq unregistration regression
>
> kernel/rseq.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Since I booted the scheduler tree on generic desktops and it was tested
> on other systems as well and nothing appeared to be broken, I presume
> RSEQ_FLAG_UNREGISTER is used only in libc syscall-testcases and in
> specific applications?
Nowadays, rseq unregistration is used by specialized applications (e.g.
tcmalloc) which disable glibc rseq support with the glibc tunable
and register it themselves. (GLIBC_TUNABLES=glibc.pthread.rseq=0)
A recent glibc (2.35+) don't use explicit rseq unregistration, it's
unregistered implicitly when the thread exits.
I'll make a note to add a test case for GLIBC_TUNABLES=glibc.pthread.rseq=0
in the rseq selftests and librseq to improve test coverage when using a
recent glibc.
We have all the code in there to use rseq unregistration, but it is skipped
when glibc 2.35+ is handling the registration.
Thanks,
Mathieu
>
> Thanks,
>
> Ingo
>
> ===================================>
> Linus,
>
> Please pull the latest sched/core Git tree from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched-core-2025-01-21
>
> # HEAD: 40724ecafccb1fb62b66264854e8c3ad394c8f3d rseq: Fix rseq unregistration regression
>
> Scheduler enhancements for v6.14:
>
> - Fair scheduler (SCHED_FAIR) enhancements:
>
> - Behavioral improvements:
> - Untangle NEXT_BUDDY and pick_next_task() (Peter Zijlstra)
>
> - Delayed-dequeue enhancements & fixes: (Vincent Guittot)
>
> - Rename h_nr_running into h_nr_queued
> - Add new cfs_rq.h_nr_runnable
> - Use the new cfs_rq.h_nr_runnable
> - Removed unsued cfs_rq.h_nr_delayed
> - Rename cfs_rq.idle_h_nr_running into h_nr_idle
> - Remove unused cfs_rq.idle_nr_running
> - Rename cfs_rq.nr_running into nr_queued
> - Do not try to migrate delayed dequeue task
> - Fix variable declaration position
> - Encapsulate set custom slice in a __setparam_fair() function
>
> - Fixes:
> - Fix race between yield_to() and try_to_wake_up() (Tianchen Ding)
> - Fix CPU bandwidth limit bypass during CPU hotplug (Vishal Chourasia)
>
> - Cleanups:
> - Clean up in migrate_degrades_locality() to improve
> readability (Peter Zijlstra)
> - Mark m*_vruntime() with __maybe_unused (Andy Shevchenko)
> - Update comments after sched_tick() rename (Sebastian Andrzej Siewior)
> - Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
> (Valentin Schneider)
>
> - Deadline scheduler (SCHED_DL) enhancements:
>
> - Restore dl_server bandwidth on non-destructive root domain
> changes (Juri Lelli)
>
> - Correctly account for allocated bandwidth during
> hotplug (Juri Lelli)
>
> - Check bandwidth overflow earlier for hotplug (Juri Lelli)
>
> - Clean up goto label in pick_earliest_pushable_dl_task()
> (John Stultz)
>
> - Consolidate timer cancellation (Wander Lairson Costa)
>
> - Load-balancer enhancements:
>
> - Improve performance by prioritizing migrating eligible
> tasks in sched_balance_rq() (Hao Jia)
>
> - Do not compute NUMA Balancing stats unnecessarily during
> load-balancing (K Prateek Nayak)
>
> - Do not compute overloaded status unnecessarily during
> load-balancing (K Prateek Nayak)
>
> - Generic scheduling code enhancements:
>
> - Use READ_ONCE() in task_on_rq_queued(), to consistently use
> the WRITE_ONCE() updated ->on_rq field (Harshit Agarwal)
>
> - Isolated CPUs support enhancements: (Waiman Long)
>
> - Make "isolcpus=nohz" equivalent to "nohz_full"
> - Consolidate housekeeping cpumasks that are always identical
> - Remove HK_TYPE_SCHED
> - Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
>
> - RSEQ enhancements:
>
> - Validate read-only fields under DEBUG_RSEQ config
> (Mathieu Desnoyers)
>
> - PSI enhancements:
>
> - Fix race when task wakes up before psi_sched_switch()
> adjusts flags (Chengming Zhou)
>
> - IRQ time accounting performance enhancements: (Yafang Shao)
>
> - Define sched_clock_irqtime as static key
> - Don't account irq time if sched_clock_irqtime is disabled
>
> - Virtual machine scheduling enhancements:
>
> - Don't try to catch up excess steal time (Suleiman Souhlal)
>
> - Heterogenous x86 CPU scheduling enhancements: (K Prateek Nayak)
>
> - Convert "sysctl_sched_itmt_enabled" to boolean
> - Use guard() for itmt_update_mutex
> - Move the "sched_itmt_enabled" sysctl to debugfs
> - Remove x86_smt_flags and use cpu_smt_flags directly
> - Use x86_sched_itmt_flags for PKG domain unconditionally
>
> - Debugging code & instrumentation enhancements:
>
> - Change need_resched warnings to pr_err() (David Rientjes)
> - Print domain name in /proc/schedstat (K Prateek Nayak)
> - Fix value reported by hot tasks pulled in /proc/schedstat (Peter Zijlstra)
> - Report the different kinds of imbalances in /proc/schedstat (Swapnil Sapkal)
> - Move sched domain name out of CONFIG_SCHED_DEBUG (Swapnil Sapkal)
> - Update Schedstat version to 17 (Swapnil Sapkal)
>
> Thanks,
>
> Ingo
>
> ------------------>
> Andy Shevchenko (1):
> sched/fair: Mark m*_vruntime() with __maybe_unused
>
> Chengming Zhou (1):
> psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
>
> David Rientjes (1):
> sched/debug: Change need_resched warnings to pr_err
>
> Hao Jia (1):
> sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
>
> Harshit Agarwal (1):
> sched: add READ_ONCE to task_on_rq_queued
>
> John Stultz (1):
> sched: deadline: Cleanup goto label in pick_earliest_pushable_dl_task
>
> Juri Lelli (3):
> sched/deadline: Restore dl_server bandwidth on non-destructive root domain changes
> sched/deadline: Correctly account for allocated bandwidth during hotplug
> sched/deadline: Check bandwidth overflow earlier for hotplug
>
> K Prateek Nayak (8):
> sched/stats: Print domain name in /proc/schedstat
> x86/itmt: Convert "sysctl_sched_itmt_enabled" to boolean
> x86/itmt: Use guard() for itmt_update_mutex
> x86/itmt: Move the "sched_itmt_enabled" sysctl to debugfs
> x86/topology: Remove x86_smt_flags and use cpu_smt_flags directly
> x86/topology: Use x86_sched_itmt_flags for PKG domain unconditionally
> sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
> sched/fair: Do not compute overloaded status unnecessarily during lb
>
> Mathieu Desnoyers (2):
> rseq: Validate read-only fields under DEBUG_RSEQ config
> rseq: Fix rseq unregistration regression
>
> Peter Zijlstra (3):
> sched/fair: Untangle NEXT_BUDDY and pick_next_task()
> sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat
> sched/fair: Cleanup in migrate_degrades_locality() to improve readability
>
> Sebastian Andrzej Siewior (1):
> sched/fair: Update comments after sched_tick() rename.
>
> Suleiman Souhlal (1):
> sched: Don't try to catch up excess steal time.
>
> Swapnil Sapkal (3):
> sched: Report the different kinds of imbalances in /proc/schedstat
> sched: Move sched domain name out of CONFIG_SCHED_DEBUG
> docs: Update Schedstat version to 17
>
> Tianchen Ding (1):
> sched: Fix race between yield_to() and try_to_wake_up()
>
> Valentin Schneider (1):
> sched/fair: Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
>
> Vincent Guittot (10):
> sched/fair: Rename h_nr_running into h_nr_queued
> sched/fair: Add new cfs_rq.h_nr_runnable
> sched/fair: Use the new cfs_rq.h_nr_runnable
> sched/fair: Removed unsued cfs_rq.h_nr_delayed
> sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
> sched/fair: Remove unused cfs_rq.idle_nr_running
> sched/fair: Rename cfs_rq.nr_running into nr_queued
> sched/fair: Do not try to migrate delayed dequeue task
> sched/fair: Fix variable declaration position
> sched/fair: Encapsulate set custom slice in a __setparam_fair() function
>
> Vishal Chourasia (1):
> sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug
>
> Waiman Long (4):
> sched/core: Remove HK_TYPE_SCHED
> sched/isolation: Make "isolcpus=nohz" equivalent to "nohz_full"
> sched/isolation: Consolidate housekeeping cpumasks that are always identical
> sched: Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
>
> Wander Lairson Costa (1):
> sched/deadline: Consolidate Timer Cancellation
>
> Yafang Shao (3):
> sched: Define sched_clock_irqtime as static key
> sched: Don't account irq time if sched_clock_irqtime is disabled
> sched, psi: Don't account irq time if sched_clock_irqtime is disabled
>
>
> Documentation/admin-guide/kernel-parameters.txt | 4 +-
> Documentation/scheduler/sched-stats.rst | 126 ++++---
> arch/x86/include/asm/topology.h | 4 +-
> arch/x86/kernel/itmt.c | 81 ++---
> arch/x86/kernel/smpboot.c | 19 +-
> include/linux/sched.h | 10 +
> include/linux/sched/isolation.h | 21 +-
> include/linux/sched/topology.h | 13 +-
> kernel/rseq.c | 98 ++++++
> kernel/sched/core.c | 94 +++--
> kernel/sched/cputime.c | 16 +-
> kernel/sched/deadline.c | 119 +++++--
> kernel/sched/debug.c | 25 +-
> kernel/sched/fair.c | 444 ++++++++++++++----------
> kernel/sched/features.h | 9 +
> kernel/sched/isolation.c | 22 +-
> kernel/sched/pelt.c | 4 +-
> kernel/sched/psi.c | 7 +-
> kernel/sched/sched.h | 37 +-
> kernel/sched/stats.c | 11 +-
> kernel/sched/stats.h | 4 +
> kernel/sched/syscalls.c | 18 +-
> kernel/sched/topology.c | 12 +-
> 23 files changed, 720 insertions(+), 478 deletions(-)
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists