linux-kernel - [RFC] vruntime updated incorrectly when rt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHRSSEwdWhUurOkviS0WdcGKj3374r-nCXH3BkQfwFiObyq+4w@mail.gmail.com>
Date:   Tue, 7 Aug 2018 10:40:42 -0700
From:   Todd Kjos <tkjos@...gle.com>
To:     LKML <linux-kernel@...r.kernel.org>, linux-pm@...r.kernel.org,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Paul Turner <pjt@...gle.com>
Cc:     John Dias <joaodias@...gle.com>,
        Quentin Perret <quentin.perret@....com>,
        Patrick Bellasi <Patrick.Bellasi@....com>,
        Chris Redpath <Chris.Redpath@....com>,
        Morten Rasmussen <Morten.Rasmussen@....com>,
        Android Kernel Team <kernel-team@...roid.com>
Subject: [RFC] vruntime updated incorrectly when rt_mutex boots prio?

This issue was discovered on a 4.9-based android device, but the
relevant mainline code appears to be the same. The symptom is that
over time the some workloads become sluggish resulting in missed
frames or sluggishness. It appears to be the same issue described in
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-March/567836.html.

Here is the scenario: A task is deactivated while still in the fair
class. The task is then boosted to RT, so rt_mutex_setprio() is
called. This changes the task to RT and calls check_class_changed(),
which eventually calls detach_task_cfs_rq(), which is where
vruntime_normalized() sees that the task's state is TASK_WAKING, which
results in skipping the subtraction of the rq's min_vruntime from the
task's vruntime. Later, when the prio is deboosted and the task is
moved back to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, resulting in vruntime inflation.

When investigating the problem, it was found that the change below
fixes the problem by forcing vruntime_normalized() to return false if
the sched_class is not CFS (though we're concerned that it might
introduce other issues):

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 91f7b3322a15..267056f2e2ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11125,7 +11125,7 @@ static inline bool vruntime_normalized(struct
task_struct *p)
         * - A task which has been woken up by try_to_wake_up() and
         *   waiting for actually being woken up by sched_ttwu_pending().
         */
-       if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+       if (!se->sum_exec_runtime || (p->state == TASK_WAKING &&
p->sched_class == &fair_sched_class))
                return true;

        return false;

Do folks agree that this is incorrect behavior? Does this fix look
appropriate and safe? Other ideas?

-Todd