linux-kernel - Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8734tosyb9.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Mon, 19 Feb 2024 15:33:30 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Daniel Bristot de Oliveira <bristot@...nel.org>
Cc: Ingo Molnar <mingo@...hat.com>,  Peter Zijlstra <peterz@...radead.org>,
  Juri Lelli <juri.lelli@...hat.com>,  Vincent Guittot
 <vincent.guittot@...aro.org>,  Dietmar Eggemann
 <dietmar.eggemann@....com>,  Steven Rostedt <rostedt@...dmis.org>,  Ben
 Segall <bsegall@...gle.com>,  Mel Gorman <mgorman@...e.de>,  Daniel
 Bristot de Oliveira <bristot@...hat.com>,  Valentin Schneider
 <vschneid@...hat.com>,  linux-kernel@...r.kernel.org,  Luca Abeni
 <luca.abeni@...tannapisa.it>,  Tommaso Cucinotta
 <tommaso.cucinotta@...tannapisa.it>,  Thomas Gleixner
 <tglx@...utronix.de>,  Joel Fernandes <joel@...lfernandes.org>,  Vineeth
 Pillai <vineeth@...byteword.org>,  Shuah Khan <skhan@...uxfoundation.org>,
  Phil Auld <pauld@...hat.com>, Aaron Lu <aaron.lu@...el.com>, Kairui Song
 <kasong@...cent.com>, Guo Ziliang <guo.ziliang@....com.cn>
Subject: Re: [PATCH v5 0/7] SCHED_DEADLINE server infrastructure

Hi, Daniel,

Thanks a lot for your great patchset!

We have a similar starvation issue in mm subsystem too.  Details are in
the patch description of the below commit.  In short, task A is busy
looping on some event, while task B will signal the event after some
work.  If the priority of task A is higher than that of task B, task B
may be starved.

IIUC, if task A is RT task while task B is fair task, then your patchset
will solve the issue.  If both task A and task B is RT tasks, is there
some way to solve the issue?

Now, we use a ugly schedule_timeout_uninterruptible(1) in the loop to
resolve the issue, is there something better?

Best Regards,
Huang, Ying

--------------------------8<---------------------------------------
commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e
Author: Guo Ziliang <guo.ziliang@....com.cn>
Date:   Wed Mar 16 16:15:03 2022 -0700

    mm: swap: get rid of livelock in swapin readahead
    
    In our testing, a livelock task was found.  Through sysrq printing, same
    stack was found every time, as follows:
    
      __swap_duplicate+0x58/0x1a0
      swapcache_prepare+0x24/0x30
      __read_swap_cache_async+0xac/0x220
      read_swap_cache_async+0x58/0xa0
      swapin_readahead+0x24c/0x628
      do_swap_page+0x374/0x8a0
      __handle_mm_fault+0x598/0xd60
      handle_mm_fault+0x114/0x200
      do_page_fault+0x148/0x4d0
      do_translation_fault+0xb0/0xd4
      do_mem_abort+0x50/0xb0
    
    The reason for the livelock is that swapcache_prepare() always returns
    EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
    cannot jump out of the loop.  We suspect that the task that clears the
    SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
    priority of the task stuck in a livelock so that the task that clears
    the SWAP_HAS_CACHE flag will run.  The results show that the system
    returns to normal after the priority is lowered.
    
    In our testing, multiple real-time tasks are bound to the same core, and
    the task in the livelock is the highest priority task of the core, so
    the livelocked task cannot be preempted.
    
    Although cond_resched() is used by __read_swap_cache_async, it is an
    empty function in the preemptive system and cannot achieve the purpose
    of releasing the CPU.  A high-priority task cannot release the CPU
    unless preempted by a higher-priority task.  But when this task is
    already the highest priority task on this core, other tasks will not be
    able to be scheduled.  So we think we should replace cond_resched() with
    schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
    call set_current_state first to set the task state, so the task will be
    removed from the running queue, so as to achieve the purpose of giving
    up the CPU and prevent it from running in kernel mode for too long.
    
    (akpm: ugly hack becomes uglier.  But it fixes the issue in a
    backportable-to-stable fashion while we hopefully work on something
    better)
    
    Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
    Signed-off-by: Guo Ziliang <guo.ziliang@....com.cn>
    Reported-by: Zeal Robot <zealci@....com.cn>
    Reviewed-by: Ran Xiaokai <ran.xiaokai@....com.cn>
    Reviewed-by: Jiang Xuexin <jiang.xuexin@....com.cn>
    Reviewed-by: Yang Yang <yang.yang29@....com.cn>
    Acked-by: Hugh Dickins <hughd@...gle.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@....com>
    Cc: Michal Hocko <mhocko@...nel.org>
    Cc: Minchan Kim <minchan@...nel.org>
    Cc: Johannes Weiner <hannes@...xchg.org>
    Cc: Roger Quadros <rogerq@...nel.org>
    Cc: Ziliang Guo <guo.ziliang@....com.cn>
    Cc: <stable@...r.kernel.org>
    Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8d4104242100..ee67164531c0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -478,7 +478,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
 		 * in swap_map, but not yet added its page to swap cache.
 		 */
-		cond_resched();
+		schedule_timeout_uninterruptible(1);
 	}
 
 	/*