linux-kernel - Re: [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20200227181430.GA44024@cmpxchg.org>
Date:   Thu, 27 Feb 2020 13:14:30 -0500
From:   Johannes Weiner <hannes@...xchg.org>
To:     Qian Cai <cai@....pw>
Cc:     Dan Schatzberg <schatzberg.dan@...il.com>,
        Jens Axboe <axboe@...nel.dk>, Tejun Heo <tj@...nel.org>,
        Li Zefan <lizefan@...wei.com>,
        Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Hugh Dickins <hughd@...gle.com>, Roman Gushchin <guro@...com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Chris Down <chris@...isdown.name>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        "open list:BLOCK LAYER" <linux-block@...r.kernel.org>,
        open list <linux-kernel@...r.kernel.org>,
        "open list:CONTROL GROUP (CGROUP)" <cgroups@...r.kernel.org>,
        "open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)" 
        <linux-mm@...ck.org>
Subject: Re: [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker

On Wed, Feb 26, 2020 at 12:02:38PM -0500, Qian Cai wrote:
> On Mon, 2020-02-24 at 17:17 -0500, Dan Schatzberg wrote:
> > Existing uses of loop device may have multiple cgroups reading/writing
> > to the same device. Simply charging resources for I/O to the backing
> > file could result in priority inversion where one cgroup gets
> > synchronously blocked, holding up all other I/O to the loop device.
> > 
> > In order to avoid this priority inversion, we use a single workqueue
> > where each work item is a "struct loop_worker" which contains a queue of
> > struct loop_cmds to issue. The loop device maintains a tree mapping blk
> > css_id -> loop_worker. This allows each cgroup to independently make
> > forward progress issuing I/O to the backing file.
> > 
> > There is also a single queue for I/O associated with the rootcg which
> > can be used in cases of extreme memory shortage where we cannot allocate
> > a loop_worker.
> > 
> > The locking for the tree and queues is fairly heavy handed - we acquire
> > the per-loop-device spinlock any time either is accessed. The existing
> > implementation serializes all I/O through a single thread anyways, so I
> > don't believe this is any worse.
> > 
> > Signed-off-by: Dan Schatzberg <schatzberg.dan@...il.com>
> > Acked-by: Johannes Weiner <hannes@...xchg.org>
> 
> The locking in loop_free_idle_workers() will trigger this with sysfs reading,
> 
> [ 7080.047167] LTP: starting read_all_sys (read_all -d /sys -q -r 10)
> [ 7239.842276] cpufreq transition table exceeds PAGE_SIZE. Disabling
> 
> [ 7247.054961] =====================================================
> [ 7247.054971] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> [ 7247.054983] 5.6.0-rc3-next-20200226 #2 Tainted: G           O     
> [ 7247.054992] -----------------------------------------------------
> [ 7247.055002] read_all/8513 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [ 7247.055014] c0000006844864c8 (&fs->seq){+.+.}, at: file_path+0x24/0x40
> [ 7247.055041] 
>                and this task is already holding:
> [ 7247.055061] c0002006bab8b928 (&(&lo->lo_lock)->rlock){..-.}, at:
> loop_attr_do_show_backing_file+0x3c/0x120 [loop]
> [ 7247.055078] which would create a new lock dependency:
> [ 7247.055105]  (&(&lo->lo_lock)->rlock){..-.} -> (&fs->seq){+.+.}
> [ 7247.055125] 
>                but this new dependency connects a SOFTIRQ-irq-safe lock:
> [ 7247.055155]  (&(&lo->lo_lock)->rlock){..-.}
> [ 7247.055156] 
>                ... which became SOFTIRQ-irq-safe at:
> [ 7247.055196]   lock_acquire+0x130/0x360
> [ 7247.055221]   _raw_spin_lock_irq+0x68/0x90
> [ 7247.055230]   loop_free_idle_workers+0x44/0x3f0 [loop]
> [ 7247.055242]   call_timer_fn+0x110/0x5f0
> [ 7247.055260]   run_timer_softirq+0x8f8/0x9f0
> [ 7247.055278]   __do_softirq+0x34c/0x8c8
> [ 7247.055288]   irq_exit+0x16c/0x1d0
> [ 7247.055298]   timer_interrupt+0x1f0/0x680
> [ 7247.055308]   decrementer_common+0x124/0x130
> [ 7247.055328]   arch_local_irq_restore.part.8+0x34/0x90
> [ 7247.055352]   cpuidle_enter_state+0x11c/0x8f0
> [ 7247.055361]   cpuidle_enter+0x50/0x70
> [ 7247.055389]   call_cpuidle+0x4c/0x90
> [ 7247.055398]   do_idle+0x378/0x470
> [ 7247.055414]   cpu_startup_entry+0x3c/0x40
> [ 7247.055442]   start_secondary+0x7a8/0xa80
> [ 7247.055461]   start_secondary_prolog+0x10/0x14

That's kind of hilarious.

So even though it's a spin_lock_irq(), suggesting it's used from both
process and irq context, Dan appears to be adding the first user that
actually runs from irq context. It looks like it should have been a
regular spinlock all along. Until now, anyway.

Fixing it should be straight-forward. Use get_file() under lock to pin
the file, drop the lock to do file_path(), release file with fput().