linux-kernel - Re: [PATCH 4.4 48/76] libceph: force GFP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOi1vP_6zY5vZgYwBGEajz1nrjda7cDbWSLOhnBJGE=JZK1vBg@mail.gmail.com>
Date:   Thu, 30 Mar 2017 17:06:51 +0200
From:   Ilya Dryomov <idryomov@...il.com>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        stable@...r.kernel.org, Sergey Jerusalimov <wintchester@...il.com>,
        Jeff Layton <jlayton@...hat.com>, linux-xfs@...r.kernel.org
Subject: Re: [PATCH 4.4 48/76] libceph: force GFP_NOIO for socket allocations

On Thu, Mar 30, 2017 at 4:36 PM, Michal Hocko <mhocko@...nel.org> wrote:
> On Thu 30-03-17 15:48:42, Ilya Dryomov wrote:
>> On Thu, Mar 30, 2017 at 1:21 PM, Michal Hocko <mhocko@...nel.org> wrote:
> [...]
>> > familiar with Ceph at all but does any of its (slab) shrinkers generate
>> > IO to recurse back?
>>
>> We don't register any custom shrinkers.  This is XFS on top of rbd,
>> a ceph-backed block device.
>
> OK, that was the part I was missing. So you depend on the XFS to make a
> forward progress here.
>
>> >> Well,
>> >> it's got to go through the same ceph_connection:
>> >>
>> >> rbd_queue_workfn
>> >>   ceph_osdc_start_request
>> >>     ceph_con_send
>> >>       mutex_lock(&con->mutex)  # deadlock, OSD X worker is knocked out
>> >>
>> >> Now if that was a GFP_NOIO allocation, we would simply block in the
>> >> allocator.  The placement algorithm distributes objects across the OSDs
>> >> in a pseudo-random fashion, so even if we had a whole bunch of I/Os for
>> >> that OSD, some other I/Os for other OSDs would complete in the meantime
>> >> and free up memory.  If we are under the kind of memory pressure that
>> >> makes GFP_NOIO allocations block for an extended period of time, we are
>> >> bound to have a lot of pre-open sockets, as we would have done at least
>> >> some flushing by then.
>> >
>> > How is this any different from xfs waiting for its IO to be done?
>>
>> I feel like we are talking past each other here.  If the worker in
>> question isn't deadlocked, it will eventually get its socket and start
>> flushing I/O.  If it has deadlocked, it won't...
>
> But if the allocation is stuck then the holder of the lock cannot make
> a forward progress and it is effectivelly deadlocked because other IO
> depends on the lock it holds. Maybe I just ask bad questions but what

Only I/O to the same OSD.  A typical ceph cluster has dozens of OSDs,
so there is plenty of room for other in-flight I/Os to finish and move
the allocator forward.  The lock in question is per-ceph_connection
(read: per-OSD).

> makes GFP_NOIO different from GFP_KERNEL here. We know that the later
> might need to wait for an IO to finish in the shrinker but it itself
> doesn't get the lock in question directly. The former depends on the
> allocator forward progress as well and that in turn wait for somebody
> else to proceed with the IO. So to me any blocking allocation while
> holding a lock which blocks further IO to complete is simply broken.

Right, with GFP_NOIO we simply wait -- there is nothing wrong with
a blocking allocation, at least in the general case.  With GFP_KERNEL
we deadlock, either in rbd/libceph (less likely) or in the filesystem
above (more likely, shown in the xfs_reclaim_inodes_ag() traces you
omitted in your quote).

Thanks,

                Ilya