[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHEhvNyQhHG516a6R+vz3b69d-5dCU=_8JpXdRdGnGsjew@mail.gmail.com>
Date: Wed, 17 Sep 2025 15:45:16 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Max Kellermann <max.kellermann@...os.com>
Cc: slava.dubeyko@....com, xiubli@...hat.com, idryomov@...il.com,
amarkuze@...hat.com, ceph-devel@...r.kernel.org, netfs@...ts.linux.dev,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH] ceph: fix deadlock bugs by making iput() calls asynchronous
On Wed, Sep 17, 2025 at 3:39 PM Max Kellermann <max.kellermann@...os.com> wrote:
>
> On Wed, Sep 17, 2025 at 3:14 PM Mateusz Guzik <mjguzik@...il.com> wrote:
> > Does the patch convert literally all iput calls within ceph into the
> > async variant? I would be worried that mandatory deferral of literally
> > all final iputs may be a regression from perf standpoint.
>
ok, in that case i have no further commentary
> I don't think this affects performance at all. It almost never happens
> that the last reference gets dropped by somebody other than dcache
> (which only happens under memory pressure).
Well only changing the problematic consumers as opposed *everyone*
should be the end of it.
> (Forgot to reply to this part)
> No, I changed just the ones that are called from Writeback+Messenger.
>
> I don't think this affects performance at all. It almost never happens
> that the last reference gets dropped by somebody other than dcache
> (which only happens under memory pressure).
> It was very difficult to reproduce this bug:
> - "echo 2 >drop_caches" in a loop
> - a kernel patch that adds msleep() to several functions
> - another kernel patch that allows me to disconnect the Ceph server via ioctl
> The latter was to free inode references that are held by Ceph caps.
> For this deadlock to occur, all references other than
> writeback/messenger must be gone already.
> (It did happen on our production servers, crashing all of them a few
> days ago causing a major service outage, but apparently in all these
> years we're the first ones to observe this deadlock bug.)
>
This makes sense to me.
The VFS layer is hopefully going to get significantly better assert
coverage, so I expect this kind of trouble will be reported on without
having to actually run into it. Presumably including
yet-to-be-discovered deadlocks. ;)
Powered by blists - more mailing lists