linux-kernel - Re: [RFC PATCH 0/4] nvme-tcp: fix hung issues for deleting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADtkEeenw4UbkhstH0F-LTOj1zem8DB8zrSc7zTo_WDfaGFmDg@mail.gmail.com>
Date:   Fri, 9 Jun 2023 11:17:14 +0800
From:   许春光 <brookxu.cn@...il.com>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     kbusch@...nel.org, axboe@...nel.dk, hch@....de, sagi@...mberg.me,
        linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 0/4] nvme-tcp: fix hung issues for deleting

Ming Lei <ming.lei@...hat.com> 于2023年6月8日周四 21:51写道：
>
> On Thu, Jun 08, 2023 at 10:48:50AM +0800, 许春光 wrote:
> > Ming Lei <ming.lei@...hat.com> 于2023年6月8日周四 08:56写道：
> > >
> > > On Wed, Jun 07, 2023 at 12:09:17PM +0800, 许春光 wrote:
> > > > Hi Ming:
> > > >
> > > > Ming Lei <ming.lei@...hat.com> 于2023年6月6日周二 23:15写道：
> > > > >
> > > > > Hello Chunguang,
> > > > >
> > > > > On Mon, May 29, 2023 at 06:59:22PM +0800, brookxu.cn wrote:
> > > > > > From: Chunguang Xu <chunguang.xu@...pee.com>
> > > > > >
> > > > > > We found that nvme_remove_namespaces() may hang in flush_work(&ctrl->scan_work)
> > > > > > while removing ctrl. The root cause may due to the state of ctrl changed to
> > > > > > NVME_CTRL_DELETING while removing ctrl , which intterupt nvme_tcp_error_recovery_work()/
> > > > > > nvme_reset_ctrl_work()/nvme_tcp_reconnect_or_remove().  At this time, ctrl is
> > > > >
> > > > > I didn't dig into ctrl state check in these error handler yet, but error
> > > > > handling is supposed to provide forward progress for any controller state.
> > > > >
> > > > > Can you explain a bit how switching to DELETING interrupts the above
> > > > > error handling and breaks the forward progress guarantee?
> > > >
> > > > Here we freezed ctrl, if ctrl state has changed to DELETING or
> > > > DELETING_NIO(by nvme disconnect),  we will break up and lease ctrl
> > > > freeze, so nvme_remove_namespaces() hang.
> > > >
> > > > static void nvme_tcp_error_recovery_work(struct work_struct *work)
> > > > {
> > > >         ...
> > > >         if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
> > > >                 /* state change failure is ok if we started ctrl delete */
> > > >                 WARN_ON_ONCE(ctrl->state != NVME_CTRL_DELETING &&
> > > >                              ctrl->state != NVME_CTRL_DELETING_NOIO);
> > > >                 return;
> > > >         }
> > > >
> > > >         nvme_tcp_reconnect_or_remove(ctrl);
> > > > }
> > > >
> > > >
> > > > Another path, we will check ctrl state while reconnecting, if it changes to
> > > > DELETING or DELETING_NIO, we will break up and lease ctrl freeze and
> > > > queue quiescing (through reset path), as a result Hang occurs.
> > > >
> > > > static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl)
> > > > {
> > > >         /* If we are resetting/deleting then do nothing */
> > > >         if (ctrl->state != NVME_CTRL_CONNECTING) {
> > > >                 WARN_ON_ONCE(ctrl->state == NVME_CTRL_NEW ||
> > > >                         ctrl->state == NVME_CTRL_LIVE);
> > > >                 return;
> > > >         }
> > > >         ...
> > > > }
> > > >
> > > > > > freezed and queue is quiescing . Since scan_work may continue to issue IOs to
> > > > > > load partition table, make it blocked, and lead to nvme_tcp_error_recovery_work()
> > > > > > hang in flush_work(&ctrl->scan_work).
> > > > > >
> > > > > > After analyzation, we found that there are mainly two case:
> > > > > > 1. Since ctrl is freeze, scan_work hang in __bio_queue_enter() while it issue
> > > > > >    new IO to load partition table.
> > > > >
> > > > > Yeah, nvme freeze usage is fragile, and I suggested to move
> > > > > nvme_start_freeze() from nvme_tcp_teardown_io_queues to
> > > > > nvme_tcp_configure_io_queues(), such as the posted change on rdma:
> > > > >
> > > > > https://lore.kernel.org/linux-block/CAHj4cs-4gQHnp5aiekvJmb6o8qAcb6nLV61uOGFiisCzM49_dg@mail.gmail.com/T/#ma0d6bbfaa0c8c1be79738ff86a2fdcf7582e06b0
> > > >
> > > > While drive reconnecting, I think we should freeze ctrl or quiescing queue,
> > > > otherwise nvme_fail_nonready_command()may return BLK_STS_RESOURCE,
> > > > and the IOs may retry frequently. So I think we may better freeze ctrl
> > > > while entering
> > > > error_recovery/reconnect, but need to unfreeze it while exit.
> > >
> > > quiescing is always done in error handling, and freeze is actually
> > > not a must, and it is easier to cause race by calling freeze & unfreeze
> > > from different contexts.
> >
> > I think if we donot freeze ctrl, as the IO already submit (just queue
> > to hctx->dispatch) and may
> > pending for a long time,  it may trigger new hang task issue, but
> > freeze ctrl may can avoid these
> > hang task.
>
> How can the freeze make the difference? If driver/device can't move on,
> any request is stuck, so the IO path waits in either submit_bio() or
> upper layer after returning from submit_bio().
>

Now error_recovery and reset ctrl are handled somewhat differently:
1. error_recovery will freeze the controller, but it will unquiescing
queue to fast fail pending IO later,
otherwise this part of IO may cause task hang during the reconnection,
so while error_recovery work
interrupted, just leave ctrl freeze, queue is unquiescing.

Think carefully, the new IO will still hang in enter_queue, it seems
that this solution still not work fine,
so I think we may also need to refactor the logic of error_recovery.

2. Reset ctrl will freeze the controller and quiescing queue at the
same time, while reset interrupted,
ctrl is freeze and the queue is quiescing.

I may got the point of you, what
https://lore.kernel.org/linux-block/CAHj4cs-4gQHnp5aiekvJmb6o8qAcb6nLV61uOGFiisCzM49_dg@mail.gmail.com/T/#ma0d6bbfaa0c8c1be79738ff86a2fdcf7582e06b0
proposal seems better.

> Thanks,
> Ming
>