linux-kernel - Re: [RFC 0/5] fs: replace kthread freezing with filesystem freeze/thaw

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171003200511.GD2294@wotan.suse.de>
Date:   Tue, 3 Oct 2017 22:05:11 +0200
From:   "Luis R. Rodriguez" <mcgrof@...nel.org>
To:     Ming Lei <ming.lei@...hat.com>
Cc:     "Luis R. Rodriguez" <mcgrof@...nel.org>, viro@...iv.linux.org.uk,
        bart.vanassche@....com, tytso@....edu, darrick.wong@...cle.com,
        jikos@...nel.org, rjw@...ysocki.net, pavel@....cz,
        len.brown@...el.com, linux-fsdevel@...r.kernel.org,
        boris.ostrovsky@...cle.com, jgross@...e.com,
        todd.e.brandt@...ux.intel.com, nborisov@...e.com, jack@...e.cz,
        martin.petersen@...cle.com, ONeukum@...e.com,
        oleksandr@...alenko.name, oleg.b.antonyan@...il.com,
        linux-pm@...r.kernel.org, linux-block@...r.kernel.org,
        linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC 0/5] fs: replace kthread freezing with filesystem
 freeze/thaw

On Wed, Oct 04, 2017 at 03:33:01AM +0800, Ming Lei wrote:
> On Tue, Oct 03, 2017 at 11:53:08AM -0700, Luis R. Rodriguez wrote:
> > INFO: task kworker/u8:8:1320 blocked for more than 10 seconds.
> >       Tainted: G            E   4.13.0-next-20170907+ #88
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > kworker/u8:8    D    0  1320      2 0x80000000
> > Workqueue: events_unbound async_run_entry_fn
> > Call Trace:
> >  __schedule+0x2ec/0x7a0
> >  schedule+0x36/0x80
> >  io_schedule+0x16/0x40
> >  get_request+0x278/0x780
> >  ? remove_wait_queue+0x70/0x70
> >  blk_get_request+0x9c/0x110
> >  scsi_execute+0x7a/0x310 [scsi_mod]
> >  sd_sync_cache+0xa3/0x190 [sd_mod]
> >  ? blk_run_queue+0x3f/0x50
> >  sd_suspend_common+0x7b/0x130 [sd_mod]
> >  ? scsi_print_result+0x270/0x270 [scsi_mod]
> >  sd_suspend_system+0x13/0x20 [sd_mod]
> >  do_scsi_suspend+0x1b/0x30 [scsi_mod]
> >  scsi_bus_suspend_common+0xb1/0xd0 [scsi_mod]
> >  ? device_for_each_child+0x69/0x90
> >  scsi_bus_suspend+0x15/0x20 [scsi_mod]
> >  dpm_run_callback+0x56/0x140
> >  ? scsi_bus_freeze+0x20/0x20 [scsi_mod]
> >  __device_suspend+0xf1/0x340
> >  async_suspend+0x1f/0xa0
> >  async_run_entry_fn+0x38/0x160
> >  process_one_work+0x191/0x380
> >  worker_thread+0x4e/0x3c0
> >  kthread+0x109/0x140
> >  ? process_one_work+0x380/0x380
> >  ? kthread_create_on_node+0x70/0x70
> >  ret_from_fork+0x25/0x30
> 
> Actually we are trying to fix this issue inside block layer/SCSI, please
> see the following link:
> 
> https://marc.info/?l=linux-scsi&m=150703947029304&w=2
> 
> Even though this patch can make kthread to not do I/O during
> suspend/resume, the SCSI quiesce still can cause similar issue
> in other case, like when sending SCSI domain validation
> to transport_spi, which happens in revalidate path, nothing
> to do with suspend/resume.

Are you saying that the SCSI layer can generate IO even without the filesystem
triggering it?

If so then by all means these are certainly other areas we should address
quiescing as I noted in my email.

Also, *iff* the generated IO is triggered on the SCSI suspend callback, then
clearly the next question is if this is truly needed. If so then yes, it
should be quiesced and all restrictions should be considered.

Note that device pm ops get called first, then later the notifiers are
processed, and only later is userspace frozen. Its this gap this patch
set addresses, and its also where where I saw the issue creep in. Depending on
the questions above we may or not need more work in other layers.

So I am not saying this patch set is sufficient to address all IO quiescing,
quite the contrary I acknowledged that each subsystem should vet if they have
non-FS generated IO (seems you and Bart are doing  great job at doing this
analysis on SCSI). This patchset however should help with odd corner cases
which *are* triggered by the FS and the spaghetti code requirements of the
kthread freezing clearly does not suffice.

> So IMO the root cause is in SCSI's quiesce.
> 
> You can find the similar description in above link:
> 
> 	Once SCSI device is put into QUIESCE, no new request except for
> 	RQF_PREEMPT can be dispatched to SCSI successfully, and
> 	scsi_device_quiesce() just simply waits for completion of I/Os
> 	dispatched to SCSI stack. It isn't enough at all.

I see so the race here is *on* the pm ops of SCSI we have generated IO
to QUIESCE.

> 
> 	Because new request still can be coming, but all the allocated
> 	requests can't be dispatched successfully, so request pool can be
> 	consumed up easily. Then RQF_PREEMPT can't be allocated, and
> 	hang forever, just like the stack trace you posted.
> 

I see. Makes sense. So SCSI quiesce has restrictions and they're being
violated.

Anyway, don't think of this as a replacement for yours or Bart's work then, but
rather supplemental.

Are you saying we should not move forward with this patch set, or simply that
the above splat is rather properly fixed with SCSI quiescing? Given you're
explanation I'd have to agree. But even with this considered and accepted, from
a theoretical perspective -- why would this patch set actually seem to fix the
same issue? Is it, that it just *seems* to fix it?

  Luis