lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKUOC8VN5n+YnFLPbQWa1hKp+vOWH26FKS92R+h4EvS=e11jFA@mail.gmail.com>
Date:   Wed, 12 Feb 2020 14:27:09 -0800
From:   Salman Qazi <sqazi@...gle.com>
To:     Jens Axboe <axboe@...nel.dk>, Ming Lei <ming.lei@...hat.com>,
        Bart Van Assche <bvanassche@....org>,
        Christoph Hellwig <hch@....de>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-block@...r.kernel.org
Cc:     Gwendal Grignou <gwendal@...gle.com>,
        Jesse Barnes <jsbarnes@...gle.com>
Subject: BLKSECDISCARD ioctl and hung tasks

Hi,

So, here's another issue that we are grappling with, where we have a
root-cause but don't currently have a good fix for.  BLKSECDISCARD is
an operation used for securely destroying a subset of the data on a
device.  Unfortunately, on SSDs, this is an operation with variable
performance.  It can be O(minutes) in the worst case.  The
pathological case is when many erase blocks on the flash contain a
small amount of data that is part of the discard and a large amount of
data that isn't.  In such cases, the erase blocks have to be copied
almost in entirety to fresh blocks, in order to erase the sectors to
be discarded. This can be thought of as a defragmentation operation on
the drive and can be expected to cost in the same ballpark as
rewriting most of the contents of the drive.

Therefore, it is possible for the thread waiting in the IOCTL in
submit_bio_wait call in blkdev_issue_discard to wait for several
minutes.  The hung task watchdog is usually configured for 2 minutes,
and this can expire before the operation finishes.

This operation is very important to the security model of Chrome OS
devices.  Therefore, we would like the kernel to survive this even if
it takes several minutes.

Three approaches come to mind:

One approach is to somehow avoid waiting for a single monolithic
operation and instead wait on bits and pieces of the operation.  These
can be sized to finish within a reasonable timeframe.  The exact size
is likely device-specific.  We already split these operations before
issuing to the device, but the IOCTL thread is waiting for the whole
rather than the parts. The hung task watchdog only sees the total
amount of time the thread slept and not the forward progress taking
place quietly.

Another approach might be to do something in the spirit of the write
system call: complete the partial operation (whatever the kernel
thinks is reasonable), adjust the IOCTL argument and have the
userspace reissue the syscall to continue the operation.  The second
option should probably be done with a different IOCTL name to avoid
breaking userspace.

A third approach, which is perhaps more adventurous, is to have a
notion of forward progress that a thread can export and the hung task
watchdog can evaluate.  This can take the form of a function pointer
and an argument.  The result of the function is a monotonically
decreasing unsigned value.  When this value stops changing, we can
conclude that the thread is hung.  This can be used in place of
context switch count for tasks where this function is available.  This
can potentially solve other similar issues, there is a way to tell if
there is forward progress, but it is not as straightforward as the
context switch count.

What are your thoughts?

Thanks in advance,

Salman

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ