linux-kernel - BLKSECDISCARD ioctl and hung tasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKUOC8VN5n+YnFLPbQWa1hKp+vOWH26FKS92R+h4EvS=e11jFA@mail.gmail.com>
Date:   Wed, 12 Feb 2020 14:27:09 -0800
From:   Salman Qazi <sqazi@...gle.com>
To:     Jens Axboe <axboe@...nel.dk>, Ming Lei <ming.lei@...hat.com>,
        Bart Van Assche <bvanassche@....org>,
        Christoph Hellwig <hch@....de>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-block@...r.kernel.org
Cc:     Gwendal Grignou <gwendal@...gle.com>,
        Jesse Barnes <jsbarnes@...gle.com>
Subject: BLKSECDISCARD ioctl and hung tasks

Hi,

So, here's another issue that we are grappling with, where we have a
root-cause but don't currently have a good fix for.  BLKSECDISCARD is
an operation used for securely destroying a subset of the data on a
device.  Unfortunately, on SSDs, this is an operation with variable
performance.  It can be O(minutes) in the worst case.  The
pathological case is when many erase blocks on the flash contain a
small amount of data that is part of the discard and a large amount of
data that isn't.  In such cases, the erase blocks have to be copied
almost in entirety to fresh blocks, in order to erase the sectors to
be discarded. This can be thought of as a defragmentation operation on
the drive and can be expected to cost in the same ballpark as
rewriting most of the contents of the drive.

Therefore, it is possible for the thread waiting in the IOCTL in
submit_bio_wait call in blkdev_issue_discard to wait for several
minutes.  The hung task watchdog is usually configured for 2 minutes,
and this can expire before the operation finishes.

This operation is very important to the security model of Chrome OS
devices.  Therefore, we would like the kernel to survive this even if
it takes several minutes.

Three approaches come to mind:

One approach is to somehow avoid waiting for a single monolithic
operation and instead wait on bits and pieces of the operation.  These
can be sized to finish within a reasonable timeframe.  The exact size
is likely device-specific.  We already split these operations before
issuing to the device, but the IOCTL thread is waiting for the whole
rather than the parts. The hung task watchdog only sees the total
amount of time the thread slept and not the forward progress taking
place quietly.

Another approach might be to do something in the spirit of the write
system call: complete the partial operation (whatever the kernel
thinks is reasonable), adjust the IOCTL argument and have the
userspace reissue the syscall to continue the operation.  The second
option should probably be done with a different IOCTL name to avoid
breaking userspace.

A third approach, which is perhaps more adventurous, is to have a
notion of forward progress that a thread can export and the hung task
watchdog can evaluate.  This can take the form of a function pointer
and an argument.  The result of the function is a monotonically
decreasing unsigned value.  When this value stops changing, we can
conclude that the thread is hung.  This can be used in place of
context switch count for tasks where this function is available.  This
can potentially solve other similar issues, there is a way to tell if
there is forward progress, but it is not as straightforward as the
context switch count.

What are your thoughts?

Thanks in advance,

Salman