linux-ext4 - Re: discard feature, mkfs.ext4 and mmc default fallback to normal erase op

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPDyKFrpccdMyqsxTi2dotbtr3_7hL0hUjWXc5Sx52kGnrDuLw@mail.gmail.com>
Date:   Wed, 9 Dec 2020 15:51:24 +0100
From:   Ulf Hansson <ulf.hansson@...aro.org>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     Michael Walle <michael@...le.cc>, linux-ext4@...r.kernel.org,
        "linux-mmc@...r.kernel.org" <linux-mmc@...r.kernel.org>,
        linux-block <linux-block@...r.kernel.org>
Subject: Re: discard feature, mkfs.ext4 and mmc default fallback to normal
 erase op

On Tue, 8 Dec 2020 at 17:52, Theodore Y. Ts'o <tytso@....edu> wrote:
>
> On Tue, Dec 08, 2020 at 12:26:22PM +0100, Michael Walle wrote:
> > Do we really need to map these functions? What if we don't have an
> > actual discard, but just a slow erase (I'm now assuming that erase
> > will likely be slow on sdcards)? Can't we just tell the user space
> > there is no discard? Like on a normal HDD? I really don't know the
> > implications, seems like mmc_erase() is just there for the linux
> > discard feature.
>
> So the potential gotcha here is that "discard" is important for
> reducing write amplification, and thus improving the lifespan of
> devices.  (See my reference to the Tesla engine controller story
> earlier.)  So if a device doesn't have "discard" but has "erase", and
> "erase" is fast, then skipping the discard could end up significantly
> reducing the lifespan of your product, and we're back to the NHTSA
> investigating whether they should stick Tesla for the $1500 engine
> controller replacement when cards die early.

Yes, exactly. The point about wear leveling and the lifespan of the
device are critical.

That said, we should continue to map discard requests to legacy erase
commands for SD cards, unless the card supports the new discard, of
course.

One thing I realized though, is that we should probably announce and
implement support for secure erase (QUEUE_FLAG_SECERASE) for SD cards,
as that seems to map well towards with the erase command.

An erase is specified in the SD spec as, after the erase the data is
either "0" or "1", which I guess is what is expected from a
REQ_OP_SECURE_ERASE operation?

>
> I guess the JEDEC spec does specify a way to query the card for how
> long an erase takes, but I don't have the knowledge about how the
> actual real-world implementations of these specs (and their many
> variants over the years) actually behave.  Can the erase times that
> they advertise actually be trusted to be accurate?  How many of them
> actually supply erase times at all, no matter what the spec says?

For eMMC discard commands are fast, but probably also trim commands.
For erase I don't know.

Then, whether the corresponding "erase times" that are be specified in
the eMMC registers, I guess those always refer to the worst case
scenario. I don't know how useful they really are in the end.

In any case, we may end up with poor erase/discard performance,
because of internal FW implementations.

Although, what I think we may be able to improve, both from eMMC and
SD point of view, is to allow more blocks per discard/erase operation.
But honestly, I don't know how big of a problem this is, even if just
staring at the code, gives me some ideas.

>
> > Coming from the user space side. Does mkfs.ext4 assumes its pre-discard
> > is fast? I'd think so, right? I'd presume it was intented to tell the
> > FTL of the block device, "hey these blocks are unused, you can do some
> > wear leveling with them".
>
> Yes, the assumption is that discard is fast.  Exactly how fast seems
> to vary; this is one of the reasons why there are three different ways
> to do discards on a file system after files are deleted.  One way is
> to do them after the deleted definitely won't be unwound (i.e., after
> the ext4 journal commit).  But on some devices, the discard command,
> while fast, is slow enough that this will compete with the I/O
> completion times of other read commands, thus degrading system
> performance.  So you can also execute the trim commands out of cron,
> using the fstrim command, which will run the discards in the
> background, and the system administrator can adjust when fstrim is
> executed during times wheno performance isn't critical.  (e.g., when
> the phone is on a charger in the middle of the night, or at 4am local
> time, etc.)  Finally, you can configure e2fsck to run the discards
> after the file system consistency check is done.
>
> The reason why we have to leave this up to the system administrators
> is that we have essentially no guidance from the device how slow the
> discard command might be, how it intereferes with other device
> operations, and whether the discard might be more likely ignored if
> the device is busy.  So it might be that the discard will more likely
> improve write endurance when it is done when the device is idle.  All
> of the speccs (SCSI, SATA, UFS, eMMC, SD) are remarkable unhelpful
> because performance considerations is generally consider "out of
> scope" of standards committees.  They want to leave that up to market
> forces; which is why big companies (at handset vendors, hyperscale
> cloud providers, systems integrators, etc.) have to spend as much
> money doing certification testing before deciding which products to
> buy; think of it as a full-employment act for storage engineers.  :-)

A few comments related to the above.

Even if the discarded blocks are flushed at some wisely selected
point, when the device is idle, that doesn't guarantee that the
internal garbage collection runs inside the device. In the end that
depends on the FW implementation of the card - and I assume it's
likely triggered based on some internal idle time and the amount of
"garbage" there is to deal with.

For both eMMC and SD cards, the specs define commands for how to
manually control the background operations inside the cards. In
principle, this allows us to tell the card when it's a good time to
run garbage collection (and when not to).

Both for eMMC and SD, we are not using this, yet. However, I have been
playing with a couple of different ideas to explore this:

*) Use the runtime PM framework to detect an idle period and then
trigger background operations. The problem is, that we don't really
know how long we will be idle, meaning that we don't know if it's
really a wise decision to trigger the background operations in the
end.

**) Invent a new type of generic block request, as to let userspace
trigger this.

Of course, another option is also to leave this as is, thus relying on
the internal FW of the card to act the best it can.

Do you have any thoughts around this?

[...]

Kind regards
Uffe