[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <656f001e-bd9d-4299-9b8a-65efd62714e6@wiesinger.com>
Date: Wed, 22 Jan 2025 07:47:22 +0100
From: Gerhard Wiesinger <lists@...singer.com>
To: Dave Chinner <david@...morbit.com>
Cc: "Theodore Ts'o" <tytso@....edu>, linux-ext4@...r.kernel.org
Subject: Re: Transparent compression with ext4 - especially with zstd
On 21.01.2025 22:26, Dave Chinner wrote:
> On Tue, Jan 21, 2025 at 07:47:24PM +0100, Gerhard Wiesinger wrote:
>> On 21.01.2025 05:01, Theodore Ts'o wrote:
>>> On Sun, Jan 19, 2025 at 03:37:27PM +0100, Gerhard Wiesinger wrote:
>>>> Are there any plans to include transparent compression with ext4 (especially
>>>> with zstd)?
>>> I'm not aware of anyone in the ext4 deveopment commuity working on
>>> something like this. Fully transparent compression is challenging,
>>> since supporting random writes into a compressed file is tricky.
>>> There are solutions (for example, the Stac patent which resulted in
>>> Microsoft to pay $120 million dollars), but even ignoring the
>>> intellectual property issues, they tend to compromise the efficiency
>>> of the compression.
>>>
>>> More to the point, given how cheap byte storage tends to be (dollars
>>> per IOPS tend to be far more of a constraint than dollars per GB),
>>> it's unclear what the business case would be for any company to fund
>>> development work in this area, when the cost of a slightly large HDD
>>> or SSD is going to be far cheaper than the necessary software
>>> engineering investrment needed, even for a hyperscaler cloud company
>>> (and even there, it's unclear that transparent compression is really
>>> needed).
>>>
>>> What is the business and/or technical problem which you are trying to
>>> solve?
>>>
>> Regarding necessity:
>> We are talking in some scenarios about some factors of diskspace. E.g. in my
>> database scenario with PostgreSQL around 85% of disk space can be saved
>> (e.g. around factor 7).
> So use a database that has built-in data compression capabilities.
>
> e.g. Mysql has transparent table compression functionality.
> This requires sparse files and FALLOC_FL_PUNCH_HOLE support in the
> filesystem, but there is no need for any special filesystem side
> support for data compression to get space gains of up to 75% on
> compressible data sets with the default database (16kB record size)
> and filesystem configs (4kB block size).
>
> The argument that "application level compression is hard, so we want
> the filesystem to do it for us" ignores the fact that it is -much
> harder- to do efficient compression in the filesystem than at the
> application level.
>
> The OS and filesystem doesn't have the freedom to control
> application level data access patterns nor tailor the compression
> algorithms to match how the application manages data, so everything
> the filesystem implements is a compromise. It will never be optimal
> for any given workload, because we have to make sure that it is
> not complete garbage for any given workload...
MySQL/MariaDB isnt't an option for me. But will look into this.
>
>> In cloud usage scenarios you can easily reduce that amount of allocated
>> diskspace by around a factor 7 and reduce cost therefore.
> Same argument: cloud applications should be managing their data
> sets appropriately and efficiently, not relying on the cloud storage
> infrastructure to magically do stuff to "reduce costs" for them.
>
> Remeber: there's a massive conflict of interest on the vendor side
> here - the less efficient the application (be it CPU, RAM or storage
> capacity), the more money the cloud vendor makes from users running
> that application. Hence they have little motivation to provide
> infrastructure or application functionality that costs them money to
> implement and has the impact of reducing their overall revenue
> stream...
Right, therefore we want to make the storage usage as small as possible
either on appication level or filesystem level.
>> You might also get a performance boost by using caching mechanism more
>> efficient (e.g. using less RAM).
> Not true. Linux caches uncompressed data in the page cache - caching
> compressed data will significantly increase the memory footprint and
> CPU consumption as it has to be constantly uncompressed and
> recompressed as the data changes. This is not a viable caching
> strategy for a general purpose OS.
AFAIK ZFS caches compressed data in the ARC cache. zstd really has a
very low overhead on decompression with a very good compression ratio
(even better than gz and bz2).
>> Also with precompressed files (e.g. photo, videos) you can safe around 5-10%
> Video and photos do not compress sufficiently to be a viable runtime
> compression target for filesystem based compression. It's a massive
> waste of resources to attempt compression of internally compressed
> data formats for anything but cold data storage. And even then, if
> it's cold storage then the data should be compressed and checksummed
> by the cold storage application before it is written to the
> filesystem.
ZFS uses with zstd the lz4 "early abort" feature which detects with very
low CPU ressources that not compression is necessary and aborts the
compression and stores it uncompressed. If lz4 doesn't abort early, zstd
compression is used. So there are solutions for low ressource usage.
Reagarding rations: In my case 3%:
zfs list -o name,compressratio,compression big/shares/fotovideo
NAME RATIO COMPRESS
big/shares/fotovideo 1.03x zstd-3
>
>> The technical topic is that IMHO no stable and practical usable Linux
>> filesystem which is included in the default kernel exists.
>> - ZFS works but is not included in the default kernel
>> - BTRFS has stability and repair issues (see mailing lists) and bugs with
>> compression (does not compress on the fly in some scenarios)
> I hear this sort of generic "btrfs is not stable/has bugs" complaint
> as a reason for not using btrfs all the time.
That's my practical experience. I tried BTRFS several times and failed
on testing and production. Had a storage topic where some blocks
(several thousand 4k blocks were damaged). On top several VMs were running.
All other filesystems (XFS, ext4, ZFS, UFS2, ) except BTRFS and bcachefs
(which is experimental) were repairable to a consistent state (of course
with some blocks lost).
You can repair BTRFS "forever" without getting it into a consistent state.
A friend of mine had also the experience that it was not mountable and
crashed immediately after a reboot ...
Find the details here on the mailing list:
https://marc.info/?l=linux-btrfs&m=172519149923874&w=2
>
> I hear just as many, if not more, generic "XFS is unstable and loses
> data" claims as a reason for not using XFS, too.
I'm not having that experience. But I try to use ext4 primarily as it is
best for "repair" scenarios.
>
> Anecdotal claims are not proof of fact, and I don't see any real
> evidence that btrfs is unstable. e.g. Fedora has been using btrfs
> as the root filesystem (and has for quite a while now) and there has
> been no noticable increase in bug reports (either for fs
> functionality or data loss) compared to when ext4 or XFS was used as
> the default filesystem type...
That are not anecdotal claims that's my practical experience that BTRFS
is not stable and repairable to a consisent state. Reproduceable, you
can try for yourself.
I'm using Fedora since Fedora FC1 for all production systems.
>
> IOWs, I redirect generic "btrfs is unstable" complaints to /dev/null
> these days, just like I do with generic "XFS is unstable"
> complaints.
>
Try it and you will see it that it is non repairable. You can find
details and testcase (simulation of what I had on overwriting random
blocks) in the link.
As with Fedora I'm using latest and "fresh" stable kernel versions as
well as filesystem utilities. I'm still having that "unrepairable"
original BTRFS filesystem and will try to repair it to a consistent
state from time to time. Until now not successful.
Find the details here on the mailing list:
https://marc.info/?l=linux-btrfs&m=172519149923874&w=2
So you should't redirect the complaints to /dev/null to get BTRFS better :-)
Thnx.
Ciao,
Gerhard
Powered by blists - more mailing lists