linux-ext4 - Re: [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <8AF0F706-B25F-4365-B9F2-8CA1BB336EC3@dilger.ca>
Date:   Mon, 14 Aug 2023 22:10:05 -0600
From:   Andreas Dilger <adilger@...ger.ca>
To:     "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
Cc:     Bobi Jam <bobijam@...mail.com>, linux-ext4@...r.kernel.org
Subject: Re: [PATCH 1/2] ext4: optimize metadata allocation for hybrid LUNs

On Aug 3, 2023, at 6:10 AM, Ritesh Harjani (IBM) <ritesh.list@...il.com> wrote:
> 
> Bobi Jam <bobijam@...mail.com> writes:
> 
>> With LVM it is possible to create an LV with SSD storage at the
>> beginning of the LV and HDD storage at the end of the LV, and use that
>> to separate ext4 metadata allocations (that need small random IOs)
>> from data allocations (that are better suited for large sequential
>> IOs) depending on the type of underlying storage.  Between 0.5-1.0% of
>> the filesystem capacity would need to be high-IOPS storage in order to
>> hold all of the internal metadata.
>> 
>> This would improve performance for inode and other metadata access,
>> such as ls, find, e2fsck, and in general improve file access latency,
>> modification, truncate, unlink, transaction commit, etc.
>> 
>> This patch split largest free order group lists and average fragment
>> size lists into other two lists for IOPS/fast storage groups, and
>> cr 0 / cr 1 group scanning for metadata block allocation in following
>> order:
>> 
>> cr 0 on largest free order IOPS group list
>> cr 1 on average fragment size IOPS group list
>> cr 0 on largest free order non-IOPS group list
>> cr 1 on average fragment size non-IOPS group list
>> cr >= 2 perform the linear search as before

Hi Ritesh,
thanks for the review and the discussion about the patch.

> Yes. The implementation looks straight forward to me.
> 

>> Non-metadata block allocation does not allocate from the IOPS groups.
>> 
>> Add for mke2fs an option to mark which blocks are in the IOPS region
>> of storage at format time:
>> 
>>  -E iops=0-1024G,4096-8192G
> 

> However few things to discuss here are -

As Ted requested on the call, this should be done as two separate calls
to the allocator, rather than embedding the policy in mballoc group
selection itself.  Presumably this would be in ext4_mb_new_blocks()
calling ext4_mb_regular_allocator() twice with different allocation
flags (first with EXT4_MB_HINT_METADATA, then without, though I don't
actually see this was used anywhere in the code before this patch?)

Metadata allocations should try only IOPS groups on the first call,
but would go through all allocation phases.  If IOPS allocation fails,
then the allocator should do a full second pass to allocate from the
non-IOPS groups.  Non-metadata allocations would only allocate from
non-IOPS groups.

> 1. What happens when the hdd space for data gets fully exhausted? AFAICS,
> the allocation for data blocks will still succeed, however we won't be
> able to make use of optimized scanning any more. Because we search within
> iops lists only when EXT4_MB_HINT_METADATA is set in ac->ac_flags.

The intention for our usage is that data allocations should *only* come
from the HDD region of the device, and *not* from the IOPS (flash) region
of the device.  The IOPS region will be comparatively small (0.5-1.0% of
the total device size) so using or not using this space will be mostly
meaningless to the overall filesystem usage, especially with a 1-5%
reserved blocks percentage that is the default for new filesystems.

As you mentioned on the call, it seems this is a defect in the current
patch, that non-metadata allocations may eventually fall back to scan
all block groups for free space including IOPS groups.  They need to
explicitly skip groups that have the IOPS flags set.

> 2. Similarly what happens when the ssd space for metadata gets full?
> In this case we keep falling back to cr2 for allocation and we don't
> utilize optimize_scanning to find the block groups from hdd space to
> allocate from.

In the case when the IOPS groups are full then the metadata allocations
should fall back to using non-IOPS groups.  That avoids ENOSPC when the
metadata space is accidentally formatted too small, or unexpected usage
such as large xattrs or many directories are consuming more IOPS space.

> 3. So it seems after a period of time, these iops lists can have block
> groups belonging to differnt ssds. Could this cause the metadata
> allocation of related inodes to come from different ssds.
> Will this be optimal? Checking on this...
>     ...On checking further on this, we start with a goal group and we
> at least scan s_mb_max_linear_groups (4) linearly. So it's unlikely that
> we frequently allocate metadata blocks from different SSDs.

In our usage will typically be only a single IOPS region at the start of
the device, but the ability to allow multiple IOPS regions was added for
completeness and flexibility in the future (e.g. resize of filesystem).
In our case, the IOPS region would itself be RAIDed, so "different SSDs"
is not really a concern.

> 4. Ok looking into this, do we even require the iops lists for metadata
> allocations? Do we allocate more than 1 blocks for metadata? If not then
> maintaining these iops lists for metadata allocation isn't really
> helpful. On the other hand it does make sense to maintain it when we
> allow data allocations from these ssds when hdds gets full.

I don't think we *need* to use the same mballoc code for IOPS allocation
in most cases, though large xattr inode allocations should also be using
the IOPS groups for allocating blocks, and these might be up to 64KB.
I don't think that is actually implemented properly in this patch yet.

Also, the mballoc list/array make it easy to find groups with free space
in a full filesystem instead of having to scan for them, even if we
don't need the full "allocate order-N" functionality.  Having one list
of free groups or order-N lists doesn't make it more expensive (and it
actually improves scalability to have multiple list heads).

One of the future enhancements might be to allow small files (of some
configurable size) to also be allocated from the IOPS groups, so it is
probably easier IMHO to just stick with the same allocator for both.

> 5. Did we run any benchmarks with this yet? What kind of gains we are
> looking for? Do we have any numbers for this?

We're working on that.  I just wanted to get the initial patches out for
review sooner rather than later, both to get feedback on implementation
(like this, thanks), and also to reserve the EXT4_BG_IOPS field so it
doesn't get used in a conflicting manner.

> 6. I couldn't stop but start to think of...
> Should there also be a provision from the user to pass hot/cold data
> types which we can use as a hint within the filesystem to allocate from
> ssd v/s hdd? Does it even make sense to think in this direction?

Yes, I also had the same idea, but then left it out of my email to avoid
getting distracted from the initial goal.  There are a number of possible
improvements that could be done with a mechanism like this:
- have fast/slow regions within a single HDD (i.e. last 20% of spindle is
  in "slow" region due to reduced linear velocity/bandwidth on inner tracks)
  to avoid using the slow region unless the fast region is (mostly) full
- have several regions across an HDD to *intentionally* allocate some
  extents in the "slow" groups to reduce *peak* bandwidth but keep
  *average* bandwidth higher as the disk becomes more full since there
  would still be free space in the faster groups.

Cheers, Andreas






Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)