[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120308070720.GP3592@dastard>
Date: Thu, 8 Mar 2012 18:07:20 +1100
From: Dave Chinner <david@...morbit.com>
To: "Martin K. Petersen" <martin.petersen@...cle.com>
Cc: Andreas Dilger <aedilger@...il.com>, linux-fsdevel@...r.kernel.org,
linux-ext4@...r.kernel.org
Subject: Re: [RFC] fadvise: add more flags to provide a hint for block
allocation
On Wed, Mar 07, 2012 at 11:23:49PM -0500, Martin K. Petersen wrote:
> >>>>> "Dave" == Dave Chinner <david@...morbit.com> writes:
>
> Dave> From what I've seen of the proposed SMR device standards, we're
> Dave> going to have to redesign filesystem allocation policies
>
> [...]
>
> The initial proposal involved SMR disks having a sparse LBA map carved
> into 2GB chunks.
2TB chunks, IIRC - the lower 32 bits of the 48bit LBA was intended
to be the relative offset into the region (RBA), with the upper 16
bits being the region number.
> However, that was shot down pretty hard.
That's unfortunate - it maps really well to how XFS uses allocation
groups. XFS already uses sparse regions for breaking up allocation
to enable parallelism. XFS could map to this sort of layout pretty
easily by placing an allocation group per region. That immediately
separates the SMR regions into discrete regions in the filesystem,
and just requires some tweaking to make use of the different
characteristics of the regions.
For example, use of the standard btree freespace allocator for the
random write regions, and use of the bitmap allocator (used by the
realtime device) for regions that are sequential write because it's
metadata is held externally to the region it is tracking. i.e. it
can be located in the random write regions. This could all be
handled by mkfs.xfs, including setting up the regions on the SMR
drives....
IOWs, XFS already has most of the allocation infrastructure to
handle the proposed region based SMR devices, and would only need a
bit of modification and extension to fully support sequential write
regions along with random write regions. The allocation policy
stuff (deciding what sort of region to allocate from and aggregating
writes appropriately) is where all the new complexity lies, but that
we have to do that anyway to handle all the different sorts of
access hints we are likely to see.
> The approach currently being worked uses either dynamic (flash, tiered
> storage) or static hints (SMR) to put things in an appropriate area
> given the nature of the I/O.
> This puts the burden of virtual to physical LBA management on the device
> rather than in the filesystem allocators. And gives us the benefit of
> having a single interface that can be used for many different device
> types.
So the current proposal hides all the physical characteristics of
the devices from the file system and remaps the LBA internally based
on the IO hint? But that is the opposite direction to what we've
been taking over the past couple of years - we want more visibility
of device characteristics at the filesystem level so we can optimise
the filesystem better, not less.
> That said, the current proposal is crazy complex and clearly written
> with Windows in mind. They are creating different access profiles for
> .DLLs, .INI files, apps in the startup folder, and so on.
I'll pass judgement when I see it.
To tell the truth, I'd much prefer that we have direct control of
physical layout in the filesystem rather than have the storage
device virtualise it with some unknown algorithm. Every device will
have different algorithms, so we won't get relatively conistent
behaviour across devices from different manufacturers like we have
now. If that is all hidden in the drive firmware and is different
for each different device we see, then we've got no hope of being
able to diagnose why two files with identical filesystem layouts at
adjacent LBAs have vastly different performance for the same access
pattern....
Cheers,
Dave.
--
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists