linux-kernel - Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1390506935.2402.8.camel@dabdike>
Date:	Thu, 23 Jan 2014 11:55:35 -0800
From:	James Bottomley <James.Bottomley@...senPartnership.com>
To:	Mel Gorman <mgorman@...e.de>
Cc:	"linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
	Chris Mason <clm@...com>, Dave Chinner <david@...morbit.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"linux-ide@...r.kernel.org" <linux-ide@...r.kernel.org>,
	"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	"lsf-pc@...ts.linux-foundation.org" 
	<lsf-pc@...ts.linux-foundation.org>,
	"rwheeler@...hat.com" <rwheeler@...hat.com>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] really large storage sectors - going
 beyond 4096 bytes

On Thu, 2014-01-23 at 16:44 +0000, Mel Gorman wrote:
> On Thu, Jan 23, 2014 at 07:47:53AM -0800, James Bottomley wrote:
> > On Thu, 2014-01-23 at 19:27 +1100, Dave Chinner wrote:
> > > On Wed, Jan 22, 2014 at 10:13:59AM -0800, James Bottomley wrote:
> > > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote:
> > > > > On Wed, 2014-01-22 at 09:21 -0800, James Bottomley wrote:
> > > > > > On Wed, 2014-01-22 at 17:02 +0000, Chris Mason wrote:
> > > > > 
> > > > > [ I like big sectors and I cannot lie ]
> > > > 
> > > > I think I might be sceptical, but I don't think that's showing in my
> > > > concerns ...
> > > > 
> > > > > > > I really think that if we want to make progress on this one, we need
> > > > > > > code and someone that owns it.  Nick's work was impressive, but it was
> > > > > > > mostly there for getting rid of buffer heads.  If we have a device that
> > > > > > > needs it and someone working to enable that device, we'll go forward
> > > > > > > much faster.
> > > > > > 
> > > > > > Do we even need to do that (eliminate buffer heads)?  We cope with 4k
> > > > > > sector only devices just fine today because the bh mechanisms now
> > > > > > operate on top of the page cache and can do the RMW necessary to update
> > > > > > a bh in the page cache itself which allows us to do only 4k chunked
> > > > > > writes, so we could keep the bh system and just alter the granularity of
> > > > > > the page cache.
> > > > > > 
> > > > > 
> > > > > We're likely to have people mixing 4K drives and <fill in some other
> > > > > size here> on the same box.  We could just go with the biggest size and
> > > > > use the existing bh code for the sub-pagesized blocks, but I really
> > > > > hesitate to change VM fundamentals for this.
> > > > 
> > > > If the page cache had a variable granularity per device, that would cope
> > > > with this.  It's the variable granularity that's the VM problem.
> > > > 
> > > > > From a pure code point of view, it may be less work to change it once in
> > > > > the VM.  But from an overall system impact point of view, it's a big
> > > > > change in how the system behaves just for filesystem metadata.
> > > > 
> > > > Agreed, but only if we don't do RMW in the buffer cache ... which may be
> > > > a good reason to keep it.
> > > > 
> > > > > > The other question is if the drive does RMW between 4k and whatever its
> > > > > > physical sector size, do we need to do anything to take advantage of
> > > > > > it ... as in what would altering the granularity of the page cache buy
> > > > > > us?
> > > > > 
> > > > > The real benefit is when and how the reads get scheduled.  We're able to
> > > > > do a much better job pipelining the reads, controlling our caches and
> > > > > reducing write latency by having the reads done up in the OS instead of
> > > > > the drive.
> > > > 
> > > > I agree with all of that, but my question is still can we do this by
> > > > propagating alignment and chunk size information (i.e. the physical
> > > > sector size) like we do today.  If the FS knows the optimal I/O patterns
> > > > and tries to follow them, the odd cockup won't impact performance
> > > > dramatically.  The real question is can the FS make use of this layout
> > > > information *without* changing the page cache granularity?  Only if you
> > > > answer me "no" to this do I think we need to worry about changing page
> > > > cache granularity.
> > > 
> > > We already do this today.
> > > 
> > > The problem is that we are limited by the page cache assumption that
> > > the block device/filesystem never need to manage multiple pages as
> > > an atomic unit of change. Hence we can't use the generic
> > > infrastructure as it stands to handle block/sector sizes larger than
> > > a page size...
> > 
> > If the compound page infrastructure exists today and is usable for this,
> > what else do we need to do? ... because if it's a couple of trivial
> > changes and a few minor patches to filesystems to take advantage of it,
> > we might as well do it anyway. 
> 
> Do not do this as there is no guarantee that a compound allocation will
> succeed.

I presume this is because in the current implementation compound pages
have to be physically contiguous.  For increasing granularity in the
page cache, we don't necessarily need this ... however, getting write
out to work properly without physically contiguous pages would be a bit
more challenging (but not impossible) to solve.

>  If the allocation fails then it is potentially unrecoverable
> because we can no longer write to storage then you're hosed. If you are
> now thinking mempool then the problem becomes that the system will be
> in a state of degraded performance for an unknowable length of time and
> may never recover fully. 64K MMU page size systems get away with this
> because the blocksize is still <= PAGE_SIZE and no core VM changes are
> necessary. Critically, pages like the page table pages are the same size as
> the basic unit of allocation used by the kernel so external fragmentation
> simply is not a severe problem.

Right, I understand this ... but we still need to wonder about what it
would take.  Even the simple fail a compound page allocation gets
treated in the kernel the same way as failing a single page allocation
in the page cache.

> > I was only objecting on the grounds that
> > the last time we looked at it, it was major VM surgery.  Can someone
> > give a summary of how far we are away from being able to do this with
> > the VM system today and what extra work is needed (and how big is this
> > piece of work)?
> > 
> 
> Offhand no idea. For fsblock, probably a similar amount of work than
> had to be done in 2007 and I'd expect it would still require filesystem
> awareness problems that Dave Chinner pointer out earlier. For large block,
> it'd hit into the same wall that allocations must always succeed.

I don't understand this.  Why must they succeed?  4k page allocations
don't have to succeed today in the page cache, so why would compound
page allocations have to succeed?

>  If we
> want to break the connection between the basic unit of memory managed
> by the kernel and the MMU page size then I don't know but it would be a
> fairly large amount of surgery and need a lot of design work. Minimally,
> anything dealing with an MMU-sized amount of memory would now need to
> deal with sub-pages and there would need to be some restrictions on how
> sub-pages were used to mitigate the risk of external fragmentation -- do not
> mix page table page allocations with pages mapped into the address space,
> do not allow sub pages to be used by different processes etc. At the very
> least there would be a performance impact because PAGE_SIZE is no longer a
> compile-time constant. However, it would potentially allow the block size
> to be at least the same size as this new basic allocation unit.

Hm, OK, so less appealing then.

James



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/