linux-kernel - Re: Distributed storage.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070813091235.GI23758@kernel.dk>
Date:	Mon, 13 Aug 2007 11:12:35 +0200
From:	Jens Axboe <jens.axboe@...cle.com>
To:	Daniel Phillips <phillips@...nq.net>
Cc:	Evgeniy Polyakov <johnpol@....mipt.ru>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: Distributed storage.

On Mon, Aug 13 2007, Daniel Phillips wrote:
> On Monday 13 August 2007 00:28, Jens Axboe wrote:
> > On Sun, Aug 12 2007, Daniel Phillips wrote:
> > > Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can
> > > derive efficiently from BIO_POOL_IDX() provided the bio was
> > > allocated in the standard way.
> >
> > That would only be feasible, if we ruled that any bio in the system
> > must originate from the standard pools.
> 
> Not at all.
> 
> > > This leaves a little bit of clean up to do for bios not allocated
> > > from a standard pool.
> >
> > Please suggest how to do such a cleanup.
> 
> Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a 
> bio from the standard pool.  Just put the power of two size in the bits 
> and map that number to the standard pool arrangement with a table 
> lookup.

So reserve a bit that tells you how to interpret the (now) 3 remaining
bits. Doesn't sound very pretty, does it?

> > > On the other hand, vm writeout deadlock ranks smack dab at the top
> > > of the list, so that is where the patching effort must go for the
> > > forseeable future.  Without bio throttling, the ddsnap load can go
> > > to 24 MB for struct bio alone.  That definitely moves the needle. 
> > > in short, we save 3,200 times more memory by putting decent
> > > throttling in place than by saving an int in struct bio.
> >
> > Then fix the damn vm writeout. I always thought it was silly to
> > depend on the block layer for any sort of throttling. If it's not a
> > system wide problem, then throttle the io count in the
> > make_request_fn handler of that problematic driver.
> 
> It is a system wide problem.  Every block device needs throttling, 
> otherwise queues expand without limit.  Currently, block devices that 
> use the standard request library get a slipshod form of throttling for 
> free in the form of limiting in-flight request structs.  Because the 
> amount of IO carried by a single request can vary by two orders of 
> magnitude, the system behavior of this approach is far from 
> predictable.

Is it? Consider just 10 standard sata disks. The next kernel revision
will have sg chaining support, so that allows 32MiB per request. Even if
we disregard reads (not so interesting in this discussion) and just look
at potentially pinned dirty data in a single queue, that number comes to
4GiB PER disk. Or 40GiB for 10 disks. Auch.

So I still think that this throttling needs to happen elsewhere, you
cannot rely the block layer throttling globally or for a single device.
It just doesn't make sense.

> > > You did not comment on the one about putting the bio destructor in
> > > the ->endio handler, which looks dead simple.  The majority of
> > > cases just use the default endio handler and the default
> > > destructor.  Of the remaining cases, where a specialized destructor
> > > is needed, typically a specialized endio handler is too, so
> > > combining is free.  There are few if any cases where a new
> > > specialized endio handler would need to be written.
> >
> > We could do that without too much work, I agree.
> 
> OK, we got one and another is close to cracking, enough of that.

No we did not, I already failed this one in the next mail.

> > > As far as code stability goes, current kernels are horribly
> > > unstable in a variety of contexts because of memory deadlock and
> > > slowdowns related to the attempt to fix the problem via dirty
> > > memory limits.  Accurate throttling of bio traffic is one of the
> > > two key requirements to fix this instability, the other other is
> > > accurate writeout path reserve management, which is only partially
> > > addressed by BIO_POOL.
> >
> > Which, as written above and stated many times over the years on lkml,
> > is not a block layer issue imho.
> 
> Whoever stated that was wrong, but this should be no surprise.  There 
> have been many wrong things said about this particular bug over the 
> years.  The one thing that remains constant is, Linux continues to 
> deadlock under a variety of loads both with and without network 
> involvement, making it effectively useless as a storage platform.
> 
> These deadlocks are first and foremost, block layer deficiencies.  Even 
> the network becomes part of the problem only because it lies in the 
> block IO path.

The block layer has NEVER guaranteed throttling, so it can - by
definition - not be a block layer deficiency.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/