netdev - Re: Distributed storage.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <200708131627.42859.phillips@phunq.net>
Date:	Mon, 13 Aug 2007 16:27:42 -0700
From:	Daniel Phillips <phillips@...nq.net>
To:	Jens Axboe <jens.axboe@...cle.com>
Cc:	Evgeniy Polyakov <johnpol@....mipt.ru>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: Distributed storage.

On Monday 13 August 2007 02:12, Jens Axboe wrote:
> > It is a system wide problem.  Every block device needs throttling,
> > otherwise queues expand without limit.  Currently, block devices
> > that use the standard request library get a slipshod form of
> > throttling for free in the form of limiting in-flight request
> > structs.  Because the amount of IO carried by a single request can
> > vary by two orders of magnitude, the system behavior of this
> > approach is far from predictable.
>
> Is it? Consider just 10 standard sata disks. The next kernel revision
> will have sg chaining support, so that allows 32MiB per request. Even
> if we disregard reads (not so interesting in this discussion) and
> just look at potentially pinned dirty data in a single queue, that
> number comes to 4GiB PER disk. Or 40GiB for 10 disks. Auch.
>
> So I still think that this throttling needs to happen elsewhere, you
> cannot rely the block layer throttling globally or for a single
> device. It just doesn't make sense.

You are right, so long as the unit of throttle accounting remains one 
request.  This is not what we do in ddsnap.  Instead we inc/dec the 
throttle counter by the number of bvecs in each bio, which produces a 
nice steady data flow to the disk under a wide variety of loads, and 
provides the memory resource bound we require.

One throttle count per bvec will not be the right throttling metric for 
every driver.  To customize this accounting metric for a given driver 
we already have the backing_dev_info structure, which provides 
per-device-instance accounting functions and instance data.  Perfect! 
This allows us to factor the throttling mechanism out of the driver, so 
the only thing the driver has to do is define the throttle accounting 
if it needs a custom one.

We can avoid affecting the traditional behavior quite easily, for 
example if backing_dev_info->throttle_fn (new method) is null then 
either not throttle at all (and rely on the struct request in-flight 
limit) or we can move the in-flight request throttling logic into core 
as the default throttling method, simplifying the request library and 
not changing its behavior.

> > These deadlocks are first and foremost, block layer deficiencies. 
> > Even the network becomes part of the problem only because it lies
> > in the block IO path.
>
> The block layer has NEVER guaranteed throttling, so it can - by
> definition - not be a block layer deficiency.

The block layer has always been deficient by not providing accurate 
throttling, or any throttling at all for some devices.  We have 
practical proof that this causes deadlock and a good theoretical basis 
for describing exactly how it happens.

To be sure, vm and net are co-conspirators, however the block layer 
really is the main actor in this little drama.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html