[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D358A47.4020009@yandex-team.ru>
Date: Tue, 18 Jan 2011 15:40:39 +0300
From: "Oleg V. Ukhno" <olegu@...dex-team.ru>
To: John Fastabend <john.r.fastabend@...el.com>
CC: Jay Vosburgh <fubar@...ibm.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"David S. Miller" <davem@...emloft.net>
Subject: Re: [PATCH] bonding: added 802.3ad round-robin hashing policy for
single TCP session balancing
On 01/18/2011 06:16 AM, John Fastabend wrote:
> On 1/14/2011 4:05 PM, Jay Vosburgh wrote:
>> Can somebody (John?) more knowledgable than I about dm-multipath
>> comment on the above?
>
> Here I'll give it a go.
>
> I don't think detecting L2 link failure this way is very robust. If there
> is a failure farther away then your immediate link your going to break
> completely? Your bonding hash will continue to round robin the iscsi
> packets and half them will get dropped on the floor. dm-multipath handles
> this reasonably gracefully. Also in this bonding environment you seem to
> be very sensitive to RTT times on the network. Maybe not bad out right but
> I wouldn't consider this robust either.
John, I agree - this bonding mode should be used in quite limited number
of situations, but as for failure farther away then immediate link -
every bonding mode will suffer same problems in this case - bonding
detects only L2 failures, other is done by upper-layer mechanisms. And
almost all bonding modes depend on equal RTT on slaves. And, there is
already similar load balancing mode - balance-alb - what I did is
approximately the same, but for 802.3ad bonding mode and provides
"better"(more equal and non-conditional layser2) load striping for tx
and _rx_ .
I think I shouldn't mention the particular use case of this patch - when
I wrote it I tried to make a more general solution - my goal was "make
equal or near-equal load striping for TX and (most important part) RX
within single ethernet(layer 2) domain for TCP transmission". This
bonding mode just introduces ability to stripe rx and tx load for
single TCP connection between hosts inside of one ethernet segment.
iSCSI is just an example. It is possible to stripe load between a
linux-based router and linux-based web/ftp/etc server as well in the
same manner. I think this feature will be useful in some number of
network configurations.
Also, I looked into net-next code - it seems to me that it can be
implemented(adapted to net-next bonding code) without any difficulties
and hashing function change makes no problem here.
What I've written below is just my personal experience and opinion after
5 years of using Oracle +iSCSI +mpath(later - patched bonding).
From my personal experience I just can say that most iSCSI failures are
caused by link failures, and also I would never send any significant
iSCSI traffic via router - router would be a bottleneck in this case.
So, in my case iSCSI traffic flows within one ethernet domain and in
case of link failure bonding driver simply fails one slave(in case of
bonding) , instead of checking and failing hundreths of paths (in case
of mpath) and first case significantly less cpu, net and time
consuming(if using default mpath checker - readsector0).
Mpath is good for me, when I use it to "merge" drbd mirrors from
different hosts, but for just doing simple load striping within single
L2 network switch between 2 .. 16 hosts is some overkill(particularly
in maintaining human-readable device naming) :).
John, what is you opinion on such load balancing method in general,
without referring to particular use cases?
>
> You could tweak your scsi timeout values and fail_fast values, set the io
> retry to 0 to cause the fail over to occur faster. I suspect you already
> did this and still it is too slow? Maybe adding a checker in multipathd to
> listen for link events would be fast enough. The checker could then fail
> the path immediately.
>
> I'll try to address your comments from the other thread here. In general I
> wonder if it would be better to solve the problems in dm-multipath rather than
> add another bonding mode?
Of course I did this, but mpath is fine when device quantity is below
30-40 devices with two paths, 150-200 devices with 2+ paths can make
life far more interesting :)
>
> OVU - it is slow(I am using ISCSI for Oracle , so I need to minimize latency)
>
> The dm-multipath layer is adding latency? How much? If this is really true
> maybe its best to the address the real issue here and not avoid it by
> using the bonding layer.
I do not remember exact number now, but switching one of my databases ,
about 2 years ago to bonding increased read throughput for the entire db
from 15-20 Tb/day to approximately 30-35 Tb/day (4 iscsi initiators and
8 iscsi targets, 4 ethernet links for iSCSI on each host, all plugged in
one switch) because of "full" bandwidth use. Also, bonding usage
simplifies network and application setup greatly(compared to mpath)
>
> OVU - it handles any link failures bad, because of it's command queue
> limitation(all queued commands above 32 are discarded in case of path
> failure, as I remember)
>
> Maybe true but only link failures with the immediate peer are handled
> with a bonding strategy. By working at the block layer we can detect
> failures throughout the path. I would need to look into this again I
> know when we were looking at this sometime ago there was some talk about
> improving this behavior. I need to take some time to go back through the
> error recovery stuff to remember how this works.
>
> OVU - it performs very bad when there are many devices and maтy paths(I was
> unable to utilize more that 2Gbps of 4 even with 100 disks with 4 paths
> per each disk)
Well, I think that behavior can be explained in such a way:
when balancing by I/Os number per path(rr_min_io), and there is a huge
number of devices, mpath is doing load-balaning per-device, and it is
not possible to quarantee equal device use for all devices, so there
will be imbalance over network interface(mpath is unaware of it's
existence, etc), and it is likely it becomes more imbalanced when there
are many devices. Also, counting I/O's for many devices and paths
consumes some CPU resources and also can cause excessive context switches.
>
> Hmm well that seems like something is broken. I'll try this setup when
> I get some time next few days. This really shouldn't be the case dm-multipath
> should not add a bunch of extra latency or effect throughput significantly.
> By the way what are you seeing without mpio?
And one more obsevation from my 2-years old tests - reading device(using
dd) (rhel 5 update 1 kernel, ramdisk via ISCSI via loopback ) as mpath
device with single path was done at approximately 120-150mb/s, and same
test on non-mpath device at 800-900mb/s. Here I am quite sure, it was a
kind of revelation to me that time.
>
> Thanks,
> John
>
--
Best regards,
Oleg Ukhno.
ITO Team Lead,
Yandex LLC.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists