[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20071118180933.GA17103@lemming.cita.utoronto.ca>
Date: Sun, 18 Nov 2007 13:09:33 -0500
From: Robin Humble <rjh@...a.utoronto.ca>
To: Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc: Jeff Garzik <jeff@...zik.org>,
Nick Piggin <nickpiggin@...oo.com.au>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
netdev@...r.kernel.org, trond.myklebust@....uio.no
Subject: Re: [PATCH 00/33] Swap over NFS -v14
<apologies for being insanely late into this thread>
On Wed, Oct 31, 2007 at 01:56:53PM +0100, Peter Zijlstra wrote:
>On Wed, 2007-10-31 at 08:16 -0400, Jeff Garzik wrote:
>> Thoughts:
>> 1) I absolutely agree that NFS is far more prominent and useful than any
>> network block device, at the present time.
>>
>> 2) Nonetheless, swap over NFS is a pretty rare case. I view this work
>> as interesting, but I really don't see a huge need, for swapping over
>> NBD or swapping over NFS. I tend to think swapping to a remote resource
>> starts to approach "migration" rather than merely swapping. Yes, we can
>> do it... but given the lack of burning need one must examine the price.
>
>There is a large corporate demand for this, which is why I'm doing this.
>
>The typical usage scenarios are:
> - cluster/blades, where having local disks is a cost issue (maintenance
> of failures, heat, etc)
HPC clusters are increasingly diskless, especially at the high end.
for all the reasons you mention, but also because networks are faster
than disks.
>But please, people who want this (I'm sure some of you are reading) do
>speak up. I'm just the motivated corporate drone implementing the
>feature :-)
swap to iSCSI has worked well in the past with your anti-deadlock
patches, and I'd definitely like to see that continue and to be merged
into mainline!! swap-to-network is a highly desirable feature for
modern clusters.
performance and scalability of NFS is poor, so it's not a good option.
actually swap to a file on Lustre(*) would be best, but iSER and iSCSI
would be my next choices. iSER is better than iSCSI as it's ~5x faster
in practice, and InfiniBand seems to be here to stay.
hmmm - any idea what the issues are with RDMA in low memory situations?
presumably if DMA regions are mapped early then there's not actually
much of a problem? I might try it with tgtd's iSER...
cheers,
robin
(*) obviously not your responsibility. although Lustre (Sun/CFS) could
presumably use your infrastructure once you have it in mainline.
>> 3) You note
>> > Swap over network has the problem that the network subsystem does not use fixed
>> > sized allocations, but heavily relies on kmalloc(). This makes mempools
>> > unusable.
>>
>> True, but IMO there are mitigating factors that should be researched and
>> taken into account:
>>
>> a) To give you some net driver background/history, most mainstream net
>> drivers were coded to allocate RX skbs of size 1538, under the theory
>> that they would all be allocating out of the same underlying slab cache.
>> It would not be difficult to update a great many of the [non-jumbo]
>> cases to create a fixed size allocation pattern.
>
>One issue that comes to mind is how to ensure we'd still overflow the
>IP-reassembly buffers. Currently those are managed on the number of
>bytes present, not the number of fragments.
>
>One of the goals of my approach was to not rewrite the network subsystem
>to accomodate this feature (and I hope I succeeded).
>
>> b) Spare-time experiments and anecdotal evidence points to RX and TX skb
>> recycling as a potentially valuable area of research. If you are able
>> to do something like that, then memory suddenly becomes a lot more
>> bounded and predictable.
>>
>>
>> So my gut feeling is that taking a hard look at how net drivers function
>> in the field should give you a lot of good ideas that approach the
>> shared goal of making network memory allocations more predictable and
>> bounded.
>
>Note that being bounded only comes from dropping most packets before
>trying them to a socket. That is the crucial part of the RX path, to
>receive all packets from the NIC (regardless their size) but to not pass
>them on to the network stack - unless they belong to a 'special' socket
>that promises undelayed processing.
>
>Thanks for these ideas, I'll look into them.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists