linux-kernel - Re: [PATCH 00/33] Swap over NFS -v14

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20071118180933.GA17103@lemming.cita.utoronto.ca>
Date:	Sun, 18 Nov 2007 13:09:33 -0500
From:	Robin Humble <rjh@...a.utoronto.ca>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc:	Jeff Garzik <jeff@...zik.org>,
	Nick Piggin <nickpiggin@...oo.com.au>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	netdev@...r.kernel.org, trond.myklebust@....uio.no
Subject: Re: [PATCH 00/33] Swap over NFS -v14

<apologies for being insanely late into this thread>

On Wed, Oct 31, 2007 at 01:56:53PM +0100, Peter Zijlstra wrote:
>On Wed, 2007-10-31 at 08:16 -0400, Jeff Garzik wrote:
>> Thoughts:
>> 1) I absolutely agree that NFS is far more prominent and useful than any 
>> network block device, at the present time.
>> 
>> 2) Nonetheless, swap over NFS is a pretty rare case.  I view this work 
>> as interesting, but I really don't see a huge need, for swapping over 
>> NBD or swapping over NFS.  I tend to think swapping to a remote resource 
>> starts to approach "migration" rather than merely swapping.  Yes, we can 
>> do it...  but given the lack of burning need one must examine the price.
>
>There is a large corporate demand for this, which is why I'm doing this.
>
>The typical usage scenarios are:
> - cluster/blades, where having local disks is a cost issue (maintenance
>   of failures, heat, etc)

HPC clusters are increasingly diskless, especially at the high end.
for all the reasons you mention, but also because networks are faster
than disks.

>But please, people who want this (I'm sure some of you are reading) do
>speak up. I'm just the motivated corporate drone implementing the
>feature :-)

swap to iSCSI has worked well in the past with your anti-deadlock
patches, and I'd definitely like to see that continue and to be merged
into mainline!! swap-to-network is a highly desirable feature for
modern clusters.

performance and scalability of NFS is poor, so it's not a good option.

actually swap to a file on Lustre(*) would be best, but iSER and iSCSI
would be my next choices. iSER is better than iSCSI as it's ~5x faster
in practice, and InfiniBand seems to be here to stay.

hmmm - any idea what the issues are with RDMA in low memory situations?
presumably if DMA regions are mapped early then there's not actually
much of a problem? I might try it with tgtd's iSER...

cheers,
robin

(*) obviously not your responsibility. although Lustre (Sun/CFS) could
presumably use your infrastructure once you have it in mainline.


>> 3) You note
>> > Swap over network has the problem that the network subsystem does not use fixed
>> > sized allocations, but heavily relies on kmalloc(). This makes mempools
>> > unusable.
>> 
>> True, but IMO there are mitigating factors that should be researched and 
>> taken into account:
>> 
>> a) To give you some net driver background/history, most mainstream net 
>> drivers were coded to allocate RX skbs of size 1538, under the theory 
>> that they would all be allocating out of the same underlying slab cache. 
>>   It would not be difficult to update a great many of the [non-jumbo] 
>> cases to create a fixed size allocation pattern.
>
>One issue that comes to mind is how to ensure we'd still overflow the
>IP-reassembly buffers. Currently those are managed on the number of
>bytes present, not the number of fragments.
>
>One of the goals of my approach was to not rewrite the network subsystem
>to accomodate this feature (and I hope I succeeded).
>
>> b) Spare-time experiments and anecdotal evidence points to RX and TX skb 
>> recycling as a potentially valuable area of research.  If you are able 
>> to do something like that, then memory suddenly becomes a lot more 
>> bounded and predictable.
>> 
>> 
>> So my gut feeling is that taking a hard look at how net drivers function 
>> in the field should give you a lot of good ideas that approach the 
>> shared goal of making network memory allocations more predictable and 
>> bounded.
>
>Note that being bounded only comes from dropping most packets before
>trying them to a socket. That is the crucial part of the RX path, to
>receive all packets from the NIC (regardless their size) but to not pass
>them on to the network stack - unless they belong to a 'special' socket
>that promises undelayed processing.
>
>Thanks for these ideas, I'll look into them.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/