[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4CE6FE65.9040302@hp.com>
Date: Fri, 19 Nov 2010 14:47:01 -0800
From: Rick Jones <rick.jones2@...com>
To: David Miller <davem@...emloft.net>
CC: therbert@...gle.com, netdev@...r.kernel.org
Subject: Re: Generalizing mmap'ed sockets
David Miller wrote:
> From: Rick Jones <rick.jones2@...com>
> Date: Fri, 19 Nov 2010 13:58:21 -0800
>
>
>>David Miller wrote:
>>
>>>From: Rick Jones <rick.jones2@...com>
>>>Date: Fri, 19 Nov 2010 13:32:57 -0800
>>>
>>>
>>>>I suppose then one would be able to track the consumer pointer (on tx)
>>>>to "know" that certain data had been ACKed by the remote? For TCP
>>>>anyway - and assuming there wouldn't be a case where TCP might copy
>>>>the data out of the ring and assert "completion."
>>>
>>>Yes, that's implicit in his design, the kernel manages the consumer
>>>pointer in the ring and this is how userspace can see when ring
>>>entries
>>>are reusable.
>>
>>But does one really want to lock-in that the update to the consumer
>>pointer means the data has been ACKed by the remote (or I suppose that
>>DMA have completed if it were UDP)?
>
>
> I think the ACK (or for UDP, the kfree_skb() after TX completes) should
> move the consumer pointer. Otherwise you have to copy, and the ACKs
> do not clock the sender process properly.
I'm not worried about the ACK/kfree_skb() moving the pointer. I'm simply
worried about what the application should infer from the pointer's movement.
That is, if the design is documented "Movement of the consumer pointer implies
that the corresponding data has been ACKed by the remote TCP" that is locking
the design into a semantic I don't know that it will always want to maintain,
because there may end-up being some cases where the stack might indeed want to
copy and so not maintain that "pointer update means the remote TCP has the data"
semantic.
> But you do bring up an interesting point about TX buffer space sizing.
>
> This whole scheme currently seems to completely ignore buffer size
> auto-tuning done by TCP, and that won't fly I think. :-)
>
> The whole point is to make it so that applications do not need to know
> about that aspect of buffering at all. With the current mmap design
> we're back to the stone ages where the app essentially has to pick an
> explicit send buffer size.
In some ways, the stone ages were nicer :)
What if... :) the stack had a way to communicate to the application that it
wanted to change the effective socket buffer size? If that is indeed
sufficiently infrequent, perhaps a "signal the new size and the app does a fresh
mmap()" mechanism would suffice. The app would, I presume need to first wait for
the existing ring to drain, which could cause some complications I suppose. Is
there a way to flip the sense and have the kernel allocate the ring(s) and
communicate that to the application?
But doesn't the whole idea of having an explicitly mmap()ed area of memory fly
in the face of autotuning to begin with? (Mind you, I've not always been a fan
of autotuning as some of my previous "Why is it growing the window so large?!?"
will attest :) It is suggesting that the application has some "communications
memory" (that it won't be itself copying to/from) and presumably knows or thinks
it knows how much of that it needs. For all we know, Tom is thinking that this
mmap()ed region of memory will be rather larger than the maximum autotuned
socket buffer sizes in the first place. Going back to his initial email I don't
see anything that explicitly describes the relationship between the size of this
mmap()'ed region and the socket buffer sizes - I was just ass-u-me-ing it would
set them. Sure, it would have to be an effective upper bound for copy-less
transmit and receive, but there is nothing that says the windows TCP is using
have to be that large.
rick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists