[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTi=WF2hAGAufv_Anc=b=Fm2WOpOMOv1UrDRvaTHp@mail.gmail.com>
Date: Fri, 19 Nov 2010 12:04:31 -0800
From: Tom Herbert <therbert@...gle.com>
To: Linux Netdev List <netdev@...r.kernel.org>
Subject: Generalizing mmap'ed sockets
This is a project I'm contemplating. If you have any comments or can
point me to prior work in this area that would be appreciated.
It seems like should be fairly straight forward to extend the mmap
packet ring mechanisms to be used for arbitrary sockets (like TCP,
UDP, etc.). The idea is that we create a ring buffer for a socket
which is mmap'ed to share between user and kernel. This can be done
for both transmit and receive side, and is basically modeled as a
consumer/producer queue. There are semantic differences between
stream and datagram sockets that need to be considered, but I don't
think anything here is untenable.
The expected benefits of this are:
TX:
- Zero copy transmit (which is already supported by vmsplice(), but
this might be simpler)
- One system call needed on transmit which can cover multiple
datagrams or what would have been multiple writes (the call is just to
kick kernel to start sending)
RX:
- Zero system calls needed to do receive (determining data ready is
accomplished by polling)
- Immediate data placement in kernel available all the time,
including OOO placement
- Potential for true zero copy on receive with device support (like
per flow queues, UDP queues)
The userland use of this for TCP might look something like:
struct mmap_sock_hdr {,
__u32 prod_ptr;
__u32 consumer_ptr;
};
int s;
struct mmap_sock_hdr *tx, *rx;
void *tx_base, *rx_base;
struct s_mmap_req {
size_t size;
} mmap_req;
s = socket(AF_INET, SOCKET_STREAM, 0);
/* Set up ring buffer on socket and mmap into user space for TX */
size = 1 >> 19 - sizeof (struct mmap_sock_hdr);
mmap_req.size = size;
setsockopt(s, SOL_SOCKET, TX_RING, (char *)&mmap_req,
sizeof(s_mmap_req));
tx = mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, s, 0);
tx_base = (void *)tx[1];
/* Now do same thing for RX */
size = 1 >> 19 - sizeof (struct mmap_sock_hdr);
mmap_req.size = size;
setsockopt(s, SOL_SOCKET, RX_RING, (char *)&mmap_req,
sizeof(s_mmap_req));
rx = mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, s, 0);
rx_base = (void *)rx[1];
bind(s, ...) /* Normal bind */
connect(s, ...) /* Normal connect */
/* Transmit */
/* Application fills some of the available buffer (up to consumer pointer) */
for (i = 0; i < 10000; i++)
tx_base[prod_ptr + i] = i % 256;
/* Advance producer pointer */
prod_ptr += 10000;
send(s, NULL, 0); /* Tells stack to send new data indicated by prod
pointer, just a trigger */
/* Polling for POLLOUT should work as expected */
/*********** Receive */
while (1) {
poll(fds);
if (s has POLLIN set) {
Process data from rx_base[rx->consume_ptr] to
rx_base[rx->prod_ptr], modulo size of buffer of course
rx->consume_ptr = rx->prod_ptr; /* Gives back buffer space
to the kernel */
}
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists