netdev - RE: [PATCH] Packet socket: mmapped IO: PACKET_TX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <FCC0EC655BD1AE408C047268D1F5DF4C3BA611D5@NASANEXMB10.na.qualcomm.com>
Date:	Tue, 11 Nov 2008 11:05:07 -0800
From:	"Lovich, Vitali" <vlovich@...lcomm.com>
To:	Johann Baudy <johaahn@...il.com>
CC:	Evgeniy Polyakov <zbr@...emap.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: RE: [PATCH] Packet socket: mmapped IO: PACKET_TX_RING

Hi Johann,

> Ok, so we can skip below block according to this assumption:
> 		for (i = 0; i < req->tp_block_nr; i++) {
> 			void *ptr = pg_vec[i];
> 			int k;
> 
> 			for (k = 0; k < rb->frames_per_block; k++) {
> 				__packet_set_status(po, ptr, TP_STATUS_KERNEL);
> 				ptr += req->tp_frame_size;
> 			}
> 		}
>
Yes.

> >
> > Earlier, I think I had said that the mapped variable should be placed
> in packet_ring_buffer struct - I was wrong.  It's correct where it is.
> I had forgotten that rx & tx are mapped contiguously and
> simultaneously.  There needs to be a note in the doc that the rings
> should be set up before mapping (i.e. cannot set up the rx or tx ring
> buffer first, mmap, and then try to change it).
> 
> Yes, we can also add a check for po->mapped in
> PACKET_TX_RING/PACKET_RX_RING cases of packet_setsockopt() switch.
Since those call in to packet_set_ring, we can just fix the logic there - there's no need to complicate packet_setsockopt unnecessarily.

> 
> 
> > We do need still need a check for tx_pending_skb when sizing the ring
> buffer (in the situation where we unfreed skbs created by a call to
> send from Thread B, while Thread A unmaps the ring buffer and trys a
> resize). Also, in tpacket_snd, we need to increment mapped before
> allocating the skb and test if the original value was 0 (we decrement
> after incrementing tx_pending_skb).  If it was, then that indicates
> that the user had unmapped the ring buffer, which means we should stop
> sending.  No need for additional locking because of the atomic
> relationship this sets up between tx_pending_skb and mapped.
> 
> I'm not sure to understand:
Sorry - I do realize I rambled a bit.  I'll try to clarify.

> 
> Let me, sum up :
> * when setting or resizing the ring: we have to return an error if:
> - someone has mapped the socket (1.a)
> - sending is ongoing - tx_pending_skb != 0 (1.b)
> - a send() procedure is ongoing (1.c)
Yes.

> 
> * when closing the socket, we have to return an error if:
> - sending is ongoing - tx_pending_skb != 0 (2.a)
> - a send() procedure is ongoing (2.b)
Only 2.a should be a reason to fail to close.  With 2.b, send should abort, recognizing the socket is closing, without sending anything out.  Realistically however, it's probably too hard to co-ordinate since 2.b causes 2.a.  So a more realistic approach is that closing never fails.  Instead, we add a flag field into packet_sock.  In send, we atomically set bit 0.  When closing, we atomically set bit 1.  If in closing bit 0 is also set, then we defer the actual closing of the socket to send.  If send sees bit 1 & bit 0 set & tx_pending_skb == 0, then it frees the actual socket.  Otherwise, the skb destructor is responsible for closing the socket once tx_pending_skb drops to 0 & bit 1 is set. Calls to send fail if bit 1 is set.
> 	
> * during do-while loop of send(),:
> - protect ring against simultaneous send() (3.a)
> - ensure that ring is not resized (3.b) (linked with 1.c)
> Our main issue is when: tx_pending_skb = 0, Thread A starts to send a
> packet while Thread B unmaps and resizes the ring.
>  Thread B can potentially resizes the ring until increment of
> tx_pending_skb in Thread A.
> - loop can continue even if ring is not mapped (3.c), this has no
> sense but it is not critical
3.a can be protected through the flag field as above - atomically set the 0th bit if it's not set (test_and_set_bit).  Otherwise, return with -EBUSY or similar error code.  After this protection, atomically increment (atomic_inc_return).  If it is 1, then abort send because the user has unmapped the memory.  Don't forget to atomically decrement it again (and clear the 0th flag bit) regardless before exiting from send.

> 
> * when executing the destructor callback:
> - we must ensure that ring structure has not changed since send() (4.a)
This is a consequence of the above logic. 
> 
> (1.a) can be done with po->mmaped check in packet_setsockopt()
> (1.b) and (2.a) can be done as you said :  tx_pending_skb != 0 check
> when sizing buffer in packet_set_ring()
> (3.a) by a mutex M outside of the do-while loop
> (1.c), (2.b) and (3.b) can be performed thanks to the same mutex M and
> mutex_trylock() in packet_set_ring(). An error is returned if we can't
> take the mutex . The mutex is released at the end of packet_set_ring()
> call.
> (4.a) is validated with tx_pending_skb increment, (3.b),(1.b) and (2.a)
> 
> Do I miss something?
I hope I covered everything above.  I'm not sure if send, setsockopt, etc are in an interrupt context or not.  My guess is they are and thus mutexes are not allowed - spinlocks only.  However, there's also no need for mutexes or spinlocks because I believe the above atomic algorithm should work.  I should be able to post a patch by the end of the day.

Thanks,
Vitali
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html