lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun, 13 Mar 2016 12:29:25 +0200
From:	achiad shochat <achiad.mellanox@...il.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	Saeed Mahameed <saeedm@....mellanox.co.il>,
	Saeed Mahameed <saeedm@...lanox.com>,
	"David S. Miller" <davem@...emloft.net>,
	Linux Netdev List <netdev@...r.kernel.org>,
	Or Gerlitz <ogerlitz@...lanox.com>,
	Eran Ben Elisha <eranbe@...lanox.com>,
	Tal Alon <talal@...lanox.com>,
	Tariq Toukan <tariqt@...lanox.com>,
	Jesper Dangaard Brouer <brouer@...hat.com>
Subject: Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support
 for RX multi packet WQE

On 11 March 2016 at 21:58, Eric Dumazet <eric.dumazet@...il.com> wrote:
> On ven., 2016-03-11 at 21:25 +0200, Saeed Mahameed wrote:
>> >> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>> >> +static void mlx5e_add_skb_frag(struct sk_buff *skb, int len, struct page *page,
>> >> +                            int page_offset)
>> >> +{
>> >> +     int f = skb_shinfo(skb)->nr_frags++;
>> >> +     skb_frag_t *fr = &skb_shinfo(skb)->frags[f];
>> >> +
>> >> +     skb->len += len;
>> >> +     skb->data_len += len;
>> >> +     get_page(page);
>> >> +     skb_frag_set_page(skb, f, page);
>> >> +     skb_frag_size_set(fr, len);
>> >> +     fr->page_offset = page_offset;
>> >> +     skb->truesize  = SKB_TRUESIZE(skb->len);
>> >> +}
>> >
>> > Really I am speechless.
>> >
>> > It is hard to believe how much effort some drivers authors spend trying
>> > to fool linux stack and risk OOM a host under stress.
>>
>> Eric, you got it all wrong my friend, no one is trying to fool nobody here.
>> I will explain it to you below.
>>
>> >
>> > SKB_TRUESIZE() is absolutely not something a driver is allowed to use.
>> >
>> > Here you want instead :
>> >
>> > skb->truesize += PAGE_SIZE;
>> >
>> > Assuming you allocate and use an order-0 page per fragment. Fact that
>> > you receive say 100 bytes datagram is irrelevant to truesize.
>>
>> Your assumption is wrong, we allocate as many pages as a WQE needs,
>> and a WQE can describe/handle
>> up to 1024 packets which share the same page/pages, so the skb should
>> really have a true size of the strides
>> of that page it used and not the WHOLE page as you think.
>>
>> you should already learn this from the previous patch.
>>
>> each WQE (Receive Work Queue Element) contains 1024 strides each of
>> the size 128B,
>> i.e, a packet of the size 128B or less will consume only one stride of
>> that WQE page, next packets on that WQE
>> will use the following strides in that same page.
>>
>> So in opposite of what you think this new scheme is better than our
>> old one in terms of memory utilization.
>> before, we wasted MTU size per SKB/Packet regardless of the real
>> packet size, now each SKB will consume only
>> as much as 128B strides it will need, no more no less.
>>
>> BTW there will be only 16 WQEs per ring :), so this new approach
>> doesn't drastically consume more memory than the previous one.
>> But it sure can handle more small packets bursts.
>>
>> >
>> > truesize is the real memory usage of one skb. Not the minimal size of an
>> > optimally allocated skb for a given payload.
>>
>> I totally agree with this, we should have reported  skb->truesize +=
>> (consumed strides)*(stride size).
>> but again this is not as critical as you think, in the worst case
>> skb->truesize will be off by 127B at most.
>
> Ouch. really you are completely wrong.
>
> If one skb has a fragment of a page, and sits in a queue for a long
> time, it really uses a full page, because the remaining part of the page
> is not reusable. Only kmalloc(128) can deal with that idea of allowing
> other parts of the page being 'freed and reusable'
>
> It is trivial for an attacker to make sure the host will consume one
> page + sk_buff + skb->head = 4096 + 256 + 512, by specially sending out
> of order packets on TCP flows.
>
> It is very tempting to have special memory allocators you know, but you
> have to understand attackers are smart. Smarter than us.
>
> If now you are telling me you plan to allocate 131072 bytes pages (1024
> strides of 128 bytes), then a smart attacker can actually bump skb
> truesize to 128KB
>
> Really your RX allocation schem is easily DOSable.
>
>

It seems you did not understand the new scheme at all.
With the new scheme each incoming packet uses only the fraction of the
page that it needs.
This is optimal memory utilization and locality.

So if one skb has a fragment of a page, and sits in a queue for a long
time, it really does _not_ use a full page, because the remaining part
of the page
_is_ usable for other incoming packets.

BTW, this new scheme was actually introduced in the previous patch
"[PATCH net-next 06/13] net/mlx5e: Support RX multi-packet WQE
(Striding RQ)" rather than in this one.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ