netdev - Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iJxk=7qqGyVMwo8p5vFtPLK49JY1eHMwwakOdCE+vnxXA@mail.gmail.com>
Date:   Sun, 12 Feb 2017 07:32:01 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     Tariq Toukan <ttoukan.linux@...il.com>
Cc:     Jesper Dangaard Brouer <brouer@...hat.com>,
        "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Martin KaFai Lau <kafai@...com>,
        Willem de Bruijn <willemb@...gle.com>,
        Brenden Blanco <bblanco@...mgrid.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: [PATCH v2 net-next 00/14] mlx4: order-0 allocations and page recycling

On Sun, Feb 12, 2017 at 7:04 AM, Tariq Toukan <ttoukan.linux@...il.com> wrote:

>
> We consistently see this behavior: the higher the BW, the sharper the
> degradation.
>
> This is because the page-cache is of a fixed-size. Any fixed-size page-cache
> will always meet one of the following:
> 1) Too small to keep the pace when load is high.
> 2) Too big (in terms of memory footprint) when load is low.
>

So, we had the order-0 allocations for years at Google, then made the
horrible mistake to rebase mlx4 driver from the upstream one,
and we had all these issues under load.

I decided to redo the work I did years ago and upstream it.

I have warned Mellanox in the past (for cx-5 driver) that _any_ high
order allocation strategy was nice in benchmarks, but terrible in face
of real server workloads.
( And I am not even referring to malicious attacks )

Think about what happens on real servers : In the order of 100,000 TCP
sockets opened.

Then some incast or outcast problem (Mapreduce jobs are fond of this)
make thousands of TCP socket accumulate _millions_ of TCP messages in
their out of order queue per second.

There is no way you can hold millions of pages in mlx4 driver.
A "dynamic" page pool is going to fail very badly.

Sure, your iperf bench will look great. But who cares ? Doyou really
have customers dedicating hosts to run 1 iperf full time ?

Make sure you run tests with 100,000 TCP sockets, and add networking
small flaps, with 5% packet losses.
This is what we really care here.

I will send the v3 of the patch series, I really hope that it will go
in, because we at Google very much need it ASAP, and I would rather
not have to keep it private in our tree.

Do not focus on your benchmarks, that is marketing only
Focus on ability of the servers to _survive_ and continue their work.

You did not answer to my questions by the way.

ethtool -g eth0
ethtool -l eth0

Thanks.