linux-kernel - Re: ARM router NAT performance affected by random/unrelated commits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190522121730.fhswxkw4gbflkhei@shell.armlinux.org.uk>
Date:   Wed, 22 May 2019 13:17:30 +0100
From:   Russell King - ARM Linux admin <linux@...linux.org.uk>
To:     Rafał Miłecki <zajec5@...il.com>
Cc:     Network Development <netdev@...r.kernel.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linux-block@...r.kernel.org, John Crispin <john@...ozen.org>,
        Jonas Gorski <jonas.gorski@...il.com>,
        Jo-Philipp Wich <jo@...n.io>, Felix Fietkau <nbd@....name>
Subject: Re: ARM router NAT performance affected by random/unrelated commits

On Wed, May 22, 2019 at 01:51:01PM +0200, Rafał Miłecki wrote:
> On 21.05.2019 12:45, Russell King - ARM Linux admin wrote:> On Tue, May 21, 2019 at 12:28:48PM +0200, Rafał Miłecki wrote:
> >> I work on home routers based on Broadcom's Northstar SoCs. Those devices
> >> have ARM Cortex-A9 and most of them are dual-core.
> >>
> >> As for home routers, my main concern is network performance. That CPU
> >> isn't powerful enough to handle gigabit traffic so all kind of
> >> optimizations do matter. I noticed some unexpected changes in NAT
> >> performance when switching between kernels.
> >>
> >> My hardware is BCM47094 SoC (dual core ARM) with integrated network
> >> controller and external BCM53012 switch.
> >
> > Guessing, I'd say it's to do with the placement of code wrt cachelines.
> > You could try aligning some of the cache flushing code to a cache line
> > and see what effect that has.
> 
> Is System.map a good place to check for functions code alignment?
> 
> With Linux 4.19 + OpenWrt mtd patches I have:
> (...)
> c010ea94 t v7_dma_inv_range
> c010eae0 t v7_dma_clean_range
> (...)
> c02ca3d0 T blk_mq_update_nr_hw_queues
> c02ca69c T blk_mq_alloc_tag_set
> c02ca94c T blk_mq_release
> c02ca9b4 T blk_mq_free_queue
> c02caa88 T blk_mq_update_nr_requests
> c02cab50 T blk_mq_unique_tag
> (...)
> 
> After cherry-picking 9316a9ed6895 ("blk-mq: provide helper for setting
> up an SQ queue and tag set"):
> (...)
> c010ea94 t v7_dma_inv_range
> c010eae0 t v7_dma_clean_range
> (...)
> c02ca3d0 T blk_mq_update_nr_hw_queues
> c02ca69c T blk_mq_alloc_tag_set
> c02ca94c T blk_mq_init_sq_queue <-- NEW
> c02ca9c0 T blk_mq_release <-- Different address of this & all below
> c02caa28 T blk_mq_free_queue
> c02caafc T blk_mq_update_nr_requests
> c02cabc4 T blk_mq_unique_tag
> (...)
> 
> As you can see blk_mq_init_sq_queue has appeared in the System.map and
> it affected addresses of ~30000 symbols. I can believe some frequently
> used symbols got luckily aligned and that improved overall performance.
> 
> Interestingly v7_dma_inv_range() and v7_dma_clean_range() were not
> relocated.
> 
> *****
> 
> I followed Russell's suggestion and added .align 5 to cache-v7.S (see
> two attached diffs).
> 
> 1) v4.19 + OpenWrt mtd patches
> > egrep -B 1 -A 1 "v7_dma_(inv|clean)_range" System.map
> c010ea58 T v7_flush_kern_dcache_area
> c010ea94 t v7_dma_inv_range
> c010eae0 t v7_dma_clean_range
> c010eb18 T b15_dma_flush_range
> 
> 2) v4.19 + OpenWrt mtd patches + two .align 5 in cache-v7.S
> c010ea6c T v7_flush_kern_dcache_area
> c010eac0 t v7_dma_inv_range
> c010eb20 t v7_dma_clean_range
> c010eb58 T b15_dma_flush_range
> (actually 15 symbols above v7_dma_inv_range were replaced)
> 
> This method seems to be somehow working (at least affects addresses in
> System.map).
> 
> *****
> 
> I run 2 tests for each combination of changes. Each test consisted of
> 10 sequences of: 30 seconds iperf session + reboot.
> 
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> Test #1: 738 Mb/s
> Test #2: 737 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> patch -p1 < v7_dma_clean_range-align.diff
> Test #1: 746 Mb/s
> Test #2: 747 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > patch -p1 < v7_dma_inv_range-align.diff
> Test #1: 745 Mb/s
> Test #2: 746 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > patch -p1 < v7_dma_clean_range-align.diff
> > patch -p1 < v7_dma_inv_range-align.diff
> Test #1: 762 Mb/s
> Test #2: 761 Mb/s
> 
> As you can see I got a quite nice performance improvement after aligning
> both: v7_dma_clean_range() and v7_dma_inv_range().

This is an improvement of about 3.3%.

> It still wasn't as good as with 9316a9ed6895 cherry-picked but pretty
> close.
> 
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > git cherry-pick -x 9316a9ed6895
> Test #1: 770 Mb/s
> Test #2: 766 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > git cherry-pick -x 9316a9ed6895
> > patch -p1 < v7_dma_clean_range-align.diff
> Test #1: 756 Mb/s
> Test #2: 759 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > git cherry-pick -x 9316a9ed6895
> > patch -p1 < v7_dma_inv_range-align.diff
> Test #1: 758 Mb/s
> Test #2: 759 Mb/s
> 
> > git reset --hard v4.19
> > git am OpenWrt-mtd-chages.patch
> > git cherry-pick -x 9316a9ed6895
> > patch -p1 < v7_dma_clean_range-align.diff
> > patch -p1 < v7_dma_inv_range-align.diff
> Test #1: 767 Mb/s
> Test #2: 763 Mb/s
> 
> Now you can see how unpredictable it is. If I cherry-pick 9316a9ed6895
> and do an extra alignment of v7_dma_clean_range() and v7_dma_inv_range()
> that extra alignment can actually *hurt* NAT performance.

You have a maximum variance of 4Mb/s in your tests which is around
0.5%, and this shows a reduction of 3Mb/s, or 0.4%.

If we look at it a different way:
- Without the alignment patches, there is a difference of 4% in
  performance depending on whether 9316a9ed6895 is applied.
- With the alignment patches, there is a difference of 0.4% in
  performance depending on whether 9316a9ed6895 is applied.

How can this not be beneficial?

> 
> My guess is that:
> 1) 9316a9ed6895 provides alignment of some very important function(s)
> 2) DMA alignments on top ^^ provide some gain but also break some align
> 
> *****
> 
> SUMMARY
> 
> It seems that for Linux 4.19 + my .config I can get a very lucky &
> optimal alignment of functions by cherry-picking 9316a9ed6895.
> 
> I thought of checking functions reported by the "perf" tool with CPU
> usage of 2%+.
> 
> All following functions keep their original address with 9316a9ed6895:
> __irqentry_text_end
> arch_cpu_idle
> l2c210_clean_range
> l2c210_inv_range
> v7_dma_clean_range
> v7_dma_inv_range
> 
> Remaining 3 functions got reallocated:
> -c03e5038 t __netif_receive_skb_core
> +c03e50b0 t __netif_receive_skb_core
> -c03c8b1c t bcma_host_soc_read32
> +c03c8b94 t bcma_host_soc_read32
> -c0475620 T fib_table_lookup
> +c0475698 T fib_table_lookup
> 
> I tried aligning all 3 above functions using:
> __attribute__((aligned(32)))
> and got 756 Mb/s. It's better but still not ~770 Mb/s.
> 
> Is there any easy way of identifying which of function alignments got
> such a big impact on NAT performance? I'd like to get those functions
> explicitly aligned using assembler/__attribute__/something.
> 
> What I'm also afraid are false positives. I may end up aligning some
> unrelated function that just happens to align other ones. Just like
> cherry-picking 9316a9ed6895 having side-effects and not really fixing
> anything explicitly.

> diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
> index 215df435bfb9..c60046cd34aa 100644
> --- a/arch/arm/mm/cache-v7.S
> +++ b/arch/arm/mm/cache-v7.S
> @@ -373,6 +373,8 @@ v7_dma_inv_range:
>  	ret	lr
>  ENDPROC(v7_dma_inv_range)
>  
> +	.align	5
> +
>  /*
>   *	v7_dma_clean_range(start,end)
>   *	- start   - virtual start address of region

> diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
> index 215df435bfb9..0c3999f219ab 100644
> --- a/arch/arm/mm/cache-v7.S
> +++ b/arch/arm/mm/cache-v7.S
> @@ -340,6 +340,8 @@ ENTRY(v7_flush_kern_dcache_area)
>  	ret	lr
>  ENDPROC(v7_flush_kern_dcache_area)
>  
> +	.align	5
> +
>  /*
>   *	v7_dma_inv_range(start,end)
>   *


-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up