[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20161011055753.2690178f@grimm.local.home>
Date: Tue, 11 Oct 2016 05:57:53 -0400
From: Steven Rostedt <rostedt@...dmis.org>
To: Joel Fernandes <joelaf@...gle.com>
Cc: linux-kernel@...r.kernel.org
Subject: Re: [RFC 0/7] pstore: Improve performance of ftrace backend with
ramoops
On Fri, 7 Oct 2016 22:28:27 -0700
Joel Fernandes <joelaf@...gle.com> wrote:
> Here's an early RFC for a patch series on improving ftrace throughput with
> ramoops. I am hoping to get some early comments so I'm releasing it in advance.
> It is functional and tested.
>
> Currently ramoops uses a single zone to store function traces. To make this
> work, it has to uses locking to synchronize accesses to the buffers. Recently
> the synchronization was completely moved from a cmpxchg mechanism to raw
> spinlocks due to difficulties in using cmpxchg on uncached memory and also on
> RAMs behind PCIe. [1] This change further dropped the peformance of ramoops
> pstore backend by more than half in my tests.
>
> This patch series improves the situation dramatically by around 280% from what
> it is now by creating a ramoops persistent zone for each CPU and avoiding use of
> locking altogether for ftrace. At init time, the persistent zones are then
> merged together.
>
> Here are some tests to show the improvements. Tested using a qemu quad core
> x86_64 instance with -mem-path to persist the guest RAM to a file. I measured
> avergage throughput of dd over 30 seconds:
>
> dd if=/dev/zero | pv | dd of=/dev/null
>
> Without this patch series: 24MB/s
> With per-cpu buffers and counter increment: 91.5 MB/s (improvement by ~ 281%)
> with per-cpu buffers and trace_clock: 51.9 MB/s
>
> Some more considerations:
> 1. Inorder to do the merge of the individual buffers, I am using racy counters
> since I didn't want to sacrifice throughput for perfect time stamps.
> trace_clock() for timestamps although did the job but was almost half the
> throughput of using counter based timestamp.
>
> 2. Since the patches divide the available ftrace persistent space by the number
> of CPUs, lesser space will now be available per-CPU however the user is free to
> disable per CPU behavior and revert to the old behavior by specifying
> PSTORE_PER_CPU flag. Its a space vs performance trade-off so if user has
> enough space and not a lot of CPUs, then using per-CPU persistent buffers make
> sense for better performance.
>
> 3. Without using any counters or timestamps, the improvement is even more
> (~140MB/s) but the buffers cannot be merged.
>
> [1] https://lkml.org/lkml/2016/9/8/375
>From a tracing point of view, I have no qualms with this patch set.
-- Steve
>
> Joel Fernandes (7):
> pstore: Make spinlock per zone instead of global
> pstore: locking: dont lock unless caller asks to
> pstore: Remove case of PSTORE_TYPE_PMSG write using deprecated
> function
> pstore: Make ramoops_init_przs generic for other prz arrays
> ramoops: Split ftrace buffer space into per-CPU zones
> pstore: Add support to store timestamp counter in ftrace records
> pstore: Merge per-CPU ftrace zones into one zone for output
>
> fs/pstore/ftrace.c | 3 +
> fs/pstore/inode.c | 7 +-
> fs/pstore/internal.h | 34 -------
> fs/pstore/ram.c | 234 +++++++++++++++++++++++++++++++++++----------
> fs/pstore/ram_core.c | 30 +++---
> include/linux/pstore.h | 69 +++++++++++++
> include/linux/pstore_ram.h | 6 +-
> 7 files changed, 280 insertions(+), 103 deletions(-)
>
Powered by blists - more mailing lists