linux-kernel - Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export more page flags in /proc/kpageflags)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 28 Apr 2009 21:31:08 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Ingo Molnar <mingo@...e.hu>
Cc:	Li Zefan <lizf@...fujitsu.com>, Tom Zanussi <tzanussi@...il.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Pekka Enberg <penberg@...helsinki.fi>,
	Andi Kleen <andi@...stfloor.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Fr馘駻ic Weisbecker <fweisbec@...il.com>,
	Larry Woodman <lwoodman@...hat.com>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Eduard - Gabriel Munteanu <eduard.munteanu@...ux360.ro>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>,
	Matt Mackall <mpm@...enic.com>,
	Alexey Dobriyan <adobriyan@...il.com>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: export
	more page flags in /proc/kpageflags)

On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> tent-Transfer-Encoding: quoted-printable
> Status: RO
> Content-Length: 5480
> Lines: 161
> 
> 
> * Wu Fengguang <fengguang.wu@...el.com> wrote:
> 
> > > The above 'get object state' interface (which allows passive 
> > > sampling) - integrated into the tracing framework - would serve 
> > > that goal, agreed?
> > 
> > Agreed. That could in theory a good complement to dynamic 
> > tracings.
> > 
> > Then what will be the canonical form for all the 'get object 
> > state' interfaces - "object.attr=value", or whatever? [...]
> 
> Lemme outline what i'm thinking of.
> 
> I'd call the feature "object collection tracing", which would live 
> in /debug/tracing, accessed via such files:
> 
>   /debug/tracing/objects/mm/pages/
>   /debug/tracing/objects/mm/pages/format
>   /debug/tracing/objects/mm/pages/filter
>   /debug/tracing/objects/mm/pages/trace_pipe
>   /debug/tracing/objects/mm/pages/stats
>   /debug/tracing/objects/mm/pages/events/
> 
> here's the (proposed) semantics of those files:
> 
> 1) /debug/tracing/objects/mm/pages/
> 
> There's a subsystem / object basic directory structure to make it 
> easy and intuitive to find our way around there.
> 
> 2) /debug/tracing/objects/mm/pages/format
> 
> the format file:
> 
>   /debug/tracing/objects/mm/pages/format
> 
> Would reuse the existing dynamic-tracepoint structured-logging 
> descriptor format and code (this is upstream already):
> 
>  [root@...enix sched_signal_send]# pwd
>  /debug/tracing/events/sched/sched_signal_send
> 
>  [root@...enix sched_signal_send]# cat format 
>  name: sched_signal_send
>  ID: 24
>  format:
> 	field:unsigned short common_type;		offset:0;	size:2;
> 	field:unsigned char common_flags;		offset:2;	size:1;
> 	field:unsigned char common_preempt_count;	offset:3;	size:1;
> 	field:int common_pid;				offset:4;	size:4;
> 	field:int common_tgid;				offset:8;	size:4;
> 
> 	field:int sig;					offset:12;	size:4;
> 	field:char comm[TASK_COMM_LEN];			offset:16;	size:16;
> 	field:pid_t pid;				offset:32;	size:4;
> 
>  print fmt: "sig: %d  task %s:%d", REC->sig, REC->comm, REC->pid
> 
> These format descriptors enumerate fields, types and sizes, in a 
> structured way that user-space tools can parse easily. (The binary 
> records that come from the trace_pipe file follow this format 
> description.)
> 
> 3) /debug/tracing/objects/mm/pages/filter
> 
> This is the tracing filter that can be set based on the 'format' 
> descriptor. So with the above (signal-send tracepoint) you can 
> define such filter expressions:
> 
>   echo "(sig == 10 && comm == bash) || sig == 13" > filter
> 
> To restrict the 'scope' of the object collection along pretty much 
> any key or combination of keys. (Or you can leave it as it is and 
> dump all objects and do keying in user-space.)
> 
> [ Using in-kernel filtering is obviously faster that streaming it 
>   out to user-space - but there might be details and types of 
>   visualization you want to do in user-space - so we dont want to 
>   restrict things here. ]
> 
> For the mm object collection tracepoint i could imagine such filter 
> expressions:
> 
>   echo "type == shared && file == /sbin/init" > filter
> 
> To dump all shared pages that are mapped to /sbin/init.
> 
> 4) /debug/tracing/objects/mm/pages/trace_pipe
> 
> The 'trace_pipe' file can be used to dump all objects in the 
> collection, which match the filter ('all objects' by default). The 
> record format is described in 'format'.
> 
> trace_pipe would be a reuse of the existing trace_pipe code: it is a 
> modern, poll()-able, read()-able, splice()-able pipe abstraction.
> 
> 5) /debug/tracing/objects/mm/pages/stats
> 
> The 'stats' file would be a reuse of the existing histogram code of 
> the tracing code. We already make use of it for the branch tracers 
> and for the workqueue tracer - it could be extended to be applicable 
> to object collections as well.
> 
> The advantage there would be that there's no dumping at all - all 
> the integration is done straight in the kernel. ( The 'filter' 
> condition is listened to - increasing flexibility. The filter file 
> could perhaps also act as a default histogram key. )
> 
> 6) /debug/tracing/objects/mm/pages/events/
> 
> The 'events' directory offers links back to existing dynamic 
> tracepoints that are under /debug/tracing/events/. This would serve 
> as an additional coherent force that keeps dynamic tracepoints 
> collected by subsystem and by object type as well. (Tools could make 
> use of this information as well - without being aware of actual 
> object semantics.)
> 
> 
> There would be a number of other object collections we could 
> enumerate:
> 
>  tasks:
> 
>   /debug/tracing/objects/sched/tasks/
> 
>  active inodes known to the kernel:
> 
>   /debug/tracing/objects/fs/inodes/
> 
>  interrupts:
> 
>   /debug/tracing/objects/hw/irqs/
> 
> etc.
> 
> These would use the same 'object collection' framework. Once done we 
> can use it for many other thing too.
> 
> Note how organically integrated it all is with the tracing 
> framework. You could start from an 'object view' to get an overview 
> and then go towards a more dynamic view of specific object 
> attributes (or specific objects), as you drill down on a specific 
> problem you want to analyze.
> 
> How does this all sound to you?

Great! I saw much opportunity to adapt the not yet submitted
/proc/filecache interface to the proposed framework.

Its basic form is:

#      ino       size   cached cached% refcnt state       age accessed  process         dev             file
[snip]
       320          1        4     100      1    D-     50443     1085 udevd           00:11(tmpfs)     /.udev/uevent_seqnum
    460725        123      124     100     35    --     50444     6795 touch           08:02(sda2)      /lib/libpthread-2.9.so
    460727         31       32     100     14    --     50444     2007 touch           08:02(sda2)      /lib/librt-2.9.so
    458865         97       80      82      1    --     50444       49 mount           08:02(sda2)      /lib/libdevmapper.so.1.02.1
    460090         15       16     100      1    --     50444       48 mount           08:02(sda2)      /lib/libuuid.so.1.2
    458866         46       48     100      1    --     50444       47 mount           08:02(sda2)      /lib/libblkid.so.1.0
    460732         43       44     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_nis-2.9.so
    460739         87       88     100     73    --     50444     3597 rcS             08:02(sda2)      /lib/libnsl-2.9.so
    460726         31       32     100     69    --     50444     3581 rcS             08:02(sda2)      /lib/libnss_compat-2.9.so
    458804        250      252     100     11    --     50445     8175 rcS             08:02(sda2)      /lib/libncurses.so.5.6
    229540        780      752      96      3    --     50445     7594 init            08:02(sda2)      /bin/bash
    460735         15       16     100     89    --     50445    17581 init            08:02(sda2)      /lib/libdl-2.9.so
    460721       1344     1340      99    117    --     50445    48732 init            08:02(sda2)      /lib/libc-2.9.so
    458801        107      104      97     24    --     50445     3586 init            08:02(sda2)      /lib/libselinux.so.1
    671870         37       24      65      1    --     50446        1 swapper         08:02(sda2)      /sbin/init
       175          1    24412     100      1    --     50446        0 swapper         00:01(rootfs)    /dev/root

The patch basically does a traversal through one or more of the inode
lists to produce the output:
        inode_in_use
        inode_unused
        sb->s_dirty
        sb->s_io
        sb->s_more_io
        sb->s_inodes

The filtering feature is a necessity for this interface - or it will
take considerable time to do a full listing. It supports the following
filters:
        { LS_OPT_DIRTY,         "dirty"         },
        { LS_OPT_CLEAN,         "clean"         },
        { LS_OPT_INUSE,         "inuse"         },
        { LS_OPT_EMPTY,         "empty"         },
        { LS_OPT_ALL,           "all"           },
        { LS_OPT_DEV,           "dev=%s"        },

There are two possible challenges for the conversion:

- One trick it does is to select different lists to traverse on
  different filter options. Will this be possible in the object
  tracing framework?
- The file name lookup(last field) is the performance killer. Is it
  possible to skip the file name lookup when the filter failed on the
  leading fields?

Will the object tracing interface allow such flexibilities?
(Sorry I'm not yet familiar with the tracing framework.)

> Can you see any conceptual holes in the scheme, any use-case that 
> /proc/kpageflags supports but the object collection approach does 
> not?

kpageflags is simply a big (perhaps sparse) binary array.
I'd still prefer to retain its current form - the kernel patches and
user space tools are all ready made, and I see no benefits in
converting to the tracing framework.

> Would you be interested in seeing something like this, if we tried 
> to implement it in the tracing tree? The majority of the code 
> already exists, we just need interest from the MM side and we have 
> to hook it all up. (it is by no means trivial to do - but looks like
> a very exciting feature.)

Definitely! /proc/filecache has another 'page view':

        # head /proc/filecache
        # file /bin/bash
        # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
        # idx   len     state           refcnt
        0       1       RAMU________    4
        3       8       RAMU________    4
        12      1       RAMU________    4
        14      5       RAMU________    4
        20      7       RAMU________    4
        27      2       RAMU________    5
        29      1       RAMU________    4

Which is also a good candidate. However I still need to investigate
whether it offers considerable margins over the mincore() syscall.

Thanks and Regards,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/