[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89i+goL82KC8uQoaNKxh=ap_16p7v1AAay1b2ekZAhPjEQg@mail.gmail.com>
Date: Mon, 19 Jan 2026 11:33:29 +0100
From: Eric Dumazet <edumazet@...gle.com>
To: David Laight <david.laight.linux@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, linux-kernel <linux-kernel@...r.kernel.org>,
netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>,
Eric Dumazet <eric.dumazet@...il.com>, Paolo Abeni <pabeni@...hat.com>,
Nicolas Pitre <npitre@...libre.com>
Subject: Re: [PATCH] compiler_types: Introduce inline_for_performance
On Mon, Jan 19, 2026 at 11:25 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Mon, Jan 19, 2026 at 10:33 AM David Laight
> <david.laight.linux@...il.com> wrote:
> >
> > On Sun, 18 Jan 2026 16:01:25 -0800
> > Andrew Morton <akpm@...ux-foundation.org> wrote:
> >
> > > On Sun, 18 Jan 2026 22:58:02 +0000 David Laight <david.laight.linux@...il.com> wrote:
> > >
> > > > > mm/ alone has 74 __always_inlines, none are documented, I don't know
> > > > > why they're present, many are probably wrong.
> > > > >
> > > > > Shit, uninlining only __get_user_pages_locked does this:
> > > > >
> > > > > text data bss dec hex filename
> > > > > 115703 14018 64 129785 1faf9 mm/gup.o
> > > > > 103866 13058 64 116988 1c8fc mm/gup.o-after
> > > >
> > > > The next questions are does anything actually run faster (either way),
> > > > and should anything at all be marked 'inline' rather than 'always_inline'.
> > > >
> > > > After all, if you call a function twice (not in a loop) you may
> > > > want a real function in order to avoid I-cache misses.
> > >
> > > yup
> >
> > I had two adjacent strlen() calls in a bit of code, the first was an
> > array (in a structure) and gcc inlined the 'word at a time' code, the
> > second was a pointer and it called the library function.
> > That had to be sub-optimal...
> >
> > > > But I'm sure there is a lot of code that is 'inline_for_bloat' :-)
> > >
> > > ooh, can we please have that?
> >
> > Or 'inline_to_speed_up_benchmark' and the associated 'unroll this loop
> > because that must make it faster'.
> >
> > > I do think that every always_inline should be justified and commented,
> > > but I haven't been energetic about asking for that.
> >
> > Apart from the 4-line functions where it is clearly obvious.
> > Especially since the compiler can still decide to not-inline them
> > if they are only 'inline'.
> >
> > > A fun little project would be go through each one, figure out whether
> > > were good reasons and if not, just remove them and see if anyone
> > > explains why that was incorrect.
> >
> > It's not just always_inline, a lot of the inline are dubious.
> > Probably why the networking code doesn't like it.
>
> Many __always_inline came because of clang's reluctance to inline
> small things, even if the resulting code size is bigger and slower.
>
> It is a bit unclear, this seems to happen when callers are 'big
> enough'. noinstr (callers) functions are also a problem.
>
> Let's take the list_add() call from dev_gro_receive() : clang does not
> inline it, for some reason.
>
> After adding __always_inline to list_add() and __list_add() we have
> smaller and more efficient code,
> for real workloads, not only benchmarks.
>
> $ scripts/bloat-o-meter -t net/core/gro.o.old net/core/gro.o.new
> add/remove: 2/4 grow/shrink: 2/1 up/down: 86/-130 (-44)
> Function old new delta
> dev_gro_receive 1795 1845 +50
> .Ltmp93 - 16 +16
> .Ltmp89 - 16 +16
> napi_gro_frags 968 972 +4
> .Ltmp94 16 - -16
> .Ltmp90 16 - -16
> .Ltmp83 16 - -16
> .Ltmp0 8396 8364 -32
> list_add 50 - -50
>
> Over the whole kernel (and also using __always_inline for
> list_add_tail(), __list_del(), list_del()) we have a similar outcome:
>
> $ size vmlinux.old vmlinux.new
> text data bss dec hex filename
> 39037635 23688605 4254712 66980952 3fe0c58 vmlinux.old
> 39035644 23688605 4254712 66978961 3fe0491 vmlinux.new
> $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
> add/remove: 2/6 grow/shrink: 103/52 up/down: 6179/-6473 (-294)
> Function old new delta
> __list_del_entry - 672 +672
Ah, and of course after adding __always_inline to __list_del_entry ()
as well we have something even better.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 1/6 grow/shrink: 105/51 up/down: 5838/-6464 (-626)
Function old new delta
__do_semtimedop 1819 2204 +385
tracer_alloc_buffers 760 920 +160
__bmc_get_device_id 3332 3486 +154
power_supply_register_extension 540 692 +152
ext4_orphan_add 1122 1254 +132
madvise_cold_or_pageout_pte_range 2202 2328 +126
lru_lazyfree 784 906 +122
iommu_sva_bind_device 744 865 +121
copy_page_range 12353 12473 +120
psi_trigger_create 767 886 +119
srcu_gp_start_if_needed 1440 1548 +108
fanout_add 996 1101 +105
seccomp_notify_ioctl 1882 1986 +104
mlock_folio_batch 3137 3234 +97
kvm_dev_ioctl 1507 1599 +92
handle_userfault 2035 2118 +83
relay_open 681 761 +80
input_register_device 1602 1680 +78
__iommu_probe_device 1203 1279 +76
spi_register_controller 1748 1823 +75
copy_process 4112 4186 +74
newseg 825 898 +73
optimize_kprobe 210 282 +72
do_msgsnd 1290 1362 +72
iscsi_queuecommand 948 1019 +71
hci_register_dev 667 737 +70
rdtgroup_mkdir 1500 1569 +69
p9_read_work 961 1030 +69
tcf_register_action 518 586 +68
mntput_no_expire_slowpath 603 669 +66
netdev_run_todo 1341 1405 +64
memcg_write_event_control 1083 1147 +64
handle_one_recv_msg 3307 3371 +64
btf_module_notify 1669 1733 +64
tcf_mirred_init 1229 1292 +63
register_stat_tracer 340 403 +63
qdisc_get_stab 596 658 +62
io_submit_one 2056 2117 +61
fuse_dev_do_read 1186 1247 +61
bpf_map_offload_map_alloc 622 683 +61
zswap_setup 584 644 +60
register_acpi_bus_type 109 169 +60
ipv6_add_addr 1007 1067 +60
mraid_mm_register_adp 1379 1438 +59
register_ife_op 230 288 +58
perf_event_alloc 2591 2649 +58
intel_iommu_domain_alloc_nested 498 556 +58
do_msgrcv 1536 1593 +57
lru_deactivate_file 1280 1336 +56
fib_create_info 2567 2623 +56
acpi_device_add 859 915 +56
iscsi_register_transport 533 588 +55
ext4_register_li_request 448 503 +55
net_devmem_bind_dmabuf 1036 1088 +52
init_one_iommu 309 360 +51
shmem_get_folio_gfp 1379 1429 +50
dev_gro_receive 1795 1845 +50
.Ltmp68 160 208 +48
.Ltmp337 48 96 +48
.Ltmp204 48 96 +48
vmemmap_remap_pte 499 546 +47
dm_get_device 501 548 +47
vfs_move_mount 509 555 +46
intel_nested_attach_dev 391 435 +44
fw_devlink_dev_sync_state 212 255 +43
bond_ipsec_add_sa 406 447 +41
bd_link_disk_holder 459 499 +40
bcm_tx_setup 1414 1454 +40
devl_linecard_create 401 440 +39
copy_tree 677 716 +39
fscrypt_setup_encryption_info 1479 1517 +38
region_del 631 668 +37
devlink_nl_rate_new_doit 501 536 +35
__down_common 549 583 +34
__softirqentry_text_end - 32 +32
.Ltmp57 48 80 +32
.Ltmp126 48 80 +32
ipmi_create_user 411 442 +31
pci_register_host_bridge 1675 1705 +30
move_cluster 160 184 +24
isolate_migratepages_block 4113 4136 +23
clk_notifier_register 474 489 +15
bpf_event_notify 311 326 +15
dm_register_path_selector 245 256 +11
bpf_crypto_register_type 209 220 +11
usb_driver_set_configuration 232 242 +10
parse_gate_list 912 921 +9
devl_rate_leaf_create 266 274 +8
bt_accept_enqueue 465 473 +8
__flush_workqueue 1233 1240 +7
dev_forward_change 772 778 +6
thermal_add_hwmon_sysfs 861 865 +4
set_node_memory_tier 1027 1031 +4
napi_gro_frags 968 972 +4
mei_cl_notify_request 1067 1071 +4
mb_cache_shrink 447 451 +4
kfence_guarded_free 764 768 +4
blk_mq_try_issue_directly 606 610 +4
__neigh_update 2452 2456 +4
__folio_freeze_and_split_unmapped 3198 3202 +4
rwsem_down_write_slowpath 1595 1598 +3
eventfs_create_dir 477 480 +3
alloc_vmap_area 1967 1970 +3
unix_add_edges 618 620 +2
hid_connect 1529 1530 +1
dwc_prep_dma_memcpy 637 638 +1
scsi_eh_test_devices 700 699 -1
acpi_extract_power_resources 559 558 -1
deactivate_slab 758 756 -2
__team_options_change_check 220 218 -2
sock_map_update_common 524 521 -3
rcu_nocb_gp_kthread 2710 2707 -3
memsw_cgroup_usage_register_event 19 16 -3
kthread 564 561 -3
st_add_path 457 453 -4
flow_block_cb_setup_simple 543 539 -4
ep_try_send_events 875 871 -4
elv_register 524 520 -4
__mptcp_move_skbs_from_subflow 1318 1314 -4
__check_limbo 460 456 -4
__bpf_list_add 277 273 -4
trim_marked 375 370 -5
megaraid_mbox_runpendq 345 340 -5
ipmi_timeout_work 1813 1808 -5
handle_new_recv_msgs 419 414 -5
mei_cl_send_disconnect 186 180 -6
mei_cl_send_connect 186 180 -6
mptcp_sendmsg 1776 1769 -7
deferred_split_folio 561 554 -7
pcibios_allocate_resources 879 871 -8
find_css_set 1690 1682 -8
af_alg_sendmsg 2384 2376 -8
posixtimer_send_sigqueue 950 939 -11
mei_irq_write_handler 1350 1338 -12
dpm_prepare 1173 1161 -12
dpll_xa_ref_pin_add 679 667 -12
css_set_move_task 513 501 -12
cache_mark 583 571 -12
scsi_queue_rq 3449 3434 -15
mtd_queue_rq 1068 1053 -15
link_css_set 323 308 -15
key_garbage_collector 1119 1104 -15
do_dma_probe 1664 1649 -15
configfs_new_dirent 306 290 -16
.Ltmp69 208 192 -16
xfrm_state_walk 677 660 -17
__dpll_pin_register 761 742 -19
i2c_do_add_adapter 1018 998 -20
complete_io 421 401 -20
fsnotify_insert_event 390 368 -22
scsi_eh_ready_devs 3026 2997 -29
.Ltmp71 128 96 -32
.Ltmp58 128 96 -32
.Ltmp127 80 48 -32
migrate_pages_batch 4623 4584 -39
.Ltmp338 96 48 -48
.Ltmp205 96 48 -48
__pfx_list_del 256 - -256
__pfx_list_add_tail 384 - -384
__pfx_list_add 656 - -656
list_del 944 - -944
list_add_tail 1360 - -1360
list_add 2212 - -2212
Total: Before=25509319, After=25508693, chg -0.00%
Powered by blists - more mailing lists