linux-stable-mirror

mirror of https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git synced 2026-04-24 10:49:54 +02:00

Author	SHA1	Message	Date
Eric Dumazet	b76543b21f	ipv6: reorganise struct ipv6_pinfo Move fields used in tx fast path at the beginning of the structure, and seldom used ones at the end. Note that rxopt is also in the first cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-5-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:17:09 +02:00
Eric Dumazet	9fba1eb39e	ipv6: np->rxpmtu race annotation Add READ_ONCE() annotations because np->rxpmtu can be changed while udpv6_recvmsg() and rawv6_recvmsg() read it. Since this is a very rarely used feature, and that udpv6_recvmsg() and rawv6_recvmsg() read np->rxopt anyway, change the test order so that np->rxpmtu does not need to be in a hot cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-4-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:17:09 +02:00
Eric Dumazet	5489f333ef	ipv6: make ipv6_pinfo.daddr_cache a boolean ipv6_pinfo.daddr_cache is either NULL or &sk->sk_v6_daddr We do not need 8 bytes, a boolean is enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-3-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:17:09 +02:00
Eric Dumazet	3fbb2a6f3a	ipv6: make ipv6_pinfo.saddr_cache a boolean ipv6_pinfo.saddr_cache is either NULL or &np->saddr. We do not need 8 bytes, a boolean is enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250916160951.541279-2-edumazet@google.com Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:17:09 +02:00
Jakub Kicinski	b127e355f1	eth: fbnic: support devmem Tx Support devmem Tx. We already use skb_frag_dma_map(), we just need to make sure we don't try to unmap the frags. Check if frag is unreadable and mark the ring entry. # ./tools/testing/selftests/drivers/net/hw/devmem.py TAP version 13 1..3 ok 1 devmem.check_rx ok 2 devmem.check_tx ok 3 devmem.check_tx_chunks # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0 Acked-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250916145401.1464550-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 10:12:05 +02:00
Paolo Abeni	f60034689f	Merge branch 'accecn-protocol-patch-series' Chia-Yu Chang says: ==================== AccECN protocol patch series Please find the v19 AccECN protocol patch series, which covers the core functionality of Accurate ECN, AccECN negotiation, AccECN TCP options, and AccECN failure handling. The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28, and it will be RFC9768. This patch series is part of the full AccECN patch series, which is available at https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ --- Chia-Yu Chang (3): tcp: accecn: AccECN option send control tcp: accecn: AccECN option failure handling tcp: accecn: try to fit AccECN option with SACK Ilpo Järvinen (7): tcp: AccECN core tcp: accecn: AccECN negotiation tcp: accecn: add AccECN rx byte counters tcp: accecn: AccECN needs to know delivered bytes tcp: sack option handling improvements tcp: accecn: AccECN option tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics Documentation/networking/ip-sysctl.rst \| 55 +- .../networking/net_cachelines/tcp_sock.rst \| 12 + include/linux/tcp.h \| 28 +- include/net/netns/ipv4.h \| 2 + include/net/tcp.h \| 33 ++ include/net/tcp_ecn.h \| 554 +++++++++++++++++- include/uapi/linux/tcp.h \| 9 + net/ipv4/syncookies.c \| 4 + net/ipv4/sysctl_net_ipv4.c \| 19 + net/ipv4/tcp.c \| 30 +- net/ipv4/tcp_input.c \| 318 +++++++++- net/ipv4/tcp_ipv4.c \| 8 +- net/ipv4/tcp_minisocks.c \| 40 +- net/ipv4/tcp_output.c \| 239 +++++++- net/ipv6/syncookies.c \| 2 + net/ipv6/tcp_ipv6.c \| 1 + 16 files changed, 1278 insertions(+), 76 deletions(-) ==================== Link: https://patch.msgid.link/20250916082434.100722-1-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:54 +02:00
Chia-Yu Chang	e7e9da850a	tcp: accecn: try to fit AccECN option with SACK As SACK blocks tend to eat all option space when there are many holes, it is useful to compromise on sending many SACK blocks in every ACK and attempt to fit the AccECN option there by reducing the number of SACK blocks. However, it will never go below two SACK blocks because of the AccECN option. As the AccECN option is often not put to every ACK, the space hijack is usually only temporary. Depending on the reuqired AccECN fields (can be either 3, 2, 1, or 0, cf. Table 5 in AccECN spec) and the NOPs used for alignment of other TCP options, up to two SACK blocks will be reduced. Please find below tables for more details: +====================+=========================================+ \| Number of \| Required \| Remaining \| Number of \| Final \| \| SACK \| AccECN \| option \| reduced \| number of \| \| blocks \| fields \| spaces \| SACK blocks \| SACK blocks \| +===========+==========+===========+=============+=============+ \| x (<=2) \| 0 to 3 \| any \| 0 \| x \| +-----------+----------+-----------+-------------+-------------+ \| 3 \| 0 \| any \| 0 \| 3 \| \| 3 \| 1 \| <4 \| 1 \| 2 \| \| 3 \| 1 \| >=4 \| 0 \| 3 \| \| 3 \| 2 \| <8 \| 1 \| 2 \| \| 3 \| 2 \| >=8 \| 0 \| 3 \| \| 3 \| 3 \| <12 \| 1 \| 2 \| \| 3 \| 3 \| >=12 \| 0 \| 3 \| +-----------+----------+-----------+-------------+-------------+ \| y (>=4) \| 0 \| any \| 0 \| y \| \| y (>=4) \| 1 \| <4 \| 1 \| y-1 \| \| y (>=4) \| 1 \| >=4 \| 0 \| y \| \| y (>=4) \| 2 \| <8 \| 1 \| y-1 \| \| y (>=4) \| 2 \| >=8 \| 0 \| y \| \| y (>=4) \| 3 \| <4 \| 2 \| y-2 \| \| y (>=4) \| 3 \| <12 \| 1 \| y-1 \| \| y (>=4) \| 3 \| >=12 \| 0 \| y \| +===========+==========+===========+=============+=============+ Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Co-developed-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-11-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	fe2cddc648	tcp: accecn: AccECN option ceb/cep and ACE field multi-wrap heuristics The AccECN option ceb/cep heuristic algorithm is from AccECN spec Appendix A.2.2 to mitigate against false ACE field overflows. Armed with ceb delta from option, delivered bytes, and delivered packets it is possible to estimate how many times ACE field wrapped. This calculation is necessary only if more than one wrap is possible. Without SACK, delivered bytes and packets are not always trustworthy in which case TCP falls back to the simpler no-or-all wraps ceb algorithm. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-10-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Chia-Yu Chang	b40671b5ee	tcp: accecn: AccECN option failure handling AccECN option may fail in various way, handle these: - Attempt to negotiate the use of AccECN on the 1st retransmitted SYN - From the 2nd retransmitted SYN, stop AccECN negotiation - Remove option from SYN/ACK rexmits to handle blackholes - If no option arrives in SYN/ACK, assume Option is not usable - If an option arrives later, re-enabled - If option is zeroed, disable AccECN option processing This patch use existing padding bits in tcp_request_sock and holes in tcp_sock without increasing the size. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-9-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Chia-Yu Chang	aa55a7dde7	tcp: accecn: AccECN option send control Instead of sending the option in every ACK, limit sending to those ACKs where the option is necessary: - Handshake - "Change-triggered ACK" + the ACK following it. The 2nd ACK is necessary to unambiguously indicate which of the ECN byte counters in increasing. The first ACK has two counters increasing due to the ecnfield edge. - ACKs with CE to allow CEP delta validations to take advantage of the option. - Force option to be sent every at least once per 2^22 bytes. The check is done using the bit edges of the byte counters (avoids need for extra variables). - AccECN option beacon to send a few times per RTT even if nothing in the ECN state requires that. The default is 3 times per RTT, and its period can be set via sysctl_tcp_ecn_option_beacon. Below are the pahole outcomes before and after this patch, in which the group size of tcp_sock_write_tx is increased from 89 to 97 due to the new u64 accecn_opt_tstamp member: [BEFORE THIS PATCH] struct tcp_sock { [...] u64 tcp_wstamp_ns; /* 2488 8 / struct list_head tsorted_sent_queue; / 2496 16 / [...] __cacheline_group_end__tcp_sock_write_tx[0]; / 2521 0 / __cacheline_group_begin__tcp_sock_write_txrx[0]; / 2521 0 / u8 nonagle:4; / 2521: 0 1 / u8 rate_app_limited:1; / 2521: 4 1 / / XXX 3 bits hole, try to pack / / Force alignment to the next boundary: / u8 :0; u8 received_ce_pending:4;/ 2522: 0 1 / u8 unused2:4; / 2522: 4 1 / u8 accecn_minlen:2; / 2523: 0 1 / u8 est_ecnfield:2; / 2523: 2 1 / u8 unused3:4; / 2523: 4 1 / [...] __cacheline_group_end__tcp_sock_write_txrx[0]; / 2628 0 / [...] / size: 3200, cachelines: 50, members: 171 / } [AFTER THIS PATCH] struct tcp_sock { [...] u64 tcp_wstamp_ns; / 2488 8 / u64 accecn_opt_tstamp; / 2596 8 / struct list_head tsorted_sent_queue; / 2504 16 / [...] __cacheline_group_end__tcp_sock_write_tx[0]; / 2529 0 / __cacheline_group_begin__tcp_sock_write_txrx[0]; / 2529 0 / u8 nonagle:4; / 2529: 0 1 / u8 rate_app_limited:1; / 2529: 4 1 / / XXX 3 bits hole, try to pack / / Force alignment to the next boundary: / u8 :0; u8 received_ce_pending:4;/ 2530: 0 1 / u8 unused2:4; / 2530: 4 1 / u8 accecn_minlen:2; / 2531: 0 1 / u8 est_ecnfield:2; / 2531: 2 1 / u8 accecn_opt_demand:2; / 2531: 4 1 / u8 prev_ecnfield:2; / 2531: 6 1 / [...] __cacheline_group_end__tcp_sock_write_txrx[0]; / 2636 0 / [...] / size: 3200, cachelines: 50, members: 173 */ } Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Co-developed-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-8-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	b5e74132df	tcp: accecn: AccECN option The Accurate ECN allows echoing back the sum of bytes for each IP ECN field value in the received packets using AccECN option. This change implements AccECN option tx & rx side processing without option send control related features that are added by a later change. Based on specification: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt (Some features of the spec will be added in the later changes rather than in this one). A full-length AccECN option is always attempted but if it does not fit, the minimum length is selected based on the counters that have changed since the last update. The AccECN option (with 24-bit fields) often ends in odd sizes so the option write code tries to take advantage of some nop used to pad the other TCP options. The delivered_ecn_bytes pairs with received_ecn_bytes similar to how delivered_ce pairs with received_ce. In contrast to ACE field, however, the option is not always available to update delivered_ecn_bytes. For ACK w/o AccECN option, the delivered bytes calculated based on the cumulative ACK+SACK information are assigned to one of the counters using an estimation heuristic to select the most likely ECN byte counter. Any estimation error is corrected when the next AccECN option arrives. It may occur that the heuristic gets too confused when there are enough different byte counter deltas between ACKs with the AccECN option in which case the heuristic just gives up on updating the counters for a while. tcp_ecn_option sysctl can be used to select option sending mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM, and TCP_ECN_OPTION_FULL. This patch increases the size of tcp_info struct, as there is no existing holes for new u32 variables. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_info { [...] __u32 tcpi_total_rto_time; /* 244 4 / / size: 248, cachelines: 4, members: 61 / } [AFTER THIS PATCH] struct tcp_info { [...] __u32 tcpi_total_rto_time; / 244 4 / __u32 tcpi_received_ce; / 248 4 / __u32 tcpi_delivered_e1_bytes; / 252 4 / __u32 tcpi_delivered_e0_bytes; / 256 4 / __u32 tcpi_delivered_ce_bytes; / 260 4 / __u32 tcpi_received_e1_bytes; / 264 4 / __u32 tcpi_received_e0_bytes; / 268 4 / __u32 tcpi_received_ce_bytes; / 272 4 / / size: 280, cachelines: 5, members: 68 / } This patch uses the existing 1-byte holes in the tcp_sock_write_txrx group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx group after the new u32 delivered_ecn_bytes[3] member. Therefore, the group size of tcp_sock_write_rx is increased from 96 to 112. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u8 received_ce_pending:4; / 2522: 0 1 / u8 unused2:4; / 2522: 4 1 / / XXX 1 byte hole, try to pack / [...] u32 rcv_rtt_last_tsecr; / 2668 4 / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2728 0 / [...] / size: 3200, cachelines: 50, members: 167 / } [AFTER THIS PATCH] struct tcp_sock { [...] u8 received_ce_pending:4;/ 2522: 0 1 / u8 unused2:4; / 2522: 4 1 / u8 accecn_minlen:2; / 2523: 0 1 / u8 est_ecnfield:2; / 2523: 2 1 / u8 unused3:4; / 2523: 4 1 / [...] u32 rcv_rtt_last_tsecr; / 2668 4 / u32 delivered_ecn_bytes[3];/ 2672 12 / / XXX 4 bytes hole, try to pack / [...] __cacheline_group_end__tcp_sock_write_rx[0]; / 2744 0 / [...] / size: 3200, cachelines: 50, members: 171 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-7-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	77a4fdf43c	tcp: sack option handling improvements 1) Don't early return when sack doesn't fit. AccECN code will be placed after this fragment so no early returns please. 2) Make sure opts->num_sack_blocks is not left undefined. E.g., tcp_current_mss() does not memset its opts struct to zero. AccECN code checks if SACK option is present and may even alter it to make room for AccECN option when many SACK blocks are present. Thus, num_sack_blocks needs to be always valid. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-6-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	a92543d597	tcp: accecn: AccECN needs to know delivered bytes AccECN byte counter estimation requires delivered bytes which can be calculated while processing SACK blocks and cumulative ACK. The delivered bytes will be used to estimate the byte counters between AccECN option (on ACKs w/o the option). Accurate ECN does not depend on SACK to function; however, the calculation would be more accurate if SACK were there. Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-5-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:52 +02:00
Ilpo Järvinen	9a01127744	tcp: accecn: add AccECN rx byte counters These three byte counters track IP ECN field payload byte sums for all arriving (acceptable) packets for ECT0, ECT1, and CE. The AccECN option (added by a later patch in the series) echoes these counters back to sender side; therefore, it is placed within the group of tcp_sock_write_txrx. Below are the pahole outcomes before and after this patch, in which the group size of tcp_sock_write_txrx is increased from 95 + 4 to 107 + 4 and an extra 4-byte hole is created but will be exploited in later patches: [BEFORE THIS PATCH] struct tcp_sock { [...] u32 delivered_ce; /* 2576 4 / u32 received_ce; / 2580 4 / u32 app_limited; / 2584 4 / u32 rcv_wnd; / 2588 4 / struct tcp_options_received rx_opt; / 2592 24 / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2616 0 / [...] / size: 3200, cachelines: 50, members: 166 / } [AFTER THIS PATCH] struct tcp_sock { [...] u32 delivered_ce; / 2576 4 / u32 received_ce; / 2580 4 / u32 received_ecn_bytes[3];/ 2584 12 / u32 app_limited; / 2596 4 / u32 rcv_wnd; / 2600 4 / struct tcp_options_received rx_opt; / 2604 24 / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2628 0 / / XXX 4 bytes hole, try to pack / [...] / size: 3200, cachelines: 50, members: 167 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-4-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:51 +02:00
Ilpo Järvinen	3cae34274c	tcp: accecn: AccECN negotiation Accurate ECN negotiation parts based on the specification: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Accurate ECN is negotiated using ECE, CWR and AE flags in the TCP header. TCP falls back into using RFC3168 ECN if one of the ends supports only RFC3168-style ECN. The AccECN negotiation includes reflecting IP ECN field value seen in SYN and SYNACK back using the same bits as negotiation to allow responding to SYN CE marks and to detect ECN field mangling. CE marks should not occur currently because SYN=1 segments are sent with Non-ECT in IP ECN field (but proposal exists to remove this restriction). Reflecting SYN IP ECN field in SYNACK is relatively simple. Reflecting SYNACK IP ECN field in the final/third ACK of the handshake is more challenging. Linux TCP code is not well prepared for using the final/third ACK a signalling channel which makes things somewhat complicated here. tcp_ecn sysctl can be used to select the highest ECN variant (Accurate ECN, ECN, No ECN) that is attemped to be negotiated and requested for incoming connection and outgoing connection: TCP_ECN_IN_NOECN_OUT_NOECN, TCP_ECN_IN_ECN_OUT_ECN, TCP_ECN_IN_ECN_OUT_NOECN, TCP_ECN_IN_ACCECN_OUT_ACCECN, TCP_ECN_IN_ACCECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_NOECN. After this patch, the size of tcp_request_sock remains unchanged and no new holes are added. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_request_sock { [...] u32 rcv_nxt; /* 352 4 / u8 syn_tos; / 356 1 / / size: 360, cachelines: 6, members: 16 / } [AFTER THIS PATCH] struct tcp_request_sock { [...] u32 rcv_nxt; / 352 4 / u8 syn_tos; / 356 1 / bool accecn_ok; / 357 1 / u8 syn_ect_snt:2; / 358: 0 1 / u8 syn_ect_rcv:2; / 358: 2 1 / u8 accecn_fail_mode:4; / 358: 4 1 / / size: 360, cachelines: 6, members: 20 / } After this patch, the size of tcp_sock remains unchanged and no new holes are added. Also, 4 bits of the existing 2-byte hole are exploited. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u8 dup_ack_counter:2; / 2761: 0 1 / u8 tlp_retrans:1; / 2761: 2 1 / u8 unused:5; / 2761: 3 1 / u8 thin_lto:1; / 2762: 0 1 / u8 fastopen_connect:1; / 2762: 1 1 / u8 fastopen_no_cookie:1; / 2762: 2 1 / u8 fastopen_client_fail:2; / 2762: 3 1 / u8 frto:1; / 2762: 5 1 / / XXX 2 bits hole, try to pack / [...] u8 keepalive_probes; / 2765 1 / / XXX 2 bytes hole, try to pack / [...] / size: 3200, cachelines: 50, members: 164 / } [AFTER THIS PATCH] struct tcp_sock { [...] u8 dup_ack_counter:2; / 2761: 0 1 / u8 tlp_retrans:1; / 2761: 2 1 / u8 syn_ect_snt:2; / 2761: 3 1 / u8 syn_ect_rcv:2; / 2761: 5 1 / u8 thin_lto:1; / 2761: 7 1 / u8 fastopen_connect:1; / 2762: 0 1 / u8 fastopen_no_cookie:1; / 2762: 1 1 / u8 fastopen_client_fail:2; / 2762: 2 1 / u8 frto:1; / 2762: 4 1 / / XXX 3 bits hole, try to pack / [...] u8 keepalive_probes; / 2765 1 / u8 accecn_fail_mode:4; / 2766: 0 1 / / XXX 4 bits hole, try to pack / / XXX 1 byte hole, try to pack / [...] / size: 3200, cachelines: 50, members: 166 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-3-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:51 +02:00
Ilpo Järvinen	542a495cba	tcp: AccECN core This change implements Accurate ECN without negotiation and AccECN Option (that will be added by later changes). Based on AccECN specifications: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Accurate ECN allows feeding back the number of CE (congestion experienced) marks accurately to the sender in contrast to RFC3168 ECN that can only signal one marks-seen-yes/no per RTT. Congestion control algorithms can take advantage of the accurate ECN information to fine-tune their congestion response to avoid drastic rate reduction when only mild congestion is encountered. With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps track of how many segments have arrived with a CE mark. Accurate ECN uses ACE field (ECE, CWR, AE) to communicate the value back to the sender which updates tp->delivered_ce (s.cep) based on the feedback. This signalling channel is lossy when ACE field overflow occurs. Conservative strategy is selected here to deal with the ACE overflow, however, some strategies using the AccECN option later in the overall patchset mitigate against false overflows detected. The ACE field values on the wire are offset by TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the real CE marks rather than forcing all downstream users to adapt to the wire offset. This patch uses the first 1-byte hole and the last 4-byte hole of the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'. Also, the group size of tcp_sock_write_txrx is increased from 91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are the trimmed pahole outcomes before and after this patch. [BEFORE THIS PATCH] struct tcp_sock { [...] __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 / u8 nonagle:4; / 2521: 0 1 / u8 rate_app_limited:1; / 2521: 4 1 / / XXX 3 bits hole, try to pack / / XXX 2 bytes hole, try to pack / [...] u32 delivered_ce; / 2576 4 / u32 app_limited; / 2580 4 / u32 rcv_wnd; / 2684 4 / struct tcp_options_received rx_opt; / 2688 24 / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2612 0 / / XXX 4 bytes hole, try to pack / [...] / size: 3200, cachelines: 50, members: 161 / } [AFTER THIS PATCH] struct tcp_sock { [...] __cacheline_group_begin__tcp_sock_write_txrx[0]; / 2521 0 / u8 nonagle:4; / 2521: 0 1 / u8 rate_app_limited:1; / 2521: 4 1 / / XXX 3 bits hole, try to pack / / Force alignment to the next boundary: / u8 :0; u8 received_ce_pending:4;/ 2522: 0 1 / u8 unused2:4; / 2522: 4 1 / / XXX 1 byte hole, try to pack / [...] u32 delivered_ce; / 2576 4 / u32 received_ce; / 2580 4 / u32 app_limited; / 2584 4 / u32 rcv_wnd; / 2588 4 / struct tcp_options_received rx_opt; / 2592 24 / __cacheline_group_end__tcp_sock_write_txrx[0]; / 2616 0 / [...] / size: 3200, cachelines: 50, members: 164 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-2-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2025-09-18 08:47:51 +02:00
Jakub Kicinski	152ba35c04	Merge branch 'net-mlx5e-use-multiple-doorbells' Tariq Toukan says: ==================== net/mlx5e: Use multiple doorbells mlx5e uses a single MMIO-mapped doorbell per netdevice for all send and receive operations. Writes to the doorbell go over the PCIe bus directly to the device, which then services the indicated queues. On certain architectures and with sufficiently high volume of doorbell ringing (many cores, many active channels, small MTU, no GSO, etc.), the MMIO-mapped doorbell address can become contended, leading to delays in servicing writes to that address and a global slowdown of all traffic for that netdevice. mlx5 NICs have supported using multiple doorbells for many years, the mlx5_ib driver for the same hardware has been using multiple doorbells traditionally. This patch series extends the mlx5 Ethernet driver to also use multiple doorbells to solve the MMIO contention issues. By allocating and using more doorbells for all channel queues (TX and RX), the MMIO contention on any particular doorbell address is reduced significantly. The first patches are cleanups: net/mlx5: Fix typo of MLX5_EQ_DOORBEL_OFFSET net/mlx5: Remove unused 'offset' field from struct mlx5_sq_bfreg' net/mlx5e: Remove unused 'xsk' param of mlx5e_build_xdpsq_param The next patch separates the global doorbell from Ethernet-specific resources: net/mlx5: Store the global doorbell in mlx5_priv Next, plumbing to allow a different doorbell to be used for channel TX and RX queues: net/mlx5e: Prepare for using multiple TX doorbells net/mlx5e: Prepare for using different CQ doorbells Then, enable using multiple doorbells for channel queues: net/mlx5e: Use multiple TX doorbells net/mlx5e: Use multiple CQ doorbells Finally, introduce a devlink parameter to control this: devlink: Add a 'num_doorbells' driverinit param net/mlx5e: Use the 'num_doorbells' devlink param Some performance results, done with the Linux pktgen script, running b2b over Connect-X 8 NICs: samples/pktgen/pktgen_sample02_multiqueue.sh -i $NIC -s 64 -d $DST_IP \ -m $MAC -t 64 Baseline (1 doorbell): 9 Mpps This series (8 doorbells): 56 Mpps Note that pktgen without 'burst' rings the doorbell after every packet, while real packet TX using NAPI usually batches multiple pending packets with the xmit_more mechanism. So this is in essence a micro-benchmark showcasing the improvement of using multiple doorbells on platforms affected by MMIO contention. Real life traffic usually sees little movement either way. ==================== Link: https://patch.msgid.link/1758031904-634231-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:56 -07:00
Cosmin Ratiu	11bbcfb766	net/mlx5e: Use the 'num_doorbells' devlink param Use the new devlink param to control how many doorbells mlx5e devices allocate and use. The maximum number of doorbells configurable is capped to the maximum number of channels. This only applies to the Ethernet part, the RDMA devices using mlx5 manage their own doorbells. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:54 -07:00
Cosmin Ratiu	6bdcb735fe	devlink: Add a 'num_doorbells' driverinit param This parameter can be used by drivers to configure a different number of doorbells. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:51 -07:00
Cosmin Ratiu	325db9c6f6	net/mlx5e: Use multiple CQ doorbells Channel doorbells are now also used by all channel CQs. A new 'uar' parameter is added to 'struct mlx5e_create_cq_param', which is then used in mlx5e_alloc_cq. A single UAR page has two TX doorbells and a single CQ doorbell, so every consecutive pair of 'struct mlx5_sq_bfreg' (TX doorbells) uses the same underlying 'struct mlx5_uars_page' (CQ doorbell). So by using c->bfreg->up, CQs from every consecutive channel pair will share the same CQ doorbell. Non-channel associated CQs keep using the global CQ doorbell. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:47 -07:00
Cosmin Ratiu	71fb4832d5	net/mlx5e: Use multiple TX doorbells First, allocate more doorbells in mlx5e_create_mdev_resources: - one doorbell remains 'global' and will be used by all non-channel associated SQs (e.g. ASO, HWS, PTP, ...). - allocate additional 'num_doorbells' doorbells. This defaults to minimum between 8 and max number of channels. mlx5e_channel_pick_doorbell() now spreads out channel SQs across available doorbells. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:44 -07:00
Cosmin Ratiu	a315b723e8	net/mlx5e: Prepare for using different CQ doorbells Completion queues (CQs) in mlx5 use the same global doorbell, which may become contended when accessed concurrently from many cores. This patch prepares the CQ management code for supporting different doorbells per CQ. This will be used in downstream patches to allow separate doorbells to be used by channels CQs. The main change is moving the 'uar' pointer from struct mlx5_core_cq to struct mlx5e_cq, as the uar page to be used is better off stored directly there. Other users of mlx5_core_cq also store the UAR to be used separately and therefore the pointer being removed is dead weight for them. As evidence, in this patch there are two users which set the mcq.uar pointer but didn't use it, Software Steering and old Innova CQ creation code. Instead, they rang the doorbell directly from another pointer. The 'uar' pointer added to struct mlx5e_cq remains in a hot cacheline (as before), because it may get accessed for each packet. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:40 -07:00
Cosmin Ratiu	673d7ab756	net/mlx5e: Prepare for using multiple TX doorbells The driver allocates a single doorbell per device and uses it for all Send Queues (SQs). This can become a bottleneck due to the high number of concurrent MMIO accesses when ringing the same doorbell from many channels. This patch makes the doorbells used by channel queues configurable. mlx5e_channel_pick_doorbell() is added to select the doorbell to be used for a given channel, picking the default for now. When opening a channel, the selected doorbell is saved to the channel struct and used whenever channel-related queues are created. Finally, 'uar_page' is added to 'struct mlx5e_create_sq_param' to control which doorbell to use when allocating an SQ, since that can happen outside channel context (e.g. for PTP). Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:36 -07:00
Cosmin Ratiu	aa4595d0ad	net/mlx5: Store the global doorbell in mlx5_priv The global doorbell is used for more than just Ethernet resources, so move it out of mlx5e_hw_objs into a common place (mlx5_priv), to avoid non-Ethernet modules (e.g. HWS, ASO) depending on Ethernet structs. Use this opportunity to consolidate it with the 'uar' pointer already there, which was used as an RX doorbell. Underneath the 'uar' pointer is identical to 'bfreg->up', so store a single resource and use that instead. For CQ doorbells, care is taken to always use bfreg->up->index instead of bfreg->index, which may refer to a subsequent UAR page from the same ALLOC_UAR batch on some NICs. This paves the way for cleanly supporting multiple doorbells in the Ethernet driver. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:32 -07:00
Cosmin Ratiu	913d28f8a7	net/mlx5e: Remove unused 'xsk' param of mlx5e_build_xdpsq_param This was added in commit [1], but its only use removed in commit [2]. The parameter is unused, so remove it from the function parameter list. [1] commit `9ded70fa1d` ("net/mlx5e: Don't prefill WQEs in XDP SQ in the multi buffer mode") [2] commit `1a9304859b` ("net/mlx5: XDP, Enable TX side XDP multi-buffer support") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:28 -07:00
Cosmin Ratiu	05dfe654b5	net/mlx5: Remove unused 'offset' field from mlx5_sq_bfreg The 'offset' field was introduced in the original commit [1] and never used until commit [2], which added an unnecessary use. Remove the field and refactor the write-combining test to use a local variable instead. [1] commit `a6d51b6861` ("net/mlx5: Introduce blue flame register allocator") [2] commit `d98995b4bf` ("net/mlx5: Reimplement write combining test") Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:25 -07:00
Cosmin Ratiu	917449e7c3	net/mlx5: Fix typo of MLX5_EQ_DOORBEL_OFFSET Also convert it to a simple define. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:30:20 -07:00
Jakub Kicinski	cbff0b1ec6	Merge branch 'net-dsa-mv88e6xxx-further-ptp-related-cleanups' Russell King says: ==================== net: dsa: mv88e6xxx: further PTP-related cleanups Further mv88e6xxx PTP-related cleanups, mostly centred around the register definitions, but also moving one function prototype to a more logical header. ==================== Link: https://patch.msgid.link/aMnJ1uRPvw82_aCT@shell.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:21:15 -07:00
Russell King (Oracle)	e866e5118b	net: dsa: mv88e6xxx: move mv88e6xxx_hwtstamp_work() prototype Since mv88e6xxx_hwtstamp_work() is defined in hwtstamp.c, its prototype should be in hwtstamp.h, so move it there. Remove it's redundant stub definition, as both hwtstamp.c (the function provider) and ptp.c (the consumer) are both dependent on the same config symbol. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:21:12 -07:00
Russell King (Oracle)	a295b33b0f	net: dsa: mv88e6xxx: remove unused 88E6165 register definitions Remove the unused 88E6165 register definitions. For the port registers, add a comment describing that each arrival and departure offset is for a set of four registers that correspond with status, two timestamp registers and the PTP sequence ID captured from the packet. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:21:10 -07:00
Russell King (Oracle)	30cf6a875e	net: dsa: mv88e6xxx: remove duplicated register definition There are two identical MV88E6XXX_PTP_GC_ETYPE definitions in ptp.h, and MV88E6XXX_PTP_ETHERTYPE in hwtstamp.h which all refer to the exact same register. As the code that accesses this register is in hwtstamp.c, use the hwtstamp.h definition, and remove the unnecessary duplicated definition in ptp.h Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:21:07 -07:00
Russell King (Oracle)	946fc083fc	net: dsa: mv88e6xxx: remove unused TAI definitions Remove the TAI definitions that the code never uses. Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:21:04 -07:00
Russell King (Oracle)	a12372ac59	net: dsa: mv88e6xxx: rename TAI definitions according to core The TAI_EVENT_STATUS and TAI_CFG definitions are only used for the 88E6352-family of TAI implementations. Rename them as such, and remove the TAI_EVENT_TIME_* definitions that are unused (although we read them as a block.) Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:21:01 -07:00
Jakub Kicinski	e218ae4024	Merge branch 'net-fix-uaf-of-sk_dst_get-sk-dev' Kuniyuki Iwashima says: ==================== net: Fix UAF of sk_dst_get(sk)->dev. syzbot caught use-after-free of sk_dst_get(sk)->dev, which was not fetched under RCU nor RTNL. [0] Patch 1 ~ 5, 7 fix UAF in smc, tcp, ktls, mptcp Patch 6 fixes dst ref leak in mptcp [0]: https://lore.kernel.org/68c237c7.050a0220.3c6139.0036.GAE@google.com v1: https://lore.kernel.org/20250911030620.1284754-1-kuniyu@google.com ==================== Link: https://patch.msgid.link/20250916214758.650211-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:25 -07:00
Kuniyuki Iwashima	893c49a78d	mptcp: Use __sk_dst_get() and dst_dev_rcu() in mptcp_active_enable(). mptcp_active_enable() is called from subflow_finish_connect(), which is icsk->icsk_af_ops->sk_rx_dst_set() and it's not always under RCU. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dst_dev_rcu(). Fixes: `27069e7cb3` ("mptcp: disable active MPTCP in case of blackhole") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	108a86c71c	mptcp: Call dst_release() in mptcp_active_enable(). mptcp_active_enable() calls sk_dst_get(), which returns dst with its refcount bumped, but forgot dst_release(). Let's add missing dst_release(). Cc: stable@vger.kernel.org Fixes: `27069e7cb3` ("mptcp: disable active MPTCP in case of blackhole") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	c65f27b9c3	tls: Use __sk_dst_get() and dst_dev_rcu() in get_netdev_for_sock(). get_netdev_for_sock() is called during setsockopt(), so not under RCU. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dst_dev_rcu(). Note that the only ->ndo_sk_get_lower_dev() user is bond_sk_get_lower_dev(), which uses RCU. Fixes: `e8f6979981` ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250916214758.650211-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	0b0e4d51c6	smc: Use __sk_dst_get() and dst_dev_rcu() in smc_vlan_by_tcpsk(). smc_vlan_by_tcpsk() fetches sk_dst_get(sk)->dev before RTNL and passes it to netdev_walk_all_lower_dev(), which is illegal. Also, smc_vlan_by_tcpsk_walk() does not require RTNL at all. Let's use __sk_dst_get(), dst_dev_rcu(), and netdev_walk_all_lower_dev_rcu(). Note that the returned value of smc_vlan_by_tcpsk() is not used in the caller. Fixes: `0cfdd8f92c` ("smc: connection and link group creation") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	235f81045c	smc: Use __sk_dst_get() and dst_dev_rcu() in smc_clc_prfx_match(). smc_clc_prfx_match() is called from smc_listen_work() and not under RCU nor RTNL. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dst_dev_rcu(). Note that the returned value of smc_clc_prfx_match() is not used in the caller. Fixes: `a046d57da1` ("smc: CLC handshake (incl. preparation steps)") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:22 -07:00
Kuniyuki Iwashima	935d783e5d	smc: Use __sk_dst_get() and dst_dev_rcu() in in smc_clc_prfx_set(). smc_clc_prfx_set() is called during connect() and not under RCU nor RTNL. Using sk_dst_get(sk)->dev could trigger UAF. Let's use __sk_dst_get() and dev_dst_rcu() under rcu_read_lock() after kernel_getsockname(). Note that the returned value of smc_clc_prfx_set() is not used in the caller. While at it, we change the 1st arg of smc_clc_prfx_set[46]_rcu() not to touch dst there. Fixes: `a046d57da1` ("smc: CLC handshake (incl. preparation steps)") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:21 -07:00
Kuniyuki Iwashima	3d3466878a	smc: Fix use-after-free in __pnet_find_base_ndev(). syzbot reported use-after-free of net_device in __pnet_find_base_ndev(), which was called during connect(). [0] smc_pnet_find_ism_resource() fetches sk_dst_get(sk)->dev and passes down to pnet_find_base_ndev(), where RTNL is held. Then, UAF happened at __pnet_find_base_ndev() when the dev is first used. This means dev had already been freed before acquiring RTNL in pnet_find_base_ndev(). While dev is going away, dst->dev could be swapped with blackhole_netdev, and the dev's refcnt by dst will be released. We must hold dev's refcnt before calling smc_pnet_find_ism_resource(). Also, smc_pnet_find_roce_resource() has the same problem. Let's use __sk_dst_get() and dst_dev_rcu() in the two functions. [0]: BUG: KASAN: use-after-free in __pnet_find_base_ndev+0x1b1/0x1c0 net/smc/smc_pnet.c:926 Read of size 1 at addr ffff888036bac33a by task syz.0.3632/18609 CPU: 1 UID: 0 PID: 18609 Comm: syz.0.3632 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x240 mm/kasan/report.c:482 kasan_report+0x118/0x150 mm/kasan/report.c:595 __pnet_find_base_ndev+0x1b1/0x1c0 net/smc/smc_pnet.c:926 pnet_find_base_ndev net/smc/smc_pnet.c:946 [inline] smc_pnet_find_ism_by_pnetid net/smc/smc_pnet.c:1103 [inline] smc_pnet_find_ism_resource+0xef/0x390 net/smc/smc_pnet.c:1154 smc_find_ism_device net/smc/af_smc.c:1030 [inline] smc_find_proposal_devices net/smc/af_smc.c:1115 [inline] __smc_connect+0x372/0x1890 net/smc/af_smc.c:1545 smc_connect+0x877/0xd90 net/smc/af_smc.c:1715 __sys_connect_file net/socket.c:2086 [inline] __sys_connect+0x313/0x440 net/socket.c:2105 __do_sys_connect net/socket.c:2111 [inline] __se_sys_connect net/socket.c:2108 [inline] __x64_sys_connect+0x7a/0x90 net/socket.c:2108 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f47cbf8eba9 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f47ccdb1038 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 00007f47cc1d5fa0 RCX: 00007f47cbf8eba9 RDX: 0000000000000010 RSI: 0000200000000280 RDI: 000000000000000b RBP: 00007f47cc011e19 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f47cc1d6038 R14: 00007f47cc1d5fa0 R15: 00007ffc512f8aa8 </TASK> The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888036bacd00 pfn:0x36bac flags: 0xfff00000000000(node=0\|zone=1\|lastcpupid=0x7ff) raw: 00fff00000000000 ffffea0001243d08 ffff8880b863fdc0 0000000000000000 raw: ffff888036bacd00 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as freed page last allocated via order 2, migratetype Unmovable, gfp_mask 0x446dc0(GFP_KERNEL_ACCOUNT\|__GFP_ZERO\|__GFP_NOWARN\|__GFP_RETRY_MAYFAIL\|__GFP_COMP), pid 16741, tgid 16741 (syz-executor), ts 343313197788, free_ts 380670750466 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x240/0x2a0 mm/page_alloc.c:1851 prep_new_page mm/page_alloc.c:1859 [inline] get_page_from_freelist+0x21e4/0x22c0 mm/page_alloc.c:3858 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5148 alloc_pages_mpol+0x232/0x4a0 mm/mempolicy.c:2416 ___kmalloc_large_node+0x5f/0x1b0 mm/slub.c:4317 __kmalloc_large_node_noprof+0x18/0x90 mm/slub.c:4348 __do_kmalloc_node mm/slub.c:4364 [inline] __kvmalloc_node_noprof+0x6d/0x5f0 mm/slub.c:5067 alloc_netdev_mqs+0xa3/0x11b0 net/core/dev.c:11812 tun_set_iff+0x532/0xef0 drivers/net/tun.c:2775 __tun_chr_ioctl+0x788/0x1df0 drivers/net/tun.c:3085 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:598 [inline] __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:584 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f page last free pid 18610 tgid 18608 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1395 [inline] __free_frozen_pages+0xbc4/0xd30 mm/page_alloc.c:2895 free_large_kmalloc+0x13a/0x1f0 mm/slub.c:4820 device_release+0x99/0x1c0 drivers/base/core.c:-1 kobject_cleanup lib/kobject.c:689 [inline] kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x22b/0x480 lib/kobject.c:737 netdev_run_todo+0xd2e/0xea0 net/core/dev.c:11513 rtnl_unlock net/core/rtnetlink.c:157 [inline] rtnl_net_unlock include/linux/rtnetlink.h:135 [inline] rtnl_dellink+0x537/0x710 net/core/rtnetlink.c:3563 rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6946 netlink_rcv_skb+0x208/0x470 net/netlink/af_netlink.c:2552 netlink_unicast_kernel net/netlink/af_netlink.c:1320 [inline] netlink_unicast+0x82f/0x9e0 net/netlink/af_netlink.c:1346 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1896 sock_sendmsg_nosec net/socket.c:714 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:729 ____sys_sendmsg+0x505/0x830 net/socket.c:2614 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2668 __sys_sendmsg net/socket.c:2700 [inline] __do_sys_sendmsg net/socket.c:2705 [inline] __se_sys_sendmsg net/socket.c:2703 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2703 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xfa/0x3b0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Memory state around the buggy address: ffff888036bac200: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888036bac280: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff888036bac300: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff888036bac380: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff888036bac400: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Fixes: `0afff91c6f` ("net/smc: add pnetid support") Fixes: `1619f77058` ("net/smc: add pnetid support for SMC-D and ISM") Reported-by: syzbot+ea28e9d85be2f327b6c6@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/68c237c7.050a0220.3c6139.0036.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916214758.650211-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 18:10:21 -07:00
Jakub Kicinski	6b957c0a36	Merge branch 'net-phy-remove-mdio_board_info-support-from-phylib' Heiner Kallweit says: ==================== net: phy: remove mdio_board_info support from phylib Since its introduction in 2017 mdio_board_info has had only two users: - dsa_loop (still existing) - arm orion, added in 2017 and removed with `fd68572b57` ("ARM: orion5x: remove dsa_chip_data references") So let's remove usage of mdio_board_info from dsa_loop, then support for mdio_board_info can be dropped from phylib. ==================== Link: https://patch.msgid.link/4ccf7476-0744-4f6b-aafc-7ba84d15a432@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:24:04 -07:00
Heiner Kallweit	b67a8631a4	net: phy: remove mdio_board_info support from phylib After having removed mdio_board_info usage from dsa_loop, there's no user left. So let's drop support for it from phylib. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/01542a2e-05f5-4f13-acef-72632b33b5be@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:24:01 -07:00
Heiner Kallweit	41357bc7b9	net: dsa: dsa_loop: remove usage of mdio_board_info dsa_loop is the last remaining user of mdio_board_info. Let's remove using mdio_board_info, so that support for it can be dropped from phylib. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Vladimir Oltean <olteanv@gmail.com> Link: https://patch.msgid.link/da9563a4-8e14-41cf-bfea-cf5f1b58a4b7@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:24:01 -07:00
Wei Fang	2479cba209	ptp: netc: only enable periodic pulse event interrupts for PPS The periodic pulse event interrupts are used to register the PPS events into the system, so it is only applicable to PTP_CLK_REQ_PPS request. However, these interrupts are mistakenly enabled in PTP_CLK_REQ_PEROUT request, so fix this error. Fixes: `671e266835` ("ptp: netc: add periodic pulse output support") Signed-off-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20250915082528.1616361-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:18:58 -07:00
Chaoyi Chen	a09655dde7	Revert "net: ethernet: stmmac: dwmac-rk: Make the clk_phy could be used for external phy" This reverts commit `da114122b8`. As discussed, the PHY clock should be managed by PHY driver instead of other driver like dwmac-rk. Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/a30a8c97-6b96-45ba-bad7-8a40401babc2@samsung.com Fixes: `da114122b8` ("net: ethernet: stmmac: dwmac-rk: Make the clk_phy could be used for external phy") Signed-off-by: Chaoyi Chen <chaoyi.chen@rock-chips.com> Link: https://patch.msgid.link/0A3F1D1604FEE424+20250916012628.1819-1-kernel@airkyi.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:16:36 -07:00
Dave Stevenson	dc110d1b23	net: cadence: macb: Add support for Raspberry Pi RP1 ethernet controller The RP1 chip has the Cadence GEM block, but wants the tx_clock to always run at 125MHz, in the same way as sama7g5. Add the relevant configuration. Signed-off-by: Dave Stevenson <dave.stevenson@raspberrypi.com> Signed-off-by: Stanimir Varbanov <svarbanov@suse.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Reviewed-by: Claudiu Beznea <claudiu.beznea@tuxon.dev> Link: https://patch.msgid.link/20250916081059.3992108-1-svarbanov@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:09:00 -07:00
Jakub Kicinski	aa9f09a26b	Merge branch 'ptp-safely-cleanup-when-unregistering-a-ptp-clock' Russell King says: ==================== ptp: safely cleanup when unregistering a PTP clock The standard rule in the kernel for unregistering user visible devices is to unpublish the userspace API before doing any shutdown of the resources necessary for the operation of the device. PTP has several issues in this area: 1. ptp_clock_unregister() cancells and destroys work while the PTP chardev is still published, which gives the opportunity for a precisely timed user API call to cause a driver to attempt to queue the aux work. 2. PTP pins are not cleaned up - if userspace has enabled PTP pins, e.g. for extts, drivers are forced to do cleanup before calling ptp_clock_unregister() to stop events being forwarded into the PTP layer. E.g mv88e6xxx cancells its internal tai_event_work to avoid calling into the PTP clock code with a stale ptp_clock pointer, but a badly timed userspace EXTTS enable will re-schedule the tai_event_work. Simplify the process by ensuring that: 1. we take a referene on the PTP struct device to stop the ptp_clock structure going away underneath us when we call posix_clock_unregister(). 2. call posix_clock_unregister() to remove the /dev/ptp* device. 3. add additional functionality to disable any PTP EXTTS pins and PPS event generation that have been configured on this device. This should shutdown all events coming from PTP clock drivers. 4. cancel the delayed aux_work and destroy the kthread. 5. remove the PPS source. 6. drop the reference on the PTP struct device to allow the ptp_clock structure to be released. This is difficult for me to test beyond build testing - on the Clearfog platform with Marvell PHY PTP, the ethernet PHY is the primary connectivity, so removing the PHY driver for an in-use network interface isn't possible. On the ZII rev B platform, where the DSA switches have the TAI hardware and where root NFS is used, removal of the DSA switch module somehow forces the FEC interface _not_ connected to the DSA switch to lose link, causing the machine to become unresponsive as its root filesystem vanishes. ==================== Link: https://patch.msgid.link/aMnYIu7RbgfXrmGx@shell.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:04:12 -07:00
Russell King (Oracle)	a60fc3294a	ptp: rework ptp_clock_unregister() to disable events The ordering of ptp_clock_unregister() is not ideal, as the chardev remains published while state is being torn down, which means userspace can race with the kernel teardown. There is also no cleanup of enabled pin settings nor of the internal PPS event, which means enabled events can still forward into the core, dereferencing a free'd pointer. Rework the ordering of cleanup in ptp_clock_unregister() so that we unpublish the posix clock (and user chardev), disable any pins that have EXTTS events enabled, disable the PPS event, and then clean up the aux work and PPS source. This avoids potential use-after-free and races in PTP clock driver teardown. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # ocelot, sja1105, netdevsim, vclocks Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Acked-by: Richard Cochran <richardcochran@gmail.com> Link: https://patch.msgid.link/E1uydLH-000000061DM-2gcV@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:04:09 -07:00
Russell King (Oracle)	0fcb1dc3e8	ptp: describe the two disables in ptp_set_pinfunc() Accurately describe what each call to ptp_disable_pinfunc() is doing, rather than the misleading comment above the first disable. This helps to make the code more readable. Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Richard Cochran <richardcochran@gmail.com> Link: https://patch.msgid.link/E1uydLC-000000061DG-2BRt@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2025-09-17 15:04:09 -07:00

1 2 3 4 5 ...

1384210 Commits