Support devmem Tx. We already use skb_frag_dma_map(), we just need
to make sure we don't try to unmap the frags. Check if frag is
unreadable and mark the ring entry.
# ./tools/testing/selftests/drivers/net/hw/devmem.py
TAP version 13
1..3
ok 1 devmem.check_rx
ok 2 devmem.check_tx
ok 3 devmem.check_tx_chunks
# Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
Acked-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20250916145401.1464550-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
As SACK blocks tend to eat all option space when there are
many holes, it is useful to compromise on sending many SACK
blocks in every ACK and attempt to fit the AccECN option
there by reducing the number of SACK blocks. However, it will
never go below two SACK blocks because of the AccECN option.
As the AccECN option is often not put to every ACK, the space
hijack is usually only temporary. Depending on the reuqired
AccECN fields (can be either 3, 2, 1, or 0, cf. Table 5 in
AccECN spec) and the NOPs used for alignment of other
TCP options, up to two SACK blocks will be reduced. Please
find below tables for more details:
+====================+=========================================+
| Number of | Required | Remaining | Number of | Final |
| SACK | AccECN | option | reduced | number of |
| blocks | fields | spaces | SACK blocks | SACK blocks |
+===========+==========+===========+=============+=============+
| x (<=2) | 0 to 3 | any | 0 | x |
+-----------+----------+-----------+-------------+-------------+
| 3 | 0 | any | 0 | 3 |
| 3 | 1 | <4 | 1 | 2 |
| 3 | 1 | >=4 | 0 | 3 |
| 3 | 2 | <8 | 1 | 2 |
| 3 | 2 | >=8 | 0 | 3 |
| 3 | 3 | <12 | 1 | 2 |
| 3 | 3 | >=12 | 0 | 3 |
+-----------+----------+-----------+-------------+-------------+
| y (>=4) | 0 | any | 0 | y |
| y (>=4) | 1 | <4 | 1 | y-1 |
| y (>=4) | 1 | >=4 | 0 | y |
| y (>=4) | 2 | <8 | 1 | y-1 |
| y (>=4) | 2 | >=8 | 0 | y |
| y (>=4) | 3 | <4 | 2 | y-2 |
| y (>=4) | 3 | <12 | 1 | y-1 |
| y (>=4) | 3 | >=12 | 0 | y |
+===========+==========+===========+=============+=============+
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-11-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The AccECN option ceb/cep heuristic algorithm is from AccECN spec
Appendix A.2.2 to mitigate against false ACE field overflows. Armed
with ceb delta from option, delivered bytes, and delivered packets it
is possible to estimate how many times ACE field wrapped.
This calculation is necessary only if more than one wrap is possible.
Without SACK, delivered bytes and packets are not always trustworthy in
which case TCP falls back to the simpler no-or-all wraps ceb algorithm.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-10-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
AccECN option may fail in various way, handle these:
- Attempt to negotiate the use of AccECN on the 1st retransmitted SYN
- From the 2nd retransmitted SYN, stop AccECN negotiation
- Remove option from SYN/ACK rexmits to handle blackholes
- If no option arrives in SYN/ACK, assume Option is not usable
- If an option arrives later, re-enabled
- If option is zeroed, disable AccECN option processing
This patch use existing padding bits in tcp_request_sock and
holes in tcp_sock without increasing the size.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-9-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Instead of sending the option in every ACK, limit sending to
those ACKs where the option is necessary:
- Handshake
- "Change-triggered ACK" + the ACK following it. The
2nd ACK is necessary to unambiguously indicate which
of the ECN byte counters in increasing. The first
ACK has two counters increasing due to the ecnfield
edge.
- ACKs with CE to allow CEP delta validations to take
advantage of the option.
- Force option to be sent every at least once per 2^22
bytes. The check is done using the bit edges of the
byte counters (avoids need for extra variables).
- AccECN option beacon to send a few times per RTT even if
nothing in the ECN state requires that. The default is 3
times per RTT, and its period can be set via
sysctl_tcp_ecn_option_beacon.
Below are the pahole outcomes before and after this patch,
in which the group size of tcp_sock_write_tx is increased
from 89 to 97 due to the new u64 accecn_opt_tstamp member:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u64 tcp_wstamp_ns; /* 2488 8 */
struct list_head tsorted_sent_queue; /* 2496 16 */
[...]
__cacheline_group_end__tcp_sock_write_tx[0]; /* 2521 0 */
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
u8 accecn_minlen:2; /* 2523: 0 1 */
u8 est_ecnfield:2; /* 2523: 2 1 */
u8 unused3:4; /* 2523: 4 1 */
[...]
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */
[...]
/* size: 3200, cachelines: 50, members: 171 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u64 tcp_wstamp_ns; /* 2488 8 */
u64 accecn_opt_tstamp; /* 2596 8 */
struct list_head tsorted_sent_queue; /* 2504 16 */
[...]
__cacheline_group_end__tcp_sock_write_tx[0]; /* 2529 0 */
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2529 0 */
u8 nonagle:4; /* 2529: 0 1 */
u8 rate_app_limited:1; /* 2529: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2530: 0 1 */
u8 unused2:4; /* 2530: 4 1 */
u8 accecn_minlen:2; /* 2531: 0 1 */
u8 est_ecnfield:2; /* 2531: 2 1 */
u8 accecn_opt_demand:2; /* 2531: 4 1 */
u8 prev_ecnfield:2; /* 2531: 6 1 */
[...]
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2636 0 */
[...]
/* size: 3200, cachelines: 50, members: 173 */
}
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Co-developed-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-8-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The Accurate ECN allows echoing back the sum of bytes for
each IP ECN field value in the received packets using
AccECN option. This change implements AccECN option tx & rx
side processing without option send control related features
that are added by a later change.
Based on specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
(Some features of the spec will be added in the later changes
rather than in this one).
A full-length AccECN option is always attempted but if it does
not fit, the minimum length is selected based on the counters
that have changed since the last update. The AccECN option
(with 24-bit fields) often ends in odd sizes so the option
write code tries to take advantage of some nop used to pad
the other TCP options.
The delivered_ecn_bytes pairs with received_ecn_bytes similar
to how delivered_ce pairs with received_ce. In contrast to
ACE field, however, the option is not always available to update
delivered_ecn_bytes. For ACK w/o AccECN option, the delivered
bytes calculated based on the cumulative ACK+SACK information
are assigned to one of the counters using an estimation
heuristic to select the most likely ECN byte counter. Any
estimation error is corrected when the next AccECN option
arrives. It may occur that the heuristic gets too confused
when there are enough different byte counter deltas between
ACKs with the AccECN option in which case the heuristic just
gives up on updating the counters for a while.
tcp_ecn_option sysctl can be used to select option sending
mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM,
and TCP_ECN_OPTION_FULL.
This patch increases the size of tcp_info struct, as there is
no existing holes for new u32 variables. Below are the pahole
outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_info {
[...]
__u32 tcpi_total_rto_time; /* 244 4 */
/* size: 248, cachelines: 4, members: 61 */
}
[AFTER THIS PATCH]
struct tcp_info {
[...]
__u32 tcpi_total_rto_time; /* 244 4 */
__u32 tcpi_received_ce; /* 248 4 */
__u32 tcpi_delivered_e1_bytes; /* 252 4 */
__u32 tcpi_delivered_e0_bytes; /* 256 4 */
__u32 tcpi_delivered_ce_bytes; /* 260 4 */
__u32 tcpi_received_e1_bytes; /* 264 4 */
__u32 tcpi_received_e0_bytes; /* 268 4 */
__u32 tcpi_received_ce_bytes; /* 272 4 */
/* size: 280, cachelines: 5, members: 68 */
}
This patch uses the existing 1-byte holes in the tcp_sock_write_txrx
group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx
group after the new u32 delivered_ecn_bytes[3] member. Therefore, the
group size of tcp_sock_write_rx is increased from 96 to 112. Below
are the pahole outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u8 received_ce_pending:4; /* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
/* XXX 1 byte hole, try to pack */
[...]
u32 rcv_rtt_last_tsecr; /* 2668 4 */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2728 0 */
[...]
/* size: 3200, cachelines: 50, members: 167 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
u8 accecn_minlen:2; /* 2523: 0 1 */
u8 est_ecnfield:2; /* 2523: 2 1 */
u8 unused3:4; /* 2523: 4 1 */
[...]
u32 rcv_rtt_last_tsecr; /* 2668 4 */
u32 delivered_ecn_bytes[3];/* 2672 12 */
/* XXX 4 bytes hole, try to pack */
[...]
__cacheline_group_end__tcp_sock_write_rx[0]; /* 2744 0 */
[...]
/* size: 3200, cachelines: 50, members: 171 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-7-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
1) Don't early return when sack doesn't fit. AccECN code will be
placed after this fragment so no early returns please.
2) Make sure opts->num_sack_blocks is not left undefined. E.g.,
tcp_current_mss() does not memset its opts struct to zero.
AccECN code checks if SACK option is present and may even
alter it to make room for AccECN option when many SACK blocks
are present. Thus, num_sack_blocks needs to be always valid.
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-6-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
These three byte counters track IP ECN field payload byte sums for
all arriving (acceptable) packets for ECT0, ECT1, and CE. The
AccECN option (added by a later patch in the series) echoes these
counters back to sender side; therefore, it is placed within the
group of tcp_sock_write_txrx.
Below are the pahole outcomes before and after this patch, in which
the group size of tcp_sock_write_txrx is increased from 95 + 4 to
107 + 4 and an extra 4-byte hole is created but will be exploited
in later patches:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 app_limited; /* 2584 4 */
u32 rcv_wnd; /* 2588 4 */
struct tcp_options_received rx_opt; /* 2592 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */
[...]
/* size: 3200, cachelines: 50, members: 166 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 received_ecn_bytes[3];/* 2584 12 */
u32 app_limited; /* 2596 4 */
u32 rcv_wnd; /* 2600 4 */
struct tcp_options_received rx_opt; /* 2604 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */
/* XXX 4 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 167 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-4-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Accurate ECN negotiation parts based on the specification:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Accurate ECN is negotiated using ECE, CWR and AE flags in the
TCP header. TCP falls back into using RFC3168 ECN if one of the
ends supports only RFC3168-style ECN.
The AccECN negotiation includes reflecting IP ECN field value
seen in SYN and SYNACK back using the same bits as negotiation
to allow responding to SYN CE marks and to detect ECN field
mangling. CE marks should not occur currently because SYN=1
segments are sent with Non-ECT in IP ECN field (but proposal
exists to remove this restriction).
Reflecting SYN IP ECN field in SYNACK is relatively simple.
Reflecting SYNACK IP ECN field in the final/third ACK of
the handshake is more challenging. Linux TCP code is not well
prepared for using the final/third ACK a signalling channel
which makes things somewhat complicated here.
tcp_ecn sysctl can be used to select the highest ECN variant
(Accurate ECN, ECN, No ECN) that is attemped to be negotiated and
requested for incoming connection and outgoing connection:
TCP_ECN_IN_NOECN_OUT_NOECN, TCP_ECN_IN_ECN_OUT_ECN,
TCP_ECN_IN_ECN_OUT_NOECN, TCP_ECN_IN_ACCECN_OUT_ACCECN,
TCP_ECN_IN_ACCECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_NOECN.
After this patch, the size of tcp_request_sock remains unchanged
and no new holes are added. Below are the pahole outcomes before
and after this patch:
[BEFORE THIS PATCH]
struct tcp_request_sock {
[...]
u32 rcv_nxt; /* 352 4 */
u8 syn_tos; /* 356 1 */
/* size: 360, cachelines: 6, members: 16 */
}
[AFTER THIS PATCH]
struct tcp_request_sock {
[...]
u32 rcv_nxt; /* 352 4 */
u8 syn_tos; /* 356 1 */
bool accecn_ok; /* 357 1 */
u8 syn_ect_snt:2; /* 358: 0 1 */
u8 syn_ect_rcv:2; /* 358: 2 1 */
u8 accecn_fail_mode:4; /* 358: 4 1 */
/* size: 360, cachelines: 6, members: 20 */
}
After this patch, the size of tcp_sock remains unchanged and no new
holes are added. Also, 4 bits of the existing 2-byte hole are exploited.
Below are the pahole outcomes before and after this patch:
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
u8 dup_ack_counter:2; /* 2761: 0 1 */
u8 tlp_retrans:1; /* 2761: 2 1 */
u8 unused:5; /* 2761: 3 1 */
u8 thin_lto:1; /* 2762: 0 1 */
u8 fastopen_connect:1; /* 2762: 1 1 */
u8 fastopen_no_cookie:1; /* 2762: 2 1 */
u8 fastopen_client_fail:2; /* 2762: 3 1 */
u8 frto:1; /* 2762: 5 1 */
/* XXX 2 bits hole, try to pack */
[...]
u8 keepalive_probes; /* 2765 1 */
/* XXX 2 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 164 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
u8 dup_ack_counter:2; /* 2761: 0 1 */
u8 tlp_retrans:1; /* 2761: 2 1 */
u8 syn_ect_snt:2; /* 2761: 3 1 */
u8 syn_ect_rcv:2; /* 2761: 5 1 */
u8 thin_lto:1; /* 2761: 7 1 */
u8 fastopen_connect:1; /* 2762: 0 1 */
u8 fastopen_no_cookie:1; /* 2762: 1 1 */
u8 fastopen_client_fail:2; /* 2762: 2 1 */
u8 frto:1; /* 2762: 4 1 */
/* XXX 3 bits hole, try to pack */
[...]
u8 keepalive_probes; /* 2765 1 */
u8 accecn_fail_mode:4; /* 2766: 0 1 */
/* XXX 4 bits hole, try to pack */
/* XXX 1 byte hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 166 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-3-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This change implements Accurate ECN without negotiation and
AccECN Option (that will be added by later changes). Based on
AccECN specifications:
https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
Accurate ECN allows feeding back the number of CE (congestion
experienced) marks accurately to the sender in contrast to
RFC3168 ECN that can only signal one marks-seen-yes/no per RTT.
Congestion control algorithms can take advantage of the accurate
ECN information to fine-tune their congestion response to avoid
drastic rate reduction when only mild congestion is encountered.
With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps
track of how many segments have arrived with a CE mark. Accurate
ECN uses ACE field (ECE, CWR, AE) to communicate the value back
to the sender which updates tp->delivered_ce (s.cep) based on the
feedback. This signalling channel is lossy when ACE field overflow
occurs.
Conservative strategy is selected here to deal with the ACE
overflow, however, some strategies using the AccECN option later
in the overall patchset mitigate against false overflows detected.
The ACE field values on the wire are offset by
TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the
real CE marks rather than forcing all downstream users to adapt
to the wire offset.
This patch uses the first 1-byte hole and the last 4-byte hole of
the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'.
Also, the group size of tcp_sock_write_txrx is increased from
91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are
the trimmed pahole outcomes before and after this patch.
[BEFORE THIS PATCH]
struct tcp_sock {
[...]
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* XXX 2 bytes hole, try to pack */
[...]
u32 delivered_ce; /* 2576 4 */
u32 app_limited; /* 2580 4 */
u32 rcv_wnd; /* 2684 4 */
struct tcp_options_received rx_opt; /* 2688 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2612 0 */
/* XXX 4 bytes hole, try to pack */
[...]
/* size: 3200, cachelines: 50, members: 161 */
}
[AFTER THIS PATCH]
struct tcp_sock {
[...]
__cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */
u8 nonagle:4; /* 2521: 0 1 */
u8 rate_app_limited:1; /* 2521: 4 1 */
/* XXX 3 bits hole, try to pack */
/* Force alignment to the next boundary: */
u8 :0;
u8 received_ce_pending:4;/* 2522: 0 1 */
u8 unused2:4; /* 2522: 4 1 */
/* XXX 1 byte hole, try to pack */
[...]
u32 delivered_ce; /* 2576 4 */
u32 received_ce; /* 2580 4 */
u32 app_limited; /* 2584 4 */
u32 rcv_wnd; /* 2588 4 */
struct tcp_options_received rx_opt; /* 2592 24 */
__cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */
[...]
/* size: 3200, cachelines: 50, members: 164 */
}
Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com>
Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916082434.100722-2-chia-yu.chang@nokia-bell-labs.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tariq Toukan says:
====================
net/mlx5e: Use multiple doorbells
mlx5e uses a single MMIO-mapped doorbell per netdevice for all send and
receive operations. Writes to the doorbell go over the PCIe bus directly
to the device, which then services the indicated queues.
On certain architectures and with sufficiently high volume of doorbell
ringing (many cores, many active channels, small MTU, no GSO, etc.), the
MMIO-mapped doorbell address can become contended, leading to delays in
servicing writes to that address and a global slowdown of all traffic
for that netdevice.
mlx5 NICs have supported using multiple doorbells for many years, the
mlx5_ib driver for the same hardware has been using multiple doorbells
traditionally.
This patch series extends the mlx5 Ethernet driver to also use multiple
doorbells to solve the MMIO contention issues. By allocating and using
more doorbells for all channel queues (TX and RX), the MMIO contention
on any particular doorbell address is reduced significantly.
The first patches are cleanups:
net/mlx5: Fix typo of MLX5_EQ_DOORBEL_OFFSET
net/mlx5: Remove unused 'offset' field from struct mlx5_sq_bfreg'
net/mlx5e: Remove unused 'xsk' param of mlx5e_build_xdpsq_param
The next patch separates the global doorbell from Ethernet-specific
resources:
net/mlx5: Store the global doorbell in mlx5_priv
Next, plumbing to allow a different doorbell to be used for channel TX
and RX queues:
net/mlx5e: Prepare for using multiple TX doorbells
net/mlx5e: Prepare for using different CQ doorbells
Then, enable using multiple doorbells for channel queues:
net/mlx5e: Use multiple TX doorbells
net/mlx5e: Use multiple CQ doorbells
Finally, introduce a devlink parameter to control this:
devlink: Add a 'num_doorbells' driverinit param
net/mlx5e: Use the 'num_doorbells' devlink param
Some performance results, done with the Linux pktgen script, running b2b
over Connect-X 8 NICs:
samples/pktgen/pktgen_sample02_multiqueue.sh -i $NIC -s 64 -d $DST_IP \
-m $MAC -t 64
Baseline (1 doorbell): 9 Mpps
This series (8 doorbells): 56 Mpps
Note that pktgen without 'burst' rings the doorbell after every packet,
while real packet TX using NAPI usually batches multiple pending packets
with the xmit_more mechanism. So this is in essence a micro-benchmark
showcasing the improvement of using multiple doorbells on platforms
affected by MMIO contention. Real life traffic usually sees little
movement either way.
====================
Link: https://patch.msgid.link/1758031904-634231-1-git-send-email-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the new devlink param to control how many doorbells mlx5e devices
allocate and use. The maximum number of doorbells configurable is capped
to the maximum number of channels. This only applies to the Ethernet
part, the RDMA devices using mlx5 manage their own doorbells.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Channel doorbells are now also used by all channel CQs.
A new 'uar' parameter is added to 'struct mlx5e_create_cq_param',
which is then used in mlx5e_alloc_cq.
A single UAR page has two TX doorbells and a single CQ doorbell, so
every consecutive pair of 'struct mlx5_sq_bfreg' (TX doorbells)
uses the same underlying 'struct mlx5_uars_page' (CQ doorbell).
So by using c->bfreg->up, CQs from every consecutive channel pair will
share the same CQ doorbell.
Non-channel associated CQs keep using the global CQ doorbell.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
First, allocate more doorbells in mlx5e_create_mdev_resources:
- one doorbell remains 'global' and will be used by all non-channel
associated SQs (e.g. ASO, HWS, PTP, ...).
- allocate additional 'num_doorbells' doorbells. This defaults to
minimum between 8 and max number of channels.
mlx5e_channel_pick_doorbell() now spreads out channel SQs across
available doorbells.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Completion queues (CQs) in mlx5 use the same global doorbell, which may
become contended when accessed concurrently from many cores.
This patch prepares the CQ management code for supporting different
doorbells per CQ. This will be used in downstream patches to allow
separate doorbells to be used by channels CQs.
The main change is moving the 'uar' pointer from struct mlx5_core_cq to
struct mlx5e_cq, as the uar page to be used is better off stored
directly there. Other users of mlx5_core_cq also store the UAR to be
used separately and therefore the pointer being removed is dead weight
for them. As evidence, in this patch there are two users which set the
mcq.uar pointer but didn't use it, Software Steering and old Innova CQ
creation code. Instead, they rang the doorbell directly from another
pointer.
The 'uar' pointer added to struct mlx5e_cq remains in a hot cacheline
(as before), because it may get accessed for each packet.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The driver allocates a single doorbell per device and uses
it for all Send Queues (SQs). This can become a bottleneck due to the
high number of concurrent MMIO accesses when ringing the same doorbell
from many channels.
This patch makes the doorbells used by channel queues configurable.
mlx5e_channel_pick_doorbell() is added to select the doorbell to be used
for a given channel, picking the default for now.
When opening a channel, the selected doorbell is saved to the channel
struct and used whenever channel-related queues are created.
Finally, 'uar_page' is added to 'struct mlx5e_create_sq_param' to
control which doorbell to use when allocating an SQ, since that can
happen outside channel context (e.g. for PTP).
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The global doorbell is used for more than just Ethernet resources, so
move it out of mlx5e_hw_objs into a common place (mlx5_priv), to avoid
non-Ethernet modules (e.g. HWS, ASO) depending on Ethernet structs.
Use this opportunity to consolidate it with the 'uar' pointer already
there, which was used as an RX doorbell. Underneath the 'uar' pointer is
identical to 'bfreg->up', so store a single resource and use that
instead.
For CQ doorbells, care is taken to always use bfreg->up->index instead
of bfreg->index, which may refer to a subsequent UAR page from the same
ALLOC_UAR batch on some NICs.
This paves the way for cleanly supporting multiple doorbells in the
Ethernet driver.
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This was added in commit [1], but its only use removed in commit [2].
The parameter is unused, so remove it from the function parameter list.
[1] commit 9ded70fa1d ("net/mlx5e: Don't prefill WQEs in XDP SQ in the
multi buffer mode")
[2] commit 1a9304859b ("net/mlx5: XDP, Enable TX side XDP multi-buffer
support")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 'offset' field was introduced in the original commit [1] and never
used until commit [2], which added an unnecessary use.
Remove the field and refactor the write-combining test to use a local
variable instead.
[1] commit a6d51b6861 ("net/mlx5: Introduce blue flame register
allocator")
[2] commit d98995b4bf ("net/mlx5: Reimplement write combining test")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: dsa: mv88e6xxx: further PTP-related cleanups
Further mv88e6xxx PTP-related cleanups, mostly centred around the
register definitions, but also moving one function prototype to a
more logical header.
====================
Link: https://patch.msgid.link/aMnJ1uRPvw82_aCT@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since mv88e6xxx_hwtstamp_work() is defined in hwtstamp.c, its prototype
should be in hwtstamp.h, so move it there. Remove it's redundant stub
definition, as both hwtstamp.c (the function provider) and ptp.c (the
consumer) are both dependent on the same config symbol.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove the unused 88E6165 register definitions. For the port
registers, add a comment describing that each arrival and departure
offset is for a set of four registers that correspond with status,
two timestamp registers and the PTP sequence ID captured from the
packet.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are two identical MV88E6XXX_PTP_GC_ETYPE definitions in ptp.h,
and MV88E6XXX_PTP_ETHERTYPE in hwtstamp.h which all refer to the
exact same register. As the code that accesses this register is in
hwtstamp.c, use the hwtstamp.h definition, and remove the
unnecessary duplicated definition in ptp.h
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The TAI_EVENT_STATUS and TAI_CFG definitions are only used for the
88E6352-family of TAI implementations. Rename them as such, and
remove the TAI_EVENT_TIME_* definitions that are unused (although
we read them as a block.)
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
get_netdev_for_sock() is called during setsockopt(),
so not under RCU.
Using sk_dst_get(sk)->dev could trigger UAF.
Let's use __sk_dst_get() and dst_dev_rcu().
Note that the only ->ndo_sk_get_lower_dev() user is
bond_sk_get_lower_dev(), which uses RCU.
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://patch.msgid.link/20250916214758.650211-6-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_vlan_by_tcpsk() fetches sk_dst_get(sk)->dev before RTNL and
passes it to netdev_walk_all_lower_dev(), which is illegal.
Also, smc_vlan_by_tcpsk_walk() does not require RTNL at all.
Let's use __sk_dst_get(), dst_dev_rcu(), and
netdev_walk_all_lower_dev_rcu().
Note that the returned value of smc_vlan_by_tcpsk() is not used
in the caller.
Fixes: 0cfdd8f92c ("smc: connection and link group creation")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916214758.650211-5-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_clc_prfx_match() is called from smc_listen_work() and
not under RCU nor RTNL.
Using sk_dst_get(sk)->dev could trigger UAF.
Let's use __sk_dst_get() and dst_dev_rcu().
Note that the returned value of smc_clc_prfx_match() is not
used in the caller.
Fixes: a046d57da1 ("smc: CLC handshake (incl. preparation steps)")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916214758.650211-4-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_clc_prfx_set() is called during connect() and not under RCU
nor RTNL.
Using sk_dst_get(sk)->dev could trigger UAF.
Let's use __sk_dst_get() and dev_dst_rcu() under rcu_read_lock()
after kernel_getsockname().
Note that the returned value of smc_clc_prfx_set() is not used
in the caller.
While at it, we change the 1st arg of smc_clc_prfx_set[46]_rcu()
not to touch dst there.
Fixes: a046d57da1 ("smc: CLC handshake (incl. preparation steps)")
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250916214758.650211-3-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit says:
====================
net: phy: remove mdio_board_info support from phylib
Since its introduction in 2017 mdio_board_info has had only two users:
- dsa_loop (still existing)
- arm orion, added in 2017 and removed with fd68572b57 ("ARM: orion5x:
remove dsa_chip_data references")
So let's remove usage of mdio_board_info from dsa_loop, then support
for mdio_board_info can be dropped from phylib.
====================
Link: https://patch.msgid.link/4ccf7476-0744-4f6b-aafc-7ba84d15a432@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The periodic pulse event interrupts are used to register the PPS events
into the system, so it is only applicable to PTP_CLK_REQ_PPS request.
However, these interrupts are mistakenly enabled in PTP_CLK_REQ_PEROUT
request, so fix this error.
Fixes: 671e266835 ("ptp: netc: add periodic pulse output support")
Signed-off-by: Wei Fang <wei.fang@nxp.com>
Link: https://patch.msgid.link/20250915082528.1616361-1-wei.fang@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
ptp: safely cleanup when unregistering a PTP clock
The standard rule in the kernel for unregistering user visible devices
is to unpublish the userspace API before doing any shutdown of the
resources necessary for the operation of the device.
PTP has several issues in this area:
1. ptp_clock_unregister() cancells and destroys work while the PTP
chardev is still published, which gives the opportunity for a
precisely timed user API call to cause a driver to attempt to
queue the aux work.
2. PTP pins are not cleaned up - if userspace has enabled PTP pins,
e.g. for extts, drivers are forced to do cleanup before calling
ptp_clock_unregister() to stop events being forwarded into the
PTP layer. E.g mv88e6xxx cancells its internal tai_event_work
to avoid calling into the PTP clock code with a stale ptp_clock
pointer, but a badly timed userspace EXTTS enable will re-schedule
the tai_event_work.
Simplify the process by ensuring that:
1. we take a referene on the PTP struct device to stop the
ptp_clock structure going away underneath us when we call
posix_clock_unregister().
2. call posix_clock_unregister() to remove the /dev/ptp* device.
3. add additional functionality to disable any PTP EXTTS pins and
PPS event generation that have been configured on this device.
This should shutdown all events coming from PTP clock drivers.
4. cancel the delayed aux_work and destroy the kthread.
5. remove the PPS source.
6. drop the reference on the PTP struct device to allow the
ptp_clock structure to be released.
This is difficult for me to test beyond build testing - on the
Clearfog platform with Marvell PHY PTP, the ethernet PHY is the
primary connectivity, so removing the PHY driver for an in-use
network interface isn't possible.
On the ZII rev B platform, where the DSA switches have the TAI
hardware and where root NFS is used, removal of the DSA switch
module somehow forces the FEC interface _not_ connected to the DSA
switch to lose link, causing the machine to become unresponsive
as its root filesystem vanishes.
====================
Link: https://patch.msgid.link/aMnYIu7RbgfXrmGx@shell.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The ordering of ptp_clock_unregister() is not ideal, as the chardev
remains published while state is being torn down, which means userspace
can race with the kernel teardown. There is also no cleanup of enabled
pin settings nor of the internal PPS event, which means enabled events
can still forward into the core, dereferencing a free'd pointer.
Rework the ordering of cleanup in ptp_clock_unregister() so that we
unpublish the posix clock (and user chardev), disable any pins that
have EXTTS events enabled, disable the PPS event, and then clean up
the aux work and PPS source.
This avoids potential use-after-free and races in PTP clock driver
teardown.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # ocelot, sja1105, netdevsim, vclocks
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://patch.msgid.link/E1uydLH-000000061DM-2gcV@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>