Introduce a new helper dl_argv_parse_with_selector() to be used
by show() functions instead of dl_argv().
Implement it to check if all needed options got get commands are
specified. In case they are not, ask kernel for dump passing only
the options (attributes) that are present, creating sort of partial
key to instruct kernel to do partial dump.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
In preparation to the follow-up dump selector patch, make sure that the
command line arguments parsing function returns -ENOENT in case the
option is missing so the caller can distinguish.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
In preparation to the follow-up dump selector patch, introduce function
dl_argv_dry_parse() which allows to do dry parsing of command line
arguments without printing out any error messages to the user.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Currently, handle parsing is destructive as the "\0" string ends are
being put in certain positions during parsing. That prevents it from
being used repeatedly. This is problematic with the follow-up patch
implementing dry-parsing. Fix by making a copy of handle argv during
parsing.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
This is basically a cosmetic change. The SB index is not required to be
passed by user and implicitly index 0 is used. This is ensured by
special treating at the end of dl_argv_parse(). Move this option from
optional to required options.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Be in-sync with port help and port man page and spell out the possible
states instead of "STATE".
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
It is common for all iproute2 apps to have command line option
names matching with show command outputs. However, that is not true
in case of trap and trap group devlink objects.
Correct would be to have "trap" and "group" in the outputs, but that is
not possible to change now. Instead of that, accept "name" instead of
"trap" and "group" options.
Examples:
$ devlink trap show netdevsim/netdevsim1
netdevsim/netdevsim1:
name source_mac_is_multicast type drop generic true action drop group l2_drops
name vlan_tag_mismatch type drop generic true action drop group l2_drops
name ingress_vlan_filter type drop generic true action drop group l2_drops
name ingress_spanning_tree_filter type drop generic true action drop group l2_drops
name port_list_is_empty type drop generic true action drop group l2_drops
name port_loopback_filter type drop generic true action drop group l2_drops
name fid_miss type exception generic false action trap group l2_drops
name blackhole_route type drop generic true action drop group l3_drops
name ttl_value_is_too_small type exception generic true action trap group l3_exceptions
name tail_drop type drop generic true action drop group buffer_drops
name ingress_flow_action_drop type drop generic true action drop group acl_drops
name egress_flow_action_drop type drop generic true action drop group acl_drops
name igmp_query type control generic true action mirror group mc_snooping
name igmp_v1_report type control generic true action trap group mc_snooping
$ devlink trap show netdevsim/netdevsim1 trap source_mac_is_multicast
netdevsim/netdevsim1:
name source_mac_is_multicast type drop generic true action drop group l2_drops
$ devlink trap show netdevsim/netdevsim1 name source_mac_is_multicast
netdevsim/netdevsim1:
name source_mac_is_multicast type drop generic true action drop group l2_drops
$ devlink trap group
netdevsim/netdevsim1:
name l2_drops generic true
name l3_drops generic true policer 1
name l3_exceptions generic true policer 1
name buffer_drops generic true policer 2
name acl_drops generic true policer 3
name mc_snooping generic true policer 3
$ devlink trap group show netdevsim/netdevsim1 group l2_drops
netdevsim/netdevsim1:
name l2_drops generic true
$ devlink trap group show netdevsim/netdevsim1 name l2_drops
name l2_drops generic true
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
The devlink utility stores an interface map that can be used to map an
interface name to a devlink port and vice versa. The map is populated by
issuing a devlink port dump via 'DEVLINK_CMD_PORT_GET' command.
Cited commits started to populate the map only when it is actually
needed. One such case is when a dump (e.g., shared buffer dump) only
returns devlink port handles. When pretty printing is required, the
utility will consult the map to translate the devlink port handles to
the corresponding interface names.
The above is problematic as it means that the port dump response(s) will
be queued to the same receive buffer as the response(s) of the dump that
triggered the port dump, resulting in a failed dump [1].
Fix by using a different netlink socket for the population of the
interface map.
[1]
$ devlink sb tc bind show
kernel answers: Device or resource busy
Failed to create index map
//0:
sb 0 tc 4 type egress pool 4 threshold 9
kernel answers: Device or resource busy
[...]
$ echo $?
1
Fixes: 5cddbb274eab ("devlink: load port-ifname map on demand")
Fixes: 63d84b1fc98d ("devlink: load ifname map on demand from ifname_map_rev_lookup() as well")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
There is a json footer missed for trap-policer output in "devlink mon".
So add it and fix the json output.
Fixes: a66af5569337 ("devlink: Add devlink trap policer set and show commands")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Suppor port function commands to enable / disable migratable
capability, this is used to set the port function as migratable.
Live migration is the process of transferring a live virtual machine
from one physical host to another without disrupting its normal
operation.
In order for a VM to be able to perform LM, all the VM components must
be able to perform migration. e.g.: to be migratable.
In order for VF to be migratable, VF must be bound to VFIO driver with
migration support.
When migratable capability is enable for a function of the port, the
device is making the necessary preparations for the function to be
migratable, which might include disabling features which cannot be
migrated.
Example of LM with migratable function configuration:
Set migratable of the VF's port function.
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable disable
$ devlink port function set pci/0000:06:00.0/2 migratable enable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0
vfnum 1
function:
hw_addr 00:00:00:00:00:00 migratable enable
Bind VF to VFIO driver with migration support:
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
$ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
$ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
Attach VF to the VM.
Start the VM.
Perform LM.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Support port function commands to enable / disable RoCE, this is used to
control the port RoCE device capabilities.
When RoCE is disabled for a function of the port, function cannot create
any RoCE specific resources (e.g GID table).
It also saves system memory utilization. For example disabling RoCE
enable a VF/SF to save 1 Mbytes of system memory per function.
Example of a PCI VF port which supports a port function:
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum
0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce enabled
$ devlink port function set pci/0000:06:00.0/2 roce disable
$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum
0 vfnum 1
function:
hw_addr 00:00:00:00:00:00 roce disabled
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Recent kernels send PORT_NEW message with when ifname changes,
so benefit from that by having ifnames updated.
Whenever there is a message containing DEVLINK_ATTR_PORT_NETDEV_NAME
attribute, use it to update ifname map.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
There is a common code in pr_out_port_handle_start() and
pr_out_port_handle_start_arr(). As the next patch is going to extend it
even more, push the code into common helper.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Currently, when user specifies ifname as a handle on command line of
devlink, the related devlink port is looked-up in previously taken dump
of all devlink ports on the system. There are 3 problems with that:
1) The dump iterates over all devlink instances in kernel and takes a
devlink instance lock for each.
2) Dumping all devlink ports would not scale.
3) Alternative ifnames are not exposed by devlink netlink interface.
Instead, benefit from RTNL get link command extension and get the
devlink port handle info from IFLA_DEVLINK_PORT attribute, if supported.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Add couple of helpers to alloc/free of map object alongside with list
addition/removal.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
The kernel has gained support for reading from regions without needing to
create a snapshot. To use this support, the DEVLINK_ATTR_REGION_DIRECT
attribute must be added to the command.
For the "read" command, if the user did not specify a snapshot, add the new
attribute to request a direct read. The "dump" command will still require a
snapshot. While technically a dump could be performed without a snapshot it
is not guaranteed to be atomic unless the region size is no larger than
256 bytes.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Setting a parent during creation of the node doesn't work, despite
documentation [1] clearly saying that it should.
[1] man/man8/devlink-rate.8
Example:
$ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
Unknown option "parent"
Fix this by passing DL_OPT_PORT_FN_RATE_PARENT as an argument to
dl_argv_parse() when it gets called from cmd_port_fn_rate_add().
Fixes: 6c70aca76ef2 ("devlink: Add port func rate support")
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
To fully utilize hierarchical QoS algorithm new attribute 'tx_weight'
needs to be introduced. Weight attribute allows for usage of Weighted
Fair Queuing arbitration scheme among siblings. This arbitration
scheme can be used simultaneously with the strict priority.
Introduce ability to configure tx_weight from devlink userspace
utility. Make the new attribute optional.
Example commands:
$ devlink port function rate add pci/0000:4b:00.0/node_custom \
tx_weight 50 parent node_0
$ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 20
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
To fully utilize hierarchical QoS algorithm new attribute 'tx_priority'
needs to be introduced. Priority attribute allows for usage of strict
priority arbiter among siblings. This arbitration scheme attempts to
schedule nodes based on their priority as long as the nodes remain within
their bandwidth limit.
Introduce ability to configure tx_priority from devlink userspace
utility. Make the new attribute optional.
Example commands:
$ devlink port function rate add pci/0000:4b:00.0/node_custom \
tx_priority 5 parent node_0
$ devlink port function rate set pci/0000:4b:00.0/2 tx_priority 5
Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Commit 5cddbb274eab ("devlink: load port-ifname map on demand") changed
the ifname map to be loaded on demand from ifname_map_lookup(). However,
it didn't put this on-demand loading into ifname_map_rev_lookup() which
causes ifname_map_rev_lookup() to return -ENOENT all the time.
Fix this by triggering on-demand ifname map load
from ifname_map_rev_lookup() as well.
Fixes: 5cddbb274eab ("devlink: load port-ifname map on demand")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Similar to other bool opts that could be set by the user, move the
global variable use_iec to be part of struct dl.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Now that it is possible to flash multiple devlink instances in parallel,
the notification processing callback needs to count in the fact that it
receives message that belongs to different devlink instance. So handle
the it gracefully and don't error out.
Reported-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
So far, the port-ifname map was loaded during devlink init
no matter if actually needed or not. Port dump cmd which is utilized
for this in kernel takes lock for every devlink instance.
That may lead to unnecessary blockage of command.
Load the map only in time it is needed to lookup ifname.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
The dl_argv_parse_put function is used to extract arguments from the
command line and convert them to the appropriate netlink attributes. This
function is a combination of calling dl_argv_parse and dl_put_opts.
A future change is going to refactor dl_argv_parse to check the kernel's
netlink policy for the command. This requires issuing another netlink
message which requires calling dl_argv_parse before
mnlu_gen_socket_cmd_prepare. Otherwise, the get policy command issued in
dl_argv_parse would overwrite the prepared buffer.
This conflicts with dl_argv_parse_put which requires being called after
mnlu_gen_socket_cmd_prepare.
Remove dl_argv_parse_put and replace it with appropriate calls to
dl_argv_parse and dl_put_opts. This allows us to ensure dl_argv_parse is
called before mnlu_gen_socket_cmd_prepare while dl_put_opts is called
afterwards.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Use the helper dl_no_arg function to check for whether the command has any
arguments.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
If line card object contains a nested devlink, expose it.
Example:
$ devlink lc show pci/0000:01:00.0 lc 1
pci/0000:01:00.0:
lc 1 state active type 16x100G nested_devlink auxiliary/mlxsw_core.lc.0
supported_types:
16x100G
$ devlink dev show auxiliary/mlxsw_core.lc.0
auxiliary/mlxsw_core.lc.0
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Add commands and helper APIs to run selftests.
Include a selftest id for a non volatile memory i.e. flash.
Also, update the man page and bash-completion for selftests
commands.
Examples:
$ devlink dev selftests run pci/0000:03:00.0 id flash
pci/0000:03:00.0:
flash:
status passed
$ devlink dev selftests show pci/0000:03:00.0
pci/0000:03:00.0
flash
$ devlink dev selftests show pci/0000:03:00.0 -j
{"selftests":{"pci/0000:03:00.0":["flash"]}}
$ devlink dev selftests run pci/0000:03:00.0 id flash -j
{"selftests":{"pci/0000:03:00.0":{"flash":{"status":"passed"}}}}
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Introduce a new object "lc" to add devlink support for line cards with
two commands:
show - to get the info about the line card state, list of supported
types as reported by kernel/driver.
set - to set/clear the line card type.
Example:
$ devlink lc
pci/0000:01:00.0:
lc 1 state unprovisioned
supported_types:
16x100G
lc 2 state unprovisioned
supported_types:
16x100G
lc 3 state unprovisioned
supported_types:
16x100G
lc 4 state unprovisioned
supported_types:
16x100G
lc 5 state unprovisioned
supported_types:
16x100G
lc 6 state unprovisioned
supported_types:
16x100G
lc 7 state unprovisioned
supported_types:
16x100G
lc 8 state unprovisioned
supported_types:
16x100G
To provision the slot #8:
$ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G
$ devlink lc show pci/0000:01:00.0 lc 8
pci/0000:01:00.0:
lc 8 state active type 16x100G
supported_types:
16x100G
To uprovision the slot #8:
$ devlink lc set pci/0000:01:00.0 lc 8 notype
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
For health reporter dumps it is quite convenient to have the numbers in
hexadecimal format. Introduce a command line option to allow user to
achieve that output.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Fix bug when user calls "devlink health dump" without "show" or "clear":
$ devlink health dump
Command "(null)" not found
Put the dump command into a separate helper as it is usual in the rest
of the code. Also, treat no cmd as "show", as it is common for other
devlink objects.
Fixes: 041e6e651a8e ("devlink: Add devlink health dump show command")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
This patch is fixing a bug, when param set user command includes
configuration mode which is not supported, the tool may not respond
with error if the requested value is 0. In such case
cmd_dev_param_set_cb() won't find the requested configuration mode and
returns ctx->value as initialized (equal 0). Then cmd_dev_param_set()
may find that requested value equals current value and returns success.
Fixing the bug by adding a flag cmode_found which is set only if
cmd_dev_param_set_cb() finds the requested configuration mode.
Fixes: 13925ae9eb38 ("devlink: Add param command support")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Recently the kernel gained ability to report the maximum number of
snapshots a region can have. Print this value out if it was reported.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Port function state can have either of the two values - active or
inactive. Update the documentation and help command for these two
values to tell user about it.
With the introduction of state, hw_addr and state are optional.
Hence mark them as optional in man page that also aligns with the help
command output.
Fixes: bdfb9f1bd61a ("devlink: Support set of port function state")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
When processing device flash update, cmd_dev_flash function waits until
the flash process has completed. This requires the following two
conditions to both be true:
a) we've received an exit status from the child process
b) we've received the DEVLINK_CMD_FLASH_UPDATE_END *or*
we haven't received any status notifications from the driver.
The original devlink flash status monitoring code in 9b13cddfe268
("devlink: implement flash status monitoring") was written assuming that
a driver will either send no status updates, or it will send at least
one DEVLINK_CMD_FLASH_UPDATE_STATUS before DEVLINK_CMD_FLASH_UPDATE_END.
Newer versions of the kernel since commit 52cc5f3a166a ("devlink: move flash
end and begin to core devlink") in v5.10 moved handling of the
DEVLINK_CMD_FLASH_UPDATE_END into the core stack, and will send this
regardless of whether or not the driver sends any of its own status
notifications.
The handling of DEVLINK_CMD_FLASH_UPDATE_END in cmd_dev_flash_status_cb
has an additional condition that it must not be the first message.
Otherwise, it falls back to treating it like
a DEVLINK_CMD_FLASH_UPDATE_STATUS.
This is wrong because it can lead to an infinite loop if a driver does
not send any status updates.
In this case, the kernel will send DEVLINK_CMD_FLASH_UPDATE_END without
any DEVLINK_CMD_FLASH_UPDATE_STATUS. The devlink application will see
that ctx->not_first is false, and will treat this like any other status
message. Thus, ctx->not_first will be set to 1.
The loop condition to exit flash update will thus never be true, since
we will wait forever, because ctx->not_first is true, and
ctx->received_end is false.
This leads to the application appearing to process the flash update, but
it will never exit.
Fix this by simply always treating DEVLINK_CMD_FLASH_UPDATE_END the same
regardless of whether its the first message or not.
This is obviously the correct thing to do: once we've received the
DEVLINK_CMD_FLASH_UPDATE_END the flash update must be finished. For new
kernels this is always true, because we send this message in the core
stack after the driver flash update routine finishes.
For older kernels, some drivers may not have sent any
DEVLINK_CMD_FLASH_UPDATE_STATUS or DEVLINK_CMD_FLASH_UPDATE_END. This is
handled by the while loop conditional that exits if we get a return
value from the child process without having received any status
notifications.
An argument could be made that we should exit immediately when we get
either the DEVLINK_CMD_FLASH_UPDATE_END or an exit code from the child
process. However, at a minimum it makes no sense to ever process
DEVLINK_CMD_FLASH_UPDATE_END as if it were a DEVLINK_CMD_FLASH_UPDATE_STATUS.
This is easy to test as it is triggered by the selftests for the
netdevsim driver, which has a test case for both with and without status
notifications.
Fixes: 9b13cddfe268 ("devlink: implement flash status monitoring")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
devlink currently uses "%lu" to format values of type uint64_t,
but on 32-bit architectures uint64_t is defined as unsigned
long long and this does not work correctly.
Fix this by using the standard macro PRIu64 instead.
Signed-off-by: Ben Hutchings <ben.hutchings@mind.be>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>