A production-quality Kubernetes cluster requires planning and preparation.
If your Kubernetes cluster is to run critical workloads, it must be configured to be resilient.
This page explains steps you can take to set up a production-ready cluster,
or to uprate an existing cluster for production use.
If you're already familiar with production setup and want the links, skip to
What's next.
Production considerations
Typically, a production Kubernetes cluster environment has more requirements than a
personal learning, development, or test environment Kubernetes. A production environment may require
secure access by many users, consistent availability, and the resources to adapt
to changing demands.
As you decide where you want your production Kubernetes environment to live
(on premises or in a cloud) and the amount of management you want to take
on or hand to others, consider how your requirements for a Kubernetes cluster
are influenced by the following issues:
Availability: A single-machine Kubernetes learning environment
has a single point of failure. Creating a highly available cluster means considering:
Separating the control plane from the worker nodes.
Replicating the control plane components on multiple nodes.
Load balancing traffic to the cluster’s API server.
Having enough worker nodes available, or able to quickly become available, as changing workloads warrant it.
Scale: If you expect your production Kubernetes environment to receive a stable amount of
demand, you might be able to set up for the capacity you need and be done. However,
if you expect demand to grow over time or change dramatically based on things like
season or special events, you need to plan how to scale to relieve increased
pressure from more requests to the control plane and worker nodes or scale down to reduce unused
resources.
Security and access management: You have full admin privileges on your own
Kubernetes learning cluster. But shared clusters with important workloads, and
more than one or two users, require a more refined approach to who and what can
access cluster resources. You can use role-based access control
(RBAC) and other
security mechanisms to make sure that users and workloads can get access to the
resources they need, while keeping workloads, and the cluster itself, secure.
You can set limits on the resources that users and workloads can access
by managing policies and
container resources.
Before building a Kubernetes production environment on your own, consider
handing off some or all of this job to
Turnkey Cloud Solutions
providers or other Kubernetes Partners.
Options include:
Serverless: Just run workloads on third-party equipment without managing
a cluster at all. You will be charged for things like CPU usage, memory, and
disk requests.
Managed control plane: Let the provider manage the scale and availability
of the cluster's control plane, as well as handle patches and upgrades.
Managed worker nodes: Configure pools of nodes to meet your needs,
then the provider makes sure those nodes are available and ready to implement
upgrades when needed.
Integration: There are providers that integrate Kubernetes with other
services you may need, such as storage, container registries, authentication
methods, and development tools.
Whether you build a production Kubernetes cluster yourself or work with
partners, review the following sections to evaluate your needs as they relate
to your cluster’s control plane, worker nodes, user access, and
workload resources.
Production cluster setup
In a production-quality Kubernetes cluster, the control plane manages the
cluster from services that can be spread across multiple computers
in different ways. Each worker node, however, represents a single entity that
is configured to run Kubernetes pods.
Production control plane
The simplest Kubernetes cluster has the entire control plane and worker node
services running on the same machine. You can grow that environment by adding
worker nodes, as reflected in the diagram illustrated in
Kubernetes Components.
If the cluster is meant to be available for a short period of time, or can be
discarded if something goes seriously wrong, this might meet your needs.
If you need a more permanent, highly available cluster, however, you should
consider ways of extending the control plane. By design, one-machine control
plane services running on a single machine are not highly available.
If keeping the cluster up and running
and ensuring that it can be repaired if something goes wrong is important,
consider these steps:
Choose deployment tools: You can deploy a control plane using tools such
as kubeadm, kops, and kubespray. See
Installing Kubernetes with deployment tools
to learn tips for production-quality deployments using each of those deployment
methods. Different Container Runtimes
are available to use with your deployments.
Manage certificates: Secure communications between control plane services
are implemented using certificates. Certificates are automatically generated
during deployment or you can generate them using your own certificate authority.
See PKI certificates and requirements for details.
Configure load balancer for apiserver: Configure a load balancer
to distribute external API requests to the apiserver service instances running on different nodes. See
Create an External Load Balancer
for details.
Separate and backup etcd service: The etcd services can either run on the
same machines as other control plane services or run on separate machines, for
extra security and availability. Because etcd stores cluster configuration data,
backing up the etcd database should be done regularly to ensure that you can
repair that database if needed.
See the etcd FAQ for details on configuring and using etcd.
See Operating etcd clusters for Kubernetes
and Set up a High Availability etcd cluster with kubeadm
for details.
Create multiple control plane systems: For high availability, the
control plane should not be limited to a single machine. If the control plane
services are run by an init service (such as systemd), each service should run on at
least three machines. However, running control plane services as pods in
Kubernetes ensures that the replicated number of services that you request
will always be available.
The scheduler should be fault tolerant,
but not highly available. Some deployment tools set up Raft
consensus algorithm to do leader election of Kubernetes services. If the
primary goes away, another service elects itself and take over.
Span multiple zones: If keeping your cluster available at all times is
critical, consider creating a cluster that runs across multiple data centers,
referred to as zones in cloud environments. Groups of zones are referred to as regions.
By spreading a cluster across
multiple zones in the same region, it can improve the chances that your
cluster will continue to function even if one zone becomes unavailable.
See Running in multiple zones for details.
Manage on-going features: If you plan to keep your cluster over time,
there are tasks you need to do to maintain its health and security. For example,
if you installed with kubeadm, there are instructions to help you with
Certificate Management
and Upgrading kubeadm clusters.
See Administer a Cluster
for a longer list of Kubernetes administrative tasks.
Production-quality workloads need to be resilient and anything they rely
on needs to be resilient (such as CoreDNS). Whether you manage your own
control plane or have a cloud provider do it for you, you still need to
consider how you want to manage your worker nodes (also referred to
simply as nodes).
Configure nodes: Nodes can be physical or virtual machines. If you want to
create and manage your own nodes, you can install a supported operating system,
then add and run the appropriate
Node services. Consider:
The demands of your workloads when you set up nodes by having appropriate memory, CPU, and disk speed and storage capacity available.
Whether generic computer systems will do or you have workloads that need GPU processors, Windows nodes, or VM isolation.
Validate nodes: See Valid node setup
for information on how to ensure that a node meets the requirements to join
a Kubernetes cluster.
Add nodes to the cluster: If you are managing your own cluster you can
add nodes by setting up your own machines and either adding them manually or
having them register themselves to the cluster’s apiserver. See the
Nodes section for information on how to set up Kubernetes to add nodes in these ways.
Add Windows nodes to the cluster: Kubernetes offers support for Windows
worker nodes, allowing you to run workloads implemented in Windows containers. See
Windows in Kubernetes for details.
Scale nodes: Have a plan for expanding the capacity your cluster will
eventually need. See Considerations for large clusters
to help determine how many nodes you need, based on the number of pods and
containers you need to run. If you are managing nodes yourself, this can mean
purchasing and installing your own physical equipment.
Autoscale nodes: Most cloud providers support
Cluster Autoscaler
to replace unhealthy nodes or grow and shrink the number of nodes as demand requires. See the
Frequently Asked Questions
for how the autoscaler works and
Deployment
for how it is implemented by different cloud providers. For on-premises, there
are some virtualization platforms that can be scripted to spin up new nodes
based on demand.
Set up node health checks: For important workloads, you want to make sure
that the nodes and pods running on those nodes are healthy. Using the
Node Problem Detector
daemon, you can ensure your nodes are healthy.
Production user management
In production, you may be moving from a model where you or a small group of
people are accessing the cluster to where there may potentially be dozens or
hundreds of people. In a learning environment or platform prototype, you might have a single
administrative account for everything you do. In production, you will want
more accounts with different levels of access to different namespaces.
Taking on a production-quality cluster means deciding how you
want to selectively allow access by other users. In particular, you need to
select strategies for validating the identities of those who try to access your
cluster (authentication) and deciding if they have permissions to do what they
are asking (authorization):
Authentication: The apiserver can authenticate users using client
certificates, bearer tokens, an authenticating proxy, or HTTP basic auth.
You can choose which authentication methods you want to use.
Using plugins, the apiserver can leverage your organization’s existing
authentication methods, such as LDAP or Kerberos. See
Authentication
for a description of these different methods of authenticating Kubernetes users.
Authorization: When you set out to authorize your regular users, you will probably choose between RBAC and ABAC authorization. See Authorization Overview to review different modes for authorizing user accounts (as well as service account access to your cluster):
Role-based access control (RBAC): Lets you assign access to your cluster by allowing specific sets of permissions to authenticated users. Permissions can be assigned for a specific namespace (Role) or across the entire cluster (ClusterRole). Then using RoleBindings and ClusterRoleBindings, those permissions can be attached to particular users.
Attribute-based access control (ABAC): Lets you create policies based on resource attributes in the cluster and will allow or deny access based on those attributes. Each line of a policy file identifies versioning properties (apiVersion and kind) and a map of spec properties to match the subject (user or group), resource property, non-resource property (/version or /apis), and readonly. See Examples for details.
As someone setting up authentication and authorization on your production Kubernetes cluster, here are some things to consider:
Set the authorization mode: When the Kubernetes API server
(kube-apiserver)
starts, the supported authentication modes must be set using the --authorization-mode
flag. For example, that flag in the kube-adminserver.yaml file (in /etc/kubernetes/manifests)
could be set to Node,RBAC. This would allow Node and RBAC authorization for authenticated requests.
Create user certificates and role bindings (RBAC): If you are using RBAC
authorization, users can create a CertificateSigningRequest (CSR) that can be
signed by the cluster CA. Then you can bind Roles and ClusterRoles to each user.
See Certificate Signing Requests
for details.
Create policies that combine attributes (ABAC): If you are using ABAC
authorization, you can assign combinations of attributes to form policies to
authorize selected users or groups to access particular resources (such as a
pod), namespace, or apiGroup. For more information, see
Examples.
Consider Admission Controllers: Additional forms of authorization for
requests that can come in through the API server include
Webhook Token Authentication.
Webhooks and other special authorization types need to be enabled by adding
Admission Controllers
to the API server.
Set limits on workload resources
Demands from production workloads can cause pressure both inside and outside
of the Kubernetes control plane. Consider these items when setting up for the
needs of your cluster's workloads:
Prepare for DNS demand: If you expect workloads to massively scale up,
your DNS service must be ready to scale up as well. See
Autoscale the DNS service in a Cluster.
Create additional service accounts: User accounts determine what users can
do on a cluster, while a service account defines pod access within a particular
namespace. By default, a pod takes on the default service account from its namespace.
See Managing Service Accounts
for information on creating a new service account. For example, you might want to:
Add secrets that a pod could use to pull images from a particular container registry. See Configure Service Accounts for Pods for an example.
If you choose to build your own cluster, plan how you want to
handle certificates
and set up high availability for features such as
etcd
and the
API server.
You need to install a
container runtime
into each node in the cluster so that Pods can run there. This page outlines
what is involved and describes related tasks for setting up nodes.
This page lists details for using several common container runtimes with
Kubernetes, on Linux:
Note: For other operating systems, look for documentation specific to your platform.
Cgroup drivers
Control groups are used to constrain resources that are allocated to processes.
When systemd is chosen as the init
system for a Linux distribution, the init process generates and consumes a root control group
(cgroup) and acts as a cgroup manager.
Systemd has a tight integration with cgroups and allocates a cgroup per systemd unit. It's possible
to configure your container runtime and the kubelet to use cgroupfs. Using cgroupfs alongside
systemd means that there will be two different cgroup managers.
A single cgroup manager simplifies the view of what resources are being allocated
and will by default have a more consistent view of the available and in-use resources.
When there are two cgroup managers on a system, you end up with two views of those resources.
In the field, people have reported cases where nodes that are configured to use cgroupfs
for the kubelet and Docker, but systemd for the rest of the processes, become unstable under
resource pressure.
Changing the settings such that your container runtime and kubelet use systemd as the cgroup driver
stabilized the system. To configure this for Docker, set native.cgroupdriver=systemd.
Caution:
Changing the cgroup driver of a Node that has joined a cluster is a sensitive operation.
If the kubelet has created Pods using the semantics of one cgroup driver, changing the container
runtime to another cgroup driver can cause errors when trying to re-create the Pod sandbox
for such existing Pods. Restarting the kubelet may not solve such errors.
If you have automation that makes it feasible, replace the node with another using the updated
configuration, or reinstall it using automation.
Migrating to the systemd driver in kubeadm managed clusters
Follow this Migration guide
if you wish to migrate to the systemd cgroup driver in existing kubeadm managed clusters.
Container runtimes
Caution:
This section links to third party projects that provide functionality required by Kubernetes. The Kubernetes project authors aren't responsible for these projects. This page follows CNCF website guidelines by listing projects alphabetically. To add a project to this list, read the content guide before submitting a change.
containerd
This section contains the necessary steps to use containerd as CRI runtime.
Use the following commands to install Containerd on your system:
Install and configure prerequisites:
cat <<EOF | sudo tee /etc/modules-load.d/containerd.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
# Setup required sysctl params, these persist across reboots.
cat <<EOF | sudo tee /etc/sysctl.d/99-kubernetes-cri.conf
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF# Apply sysctl params without reboot
sudo sysctl --system
Install the containerd.io package from the official Docker repositories.
Instructions for setting up the Docker repository for your respective Linux distribution and
installing the containerd.io package can be found at
Install Docker Engine.
To install CRI-O on the following operating systems, set the environment variable OS
to the appropriate value from the following table:
Operating system
$OS
Debian Unstable
Debian_Unstable
Debian Testing
Debian_Testing
Then, set $VERSION to the CRI-O version that matches your Kubernetes version.
For instance, if you want to install CRI-O 1.20, set VERSION=1.20.
You can pin your installation to a specific release.
To install version 1.20.0, set VERSION=1.20:1.20.0.
To install on the following operating systems, set the environment variable OS
to the appropriate field in the following table:
Operating system
$OS
Ubuntu 20.04
xUbuntu_20.04
Ubuntu 19.10
xUbuntu_19.10
Ubuntu 19.04
xUbuntu_19.04
Ubuntu 18.04
xUbuntu_18.04
Then, set $VERSION to the CRI-O version that matches your Kubernetes version.
For instance, if you want to install CRI-O 1.20, set VERSION=1.20.
You can pin your installation to a specific release.
To install version 1.20.0, set VERSION=1.20:1.20.0.
To install on the following operating systems, set the environment variable OS
to the appropriate field in the following table:
Operating system
$OS
Centos 8
CentOS_8
Centos 8 Stream
CentOS_8_Stream
Centos 7
CentOS_7
Then, set $VERSION to the CRI-O version that matches your Kubernetes version.
For instance, if you want to install CRI-O 1.20, set VERSION=1.20.
You can pin your installation to a specific release.
To install version 1.20.0, set VERSION=1.20:1.20.0.
CRI-O uses the systemd cgroup driver per default. To switch to the cgroupfs
cgroup driver, either edit /etc/crio/crio.conf or place a drop-in
configuration in /etc/crio/crio.conf.d/02-cgroup-manager.conf, for example:
Please also note the changed conmon_cgroup, which has to be set to the value
pod when using CRI-O with cgroupfs. It is generally necessary to keep the
cgroup driver configuration of the kubelet (usually done via kubeadm) and CRI-O
in sync.
Docker
On each of your nodes, install the Docker for your Linux distribution as per
Install Docker Engine.
You can find the latest validated version of Docker in this
dependencies file.
Configure the Docker daemon, in particular to use systemd for the management of the container’s cgroups.
Note:overlay2 is the preferred storage driver for systems running Linux kernel version 4.0 or higher,
or RHEL or CentOS using version 3.10.0-514 and above.
This page shows how to install the kubeadm toolbox.
For information how to create a cluster with kubeadm once you have performed this installation process, see the Using kubeadm to Create a Cluster page.
Before you begin
A compatible Linux host. The Kubernetes project provides generic instructions for Linux distributions based on Debian and Red Hat, and those distributions without a package manager.
2 GB or more of RAM per machine (any less will leave little room for your apps).
2 CPUs or more.
Full network connectivity between all machines in the cluster (public or private network is fine).
Unique hostname, MAC address, and product_uuid for every node. See here for more details.
Certain ports are open on your machines. See here for more details.
Swap disabled. You MUST disable swap in order for the kubelet to work properly.
Verify the MAC address and product_uuid are unique for every node
You can get the MAC address of the network interfaces using the command ip link or ifconfig -a
The product_uuid can be checked by using the command sudo cat /sys/class/dmi/id/product_uuid
It is very likely that hardware devices will have unique addresses, although some virtual machines may have
identical values. Kubernetes uses these values to uniquely identify the nodes in the cluster.
If these values are not unique to each node, the installation process
may fail.
Check network adapters
If you have more than one network adapter, and your Kubernetes components are not reachable on the default
route, we recommend you add IP route(s) so Kubernetes cluster addresses go via the appropriate adapter.
Letting iptables see bridged traffic
Make sure that the br_netfilter module is loaded. This can be done by running lsmod | grep br_netfilter. To load it explicitly call sudo modprobe br_netfilter.
As a requirement for your Linux Node's iptables to correctly see bridged traffic, you should ensure net.bridge.bridge-nf-call-iptables is set to 1 in your sysctl config, e.g.
Any port numbers marked with * are overridable, so you will need to ensure any
custom ports you provide are also open.
Although etcd ports are included in control-plane nodes, you can also host your own
etcd cluster externally or on custom ports.
The pod network plugin you use (see below) may also require certain ports to be
open. Since this differs with each pod network plugin, please see the
documentation for the plugins about what port(s) those need.
By default, Kubernetes uses the
Container Runtime Interface (CRI)
to interface with your chosen container runtime.
If you don't specify a runtime, kubeadm automatically tries to detect an installed
container runtime by scanning through a list of well known Unix domain sockets.
The following table lists container runtimes and their associated socket paths:
Container runtimes and their socket paths
Runtime
Path to Unix domain socket
Docker
/var/run/dockershim.sock
containerd
/run/containerd/containerd.sock
CRI-O
/var/run/crio/crio.sock
If both Docker and containerd are detected, Docker takes precedence. This is
needed because Docker 18.09 ships with containerd and both are detectable even if you only
installed Docker.
If any other two or more runtimes are detected, kubeadm exits with an error.
The kubelet integrates with Docker through the built-in dockershim CRI implementation.
You will install these packages on all of your machines:
kubeadm: the command to bootstrap the cluster.
kubelet: the component that runs on all of the machines in your cluster
and does things like starting pods and containers.
kubectl: the command line util to talk to your cluster.
kubeadm will not install or manage kubelet or kubectl for you, so you will
need to ensure they match the version of the Kubernetes control plane you want
kubeadm to install for you. If you do not, there is a risk of a version skew occurring that
can lead to unexpected, buggy behaviour. However, one minor version skew between the
kubelet and the control plane is supported, but the kubelet version may never exceed the API
server version. For example, the kubelet running 1.7.0 should be fully compatible with a 1.8.0 API server,
but not vice versa.
Warning: These instructions exclude all Kubernetes packages from any system upgrades.
This is because kubeadm and Kubernetes require
special attention to upgrade.
Setting SELinux in permissive mode by running setenforce 0 and sed ... effectively disables it.
This is required to allow containers to access the host filesystem, which is needed by pod networks for example.
You have to do this until SELinux support is improved in the kubelet.
You can leave SELinux enabled if you know how to configure it but it may require settings that are not supported by kubeadm.
Install CNI plugins (required for most pod network):
CRICTL_VERSION="v1.17.0"
curl -L "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-amd64.tar.gz" | sudo tar -C $DOWNLOAD_DIR -xz
Install kubeadm, kubelet, kubectl and add a kubelet systemd service:
RELEASE="$(curl -sSL https://dl.k8s.io/release/stable.txt)"cd$DOWNLOAD_DIR
sudo curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/amd64/{kubeadm,kubelet,kubectl}
sudo chmod +x {kubeadm,kubelet,kubectl}RELEASE_VERSION="v0.4.0"
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service
sudo mkdir -p /etc/systemd/system/kubelet.service.d
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Enable and start kubelet:
systemctl enable --now kubelet
Note: The Flatcar Container Linux distribution mounts the /usr directory as a read-only filesystem.
Before bootstrapping your cluster, you need to take additional steps to configure a writable directory.
See the Kubeadm Troubleshooting guide to learn how to set up a writable directory.
The kubelet is now restarting every few seconds, as it waits in a crashloop for
kubeadm to tell it what to do.
Configuring a cgroup driver
Both the container runtime and the kubelet have a property called
"cgroup driver", which is important
for the management of cgroups on Linux machines.
Warning:
Matching the container runtime and kubelet cgroup drivers is required or otherwise the kubelet process will fail.
As with any program, you might run into an error installing or running kubeadm.
This page lists some common failure scenarios and have provided steps that can help you understand and fix the problem.
If your problem is not listed below, please follow the following steps:
If no issue exists, please open one and follow the issue template.
If you are unsure about how kubeadm works, you can ask on Slack in #kubeadm,
or open a question on StackOverflow. Please include
relevant tags like #kubernetes and #kubeadm so folks can help you.
Not possible to join a v1.18 Node to a v1.17 cluster due to missing RBAC
In v1.18 kubeadm added prevention for joining a Node in the cluster if a Node with the same name already exists.
This required adding RBAC for the bootstrap-token user to be able to GET a Node object.
However this causes an issue where kubeadm join from v1.18 cannot join a cluster created by kubeadm v1.17.
To workaround the issue you have two options:
Execute kubeadm init phase bootstrap-token on a control-plane node using kubeadm v1.18.
Note that this enables the rest of the bootstrap-token permissions as well.
or
Apply the following RBAC manually using kubectl apply -f ...:
ebtables or some similar executable not found during installation
If you see the following warnings while running kubeadm init
[preflight] WARNING: ebtables not found in system path
[preflight] WARNING: ethtool not found in system path
Then you may be missing ebtables, ethtool or a similar executable on your node. You can install them with the following commands:
For Ubuntu/Debian users, run apt install ebtables ethtool.
For CentOS/Fedora users, run yum install ebtables ethtool.
kubeadm blocks waiting for control plane during installation
If you notice that kubeadm init hangs after printing out the following line:
[apiclient] Created API client, waiting for the control plane to become ready
This may be caused by a number of problems. The most common are:
network connection problems. Check that your machine has full network connectivity before continuing.
the default cgroup driver configuration for the kubelet differs from that used by Docker.
Check the system log file (e.g. /var/log/message) or examine the output from journalctl -u kubelet. If you see something like the following:
error: failed to run Kubelet: failed to create kubelet:
misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"
There are two common ways to fix the cgroup driver problem:
control plane Docker containers are crashlooping or hanging. You can check this by running docker ps and investigating each container by running docker logs.
kubeadm blocks when removing managed containers
The following could happen if Docker halts and does not remove any Kubernetes-managed containers:
sudo kubeadm reset
[preflight] Running pre-flight checks
[reset] Stopping the kubelet service
[reset] Unmounting mounted directories in "/var/lib/kubelet"
[reset] Removing kubernetes-managed containers
(block)
A possible solution is to restart the Docker service and then re-run kubeadm reset:
Inspecting the logs for docker may also be useful:
journalctl -u docker
Pods in RunContainerError, CrashLoopBackOff or Error state
Right after kubeadm init there should not be any pods in these states.
If there are pods in one of these states right afterkubeadm init, please open an
issue in the kubeadm repo. coredns (or kube-dns) should be in the Pending state
until you have deployed the network add-on.
If you see Pods in the RunContainerError, CrashLoopBackOff or Error state
after deploying the network add-on and nothing happens to coredns (or kube-dns),
it's very likely that the Pod Network add-on that you installed is somehow broken.
You might have to grant it more RBAC privileges or use a newer version. Please file
an issue in the Pod Network providers' issue tracker and get the issue triaged there.
If you install a version of Docker older than 1.12.1, remove the MountFlags=slave option
when booting dockerd with systemd and restart docker. You can see the MountFlags in /usr/lib/systemd/system/docker.service.
MountFlags can interfere with volumes mounted by Kubernetes, and put the Pods in CrashLoopBackOff state.
The error happens when Kubernetes does not find var/run/secrets/kubernetes.io/serviceaccount files.
coredns is stuck in the Pending state
This is expected and part of the design. kubeadm is network provider-agnostic, so the admin
should install the pod network add-on
of choice. You have to install a Pod Network
before CoreDNS may be deployed fully. Hence the Pending state before the network is set up.
HostPort services do not work
The HostPort and HostIP functionality is available depending on your Pod Network
provider. Please contact the author of the Pod Network add-on to find out whether
HostPort and HostIP functionality are available.
Calico, Canal, and Flannel CNI providers are verified to support HostPort.
If your network provider does not support the portmap CNI plugin, you may need to use the NodePort feature of
services or use HostNetwork=true.
Pods are not accessible via their Service IP
Many network add-ons do not yet enable hairpin mode
which allows pods to access themselves via their Service IP. This is an issue related to
CNI. Please contact the network
add-on provider to get the latest status of their support for hairpin mode.
If you are using VirtualBox (directly or via Vagrant), you will need to
ensure that hostname -i returns a routable IP address. By default the first
interface is connected to a non-routable host-only network. A work around
is to modify /etc/hosts, see this Vagrantfile
for an example.
TLS certificate errors
The following error indicates a possible certificate mismatch.
# kubectl get pods
Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
Verify that the $HOME/.kube/config file contains a valid certificate, and
regenerate a certificate if necessary. The certificates in a kubeconfig file
are base64 encoded. The base64 --decode command can be used to decode the certificate
and openssl x509 -text -noout can be used for viewing the certificate information.
Unset the KUBECONFIG environment variable using:
unset KUBECONFIG
Or set it to the default KUBECONFIG location:
exportKUBECONFIG=/etc/kubernetes/admin.conf
Another workaround is to overwrite the existing kubeconfig for the "admin" user:
By default, kubeadm configures a kubelet with automatic rotation of client certificates by using the /var/lib/kubelet/pki/kubelet-client-current.pem symlink specified in /etc/kubernetes/kubelet.conf.
If this rotation process fails you might see errors such as x509: certificate has expired or is not yet valid
in kube-apserver logs. To fix the issue you must follow these steps:
Backup and delete /etc/kubernetes/kubelet.conf and /var/lib/kubelet/pki/kubelet-client* from the failed node.
From a working control plane node in the cluster that has /etc/kubernetes/pki/ca.key execute
kubeadm kubeconfig user --org system:nodes --client-name system:node:$NODE > kubelet.conf.
$NODE must be set to the name of the existing failed node in the cluster.
Modify the resulted kubelet.conf manually to adjust the cluster name and server endpoint,
or pass kubeconfig user --config (it accepts InitConfiguration). If your cluster does not have
the ca.key you must sign the embedded certificates in the kubelet.conf externally.
Copy this resulted kubelet.conf to /etc/kubernetes/kubelet.conf on the failed node.
Restart the kubelet (systemctl restart kubelet) on the failed node and wait for
/var/lib/kubelet/pki/kubelet-client-current.pem to be recreated.
Run kubeadm init phase kubelet-finalize all on the failed node. This will make the new
kubelet.conf file use /var/lib/kubelet/pki/kubelet-client-current.pem and will restart the kubelet.
Make sure the node becomes Ready.
Default NIC When using flannel as the pod network in Vagrant
The following error might indicate that something was wrong in the pod network:
Error from server (NotFound): the server could not find the requested resource
If you're using flannel as the pod network inside Vagrant, then you will have to specify the default interface name for flannel.
Vagrant typically assigns two interfaces to all VMs. The first, for which all hosts are assigned the IP address 10.0.2.15, is for external traffic that gets NATed.
This may lead to problems with flannel, which defaults to the first interface on a host. This leads to all hosts thinking they have the same public IP address. To prevent this, pass the --iface eth1 flag to flannel so that the second interface is chosen.
Non-public IP used for containers
In some situations kubectl logs and kubectl run commands may return with the following errors in an otherwise functional cluster:
Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc65b868-glc5m/mysql: dial tcp 10.19.0.41:10250: getsockopt: no route to host
This may be due to Kubernetes using an IP that can not communicate with other IPs on the seemingly same subnet, possibly by policy of the machine provider.
DigitalOcean assigns a public IP to eth0 as well as a private one to be used internally as anchor for their floating IP feature, yet kubelet will pick the latter as the node's InternalIP instead of the public one.
Use ip addr show to check for this scenario instead of ifconfig because ifconfig will not display the offending alias IP address. Alternatively an API endpoint specific to DigitalOcean allows to query for the anchor IP from the droplet:
The workaround is to tell kubelet which IP to use using --node-ip. When using DigitalOcean, it can be the public one (assigned to eth0) or the private one (assigned to eth1) should you want to use the optional private network. The KubeletExtraArgs section of the kubeadm NodeRegistrationOptions structure can be used for this.
Then restart kubelet:
systemctl daemon-reload
systemctl restart kubelet
coredns pods have CrashLoopBackOff or Error state
If you have nodes that are running SELinux with an older version of Docker you might experience a scenario
where the coredns pods are not starting. To solve that you can try one of the following options:
Modify the coredns deployment to set allowPrivilegeEscalation to true:
kubectl -n kube-system get deployment coredns -o yaml | \
sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' | \
kubectl apply -f -
Another cause for CoreDNS to have CrashLoopBackOff is when a CoreDNS Pod deployed in Kubernetes detects a loop. A number of workarounds
are available to avoid Kubernetes trying to restart the CoreDNS Pod every time CoreDNS detects the loop and exits.
Warning: Disabling SELinux or setting allowPrivilegeEscalation to true can compromise
the security of your cluster.
etcd pods restart continually
If you encounter the following error:
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""
this issue appears if you run CentOS 7 with Docker 1.13.1.84.
This version of Docker can prevent the kubelet from executing into the etcd container.
To work around the issue, choose one of these options:
Roll back to an earlier version of Docker, such as 1.13.1-75
Not possible to pass a comma separated list of values to arguments inside a --component-extra-args flag
kubeadm init flags such as --component-extra-args allow you to pass custom arguments to a control-plane
component like the kube-apiserver. However, this mechanism is limited due to the underlying type used for parsing
the values (mapStringString).
If you decide to pass an argument that supports multiple, comma-separated values such as
--apiserver-extra-args "enable-admission-plugins=LimitRanger,NamespaceExists" this flag will fail with
flag: malformed pair, expect string=string. This happens because the list of arguments for
--apiserver-extra-args expects key=value pairs and in this case NamespacesExists is considered
as a key that is missing a value.
Alternatively, you can try separating the key=value pairs like so:
--apiserver-extra-args "enable-admission-plugins=LimitRanger,enable-admission-plugins=NamespaceExists"
but this will result in the key enable-admission-plugins only having the value of NamespaceExists.
kube-proxy scheduled before node is initialized by cloud-controller-manager
In cloud provider scenarios, kube-proxy can end up being scheduled on new worker nodes before
the cloud-controller-manager has initialized the node addresses. This causes kube-proxy to fail
to pick up the node's IP address properly and has knock-on effects to the proxy function managing
load balancers.
The following error can be seen in kube-proxy Pods:
server.go:610] Failed to retrieve node IP: host IP unknown; known addresses: []
proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
A known solution is to patch the kube-proxy DaemonSet to allow scheduling it on control-plane
nodes regardless of their conditions, keeping it off of other nodes until their initial guarding
conditions abate:
The NodeRegistration.Taints field is omitted when marshalling kubeadm configuration
Note: This issue only applies to tools that marshal kubeadm types (e.g. to a YAML configuration file). It will be fixed in kubeadm API v1beta2.
By default, kubeadm applies the node-role.kubernetes.io/master:NoSchedule taint to control-plane nodes.
If you prefer kubeadm to not taint the control-plane node, and set InitConfiguration.NodeRegistration.Taints to an empty slice,
the field will be omitted when marshalling. When the field is omitted, kubeadm applies the default taint.
There are at least two workarounds:
Use the node-role.kubernetes.io/master:PreferNoSchedule taint instead of an empty slice. Pods will get scheduled on masters, unless other nodes have capacity.
On Linux distributions such as Fedora CoreOS or Flatcar Container Linux, the directory /usr is mounted as a read-only filesystem.
For flex-volume support,
Kubernetes components like the kubelet and kube-controller-manager use the default path of
/usr/libexec/kubernetes/kubelet-plugins/volume/exec/, yet the flex-volume directory must be writeable
for the feature to work.
To workaround this issue you can configure the flex-volume directory using the kubeadm
configuration file.
On the primary control-plane Node (created using kubeadm init) pass the following
file using --config:
Alternatively, you can modify /etc/fstab to make the /usr mount writeable, but please
be advised that this is modifying a design principle of the Linux distribution.
kubeadm upgrade plan prints out context deadline exceeded error message
This error message is shown when upgrading a Kubernetes cluster with kubeadm in the case of running an external etcd. This is not a critical bug and happens because older versions of kubeadm perform a version check on the external etcd cluster. You can proceed with kubeadm upgrade apply ....
This issue is fixed as of version 1.19.
kubeadm reset unmounts /var/lib/kubelet
If /var/lib/kubelet is being mounted, performing a kubeadm reset will effectively unmount it.
To workaround the issue, re-mount the /var/lib/kubelet directory after performing the kubeadm reset operation.
This is a regression introduced in kubeadm 1.15. The issue is fixed in 1.20.
Cannot use the metrics-server securely in a kubeadm cluster
In a kubeadm cluster, the metrics-server
can be used insecurely by passing the --kubelet-insecure-tls to it. This is not recommended for production clusters.
If you want to use TLS between the metrics-server and the kubelet there is a problem,
since kubeadm deploys a self-signed serving certificate for the kubelet. This can cause the following errors
on the side of the metrics-server:
x509: certificate signed by unknown authority
x509: certificate is valid for IP-foo not IP-bar
Using kubeadm, you can create a minimum viable Kubernetes cluster that conforms to best practices. In fact, you can use kubeadm to set up a cluster that will pass the Kubernetes Conformance tests.
kubeadm also supports other cluster
lifecycle functions, such as bootstrap tokens and cluster upgrades.
The kubeadm tool is good if you need:
A simple way for you to try out Kubernetes, possibly for the first time.
A way for existing users to automate setting up a cluster and test their application.
A building block in other ecosystem and/or installer tools with a larger
scope.
You can install and use kubeadm on various machines: your laptop, a set
of cloud servers, a Raspberry Pi, and more. Whether you're deploying into the
cloud or on-premises, you can integrate kubeadm into provisioning systems such
as Ansible or Terraform.
Before you begin
To follow this guide, you need:
One or more machines running a deb/rpm-compatible Linux OS; for example: Ubuntu or CentOS.
2 GiB or more of RAM per machine--any less leaves little room for your
apps.
At least 2 CPUs on the machine that you use as a control-plane node.
Full network connectivity among all machines in the cluster. You can use either a
public or a private network.
You also need to use a version of kubeadm that can deploy the version
of Kubernetes that you want to use in your new cluster.
Kubernetes' version and version skew support policy applies to kubeadm as well as to Kubernetes overall.
Check that policy to learn about what versions of Kubernetes and kubeadm
are supported. This page is written for Kubernetes v1.21.
The kubeadm tool's overall feature state is General Availability (GA). Some sub-features are
still under active development. The implementation of creating the cluster may change
slightly as the tool evolves, but the overall implementation should be pretty stable.
Note: Any commands under kubeadm alpha are, by definition, supported on an alpha level.
Objectives
Install a single control-plane Kubernetes cluster
Install a Pod network on the cluster so that your Pods can
talk to each other
If you have already installed kubeadm, run apt-get update && apt-get upgrade or yum update to get the latest version of kubeadm.
When you upgrade, the kubelet restarts every few seconds as it waits in a crashloop for
kubeadm to tell it what to do. This crashloop is expected and normal.
After you initialize your control-plane, the kubelet runs normally.
Initializing your control-plane node
The control-plane node is the machine where the control plane components run, including
etcd (the cluster database) and the
API Server
(which the kubectl command line tool
communicates with).
(Recommended) If you have plans to upgrade this single control-plane kubeadm cluster
to high availability you should specify the --control-plane-endpoint to set the shared endpoint
for all control-plane nodes. Such an endpoint can be either a DNS name or an IP address of a load-balancer.
Choose a Pod network add-on, and verify whether it requires any arguments to
be passed to kubeadm init. Depending on which
third-party provider you choose, you might need to set the --pod-network-cidr to
a provider-specific value. See Installing a Pod network add-on.
(Optional) Since version 1.14, kubeadm tries to detect the container runtime on Linux
by using a list of well known domain socket paths. To use different container runtime or
if there are more than one installed on the provisioned node, specify the --cri-socket
argument to kubeadm init. See Installing runtime.
(Optional) Unless otherwise specified, kubeadm uses the network interface associated
with the default gateway to set the advertise address for this particular control-plane node's API server.
To use a different network interface, specify the --apiserver-advertise-address=<ip-address> argument
to kubeadm init. To deploy an IPv6 Kubernetes cluster using IPv6 addressing, you
must specify an IPv6 address, for example --apiserver-advertise-address=fd00::101
(Optional) Run kubeadm config images pull prior to kubeadm init to verify
connectivity to the gcr.io container image registry.
To initialize the control-plane node run:
kubeadm init <args>
Considerations about apiserver-advertise-address and ControlPlaneEndpoint
While --apiserver-advertise-address can be used to set the advertise address for this particular
control-plane node's API server, --control-plane-endpoint can be used to set the shared endpoint
for all control-plane nodes.
--control-plane-endpoint allows both IP addresses and DNS names that can map to IP addresses.
Please contact your network administrator to evaluate possible solutions with respect to such mapping.
Here is an example mapping:
192.168.0.102 cluster-endpoint
Where 192.168.0.102 is the IP address of this node and cluster-endpoint is a custom DNS name that maps to this IP.
This will allow you to pass --control-plane-endpoint=cluster-endpoint to kubeadm init and pass the same DNS name to
kubeadm join. Later you can modify cluster-endpoint to point to the address of your load-balancer in an
high availability scenario.
Turning a single control plane cluster created without --control-plane-endpoint into a highly available cluster
is not supported by kubeadm.
To customize control plane components, including optional IPv6 assignment to liveness probe for control plane components and etcd server, provide extra arguments to each component as documented in custom arguments.
If you join a node with a different architecture to your cluster, make sure that your deployed DaemonSets
have container image support for this architecture.
kubeadm init first runs a series of prechecks to ensure that the machine
is ready to run Kubernetes. These prechecks expose warnings and exit on errors. kubeadm init
then downloads and installs the cluster control plane components. This may take several minutes.
After it finishes you should see:
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a Pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
/docs/concepts/cluster-administration/addons/
You can now join any number of machines by running the following on each node
as root:
kubeadm join <control-plane-host>:<control-plane-port> --token <token> --discovery-token-ca-cert-hash sha256:<hash>
To make kubectl work for your non-root user, run these commands, which are
also part of the kubeadm init output:
Alternatively, if you are the root user, you can run:
exportKUBECONFIG=/etc/kubernetes/admin.conf
Warning: Kubeadm signs the certificate in the admin.conf to have Subject: O = system:masters, CN = kubernetes-admin.
system:masters is a break-glass, super user group that bypasses the authorization layer (e.g. RBAC).
Do not share the admin.conf file with anyone and instead grant users custom permissions by generating
them a kubeconfig file using the kubeadm kubeconfig user command.
Make a record of the kubeadm join command that kubeadm init outputs. You
need this command to join nodes to your cluster.
The token is used for mutual authentication between the control-plane node and the joining
nodes. The token included here is secret. Keep it safe, because anyone with this
token can add authenticated nodes to your cluster. These tokens can be listed,
created, and deleted with the kubeadm token command. See the
kubeadm reference guide.
Installing a Pod network add-on
Caution:
This section contains important information about networking setup and
deployment order.
Read all of this advice carefully before proceeding.
You must deploy a
Container Network Interface
(CNI) based Pod network add-on so that your Pods can communicate with each other.
Cluster DNS (CoreDNS) will not start up before a network is installed.
Take care that your Pod network must not overlap with any of the host
networks: you are likely to see problems if there is any overlap.
(If you find a collision between your network plugin's preferred Pod
network and some of your host networks, you should think of a suitable
CIDR block to use instead, then use that during kubeadm init with
--pod-network-cidr and as a replacement in your network plugin's YAML).
By default, kubeadm sets up your cluster to use and enforce use of
RBAC (role based access
control).
Make sure that your Pod network plugin supports RBAC, and so do any manifests
that you use to deploy it.
If you want to use IPv6--either dual-stack, or single-stack IPv6 only
networking--for your cluster, make sure that your Pod network plugin
supports IPv6.
IPv6 support was added to CNI in v0.6.0.
Note: Kubeadm should be CNI agnostic and the validation of CNI providers is out of the scope of our current e2e testing.
If you find an issue related to a CNI plugin you should log a ticket in its respective issue
tracker instead of the kubeadm or kubernetes issue trackers.
Several external projects provide Kubernetes Pod networks using CNI, some of which also
support Network Policy.
You can install a Pod network add-on with the following command on the
control-plane node or a node that has the kubeconfig credentials:
kubectl apply -f <add-on.yaml>
You can install only one Pod network per cluster.
Once a Pod network has been installed, you can confirm that it is working by
checking that the CoreDNS Pod is Running in the output of kubectl get pods --all-namespaces.
And once the CoreDNS Pod is up and running, you can continue by joining your nodes.
If your network is not working or CoreDNS is not in the Running state, check out the
troubleshooting guide
for kubeadm.
Control plane node isolation
By default, your cluster will not schedule Pods on the control-plane node for security
reasons. If you want to be able to schedule Pods on the control-plane node, for example for a
single-machine Kubernetes cluster for development, run:
node "test-01" untainted
taint "node-role.kubernetes.io/master:" not found
taint "node-role.kubernetes.io/master:" not found
This will remove the node-role.kubernetes.io/master taint from any nodes that
have it, including the control-plane node, meaning that the scheduler will then be able
to schedule Pods everywhere.
Joining your nodes
The nodes are where your workloads (containers and Pods, etc) run. To add new nodes to your cluster do the following for each machine:
SSH to the machine
Become root (e.g. sudo su -)
Run the command that was output by kubeadm init. For example:
If you do not have the token, you can get it by running the following command on the control-plane node:
kubeadm token list
The output is similar to this:
TOKEN TTL EXPIRES USAGES DESCRIPTION EXTRA GROUPS
8ewj1p.9r9hcjoqgajrj4gi 23h 2018-06-12T02:51:28Z authentication, The default bootstrap system:
signing token generated by bootstrappers:
'kubeadm init'. kubeadm:
default-node-token
By default, tokens expire after 24 hours. If you are joining a node to the cluster after the current token has expired,
you can create a new token by running the following command on the control-plane node:
kubeadm token create
The output is similar to this:
5didvk.d09sbcov8ph2amjw
If you don't have the value of --discovery-token-ca-cert-hash, you can get it by running the following command chain on the control-plane node:
Note: To specify an IPv6 tuple for <control-plane-host>:<control-plane-port>, IPv6 address must be enclosed in square brackets, for example: [fd00::101]:2073.
The output should look something like:
[preflight] Running pre-flight checks
... (log output of join workflow) ...
Node join complete:
* Certificate signing request sent to control-plane and response
received.
* Kubelet informed of new secure connection details.
Run 'kubectl get nodes' on control-plane to see this machine join.
A few seconds later, you should notice this node in the output from kubectl get nodes when run on the control-plane node.
(Optional) Controlling your cluster from machines other than the control-plane node
In order to get a kubectl on some other computer (e.g. laptop) to talk to your
cluster, you need to copy the administrator kubeconfig file from your control-plane node
to your workstation like this:
scp root@<control-plane-host>:/etc/kubernetes/admin.conf .
kubectl --kubeconfig ./admin.conf get nodes
Note:
The example above assumes SSH access is enabled for root. If that is not the
case, you can copy the admin.conf file to be accessible by some other user
and scp using that other user instead.
The admin.conf file gives the user superuser privileges over the cluster.
This file should be used sparingly. For normal users, it's recommended to
generate an unique credential to which you grant privileges. You can do
this with the kubeadm alpha kubeconfig user --client-name <CN>
command. That command will print out a KubeConfig file to STDOUT which you
should save to a file and distribute to your user. After that, grant
privileges by using kubectl create (cluster)rolebinding.
(Optional) Proxying API Server to localhost
If you want to connect to the API Server from outside the cluster you can use
kubectl proxy:
You can now access the API Server locally at http://localhost:8001/api/v1
Clean up
If you used disposable servers for your cluster, for testing, you can
switch those off and do no further clean up. You can use
kubectl config delete-cluster to delete your local references to the
cluster.
However, if you want to deprovision your cluster more cleanly, you should
first drain the node
and make sure that the node is empty, then deconfigure the node.
Remove the node
Talking to the control-plane node with the appropriate credentials, run:
See the Cluster Networking page for a bigger list
of Pod network add-ons.
See the list of add-ons to
explore other add-ons, including tools for logging, monitoring, network policy, visualization &
control of your Kubernetes cluster.
Configure how your cluster handles logs for cluster events and from
applications running in Pods.
See Logging Architecture for
an overview of what is involved.
The kubeadm tool of version v1.21 may deploy clusters with a control plane of version v1.21 or v1.20.
kubeadm v1.21 can also upgrade an existing kubeadm-created cluster of version v1.20.
Due to that we can't see into the future, kubeadm CLI v1.21 may or may not be able to deploy v1.22 clusters.
These resources provide more information on supported version skew between kubelets and the control plane, and other Kubernetes components:
The cluster created here has a single control-plane node, with a single etcd database
running on it. This means that if the control-plane node fails, your cluster may lose
data and may need to be recreated from scratch.
Workarounds:
Regularly back up etcd. The
etcd data directory configured by kubeadm is at /var/lib/etcd on the control-plane node.
kubeadm deb/rpm packages and binaries are built for amd64, arm (32-bit), arm64, ppc64le, and s390x
following the multi-platform
proposal.
Multiplatform container images for the control plane and addons are also supported since v1.12.
Only some of the network providers offer solutions for all platforms. Please consult the list of
network providers above or the documentation from each provider to figure out whether the provider
supports your chosen platform.
Troubleshooting
If you are running into difficulties with kubeadm, please consult our troubleshooting docs.
2.1.4 - Customizing control plane configuration with kubeadm
FEATURE STATE:Kubernetes v1.12 [stable]
The kubeadm ClusterConfiguration object exposes the field extraArgs that can override the default flags passed to control plane
components such as the APIServer, ControllerManager and Scheduler. The components are defined using the following fields:
apiServer
controllerManager
scheduler
The extraArgs field consist of key: value pairs. To override a flag for a control plane component:
Add the appropriate fields to your configuration.
Add the flags to override to the field.
Run kubeadm init with --config <YOUR CONFIG YAML>.
For more details on each field in the configuration you can navigate to our
API reference pages.
Note: You can generate a ClusterConfiguration object with default values by running kubeadm config print init-defaults and saving the output to a file of your choice.
This page explains the two options for configuring the topology of your highly available (HA) Kubernetes clusters.
You can set up an HA cluster:
With stacked control plane nodes, where etcd nodes are colocated with control plane nodes
With external etcd nodes, where etcd runs on separate nodes from the control plane
You should carefully consider the advantages and disadvantages of each topology before setting up an HA cluster.
Note: kubeadm bootstraps the etcd cluster statically. Read the etcd Clustering Guide
for more details.
Stacked etcd topology
A stacked HA cluster is a topology where the distributed
data storage cluster provided by etcd is stacked on top of the cluster formed by the nodes managed by
kubeadm that run control plane components.
Each control plane node runs an instance of the kube-apiserver, kube-scheduler, and kube-controller-manager.
The kube-apiserver is exposed to worker nodes using a load balancer.
Each control plane node creates a local etcd member and this etcd member communicates only with
the kube-apiserver of this node. The same applies to the local kube-controller-manager
and kube-scheduler instances.
This topology couples the control planes and etcd members on the same nodes. It is simpler to set up than a cluster
with external etcd nodes, and simpler to manage for replication.
However, a stacked cluster runs the risk of failed coupling. If one node goes down, both an etcd member and a control
plane instance are lost, and redundancy is compromised. You can mitigate this risk by adding more control plane nodes.
You should therefore run a minimum of three stacked control plane nodes for an HA cluster.
This is the default topology in kubeadm. A local etcd member is created automatically
on control plane nodes when using kubeadm init and kubeadm join --control-plane.
External etcd topology
An HA cluster with external etcd is a topology where the distributed data storage cluster provided by etcd is external to the cluster formed by the nodes that run control plane components.
Like the stacked etcd topology, each control plane node in an external etcd topology runs an instance of the kube-apiserver, kube-scheduler, and kube-controller-manager. And the kube-apiserver is exposed to worker nodes using a load balancer. However, etcd members run on separate hosts, and each etcd host communicates with the kube-apiserver of each control plane node.
This topology decouples the control plane and etcd member. It therefore provides an HA setup where
losing a control plane instance or an etcd member has less impact and does not affect
the cluster redundancy as much as the stacked HA topology.
However, this topology requires twice the number of hosts as the stacked HA topology.
A minimum of three hosts for control plane nodes and three hosts for etcd nodes are required for an HA cluster with this topology.
2.1.6 - Creating Highly Available clusters with kubeadm
This page explains two different approaches to setting up a highly available Kubernetes
cluster using kubeadm:
With stacked control plane nodes. This approach requires less infrastructure. The etcd members
and control plane nodes are co-located.
With an external etcd cluster. This approach requires more infrastructure. The
control plane nodes and etcd members are separated.
Before proceeding, you should carefully consider which approach best meets the needs of your applications
and environment. This comparison topic outlines the advantages and disadvantages of each.
If you encounter issues with setting up the HA cluster, please provide us with feedback
in the kubeadm issue tracker.
Caution: This page does not address running your cluster on a cloud provider. In a cloud
environment, neither approach documented here works with Service objects of type
LoadBalancer, or with dynamic PersistentVolumes.
Full network connectivity between all machines in the cluster (public or
private network)
sudo privileges on all machines
SSH access from one device to all nodes in the system
kubeadm and kubelet installed on all machines. kubectl is optional.
For the external etcd cluster only, you also need:
Three additional machines for etcd members
First steps for both methods
Create load balancer for kube-apiserver
Note: There are many configurations for load balancers. The following example is only one
option. Your cluster requirements may need a different configuration.
Create a kube-apiserver load balancer with a name that resolves to DNS.
In a cloud environment you should place your control plane nodes behind a TCP
forwarding load balancer. This load balancer distributes traffic to all
healthy control plane nodes in its target list. The health check for
an apiserver is a TCP check on the port the kube-apiserver listens on
(default value :6443).
It is not recommended to use an IP address directly in a cloud environment.
The load balancer must be able to communicate with all control plane nodes
on the apiserver port. It must also allow incoming traffic on its
listening port.
Make sure the address of the load balancer always matches
the address of kubeadm's ControlPlaneEndpoint.
Add the first control plane nodes to the load balancer and test the
connection:
nc -v LOAD_BALANCER_IP PORT
A connection refused error is expected because the apiserver is not yet
running. A timeout, however, means the load balancer cannot communicate
with the control plane node. If a timeout occurs, reconfigure the load
balancer to communicate with the control plane node.
Add the remaining control plane nodes to the load balancer target group.
You can use the --kubernetes-version flag to set the Kubernetes version to use.
It is recommended that the versions of kubeadm, kubelet, kubectl and Kubernetes match.
The --control-plane-endpoint flag should be set to the address or DNS and port of the load balancer.
The --upload-certs flag is used to upload the certificates that should be shared
across all the control-plane instances to the cluster. If instead, you prefer to copy certs across
control-plane nodes manually or using automation tools, please remove this flag and refer to Manual
certificate distribution section below.
Note: The kubeadm init flags --config and --certificate-key cannot be mixed, therefore if you want
to use the kubeadm configuration
you must add the certificateKey field in the appropriate config locations
(under InitConfiguration and JoinConfiguration: controlPlane).
Note: Some CNI network plugins require additional configuration, for example specifying the pod IP CIDR, while others do not.
See the CNI network documentation.
To add a pod CIDR pass the flag --pod-network-cidr, or if you are using a kubeadm configuration file
set the podSubnet field under the networking object of ClusterConfiguration.
The output looks similar to:
...
You can now join any number of control-plane node by running the following command on each as a root:
kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866 --control-plane --certificate-key f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07
Please note that the certificate-key gives access to cluster sensitive data, keep it secret!
As a safeguard, uploaded-certs will be deleted in two hours; If necessary, you can use kubeadm init phase upload-certs to reload certs afterward.
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866
Copy this output to a text file. You will need it later to join control plane and worker nodes to the cluster.
When --upload-certs is used with kubeadm init, the certificates of the primary control plane
are encrypted and uploaded in the kubeadm-certs Secret.
To re-upload the certificates and generate a new decryption key, use the following command on a control plane
node that is already joined to the cluster:
You can also specify a custom --certificate-key during init that can later be used by join.
To generate such a key you can use the following command:
kubeadm certs certificate-key
Note: The kubeadm-certs Secret and decryption key expire after two hours.
Caution: As stated in the command output, the certificate key gives access to cluster sensitive data, keep it secret!
Apply the CNI plugin of your choice:
Follow these instructions
to install the CNI provider. Make sure the configuration corresponds to the Pod CIDR specified in the kubeadm configuration file if applicable.
Type the following and watch the pods of the control plane components get started:
kubectl get pod -n kube-system -w
Steps for the rest of the control plane nodes
Note: Since kubeadm version 1.15 you can join multiple control-plane nodes in parallel.
Prior to this version, you must join new control plane nodes sequentially, only after
the first node has finished initializing.
For each additional control plane node you should:
Execute the join command that was previously given to you by the kubeadm init output on the first node.
It should look something like this:
The --control-plane flag tells kubeadm join to create a new control plane.
The --certificate-key ... will cause the control plane certificates to be downloaded
from the kubeadm-certs Secret in the cluster and be decrypted using the given key.
External etcd nodes
Setting up a cluster with external etcd nodes is similar to the procedure used for stacked etcd
with the exception that you should setup etcd first, and you should pass the etcd information
in the kubeadm config file.
Note: The difference between stacked etcd and external etcd here is that the external etcd setup requires
a configuration file with the etcd endpoints under the external object for etcd.
In the case of the stacked etcd topology this is managed automatically.
Replace the following variables in the config template with the appropriate values for your cluster:
If you choose to not use kubeadm init with the --upload-certs flag this means that
you are going to have to manually copy the certificates from the primary control plane node to the
joining control plane nodes.
There are many ways to do this. In the following example we are using ssh and scp:
SSH is required if you want to control all nodes from a single machine.
Enable ssh-agent on your main device that has access to all other nodes in
the system:
eval $(ssh-agent)
Add your SSH identity to the session:
ssh-add ~/.ssh/path_to_private_key
SSH between nodes to check that the connection is working correctly.
When you SSH to any node, make sure to add the -A flag:
ssh -A 10.0.0.7
When using sudo on any node, make sure to preserve the environment so SSH
forwarding works:
sudo -E -s
After configuring SSH on all the nodes you should run the following script on the first control plane node after
running kubeadm init. This script will copy the certificates from the first control plane node to the other
control plane nodes:
In the following example, replace CONTROL_PLANE_IPS with the IP addresses of the
other control plane nodes.
USER=ubuntu # customizableCONTROL_PLANE_IPS="10.0.0.7 10.0.0.8"for host in ${CONTROL_PLANE_IPS}; do
scp /etc/kubernetes/pki/ca.crt "${USER}"@$host:
scp /etc/kubernetes/pki/ca.key "${USER}"@$host:
scp /etc/kubernetes/pki/sa.key "${USER}"@$host:
scp /etc/kubernetes/pki/sa.pub "${USER}"@$host:
scp /etc/kubernetes/pki/front-proxy-ca.crt "${USER}"@$host:
scp /etc/kubernetes/pki/front-proxy-ca.key "${USER}"@$host:
scp /etc/kubernetes/pki/etcd/ca.crt "${USER}"@$host:etcd-ca.crt
# Quote this line if you are using external etcd
scp /etc/kubernetes/pki/etcd/ca.key "${USER}"@$host:etcd-ca.key
done
Caution: Copy only the certificates in the above list. kubeadm will take care of generating the rest of the certificates
with the required SANs for the joining control-plane instances. If you copy all the certificates by mistake,
the creation of additional nodes could fail due to a lack of required SANs.
Then on each joining control plane node you have to run the following script before running kubeadm join.
This script will move the previously copied certificates from the home directory to /etc/kubernetes/pki:
USER=ubuntu # customizable
mkdir -p /etc/kubernetes/pki/etcd
mv /home/${USER}/ca.crt /etc/kubernetes/pki/
mv /home/${USER}/ca.key /etc/kubernetes/pki/
mv /home/${USER}/sa.pub /etc/kubernetes/pki/
mv /home/${USER}/sa.key /etc/kubernetes/pki/
mv /home/${USER}/front-proxy-ca.crt /etc/kubernetes/pki/
mv /home/${USER}/front-proxy-ca.key /etc/kubernetes/pki/
mv /home/${USER}/etcd-ca.crt /etc/kubernetes/pki/etcd/ca.crt
# Quote this line if you are using external etcd
mv /home/${USER}/etcd-ca.key /etc/kubernetes/pki/etcd/ca.key
2.1.7 - Set up a High Availability etcd cluster with kubeadm
Note: While kubeadm is being used as the management tool for external etcd nodes
in this guide, please note that kubeadm does not plan to support certificate rotation
or upgrades for such nodes. The long term plan is to empower the tool
etcdadm to manage these
aspects.
Kubeadm defaults to running a single member etcd cluster in a static pod managed
by the kubelet on the control plane node. This is not a high availability setup
as the etcd cluster contains only one member and cannot sustain any members
becoming unavailable. This task walks through the process of creating a high
availability etcd cluster of three members that can be used as an external etcd
when using kubeadm to set up a kubernetes cluster.
Before you begin
Three hosts that can talk to each other over ports 2379 and 2380. This
document assumes these default ports. However, they are configurable through
the kubeadm config file.
Each host should have access to the Kubernetes container image registry (k8s.gcr.io) or list/pull the required etcd image using
kubeadm config images list/pull. This guide will setup etcd instances as
static pods managed by a kubelet.
Some infrastructure to copy files between hosts. For example ssh and scp
can satisfy this requirement.
Setting up the cluster
The general approach is to generate all certs on one node and only distribute
the necessary files to the other nodes.
Note: kubeadm contains all the necessary crytographic machinery to generate
the certificates described below; no other cryptographic tooling is required for
this example.
Configure the kubelet to be a service manager for etcd.
Note: You must do this on every host where etcd should be running.
Since etcd was created first, you must override the service priority by creating a new unit file
that has higher precedence than the kubeadm-provided kubelet unit file.
cat << EOF > /etc/systemd/system/kubelet.service.d/20-etcd-service-manager.conf
[Service]
ExecStart=
# Replace "systemd" with the cgroup driver of your container runtime. The default value in the kubelet is "cgroupfs".
ExecStart=/usr/bin/kubelet --address=127.0.0.1 --pod-manifest-path=/etc/kubernetes/manifests --cgroup-driver=systemd
Restart=always
EOF
systemctl daemon-reload
systemctl restart kubelet
Check the kubelet status to ensure it is running.
systemctl status kubelet
Create configuration files for kubeadm.
Generate one kubeadm configuration file for each host that will have an etcd
member running on it using the following script.
# Update HOST0, HOST1, and HOST2 with the IPs or resolvable names of your hostsexportHOST0=10.0.0.6
exportHOST1=10.0.0.7
exportHOST2=10.0.0.8
# Create temp directories to store files that will end up on other hosts.
mkdir -p /tmp/${HOST0}/ /tmp/${HOST1}/ /tmp/${HOST2}/
ETCDHOSTS=(${HOST0}${HOST1}${HOST2})NAMES=("infra0""infra1""infra2")for i in "${!ETCDHOSTS[@]}"; doHOST=${ETCDHOSTS[$i]}NAME=${NAMES[$i]}
cat << EOF > /tmp/${HOST}/kubeadmcfg.yaml
apiVersion: "kubeadm.k8s.io/v1beta2"
kind: ClusterConfiguration
etcd:
local:
serverCertSANs:
- "${HOST}"
peerCertSANs:
- "${HOST}"
extraArgs:
initial-cluster: ${NAMES[0]}=https://${ETCDHOSTS[0]}:2380,${NAMES[1]}=https://${ETCDHOSTS[1]}:2380,${NAMES[2]}=https://${ETCDHOSTS[2]}:2380
initial-cluster-state: new
name: ${NAME}
listen-peer-urls: https://${HOST}:2380
listen-client-urls: https://${HOST}:2379
advertise-client-urls: https://${HOST}:2379
initial-advertise-peer-urls: https://${HOST}:2380
EOFdone
Generate the certificate authority
If you already have a CA then the only action that is copying the CA's crt and
key file to /etc/kubernetes/pki/etcd/ca.crt and
/etc/kubernetes/pki/etcd/ca.key. After those files have been copied,
proceed to the next step, "Create certificates for each member".
If you do not already have a CA then run this command on $HOST0 (where you
generated the configuration files for kubeadm).
kubeadm init phase certs etcd-ca
This creates two files
/etc/kubernetes/pki/etcd/ca.crt
/etc/kubernetes/pki/etcd/ca.key
Create certificates for each member
kubeadm init phase certs etcd-server --config=/tmp/${HOST2}/kubeadmcfg.yaml
kubeadm init phase certs etcd-peer --config=/tmp/${HOST2}/kubeadmcfg.yaml
kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST2}/kubeadmcfg.yaml
kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST2}/kubeadmcfg.yaml
cp -R /etc/kubernetes/pki /tmp/${HOST2}/
# cleanup non-reusable certificates
find /etc/kubernetes/pki -not -name ca.crt -not -name ca.key -type f -delete
kubeadm init phase certs etcd-server --config=/tmp/${HOST1}/kubeadmcfg.yaml
kubeadm init phase certs etcd-peer --config=/tmp/${HOST1}/kubeadmcfg.yaml
kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST1}/kubeadmcfg.yaml
kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST1}/kubeadmcfg.yaml
cp -R /etc/kubernetes/pki /tmp/${HOST1}/
find /etc/kubernetes/pki -not -name ca.crt -not -name ca.key -type f -delete
kubeadm init phase certs etcd-server --config=/tmp/${HOST0}/kubeadmcfg.yaml
kubeadm init phase certs etcd-peer --config=/tmp/${HOST0}/kubeadmcfg.yaml
kubeadm init phase certs etcd-healthcheck-client --config=/tmp/${HOST0}/kubeadmcfg.yaml
kubeadm init phase certs apiserver-etcd-client --config=/tmp/${HOST0}/kubeadmcfg.yaml
# No need to move the certs because they are for HOST0# clean up certs that should not be copied off this host
find /tmp/${HOST2} -name ca.key -type f -delete
find /tmp/${HOST1} -name ca.key -type f -delete
Copy certificates and kubeadm configs
The certificates have been generated and now they must be moved to their
respective hosts.
Now that the certificates and configs are in place it's time to create the
manifests. On each host run the kubeadm command to generate a static manifest
for etcd.
root@HOST0 $ kubeadm init phase etcd local --config=/tmp/${HOST0}/kubeadmcfg.yaml
root@HOST1 $ kubeadm init phase etcd local --config=/tmp/${HOST1}/kubeadmcfg.yaml
root@HOST2 $ kubeadm init phase etcd local --config=/tmp/${HOST2}/kubeadmcfg.yaml
Set ${ETCD_TAG} to the version tag of your etcd image. For example 3.4.3-0. To see the etcd image and tag that kubeadm uses execute kubeadm config images list --kubernetes-version ${K8S_VERSION}, where ${K8S_VERSION} is for example v1.17.0
Set ${HOST0}to the IP address of the host you are testing.
What's next
Once you have a working 3 member etcd cluster, you can continue setting up a
highly available control plane using the external etcd method with
kubeadm.
2.1.8 - Configuring each kubelet in your cluster using kubeadm
FEATURE STATE:Kubernetes v1.11 [stable]
The lifecycle of the kubeadm CLI tool is decoupled from the
kubelet, which is a daemon that runs
on each node within the Kubernetes cluster. The kubeadm CLI tool is executed by the user when Kubernetes is
initialized or upgraded, whereas the kubelet is always running in the background.
Since the kubelet is a daemon, it needs to be maintained by some kind of an init
system or service manager. When the kubelet is installed using DEBs or RPMs,
systemd is configured to manage the kubelet. You can use a different service
manager instead, but you need to configure it manually.
Some kubelet configuration details need to be the same across all kubelets involved in the cluster, while
other configuration aspects need to be set on a per-kubelet basis to accommodate the different
characteristics of a given machine (such as OS, storage, and networking). You can manage the configuration
of your kubelets manually, but kubeadm now provides a KubeletConfiguration API type for
managing your kubelet configurations centrally.
Kubelet configuration patterns
The following sections describe patterns to kubelet configuration that are simplified by
using kubeadm, rather than managing the kubelet configuration for each Node manually.
Propagating cluster-level configuration to each kubelet
You can provide the kubelet with default values to be used by kubeadm init and kubeadm join
commands. Interesting examples include using a different CRI runtime or setting the default subnet
used by services.
If you want your services to use the subnet 10.96.0.0/12 as the default for services, you can pass
the --service-cidr parameter to kubeadm:
kubeadm init --service-cidr 10.96.0.0/12
Virtual IPs for services are now allocated from this subnet. You also need to set the DNS address used
by the kubelet, using the --cluster-dns flag. This setting needs to be the same for every kubelet
on every manager and Node in the cluster. The kubelet provides a versioned, structured API object
that can configure most parameters in the kubelet and push out this configuration to each running
kubelet in the cluster. This object is called
KubeletConfiguration.
The KubeletConfiguration allows the user to specify flags such as the cluster DNS IP addresses expressed as
a list of values to a camelCased key, illustrated by the following example:
For more details on the KubeletConfiguration have a look at this section.
Providing instance-specific configuration details
Some hosts require specific kubelet configurations due to differences in hardware, operating system,
networking, or other host-specific parameters. The following list provides a few examples.
The path to the DNS resolution file, as specified by the --resolv-conf kubelet
configuration flag, may differ among operating systems, or depending on whether you are using
systemd-resolved. If this path is wrong, DNS resolution will fail on the Node whose kubelet
is configured incorrectly.
The Node API object .metadata.name is set to the machine's hostname by default,
unless you are using a cloud provider. You can use the --hostname-override flag to override the
default behavior if you need to specify a Node name different from the machine's hostname.
Currently, the kubelet cannot automatically detect the cgroup driver used by the CRI runtime,
but the value of --cgroup-driver must match the cgroup driver used by the CRI runtime to ensure
the health of the kubelet.
Depending on the CRI runtime your cluster uses, you may need to specify different flags to the kubelet.
For instance, when using Docker, you need to specify flags such as --network-plugin=cni, but if you
are using an external runtime, you need to specify --container-runtime=remote and specify the CRI
endpoint using the --container-runtime-endpoint=<path>.
You can specify these flags by configuring an individual kubelet's configuration in your service manager,
such as systemd.
Configure kubelets using kubeadm
It is possible to configure the kubelet that kubeadm will start if a custom KubeletConfiguration
API object is passed with a configuration file like so kubeadm ... --config some-config-file.yaml.
By calling kubeadm config print init-defaults --component-configs KubeletConfiguration you can
see all the default values for this structure.
When you call kubeadm init, the kubelet configuration is marshalled to disk
at /var/lib/kubelet/config.yaml, and also uploaded to a ConfigMap in the cluster. The ConfigMap
is named kubelet-config-1.X, where X is the minor version of the Kubernetes version you are
initializing. A kubelet configuration file is also written to /etc/kubernetes/kubelet.conf with the
baseline cluster-wide configuration for all kubelets in the cluster. This configuration file
points to the client certificates that allow the kubelet to communicate with the API server. This
addresses the need to
propagate cluster-level configuration to each kubelet.
To address the second pattern of
providing instance-specific configuration details,
kubeadm writes an environment file to /var/lib/kubelet/kubeadm-flags.env, which contains a list of
flags to pass to the kubelet when it starts. The flags are presented in the file like this:
In addition to the flags used when starting the kubelet, the file also contains dynamic
parameters such as the cgroup driver and whether to use a different CRI runtime socket
(--cri-socket).
After marshalling these two files to disk, kubeadm attempts to run the following two
commands, if you are using systemd:
If the reload and restart are successful, the normal kubeadm init workflow continues.
Workflow when using kubeadm join
When you run kubeadm join, kubeadm uses the Bootstrap Token credential to perform
a TLS bootstrap, which fetches the credential needed to download the
kubelet-config-1.X ConfigMap and writes it to /var/lib/kubelet/config.yaml. The dynamic
environment file is generated in exactly the same way as kubeadm init.
Next, kubeadm runs the following two commands to load the new configuration into the kubelet:
After the kubelet loads the new configuration, kubeadm writes the
/etc/kubernetes/bootstrap-kubelet.conf KubeConfig file, which contains a CA certificate and Bootstrap
Token. These are used by the kubelet to perform the TLS Bootstrap and obtain a unique
credential, which is stored in /etc/kubernetes/kubelet.conf. When this file is written, the kubelet
has finished performing the TLS Bootstrap.
The kubelet drop-in file for systemd
kubeadm ships with configuration for how systemd should run the kubelet.
Note that the kubeadm CLI command never touches this drop-in file.
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf
--kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generate at runtime, populating
the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably,
# the user should use the .NodeRegistration.KubeletExtraArgs object in the configuration files instead.
# KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
This file specifies the default locations for all of the files managed by kubeadm for the kubelet.
The KubeConfig file to use for the TLS Bootstrap is /etc/kubernetes/bootstrap-kubelet.conf,
but it is only used if /etc/kubernetes/kubelet.conf does not exist.
The KubeConfig file with the unique kubelet identity is /etc/kubernetes/kubelet.conf.
The file containing the kubelet's ComponentConfig is /var/lib/kubelet/config.yaml.
The dynamic environment file that contains KUBELET_KUBEADM_ARGS is sourced from /var/lib/kubelet/kubeadm-flags.env.
The file that can contain user-specified flag overrides with KUBELET_EXTRA_ARGS is sourced from
/etc/default/kubelet (for DEBs), or /etc/sysconfig/kubelet (for RPMs). KUBELET_EXTRA_ARGS
is last in the flag chain and has the highest priority in the event of conflicting settings.
Kubernetes binaries and package contents
The DEB and RPM packages shipped with the Kubernetes releases are:
Package name
Description
kubeadm
Installs the /usr/bin/kubeadm CLI tool and the kubelet drop-in file for the kubelet.
kubelet
Installs the kubelet binary in /usr/bin and CNI binaries in /opt/cni/bin.
Your Kubernetes cluster can run in dual-stack networking mode, which means that cluster networking lets you use either address family. In a dual-stack cluster, the control plane can assign both an IPv4 address and an IPv6 address to a single Pod or a Service.
For each server that you want to use as a node, make sure it allows IPv6 forwarding. On Linux, you can set this by running run sysctl -w net.ipv6.conf.all.forwarding=1 as the root user on each server.
You need to have an IPv4 and and IPv6 address range to use. Cluster operators typically
use private address ranges for IPv4. For IPv6, a cluster operator typically chooses a global
unicast address block from within 2000::/3, using a range that is assigned to the operator.
You don't have to route the cluster's IP address ranges to the public internet.
The size of the IP address allocations should be suitable for the number of Pods and
Services that you are planning to run.
Note: If you are upgrading an existing cluster then, by default, the kubeadm upgrade command
changes the feature gateIPv6DualStack to true if that is not already enabled.
However, kubeadm does not support making modifications to the pod IP address range
(“cluster CIDR”) nor to the cluster's Service address range (“Service CIDR”).
Create a dual-stack cluster
To create a dual-stack cluster with kubeadm init you can pass command line arguments
similar to the following example:
# These address ranges are examples
kubeadm init --pod-network-cidr=10.244.0.0/16,2001:db8:42:0::/56 --service-cidr=10.96.0.0/16,2001:db8:42:1::/112
To make things clearer, here is an example kubeadm configuration filekubeadm-config.yaml for the primary dual-stack control plane node.
advertiseAddress in InitConfiguration specifies the IP address that the API Server will advertise it is listening on. The value of advertiseAddress equals the --apiserver-advertise-address flag of kubeadm init
Run kubeadm to initiate the dual-stack control plane node:
kubeadm init --config=kubeadm-config.yaml
Currently, the kube-controller-manager flags --node-cidr-mask-size-ipv4|--node-cidr-mask-size-ipv6 are being left with default values. See enable IPv4/IPv6 dual stack.
Note: The --apiserver-advertise-address flag does not support dual-stack.
Join a node to dual-stack cluster
Before joining a node, make sure that the node has IPv6 routable network interface and allows IPv6 forwarding.
Here is an example kubeadm configuration filekubeadm-config.yaml for joining a worker node to the cluster.
advertiseAddress in JoinConfiguration.controlPlane specifies the IP address that the API Server will advertise it is listening on. The value of advertiseAddress equals the --apiserver-advertise-address flag of kubeadm join.
kubeadm join --config=kubeadm-config.yaml ...
Create a single-stack cluster
Note: Enabling the dual-stack feature doesn't mean that you need to use dual-stack addressing.
You can deploy a single-stack cluster that has the dual-stack networking feature enabled.
In 1.21 the IPv6DualStack feature is Beta and the feature gate is defaulted to true. To disable the feature you must configure the feature gate to false. Note that once the feature is GA, the feature gate will be removed.
kubeadm init --feature-gates IPv6DualStack=false
To make things more clear, here is an example kubeadm configuration filekubeadm-config.yaml for the single-stack control plane node.
kops uses DNS for discovery, both inside the cluster and outside, so that you can reach the kubernetes API server
from clients.
kops has a strong opinion on the cluster name: it should be a valid DNS name. By doing so you will
no longer get your clusters confused, you can share clusters with your colleagues unambiguously,
and you can reach them without relying on remembering an IP address.
You can, and probably should, use subdomains to divide your clusters. As our example we will use
useast1.dev.example.com. The API server endpoint will then be api.useast1.dev.example.com.
A Route53 hosted zone can serve subdomains. Your hosted zone could be useast1.dev.example.com,
but also dev.example.com or even example.com. kops works with any of these, so typically
you choose for organization reasons (e.g. you are allowed to create records under dev.example.com,
but not under example.com).
Let's assume you're using dev.example.com as your hosted zone. You create that hosted zone using
the normal process, or
with a command such as aws route53 create-hosted-zone --name dev.example.com --caller-reference 1.
You must then set up your NS records in the parent domain, so that records in the domain will resolve. Here,
you would create NS records in example.com for dev. If it is a root domain name you would configure the NS
records at your domain registrar (e.g. example.com would need to be configured where you bought example.com).
Verify your route53 domain setup (it is the #1 cause of problems!). You can double-check that
your cluster is configured correctly if you have the dig tool by running:
dig NS dev.example.com
You should see the 4 NS records that Route53 assigned your hosted zone.
(3/5) Create an S3 bucket to store your clusters state
kops lets you manage your clusters even after installation. To do this, it must keep track of the clusters
that you have created, along with their configuration, the keys they are using etc. This information is stored
in an S3 bucket. S3 permissions are used to control access to the bucket.
Multiple clusters can use the same S3 bucket, and you can share an S3 bucket between your colleagues that
administer the same clusters - this is much easier than passing around kubecfg files. But anyone with access
to the S3 bucket will have administrative access to all your clusters, so you don't want to share it beyond
the operations team.
So typically you have one S3 bucket for each ops team (and often the name will correspond
to the name of the hosted zone above!)
In our example, we chose dev.example.com as our hosted zone, so let's pick clusters.dev.example.com as
the S3 bucket name.
Export AWS_PROFILE (if you need to select a profile for the AWS CLI to work)
Create the S3 bucket using aws s3 mb s3://clusters.dev.example.com
You can export KOPS_STATE_STORE=s3://clusters.dev.example.com and then kops will use this location by default.
We suggest putting this in your bash profile or similar.
(4/5) Build your cluster configuration
Run kops create cluster to create your cluster configuration:
kops will create the configuration for your cluster. Note that it only creates the configuration, it does
not actually create the cloud resources - you'll do that in the next step with a kops update cluster. This
give you an opportunity to review the configuration or change it.
It prints commands you can use to explore further:
List your clusters with: kops get cluster
Edit this cluster with: kops edit cluster useast1.dev.example.com
Edit your node instance group: kops edit ig --name=useast1.dev.example.com nodes
Edit your master instance group: kops edit ig --name=useast1.dev.example.com master-us-east-1c
If this is your first time using kops, do spend a few minutes to try those out! An instance group is a
set of instances, which will be registered as kubernetes nodes. On AWS this is implemented via auto-scaling-groups.
You can have several instance groups, for example if you wanted nodes that are a mix of spot and on-demand instances, or
GPU and non-GPU instances.
(5/5) Create the cluster in AWS
Run "kops update cluster" to create your cluster in AWS:
kops update cluster useast1.dev.example.com --yes
That takes a few seconds to run, but then your cluster will likely take a few minutes to actually be ready.
kops update cluster will be the tool you'll use whenever you change the configuration of your cluster; it
applies the changes you have made to the configuration to your cluster - reconfiguring AWS or kubernetes as needed.
For example, after you kops edit ig nodes, then kops update cluster --yes to apply your configuration, and
sometimes you will also have to kops rolling-update cluster to roll out the configuration immediately.
Without --yes, kops update cluster will show you a preview of what it is going to do. This is handy
for production clusters!
Explore other add-ons
See the list of add-ons to explore other add-ons, including tools for logging, monitoring, network policy, visualization, and control of your Kubernetes cluster.
Cleanup
To delete your cluster: kops delete cluster useast1.dev.example.com --yes
Contribute to kops by addressing or raising an issue GitHub Issues
2.3 - Installing Kubernetes with Kubespray
This quickstart helps to install a Kubernetes cluster hosted on GCE, Azure, OpenStack, AWS, vSphere, Packet (bare metal), Oracle Cloud Infrastructure (Experimental) or Baremetal with Kubespray.
Kubespray is a composition of Ansible playbooks, inventory, provisioning tools, and domain knowledge for generic OS/Kubernetes clusters configuration management tasks. Kubespray provides:
Provision servers with the following requirements:
Ansible v2.9 and python-netaddr is installed on the machine that will run Ansible commands
Jinja 2.11 (or newer) is required to run the Ansible Playbooks
The target servers must have access to the Internet in order to pull docker images. Otherwise, additional configuration is required (See Offline Environment)
The target servers are configured to allow IPv4 forwarding
Your ssh key must be copied to all the servers part of your inventory
The firewalls are not managed, you'll need to implement your own rules the way you used to. in order to avoid any issue during deployment you should disable your firewall
If kubespray is ran from non-root user account, correct privilege escalation method should be configured in the target servers. Then the ansible_become flag or command parameters --become or -b should be specified
Kubespray provides the following utilities to help provision your environment:
Terraform scripts for the following cloud providers:
Kubespray customizations can be made to a variable file. If you are getting started with Kubespray, consider using the Kubespray defaults to deploy your cluster and explore Kubernetes.
Large deployments (100+ nodes) may require specific adjustments for best results.
(5/5) Verify the deployment
Kubespray provides a way to verify inter-pod connectivity and DNS resolve with Netchecker. Netchecker ensures the netchecker-agents pods can resolve DNS requests and ping each over within the default namespace. Those pods mimic similar behavior of the rest of the workloads and serve as cluster health indicators.
Cluster operations
Kubespray provides additional playbooks to manage your cluster: scale and upgrade.
Scale your cluster
You can add worker nodes from your cluster by running the scale playbook. For more information, see "Adding nodes".
You can remove worker nodes from your cluster by running the remove-node playbook. For more information, see "Remove nodes".
Upgrade your cluster
You can upgrade your cluster by running the upgrade-cluster playbook. For more information, see "Upgrades".
Cleanup
You can reset your nodes and wipe out all components installed with Kubespray via the reset playbook.
Caution: When running the reset playbook, be sure not to accidentally target your production cluster!
Feedback
Slack Channel: #kubespray (You can get your invite here)
This page provides a list of Kubernetes certified solution providers. From each
provider page, you can learn how to install and setup production
ready clusters.
4 - Windows in Kubernetes
4.1 - Intro to Windows support in Kubernetes
Windows applications constitute a large portion of the services and
applications that run in many organizations.
Windows containers provide a modern way to
encapsulate processes and package dependencies, making it easier to use DevOps
practices and follow cloud native patterns for Windows applications.
Kubernetes has become the defacto standard container orchestrator, and the
release of Kubernetes 1.14 includes production support for scheduling Windows
containers on Windows nodes in a Kubernetes cluster, enabling a vast ecosystem
of Windows applications to leverage the power of Kubernetes. Organizations
with investments in Windows-based applications and Linux-based applications
don't have to look for separate orchestrators to manage their workloads,
leading to increased operational efficiencies across their deployments,
regardless of operating system.
Windows containers in Kubernetes
To enable the orchestration of Windows containers in Kubernetes, include
Windows nodes in your existing Linux cluster. Scheduling Windows containers in
Pods on Kubernetes is similar to
scheduling Linux-based containers.
In order to run Windows containers, your Kubernetes cluster must include
multiple operating systems, with control plane nodes running Linux and workers
running either Windows or Linux depending on your workload needs. Windows
Server 2019 is the only Windows operating system supported, enabling
Kubernetes Node
on Windows (including kubelet,
container runtime,
and kube-proxy). For a detailed explanation of Windows distribution channels
see the Microsoft documentation.
The Kubernetes control plane, including the
master components,
continues to run on Linux.
There are no plans to have a Windows-only Kubernetes cluster.
In this document, when we talk about Windows containers we mean Windows
containers with process isolation. Windows containers with
Hyper-V isolation
is planned for a future release.
Supported Functionality and Limitations
Supported Functionality
Windows OS Version Support
Refer to the following table for Windows operating system support in
Kubernetes. A single heterogeneous Kubernetes cluster can have both Windows
and Linux worker nodes. Windows containers have to be scheduled on Windows
nodes and Linux containers on Linux nodes.
Kubernetes version
Windows Server LTSC releases
Windows Server SAC releases
Kubernetes v1.19
Windows Server 2019
Windows Server ver 1909, Windows Server ver 2004
Kubernetes v1.20
Windows Server 2019
Windows Server ver 1909, Windows Server ver 2004
Kubernetes v1.21
Windows Server 2019
Windows Server ver 2004, Windows Server ver 20H2
Information on the different Windows Server servicing channels including their
support models can be found at
Windows Server servicing channels.
We don't expect all Windows customers to update the operating system for their
apps frequently. Upgrading your applications is what dictates and necessitates
upgrading or introducing new nodes to the cluster. For the customers that
chose to upgrade their operating system for containers running on Kubernetes,
we will offer guidance and step-by-step instructions when we add support for a
new operating system version. This guidance will include recommended upgrade
procedures for upgrading user applications together with cluster nodes.
Windows nodes adhere to Kubernetes
version-skew policy (node to control plane
versioning) the same way as Linux nodes do today.
Microsoft maintains a Windows pause infrastructure container at
mcr.microsoft.com/oss/kubernetes/pause:3.4.1.
Compute
From an API and kubectl perspective, Windows containers behave in much the
same way as Linux-based containers. However, there are some notable
differences in key functionality which are outlined in the
limitation section.
Key Kubernetes elements work the same way in Windows as they do in Linux. In
this section, we talk about some of the key workload enablers and how they map
to Windows.
A Pod is the basic building block of Kubernetes–the smallest and simplest
unit in the Kubernetes object model that you create or deploy. You may not
deploy Windows and Linux containers in the same Pod. All containers in a Pod
are scheduled onto a single Node where each Node represents a specific
platform and architecture. The following Pod capabilities, properties and
events are supported with Windows containers:
Single or multiple containers per Pod with process isolation and volume sharing
Pod status fields
Readiness and Liveness probes
postStart & preStop container lifecycle events
ConfigMap, Secrets: as environment variables or volumes
A Kubernetes Service is an abstraction which defines a logical set of Pods
and a policy by which to access them - sometimes called a micro-service. You
can use services for cross-operating system connectivity. In Windows, services
can utilize the following types, properties and capabilities:
Service Environment variables
NodePort
ClusterIP
LoadBalancer
ExternalName
Headless services
Pods, Controllers and Services are critical elements to managing Windows
workloads on Kubernetes. However, on their own they are not enough to enable
the proper lifecycle management of Windows workloads in a dynamic cloud native
environment. We added support for the following features:
Pod and container metrics
Horizontal Pod Autoscaler support
kubectl Exec
Resource Quotas
Scheduler preemption
Container Runtime
Docker EE
FEATURE STATE:Kubernetes v1.14 [stable]
Docker EE-basic 19.03+ is the recommended container runtime for all Windows
Server versions. This works with the dockershim code included in the kubelet.
CRI-ContainerD
FEATURE STATE:Kubernetes v1.20 [stable]
ContainerD 1.4.0+ can
also be used as the container runtime for Windows Kubernetes nodes.
Kubernetes volumes enable complex
applications, with data persistence and Pod volume sharing requirements, to be
deployed on Kubernetes. Management of persistent volumes associated with a
specific storage back-end or protocol includes actions such as:
provisioning/de-provisioning/resizing of volumes, attaching/detaching a volume
to/from a Kubernetes node and mounting/dismounting a volume to/from individual
containers in a pod that needs to persist data. The code implementing these
volume management actions for a specific storage back-end or protocol is
shipped in the form of a Kubernetes volume
plugin. The following
broad classes of Kubernetes volume plugins are supported on Windows:
In-tree Volume Plugins
Code associated with in-tree volume plugins ship as part of the core
Kubernetes code base. Deployment of in-tree volume plugins do not require
installation of additional scripts or deployment of separate containerized
plugin components. These plugins can handle: provisioning/de-provisioning and
resizing of volumes in the storage backend, attaching/detaching of volumes
to/from a Kubernetes node and mounting/dismounting a volume to/from individual
containers in a pod. The following in-tree plugins support Windows nodes:
Code associated with FlexVolume
plugins ship as out-of-tree scripts or binaries that need to be deployed
directly on the host. FlexVolume plugins handle attaching/detaching of volumes
to/from a Kubernetes node and mounting/dismounting a volume to/from individual
containers in a pod. Provisioning/De-provisioning of persistent volumes
associated with FlexVolume plugins may be handled through an external
provisioner that is typically separate from the FlexVolume plugins. The
following FlexVolume
plugins,
deployed as powershell scripts on the host, support Windows nodes:
Code associated with CSI plugins
ship as out-of-tree scripts and binaries that are typically distributed as
container images and deployed using standard Kubernetes constructs like
DaemonSets and StatefulSets. CSI plugins handle a wide range of volume
management actions in Kubernetes: provisioning/de-provisioning/resizing of
volumes, attaching/detaching of volumes to/from a Kubernetes node and
mounting/dismounting a volume to/from individual containers in a pod,
backup/restore of persistent data using snapshots and cloning. CSI plugins
typically consist of node plugins (that run on each node as a DaemonSet) and
controller plugins.
CSI node plugins (especially those associated with persistent volumes exposed
as either block devices or over a shared file-system) need to perform various
privileged operations like scanning of disk devices, mounting of file systems,
etc. These operations differ for each host operating system. For Linux worker
nodes, containerized CSI node plugins are typically deployed as privileged
containers. For Windows worker nodes, privileged operations for containerized
CSI node plugins is supported using
csi-proxy, a community-managed,
stand-alone binary that needs to be pre-installed on each Windows node. Please
refer to the deployment guide of the CSI plugin you wish to deploy for further
details.
Networking
Networking for Windows containers is exposed through
CNI plugins.
Windows containers function similarly to virtual machines in regards to
networking. Each container has a virtual network adapter (vNIC) which is
connected to a Hyper-V virtual switch (vSwitch). The Host Networking Service
(HNS) and the Host Compute Service (HCS) work together to create containers
and attach container vNICs to networks. HCS is responsible for the management
of containers whereas HNS is responsible for the management of networking
resources such as:
Virtual networks (including creation of vSwitches)
Windows supports five different networking drivers/modes: L2bridge, L2tunnel,
Overlay, Transparent, and NAT. In a heterogeneous cluster with Windows and
Linux worker nodes, you need to select a networking solution that is
compatible on both Windows and Linux. The following out-of-tree plugins are
supported on Windows, with recommendations on when to use each CNI:
Network Driver
Description
Container Packet Modifications
Network Plugins
Network Plugin Characteristics
L2bridge
Containers are attached to an external vSwitch. Containers are attached
to the underlay network, although the physical network doesn't need to learn
the container. MACs because they are rewritten on ingress/egress.
MAC is rewritten to host MAC, IP may be rewritten to host IP using HNS
OutboundNAT policy.
win-bridge uses L2bridge network mode,
connects containers to the underlay of hosts, offering best performance.
Requires user-defined routes (UDR) for inter-node connectivity.
L2Tunnel
This is a special case of l2bridge, but only used on Azure. All packets
are sent to the virtualization host where SDN policy is applied.
Azure-CNI allows integration of containers with Azure vNET, and allows them
to leverage the set of capabilities that
Azure Virtual Network
provides. For example, securely connect to Azure services or use Azure NSGs.
See azure-cni
for some examples.
Overlay (Overlay networking for Windows in Kubernetes is in Alpha stage)
Containers are given a vNIC connected to an external vSwitch. Each overlay
network gets its own IP subnet, defined by a custom IP prefix.The overlay
network driver uses VXLAN encapsulation.
win-overlay should be used when virtual container networks are desired to
be isolated from underlay of hosts (e.g. for security reasons). Allows for IPs
to be re-used for different overlay networks (which have different VNID tags)
if you are restricted on IPs in your datacenter. This option requires
KB4489899 on Windows Server
2019.
Requires an external vSwitch. Containers are attached to an external
vSwitch which enables intra-pod communication via logical networks (logical
switches and routers).
Packet is encapsulated either via
GENEVE,
STT tunneling to reach
pods which are not on the same host. Packets are forwarded or dropped
via the tunnel metadata information supplied by the ovn network controller.
NAT is done for north-south communication.
Deploy via Ansible.
Distributed ACLs can be applied via Kubernetes policies. IPAM support.
Load-balancing can be achieved without kube-proxy. NATing is done without
using iptables/netsh.
NAT (not used in Kubernetes)
Containers are given a vNIC connected to an internal vSwitch. DNS/DHCP is
provided using an internal component called
WinNAT.
As outlined above, the Flannel CNI
meta plugin
is also supported on
Windows
via the VXLAN network backend
(alpha support ; delegates to win-overlay) and
host-gateway network backend
(stable support; delegates to win-bridge). This plugin supports delegating to
one of the reference CNI plugins (win-overlay, win-bridge), to work in
conjunction with Flannel daemon on Windows (Flanneld) for automatic node
subnet lease assignment and HNS network creation. This plugin reads in its own
configuration file (cni.conf), and aggregates it with the environment
variables from the FlannelD generated subnet.env file. It then delegates to
one of the reference CNI plugins for network plumbing, and sends the correct
configuration containing the node-assigned subnet to the IPAM plugin (e.g.
host-local).
For the node, pod, and service objects, the following network flows are
supported for TCP/UDP traffic:
Pod -> Pod (IP)
Pod -> Pod (Name)
Pod -> Service (Cluster IP)
Pod -> Service (PQDN, but only if there are no ".")
Pod -> Service (FQDN)
Pod -> External (IP)
Pod -> External (DNS)
Node -> Pod
Pod -> Node
IP address management (IPAM)
The following IPAM options are supported on Windows:
Load balancing mode where the IP address fixups and the LBNAT occurs at
the container vSwitch port directly; service traffic arrives with the source
IP set as the originating pod IP.
v1.20+
Windows Server 2019
Set the following flags in kube-proxy:
--feature-gates="WinDSR=true" --enable-dsr=true
Preserve-Destination
Skips DNAT of service traffic, thereby preserving the virtual IP of the target
service in packets reaching the backend Pod. Also disables node-node forwarding.
v1.20+
Windows Server, version 1903 (or higher)
Set "preserve-destination": "true" in service annotations
and enable DSR in kube-proxy.
IPv4/IPv6 dual-stack networking
Native IPv4-to-IPv4 in parallel with IPv6-to-IPv6 communications to, from,
and within a cluster
On Windows, using IPv6 with Kubernetes require Windows Server, version 2004
(kernel version 10.0.19041.610) or later.
Overlay (VXLAN) networks on Windows do not support dual-stack networking today.
Limitations
Windows is only supported as a worker node in the Kubernetes architecture and
component matrix. This means that a Kubernetes cluster must always include
Linux master nodes, zero or more Linux worker nodes, and zero or more Windows
worker nodes.
Resource Handling
Linux cgroups are used as a pod boundary for resource controls in Linux.
Containers are created within that boundary for network, process and file
system isolation. The cgroups APIs can be used to gather cpu/io/memory stats.
In contrast, Windows uses a Job object per container with a system namespace
filter to contain all processes in a container and provide logical isolation
from the host. There is no way to run a Windows container without the
namespace filtering in place. This means that system privileges cannot be
asserted in the context of the host, and thus privileged containers are not
available on Windows. Containers cannot assume an identity from the host
because the Security Account Manager (SAM) is separate.
Resource Reservations
Memory Reservations
Windows does not have an out-of-memory process killer as Linux does. Windows
always treats all user-mode memory allocations as virtual, and pagefiles are
mandatory. The net effect is that Windows won't reach out of memory conditions
the same way Linux does, and processes page to disk instead of being subject
to out of memory (OOM) termination. If memory is over-provisioned and all
physical memory is exhausted, then paging can slow down performance.
Keeping memory usage within reasonable bounds is possible using the kubelet
parameters --kubelet-reserve and/or --system-reserve to account for memory
usage on the node (outside of containers). This reduces
NodeAllocatable.
As you deploy workloads, use resource limits (must set only limits or limits
must equal requests) on containers. This also subtracts from NodeAllocatable
and prevents the scheduler from adding more pods once a node is full.
A best practice to avoid over-provisioning is to configure the kubelet with a
system reserved memory of at least 2GB to account for Windows, Docker, and
Kubernetes processes.
CPU Reservations
To account for Windows, Docker and other Kubernetes host processes it is
recommended to reserve a percentage of CPU so they are able to respond to
events. This value needs to be scaled based on the number of CPU cores
available on the Windows node.To determine this percentage a user should
identify the maximum pod density for each of their nodes and monitor the CPU
usage of the system services choosing a value that meets their workload needs.
Keeping CPU usage within reasonable bounds is possible using the kubelet
parameters --kubelet-reserve and/or --system-reserve to account for CPU
usage on the node (outside of containers). This reduces
NodeAllocatable.
Feature Restrictions
TerminationGracePeriod: not implemented
Single file mapping: to be implemented with CRI-ContainerD
Termination message: to be implemented with CRI-ContainerD
Privileged Containers: not currently supported in Windows containers
HugePages: not currently supported in Windows containers
The existing node problem detector is Linux-only and requires privileged
containers. In general, we don't expect this to be used on Windows because
privileged containers are not supported
Not all features of shared namespaces are supported (see API section for
more details)
Difference in behavior of flags when compared to Linux
The behavior of the following kubelet flags is different on Windows nodes as described below:
--kubelet-reserve, --system-reserve , and --eviction-hard flags update
Node Allocatable
Eviction by using --enforce-node-allocable is not implemented.
Eviction by using --eviction-hard and --eviction-soft are not implemented.
MemoryPressure Condition is not implemented.
There are no OOM eviction actions taken by the kubelet.
Kubelet running on the windows node does not have memory restrictions.
--kubelet-reserve and --system-reserve do not set limits on kubelet or
processes running on the host. This means kubelet or a process on the host
could cause memory resource starvation outside the node-allocatable and
scheduler
An additional flag to set the priority of the kubelet process is available
on the Windows nodes called --windows-priorityclass. This flag allows
kubelet process to get more CPU time slices when compared to other processes
running on the Windows host. More information on the allowable values and
their meaning is available at
Windows Priority Classes.
In order for kubelet to always have enough CPU cycles it is recommended to set
this flag to ABOVE_NORMAL_PRIORITY_CLASS and above.
Storage
Windows has a layered filesystem driver to mount container layers and create a
copy filesystem based on NTFS. All file paths in the container are resolved
only within the context of that container.
With Docker Volume mounts can only target a directory in the container, and
not an individual file. This limitation does not exist with CRI-containerD.
Volume mounts cannot project files or directories back to the host
filesystem
Read-only filesystems are not supported because write access is always
required for the Windows registry and SAM database. However, read-only
volumes are supported
Volume user-masks and permissions are not available. Because the SAM is not
shared between the host & container, there's no mapping between them. All
permissions are resolved within the context of the container
As a result, the following storage functionality is not supported on Windows nodes:
Volume subpath mounts. Only the entire volume can be mounted in a Windows container.
Subpath volume mounting for Secrets
Host mount projection
DefaultMode (due to UID/GID dependency)
Read-only root filesystem. Mapped volumes still support readOnly
Block device mapping
Memory as the storage medium
File system features like uui/guid, per-user Linux filesystem permissions
The Windows host networking service and virtual switch implement namespacing
and can create virtual NICs as needed for a pod or container. However, many
configurations such as DNS, routes, and metrics are stored in the Windows
registry database rather than /etc/... files as they are on Linux. The Windows
registry for the container is separate from that of the host, so concepts like
mapping /etc/resolv.conf from the host into a container don't have the same
effect they would on Linux. These must be configured using Windows APIs run in
the context of that container. Therefore CNI implementations need to call the
HNS instead of relying on file mappings to pass network details into the pod
or container.
The following networking functionality is not supported on Windows nodes
Host networking mode is not available for Windows pods.
Local NodePort access from the node itself fails (works for other nodes or
external clients).
Accessing service VIPs from nodes will be available with a future release of
Windows Server.
A single service can only support up to 64 backend pods / unique destination IPs.
Overlay networking support in kube-proxy is a beta feature. In addition, it
requires KB4482887
to be installed on Windows Server 2019.
Local Traffic Policy in non-DSR mode.
Windows containers connected to overlay networks do not support
communicating over the IPv6 stack. There is outstanding Windows platform
work required to enable this network driver to consume IPv6 addresses and
subsequent Kubernetes work in kubelet, kube-proxy, and CNI plugins.
Outbound communication using the ICMP protocol via the win-overlay,
win-bridge, and Azure-CNI plugin. Specifically, the Windows data plane
(VFP)
doesn't support ICMP packet transpositions. This means:
ICMP packets directed to destinations within the same network (e.g. pod to
pod communication via ping) work as expected and without any limitations
TCP/UDP packets work as expected and without any limitations
ICMP packets directed to pass through a remote network (e.g. pod to
external internet communication via ping) cannot be transposed and thus
will not be routed back to their source
Since TCP/UDP packets can still be transposed, one can substitute
ping <destination> with curl <destination> to be able to debug connectivity
to the outside world.
These features were added in Kubernetes v1.15:
kubectl port-forward
CNI Plugins
Windows reference network plugins win-bridge and win-overlay do not
currently implement CNI spec
v0.4.0 due to missing "CHECK" implementation.
The Flannel VXLAN CNI has the following limitations on Windows:
Node-pod connectivity isn't possible by design. It's only possible for
local pods with Flannel v0.12.0 (or higher).
We are restricted to using VNI 4096 and UDP port 4789. The VNI limitation
is being worked on and will be overcome in a future release (open-source
flannel changes). See the official
Flannel VXLAN
backend docs for more details on these parameters.
DNS
ClusterFirstWithHostNet is not supported for DNS. Windows treats all names
with a '.' as a FQDN and skips PQDN resolution
On Linux, you have a DNS suffix list, which is used when trying to resolve
PQDNs. On Windows, we only have 1 DNS suffix, which is the DNS suffix
associated with that pod's namespace (mydns.svc.cluster.local for example).
Windows can resolve FQDNs and services or names resolvable with only that
suffix. For example, a pod spawned in the default namespace, will have the DNS
suffix default.svc.cluster.local. On a Windows pod, you can resolve both
kubernetes.default.svc.cluster.local and kubernetes, but not the
in-betweens, like kubernetes.default or kubernetes.default.svc.
On Windows, there are multiple DNS resolvers that can be used. As these come
with slightly different behaviors, using the Resolve-DNSName utility for
name query resolutions is recommended.
IPv6
Kubernetes on Windows does not support single-stack "IPv6-only" networking.
However,dual-stack IPv4/IPv6 networking for pods and nodes with single-family
services is supported.
See IPv4/IPv6 dual-stack networking for more details.
Session affinity
Setting the maximum session sticky time for Windows services using
service.spec.sessionAffinityConfig.clientIP.timeoutSeconds is not supported.
Security
Secrets are written in clear text on the node's volume (as compared to
tmpfs/in-memory on linux). This means customers have to do two things:
RunAsUsername
can be specified for Windows Pod's or Container's to execute the Container
processes as a node-default user. This is roughly equivalent to
RunAsUser.
Linux specific pod security context privileges such as SELinux, AppArmor,
Seccomp, Capabilities (POSIX Capabilities), and others are not supported.
In addition, as mentioned already, privileged containers are not supported on
Windows.
API
There are no differences in how most of the Kubernetes APIs work for Windows.
The subtleties around what's different come down to differences in the OS and
container runtime. In certain situations, some properties on workload APIs
such as Pod or Container were designed with an assumption that they are
implemented on Linux, failing to run on Windows.
At a high level, these OS concepts are different:
Identity - Linux uses userID (UID) and groupID (GID) which are represented
as integer types. User and group names are not canonical - they are an alias
in /etc/groups or /etc/passwd back to UID+GID. Windows uses a larger
binary security identifier (SID) which is stored in the Windows Security
Access Manager (SAM) database. This database is not shared between the host
and containers, or between containers.
File permissions - Windows uses an access control list based on SIDs, rather
than a bitmask of permissions and UID+GID
File paths - convention on Windows is to use \ instead of /. The Go IO
libraries accept both types of file path separators. However, when you're
setting a path or command line that's interpreted inside a container, \ may
be needed.
Signals - Windows interactive apps handle termination differently, and can
implement one or more of these:
A UI thread handles well-defined messages including WM_CLOSE
Console apps handle ctrl-c or ctrl-break using a Control Handler
Services register a Service Control Handler function that can accept
SERVICE_CONTROL_STOP control codes
Exit Codes follow the same convention where 0 is success, nonzero is failure.
The specific error codes may differ across Windows and Linux. However, exit
codes passed from the Kubernetes components (kubelet, kube-proxy) are
unchanged.
V1.Container
V1.Container.ResourceRequirements.limits.cpu and
V1.Container.ResourceRequirements.limits.memory - Windows doesn't use hard
limits for CPU allocations. Instead, a share system is used. The existing
fields based on millicores are scaled into relative shares that are followed
by the Windows scheduler.
See kuberuntime/helpers_windows.go,
and resource controls in Microsoft docs
Huge pages are not implemented in the Windows container runtime, and are
not available. They require
asserting a user privilege
that's not configurable for containers.
V1.Container.ResourceRequirements.requests.cpu and
V1.Container.ResourceRequirements.requests.memory - Requests are subtracted
from node available resources, so they can be used to avoid overprovisioning a
node. However, they cannot be used to guarantee resources in an
overprovisioned node. They should be applied to all containers as a best
practice if the operator wants to avoid overprovisioning entirely.
V1.Container.SecurityContext.allowPrivilegeEscalation - not possible on
Windows, none of the capabilities are hooked up
V1.Container.SecurityContext.Capabilities - POSIX capabilities are not
implemented on Windows
V1.Container.SecurityContext.privileged - Windows doesn't support privileged
containers
V1.Container.SecurityContext.procMount - Windows doesn't have a /proc filesystem
V1.Container.SecurityContext.readOnlyRootFilesystem - not possible on
Windows, write access is required for registry & system processes to run
inside the container
V1.Container.SecurityContext.runAsGroup - not possible on Windows, no GID support
V1.Container.SecurityContext.runAsNonRoot - Windows does not have a root
user. The closest equivalent is ContainerAdministrator which is an identity
that doesn't exist on the node.
V1.Container.SecurityContext.runAsUser - not possible on Windows, no UID
support as int.
V1.Container.SecurityContext.seLinuxOptions - not possible on Windows, no SELinux
V1.Container.terminationMessagePath - this has some limitations in that
Windows doesn't support mapping single files. The default value is
/dev/termination-log, which does work because it does not exist on Windows by
default.
V1.Pod
V1.Pod.hostIPC, v1.pod.hostpid - host namespace sharing is not possible on Windows
V1.Pod.hostNetwork - There is no Windows OS support to share the host network
V1.Pod.dnsPolicy - ClusterFirstWithHostNet is not supported because Host
Networking is not supported on Windows.
V1.Pod.podSecurityContext - see V1.PodSecurityContext below
V1.Pod.shareProcessNamespace - this is a beta feature, and depends on Linux
namespaces which are not implemented on Windows. Windows cannot share
process namespaces or the container's root filesystem. Only the network can be
shared.
V1.Pod.terminationGracePeriodSeconds - this is not fully implemented in
Docker on Windows, see:
reference. The behavior today is
that the ENTRYPOINT process is sent CTRL_SHUTDOWN_EVENT, then Windows waits 5
seconds by default, and finally shuts down all processes using the normal
Windows shutdown behavior. The 5 second default is actually in the Windows
registry inside the container,
so it can be overridden when the container is built.
V1.Pod.volumeDevices - this is a beta feature, and is not implemented on
Windows. Windows cannot attach raw block devices to pods.
V1.Pod.volumes - EmptyDir, Secret, ConfigMap, HostPath - all work and have
tests in TestGrid
V1.emptyDirVolumeSource - the Node default medium is disk on Windows.
Memory is not supported, as Windows does not have a built-in RAM disk.
V1.VolumeMount.mountPropagation - mount propagation is not supported on Windows.
V1.PodSecurityContext
None of the PodSecurityContext fields work on Windows. They're listed here for
reference.
V1.PodSecurityContext.SELinuxOptions - SELinux is not available on Windows
V1.PodSecurityContext.RunAsUser - provides a UID, not available on Windows
V1.PodSecurityContext.RunAsGroup - provides a GID, not available on Windows
V1.PodSecurityContext.RunAsNonRoot - Windows does not have a root user. The
closest equivalent is ContainerAdministrator which is an identity that
doesn't exist on the node.
V1.PodSecurityContext.SupplementalGroups - provides GID, not available on Windows
V1.PodSecurityContext.Sysctls - these are part of the Linux sysctl
interface. There's no equivalent on Windows.
Operating System Version Restrictions
Windows has strict compatibility rules, where the host OS version must match
the container base image OS version. Only Windows containers with a container
operating system of Windows Server 2019 are supported. Hyper-V isolation of
containers, enabling some backward compatibility of Windows container image
versions, is planned for a future release.
Getting Help and Troubleshooting
Your main source of help for troubleshooting your Kubernetes cluster should
start with this
section. Some
additional, Windows-specific troubleshooting help is included in this section.
Logs are an important element of troubleshooting issues in Kubernetes. Make
sure to include them any time you seek troubleshooting assistance from other
contributors. Follow the instructions in the SIG-Windows
contributing guide on gathering logs.
How do I know start.ps1 completed successfully?
You should see kubelet, kube-proxy, and (if you chose Flannel as your
networking solution) flanneld host-agent processes running on your node, with
running logs being displayed in separate PowerShell windows. In addition to
this, your Windows node should be listed as "Ready" in your Kubernetes
cluster.
Can I configure the Kubernetes node processes to run in the background as services?
Kubelet and kube-proxy are already configured to run as native Windows
Services, offering resiliency by re-starting the services automatically in the
event of failure (for example a process crash). You have two options for
configuring these node components as services.
As native Windows Services
Kubelet & kube-proxy can be run as native Windows Services using sc.exe.
# Create the services for kubelet and kube-proxy in two separate commands
sc.exe create <component_name> binPath= "<path_to_binary> --service <other_args>"# Please note that if the arguments contain spaces, they must be escaped.
sc.exe create kubelet binPath= "C:\kubelet.exe --service --hostname-override 'minion' <other_args>"# Start the servicesStart-Service kubelet
Start-Service kube-proxy
# Stop the serviceStop-Service kubelet (-Force)
Stop-Service kube-proxy (-Force)
# Query the service statusGet-Service kubelet
Get-Service kube-proxy
Using nssm.exe
You can also always use alternative service managers like
nssm.exe to run these processes (flanneld, kubelet &
kube-proxy) in the background for you. You can use this
sample script,
leveraging nssm.exe to register kubelet, kube-proxy, and flanneld.exe
to run as Windows services in the background.
register-svc.ps1 -NetworkMode <Network mode> -ManagementIP <Windows Node IP> -ClusterCIDR <Cluster subnet> -KubeDnsServiceIP <Kube-dns Service IP> -LogDir <Directory to place logs>
The parameters are explained below:
NetworkMode: The network mode l2bridge (flannel host-gw, also the
default value) or overlay (flannel vxlan) chosen as a network solution
ManagementIP: The IP address assigned to the Windows node. You can use
ipconfig to find this.
ClusterCIDR: The cluster subnet range. (Default: 10.244.0.0/16)
KubeDnsServiceIP: The Kubernetes DNS service IP. (Default: 10.96.0.10)
LogDir: The directory where kubelet and kube-proxy logs are redirected
into their respective output files. (Default value C:\k)
If the above referenced script is not suitable, you can manually configure
nssm.exe using the following examples.
Register flanneld.exe:
nssm install flanneld C:\flannel\flanneld.exe
nssm set flanneld AppParameters --kubeconfig-file=c:\k\config --iface=<ManagementIP> --ip-masq=1 --kube-subnet-mgr=1
nssm set flanneld AppEnvironmentExtra NODE_NAME=<hostname>
nssm set flanneld AppDirectory C:\flannel
nssm start flanneld
Register kubelet.exe:
# Microsoft releases the pause infrastructure container at mcr.microsoft.com/oss/kubernetes/pause:3.4.1
nssm install kubelet C:\k\kubelet.exe
nssm set kubelet AppParameters --hostname-override=<hostname> --v=6 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.4.1 --resolv-conf="" --allow-privileged=true --enable-debugging-handlers --cluster-dns=<DNS-service-IP> --cluster-domain=cluster.local --kubeconfig=c:\k\config --hairpin-mode=promiscuous-bridge --image-pull-progress-deadline=20m --cgroups-per-qos=false --log-dir=<log directory> --logtostderr=false --enforce-node-allocatable="" --network-plugin=cni --cni-bin-dir=c:\k\cni --cni-conf-dir=c:\k\cni\config
nssm set kubelet AppDirectory C:\k
nssm start kubelet
Register kube-proxy.exe (l2bridge / host-gw):
nssm install kube-proxy C:\k\kube-proxy.exe
nssm set kube-proxy AppDirectory c:\k
nssm set kube-proxy AppParameters --v=4 --proxy-mode=kernelspace --hostname-override=<hostname>--kubeconfig=c:\k\config --enable-dsr=false --log-dir=<log directory> --logtostderr=false
nssm.exe set kube-proxy AppEnvironmentExtra KUBE_NETWORK=cbr0
nssm set kube-proxy DependOnService kubelet
nssm start kube-proxy
For initial troubleshooting, you can use the following flags in
nssm.exe to redirect stdout and stderr to a output file:
nssm set <Service Name> AppStdout C:\k\mysvc.log
nssm set <Service Name> AppStderr C:\k\mysvc.log
For additional details, see official nssm usage docs.
My Windows Pods do not have network connectivity
If you are using virtual machines, ensure that MAC spoofing is enabled on
all the VM network adapter(s).
My Windows Pods cannot ping external resources
Windows Pods do not have outbound rules programmed for the ICMP protocol
today. However, TCP/UDP is supported. When trying to demonstrate connectivity
to resources outside of the cluster, please substitute ping <IP> with
corresponding curl <IP> commands.
If you are still facing problems, most likely your network configuration in
cni.conf
deserves some extra attention. You can always edit this static file. The
configuration update will apply to any newly created Kubernetes resources.
One of the Kubernetes networking requirements (see
Kubernetes network model)
is for cluster communication to occur without NAT internally. To honor this
requirement, there is an
ExceptionList
for all the communication where we do not want outbound NAT to occur. However,
this also means that you need to exclude the external IP you are trying to
query from the ExceptionList. Only then will the traffic originating from your
Windows pods be SNAT'ed correctly to receive a response from the outside
world. In this regard, your ExceptionList in cni.conf should look as
follows:
Local NodePort access from the node itself fails. This is a known
limitation. NodePort access works from other nodes or external clients.
vNICs and HNS endpoints of containers are being deleted
This issue can be caused when the hostname-override parameter is not
passed to
kube-proxy.
To resolve it, users need to pass the hostname to kube-proxy as follows:
With flannel my nodes are having issues after rejoining a cluster
Whenever a previously deleted node is being re-joined to the cluster,
flannelD tries to assign a new pod subnet to the node. Users should remove the
old pod subnet configuration files in the following paths:
After launching start.ps1, flanneld is stuck in "Waiting for the Network
to be created"
There are numerous reports of this
issue; most likely it is a
timing issue for when the management IP of the flannel network is set. A
workaround is to relaunch start.ps1 or relaunch it manually as follows:
My Windows Pods cannot launch because of missing /run/flannel/subnet.env
This indicates that Flannel didn't launch correctly. You can either try to
restart flanneld.exe or you can copy the files over manually from
/run/flannel/subnet.env on the Kubernetes master to
C:\run\flannel\subnet.env on the Windows worker node and modify the
FLANNEL_SUBNET row to a different number. For example, if node subnet
10.244.4.1/24 is desired:
My Windows node cannot access my services using the service IP
This is a known limitation of the current networking stack on Windows.
Windows Pods are able to access the service IP however.
No network adapter is found when starting kubelet
The Windows networking stack needs a virtual adapter for Kubernetes
networking to work. If the following commands return no results (in an admin
shell), virtual network creation — a necessary prerequisite for Kubelet to
work — has failed:
Get-HnsNetwork | ? Name -ieq"cbr0"Get-NetAdapter | ? Name -Like"vEthernet (Ethernet*"
Often it is worthwhile to modify the
InterfaceName
parameter of the start.ps1 script, in cases where the host's network adapter
isn't "Ethernet". Otherwise, consult the output of the start-kubelet.ps1
script to see if there are errors during virtual network creation.
My Pods are stuck at "Container Creating" or restarting over and over
Check that your pause image is compatible with your OS version. The
instructions
assume that both the OS and the containers are version 1803. If you have a
later version of Windows, such as an Insider build, you need to adjust the
images accordingly. Please refer to the Microsoft's
Docker repository for images.
Regardless, both the pause image Dockerfile and the sample service expect
the image to be tagged as :latest.
kubectl port-forward fails with "unable to do port forwarding: wincat not found"
This was implemented in Kubernetes 1.15 by including wincat.exe in the
pause infrastructure container mcr.microsoft.com/oss/kubernetes/pause:3.4.1.
Be sure to use these versions or newer ones. If you would like to build your
own pause infrastructure container be sure to include
wincat.
My Kubernetes installation is failing because my Windows Server node is
behind a proxy
If you are behind a proxy, the following PowerShell environment variables
must be defined:
In a Kubernetes Pod, an infrastructure or "pause" container is first created
to host the container endpoint. Containers that belong to the same pod,
including infrastructure and worker containers, share a common network
namespace and endpoint (same IP and port space). Pause containers are needed
to accommodate worker containers crashing or restarting without losing any of
the networking configuration.
The "pause" (infrastructure) image is hosted on Microsoft Container Registry
(MCR). You can access it using mcr.microsoft.com/oss/kubernetes/pause:3.4.1.
For more details, see the
DOCKERFILE.
Further investigation
If these steps don't resolve your problem, you can get help running Windows
containers on Windows nodes in Kubernetes through:
If you have what looks like a bug, or you would like to make a feature
request, please use the
GitHub issue tracking system.
You can open issues on
GitHub and
assign them to SIG-Windows. You should first search the list of issues in case
it was reported previously and comment with your experience on the issue and
add additional logs. SIG-Windows Slack is also a great avenue to get some
initial support and troubleshooting ideas prior to creating a ticket.
If filing a bug, please include detailed information about how to reproduce
the problem, such as:
Kubernetes version: kubectl version
Environment details: Cloud provider, OS distro, networking choice and
configuration, and Docker version
Tag the issue sig/windows by commenting on the issue with /sig windows to
bring it to a SIG-Windows member's attention
What's next
We have a lot of features in our roadmap. An abbreviated high level list is
included below, but we encourage you to view our
roadmap project and help us make
Windows support better by
contributing.
Hyper-V isolation
Hyper-V isolation is required to enable the following use cases for Windows
containers in Kubernetes:
Hypervisor-based isolation between pods for additional security
Backwards compatibility allowing a node to run a newer Windows Server
version without requiring containers to be rebuilt
Specific CPU/NUMA settings for a pod
Memory isolation and reservations
Hyper-V isolation support will be added in a later release and will require
CRI-Containerd.
Deployment with kubeadm and cluster API
Kubeadm is becoming the de facto standard for users to deploy a Kubernetes
cluster. Windows node support in kubeadm is currently a work-in-progress but a
guide is available
here. We are
also making investments in cluster API to ensure Windows nodes are properly
provisioned.
4.2 - Guide for scheduling Windows containers in Kubernetes
Windows applications constitute a large portion of the services and applications that run in many organizations.
This guide walks you through the steps to configure and deploy a Windows container in Kubernetes.
Objectives
Configure an example deployment to run Windows containers on the Windows node
(Optional) Configure an Active Directory Identity for your Pod using Group Managed Service Accounts (GMSA)
It is important to note that creating and deploying services and workloads on Kubernetes
behaves in much the same way for Linux and Windows containers.
Kubectl commands to interface with the cluster are identical.
The example in the section below is provided to jumpstart your experience with Windows containers.
Getting Started: Deploying a Windows container
To deploy a Windows container on Kubernetes, you must first create an example application.
The example YAML file below creates a simple webserver application.
Create a service spec named win-webserver.yaml with the contents below:
Note: Port mapping is also supported, but for simplicity in this example
the container port 80 is exposed directly to the service.
Check that all nodes are healthy:
kubectl get nodes
Deploy the service and watch for pod updates:
kubectl apply -f win-webserver.yaml
kubectl get pods -o wide -w
When the service is deployed correctly both Pods are marked as Ready. To exit the watch command, press Ctrl+C.
Check that the deployment succeeded. To verify:
Two containers per pod on the Windows node, use docker ps
Two pods listed from the Linux master, use kubectl get pods
Node-to-pod communication across the network, curl port 80 of your pod IPs from the Linux master
to check for a web server response
Pod-to-pod communication, ping between pods (and across hosts, if you have more than one Windows node)
using docker exec or kubectl exec
Service-to-pod communication, curl the virtual service IP (seen under kubectl get services)
from the Linux master and from individual pods
Service discovery, curl the service name with the Kubernetes default DNS suffix
Inbound connectivity, curl the NodePort from the Linux master or machines outside of the cluster
Outbound connectivity, curl external IPs from inside the pod using kubectl exec
Note: Windows container hosts are not able to access the IP of services scheduled on them due to current platform limitations of the Windows networking stack.
Only Windows pods are able to access service IPs.
Observability
Capturing logs from workloads
Logs are an important element of observability; they enable users to gain insights
into the operational aspect of workloads and are a key ingredient to troubleshooting issues.
Because Windows containers and workloads inside Windows containers behave differently from Linux containers,
users had a hard time collecting logs, limiting operational visibility.
Windows workloads for example are usually configured to log to ETW (Event Tracing for Windows)
or push entries to the application event log.
LogMonitor, an open source tool by Microsoft,
is the recommended way to monitor configured log sources inside a Windows container.
LogMonitor supports monitoring event logs, ETW providers, and custom application logs,
piping them to STDOUT for consumption by kubectl logs <pod>.
Follow the instructions in the LogMonitor GitHub page to copy its binaries and configuration files
to all your containers and add the necessary entrypoints for LogMonitor to push your logs to STDOUT.
Using configurable Container usernames
Starting with Kubernetes v1.16, Windows containers can be configured to run their entrypoints and processes
with different usernames than the image defaults.
The way this is achieved is a bit different from the way it is done for Linux containers.
Learn more about it here.
Managing Workload Identity with Group Managed Service Accounts
Starting with Kubernetes v1.14, Windows container workloads can be configured to use Group Managed Service Accounts (GMSA).
Group Managed Service Accounts are a specific type of Active Directory account that provides automatic password management,
simplified service principal name (SPN) management, and the ability to delegate the management to other administrators across multiple servers.
Containers configured with a GMSA can access external Active Directory Domain resources while carrying the identity configured with the GMSA.
Learn more about configuring and using GMSA for Windows containers here.
Taints and Tolerations
Users today need to use some combination of taints and node selectors in order to
keep Linux and Windows workloads on their respective OS-specific nodes.
This likely imposes a burden only on Windows users. The recommended approach is outlined below,
with one of its main goals being that this approach should not break compatibility for existing Linux workloads.
Ensuring OS-specific workloads land on the appropriate container host
Users can ensure Windows containers can be scheduled on the appropriate host using Taints and Tolerations.
All Kubernetes nodes today have the following default labels:
kubernetes.io/os = [windows|linux]
kubernetes.io/arch = [amd64|arm64|...]
If a Pod specification does not specify a nodeSelector like "kubernetes.io/os": windows,
it is possible the Pod can be scheduled on any host, Windows or Linux.
This can be problematic since a Windows container can only run on Windows and a Linux container can only run on Linux.
The best practice is to use a nodeSelector.
However, we understand that in many cases users have a pre-existing large number of deployments for Linux containers,
as well as an ecosystem of off-the-shelf configurations, such as community Helm charts, and programmatic Pod generation cases, such as with Operators.
In those situations, you may be hesitant to make the configuration change to add nodeSelectors.
The alternative is to use Taints. Because the kubelet can set Taints during registration,
it could easily be modified to automatically add a taint when running on Windows only.
For example: --register-with-taints='os=windows:NoSchedule'
By adding a taint to all Windows nodes, nothing will be scheduled on them (that includes existing Linux Pods).
In order for a Windows Pod to be scheduled on a Windows node,
it would need both the nodeSelector to choose Windows, and the appropriate matching toleration.
Handling multiple Windows versions in the same cluster
The Windows Server version used by each pod must match that of the node. If you want to use multiple Windows
Server versions in the same cluster, then you should set additional node labels and nodeSelectors.
Kubernetes 1.17 automatically adds a new label node.kubernetes.io/windows-build to simplify this.
If you're running an older version, then it's recommended to add this label manually to Windows nodes.
This label reflects the Windows major, minor, and build number that need to match for compatibility.
Here are values used today for each Windows Server version.
Product Name
Build Number(s)
Windows Server 2019
10.0.17763
Windows Server version 1809
10.0.17763
Windows Server version 1903
10.0.18362
Simplifying with RuntimeClass
RuntimeClass can be used to simplify the process of using taints and tolerations.
A cluster administrator can create a RuntimeClass object which is used to encapsulate these taints and tolerations.
Save this file to runtimeClasses.yml. It includes the appropriate nodeSelector
for the Windows OS, architecture, and version.