Understanding Pod Priority and Preemption in Kubernetes: A Detailed Guide

Introduction
In Kubernetes, Pod Priority and Preemption is a powerful scheduling feature that ensures critical workloads are placed and maintained on your cluster, even when resources are scarce. With this mechanism, Kubernetes can automatically preempt (evict) lower-priority pods to make room for higher-priority ones, helping orchestrate resource-efficient and reliable workload execution. Introduced as generally available in Kubernetes v1.14, this feature has become a staple for cluster operations.
1. What Is Pod Priority?
Pod Priority is an integer value assigned to a Pod, representing its importance relative to others. Higher values indicate higher importance in scheduling decisions.
Pods without an explicit priority use a default value of 0.
Priorities are defined through
PriorityClassobjects, which are non-namespaced resources that map a name to an integer priority.
2. Defining Priority: PriorityClass
A PriorityClass defines both the name and numerical value of a priority:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-apps
value: 1000000
globalDefault: false
description: "Pods critical to business logic."
value: Higher numbers mean higher priority.globalDefault: Iftrue, this is the default for pods without a specifiedpriorityClassName—but only for pods created after the class exists.
Kubernetes ships with two default system-critical classes:
system-node-critical(≈ 2,000,001,000)system-cluster-critical(≈ 2,000,000,000)
3. Scheduling: How Priority Influences Order
Once Pod Priority is in place, the scheduler sorts pending pods by priority. High-priority pods are attempted first. If scheduling a high-priority pod fails due to resource constraints, the scheduler may then preempt lower-priority pods to make room.
4. Preemption: Making Space for What Matters
When a pending pod cannot be scheduled:
The scheduler looks for nodes where evicting one or more lower-priority pods would free enough capacity.
It evicts the minimal necessary set of pods to schedule the higher-priority pod.
When a pod is evicted (whether due to preemption or node pressure eviction):
The pod is terminated on the node where it is running.
The pod is deleted from the current node.
If the pod belongs to a controller (e.g., Deployment, StatefulSet, ReplicaSet, Job, etc.), that controller will notice the missing replica and create a new pod.
The scheduler will then place this new pod on another suitable node.
So effectively, Standalone Pod (not managed by a controller): Once evicted, it is gone permanently.
Pod managed by a controller: It gets recreated, usually on another node, assuming resources are available.
Scheduling metadata:
The pending pod’s status.nominatedNodeName indicates which node is targeted for preemption. However, the pod may ultimately be scheduled elsewhere if conditions change.
Important Constraints:
Victim pods terminate using their graceful termination period (default ~30 seconds), which delays when space becomes available.
PodDisruptionBudget(PDB) is respected on a best-effort basis but can be violated if no alternate victim set exists.Inter-pod affinity: If the pending pod requires co-location with lower-priority pods, preemption won't occur on that node.
Cross-node preemption is not supported: The scheduler doesn’t preempt pods on other nodes to alleviate anti-affinity constraints.
5. Non-Preempting Priority Classes
Introduced in Kubernetes v1.24, you can define a PriorityClass with:
preemptionPolicy: Never
This means pods with this class will:
Queue ahead of lower-priority pods.
Not preempt other pods.
Be preempted by even higher-priority pods.
This is useful, for example, in ML or data science workflows where you want to ensure high scheduling priority without disrupting running services.
6. Interplay with QoS and Eviction
While Pod QoS classes (Guaranteed, Burstable, BestEffort) affect eviction precedence during node-pressure scenarios, they don’t influence scheduling preemption. The scheduler focuses solely on priority values—QoS only comes into play during evictions and not scheduling.
At node pressure, pods are ranked for eviction by:
Exceeding resource requests
Priority
Resource usage relative to requests
7. Why Use Pod Priority and Preemption?
Reliability: Ensures critical workloads are scheduled promptly without over-provisioning clusters.
Resource utilization: Hosts both mission-critical and lower-priority workloads together, evicting non-essential pods under pressure.
Operational flexibility: You can finely control priority and preemption behavior using Policy, Preemption settings, and PDB nuances.
8. Sample YAML Snippet
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-apps
value: 1000000
globalDefault: false
description: "Priority for critical services."
---
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
priorityClassName: high-priority-apps
To create a non-preempting class via kubectl:
kubectl create priorityclass high-priority --value=1000 \
--description="High priority but non-preempting" \
--preemption-policy="Never"
9. Best Practices & Troubleshooting
| Scenario | Guidance |
| Unintended preemptions | Ensure priority levels are correctly assigned; empty priorityClassName defaults to 0. |
| Pending pods not scheduling after preemption | Another higher-priority pod may have taken precedence. This is expected. |
| Higher-priority pods evicted first | The scheduler may choose nodes where victims have the lowest priority or where PDB isn't violated. |
| Affinity issues | Avoid inter-pod affinity that ties a high-priority pod to a lower-priority pod, as it can block preemption. |
| Termination latency in scheduling gap | Reduce or set terminationGracePeriodSeconds to a small value on lower-priority pods. |





