A Kubernetes operator that automatically cleans up stuck and excess CronJob executions (Jobs and Pods) based on user-defined lifecycle policies.
This project was built from a real production issue where CronJob executions entered error or hung states and were never cleaned up, leading to unbounded Jobs and Pods in the cluster.
Kubernetes CronJobs create a new Job (and Pod) on every schedule. While Kubernetes provides basic history limits, it does not handle:
- Long-running or hung Jobs
- Executions stuck in Active state
- Fine-grained retention policies
- Explicit lifecycle cleanup semantics
Over time, this leads to:
- Orphaned Jobs
- Accumulating Pods
- Wasted cluster resources
The Cron Execution Cleaner Operator introduces a Custom Resource, CronExecutionCleaner, that allows users to define explicit lifecycle policies for CronJob executions.
The operator follows a standard Kubernetes reconcile pattern driven by a custom resource that defines cleanup policies.
flowchart TD
CR[CronExecutionCleaner - CR]
CTRL[Controller - Reconcile Loop]
CACHE[Informer Cache - Jobs and Pods]
OWN[Filter by Ownership]
STUCK[Detect Stuck Jobs - Time Based]
RETAIN[Apply Retention Policy - Count Based]
DELETE[Delete Jobs via Background Propagation]
STATUS[Update CR Status]
CR --> CTRL
CTRL --> CACHE
CACHE --> OWN
OWN --> STUCK
OWN --> RETAIN
STUCK --> DELETE
RETAIN --> DELETE
DELETE --> STATUS
The operator is deployed to cron-execution-cleaner-system namespace but operates
on Jobs/Pods in user-specified namespaces. This provides:
- Isolation: Operator cannot interfere with cluster system components
- Multi-tenancy: Single operator instance manages multiple namespaces
- Least privilege: Explicit RBAC for each namespace
apiVersion: lifecycle.github.io/v1alpha1
kind: CronExecutionCleaner
metadata:
name: stuck-cron-cleaner
spec:
namespace: cron-test
cronJobName: stuck-cron
retain:
successfulJobs: 3
failedJobs: 3
cleanupStuck:
enabled: true
stuckAfter: 2h
runInterval: 5m- Watches cleanup policies via a CRD
- Lists Jobs using informer-backed caches
- Filters Jobs using ownership (CronJob → Job → Pod)
- Detects stuck executions using time-based rules
- Enforces retention limits for completed Jobs
- Deletes Jobs safely with cascading Pod cleanup
- Reports actions via CR status
The controller reconciles based on spec.runInterval. For example, with runInterval: 5m:
- Cleanup check runs every 5 minutes
- Between cycles, no cleanup occurs even if conditions are met
- This prevents excessive API calls and provides predictable cleanup timing
A Job is considered stuck if all of the following are true:
status.active> 0status.succeeded== 0status.failed== 0- The Job has been running longer than
cleanupStuck.stuckAfter
This logic relies on Job controller semantics (status.startTime) rather than
Pod-level heuristics.
For completed Jobs:
- Keep the N most recent successful Jobs
- Keep the M most recent failed Jobs
- Delete older executions (oldest first)
Important: Jobs without status.startTime are treated as oldest and deleted first.
Retention is enforced independently from stuck-job cleanup.
The operator updates the CronExecutionCleaner status with:
-
lastRunTime -
jobsDeleted -
podsDeleted
This provides visibility into cleanup actions and makes the operator easy to observe and debug.
- Namespace-scoped
- CronJob-scoped
- Ownership-based selection
- No cluster-wide deletions
- Explicit retention and timeout policies
- Cascading deletion handled by Kubernetes
The operator uses Kubernetes' DeletePropagationBackground policy:
- Jobs are deleted immediately
- Associated Pods are cleaned up asynchronously
- WARNING: Once a Job is deleted, there is no recovery mechanism. Ensure your retention policies are appropriate before enabling cleanup.
- One
CronExecutionCleanerresource per CronJob - Assumes 1:1 Job:Pod ratio
- No dry-run mode available
- Go v1.21+
- Docker
- kubectl
- Kubernetes: 1.26+
- Access to a Kubernetes cluster(Kind, Minikube, etc.)
- Kubebuilder
The controller requires:
- CPU: 10m (request) / 500m (limit)
- Memory: 64Mi (request) / 128Mi (limit)
See config/manager/manager.yaml for details.
Run the controller locally against your kubeconfig:
# Install the CRD (register the template)
# This makes Kubernetes understand what a "CronExecutionCleaner" is
make install
make runBuild and push the controller image:
make docker-build docker-push IMG=<your-registry>/cron-execution-cleaner:tagNOTE: Replace <your-registry> with your container registry (e.g., docker.io/username, gcr.io/project-id).
Ensure you have push permissions to the registry.
Install the CRDs into the cluster:
make installDeploy the controller
make deploy IMG=<your-registry>/cron-execution-cleaner:tagNOTE: If you encounter RBAC errors, you may need to grant yourself cluster-admin privileges or be logged in as admin.
Verify the controller is running:
kubectl get deployment -n cron-execution-cleaner-system
kubectl logs -n cron-execution-cleaner-system deployment/controller-managerFirst, review the sample configuration:
cat config/samples/lifecycle_v1alpha1_cronexecutioncleaner.yamlCustomize the sample if needed (namespace, cronJobName, retention policy, etc.), then apply:
# Create a CR (deploy an instance)
kubectl apply -k config/samples/
# Alternatively, kubectl apply -f config/samples/lifecycle_v1alpha1_cronexecutioncleaner.yamlVerify the resource was created:
# List all CRs of this type
kubectl get cronexecutioncleaner
# Shows: stuck-cron-cleaner
# Get a specific CR
kubectl get cronexecutioncleaner stuck-cron-cleaner -o yaml
# Shows the configuration and status of this specific instanceWait for the next reconciliation cycle (based on runInterval) to see cleanup actions.
After one cycle runs, check the CR status:
kubectl get cronexecutioncleaner stuck-cron-cleaner -o yamlExpected status output:
status:
jobsDeleted: 5
podsDeleted: 5
lastRunTime: "2026-01-07T10:30:00Z"
conditions:
- type: Ready
status: "True"
reason: ReconcileSuccess
message: Cleanup executed successfully
lastTransitionTime: "2026-01-07T10:30:00Z"Delete the instances (CRs) from the cluster:
kubectl delete -k config/samples/Delete the CRDs from the cluster:
make uninstallUndeploy the controller from the cluster:
make undeployJobs not being deleted:
- Check controller logs:
kubectl logs -n cron-execution-cleaner-system deployment/controller-manager - Verify RBAC:
kubectl auth can-i delete jobs --as=system:serviceaccount:cron-execution-cleaner-system:controller-manager - Confirm Job ownership:
kubectl get jobs -o jsonpath='{.items[*].metadata.ownerReferences}'
Controller pod not running:
- Insufficient RBAC permissions
- CRD not installed:
make install
The project includes end-to-end (e2e) tests generated using Kubebuilder. These tests create and manage a Kind cluster automatically and validate the controller behavior against a real Kubernetes API.
This is the recommended and easiest way to run e2e tests locally.
Make sure the following tools are installed:
- Go (1.21+)
- Docker
- kubectl
- Kind
- Kubebuilder
Verify:
docker version
kubectl version --client
kind version
kubebuilder versionStep 1: Create a Kind cluster named kind
kind create cluster --name kindVerify:
kubectl cluster-infoStep 2: Run the e2e tests
From the repository root:
go test ./test/e2e -tags=e2e -vWhat this does:
- Builds the controller image
- Loads the image into the Kind cluster
- Installs CRDs
- Deploys the controller
- Creates test resources
- Validates reconcile behavior
- Cleans up automatically
Step 3: Clean up (optional)
Delete the Kind cluster when done:
kind delete cluster --name kind- Prometheus Metrics
- Dry-run mode (validate policies without deletion)
- Helm chart
- Support for multiple CronJobs per CR
- Finalizers for CR cleanup on deletion
Contributions are welcome. Please:
- Open an issue to discuss changes
- Keep PRs focused and well-scoped
- Follow existing controller patterns
NOTE: Run make help for more information on all potential make targets
MIT