I have a Kubernetes cluster running using EKS (Elastic Kubernetes Service) and ECR (Elastic Container Repository) on AWS. One specific deployment of mine runs fine for the first two/three restarts before then always initialising a CrashLoopBackOff on image pull, waiting for the length of the Back Off and then running fine, before repeating the process.
These pods consist of a docker container which waits for a message from a message queue, runs a process, then the docker container stops at which point the deployment will restart the container, always pulling the container from ECR.
As these pods are intended to handle a lot of traffic and have a short runtime (~1-30 seconds), having each pod immediately enter CrashLoopBackOff on pull and then wait for five minutes before actually running is annoying with a lot of waiting time.
I’ve had a look around for any answers to this, but all the questions I’ve seen describe cases where CrashLoopBackOff continues to run indefinitely, rather than a pod entering CrashLoopBackOff then running successfully once the wait time has finished.
I’ve checked the logs for the pods which have this issue and there is nothing there which indicates any errors. I’m wondering if there is a way to "pause" the container after it is pulled, to ensure it is up and running correctly before the docker command is actually run? Or any other way to delay CrashLoopBackOff for a configurable amount of seconds? I’ve added "sleep 15;" to the start of my docker container command, but that hasn’t helped the issue.
apiVersion: apps/v1 kind: Deployment metadata: name: piml-xgboost spec: replicas: 5 selector: matchLabels: app: piml-xgboost template: metadata: labels: app: piml-xgboost spec: serviceAccountName: cluster-service-account containers: - name: piml-unet image: 'ecr_path' imagePullPolicy: "Always" resources: requests: memory: "500Mi" limits: memory: "4Gi" env: - name: BROKER_URL value: 'amqp_broker_url' - name: QUEUE value: 'amqp_queue' - name: method value: xgboost - name: k8s value: 'True'
Typical ‘kubectl get pods’ output:
NAME READY STATUS RESTARTS AGE piml-xgboost-77d48f9db8-5txmz 0/1 CrashLoopBackOff 959 (2m51s ago) 3d21h piml-xgboost-77d48f9db8-gs542 0/1 CrashLoopBackOff 532 (108s ago) 2d1h piml-xgboost-77d48f9db8-pmvlg 0/1 CrashLoopBackOff 979 (44s ago) 3d23h piml-xgboost-77d48f9db8-wckmk 0/1 CrashLoopBackOff 533 (59s ago) 2d1h piml-xgboost-77d48f9db8-wz657 0/1 CrashLoopBackOff 712 (2m39s ago) 2d21h
Docker command from Dockerfile
CMD sleep 5;/usr/bin/amqp-consume --url=$BROKER_URL -q $QUEUE -c 1 ./docker_script.py
Deployment is not suitable for your use case. A deployment is designed for services that run permanently e.g. for serving a rest service or a worker that register to a message queue (seems to be tight to your use case). When a container stops, as you noted, kubernetes will restart it, but if that happens more often it is considered to be in an errornous state.
You may have two options:
redesign your app to not stop after it finished its work but listen again on the queue for new messages
switch from deployment to cron job that runs every 5 seconds (and remove the sleep time from the container’s command)