DevOps Engineer Interview Questions 2026
22 real-world questions covering CI/CD pipelines, containerization, cloud infrastructure, monitoring, and security practices from top employers.
Interview Questions
22 Questions with Answers
Click any question to reveal a detailed sample answer. Filter by category to focus your preparation.
Explain the difference between Docker and Kubernetes. When would you use each?
Sample Answer
Docker is a containerization platform that packages applications with their dependencies into portable containers. Kubernetes is a container orchestration platform that manages, scales, and maintains containerized applications across clusters. Use Docker alone for development environments, simple single-host deployments, and CI/CD build stages. Use Kubernetes when you need auto-scaling, self-healing, rolling updates, service discovery, and load balancing across multiple hosts. For smaller workloads, consider Docker Compose or AWS ECS. Kubernetes adds operational complexity that is only justified at scale. Managed Kubernetes services like EKS, GKE, and AKS reduce the operational burden significantly.
Design a CI/CD pipeline for a microservices application.
Sample Answer
Pipeline stages: (1) Source: trigger on git push with branch-based workflows (feature, develop, main). (2) Build: parallel builds for each microservice with dependency caching. (3) Test: unit tests, integration tests, contract tests between services. (4) Security: SAST scanning, dependency vulnerability checks, container image scanning. (5) Build artifacts: Docker image build, push to registry with semantic versioning and git SHA tags. (6) Deploy to staging: automated deployment, smoke tests, integration verification. (7) Deploy to production: canary or blue-green deployment with automated rollback on metric degradation. Use GitOps (ArgoCD/Flux) for declarative deployments. Implement pipeline-as-code in the repository. Add monitoring alerts for deployment failures.
What is Infrastructure as Code (IaC)? Compare Terraform and CloudFormation.
Sample Answer
Infrastructure as Code manages infrastructure through declarative configuration files instead of manual processes, enabling version control, peer review, automated testing, and reproducible environments. Terraform is cloud-agnostic, uses HCL language, maintains state files, and supports a vast provider ecosystem including multi-cloud deployments. CloudFormation is AWS-native, deeply integrated with AWS services, requires no state management (AWS manages it), and supports drift detection. Choose Terraform for multi-cloud environments and team familiarity. Choose CloudFormation for AWS-only shops wanting tight integration and native support. Both support modules for reusable infrastructure patterns. Always use remote state storage (S3, Terraform Cloud) and state locking for team collaboration.
How would you implement a zero-downtime deployment strategy?
Sample Answer
Three main strategies: Blue-Green deployment maintains two identical environments; route traffic to the new one after verification, keeping the old one as instant rollback. Rolling update gradually replaces old instances with new ones, maintaining service availability throughout. Canary deployment routes a small percentage of traffic (1-5%) to the new version, monitors key metrics, and gradually increases traffic if healthy. In Kubernetes, use rolling update strategy with proper readiness and liveness probes. Set maxSurge and maxUnavailable appropriately. Implement health checks that verify application readiness, not just port availability. Add automated rollback triggers based on error rate, latency, and business metrics.
Explain the concept of container networking. How do containers communicate?
Sample Answer
Containers use network namespaces for isolation. Docker provides bridge networks (default, containers communicate via IP), host networks (container shares host network stack, no isolation), overlay networks (multi-host communication for Docker Swarm), and none (no networking). In Kubernetes, every pod gets a unique IP address. Pods communicate directly without NAT. Services provide stable DNS names and load balancing across pod replicas. Ingress controllers handle external traffic routing. Network policies control pod-to-pod communication for security. Service mesh (Istio, Linkerd) adds mTLS encryption, traffic management, and observability. Understanding CNI plugins (Calico, Cilium, Flannel) is important for troubleshooting network issues in production.
How do you monitor and alert on a production system effectively?
Sample Answer
Implement the four golden signals: latency (request duration), traffic (request rate), errors (error rate), and saturation (resource utilization). Use Prometheus for metrics collection and Grafana for dashboards. Implement structured logging with ELK stack or Loki. Use distributed tracing (Jaeger, OpenTelemetry) for request flow across services. Design alerts with clear ownership, actionable runbooks, and appropriate severity levels. Avoid alert fatigue by eliminating noisy alerts and using multi-condition triggers. Implement SLOs and error budgets to make reliability measurable. Monitor both infrastructure (CPU, memory, disk, network) and application metrics (response time, error rate, queue depth, cache hit rate). Use anomaly detection for unknown-unknowns.
What is the difference between horizontal and vertical scaling? When would you use each?
Sample Answer
Vertical scaling (scaling up) adds more resources (CPU, RAM) to an existing instance. It is simpler, requires no application changes, but has hardware limits and creates a single point of failure. Horizontal scaling (scaling out) adds more instances behind a load balancer. It provides better fault tolerance and near-unlimited scaling but requires stateless application design, distributed session management, and database considerations. Use vertical scaling for databases (before sharding), legacy applications that cannot be easily distributed, and when quick scaling is needed. Use horizontal scaling for web servers, microservices, and any stateless workload. Most production systems combine both: vertically scale individual instances to a cost-effective point, then scale horizontally.
Describe how you would handle secrets management in a production environment.
Sample Answer
Never store secrets in code, environment variables in plain text, or Docker images. Use a dedicated secrets manager: AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault. Implement least-privilege access: each service gets only the secrets it needs. Rotate secrets automatically with zero downtime using the rotation lifecycle pattern. In Kubernetes, use external-secrets-operator to sync secrets from Vault or AWS Secrets Manager into Kubernetes Secrets (encrypted at rest with KMS). For CI/CD, use pipeline-native secret variables (GitHub Actions secrets, GitLab CI variables) that are masked in logs. Audit all secret access with logging. Never log sensitive values, and scan code repositories with tools like gitleaks to detect accidentally committed secrets.
How would you troubleshoot a Kubernetes pod that keeps crashing?
Sample Answer
Systematic approach: (1) kubectl describe pod to check events for OOMKilled, ImagePullBackOff, or failed health checks. (2) kubectl logs for application errors; use --previous flag for crashed container logs. (3) Check resource limits: is the container running out of memory or CPU? (4) Verify readiness and liveness probes are correctly configured and not timing out. (5) Check if config maps, secrets, or persistent volumes are mounted correctly. (6) Verify network connectivity: can the pod reach dependent services? (7) Use kubectl exec to debug inside a running container. (8) Check node resources: is the node itself under pressure? Common causes: misconfigured environment variables, missing dependencies, insufficient memory limits, and failing health check endpoints.
Explain the principle of shift-left security. How do you integrate security into CI/CD?
Sample Answer
Shift-left security integrates security testing early in the development lifecycle rather than as a final gate. In CI/CD: run SAST (Static Application Security Testing) tools like SonarQube or Semgrep on every pull request. Scan dependencies for known vulnerabilities using Snyk or Dependabot. Scan container images with Trivy or Grype before pushing to registry. Implement infrastructure-as-code security scanning with Checkov or tfsec. Run DAST (Dynamic Application Security Testing) in staging environments. Use policy-as-code (OPA/Rego) to enforce security standards on Kubernetes manifests. Secrets detection with gitleaks prevents accidental credential exposure. Make security checks blocking for critical findings and advisory for medium ones to balance velocity with safety.
What is GitOps and how does it differ from traditional DevOps?
Sample Answer
GitOps uses Git as the single source of truth for both application code and infrastructure configuration. A GitOps operator (ArgoCD, Flux) continuously reconciles the desired state in Git with the actual state in the cluster, automatically applying changes and reverting drift. Compared to traditional push-based CI/CD where pipelines deploy directly to targets, GitOps is pull-based: the cluster pulls changes from Git. Benefits: full audit trail via Git history, easy rollback (git revert), consistent environments (what is in Git is what is running), and reduced blast radius (changes go through pull request review). GitOps works best with Kubernetes but the principles apply broadly. Combine with sealed-secrets or external-secrets for managing sensitive configuration.
Tell me about a time you improved the reliability of a production system.
Sample Answer
Use the STAR method with specific metrics: 'Our payment processing service had 99.5% uptime but the business needed 99.95%. I analyzed incident reports and found three root causes: single-point-of-failure database, no circuit breakers for downstream services, and manual deployments causing human errors. I implemented a multi-AZ database with automatic failover, added circuit breakers with Resilience4j, and built a fully automated deployment pipeline with canary releases and automatic rollback. I also established SLOs with error budgets and on-call rotations. Over six months, uptime improved to 99.97%, and mean time to recovery dropped from 45 minutes to 8 minutes. The key learning was that reliability is a system property, not a feature you bolt on.'
How do you design a disaster recovery strategy?
Sample Answer
Define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) based on business requirements. Strategy tiers: backup and restore (hours RTO, cheapest), pilot light (minutes to hours RTO, minimal infrastructure running), warm standby (minutes RTO, scaled-down replica), and multi-site active-active (near-zero RTO, most expensive). Implement automated backups with tested restore procedures. Use infrastructure as code for rapid environment recreation. Replicate data across regions (S3 cross-region replication, database read replicas). Document and automate runbooks for failover procedures. Conduct regular disaster recovery drills (gamedays) to verify procedures work under pressure. Monitor backup health and test restores monthly. DR planning without testing is just documentation.
What are Kubernetes liveness and readiness probes? Why do they matter?
Sample Answer
Liveness probes determine if a container is running. If a liveness probe fails, Kubernetes restarts the container. Use for detecting deadlocked applications or processes that are running but not functioning. Readiness probes determine if a container is ready to receive traffic. If a readiness probe fails, the pod is removed from service load balancing but not restarted. Use for applications that need initialization time or are temporarily unable to serve (database connection pool exhausted). Configure appropriate initialDelaySeconds to avoid killing pods during startup. Use HTTP probes for web services, TCP probes for databases, and exec probes for custom health checks. Misconfigured probes are a top cause of unnecessary pod restarts and cascading failures.
How do you approach capacity planning for cloud infrastructure?
Sample Answer
Analyze historical usage patterns to establish baselines and growth trends. Use load testing (k6, Locust, JMeter) to determine per-instance capacity limits. Calculate required capacity with headroom for traffic spikes (typically 30-50% buffer). Implement auto-scaling with appropriate metrics (CPU, memory, request queue depth, custom business metrics). Use predictive scaling for known traffic patterns (daily cycles, seasonal peaks). Right-size instances by analyzing actual resource utilization, not just allocated resources. Consider spot instances for fault-tolerant workloads to reduce costs by 60-70%. Review capacity monthly and adjust auto-scaling policies. Build dashboards showing current utilization vs capacity limits. Document scaling playbooks for events like product launches or viral growth.
Explain the difference between Docker volumes and bind mounts.
Sample Answer
Docker volumes are managed by Docker, stored in Docker's storage area (/var/lib/docker/volumes), and are the preferred mechanism for persisting data. They are portable across containers, can be named for easy reference, support volume drivers for remote storage (NFS, cloud), and Docker manages their lifecycle. Bind mounts map a host directory or file directly into the container, providing direct access to the host filesystem. Use volumes for persistent application data (databases, uploads). Use bind mounts for development (source code hot-reloading) and configuration files. Named volumes survive container removal and can be backed up. Both can be read-only for security. In Kubernetes, use PersistentVolumes and PersistentVolumeClaims instead.
What is your experience with configuration management tools? Compare Ansible, Chef, and Puppet.
Sample Answer
Ansible is agentless (SSH-based), uses YAML playbooks, has a low learning curve, and is ideal for ad-hoc tasks and smaller environments. Chef uses Ruby-based recipes with a client-server model, excels at complex configurations, and is popular in enterprise environments. Puppet uses its own declarative language with a client-server model, strong enforcement of desired state, and excellent reporting. In 2026, Ansible dominates for its simplicity and agentless architecture. However, container orchestration (Kubernetes) and IaC tools (Terraform) have replaced much of what configuration management tools used to do. Use configuration management for server setup and application configuration, IaC for infrastructure provisioning, and container orchestration for application deployment.
How do you handle database migrations in a CI/CD pipeline?
Sample Answer
Use a migration tool (Flyway, Liquibase, Alembic, Prisma Migrate) that tracks applied migrations with version numbers. Store migrations in version control alongside application code. Run migrations as a separate step before application deployment. Design migrations to be backward-compatible: add new columns as nullable, use expand-contract pattern for schema changes, and avoid dropping columns until all application versions stop referencing them. In Kubernetes, use init containers or migration jobs that run before the main application starts. Test migrations against production-like data in staging. Always have a rollback migration ready. For large tables, use online schema change tools (pt-online-schema-change, gh-ost) to avoid locking.
What salary range are you targeting for this DevOps role?
Sample Answer
Research market rates on Levels.fyi, Glassdoor, and DevOps-specific salary surveys. DevOps engineers in the US typically earn $110K-$160K for mid-level and $150K-$220K+ for senior roles with Kubernetes and cloud expertise. Frame your response: 'Based on my experience with AWS, Kubernetes, and building CI/CD pipelines at scale, and the market rate for this location, I am targeting total compensation in the range of X to Y. I value the role's scope, the team, and growth opportunities, and am open to discussing the full compensation package.' Let the employer share their range first when possible.
How do you handle an on-call rotation? Describe your incident response process.
Sample Answer
Structure on-call with clear escalation paths, runbooks for common issues, and reasonable rotation schedules (typically weekly rotations with no more than 1 in 4 weeks). When an alert fires: acknowledge within 5 minutes, assess severity and impact, communicate status in the incident channel, begin diagnosis using monitoring dashboards and logs, and implement a fix or workaround. For complex incidents, declare an incident commander to coordinate response. After resolution, conduct a blameless post-mortem within 48 hours documenting timeline, root cause, contributing factors, and action items to prevent recurrence. Track incident metrics: MTTD (mean time to detect), MTTR (mean time to recover), and incidents per on-call shift to continuously improve.
What is a service mesh and when would you implement one?
Sample Answer
A service mesh is an infrastructure layer that handles service-to-service communication in a microservices architecture. It provides: mTLS encryption (zero-trust security), traffic management (canary deployments, circuit breaking, retries), and observability (distributed tracing, metrics, access logs) without requiring application code changes. Popular implementations include Istio and Linkerd. Implement a service mesh when you have 10+ microservices and need consistent security, traffic management, and observability across all services. Do not implement it for monoliths or small microservice deployments: the operational complexity outweighs the benefits. Consider Linkerd for simplicity or Istio for feature completeness. Evaluate the latency and resource overhead before adoption.
Describe a situation where you had to balance speed and reliability in a deployment.
Sample Answer
Frame with a real scenario: 'We had a critical security vulnerability that needed patching within 24 hours across 15 services. Our normal deployment process took 3 days with full testing. I proposed a risk-based approach: immediately patch and deploy the 5 internet-facing services using our canary pipeline with enhanced monitoring, while following the full process for internal services. I coordinated with the security team to validate patches, set up additional monitoring dashboards, and had the team on standby for quick rollbacks. All patches were deployed within 20 hours with zero customer impact. The experience led us to create a fast-track deployment process for security patches with pre-approved rollback procedures.'
Preparation Tips
Interview Preparation Tips
Be ready to draw architecture diagrams for CI/CD pipelines and cloud infrastructure — many interviews include whiteboard design rounds.
Practice Kubernetes commands and troubleshooting scenarios: kubectl describe, logs, exec, and debugging CrashLoopBackOff pods.
Know your cloud provider deeply: be prepared to discuss specific AWS/GCP/Azure services and when to use each.
Prepare stories about incident response, production debugging, and reliability improvements with specific metrics.
Understand networking fundamentals: DNS, TCP/IP, load balancing, firewalls, and VPNs are frequently tested.
Study Infrastructure as Code patterns: modules, state management, environment promotion, and secret handling.
Avoid These
Common Mistakes to Avoid
Focusing only on tools without explaining the principles and trade-offs behind architectural decisions.
Not being able to explain your monitoring and alerting strategy beyond just naming tools.
Underestimating the importance of security questions: shift-left security, secrets management, and compliance are critical topics.
Not preparing incident response stories with specific timelines, actions, and measurable outcomes.
Over-engineering solutions: proposing Kubernetes for a simple application or a service mesh for five services.
Failing to discuss cost optimization alongside technical solutions — DevOps increasingly includes FinOps responsibilities.
Related Roles
Explore Other Interview Guides
Preparing for multiple roles? Check out interview questions for related positions.
Interview Guides
Explore More Interview Questions
Browse all our interview question guides with detailed answers and preparation tips.
View All Interview GuidesIs Your Resume ATS-Ready?
Run a free ATS score check and get specific improvements in under 60 seconds.