🚀 Job Title: Principal Site Reliability Engineer
📍 Location: Remote (Within North America)
🕒 Seniority Level: Principal (6+ years); Senior (4+ years); Mid-Level (2+ years)
🔧 What You’ll Do
- Own platform reliability – especially Kubernetes clusters hosted on EKS (AWS).
- Drive cloud cost-efficiency – identify and implement infrastructure savings.
- Design and manage CI/CD pipelines – using GitHub Actions, ArgoCD, etc.
- Enhance and manage Infrastructure as Code (IaC) – with Terraform, Helm.
- Build systems for observability – alerts, dashboards, runbooks, auto-healing.
- Collaborate cross-functionally with engineering, analytics, and product teams.
- Champion DevOps, SRE, and cloud-native best practices across the org.
✅ Required Skills & Experience
- 6+ years in SRE/DevOps/infrastructure (or 4+ years for Senior; 2+ for Mid-level).
- Deep experience with:
- Kubernetes (EKS highly preferred)
- AWS
- Docker & container orchestration
- Strong Linux/Bash scripting and system troubleshooting skills.
- Proficiency in CI/CD workflows, ideally with GitHub Actions & ArgoCD.
- Hands-on experience with Terraform and IaC principles.
- Strong communication, problem-solving, and collaboration skills.
🧠 Nice to Have
- Programming experience with Python or Go.
- Observability stack: Prometheus, Victoria Metrics, Grafana.
- Familiarity with:
- Karpenter (EKS autoscaling)
- Helm
- Microservices & distributed systems
- Experience optimizing system performance & cloud cost at scale.
💡 Soft Skills
- Self-directed learner and independent problem-solver.
- Team-oriented, highly collaborative mindset.
- Strong analytical skills with a pragmatic approach.
- Clear communicator (written & verbal).
🌟 Why Join?
- Fast-moving, collaborative, supportive work environment.
- Infrastructure is seen as core to product delivery.
- Real opportunity to shape the future of the platform's reliability and scale.
- Your work directly improves developer experience, platform performance, and cost-efficiency.