Why Join Us?
Work on
high-availability, multi-region deployments
Shape our
observability strategy
and implement
automation at scale
Collaborate with
development teams
to enhance
service reliability
Lead
incident response
and drive
systematic improvements
Essential Skills & Experience
10+ years
in SRE, Dev Ops, or similar roles
Strong
networking fundamentals
Skilled with
AWS
and cloud-native technologies
Proficiency in
Python, Go, or Java Script/Type Script
Experience with
Docker, Kubernetes, CI/CD, and Git Ops (Flux/Argo CD)
Knowledge of
monitoring tools (Grafana, Prometheus, Loki, Tempo)
Bonus Skills
Advanced
Kubernetes certification (CKA/CKAD)
Experience with
Terraform, Postgre SQL, Mongo DB
Expertise in
performance optimization & cost management
Security hardening & compliance implementation
Tech Stack You'll Work With
Containerization:
Kubernetes, Docker
Observability:
Grafana Stack, Prometheus
Infrastructure:
Cloud-native technologies
Programming:
Go, Python, Type Script/Java Script
CI/CD:
Modern pipeline tools
Multi-region deployments & microservices architecture
Key Responsibilities
System Reliability:
Design and implement
scalable infrastructure solutions
Observability:
Architect and maintain
monitoring & alerting systems
Automation:
Develop
automated workflows
to reduce manual effort
Incident Management:
Lead
major incident response
and drive improvements
Technical Leadership:
Mentor team members and influence
engineering decisions
Tool Development:
Build
internal tools
to enhance operational efficiency
Best Practices:
Establish and enforce
SRE methodologies
ð
Ready to take on this challenge? Apply now with your latest and detailed CV!
Site Reliability Engineer (expert) 0630, Midrand
Free
Site Reliability Engineer (expert) 0630, Midrand
South Africa, Gauteng, Midrand,
Modified June 10, 2025
Description
Job details:
⇐ Previous job |
Next job ⇒ |
Advertisement: