We are looking for experienced Senior Site Reliability Engineers (SREs) to join our team and help maintain and enhance the reliability, scalability, and performance of our cloud-based systems. Our platform processes vast amounts of data in real time and operates 24/7 with high availability, requiring expertise in automation, monitoring, and incident resolution.
This role requires on-site presence at our office 4 days a week to support effective collaboration and teamwork.
Responsibilities:
Design, implement, and maintain highly available, fault-tolerant cloud infrastructure with an Infrastructure-as-Code (IaC) approach.
Develop and optimize automated CI/CD pipelines following the GitOps methodology.
Improve service scalability and engineering productivity through automation.
Monitor and maintain production systems, proactively identifying and resolving performance bottlenecks.
Implement security and compliance best practices.
Develop and maintain observability solutions, ensuring comprehensive monitoring, alerting, and logging across distributed systems.
Participate in an on-call rotation, incident resolution, and root cause analysis to enhance system resilience.
Plan and execute disaster recovery and system capacity scaling strategies.
Collaborate closely with development and architecture teams to drive performance improvements and optimize infrastructure.
Requirements:
4+ years of experience as an SRE, Systems Engineer, or DevOps Engineer supporting large-scale, high-availability systems.
Strong Linux administration skills and knowledge of networking fundamentals (TCP/IP, DNS, routing).
Hands-on experience with public cloud providers (AWS, GCP, or Azure) and container orchestration using Kubernetes & Docker.
Proven expertise in Infrastructure-as-Code tools (Terraform, Ansible, ArgoCD, or Helm).
Proficiency in automation and scripting using Python, Go, or Bash.
Experience working with distributed systems and databases such as Kafka, Cassandra, ClickHouse, PostgreSQL, MySQL, MongoDB, or VictoriaMetrics.
Familiarity with CI/CD tools such as GitLab CI/CD, Spinnaker and experience deploying high-availability applications.
Strong knowledge of monitoring and logging systems like Prometheus, Grafana, ELK Stack, Zabbix, or CloudWatch.
Effective communication and problem-solving skills, with the ability to work in a globally distributed team.
Fluent English (written & spoken).
Nice to Have:
Experience with high-load distributed systems and microservices.
Knowledge of VoIP solutions, contact center technologies, or SaaS monitoring practices.
Experience with JVM tuning, Nginx administration, and high-availability configurations (HAProxy, Keepalived).
Familiarity with ITIL or other IT service management frameworks.
What We Offer:
A well-coordinated, professional team working on cutting-edge technologies.
Interesting and challenging tasks in a dynamic environment with opportunities for professional growth.
Additional Health and Life Insurance Package.
Employee Assistance Program.
25 vacation days.
200 BGN Digital Food Vouchers.
120 BGN Gross as part of the salary for Working Expenses Allowance.
By enabling them, you help us to develop and deliver better services in the way that's most convenient for you. For information and settings, see our Cookie Policy.