Tradu

Site Reliability Engineer

Submit your application

The job listing is published in the following categories

DevOps 155 Infrastructure 363

Posted 3 weeks ago
Anywhere
Report an issue

Report an issue with the job ad

Tech Stack / Requirements

We are seeking an experienced Site Reliability Engineer (SRE) to join our technology team. The SRE will be responsible for ensuring the reliability, scalability, and performance of our systems and services, with a strong focus on AWS, automation, and infrastructure as code. This role blends software engineering with systems engineering, driving resilience and efficiency across our production platforms.

Primary responsibilities (not limited to)

Design, build, and maintain reliable, scalable, and performant systems across AWS-based and on-premises environments, with a cloud-first approach.
Implement monitoring, alerting, and observability solutions to ensure visibility into system health and application performance.
Automate operational tasks, deployments, and configuration management to reduce manual intervention and improve efficiency.
Participate in incident response and postmortem processes, driving improvements to system reliability and reducing mean time to recovery (MTTR).
Collaborate with development teams to embed reliability, performance, and scalability into the software development lifecycle.
Manage capacity planning, performance tuning, and cost optimization within AWS.
Ensure security, compliance, and audit requirements are met in all infrastructure and operational practices.

Requirements

5+ years of hands-on experience in Site Reliability Engineering, DevOps, or related roles.
Strong background in Linux systems administration.
Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.).
Deep experience with AWS services (EC2, ECS/EKS, RDS, S3, IAM, networking, etc.).
Proven expertise with tools like Puppet, Chef, Ansible for configuration management and
Terraform for infrastructure as code.
Strong knowledge of CI/CD pipelines and deployment automation (Jenkins, GitLab, or similar).
Hands-on experience with monitoring/observability tools (Prometheus, Grafana, ELK, Datadog, etc.).
Solid understanding of networking, load balancing, and DNS fundamentals.
Excellent problem-solving skills and ability to work effectively under pressure during incidents.

Preferred Skills

Experience with Kubernetes or other container orchestration systems.
Knowledge of service-level objectives (SLOs), SLIs, and error budgeting.
Background in financial systems or other mission-critical, high-availability environments.

Working Hours: 40/week, Monday–Friday. Hybrid: 3 days in-office.

Please submit your CV in English. Only shortlisted candidates will be contacted for an interview.

All Stratos Support EAD employees must be eligible to work in Bulgaria.

Company Description

Tradu is a new multi-asset global trading platform and is part of the Stratos group of companies. Tradu, built by traders for traders, provides the most sophisticated traders with a serious platform that allows them to move easily between asset classes such as stocks, CFDs and crypto, depending on the regulations that govern the trader’s market.

Equal Opportunity Employer

Submit your application

Company overview

Diversity is important to our company and our colleagues because we rely on each of us and each of our teams to provide new perspectives, collaborative solutions, and out-of-the-box ways of thinking. We encourage productivity and success from our colleagues where the most capable people and ideas have the opportunity to succeed. To achieve this,… See more information and opinions about Tradu

More about the company

All job listings by the company

Register

Please wait…

Site Reliability Engineer

The job listing is published in the following categories

Report an issue with the job ad

Tech Stack / Requirements