Site Reliability Engineering
EPAM Systems
Software Engineering, Other Engineering
Ukraine
Posted on Jan 16, 2026
Responsibilities
- Design comprehensive monitoring and logging systems using tools like DataDog, Dynatrace, Prometheus, Grafana, Zabbix, and ELK to ensure robust observability
- Define and manage SLIs and SLOs to measure and enhance system performance, reliability, and scalability
- Lead root cause analysis during incident responses, ensure detailed postmortem evaluations, and develop long-term preventive strategies
- Implement infrastructure as code (IaC) using Terraform and cloud CLI (AWS, Azure, GCP) for streamlined management and consistency
- Automate workflows and CI/CD pipelines leveraging tools such as Jenkins (Groovy SDK), GitLab CI, and Azure DevOps
- Manage containerized environments with expertise in Docker and Kubernetes orchestration for seamless application deployment
- Collaborate with engineering and DevOps teams to standardize observability practices and proactively address issues before they escalate
- Lead and facilitate post-incident reviews and operational drilling exercises to identify areas for improvement and increase system resilience
- Focus optional on-call support hours for rapid issue resolution and the maintenance of system stability
Requirements
- Residence in Ukraine, with remote work eligibility limited to candidates based within the country
- Advanced proficiency in scripting automations with Python, Go, Bash, or PowerShell
- Strong knowledge of monitoring systems and tools like Prometheus, Grafana, DataDog, Dynatrace, Zabbix, or ELK
- Experience with cloud platforms (AWS, Azure, or GCP) and expertise in IaC with Terraform
- Solid understanding of configuration management systems like Ansible
- Background in automating CI/CD pipelines and delivery lifecycles using Jenkins, GitLab CI, and Azure DevOps
- Practical experience deploying and orchestrating applications in Docker and Kubernetes environments
- Exceptional problem-solving capability for incident reconstruction and identifying root causes
- Proven track record in leading post-incident reviews and operational improvement exercises
- Strong collaboration skills to work effectively with engineering teams and stakeholders to maintain reliability and performance
- English level B2 or higher
Nice to have
- Knowledge of advanced security and compliance strategies in observable environments
- Familiarity with chaos engineering approaches for resilience and fault tolerance testing
- Experience integrating observability into development workflows to accelerate issue resolution
- Familiarity with additional cloud monitoring services like AWS CloudWatch, Azure Monitor, or GCP Operations Suite
We offer/Benefits
With us you can:
- Work on a flexible schedule remotely or from any of our comfortable offices or coworking spaces in Ukraine
- Receive the necessary equipment to perform your work tasks
- Change projects and technology stacks within EPAM
- Gain experience in various business domains (Insurance, E-commerce, Healthcare, Finance, Travelling, Media, Artificial Intelligence, and more)
- Relocation opportunities may be available for eligible candidates, depending on the role and openings at other EPAM locations
- Participate in volunteer, charity programs and communities (both technical and interest-based)
We focus on your professional growth:
- You can plan your individual career path together with your manager
- Receive regular feedback from colleagues
- Improve your English for free with certified teachers (Speaking Clubs, client interview preparation courses, etc.)
- Get the opportunity to undergo free training and certification in AWS, GCP, or Azure Clouds
- Use the internal E-learn training program (18,200+ specialized training and mentoring programs)
- Access corporate accounts on LinkedIn Learning, Get Abstract and other partner resources
- Study at EPAM Solution Architecture School with the instructors who are practicing architects
- Develop as a leader, join Delivery Management, Resource Management, Leadership Essentials school and more
- Participate in internal communities (500+ meetups, technical discussions, brainstorming sessions, online events and conferences annually)
What we offer:
- Vacation and sick leave (including a sick leave without a medical certificate)
- A wide range of Voluntary Medical Insurance programs providing both medical treatment and various preventive options (including sports activities)
- Medical insurance for family members at corporate rates
- Company support during significant life events (childbirth or adoption, marriage, etc.)
- Support for psychological comfort: discounts on services from mental health specialists or coaches, thematic training
- E-kids program - a free programming language training program for EPAMers' children