EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture. Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

We are seeking a talented and motivated Site Reliability Engineer (SRE) to join our organization.

The SRE will play a crucial role in ensuring the reliability, scalability, capacity planning, and performance of our infrastructure and applications. The ideal candidate will have a strong background in software engineering, system administration, containerization, and cloud technologies.

#LI-DNI#EasyApply

Responsibilities

Design, build, and maintain scalable, reliable, and efficient cloud infrastructure and services on platforms like AWS, Azure, or Google Cloud
Automate manual work using scripting/programming languages such as Python, Bash, or PowerShell, especially within cloud environments
Utilize automation tools like Jenkins, GitLab, and Ansible/Chef to streamline deployment, monitoring, and management of systems and applications in the cloud
Monitor system performance and proactively troubleshoot issues to ensure high availability and performance
Employ observability tools like Prometheus, Grafana, ELK stack, Splunk, Dynatrace, or Datadog for monitoring, alerting, and logging
Participate in capacity planning and scalability assessments to support business growth and requirements
Manage containerization and orchestration technologies such as Docker and Kubernetes, particularly in cloud-native environments
Implement security best practices and standards to safeguard data and systems in the cloud
Continuously evaluate and recommend new technologies and practices to improve system reliability and efficiency
Document processes, procedures, and configurations to maintain system integrity and facilitate knowledge sharing

Requirements

3 – 5 years of relevant experience
Proficient in designing and maintaining cloud infrastructure on AWS, Azure, or Google Cloud
Strong scripting and programming skills in languages like Python, Bash, or PowerShell
Experience with automation tools such as Jenkins, GitLab, and Ansible/Chef
Excellent communication and collaboration skills
Experience with observability tools like Prometheus, Grafana, ELK stack, Splunk, Dynatrace, or Datadog
Hands-on experience with Docker, Kubernetes, or similar containerization and orchestration technologies
Knowledge of security best practices for cloud environments
Familiarity with SLI, SLO, SLA, and Error Budget concepts
Strong problem-solving skills and ability to troubleshoot complex issues under pressure

Nice to have

Experience with Agile methodologies and DevOps practices
Certifications in cloud technologies (AWS, Azure, Google Cloud)
Advanced knowledge of network and security architecture

We offer

Opportunity to work on technical challenges that may impact across geographies
Vast opportunities for self-development: online university, knowledge sharing opportunities globally, learning opportunities through external certifications
Opportunity to share your ideas on international platforms
Sponsored Tech Talks & Hackathons
Unlimited access to LinkedIn learning solutions
Possibility to relocate to any EPAM office for short and long-term projects
Focused individual development
Benefit package:
- Health benefits
- Retirement benefits
- Paid time off
- Flexible benefits
Forums to explore beyond work passion (CSR, photography, painting, sports, etc.)