Site Reliability Engineering & 10 others
EPAM Systems
Software Engineering
Mexico · Remote
Posted on Nov 19, 2025
Responsibilities
- Oversee and enhance the product monitoring system
- Handle incidents, including troubleshooting, resolution, documentation, and analysis
- Distribute knowledge and insights across teams
- Facilitate collaboration between operations and development
- Create automation for log analysis, testing production systems, and alerting
- Track system health, performance, and SLIs/SLOs/SLAs
- Maintain documentation for incident management procedures
- Conduct incident analyses and implement corrective actions
- Respond to on-call support requests during and after business hours
- Collaborate with teams to enhance system efficiency and reliability
- Leverage tools such as PagerDuty, ELK/Kibana, SEQ logging, Prometheus, and Grafana for system monitoring
- Develop scripts and implement automation solutions using Python, C#, and Bash
- Manage orchestration and infrastructure through SaltStack and Docker
- Support project workflows using Azure DevOps and maintain a comprehensive Wiki
- Maintain code repositories and implement version control systems using Git
Requirements
- 1+ years of experience in creating solutions, particularly in Site Reliability Engineering
- Expertise in cloud services and automation scripting with Python and Bash
- Background in Oil & Gas operations and incident handling
- Skill in managing incident responses and providing on-call support
- Familiarity with monitoring tools such as Prometheus and Grafana
- Proficiency in logging tools like ELK/Kibana and SEQ logging
- Knowledge of orchestration and infrastructure solutions including SaltStack and Docker
- Understanding of fundamental networking concepts like inbound/outbound rules and firewalls
- Proficiency in tools for project management and issue tracking like Azure DevOps
- Capability to manage source code with Git
- Strong skills in creating documentation and disseminating knowledge
- Competency in conducting detailed post-incident reviews
- Excellent troubleshooting abilities and problem-solving skills
- Effective communication skills, with an English level of at least B2
Nice to have
- Experience using PagerDuty for incident handling
- Competency in C# programming
- Understanding of SQL and MongoDB databases
- Background in Zededa infrastructure
- Experience in supporting Oil & Gas field operations
We offer/Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn