Site Reliability Engineering
EPAM Systems
Software Engineering, Other Engineering
New York, NY, USA · Remote
Posted on Apr 9, 2026
Responsibilities
- Define and implement a strategic reliability vision for the trading portfolio, covering infrastructure, network connectivity, application performance, and throughput
- Lead and oversee a team of SRE engineers, providing technical direction, mentorship, and performance guidance
- Own and evolve the SLA/SLO/SLI framework, including error budgets and service health reporting
- Configure and optimize comprehensive monitoring and alerting systems across infrastructure and applications
- Drive observability best practices using APM and monitoring platforms (e.g., Dynatrace)
- Analyze application and infrastructure performance to isolate fault domains and determine root causes of critical incidents
- Lead major incident management, coordinate resolution efforts, and conduct blameless postmortems
- Participate in 24x7x365 support rotation and ensure operational excellence across the team
- Identify automation opportunities to improve reliability, scalability, and operational efficiency
Requirements
- 8+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering
- Proven leadership experience (technical lead or team lead), with ability to oversee and mentor engineers
- Strong hands-on experience with SLA/SLO/SLI definition, governance, and reporting
- Solid experience working in Microsoft Azure environments (IaaS, PaaS, networking, monitoring)
- Hands-on experience with Dynatrace (configuration, alerting, dashboards, performance analysis)
- Experience with observability, monitoring, and APM tools in production environments
- Ability to operate effectively under pressure in time-sensitive, high-impact environments
We offer/Benefits
- Medical, Dental and Vision Insurance (Subsidized)
- Health Savings Account
- Flexible Spending Accounts (Healthcare, Dependent Care, Commuter)
- Short-Term and Long-Term Disability (Company Provided)
- Life and AD&D Insurance (Company Provided)
- Employee Assistance Program
- Unlimited access to LinkedIn learning solutions
- Matched 401(k) Retirement Savings Plan
- Paid Time Off – the employee will be eligible to accrue 15-25 paid days, depending on specific level and tenure with EPAM (accrual eligibility may change over time)
- Paid Holidays - nine (9) total per year
- Legal Plan and Identity Theft Protection
- Accident Insurance
- Employee Discounts
- Pet Insurance
- Employee Stock Purchase Program
- If otherwise eligible, participation in the discretionary annual bonus program
- If otherwise eligible and hired into a qualifying level, participation in the discretionary Long-Term Incentive (LTI) Program