Lead Site Reliability Engineer
EPAM Systems
Lead Site Reliability Engineer Description
We are seeking a highly skilled Lead Site Reliability Engineer to join our team.
The ideal candidate will have a strong background in software engineering and systems engineering, with a focus on reliability and scalability in cloud environments, specifically Azure.
This is a fully remote position that offers you the flexibility to work from any location in Armenia, whether it's your home or well-equipped offices in Yerevan or Gyumri.
#LI-DNI
Responsibilities
- Design, implement, and maintain highly available and scalable systems across multi-region Azure cloud architectures
- Ensure disaster recovery plans are in place and tested regularly
- Configure and enhance monitoring and alerting processes using Prometheus, Grafana, Alertmanager, and OpsGenie
- Develop dashboards to visualize system performance and reliability metrics
- Utilize Terraform for infrastructure provisioning and management
- Implement best practices for continuous deployment and infrastructure changes
- Work closely with the development team to support ongoing development efforts
- Communicate with the customer’s DevOps team to elaborate on requirements and collaborate on implementations
- Enhance release management and CI/CD processes using Jenkins
- Improve system security based on recommendations from the security team
- Write and test runbooks to streamline operational tasks and incident response
- Manage and optimize services running on Kubernetes, Docker/Linux environments
- Handle data persistence using Cosmos DB (Mongo API & SQL API) and MS SQL Server
- Work with messaging systems like RabbitMQ, Kafka, and EventHub
- Utilize Azure Networking for secure and efficient communication
Requirements
- 5+ years experience as a DevOps or SRE engineer
- Proven experience with multi-region Azure cloud architectures
- Proficiency in Kubernetes and containerization technologies
- Strong knowledge of Cosmos DB (both Mongo API & SQL API) and MS SQL Server
- Familiarity with monitoring tools like Prometheus, Grafana, Alertmanager, OpsGenie
- Experience with .NET Core and ASP.NET Core applications
- Competency in Docker and Linux environments
- Expertise in Terraform for infrastructure as code
- Experience with CI/CD tools
- Solid understanding of Azure Networking concepts
- Excellent communication skills, both verbal and written
- Strong self-motivation and ability to self-manage tasks and projects
Nice to have
- Experience with Azure IoT Hub and EventHub
We offer
- We connect like-minded people:
- Delivering innovative solutions to industry leaders, making a global impact
- Enjoyable working environment, whether it is the vibrant office or the comfort of your home
- Opportunity to work abroad for up to two months per year
- Relocation opportunities within our offices in 55+ countries
- Corporate and social events
- We invest in your growth:
- Leadership development, career advising, soft skills and well-being programs
- Certifications, including GCP, Azure and AWS
- Unlimited access to LinkedIn Learning, Get Abstract and O'ReillyFree
- English classes with certified teachers
- We cover it all:
- Participation in the Employee Stock Purchase Plan
- Monetary bonuses for engaging in the referral program
- Comprehensive medical & family care package
- Four trust days per year for personal needs
- Discounts for fitness clubs
- Benefits package (hotels, restaurants, stores and services)
EPAM Armenia is a team of talented innovators united by a passion for technology. In 2014, we opened our first office in Yerevan, and now we have a second engineering hub in Gyumri. We've built a continuously learning organization that helps its employees rapidly advance their careers. Here you will work with the world's industry leaders, support impactful projects using the latest technologies, collaborate with multi-national teams, and have access to a wide variety of development opportunities.