Site Reliability Engineering & 6 others
EPAM Systems
Software Engineering
Lisbon, Portugal
Posted on Nov 19, 2025
Responsibilities
- Design and enforce monitoring, alerting, and incident management strategies
- Automate repetitive tasks and workflows to increase operational efficiency
- Work alongside software engineering teams to build and launch scalable, dependable systems
- Execute production deployments carefully to preserve platform stability
- Handle incident management with thorough analysis and reporting to maintain service quality
- Engage in on-call duties to support essential systems and services
- Communicate clearly with colleagues to swiftly resolve technical problems
- Maintain up-to-date documentation for operational workflows and system settings
- Drive continuous improvements in system reliability and efficiency through proactive initiatives
Requirements
- Deep understanding of Unix/Linux operating systems and networking with over 5 years experience
- Proficiency in Unix/Linux shell scripting and programming languages including Python, Perl, C, C++, or Java
- Experience with monitoring and observability solutions such as ITRS Geneos, Dynatrace, Prometheus, and Grafana
- Strong troubleshooting skills for complex system issues
- Experience in environments with high availability and heavy traffic
- Bachelor’s or Master’s degree in IT engineering or a related discipline
- Ability to collaborate effectively within a team and adapt to evolving environments
- Self-driven with excellent problem-solving capabilities and thorough issue tracking
- Excellent written and verbal communication abilities with English proficiency at B2+ level
Nice to have
- Familiarity with log analysis tools like Splunk, ELK, Graylog, or Loki
- Knowledge of network monitoring solutions such as Corvil
- Experience with relational databases including Oracle, PostgreSQL, MySQL/MariaDB, or KDB/q
- Understanding of messaging platforms like IBM MQ, Tibco, Solace, LBM, or Kafka
- Experience with Infrastructure as Code tools such as Ansible or Terraform
We offer/Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn