Site Reliability Engineering & 6 others
EPAM Systems
Software Engineering
Lisbon, Portugal
Posted on Nov 19, 2025
Responsibilities
- Develop and implement monitoring, alerting, and incident response strategies
- Automate routine tasks and processes to improve efficiency
- Collaborate with software engineering teams to design and deploy reliable, scalable systems
- Deploy production changes with precision to maintain platform integrity
- Manage incidents including detailed analysis and reporting to ensure high service levels
- Participate in on-call rotations to support critical systems and services
- Communicate effectively with team members to resolve issues promptly
- Maintain documentation for operational procedures and system configurations
- Continuously improve system reliability and performance through proactive measures
Requirements
- Strong knowledge of Unix/Linux systems and networking with 3+ years experience
- Proficiency in Unix/Linux shell scripting and programming languages such as Python, Perl, C, C++, or Java
- Experience with monitoring and observability tools like ITRS Geneos, Dynatrace, Prometheus, and Grafana
- Ability to troubleshoot complex systems and resolve issues efficiently
- Experience working in high-availability, high-traffic environments
- Bachelor’s or Master’s degree in IT engineering or related field
- Ability to work effectively in a team and adapt to new environments
- Self-motivated with strong problem-solving and issue follow-up skills
- Excellent written and verbal communication skills with English level B2+
Nice to have
- Experience with log management tools such as Splunk, ELK, Graylog, or Loki
- Knowledge of network monitoring tools like Corvil
- Familiarity with databases including Oracle, PostgreSQL, MySQL/MariaDB, or KDB/q
- Experience with messaging systems such as IBM MQ, Tibco, Solace, LBM, or Kafka
- Familiarity with Infrastructure as Code tools like Ansible or Terraform
We offer/Benefits
- International projects with top brands
- Work with global teams of highly skilled, diverse peers
- Healthcare benefits
- Employee financial programs
- Paid time off and sick leave
- Upskilling, reskilling and certification courses
- Unlimited access to the LinkedIn Learning library and 22,000+ courses
- Global career opportunities
- Volunteer and community involvement opportunities
- EPAM Employee Groups
- Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn