Come join oneZero Financial Systems! An exciting, fast-growing company with Headquarters in Somerville MA, oneZero empowers banks, brokerages and hedge funds with cutting edge trade routing and execution technology. Our platform, deployed with 200+ entities globally, features a low-latency trading environment, integrations to the world’s leading execution venues, and reliable IT infrastructure and technical support—all designed to be customized and scaled to serve any business model and any size of market participant. We take pride in our great work atmosphere and highly motivated team of engineers. We are currently looking for a motivated and talented Site Reliability Engineer to join our Johannesburg office.

oneZero is proud to have been named one of Business Intelligence Group's Best Places to Work for four consecutive years:

https://www.onezero.com/awards/onezero-earns-recognition-as-a-2025-best-place-to-work/

The Boston Globe names oneZero a Top Place to Work in 2022, 2023, and 2024: https://www.onezero.com/homepage/the-boston-globe-names-onezero-a-top-place-to-work-for-third-year-in-a-row/

oneZero earns 2024 Great Place To Work Australia Certification
https://www.onezero.com/awards/onezero-2024-great-place-to-work-australia-certification/

Please see oneZero featured in e-Forex Magazine to learn more about the company and our dynamic team (https://goo.gl/vbXw8i)

Job Purpose

The Site Reliability Engineer is responsible for ensuring the high availability, reliability, and performance of oneZero's AWS-centric microservices platform supporting analytics and market-data products delivered to global brokers. This role is deeply technical, requiring strong AWS expertise and Python proficiency to automate operations, debug production services, optimize performance, and support continuous delivery in a 24x7 financial services environment where uptime is mission-critical. Reporting to the IT Operations Manager, this position demands independent technical decision-making and the ability to exercise sound judgment when responding to critical incidents. The SRE operates with significant autonomy in assessing system performance, diagnosing complex issues, and making critical determinations that impact service availability across a high-traffic, globally distributed infrastructure.

Responsibilities

AWS Infrastructure Monitoring and Incident Response

Monitor and manage AWS services supporting production workloads (ECS/EKS, EC2, Lambda, API Gateway, SQS/SNS, RDS, ElastiCache, CloudFront)
Respond to alerts from CloudWatch, Datadog, and custom monitoring scripts with urgency and precision
Exercise independent judgment in assessing incident severity and determining appropriate response strategies
Diagnose scaling, networking, and performance issues in distributed AWS systems
Perform incident response, ensuring rapid recovery and minimal downtime
Coordinate with development teams during critical incidents and outages, serving as the technical authority for infrastructure decisions

Python-Driven Troubleshooting and Automation

Write Python scripts and tools to automate operational tasks, system checks, and data validation routines.
Analyze Python microservice behavior by reading logs, debugging issues, and profiling performance
Build or enhance internal CLI tools to improve support workflows
Use Python to interrogate APIs, AWS resources (via boto3), and production data sources
Independently assess automation opportunities and implement solutions to reduce manual workload

Production Systems and Data Flow Stability

Maintain stability across charting engines, data ingestion pipelines, market-data feeds, scanning engines, and sentiment analysis services
Investigate failures in REST APIs, WebSocket streams, and asynchronous workers
Validate deployments and configurations for AWS-based microservices
Ensure data completeness and accuracy across instruments, markets, and broker-specific configurations
Make real-time decisions on system changes, maintenance windows, and emergency response procedures

Collaboration and Continuous Operations Improvement

Work with DevOps engineers to refine CI/CD pipelines, infrastructure-as-code workflows, and AWS deployment patterns
Collaborate with backend teams to improve microservice reliability and observability
Provide feedback on Python code, error-handling logic, and operational robustness
Contribute to post-incident root cause analyses and propose architectural or automation improvements
Participate in an on-call rotation to provide round-the-clock infrastructure support
Documentation and Runbook Management
Maintain detailed operational documentation, AWS service runbooks, and troubleshooting guides
Build automated checks and self-healing routines where feasible
Drive SRE best practices across the team
Document configurations, standards, and operational procedures that align with industry best practices

Required Skills & Experience

Experience Requirements

2+ years in production support, SRE, or DevOps with a strong AWS and Python footprint
Demonstrated ability to exercise independent judgment in high-pressure situations and make
critical decisions affecting system availability
Strong Python scripting and debugging skills (must be able to analyze stack traces, write
scripts, automate workflows)
Strong analytical mindset and exceptional problem-solving ability
Calm, structured communication during incidents
Ability to work cross-functionally with DevOps, developers, QA, and product staff
Keen attention to detail and strong ownership of production systems
Comfortable working in a high-availability, high-traffic environment
Off-hours support and coverage as part of on-call rotation

Technical Expertise

AWS Services:

ECS or EKS (service deployments, scaling behavior, debugging containers)
EC2, Lambda, API Gateway
SQS/SNS messaging patterns
RDS (PostgreSQL/MySQL), DynamoDB
S3 and CloudFront
IAM, KMS, networking (VPC, subnets, security groups)

Monitoring & Observability:

CloudWatch, Datadog, Grafana, OpenSearch/Kibana

Infrastructure & DevOps:

Docker containerization
Infrastructure-as-code (CloudFormation, Terraform)
CI/CD pipelines (CodePipeline, GitHub Actions, GitLab CI, or similar)

Development & Data:

REST APIs, WebSocket protocols, asynchronous workers, distributed system behavior
SQL proficiency and performance investigation for relational databases
MongoDB with JavaScript proficiency

Preferred Qualifications

Bachelor's degree in Computer Science, Information Systems, Engineering, or equivalent
practical experience
Experience in fintech, trading systems, or market-data streaming
Python experience with data processing, concurrency (asyncio), or task queues (Celery/RQ)
Exposure to Kinesis, Kafka, or other event-streaming platforms
Familiarity with FastAPI-based microservices
Experience with cost optimization and AWS Well-Architected practices
Understanding of foreign exchange markets and trading platform requirements

Success Profile

The ideal candidate will demonstrate:

Operational Excellence: Reduced production incidents and improved uptime through proactive monitoring and rapid incident response
Automation Focus: Faster MTTD and MTTR through automation and AWS-driven improvements
Technical Impact: Operational tooling and Python automation that significantly reduces manual workload
Collaboration: Positive feedback from internal teams and external broker partners
Ownership: Strong sense of accountability for production system health and reliability

This job is no longer accepting applications

See open jobs at oneZero Financial Systems.See open jobs similar to "Site Reliability Engineer" FinTech Australia.

See more open positions at oneZero Financial Systems

Powered by Getro.com

Privacy policy Cookie policy

Site Reliability Engineer

FINTECH AUSTRALIA

IMPORTANT LINKS