The answer is data, -- all fast moving, fast growing industries rely on data for a competitive edge in their industries. And the most advanced companies are realizing the full data advantage by partnering with our client.
As a Observability Engineer in ISS, you will work to improve the reliability and performance of our client's critical infrastructure applications by owning their development and operation. This means setting and owning SLO goals for uptime and latency, as well as helping colleagues leverage the features and workflows available to them. All with the focus of keeping the backend web servers, load balancers, and database servers healthy and running smoothly.
They are looking for engineers who have a mix of software and systems skills, are passionate about reliability, performance, and efficiency, and have experience building tools, services, and automation to manage and improve production services.
- Design, operate, maintain, and troubleshoot enterprise systems such as databases, message queues, APIs, and distributed applications through the use of data and metrics such as SLOs and error budgets.
- Establish and practice sustainable incident response and blameless postmortems to prevent problem recurrence.
- Support services before they go live through activities such as system design, developing software platforms and frameworks, capacity planning, and launch reviews.
- Scale systems sustainably through mechanisms like scripting and automation; evolve systems by pushing changes that improve their operational management reliability and velocity.
- Collaborate with team members, across business units, and across multiple time zones to create high quality customer outcomes.
- Demonstrated Coding ability with one or similar of the following: C, C++, Java, Python, or Go;
- Demonstrated experience in design, implementation, delivery, and maintenance of software systems;
- Able to work in a 24x7 oncall rotation using a follow the sun model (i.e. 8am to 8pm local time pager duty, approximately 1 week every 2-3 months);
- Systematic problem-solving approach, strong communication skills, and a sense of ownership and drive;
- Experience in analyzing performance & debugging Enterprise Systems.
- 5+ years as a Site Reliability Engineer, DevOps Engineer, or Infrastructure engineer;
- Understanding of Unix/Linux, and optionally Windows operating systems;
- Experience working with Infrastructure as Code / Automation tools (Ansible, Terraform, CloudFormation);
- Well organized, with ability to prioritize tasks independently, set goals and follow through in order to see them to completion;
- Experience with containers and container orchestration systems such as Docker and/or Kubernetes;
- Expertise with hybrid (bare metal/public cloud - AWS preferred) cloud environments.