Perlego

Site Reliability Engineer

Job Description

Posted on: 
October 25, 2024

We are looking for an experienced Site Reliability Engineer (SRE) with a strong background in AWS services and monitoring tools. In this role, you will ensure the availability and reliability of our services, especially during out-of-office hours, while most of the team is based in Europe and India. You will be integral to swiftly addressing issues, resolving incidents independently, and thriving in a fast-paced environment.

Responsibilities

As a Site Reliability Engineer, your main focus will be to ensure our services remain highly available and performant. Key responsibilities include:

Monitoring Incident Management:
Monitor and manage platform activity using tools like Datadog, Prometheus, Grafana, or AWS CloudWatch.
Respond quickly to alerts and incidents, independently resolving issues and ensuring service uptime during off-peak hours.
Conduct post-incident reviews and help improve system resiliency through automation and monitoring enhancements.

Job Requirements

Experience in Site Reliability Engineering, DevOps, or a similar field.
Strong experience with AWS services
Expertise in using monitoring tools (e.g. Prometheus, Grafana, CloudWatch) for real-time platform performance insights.
Hands-on experience with CI/CD pipeline management for deploying containerized (Docker) and serverless applications.
Proficiency in Linux-based operating systems and shell scripting.
Familiarity with Infrastructure as Code tools (Terraform, CloudFormation).
Experience with incident management, troubleshooting, and platform recovery in high-pressure environments.
Strong communication skills with a proven ability to work both independently and collaboratively across time zones.

Apply now

More job openings