

Staff Site Reliability Engineer
Job Description
We are seeking an experienced and talented Staff Site Reliability Engineer to join our Reliability Engineering team. As a Staff SRE, you will play a key role in ensuring the reliability, scalability, and performance of our infrastructure and applications. You will work closely with cross-functional teams to design, build, and maintain systems that deliver exceptional user experiences and improve the uptime and availability of the company’s products and services.
Responsibilities
Collaborate with development and operations teams to identify and implement solutions for improving system reliability, performance, and availability.
Design and implement automation strategies for provisioning, configuration, and monitoring of infrastructure and applications.
Lead incident response efforts, ensuring timely and effective resolution of issues and conducting thorough post-mortems for continuous improvement.
Utilize tools such as Datadog for observability and Splunk for logging to enhance monitoring, alerting, and logging capabilities.
Enable application teams across the company to better instrument and improve observability of their services while also enhancing overall system reliability.
Conduct regular performance analysis and capacity planning to proactively address potential issues and optimize system performance.
Implement and manage monitoring, alerting, and logging systems to ensure the early detection of issues.
Contribute to the design and implementation of disaster recovery and business continuity plans.
Stay current with industry trends, emerging technologies, and best practices to continually enhance the reliability and efficiency of our systems.
Troubleshoot and resolve complex issues in production environments.
Participate in on-call rotation to ensure 24/7 availability of our systems and services.
Lead and mentor junior members of the Reliability Engineering team.
Continuously identify and implement process improvements to increase efficiency and reduce risk.
Job Requirements
Bachelor's degree in Computer Science, Information Technology, or a related field.
8+ years of experience as a Site Reliability Engineer or a related role.
Strong expertise in cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes).
Proficient in scripting and automation using languages such as Python, Shell, or Go.
Solid understanding of networking, security, and infrastructure-as-code principles.
Experience with observability tools such as Datadog and logging solutions like Splunk.
Proven track record of successfully leading incident response efforts and conducting post-mortems.
Experience in enabling application teams to enhance observability and reliability of their services.
Excellent communication and collaboration skills, with the ability to work effectively in a team environment.
Excellent problem-solving and troubleshooting skills.