

Senior Software Engineer
Job Description
We are looking for a Senior SRE to help us support our highest value Grafana Cloud customers by increasing the reliability of our Cloud databases that are based on Mimir, Loki, Tempo, and Pyroscope. We provide these databases as a SaaS product from AWS, GCP, and Azure across all regions.
Responsibilities
Reviewing and creating SLOs, proactively investigating ways in which we can further reduce budget burn for those SLOs, which can be self-directed or as the result of learnings from incidents, and may include improvements to monitoring, automation, increasing self-healing, auto-scaling, etc.
Improve observability of customers within the High SLA environments
Configuring systems to increase reliability via Helm and Jsonnet
Collaborating with our Engineering Leaders to help define and influence product strategy, roadmaps and technical designs
Participate in PR review and collaborating with other engineers on their Design Docs
Teach others about Site Reliability Engineering and communicate best practices to be applied early in development of new features and functionality
Participate in Incident Response when applicable, including investigation through to resolution, PIR, and communication with customers via Bridge calls where necessary
Job Requirements
Strong engineering background (at least 6 years), that lean towards SRE roles (at least 3 years)
Good communication, capable of engaging in deep technical conversations with other engineers and customers, and collaborating across organizational boundaries
Experience with Kubernetes on any of AWS, GCP, or Azure, and working with Helm charts
Experience with Site Reliability Engineering, System Design, and Distributed Computing
Experience with one or more programming languages (e.g. Go, Python, JavaScript, etc)
Experience with Linux operating systems internals, and some knowledge of networking
Experience with calmly and actively participating in blame-free Incident Response, following up on actions, and writing high quality PIRs (Post Incident Reviews, a.k.a. post-mortem documents)
Comfortable working within an engineering team where individuals are encouraged to have a strong sense of autonomy and self-direction
We highly value those who are kind, intellectually curious, who default to transparency, possess a high bias towards action, and who are also kind