Sr. Site Reliability Engineer, PLM Operations – Palo Alto, California

Tesla

What to Expect

This position can be based in Palo Alto, CA, San Diego, CA or Austin, TX. 

Every day, thousands of Tesla Engineers around the world use a variety of software tools and data stores to design mechanical, electrical, electronic, and software systems. The PLM/CAD Operations team, POPS for short, maintains and improves these systems as technologies evolve so that Tesla Engineers have access to reliable and performant engineering design tools.

Due to the breadth of technology used by Tesla, the members of the POPS team are expected to be technical generalists – with a deeper well in a few areas, e.g. database, networking or cluster management. As SREs, we replace toil with automation. We develop tooling in Go, but we encounter plenty of Java, Python, JS frameworks, Tcl, and even some VB. We manage clusters above the node allocation layer, managing for example, our own kubelet upgrades and Windows nodes.

What You’ll Do
Define SLOs around latency, traffic, errors and saturation. Reliability and performance are the team’s deliverables
Maintain Tesla-custom Helm Charts to deploy highly customized and evolving 3DExperience (Dassault Systèmes) services running on on-prem Kubernetes
Modernize our deployment infrastructure using custom GitHub Actions, ArgoCD, Atlantis, and terraform
Achieve high performance service using tools like Prometheus, Grafana, Catchpoint, Splunk and OpsGenie
Be in an on-call rotation, manage incidents as Incident Commander, write actionable incident reports
Manage tasks via Jira for observability and human capacity planning. Maintain excellent Jira hygiene
Write and review design docs – testing frameworks, deployment models, environment definitions, etc.
What You’ll Bring
Deep networking experience, e.g. experience troubleshooting outages from L7 to L3, experience contributing to infra or networking GitHub repos or publications
Deep Oracle Database experience, e.g. indexing deltas, schema migrations
Docker/Kubernetes, e.g. performed kubelet upgrades in-situ, used skopeo or CRI-O intentionally, configured containerd
Diagnosing problems in legacy enterprise Java stacks
Installing, managing or using 3DExperience, or similar experience with other PLM software
Outstanding experience with Scientific computing or LIMS
Deep understanding of hypervisor technology (VMware)

Palo Alto, California

Full time

Job Overview