Staff Site Reliability Engineer, Infrastructure Engineering – Fremont, California

Tesla

What to Expect

Our team manages multiple functions across Tesla that includes management of public Cloud Infrastructure (AWS, Azure, GCP), Devops, Factory Software, and MLOps as well. Continued development and automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups. As a site reliability engineer, you will be responsible for maintaining and improving our platform to ensure our cross functional teams have the necessary tools and resources to be productive. 

What You’ll Do
Support factory teams across Tesla in bridging the gap between developers & Infrastructure, reviewing architecture, and improving cloud security posture
Improve our Terraform modules, CI/CD platform (cloud & on-premises), and Infrastructure as Code, monitoring & self-healing pipelines
Optimize our server, storage, and network performance
Automation for factory teams and onboard them to our centralized CI/CD platform
Experience in designing, automating, maintaining, and observability platforms (logging, metrics, and tracing)
Enable Infrastructure as Code, governance, and security best practices
Participate in 24×7 on-call rotation
What You’ll Bring
Proficiency in Python, Golang and/or Bash
Proficiency with Linux fundamentals and performance optimizations
Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.)
Experience with AWS, Azure and/or GCP, advanced multi-region cloud networking concepts, and containerization technologies such as Kubernetes
Experience with cloud governance and Identity & Access Management

Excellent troubleshooting skills in production
Bachelor’s Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field or equivalent experience

Fremont, California

Full time

Job Overview