Staff Site Reliability Engineer, Engineering Tools – Fremont, California

Tesla

What to Expect
The Engineering Tools team manages features critical for enhancing developer productivity as well as Tesla’s internal communication platform to ensure all developers can collaborate seamlessly, share information effortlessly, and maintain smooth operations. Through tailored solutions and integrations, this group enables software development at scale across various internal organizations; among these include Autopilot, Firmware, Factory Software, and Manufacturing.

This Staff Site Reliability Engineer will be responsible for managing & maintaining critical engineering tools like GitHub, Bitbucket, SVN & Perforce for version control, Jira & Confluence for project tracking, Polarion for requirements management, and Artifactory for software artifact storage. The ideal candidate will have a strong background in both software engineering & systems administration, as well as a passion for automating & optimizing processes; their work will be instrumental in ensuring the reliability, scalability, and performance of our development capabilities across internal organizations.

What You’ll Do
Design, implement, and maintain automation solutions for provisioning, configuration, and monitoring of engineering tools infrastructure
Administer & support Atlassian application stack (Jira, Confluence), ultimately remaining accountable for the high availability of our infrastructure
Administer Polarion, including configuration, OSLC plugin integration, workflows, reports, templates, access permissions, re-indexing, and restoration processes; work with users to address any issues or concerns promptly
Restore projects, work items, and live documents from SVN repository
Collaborate with development and operations teams to ensure seamless integration and functionality of engineering tools within our CI/CD pipelines
Perform regular backups, upgrades, and patch management to ensure security & stability
Rapidly troubleshoot and resolve critical issues by identifying root causes across multiple layers (storage, OS, network, virtualization, & application/DB stack)

Conduct performance analysis & capacity planning to prevent service disruptions, anticipate future resource requirements, and optimize infrastructure
Participate in on-call rotation and respond to incidents in a timely manner, resolving issues to minimize downtime & impact on users

What You’ll Bring
Experience with the installation, configuration, development, debugging, support and upgrades of Github Enterprise

Proficient in setting up, managing & automating Jira projects, Confluence Spaces, and permissions 
Experience with setting up & maintaining Polarion in High Availability mode, as well as configuring templates, workflows, and permissions within the platform
Experience with general programming/scripting languages (Python, Shell, Golang) & automation frameworks (Ansible) to manage the administration, monitoring and development of custom plug-ins & workflows 
Knowledge of containerization technologies like Docker & orchestration tools like Kubernetes
Familiarity with monitoring & logging solutions such as Prometheus, Grafana and Splunk
Bachelor’s Degree in Computer Science, Computer Engineering, Information Technology, or proof of exceptional skills in related field

Fremont, California

Full time

Job Overview