Manager – Incident Management, Central Technical Operations Services (CTOS) – USA, TX, Austin

Amazon

  • Full Time

DESCRIPTION

Amazon Stores Central Technical Operations Services (CTOS) is the first line of defense for maintaining high availability in the Amazon Retail Website. We make customer impacting events shorter, less frequent, and less severe, by providing large scale event and incident management. The Amazon Retail Website has hundreds of millions of customers globally who can be impacted by these types of incidents; the work we do to mitigate them helps real people at a tremendous scale. Our automated tooling quickly identifies the cause of an issue and helps mitigate the impact, and much of our engineers’ time is spent on projects to improve the tooling, automation, and processes to avoid future occurrences. We help direct the resolution of an issue to the relevant service teams, and dive deep into those events retrospectively to drive improvements to our process. It’s an exciting time to join our team as we are rapidly growing and expanding our offerings globally.

Key job responsibilities
– Lead, mentor, and manage a highly adaptable and skilled Incident Response Team.

– Develop and maintain Incident response plans, procedures, and runbooks to effectively respond to Corporate and Online Production incidents.

– Coordinate and oversee Incident response activities, including Detection, Localization, Mitigation, and Recovery efforts.

– Serve in a role of Incident Commander as a primary point of contact for Corporate and Online Production incidents. Coordinate communication and collaboration with internal and external stakeholders.

– Lead investigations into Corporate and Online Production incidents to determine the root cause, scope, and impact of the Incident.

– Provide leadership and guidance during outage situations, including coordinating cross-functional teams and communicating with senior leadership and stakeholders.

– Conduct post-incident actions such as analysis, lessons learned, and reviews to identify areas for improvement to the Incident and Problem Management process.

– Prepare and deliver Incident response reports, presentations, and other communications to senior leadership and stakeholders.
– Incident Management: Perform the Incident Commander role for outage calls. Be the gold standard for “what excellent looks like.”

– Regularly audit documentation to ensure it’s up to date and applicable to current process.

– Lead reviews of Incident calls to ensure every Incident Manager is consistently raising the bar.

– Work with the team to generate reporting and derive insights from past incidents to identify trends in outage behaviors.

A day in the life
– Conduct stand-ups with the Incident Management team

– Lead the discussion around Incident Analytics and leverage the team to identify the pain points of the Incident process

– Lead documentation reviews

– Conduct 1 on 1 meetings with each Incident Manager that reports to you.

– Track performance of each team member against the Role Guidelines expectation.

– Identify and create KPI’s to raise the bar of CTOS Incident Management practice.

– Generate career plans with the team to ensure that they have guidance on where to learn and grow.

BASIC QUALIFICATIONS

– Bachelor’s Degree in IT related field (e.g. Information Systems, IT Management, Engineering, Computer Science), or equivalent experience.

– 5+ years of experience in managing high velocity teams in the Incident Management space

– 5+ Years of experience with incident and event management tools, networking, infrastructure, and cloud systems such as PagerDuty, DataDog, Moogsoft, Solar Winds, Nagios etc.

– Proven experience (5+ years) in Incident response, with a focus on managing and leading incident response efforts.

– Knowledge of and/or a completed certification of incident response frameworks and best practices, such as ITIL 4 Foundations, Professional, or Strategic Leader Certifications.

PREFERRED QUALIFICATIONS

– Strong leadership, communication, and interpersonal skills.

– Strong understanding of Incident Management principles, technologies and platforms around detection, and a strong understanding of the Problem Management process.

– Ability to work effectively in a fast-paced, dynamic environment and manage competing priorities.

– Strong analytical and problem-solving capability, with the ability to analyze complex incidents and identify effective response strategies.

Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation, please visit https://www.amazon.jobs/en/disability/us.

 

USA, TX, Austin