Lead Production Support Analyst

The Lead Production Support & Operations role is responsible for end-to-end production support management for a defined line of business (Individual Solutions and WFG) ensuring availability, stability, performance, and operational excellence for business-critical applications and services. This Lead oversees a vendor/contractor production team, drives incident/problem/change rigor, and delivers measurable improvements through automation, monitoring enhancements, and operational standardization. This role is preferred to be hands-on (or strongly technically fluent) with the ability to guide triage, diagnose complex issues across application/infrastructure/database layers, and partner effectively with engineering, infrastructure, security, and business stakeholders.

Responsibilities

Operational & Production Support Leadership

Lead day-to-day production support operations for Individual Solutions & WFG applications/services, ensuring high availability, performance, and stability.
Act as the accountable owner for the production support operating model, including L1/L2/L3 routing, on-call rotations, escalation paths, and SLAs/SLOs.
Oversee and coach a vendor/contractor support team, ensuring quality execution, clear accountability, and consistent outcomes across shifts/time zones.
Own application onboarding into production support: ensure runbooks, SOPs, architecture diagrams, support metrics, monitoring/alerting, access, and DR/backup readiness are complete and current.
Establish operational readiness standards across logging, monitoring, access controls, backup, disaster recovery, and maintenance windows.

Vendor Management & Service Delivery

Manage vendor performance (tickets, SLAs, MTTR, quality of RCAs, repeat incidents, documentation hygiene) and drive continuous service improvement.
Run recurring vendor governance: operational reviews, KPI scorecards, backlog prioritization, and corrective action plans.
Coordinate with third-party providers for escalations, service requests, planned maintenance, patching, and production changes.

Incident, Problem & Change Management

Serve as the primary escalation point for high-severity incidents; lead war rooms/bridge calls and drive timely resolution with strong communication.
Ensure Root Cause Analysis (RCA) and Post-Incident Reviews (PIRs) are completed with actionable remediation, prevention plans, and measurable follow-through.
Drive problem management: identify patterns and recurring issues using incident history, logs, and metrics; reduce repeat incidents through permanent fixes.
Oversee change/release execution to minimize production risk: pre-change validation, approvals, rollback plans, post-release monitoring, and “go/no-go” decision support.
Ensure adherence to ITSM processes and audit-ready evidence for incident/change/problem workflows.

Monitoring, Observability & Reliability

Improve detection and response through dashboards, health checks, distributed tracing/APM, synthetic monitoring, and log correlation.
Tune alerting to reduce noise and improve signal-to-noise; implement event correlation to prevent alert storms.
Partner with engineering and platform teams to define/track error (where applicable), and reliability improvements.

Continuous Improvement, Automation & Incident Reduction

Proactively identify opportunities for automation (self-healing, auto-remediation, runbook automation, standardized scripts) that reduce toil and improve MTTR.
Drive operational standardization: repeatable onboarding, consistent runbooks, automated checks, and common monitoring patterns.
Lead initiatives focused on reducing incident volume, shortening recovery times, improving release quality, and removing manual steps from common procedures.

Technical Environment:

Cloud Platforms

AWS: EC2, Lambda, ECS/EKS, S3, CloudFront, Route 53, IAM, CloudWatch, API Gateway, Secrets Manager
Azure: Virtual Machines, Azure Functions, App Service, AKS, Entra ID, Azure Monitor/Log Analytics, Key Vault, API Management, Azure Backup

Monitoring & Observability

AppDynamics, Splunk, Prometheus, ELK, CloudWatch, Azure Monitor, Grafana

Incident & Event Management

ServiceNow (Incident/Problem/Change/Event), BigPanda, JIRA

Infrastructure, Middleware & Platforms

Linux/Windows Server fundamentals; networking basics (DNS, routing, LB, firewall rules)
Middleware/servers (as applicable): NGINX/Apache, Tomcat/WebLogic/JBoss, Kafka/MQ patterns

CI/CD & Scheduling

Jenkins/GitHub Actions/Cloud pipelines (where applicable)
Control-M/Cron/Airflow (where applicable)

Security & Access

IAM/role-based access, certificates, secrets management, key vaults

Qualifications

8+ years in production support, IT operations, cloud operations, or SRE/Platform operations, with 3+ years in a lead role (team lead, service owner, or vendor lead).
Strong knowledge of ITSM/ITIL practices and hands-on experience with ServiceNow (Inc/Prob/Chg; Event Mgmt preferred).
Demonstrated ability to lead high-severity incident response, drive cross-functional execution, and ensure disciplined RCA/PIR completion.
Proven experience managing vendor/contractor teams, including performance management through KPIs, governance routines, and continuous improvement plans.
Technical fluency across applications, infrastructure, cloud, and database layers, able to guide triage and validate solutions.
Strong documentation skills: runbooks, SOPs, support models, escalation procedures, and operational readiness checklists.
Excellent communication skills able to translate complex technical events into business impact and executive-ready updates.

Preferred Qualifications

Experience supporting financial services/insurance applications and regulated environments (audit, evidence capture, change controls).
Experience implementing automation (runbook automation, scripting, auto-remediation) and improving observability practices.
Exposure to SLO/SLI definitions, reliability reporting, and operational scorecards. · Experience with multi-sourced/global delivery models and coordinating across time zones.
Bachelor’s degree in information technology, Computer Science, or related field (or equivalent experience); advanced degree a plus.

Working Conditions

Hybrid - Office Environment (Tuesdays, Wednesdays, Thursdays)
Moderate Travel 10 to 25%

This job description is not a contract of employment nor for any specific job responsibilities. The Company may change, add to, remove, or revoke the terms of this job description at its discretion. Managers may assign other duties and responsibilities as needed. In the event an employee or applicant requests or requires an accommodation to perform job functions, the applicable HR Business Partner should be contacted to evaluate the accommodation request.

Compensation

The Salary for this position generally ranges between $114,000 - $140,000 annually. Please note that the salary range is a good faith estimate for this position and actual starting pay is determined by several factors including qualifications, experience, geography, work location designation (in-office, hybrid, remote) and operational needs. Salary may vary above and below the stated amounts, as permitted by applicable law.

Additionally, this position is typically eligible for an Annual Bonus based on the Company Bonus Plan/Individual Performance and is at the Company’s discretion.

Applicants must be authorized to work for any employer in the U.S. We are unable to sponsor or take over sponsorship of an employment Visa at this time.

This is a hybrid position requiring three days in office per week in one of our hub locations (Denver, Cedar Rapids or Philadelphia). Relocation assistance will not be provided for this position.

Apply

Aegon country websites

Lead Production Support Analyst

Related Vacancies