Lead Production Support Analyst
Responsibilities
Operational & Production Support Leadership
- Lead day-to-day production support operations for Individual Solutions & WFG applications/services, ensuring high availability, performance, and stability.
- Act as the accountable owner for the production support operating model, including L1/L2/L3 routing, on-call rotations, escalation paths, and SLAs/SLOs.
- Oversee and coach a vendor/contractor support team, ensuring quality execution, clear accountability, and consistent outcomes across shifts/time zones.
- Own application onboarding into production support: ensure runbooks, SOPs, architecture diagrams, support metrics, monitoring/alerting, access, and DR/backup readiness are complete and current.
- Establish operational readiness standards across logging, monitoring, access controls, backup, disaster recovery, and maintenance windows.
Vendor Management & Service Delivery
- Manage vendor performance (tickets, SLAs, MTTR, quality of RCAs, repeat incidents, documentation hygiene) and drive continuous service improvement.
- Run recurring vendor governance: operational reviews, KPI scorecards, backlog prioritization, and corrective action plans.
- Coordinate with third-party providers for escalations, service requests, planned maintenance, patching, and production changes.
Incident, Problem & Change Management
- Serve as the primary escalation point for high-severity incidents; lead war rooms/bridge calls and drive timely resolution with strong communication.
- Ensure Root Cause Analysis (RCA) and Post-Incident Reviews (PIRs) are completed with actionable remediation, prevention plans, and measurable follow-through.
- Drive problem management: identify patterns and recurring issues using incident history, logs, and metrics; reduce repeat incidents through permanent fixes.
- Oversee change/release execution to minimize production risk: pre-change validation, approvals, rollback plans, post-release monitoring, and “go/no-go” decision support.
- Ensure adherence to ITSM processes and audit-ready evidence for incident/change/problem workflows.
Monitoring, Observability & Reliability
- Improve detection and response through dashboards, health checks, distributed tracing/APM, synthetic monitoring, and log correlation.
- Tune alerting to reduce noise and improve signal-to-noise; implement event correlation to prevent alert storms.
- Partner with engineering and platform teams to define/track error (where applicable), and reliability improvements.
Continuous Improvement, Automation & Incident Reduction
- Proactively identify opportunities for automation (self-healing, auto-remediation, runbook automation, standardized scripts) that reduce toil and improve MTTR.
- Drive operational standardization: repeatable onboarding, consistent runbooks, automated checks, and common monitoring patterns.
- Lead initiatives focused on reducing incident volume, shortening recovery times, improving release quality, and removing manual steps from common procedures.
Technical Environment:
Cloud Platforms
- AWS: EC2, Lambda, ECS/EKS, S3, CloudFront, Route 53, IAM, CloudWatch, API Gateway, Secrets Manager
- Azure: Virtual Machines, Azure Functions, App Service, AKS, Entra ID, Azure Monitor/Log Analytics, Key Vault, API Management, Azure Backup
Monitoring & Observability
- AppDynamics, Splunk, Prometheus, ELK, CloudWatch, Azure Monitor, Grafana
Incident & Event Management
- ServiceNow (Incident/Problem/Change/Event), BigPanda, JIRA
Infrastructure, Middleware & Platforms
- Linux/Windows Server fundamentals; networking basics (DNS, routing, LB, firewall rules)
- Middleware/servers (as applicable): NGINX/Apache, Tomcat/WebLogic/JBoss, Kafka/MQ patterns
CI/CD & Scheduling
- Jenkins/GitHub Actions/Cloud pipelines (where applicable)
- Control-M/Cron/Airflow (where applicable)
Security & Access
- IAM/role-based access, certificates, secrets management, key vaults
Qualifications
- 8+ years in production support, IT operations, cloud operations, or SRE/Platform operations, with 3+ years in a lead role (team lead, service owner, or vendor lead).
- Strong knowledge of ITSM/ITIL practices and hands-on experience with ServiceNow (Inc/Prob/Chg; Event Mgmt preferred).
- Demonstrated ability to lead high-severity incident response, drive cross-functional execution, and ensure disciplined RCA/PIR completion.
- Proven experience managing vendor/contractor teams, including performance management through KPIs, governance routines, and continuous improvement plans.
- Technical fluency across applications, infrastructure, cloud, and database layers, able to guide triage and validate solutions.
- Strong documentation skills: runbooks, SOPs, support models, escalation procedures, and operational readiness checklists.
- Excellent communication skills able to translate complex technical events into business impact and executive-ready updates.
Preferred Qualifications
- Experience supporting financial services/insurance applications and regulated environments (audit, evidence capture, change controls).
- Experience implementing automation (runbook automation, scripting, auto-remediation) and improving observability practices.
- Exposure to SLO/SLI definitions, reliability reporting, and operational scorecards. · Experience with multi-sourced/global delivery models and coordinating across time zones.
- Bachelor’s degree in information technology, Computer Science, or related field (or equivalent experience); advanced degree a plus.
Working Conditions
- Hybrid - Office Environment (Tuesdays, Wednesdays, Thursdays)
- Moderate Travel 10 to 25%
This job description is not a contract of employment nor for any specific job responsibilities. The Company may change, add to, remove, or revoke the terms of this job description at its discretion. Managers may assign other duties and responsibilities as needed. In the event an employee or applicant requests or requires an accommodation to perform job functions, the applicable HR Business Partner should be contacted to evaluate the accommodation request.
Compensation
The Salary for this position generally ranges between $114,000 - $140,000 annually. Please note that the salary range is a good faith estimate for this position and actual starting pay is determined by several factors including qualifications, experience, geography, work location designation (in-office, hybrid, remote) and operational needs. Salary may vary above and below the stated amounts, as permitted by applicable law.
Additionally, this position is typically eligible for an Annual Bonus based on the Company Bonus Plan/Individual Performance and is at the Company’s discretion.
Applicants must be authorized to work for any employer in the U.S. We are unable to sponsor or take over sponsorship of an employment Visa at this time.
This is a hybrid position requiring three days in office per week in one of our hub locations (Denver, Cedar Rapids or Philadelphia). Relocation assistance will not be provided for this position.