Site Reliability Engineer III -(AIML SRE)
Are you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area? This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.
As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division, you will join our dynamic team of innovators and technologists.
Your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients.
You will focus on ensuring the robustness and availability of AI models, deepening client engagements, and promoting process transformation.
We seek team members passionate about leveraging advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges through high-quality, cloud-centric software delivery.
Job Responsibilities:
* Define and refine Service Level Objectives (SLOs) for large language model serving and training systems, using metrics like accuracy, fairness, latency, drift targets, TTFT, and TPOT, while balancing reliability and development velocity.
* Design, implement, and continuously improve monitoring systems to track availability, latency, drift, and other key metrics for robust observability and rapid issue detection.
* Collaborate in the design and deployment of high-availability language model serving infrastructure that supports high-traffic internal workloads across multiple regions and cloud providers.
* Champion site reliability engineering practices, providing technical leadership and fostering a culture of reliability, resilience, and continuous improvement across teams.
* Develop and manage automated failover and recovery systems for model serving deployments, ensuring seamless operation and rapid recovery from failures.
* Create and lead AI-specific incident response playbooks for issues like model drift or bias spikes, including automated rollbacks, circuit breakers, and systematic post-incident improvements.
* Build and maintain cost optimization systems for large-scale AI infrastructure, leveraging load balancing, caching, optimized GPU scheduling, and AI Gateways to ensure efficient, secure, and scalable operations.
Required qualifications, capabilities, and skills:
* Formal training or certification on AI reliability concepts and 3+ years applied experience.
* Demonstrate a strong sense of curiosity and a passion for continuous learning, especially in the rapidly evolving field of AI reliability.
* Show proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices.
* Possess deep knowledge and experience in observability, including white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, an...
- Rate: Not Specified
- Location: Jersey City, US-NJ
- Type: Permanent
- Industry: Finance
- Recruiter: JPMorgan Chase Bank, N.A.
- Contact: Not Specified
- Email: to view click here
- Reference: 210668647
- Posted: 2025-09-20 08:46:08 -
- View all Jobs from JPMorgan Chase Bank, N.A.
More Jobs from JPMorgan Chase Bank, N.A.
- Medical Director - Post-Acute Care - Evicore - Remote
- Quality Review and Audit Senior Manager - Express Scripts - Hybrid
- Nurse Case Manager Specialist - Evernorth - Northwest Georgia
- Manager, Provider Contracting - Ancillary/Hospital/Physician Group - Walnut Creek, CA
- Director, Product and Care Solutions Strategy & Business Development - Evernorth Health Services - H
- Provider Relations & Claims Advocate - Evernorth Behavioral Health - Remote
- Program Management Advisor - Cigna Healthcare - Remote
- Pharmacy Operations Lead Rep.- DME- Accredo - Onsite
- Government Affairs Director - Maryland
- Wastewater Maintenance Technician (Eastlake, OH)
- Regional Controller- SOCAL (City of Industry, CA)
- Process Engineer (Braintree, MA)
- Operations Supervisor (Braintree, MA)
- Metal Melter/ Pourer- 2nd shift (Bloomington, MN)
- Logistics Coordinator (Bloomington, MN)
- Home Infusion Nurse, 32 hours - Accredo - Scranton, PA
- Senior Internal Auditor - US Commercial Healthcare - Hybrid
- Administrative Assistant Senior Representative (Hybrid) Atlanta, GA
- PRN - Licensed COTA
- Medical Director - Vascular Surgery - EviCore