Senior Site Reliability Engineer

Mastercard · Pune · 8+ yrs experience · Posted 2026-05-22

Tech stack: AWS, Azure, Docker, GCP, Java, Jenkins, Kafka, Kubernetes, Linux, Redis, SQL, Spring Boot, Unix

Apply on the company site · Get a referral for this role

Mastercard salary & ratings · More live openings

About the role

Mastercard is a technology company in global payments. We’re seeking a Senior Software Engineer for SRE Role with strong Java Microservices development experience to improve the reliability, performance, and operability of mission critical services. You will serve as a subject matter expert across development, testing and Support for production/non production environments, driving reliability, monitoring & observability, performance engineering, root cause analysis, and automation to increase production stability. Acts as a recognized technical authority with sound commercial awareness.
Responsibilities:
- Reliability & Operations
- Improve service reliability through better architecture, automation, capacity modeling/planning, and sub linear operational scaling.
- Lead incident handling, mitigation, RCA, and postmortems; clearly explain performance bottlenecks to stakeholders.
- Build best in class monitoring/observability using Splunk and Dynatrace (or equivalent APM); create advanced dashboards, queries, and runbooks.
- Maintain and improve program quality matrix like availability, MTTM, latency, capacity, scalability, reliability for the project
- Engineering & Performance
- Contribute to and review Java/Spring Boot microservices and RESTful applications for Reliability and Performance.
- Drive performance testing/tuning/analysis, using applications like JMeter, Blaze meter (or similar), through thread/heap dump analysis, Java, SQL, network & application configuration tuning/fixing.
- Partner as a co owner of resilient production services and CI/CD pipelines with software engineering teams.
- Architecture & Tooling
- Work across Java app servers, web servers, Docker, Kubernetes, Kafka, Redis, and cloud platforms (AWS/Azure/GCP); Cloud Foundry is a plus.
- Lead POCs to incubate new features/capabilities; recommend product customization for system integration.
- Collaboration & Governance
- Collaborate with SRE, operations, production support, performance testing, architects, and application owners.
- Identify patterns and implement reusable solutions that reduce complexity and operational risk.
- Uphold compliance with applicable laws, policies, and risk standards; demonstrate ethical judgment and transparency.
- Mentor junior engineers and influence engineering decisions through counsel and expertise.
- working in large-scale agile software development environments utilizing the SAFe framework (scrum methodology)
Qualifications:
- BE/BTech in Computer Science or equivalent experience.
- 8+ years in Engineering/IT; 3+ years in enterprise Java development; 3+ years with Splunk/ELK (or similar log analytics); 3+ years with Dynatrace or other APM.
- Java, Spring, Spring Batch, Spring Boot, Microservices, RESTful APIs, MVC architecture.
- Databases: Oracle, SQL (queries & stored procedures).
- Platform/infra: Java app servers, web servers, Linux/Unix, Docker, Kubernetes, Kafka, Redis; Cloud Foundry (plus).
- Expertise in building Dashboards/Views in Splunk and Dynatrace is a Plus
- CI/CD: Jenkins, Git, Maven; Agile/DevOps practices; experience in SAFe at scale.
- Production monitoring, observability, RCA, capacity planning (redline/load testing).
- Performance testing/tuning/analysis; JMeter, BlazeMeter (or equivalent); thread/heap dump analysis.
- Oracle Database SQL & configuration Tuning for Performance (indexes, hints, execution plan analysis, SQLT, AWR and similar report analysis)
- Hands on in large scale, complex, highly available distributed systems (web, relational & non relational DBs, cache, pub/sub, containers) with resiliency, graceful degradation, DR, and backpressure patterns.
- Clear written/verbal communication; stakeholder engagement across operations, developers, and performance testers.
- Analytical/diagnostic strength; ability to work independently, in a matrix/virtual environment, under pressure and tight deadlines.
- Proven record of operational process changes and continuous improvement.
- Commercial awareness with alignment to business outcomes.
- Influence the engineering function by providing guidance, expert advice, and facilitating services to support decision-making processes
- Demonstrated capability to operate effectively within a matrix organization and collaborate with virtual teams.
- Proven ability to work autonomously, manage multiple tasks simultaneously, and assume responsibility for various aspects of projects or initiatives.
- Skilled in performing under pressure and adept at meeting strict deadlines or adapting to changing expectations and requirements.
- AI for Reliability: Applying AI tools to enhance monitoring, observability, automation, and system reliability; experience using Microsoft Copilot for coding, scripting, and automation.
- Big Data / AIOps: Experience with AIOps and Splunk Machine Learning Toolkit; Splunk administration.
- Metrics & Benchmarking: Comprehensive knowledge of design metrics, analytics tools, benchmarking, and reporting to capture best practices.
- How You’ll Work
- Partner across SRE and engineering to leverage shared tools, processes, and techniques that improve reliability and reduce operational toil.
- Identify recurring operational challenges and implement reusable, cross functional solutions that lower cost, complexity, and risk.
- Safeguard Mastercard’s reputation, clients, and assets through policy adherence and transparent risk management.

Qualifications

- BE/BTech in Computer Science or equivalent experience.
- 8+ years in Engineering/IT; 3+ years in enterprise Java development; 3+ years with Splunk/ELK (or similar log analytics); 3+ years with Dynatrace or other APM.
- Java, Spring, Spring Batch, Spring Boot, Microservices, RESTful APIs, MVC architecture.
- Databases: Oracle, SQL (queries & stored procedures).
- Platform/infra: Java app servers, web servers, Linux/Unix, Docker, Kubernetes, Kafka, Redis; Cloud Foundry (plus).
- Expertise in building Dashboards/Views in Splunk and Dynatrace is a Plus
- CI/CD: Jenkins, Git, Maven; Agile/DevOps practices; experience in SAFe at scale.
- Production monitoring, observability, RCA, capacity planning (redline/load testing).
- Performance testing/tuning/analysis; JMeter, BlazeMeter (or equivalent); thread/heap dump analysis.
- Oracle Database SQL & configuration Tuning for Performance (indexes, hints, execution plan analysis, SQLT, AWR and similar report analysis)
- Hands on in large scale, complex, highly available distributed systems (web, relational & non relational DBs, cache, pub/sub, containers) with resiliency, graceful degradation, DR, and backpressure patterns.
- Clear written/verbal communication; stakeholder engagement across operations, developers, and performance testers.
- Analytical/diagnostic strength; ability to work independently, in a matrix/virtual environment, under pressure and tight deadlines.
- Proven record of operational process changes and continuous improvement.
- Commercial awareness with alignment to business outcomes.
- Influence the engineering function by providing guidance, expert advice, and facilitating services to support decision-making processes
- Demonstrated capability to operate effectively within a matrix organization and collaborate with virtual teams.
- Proven ability to work autonomously, manage multiple tasks simultaneously, and assume responsibility for various aspects of projects or initiatives.
- Skilled in performing under pressure and adept at meeting strict deadlines or adapting to changing expectations and requirements.
- AI for Reliability: Applying AI tools to enhance monitoring, observability, automation, and system reliability; experience using Microsoft Copilot for coding, scripting, and automation.
- Big Data / AIOps: Experience with AIOps and Splunk Machine Learning Toolkit; Splunk administration.
- Metrics & Benchmarking: Comprehensive knowledge of design metrics, analytics tools, benchmarking, and reporting to capture best practices.
- How You’ll Work
- Partner across SRE and engineering to leverage shared tools, processes, and techniques that improve reliability and reduce operational toil.
- Identify recurring operational challenges and implement reusable, cross functional solutions that lower cost, complexity, and risk.
- Safeguard Mastercard’s reputation, clients, and assets through policy adherence and transparent risk management.

Responsibilities

- Reliability & Operations
- Improve service reliability through better architecture, automation, capacity modeling/planning, and sub linear operational scaling.
- Lead incident handling, mitigation, RCA, and postmortems; clearly explain performance bottlenecks to stakeholders.
- Build best in class monitoring/observability using Splunk and Dynatrace (or equivalent APM); create advanced dashboards, queries, and runbooks.
- Maintain and improve program quality matrix like availability, MTTM, latency, capacity, scalability, reliability for the project
- Engineering & Performance
- Contribute to and review Java/Spring Boot microservices and RESTful applications for Reliability and Performance.
- Drive performance testing/tuning/analysis, using applications like JMeter, Blaze meter (or similar), through thread/heap dump analysis, Java, SQL, network & application configuration tuning/fixing.
- Partner as a co owner of resilient production services and CI/CD pipelines with software engineering teams.
- Architecture & Tooling
- Work across Java app servers, web servers, Docker, Kubernetes, Kafka, Redis, and cloud platforms (AWS/Azure/GCP); Cloud Foundry is a plus.
- Lead POCs to incubate new features/capabilities; recommend product customization for system integration.
- Collaboration & Governance
- Collaborate with SRE, operations, production support, performance testing, architects, and application owners.
- Identify patterns and implement reusable solutions that reduce complexity and operational risk.
- Uphold compliance with applicable laws, policies, and risk standards; demonstrate ethical judgment and transparency.
- Mentor junior engineers and influence engineering decisions through counsel and expertise.
- working in large-scale agile software development environments utilizing the SAFe framework (scrum methodology)