Lead Site Reliability Engineer

Zeta · Hyderabad · 6–10 yrs experience · Posted 2026-02-13

Tech stack: Linux, MySQL, SQL, Unix, PostgreSQL, Kubernetes, Security

Apply on the company site · Get a referral for this role

Zeta salary & ratings · More live openings

About the role

Our focus is on establishing product lines that focus on key outcomes by addressing real customer pain points, modernizing legacy systems, and strengthening core fundamentals. As a result, our systems and platforms support a wide range of banking and payments capabilities, including:
1. Tachyon, our cloud-native banking stack built for population-scale systems
2. Cipher, our unified authentication platform for secure, high-volume banking environments
3. Digital Credit as a Service, enabling banks to launch credit lines on UPI
4. Elena, our intelligent and conversational AI platform for banking.
5. Pixel, India’s first digital-native credit card, launched in partnership with HDFC Bank, for whom we also revamped their PayZapp mobile app: Winner of the Celent Model Bank Award for Payments Innovation 2024.
6. Sparrow, the leading card experience for non-prime cardholders in the US
…and more across cards, payments, lending, and core banking.
We are an engineering-first organization that values ownership, bias for action, and long-term thinking. Together, we solve some of the hardest problems in banking tech. Our culture is built around trust, collaboration, and creating the conditions f
Responsibilities:
- Establish a SRE site and help build an effective, inclusive SRE team.
- Provide technical leadership for the local team and work closely with partner team technical leads and cloud leadership.
- Provide guidance to other team members on managing availability and performance of mission critical services, on building automation to prevent problem recurrence, and building automated responses for non-exceptional service conditions.
- Manage execution of project priorities, deadlines, and deliverables.
- Lead Incident Management during Incidents.
- Responsible for driving MTTR as per the Incident SLA.
- Responsible for having 100% coverage for various alerts covering Application, Infrasture, Security, Flows etc
Qualifications:
- 6-10 years of experience in distributed systems, storage systems, or databases, algorithms and data structures and/or Unix/Linux systems internals (e.g., filesystems, system calls) and administration.
- Experience designing, analyzing, and troubleshooting large-scale distributed systems.
- Experience in MySQL or Postgres SQL in database.
- Hands-on experience on operating with k8s and any cloud.
- Excellent communication skills and a sense of ownership, with a systematic problem-solving approach

Qualifications

- 6-10 years of experience in distributed systems, storage systems, or databases, algorithms and data structures and/or Unix/Linux systems internals (e.g., filesystems, system calls) and administration.
- Experience designing, analyzing, and troubleshooting large-scale distributed systems.
- Experience in MySQL or Postgres SQL in database.
- Hands-on experience on operating with k8s and any cloud.
- Excellent communication skills and a sense of ownership, with a systematic problem-solving approach

Responsibilities

- Establish a SRE site and help build an effective, inclusive SRE team.
- Provide technical leadership for the local team and work closely with partner team technical leads and cloud leadership.
- Provide guidance to other team members on managing availability and performance of mission critical services, on building automation to prevent problem recurrence, and building automated responses for non-exceptional service conditions.
- Manage execution of project priorities, deadlines, and deliverables.
- Lead Incident Management during Incidents.
- Responsible for driving MTTR as per the Incident SLA.
- Responsible for having 100% coverage for various alerts covering Application, Infrasture, Security, Flows etc