SENIOR, SOFTWARE ENGINEER
Walmart · Bangalore · 10+ yrs experience · Posted 2026-05-20
Tech stack: Cassandra, Docker, Go, Golang, Kafka, Kubernetes, MongoDB, MySQL, PostgreSQL, Python, SQL
About the role
We are building a unified observability platform that delivers 360° visibility across distributed systems with minimal instrumentation overhead. The platform seamlessly integrates with existing environments and workflows, leveraging AI-driven insights to detect, predict, and resolve issues in real time. Our goal is to enable self-healing systems through AI agents that autonomously diagnose and trigger remediation actions with minimal human intervention.
Responsibilities:
- Design, develop, and deploy scalable AI/ML models for anomaly detection, forecasting, and root-cause analysis.
- Build and optimize real-time inference APIs and services integrating ML pipelines into production.
- Develop data pipelines for large-scale telemetry, logs, metrics, and traces using event-driven architectures.
- Automate model training, evaluation, and deployment pipelines (MLOps).
- Continuously monitor model performance and optimize for accuracy, latency, and cost.
- Work closely with platform and SRE teams to build AI-powered automation and observability workflows.
- Build high-performance backend systems using Golang and modern design patterns.
- Architect distributed and fault-tolerant systems with strong fundamentals in concurrency, scalability, and resilience.
- Design multi-cloud applications using Kubernetes, Docker, and infrastructure-as-code tools.
- Implement service discovery, load balancing, and failure recovery mechanisms.
- Contribute to CI/CD, observability, and automation frameworks for production systems.
- Design data flows using Kafka, Pub/Sub, or similar event streaming platforms.
- Work with SQL (PostgreSQL/MySQL) and NoSQL (MongoDB, Cassandra, ClickHouse) databases for structured and unstructured data.
- Implement efficient data serialization, compression, and query optimization for large-scale data.
- Collaboration & Technical Leadership
- Collaborate with SRE, DevOps, and Product teams to integrate AI/ML features into observability workflows.
- Write clear design documents, architecture diagrams, and technical proposals.
- Contribute to long-term technical strategy and roadmap decisions.
- Mentor junior engineers on best practices in backend, ML systems, and distributed computing.
Qualifications:
- 5–10 years of software engineering experience, including 2–4 years in AI/ML engineering.
- Proven experience deploying ML models end-to-end (data ingestion → training → inference → monitoring).
- Strong coding skills in Golang (or Python with willingness to learn Go).
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
- Strong understanding of algorithms, data structures, and system design.
- Experience with ML frameworks such as TensorFlow, PyTorch, or Scikit-learn.
- Hands-on experience with time-series modeling, anomaly detection, or forecasting.
- Exposure to LLMs, RAG pipelines, or agentic workflows for automation.
- Familiarity with MLOps tools like Kubeflow, MLflow, Vertex AI, or SageMaker.
- Proficiency with Kafka, Pub/Sub, or similar distributed messaging systems.
- Hands-on with SQL/NoSQL databases and schema design for performance at scale.
- Expertise in designing RESTful or gRPC APIs and scalable microservices.
- Strong focus on testing, CI/CD pipelines, and production readiness.
- Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry).
- Experience in real-time observability, AIOps, or incident management platforms.
- Knowledge of distributed consensus (Raft, Paxos) and event sourcing.
- Contributions to open-source ML, observability, or infrastructure projects.
- Familiarity with LLM orchestration frameworks (LangChain, Haystack, Semantic Kernel).
Qualifications
- 5–10 years of software engineering experience, including 2–4 years in AI/ML engineering.
- Proven experience deploying ML models end-to-end (data ingestion → training → inference → monitoring).
- Strong coding skills in Golang (or Python with willingness to learn Go).
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
- Strong understanding of algorithms, data structures, and system design.
- Experience with ML frameworks such as TensorFlow, PyTorch, or Scikit-learn.
- Hands-on experience with time-series modeling, anomaly detection, or forecasting.
- Exposure to LLMs, RAG pipelines, or agentic workflows for automation.
- Familiarity with MLOps tools like Kubeflow, MLflow, Vertex AI, or SageMaker.
- Proficiency with Kafka, Pub/Sub, or similar distributed messaging systems.
- Hands-on with SQL/NoSQL databases and schema design for performance at scale.
- Expertise in designing RESTful or gRPC APIs and scalable microservices.
- Strong focus on testing, CI/CD pipelines, and production readiness.
- Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry).
- Experience in real-time observability, AIOps, or incident management platforms.
- Knowledge of distributed consensus (Raft, Paxos) and event sourcing.
- Contributions to open-source ML, observability, or infrastructure projects.
- Familiarity with LLM orchestration frameworks (LangChain, Haystack, Semantic Kernel).
Responsibilities
- Design, develop, and deploy scalable AI/ML models for anomaly detection, forecasting, and root-cause analysis.
- Build and optimize real-time inference APIs and services integrating ML pipelines into production.
- Develop data pipelines for large-scale telemetry, logs, metrics, and traces using event-driven architectures.
- Automate model training, evaluation, and deployment pipelines (MLOps).
- Continuously monitor model performance and optimize for accuracy, latency, and cost.
- Work closely with platform and SRE teams to build AI-powered automation and observability workflows.
- Build high-performance backend systems using Golang and modern design patterns.
- Architect distributed and fault-tolerant systems with strong fundamentals in concurrency, scalability, and resilience.
- Design multi-cloud applications using Kubernetes, Docker, and infrastructure-as-code tools.
- Implement service discovery, load balancing, and failure recovery mechanisms.
- Contribute to CI/CD, observability, and automation frameworks for production systems.
- Design data flows using Kafka, Pub/Sub, or similar event streaming platforms.
- Work with SQL (PostgreSQL/MySQL) and NoSQL (MongoDB Cassandra, ClickHouse) databases for structured and unstructured data.
- Implement efficient data serialization, compression, and query optimization for large-scale data.
- Collaboration & Technical Leadership
- Collaborate with SRE, DevOps, and Product teams to integrate AI/ML features into observability workflows.
- Write clear design documents, architecture diagrams, and technical proposals.
- Contribute to long-term technical strategy and roadmap decisions.
- Mentor junior engineers on best practices in backend, ML systems, and distributed computing.