Sr. Site Reliability Engineer Job at Tiger Analytics Inc., Washington DC

L0ZJZ2hWbmhIVUJuTmMwSUVwL3V0U3MxSHc9PQ==
  • Tiger Analytics Inc.
  • Washington DC

Job Description

Role Overview

We are seeking a high-caliber Site Reliability Engineer (SRE) to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on MLOps —bridging the gap between model development and production-grade reliability.

Key Responsibilities

1. Reliability & Performance Engineering

  • SLA/SLO Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.
  • Error Budgeting: Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
  • Scalability: Architect and manage auto-scaling strategies for Kubernetes (GKE) to handle fluctuating workloads during model training and high-volume inference.

2. MLOps & AI Infrastructure

  • Model Serving Reliability: Ensure the high availability of Vertex AI endpoints and custom inference services.
  • GPU/TPU Optimization: Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
  • Pipeline Resilience: Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.

3. Automation & Orchestration (Eliminating "Toil")

  • Infrastructure as Code (IaC): Use Terraform or Pulumi to provision and manage consistent, version-controlled cloud environments.
  • CI/CD & GitOps: Design and optimize robust deployment pipelines for both application code and ML models using GitHub Actions, Cloud Build, or ArgoCD.
  • Task Automation: Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.

4. Monitoring, Alerting & Incident Response

  • Observability: Build and manage comprehensive dashboards using Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver) .
  • Incident Management: Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
  • Blameless Post-Mortems: Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.

Requirements

Orchestration: Expert-level knowledge of

Kubernetes (K8s) and Docker.

MLOps Stack: Familiarity with tools such as

Kubeflow, Vertex AI, MLflow, or DVC .

Scripting: Strong proficiency in

Python (for automation) and Bash; knowledge of Go is a plus.

Data Systems: Experience managing the reliability of data-heavy services (BigQuery, Pub/Sub, or Vector Databases like Pinecone/Milvus).

Networking: Solid understanding of VPCs, Load Balancers, DNS, and secure service mesh (Istio/Anthos).

Benefits

Benefits

Significant career development opportunities exist as the company grows. The position offers a unique opportunity to be part of a small, fast-growing, challenging and entrepreneurial environment, with a high degree of individual responsibility.

Tiger Analytics provides equal employment opportunities to applicants and employees without regard to race, color, religion, age, sex, sexual orientation, gender identity/expression, pregnancy, national origin, ancestry, marital status, protected veteran status, disability status, or any other basis as protected by federal, state, or local law.

Job Tags

Local area

Similar Jobs

University of Notre Dame

Marketing Intelligence Professional Job at University of Notre Dame

 ...of the University at any given time. Role Summary: The Marketing Intelligence Professional is crucial to the day-to-day...  ...~ Must be authorized to work in the United States without visa sponsorship Preferred Qualifications : Experience working in... 

Westinghouse

Engineer 1 Job at Westinghouse

 ...support for our employees and their household members ~401(k) with Company Match Contributions to support employees' retirement ~ Paid...  ...and service ~ Employee Referral Program Westinghouse Electric Company is the global nuclear energy industry's first choice... 

Modernistic Cleaning & Restoration

Experienced Carpet Cleaning Technician Job at Modernistic Cleaning & Restoration

 ...Experienced Carpet Cleaning Technician Pay: $1,000 - $1,400 per week Start your career at the BEST Home Service Company in the state of Michigan! For over 50 years, Modernistic has provided our community with exceptional cleaning and restoration services throughout... 

Jule

Electrical Mechanical Technician Job at Jule

 ...the same goal of generating a turnkey solution to address the clients both immediate and long-term needs. What it feels like to work at eCAMION eCAMION has maintained its flat management structure and start-up culture, despite its steady growth in the industry... 

Traction Forge Financial

Entry-Level Financial Planner Job at Traction Forge Financial

 ...Our company believes in taking on an educational approach to finances and to deliver our expertise to people from all backgrounds. Partnering...  ...people, and are willing to start a training program as an entry-level financial professional and potentially develop into management...