Job Description
About Us We are a staffing services technology company that helps organizations design, build, and scale digital products and engineering capabilities. Our teams deliver end-to-end software development, engineering, and design services, and we provide flexible staffing solutions to augment internal teams with specialized talent—quickly and reliably. The Role: We are seeking an innovative and resilient Cloud Engineer to join our distributed engineering team. This role focuses on designing, building, deploying, and operating scalable AI/ML infrastructure that enables product teams to prototype, train, and serve models with reliability and efficiency. You’ll bridge data science, backend engineering, and platform operations to ensure robust, observable, and cost-effective AI systems in production. What You’ll Do Cloud Architecture & Infra Design: Design and implement scalable, secure cloud architectures for AI/ML workloads across multiple environments (dev, staging, prod). Architect data pipelines, model training fleets, model serving endpoints, and incident response playbooks. Platform & Automation: Build reusable platform components (CI/CD for ML, feature stores, model registry, experiment tracking, reusable pipelines) and automate deployment, scaling, and self-healing of AI services. Model Deployment & Operations: Provision GPU/CPU clusters, manage containerized services (Docker/Kubernetes), implement inference caching, autoscaling, and canary/blue-green deployment strategies; monitor service health and model performance in production. Observability & Governance: Instrument comprehensive monitoring, tracing, logging, and alerting; establish SLAs/SLOs for latency, availability, and model quality; implement cost controls and usage dashboards. Collaboration & Delivery: Work closely with Data Scientists, ML Engineers, Backend Engineers, and DevOps in an Agile environment to operationalize experiments, standardize APIs, and maintain clear documentation. Security & Compliance: Implement secure coding and deployment practices; manage IAM, encryption at rest/in transit, secret management, and compliance considerations for regulated data environments when applicable. What We’re Looking For Experience: 3+ years in cloud engineering, DevOps, or MLOps with production-grade systems; experience supporting AI/ML workloads is a plus. Education: Bachelor’s or Master’s degree in Computer Science, Electrical/Computer Engineering, Mathematics, or a related field (or equivalent practical experience). Cloud & Infra Proficiency: Strong hands-on experience with at least one major cloud provider (AWS, Azure, or GCP); familiarity with Kubernetes, containerization, and cloud-native services for compute, storage, and networking. ML Infrastructure: Experience with ML lifecycle tooling (MLflow, Kubeflow, Weights & Biases, or equivalent) and feature stores/ML metadata management concepts; comfort with model serving frameworks and GPUs . Automation & CI/CD: Proficient in CI/CD for data/ML workloads, IaC (Terraform, CloudFormation, ARM templates), Git workflows, and configuration management. Programming & SRE Practices: Proficiency in Python or another language commonly used in ML ops; strong understanding of software engineering best practices (testing, code reviews, documentation). Observability: Familiarity with monitoring/observability stacks (Prometheus, Grafana, OpenTelemetry, Cloud logging/monitoring services); ability to define and track SLOs/SLIs. Communication: Clear written and verbal communication; ability to translate technical concepts for non-technical stakeholders. Remote/Collaboration: Comfortable working asynchronously in a distributed team; self-motivated and capable of prioritizing tasks in a dynamic environment. Adaptability: Comfortable handling rapid changes in priorities, diagnosing issues across distributed systems, and turning incidents into learnings. Bonus Points ML/AI Platform Experience: Hands-on with ML model training pipelines, distributed training, or serving architectures; experience with RAG, vector databases, or LLM inference at scale. GPU & GPU Orchestration: Experience managing GPU clusters, job schedulers, and cost-optimized GPU usage. Data Compliance: Familiarity with HIPAA, SOC 2, GDPR, or oth