Job Description
We are Nexus Future Systems, a pioneering research organization dedicated to architecting the technological backbone for the Artificial General Intelligence (AGI) era. As we accelerate towards our 2026 roadmap, we are seeking a visionary Lead AI Infrastructure Engineer to design, deploy, and scale the high-performance computing environments required for next-generation models.
In this role, you will bridge the gap between cutting-edge AI research and robust, scalable engineering. You will be responsible for ensuring our infrastructure can handle exascale computing demands, optimizing deep learning workflows, and implementing resilience strategies for future-proof systems.
Why Join Us?
- Work on the frontier of AI development.
- Competitive equity package and salary.
- Flexible remote-first culture with headquarters in SF.
Responsibilities
- Architect and manage scalable, multi-region Kubernetes clusters optimized for training large language models (LLMs).
- Design high-throughput, low-latency inference pipelines for real-time AI applications.
- Implement and enforce rigorous security and compliance protocols for sensitive AI data.
- Collaborate with data scientists to optimize model training efficiency and reduce compute costs.
- Plan for and integrate emerging technologies such as quantum-ready hardware interfaces and edge computing nodes.
- Mentor a team of infrastructure engineers and define technical roadmaps for 2026 and beyond.
Qualifications
- 8+ years of experience in backend engineering, DevOps, or Site Reliability Engineering.
- Strong proficiency in Python, Go, or Rust, with deep knowledge of containerization (Docker, Kubernetes).
- Extensive experience with cloud providers (AWS, GCP, or Azure) and serverless architectures.
- Proven track record of managing GPU clusters and high-performance computing (HPC) environments.
- Experience with observability tools (Prometheus, Grafana) and incident management (PagerDuty).
- Familiarity with AI/ML frameworks (PyTorch, TensorFlow) and MLOps tools (MLflow, Kubeflow).