About the Team
We are the DevOps team responsible for operating the ASUS xHIS (Healthcare Information System SaaS) platform. We manage both the Azure-hosted Rancher platform and on-prem Kubernetes clusters deployed across multiple hospitals — giving you exposure to both cloud-native architecture and on-prem environments, with real technical breadth and depth.
We believe in an “automation-first, documentation-first, AI-augmented” working culture. We welcome people who keep learning and are willing to write down what they learn to share with the team.
What You’ll Do
- Cloud & Cluster Operations
• Manage the full lifecycle of Azure AKS, Rancher Server, and on-prem RKE2 clusters — including annual LTS upgrades
• Handle day-to-day Kubernetes operations: workload troubleshooting, resource tuning, RBAC configuration, storage and networking issues
- CI/CD & Infrastructure Automation
• Design and maintain release pipelines using Azure DevOps Pipelines (artifact management, multi-environment promotion, service connections)
• Manage Azure resources with Terraform / Terragrunt (Resource Groups, Key Vault, Storage Accounts, Application Insights)
• Use Helm to deploy and maintain cluster services (ingress-nginx, cert-manager, Rancher, Loki, etc.)
- Monitoring, Alerting & Logging
• Build and tune the Prometheus / Grafana / Loki monitoring and alerting stack
• Design dashboards, write alert rules, and integrate SMTP and other notification channels
• Architect cluster audit logging (Rancher API audit, K8s audit policy) with a Fluent Bit → Loki / Azure Blob dual-write design
- On-prem Deployment Support
• Support on-prem K8s cluster deployment and environment setup: Ubuntu Server installation, static IP, firewall (UFW), LVM expansion, cgroup rules, TLS certificates, etc.
• Collaborate with hospital IT on networking, firewall, NTP, NAS, and related setup (most work can be done remotely)
- Documentation & Knowledge Sharing
• Write and maintain SOPs and technical documentation (Confluence) so team knowledge compounds over time
• Integrate AI coding assistants (Claude Code, Cursor, Codex, Copilot, etc.) into daily ops workflows to improve troubleshooting, scripting, IaC authoring, and documentation efficiency
Must-have
We’re open to candidates with 2+ years of relevant experience. Some specialized skills can be picked up through on-job training, but the following are non-negotiable baselines for hitting the ground running:
- 2+ years of experience in DevOps / SRE / Backend / Cloud / Systems Administration
- Solid Linux system administration skills (Ubuntu Server, systemd, networking, disk management, SSH, basic shell scripting)
- Working knowledge of Kubernetes fundamentals (Pod, Deployment, Service, Ingress, ConfigMap, Secret — able to troubleshoot with kubectl)
- Hands-on experience with at least one public cloud (Azure preferred; AWS / GCP experience is transferable)
- Experience with any CI/CD pipeline tool (Azure DevOps, GitHub Actions, GitLab CI, Jenkins, etc.)
- Experience with any monitoring tool (Prometheus, Grafana, Datadog, CloudWatch, etc.)
- Strong Git workflow and technical writing skills
- Self-directed learner with strong troubleshooting instincts (important — you’ll encounter unfamiliar tools)
Nice-to-have
The skills below are a strong plus. If you don’t have them yet, we’ll help you build them through OJT and mentorship:
- Already integrating AI coding assistants (Claude Code, Codex, Cursor, etc.) into daily work (debugging, generating IaC, writing runbooks, automating routine checks)
- Experience using Rancher to manage multiple clusters
- Hands-on with RKE2 / K3s or other self-managed Kubernetes distributions
- Familiarity with Helm Chart authoring and values customization
- Familiarity with Terraform / Terragrunt or other IaC tooling
- Familiarity with Loki / Fluent Bit / EFK or similar log pipelines
- Exposure to on-prem / hybrid cloud deployment, including VM provisioning and network planning
- Hands-on with cert-manager / Let’s Encrypt / ingress-nginx
Working Conditions
On-call Rotation
This role supports core medical systems running in hospital environments, so participation in the team’s on-call rotation is required:
- Team members rotate on a scheduled basis to handle urgent incidents outside business hours
- Triggers are infrequent under normal conditions, but you’ll need to remain reachable by phone and able to log in to investigate within an agreed response time
What You’ll Gain
- Dual-track Cloud + On-prem experience — a stronger résumé profile than pure-cloud or pure-on-prem roles
- End-to-end ownership of cluster lifecycle, from Day 0 design to Day 2 upgrades
- Early hands-on involvement in bringing AI tooling into enterprise operations — a core skill for the next wave of DevOps
- Experience operating a healthcare SaaS platform in production
- A structured OJT and mentorship program to help you fill in the gaps