DevOps Engineer

About the Team

We are the DevOps team responsible for operating the ASUS xHIS (Healthcare Information System SaaS) platform. We manage both the Azure-hosted Rancher platform and on-prem Kubernetes clusters deployed across multiple hospitals — giving you exposure to both cloud-native architecture and on-prem environments, with real technical breadth and depth.

We believe in an “automation-first, documentation-first, AI-augmented” working culture. We welcome people who keep learning and are willing to write down what they learn to share with the team.

What You’ll Do

Cloud & Cluster Operations
• Manage the full lifecycle of Azure AKS, Rancher Server, and on-prem RKE2 clusters — including annual LTS upgrades
• Handle day-to-day Kubernetes operations: workload troubleshooting, resource tuning, RBAC configuration, storage and networking issues
CI/CD & Infrastructure Automation
• Design and maintain release pipelines using Azure DevOps Pipelines (artifact management, multi-environment promotion, service connections)
• Manage Azure resources with Terraform / Terragrunt (Resource Groups, Key Vault, Storage Accounts, Application Insights)
• Use Helm to deploy and maintain cluster services (ingress-nginx, cert-manager, Rancher, Loki, etc.)
Monitoring, Alerting & Logging
• Build and tune the Prometheus / Grafana / Loki monitoring and alerting stack
• Design dashboards, write alert rules, and integrate SMTP and other notification channels
• Architect cluster audit logging (Rancher API audit, K8s audit policy) with a Fluent Bit → Loki / Azure Blob dual-write design
On-prem Deployment Support
• Support on-prem K8s cluster deployment and environment setup: Ubuntu Server installation, static IP, firewall (UFW), LVM expansion, cgroup rules, TLS certificates, etc.
• Collaborate with hospital IT on networking, firewall, NTP, NAS, and related setup (most work can be done remotely)
Documentation & Knowledge Sharing
• Write and maintain SOPs and technical documentation (Confluence) so team knowledge compounds over time
• Integrate AI coding assistants (Claude Code, Cursor, Codex, Copilot, etc.) into daily ops workflows to improve troubleshooting, scripting, IaC authoring, and documentation efficiency

Must-have

We’re open to candidates with 2+ years of relevant experience. Some specialized skills can be picked up through on-job training, but the following are non-negotiable baselines for hitting the ground running:

2+ years of experience in DevOps / SRE / Backend / Cloud / Systems Administration
Solid Linux system administration skills (Ubuntu Server, systemd, networking, disk management, SSH, basic shell scripting)
Working knowledge of Kubernetes fundamentals (Pod, Deployment, Service, Ingress, ConfigMap, Secret — able to troubleshoot with kubectl)
Hands-on experience with at least one public cloud (Azure preferred; AWS / GCP experience is transferable)
Experience with any CI/CD pipeline tool (Azure DevOps, GitHub Actions, GitLab CI, Jenkins, etc.)
Experience with any monitoring tool (Prometheus, Grafana, Datadog, CloudWatch, etc.)
Strong Git workflow and technical writing skills
Self-directed learner with strong troubleshooting instincts (important — you’ll encounter unfamiliar tools)

Nice-to-have

The skills below are a strong plus. If you don’t have them yet, we’ll help you build them through OJT and mentorship:

Already integrating AI coding assistants (Claude Code, Codex, Cursor, etc.) into daily work (debugging, generating IaC, writing runbooks, automating routine checks)
Experience using Rancher to manage multiple clusters
Hands-on with RKE2 / K3s or other self-managed Kubernetes distributions
Familiarity with Helm Chart authoring and values customization
Familiarity with Terraform / Terragrunt or other IaC tooling
Familiarity with Loki / Fluent Bit / EFK or similar log pipelines
Exposure to on-prem / hybrid cloud deployment, including VM provisioning and network planning
Hands-on with cert-manager / Let’s Encrypt / ingress-nginx

Working Conditions

On-call Rotation
This role supports core medical systems running in hospital environments, so participation in the team’s on-call rotation is required:

Team members rotate on a scheduled basis to handle urgent incidents outside business hours
Triggers are infrequent under normal conditions, but you’ll need to remain reachable by phone and able to log in to investigate within an agreed response time

What You’ll Gain

Dual-track Cloud + On-prem experience — a stronger résumé profile than pure-cloud or pure-on-prem roles
End-to-end ownership of cluster lifecycle, from Day 0 design to Day 2 upgrades
Early hands-on involvement in bringing AI tooling into enterprise operations — a core skill for the next wave of DevOps
Experience operating a healthcare SaaS platform in production
A structured OJT and mentorship program to help you fill in the gaps