Site Reliability Engineer

2 weeks ago


Ho Chi Minh City, Ho Chi Minh, Vietnam PAVE Full time ₫4,000,000 - ₫12,000,000 per year

PAVE
is an innovative automotive technology company transforming the way the world inspects vehicles. Powered by Intelligent Damage Detection capabilities,
PAVE
enables anyone with a smartphone to complete a guided vehicle inspection simply by taking photos of their car.

Headquartered in Toronto, our team brings deep expertise from both the automotive and technology industries, blending the best of artificial intelligence and automotive intelligence.

For more information, visit

Position Overview

We're seeking a skilled Site Reliability Engineer to join our DevOps team and ensure the stability and reliability of our enterprise vehicle inspection platform. Reporting to the Lead DevOps Engineer, you'll play a critical role in our GCP to AWS migration while maintaining and improving system reliability. As an SRE at , you'll implement best practices for monitoring, incident response, and automation to achieve 99.9%+ uptime. You'll work hands-on with AWS infrastructure to build resilient systems that process millions of vehicle inspections for dealerships, fleet operators, insurers, and vehicle marketplaces globally.

Key Responsibilities

System Reliability & Stability

  • Monitor and maintain production systems to ensure 99.9%+ uptime
  • Implement proactive monitoring and alerting to detect issues before they impact customers
  • Perform root cause analysis for incidents and implement permanent fixes
  • Create and maintain runbooks for common operational procedures
  • Participate in 24/7 on-call rotation and incident response
  • Conduct regular reliability reviews and implement improvements

AWS Infrastructure Management

  • Deploy and manage AWS services including EC2, ECS/EKS, RDS, S3, CloudFront
  • Optimize AWS infrastructure for performance, cost, and reliability
  • Implement AWS best practices for security, backup, and disaster recovery
  • Configure auto-scaling policies and load balancing for high availability
  • Manage AWS networking components (VPC, Security Groups, ALB/NLB)
  • Support migration efforts from GCP to AWS under Lead DevOps guidance

Monitoring & Observability

  • Design and implement comprehensive monitoring solutions using CloudWatch, Prometheus, Grafana
  • Set up distributed tracing and application performance monitoring
  • Create meaningful dashboards and alerts for service health
  • Define and track SLIs (Service Level Indicators) for critical services
  • Implement log aggregation and analysis using ELK stack or similar
  • Establish baseline metrics and identify performance anomalies

Automation & Infrastructure as Code

  • Develop automation scripts to reduce manual operations and toil
  • Implement Infrastructure as Code using Terraform and CloudFormation
  • Create CI/CD pipelines for reliable and repeatable deployments
  • Automate routine tasks such as backups, scaling, and maintenance
  • Build self-healing mechanisms for common failure scenarios
  • Develop tools to improve developer productivity and deployment velocity

Performance Optimization

  • Analyze system performance and identify bottlenecks
  • Optimize application and database performance
  • Implement caching strategies to reduce latency
  • Conduct load testing and capacity planning
  • Fine-tune resource allocation and utilization
  • Optimize cloud costs without compromising reliability

Incident Management

  • Respond to production incidents with urgency and professionalism
  • Follow incident management procedures and escalation protocols
  • Document incidents and contribute to post-mortem analysis
  • Implement preventive measures based on incident learnings
  • Improve MTTR (Mean Time To Recovery) through better tooling and processes
  • Maintain incident communication with stakeholders

Collaboration & Documentation

  • Work closely with development teams to improve application reliability
  • Provide guidance on reliability best practices during design phase
  • Document infrastructure, procedures, and troubleshooting guides
  • Share knowledge through team presentations and training sessions
  • Collaborate on capacity planning and scaling strategies
  • Support developers with production debugging and optimization

Required Qualifications

Experience

  • 2-5 years of experience in DevOps, SRE, or Infrastructure Engineering
  • 2+ years of hands-on AWS experience in production environments
  • Experience maintaining high-traffic, high-availability systems
  • Proven track record of improving system reliability and uptime
  • Experience with 24/7 on-call responsibilities and incident management

Technical Skills

AWS Expertise:

  • Strong proficiency with core AWS services (EC2, S3, RDS, VPC, IAM)
  • Experience with container services (ECS, EKS, ECR)
  • Knowledge of AWS monitoring and logging (CloudWatch, CloudTrail)
  • Understanding of AWS security best practices
  • Experience with AWS CLI and SDKs
  • Familiarity with AWS Well-Architected Framework

SRE & DevOps Tools:

  • Infrastructure as Code: Terraform, CloudFormation, or AWS CDK
  • Configuration management: Ansible, Chef, or Puppet
  • CI/CD tools: Jenkins, GitLab CI, GitHub Actions
  • Containerization: Docker, Kubernetes, Helm
  • Version control: Git, GitHub/GitLab
  • Scripting languages: Python, Bash, or Go

Monitoring & Observability:

  • Prometheus, Grafana, or similar metrics platforms
  • Log management: ELK Stack, OpenSearch, or CloudWatch Logs
  • APM tools: New Relic, Datadog, or OpenSearch
  • Distributed tracing: Jaeger, Zipkin, or AWS X-Ray
  • Alert management: PagerDuty, Opsgenie, or similar

Technical Fundamentals:

  • Strong Linux/Unix system administration skills
  • Networking concepts: TCP/IP, DNS, Load Balancing, CDN
  • Database administration: PostgreSQL, MySQL, Redis, MongoDB
  • Understanding of distributed systems and microservices
  • Knowledge of security principles and best practices
  • Experience with performance tuning and optimization

Soft Skills

  • Strong problem-solving and troubleshooting abilities
  • Excellent written and verbal communication skills in both English and Vietnamese
  • Ability to work effectively under pressure during incidents
  • Detail-oriented with strong documentation skills
  • Team player with collaborative mindset
  • Proactive approach to identifying and solving problems
  • Continuous learning mindset for new technologies

Preferred Qualifications

  • AWS certifications (SysOps Administrator, DevOps Engineer, or Solutions Architect)
  • Experience with GCP and cloud migration projects
  • Knowledge of SRE practices from Google's SRE book
  • Experience with AI/ML infrastructure and GPU workloads
  • Familiarity with automotive industry or vehicle inspection systems
  • Experience with chaos engineering and failure injection
  • Knowledge of compliance frameworks (SOC2, ISO 27001)
  • Experience with serverless architectures (Lambda, API Gateway)
  • Contributions to open-source DevOps/SRE projects
  • Experience with FinOps and cloud cost optimization

Success Metrics

  • Maintain 99.9%+ uptime for assigned services
  • Reduce incident MTTR by 30% within first year
  • Automate 50% of manual operational tasks
  • Zero critical security incidents
  • Achieve all SLO targets for assigned services
  • Complete AWS migration tasks on schedule

What We Offer

  • Competitive salary
  • Flexible work arrangements, including hybrid options
  • 13th-month bonus in accordance with company policy
  • Comprehensive health, dental, and vision insurance for the employee and one dependent
  • Professional development budget for AWS certifications
  • On-call compensation and time-off policies
  • Opportunity to work with cutting-edge cloud technologies
  • Career growth path to Senior SRE or Lead positions
  • Collaborative and innovative work environment

Location
Hybrid settings, D1, HCMC.



  • Ho Chi Minh City, Ho Chi Minh, Vietnam HRS Group Full time $50,000 - $120,000 per year

    Hrs As a CompanyHRS, a pioneer in business travel, aims to elevate every stay through innovative technology. With over 50 years of experience, their digital platform, driven by ProcureTech, TravelTech, and FinTech, transforms how companies and travelers Stay, Work, and Pay.ProcureTech digitally revolutionizes lodging procurement, connecting corporations and...


  • Ho Chi Minh City, Ho Chi Minh, Vietnam Zalopay Full time $40,000 - $120,000 per year

    We are seeking a Senior Site Reliability Engineer (SRE) with a strong DevOps mindset to drive automation, delivery excellence, and infrastructure scalability for our high-throughput payment platform. You will partner with engineering teams to streamline CI/CD pipelines, implement GitOps workflows, and build internal tools that improve developer productivity...


  • Ho Chi Minh City, Ho Chi Minh, Vietnam HRS Group Full time $120,000 - $180,000 per year

    Hrs As a CompanyHRS, a pioneer in business travel, aims to elevate every stay through innovative technology. With over 50 years of experience, their digital platform, driven by ProcureTech, TravelTech, and FinTech, transforms how companies and travelers Stay, Work, and Pay.ProcureTech digitally revolutionizes lodging procurement, connecting corporations and...


  • Ho Chi Minh City, Ho Chi Minh, Vietnam VNG Full time $30,000 - $120,000 per year

    We are looking for aSenior Site Reliability Engineer (SRE)with deep expertise in deploying, operating, and optimizing database systems on Kubernetes (K8s). In this role, you will play a critical part in ensuring the data infrastructure is highly reliable, high-performance, scalable, and proactively monitored through modern observability systems.Key...


  • Ho Chi Minh City, Ho Chi Minh, Vietnam Techcombank Full time ₫10,000,000 - ₫20,000,000 per year

    Top 3 reasons to join usTop-tier banking environment in VietnamChallenging opportunities for the "Greater" YouAttractive career path and benefitsJob description1. About the Role:We are seeking a highly skilled Site Reliability Engineer with experience applying GenAI to automate and enhance the reliability of complex data platforms in Data Division. You will...


  • Ho Chi Minh City, Ho Chi Minh, Vietnam PAVE Full time ₫120,000 - ₫180,000 per year

    Top 3 reasons to join usHybrid and flexible working environmentInnovative ProductGrowth OpportunitiesJob descriptionWe're seeking a skilled Site Reliability Engineer to join our DevOps team and ensure the stability and reliability of our enterprise vehicle inspection platform. Reporting to the Lead DevOps Engineer, you'll play a critical role in our GCP to...


  • Ho Chi Minh City, Ho Chi Minh, Vietnam LOGIX TECHNOLOGY Full time ₫100,000 - ₫150,000 per year

    LOGIX TECHNOLOGYis a distinguished software services company, specializing in the provision of professional software development services, fostering strong partnerships through the establishment of offshore development centers, offshore product development, and software testing with an infrastructure-oriented approach. We excel at simplifying and enhancing...


  • Ho Chi Minh City, Ho Chi Minh, Vietnam EPAM Systems Full time $60,000 - $120,000 per year

    AtEPAM Vietnam, EPAM is hiring aSenior Site Reliability Engineerto join the team in Vietnam. You'll design and optimize infrastructure, automate processes and ensure the reliability of our education platforms. More than that, at EPAM, engineering is in our DNA. So, when you join our growing team, you will work with top global clients and make significant...


  • Ho Chi Minh City, Ho Chi Minh, Vietnam VSol Full time $120,000 - $180,000 per year

    Top 3 reasons to join usOnsite opportunities in UAE & Saudi ArabiaPremium Health insurance for employees & family14+ days of Annual leave & 5 days of Outing leaveJob descriptionVSOL is a digital enabler with a mission to help public and private organizations evolve their businesses through data and technology. We provide an end-to-end service from consulting...


  • Ho Chi Minh City, Ho Chi Minh, Vietnam Bestarion: Leading Outsourcing Company in Vietnam Full time

    Bestarion is a subsidiary of Larion, a well-established software outsourcing company in Vietnam with decades of experience delivering high-quality technology solutions. Inheriting Larion's strong foundation and technical expertise, Bestarion continues to grow as a trusted partner for clients worldwide.For over 15 years, Bestarion has provided innovative...