1599003056385

Site Reliability Engineer

Hyderabad, Telangana, India Hexagon PPM Full-time
almost 2 years ago

Job Description:

Overview

  • Hexagon’s PPM Development organization is looking for an experienced Site Reliability Engineer who will be an integral member of a global team charged with running our production cloud systems. Here you will be performing typical operations work amongst development teams as an engineer focused on eliminating toil and inefficiency. The ideal candidate should have strong experience and expertise in running best-in-class and modern cloud infrastructure, operations, and observability. You will have the opportunity to help decide what we as a team focus on and what paths we take as part of a brand new SRE team. You will be amongst peers with much experience who desire to help you in your career growth. This team will promote positivity, shared ownership, accountability, and self-initiative

Responsibilities

  • Commitment to continually re-defining reliability goals, service-level objectives, measuring those goals, and working to improve our services as needed
  • Become a master of hands-off Administration of Kubernetes running on Azure AKS
  • Participate in On-Call rotation to respond to availability incidents and provide support for service engineers with customer incidents
  • Use your On-Call shift to prevent incidents from happening again
  • Follow an “automate all things” approach to service delivery and management
  • Efficiently coding and deploying Infrastructure using Terraform, Terraform Cloud, and AzDo
  • Make monitoring and alerting trigger on symptoms and not on outages
  • Completing Root Cause Analysis (RCA) investigations and blameless post-mortems
  • Performing Readiness Reviews with internal service teams
  • Plan the growth and control the costs of our infrastructure
  • Create scalable and extendable patterns to apply across multiple teams

Educational Qualifications

  • Bachelor’s degree in CS, engineering, software engineering, or related field.
  • Minimum of 5-10 years combined Operations & Software Development / Engineer experience with a preference of DevOps or SRE roles

Skills Required

  • Experience with at least one programming, scripting language (Preferences: PowerShell, C#, Python, go)
  • Solid understanding in the challenges and trade-offs to be made when building and deploying systems to production
  • Kubernetes certifications or an interest in obtaining these certifications are a big plus: (Certified Kubernetes Administrator (CKA) and Certified Kubernetes Security Specialist (CKS))
  • Experience with large scale distributed cloud service development, infrastructure, traffic management, and architecture
  • Good self-awareness, accountability, conflict resolution skills, and great at receiving feedback
  • Kubernetes, Terraform, Azure DevOps, Microsoft Azure, PowerShell, C#, PagerDuty, GitOps, SRE, DevOps, Infrastructure as Code (IaC), Operations, Cloud, Docker, Helm, Flux

Soft skills required include:

  • Excellent communication skills (verbal and written).
  • Effective in a team environment as well as working independently.
  • Excellent problem-solving skills.
  • Able to work in a fast-paced milestone driven environment.
  • Ability to document the design or explain status information etc., in emails well.