Roles and Responsibilities
The Site Reliability Engineering team (SRE) drive the reliability, recoverability and operational efficiency of this product portfolio. Reporting to the Global SRE Lead, key features of this role include implementing advanced observability, troubleshooting complex systems, task automation, and technical debt management.
Members of the SRE team align to an Agile squad, and are expected to work closely with the ITSM user community on day to day usage of the products, as well with our internal development and engineering squads, and the offshore support team that provide first line support.
Candidates will have the technical skills required to support these products on a Linux platform. Prior task automation experience in at least one programming language is expected, and through that some user experience with typical software development lifecycle toolchains. Hands-on experience with at least one pillar of observability is required and ideally experience in defining system monitoring, not just reacting to alerts. Cloud experience is not necessary, as training can be provided, but it would be an advantage.
- Building and maintaining knowledge front to back of Application Infrastructure?s IT Service Management products, and then specializing in one or two of them
- Maximizing the availability and performance of supported systems through optimized and automated plant management, ongoing problem management, and architecture reviews with dev-side peers
- Reduction of the cost of support (hours of effort) through the elimination of operational issues, optimization and automation of tasks, development of operational tools and driving client self-service to minimize constraints
- Identification and prioritization of technical debt that is impacting client developer productivity, reliability or the efficiency of the ops team
- Complex troubleshooting in a Linux environment
- Consult with clients (the Firm?s internal development community, IT service practitioners) to maximize their productivity, including troubleshooting the issues they have using the department?s products
- Minimizing the escalation rate to the dev-side product delivery team members to ensure the department has the greatest possible flow of feature delivery
- Being operationally responsive, including sharing on-call rotation with the rest of the global team (with a time-off in lieu system)
Desired Candidate Profile
- Strong Linux troubleshooting skills
- Task automation experience in any programming language
- Practical experience of at least one pillar of observability (metrics, logs or traces)
- Exhibit working knowledge in at least ONE of the following areas
o REST services (API)
o Load balancing and networking
o Performance troubleshooting and resolution
- Confident collaboration skills
- Python development for task automation
- Experience with site reliability engineering practices, like service level objectives (SLOs), error budgets, blameless postmortems, toil reduction
- Prior experience creating operational dashboards (Splunk, Grafana, etc)
- Experience administering and/or supporting ServiceNow.
Role:Site Reliability Engineer
Salary: Not Disclosed by Recruiter
Functional Area:Engineering - Software & QA
Employment Type:Full Time, Permanent
UG:B.Tech/B.E. in Any Specialization
PG:MCA in Any Specialization,M.Tech in Any Specialization