Shire Veteran Jobs

Job Information

Georgia Employer SENIOR SITE RELIABILITY ENGINEER in Atlanta, Georgia

Job DetailsImmediate need for a talented Senior Site Reliability Engineer with experience in the Airlines Industry. This is a 6+ Months Contract opportunity with long-term potential and is located in Atlanta, GA. Please review the job description below. Job ID: 22-32314 Key Responsibilities: * Engage in and improve the whole lifecycle of services-from inception and design through deployment, operation, and refinement * Support capacity planning, availability, scalability, security and latency considerations for new infrastructure and service provisioning as appropriate * Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence. * Partner with business and technical product owners to set SLOs / SLIs / error budgets to manage reliability of infrastructure and applications * Partner with other SREs to bring best practices or learnings from across the organization to them * Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency * Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence * Maintain infrastructure (infrastructure as code) and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages and security threats in Development, UAT, Staging and Production environments * Practice sustainable incident response and blameless postmortems * Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems * Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, and production readiness reviews to run the platform * Step back to observe patterns and develop innovative tools and automation to eliminate or minimize menial tasks. Use those learnings to drive the best operational practices * Develop and maintain solution and operational documentation and designs for all infrastructure and services within the scope of SRE * Preserve operational visibility and response capabilities - fixing and improving our dashboards, alerts, and automation * Maintain operational uptime and reliability by participating in triage and issue support calls for mission critical systems Who are you? * Strong experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications * Proficient in one or more of the following scripting languages: JavaScript, Nodejs, Python, Maven, Ansible, Bash, etc. * Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible * Proven history of toil elimination by leveraging automation * Strong background using tools like PagerDuty for managing incidents * Strong experience with monitoring and alerting systems like Prometheus, Grafana, Datadog. * Understanding of standard networking protocols and components such as HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies Key Requirements and Technology Experience: * Experience in Serverless Application Framework * Experience in containerized workloads and management platforms such as Docker or Kubernetes * Familiarity with distributed systems is a plus including Microservices * Experience in Infrastructure automation tools such as CloudFormation, Terraform * Understanding of CI/CD processes and experience with deployment automation tools such as Code Pipeline, Code Deploy, Jenkins, Bamboo * Strong debugging, troubleshooting, and problem-solving skills * Effective communication, collaboration & negotiation skills with the ability to interface with various business units and third parties * Experience liaising with developers, operations staff and third-party res