The Principal Site Reliability Engineer is responsible for the overall performance, stability, reliability and scalability of enterprise-wide internet-facing systems; ensures Dealertrack’s complex, web-scale systems are healthy, monitored, automated, and designed to scale. Directs and leads continuous improvement efforts and incident response of cross-functional support teams to troubleshoot and address database, OS, application, network and any other issues. Provides consultation to development teams to develop high performing solutions.
- Subject matter expert in several technical domains, including infrastructure, storage design, operating systems, networking, engineering.
- Lead and drive continuous improvement initiatives in domain expertise.
- Uses technical expertise to review and makes recommendation for engineering reliability into code, infrastructure, OS, network, and processes used to ensure the application is always fast, available, and scalable.
- Oversees the monitoring of software performance, packets flow, and hardware and how code interacts in support of managing services; predicting and preventing failures
- As a Subject Matter Expert, consults and provides recommendations to various development teams to ensure the availability, speed, scalability and efficiency of Dealertrack services by engineering reliability into software and monitoring systems.
- Respond to and resolve emergent service problems; builds custom tools to automate daily functions to prevent problem recurrence
- Works in close contact with Architecture, Development and Infrastructure teams on software and system performance analysis and tuning, service capacity planning and demand forecasting; coordinates efforts of cross-functional teams to design and implement solutions.
- Bachelor’s Degree in Computer Science or equivalent practical experience
- 10+ years engineering and/or administering a high-volume or critical production service environment running on a UNIX/Linux platform
- Strong working knowledge of C, C++ or Java and Shell, Perl or Python.
- Hands-on experience in Apache, JBoss, Tomcat, Load Balancers (F5) and Firewalls
- Understanding of IP networking, network devices and common topologies.
- Proven technical troubleshooting and performance tuning experience.
- Excellent analytical skills, coupled with a strong sense of ownership, urgency and drive.
- Working knowledge in the following areas: PureData (DB2), Mongo DB, Redis, Qpid (MRG), DataPower