System Administrator

Advanced Micro Devices, Inc.
$143,280.00/Yr.-$214,920.00/Yr.
United States, Georgia, Atlanta
Aug 22, 2025
WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. AMD together we advance_ THE ROLE: AMD is seeking a hands-on System Administrator to join our Engineering Operations team in Atlanta. You will support and optimize large-scale, multi-GPU/CPU ML infrastructure to enable world-class AI and rendering research. Collaborating with teams across North America and Europe, you will ensure stability, performance, and operational reliability of the infrastructure by performing proactive monitoring, hardware/software maintenance, troubleshooting, and system optimization. This is a technical, lab-facing position, requiring both system administration expertise and strong problem-solving skills to support ML engineers, researchers, and data scientists. THE PERSON: You are a hands-on engineer with a passion for high-performance computing and infrastructure automation. You thrive in fast-paced environments, enjoy solving complex technical challenges, and are comfortable working independently or as part of a distributed team. You demonstrate proficiency in performing hardware maintenance tasks, including hardware replacement, as well as server diagnostics and troubleshooting. The candidate must have comprehensive networking knowledge and how to solve network issues. Additionally, you effectively coordinate with other teams to ensure timely and efficient resolution of technical issues and project execution. Key responsibilities: Operate, configure, and maintain on-premise GPU/CPU server clusters and lab machines to support machine learning workloads and research applications. Design, implement, and manage monitoring solutions for cluster health, resource utilization, and application performance (using tools such as Prometheus, Grafana, etc.). Proactively monitor and address system issues, performance bottlenecks, and capacity planning. Automate infrastructure provisioning, configuration, and maintenance tasks to improve operational efficiency and uptime. Support worldwide ML/AI teams in optimizing workloads for cluster performance and reliability. Manage server cluster hardware lifecycle: installation, upgrades, troubleshooting, and decommissioning. Maintain and document standard operating procedures, system configurations, troubleshooting guides, and hardware inventories. Coordinate with local IT teams to facilitate effective troubleshooting, hardware procurement, and infrastructure upgrades. Stay up-to-date with new DevOps tools, AI infrastructure improvements, and performance optimization techniques PREFERRED EXPERIENCE: Proven experience in Linux and Windows server administration and management of enterprise-grade on-premises infrastructure. Experience with enterprise render farms, large-scale NAS, and distributed storage management. Strong hardware skills, including diagnostics, repair, and component replacement for servers, GPUs, and storage devices. Solid understanding of networking concepts, DHCP, DNS, VLAN and troubleshooting Understanding of monitoring and alerting tools (e.g., Prometheus, Grafana, Nagios) for system health and performance management. Familiarity with hardware diagnostics tools and best practices for detecting and resolving physical component failures. Ability to manage and maintain complex wiring, cabling, and power management systems within server rooms or lab environments. Experience with inventory management systems for tracking hardware assets, spares, and consumables. Knowledge of safety protocols and compliance requirements specific to data centers and technical lab spaces. Basic understanding of storage systems, including RAID configuration and disk array troubleshooting. Comfort working with hardware vendors and coordinating on-site service and warranty repairs. ACADEMIC CREDENTIALS: Computer Science, Computer Engineering, Electrical Engineering, or closely related field. Location: Atlanta GA Data Center (Onsite) #LI-CS1 Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.