We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Senior Software Engineer

Microsoft
United States, Nevada, Reno
6840 Sierra Center Parkway (Show on map)
Nov 02, 2025
OverviewMicrosoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team is responsible for managing the core platform & fleet of AI High Performance Computing products that customers use to run their most performant and demanding workloads. The AI Customer Experience (AICE) engineering team within the HPC & AI Eng. team is on the frontlines managing the flagship supercomputers used by top tier AI customers that enable breakthroughs such as ChatGPT and are highlighted in Top500, MLPerf and Graph500 rankings.Operating at supercomputing scale requires specialized tools and techniques to ensure system reliability, runtime performance, and job health, while continuing to meet customer Service Level Agreements (SLAs). As a Senior Supercomputing Software & Systems Engineer, you will be responsible for diagnosing & troubleshooting the largest scale supercomputing systems across the infrastructure stack (GPU hardware, networking, datacenter and core software). In this role, you will develop and apply advanced tools, identify operational gaps, and implement features that support the smooth operation of cloud-native supercomputers. This opportunity will give you hands-on experience developing capabilities to manage the largest scale of supercomputers delivered to our customers. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
ResponsibilitiesCollaborates with appropriate stakeholders to determine user requirements for a scenario.Drives identification of dependencies and the development of design documents for a product, application, service, or platform.Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).Leverages subject-matter expertise of product features and partners with appropriate stakeholders (e.g., project managers) to drive a workgroup's project plans, release plans, and work items.Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate.Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.
Applied = 0

(web-675dddd98f-rz56g)