Monitoring Large Scale HPC Systems: Understanding, Diagnosis, and Attribution of Performance Variation and Issues
Primary Session Leader
Event Type
Birds of a Feather
Intermediate
Performance
Location155-F
DescriptionThis BOF addresses critical issues in large-scale monitoring from the perspectives of worldwide HPC center system administrators, users, and vendors. This year will be 100% facilitated audience interactive discussion on tools, techniques, experiences, and gaps in understanding, diagnosing, and attributing causes behind performance variation and poor performance. Causes include contention for shared network and I/O resources and system component problems. Our goal is to facilitate enhancement of community monitoring and analysis capabilities by identifying useful tools and techniques and encouraging the development of quickstart guides for these tools to be posted at the community web site: https://sites.google.com/site/monitoringlargescalehpcsystems/
Primary Session Leader
Secondary Session Leaders











