
OLCF’s Justin Whitt
Everything about the Frontier supercomputer, the world’s first exascale system residing at Oak Ridge National Laboratory, is outsized: its power, its scale, and the attention it attracts. In HPC circles, attention and discussion has increasingly focused on performance issues as the lab prepares the system for “full user operations” by next January.
We reported issues with Frontier’s HPE Cray Slingshot fabric late last year and in the spring of this year, issues that the lab and HPE worked to overcome before Frontier was certified to have passed the exaFLOP milestone on the HPL (High-Performance LINPACK) referenced last May by the TOP500 organization. The current issues seem to focus on Frontier’s stability when running very demanding workloads, with some of the issues centered on AMD’s Instinct GPU accelerators, which carry most of the processing workload of the system and are paired with AMD EPYC processors in the system blades.
In interviews with us this week, Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), confirmed that he and his staff have encountered problems, but he pointed out that they are typical of those whom he encountered during his decade. -no more testing and tuning class-leading supercomputers in the lab.
“It’s mostly scaling issues associated with application scope, so the issues we’re having are mostly around running very, very large jobs using the entire system…and the fact to run all the gig gear to do it,” Whitt said. . “It’s sort of the final exam for supercomputers. This is the hardest part to reach. And that’s the kind of issues we’re running into at this point, with the tuning being general enough to benefit a wide range of applications.
He said running the HPL benchmark is different from engaging the system when running scientific applications “without hardware failure, without a hitch in the network, and everything is settled.”
Whitt declined to go into detail about Frontier’s “hiccups”, but said he and his team were working to improve Frontier’s current average rate before the failure.
“We’re working on the hardware issues and making sure we understand (what they are),” he said, “because you’re going to have outages at this scale. The mean time between failures on a system this size is in hours, not days, so you need to make sure that you understand what those failures are and that there is no pattern to those failures that you need to worry about. it’s about tuning the programming environment so that you… get maximum performance on applications.
The goal, Whitt said, is to enable users to be productive in their scientific research, which varies by application. A one-day run with no system crashes “would be exceptional,” Whitt said. “Our target is still hours” but longer than Frontier’s current failure rate, adding that “we are not very far from our target”.
Whitt refused to blame most of Frontier’s current challenges on the operation of Instinct GPUs. “The issues span many different categories, GPUs being just one.”
“A lot of the challenges center around those, but that’s not the majority of the challenges we see,” he said. “That’s a pretty good spread among the common culprits of part failures that have been a big part of it. I don’t think at this point we have much concern about AMD products. We’re facing a lot of early-life stuff we’ve seen with other machines we’ve deployed, so it’s not too unusual.
That said, Whitt also said that the problems presented by Frontier have “been a little more difficult” due to the scale of Frontier, which includes 685 different parts, 60 million parts in total.
Added to the pressure to finalize the system by January are pandemic-related supply chain issues that delayed Frontier delivery by about three months. This, in turn, delayed the start of testing and tuning.
“We’re nearing the end of the process and we’re largely on the right track,” Whitt said. “When we made the plan for Frontier in 2019, even at the end of 2018, we said that we will be ready for user programs on January 1, 2023. And that’s where we are still, we are on the right track for be ready for user programs then.. We took the hit of the quarter with the supply chain issues… We are getting very close to the end of this part of the schedule.
#Frontier #Testing #Tuning #Issues #Minimized #Oak #Ridge #High #Performance #Computing #News #Analysis #insideHPC