Amazon Web Services (AWS) has research grants to facilitate the integration of computing processes with their Cloud. Such a grant was awarded in March 2015 as a collaboration between NOvA and SCD. The scientific goals were to perform the required simulation and data analysis for the NOvA experiment’s 2014/2015 data set on AWS.
It was estimated to require ~190,000 CPU hours of data processing and 1,900,000 CPU hours of simulation. There were three campaigns to spend the grant including the reconstruction of simulated neutrino events in the near detector (ND nonswap nogenierw). The experience gained running the campaigns allowed us to improveresource burst capacity, refine operational practices, introduce metadata and location declaration and increase access to resources. For NOvA, HEP Cloud achieved a scale of 7,300 slots, equivalent to more than 3 times their local slot allocations at Fermilab.
AWS is a highly-preemptive environment with a job failure rate up to 60% for 1 day jobs.Despite the high failure rate, however, the processing was fairly efficient. In fact, the NOvA jobs processed five files per job from the input dataset. Even preempted jobs, therefore, had a high likelihood of processing some of their five files before being terminated. Furthermore, preempted jobs were automatically resubmitted by HEP Cloud in an attempt to complete the full processing of the input dataset. Because jobs were handed over files by the SAM data handling systems, resubmitted jobs only processed files that had not been processed already, increasing the amount of useful work done by the jobs. All in all, out of 57,000 files, a single campaign submission (the “2nd recovery” as in the figure) processed 46,858 files, despite the high job preemption rate.
Efficiency per file:
- Total job time= 203k h
- Total time Consumed = 131k h (64%) –> (78% on next recovery)
- Total time Failed= 3,353 h (2%) –> ( 5% on next recovery)
- Total time Preempted = 69k h (34%) –> (16% on next recovery)
where “consumed” files were files successfully processed, “failed” had problems during processing (e.g. out of memory errors), and “preempted” were files whose processing was interrupted because of preemption.
Lessons learned include:
- Scale can be achieved only by a combination of instance types and AZ.
- Ultimately, cloud jobs are influenced by the health of GPGrid, the local job submission system.
- Running output intensive job on the Cloud and transferring the data back is (will be) costly.
- AWS is a highly preemptive environment.
- Declare metadata and location to the data handling system from the worker node to simplify recovery operations and output transfer.
- It is important to proactively transfer the lessons learned (e.g. configuration changes) from the context of one experiment to another.
For more details, including a case study on egress, please see the presentation on the NOvA Processing on HEPCloud – Experience – Gabriele Garzoglio or the full report NOvA Experience on HEP Cloud: Amazon Web Services Demonstration.