HEPCloud: a new paradigm for particle physics computing

The vision for HEPCloud includes:

  • functioning as a portal to an ecosystem of diverse computing resources commercial or academic
  • providing “complete solutions” to users, with agreed upon levels of service
  • having the facility route to local or remote resources based on workflow requirements, cost and efficiency of accessing various resources
  • managing allocations of users to target compute engines

As noted in the 2016 Fermilab Strategic Plan, “Particle physics requires copious computing resources to extract physics results. Such resources are delivered by various systems: local batch farms, grid sites, private and commercial clouds, and supercomputing centers. Historically, expert knowledge was required to access and concurrently use all these resources efficiently. Fermilab is pursuing a new paradigm in particle physics computing through a single managed portal (“HEPCloud”) that will allow more scientists, experiments, and projects to use more resources to extract more science, without the need for expert knowledge. HEPCloud will provide cost-effective access by optimizing usage across all available types of computing resources and will elastically expand the resource pool on short notice (e.g. by renting temporary resources on commercial clouds). This new elasticity, together with the transparent accessibility of resources, will change the way experiments use computing resources to produce physics results. The CMS collaboration was amongst the first users of HEPCloud. CMS was able to increase its resources by 50 Fermilab Strategic Plan 13 thousand cores (approximately 1/3 of CMS’s world-wide available resources) by using Amazon Web Services cloud resources through HEPCloud for about one month. This enabled CMS to deliver more physics results for the Moriond conferences in Spring 2016 than were planned with non-HEPCloud resources.”

Fermilab Scientific Computing supports several types of dedicated and shared resources (CPU, disk, hierarchical storage, including disk cache, tape, tape libraries) for both data-intensive and compute-intensive scientific work. This is limited, however, to resources provisioned by and hosted at Fermilab, or to remote resources made available through the Open Science Grid. The resources may be dedicated or shared and, in some cases, offered only at low priority such that their use may be pre-empted by higher priority demands on them. In order to reliably meet peak demands, Fermilab still must provision with the forecasted peak demand in mind, rather than the median or mean demand. This can be cost ineffective, since some resources may be underutilized during non-peak periods even with the resource sharing enabled by grids. This can also lower scientific productivity if the forecasted demand is too low, since there is a long lead-time to significantly increase current forms of local or remote resources.


An illustration of provisioning for average vs. provisioning for peak.

The goal is to extend the current Fermilab Computing Facility to transparently run on disparate resources including commercial and community clouds, grid federations and HPC centers. The Fermilab HEP Cloud Facility will enable experiments to perform the full spectrum of computing tasks, including data-intensive simulation and reconstruction, at production scale irrespective of whether the resources are local, remote, or both. This will also allow Fermilab to provision scientific computing resources in a more efficient and cost-effective way, incorporating elasticity that will allow the facility to respond to demand peaks without over-provisioning local resources by using a more cost-effective mix of local and remote resources transparent to facility users.