B.2 Executive Summary
As demonstrated in Section A, Compute Canada (CC) supports a vibrant community of researchers spanning all disciplines and regions in Canada. Providing access to world-‐class infrastructure and expert personnel supports Canadian researchers. All Canadian university researchers have equal opportunity to access the CC resources. Larger requests are accommodated through an annual peer review allocation process that ensures Compute Canada is providing access and support to the most promising research in Canada.
The advanced research computing (ARC) needs of the Canadian research community continue to grow as the next generation of scientific instruments is deployed, as ARC becomes relevant to answering key questions in an ever broader list of disciplines, as new datasets are gathered and mined in innovative ways, and as technological advances allow researchers to construct ever more precise models of the world around us. The current CC infrastructure must keep pace with the needs of Canadian researchers.
This proposal addresses the urgent requirement to replace many aging systems with a consolidated set of systems designed expressly to meet Canadian research needs. These systems are designed to balance the need for technical innovation, with ongoing productivity, avoiding technologies that may require many months of refinement before research groups can effectively use them. The new systems are designed to meet the needs of the broad range of users identified in CC’s Strategic Plan. These upgrades will improve services to both
“traditional” users who focus on the number of cores available, and “newer” users who need a balance of technology leadership as well as service and support leadership.
In order to promote effective and efficient use of new infrastructure, CC will offer researchers common identity management, software environments and data management tools across a national network of facilities. Integrated services will be matched with the development of a nationally coordinated support regime. Local user support will continue to be provided by on-‐campus personnel, augmented by a national network of subject matter experts as well as supported user communities.
As the CC data centre footprint is consolidated, a stronger network of systems administrators will be able to serve a wider range of systems, both locally and remotely. Working with our regional partners, we will create a deeper pool of expertise in critical areas such as file systems management, networking, systems software, applications software, and code optimization. This will allow CC to increase the level and professionalism of its service to the community without significantly increasing investments in personnel.
Compute Canada, through consultation with Canadian researchers, has developed a well-‐documented forecast of needs versus the current capacity and the expected capacity with the current planned investments. The funding available through the Canada Foundation for Innovation’s (CFI’s) Challenge 2, Stage-‐1 Cyberinfrastructure Initiative is not sufficient to meet all of these needs. As such, choices must be made about which needs will be supported, and to what degree. CC has developed a balanced approach as the recommended baseline option in this proposal. Two alternative options have been developed which shift the balance in favour of either tightly coupled computations or data analytics. Pursuing these alternative options in stage-‐1 comes at a cost to the existing CC supported science programme. Assuming the baseline option is chosen in stage-‐1, the alternative options are directions Compute Canada is likely to pursue with the additional funding available in stage-‐2. The technological refresh and changes to the service delivery model in stage-‐1 will empower Canadian researchers to pursue leading-‐edge research. The extensive benefits to Canada documented in Section A will continue as Canadians continue to push forward the boundaries of their disciplines and compete on an
international stage. Revolutionary change in many fields and the resulting societal benefits now rely critically on ARC. From personalized medicine to better aircraft design, from the modelling of novel materials to modelling the Canadian economy, CC will continue to enable the creation of new knowledge across a broad range of domains.
B.3 Need for the Infrastructure
B3.1 Immediate and Pressing Needs
CC currently operates 50 systems in 27 data centres across the country. More than half of the roughly 200,000 computational cores in operation today were deployed in 2010 or earlier and are hence already beyond their normal lifespan of five years. These pre-‐2011 systems also provide more than 25% of currently available storage resources. The vast majority of the remaining resources were deployed in 2011 and 2012 and will reach their nominal lifespan in 2016 or 2017. As it stands today, most of the pre-‐2011 systems are on limited maintenance contracts covering only critical components. For the sake of system reliability, there is an urgent need to replace existing infrastructure.
Ignoring concerns about reliability, operating costs for maintenance and repairs are growing yearly as older systems reach the end of their originally purchased warranties and manufacturers no longer offer service on obsolete components. In addition, normal improvements in efficiency mean that modern systems would deliver similar computational performance for much lower electrical energy costs. Maintenance and energy costs need to be reduced to allow increased investment in support and service.
Finally, regardless of reliability or the cost of operations, CC has reached the limits of compute and storage capacity that can be allocated to its most excellent research users. Demand continues to increase, while the ability to meet that demand is falling.
B3.2 Responding to the Needs of Existing and Emerging Research
Communities
The needs of the research community have evolved since the last round of major capital purchases by CC. The rapid growth in data-‐intensive research has strained the capacity of CC to meet data storage needs for ongoing research projects. The problems being solved via modelling of materials, biological molecules, and other complex systems (e.g. earth-‐ocean) have increased in precision and concomitant computational intensity. Adoption of accelerators (GPUs) is revolutionizing certain types of problem solving, such as machine learning (so-‐called deep learning). For some emerging areas (e.g. image analysis), the required system memory per computational core has exceeded the capacity of most existing CC systems, such that use of these systems is becoming less efficient for some problems and impossible for others. In addition to hardware infrastructure changes, the way that researchers interact with the infrastructure has also changed dramatically in the last five years with the emergence of cloud computing and the proliferation of scientific gateways and data portals. In order to adapt to modern workloads, there is an urgent need to replace existing infrastructure.
As illustrated in Section A, CC now serves a rapidly growing number of researchers across a wide range of disciplines. Assessing the ARC needs of such a broad group is challenging and CC has undertaken extensive consultations in order to engage the community. This consultation has included:
• A needs survey distributed to all CC users in the autumn of 2013 (more than 200 faculty responses). • More than 20 in-‐person consultations at various Canadian campuses in the winter of 2013-‐14. This was
associated with the writing of the attached Compute Canada strategic plan. Several online-‐only consultation sessions were also offered.
• A call for white papers was issued in summer 2014. 23 papers were received from a variety of disciplinary bodies and institutions.
• Advisory Council on Research (ACOR) was formed in 2013 and met regularly through proposal submission to give input to the planning process.
• A draft infrastructure proposal was posted on the Compute Canada website and was broadcast to the CC researcher mailing list (more than 10,000 people) in January 2015.
• In-‐person consultations were held at 6 locations across Canada in January 2015. This was followed by an online-‐only consultation session.
In addition, user data from the Compute Canada Database (CCDB) was mined for the 2010-‐2015 period to search for usage trends. Existing usage data was then combined with the consultation data described above and was compared to international trends. While CC has made extensive efforts to capture needs from all areas of science, there remains an unavoidable bias towards existing CC users compared to researchers in emerging disciplines due to the different response rates from the two communities.
B.3.3 Current and Anticipated Needs by Thematic Research Area
Each of the thematic research areas identified in Section A will see increasing demand for infrastructure over the next 5 years. In some cases, this is due to a constant progression of the field towards more complex models and more compute-‐intensive treatments. In other cases, anticipated advances in instrumentation are expected to drive data-‐intensive research in a certain field. Some examples are provided below, organized by the thematic areas of Section A. Common to all thematic areas is the need for expert personnel to enable efficient use of ARC resources in cutting-‐edge research.
Theme 1: Materials Science, Condensed Matter and Nanoscience
A white paper in this area was submitted to the CC SPARC process by 28 faculty members from 12 Canadian universities. That paper illustrated that the growth in this field is driven by the need for realistic and
experimentally relevant real materials simulations. Materials are studied on multiple length and timescales and the methods vary according to those scales. Much of the computation is accomplished today using homemade codes specialized to solve a certain problem of interest. However, Canadians are also involved in some large multi-‐national initiatives to produce more general-‐purpose software. The United States is currently funding the “Materials Genome Initiative” to “speed up our understanding of the fundamentals of material science, providing a wealth of practical information that entrepreneurs and innovators will be able to use to develop new products and processes”. In particular, “The initiative funds the development of computational tools, software, new methods for material characterization, and the development of open standards and databases.” As such, this area is poised for substantial growth in computational need (at least a factor of 5 in the next 5 years). Roughly half of the usage in materials science is expected to be serial in nature while the other half would benefit from being able to run parallel codes on highly connected machines. Given the choice, this community would maximize the number of cores deployed over optimization of machine interconnect. The importance of acceleration via FPGPU and GPGPU is evolving rapidly.
Theme 2: Chemistry, Biochemistry and Biophysics
This area currently represents the single largest utilization of CC CPU by discipline. This CPU is used to solve problems using molecular dynamics (MD) simulations, quantum mechanical calculations that explore electronic and molecular structure, ab initio MD simulations that derive molecular interactions from first principles, and hybrid techniques.
In order to achieve further advances or to provide new insights, researchers need to move to more detailed descriptions and better models, larger systems, and/or longer timescales. Given that these approaches are in essentially all cases computationally intensive, this translates into significant need for greater computational power, with implications such as increased memory, increased storage, and increased parallelism (need for fast interconnects). Approximately 65% of the CPU time consumed by computational chemistry calculations on CC resources today is by jobs which are at least moderately parallel (64 cores) and 12% is consumed by highly parallel jobs (at least 1024 cores). This community is also extensively exploring the use of GPU accelerators and sees at least a factor-‐of-‐four improvement in calculation speed when supported by an accelerator.
Theme 3: Bioinformatics and Medicine
“Over the next decade almost every biomedical investigation in basic and clinical research will be enabled through characterization of an accompanying genome sequence. Genomic technologies have become a critical component not only in human health research but also in other fields such as: agriculture, fisheries, forestry and mining. With next-‐generation sequencing technologies revolutionizing the life sciences, data processing and interpretation, rather than data production, has become the major limiting factor for new discoveries. In this context, the availability of advanced research computing resources has become a key issue for the genomics community.” – Advanced Research Computing Resources and Needs at 4 Canadian Genome centres (submitted to
SPARC process)
The increased demand in genomics will be primarily driven by three factors: improvements in instrumentation, the use of more advanced analysis strategies on acquired genome data, and increased demand for access to informatics infrastructure to utilize large international public datasets. The estimated growth in this area is at least a factor 8 in CPU and nearly a factor of 30 in disk storage over the next 5 years.
Generally speaking, computations in this area require a “Big Data” infrastructure including high-‐throughput disk arrays. For some types of analysis, high-‐memory nodes are required (e.g. at least 512GB per node). Most
applications do not take advantage of a high degree of parallelism.
Data privacy restrictions are important considerations in serving the ARC needs in this area. Many projects involve identifiable personal health information that must be protected by both appropriate policies and
appropriate technological safeguards. Medical research is now the largest category of special resource allocation requests received by Compute Canada each year. While the number of requests is growing rapidly, each request is not (yet) as compute or storage intensive as requests from some other disciplines. Adopting a better security posture at new data centres is an important adjustment that CC must make in order to serve this community. Since 2012, CC has added two major centres (BC Genome Science Centre and HPC4Health) to the organization in this area. In 2015, CC has become a partner in a successful Genomics Innovation Network proposal to Genome Canada and is generally playing an active role in supporting the Canadian genomics community. Providing service to this community is a clear priority for CC and can only be enabled through new infrastructure purchases.
Theme 4: Earth, Ocean and Atmospheric Sciences
A white paper on the needs of the ocean modelling community from researchers at 10 Canadian universities was submitted to the Compute Canada SPARC process. This community strives to “improve our basic understanding of oceanographic processes and our ability to simulate, predict and project physical, biological and chemical ocean characteristics on timescales from days, weeks and seasons to centuries”.
This community currently uses parallel codes which scale well in the range from 100-‐1000 cores and so requires large compute clusters with high-‐speed interconnect between the nodes. The lack of a dedicated large parallel machine in Compute Canada with scheduling optimized for large jobs means that members of this community typically wait for days to begin a single calculation. The presently available infrastructure limits the temporal and spatial resolution possible. Doubling the resolution leads to an increase in required compute power of roughly an order of magnitude. Moving from 2-‐dimensional to 3-‐dimensional models, which are now becoming more common, increases the required computational power by 2-‐3 orders of magnitude. This community requires increased capacity in tightly coupled cores in order to remain competitive.
Theme 5: Subatomic Physics and Astronomy
The Canadian subatomic physics community is involved in several high-‐profile global experiments with significant computational, storage and advanced networking needs. A group of 39 Canadian faculty members currently participate in the ATLAS experiment at the Large Hadron Collider (LHC). Run I at the LHC completed in 2012 and featured the discovery of the Higgs boson. Run II begins in the summer of 2015 with upgraded energy
and a doubling in the data-‐taking rate. The demand for high-‐throughput storage will grow throughout Run II, which ends in mid-‐2018. The instrument will then undergo upgrades and will return in the early 2020s at an even higher data-‐taking rate. Several other major subatomic experiments served by Compute Canada are also being upgraded or are coming online in the next 5 years.
ATLAS compute and storage needs in Canada are currently met by the Tier-‐1 computing centre at TRIUMF and by four Tier-‐2 computing centres within Compute Canada. In preparation for this proposal, Compute Canada and TRIUMF have agreed to pursue a partnership in which the current TRIUMF Tier-‐1 staff would join Compute Canada and Tier-‐1 functionality would be transitioned from TRIUMF to one of the new consolidated Compute Canada data centres. The Tier-‐1, which requires 24x7 support and a high-‐bandwidth connection to CERN, would be co-‐located with a Compute Canada Tier-‐2 centre. As part of this process, Compute Canada would consolidate ATLAS Tier-‐2 support from four sites to two. This is a more efficient operational arrangement and represents a major redesign for ATLAS computing support in Canada.
Experimental subatomic physics requires large quantities of high-‐throughput storage and nearby computation cores to process the data. The jobs are generally serial, or parallel over a small number of cores (e.g. 8), though GPUs are starting to be used and provide a significant advantage for specific types of calculations. Memory requirements are generally moderate (e.g. 4GB/core). In future, centres that support ATLAS must provide 100Gb connectivity to the LHCONE network. Theoretical subatomic physics often relies on parallel codes scaling on interconnected nodes into at least 100-‐1000 cores, depending on the sub-‐discipline.
CANFAR, a collaborative effort of the Canadian university astronomy community, currently makes Canadian astronomy data available to researchers around the world. This platform also provides compute resources that enable those researchers to process and analyze that data. The CANFAR platform operates on Compute Canada resources. The Canadian Astronomy Data Centre (CADC) currently hosts copies of the raw data, as well as database and other support services that are necessary for the proper functioning of CANFAR. Compute Canada and CADC are currently discussing a 3-‐year plan to migrate these core services to Compute Canada (costs to be paid by the National Research Council, outside the scope of the MSI project award). The Compute Canada services would continue to be supported by CADC personnel.
The CANFAR platform has recently been migrated from a Nimbus cloud to the new Compute Canada cloud systems, which run OpenStack. For some image processing, for example, it requires high-‐memory nodes (512GB per node). While observational data processing tends to be serial in nature, this is not the case for theoretical astronomy, astrophysics and astrochemistry. These calculations require a large number of computational cores in tightly coupled systems.
Theme 6: Computer and Information Sciences
Computer scientists naturally push some of the technological boundaries of ARC in a variety of technical domains. Compute Canada serves a diverse set of Canadian computer scientists including a strong machine learning community. In particular, the Canadian machine learning community is making extensive use of GPU co-‐processing in order to mine data using deep learning techniques. These techniques are relied upon for the artificial intelligence behind modern image and speech recognition and are expected to see significant growth in breadth of application. In the coming years, the group of Yoshua Bengio expects to require 240 GPUs for his 60-‐ person laboratory. Across Compute Canada, this research field alone could use productively more than 1000 GPUs, which offer 10-‐20x speed-‐ups compared to conventional CPU processing for this type of application.
Theme 7: Social Sciences and Humanities
While Compute Canada resource usage in social sciences and humanities is currently small as a fraction of overall compute and storage usage, this is a growth area in which the delivery and support of services is often more important than the scale.
One limiting factor in the exploitation of CC resources by researchers in the social sciences has been the need to manage private data sets. While CC has recently taken responsibility for housing and managing RCMP crime data
at a particular site in collaboration with a local computational criminology group, this is an exception rather than the norm. Adopting an enhanced security posture (both in policy and technology) is vital to supporting social science researchers. Over the last year, CC has engaged in detailed discussions with the Canadian Research Data Centre Networks (CRDCN) and Statistics Canada around access by researchers to Statistics Canada
datasets. CC is assisting with the design of the refresh of CRDCN platform and may come to play an ongoing role in this area.
CC received a white paper submission from the Canadian Society for Digital Humanities, which laid out their most pressing needs going forward. In addition to enhanced training resources and specialist Digital Humanities (DH) support personnel, they requested a cloud-‐based web-‐accessible infrastructure backed by significant storage resources. CC has invited DH researchers to be beta testers of the Compute Canada cloud and is working closely with these groups to ensure that the required cloud services are available on the infrastructure deployed as a result of this proposal.
B.3.4 Projecting Demand for Compute and Storage Resources
Based on responses to community consultations and analysis of existing usage data, CC has undertaken an exercise to project future infrastructure needs for the Canadian community. The projections below are based on the growing needs of existing Compute Canada users and do not account for anticipated growth in the CC user base.
Computation
In response to a survey distributed to CC users in fall 2013, computational resources were ranked as their number 1 current and future need from Compute Canada. The SPARC white papers demonstrated a broad need for increased computational resources over the next 5 years as shown in the table below.
White Paper Predicted Increase from Current to 2020
Numerical Relativity 3x
Subatomic Physics 3x
Materials Research 5x
Canadian Genome Centres 8x
Canadian Astronomical Society 10x
Theoretical Chemistry 12x
Weighting by current usage by discipline, this leads to an average expected increase of 7x over 5 years. It should be noted that, in some cases, the range of responses within a discipline may include researchers who need 100x over the next 5 years. Based on this and on international norms, the growth rate used here should be considered as a lower bound.
Storage
Many communities see storage growth rates at least commensurate with their compute growth. However, research communities analyzing datasets collected from a variety of different instruments or agencies see additional storage growth beyond their ability to grow computational power. CC has already witnessed a rapid increase in storage demand that has outstripped the supply at existing sites.
The Canadian subatomic physics community has some of the largest storage allocations on Compute Canada resources today. This discipline represents “traditional big data”. The long timelines of the associated
experiments and relative maturity of the field mean that the storage growth rate is predictable and controlled. This provides us with an example of a large base experiencing only modest growth. By contrast, in some
disciplines the pace of change is very rapid, making it impossible to apply predictable growth limits to the data in advance. As an example, sequencing production in the four largest Canadian genome centres currently doubles every 12 months. The table below illustrates anticipated storage growth from these two Canadian “Big Data” communities. The growth in disk needs for subatomic physics is a relatively modest factor of 3 over the 5-‐ year period from 2015-‐2020. In contrast, the disk need in genomics increases by a factor of 27 over the same period.
Storage Requirements Growth
2014 2016 2018 2020
Subatomic Physics Disk (PB) 13 19 27 37
Genome Centre Disk (PB) 17 51 153 459
Total Disk (PB) 30 70 180 496
Subatomic Physics Tape (PB) 6 10 16 31
Genome Centre Tape (PB) 13 38 114 343
Total Tape (PB) 19 48 130 374
In addition, other communities report very rapid growth rates. Neuroimaging researchers supported by a CC Research Platform and Portals award have projected a 14x growth in storage need over the next 3 years. As a result of these expected increases, CC has conservatively assumed an average growth rate of 15x over the next 5 years.
Compute and Storage Projections
Using the compute and storage numbers above, CC has produced the growth curves shown below.
For the compute projections, the unit “core-‐years (CY)” is used. This represents the amount of computation that can be performed by a single computational core running constantly for 1 year, or the computations performed by 12 such cores in one month, etc. (based on the cores deployed in the current CC fleet). The solid line
represents demand as extracted from recent CC resource allocation competition data. Future years are calculated using the weighted average 7x growth rate over 5 years described above and assuming that the growth is exponential in form. For the supply curve (blue), it is assumed that the full $15M CFI award in 2015 is allotted to Compute Canada, that the baseline option in this proposal is funded and that the resulting equipment comes online in 2017. When this comes online, pre-‐2011 systems are assumed to be decommissioned, leading to a net drop in core-‐count. It is further assumed that the full $15M CFI award in 2016 is allotted to Compute Canada and that this equipment comes online in 2018. This leads to the first real increase in core-‐count since 2012. Since there are no further CFI competitions approved at this time, no increases are assumed beyond 2018.
For the storage projections the unit petabytes (PB) is used. The solid yellow line is again demand extracted from recent resource allocation competitions and the future demand projections (dashed) use the 15x growth rate over 5 years assuming an exponential form. In estimating the supply (blue), the full stage-‐1 and stage-‐2 Cyberinfrastructure funding is assumed. It is further assumed that some storage from stage-‐1 is front-‐loaded into the 2016 fiscal year in order to meet pressing current demand.
B.3.5 Current Job Size Distribution
CC currently supports a wide range of computational needs. The figures below provide two ways to view the number of cores used in a typical Compute Canada computation (or “job”). The plot on the left shows the number of core years used in CC as a function of the year. The various colours illustrate the fractions of those core years in bins of cores-‐per-‐job. It shows, for example, that nearly 50% of CPU consumption in 2014 was by jobs using at least 128 cores. The plot on the right illustrates what fraction of the CC user base (counting project
groups, not individual users) have submitted at least one job using a given number of cores, shown as a function of time.
Further information about parallelism in the CC user community is visible in the table below, which summarizes usage data for 2014. In this table, the first column represents the minimum number of computational cores used in a single “job”. The second column represents the fraction of project groups that have submitted at least one job of at least that many cores. The third column represents the fraction of total CPU usage represented by jobs of at least that many cores. This means, for example, that 19% of user groups submitted at least one job of at least 256 parallel cores and that these jobs represent 31% of all CPU resources consumed in 2014.
2014 Summary of Data Usage
Min. Number of Cores/Job Fraction of Groups (%) Fraction of CPU Usage (%)
1024 5 – 6 10
512 11 19
256 19 31
It should be noted that the size and configuration of CC’s current systems limits the ability of Canadian
researchers to submit jobs at the largest scales and so has likely limited the growth of the highly parallel bins. To illustrate this effect, consider the SOSCIP BlueGene system that offers service to southern Ontario researchers. This system provided more than 32,000 core-‐years of computation in 2014 to jobs using at least 1024 cores. Some of these users have shifted their computational workloads from CC systems to the SOSCIP system in order to take advantage of the highly parallel architecture. Others, notably users from the astrophysics community, have found ways to access resources in other countries, including XSEDE in the US and even Tihane-‐2 in China.
B.4 Efficient and Effective Operation
The current distribution of CC data centres and systems reflects the distribution of resources from the seven pre-‐existing regional consortia that joined to form Compute Canada in 2006. Future hardware investment will be optimized on a national level into fewer, larger systems with national service roles. CC expects the current fleet of 27 data centres to be reduced to 5-‐10 by 2018. By concentrating investment in this way, important advantages will be realized:
• The CC management regime and role will shift, such that the central organization provides oversight for quality control, central processes for configuration change management and security, and coordinated planning for technology refresh.
• Some expert personnel will support enhanced services available across Canada rather than distinct hardware systems.
• The complexity of the CC enterprise will be reduced by not maintaining 27 bilateral hosting arrangements.
• Many researchers will no longer need to have their resource allocations split across multiple systems. This eases the burden on research groups. At the same time, it simplifies scheduling and storage allocation procedures for CC. Having a mix of hardware types in a single site is particularly valuable to those groups who require a mix of job types throughout their overall workflow.
• Better efficiency of operation and economy of scale will be attained by purchasing fewer, larger systems and having fewer support contracts.
• CC will be aligned with other national and multinational ARC consortia, by heading towards a more sustainable model of operation where hardware resources are centralized at locations where operational conditions are favourable and where qualified on-‐site staff are available. Access by users and most support staff is via the national wide-‐area network
The purchase of new infrastructure and consolidation of compute centres provides a unique opportunity to rethink both the way CC resources are managed and the way researchers interact with those resources. It will help Compute Canada evolve from today’s federation of systems and support into national-‐level
cyberinfrastructure, with support that transcends site and regional boundaries.
During the stage-‐1 technology refresh, four new sites will receive four new systems (described below), and a number of other systems will be defunded and removed from the CC allocations process. This shift in resources creates an opportunity for a shift in roles and expectations for CC’s staff members. Rather than having the majority of services for systems based at the host institution, the future will see support coming from across all of Compute Canada. The on-‐site support that users value will continue as a key component of Compute Canada’s services, and will be augmented by experts from across the nation.
A range of activities, from software licensing to 24x7 monitoring and response, will shift from an institutional model to a pan-‐CC model. CC’s leadership, working closely with regional leaders and member sites, will guide personnel towards thinking more broadly about their roles. Personnel will have the opportunity to become increasingly specialized, knowing that their knowledge might be called upon from any CC user at any site. Canada is ideally positioned to become a world leader in national-‐level support for ARC. Canada has an outstanding research network backbone, a broad mix of research universities, and a strong record of collaborative scholarship.
The multi-‐year shift from having ARC resources plus personnel at member sites, towards centralization of resources while retaining on-‐site personnel, provides two key opportunities:
1. To pursue an active technology refresh program, in which a limited number of sites host large-‐scale ARC systems to serve all CC constituencies;
2. To create a pan-‐Canadian support structure for ARC users, in which on-‐site talent is augmented by experts from across all member institutions.
CC’s plans in each area are described in this section.
B.4.1 National Centres
Four sites have been identified for hosting the next Compute Canada systems, which are anticipated to be available for use by mid-‐2016. All current CC centres, while part of a national network of systems, have
traditionally operated with a large degree of autonomy. As an example, all CC researchers currently have equal access to every system in the network, but there is no mechanism to grant administrator privilege at a given site to staff from outside that site.
Compute Canada has recently established some core principles that define a national site. These core principles were mandatory hosting conditions in the site selection process described later in this document. These core principles are part of the signed agreements between newly selected hosting sites and CC:
• Allocation of resources on the hosted system(s) will be performed through the Compute Canada resource allocation process. No institution or region will receive preferential access to those system(s). • Decisions on hardware procurement will be made through a national process. Local purchasing rules
must allow Compute Canada staff to participate fully in the hardware vendor selection process. The host institution will own the purchased system(s).
• Sites will participate fully in collection and reporting of information about the purchased system(s) operation in accordance with Compute Canada policies. This includes automatic collection of usage information, system up-‐times, etc. This information will be used to ensure consistent configuration and high levels of reliability and accessibility across the new systems.
• Sites will commit to enforce the Compute Canada Security and Privacy Policies at the hosting site, including affected operations personnel. These Policies will include but will not be limited to: physical and logical access control, security screening, operational security management, internal (i.e. Compute Canada) and external audits.
• System administrator (root) access on the Proposed System(s) may be granted to CC or regional
personnel from outside of their institution. This access will be provided on an as-‐needed, least-‐privilege basis to qualified and authorized personnel, in order for Compute Canada to implement best practices in systems management and administration.
B.4.2 National Systems and Support
After consolidation, most researchers will rely on remote hardware resources. Compute Canada will therefore provide a similar look and feel when accessing each system. This national-‐level support approach will ensure users are able to get connected to the best system, and get all the support they need, regardless of location or language (English or French). Several ongoing initiatives in this area are expected to mature and be deployed with the new infrastructure:
• Single sign-‐on: Whether through a Web browser or command line, Compute Canada is working towards a single username and password for all services. This is in cooperation with the Canadian Access
Federation (CAF) project.
• National monitoring: The new systems will be monitored by a new national operations centre, which will give an improved level of monitoring. Critical services will have 24x7 on-‐call support. This will include a national issue tracking (ticketing) system; making Compute Canada more resilient to failure, and will enable our geographically distributed staff to bring expertise to bear when problems occur. • Distributed systems administration: By applying granular privilege separation, appropriately trained
staff members will be able to effect changes on remote systems. Activities such as software installations, password resets, and investigations of failed computational jobs will be undertaken by remote staff members in addition to the four sites planned in this stage-‐1 proposal.
• Common software stack, centralized licensing: The four new systems, and subsequent systems, will have similar mechanisms for installing and maintaining software, using modules and other techniques. This will make it easier for users to be portable across systems, and to rapidly become productive on new systems.
• Highly credentialed staff members: Compute Canada will embark on training to ensure anyone with elevated access, or who needs to provide specific technical support for the new systems, obtains and maintains appropriate credentials. This will include vendor training, third party training, and certifications.
• Security profiles: The four new systems, sites, and all personnel who have any sort of elevated
privileges will be part of the national-‐level Compute Canada security enclave. Systems and services will be actively monitored, with defense in depth against any type of attack or accident. The newly formed CC Security Council will oversee this.
• Change management: To maintain consistency across systems, and avoid surprises for users or staff members, there will be per-‐system and national-‐level configuration change boards (CCBs). The CCBs will provide oversight and consistency with change management.
B.4.3 Defunding Existing Systems
Compute Canada undertook a cost-‐benefit analysis to assess which systems should be defunded as a part of the stage-‐1 plan. The terminology is “defunded” instead of “decommissioned” because the systems belong to the host institutions, which control their ultimate fate.
The cost-‐benefit analysis took into account many factors, starting with the following well-‐defined measures: • Computing power provided by a given system, measured in Tflops;
• Cost of electricity (including cooling);
• Cost of maintenance of the system (not including the maintenance of the data center itself). This allows calculating the total cost per Tflops, as shown in the figure below.
Total cost per Tflops for all Compute Canada compute servers online during the fall of 2014. Green identifies the servers that will remain funded and operational after stage-‐1.
This analysis determined that most systems commissioned pre-‐2011 were no longer cost-‐effective. Based on this analysis, and further taking into account the size and configuration of the various clusters as well as the opportunity to conserve some systems as test beds, CC will stop funding 24 systems, and move out of 12 university data centres, in stage-‐1. This represents a loss of capacity of 85,000 cores, from approximately 2.0PF to 1.5PF and a loss of 7 PBs of storage. The list of defunded systems includes one of the largest parallel clusters in the current fleet (GPC) and the largest storage site (Silo). This will still leave 17 existing systems (over 100,000 cores) in operation. All existing systems, including those slated for defunding, will remain in operation until the new stage-‐1 capacity is available, in order to allow users and data to be seamlessly migrated.
B.5 Excellence-‐Based Access
B.5.1 Merit-‐Based Access
As documented in Section A, CC has a policy, which grants access to any eligible Canadian researcher, while allocating approximately 80% of available compute resources through a national merit-‐based review. This review process includes a technical review, eight separate science panels, and a final multi-‐disciplinary review committee.
As competition has grown for a fixed pool of resources, the number of applications submitted to to the Resource Allocation Competition (RAC) each year has grown from 135 in the fall of 2010, to 348 in the fall of 2014. In 2013, a “FastTrack” stream was introduced for researchers who had received strong science reviews the year before and who were requesting to continue their existing allocation. This is attractive to researchers because it reduces the burden required in submission of a new proposal and helps streamline the process for CC staff. 50 projects took advantage of FastTrack when first introduced in 2013.
However, the growth in the number and diversity of proposals cannot be sustained without additional
streamlining and additional staff support. The running of this competition has put a strain on existing CC staff. To address these operational challenges, for the 2014/15 competition, MSI funding allowed CC to hire a consultant with extensive federal granting council experience to review, document and recommend changes to the RAC process. In addition, a permanent science project manager has been hired (September 2014) with significant responsibilities for running the labour-‐intensive RAC process.
The first of the externally recommended changes to the allocation process has already been implemented in the fall 2014 competition with the creation of a separate “Research Platforms and Portals” (RPP) competition. Researcher feedback indicated that multi-‐user platforms, which often serve an international community, should not be evaluated against the same criteria as projects serving the needs of individual researchers. For example, while a one-‐year allocation may be reasonable for an individual project, a platform may instead require a large multi-‐year storage allocation, which can be accessed by scientists from around the world. CC awarded 13 RPPs in the first competition and expects this competition to grow the list of supported platforms and portals in future years.
Another of the key external recommendations was to develop a project plan for the allocation process with detailed timeline and milestones throughout the year. This has been implemented and planning for the fall 2015 competition launch is well underway at the time of writing. Given the rapid growth in allocation applications, it is vital that CC continue to streamline administrative aspects of the process.
B.5.2 Support for “Contributed Systems”
In parallel with funding CC, the CFI continues to receive proposals for the funding of advanced computing infrastructure in connection with specific research-‐focussed projects. In 2012, the CFI modified its Policy and Program Guide to address the housing and managing of any ARC infrastructure to be funded by CFI awards. The so-‐called “Compute Canada Clause” indicates a requirement to consult with CC to determine if the infrastructure described in the project can be provided by CC, integrated into CC facilities, or if the infrastructure must or should be separate from CC facilities. A single consultation usually involves a teleconference between project representatives and CC, as well as the exchange of detailed documentation, before the proposal is submitted. It may involve detailed follow-‐up between project and CC technical teams, discussions with host data centre teams and work on system design. After the award is granted, CC follows-‐up with all awarded projects that have an identified CC role, for example as an infrastructure host.
Since this change of policy, CC has consulted on 91 smaller proposals (CFI LOF/JELF competitions) in which a total of nearly $10M in ARC infrastructure was proposed. In addition, CC consulted with 59 larger projects as part of the recent CFI Innovation Fund (IF) competition. Overall, integration was recommended in 71 out of