Evolution
Since JASMIN came into being in early 2012, it has grown significantly in scale and complexity but also in the number and variety of users it serves, and the types of scientific workflow it supports. As the requirements of its user community evolve, so does JASMIN. The Phases below describe the major procurement and upgrade projects which have taken place. These have been complemented by the work of teams within CEDA and STFC’s scientific computing department in developing and maintaining the infrastructure and its component services and software to create the major e-infrastructure facility now familiar to over 1,500 users and 200 science projects.
Phase 1 (2011-2012)
A “super-data-cluster” is born
The initial technical architecture was selected to provide a flexible, high-performance storage and data analysis environment, supporting batch computing, hosted processing and a cloud environment. The CEDA Archive had outgrown its previous hosting environment and the increasing need for scientific workdlows to “bring the compute to the data” drove the development of an infrastructure to support analysis of archive data alongside datasets brought into or generated by projects in their own collaborative workspaces. The first components deployed in this phase were:
- Low latency core network
- High-performance disk storage system supporting parallel write
- Access to expandable tape storage for near-line storage
- Resources to support bare-metal and virtualised compute
- A batch scheduler
- Block storage for storing virtual machine images
- A paper describing the initial architecture is available (doi:10.1109/BigData.2013.6691556).
Component | Details | |
---|---|---|
Disk Storage | Initial fast disk | 4.6 PB RAL (0.5 PB Reading, 0.15 PB Leeds) |
Batch compute | Initial compute for LOTUS | 650 cores |
Network | Initial Gnodal-based network | |
Virtual compute | VM licences | Virtualisation software licenses for hosting virtual machines |
Tape storage | Tape drives & media | 4 x T10KC drives, 2.5 PB media |
Software | Data movement software Community intercomparison suite |
|
Other | Machine room environment monitoring equipment |
Phase 1.5 (2012-2013)
Enabling NERC Big Data projects
Already establishing its ability to facilitate projects with data-intensive workflows, JASMIN was given additional capability to support several NERC “Big Data” projects across a range of disciplines: near-real-time processing of EO data, Earth surface deformation analysis and seismic hazard analysis, along with supporting a cloud infrastructure used within the Genomics community.
Component | Details | |
---|---|---|
Disk storage | Minor addition to fast disk storage | 0.4 PB PFS |
Batch compute | Interim expansion | 1920 cores |
Network | Core network upgrade | |
Virtual compute | Virtualisation licenses: expansion of licensed estate | |
Tape storage | Tape drives & servers, Tape media | 2 x T10KC drives, 3.5 PB media |
Software | Initial versions of Elastic Tape interface (ET) & JASMIN Analysis Platform (JAP) |
Phases 2 & 3 (2013-2015)
Major expansion over a 2-year period
Having proved its worth as a concept able to facilitate many large data-intensive environmental science projects, JASMIN underwent a major upgrade to provide the necessary storage and compute for its stakeholder community. Its remit now extended beyond the initial NCAS and NCEO stakeholders to serve the whole of the NERC community.
Component | Details | |
---|---|---|
Disk storage | Major expansion to fast storage Block storage for VM hosting High-performance storage for databases |
11 PB PFS 0.9 TB BLK 0.05 TB high-IOPS BLK |
Batch compute | Major expansion to LOTUS compute Dual capability as hypervisors for virtual machines, or as LOTUS nodes |
3800 cores 4 high-memory nodes (2 TB RAM) |
Network | Major redesign & implementation | |
Virtual compute | Expansion of licensed estate | |
Tape storage | Major expansion | 7.5 PB tape media |
Software | Community Intercomparison Suite JASMIN Cloud Portal |
Scientific end-user software Cloud tenancy management interface |
Other | User documentation Website Dataset construction |
Phase 3.5 (2016-2017)
Interim upgrades and strategic proof-of-concept projects
Ahead of larger investments in years to come, limited but carefully-targetted upgrades ensured that key systems continued to operate at the scales needed. A proof-of-concept project tested the feasibility of using OpenStack instead of a proprietary solution for JASMIN’s growing Community Cloud infrastructure.
Component | Details | |
---|---|---|
Disk storage | Object store proof of concept Replacement of cloud block storage Continued use of Phase 1, 2 storage inc. battery replacements |
1.2 PB HPOS 0.4 PB BLK |
Batch compute | Interim expansion of batch compute Continued use of Phase 1.5 & 2 compute (~4000 cores) |
1120 cores |
Network | Essential network & firewall support | |
Virtual compute | Cloud software support | |
Tape storage | Tape media | 5 PB |
Software | OpenStack proof of concept |
Phase 4 (2017-2018)
Major expansion with new technologies
Phase 4 introduced new types of storage at the scales needed to support scientific workflows into the future. Successful proofs-of-concept with Scale Out Filesystem (SOF) and high-performance object storage (HPOS) enabled large deployments of these, with SOF adopted as the primary storage medium for Group Workspace storage, and tooling and services under development to enable use of object storage within cloud-based workflows. LOTUS gained a major upgrade of >5000 cores, in a network enhanced for future expansion. Cloud tenancies were migrated to an OpenStack platform and management interfaces adapted to match. Meanwhile testbeds for Cluster-as-a-Service and JuPyter Notebooks provided previews of exciting capabilities to come.
Component | Details | |
---|---|---|
Disk storage | BLK storage for cloud Major expansion of SOF Object storage (HPOS) New SSD for home areas Replacement of earlier PFS |
0.4 PB BLK 30 PB SOF 5 PB HPOS 0.5 PB SSD 3 PB PFS |
Batch & physical compute | Expansion of batch compute New servers for Data Transfer Zone |
210 servers, 5040 cores 10 servers for DTZ |
Network | Implementation of "super-spine" network Expansion & upgrade to management network |
Ensuring future connectivity on site |
Virtual compute | Production deployment of OpenStack as cloud platform, migration of tenancies | |
Software | OpenStack upgrade for JASMIN cloud portal OpenDAP4GWS Cluster-as-a-Service testbed Containerised Jupyter Notebook deployed in Kubernetes |
Management capability for OpenStack cloud tenancies Autonomous exposure of data from GWSs Dynamic virtualized batch compute PoC for Python Notebook service |
Other | Bulk migration of data from Phase 1 hardware Machine room hardware |
Ahead of retirement of old hardware Racks, PDUs, cabling, environment monitoring equipment |
Phase 5 (2018-2019)
Tape storage & other strategic upgrades
Together with STFC’s IRIS consortium, a major upgrade to a shared tape storage facility was procured with capacity for 65 PB of near-line storage. JASMIN also acquired its first GPU servers: a small prrof-of-concept cluster of 5 systems.
It was time to say goodbye to several tonnes of storage and compute hardware from previous phases which were now retired, and needed to be removed to make room for new equipment.
Component | Details | |
---|---|---|
Batch compute | Initial GPU servers Extra SSD disks for Phase 4 batch compute |
PoC with 2 x small, 1 x large system |
Network | Firewall hardware Routers and 100G connectivity |
|
Virtual compute | New hypervisor servers New backup appliance |
for "cattle-class" virtual machines |
Tape storage | Replacement of tape library Tape media |
Shared procurement with STFC IRIS. 65 PB capacity. 11 PB (LTO and TS1160) |
Software | OpenStack software development Cluster-as-a-Service development |
|
Other | Decommissioning of Phase 2 hardware |
Phase 6 (2019-2020)
Batch compute upgrade and network improvements
LOTUS was the main focus of this phase with the replacement of old compute nodes with new higher-memory servers and work to migrate from Platform LSF to SLURM as the scheduler. A change of operating system also meant redeployment of CEDA and JASMIN service hosts throughout the system.
Component | Details | |
---|---|---|
Disk storage | BLK storage replacement | Multiple retirement dates but avoiding transition all at once. To run alongside then replace existing hardware. |
Batch compute | Replacement of Phase 1 and 2 compute nodes | Solves flow control issue for interaction with Phase 4 storage. Current 4 x 2 TB high-memory nodes to be replaced with 132 x 1 TB nodes |
Network | Improvements to "exit pod" network | Enhance connectivity between JASMIN & wider internet |
Virtual compute | Replacement of virtualisation servers | For “pet” class virtual machines where reliability is important |
Software | Replacement of Platform LSF with SLURM scheduler Change of operating system |
Move to open-source scheduler with lower ongoing costs Move from RedHat Enterprise to Centos7 |
Phase 7 (2020-2021)
Essential storage upgrades and new compute capabilities
A much-needed boost to capacity across the many types of storage, but coupled with retirement of older disk systems and increased CPU compute for the LOTUS batch processing cluster. Following a successful proof-of-concept in previous years, this phase also establised ORCHID, JASMIN’s new GPU cluster to cater for AI workflows.
Component | Details | |
---|---|---|
Cloud | Integration of an additional cloud platform | |
Network | Replacement of Phase 1/2 network pod for Phase 7 hardware 25Gbit/sec NIC upgrade for hypervisors in managed cluster |
|
Compute | Full-scale GPU cluster for AI workflows Replacement of Phase 2/3 CPU nodes and cloud hardware expansion |
2x8xNVidia A100 nodes, 14x4xNVidia A100 nodes +768 cores CPU with large RAM. New 100Gb networking for LOTUS |
Disk storage | 30% SOF capacity increase, small file capability 40% HPOS increase 125% PFS capacity increase SSD upgrade for small-file workloads. Block capacity for virtualisation, clouds & container storage, API brought up to date |
10 PB SOF + 0.5 PB SSD 2 PB HPOS 5 PB PFS 300TB SSD 4-500TB Flash |
Tape storage | Tape server hardware replacements Tape media New colder-storage system design & development to replace ET & JDMA |
18PB media |
Phase 7.5 / JASMINx Phase 1 (2021-2022)
Strategic investment in tape storage, LOTUS upgrade and consultancy on future user requirements.
Commissioning of a new Near-Line Data Store(NLDS) with essential uplift in tape media capacity. Replacement and expansion of LOTUS capacity plus study of future user requirements.
Component | Details | |
---|---|---|
Tape storage | Commissioning of new NLDS tiered storage system Tape media capacity increase. |
NLDS design & development project underway at CEDA in collaboration with University of Reading 23 PB media, 4 drives, 2 data frames, chamber licences & associated costs 2 data servers |
Compute | Compute nodes to replace & expand LOTUS cluster capacity | 92 x compute nodes with 512 GB RAM, dual AMD Epyc processor, 48-core Total 92 x 48 = 4416 cores, mostly for deployment in LOTUS cluster. |
User requirements study | Commissioned study to identify potential future user requirements for JASMIN | UKRI JASMINx expansion - User need analysis report |