Evolution

Since JASMIN came into being in early 2012, it has grown significantly in scale and complexity but also in the number and variety of users it serves, and the types of scientific workflow it supports. As the requirements of its user community evolve, so does JASMIN. The Phases below describe the major procurement and upgrade projects which have taken place. These have been complemented by the work of teams within CEDA and STFC’s scientific computing department in developing and maintaining the infrastructure and its component services and software to create the major e-infrastructure facility now familiar to over 1,500 users and 200 science projects.

Spectra Time Lapse: installation of new STFC tape library

All racks powered on following major addition to storage and compute capabilities in Phases 2 and 3.

Phase 1 (2012): Panasas shelves close up.

Phase 1 (2012) first two racks of JAMSIN storage powered on.

Phase 1 (2012) Machine room floor before installation. Compute servers and block storage arrays.

Block storage added in Phase 3

Artful cabling is required to connect across JASMIN's internal network.

JASMIN evolution in pictures.

Phase 1 (2011-2012)

A “super-data-cluster” is born

The initial technical architecture was selected to provide a flexible, high-performance storage and data analysis environment, supporting batch computing, hosted processing and a cloud environment. The CEDA Archive had outgrown its previous hosting environment and the increasing need for scientific workdlows to “bring the compute to the data” drove the development of an infrastructure to support analysis of archive data alongside datasets brought into or generated by projects in their own collaborative workspaces. The first components deployed in this phase were:

Low latency core network
High-performance disk storage system supporting parallel write
Access to expandable tape storage for near-line storage
Resources to support bare-metal and virtualised compute
A batch scheduler
Block storage for storing virtual machine images
A paper describing the initial architecture is available (doi:10.1109/BigData.2013.6691556).

Details of Phase 1 (2011-2012)
Component	Details
Disk Storage	Initial fast disk	4.6 PB RAL (0.5 PB Reading, 0.15 PB Leeds)
Batch compute	Initial compute for LOTUS	650 cores
Network	Initial Gnodal-based network
Virtual compute	VM licences	Virtualisation software licenses for hosting virtual machines
Tape storage	Tape drives & media	4 x T10KC drives, 2.5 PB media
Software	Data movement software Community intercomparison suite
Other	Machine room environment monitoring equipment

Phase 1.5 (2012-2013)

Enabling NERC Big Data projects

Already establishing its ability to facilitate projects with data-intensive workflows, JASMIN was given additional capability to support several NERC “Big Data” projects across a range of disciplines: near-real-time processing of EO data, Earth surface deformation analysis and seismic hazard analysis, along with supporting a cloud infrastructure used within the Genomics community.

Details of Phase 1.5 (2012-2013)
Component	Details
Disk storage	Minor addition to fast disk storage	0.4 PB PFS
Batch compute	Interim expansion	1920 cores
Network	Core network upgrade
Virtual compute	Virtualisation licenses: expansion of licensed estate
Tape storage	Tape drives & servers, Tape media	2 x T10KC drives, 3.5 PB media
Software	Initial versions of Elastic Tape interface (ET) & JASMIN Analysis Platform (JAP)

Phases 2 & 3 (2013-2015)

Major expansion over a 2-year period

Having proved its worth as a concept able to facilitate many large data-intensive environmental science projects, JASMIN underwent a major upgrade to provide the necessary storage and compute for its stakeholder community. Its remit now extended beyond the initial NCAS and NCEO stakeholders to serve the whole of the NERC community.

Details of Phases 2 & 3 (2013-2015)
Component	Details
Disk storage	Major expansion to fast storage Block storage for VM hosting High-performance storage for databases	11 PB PFS 0.9 TB BLK 0.05 TB high-IOPS BLK
Batch compute	Major expansion to LOTUS compute Dual capability as hypervisors for virtual machines, or as LOTUS nodes	3800 cores 4 high-memory nodes (2 TB RAM)
Network	Major redesign & implementation
Virtual compute	Expansion of licensed estate
Tape storage	Major expansion	7.5 PB tape media
Software	Community Intercomparison Suite JASMIN Cloud Portal	Scientific end-user software Cloud tenancy management interface
Other	User documentation Website Dataset construction

Phase 3.5 (2016-2017)

Interim upgrades and strategic proof-of-concept projects

Ahead of larger investments in years to come, limited but carefully-targetted upgrades ensured that key systems continued to operate at the scales needed. A proof-of-concept project tested the feasibility of using OpenStack instead of a proprietary solution for JASMIN’s growing Community Cloud infrastructure.

Details of Phase 3.5 (2016-2017)
Component	Details
Disk storage	Object store proof of concept Replacement of cloud block storage Continued use of Phase 1, 2 storage inc. battery replacements	1.2 PB HPOS 0.4 PB BLK
Batch compute	Interim expansion of batch compute Continued use of Phase 1.5 & 2 compute (~4000 cores)	1120 cores
Network	Essential network & firewall support
Virtual compute	Cloud software support
Tape storage	Tape media	5 PB
Software	OpenStack proof of concept

Phase 4 (2017-2018)

Major expansion with new technologies

Phase 4 introduced new types of storage at the scales needed to support scientific workflows into the future. Successful proofs-of-concept with Scale Out Filesystem (SOF) and high-performance object storage (HPOS) enabled large deployments of these, with SOF adopted as the primary storage medium for Group Workspace storage, and tooling and services under development to enable use of object storage within cloud-based workflows. LOTUS gained a major upgrade of >5000 cores, in a network enhanced for future expansion. Cloud tenancies were migrated to an OpenStack platform and management interfaces adapted to match. Meanwhile testbeds for Cluster-as-a-Service and JuPyter Notebooks provided previews of exciting capabilities to come.

Details of Phase 4 (2017-2018)
Component	Details
Disk storage	BLK storage for cloud Major expansion of SOF Object storage (HPOS) New SSD for home areas Replacement of earlier PFS	0.4 PB BLK 30 PB SOF 5 PB HPOS 0.5 PB SSD 3 PB PFS
Batch & physical compute	Expansion of batch compute New servers for Data Transfer Zone	210 servers, 5040 cores 10 servers for DTZ
Network	Implementation of "super-spine" network Expansion & upgrade to management network	Ensuring future connectivity on site
Virtual compute	Production deployment of OpenStack as cloud platform, migration of tenancies
Software	OpenStack upgrade for JASMIN cloud portal OpenDAP4GWS Cluster-as-a-Service testbed Containerised Jupyter Notebook deployed in Kubernetes	Management capability for OpenStack cloud tenancies Autonomous exposure of data from GWSs Dynamic virtualized batch compute PoC for Python Notebook service
Other	Bulk migration of data from Phase 1 hardware Machine room hardware	Ahead of retirement of old hardware Racks, PDUs, cabling, environment monitoring equipment

Phase 5 (2018-2019)

Tape storage & other strategic upgrades

Together with STFC’s IRIS consortium, a major upgrade to a shared tape storage facility was procured with capacity for 65 PB of near-line storage. JASMIN also acquired its first GPU servers: a small prrof-of-concept cluster of 5 systems.

It was time to say goodbye to several tonnes of storage and compute hardware from previous phases which were now retired, and needed to be removed to make room for new equipment.

Details of Phase 5 (2018-2019)
Component	Details
Batch compute	Initial GPU servers Extra SSD disks for Phase 4 batch compute	PoC with 2 x small, 1 x large system
Network	Firewall hardware Routers and 100G connectivity
Virtual compute	New hypervisor servers New backup appliance	for "cattle-class" virtual machines
Tape storage	Replacement of tape library Tape media	Shared procurement with STFC IRIS. 65 PB capacity. 11 PB (LTO and TS1160)
Software	OpenStack software development Cluster-as-a-Service development
Other	Decommissioning of Phase 2 hardware

Phase 6 (2019-2020)

Batch compute upgrade and network improvements

LOTUS was the main focus of this phase with the replacement of old compute nodes with new higher-memory servers and work to migrate from Platform LSF to SLURM as the scheduler. A change of operating system also meant redeployment of CEDA and JASMIN service hosts throughout the system.

Details of Phase 6 (2019-2020)
Component	Details
Disk storage	BLK storage replacement	Multiple retirement dates but avoiding transition all at once. To run alongside then replace existing hardware.
Batch compute	Replacement of Phase 1 and 2 compute nodes	Solves flow control issue for interaction with Phase 4 storage. Current 4 x 2 TB high-memory nodes to be replaced with 132 x 1 TB nodes
Network	Improvements to "exit pod" network	Enhance connectivity between JASMIN & wider internet
Virtual compute	Replacement of virtualisation servers	For “pet” class virtual machines where reliability is important
Software	Replacement of Platform LSF with SLURM scheduler Change of operating system	Move to open-source scheduler with lower ongoing costs Move from RedHat Enterprise to Centos7

Phase 7 (2020-2021)

Essential storage upgrades and new compute capabilities

A much-needed boost to capacity across the many types of storage, but coupled with retirement of older disk systems and increased CPU compute for the LOTUS batch processing cluster. Following a successful proof-of-concept in previous years, this phase also establised ORCHID, JASMIN’s new GPU cluster to cater for AI workflows.

Details of Phase 7 (2020-2021)
Component	Details
Cloud	Integration of an additional cloud platform
Network	Replacement of Phase 1/2 network pod for Phase 7 hardware 25Gbit/sec NIC upgrade for hypervisors in managed cluster
Compute	Full-scale GPU cluster for AI workflows Replacement of Phase 2/3 CPU nodes and cloud hardware expansion	2x8xNVidia A100 nodes, 14x4xNVidia A100 nodes +768 cores CPU with large RAM. New 100Gb networking for LOTUS
Disk storage	30% SOF capacity increase, small file capability 40% HPOS increase 125% PFS capacity increase SSD upgrade for small-file workloads. Block capacity for virtualisation, clouds & container storage, API brought up to date	10 PB SOF + 0.5 PB SSD 2 PB HPOS 5 PB PFS 300TB SSD 4-500TB Flash
Tape storage	Tape server hardware replacements Tape media New colder-storage system design & development to replace ET & JDMA	18PB media

Phase 7.5 / JASMINx Phase 1 (2021-2022)

Strategic investment in tape storage, LOTUS upgrade and consultancy on future user requirements.

Commissioning of a new Near-Line Data Store(NLDS) with essential uplift in tape media capacity. Replacement and expansion of LOTUS capacity plus study of future user requirements.

Details of Phase 7.5 / JASMINx Phase 1 (2021-2022)
Component	Details
Tape storage	Commissioning of new NLDS tiered storage system Tape media capacity increase.	NLDS design & development project underway at CEDA in collaboration with University of Reading 23 PB media, 4 drives, 2 data frames, chamber licences & associated costs 2 data servers
Compute	Compute nodes to replace & expand LOTUS cluster capacity	92 x compute nodes with 512 GB RAM, dual AMD Epyc processor, 48-core Total 92 x 48 = 4416 cores, mostly for deployment in LOTUS cluster.
User requirements study	Commissioned study to identify potential future user requirements for JASMIN	UKRI JASMINx expansion - User need analysis report