Itzik Gur - 01.09.201720170901

Join our community of 1,000+ IT professionals, and receive tech tips and updates once a week.

Why the hell should I use software defined storage?

It seems we are living in a world where everything is “something-defined” or “something as a service” and we are surrounded by more and more acronyms. So when faced with Software Defined Storage (SDS) we may ask ourselves “Why not to use good old physical or block storage?” and good old physical hardware if it is available?

These are questions I have been asking myself whilst having some regret that the past is in the past as we now find ourselves with new emerging technologies which have redefined the technological landscape.

So, let’s get to why SDS, what does it mean and how do we consume it?

Firstly, get the semantics out of the way.

What is storage?. Storage is an entity for storing data or information for future use or retention of retrievable data on a computer or other electronic system.

What is software? Software is a collection of programs and other operating information used by a computer.

What is SDS? Software-defined storage (SDS) is a marketing term for computer data storage software for policy-based provisioning and management of storage independent of the underlying hardware.

In other terms – SDS is independent from hardware vendors. This implies that the storage can be built using any commodity hardware and the operating systems which are already in the environment and it should provide similar capabilities as proprietary hardware. Usually, no additional purchases are necessary, as the highly scalable storage can be built on already existing hardware, including virtual machines. Further to that, SDS is based on skills which are already in house – like Linux administration. It is not necessary to spend tens or hundreds of thousands of dollars to purchase the hardware (and lock yourself into specific technology) nor to hire expensive storage administrators.

An existing SDS system in the market is Red Hat’s GlusterFS. GlusterFS is a scale-out network-attached storage (NAS) file system. GlusterFS found many applications in various industries, including cloud computing, streaming media services, and content delivery networks .

I have recently been involved in the installation and configuration of a Highly Available Gluster Cluster consisting of six nodes for a financial organization in Australia. This organization adopted and implemented Gluster a few years ago to provide NAS storage to their 500+ Linux servers. Gluster is being used with multiple volumes as centralized log storage to provide storage for their web-based services and in some cases to serve as the highly accessible storage for their databases. After a few years of using Gluster, they decided to review the architecture, introduce new features and ensure that the configuration is indeed highly available.

Gluster provides many topology options, and in this case, the architecture adopted was a stretched cluster across two data centers (with sub 2ms latency between the DCs) with six Gluster nodes (three nodes in each DC).

To provide high availability and scalability, a replicated-distributed configuration was used with an arbiter node to provide a fencing or arbitration capability.

Let’s crack the code to understand what it all means.

Replicated Volume – The replicated volume is similar to Raid 1 and can consist (or at least what Red Hat supports) of up to three volumes. This means that any file written to the replicated volume is automatically written to two or three different storage bricks. But hang-on – what is a brick?

A Brick – as the name suggests – a brick is a foundational building element in a Gluster environment. The simplest description of a brick is – a mountpoint with a file system which is dedicated for a specific volume, which in turn will be shared and provided to the users. One or more bricks can reside on a Gluster node.

Distributed Volume – a distributed volume is similar in concept to a concatenated volume. It does not provide any resiliency, but it scales well. To provide the best of both worlds – we can combine two volume types together to create a replicated-distributed volume. This type of volume consists of two replicated subvolumes which together work as a distributed volume.

What? – Let’s imagine a replicated Subvolume A and a replicated Subvolume B, which form part of a distributed Volume X. When file A is written to the Subvolume A, the file is automatically (and synchronously) written to all bricks (let’s assume three bricks with one brick per server in a subvolume A). with me so far? When another file is written, it will be written to the Subvolume B following the same process. You get the picture?

So, what is this configuration good for? – Given we are dealing with the two sets of replicated volumes, there will be a penalty for write operations but this is balanced by (not only peace of mind) the phenomenal performance of read operations, as read operations happen from all the bricks simultaneously. At the same time, if one of the bricks is unavailable, the rest of the volume continues to work without a interruption. At the same time, the distributed part of the volume – provides additional space (and improved scalability).

But wait… if the replicated volume is a cluster and all parts are synchronous, that means that at some point, a split-brain scenario could happen.

A split what?

What is a split-brain situation in the case of the replica volume? It is defined as the difference in file data/metadata across the bricks of a replica. As a result, the self-healing daemon cannot identify which brick holds a good copy of the data, even when all bricks are available. As a result, all modifications to the corrupted file(s) fail with an I/O Error.

This is the point where the client-quorum comes to the rescue. In the GlusterFS world – this would mean that a minimum number of bricks need to be up to allow modifications. For the volume replica 2 (one volume consisting of two bricks which are effectively mirroring each other) it means that the client needs to be connected to both bricks at all times and there is no Fault tolerance – meaning – no brick can fail.

This is where the replica 3 steps in. Using client quorum in auto mode, more than 50% of the cluster bricks have to be up and allow modifications. This implies that in replica 3, two bricks have to be always up to allow operations. The maths becomes quite clear here – the fault tolerance in the case of the replica 3 volume is one brick (one brick can be offline, and over 50% of other bricks are online). It all sounds perfect right? … but … is it cost effective to allocate three (3) times the space required for the volume?

At the risk of sounding direct or frank, Yes it is … unless of course the solution has to be as resilient as you can get. There is however yet another approach to the replica 3 configuration which does not require that much storage. This additional approach is called – THE ARBITER. The arbiter volumes are replica 3 volumes where the 3rd brick of the replica will store only the file name and metadata but does not contain any data. This configuration is useful in avoiding split-brain scenarios while providing similar levels of consistency as a normal replica 3 volume.

How much space do I need for the arbiter brick? The safe estimate is around 4KB/file x number of files on the volume. Practical estimates vary but open source community users estimated around 1KB/file. There are also formulas to estimate the size of the brick given the size of the target volume.

What are the Arbiter volumes good for? – They provide a self-healing mechanism. A little bit earlier I mentioned that replica bricks/volumes can be quarrelsome and blame each other or prove to each other that its data is valid and the counterpart’s corrupted. The arbiter does what the arbiters have always been doing since the function was introduced into the judicial system – they decide the the outcome of the case of the replica 3 volume – they can serve as a source for metadata (but not data). I have also mentioned before that in replica 3 volumes, at least 50% of the bricks have to be online to provide High Availability. With the replica 3 arbitrated volume, the following scenarios come into play:

All three (3) bricks (including arbiter) are online – the volume allows the writes
Two (2) data bricks are up, arbiter is offline – the volume allows the writes
One (1) data brick and the arbiter brick are online – the volume allows the writes if the arbiter does not blame the data brick which is online

What was the outcome of this particular engagement? – A Fully functioning Gluster Cluster with multiple volumes and well balanced performance which is now production ready and about to replace the legacy system.

About Sebastian

Sebastian Baszczji
Consultant, Information Management
Insentra
e Sebastian.Baszczyj@insentra.com.au

Sebastian is a consultant at Insentra, he is responsible for the technical design and implementation of leading class enterprise solutions. Sebastian is a highly accredited Red Hat Architect, Symantec Technical Specialist and Authorized Veritas Consultant, who is well known in the industry for his strong skills, experience and technical knowledge.

Over the past 21 years Sebastian has worked for local and global system integrators specializing in the following technologies Red Hat, Network Security, Data Protection and High Availability. Sebastian has extensive experience in the design and implementation of Data Protection systems for numerous banks and government locally and globally. He has also good knowledge of European technical vistas and methodologies.