What are my options to scale a data system?!
You might have realized that your system is receiving higher requests than usual, users are experiencing higher latency, your system’s resources (e.g., CPU or memory) are overloaded, and the system is slowing down. One way to handle these issues is to optimize your application, consider a different type of database, or utilize an appropriate indexing technique for your system. These solutions can solve some of the problems mentioned above; however, at some point, and with more load on your system, you should scale your system up or out!!
Scaling is about changing or expanding the capacity of your system to accommodate specific usage requirements and to increase your system’s performance and capabilities (e.g., lower latency, handle more requests, enhance the overall throughput with more processing powers, space to store more data, network connections to support the received traffic). Scaling is achieved by different techniques. The first technique is to add resource(s) to the current system, and that is referred to as vertical scaling. The second technique is to attach new system(s) to the current system, and that is referred to as horizontal scaling. The third technique is a hybrid solution that considers some aspects of the first two techniques together. In this article, we would like to discuss these different techniques, especially in relation to data storage and management. It is worth mentioning that systems can scale up by adding more resources to accommodate higher performance metrics; however, systems can also scale down by removing resources when not needed. Such scaling up and down can be done manually, as in the on-premise environments, or dynamically, as in the on-cloud environments.
Vertical Scaling (Scaling Up)
Your system may reside on a server that can be on-premise or on-cloud. The first option to scale up your system is by adding new resources to the current pool of resources, and this is referred to as Vertical Scaling. These resources can be computational type (e.g., adding more CPU and Cache), storage type (e.g., adding more RAM and secondary storage units), or network type (e.g., network interfaces). The different types of databases -both relational and non-relational- can scale up using this vertical scaling.
Vertical scaling is usually cheaper, faster, and easier to implement; however, this option faces some challenges. At some point, it is difficult to manage new resources continuously added to the same machine (a.k.a. server or node), and there are usually hardware limits on the number of resources that can be added to a single machine. Restarting the machine to add resources might be required, which could result in downtime and service disruption. The system is also not fault-tolerant since all these resources run on the same machine; the entire system and all these resources are down if that machine is down — which is referred to as a Single Point of Failure (SPOF).
Horizontal Scaling (Scaling Out)
To address the limitations of vertical scaling, the horizontal scaling option is about adding new systems to join the existing system. Each newly added system is a server which a pool of computational, storage, and network resources. The newly added systems are designed to work in parallel and to divide the load with the current system. These servers can be collocated or geographically distributed. Horizontal scaling, also called scale-out, addresses the physical limitation of the vertical scaling option when it comes to including more resources by providing a solution that is easier to upgrade for higher throughput. The horizontal scaling option also addresses the SPOF challenge for an overall higher resilient and fault-tolerant solution.
However, in most cases, including additional systems is considered expensive and more challenging to implement, manage, and maintain. Horizontal scaling technologies are used in a wide range of tech domains, and when it comes to data storage and management, there are two common horizontal scaling techniques: replication and partitioning (or sharding). Each machine (referred to as a node) -as part of a group of machines (referred to as a cluster)- can store part or a copy of the data. The horizontal scaling techniques mainly address non-relational types of databases since data collections are independent and can be self-addressed and self-contained. However, spreading a relational database on a cluster of nodes is challenging since data queries depend mainly on the relations that join the different tables.
Horizontal Scaling: Replication
Replication is a horizontal scaling technique where synchronized copies of a database are created and hosted by the different nodes in a cluster. It is worth mentioning that replication differs from database backup (or mirroring). Database backup is the process of keeping an up-to-date copy of the database as a safety procedure in case of a hardware or software failure, where such copy is not used for other purposes. Replication, on the other hand, is the process of enabling higher fault tolerance to the whole system with a focus on data accessibility and data availability, where other nodes still function even if one or more nodes are down. With such distributed copies, rather than overwhelming single node addressing all database read queries, these queries and requests can be distributed across the different nodes in the cluster - providing load balancing for less processing requirements from each node, higher performance executing queries, and lower latency without interfering with the work of the other nodes. Other types of queries (e.g., creating, writing, updating) are addressed to the primary (or master) node in the cluster, and the edits are then propagated to all other nodes in the cluster to keep all copies up to date and to ensure data integrity. There is a wide range of choices when it comes to operating and running replication strategies.
Synchronizing the edits and updates between the different copies adds more traffic and processing requirements and is considered one of the challenges of replication. Several problems may occur in such a synchronization process, ranging from data loss or inconsistent data copies hosted by the different nodes to managing requirements for the destination servers and analyzing the required bandwidth and the expected traffic. Replication is usually an ongoing and continuous operation to keep a consistent and resilient cluster. There are common techniques for replication. Full replication techniques focus on storing complete database copies on each node, and partial replication techniques only focus on identifying and replicating the frequently used database fragments. Each technique offers a set of features, and the choice depends on the requirements to be addressed and the size of the data to be stored.
The Full table replication copies all existing or new data from the primary database to every other node in the cluster. This approach enables high data availability, higher overall system throughput, efficient load distribution, and faster execution of queries. However, this replication option produces more traffic on the network and requires more processing power. In another version, the cluster nodes receive complete initial copies of the primary database and then receive real-time updates as data changes to ensure consistency. The Key-based incremental replication copies only the updated and new data instead of all. This replication technique usually identifies the data records to be copied to other nodes in the cluster based on a key parameter (e.g., timestamps). Using timestamps, the key-based technique can replicate all data records with a timestamp from a certain point forward. The key-based replication provides a faster and more efficient replication solution since fewer data records are copied in case of new or updated records; however, this solution cannot capture if data records are deleted from the primary database and propagate such updates to other nodes in the cluster.
Difference-based, also called snapshot, replication is based on comparing a compressed image (or snapshot) of the primary database with a compressed image of a database hosted by another node in the cluster. Comparing two snapshots enables identifying the changes between them; however, this method runs efficiently when data changes are infrequent and with smaller data sets. Another version of the snapshot is the Log-based incremental replication. This technique copies data records from the primary database to other cluster nodes based on the binary log file. A log file hosts information about changes committed to a database, including inserts, updates, and deletes. The log-based replication option provides the most efficient solution, and a wide range of database management systems (DBMS) support this technique.
Horizontal Scaling: Sharding (Partitioning)
In some applications and services, the amount of data available is massive and grows continuously. Such a volume of data exceeds a single device’s storage and processing capabilities. Instead of replicating copies of the full database on every cluster node, sharding provides a different horizontal scaling solution for database storage and management. Sharding is about dividing a database into smaller chunks or portions (known as shards) with respect to a sharding strategy (or simply a key) and distributing the partitions on the cluster’s nodes, where every node is responsible for processing and managing the assigned shards. This horizontal partitioning supports parallel data processing and query execution. With a query, the cluster decides which node (or nodes) hosts the required data to which the query should be forwarded.
Let us consider a table of employees as an example; we can divide such a table horizontally (by rows) where each shard has a set of employees with their complete information. We might also divide the table vertically (by columns), where each shard has a set of employees’ attributes. In order to support the efficient execution of some database operators, the same partition of data or related data can be hosted on more than one node, or the data references can be replicated to different nodes. This comes with the overhead of transferring the sub-results over the network to be merged into the final result.
Sharding provides an efficient horizontal scalability solution, especially with a massive collection of data; however, re-building the original unsharded database version requires merging the last versions of the shards from the different nodes, which can be a time-consuming and costly process. On the other hand, the criteria for partitioning data is not usually a straightforward process. For example, to achieve load balance between the different nodes, your data records should be partitioned based on how frequently they are requested. There are a few common solutions, known as sharding architectures, that address how to distribute data records on the cluster’s nodes.
The first sharding architecture is hash-based sharding. The hash-based sharding considers a specific attribute of the data record (e.g., employee address) as an input to a hash function, and the output hash value determines which node or shard such record should be stored in. The hash value also guides the query execution process and answers where to find the required data records. The hash-based sharding is simple, easy to implement, and evenly distributes the data records on the available cluster’s nodes; however, it faces certain limitations. The first limitation is that in the case of adding more nodes to the cluster or removing failed nodes, the data records should be re-mapped again.
The second sharding architecture is range-based sharding. The range-based sharding partitions the data with respect to specific attributes of the data records (e.g., product price) into ranges. The range-based is also simple and easy to implement; however, it may not divide the data evenly on the available cluster’s nodes, and efficient load balance may not be achieved.
The third sharding architecture is directory-based sharding. The directory-based sharding maintains a lookup table that keeps track of which data records are stored in which data shards with respect to a certain attribute (key) or a set of attributes (keys) in the data records. This solution provides a dynamic approach to add or remove nodes to the cluster; however, the lookup table represents a single point of failure, and also checking the lookup table before every query may introduce latency.