One place for hosting & domains


      Database Sharding: Concepts, Examples, and Strategies

      Many software applications use a relational database management system (RDBMS) to store data. As the database grows, it becomes more time-and-storage intensive to store the data. One popular solution to this problem is
      database sharding
      . A sharded database distributes the records in a database’s tables across different databases on different computer systems. This guide explains how database sharding works and discusses some of the advantages and disadvantages of sharding. It also describes some of the main sharding strategies and provides some database sharding examples.

      What is Database Sharding?

      As databases grow larger, they can be scaled in one of two ways. Vertical scaling involves upgrading the server hosting the database with more RAM, CPU ability, or disk space. This allows it to store more data and process a query more quickly and effectively. Horizontal scaling, which is also known as “scaling out”, adds additional servers to distribute the workload.

      Data sharding is a common way of implementing horizontal scaling. Database sharding divides the table records in a database into smaller portions. Each section is a shard, and is stored on a different server. The database can be divided into shards based on different methods. In a simple implementation, the individual tables can be assigned to different shards. More often, the rows in a single table are divided between the shards.

      Vertical partitioning and horizontal partitioning are two different methods of partitioning tables into shards. Vertical partitioning assigns different columns within a table to different servers, but this technique is not widely used. In most cases, horizontal partitioning/sharding is used to implement sharding, and the two terms are often used interchangeably. Horizontal sharding divides the rows within a table amongst the different shards and keeps the individual table rows intact.


      Vertical partitioning and horizontal partitioning should not be confused with vertical and horizontal scaling.

      The shards are distributed across the different servers in the cluster. Each shard has the same database schema and table definitions. This maintains consistency across the shards. Sharding allocates each row to a shard based on a sharding key. This key is typically an index or primary key from the table. A good example is a user ID column. However, it is possible to generate a sharding key from any field, or from multiple table columns. The selection of the sharding key should be reasonable for the application and effectively distribute the rows among the shards. For example, a country code or zip code is a good choice to distribute the data to geographically dispersed shards. Sharding is particularly advantageous for databases that store large amounts of data in relatively few tables, and have a high volume of reads and writes.

      Each shard can be accessed independently and does not necessarily require access to the other shards. Different tables can use different sharding techniques and not all tables necessarily have to be sharded. As an ideal, sharding strives towards a shared-nothing architecture, in which the shards do not share data and there is no data duplication. In practice, it is often advantageous to replicate certain data to each shard. This avoids the need to access multiple servers for a single query and can result in better performance.

      The following example demonstrates how horizontal sharding works in practice. Before the database is sharded, the example store table is organized in the following way:

      2459New YorkNY10022

      After sharding, one shard has half the rows from the table.


      The second shard contains the remainder of the rows.

      2459New YorkNY10022

      Sharding does not necessarily make any backup copies of the data. Each record is still only stored on a single server. Replication is used to copy information to another server, resulting in primary and secondary copies of the data. Replication enhances reliability and robustness at the cost of additional complexity and resources. Sharded databases can be replicated, but the procedure for doing so can be very complex.

      Replication and caching are both potential alternatives to sharding, particular in applications which mainly read data from a database. Replication spreads out the queries to multiple servers, while caching speeds up the requests. See our guide
      How to Configure Source-Replica Replication in MySQL
      to learn more about data replication.

      Pros and Cons of a Sharded Database

      Generally, a horizontal scaling approach is more robust and effective than vertical scaling. Vertical scaling is much easier to implement, because it mainly consists of hardware upgrades. It might be the correct approach to take with a medium-sized database that is slowly reaching its limit. However, it is impossible to scale any system indefinitely, and ongoing growth rapidly becomes unmanageable. The limits of vertical scaling usually lead administrators to seek another alternative.

      Horizontal scaling allows systems to achieve a much higher scaling rate. Additional servers can be added as required, permitting the database system to organically grow and access additional resources. It provides administrators with much more flexibility.

      Database sharding is a horizontal scaling strategy, so it shares the advantages of this approach. However, it also offers several additional benefits, including the following:

      • It improves performance and speeds up data retrieval. Based on the sharding key, the database system immediately knows which shard contains the data. It can quickly route the query to the right server. Because each shard only contains a subset of the rows, it is easier for the database server to find the correct entry.
      • Additional computing capacity can be added with no downtime. Sharding increases the data storage capacity and the total resources available to the database.
      • It can be more cost efficient to run multiple servers than one mega-server.
      • Sharding can simplify upgrades, allowing one server to be upgraded at a time.
      • A sharded approach is more resilient. If one of the servers is offline, the remaining shards are still accessible. Sharding can be combined with high availability techniques for even higher reliability.
      • Many modern database systems provide some tools to assist with sharding, although they do not completely automate the process.

      Unfortunately, sharding also has drawbacks. Some of the downsides include:

      • Sharding greatly increases the complexity of a software development project. Additional logic is required to shard the database and properly direct queries to the correct shard. This increases development time and cost. A more elaborate network mesh is often necessary, which leads to an increase in lab and infrastructure costs.
      • Latency can be higher than with a standard database design.
      • SQL join operations
        affecting multiple shards are more difficult to execute and take longer to complete. Some operations might become too slow to be feasible. However, the right design can facilitate better performance on common queries.
      • Sharding requires a lot of tuning and tweaking as the database grows. This sometimes requires a reconsideration of the entire sharding strategy and database design. Uneven shard distribution can happen even with proper planning, causing the distribution to unexpectedly become lopsided.
      • It is not always obvious how many shards and servers to use, or how to choose the sharding key. Poor sharding keys can adversely affect performance or data distribution. This causes some shards to be overloaded while others are almost empty, leading to hotspots and inefficiencies.
      • It is more challenging to change the database schema after sharding is implemented. It is also difficult to convert the database back to its pre-sharded state.
      • Shard failures can cause cross-shard inconsistencies and other failures.
      • Backup and replication tasks are more difficult with a sharded database.
      • Although most RDBMS applications provide some sharding support, the tools are often not robust or complete. Most systems still do not fully support automatic sharding.

      Database Sharding Strategies: Common Architectures

      Any sharding implementation must first decide on a db sharding strategy. Database designers must consider how many shards to use and how to distribute the data to the various servers. They must decide what queries to optimize, and how to handle joins and bulk data retrieval. A system in which the data frequently changes requires a different architecture than one that mainly handles read requests. Replication, reliability and a maintenance strategy are also important considerations.

      The choice of a sharding architecture is a critical decision, because it affects many of the other considerations. Most sharded databases have one of the following four architectures:

      • Range Sharding.
      • Hashed Sharding.
      • Directory-Based Sharding.
      • Geographic-Based Sharding.

      Range Sharding

      Range sharding examines the value of the sharding key and determines what range it falls into. Each range directly maps to a different shard. The sharding key should ideally be immutable. If the key changes, the shard must be recalculated and the record copied to the new shard. Otherwise, the mapping is destroyed and the location could be lost. Range sharding is also known as dynamic sharding.

      As an example, if the userID field is the sharding key, then records having IDs between 1 to 10000 could be stored in one shard. IDs between 10001 and 20000 map to a second shard, and those between 20001 and 30000 to a third.

      This approach is fairly easy to design and implement, and requires less programming time. The database application only has to compare the value of the sharding key to the predefined ranges using a lookup table. This scheme is also easier to redesign and maintain. Range sharding is a good choice if records with similar keys are frequently viewed together.

      Range sharding works best if there are a large number of possible values that are fairly evenly distributed across the entire range. This design works poorly if most of the key values map to the same shard. Unfortunately, this architecture is prone to poor distribution of rows among the shards. A good design can still lead to an unbalanced distribution. For example, older accounts are more likely to have been deleted over the years, leaving the corresponding shard relatively empty. This leads to inefficiencies in the database. Choosing fairly large ranges can reduce, but not eliminate, this possibility.

      The database sharding examples below demonstrate how range sharding might work using the data from the store database. In this case, the records for stores with store IDs under 2000 are placed in one shard. Stores possessing IDs of 2001 and greater go in the other.

      The first shard contains the following rows:


      The second shard has the following entries:

      2459New YorkNY10022

      This results in a slightly imbalanced distribution of records. However, as new stores are added, they might be assigned larger store IDs. This leads to a greater imbalance as time goes on.

      To keep the database running efficiently, shards and ranges have to be regularly rebalanced. This might involve splitting the shards apart and reassigning the data, or merging several smaller shards. If the data is not regularly monitored, performance can steadily degrade.

      Hash Sharding (Key-Based)

      Hash-based sharding, also known as key-based or algorithmic sharding, also uses the shard key to determine which shard a record is assigned to. However, instead of mapping the key directly to a shard, it applies a hash function to the shard key. A hash function transforms one or more data points to a new value that lies within a fixed-size range. In this case, the size of the range is equal to the number of shards. The database uses the output from the hash function to allocate the record to a shard. This typically results in a more even distribution of the records to the different shards.

      This method allows multiple fields to be used as a compound shard key. This eliminates clumping and clustering, and is a better approach to use if several records can share the same key. Hash functions vary in complexity. A simple hash function calculates the remainder, or modulus, of the key divided by the number of shards. More complex hashing algorithms apply mathematically advanced equations to multiple inputs. However, it is important to use the same hash function on the same keys for each hashing operation. As with range sharding, the key value should be immutable. If it changes, the hash value must be recalculated and the database entry remapped.

      Hash sharding is more efficient than range sharding because a lookup table is not required. The hash is calculated in real time for each query. However, it is impossible to group related records together, and there is no logical connection between the records on a given shard. This requires most bulk queries to read records from multiple shards. Hash sharding is more advantageous for applications that read or write one record at a time.

      Hash sharding does not guarantee that the shards are destined to remain perfectly balanced. Patterns in the data still might lead to clustering, which can occur purely by chance. Hash sharding complicates the tasks of rebalancing and rebuilding the shards. To add more shards, it is usually necessary to re-merge all the data, recalculate the hashes, and reassign all the records.

      The following database sharding example demonstrates a simple hash sharing operation. It uses the simple hash function store_ID % 3 to assign the records in the store database to one of three shards. The first step is to calculate a hash result for each entry.


      The hash results are not actually stored inside the database. They are shown in the final column for clarity.

      store_IDcitystatezip_codehash result
      2459New YorkNY100222

      Rows having a hash result of 0 map to the first shard.


      Those that have a hash result of 1 are assigned to shard number two.

      2459New YorkNY10022

      The remainder are stored in the third shard.

      2459New YorkNY10022

      In this case, although the data set is quite small, the hash function still distributes the entries evenly. This is not always the case with every database. However, as records are added, the distribution is likely to remain reasonably balanced.

      Directory-Based Sharding

      Directory-based sharding groups related items together on the same shard. This is also known as entity or relationship-based sharding. It typically uses the value contained in a certain field to decide what shard to use. Directory sharding is accomplished through the use of a static lookup table. The table contains a list of mappings between each possible value for the field and its designated shard. Each key can only map to one shard and must appear in the lookup table exactly once. However many keys can potentially be mapped to the same shard.

      As an example, the records in a table of customers can be mapped to shards based on the customer’s home state. The lookup table contains a list of all fifty states, which are the shard keys, and the shard it maps to. This allows for a system design where the records of all customers living in New England are stored on the first shard. Clients in the Mid-Atlantic are located on shard two. Clients residing in the Deep South are mapped to the third shard.

      Directory-based sharding provides a high level of control and flexibility in determining how the data is stored. When intelligently designed, it speeds up common table joins and the bulk retrieval of related data. This architecture is very helpful if the shard key can only be assigned a small number of possible values. Unfortunately, it is highly prone to clustering and imbalanced tables, and the overhead of accessing the lookup table degrades performance. However, the benefits of this architecture often outweighs its drawbacks.

      Directory-based sharding is a good choice for the stores database. The store entries can be distributed to the different shards based on their location. In this design, locations in New England and the mid-Atlantic are stored in the first shard, which serves as the North-East shard. Stores in the Midwest are written to the second shard.

      The first shard contains the entries displayed below.

      2459New YorkNY10022

      The second shard contains the remainder of the data.


      Although these two shards are perfectly balanced, this is not the main goal of directory sharding. It instead seeks to generate useful and relevant shards of closely-related information, which this example also accomplishes.

      Geographic-Based Sharding

      Geographic-based sharding, or Geo-sharding, is a specific type of directory-based sharding. Data is divided amongst the shards based on the location of the entry, which relates to the location of the server hosting the shard. The sharding key is typically a city, state, region, country, or continent. This groups geographically similar data on the same shard. It works the same way directory-based sharding does.

      A good example of geo-sharding relates to geographically dispersed customer data. The customer’s home state is used as a sharding key. The lookup table maps customers living in states in the same sales region to the same shard. Each shard is located on a server located in the same region as the customer data it contains. This makes it very quick and efficient for a regional sales team to access customer data.

      Is Sharding Right For Your Business?

      Because sharding has both advantages and drawbacks, it is important to consider which type of database benefits the most from sharding. The first part of any sharding strategy is to decide whether to shard at all. To generalize, sharding makes the most sense for a high-volume database that stores a large amount of data in a few simple tables. Sharding is especially compelling if a company expects a large increase in the size of its database. Sharding is also useful for organizations that want to access or co-locate their data on a regional basis. For instance, a large social media company would want its users to access database servers in the same country or on the same continent. This requires the company to shard its data based on user location.

      In other cases, the complexity and difficulty associated with sharding are greater than the benefits. A database with many small to medium-sized tables could use vertical scaling, increasing the storage and computing power on a single server. It could also use alternative strategies such as replication for greater resilience and read-only throughput.


      This guide answers the question, “What is database sharding?”. Sharding is a method of distributing the data in a database table to several different shards based on the value of a sharding key. Each shard is stored on a different server. Ideally, the records in a sharded database are distributed amongst the shards in an equitable manner. The different shards share the same table definitions and schemas, but each record is only stored on a single shard.

      Sharding allows a database to scale horizontally, taking advantage of the increased storage, memory, and processing power that only multiple servers can offer. It also increases resiliency and performance. Each query only has to search through a portion of the total records, which is much faster. As a drawback, sharding increases the complexity of a database and increases the difficulty of joins and schema changes.

      Sharding can be accomplished using range sharding, hash sharding, or directory-based sharding. Range sharding is the easiest method, but is more likely to result in unequal shards. Hash sharding more effectively distributes the records, but is more difficult to implement. Directory-based sharding groups related items together on the same shard.

      A sharded database can be implemented using multiple Linode servers. Linode allows you to configure a full web application on a powerful Linux operating system running the industry-standard LAMP stack. Choose from a high-performance
      Dedicated CPU
      service, or a flexible and affordable
      Shared CPU
      alternative. Similarly, you can also use
      Linode’s Managed Database service
      to deploy a database cluster without the need to install and maintain the database infrastructure.

      More Information

      You may wish to consult the following resources for additional information
      on this topic. While these are provided in the hope that they will be
      useful, please note that we cannot vouch for the accuracy or timeliness of
      externally hosted materials.

      Source link

      How To Use Sharding in MongoDB

      The author selected the Open Internet/Free Speech Fund to receive a donation as part of the Write for DOnations program.


      Database sharding is the process of splitting up records that would normally be held in the same table or collection and distributing them across multiple machines, known as shards. Sharding is especially useful in cases where you’re working with large amounts of data, as it allows you to scale your base horizontally by adding more machines that can function as new shards.

      In this tutorial, you’ll learn how to deploy a sharded MongoDB cluster with two shards. This guide will also outline how to choose an appropriate shard key as well as how to verify whether your MongoDB documents are being split up across shards correctly and evenly.

      Warning: The goal of this guide is to outline how sharding works in MongoDB. To that end, it demonstrates how to get a sharded cluster set up and running quickly for use in a development environment. Upon completing this tutorial you’ll have a functioning sharded cluster, but it will not have any security features enabled.

      Additionally, MongoDB recommends that a sharded cluster’s shard servers and config server all be deployed as replica sets with at least three members. Again, though, in order to get a sharded cluster up and running quickly, this guide outlines how to deploy these components as single-node replica sets.

      For these reasons, this setup is not considered secure and should not be used in production environments. If you plan on using a sharded cluster in a production environment, we strongly encourage you to review the official MongoDB documentation on Internal/Membership Authentication as well as our tutorial on How To Configure a MongoDB Replica Set on Ubuntu 20.04.


      To follow this tutorial, you will need:

      • Four separate servers. Each of these should have a regular, non-root user with sudo privileges and a firewall configured with UFW. This tutorial was validated using four servers running Ubuntu 20.04, and you can prepare your servers by following this initial server setup tutorial for Ubuntu 20.04 on each of them.
      • MongoDB installed on each of your servers. To set this up, follow our tutorial on How to Install MongoDB on Ubuntu 20.04 for each server.
      • All four of your servers configured with remote access enabled for each of the other instances. To set this up, follow our tutorial on How to Configure Remote Access for MongoDB on Ubuntu 20.04. As you follow this guide, make sure that each server has the other three servers’ IP addresses added as trusted IP addresses to allow for open communication between all of the servers.

      Note: The linked tutorials on how to configure your server, install MongoDB, and then allow remote access to MongoDB all refer to Ubuntu 20.04. This tutorial concentrates on MongoDB itself, not the underlying operating system. It will generally work with any MongoDB installation regardless of the operating system as long as each of the four servers are configured as outlined previously.

      For clarity, this tutorial will refer to the four servers as follows:

      • mongo-config, which will function as the cluster’s config server.
      • mongo-shard1 and mongo-shard2, which will serve as shard servers where the data will actually be distributed.
      • mongo-router, which will run a mongos instance and function as the shard cluster’s query router.

      For more details on what these roles are and how they function within a sharded MongoDB cluster, please read the following section on Understanding MongoDB’s Sharding Topology.

      Commands that must be executed on mongo-config will have a blue background, like this:

      Commands that must be executed on mongo-shard1 will have a red background:

      Commands run on mongo-shard2 will have a green background:

      And the mongo-router server’s commands will have a violet background:

      Understanding MongoDB’s Sharding Topology

      When working with a standalone MongoDB database server, you connect to that instance and use it to directly manage your data. In an unsharded replica set, you connect to the cluster’s primary member, and any changes you make to the data there are automatically carried over to the set’s secondary members. Sharded MongoDB clusters, though, are slightly more complex.

      Sharding is meant to help with horizontal scaling, also known as scaling out, since it splits up records from one data set across multiple machines. If the workload becomes too great for the shards in your cluster, you can scale out your database by adding another separate shard to take on some of the work. This contrasts with vertical scaling, also known as scaling up, which involves migrating one’s resources to larger or more powerful hardware.

      Because data is physically divided into multiple database nodes in a sharded database architecture, some documents will be available only on one node, while others will reside on another server. If you decided to connect to a particular instance to query the data, only a subset of the data would be available to you. Additionally, if you were to directly change any data held on one shard, you run the risk of creating inconsistency between your shards.

      To mitigate these risks, sharded clusters in MongoDB are made up of three separate components:

      • Shard servers are individual MongoDB instances used to store a subset of a larger collection of data. Every shard server must always be deployed as a replica set. There must be a minimum of one shard in a sharded cluster, but to gain any benefits from sharding you will need at least two.
      • The cluster’s config server is a MongoDB instance that stores metadata and configuration settings for the sharded cluster. The cluster uses this metadata for setup and management purposes. Like shard servers, the config server must be deployed as a replica set to ensure that this metadata remains highly available.
      • mongos is a special type of MongoDB instance that serves as a query router. mongos acts as a proxy between client applications and the sharded cluster, and is responsible for deciding where to direct a given query. Every application connection goes through a query router in a sharded cluster, thereby hiding the complexity of the configuration from the application.

      Because sharding in MongoDB is done at a collection level, a single database can contain a mixture of sharded and unsharded collections. Although sharded collections are partitioned and distributed across the multiple shards of the cluster, one shard is always elected as a primary shard. Unsharded collections are stored in their entirety on this primary shard.

      Since every application connection must go through the mongos instance, the mongos query router is what’s responsible for making all data consistently available and distributed across individual shards.

      Diagram outlining how to connect to a sharded MongoDB cluster. Applications connect to the mongos query router, which connects to a config server to determine how to query and distribute data to shards. The query router also connects to the shards themselves.

      Step 1 — Setting Up a MongoDB Config Server

      After completing the prerequisites, you’ll have four MongoDB installations running on four separate servers. In this step, you’ll convert one of these instances — mongo-config — into a replica set that you can use for testing or development purposes. You’ll also set this MongoDB instance up with features that will allow it to serve as a config server for a sharded cluster.

      Warning: Starting with MongoDB 3.6, both individual shards and config servers must be deployed as replica sets. It’s recommended to always have replica sets with at least three members in a production environment. Using replica sets with three or more members is helpful for keeping your data available and secure, but it also substantially increases the complexity of the sharded architecture. However, you can use single-node replica sets for local development, as this guide outlines.

      To reiterate the warning given previously in the introduction, this guide outlines how to get a sharded cluster up and running quickly. Hence, it outlines how to deploy a sharded cluster using shard servers and a config server that each consist of a single-node replica set. Because of this, and because it will not have any security features enabled, this setup is not secure and should not be used in a production environment.

      On mongo-config, open the MongoDB configuration file in your preferred text editor. Here, we’ll use nano:

      • sudo nano /etc/mongod.conf

      Find the configuration section with lines that read #replication: and #sharding: towards the bottom of the file:


      . . .

      Uncomment the #replication: line by removing the pound sign (#). Then add a replSetName directive below the replication: line, followed by a name MongoDB will use to identify the replica set. Because you’re setting up this MongoDB instance as a replica set that will function as a config server, this guide will use the name config:


      . . .
        replSetName: "config"
      . . .

      Note that there are two spaces preceding the new replSetName directive and that its config value is wrapped in quotation marks. This syntax is required for the configuration to be read properly.

      Next, uncomment the #sharding: line as well. On the next line after that, add a clusterRole directive with a value of configsvr:


      . . .
        replSetName: "config"
        clusterRole: configsvr
      . . .

      The clusterRole directive tells MongoDB that this server will be a part of the sharded cluster and will take the role of a config server (as indicated by the configsvr value). Again, be sure to precede this line with two spaces.

      Note: When both the replication and security lines are enabled in the mongod.conf file, MongoDB also requires you to configure some means of authentication other than password authentication, such as keyfile authentication or setting up x.509 certificates. If you followed our How To Secure MongoDB on Ubuntu 20.04 tutorial and enabled authentication on your MongoDB instance, you will only have password authentication enabled.

      Rather than setting up more advanced security measures, for the purposes of this tutorial it would be prudent to disable the security block in your mongod.conf file. Do so by commenting out every line in the security block with a pound sign:


      . . .
      #  authorization: enabled
      . . .

      As long as you only plan to use this database to practice sharding or other testing purposes, this won’t present a security risk. However, if you plan to use this MongoDB instance to store any sensitive data in the future, be sure to uncomment these lines to re-enable authentication.

      After updating these two sections of the file, save and close the file. If you used nano, you can do so by pressing CTRL + X, Y, and then ENTER.

      Then, restart the mongod service:

      • sudo systemctl restart mongod

      With that, you’ve enabled replication for the server. However, the MongoDB instance isn’t yet replicating any data. You’ll need to start replication through the MongoDB shell, so open it up with the following command:

      From the MongoDB shell prompt, run the following command to initiate this replica set:

      This command will start the replication with the default configuration inferred by the MongoDB server. When setting up a replica set that consists of multiple separate servers, as would be the case if you were deploying a production-ready replica set, you would pass a document to the rs.initiate() method that describes the configuration for the new replica set. However, because this guide outlines how to deploy a sharded cluster using a config server and shard servers that each consist of a single node, you don’t need to pass any arguments to this method.

      MongoDB will automatically read the replica set name and its role in a sharded cluster from the running configuration. If this method returns "ok" : 1 in its output, it means the replica set was started successfully:


      { "info2" : "no configuration specified. Using a default configuration for the set", . . . "ok" : 1, . . . }

      Assuming this is the case, your MongoDB shell prompt will change to indicate that the instance the shell is connected to what is now a member of the rs0 replica set:

      The first part of this new prompt will be the name of the replica set you configured previously.

      Note that the second part this example prompt shows that this MongoDB instance is a secondary member of the replica set. This is to be expected, as there is usually a gap between the time when a replica set is initiated and the time when one of its members becomes the primary member.

      If you were to run a command or even just press ENTER after waiting a few moments, the prompt would update to reflect that you’re connected to the replica set’s primary member:

      You can verify that the replica set was configured properly by executing the following command in the MongoDB shell:

      This will return a lot of output about the replica set configuration, but a few keys are especially important:


      { . . . "set" : "config", . . . "configsvr" : true, "ok" : 1, . . . }

      The set key shows the replica set name, which is config in this example. The configsvr key indicates whether it’s a config server replica set in a sharded cluster, in this case showing true. Lastly, the ok flag has a value of 1, meaning the replica set is working correctly.

      In this step, you’ve configured your first replica set for the config servers in sharded clusters. In the next step, you’ll follow through a similar configuration for the two individual shards.

      Step 2 — Configuring Shard Server Replica Sets

      After completing the previous step, you will have a fully configured replica set that can function as the config server for a sharded cluster. In this step, you’ll convert the mongo-shard1 and mongo-shard2 instances into replica sets as well. Rather than setting them up as config servers, though, you will configure them to function as the actual shards within your sharded cluster.

      In order to set this up, you’ll need to make a few changes to both mongo-shard1 and mongo-shard2’s configuration files. Because you’re setting up two separate replica sets, though, each configuration will use different replica set names.

      On both mongo-shard1 and mongo-shard2, open the MongoDB configuration file in your preferred text editor:

      • sudo nano /etc/mongod.conf
      • sudo nano /etc/mongod.conf

      Find the configuration section with lines that read #replication: and #sharding: towards the bottom of the files. Again, these lines will be commented out in both files by default:



      In both configuration files, uncomment the #replication: line by removing the pound sign (#). Then, add a replSetName directive below the replication: line, followed by the name MongoDB will use to identify the replica set. These examples use the name shard1 for the replica set on mongo-shard1 and shard2 for the set on mongo-shard2:


      . . .
        replSetName: "shard1"
      . . .


      . . .
        replSetName: "shard2"
      . . .

      Then uncomment the #sharding: line and add a clusterRole directive below that line in each configuration file. In both files, set the clusterRole directive value to shardsvr. This tells the respective MongoDB instances that these servers will function as shards.


      . . .
        replSetName: "shard1"
        clusterRole: shardsvr
      . . .


      . . .
        replSetName: "shard2"
        clusterRole: shardsvr
      . . .

      After updating these two sections of the files, save and close the files. Then, restart the mongod service by issuing the following command on both servers:

      • sudo systemctl restart mongod
      • sudo systemctl restart mongod

      With that, you’ve enabled replication for the two shards. As with the config server you set up in the previous step, these replica sets must also be initiated through the MongoDB shell before they can be used. Open the MongoDB shells on both shard servers with the mongo command:

      To reiterate, this guide outlines how to deploy a sharded cluster with a config server and two shard servers, all of which are made up of single-node replica sets. This kind of setup is useful for testing and outlining how sharding works, but it is not suitable for a production environment.

      Because you’re setting up these MongoDB instances to function as single-node replica sets, you can initiate replication on both shard servers by executing the rs.initiate() method without any further arguments:

      These will start replication on each MongoDB instance using the default replica set configuration. If these commands return "ok" : 1 in their output, it means the initialization was successful:


      { "info2" : "no configuration specified. Using a default configuration for the set", . . . "ok" : 1, . . . }


      { "info2" : "no configuration specified. Using a default configuration for the set", . . . "ok" : 1, . . . }

      As with the config server replica set, each of these shard servers will be elected as a primary member after only a few moments. Although their prompts may at first read SECONDARY>, if you press the ENTER key in the shell after a few moments the prompts will change to confirm that each server is the primary instance of their respective replica set. The prompts on the two shards will differ only in name, with one reading shard1:PRIMARY> and the other shard2:PRIMARY>.

      You can verify that each replica set was configured properly by executing the rs.status() method in both MongoDB shells. First, check wither the mongo-shard1 replica set was set up correctly:

      If this method’s output includes "ok" : 1, it means the replica set is functioning properly:


      { . . . "set" : "shard1", . . . "ok" : 1, . . . }

      Executing the same command on mongo-shard2 will show a different replica set name but otherwise will be nearly identical:


      { . . . "set" : "shard2", . . . "ok" : 1, . . . }

      With that, you’ve successfully configured both mongo-shard1 and mongo-shard2 as single-node replica sets. At this point, though, neither these two replica sets nor the config server replica set you created in the previous step are aware of each other. In the next step, you’ll run a query router and connect all of them together.

      Step 3 — Running mongos and Adding Shards to the Cluster

      The three replica sets you’ve configured so far, one config server and two individual shards, are currently running but are not yet part of a sharded cluster. To connect these components as parts of a sharded cluster, you’ll need one more tool: a mongos query router. This will be responsible for communicating with the config server and managing the shard servers.

      You’ll use your fourth and final MongoDB server — mongo-router — to run mongos and function as your sharded cluster’s query router. The query router daemon is included as part of the standard MongoDB installation, but is not enabled by default and must be run separately.

      First, connect to the mongo-router server and stop the MongoDB database service from running:

      • sudo systemctl stop mongod

      Because this server will not act as a database itself, disable the mongod service from starting whenever the server boots up:

      • sudo systemctl disable mongod

      Now, run mongos and connect it to the config server replica set with a command like the following:

      • mongos --configdb config/mongo_config_ip:27017

      The first part of this command’s connection string, config, is the name of the replica you defined earlier. Be sure to change this, if different, and update mongo_config_ip with the IP address of your mongo-config server.

      By default, mongos runs in the foreground and binds only to the local interface, thereby disallowing remote connections. With no additional security configured apart from firewall settings limiting traffic between all of your servers, this is a sound safety measure.

      Note: In MongoDB, it’s customary to differentiate the ports on which the config server and shard servers run, with 27019 being commonly used for config servers replica set and 27018 used for shards. To keep things simple, this guide did not change the port that any of the MongoDB instances in this cluster are running on. Thus, all replica sets are running on the default port of 27017.

      The previous mongos command will produce a verbose and detailed output in a format similar to system logs. At the beginning, you’ll find a message like this:


      {"t":{"$date":"2021-11-07T15:58:36.278Z"},"s":"W", "c":"SHARDING", "id":24132, "ctx":"main","msg":"Running a sharded cluster with fewer than 3 config servers should only be done for testing purposes and is not recommended for production."} . . .

      This means the query router connected to the config server replica set correctly and noticed it’s built with only a single node, a configuration not recommended for production environments.

      Note: Although running in the foreground like this is its default behavior, mongos is typically run as a daemon using a process like systemd.

      Running mongos as a system service is beyond the scope of this tutorial, but we encourage you to learn more about using and administering the mongos query router by reading the official documentation.

      Now you can add the shards you configured previously to the sharded cluster. Because mongos is running in the foreground, open another shell window connected to mongo-router. From this new window, open up the MongoDB shell:

      This command will open the MongoDB shell connected to the local MongoDB instance, which is not a MongoDB server but a running mongos query router. Your prompt will change to indicate this by reading mongos> instead of the MongoDB shell’s usual >.

      You can verify that the query router is connected to the config server by running the sh.status() method:

      This command returns the current status of the sharded cluster. At this point, it will show an empty list of connected shards in the shards key:


      --- Sharding Status --- sharding version: { "_id" : 1, "minCompatibleVersion" : 5, "currentVersion" : 6, "clusterId" : ObjectId("6187ea2e3d82d39f10f37ea7") } shards: active mongoses: autosplit: Currently enabled: yes balancer: Currently enabled: yes Currently running: no Failed balancer rounds in last 5 attempts: 0 Migration Results for the last 24 hours: No recent migrations databases: { "_id" : "config", "primary" : "config", "partitioned" : true } . . .

      To add the first shard to the cluster, execute the following command. In this example, shard1 is the replica set name of the first shard, and mongo_shard1_ip is the IP address of the server on which that shard, mongo-shard1, is running:

      • sh.addShard("shard1/mongo_shard1_ip:27017")

      This command will return a success message:


      { "shardAdded" : "shard1", "ok" : 1, "operationTime" : Timestamp(1636301581, 6), "$clusterTime" : { "clusterTime" : Timestamp(1636301581, 6), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } } }

      Follow that by adding the second shard:

      • sh.addShard("shard2/mongo_shard2_ip:27017")

      Notice that not only the IP address in this command is different, but the replica set name is different as well. The command will return a success message:


      { "shardAdded" : "shard2", "ok" : 1, "operationTime" : Timestamp(1639724738, 6), "$clusterTime" : { "clusterTime" : Timestamp(1639724738, 6), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } } }

      You can check that both shards have been properly added by issuing the sh.status() command again:


      --- Sharding Status --- sharding version: { "_id" : 1, "minCompatibleVersion" : 5, "currentVersion" : 6, "clusterId" : ObjectId("6187ea2e3d82d39f10f37ea7") } shards: { "_id" : "shard1", "host" : "shard1/mongo_shard1_ip:27017", "state" : 1 } { "_id" : "shard2", "host" : "shard2/mongo_shard2_ip:27017", "state" : 1 } active mongoses: "4.4.10" : 1 autosplit: Currently enabled: yes balancer: Currently enabled: yes Currently running: no Failed balancer rounds in last 5 attempts: 0 Migration Results for the last 24 hours: No recent migrations databases: { "_id" : "config", "primary" : "config", "partitioned" : true } . . .

      With that, you have a fully working sharded cluster consisting of two up and running shards. In the next step, you’ll create a new database, enable sharding for the database, and begin partitioning data in a collection.

      Step 4 — Partitioning Collection Data

      One shard within every sharded MongoDB cluster will be elected to be the cluster’s primary shard. In addition to the partitioned data stored across every shard in the cluster, the primary shard is also responsible for storing any non-partitioned data.

      At this point, you can freely use the mongos query router to work with databases, documents, and collections just like you would with a typical unsharded database. However, with no further setup, any data you add to the cluster will end up being stored only on the primary shard; it will not be automatically partitioned, and you won’t experience any of the benefits sharding provides.

      In order to use your sharded MongoDB cluster to its fullest potential, you must enabling sharding for a database within the cluster. A MongoDB database can only contain sharded collections if it has sharding enabled.

      To better understand MongoDB’s behavior as it partitions data, you’ll need a set of documents you can work with. This guide will use a collection of documents representing a few of the most populated cities in the world. As an example, the following sample document represents Tokyo:

      The Tokyo document

          "name": "Tokyo",
          "country": "Japan",
          "continent": "Asia",
          "population": 37.400

      You’ll store these documents in a database called populations and a collection called cities.

      You can enable sharding for a database before you explicitly create it. To enable sharding for the populations database, run the following enableSharding() method:

      • sh.enableSharding("populations")

      The command will return a success message:


      { . . . "ok" : 1, . . . }

      Now that the database is configured to allow partitioning, you can enable partitioning for the cities collection.

      MongoDB provides two ways to shard collections and determine which documents will be stored on which shard: ranged sharding and hashed sharding. This guide focuses on how to implement hashed sharding, in which MongoDB maintains an automated hashed index on a field that has been selected to be the cluster’s shard key. This helps to achieve an even distribution of documents. If you’d like to learn about ranged sharding in MongoDB, please refer to the official documentation.

      When implementing a hash-based sharding strategy, it’s the responsibility of the database administrator to choose an appropriate shard key. A poorly chosen shard key has the potential to mitigate many of benefits one might gain from sharding.

      In MongoDB, a document field that would function well as a shard key should follow these principles:

      • The chosen field should be of high cardinality, meaning that it can have many possible values. Every document added to the collection will always end up being stored on a single shard, so if the field chosen as the shard key will have only a few possible values, adding more shards to the cluster will not benefit performance. Considering the example populations database, the continent field would not be a good shard key since it can only contain a few possible values.
      • The shard key should have a low frequency of duplicate values. If the majority of documents share duplicate values for the field used as the shard key, it’s likely that some shards will be used to store more data than others. The more even the distribution of values in the sharded key across the entire collection, the better.
      • The shard key should facilitate queries. For example, a field that’s frequently used as a query filter would be a good choice for a shard key. In a sharded cluster, the query router uses a single shard to return a query result only if the query contains the shard key. Otherwise, the query will be broadcast to all shards for evaluation, even though the returned documents will come from a single shard. Thus, the population field would not be the best key, as it’s unlikely the majority of queries would involve filtering by the exact population value.

      For the example data used in this guide, the country name field would be a good choice for the cluster’s shard key, since it has the highest cardinality of all fields that will likely be frequently used in filter queries.

      Partition the cities collection — which hasn’t yet been created — with the country field as its shard key by running the following shardCollection() method:

      • sh.shardCollection("populations.cities", { "country": "hashed" })

      The first part of this command refers to the cities collection in the populations database, while the second part selects country as the shard key using the hashed partition method.

      The command will return a success message:


      { "collectionsharded" : "populations.cities", "collectionUUID" : UUID("03823afb-923b-4cd0-8923-75540f33f07d"), "ok" : 1, . . . }

      Now you can insert some sample documents to the sharded cluster. First, switch to the populations database:

      Then insert 20 sample documents with the following insertMany command:

      • db.cities.insertMany([
      • {"name": "Seoul", "country": "South Korea", "continent": "Asia", "population": 25.674 },
      • {"name": "Mumbai", "country": "India", "continent": "Asia", "population": 19.980 },
      • {"name": "Lagos", "country": "Nigeria", "continent": "Africa", "population": 13.463 },
      • {"name": "Beijing", "country": "China", "continent": "Asia", "population": 19.618 },
      • {"name": "Shanghai", "country": "China", "continent": "Asia", "population": 25.582 },
      • {"name": "Osaka", "country": "Japan", "continent": "Asia", "population": 19.281 },
      • {"name": "Cairo", "country": "Egypt", "continent": "Africa", "population": 20.076 },
      • {"name": "Tokyo", "country": "Japan", "continent": "Asia", "population": 37.400 },
      • {"name": "Karachi", "country": "Pakistan", "continent": "Asia", "population": 15.400 },
      • {"name": "Dhaka", "country": "Bangladesh", "continent": "Asia", "population": 19.578 },
      • {"name": "Rio de Janeiro", "country": "Brazil", "continent": "South America", "population": 13.293 },
      • {"name": "São Paulo", "country": "Brazil", "continent": "South America", "population": 21.650 },
      • {"name": "Mexico City", "country": "Mexico", "continent": "North America", "population": 21.581 },
      • {"name": "Delhi", "country": "India", "continent": "Asia", "population": 28.514 },
      • {"name": "Buenos Aires", "country": "Argentina", "continent": "South America", "population": 14.967 },
      • {"name": "Kolkata", "country": "India", "continent": "Asia", "population": 14.681 },
      • {"name": "New York", "country": "United States", "continent": "North America", "population": 18.819 },
      • {"name": "Manila", "country": "Philippines", "continent": "Asia", "population": 13.482 },
      • {"name": "Chongqing", "country": "China", "continent": "Asia", "population": 14.838 },
      • {"name": "Istanbul", "country": "Turkey", "continent": "Europe", "population": 14.751 }
      • ])

      The output will be similar to the typical MongoDB output since, from the user’s perspective, the sharded cluster behaves like a normal MongoDB database:


      { "acknowledged" : true, "insertedIds" : [ ObjectId("61880330754a281b83525a9b"), ObjectId("61880330754a281b83525a9c"), ObjectId("61880330754a281b83525a9d"), ObjectId("61880330754a281b83525a9e"), ObjectId("61880330754a281b83525a9f"), ObjectId("61880330754a281b83525aa0"), ObjectId("61880330754a281b83525aa1"), ObjectId("61880330754a281b83525aa2"), ObjectId("61880330754a281b83525aa3"), ObjectId("61880330754a281b83525aa4"), ObjectId("61880330754a281b83525aa5"), ObjectId("61880330754a281b83525aa6"), ObjectId("61880330754a281b83525aa7"), ObjectId("61880330754a281b83525aa8"), ObjectId("61880330754a281b83525aa9"), ObjectId("61880330754a281b83525aaa"), ObjectId("61880330754a281b83525aab"), ObjectId("61880330754a281b83525aac"), ObjectId("61880330754a281b83525aad"), ObjectId("61880330754a281b83525aae") ] }

      Under the hood, however, MongoDB distributed the documents across the sharded nodes.

      You can access information about how the data was distributed across your shards with the getShardDistribution() method:

      • db.cities.getShardDistribution()

      This method’s output provides statistics for every shard that is part of the cluster:


      Shard shard2 at shard2/mongo_shard2_ip:27017 data : 943B docs : 9 chunks : 2 estimated data per chunk : 471B estimated docs per chunk : 4 Shard shard1 at shard1/mongo_shard1_ip:27017 data : 1KiB docs : 11 chunks : 2 estimated data per chunk : 567B estimated docs per chunk : 5 Totals data : 2KiB docs : 20 chunks : 4 Shard shard2 contains 45.4% data, 45% docs in cluster, avg obj size on shard : 104B Shard shard1 contains 54.59% data, 55% docs in cluster, avg obj size on shard : 103B

      This output indicates that the automated hashing strategy on the country field resulted in a mostly even distribution across two shards.

      You have now configured a fully working sharded cluster and inserted data that has been automatically partitioned across multiple shards. In the next step, you’ll learn how to monitor shard usage when executing queries.

      Step 5 — Analyzing Shard Usage

      Sharding is used to scale the performance of the database system and, as such, works best if it’s used efficiently to support database queries. If most of your queries to the database need to scan every shard in the cluster in order to be executed, any benefits of sharding would be lost in the system’s increased complexity.

      This step focuses on verifying whether a query is optimized to only use a single shard or if it spans multiple shards to retrieve a result.

      Start with selecting every document in the cities collection. Since you want to retrieve all the documents, it’s guaranteed that every shard must be used to perform the query:

      The query will, unsurprisingly, return all cities. Re-run the query, this time with the explain() method attached to the end of it:

      • db.cities.find().explain()

      The long output will provide details about how the query was executed:


      { "queryPlanner" : { "mongosPlannerVersion" : 1, "winningPlan" : { "stage" : "SHARD_MERGE", "shards" : [ { "shardName" : "shard1", . . . }, { "shardName" : "shard2", . . . } ] } }, . . .

      Notice that the winning plan refers to the SHARD_MERGE strategy, which means that multiple shards were used to resolve the query. In the shards key, MongoDB returns the list of shards taking part in the evaluation. In this case, this list includes both shards of the cluster.

      Now test whether the result will be any different if you query against the continent field, which is not the chosen shard key:

      • db.cities.find({"continent": "Europe"}).explain()

      This time, MongoDB also had to use both shards to satisfy the query. The database had no way to know which shard contains documents for European cities:


      { "queryPlanner" : { "mongosPlannerVersion" : 1, "winningPlan" : { "stage" : "SHARD_MERGE", . . . } }, . . .

      The result should be different when filtering against the shard key. Try filtering cities only from Japan using the country field, which you previously selected as the shard key:

      • db.cities.find({"country": "Japan"}).explain()


      { "queryPlanner" : { "mongosPlannerVersion" : 1, "winningPlan" : { "stage" : "SINGLE_SHARD", "shards" : [ { "shardName" : "shard1", . . . } . . .

      This time, MongoDB used a different query strategy: SINGLE_SHARD instead of SHARD_MERGE. This means that only a single shard was needed to satisfy the query. In the shards key, only a single shard will be mentioned. In this example, documents for Japan were stored on the first shard in the cluster.

      By using the explain feature on the query cursor you can check whether the query you are running spans one or multiple shards. In turn, it can also help you determine whether the query will overload the cluster by reaching out to every shard at once. You can use this method — alongside rules of thumb for shard key selection — to select the shard key that will yield the most performance gains.


      Sharding has seen wide use as a strategy to improve the performance and scalability of large data clusters. When paired with replication, sharding also has the potential to improve availability and data security. Sharding is also MongoDB’s core means of horizontal scaling, in that you can extend the database cluster performance by adding more nodes to the cluster instead of migrating databases to bigger and more powerful servers.

      By completing this tutorial, you’ve learned how sharding works in MongoDB, how to configure config servers and individual shards, and how to connect them together to form a sharded cluster. You’ve used the mongos query router to administer the shard, introduce data partitioning, execute queries against the database, and monitor sharding metrics.

      This strategy comes with many benefits, but also with administrative challenges such as having to manage multiple replica sets and more complex security considerations. To learn more about sharding and running a sharded cluster outside the development environment, we encourage you to check out the official MongoDB documentation on the subject. Otherwise, we encourage you to check out the other tutorials in our series on How To Manage Data with MongoDB.

      Source link

      Understanding Database Sharding


      Any application or website that sees significant growth will eventually need to scale in order to accommodate increases in traffic. For data-driven applications and websites, it’s critical that scaling is done in a way that ensures the security and integrity of their data. It can be difficult to predict how popular a website or application will become or how long it will maintain that popularity, which is why some organizations choose a database architecture that allows them to scale their databases dynamically.

      In this conceptual article, we will discuss one such database architecture: sharded databases. Sharding has been receiving lots of attention in recent years, but many don’t have a clear understanding of what it is or the scenarios in which it might make sense to shard a database. We will go over what sharding is, some of its main benefits and drawbacks, and also a few common sharding methods.

      What is Sharding?

      Sharding is a database architecture pattern related to horizontal partitioning — the practice of separating one table’s rows into multiple different tables, known as partitions. Each partition has the same schema and columns, but also entirely different rows. Likewise, the data held in each is unique and independent of the data held in other partitions.

      It can be helpful to think of horizontal partitioning in terms of how it relates to vertical partitioning. In a vertically-partitioned table, entire columns are separated out and put into new, distinct tables. The data held within one vertical partition is independent from the data in all the others, and each holds both distinct rows and columns. The following diagram illustrates how a table could be partitioned both horizontally and vertically:

      Example tables showing horizontal and vertical partitioning

      Sharding involves breaking up one’s data into two or more smaller chunks, called logical shards. The logical shards are then distributed across separate database nodes, referred to as physical shards, which can hold multiple logical shards. Despite this, the data held within all the shards collectively represent an entire logical dataset.

      Database shards exemplify a shared-nothing architecture. This means that the shards are autonomous; they don’t share any of the same data or computing resources. In some cases, though, it may make sense to replicate certain tables into each shard to serve as reference tables. For example, let’s say there’s a database for an application that depends on fixed conversion rates for weight measurements. By replicating a table containing the necessary conversion rate data into each shard, it would help to ensure that all of the data required for queries is held in every shard.

      Oftentimes, sharding is implemented at the application level, meaning that the application includes code that defines which shard to transmit reads and writes to. However, some database management systems have sharding capabilities built in, allowing you to implement sharding directly at the database level.

      Given this general overview of sharding, let’s go over some of the positives and negatives associated with this database architecture.

      Benefits of Sharding

      The main appeal of sharding a database is that it can help to facilitate horizontal scaling, also known as scaling out. Horizontal scaling is the practice of adding more machines to an existing stack in order to spread out the load and allow for more traffic and faster processing. This is often contrasted with vertical scaling, otherwise known as scaling up, which involves upgrading the hardware of an existing server, usually by adding more RAM or CPU.

      It’s relatively simple to have a relational database running on a single machine and scale it up as necessary by upgrading its computing resources. Ultimately, though, any non-distributed database will be limited in terms of storage and compute power, so having the freedom to scale horizontally makes your setup far more flexible.

      Another reason why some might choose a sharded database architecture is to speed up query response times. When you submit a query on a database that hasn’t been sharded, it may have to search every row in the table you’re querying before it can find the result set you’re looking for. For an application with a large, monolithic database, queries can become prohibitively slow. By sharding one table into multiple, though, queries have to go over fewer rows and their result sets are returned much more quickly.

      Sharding can also help to make an application more reliable by mitigating the impact of outages. If your application or website relies on an unsharded database, an outage has the potential to make the entire application unavailable. With a sharded database, though, an outage is likely to affect only a single shard. Even though this might make some parts of the application or website unavailable to some users, the overall impact would still be less than if the entire database crashed.

      Drawbacks of Sharding

      While sharding a database can make scaling easier and improve performance, it can also impose certain limitations. Here, we’ll discuss some of these and why they might be reasons to avoid sharding altogether.

      The first difficulty that people encounter with sharding is the sheer complexity of properly implementing a sharded database architecture. If done incorrectly, there’s a significant risk that the sharding process can lead to lost data or corrupted tables. Even when done correctly, though, sharding is likely to have a major impact on your team’s workflows. Rather than accessing and managing one’s data from a single entry point, users must manage data across multiple shard locations, which could potentially be disruptive to some teams.

      One problem that users sometimes encounter after having sharded a database is that the shards eventually become unbalanced. By way of example, let’s say you have a database with two separate shards, one for customers whose last names begin with letters A through M and another for those whose names begin with the letters N through Z. However, your application serves an inordinate amount of people whose last names start with the letter G. Accordingly, the A-M shard gradually accrues more data than the N-Z one, causing the application to slow down and stall out for a significant portion of your users. The A-M shard has become what is known as a database hotspot. In this case, any benefits of sharding the database are canceled out by the slowdowns and crashes. The database would likely need to be repaired and resharded to allow for a more even data distribution.

      Another major drawback is that once a database has been sharded, it can be very difficult to return it to its unsharded architecture. Any backups of the database made before it was sharded won’t include data written since the partitioning. Consequently, rebuilding the original unsharded architecture would require merging the new partitioned data with the old backups or, alternatively, transforming the partitioned DB back into a single DB, both of which would be costly and time consuming endeavors.

      A final disadvantage to consider is that sharding isn’t natively supported by every database engine. For instance, PostgreSQL does not include automatic sharding as a feature, although it is possible to manually shard a PostgreSQL database. There are a number of Postgres forks that do include automatic sharding, but these often trail behind the latest PostgreSQL release and lack certain other features. Some specialized database technologies — like MySQL Cluster or certain database-as-a-service products like MongoDB Atlas — do include auto-sharding as a feature, but vanilla versions of these database management systems do not. Because of this, sharding often requires a “roll your own” approach. This means that documentation for sharding or tips for troubleshooting problems are often difficult to find.

      These are, of course, only some general issues to consider before sharding. There may be many more potential drawbacks to sharding a database depending on its use case.

      Now that we’ve covered a few of sharding’s drawbacks and benefits, we will go over a few different architectures for sharded databases.

      Sharding Architectures

      Once you’ve decided to shard your database, the next thing you need to figure out is how you’ll go about doing so. When running queries or distributing incoming data to sharded tables or databases, it’s crucial that it goes to the correct shard. Otherwise, it could result in lost data or painfully slow queries. In this section, we’ll go over a few common sharding architectures, each of which uses a slightly different process to distribute data across shards.

      Key Based Sharding

      Key based sharding, also known as hash based sharding, involves using a value taken from newly written data — such as a customer’s ID number, a client application’s IP address, a ZIP code, etc. — and plugging it into a hash function to determine which shard the data should go to. A hash function is a function that takes as input a piece of data (for example, a customer email) and outputs a discrete value, known as a hash value. In the case of sharding, the hash value is a shard ID used to determine which shard the incoming data will be stored on. Altogether, the process looks like this:

      Key based sharding example diagram

      To ensure that entries are placed in the correct shards and in a consistent manner, the values entered into the hash function should all come from the same column. This column is known as a shard key. In simple terms, shard keys are similar to primary keys in that both are columns which are used to establish a unique identifier for individual rows. Broadly speaking, a shard key should be static, meaning it shouldn’t contain values that might change over time. Otherwise, it would increase the amount of work that goes into update operations, and could slow down performance.

      While key based sharding is a fairly common sharding architecture, it can make things tricky when trying to dynamically add or remove additional servers to a database. As you add servers, each one will need a corresponding hash value and many of your existing entries, if not all of them, will need to be remapped to their new, correct hash value and then migrated to the appropriate server. As you begin rebalancing the data, neither the new nor the old hashing functions will be valid. Consequently, your server won’t be able to write any new data during the migration and your application could be subject to downtime.

      The main appeal of this strategy is that it can be used to evenly distribute data so as to prevent hotspots. Also, because it distributes data algorithmically, there’s no need to maintain a map of where all the data is located, as is necessary with other strategies like range or directory based sharding.

      Range Based Sharding

      Range based sharding involves sharding data based on ranges of a given value. To illustrate, let’s say you have a database that stores information about all the products within a retailer’s catalog. You could create a few different shards and divvy up each products’ information based on which price range they fall into, like this:

      Range based sharding example diagram

      The main benefit of range based sharding is that it’s relatively simple to implement. Every shard holds a different set of data but they all have an identical schema as one another, as well as the original database. The application code just reads which range the data falls into and writes it to the corresponding shard.

      On the other hand, range based sharding doesn’t protect data from being unevenly distributed, leading to the aforementioned database hotspots. Looking at the example diagram, even if each shard holds an equal amount of data the odds are that specific products will receive more attention than others. Their respective shards will, in turn, receive a disproportionate number of reads.

      Directory Based Sharding

      To implement directory based sharding, one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. In a nutshell, a lookup table is a table that holds a static set of information about where specific data can be found. The following diagram shows a simplistic example of directory based sharding:

      Directory based sharding example diagram

      Here, the Delivery Zone column is defined as a shard key. Data from the shard key is written to the lookup table along with whatever shard each respective row should be written to. This is similar to range based sharding, but instead of determining which range the shard key’s data falls into, each key is tied to its own specific shard. Directory based sharding is a good choice over range based sharding in cases where the shard key has a low cardinality and it doesn’t make sense for a shard to store a range of keys. Note that it’s also distinct from key based sharding in that it doesn’t process the shard key through a hash function; it just checks the key against a lookup table to see where the data needs to be written.

      The main appeal of directory based sharding is its flexibility. Range based sharding architectures limit you to specifying ranges of values, while key based ones limit you to using a fixed hash function which, as mentioned previously, can be exceedingly difficult to change later on. Directory based sharding, on the other hand, allows you to use whatever system or algorithm you want to assign data entries to shards, and it’s relatively easy dynamically add shards using this approach.

      While directory based sharding is the most flexible of the sharding methods discussed here, the need to connect to the lookup table before every query or write can have a detrimental impact on an application’s performance. Furthermore, the lookup table can become a single point of failure: if it becomes corrupted or otherwise fails, it can impact one’s ability to write new data or access their existing data.

      Should I Shard?

      Whether or not one should implement a sharded database architecture is almost always a matter of debate. Some see sharding as an inevitable outcome for databases that reach a certain size, while others see it as a headache that should be avoided unless it’s absolutely necessary, due to the operational complexity that sharding adds.

      Because of this added complexity, sharding is usually only performed when dealing with very large amounts of data. Here are some common scenarios where it may be beneficial to shard a database:

      • The amount of application data grows to exceed the storage capacity of a single database node.
      • The volume of writes or reads to the database surpasses what a single node or its read replicas can handle, resulting in slowed response times or timeouts.
      • The network bandwidth required by the application outpaces the bandwidth available to a single database node and any read replicas, resulting in slowed response times or timeouts.

      Before sharding, you should exhaust all other options for optimizing your database. Some optimizations you might want to consider include:

      • Setting up a remote database. If you’re working with a monolithic application in which all of its components reside on the same server, you can improve your database’s performance by moving it over to its own machine. This doesn’t add as much complexity as sharding since the database’s tables remain intact. However, it still allows you to vertically scale your database apart from the rest of your infrastructure.
      • Implementing caching. If your application’s read performance is what’s causing you trouble, caching is one strategy that can help to improve it. Caching involves temporarily storing data that has already been requested in memory, allowing you to access it much more quickly later on.
      • Creating one or more read replicas. Another strategy that can help to improve read performance, this involves copying the data from one database server (the primary server) over to one or more secondary servers. Following this, every new write goes to the primary before being copied over to the secondaries, while reads are made exclusively to the secondary servers. Distributing reads and writes like this keeps any one machine from taking on too much of the load, helping to prevent slowdowns and crashes. Note that creating read replicas involves more computing resources and thus costs more money, which could be a significant constraint for some.
      • Upgrading to a larger server. In most cases, scaling up one’s database server to a machine with more resources requires less effort than sharding. As with creating read replicas, an upgraded server with more resources will likely cost more money. Accordingly, you should only go through with resizing if it truly ends up being your best option.

      Bear in mind that if your application or website grows past a certain point, none of these strategies will be enough to improve performance on their own. In such cases, sharding may indeed be the best option for you.


      Sharding can be a great solution for those looking to scale their database horizontally. However, it also adds a great deal of complexity and creates more potential failure points for your application. Sharding may be necessary for some, but the time and resources needed to create and maintain a sharded architecture could outweigh the benefits for others.

      By reading this conceptual article, you should have a clearer understanding of the pros and cons of sharding. Moving forward, you can use this insight to make a more informed decision about whether or not a sharded database architecture is right for your application.

      Source link