One place for hosting & domains

      MongoDB

      How To Use Schema Validation in MongoDB


      The author selected the Open Internet/Free Speech Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      One important aspect of relational databases — which store databases in tables made up of rows and columns — is that they operate on fixed, rigid schemas with fields of known data types. Document-oriented databases like MongoDB are more flexible in this regard, as they allow you to reshape your documents’ structure as needed.

      However, there are likely to be situations in which you might need your data documents to follow a particular structure or fulfill certain requirements. Many document databases allow you to define rules that dictate how parts of your documents’ data should be structured while still offering some freedom to change this structure if needed.

      MongoDB has a feature called schema validation that allows you to apply constraints on your documents’ structure. Schema validation is built around JSON Schema, an open standard for JSON document structure description and validation. In this tutorial, you’ll write and apply validation rules to control the structure of documents in an example MongoDB collection.

      Prerequisites

      To follow this tutorial, you will need:

      Note: The linked tutorials on how to configure your server, install MongoDB, and secure the MongoDB installation refer to Ubuntu 20.04. This tutorial concentrates on MongoDB itself, not the underlying operating system. It will generally work with any MongoDB installation regardless of the operating system as long as authentication has been enabled.

      Step 1 — Inserting Documents Without Applying Schema Validation

      In order to highlight MongoDB’s schema validation features and why they can be useful, this step outlines how to open the MongoDB shell to connect to your locally-installed MongoDB instance and create a sample collection within it. Then, by inserting a number of example documents into this collection, this step will show how MongoDB doesn’t enforce any schema validation by default. In later steps, you’ll begin creating and enforcing such rules yourself.

      To create the sample collection used in this guide, connect to the MongoDB shell as your administrative user. This tutorial follows the conventions of the prerequisite MongoDB security tutorial and assumes the name of this administrative user is AdminSammy and its authentication database is admin. Be sure to change these details in the following command to reflect your own setup, if different:

      • mongo -u AdminSammy -p --authenticationDatabase admin

      Enter the password set during installation to gain access to the shell. After providing the password, you’ll see the > prompt sign.

      To illustrate the schema validation features, this guide’s examples use an sample database containing documents that represent the highest mountains in the world. The sample document for Mount Everest will take this form:

      The Everest document

      {
          "name": "Everest",
          "height": 8848,
          "location": ["Nepal", "China"],
          "ascents": {
              "first": {
                  "year": 1953,
              },
              "first_winter": {
                  "year": 1980,
              },
              "total": 5656,
          }
      }
      

      This document contains the following information:

      • name: the peak’s name.
      • height: the peak’s elevation, in meters.
      • location: the countries in which the mountain is located. This field stores values as an array to allow for mountains located in more than one country.
      • ascents: this field’s value is another document. When one document is stored within another document like this, it’s known as an embedded or nested document. Each ascents document describes successful ascents of the given mountain. Specifically, each ascents document contains a total field that lists the total number of successful ascents of each given peak. Additionally, each of these nested documents contain two fields whose values are also nested documents:
        • first: this field’s value is a nested document that contains one field, year, which describes the year of the first overall successful ascent.
        • first_winter: this field’s value is a nested document that also contains a year field, the value of which represents the year of the first successful winter ascent of the given mountain.

      Run the following insertOne() method to simultaneously create a collection named peaks in your MongoDB installation and insert the previous example document representing Mount Everest into it:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"],
      • "ascents": {
      • "first": {
      • "year": 1953
      • },
      • "first_winter": {
      • "year": 1980
      • },
      • "total": 5656
      • }
      • }
      • )

      The output will contain a success message and an object identifier assigned to the newly inserted object:

      Output

      { "acknowledged" : true, "insertedId" : ObjectId("618ffa70bfa69c93a8980443") }

      Although you inserted this document by running the provided insertOne() method, you had complete freedom in designing this document’s structure. In some cases, you might want to have some degree of flexibility in how documents within the database are structured. However, you might also want to make sure some aspects of the documents’ structure remain consistent to allow for easier data analysis or processing.

      To illustrate why this can be important, consider a few other example documents that might be entered into this database.

      The following document is almost identical to the previous one representing Mount Everest, but it doesn’t contain a name field:

      The Mountain with no name at all

      {
          "height": 8611,
          "location": ["Pakistan", "China"],
          "ascents": {
              "first": {
                  "year": 1954
              },
              "first_winter": {
                  "year": 1921
              },
              "total": 306
          }
      }
      

      For a database containing a list of the highest mountains in the world, adding a document representing a mountain but not including its name would likely be a serious error.

      In this next example document, the mountain’s name is present but its height is represented as a string instead of a number. Additionally, the location is not an array but a single value, and there is no information on the total number of ascent attempts:

      Mountain with a string value for its height

      {
          "name": "Manaslu",
          "height": "8163m",
          "location": "Nepal"
      }
      

      Interpreting a document with as many omissions as this example could prove difficult. For instance, you would not be able to successfully sort the collection by peak heights if the height attribute values are stored as different data types between documents.

      Now run the following insertMany() method to test whether these documents can be inserted into the database without causing any errors:

      • db.peaks.insertMany([
      • {
      • "height": 8611,
      • "location": ["Pakistan", "China"],
      • "ascents": {
      • "first": {
      • "year": 1954
      • },
      • "first_winter": {
      • "year": 1921
      • },
      • "total": 306
      • }
      • },
      • {
      • "name": "Manaslu",
      • "height": "8163m",
      • "location": "Nepal"
      • }
      • ])

      As it turns out, MongoDB will not return any errors and both documents will be inserted successfully:

      Output

      { "acknowledged" : true, "insertedIds" : [ ObjectId("618ffd0bbfa69c93a8980444"), ObjectId("618ffd0bbfa69c93a8980445") ] }

      As this output indicates, both of these documents are valid JSON, which is enough to insert them into the collection. However, this isn’t enough to keep the database logically consistent and meaningful. In the next steps, you’ll build schema validation rules to make sure the data documents in the peaks collection follow a few essential requirements.

      Step 2 — Validating String Fields

      In MongoDB, schema validation works on individual collections by assigning a JSON Schema document to the collection. JSON Schema is an open standard that allows you to define and validate the structure of JSON documents. You do this by creating a schema definition that lists a set of requirements that documents in the given collection must follow to be considered valid.

      Any given collection can only use a single JSON Schema, but you can assign a schema when you create the collection or any time afterwards. If you decide to change your original validation rules later on, you will have to replace the original JSON Schema document with one that aligns with your new requirements.

      To assign a JSON Schema validator document to the peaks collection you created in the previous step, you could run the following command:

      • db.runCommand({
      • "collMod": "collection_name",
      • "validator": {
      • $jsonSchema: {JSON_Schema_document}
      • }
      • })

      The runCommand method executes the collMod command, which modifies the specified collection by applying the validator attribute to it. The validator attribute is responsible for schema validation and, in this example syntax, it accepts the $jsonSchema operator. This operator defines a JSON Schema document which will be used as the schema validator for the given collection.

      Warning: In order to execute the collMod command, your MongoDB user must be granted the appropriate privileges. Assuming you followed the prerequisite tutorial on How To Secure MongoDB on Ubuntu 20.04 and are connected to your MongoDB instance as the administrative user you created in that guide, you will need to grant it an additional role to follow along with the examples in this guide.

      First, switch to your user’s authentication database. This is admin in the following example, but connect to your own user’s authentication database if different:

      Output

      switched to db admin

      Then run a grantRolesToUser() method and grant your user the dbAdmin role over the database where you created the peaks collection. The following example assumes the peaks collection is in the test database:

      • db.grantRolesToUser(
      • "AdminSammy",
      • [ { role : "dbAdmin", db : "test" } ]
      • )

      Alternatively, you can grant your user the dbAdminAnyDatabase role. As this role’s name implies, it will grant your user dbAdmin privileges over every database on your MongoDB instance:

      • db.grantRolesToUser(
      • "AdminSammy",
      • [ "dbAdminAnyDatabase" ]
      • )

      After granting your user the appropriate role, navigate back to the database where your peaks collection is stored:

      Output

      switched to db test

      Be aware that you can also assign a JSON Schema validator when you create a collection. To do so, you could use the following syntax:

      • db.createCollection(
      • "collection_name", {
      • "validator": {
      • $jsonSchema: {JSON_Schema_document}
      • }
      • })

      Unlike the previous example, this syntax doesn’t include the collMod command, since the collection doesn’t yet exist and thus can’t be modified. As with the previous example, though, collection_name is the name of the collection to which you want to assign the validator document and the validator option assigns a specified JSON Schema document as the collection’s validator.

      Applying a JSON Schema validator from the start like this means every document you add to the collection must satisfy the requirements set by the validator. When you add validation rules to an existing collection, though, the new rules won’t affect existing documents until you try to modify them.

      The JSON schema document you pass to the validator attribute should outline every validation rule you want to apply to the collection. The following example JSON Schema will make sure that the name field is present in every document in the collection, and that the name field’s value is always a string:

      Your first JSON Schema document validating the name field

      {
          "bsonType": "object",
          "description": "Document describing a mountain peak",
          "required": ["name"],
          "properties": {
              "name": {
                  "bsonType": "string",
                  "description": "Name must be a string and is required"
              }
          },
      }
      

      This schema document outlines certain requirements that certain parts of documents entered into the collection must follow. The root part of the JSON Schema document (the fields before properties, which in this case are bsonType, description, and required) describes the database document itself.

      The bsonType property describes the data type that the validation engine will expect to find. For the database document itself, the expected type is object. This means that you can only add objects — in other words, complete, valid JSON documents surrounded by curly braces ({ and }) — to this collection. If you were to try to insert some other kind of data type (like a standalone string, integer, or an array), it would cause an error.

      In MongoDB, every document is an object. However, JSON Schema is a standard used to describe and validate all kinds of valid JSON documents, and a plain array or a string is valid JSON, too. When working with MongoDB schema validation, you’ll find that you must always set the root document’s bsonType value as object in the JSON Schema validator.

      Next, the description property provides a short description of the documents found in this collection. This field isn’t required, but in addition to being used to validate documents, JSON Schemas can also be used to annotate the document’s structure. This can help other users understand what the purpose of the documents are, so including a description field can be a good practice.

      The next property in the validation document is the required field. The required field can only accept an array containing a list of document fields that must be present in every document in the collection. In this example, ["name"] means that the documents only have to contain the name field to be considered valid.

      Following that is a properties object that describes the rules used to validate document fields. For each field that you want to define rules for, include an embedded JSON Schema document named after the field. Be aware that you can define schema rules for fields that aren’t listed in the required array. This can be useful in cases where your data has fields that aren’t required, but you’d still like for them to follow certain rules when they are present.

      These embedded schema documents will follow a similar syntax as the main document. In this example, the bsonType property will require every document’s name field to be a string. This embedded document also contains a brief description field.

      To apply this JSON Schema to the peaks collection you created in the previous step, run the following runCommand() method:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • }
      • },
      • }
      • }
      • })

      MongoDB will respond with a success message indicating that the collection was successfully modified:

      Output

      { "ok" : 1 }

      Following that, MongoDB will no longer allow you to insert documents into the peaks collection if they don’t have a name field. To test this, try inserting the document you inserted in the previous step that fully describes a mountain, aside from missing the name field:

      • db.peaks.insertOne(
      • {
      • "height": 8611,
      • "location": ["Pakistan", "China"],
      • "ascents": {
      • "first": {
      • "year": 1954
      • },
      • "first_winter": {
      • "year": 1921
      • },
      • "total": 306
      • }
      • }
      • )

      This time, the operation will trigger an error message indicating a failed document validation:

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . . })

      MongoDB won’t insert any documents that fail to pass the validation rules specified in the JSON Schema.

      Note: Starting with MongoDB 5.0, when validation fails the error messages point towards the failed constraint. In MongoDB 4.4 and earlier, the database provides no further details on the failure reason.

      You can also test whether MongoDB will enforce the data type requirement you included in the JSON Schema by running the following insertOne() method. This is similar to the last operation, but this time it includes a name field. However, this field’s value is a number instead of a string:

      • db.peaks.insertOne(
      • {
      • "name": 123,
      • "height": 8611,
      • "location": ["Pakistan", "China"],
      • "ascents": {
      • "first": {
      • "year": 1954
      • },
      • "first_winter": {
      • "year": 1921
      • },
      • "total": 306
      • }
      • }
      • )

      Once again, the validation will fail. Even though the name field is present, it doesn’t meet the constraint that requires it to be a string:

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . . })

      Try once more, but with the name field present in the document and followed by a string value. This time, name is the only field in the document:

      • db.peaks.insertOne(
      • {
      • "name": "K2"
      • }
      • )

      The operation will succeed, and the document will receive the object identifier as usual:

      Output

      { "acknowledged" : true, "insertedId" : ObjectId("61900965bfa69c93a8980447") }

      The schema validation rules pertain only to the name field. At this point, as long as the name field fulfills the validation requirements, the document will be inserted without error. The rest of the document can take any shape.

      With that, you’ve created your first JSON Schema document and applied the first schema validation rule to the name field, requiring it to be present and a string. However, there are different validation options for different data types. Next, you’ll validate number values stored in each document’s height field.

      Step 3 — Validating Number Fields

      Recall from Step 1 when you inserted the following document into the peaks collection:

      Mountain with a string value for its height

      {
          "name": "Manaslu",
          "height": "8163m",
          "location": "Nepal"
      }
      

      Even though this document’s height value is a string instead of a number, the insertMany() method you used to insert this document was successful. This was possible because you haven’t yet added any validation rules for the height field.

      MongoDB will accept any value for this field — even values that don’t make any sense for this field, like negative values — as long as the inserted document is written in valid JSON syntax. To work around this, you can extend the schema validation document from the previous step to include additional rules regarding the height field.

      Start by ensuring that the height field is always present in newly-inserted documents and that it’s always expressed as a number. Modify the schema validation with the following command:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name", "height"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • },
      • "height": {
      • "bsonType": "number",
      • "description": "Height must be a number and is required"
      • }
      • },
      • }
      • }
      • })

      In this command’s schema document, the height field is included in the required array. Likewise, there’s a height document within the properties object that will require any new height values to be a number. Again, the description field is auxiliary, and any description you include should only be to help other users understand the intention behind the JSON Schema.

      MongoDB will respond with a short success message to let you know that the collection was successfully modified:

      Output

      { "ok" : 1 }

      Now you can test the new rule. Try inserting a document with the minimal document structure required to pass the validation document. The following method will insert a document containing the only two mandatory fields, name and height:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300
      • }
      • )

      The insertion will succeed:

      Output

      { acknowledged: true, insertedId: ObjectId("61e0c8c376b24e08f998e371") }

      Next, try inserting a document with a missing height field:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak"
      • }
      • )

      Then try another that includes the height field, but this field contains a string value:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": "8300m"
      • }
      • )

      Both times, the operations will trigger an error message and fail:

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . . })

      However, if you try inserting a mountain peak with a negative height, the mountain will save properly:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": -100
      • }
      • )

      To prevent this, you could add a few more properties to the schema validation document. Replace the current schema validation settings by running the following operation:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name", "height"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • },
      • "height": {
      • "bsonType": "number",
      • "description": "Height must be a number between 100 and 10000 and is required",
      • "minimum": 100,
      • "maximum": 10000
      • }
      • },
      • }
      • }
      • })

      The new minimum and maximum attributes set constraints on values included in height fields, ensuring they can’t be lower than 100 or higher than 10000. This range makes sense in this case, as this collection is used to store information about mountain peak heights, but you could choose any values you like for these attributes.

      Now, if you try inserting a peak with a negative height value again, the operation will fail:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": -100
      • }
      • )

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . .

      As this output shows, your document schema now validates string values held in each document’s name field as well as numeric values held in the height fields. Continue reading to learn how to validate array values stored in each document’s location field.

      Step 4 — Validating Array Fields

      Now that each peak’s name and height values are being verified by schema validation constraints, we can turn our attention to the location field to guarantee its data consistency.

      Specifying the location for mountains is more tricky than one might expect, since peaks span more than one country, and this is the case for many of the famous eight-thousanders. Because of this, it would make sense store each peak’s location data as an array containing one or more country names instead of being just a string value. As with the height values, making sure each location field’s data type is consistent across every document can help with summarizing data when using aggregation pipelines.

      First, consider some examples of location values that users might enter, and weigh which ones would be valid or invalid:

      • ["Nepal", "China"]: this is a two-element array, and would be a valid value for a mountain spanning two countries.
      • ["Nepal"]: this example is a single-element array, it would also be a valid value for a mountain located in a single country.
      • "Nepal": this example is a plain string. It would be invalid because although it lists a single country, the location field should always contain an array
      • []: an empty array, this example would not be a valid value. After all, every mountain must exist in at least one country.
      • ["Nepal", "Nepal"]: this two-element array would also be invalid, as it contains the same value appearing twice.
      • ["Nepal", 15]: lastly, this two-element array would be invalid, as one of its values is a number instead of a string and this is not a correct location name.

      To ensure that MongoDB will correctly interpret each of these examples as valid or invalid, run the following operation to create some new validation rules for the peaks collection:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name", "height", "location"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • },
      • "height": {
      • "bsonType": "number",
      • "description": "Height must be a number between 100 and 10000 and is required",
      • "minimum": 100,
      • "maximum": 10000
      • },
      • "location": {
      • "bsonType": "array",
      • "description": "Location must be an array of strings",
      • "minItems": 1,
      • "uniqueItems": true,
      • "items": {
      • "bsonType": "string"
      • }
      • }
      • },
      • }
      • }
      • })

      In this $jsonSchema object, the location field is included within the required array as well as the properties object. There, it’s defined with a bsonType of array to ensure that the location value is always an array rather than a single string or a number.

      The minItems property validates that the array must contain at least one element, and the uniqueItems property is set to true to ensure that elements within each location array will be unique. This will prevent values like ["Nepal", "Nepal"] from being accepted. Lastly, the items subdocument defines the validation schema for each individual array item. Here, the only expectation is that every item within a location array must be a string.

      Note: The available schema document properties are different for each bsonType and, depending on the field type, you will be able to validate different aspects of the field value. For example, with number values you could define minimum and maximum allowable values to create a range of acceptable values. In the previous example, by setting the location field’s bsonType to array, you can validate features particular to arrays.

      You can find details on all possible validation choices in the JSON Schema documentation.

      After executing the command, MongoDB will respond with a short success message that the collection was successfully modified with the new schema document:

      Output

      { "ok" : 1 }

      Now try inserting documents matching the examples prepared earlier to test how the new rule behaves. Once again, let’s use the minimal document structure, with only the name, height, and location fields present.

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": ["Nepal", "China"]
      • }
      • )

      The document will be inserted successfully as it fulfills all the defined validation expectations. Similarly, the following document will insert without error:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": ["Nepal"]
      • }
      • )

      However, if you were to run any of the following insertOne() methods, they would trigger a validation error and fail:

      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": "Nepal"
      • }
      • )
      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": []
      • }
      • )
      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": ["Nepal", "Nepal"]
      • }
      • )
      • db.peaks.insertOne(
      • {
      • "name": "Test peak",
      • "height": 8300,
      • "location": ["Nepal", 15]
      • }
      • )

      As per the validation rules you defined previously, the location values provided in these operations are considered invalid.

      After following this step, three primary fields describing a mountain top are already being validated through MongoDB’s schema validation feature. In the next step, you’ll learn how to validate nested documents using the ascents field as an example.

      Step 5 — Validating Embedded Documents

      At this point, your peaks collection has three fields — name, height and location — that are being kept in check by schema validation. This step focuses on defining validation rules for the ascents field, which describes successful attempts at summiting each peak.

      In the example document from Step 1 that represents Mount Everest, the ascents field was structured as follows:

      The Everest document

      {
          "name": "Everest",
          "height": 8848,
          "location": ["Nepal", "China"],
          "ascents": {
              "first": {
                  "year": 1953,
              },
              "first_winter": {
                  "year": 1980,
              },
              "total": 5656,
          }
      }
      

      The ascents subdocument contains a total field whose value represents the total number of ascent attempts for the given mountain. It also contains information on the first winter ascent of the mountain as well as the first ascent overall. These, however, might not be essential to the mountain description. After all, some mountains might not have been ascended in winter yet, or the ascent dates are disputed or not known. For now, just assume the information that you will always want to have in each document is the total number of ascent attempts.

      You can change the schema validation document so that the ascents field must always be present and its value must always be a subdocument. This subdocument, in turn, must always contain a total attribute holding a number greater than or equal to zero. The first and first_winter fields aren’t required for the purposes of this guide, so the validation form won’t consider them and they can take flexible forms.

      Once again, replace the schema validation document for the peaks collection by running the following runCommand() method:

      • db.runCommand({
      • "collMod": "peaks",
      • "validator": {
      • $jsonSchema: {
      • "bsonType": "object",
      • "description": "Document describing a mountain peak",
      • "required": ["name", "height", "location", "ascents"],
      • "properties": {
      • "name": {
      • "bsonType": "string",
      • "description": "Name must be a string and is required"
      • },
      • "height": {
      • "bsonType": "number",
      • "description": "Height must be a number between 100 and 10000 and is required",
      • "minimum": 100,
      • "maximum": 10000
      • },
      • "location": {
      • "bsonType": "array",
      • "description": "Location must be an array of strings",
      • "minItems": 1,
      • "uniqueItems": true,
      • "items": {
      • "bsonType": "string"
      • }
      • },
      • "ascents": {
      • "bsonType": "object",
      • "description": "Ascent attempts information",
      • "required": ["total"],
      • "properties": {
      • "total": {
      • "bsonType": "number",
      • "description": "Total number of ascents must be 0 or higher",
      • "minimum": 0
      • }
      • }
      • }
      • },
      • }
      • }
      • })

      Anytime the document contains subdocuments under any of its fields, the JSON Schema for that field follows the exact same syntax as the main document schema. Just like how the same documents can be nested within one another, the validation schema nests them within one another as well. This makes it straightforward to define complex validation schemas for document structures containing multiple subdocuments in a hierarchical structure.

      In this JSON Schema document, the ascents field is included within the required array, making it mandatory. It also appears in the properties object where it’s defined with a bsonType of object, just like the root document itself.

      Notice that the definition for ascents validation follows a similar principle as the root document. It has the required field, denoting properties the subdocument must contain. It also defines a properties list, following the same structure. Since the ascents field is a subdocument, it’s values will be validated just like those of a larger document would be.

      Within ascents, there’s a required array whose only value is total, meaning that every ascents subdocument will be required to contain a total field. Following that, the total value is described thoroughly within the properties object, which specifies that this must always be a number with a minimum value of zero.

      Again, because neither the first nor the first_winter fields are mandatory for the purposes of this guide, they aren’t included in these validation rules.

      With this schema validation document applied, try inserting the sample Mount Everest document from the first step to verify it allows you to insert documents you’ve already established as valid:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"],
      • "ascents": {
      • "first": {
      • "year": 1953,
      • },
      • "first_winter": {
      • "year": 1980,
      • },
      • "total": 5656,
      • }
      • }
      • )

      The document saves successfully, and MongoDB returns the new object identifier:

      Output

      { "acknowledged" : true, "insertedId" : ObjectId("619100f51292cb2faee531f8") }

      To make sure the last pieces of validation work properly, try inserting a document that doesn’t include the ascents field:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"]
      • }
      • )

      This time, the operation will trigger an error message pointing out a failed document validation:

      Output

      WriteError({ "index" : 0, "code" : 121, "errmsg" : "Document failed validation", . . . })

      Now try inserting a document whose ascents subdocument is missing the total field:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"],
      • "ascents": {
      • "first": {
      • "year": 1953,
      • },
      • "first_winter": {
      • "year": 1980,
      • }
      • }
      • }
      • )

      This will again trigger an error.

      As a final test, try entering a document that contains an ascents field with a total value, but this value is negative:

      • db.peaks.insertOne(
      • {
      • "name": "Everest",
      • "height": 8848,
      • "location": ["Nepal", "China"],
      • "ascents": {
      • "first": {
      • "year": 1953,
      • },
      • "first_winter": {
      • "year": 1980,
      • },
      • "total": -100
      • }
      • }
      • )

      Because of the negative total value, this document will also fail the validation test.

      Conclusion

      By following this tutorial, you became familiar with JSON Schema documents and how to use them to validate document structures before saving them into a collection. You then used JSON Schema documents to verify field types and apply value constraints to numbers and arrays. You’ve also learned how to validate subdocuments in a nested document structure.

      MongoDB’s schema validation feature should not be considered a replacement for data validation at the application level, but it can further safeguard against violating data constraints that are essential to keeping your data meaningful. Using schema validation can be a helpful tool for structuring one’s data while retaining the flexibility of a schemaless approach to data storage. With schema validation, you are in total control of those parts of the document structure you want to validate and those you’d like to leave open-ended.

      The tutorial described only a subset of MongoDB’s schema validation features. You can apply more constraints to different MongoDB data types, and it’s even possible to change the strictness of validation behavior and use JSON Schema to filter and validate existing documents. We encourage you to study the official official MongoDB documentation to learn more about schema validation and how it can help you work with data stored in the database.



      Source link

      How To Perform Full-text Search in MongoDB


      The author selected the Open Internet/Free Speech Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      MongoDB queries that filter data by searching for exact matches, using greater-than or less-than comparisons, or by using regular expressions will work well enough in many situations. However, these methods fall short when it comes to filtering against fields containing rich textual data.

      Imagine you typed “coffee recipe” into a web search engine but it only returned pages that contained that exact phrase. In this case, you may not find exactly what you were looking for since most popular websites with coffee recipes may not contain the exact phrase “coffee recipe.” If you were to enter that phrase into a real search engine, though, you might find pages with titles like “Great Coffee Drinks (with Recipes!)” or “Coffee Shop Drinks and Treats You Can Make at Home.” In these examples, the word “coffee” is present but the titles contain another form of the word “recipe” or exclude it entirely.

      This level of flexibility in matching text to a search query is typical for full-text search engines that specialize in searching textual data. There are multiple specialized open-source tools for such applications in use, with ElasticSearch being an especially popular choice. However, for scenarios that don’t require the robust search features found in dedicated search engines, some general-purpose database management systems offer their own full-text search capabilities.

      In this tutorial, you’ll learn by example how to create a text index in MongoDB and use it to search the documents in the database against common full-text search queries and filters.

      Prerequisites

      To follow this tutorial, you will need:

      Note: The linked tutorials on how to configure your server, install MongoDB, and secure the MongoDB installation refer to Ubuntu 20.04. This tutorial concentrates on MongoDB itself, not the underlying operating system. It will generally work with any MongoDB installation regardless of the operating system as long as authentication has been enabled.

      Step 1 — Preparing the Test Data

      To help you learn how to perform full-text searches in MongoDB, this step outlines how to open the MongoDB shell to connect to your locally-installed MongoDB instance. It also explains how to create a sample collection and insert a few sample documents into it. This sample data will be used in commands and examples throughout this guide to help explain how to use MongoDB to search text data.

      To create this sample collection, connect to the MongoDB shell as your administrative user. This tutorial follows the conventions of the prerequisite MongoDB security tutorial and assumes the name of this administrative user is AdminSammy and its authentication database is admin. Be sure to change these details in the following command to reflect your own setup, if different:

      • mongo -u AdminSammy -p --authenticationDatabase admin

      Enter the password you set during installation to gain access to the shell. After providing the password, your prompt will change to a greater-than sign:

      Note: On a fresh connection, the MongoDB shell will connect to the test database by default. You can safely use this database to experiment with MongoDB and the MongoDB shell.

      Alternatively, you could switch to another database to run all of the example commands given in this tutorial. To switch to another database, run the use command followed by the name of your database:

      To understand how full-text search can be applied to documents in MongoDB, you’ll need a collection of documents you can filter against. This guide will use a collection of sample documents that include names and descriptions of several different types of coffee drinks. These documents will have the same format as the following example document describing a Cuban coffee drink:

      Example Cafecito document

      {
          "name": "Cafecito",
          "description": "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam."
      }
      

      This document contains two fields: the name of the coffee drink and a longer description which provides some background information about the drink and its ingredients.

      Run the following insertMany() method in the MongoDB shell to create a collection named recipes and, at the same time, insert five sample documents into it:

      • db.recipes.insertMany([
      • {"name": "Cafecito", "description": "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam."},
      • {"name": "New Orleans Coffee", "description": "Cafe Noir from New Orleans is a spiced, nutty coffee made with chicory."},
      • {"name": "Affogato", "description": "An Italian sweet dessert coffee made with fresh-brewed espresso and vanilla ice cream."},
      • {"name": "Maple Latte", "description": "A wintertime classic made with espresso and steamed milk and sweetened with some maple syrup."},
      • {"name": "Pumpkin Spice Latte", "description": "It wouldn't be autumn without pumpkin spice lattes made with espresso, steamed milk, cinnamon spices, and pumpkin puree."}
      • ])

      This method will return a list of object identifiers assigned to the newly inserted objects:

      Output

      { "acknowledged" : true, "insertedIds" : [ ObjectId("61895d2787f246b334ece911"), ObjectId("61895d2787f246b334ece912"), ObjectId("61895d2787f246b334ece913"), ObjectId("61895d2787f246b334ece914"), ObjectId("61895d2787f246b334ece915") ] }

      You can verify that the documents were properly inserted by running the find() method on the recipes collection with no arguments. This will retrieve every document in the collection:

      Output

      { "_id" : ObjectId("61895d2787f246b334ece911"), "name" : "Cafecito", "description" : "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam." } . . .

      With the sample data in place, you’re ready to start learning how to use MongoDB’s full-text search features.

      Step 2 — Creating a Text Index

      To start using MongoDB’s full-text search capabilities, you must create a text index on a collection. Indexes are special data structures that store only a small subset of data from each document in a collection separately from the documents themselves. There are several types of indexes users can create in MongoDB, all of which help the database optimize search performance when querying the collection.

      A text index, however, is a special type of index used to further facilitate searching fields containing text data. When a user creates a text index, MongoDB will automatically drop any language-specific stop words from searches. This means that MongoDB will ignore the most common words for the given language (in English, words like “a”, “an”, “the”, or “this”).

      MongoDB will also implement a form of suffix-stemming in searches. This involves MongoDB identifying the root part of the search term and treating other grammar forms of that root (created by adding common suffixes like “-ing”, “-ed”, or perhaps “-er”) as equivalent to the root for the purposes of the search.

      Thanks to these and other features, MongoDB can more flexibly support queries written in natural language and provide better results.

      Note: This tutorial focuses on English text, but MongoDB supports multiple languages when using full-text search and text indexes. To learn more about what languages MongoDB supports, refer to the official documentation on supported languages.

      You can only create one text index for any given MongoDB collection, but the index can be created using more than one field. In our example collection, there is useful text stored in both the name and description fields of each document. It could be useful to create a text index for both fields.

      Run the following createIndex() method, which will create a text index for the two fields:

      • db.recipes.createIndex({ "name": "text", "description": "text" });

      For each of the two fields, name and description, the index type is set to text, telling MongoDB to create a text index tailored for full-text search based on these fields. The output will confirm the index creation:

      Output

      { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "ok" : 1 }

      Now that you’ve created the index, you can use it to issue full-text search queries to the database. In the next step, you’ll learn how to execute queries containing both single and multiple words.

      Step 3 — Searching for One or More Individual Words

      Perhaps the most common search problem is to look up documents containing one or more individual words.

      Typically, users expect the search engine to be flexible in determining where the given search terms should appear. As an example, if you were to use any popular web search engine and type in “coffee sweet spicy”, you likely are not expecting results that will contain those three words in that exact order. It’s more likely that you’d expect a list of web pages containing the words “coffee”, “sweet”, and “spicy” but not necessarily immediately near each other.

      That’s also how MongoDB approaches typical search queries when using text indexes. This step outlines how MongoDB interprets search queries with a few examples.

      To begin, say you want to search for coffee drinks with spices in their recipe, so you search for the word spiced alone using the following command:

      • db.recipes.find({ $text: { $search: "spiced" } });

      Notice that the syntax when using full-text search is slightly different from regular queries. Individual field names — like name or description — don’t appear in the filter document. Instead, the query uses the $text operator, telling MongoDB that this query intends to use the text index you created previously. You don’t need to be any more specific than that because, as you may recall, a collection may only have a single text index. Inside the embedded document for this filter is the $search operator taking the search query as its value. In this example, the query is a single word: spiced.

      After running this command, MongoDB produces the following list of documents:

      Output

      { "_id" : ObjectId("61895d2787f246b334ece915"), "name" : "Pumpkin Spice Latte", "description" : "It wouldn't be autumn without pumpkin spice lattes made with espresso, steamed milk, cinnamon spices, and pumpkin puree." } { "_id" : ObjectId("61895d2787f246b334ece912"), "name" : "New Orleans Coffee", "description" : "Cafe Noir from New Orleans is a spiced, nutty coffee made with chicory." }

      There are two documents in the result set, both of which contain words resembling the search query. While the New Orleans Coffee document does have the word spiced in the description, the Pumpkin Spice Late document doesn’t.

      Regardless, it was still returned by this query thanks to MongoDB’s use of stemming. MongoDB stripped the word spiced down to just spice, looked up spice in the index, and also stemmed it. Because of this, the words spice and spices in the Pumpkin Spice Late document matched the search query successfully, even though you didn’t search for either of those words specifically.

      Now, suppose you’re particularly fond of espresso drinks. Try looking up documents with a two-word query, spiced espresso, to look for a spicy, espresso-based coffee.

      • db.recipes.find({ $text: { $search: "spiced espresso" } });

      The list of results this time is longer than before:

      Output

      { "_id" : ObjectId("61895d2787f246b334ece914"), "name" : "Maple Latte", "description" : "A wintertime classic made with espresso and steamed milk and sweetened with some maple syrup." } { "_id" : ObjectId("61895d2787f246b334ece913"), "name" : "Affogato", "description" : "An Italian sweet dessert coffee made with fresh-brewed espresso and vanilla ice cream." } { "_id" : ObjectId("61895d2787f246b334ece911"), "name" : "Cafecito", "description" : "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam." } { "_id" : ObjectId("61895d2787f246b334ece915"), "name" : "Pumpkin Spice Latte", "description" : "It wouldn't be autumn without pumpkin spice lattes made with espresso, steamed milk, cinnamon spices, and pumpkin puree." } { "_id" : ObjectId("61895d2787f246b334ece912"), "name" : "New Orleans Coffee", "description" : "Cafe Noir from New Orleans is a spiced, nutty coffee made with chicory." }

      When using multiple words in a search query, MongoDB performs a logical OR operation, so a document only has to match one part of the expression to be included in the result set. The results contain documents containing both spiced and espresso or either term alone. Notice that words do not necessarily need to appear near each other as long as they appear in the document somewhere.

      Note: If you try to execute any full-text search query on a collection for which there is no text index defined, MongoDB will return an error message instead:

      Error message

      Error: error: { "ok" : 0, "errmsg" : "text index required for $text query", "code" : 27, "codeName" : "IndexNotFound" }

      In this step, you learned how to use one or multiple words as a text search query, how MongoDB joins multiple words with a logical OR operation, and how MongoDB performs stemming. Next, you’ll use a complete phrase in a text search query and begin using exclusions to narrow down your search results further.

      Step 4 — Searching for Full Phrases and Using Exclusions

      Looking up individual words might return too many results, or the results may not be precise enough. In this step, you’ll use phrase search and exclusions to control search results more precisely.

      Suppose you have a sweet tooth, it’s hot outside, and coffee topped with ice cream sounds like a nice treat. Try finding an ice cream coffee using the basic search query as outlined previously:

      • db.recipes.find({ $text: { $search: "ice cream" } });

      The database will return two coffee recipes:

      Output

      { "_id" : ObjectId("61895d2787f246b334ece913"), "name" : "Affogato", "description" : "An Italian sweet dessert coffee made with fresh-brewed espresso and vanilla ice cream." } { "_id" : ObjectId("61895d2787f246b334ece911"), "name" : "Cafecito", "description" : "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam." }

      While the Affogato document matches your expectations, Cafecito isn’t made with ice cream. The search engine, using the logical OR operation, accepted the second result just because the word cream appears in the description.

      To tell MongoDB that you are looking for ice cream as a full phrase and not two separate words, use the following query:

      • db.recipes.find({ $text: { $search: ""ice cream"" } });

      Notice the backslashes preceding each of the double quotes surrounding the phrase: "ice cream". The search query you’re executing is "ice cream", with double quotes denoting a phrase that should be matched exactly. The backslashes () escape the double quotes so they’re not treated as a part of JSON syntax, since these can appear inside the $search operator value.

      This time, MongoDB returns a single result:

      Output

      { "_id" : ObjectId("61895d2787f246b334ece913"), "name" : "Affogato", "description" : "An Italian sweet dessert coffee made with fresh-brewed espresso and vanilla ice cream." }

      This document matches the search term exactly, and neither cream nor ice alone would be enough to count as a match.

      Another useful full-text search feature is the exclusion modifier. To illustrate how to this works, first run the following query to get a list of all the coffee drinks in the collection based on espresso:

      • db.recipes.find({ $text: { $search: "espresso" } });

      This query returns four documents:

      Output

      { "_id" : ObjectId("61895d2787f246b334ece914"), "name" : "Maple Latte", "description" : "A wintertime classic made with espresso and steamed milk and sweetened with some maple syrup." } { "_id" : ObjectId("61895d2787f246b334ece913"), "name" : "Affogato", "description" : "An Italian sweet dessert coffee made with fresh-brewed espresso and vanilla ice cream." } { "_id" : ObjectId("61895d2787f246b334ece915"), "name" : "Pumpkin Spice Latte", "description" : "It wouldn't be autumn without pumpkin spice lattes made with espresso, steamed milk, cinnamon spices, and pumpkin puree." } { "_id" : ObjectId("61895d2787f246b334ece911"), "name" : "Cafecito", "description" : "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam." }

      Notice that two of these drinks are served with milk, but suppose you want a milk-free drink. This is a case where exclusions can come in handy. In a single query, you can join words that you want to appear in the results with those that you want to be excluded by prepending the word or phrase you want to exclude with a minus sign (-).

      As an example, say you run the following query to look up espresso coffees that do not contain milk:

      • db.recipes.find({ $text: { $search: "espresso -milk" } });

      With this query, two documents will be excluded from the previously returned results:

      Output

      { "_id" : ObjectId("61895d2787f246b334ece913"), "name" : "Affogato", "description" : "An Italian sweet dessert coffee made with fresh-brewed espresso and vanilla ice cream." } { "_id" : ObjectId("61895d2787f246b334ece911"), "name" : "Cafecito", "description" : "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam." }

      You can also exclude full phrases. To search for coffees without ice cream, you could include -"ice cream" in your search query. Again, you’d need to escape the double quotes with backslashes, like this:

      • db.recipes.find({ $text: { $search: "espresso -"ice cream"" } });

      Output

      { "_id" : ObjectId("61d48c31a285f8250c8dd5e6"), "name" : "Maple Latte", "description" : "A wintertime classic made with espresso and steamed milk and sweetened with some maple syrup." } { "_id" : ObjectId("61d48c31a285f8250c8dd5e7"), "name" : "Pumpkin Spice Latte", "description" : "It wouldn't be autumn without pumpkin spice lattes made with espresso, steamed milk, cinnamon spices, and pumpkin puree." } { "_id" : ObjectId("61d48c31a285f8250c8dd5e3"), "name" : "Cafecito", "description" : "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam." }

      Now that you’ve learned how to filter documents based on a phrase consisting of multiple words and how to exclude certain words and phrases from search results, you can acquaint yourself with MongoDB’s full-text search scoring.

      Step 5 — Scoring the Results and Sorting By Score

      When a query, especially a complex one, returns multiple results, some documents are likely to be a better match than others. For example, when you look for spiced espresso drinks, those that are both spiced and espresso-based are more fitting than those without spices or not using espresso as the base.

      Full-text search engines typically assign a relevance score to the search results, indicating how well they match the search query. MongoDB also does this, but the search relevance is not visible by default.

      Search once again for spiced espresso, but this time have MongoDB also return each result’s search relevance score. To do this, you could add a projection after the query filter document:

      • db.recipes.find(
      • { $text: { $search: "spiced espresso" } },
      • { score: { $meta: "textScore" } }
      • )

      The projection { score: { $meta: "textScore" } } uses the $meta operator, a special kind of projection that returns specific metadata from returned documents. This example returns the documents’ textScore metadata, a built-in feature of MongoDB’s full-text search engine that contains the search relevance score.

      After executing the query, the returned documents will include a new field named score, as was specified in the filter document:

      Output

      { "_id" : ObjectId("61895d2787f246b334ece913"), "name" : "Affogato", "description" : "An Italian sweet dessert coffee made with fresh-brewed espresso and vanilla ice cream.", "score" : 0.5454545454545454 } { "_id" : ObjectId("61895d2787f246b334ece911"), "name" : "Cafecito", "description" : "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam.", "score" : 0.5384615384615384 } { "_id" : ObjectId("61895d2787f246b334ece914"), "name" : "Maple Latte", "description" : "A wintertime classic made with espresso and steamed milk and sweetened with some maple syrup.", "score" : 0.55 } { "_id" : ObjectId("61895d2787f246b334ece912"), "name" : "New Orleans Coffee", "description" : "Cafe Noir from New Orleans is a spiced, nutty coffee made with chicory.", "score" : 0.5454545454545454 } { "_id" : ObjectId("61895d2787f246b334ece915"), "name" : "Pumpkin Spice Latte", "description" : "It wouldn't be autumn without pumpkin spice lattes made with espresso, steamed milk, cinnamon spices, and pumpkin puree.", "score" : 2.0705128205128203 }

      Notice how much higher the score is for Pumpkin Spice Latte, the only coffee drink that contains both the words spiced and espresso. According to MongoDB’s relevance score, it’s the most relevant document for that query. However, by default, the results are not returned in order of relevance.

      To change that, you could add a sort() clause to the query, like this:

      • db.recipes.find(
      • { $text: { $search: "spiced espresso" } },
      • { score: { $meta: "textScore" } }
      • ).sort(
      • { score: { $meta: "textScore" } }
      • );

      The syntax for the sorting document is the same as that of the projection. Now, the list of documents is the same, but their order is different:

      Output

      { "_id" : ObjectId("61895d2787f246b334ece915"), "name" : "Pumpkin Spice Latte", "description" : "It wouldn't be autumn without pumpkin spice lattes made with espresso, steamed milk, cinnamon spices, and pumpkin puree.", "score" : 2.0705128205128203 } { "_id" : ObjectId("61895d2787f246b334ece914"), "name" : "Maple Latte", "description" : "A wintertime classic made with espresso and steamed milk and sweetened with some maple syrup.", "score" : 0.55 } { "_id" : ObjectId("61895d2787f246b334ece913"), "name" : "Affogato", "description" : "An Italian sweet dessert coffee made with fresh-brewed espresso and vanilla ice cream.", "score" : 0.5454545454545454 } { "_id" : ObjectId("61895d2787f246b334ece912"), "name" : "New Orleans Coffee", "description" : "Cafe Noir from New Orleans is a spiced, nutty coffee made with chicory.", "score" : 0.5454545454545454 } { "_id" : ObjectId("61895d2787f246b334ece911"), "name" : "Cafecito", "description" : "A sweet and rich Cuban hot coffee made by topping an espresso shot with a thick sugar cream foam.", "score" : 0.5384615384615384 }

      The Pumpkin Spice Latte document appears as the first result since it has the highest relevance score.

      Sorting results according to their relevance score can be helpful. This is especially true with queries containing multiple words, where the most fitting documents will usually contain multiple search terms while the less relevant documents might contain only one.

      Conclusion

      By following this tutorial, you’ve acquainted yourself with MongoDB’s full-text search features. You created a text index and wrote text search queries using single and multiple words, full phrases, and exclusions. You’ve also assessed the relevance scores for returned documents and sorted the search results to show the most relevant results first. While MongoDB’s full-text search features may not be as robust as those of some dedicated search engines, they are capable enough for many use cases.

      Note that there are more search query modifiers — such as case and diacritic sensitivity and support for multiple languages — within a single text index. These can be used in more robust scenarios to support text search applications. For more information on MongoDB’s full-text search features and how they can be used, we encourage you to check out the official official MongoDB documentation.



      Source link

      How To Use Sharding in MongoDB


      The author selected the Open Internet/Free Speech Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      Database sharding is the process of splitting up records that would normally be held in the same table or collection and distributing them across multiple machines, known as shards. Sharding is especially useful in cases where you’re working with large amounts of data, as it allows you to scale your base horizontally by adding more machines that can function as new shards.

      In this tutorial, you’ll learn how to deploy a sharded MongoDB cluster with two shards. This guide will also outline how to choose an appropriate shard key as well as how to verify whether your MongoDB documents are being split up across shards correctly and evenly.

      Warning: The goal of this guide is to outline how sharding works in MongoDB. To that end, it demonstrates how to get a sharded cluster set up and running quickly for use in a development environment. Upon completing this tutorial you’ll have a functioning sharded cluster, but it will not have any security features enabled.

      Additionally, MongoDB recommends that a sharded cluster’s shard servers and config server all be deployed as replica sets with at least three members. Again, though, in order to get a sharded cluster up and running quickly, this guide outlines how to deploy these components as single-node replica sets.

      For these reasons, this setup is not considered secure and should not be used in production environments. If you plan on using a sharded cluster in a production environment, we strongly encourage you to review the official MongoDB documentation on Internal/Membership Authentication as well as our tutorial on How To Configure a MongoDB Replica Set on Ubuntu 20.04.

      Prerequisites

      To follow this tutorial, you will need:

      • Four separate servers. Each of these should have a regular, non-root user with sudo privileges and a firewall configured with UFW. This tutorial was validated using four servers running Ubuntu 20.04, and you can prepare your servers by following this initial server setup tutorial for Ubuntu 20.04 on each of them.
      • MongoDB installed on each of your servers. To set this up, follow our tutorial on How to Install MongoDB on Ubuntu 20.04 for each server.
      • All four of your servers configured with remote access enabled for each of the other instances. To set this up, follow our tutorial on How to Configure Remote Access for MongoDB on Ubuntu 20.04. As you follow this guide, make sure that each server has the other three servers’ IP addresses added as trusted IP addresses to allow for open communication between all of the servers.

      Note: The linked tutorials on how to configure your server, install MongoDB, and then allow remote access to MongoDB all refer to Ubuntu 20.04. This tutorial concentrates on MongoDB itself, not the underlying operating system. It will generally work with any MongoDB installation regardless of the operating system as long as each of the four servers are configured as outlined previously.

      For clarity, this tutorial will refer to the four servers as follows:

      • mongo-config, which will function as the cluster’s config server.
      • mongo-shard1 and mongo-shard2, which will serve as shard servers where the data will actually be distributed.
      • mongo-router, which will run a mongos instance and function as the shard cluster’s query router.

      For more details on what these roles are and how they function within a sharded MongoDB cluster, please read the following section on Understanding MongoDB’s Sharding Topology.

      Commands that must be executed on mongo-config will have a blue background, like this:

      Commands that must be executed on mongo-shard1 will have a red background:

      Commands run on mongo-shard2 will have a green background:

      And the mongo-router server’s commands will have a violet background:

      Understanding MongoDB’s Sharding Topology

      When working with a standalone MongoDB database server, you connect to that instance and use it to directly manage your data. In an unsharded replica set, you connect to the cluster’s primary member, and any changes you make to the data there are automatically carried over to the set’s secondary members. Sharded MongoDB clusters, though, are slightly more complex.

      Sharding is meant to help with horizontal scaling, also known as scaling out, since it splits up records from one data set across multiple machines. If the workload becomes too great for the shards in your cluster, you can scale out your database by adding another separate shard to take on some of the work. This contrasts with vertical scaling, also known as scaling up, which involves migrating one’s resources to larger or more powerful hardware.

      Because data is physically divided into multiple database nodes in a sharded database architecture, some documents will be available only on one node, while others will reside on another server. If you decided to connect to a particular instance to query the data, only a subset of the data would be available to you. Additionally, if you were to directly change any data held on one shard, you run the risk of creating inconsistency between your shards.

      To mitigate these risks, sharded clusters in MongoDB are made up of three separate components:

      • Shard servers are individual MongoDB instances used to store a subset of a larger collection of data. Every shard server must always be deployed as a replica set. There must be a minimum of one shard in a sharded cluster, but to gain any benefits from sharding you will need at least two.
      • The cluster’s config server is a MongoDB instance that stores metadata and configuration settings for the sharded cluster. The cluster uses this metadata for setup and management purposes. Like shard servers, the config server must be deployed as a replica set to ensure that this metadata remains highly available.
      • mongos is a special type of MongoDB instance that serves as a query router. mongos acts as a proxy between client applications and the sharded cluster, and is responsible for deciding where to direct a given query. Every application connection goes through a query router in a sharded cluster, thereby hiding the complexity of the configuration from the application.

      Because sharding in MongoDB is done at a collection level, a single database can contain a mixture of sharded and unsharded collections. Although sharded collections are partitioned and distributed across the multiple shards of the cluster, one shard is always elected as a primary shard. Unsharded collections are stored in their entirety on this primary shard.

      Since every application connection must go through the mongos instance, the mongos query router is what’s responsible for making all data consistently available and distributed across individual shards.

      Diagram outlining how to connect to a sharded MongoDB cluster. Applications connect to the mongos query router, which connects to a config server to determine how to query and distribute data to shards. The query router also connects to the shards themselves.

      Step 1 — Setting Up a MongoDB Config Server

      After completing the prerequisites, you’ll have four MongoDB installations running on four separate servers. In this step, you’ll convert one of these instances — mongo-config — into a replica set that you can use for testing or development purposes. You’ll also set this MongoDB instance up with features that will allow it to serve as a config server for a sharded cluster.

      Warning: Starting with MongoDB 3.6, both individual shards and config servers must be deployed as replica sets. It’s recommended to always have replica sets with at least three members in a production environment. Using replica sets with three or more members is helpful for keeping your data available and secure, but it also substantially increases the complexity of the sharded architecture. However, you can use single-node replica sets for local development, as this guide outlines.

      To reiterate the warning given previously in the introduction, this guide outlines how to get a sharded cluster up and running quickly. Hence, it outlines how to deploy a sharded cluster using shard servers and a config server that each consist of a single-node replica set. Because of this, and because it will not have any security features enabled, this setup is not secure and should not be used in a production environment.

      On mongo-config, open the MongoDB configuration file in your preferred text editor. Here, we’ll use nano:

      • sudo nano /etc/mongod.conf

      Find the configuration section with lines that read #replication: and #sharding: towards the bottom of the file:

      /etc/mongod.conf

      
      . . .
      #replication:
      
      #sharding:
      

      Uncomment the #replication: line by removing the pound sign (#). Then add a replSetName directive below the replication: line, followed by a name MongoDB will use to identify the replica set. Because you’re setting up this MongoDB instance as a replica set that will function as a config server, this guide will use the name config:

      /etc/mongod.conf

      . . .
      replication:
        replSetName: "config"
      
      #sharding:
      . . .
      

      Note that there are two spaces preceding the new replSetName directive and that its config value is wrapped in quotation marks. This syntax is required for the configuration to be read properly.

      Next, uncomment the #sharding: line as well. On the next line after that, add a clusterRole directive with a value of configsvr:

      /etc/mongod.conf

      . . .
      replication:
        replSetName: "config"
      
      sharding:
        clusterRole: configsvr
      . . .
      

      The clusterRole directive tells MongoDB that this server will be a part of the sharded cluster and will take the role of a config server (as indicated by the configsvr value). Again, be sure to precede this line with two spaces.

      Note: When both the replication and security lines are enabled in the mongod.conf file, MongoDB also requires you to configure some means of authentication other than password authentication, such as keyfile authentication or setting up x.509 certificates. If you followed our How To Secure MongoDB on Ubuntu 20.04 tutorial and enabled authentication on your MongoDB instance, you will only have password authentication enabled.

      Rather than setting up more advanced security measures, for the purposes of this tutorial it would be prudent to disable the security block in your mongod.conf file. Do so by commenting out every line in the security block with a pound sign:

      /etc/mongod.conf

      . . .
      
      #security:
      #  authorization: enabled
      
      . . .
      

      As long as you only plan to use this database to practice sharding or other testing purposes, this won’t present a security risk. However, if you plan to use this MongoDB instance to store any sensitive data in the future, be sure to uncomment these lines to re-enable authentication.

      After updating these two sections of the file, save and close the file. If you used nano, you can do so by pressing CTRL + X, Y, and then ENTER.

      Then, restart the mongod service:

      • sudo systemctl restart mongod

      With that, you’ve enabled replication for the server. However, the MongoDB instance isn’t yet replicating any data. You’ll need to start replication through the MongoDB shell, so open it up with the following command:

      From the MongoDB shell prompt, run the following command to initiate this replica set:

      This command will start the replication with the default configuration inferred by the MongoDB server. When setting up a replica set that consists of multiple separate servers, as would be the case if you were deploying a production-ready replica set, you would pass a document to the rs.initiate() method that describes the configuration for the new replica set. However, because this guide outlines how to deploy a sharded cluster using a config server and shard servers that each consist of a single node, you don’t need to pass any arguments to this method.

      MongoDB will automatically read the replica set name and its role in a sharded cluster from the running configuration. If this method returns "ok" : 1 in its output, it means the replica set was started successfully:

      Output

      { "info2" : "no configuration specified. Using a default configuration for the set", . . . "ok" : 1, . . . }

      Assuming this is the case, your MongoDB shell prompt will change to indicate that the instance the shell is connected to what is now a member of the rs0 replica set:

      The first part of this new prompt will be the name of the replica set you configured previously.

      Note that the second part this example prompt shows that this MongoDB instance is a secondary member of the replica set. This is to be expected, as there is usually a gap between the time when a replica set is initiated and the time when one of its members becomes the primary member.

      If you were to run a command or even just press ENTER after waiting a few moments, the prompt would update to reflect that you’re connected to the replica set’s primary member:

      You can verify that the replica set was configured properly by executing the following command in the MongoDB shell:

      This will return a lot of output about the replica set configuration, but a few keys are especially important:

      Output

      { . . . "set" : "config", . . . "configsvr" : true, "ok" : 1, . . . }

      The set key shows the replica set name, which is config in this example. The configsvr key indicates whether it’s a config server replica set in a sharded cluster, in this case showing true. Lastly, the ok flag has a value of 1, meaning the replica set is working correctly.

      In this step, you’ve configured your first replica set for the config servers in sharded clusters. In the next step, you’ll follow through a similar configuration for the two individual shards.

      Step 2 — Configuring Shard Server Replica Sets

      After completing the previous step, you will have a fully configured replica set that can function as the config server for a sharded cluster. In this step, you’ll convert the mongo-shard1 and mongo-shard2 instances into replica sets as well. Rather than setting them up as config servers, though, you will configure them to function as the actual shards within your sharded cluster.

      In order to set this up, you’ll need to make a few changes to both mongo-shard1 and mongo-shard2’s configuration files. Because you’re setting up two separate replica sets, though, each configuration will use different replica set names.

      On both mongo-shard1 and mongo-shard2, open the MongoDB configuration file in your preferred text editor:

      • sudo nano /etc/mongod.conf
      • sudo nano /etc/mongod.conf

      Find the configuration section with lines that read #replication: and #sharding: towards the bottom of the files. Again, these lines will be commented out in both files by default:

      /etc/mongod.conf

      #replication:
      
      #sharding:
      

      In both configuration files, uncomment the #replication: line by removing the pound sign (#). Then, add a replSetName directive below the replication: line, followed by the name MongoDB will use to identify the replica set. These examples use the name shard1 for the replica set on mongo-shard1 and shard2 for the set on mongo-shard2:

      /etc/mongod.conf

      . . .
      replication:
        replSetName: "shard1"
      
      #sharding:
      . . .
      

      /etc/mongod.conf

      . . .
      replication:
        replSetName: "shard2"
      
      #sharding:
      . . .
      

      Then uncomment the #sharding: line and add a clusterRole directive below that line in each configuration file. In both files, set the clusterRole directive value to shardsvr. This tells the respective MongoDB instances that these servers will function as shards.

      /etc/mongod.conf

      . . .
      replication:
        replSetName: "shard1"
      
      sharding:
        clusterRole: shardsvr
      . . .
      

      /etc/mongod.conf

      . . .
      replication:
        replSetName: "shard2"
      
      sharding:
        clusterRole: shardsvr
      . . .
      

      After updating these two sections of the files, save and close the files. Then, restart the mongod service by issuing the following command on both servers:

      • sudo systemctl restart mongod
      • sudo systemctl restart mongod

      With that, you’ve enabled replication for the two shards. As with the config server you set up in the previous step, these replica sets must also be initiated through the MongoDB shell before they can be used. Open the MongoDB shells on both shard servers with the mongo command:

      To reiterate, this guide outlines how to deploy a sharded cluster with a config server and two shard servers, all of which are made up of single-node replica sets. This kind of setup is useful for testing and outlining how sharding works, but it is not suitable for a production environment.

      Because you’re setting up these MongoDB instances to function as single-node replica sets, you can initiate replication on both shard servers by executing the rs.initiate() method without any further arguments:

      These will start replication on each MongoDB instance using the default replica set configuration. If these commands return "ok" : 1 in their output, it means the initialization was successful:

      Output

      { "info2" : "no configuration specified. Using a default configuration for the set", . . . "ok" : 1, . . . }

      Output

      { "info2" : "no configuration specified. Using a default configuration for the set", . . . "ok" : 1, . . . }

      As with the config server replica set, each of these shard servers will be elected as a primary member after only a few moments. Although their prompts may at first read SECONDARY>, if you press the ENTER key in the shell after a few moments the prompts will change to confirm that each server is the primary instance of their respective replica set. The prompts on the two shards will differ only in name, with one reading shard1:PRIMARY> and the other shard2:PRIMARY>.

      You can verify that each replica set was configured properly by executing the rs.status() method in both MongoDB shells. First, check wither the mongo-shard1 replica set was set up correctly:

      If this method’s output includes "ok" : 1, it means the replica set is functioning properly:

      Output

      { . . . "set" : "shard1", . . . "ok" : 1, . . . }

      Executing the same command on mongo-shard2 will show a different replica set name but otherwise will be nearly identical:

      Output

      { . . . "set" : "shard2", . . . "ok" : 1, . . . }

      With that, you’ve successfully configured both mongo-shard1 and mongo-shard2 as single-node replica sets. At this point, though, neither these two replica sets nor the config server replica set you created in the previous step are aware of each other. In the next step, you’ll run a query router and connect all of them together.

      Step 3 — Running mongos and Adding Shards to the Cluster

      The three replica sets you’ve configured so far, one config server and two individual shards, are currently running but are not yet part of a sharded cluster. To connect these components as parts of a sharded cluster, you’ll need one more tool: a mongos query router. This will be responsible for communicating with the config server and managing the shard servers.

      You’ll use your fourth and final MongoDB server — mongo-router — to run mongos and function as your sharded cluster’s query router. The query router daemon is included as part of the standard MongoDB installation, but is not enabled by default and must be run separately.

      First, connect to the mongo-router server and stop the MongoDB database service from running:

      • sudo systemctl stop mongod

      Because this server will not act as a database itself, disable the mongod service from starting whenever the server boots up:

      • sudo systemctl disable mongod

      Now, run mongos and connect it to the config server replica set with a command like the following:

      • mongos --configdb config/mongo_config_ip:27017

      The first part of this command’s connection string, config, is the name of the replica you defined earlier. Be sure to change this, if different, and update mongo_config_ip with the IP address of your mongo-config server.

      By default, mongos runs in the foreground and binds only to the local interface, thereby disallowing remote connections. With no additional security configured apart from firewall settings limiting traffic between all of your servers, this is a sound safety measure.

      Note: In MongoDB, it’s customary to differentiate the ports on which the config server and shard servers run, with 27019 being commonly used for config servers replica set and 27018 used for shards. To keep things simple, this guide did not change the port that any of the MongoDB instances in this cluster are running on. Thus, all replica sets are running on the default port of 27017.

      The previous mongos command will produce a verbose and detailed output in a format similar to system logs. At the beginning, you’ll find a message like this:

      Output

      {"t":{"$date":"2021-11-07T15:58:36.278Z"},"s":"W", "c":"SHARDING", "id":24132, "ctx":"main","msg":"Running a sharded cluster with fewer than 3 config servers should only be done for testing purposes and is not recommended for production."} . . .

      This means the query router connected to the config server replica set correctly and noticed it’s built with only a single node, a configuration not recommended for production environments.

      Note: Although running in the foreground like this is its default behavior, mongos is typically run as a daemon using a process like systemd.

      Running mongos as a system service is beyond the scope of this tutorial, but we encourage you to learn more about using and administering the mongos query router by reading the official documentation.

      Now you can add the shards you configured previously to the sharded cluster. Because mongos is running in the foreground, open another shell window connected to mongo-router. From this new window, open up the MongoDB shell:

      This command will open the MongoDB shell connected to the local MongoDB instance, which is not a MongoDB server but a running mongos query router. Your prompt will change to indicate this by reading mongos> instead of the MongoDB shell’s usual >.

      You can verify that the query router is connected to the config server by running the sh.status() method:

      This command returns the current status of the sharded cluster. At this point, it will show an empty list of connected shards in the shards key:

      Output

      --- Sharding Status --- sharding version: { "_id" : 1, "minCompatibleVersion" : 5, "currentVersion" : 6, "clusterId" : ObjectId("6187ea2e3d82d39f10f37ea7") } shards: active mongoses: autosplit: Currently enabled: yes balancer: Currently enabled: yes Currently running: no Failed balancer rounds in last 5 attempts: 0 Migration Results for the last 24 hours: No recent migrations databases: { "_id" : "config", "primary" : "config", "partitioned" : true } . . .

      To add the first shard to the cluster, execute the following command. In this example, shard1 is the replica set name of the first shard, and mongo_shard1_ip is the IP address of the server on which that shard, mongo-shard1, is running:

      • sh.addShard("shard1/mongo_shard1_ip:27017")

      This command will return a success message:

      Output

      { "shardAdded" : "shard1", "ok" : 1, "operationTime" : Timestamp(1636301581, 6), "$clusterTime" : { "clusterTime" : Timestamp(1636301581, 6), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } } }

      Follow that by adding the second shard:

      • sh.addShard("shard2/mongo_shard2_ip:27017")

      Notice that not only the IP address in this command is different, but the replica set name is different as well. The command will return a success message:

      Output

      { "shardAdded" : "shard2", "ok" : 1, "operationTime" : Timestamp(1639724738, 6), "$clusterTime" : { "clusterTime" : Timestamp(1639724738, 6), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } } }

      You can check that both shards have been properly added by issuing the sh.status() command again:

      Output

      --- Sharding Status --- sharding version: { "_id" : 1, "minCompatibleVersion" : 5, "currentVersion" : 6, "clusterId" : ObjectId("6187ea2e3d82d39f10f37ea7") } shards: { "_id" : "shard1", "host" : "shard1/mongo_shard1_ip:27017", "state" : 1 } { "_id" : "shard2", "host" : "shard2/mongo_shard2_ip:27017", "state" : 1 } active mongoses: "4.4.10" : 1 autosplit: Currently enabled: yes balancer: Currently enabled: yes Currently running: no Failed balancer rounds in last 5 attempts: 0 Migration Results for the last 24 hours: No recent migrations databases: { "_id" : "config", "primary" : "config", "partitioned" : true } . . .

      With that, you have a fully working sharded cluster consisting of two up and running shards. In the next step, you’ll create a new database, enable sharding for the database, and begin partitioning data in a collection.

      Step 4 — Partitioning Collection Data

      One shard within every sharded MongoDB cluster will be elected to be the cluster’s primary shard. In addition to the partitioned data stored across every shard in the cluster, the primary shard is also responsible for storing any non-partitioned data.

      At this point, you can freely use the mongos query router to work with databases, documents, and collections just like you would with a typical unsharded database. However, with no further setup, any data you add to the cluster will end up being stored only on the primary shard; it will not be automatically partitioned, and you won’t experience any of the benefits sharding provides.

      In order to use your sharded MongoDB cluster to its fullest potential, you must enabling sharding for a database within the cluster. A MongoDB database can only contain sharded collections if it has sharding enabled.

      To better understand MongoDB’s behavior as it partitions data, you’ll need a set of documents you can work with. This guide will use a collection of documents representing a few of the most populated cities in the world. As an example, the following sample document represents Tokyo:

      The Tokyo document

      {
          "name": "Tokyo",
          "country": "Japan",
          "continent": "Asia",
          "population": 37.400
      }
      

      You’ll store these documents in a database called populations and a collection called cities.

      You can enable sharding for a database before you explicitly create it. To enable sharding for the populations database, run the following enableSharding() method:

      • sh.enableSharding("populations")

      The command will return a success message:

      Output

      { . . . "ok" : 1, . . . }

      Now that the database is configured to allow partitioning, you can enable partitioning for the cities collection.

      MongoDB provides two ways to shard collections and determine which documents will be stored on which shard: ranged sharding and hashed sharding. This guide focuses on how to implement hashed sharding, in which MongoDB maintains an automated hashed index on a field that has been selected to be the cluster’s shard key. This helps to achieve an even distribution of documents. If you’d like to learn about ranged sharding in MongoDB, please refer to the official documentation.

      When implementing a hash-based sharding strategy, it’s the responsibility of the database administrator to choose an appropriate shard key. A poorly chosen shard key has the potential to mitigate many of benefits one might gain from sharding.

      In MongoDB, a document field that would function well as a shard key should follow these principles:

      • The chosen field should be of high cardinality, meaning that it can have many possible values. Every document added to the collection will always end up being stored on a single shard, so if the field chosen as the shard key will have only a few possible values, adding more shards to the cluster will not benefit performance. Considering the example populations database, the continent field would not be a good shard key since it can only contain a few possible values.
      • The shard key should have a low frequency of duplicate values. If the majority of documents share duplicate values for the field used as the shard key, it’s likely that some shards will be used to store more data than others. The more even the distribution of values in the sharded key across the entire collection, the better.
      • The shard key should facilitate queries. For example, a field that’s frequently used as a query filter would be a good choice for a shard key. In a sharded cluster, the query router uses a single shard to return a query result only if the query contains the shard key. Otherwise, the query will be broadcast to all shards for evaluation, even though the returned documents will come from a single shard. Thus, the population field would not be the best key, as it’s unlikely the majority of queries would involve filtering by the exact population value.

      For the example data used in this guide, the country name field would be a good choice for the cluster’s shard key, since it has the highest cardinality of all fields that will likely be frequently used in filter queries.

      Partition the cities collection — which hasn’t yet been created — with the country field as its shard key by running the following shardCollection() method:

      • sh.shardCollection("populations.cities", { "country": "hashed" })

      The first part of this command refers to the cities collection in the populations database, while the second part selects country as the shard key using the hashed partition method.

      The command will return a success message:

      Output

      { "collectionsharded" : "populations.cities", "collectionUUID" : UUID("03823afb-923b-4cd0-8923-75540f33f07d"), "ok" : 1, . . . }

      Now you can insert some sample documents to the sharded cluster. First, switch to the populations database:

      Then insert 20 sample documents with the following insertMany command:

      • db.cities.insertMany([
      • {"name": "Seoul", "country": "South Korea", "continent": "Asia", "population": 25.674 },
      • {"name": "Mumbai", "country": "India", "continent": "Asia", "population": 19.980 },
      • {"name": "Lagos", "country": "Nigeria", "continent": "Africa", "population": 13.463 },
      • {"name": "Beijing", "country": "China", "continent": "Asia", "population": 19.618 },
      • {"name": "Shanghai", "country": "China", "continent": "Asia", "population": 25.582 },
      • {"name": "Osaka", "country": "Japan", "continent": "Asia", "population": 19.281 },
      • {"name": "Cairo", "country": "Egypt", "continent": "Africa", "population": 20.076 },
      • {"name": "Tokyo", "country": "Japan", "continent": "Asia", "population": 37.400 },
      • {"name": "Karachi", "country": "Pakistan", "continent": "Asia", "population": 15.400 },
      • {"name": "Dhaka", "country": "Bangladesh", "continent": "Asia", "population": 19.578 },
      • {"name": "Rio de Janeiro", "country": "Brazil", "continent": "South America", "population": 13.293 },
      • {"name": "São Paulo", "country": "Brazil", "continent": "South America", "population": 21.650 },
      • {"name": "Mexico City", "country": "Mexico", "continent": "North America", "population": 21.581 },
      • {"name": "Delhi", "country": "India", "continent": "Asia", "population": 28.514 },
      • {"name": "Buenos Aires", "country": "Argentina", "continent": "South America", "population": 14.967 },
      • {"name": "Kolkata", "country": "India", "continent": "Asia", "population": 14.681 },
      • {"name": "New York", "country": "United States", "continent": "North America", "population": 18.819 },
      • {"name": "Manila", "country": "Philippines", "continent": "Asia", "population": 13.482 },
      • {"name": "Chongqing", "country": "China", "continent": "Asia", "population": 14.838 },
      • {"name": "Istanbul", "country": "Turkey", "continent": "Europe", "population": 14.751 }
      • ])

      The output will be similar to the typical MongoDB output since, from the user’s perspective, the sharded cluster behaves like a normal MongoDB database:

      Output

      { "acknowledged" : true, "insertedIds" : [ ObjectId("61880330754a281b83525a9b"), ObjectId("61880330754a281b83525a9c"), ObjectId("61880330754a281b83525a9d"), ObjectId("61880330754a281b83525a9e"), ObjectId("61880330754a281b83525a9f"), ObjectId("61880330754a281b83525aa0"), ObjectId("61880330754a281b83525aa1"), ObjectId("61880330754a281b83525aa2"), ObjectId("61880330754a281b83525aa3"), ObjectId("61880330754a281b83525aa4"), ObjectId("61880330754a281b83525aa5"), ObjectId("61880330754a281b83525aa6"), ObjectId("61880330754a281b83525aa7"), ObjectId("61880330754a281b83525aa8"), ObjectId("61880330754a281b83525aa9"), ObjectId("61880330754a281b83525aaa"), ObjectId("61880330754a281b83525aab"), ObjectId("61880330754a281b83525aac"), ObjectId("61880330754a281b83525aad"), ObjectId("61880330754a281b83525aae") ] }

      Under the hood, however, MongoDB distributed the documents across the sharded nodes.

      You can access information about how the data was distributed across your shards with the getShardDistribution() method:

      • db.cities.getShardDistribution()

      This method’s output provides statistics for every shard that is part of the cluster:

      Output

      Shard shard2 at shard2/mongo_shard2_ip:27017 data : 943B docs : 9 chunks : 2 estimated data per chunk : 471B estimated docs per chunk : 4 Shard shard1 at shard1/mongo_shard1_ip:27017 data : 1KiB docs : 11 chunks : 2 estimated data per chunk : 567B estimated docs per chunk : 5 Totals data : 2KiB docs : 20 chunks : 4 Shard shard2 contains 45.4% data, 45% docs in cluster, avg obj size on shard : 104B Shard shard1 contains 54.59% data, 55% docs in cluster, avg obj size on shard : 103B

      This output indicates that the automated hashing strategy on the country field resulted in a mostly even distribution across two shards.

      You have now configured a fully working sharded cluster and inserted data that has been automatically partitioned across multiple shards. In the next step, you’ll learn how to monitor shard usage when executing queries.

      Step 5 — Analyzing Shard Usage

      Sharding is used to scale the performance of the database system and, as such, works best if it’s used efficiently to support database queries. If most of your queries to the database need to scan every shard in the cluster in order to be executed, any benefits of sharding would be lost in the system’s increased complexity.

      This step focuses on verifying whether a query is optimized to only use a single shard or if it spans multiple shards to retrieve a result.

      Start with selecting every document in the cities collection. Since you want to retrieve all the documents, it’s guaranteed that every shard must be used to perform the query:

      The query will, unsurprisingly, return all cities. Re-run the query, this time with the explain() method attached to the end of it:

      • db.cities.find().explain()

      The long output will provide details about how the query was executed:

      Output

      { "queryPlanner" : { "mongosPlannerVersion" : 1, "winningPlan" : { "stage" : "SHARD_MERGE", "shards" : [ { "shardName" : "shard1", . . . }, { "shardName" : "shard2", . . . } ] } }, . . .

      Notice that the winning plan refers to the SHARD_MERGE strategy, which means that multiple shards were used to resolve the query. In the shards key, MongoDB returns the list of shards taking part in the evaluation. In this case, this list includes both shards of the cluster.

      Now test whether the result will be any different if you query against the continent field, which is not the chosen shard key:

      • db.cities.find({"continent": "Europe"}).explain()

      This time, MongoDB also had to use both shards to satisfy the query. The database had no way to know which shard contains documents for European cities:

      Output

      { "queryPlanner" : { "mongosPlannerVersion" : 1, "winningPlan" : { "stage" : "SHARD_MERGE", . . . } }, . . .

      The result should be different when filtering against the shard key. Try filtering cities only from Japan using the country field, which you previously selected as the shard key:

      • db.cities.find({"country": "Japan"}).explain()

      Output

      { "queryPlanner" : { "mongosPlannerVersion" : 1, "winningPlan" : { "stage" : "SINGLE_SHARD", "shards" : [ { "shardName" : "shard1", . . . } . . .

      This time, MongoDB used a different query strategy: SINGLE_SHARD instead of SHARD_MERGE. This means that only a single shard was needed to satisfy the query. In the shards key, only a single shard will be mentioned. In this example, documents for Japan were stored on the first shard in the cluster.

      By using the explain feature on the query cursor you can check whether the query you are running spans one or multiple shards. In turn, it can also help you determine whether the query will overload the cluster by reaching out to every shard at once. You can use this method — alongside rules of thumb for shard key selection — to select the shard key that will yield the most performance gains.

      Conclusion

      Sharding has seen wide use as a strategy to improve the performance and scalability of large data clusters. When paired with replication, sharding also has the potential to improve availability and data security. Sharding is also MongoDB’s core means of horizontal scaling, in that you can extend the database cluster performance by adding more nodes to the cluster instead of migrating databases to bigger and more powerful servers.

      By completing this tutorial, you’ve learned how sharding works in MongoDB, how to configure config servers and individual shards, and how to connect them together to form a sharded cluster. You’ve used the mongos query router to administer the shard, introduce data partitioning, execute queries against the database, and monitor sharding metrics.

      This strategy comes with many benefits, but also with administrative challenges such as having to manage multiple replica sets and more complex security considerations. To learn more about sharding and running a sharded cluster outside the development environment, we encourage you to check out the official MongoDB documentation on the subject. Otherwise, we encourage you to check out the other tutorials in our series on How To Manage Data with MongoDB.



      Source link