One place for hosting & domains

      How To Test Your Data With Great Expectations


      The author selected the Diversity in Tech Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      In this tutorial, you will set up a local deployment of Great Expectations, an open source data validation and documentation library written in Python. Data validation is crucial to ensuring that the data you process in your pipelines is correct and free of any data quality issues that might occur due to errors such as incorrect inputs or transformation bugs. Great Expectations allows you to establish assertions about your data called Expectations, and validate any data using those Expectations.

      When you’re finished, you’ll be able to connect Great Expectations to your data, create a suite of Expectations, validate a batch of data using those Expectations, and generate a data quality report with the results of your validation.

      Prerequisites

      To complete this tutorial, you will need:

      Step 1 — Installing Great Expectations and Initializing a Great Expectations Project

      In this step, you will install the Great Expectations package in your local Python environment, download the sample data you’ll use in this tutorial, and initialize a Great Expectations project.

      To begin, open a terminal and make sure to activate your virtual Python environment. Install the Great Expectations Python package and command-line tool (CLI) with the following command:

      • pip install great_expectations==0.13.35

      Note: This tutorial was developed for Great Expectations version 0.13.35 and may not be applicable to other versions.

      In order to have access to the example data repository, run the following git command to clone the directory and change into it as your working directory:

      • git clone https://github.com/do-community/great_expectations_tutorial
      • cd great_expectations_tutorial

      The repository only contains one folder called data, which contains two example CSV files with data that you will use in this tutorial. Take a look at the contents of the data directory:

      You’ll see the following output:

      Output

      yellow_tripdata_sample_2019-01.csv yellow_tripdata_sample_2019-02.csv

      Great Expectations works with many different types of data, such as connections to relational databases, Spark dataframes, and various file formats. For the purpose of this tutorial, you will use these CSV files containing a small set of taxi ride data to get started.

      Finally, initialize your directory as a Great Expectations project by running the following command. Make sure to use the --v3-api flag, as this will switch you to using the most recent API of the package:

      • great_expectations --v3-api init

      When asked OK to proceed? [Y/n]:, press ENTER to proceed.

      This will create a folder called great_expectations, which contains the basic configuration for your Great Expectations project, also called the Data Context. You can inspect the contents of the folder:

      You will see the first level of files and subdirectories that were created inside the great_expectations folder:

      Output

      checkpoints great_expectations.yml plugins expectations notebooks uncommitted

      The folders store all the relevant content for your Great Expectations setup. The great_expectations.yml file contains all important configuration information. Feel free to explore the folders and configuration file a little more before moving on to the next step in the tutorial.

      In the next step, you will add a Datasource to point Great Expectations at your data.

      Step 2 — Adding a Datasource

      In this step, you will configure a Datasource in Great Expectations, which allows you to automatically create data assertions called Expectations as well as validate data with the tool.

      While in your project directory, run the following command:

      • great_expectations --v3-api datasource new

      You will see the following output. Enter the options shown when prompted to configure a file-based Datasource for the data directory:

      Output

      What data would you like Great Expectations to connect to? 1. Files on a filesystem (for processing with Pandas or Spark) 2. Relational database (SQL) : 1 What are you processing your files with? 1. Pandas 2. PySpark : 1 Enter the path of the root directory where the data files are stored. If files are on local disk enter a path relative to your current working directory or an absolute path. : data

      After confirming the directory path with ENTER, Great Expectations will open a Jupyter notebook in your web browser, which allows you to complete the configuration of the Datasource and store it to your Data Context. The following screenshot shows the first few cells of the notebook.

      Screenshot of a Jupyter notebook

      The notebook contains several pre-populated cells of Python code to configure your Datasource. You can modify the settings for the Datasource, such as the name, if you like. However, for the purpose of this tutorial, you’ll leave everything as-is and execute all cells using the Cell > Run All menu option. If run successfully, the last cell output will look as follows:

      Output

      [{'data_connectors': {'default_inferred_data_connector_name': {'module_name': 'great_expectations.datasource.data_connector', 'base_directory': '../data', 'class_name': 'InferredAssetFilesystemDataConnector', 'default_regex': {'group_names': ['data_asset_name'], 'pattern': '(.*)'}}, 'default_runtime_data_connector_name': {'module_name': 'great_expectations.datasource.data_connector', 'class_name': 'RuntimeDataConnector', 'batch_identifiers': ['default_identifier_name']}}, 'module_name': 'great_expectations.datasource', 'class_name': 'Datasource', 'execution_engine': {'module_name': 'great_expectations.execution_engine', 'class_name': 'PandasExecutionEngine'}, 'name': 'my_datasource'}]

      This shows that you have added a new Datasource called my_datasource to your Data Context. Feel free to read through the instructions in the notebook to learn more about the different configuration options before moving on to the next step.

      Warning: Before moving forward, close the browser tab with the notebook, return to your terminal, and press CTRL+C to shut down the running notebook server before proceeding.

      You have now successfully set up a Datasource that points at the data directory, which will allow you to access the CSV files in the directory through Great Expectations. In the next step, you will use one of these CSV files in your Datasource to automatically generate Expectations with a profiler.

      Step 3 — Creating an Expectation Suite With an Automated Profiler

      In this step of the tutorial, you will use the built-in Profiler to create a set of Expectations based on some existing data. For this purpose, let’s take a closer look at the sample data that you downloaded:

      • The files yellow_tripdata_sample_2019-01.csv and yellow_tripdata_sample_2019-02.csv contain taxi ride data from January and February 2019, respectively.
      • This tutorial assumes that you know the January data is correct, and that you want to ensure that any subsequent data files match the January data in terms of number or rows, columns, and the distributions of certain column values.

      For this purpose, you will create Expectations (data assertions) based on certain properties of the January data and then, in a later step, use those Expectations to validate the February data. Let’s get started by creating an Expectation Suite, which is a set of Expectations that are grouped together:

      • great_expectations --v3-api suite new

      By selecting the options shown in the output below, you specify that you would like to use a profiler to generate Expectations automatically, using the yellow_tripdata_sample_2019-01.csv data file as an input. Enter the name my_suite as the Expectation Suite name when prompted and press ENTER at the end when asked Would you like to proceed? [Y/n]:

      Output

      Using v3 (Batch Request) API How would you like to create your Expectation Suite? 1. Manually, without interacting with a sample batch of data (default) 2. Interactively, with a sample batch of data 3. Automatically, using a profiler : 3 A batch of data is required to edit the suite - let's help you to specify it. Which data asset (accessible by data connector "my_datasource_example_data_connector") would you like to use? 1. yellow_tripdata_sample_2019-01.csv 2. yellow_tripdata_sample_2019-02.csv : 1 Name the new Expectation Suite [yellow_tripdata_sample_2019-01.csv.warning]: my_suite When you run this notebook, Great Expectations will store these expectations in a new Expectation Suite "my_suite" here: <path_to_project>/great_expectations_tutorial/great_expectations/expectations/my_suite.json Would you like to proceed? [Y/n]: <press ENTER>

      This will open another Jupyter notebook that lets you complete the configuration of your Expectation Suite. The notebook contains a fair amount of code to configure the built-in profiler, which looks at the CSV file you selected and creates certain types of Expectations for each column in the file based on what it finds in the data.

      Scroll down to the second code cell in the notebook, which contains a list of ignored_columns. By default, the profiler will ignore all columns, so let’s comment out some of them to make sure the profiler creates Expectations for them. Modify the code so it looks like this:

      ignored_columns = [
      #     "vendor_id"
      # ,    "pickup_datetime"
      # ,    "dropoff_datetime"
      # ,    "passenger_count"
          "trip_distance"
      ,    "rate_code_id"
      ,    "store_and_fwd_flag"
      ,    "pickup_location_id"
      ,    "dropoff_location_id"
      ,    "payment_type"
      ,    "fare_amount"
      ,    "extra"
      ,    "mta_tax"
      ,    "tip_amount"
      ,    "tolls_amount"
      ,    "improvement_surcharge"
      ,    "total_amount"
      ,    "congestion_surcharge"
      ,]
      

      Make sure to remove the comma before "trip_distance". By commenting out the columns vendor_id, pickup_datetime, dropoff_datetime, and passenger_count, you are telling the profiler to generate Expectations for those columns. In addition, the profiler will also generate table-level Expectations, such as the number and names of columns in your data, and the number of rows. Once again, execute all cells in the notebook by using the Cell > Run All menu option.

      When executing all cells in this notebook, two things happen:

      1. The code creates an Expectation Suite using the automated profiler and the yellow_tripdata_sample_2019-01.csv file you told it to use.
      2. The last cell in the notebook is also configured to run validation and open a new browser window with Data Docs, which is a data quality report.

      In the next step, you will take a closer look at the Data Docs that were opened in the new browser window.

      Step 4 — Exploring Data Docs

      In this step of the tutorial, you will inspect the Data Docs that Great Expectations generated and learn how to interpret the different pieces of information. Go to the browser window that just opened and take a look at the page, shown in the screenshot below.

      Screenshot of Data Docs

      At the top of the page, you will see a box titled Overview, which contains some information about the validation you just ran using your newly created Expectation Suite my_suite. It will tell you Status: Succeeded and show some basic statistics about how many Expectations were run. If you scroll further down, you will see a section titled Table-Level Expectations. It contains two rows of Expectations, showing the Status, Expectation, and Observed Value for each row. Below the table Expectations, you will see the column-level Expectations for each of the columns you commented out in the notebook.

      Let’s focus on one specific Expectation: The passenger_count column has an Expectation stating “values must belong to this set: 1 2 3 4 5 6.” which is marked with a green checkmark and has an Observed Value of “0% unexpected”. This is telling you that the profiler looked at the values in the passenger_count column in the January CSV file and detected only the values 1 through 6, meaning that all taxi rides had between 1 and 6 passengers. Great Expectations then created an Expectation for this fact. The last cell in the notebook then triggered validation of the January CSV file and it found no unexpected values. This is spuriously true, since the same data that was used to create the Expectation was also the data used for validation.

      In this step, you reviewed the Data Docs and observed the passenger_count column for its Expectation. In the next step, you’ll see how you can validate a different batch of data.

      Step 5 — Creating a Checkpoint and Running Validation

      In the final step of this tutorial, you will create a new Checkpoint, which bundles an Expectation Suite and a batch of data to execute validation of that data. After creating the Checkpoint, you will then run it to validate the February taxi data CSV file and see whether the file passed the Expectations you previously created. To begin, return to your terminal and stop the Jupyter notebook by pressing CTRL+C if it is still running. The following command will start the workflow to create a new Checkpoint called my_checkpoint:

      • great_expectations --v3-api checkpoint new my_checkpoint

      This will open a Jupyter notebook with some pre-populated code to configure the Checkpoint. The second code cell in the notebook will have a random data_asset_name pre-populated from your existing Datasource, which will be one of the two CSV files in the data directory you’ve seen earlier. Ensure that the data_asset_name is yellow_tripdata_sample_2019-02.csv and modify the code if needed to use the correct filename.

      my_checkpoint_name = "my_checkpoint" # This was populated from your CLI command.
      
      yaml_config = f"""
      name: {my_checkpoint_name}
      config_version: 1.0
      class_name: SimpleCheckpoint
      run_name_template: "%Y%m%d-%H%M%S-my-run-name-template"
      validations:
        - batch_request:
            datasource_name: my_datasource
            data_connector_name: default_inferred_data_connector_name
            data_asset_name: yellow_tripdata_sample_2019-02.csv
            data_connector_query:
              index: -1
          expectation_suite_name: my_suite
      """
      print(yaml_config)
      """
      

      This configuration snippet configures a new Checkpoint, which reads the data asset yellow_tripdata_sample_2019-02.csv, i.e., your February CSV file, and validates it using the Expectation Suite my_suite. Confirm that you modified the code correctly, then execute all cells in the notebook. This will save the new Checkpoint to your Data Context.

      Finally, in order to run this new Checkpoint and validate the February data, scroll down to the last cell in the notebook. Uncomment the code in the cell to look as follows:

      context.run_checkpoint(checkpoint_name=my_checkpoint_name)
      context.open_data_docs()
      

      Select the cell and run it using the Cell > Run Cells menu option or the SHIFT+ENTER keyboard shortcut. This will open Data Docs in a new browser tab.

      On the Validation Results overview page, click on the topmost run to navigate to the Validation Result details page. The Validation Result details page will look very similar to the page you saw in the previous step, but it will now show that the Expectation Suite failed, validating the new CSV file. Scroll through the page to see which Expectations have a red X next to them, marking them as failed.

      Find the Expectation on the passenger_count column you looked at in the previous step: “values must belong to this set: 1 2 3 4 5 6”. You will notice that it now shows up as failed and highlights that 1579 unexpected values found. ≈15.79% of 10000 total rows. The row also displays a sample of the unexpected values that were found in the column, namely the value 0. This means that the February taxi ride data suddenly introduced the unexpected value 0 as in the passenger_counts column, which seems like a potential data bug. By running the Checkpoint, you validated the new data with your Expectation Suite and detected this issue.

      Note that each time you execute the run_checkpoint method in the last notebook cell, you kick off another validation run. In a production data pipeline environment, you would call the run_checkpoint command outside of a notebook whenever you’re processing a new batch of data to ensure that the new data passes all validations.

      Conclusion

      In this article you created a first local deployment of the Great Expectations framework for data validation. You initialized a Great Expectations Data Context, created a new file-based Datasource, and automatically generated an Expectation Suite using the built-in profiler. You then created a Checkpoint to run validation against a new batch of data, and inspected the Data Docs to view the validation results.

      This tutorial only taught you the basics of Great Expectations. The package contains more options for configuring Datasources to connect to other types of data, for example relational databases. It also comes with a powerful mechanism to automatically recognize new batches of data based on pattern-matching in the tablename or filename, which allows you to only configure a Checkpoint once to validate any future data inputs. You can learn more about Great Expectations in the official documentation.



      Source link

      How To Populate a Database with Sample Data using Laravel Seeders and Eloquent Models



      Part of the Series:
      A Practical Introduction to Laravel Eloquent ORM

      Eloquent is an object relational mapper (ORM) that is included by default within the Laravel framework. In this project-based series, you’ll learn how to make database queries and how to work with relationships in Laravel Eloquent. To practice the examples explained throughout the series, you’ll improve a demo application with new models and relationships.

      Laravel’s Seeders are special classes that live in the database/seeders directory in a Laravel project that allow you to programmatically insert a collection of default or sample records in the database. The demo application has a seeder class that imports links from a links.yml file in the root of the application folder.

      In your code editor, open the following file:

      database/seeders/LinkSeeder.php
      

      It will contain the following code:

      database/seeders/LinkSeeder.php

      <?php
      
      namespace DatabaseSeeders;
      
      use AppModelsLink;
      use IlluminateDatabaseSeeder;
      use IlluminateSupportFacadesDB;
      use SymfonyComponentYamlYaml;
      
      class LinkSeeder extends Seeder
      {
          /**
           * Run the database seeds.
           *
           * @return void
           */
          public function run()
          {
              //only import seeds if DB is empty.
              if (!Link::count()) {
                  $this->importLinks();
              }
          }
      
          /**
           * Imports Links from the default links.yml file at the root of the app.
           * Change that file to import a set of personal basic links you want to show
           * as soon as the application is deployed.
           */
          public function importLinks()
          {
              $links_import_path = __DIR__ . '/../../links.yml';
      
              $yaml = new Yaml();
              if (is_file($links_import_path)) {
                  $links = $yaml->parsefile($links_import_path);
      
                  foreach ($links as $link) {
                      DB::table('links')->insert([
                          'url' => $link['url'],
                          'description' => $link['description']
                      ]);
                  }
              }
          }
      }
      
      

      Notice that this code does not use the Link model and instead uses the query builder to insert new links in the database. This is a different way of working with database records in Laravel that doesn’t depend on Eloquent models. Even though this works well, by using Eloquent models you’ll have access to a series of helpful methods and shortcuts to make your code more concise and easier to read.

      To improve this code, you’ll change the foreach loop to use Eloquent models instead of querying the database directly with the query builder. You’ll also have to create a default list of links (called $default_list in the following code) before the loop is started, so that you can reference this list in each new link created.

      Replace the current content in your seeder class with the following code:

      database/seeders/LinkSeeder.php

      <?php
      
      namespace DatabaseSeeders;
      
      use AppModelsLink;
      use AppModelsLinkList;
      use IlluminateDatabaseSeeder;
      use IlluminateSupportFacadesDB;
      use SymfonyComponentYamlYaml;
      
      class LinkSeeder extends Seeder
      {
          /**
           * Run the database seeds.
           *
           * @return void
           */
          public function run()
          {
              //only import seeds if DB is empty.
              if (!Link::count()) {
                  $this->importLinks();
              }
          }
      
          /**
           * Imports Links from the default links.yml file at the root of the app.
           * Change that file to import a set of personal basic links you want to show
           * as soon as the application is deployed.
           */
          public function importLinks()
          {
              $links_import_path = __DIR__ . '/../../links.yml';
      
              $yaml = new Yaml();
              if (is_file($links_import_path)) {
                  $links = $yaml->parsefile($links_import_path);
      
                  $default_list = new LinkList();
                  $default_list->title = "Default";
                  $default_list->description = "Default List";
                  $default_list->slug = "default";
                  $default_list->save();
      
                  foreach ($links as $link) {
                      $seed_link = new Link();
                      $seed_link->url = $link['url'];
                      $seed_link->description = $link['description'];
      
                      $default_list->links()->save($seed_link);
                  }
              }
          }
      }
      
      
      

      The updated code uses an object-oriented approach for setting up the properties for the LinkList and List models that are translated into table columns by Eloquent. The final line of the for loop uses the $default_list reference to the links table, which is accessed via the method links(), to save new links within that list.

      Save the file when you’re done. Laravel seeders will only run when the database is empty, so as to not conflict with actual data that was inserted in the database by other means. Thus in order to run the modified seeder, you’ll need to wipe the database once again with the artisan db:wipe command..

      Run the following command to wipe the development database:

      • docker-compose exec app php artisan db:wipe

      Output

      Dropped all tables successfully.

      Now to recreate the tables and run the updated seeders, you can use the following artisan migrate --seed command:

      • docker-compose exec app php artisan migrate --seed

      You should receive output that is similar to the following:

      Output

      Migration table created successfully. Migrating: 2014_10_12_000000_create_users_table Migrated: 2014_10_12_000000_create_users_table (124.20ms) Migrating: 2014_10_12_100000_create_password_resets_table Migrated: 2014_10_12_100000_create_password_resets_table (121.75ms) Migrating: 2019_08_19_000000_create_failed_jobs_table Migrated: 2019_08_19_000000_create_failed_jobs_table (112.43ms) Migrating: 2020_11_18_165241_create_links_table Migrated: 2020_11_18_165241_create_links_table (61.04ms) Migrating: 2021_07_09_122027_create_link_lists_table Migrated: 2021_07_09_122027_create_link_lists_table (112.18ms) Seeding: DatabaseSeedersLinkSeeder Seeded: DatabaseSeedersLinkSeeder (84.57ms) Database seeding completed successfully.

      In the next chapter of this series, you’ll learn in more detail how to query database records with Eloquent.

      This tutorial is part of an ongoing weekly series about Laravel Eloquent. You can subscribe to the Laravel tag if you want to be notified when new tutorials are published.



      Source link

      What is Data Analysis?


      Data analysis is the practice of investigating the structure of data and using it to identify patterns and possible solutions to problems. A foundational element of data science, data analysis draws on methodologies from statistics, mathematics, and computer science to analyze data from past events and to predict possible outcomes.

      One important trend within data analysis is machine learning, which attempts to train algorithms based on preexisting data so that they may automatically classify or predict future data. In this way, machine learning enables practices such as automated decision-making.

      For more information about data analysis, check out our data analysis tag page, and An Introduction to Machine Learning.



      Source link