One place for hosting & domains

      Pipeline

      How To Set Up a Continuous Delivery Pipeline with Flux on DigitalOcean Kubernetes


      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      By itself, Kubernetes does not offer continuous integration and deployment features. While these concepts are often not widespread in smaller projects, bigger teams who host and update their deployments extensively find it much easier to set up such processes to alleviate manual time-consuming tasks and instead focus on developing the software that’s being deployed. One approach to maintaining continuous delivery for Kubernetes is GitOps.

      GitOps views the Git repositories hosting the application and Kubernetes manifests as the central source of truth regarding deployments. It allows for separated deployment environments by using repository branches, gives you the ability to quickly reproduce any config state, current or past, on any cluster, and makes rollbacks trivial thanks to Git versioning. The manifests are secure, synchronized, and easily accessible at all times. Modifications to the manifest or application can be audited, allowed, or denied depending on external factors (usually, the continuous integration system). Automating the process from pushing the code to having it deploy on a cluster can greatly increase productivity and enhance the developer experience while making the deployment always consistent with the central code base.

      Flux is an open-source tool facilitating the GitOps continuous delivery approach for Kubernetes. Flux allows for automated application and configuration deployments to your clusters by monitoring the configured Git repositories and automatically applying the changes as soon as they become available. It can apply Kustomize manifests (which provide an easy way to optionally patch parts of the usual Kubernetes manifests on the fly), as well as watch over Helm chart releases. You can also configure it to be notified via Slack, Discord, Microsoft Teams, or any other service that supports webhooks. Webhooks provide a way of notifying an app or a service of an event that’s happened somewhere else and provide its description.

      In this tutorial, you’ll install Flux and use it to set up continuous delivery for the podinfo app to your DigitalOcean Kubernetes cluster. podinfo is an app that provides details about the environment it’s running in. You’ll host the repositories holding Flux configuration and podinfo on your GitHub account. You’ll set up Flux to watch over the app repository, automatically apply the changes, and notify you on Slack using webhooks. In the end, all changes that you make to the monitored repository will quickly be propagated to your cluster.

      Prerequisites

      To complete this tutorial, you will need:

      • A DigitalOcean Kubernetes cluster with your connection configuration configured as the kubectl default. Instructions on how to configure kubectl are shown under the Connect to your Cluster step when you create your cluster. To learn how to create a Kubernetes cluster on DigitalOcean, see Kubernetes Quickstart.
      • A Slack workspace you’re a member of. To learn how to create a workspace, visit the official docs.
      • A GitHub account with a Personal Access Token (PAT) created with all privileges. To learn how to create one, visit the official docs.

      • Git initialized and set up on your local machine. To get started with Git, as well as see installation instructions, visit the How To Contribute to Open Source: Getting Started with Git tutorial.

      • The podinfo app repository forked to your GitHub account. For instructions on how to fork a repository to your account, visit the official getting started docs.

      Step 1 — Installing and Bootstrapping Flux

      In this step, you’ll set up Flux on your local machine, install it to your cluster, and set up a dedicated Git repository for storing and versioning its configuration.

      On Linux, you can use the official Bash script to install Flux. If you’re on MacOS, you can either use the official script, following the same steps as for Linux, or use Homebrew to install Flux with the following command:

      • brew install fluxcd/tap/flux

      To install Flux using the officially provided script, download it by running the following command:

      • curl https://fluxcd.io/install.sh -so flux-install.sh

      You can inspect the flux-install.sh script to verify that it’s safe by running this command:

      To be able to run it, you must mark it as executable:

      Then, execute the script to install Flux:

      You’ll see the following output, detailing what version is being installed:

      Output

      [INFO] Downloading metadata https://api.github.com/repos/fluxcd/flux2/releases/latest [INFO] Using 0.13.4 as release [INFO] Downloading hash https://github.com/fluxcd/flux2/releases/download/v0.13.4/flux_0.13.4_checksums.txt [INFO] Downloading binary https://github.com/fluxcd/flux2/releases/download/v0.13.4/flux_0.13.4_linux_amd64.tar.gz [INFO] Verifying binary download [INFO] Installing flux to /usr/local/bin/flux

      To enable command autocompletion, run the following command to configure the shell:

      • echo ". <(flux completion bash)" >> ~/.bashrc

      For the changes to take effect, reload ~/.bashrc by running:

      You now have Flux available on your local machine. Before installing it to your cluster, you’ll first need to run the prerequisite checks that verify compatibility:

      Flux will connect to your cluster, which you’ve set up a connection to in the prerequisites. You’ll see an output similar to this:

      Output

      ► checking prerequisites ✔ kubectl 1.21.1 >=1.18.0-0 ✔ Kubernetes 1.20.2 >=1.16.0-0 ✔ prerequisites checks passed

      Note: If you see an error or a warning, double check the cluster you’re connected to. It’s possible that you may need to perform an upgrade to be able to use Flux. If kubectl is reported missing, repeat the steps from the prerequisites for your platform and check that it’s in your PATH.

      During the bootstrapping process, Flux creates a Git repository at a specified provider and initializes it with a default configuration. To do so requires your GitHub username and personal access token, which you’ve retrieved in the prerequisites. The repository will be available under your account on GitHub.

      You’ll store your GitHub username and personal access token as environment variables to avoid typing them multiple times. Run the following commands, replacing the highlighted parts with your GitHub credentials:

      • export GITHUB_USER=your_username
      • export GITHUB_TOKEN=your_personal_access_token

      You can now bootstrap Flux and install it to your cluster by running:

      • flux bootstrap github
      • --owner=$GITHUB_USER
      • --repository=flux-config
      • --branch=main
      • --path=./clusters/my-cluster
      • --personal

      In this command, you specify that the repository should be called flux-config at provider github, owned by the user you’ve just defined. The new repository will be personal (not under an organization) and will be made private by default.

      The output you’ll see will be similar to this:

      Output

      ► connecting to github.com ► cloning branch "main" from Git repository "https://github.com/GITHUB_USER/flux-config.git" ✔ cloned repository ► generating component manifests ✔ generated component manifests ✔ committed sync manifests to "main" ("b750ffae686c2f110364694d2ddae26c7f18c6a2") ► pushing component manifests to "https://github.com/GITHUB_USER/flux-config.git" ► installing components in "flux-system" namespace ✔ installed components ✔ reconciled components ► determining if source secret "flux-system/flux-system" exists ► generating source secret ✔ public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDKw943TnUiKLVk4WMLC5YCeC+tIPVvJprQxTfLqcwkHtedMJPanJFifmbQ/M3CAq1IgqyQTydRJSJu6E/4YDOwx1vawStR9XU16rkn+rZbmvRxZ97E0HNb5m54OwmziAWf0EPdsfiIIJYSRkCMihpKJUNoakl+sng6LQsW+WIRlOK39aJRWud+rygQEuEKmD7YHKQ0VSb/L5v50jiPgEZImiREHNfjBU+RkEni3aZuOO3jNy5WdlPkpdqfHe8fdFsjJnvNB0zmfe3eTIB2fbdDzxo2usLbFeAMhGCRYsGnniHsytBHNLmxDM/4I18xlNN9e6WEYpgHEJVb8azKmwSX ✔ configured deploy key "flux-system-main-flux-system-./clusters/my-cluster" for "https://github.com/GITHUB_USER/flux-config" ► applying source secret "flux-system/flux-system" ✔ reconciled source secret ► generating sync manifests ✔ generated sync manifests ✔ committed sync manifests to "main" ("1dc033e24f3288a70ff80c57816e16c52bc62303") ► pushing sync manifests to "https://github.com/GITHUB_USER/flux-config.git" ► applying sync manifests ✔ reconciled sync configuration ◎ waiting for Kustomization "flux-system/flux-system" to be reconciled ✔ Kustomization reconciled successfully ► confirming components are healthy ✔ source-controller: deployment ready ✔ kustomize-controller: deployment ready ✔ helm-controller: deployment ready ✔ notification-controller: deployment ready ✔ all components are healthy

      Flux noted that it made a new Git repository, committed a basic starting configuration to it, and provisioned necessary controllers in your cluster.

      In this step, you’ve installed Flux on your local machine, created a new Git repository to hold its configuration, and deployed its server-side components to your cluster. The changes defined by the commits in the repository will now get propagated to your cluster automatically. In the next step, you’ll create configuration manifests ordering Flux to automate deployments of the podinfo app you’ve forked whenever a change occurs.

      Step 2 — Configuring the Automated Deployment

      In this section, you will configure Flux to watch over the podinfo repository that you’ve forked and apply the changes to your cluster as soon as they become available.

      In addition to creating the repository and initial configuration, Flux offers commands to help you generate config manifests with your parameters faster than writing them from scratch. The manifests, regardless of what they define, must be available in its Git repository to be taken into consideration. To add them to the repository, you’ll first need to clone it to your machine to be able to push changes. Do so by running the following command:

      • git clone https://github.com/$GITHUB_USER/flux-config ~/flux-config

      You may be asked for your username and password. Input your account username and provide your personal access token for the password.

      Then, navigate to it:

      To instruct Flux to monitor the forked podinfo repository, you’ll first need to let it know where it’s located. This is achieved by creating a GitRepository manifest, which details the repository URL, branch, and monitoring interval.

      To create the manifest, run the following command:

      • flux create source git podinfo
      • --url=https://github.com/$GITHUB_USER/podinfo
      • --branch=master
      • --interval=30s
      • --export > ./clusters/my-cluster/podinfo-source.yaml

      Here, you specify that the source will be a Git repository with the given URL and branch. You pass in --export to output the generated manifest and pipe it into podinfo-source.yaml, located under ./clusters/my-cluster/ in the main config repository, where manifests for the current cluster are stored.

      You can show the contents of the generated file by running:

      • cat ./clusters/my-cluster/podinfo-source.yaml

      The output will look similar to this:

      ~/flux-config/clusters/my-cluster/podinfo-source.yaml

      ---
      apiVersion: source.toolkit.fluxcd.io/v1beta1
      kind: GitRepository
      metadata:
        name: podinfo
        namespace: flux-system
      spec:
        interval: 30s
        ref:
          branch: master
        url: https://github.com/GITHUB_USER/podinfo
      

      You can check that the parameters you just passed into Flux are correctly laid out in the generated manifest.

      You’ve now defined a source Git repository that Flux can access, but you still need to tell it what to deploy. Flux supports Kustomize resources, which podinfo exposes under the kustomize directory. By supporting Kustomizations, Flux does not limit itself, because Kustomize manifests can be as simple as just including all usual manifests unchanged.

      Create a Kustomization manifest, which tells Flux where to look for deployable manifests, by running the following command:

      • flux create kustomization podinfo
      • --source=podinfo
      • --path="./kustomize"
      • --prune=true
      • --validation=client
      • --interval=5m
      • --export > ./clusters/my-cluster/podinfo-kustomization.yaml

      For the --source, you specify the podinfo Git repository you’ve just created. You also set the --path to ./kustomize, which refers to the filesystem structure of the source repository. Then, you save the YAML output into a file called podinfo-kustomization.yaml in the directory for the current cluster.

      The Git repository and Kustomization you’ve created are now available, but the cluster-side of Flux can’t yet see them because they’re not in the remote repository on GitHub. To push them, you must first commit them by running:

      • git add . && git commit -m "podinfo added"

      With the changes now committed, push them to the remote repository:

      Same as last time, git may ask you for your credentials. Input your username and your personal access token to continue.

      The new manifests are now live, and cluster-side Flux will soon pick them up. You can watch it sync the cluster’s state with the one presented in the manifests by running:

      • watch flux get kustomizations

      After the refresh interval specified for the Git repository elapses (which you’ve set to 30s in the manifest above), Flux will retrieve its latest commit and update the cluster. Once it does, you’ll see output similar to this:

      Output

      NAME READY MESSAGE flux-system True Applied revision: main/fc07af652d3168be329539b30a4c3943a7d12dd8 podinfo True Applied revision: master/855f7724be13f6146f61a893851522837ad5b634

      You can see that a podinfo Kustomization was applied, along with its branch and commit hash. You can list deployments and services as well to check that podinfo is deployed:

      • kubectl get deployments,services

      You’ll see that they are present, configured according to their respective manifests:

      Output

      NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/podinfo 2/2 2 2 56s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubernetes ClusterIP 10.245.0.1 <none> 443/TCP 34m service/podinfo ClusterIP 10.245.78.189 <none> 9898/TCP,9999/TCP 56s

      Any changes that you manually make to these and other resources that Flux controls will quickly be overwritten with the ones referenced from Git repositories. To make changes, you’d need to modify the central sources, not the actual deployments in a cluster. This applies to deleting resources as well — any resources you manually delete from the cluster will soon be reinstated. To delete them, you’d need to remove their manifests from the monitored repositories and wait for the changes to be propagated.

      Flux’s behavior is intentionally rigid because it operates on what it finds in the remote repositories at the end of each refresh interval. Suspending Kustomization monitoring and, in turn, state reconciliation is useful when you need to manually override the resources in the cluster without being interrupted by Flux.

      You can pause monitoring of a Kustomization indefinitely by running:

      • flux suspend kustomization kustomization_name

      The default behavior can be brought back by running flux resume on a paused Kustomization:

      • flux resume kustomization kustomization_name

      You now have an automated process in place that will deploy podinfo to your cluster every time a change occurs. You’ll now set up Slack notifications, so you’ll know when a new version of podinfo is being deployed.

      Step 3 — Setting up Slack Notifications

      Now that you’ve set up automatic podinfo deployments to your cluster, you’ll connect Flux to a Slack channel, where you’ll be notified of every deployment and its outcome.

      To integrate with Slack, you’ll need to have an incoming webhook on Slack for your workspace. Incoming webhooks are a way of posting messages to the configured Slack channel.

      If you haven’t ever created a webhook, you’ll first need to create an app for your workspace. To do so, first log in to Slack and navigate to the app creation page. Press on the green Create New App button and select From scratch. Name it flux-app, select the desired workspace, and click Create New App.

      You’ll be redirected to the settings page for the new app. Click on Incoming Webhooks on the left navigation bar.

      Slack app - Incoming Webhooks

      Enable webhooks for flux-app by flipping the switch button next to the title Activate Incoming Webhooks.

      Slack app - Activate Incoming Webhooks

      A new section further down the page will be uncovered. Scroll down and click the Add New Webhook to Workspace button. On the next page, select the channel you want the reports to be sent to and click Allow.

      You’ll be redirected back to the settings page for webhooks, and you’ll see a new webhook listed in the table. Click on Copy to copy it to clipboard and make note of it for later use.

      You’ll store the generated Slack webhook for your app in a Kubernetes Secret in your cluster, so that Flux can access it without explicitly specifying it in its configuration manifests. Storing the webhook as a Secret also lets you easily replace it in the future.

      Create a Secret called slack-url containing the webhook by running the following command, replacing your_slack_webhook with the URL you’ve just copied:

      • kubectl -n flux-system create secret generic slack-url --from-literal=address=your_slack_webhook

      The output will be:

      Output

      secret/slack-url created

      You’ll now create a Provider, which allows Flux to talk to the specified service using webhooks. They read the webhook URL from Secrets, which is why you’ve just created one. Run the following Flux command to create a Slack Provider:

      • flux create alert-provider slack
      • --type slack
      • --channel general
      • --secret-ref slack-url
      • --export > ./clusters/my-cluster/slack-alert-provider.yaml

      Aside from Slack, Flux supports communicating with Microsoft Teams, Discord, and other platforms via webhooks. It also supports sending generic JSON to accommodate more software that parses this format.

      A Provider only allows Flux to send messages and does not specify when messages should be sent. For Flux to react to events, you’ll need to create an Alert using the slack Provider by running:

      • flux create alert slack-alert
      • --event-severity info
      • --event-source Kustomization/*
      • --event-source GitRepository/*
      • --provider-ref slack
      • --export > ./clusters/my-cluster/slack-alert.yaml

      This command creates an alert manifest called slack-alert that will react to all Kustomization and Git repository changes and report them to the slack provider. The event severity is set to info, which will allow the alert to be triggered on all events, such as Kubernetes manifests being created or applied, something delaying deployment, or an error occurring. To report only errors, you can specify error instead. The resulting generated YAML is exported to a file called slack-alert.yaml.

      Commit the changes by running:

      • git add . && git commit -m "Added Slack alerts"

      Push the changes to the remote repository by running the following command, inputting your GitHub username and personal access token if needed:

      After the configured refresh interval for the Git repository elapses, Flux will retrieve and apply the changes. You can watch the Alert become available by running:

      • watch kubectl -n flux-system get alert

      You’ll soon see that it’s Initialized:

      Output

      NAME READY STATUS AGE slack-alert True Initialized 7s

      With alerting now set up, any actions that Flux takes will be logged in the Slack channel of the workspace that the webhook is connected to.

      You’ll test this connection by introducing a change to your fork of podinfo. First, clone it your local machine by running:

      • git clone https://github.com/$GITHUB_USER/podinfo.git ~/podinfo

      Navigate to the cloned repository:

      You’ll modify the name of its Service, which is defined in ~/podinfo/kustomize/service.yaml. Open it for editing:

      • nano ~/podinfo/kustomize/service.yaml

      Modify the Service name, like so:

      ~/podinfo/kustomize/service.yaml

      apiVersion: v1
      kind: Service
      metadata:
        name: podinfo-1
      spec:
        type: ClusterIP
        selector:
          app: podinfo
        ports:
          - name: http
            port: 9898
            protocol: TCP
            targetPort: http
          - port: 9999
            targetPort: grpc
            protocol: TCP
            name: grpc
      

      Save and close the file, then commit the changes by running:

      • git add . && git commit -m "Service name modified"

      Then, push the changes:

      After a few minutes, you’ll see the changes pop up in Slack as they are deployed:

      Slack - Flux reported changes

      Flux fetched the new commit, created a new Service called podinfo-1, configured it, and deleted the old one. This order of actions ensures that the old Service (or any other manifest) stays untouched if provisioning of the new one fails.

      In case the new revision of the watched manifests contains a syntax error, Flux will report an error:

      Slack - Flux reported failed deployment

      You’ve connected Flux to your Slack workspace, and will immediately be notified of all actions and deployments that happen. You’ll now set up Flux to watch over Helm releases.

      Step 4 — (Optional) Automating Helm Release Deployments

      In addition to watching over Kustomizations and Git repositories, Flux can also monitor Helm charts. Flux can monitor charts residing in Git or Helm repositories, as well as in S3 cloud storage. You’ll now set it up to watch over the podinfo chart, which is located in a Helm repository.

      The process of instructing Flux to monitor a Helm chart is similar to what you did in step 2. You’ll first need to define a source that it can poll for changes (of one of the three types noted earlier). Then, you’ll specify which chart to actually deploy among the ones it finds by creating a HelmRelease.

      Navigate back to the flux-config repository:

      Run the following command to create a source for the Helm repository that contains podinfo:

      • flux create source helm podinfo
      • --url=https://stefanprodan.github.io/podinfo
      • --interval=10m
      • --export > ./clusters/my-cluster/podinfo-helm-repo.yaml

      Here, you specify the URL of the repository and how often it should be checked. Then, you save the output into a file called podinfo-helm-repo.yaml.

      With the source repository now defined, you can create a HelmRelease, defining which chart to monitor:

      • flux create hr podinfo
      • --interval=10m
      • --source=HelmRepository/podinfo
      • --chart=podinfo
      • --target-namespace=podinfo-helm
      • --export > ./clusters/my-cluster/podinfo-helm-chart.yaml

      As in the previous command, you save the resulting YAML output to a file, here called podinfo-helm-chart.yaml. You also pass in the name of the chart (podinfo), set the --source to the repository you’ve just defined and specify that the namespace the chart will be installed to is podinfo-helm.

      Since the podinfo-helm namespace does not exist, create it by running:

      • kubectl create namespace podinfo-helm

      Then, commit and push the changes:

      • git add . && git commit -m "Added podinfo Helm chart" && git push

      After a few minutes, you’ll see that Flux logged a successful Helm chart upgrade in Slack:

      Slack - Flux logged successful chart install

      You can check the pods contained in the podinfo-helm namespace by running:

      • kubectl get pods -n podinfo-helm

      The output will be similar to this:

      Output

      NAME READY STATUS RESTARTS AGE podinfo-chart-podinfo-7c9b7667cb-gshkb 1/1 Running 0 33s

      This means that you have successfully configured Flux to monitor and deploy the podinfo Helm chart. As soon as a new version is released, or a modification is pushed, Flux will retrieve and deploy the newest variant of the Helm chart for you.

      Conclusion

      You’ve now automated Kubernetes manifest deployments using Flux, which allows you to push commits to watched repositories and have them automatically applied to your cluster. You’ve also set up alerting to Slack, so you’ll always know what deployments are happening in real time, and you can look up previous ones and see any errors that might have occurred.

      In addition to GitHub, Flux also supports retrieving and bootstrapping Git repositories hosted at GitLab. You can visit the official docs to learn more.



      Source link

      How To Build a Data Processing Pipeline Using Luigi in Python on Ubuntu 20.04


      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      Luigi is a Python package that manages long-running batch processing, which is the automated running of data processing jobs on batches of items. Luigi allows you to define a data processing job as a set of dependent tasks. For example, task B depends on the output of task A. And task D depends on the output of task B and task C. Luigi automatically works out what tasks it needs to run to complete a requested job.

      Overall Luigi provides a framework to develop and manage data processing pipelines. It was originally developed by Spotify, who use it to manage plumbing together collections of tasks that need to fetch and process data from a variety of sources. Within Luigi, developers at Spotify built functionality to help with their batch processing needs including handling of failures, the ability to automatically resolve dependencies between tasks, and visualization of task processing. Spotify uses Luigi to support batch processing jobs, including providing music recommendations to users, populating internal dashboards, and calculating lists of top songs.

      In this tutorial, you will build a data processing pipeline to analyze the most common words from the most popular books on Project Gutenburg. To do this, you will build a pipeline using the Luigi package. You will use Luigi tasks, targets, dependencies, and parameters to build your pipeline.

      Prerequisites

      To complete this tutorial, you will need the following:

      Step 1 — Installing Luigi

      In this step, you will create a clean sandbox environment for your Luigi installation.

      First, create a project directory. For this tutorial luigi-demo:

      Navigate into the newly created luigi-demo directory:

      Create a new virtual environment luigi-venv:

      • python3 -m venv luigi-venv

      And activate the newly created virtual environment:

      • . luigi-venv/bin/activate

      You will find (luigi-venv) appended to the front of your terminal prompt to indicate which virtual environment is active:

      Output

      (luigi-venv) username@hostname:~/luigi-demo$

      For this tutorial, you will need three libraries: luigi, beautifulsoup4, and requests. The requests library streamlines making HTTP requests; you will use it to download the Project Gutenberg book lists and the books to analyze. The beautifulsoup4 library provides functions to parse data from web pages; you will use it to parse out a list of the most popular books on the Project Gutenberg site.

      Run the following command to install these libraries using pip:

      • pip install wheel luigi beautifulsoup4 requests

      You will get a response confirming the installation of the latest versions of the libraries and all of their dependencies:

      Output

      Successfully installed beautifulsoup4-4.9.1 certifi-2020.6.20 chardet-3.0.4 docutils-0.16 idna-2.10 lockfile-0.12.2 luigi-3.0.1 python-daemon-2.2.4 python-dateutil-2.8.1 requests-2.24.0 six-1.15.0 soupsieve-2.0.1 tornado-5.1.1 urllib3-1.25.10

      You’ve installed the dependencies for your project. Now, you’ll move on to building your first Luigi task.

      Step 2 — Creating a Luigi Task

      In this step, you will create a “Hello World” Luigi task to demonstrate how they work.

      A Luigi task is where the execution of your pipeline and the definition of each task’s input and output dependencies take place. Tasks are the building blocks that you will create your pipeline from. You define them in a class, which contains:

      • A run() method that holds the logic for executing the task.
      • An output() method that returns the artifacts generated by the task. The run() method populates these artifacts.
      • An optional input() method that returns any additional tasks in your pipeline that are required to execute the current task. The run() method uses these to carry out the task.

      Create a new file hello-world.py:

      Now add the following code to your file:

      hello-world.py

      import luigi
      
      class HelloLuigi(luigi.Task):
      
          def output(self):
              return luigi.LocalTarget('hello-luigi.txt')
      
          def run(self):
              with self.output().open("w") as outfile:
                  outfile.write("Hello Luigi!")
      
      

      You define that HelloLuigi() is a Luigi task by adding the luigi.Task mixin to it.

      The output() method defines one or more Target outputs that your task produces. In the case of this example, you define a luigi.LocalTarget, which is a local file.

      Note: Luigi allows you to connect to a variety of common data sources including AWS S3 buckets, MongoDB databases, and SQL databases. You can find a complete list of supported data sources in the Luigi docs.

      The run() method contains the code you want to execute for your pipeline stage. For this example you are opening the output() target file in write mode, self.output().open("w") as outfile: and writing "Hello Luigi!" to it with outfile.write("Hello Luigi!").

      To execute the task you created, run the following command:

      • python -m luigi --module hello-world HelloLuigi --local-scheduler

      Here, you run the task using python -m instead of executing the luigi command directly; this is because Luigi can only execute code that is within the current PYTHONPATH. You can alternatively add PYTHONPATH='.' to the front of your Luigi command, like so:

      • PYTHONPATH='.' luigi --module hello-world HelloLuigi --local-scheduler

      With the --module hello-world HelloLuigi flag, you tell Luigi which Python module and Luigi task to execute.

      The --local-scheduler flag tells Luigi to not connect to a Luigi scheduler and, instead, execute this task locally. (We explain the Luigi scheduler in Step 4.) Running tasks using the local-scheduler flag is only recommended for development work.

      Luigi will output a summary of the executed tasks:

      Output

      ===== Luigi Execution Summary ===== Scheduled 1 tasks of which: * 1 ran successfully: - 1 HelloLuigi() This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      And it will create a new file hello-luigi.txt with content:

      hello-luigi.txt

      Hello Luigi!
      

      You have created a Luigi task that generates a file and then executed it using the Luigi local-scheduler. Now, you’ll create a task that can extract a list of books from a web page.

      In this step, you will create a Luigi task and define a run() method for the task to download a list of the most popular books on Project Gutenberg. You’ll define an output() method to store links to these books in a file. You will run these using the Luigi local scheduler.

      Create a new directory data inside of your luigi-demo directory. This will be where you will store the files defined in the output() methods of your tasks. You need to create the directories before running your tasks—Python throws exceptions when you try to write a file to a directory that does not exist yet:

      • mkdir data
      • mkdir data/counts
      • mkdir data/downloads

      Create a new file word-frequency.py:

      Insert the following code, which is a Luigi task to extract a list of links to the top most-read books on Project Gutenberg:

      word-frequency.py

      import requests
      import luigi
      from bs4 import BeautifulSoup
      
      
      class GetTopBooks(luigi.Task):
          """
          Get list of the most popular books from Project Gutenberg
          """
      
          def output(self):
              return luigi.LocalTarget("data/books_list.txt")
      
          def run(self):
              resp = requests.get("http://www.gutenberg.org/browse/scores/top")
      
              soup = BeautifulSoup(resp.content, "html.parser")
      
              pageHeader = soup.find_all("h2", string="Top 100 EBooks yesterday")[0]
              listTop = pageHeader.find_next_sibling("ol")
      
              with self.output().open("w") as f:
                  for result in listTop.select("li>a"):
                      if "/ebooks/" in result["href"]:
                          f.write("http://www.gutenberg.org{link}.txt.utf-8n"
                              .format(
                                  link=result["href"]
                              )
                          )
      

      You define an output() target of file "data/books_list.txt" to store the list of books.

      In the run() method, you:

      • use the requests library to download the HTML contents of the Project Gutenberg top books page.
      • use the BeautifulSoup library to parse the contents of the page. The BeautifulSoup library allows us to scrape information out of web pages. To find out more about using the BeautifulSoup library, read the How To Scrape Web Pages with Beautiful Soup and Python 3 tutorial.
      • open the output file defined in the output() method.
      • iterate over the HTML structure to get all of the links in the Top 100 EBooks yesterday list. For this page, this is locating all links <a> that are within a list item <li>. For each of those links, if they link to a page that points at a link containing /ebooks/, you can assume it is a book and write that link to your output() file.

      Screenshot of the Project Gutenberg top books web page with the top ebooks links highlighted

      Save and exit the file once you’re done.

      Execute this new task using the following command:

      • python -m luigi --module word-frequency GetTopBooks --local-scheduler

      Luigi will output a summary of the executed tasks:

      Output

      ===== Luigi Execution Summary ===== Scheduled 1 tasks of which: * 1 ran successfully: - 1 GetTopBooks() This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      In the data directory, Luigi will create a new file (data/books_list.txt). Run the following command to output the contents of the file:

      This file contains a list of URLs extracted from the Project Gutenberg top projects list:

      Output

      http://www.gutenberg.org/ebooks/1342.txt.utf-8 http://www.gutenberg.org/ebooks/11.txt.utf-8 http://www.gutenberg.org/ebooks/2701.txt.utf-8 http://www.gutenberg.org/ebooks/1661.txt.utf-8 http://www.gutenberg.org/ebooks/16328.txt.utf-8 http://www.gutenberg.org/ebooks/45858.txt.utf-8 http://www.gutenberg.org/ebooks/98.txt.utf-8 http://www.gutenberg.org/ebooks/84.txt.utf-8 http://www.gutenberg.org/ebooks/5200.txt.utf-8 http://www.gutenberg.org/ebooks/51461.txt.utf-8 ...

      You’ve created a task that can extract a list of books from a web page. In the next step, you’ll set up a central Luigi scheduler.

      Step 4 — Running the Luigi Scheduler

      Now, you’ll launch the Luigi scheduler to execute and visualize your tasks. You will take the task developed in Step 3 and run it using the Luigi scheduler.

      So far, you have been running Luigi using the --local-scheduler tag to run your jobs locally without allocating work to a central scheduler. This is useful for development, but for production usage it is recommended to use the Luigi scheduler. The Luigi scheduler provides:

      • A central point to execute your tasks.
      • Visualization of the execution of your tasks.

      To access the Luigi scheduler interface, you need to enable access to port 8082. To do this, run the following command:

      To run the scheduler execute the following command:

      • sudo sh -c ". luigi-venv/bin/activate ;luigid --background --port 8082"

      Note: We have re-run the virtualenv activate script as root, before launching the Luigi scheduler as a background task. This is because when running sudo the virtualenv environment variables and aliases are not carried over.

      If you do not want to run as root, you can run the Luigi scheduler as a background process for the current user. This command runs the Luigi scheduler in the background and hides messages from the scheduler background task. You can find out more about managing background processes in the terminal at How To Use Bash’s Job Control to Manage Foreground and Background Processes:

      • luigid --port 8082 > /dev/null 2> /dev/null &

      Open a browser to access the Luigi interface. This will either be at http://your_server_ip:8082, or if you have set up a domain for your server http://your_domain:8082. This will open the Luigi user interface.

      Luigi default user interface

      By default, Luigi tasks run using the Luigi scheduler. To run one of your previous tasks using the Luigi scheduler omit the --local-scheduler argument from the command. Re-run the task from Step 3 using the following command:

      • python -m luigi --module word-frequency GetTopBooks

      Refresh the Luigi scheduler user interface. You will find the GetTopBooks task added to the run list and its execution status.

      Luigi User Interface after running the GetTopBooks Task

      You will continue to refer back to this user interface to monitor the progress of your pipeline.

      Note: If you’d like to secure your Luigi scheduler through HTTPS, you can serve it through Nginx. To set up an Nginx server using HTTPS follow: How To Secure Nginx with Let’s Encrypt on Ubuntu 20.04. See Github - Luigi - Pull Request 2785 for suggestions on a suitable Nginx configuration to connect the Luigi server to Nginx.

      You’ve launched the Luigi Scheduler and used it to visualize your executed tasks. Next, you will create a task to download the list of books that the GetTopBooks() task outputs.

      Step 5 — Downloading the Books

      In this step you will create a Luigi task to download a specified book. You will define a dependency between this newly created task and the task created in Step 3.

      First open your file:

      Add an additional class following your GetTopBooks() task to the word-frequency.py file with the following code:

      word-frequency.py

      . . .
      class DownloadBooks(luigi.Task):
          """
          Download a specified list of books
          """
          FileID = luigi.IntParameter()
      
          REPLACE_LIST = """.,"';_[]:*-"""
      
          def requires(self):
              return GetTopBooks()
      
          def output(self):
              return luigi.LocalTarget("data/downloads/{}.txt".format(self.FileID))
      
          def run(self):
              with self.input().open("r") as i:
                  URL = i.read().splitlines()[self.FileID]
      
                  with self.output().open("w") as outfile:
                      book_downloads = requests.get(URL)
                      book_text = book_downloads.text
      
                      for char in self.REPLACE_LIST:
                          book_text = book_text.replace(char, " ")
      
                      book_text = book_text.lower()
                      outfile.write(book_text)
      

      In this task you introduce a Parameter; in this case, an integer parameter. Luigi parameters are inputs to your tasks that affect the execution of the pipeline. Here you introduce a parameter FileID to specify a line in your list of URLs to fetch.

      You have added an additional method to your Luigi task, def requires(); in this method you define the Luigi task that you need the output of before you can execute this task. You require the output of the GetTopBooks() task you defined in Step 3.

      In the output() method, you define your target. You use the FileID parameter to create a name for the file created by this step. In this case, you format data/downloads/{FileID}.txt.

      In the run() method, you:

      • open the list of books generated in the GetTopBooks() task.
      • get the URL from the line specified by parameter FileID.
      • use the requests library to download the contents of the book from the URL.
      • filter out any special characters inside the book like :,.?, so they don’t get included in your word analysis.
      • convert the text to lowercase so you can compare words with different cases.
      • write the filtered output to the file specified in the output() method.

      Save and exit your file.

      Run the new DownloadBooks() task using this command:

      • python -m luigi --module word-frequency DownloadBooks --FileID 2

      In this command, you set the FileID parameter using the --FileID argument.

      Note: Be careful when defining a parameter with an _ in the name. To reference them in Luigi you need to substitute the _ for a -. For example, a File_ID parameter would be referenced as --File-ID when calling a task from the terminal.

      You will receive the following output:

      Output

      ===== Luigi Execution Summary ===== Scheduled 2 tasks of which: * 1 complete ones were encountered: - 1 GetTopBooks() * 1 ran successfully: - 1 DownloadBooks(FileID=2) This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      Note from the output that Luigi has detected that you have already generated the output of GetTopBooks() and skipped running that task. This functionality allows you to minimize the number of tasks you have to execute as you can re-use successful output from previous runs.

      You have created a task that uses the output of another task and downloads a set of books to analyze. In the next step, you will create a task to count the most common words in a downloaded book.

      Step 6 — Counting Words and Summarizing Results

      In this step, you will create a Luigi task to count the frequency of words in each of the books downloaded in Step 5. This will be your first task that executes in parallel.

      First open your file again:

      Add the following imports to the top of word-frequency.py:

      word-frequency.py

      from collections import Counter
      import pickle
      

      Add the following task to word-frequency.py, after your DownloadBooks() task. This task takes the output of the previous DownloadBooks() task for a specified book, and returns the most common words in that book:

      word-frequency.py

      class CountWords(luigi.Task):
          """
          Count the frequency of the most common words from a file
          """
      
          FileID = luigi.IntParameter()
      
          def requires(self):
              return DownloadBooks(FileID=self.FileID)
      
          def output(self):
              return luigi.LocalTarget(
                  "data/counts/count_{}.pickle".format(self.FileID),
                  format=luigi.format.Nop
              )
      
          def run(self):
              with self.input().open("r") as i:
                  word_count = Counter(i.read().split())
      
                  with self.output().open("w") as outfile:
                      pickle.dump(word_count, outfile)
      

      When you define requires() you pass the FileID parameter to the next task. When you specify that a task depends on another task, you specify the parameters you need the dependent task to be executed with.

      In the run() method you:

      • open the file generated by the DownloadBooks() task.
      • use the built-in Counter object in the collections library. This provides an easy way to analyze the most common words in a book.
      • use the pickle library to store the output of the Python Counter object, so you can re-use that object in a later task. pickle is a library that you use to convert Python objects into a byte stream, which you can store and restore into a later Python session. You have to set the format property of the luigi.LocalTarget to allow it to write the binary output the pickle library generates.

      Save and exit your file.

      Run the new CountWords() task using this command:

      • python -m luigi --module word-frequency CountWords --FileID 2

      Open the CountWords task graph view in the Luigi scheduler user interface.

      Showing how to view a graph from the Luigi user interface

      Deselect the Hide Done option, and deselect Upstream Dependencies. You will find the flow of execution from the tasks you have created.

      Visualizing the execution of the CountWords task

      You have created a task to count the most common words in a downloaded book and visualized the dependencies between those tasks. Next, you will define parameters that you can use to customize the execution of your tasks.

      Step 7 — Defining Configuration Parameters

      In this step, you will add configuration parameters to the pipeline. These will allow you to customize how many books to analyze and the number of words to include in the results.

      When you want to set parameters that are shared among tasks, you can create a Config() class. Other pipeline stages can reference the parameters defined in the Config() class; these are set by the pipeline when executing a job.

      Add the following Config() class to the end of word-frequency.py. This will define two new parameters in your pipeline for the number of books to analyze and the number of most frequent words to include in the summary:

      word-frequency.py

      class GlobalParams(luigi.Config):
          NumberBooks = luigi.IntParameter(default=10)
          NumberTopWords = luigi.IntParameter(default=500)
      

      Add the following class to word-frequency.py. This class aggregates the results from all of the CountWords() task to create a summary of the most frequent words:

      word-frequency.py

      class TopWords(luigi.Task):
          """
          Aggregate the count results from the different files
          """
      
          def requires(self):
              requiredInputs = []
              for i in range(GlobalParams().NumberBooks):
                  requiredInputs.append(CountWords(FileID=i))
              return requiredInputs
      
          def output(self):
              return luigi.LocalTarget("data/summary.txt")
      
          def run(self):
              total_count = Counter()
              for input in self.input():
                  with input.open("rb") as infile:
                      nextCounter = pickle.load(infile)
                      total_count += nextCounter
      
              with self.output().open("w") as f:
                  for item in total_count.most_common(GlobalParams().NumberTopWords):
                      f.write("{0: <15}{1}n".format(*item))
      
      

      In the requires() method, you can provide a list where you want a task to use the output of multiple dependent tasks. You use the GlobalParams().NumberBooks parameter to set the number of books you need word counts from.

      In the output() method, you define a data/summary.txt output file that will be the final output of your pipeline.

      In the run() method you:

      • create a Counter() object to store the total count.
      • open the file and “unpickle” it (convert it from a file back to a Python object), for each count carried out in the CountWords() method
      • append the loaded count and add it to the total count.
      • write the most common words to target output file.

      Run the pipeline with the following command:

      • python -m luigi --module word-frequency TopWords --GlobalParams-NumberBooks 15 --GlobalParams-NumberTopWords 750

      Luigi will execute the remaining tasks needed to generate the summary of the top words:

      Output

      ===== Luigi Execution Summary ===== Scheduled 31 tasks of which: * 2 complete ones were encountered: - 1 CountWords(FileID=2) - 1 GetTopBooks() * 29 ran successfully: - 14 CountWords(FileID=0,1,10,11,12,13,14,3,4,5,6,7,8,9) - 14 DownloadBooks(FileID=0,1,10,11,12,13,14,3,4,5,6,7,8,9) - 1 TopWords() This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====

      You can visualize the execution of the pipeline from the Luigi scheduler. Select the GetTopBooks task in the task list and press the View Graph button.

      Showing how to view a graph from the Luigi user interface

      Deselect the Hide Done and Upstream Dependencies options.

      Visualizing the execution of the TopWords Task

      It will show the flow of processing that is happening in Luigi.

      Open the data/summary.txt file:

      You will find the calculated most common words:

      Output

      the 64593 and 41650 of 31896 to 31368 a 25265 i 23449 in 19496 it 16282 that 15907 he 14974 ...

      In this step, you have defined and used parameters to customize the execution of your tasks. You have generated a summary of the most common words for a set of books.

      Find all the code for this tutorial in this repository.

      Conclusion

      This tutorial has introduced you to using the Luigi data processing pipeline and its major features including tasks, parameters, configuration parameters, and the Luigi scheduler.

      Luigi supports connecting to a large number of common data sources out the box. You can also scale it to run large, complex data pipelines. This provides a powerful framework to start solving your data processing challenges.

      For more tutorials, check out our Data Analysis topic page and Python topic page.



      Source link

      How To Set Up a Continuous Deployment Pipeline with GitLab CI/CD on Ubuntu 18.04


      The author selected the Free and Open Source Fund to receive a donation as part of the Write for DOnations program.

      Introduction

      GitLab is an open source collaboration platform that provides powerful features beyond hosting a code repository. You can track issues, host packages and registries, maintain Wikis, set up continuous integration (CI) and continuous deployment (CD) pipelines, and more.

      In this tutorial you’ll build a continuous deployment pipeline with GitLab. You will configure the pipeline to build a Docker image, push it to the GitLab container registry, and deploy it to your server using SSH. The pipeline will run for each commit pushed to the repository.

      You will deploy a small, static web page, but the focus of this tutorial is configuring the CD pipeline. The static web page is only for demonstration purposes; you can apply the same pipeline configuration using other Docker images for the deployment as well.

      When you have finished this tutorial, you can visit http://your_server_IP in a browser for the results of the automatic deployment.

      Prerequisites

      To complete this tutorial, you will need:

      Step 1 — Creating the GitLab Repository

      Let’s start by creating a GitLab project and adding an HTML file to it. You will later copy the HTML file into an Nginx Docker image, which in turn you’ll deploy to the server.

      Log in to your GitLab instance and click New project.

      The new project button in GitLab

      1. Give it a proper Project name.
      2. Optionally add a Project description.
      3. Make sure to set the Visibility Level to Private or Public depending on your requirements.
      4. Finally click Create project

      The new project form in GitLab

      You will be redirected to the Project’s overview page.

      Let’s create the HTML file. On your Project’s overview page, click New file.

      The new file button on the project overview page

      Set the File name to index.html and add the following HTML to the file body:

      index.html

      <html>
      <body>
      <h1>My Personal Website</h1>
      </body>
      </html>
      

      Click Commit changes at the bottom of the page to create the file.

      This HTML will produce a blank page with one headline showing My Personal Website when opened in a browser.

      Dockerfiles are recipes used by Docker to build Docker images. Let’s create a Dockerfile to copy the HTML file into an Nginx image.

      Go back to the Project’s overview page, click the + button and select the New file option.

      New file option in the project's overview page listed in the plus button

      Set the File name to Dockerfile and add these instructions to the file body:

      Dockerfile

      FROM nginx:1.18
      COPY index.html /usr/share/nginx/html
      

      The FROM instruction specifies the image to inherit from—in this case the nginx:1.18 image. 1.18 is the image tag representing the Nginx version. The nginx:latest tag references the latest Nginx release, but that could break your application in the future, which is why fixed versions are recommended.

      The COPY instruction copies the index.html file to /usr/share/nginx/html in the Docker image. This is the directory where Nginx stores static HTML content.

      Click Commit changes at the bottom of the page to create the file.

      In the next step, you’ll configure a GitLab runner to keep control of who gets to execute the deployment job.

      Step 2 — Registering a GitLab Runner

      In order to keep track of the environments that will have contact with the SSH private key, you’ll register your server as a GitLab runner.

      In your deployment pipeline you want to log in to your server using SSH. To achieve this, you’ll store the SSH private key in a GitLab CI/CD variable (Step 5). The SSH private key is a very sensitive piece of data, because it is the entry ticket to your server. Usually, the private key never leaves the system it was generated on. In the usual case, you would generate an SSH key on your host machine, then authorize it on the server (that is, copy the public key to the server) in order to log in manually and perform the deployment routine.

      Here the situation changes slightly: You want to grant an autonomous authority (GitLab CI/CD) access to your server to automate the deployment routine. Therefore the private key needs to leave the system it was generated on and be given in trust to GitLab and other involved parties. You never want your private key to enter an environment that is not either controlled or trusted by you.

      Besides GitLab, the GitLab runner is yet another system that your private key will enter. For each pipeline, GitLab uses runners to perform the heavy work, that is, execute the jobs you have specified in the CI/CD configuration. That means the deployment job will ultimately be executed on a GitLab runner, hence the private key will be copied to the runner such that it can log in to the server using SSH.

      If you use unknown GitLab Runners (for example, shared runners) to execute the deployment job, then you’d be unaware of the systems getting in contact with the private key. Even though GitLab runners clean up all data after job execution, you can avoid sending the private key to unknown systems by registering your own server as a GitLab runner. The private key will then be copied to the server controlled by you.

      Start by logging in to your server:

      In order to install the gitlab-runner service, you’ll add the official GitLab repository. Download and inspect the install script:

      • curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh > script.deb.sh
      • less script.deb.sh

      Once you are satisfied with the safety of the script, run the installer:

      It may not be obvious, but you have to enter your non-root user’s password to proceed. When you execute the previous command, the output will be like:

      Output

      [sudo] password for sammy: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5945 100 5945 0 0 8742 0 --:--:-- --:--:-- --:--:-- 8729

      When the curl command finishes, you will receive the following message:

      Output

      The repository is setup! You can now install packages.

      Next install the gitlab-runner service:

      • sudo apt install gitlab-runner

      Verify the installation by checking the service status:

      • systemctl status gitlab-runner

      You will have active (running) in the output:

      Output

      ● gitlab-runner.service - GitLab Runner Loaded: loaded (/etc/systemd/system/gitlab-runner.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2020-06-01 09:01:49 UTC; 4s ago Main PID: 16653 (gitlab-runner) Tasks: 6 (limit: 1152) CGroup: /system.slice/gitlab-runner.service └─16653 /usr/lib/gitlab-runner/gitlab-runner run --working-directory /home/gitlab-runner --config /etc/gitla

      To register the runner, you need to get the project token and the GitLab URL:

      1. In your GitLab project, navigate to Settings > CI/CD > Runners.
      2. In the Set up a specific Runner manually section, you’ll find the registration token and the GitLab URL. Copy both to a text editor; you’ll need them for the next command. They will be referred to as https://your_gitlab.com and project_token.

      The runners section in the ci/cd settings with the copy token button

      Back to your terminal, register the runner for your project:

      • sudo gitlab-runner register -n --url https://your_gitlab.com --registration-token project_token --executor docker --description "Deployment Runner" --docker-image "docker:stable" --tag-list deployment --docker-privileged

      The command options can be interpreted as follows:

      • -n executes the register command non-interactively (we specify all parameters as command options).
      • --url is the GitLab URL you copied from the runners page in GitLab.
      • --registration-token is the token you copied from the runners page in GitLab.
      • --executor is the executor type. docker executes each CI/CD job in a Docker container (see GitLab’s documentation on executors).
      • --description is the runner’s description, which will show up in GitLab.
      • --docker-image is the default Docker image to use in CI/CD jobs, if not explicitly specified.
      • --tag-list is a list of tags assigned to the runner. Tags can be used in a pipeline configuration to select specific runners for a CI/CD job. The deployment tag will allow you to refer to this specific runner to execute the deployment job.
      • --docker-privileged executes the Docker container created for each CI/CD job in privileged mode. A privileged container has access to all devices on the host machine and has nearly the same access to the host as processes running outside containers (see Docker’s documentation about runtime privilege and Linux capabilities). The reason for running in privileged mode is so you can use Docker-in-Docker (dind) to build a Docker image in your CI/CD pipeline. It is good practice to give a container the minimum requirements it needs. For you it is a requirement to run in privileged mode in order to use Docker-in-Docker. Be aware, you registered the runner for this specific project only, where you are in control of the commands being executed in the privileged container.

      After executing the gitlab-runner register command, you will receive the following output:

      Output

      Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!

      Verify the registration process by going to Settings > CI/CD > Runners in GitLab, where the registered runner will show up.

      The registered runner in the runners section of the ci/cd settings

      In the next step you’ll create a deployment user.

      Step 3 — Creating a Deployment User

      You are going to create a non-sudo user that is dedicated for the deployment task, so that its power is limited and the deployment takes place in an isolated user space. You will later configure the CI/CD pipeline to log in to the server with that user.

      On your server, create a new user:

      You’ll be guided through the user creation process. Enter a strong password and optionally any further user information you want to specify. Finally confirm the user creation with Y.

      Add the user to the Docker group:

      • sudo usermod -aG docker deployer

      This permits deployer to execute the docker command, which is required to perform the deployment.

      In the next step you’ll create an SSH key to be able to log in to the server as deployer.

      Step 4 — Setting Up an SSH Key

      You are going to create an SSH key for the deployment user. GitLab CI/CD will later use the key to log in to the server and perform the deployment routine.

      Let’s start by switching to the newly created deployer user for whom you’ll generate the SSH key:

      You’ll be prompted for the deployer password to complete the user switch.

      Next, generate a 4096-bit SSH key. It is important to answer the questions of the ssh-keygen command correctly:

      1. First question: answer it with ENTER, which stores the key in the default location (the rest of this tutorial assumes the key is stored in the default location).
      2. Second question: configures a password to protect the SSH private key (the key used for authentication). If you specify a passphrase, you’ll have to enter it each time the private key is used. In general, a passphrase adds another security layer to SSH keys, which is good practice. Somebody in possession of the private key would also require the passphrase to use the key. For the purposes of this tutorial, it is important that you have an empty passphrase, because the CI/CD pipeline will execute non-interactively and therefore does not allow to enter a passphrase.

      To summarize, run the following command and confirm both questions with ENTER to create a 4096-bit SSH key and store it in the default location with an empty passphrase:

      To authorize the SSH key for the deployer user, you need to append the public key to the authorized_keys file:

      • cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

      ~ is short for the user home in Linux. The cat program will print the contents of a file; here you use the >> operator to redirect the output of cat and append it to the authorized_keys file.

      In this step you have created an SSH key pair for the CI/CD pipeline to log in and deploy the application. Next you’ll store the private key in GitLab to make it accessible during the pipeline process.

      Step 5 — Storing the Private Key in a GitLab CI/CD Variable

      You are going to store the SSH private key in a GitLab CI/CD file variable, so that the pipeline can make use of the key to log in to the server.

      When GitLab creates a CI/CD pipeline, it will send all variables to the corresponding runner and the variables will be set as environment variables for the duration of the job. In particular, the values of file variables are stored in a file and the environment variable will contain the path to this file.

      While you’re in the variables section, you’ll also add a variable for the server IP and the server user, which will inform the pipeline about the destination server and user to log in.

      Start by showing the SSH private key:

      Copy the output to your clipboard using CTRL+C. Make sure to copy everything including the BEGIN and END line:

      ~/.ssh/id_rsa

      -----BEGIN RSA PRIVATE KEY-----
      ...
      -----END RSA PRIVATE KEY-----
      

      Now navigate to Settings > CI / CD > Variables in your GitLab project and click Add Variable. Fill out the form as follows:

      • Key: ID_RSA
      • Value: Paste your SSH private key from your clipboard with CTRL+V
      • Type: File
      • Environment Scope: All (default)
      • Protect variable: Checked
      • Mask variable: Unchecked

      Note: The variable can’t be masked because it does not meet the regular expression requirements (see GitLab’s documentation about masked variables). However, the private key will never appear in the console log, which makes masking it obsolete.

      A file containing the private key will be created on the runner for each CI/CD job and its path will be stored in the $ID_RSA environment variable.

      Create another variable with your server IP. Click Add Variable and fill out the form as follows:

      • Key: SERVER_IP
      • Value: your_server_IP
      • Type: Variable
      • Environment scope: All (default)
      • Protect variable: Checked
      • Mask variable: Checked

      Finally, create a variable with the login user. Click Add Variable and fill out the form as follows:

      • Key: SERVER_USER
      • Value: deployer
      • Type: Variable
      • Environment scope: All (default)
      • Protect variable: Checked
      • Mask variable: Checked

      You have now stored the private key in a GitLab CI/CD variable, which makes the key available during pipeline execution. In the next step, you’re moving on to configuring the CI/CD pipeline.

      Step 6 — Configuring the .gitlab-ci.yml File

      You are going to configure the GitLab CI/CD pipeline. The pipeline will build a Docker image and push it to the container registry. GitLab provides a container registry for each project. You can explore the container registry by going to Packages & Registries > Container Registry in your GitLab project (read more in GitLab’s container registry documentation.) The final step in your pipeline is to log in to your server, pull the latest Docker image, remove the old container, and start a new container.

      Now you’re going to create the .gitlab-ci.yml file that contains the pipeline configuration. In GitLab, go to the Project overview page, click the + button and select New file. Then set the File name to .gitlab-ci.yml.

      (Alternatively you can clone the repository and make all following changes to .gitlab-ci.yml on your local machine, then commit and push to the remote repository.)

      To begin add the following:

      .gitlab-ci.yml

      stages:
        - publish
        - deploy
      

      Each job is assigned to a stage. Jobs assigned to the same stage run in parallel (if there are enough runners available). Stages will be executed in the order they were specified. Here, the publish stage will go first and the deploy stage second. Successive stages only start when the previous stage finished successfully (that is, all jobs have passed). Stage names can be chosen arbitrarily.

      When you want to combine this CD configuration with your existing CI pipeline, which tests and builds the app, you may want to add the publish and deploy stages after your existing stages, such that the deployment only takes place if the tests passed.

      Following this, add this to your .gitlab-ci.yml file:

      .gitlab-ci.yml

      . . .
      variables:
        TAG_LATEST: $CI_REGISTRY_IMAGE/$CI_COMMIT_REF_NAME:latest
        TAG_COMMIT: $CI_REGISTRY_IMAGE/$CI_COMMIT_REF_NAME:$CI_COMMIT_SHORT_SHA
      

      The variables section defines environment variables that will be available in the context of a job’s script section. These variables will be available as usual Linux environment variables; that is, you can reference them in the script by prefixing with a dollar sign such as $TAG_LATEST. GitLab creates some predefined variables for each job that provide context specific information, such as the branch name or the commit hash the job is working on (read more about predefined variable). Here you compose two environment variables out of predefined variables. They represent:

      • CI_REGISTRY_IMAGE: Represents the URL of the container registry tied to the specific project. This URL depends on the GitLab instance. For example, registry URLs for gitlab.com projects follow the pattern: registry.gitlab.com/your_user/your_project. But since GitLab will provide this variable, you do not need to know the exact URL.
      • CI_COMMIT_REF_NAME: The branch or tag name for which project is built.
      • CI_COMMIT_SHORT_SHA: The first eight characters of the commit revision for which the project is built.

      Both of the variables are composed of predefined variables and will be used to tag the Docker image.

      TAG_LATEST will add the latest tag to the image. This is a common strategy to provide a tag that always represents the latest release. For each deployment, the latest image will be overridden in the container registry with the newly built Docker image.

      TAG_COMMIT, on the other hand, uses the first eight characters of the commit SHA being deployed as the image tag, thereby creating a unique Docker image for each commit. You will be able to trace the history of Docker images down to the granularity of Git commits. This is a common technique when doing continuous deployments, because it allows you to quickly deploy an older version of the code in case of a defective deployment.

      As you’ll explore in the coming steps, the process of rolling back a deployment to an older Git revision can be done directly in GitLab.

      $CI_REGISTRY_IMAGE/$CI_COMMIT_REF_NAME specifies the Docker image base name. According to GitLab’s documentation, a Docker image name has to follow this scheme:

      image name scheme

      <registry URL>/<namespace>/<project>/<image>

      $CI_REGISTRY_IMAGE represents the <registry URL>/<namespace>/<project> part and is mandatory because it is the project’s registry root. $CI_COMMIT_REF_NAME is optional but useful to host Docker images for different branches. In this tutorial you will only work with one branch, but it is good to build an extendable structure. In general, there are three levels of image repository names supported by GitLab:

      repository name levels

      registry.example.com/group/project:some-tag registry.example.com/group/project/image:latest registry.example.com/group/project/my/image:rc1

      For your TAG_COMMIT variable you used the second option, where image will be replaced with the branch name.

      Next, add the following to your .gitlab-ci.yml file:

      .gitlab-ci.yml

      . . .
      publish:
        image: docker:latest
        stage: publish
        services:
          - docker:dind
        script:
          - docker build -t $TAG_COMMIT -t $TAG_LATEST .
          - docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY
          - docker push $TAG_COMMIT
          - docker push $TAG_LATEST
      

      The publish section is the first job in your CI/CD configuration. Let’s break it down:

      • image is the Docker image to use for this job. The GitLab runner will create a Docker container for each job and execute the script within this container. docker:latest image ensures that the docker command will be available.
      • stage assigns the job to the publish stage.
      • services specifies Docker-in-Docker—the dind service. This is the reason why you registered the GitLab runner in privileged mode.

      The script section of the publish job specifies the shell commands to execute for this job. The working directory will be set to the repository root when these commands will be executed.

      • docker build ...: Builds the Docker image based on the Dockerfile and tags it with the latest commit tag defined in the variables section.
      • docker login ...: Logs Docker in to the project’s container registry. You use the predefined variable $CI_BUILD_TOKEN as an authentication token. GitLab will generate the token and stay valid for the job’s lifetime.
      • docker push ...: Pushes both image tags to the container registry.

      Following this, add the deploy job to your .gitlab-ci.yml:

      .gitlab-ci.yml

      . . .
      deploy:
        image: alpine:latest
        stage: deploy
        tags:
          - deployment
        script:
          - chmod og= $ID_RSA
          - apk update && apk add openssh-client
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker pull $TAG_COMMIT"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker container rm -f my-app || true"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker run -d -p 80:80 --name my-app $TAG_COMMIT"
      

      Alpine is a lightweight Linux distribution and is sufficient as a Docker image here. You assign the job to the deploy stage. The deployment tag ensures that the job will be executed on runners that are tagged deployment, such as the runner you configured in Step 2.

      The script section of the deploy job starts with two configurative commands:

      • chmod og= $ID_RSA: Revokes all permissions for group and others from the private key, such that only the owner can use it. This is a requirement, otherwise SSH refuses to work with the private key.
      • apk update && apk add openssh-client: Updates Alpine’s package manager (apk) and installs the openssh-client, which provides the ssh command.

      Four consecutive ssh commands follow. The pattern for each is:

      ssh connect pattern for all deployment commands

      ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "command"

      In each ssh statement you are executing command on the remote server. To do so, you authenticate with your private key.

      The options are as follows:

      • -i stands for identity file and $ID_RSA is the GitLab variable containing the path to the private key file.
      • -o StrictHostKeyChecking=no makes sure to bypass the question, whether or not you trust the remote host. This question can not be answered in a non-interactive context such as the pipeline.
      • $SERVER_USER and $SERVER_IP are the GitLab variables you created in Step 5. They specify the remote host and login user for the SSH connection.
      • command will be executed on the remote host.

      The deployment ultimately takes place by executing these four commands on your server:

      1. docker login ...: Logs Docker in to the container registry.
      2. docker pull ...: Pulls the latest image from the container registry.
      3. docker container rm ...: Deletes the existing container if it exists. || true makes sure that the exit code is always successful, even if there was no container running by the name my-app. This guarantees a delete if exists routine without breaking the pipeline when the container does not exist (for example, for the first deployment).
      4. docker run ...: Starts a new container using the latest image from the registry. The container will be named my-app. Port 80 on the host will be bound to port 80 of the container (the order is -p host:container). -d starts the container in detached mode, otherwise the pipeline would be stuck waiting for the command to terminate.

      Note: It may seem odd to use SSH to run these commands on your server, considering the GitLab runner that executes the commands is the exact same server. Yet it is required, because the runner executes the commands in a Docker container, thus you would deploy inside the container instead of the server if you’d execute the commands without the use of SSH. One could argue that instead of using Docker as a runner executor, you could use the shell executor to run the commands on the host itself. But, that would create a constraint to your pipeline, namely that the runner has to be the same server as the one you want to deploy to. This is not a sustainable and extensible solution because one day you may want to migrate the application to a different server or use a different runner server. In any case it makes sense to use SSH to execute the deployment commands, may it be for technical or migration-related reasons.

      Let’s move on by adding this to the deployment job in your .gitlab-ci.yml:

      .gitlab-ci.yml

      . . .
      deploy:
      . . .
        environment:
          name: production
          url: http://your_server_IP
        only:
          - master
      

      GitLab environments allow you to control the deployments within GitLab. You can examine the environments in your GitLab project by going to Operations > Environments. If the pipeline did not finish yet, there will be no environment available, as no deployment took place so far.

      When a pipeline job defines an environment section, GitLab will create a deployment for the given environment (here production) each time the job successfully finishes. This allows you to trace all the deployments created by GitLab CI/CD. For each deployment you can see the related commit and the branch it was created for.

      There is also a button available for re-deployment that allows you to rollback to an older version of the software. The URL that was specified in the environment section will be opened when clicking the View deployment button.

      The only section defines the names of branches and tags for which the job will run. By default, GitLab will start a pipeline for each push to the repository and run all jobs (provided that the .gitlab-ci.yml file exists). The only section is one option of restricting job execution to certain branches/tags. Here you want to execute the deployment job for the master branch only. To define more complex rules on whether a job should run or not, have a look at the rules syntax.

      Note: In October 2020, GitHub has changed its naming convention for the default branch from master to main. Other providers such as GitLab and the developer community in general are starting to follow this approach. The term master branch is used in this tutorial to denote the default branch for which you may have a different name.

      Your complete .gitlab-ci.yml file will look like the following:

      .gitlab-ci.yml

      stages:
        - publish
        - deploy
      
      variables:
        TAG_LATEST: $CI_REGISTRY_IMAGE/$CI_COMMIT_REF_NAME:latest
        TAG_COMMIT: $CI_REGISTRY_IMAGE/$CI_COMMIT_REF_NAME:$CI_COMMIT_SHORT_SHA
      
      publish:
        image: docker:latest
        stage: publish
        services:
          - docker:dind
        script:
          - docker build -t $TAG_COMMIT -t $TAG_LATEST .
          - docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY
          - docker push $TAG_COMMIT
          - docker push $TAG_LATEST
      
      deploy:
        image: alpine:latest
        stage: deploy
        tags:
          - deployment
        script:
          - chmod og= $ID_RSA
          - apk update && apk add openssh-client
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker pull $TAG_COMMIT"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker container rm -f my-app || true"
          - ssh -i $ID_RSA -o StrictHostKeyChecking=no $SERVER_USER@$SERVER_IP "docker run -d -p 80:80 --name my-app $TAG_COMMIT"
        environment:
          name: production
          url: http://your_server_IP
        only:
          - master
      

      Finally click Commit changes at the bottom of the page in GitLab to create the .gitlab-ci.yml file. Alternatively, when you have cloned the Git repository locally, commit and push the file to the remote.

      You’ve created a GitLab CI/CD configuration for building a Docker image and deploying it to your server. In the next step you are validating the deployment.

      Step 7 — Validating the Deployment

      Now you’ll validate the deployment in various places of GitLab as well as on your server and in a browser.

      When a .gitlab-ci.yml file is pushed to the repository, GitLab will automatically detect it and start a CI/CD pipeline. At the time you created the .gitlab-ci.yml file, GitLab started the first pipeline.

      Go to CI/CD > Pipelines in your GitLab project to see the pipeline’s status. If the jobs are still running/pending, wait until they are complete. You will see a Passed pipeline with two green checkmarks, denoting that the publish and deploy job ran successfully.

      The pipeline overview page showing a passed pipeline

      Let’s examine the pipeline. Click the passed button in the Status column to open the pipeline’s overview page. You will get an overview of general information such as:

      • Execution duration of the whole pipeline.
      • For which commit and branch the pipeline was executed.
      • Related merge requests. If there is an open merge request for the branch in charge, it would show up here.
      • All jobs executed in this pipeline as well as their status.

      Next click the deploy button to open the result page of the deploy job.

      The result page of the deploy job

      On the job result page you can see the shell output of the job’s script. This is the place to look for when debugging a failed pipeline. In the right sidebar you’ll find the deployment tag you added to this job, and that it was executed on your Deployment Runner.

      If you scroll to the top of the page, you will find the This job is deployed to production message. GitLab recognizes that a deployment took place because of the job’s environment section. Click the production link to move over to the production environment.

      The production environment in GitLab

      You will have an overview of all production deployments. There was only a single deployment so far. For each deployment there is a re-deploy button available to the very right. A re-deployment will repeat the deploy job of that particular pipeline.

      Whether a re-deployment works as intended depends on the pipeline configuration, because it will not do more than repeating the deploy job under the same circumstances. Since you have configured to deploy a Docker image using the commit SHA as a tag, a re-deployment will work for your pipeline.

      Note: Your GitLab container registry may have an expiration policy. The expiration policy regularly removes older images and tags from the container registry. As a consequence, a deployment that is older than the expiration policy would fail to re-deploy, because the Docker image for this commit will have been removed from the registry. You can manage the expiration policy in Settings > CI/CD > Container Registry tag expiration policy. The expiration interval is usually set to something high, like 90 days. But when you run into the case of trying to deploy an image that has been removed from the registry due to the expiration policy, you can solve the problem by re-running the publish job of that particular pipeline as well, which will re-create and push the image for the given commit to registry.

      Next click the View deployment button, which will open http://your_server_IP in a browser and you should see the My Personal Website headline.

      Finally we want to check the deployed container on your server. Head over to your terminal and make sure to log in again, if you have disconnected already (it works for both users, sammy and deployer):

      Now list the running containers:

      Which will list the my-app container:

      Output

      CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 5b64df4b37f8 registry.your_gitlab.com/your_gitlab_user/your_project/master:your_commit_sha "nginx -g 'daemon of…" 4 hours ago Up 4 hours 0.0.0.0:80->80/tcp my-app

      Read the How To Install and Use Docker on Ubuntu 18.04 guide to learn more about managing Docker containers.

      You have now validated the deployment. In the next step, you will go through the process of rolling back a deployment.

      Step 8 — Rolling Back a Deployment

      Next you’ll update the web page, which will create a new deployment and then re-deploy the previous deployment using GitLab environments. This covers the use case of a deployment rollback in case of a defective deployment.

      Start by making a little change in the index.html file:

      1. In GitLab, go to the Project overview and open the index.html file.
      2. Click the Edit button to open the online editor.
      3. Change the file content to the following:

      index.html

      <html>
      <body>
      <h1>My Enhanced Personal Website</h1>
      </body>
      </html>
      

      Save the changes by clicking Commit changes at the bottom of the page.

      A new pipeline will be created to deploy the changes. In GitLab, go to CI/CD > Pipelines. When the pipeline has completed, you can open http://your_server_IP in a browser for the updated web page now showing My Enhanced Personal Website instead of My Personal Website.

      When you move over to Operations > Environments > production you will see the newly created deployment. Now click the re-deploy button of the initial, older deployment:

      A list of the deployments of the production environment in GitLab with emphasize on the re-deploy button of the first deployment

      Confirm the popup by clicking the Rollback button.

      The deploy job of that older pipeline will be restarted and you will be redirected to the job’s overview page. Wait for the job to finish, then open http://your_server_IP in a browser, where you’ll see the initial headline My Personal Website showing up again.

      Let’s summarize what you have achieved throughout this tutorial.

      Conclusion

      In this tutorial, you have configured a continuous deployment pipeline with GitLab CI/CD. You created a small web project consisting of an HTML file and a Dockerfile. Then you configured the .gitlab-ci.yml pipeline configuration to:

      1. Build the Docker image.
      2. Push the Docker image to the container registry.
      3. Log in to the server, pull the latest image, stop the current container, and start a new one.

      GitLab will now deploy the web page to your server for each push to the repository.

      Furthermore you have verified a deployment in GitLab and on your server. You have also created a second deployment and rolled back to the first deployment using GitLab environments, which demonstrates how you deal with defective deployments.

      At this point you have automated the whole deployment chain. You can now share code changes more frequently with the world and/or customer. As a result, development cycles are likely to become shorter, as less time is required to gather feedback and publish the code changes.

      As a next step you could make your service accessible by a domain name and secure the communication with HTTPS for which How To Use Traefik as a Reverse Proxy for Docker Containers is a good follow up.



      Source link