Creating Single secure Airflow instance on Ubuntu 18.04

Airflow installation beyond the basics are pretty involved. This is my notes on how. This note assumes the reader wanted to access airflow…

Creating Single secure Airflow instance on Ubuntu 18.04

Airflow installation beyond the basics are pretty involved. This is my notes on how. This note assumes the reader wanted to access airflow from public web and …

  • have a domain name to use
  • Know how to create an ubuntu instance in the cloud (AWS, GCE, etc), with a sudo access
  • Know how to config a firewall to that instance

Create a linux user for AirFlow

We will be creating a user airflow. We are going to install and set up services using this user. It is better to isolate this user from the user that we use to log in.

For Google Cloud Engine, this is simply to login with a new user

$ gcloud compute ssh airflow@airflow

Or create it manually using this link : [How To Create a Sudo User on Ubuntu Quickstart | DigitalOcean]

From this point, it will be assumed that we are logged in with the user airflow

Install Airflow

First install pip3. Python3 is already installed with Ubuntu 18.04

$ sudo apt update
$ sudo apt install python3-pip

Now Install airflow

# To allow airflow to install 
$ export AIRFLOW_GPL_UNIDECODE=yes

# Install the package itself
$ pip3 install apache-airflow

Restart the shell to make sure PATH is update for pip3 (or log off / log on ssh again). Otherwise, we cannot execute `airflow` from the bash.

After logged back in, run airflow once to create `~/airflow` directory

$ airflow

Create PostgresDB as backend database

Although airflow by default use sqlite, it will be restricted to only 1 task at a time. we should just go ahead and setup a proper database backend.

$ sudo apt install postgresql

Then create the database, a user and their password. psql only works with the user postgres, so we need to sudo as that user

$ sudo -u postgres psql -c "create database airflow"
$ sudo -u postgres psql -c "create user airflow with encrypted password 'mypass'";
$ sudo -u postgres psql -c "grant all privileges on database airflow to airflow";

After that install a package in airflow to support postgresql

$ pip3 install apache-airflow[postgres]
$ pip3 install psycopg2

Change airflow config to connect to the newly created database.

### vi ~/airflow/airflow.cfg ###

sql_alchemy_conn = postgresql+psycopg2://airflow:mypass@localhost/airflow

Run command to initialize database

$ airflow initdb

Test Run

At this point we should test run that airflow works. Make sure port 8080 is open

$ airflow web server -p 8080

In order to start running a job, a schedule needs to also run in foreground. Logs in with another session of ssh then execute

$ airflow scheduler

To test run a job. Go to http://<yoursite>:8080. Dont forget to turn “ON” the DAG before clicking run.

Create Service with Systemd

So that airflow runs in background and starts up automatically with the server.

First copy the default systemd service script from airflow github

$ sudo curl -o /etc/systemd/system/airflow-webserver.service https://raw.githubusercontent.com/apache/airflow/master/scripts/systemd/airflow-webserver.service

$ sudo curl -o /etc/systemd/system/airflow-scheduler.service https://raw.githubusercontent.com/apache/airflow/master/scripts/systemd/airflow-scheduler.service

The default script was meant to be run in CentOS/Redhat. So we need to adjust some parameters.

#############################################################
### sudo vi /etc/systemd/system/airflow-webserver.service ### 
#############################################################

# EnvironmentFile=/etc/sysconfig/airflow (comment out this line)
Environment="PATH=/home/airflow/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ExecStart=/home/airflow/.local/bin/airflow webserver — pid /home/airflow/airflow-webserver.pid

############################################################# 
### sudo vi /etc/systemd/system/airflow-scheduler.service ### 
#############################################################

# EnvironmentFile=/etc/sysconfig/airflow (comment out this line)
Environment="PATH=/home/airflow/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

ExecStart=/home/airflow/.local/bin/airflow scheduler

After the services files are edited, reload it to systemd daemon

$ sudo systemctl daemon-reload

Then start the servers

$ sudo systemctl start airflow-webserver
$ sudo systemctl start airflow-scheduler

We can check the status of each service using command

$ sudo systemctl status airflow-webserver
$ sudo systemctl status airflow-scheduler

If all is well, enable these two services to start at boot

$ sudo systemctl enable airflow-webserver
$ sudo systemctl enable airflow-scheduler

Secure with Nginx and SSL

Although airflow can do SSL by itself, it is probably better to use it via nginx proxy so that the certs are taken care of automatically by letsencrypt.

This is just a shorthand note of https://www.digitalocean.com/community/tutorials/how-to-secure-nginx-with-let-s-encrypt-on-ubuntu-18-04

First Install and enable nginx. Make sure port 80 is enabled.

$ sudo apt install nginx

# Verify that nginx works by going to http://<yoursite>

$ sudo systemctl enable nginx

Create Nginx config to proxy from port 80 ->8080.

### sudo vi /etc/nginx/sites-available/airflow ### 

server {
    listen 80;
    server_name <your server name>;

location / {
        proxy_pass http://localhost:8080;
        proxy_set_header Host $host;
        proxy_redirect off;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
}

Then create replace the default config with this one

$ sudo rm /etc/nginx/sites-enabled/default
$ sudo ln -s /etc/nginx/sites-available/airflow /etc/nginx/sites-enabled/airflow

# Run to check that nginx configs are correct
$ sudo nginx -t

# Reload the config, no need for restart
$ sudo systemctl reload nginx

After that modify airflow config to enable proxy

### vi ~/airflow/airflow.cfg ###

enable_proxy_fix = True

###

# Restart airflow webserver
$ sudo systemctl restart airflow-webserver

Verify by going to http://<yoursite> (without the port 8080). It should be proxied correctly.

At this point we can drop 8080 from firewall.

SSL with Certbot

Make sure port 443 (https) is open

$ sudo add-apt-repository ppa:certbot/certbot
$ sudo apt install python-certbot-nginx

$ sudo certbot --nginx -d www.yourwebsite.com

Answer some prompts 
(When asked to choose whether to redirect, say yes (2)
Please choose whether or not to redirect HTTP traffic to HTTPS, removing HTTP access.)

Verify by going to http://<yoursite> (without the port 8080). It should get redirected to https://<yoursite> and the website should be displayed correctly.

Protect with simple password auth

Airflow has a few security connectors. The simplest one asked us to add username/password via command line

Install flash-bcrypt (The manual does not mentioned this)

$ pip3 install flask-bcrypt

Then edit config file

### vi ~/airflow/airflow.cfg ### 

[webserver]
authenticate = True
auth_backend = airflow.contrib.auth.backends.password_auth

####

$ sudo systemctl restart airflow-webserver

Create an airflow user from command line

/# navigate to the airflow installation directory/
$ cd ~/airflow
$ python3

import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'new_user_name'
user.email = '[email protected]'
user.password = 'set_the_password'
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()