Deploy ClickHouse DBMS at the Edge with StackPath

Resource

Topic

Date

Mar 31, 2020

Author

Businesses use data analytics to extract meaningful, decision-making information for their day-to-day operations. With analytics we can discover hidden relations, find unexpected patterns, learn about customer behavior, understand the past, and predict the future.

In this tutorial, we’ll learn about ClickHouse, a distributed column-based database for real-time analytics. We will explore how it works and take it for a test drive using StackPath Containers, a distributed edge compute product that lets you deploy images to multiple points of presence to reduce latency and bandwidth.

Before diving into the tutorial, first we’ll provide an overview of column-based databases, ClickHouse, and the benefits of running ClickHouse on StackPath.

Column-Based Databases

Compared with traditional row-based databases, column-based databases have a subtle yet profound difference in how data is stored on disk.

A traditional database stores data as rows.

ID	COMPANY	CEO_NAME	YEAR_FROM	YEAR_TO
1	Microsoft	Bill Gates	1975	2000
2	Microsoft	Satya Nadella	2014	—
3	Apple	Steve Cook	2011	—
4	Facebook	M. Zuckerberg	2004	—
5	Apple	Steve Jobs	1997	2001

A column-based database serialize them by columns.

COMPANY	Microsoft (1,2)	Apple (3,5)	Facebook (4)
CEO_NAME	Bill Gates (1)	Satya Nadella (2)	Steve Cook (3)	M. Zuckerberg (4)	Steve Jobs (5)
YEAR_FROM	1975 (1)	2014 (2)	2011 (3)	2014 (4)	1997 (5)
YEAR_TO	2000 (1)	2011 (5)

Our expected workload determines what type of database we should choose. Row I/O is designed to get all the information for a small set of records while column I/O is designed to get a small subset of columns for a huge set of records (or even a whole table).

Now, imagine we ask the server to calculate how many years each of the persons on the tables above were CEOs of their respective companies. A row database has to retrieve each row, then get to the columns with the relevant years and discard the rest. A column database only needs two I/O operations (regardless of the size of the table): one for each date column. No data is discarded. A column system will perform much better in this case and scales much better with large tables.

As a result of their design, column databases have the following unique properties.

Efficient column aggregation and materialized views
Wide tables with optional columns pose no problems; on the contrary, they are encouraged
Because similar data compress best, column orientation uses less space on disk than rows
Due to less reliance on locking, column architectures can run with a much higher degree of parallelization
Batch inserts are more efficient with columns than with rows
Column databases have less reliance on indexes which reduces overhead in the system.

ClickHouse: a Distributed Column-Based DBMS

ClickHouse is a distributed database management system (DBMS) created by Yandex, the Russian Internet giant and the second-largest web analytics platform in the world. ClickHouse is available as open-source software under the Apache 2.0 License.

Beyond Yandex, many other large companies have adopted ClickHouse for their analytics, including Cisco, CERN, Bloomberg, and Spotify.

The main features of ClickHouse include:

True column-based DBMS
Horizontal scalability
Multi-petabyte scale support
Massive parallel processing on a single node
Distributed queries using multiple nodes
Support of approximated calculations through sampling
Fault tolerance and no-downtime upgrades
A subset of the SQL Language, feature-rich with analytics and statistics functions

ClickHouse supports two mechanisms to scale out databases. Sharding distributes big tables among multiple machines and requests get automatically parallelized. Replication creates redundant copies of the data and can be used to distribute and filter data to different data centers. Also, ClickHouse supports multi-master replication.

Is ClickHouse right for your project?

ClickHouse IS good for:

OLAP (Online Analytical Processing) and OLEP (Online Event Processing) workloads, as long as they don’t need full-text search
Data Warehouse: ClickHouse can store and process petabytes of data and build customized reports on the fly.
Big tables: tables with hundreds or thousands of columns
Applications that need a high read throughput
Applications that make heavy use of aggregated columns or materialized views

While ClickHouse IS NOT good for:

OLTP (Online Transactional Processing) workloads: ClickHouse doesn’t support full-fledged transactions.
Applications that need to insert new data batches greater than once per second per clickHouse worker
Applications with complex or random write or update activity
Applications that need full-text search

Running ClickHouse on StackPath

Databases have unique requirements that can clash with common cloud offerings:

Cost: high per-GB costs can make downloads and uploads expensive
Bandwidth: big databases can take a long time to load or export
Latency: round trip times can add unacceptable latencies to the queries

Edge computing solves these issues by moving servers as close as possible to customers and sources of data. As for StackPath, it offers the best of both worlds: the convenience and flexibility of the cloud with on premise-like performance.

Single Instance Configuration

In this section we’ll deploy a single ClickHouse edge computing node in order to get familiar with the system. Then, we’ll learn how to implement a cluster.

Prerequisites

First, we need to create a custom Docker image with our private config. Before continuing, ensure you have the following things.

Docker installed on your workstation. On Mac and Windows, we recommend installing Docker Desktop. On Linux machines install Docker from the repository.
A Docker Hub account to store the images

Prepare the Docker image

Yandex supplies two ready-to-use Docker images:

clickhouse-server is the server component. The server stores data, accepts client connections, and processes requests.
clickhouse-client is the official command-line client, much like psql or mysql.

To download the images to your machine, use the following commands.

docker pull yandex/clickhouse-server
docker pull yandex/clickhouse-client

By default, the ClickHouse server comes with a single blank-password user called default. For testing purposes, having a blank password is not a problem, but since we’re going to deploy the node in production, we need to secure it.

To customize the user settings, we need to change a file called users.xml inside the clickhouse-server image.

To do this, create a directory to store the project files.

mkdir my-clickhouse-deployment
cd my-clickhouse-deployment

Start an instance of the container so we can copy out the file.

docker create -it --name clickhouse-server clickhouse server

Copy the file from the container to your machine. (Don’t forget the dot at the end of the command.)

cp clickhouse-server:/etc/clickhouse-server/users.xml

Open users.xml with a text editor and locate the following lines.

ClickHouse supports several password settings:

Plaintext: insecure and not recommended
Double SHA1: uses a cryptographic hash function to make the password harder to guess for anyone looking at our image. Double SHA1 strings go inside the
SHA256: arguably the most secure alternative and the one we will use. This option uses the

To generate a SHA256 password in Linux or Mac, execute the following command in a new terminal.

PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-'

The first line printed is the password, the second is the encoded SHA256 string. Alternatively, you can use an online SHA256 generator to obtain the string. In either case, keep both the password and the SHA string at hand as we’ll need them both soon.

Going back to users.xml:

Remove the line to get rid of the blank password. Then add the following line with your SHA256 string.

YOUR_SHA254_HEX_STRING

For example:


    
    78bb49329e64a3df1ea5a48fb09ce22ed5223171cf140e3b08683ab3926a0b9b

        . . .

Build the image

Create a file called Dockerfile with the following content.

FROM yandex/clickhouse-server
COPY users.xml /etc/clickhouse-server

Then build the image.

docker build -t my-clickhouse-server .

To test that it’s working correctly, run the following.

docker run -it -d --name my-server my-clickhouse-server

Now, try connecting using your password.

docker run -it --rm --link clickhouse:clickhouse-server yandex/clickhouse-client --host clickhouse-server --user default --password YOUR_PASSWORD

Push image to registry

Before we can use the image in StackPath, we have to send it to Docker Hub. To do this, first log in to Docker Hub with your account.

$ docker login -u YOUR_DOCKERHUB_USERNAME

Build the image using your Docker Hub username as a prefix.

$ docker build -t YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest

Push the image to Docker Hub.

$ docker push YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest

Deploy to StackPath

Once we have pushed the image to the Docker Registry, we can deploy it on StackPath. Head over to StackPath and login or sign up for a new account. Then click Create and select Workload.

We’ll name the workload “clickhouse-servers” and set the type to Container. For Image, type the address to your Docker image: YOUR_DOCKERHUB_USERNAME/clickhouse-server:latest

If you are using a private repository or a different registry, check the Add Image Pull Credentials box and type in the pull details.

Click Continue to Settings.

If you want the servers to be accessible to everyone, add the following ports under Public ports. This will assign Public IPs and open the ports to the Internet.

All containers get a Private IP assigned automatically.

Port 8123 TCP is the default HTTP API Endpoint
Port 9000 TCP is the default port for the native clients

Click Continue to Spec.

In Spec, choose the machine size. ClickHouse developers recommend a minimum of 4GB of RAM to allow for complex queries. But if you choose the smallest machine you can always scale the machine up with a few clicks later on.

Next we’ll need a volume to persist our data. Uunder Additional volume, enter /var/lib/clickhouse and set its size to your requirements.

Under Deployment Target, group servers with similar scaling needs. Type a descriptive name like “clickhouse-servers”. Then select the PoP nearest to you or your customers, set the number of instances to 1, and click Create Workload.

Wait a few seconds until the instance is on the Running state.

Test the connection

The following test will only work if you have opened ports 8123 and 9000.

First, click on the running instance to get its details. Then scroll down to the Instance details section and copy the Public IP address.

Next, open a terminal and try the /ping endpoint with Curl or browser. If the server is running it should respond with Ok.

$ curl -w "n" YOUR_PUBLIC_IP:8123/ping
Ok

We can also connect with ClickHouse Native Client.

docker run -it yandex/clickhouse-client --host YOUR_PUBLIC_IP --user default --password YOUR_PASSWORD

Now you can run ClickHouse SQL commands on your new node.

clickhouse-servers-clickhouse-servers-jfk-0 :) create database my_db;

CREATE DATABASE my_db

Ok.

0 rows in set. Elapsed: 0.393 sec.

clickhouse-servers-clickhouse-servers-jfk-0 :) show databases;

SHOW DATABASES

┌─name────┐
│ default │
│ my_db   │
│ system  │
└─────────┘

3 rows in set. Elapsed: 0.326 sec.

Sharded Cluster Configuration

In this section, we’ll deploy two more database instances to create a 3-node edge compute cluster.

Create more instances

First, we’ll create the remaining instances so that we can get their Private IPs.

On the left navigation menu, click Workloads and select your container workload. Then click the gear icon to edit the workload configuration.

Click the Edit icon next to Spec, Storage & Deployment and increase the number of Instances per PoP to 3. Then click Save Changes.

StackPath will create the remaining instances in only a few seconds.

When the instances start running, copy the Private IP address of each one.

Configure ClickHouse shards

We have to tell ClickHouse where all the nodes are. To do this, we need to get the config.xml file from the ClickHouse image.

Open a terminal and type the following command to copy the file out.

docker cp clickhouse-server:/etc/clickhouse-server/config.xml

Then, open the file and locate the remote_servers line.

Remove all the lines inside and enter the following config, replacing the IP addresses with your servers’ Private IPs.


    
        
            
                SERVER_1_PRIVATE_IP
                9000
            
        
        
            
                SERVER_2_PRIVATE_IP
                9000
            
        
        
            
                SERVER_3_PRIVATE_IP
                9000

Edit your Dockerfile to add the modified config:

FROM yandex/clickhouse-server
COPY users.xml /etc/clickhouse-server
COPY config.xml /etc/clickhouse-server

Build a new version of the image. (Notice how we’re using a different tag here.)

docker build -t YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded

Push the new image to the Docker Registry like we did before.

docker push YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded

Update the StackPath workload

The final step is to tell StackPath to use our new configuration. To do this, go to your StackPath Workload and click the gear icon again to change its configuration. Edit the Image name with the new tag: YOUR_DOCKERHUB_USERNAME/clickhouse-server:sharded

Click Save Changes.

StackPath will start recycling the containers. When this is done, you should find all of them running the newest version.

Testing the cluster

Again, we can use Curl or any browser to check each of the ClickHouse nodes. Just replace the IP addresses with your Private IPs.

$ curl -w "n" YOUR_PUBLIC_IP_1:8123/ping
Ok

$ curl -w "n" YOUR_PUBLIC_IP_2:8123/ping
Ok

$ curl -w "n" YOUR_PUBLIC_IP_2:8123/ping
Ok

We can also connect to any of the instances and check the cluster health. Run the following query to see the cluster status.

select * from system.clusters;

For example:

$ docker run -it yandex/clickhouse-client --host YOUR_PUBLIC_IP --user default --password YOUR_PASSWORD

ClickHouse client version 20.3.2.1 (official build).
Connecting to 151.139.31.12:9000 as user default.
Connected to ClickHouse server version 20.3.2 revision 54433.

clickhouse-servers-clickhouse-servers-jfk-0 :) select * from system.clusters;

SELECT cluster, shard_num, host_address, is_local, errors_count
FROM system.clusters

┌─cluster────────────┬─shard_num─┬─host_address─┬─is_local─┬─errors_count─┐
│ my_sharded_config  │         1 │ 10.128.96.2  │ 0        │ 0            │
│ my_sharded_config  │         2 │ 10.128.96.4  │ 1        │ 0            │
│ my_sharded_config  │         3 │ 10.128.96.3  │ 0        │ 0            │
└────────────────────┴───────────┴──────────────┴──────────┴──────────────┘

3 rows in set. Elapsed: 2.140 sec.

And that’s it! If you made it this far, you have sucessfully deployed a 3-node cluster on StackPath.

Next Steps

For the next step in your road to mastering ClickHouse, try setting up a second cluster on a different PoP and establishing table replication between the two data centers.

If you want to learn more, check out these related docs and projects:

What are edge containers?
ClickHouse SQL Reference
ClickHouse Operations
ClickHouse official JDBC and ODBC
ClickHouse third party drivers

Request a Demo

Schedule time with our edge experts for a StackPath demo tailored to your needs.

Follow SP//

Byte-sized activity and announcements: