Synerise is a Polish company conducting commercial activities and R&D works in the field of Big Data and AI (Artificial Intelligence). We created the Synerise platform to help organizations grow rapidly.
It’s no secret that we’re an engineering driven company that constantly pushes boundaries with fresh and innovative products. At Synerise, we believe that only organizations that innovate with passion, practice consistency, and perform intellectual experiments will stay relevant and become the leaders in their space. That's how we stay competitive against other companies with strong intellectual property – by producing the best products, based on science.
Today, I want to share a couple words about our own data storage - engine which is core to all Synerise components.
Terrarium is an "all in one" DB project that resolves the challenges no other available tool can solve. With Terrarium we didn’t need to string together several independent third party technologies, which would result in both higher system complexity and costs. Terrarium enables effective and simultaneous executions of analytical / column and row queries.
In early 2017 at Synerise, after looking for the perfect solution to analyze and execute business scenarios on large-scale data, with the IT Core Team led by Milosz Balus and Marcin Pasternak, we decided to produce our own storage engine.
At that time, the landscape of #bigdata technologies was vast and we found many different engines with column stores for analyses and row stores for documents on the market.
We defined the fundamental features that a perfect solution should have:
The above requirements are specific to various types of storage: documents, wide column store, RDBMS, time series DBMS, and OLAP storages.
Before we started developing our own solutions, we set a benchmark and during tests we tried to use and adapt different kinds of technologies. We tested real use cases with MemSQL, ClickHouse, Tarantool (worker implementation), CitusData, Druid, HBase, Spark SQL, ElasticSearch, Apache Ignite, and Hazelcast. We also tried making a hybrid solution, like Druid or ClickHouse for analytics or HBase and Spark for a direct query for single object analyses. After a few months of tests, we shortlist a few promising technologies and a custom implementation of Tarantool for a single worker used in a cluster mode, developed in-house. We learned that for the required performance, data should be local and stored in memory & disk. All technologies are different for various purposes, but they all have substantial advantages that we were looking for in our final solutions: data locality and in-memory & disk storage for fast computing.
None of the technologies mentioned above met all our expectations, business needs, requirements because of problematic edge cases such as data consistency, schema requirements, writing different queries for analysis and direct queries (row store characteristic), performance, redundancy, complexity, and more.
No, there are no solutions "more reliable or better" that would meet the characteristics of traffic handled by Terrarium while providing the required scalability. Though, it is likely that such technologies will begin to emerge in the next few years.
There is currently no single solution available to replace Terrarium. None of the TOP30 (https://db-engines.com/en/ranking) commonly used tools meet the requirements. Solutions are needed to efficiently perform analytical queries (e.g. ClickHouse), while offering easy access to individual lines (e.g. Scylla). Terrarium should not be classified as a database in the Row Stores (MySQL, PostgreSQL) / Column Stores (Vertica, Greenplum, ClickHouse) / Wide Column Stores (Cassandra, Scylla, Hbase) categories, but as a Cloud-Native Database (https://dzone.com/articles/cloud-native-databases-and-why-should-you-use-them) and as a solution providing a tool to store data in all of the above categories simultaneously. Just as a few years ago, there was a demand for solutions in the container orchestration field (Amazon ECS, Kubernetes), which would allow for better use of resources provided by the cloud (VMs). Along with the increase in the amount of generated and stored data, there is a need to create a new category of databases with a view to operating in the cloud. This allow us to use the full potential of the cloud (Blob Storage, Spot Instances), which will translate into a reduction in data storage and computing costs.
Firstly, to reduce the maintenance costs of the system and to ensure the scalability of the provided solutions.
Secondly, to deliver a unified experience for our clients. The core requirement in Synerise is real-time processing defined as access to unstructured heterogeneous data and object metrics in milliseconds. We analyze and deliver information at the same time.
The column store approach is perfect for the data analysis purposes that our clients need to make informed decisions. At the same time, after gaining insights from data, our clients need to act on them. For example, by building segments of objects (customers, products, locations, etc.) based on a stream of events generated from that object and delivering output data to 3rd party solutions (POS, campaign engines, call center, etc.) in real time. They require real-time and ad-hoc analyses and data execution in the same moment.
Figure 1. Traditional approach for data analytics and business scenario execution
In a traditional legacy solution, this approach is delivered by separate technologies. First, you look for some insights into one technology, e.g., some metrics. After that, you take a data snapshot and deliver it to a separate database engine where you can enrich and execute data. This approach is not effective in terms of time, effort or cost. Using multiple technologies generates problems with the maintenance and scalability of the provided solutions. It is also problematic for real-time requirements, e.g., personalized content on the website, counting loyalty points/metrics based on behavioral data, or delivering some insights to POS or serving AI recommendations, churn models, scoring etc.
With our clients, we dive into the tech world and see the demand for technology that combines Column Stores (Vertica, Greenplum, ClickHouse) with Row Stores/Wide Column Stores (Cassandra, Scylla, Hbase). The currently available solutions do not meet the requirements (schema-less), and solutions such as Vitess are only a partial mitigation of current restrictions; as time goes by, this will become even more apparent. Investing in technologies that will be outdated in a few years is pointless. Terrarium was built from scratch to avoid such problems.
Figure 2. High level architecture definition for Terrarium
One may need very sound reasons to create a completely new database engine, especially considering how much has been done with regards to column and row stores in general.. That’s why we devoted ourselves to studying and grasping the complexity of our use case and consequences on our technology. Our work resulted in many benefits and led us to develop our own database.
As there are two sides of every story, there were also significant disadvantages to this approach, such as the required engineering effort or the chance that we would be forced to recreate mechanisms responsible for reliability and scalability.
The main features of the Terrarium DBMS are:
Terrarium has some features that can be considered disadvantages:
Terrarium was designed to analyze behavioral data, where data order and time are important to make business decisions. More cases
Efficiency, scalability, combining Column Stores and Row Stores characteristics, cost optimization (separation of the calculation layer and the data storage layer, allowing for dynamic adjustment of the amount of resource usage, depending on the load characteristics), and analytical calculation features.
To achieve a high compression ratio of stored data, it was necessary to use a columnar file format. Unfortunately, the use of existing formats such as Parquet or ORC were not possible due to the need to store schema-less data. To meet these requirements, we decided to develop a proprietary data storage format.
The mentioned file consists of column blocks containing up to 1M values. A dictionary containing unique values is generated for each column. Then, dictionary encoded values are split into chunks of data containing up to 65k values, and each chunk is dictionary-encoded once again. This solution allows us to achieve a very high compression ratio and selective loading of data. In order to filter rows containing only the specified value, we first search a column dictionary. If we do not find the value we are looking for, there is no need to load the remaining data (chunk dictionaries and chunk values), and we can immediately go to filtering the columns from the next block. Dictionary encoding with a fixed length of single encoded values also allows for easy use of vector extensions to decompress data. Compression of column dictionary, chunk dictionaries, and encoded values could be improved by enabling generic compression algorithms: ZSTD, LZ4 or ZLIB.
To support writing to HDFS we decided to include file metadata as a footer. Our file format also allows us to encrypt stored data with the adoption of generic encryption algorithms.
For our use cases, we prepare a benchmark which compared Terrarium and ClickHouse engines - we have great respect to that technology.
ClickHouse is an open-source column-oriented DBMS (columnar database management system) for online analytical processing (OLAP).
ClickHouse was developed by the Russian IT company Yandex for the Yandex.Metrica web analytics service. ClickHouse allows analysis of data that is updated in real time. The system is marketed for high performance. ClickHouse is used by the Yandex.Tank load testing tool. ClickHouse was also implemented at CERN’s LHCb experiment to store and process metadata on 10 billion events with over 1000 attributes per event, and Tinkoff Bank uses ClickHouse as a data store for a project.
Hardware
We use Azure virtual machines series L32s_v2 with:
For both technologies, we prepared clusters with 10 nodes. The whole capacity of a cluster was:
The ClickHouse configuration was boosted and set up for gaining the highest performance:
set max_block_size=262144;set distributed_group_by_no_merge=0;set optimize_distributed_group_by_sharding_key=1;set distributed_aggregation_memory_efficient=1;set optimize_skip_unused_shards=1;
Engine definition for ClickHouse schema:
dist.col_all_events_124: ENGINE = Distributed('analytics', 'buffers', 'col_all_events_124', client_id)buffers.col_all_events_124: ENGINE = Buffer('default', 'col_all_events_124', 16, 120, 300, 10000, 50000, 10240000, 10240000)default.col_all_events_124: ENGINE = CollapsingMergeTree(sign)PARTITION BY (xxHash32(action) % 29, toYYYYMM(event_date))ORDER BY (client_id, event_timestamp, client_uuid_hash, event_uuid_hash)SETTINGS index_granularity = 8192, storage_policy = 'nvme_jbod'
Data Sizes
For both solutions, we inserted 1.3 billion real customer events data like page visits, transactions and other client activities.
Query Performance
For test purposes, we defined three types of query which are unique for real business cases:
Q1 (analytic query) shows a list of clients that fit the segment definition:
Q2 (direct query) the same as Q1 but with a check for a particular client_id is in the segment
Q3 (direct query) count points in the loyalty program:
They were on the page where the URI contains "TV" - there were at least three such visits one after another (completed within 30 seconds last 30 days), or they bought the product in the category "TVs" in the last 30 days.
From the first entry into the TV category to the purchase of a product in this category, no more than 30 minutes have passed
Query performance summary
Table 1. Query performance with data locality for a single unit in Terrarium (worker) and ClickHouse (node)
Table 2. Query performance for a cluster mode with 10 nodes and random client_id – production load
Figure 4. Production workload (6,000 requests per second) with execution time for single direct query in Terrarium.
Terrarium is exposed to a heavy public load. In one of our production clusters, it handles 6k rps. For a direct query like those for scenario Q2 and Q3, response time is less than 5 ms for 95 percentile.
We see that ClickHouse has 2x better performance for data locality. It does not support native query routing for data locality. If we want to adopt ClickHouse for our business use cases and public load (6k rps) we need:
● A write service that will route queries for a particular node to ensure data locality computation
● To increase cluster size from 10 to 34 notes if we want to handle 6k rps
Our Analytics Engine is one of the key differentiators that has helped us scale to meet the needs of our customers and help them build better products. We continue to invest resources to support large volume customers, improve query latency, add new features, and reduce costs.
Auto-scaling & spot instances
Stateless architecture and spot instances decrease cost infrastructure in cloud deployments by about 60% to 80% in comparison to the current deployment. Also, re-architecture caused a significant reduction of the bootstrap time for a new node.
Building support for third data layer – object oriented unstructured storage
Today Terrarium natively supports data processing stored in-memory and fast hard drive NVMe in order to meet performance requirements. In the next few months, we will implement native support for third data source layer - object-oriented unstructured data available in all cloud providers. This solution decreases the cost of storing huge data sets without losing performance. Our customers will be able store petabytes of data and pay only for what they are using - storage and computation will be separated. Data will always be available for fast computation and execution.
Implementing SQL for querying data
For the time being, Terrarium lacks the ability to process SQL queries. However, in the future, it is our main focus to implement SQL to increase the work quality and make our tool accessible with user-friendly interfaces for people working with data like data scientists, analysts, data engineers.