Migrate Data to ClickHouse

ClickHouse (CH) definitely has data source integration pipeline to other data sources: RDBS, Cloud storage (for example S3), Kafka, JDBC-based, or Hive etc.. but one of our projects require to migrate historical data from Hive(HDFS) based data warehouse into ClickHouse. Reference: ClickHouse migration documents (https://clickhouse.com/docs/en/cloud/migration/clickhouse-local)

Step 1, Create ClickHouse Target Table. Unlike some RDBS systems, the target table can be created automatically on-fly when migrating (import), we have to create the ClickHouse tables manually or by using some open-source python library. Here, we are using traditional ‘Schema-Generation’ approach to do this. Client’s Hive metadata is using MySQL, so the way is using metadata tables to generate the ClickHouse tables creating scripts. The first thing we are creating the data types mapping between Hive and ClickHouse:

Hive Data TypeClickHouse Data Type
intUInt32
stringString
booleanBoolean
dateDate

There are 3 tables are being used in Hive Metadata: SDS, TBLS and COLUMNS_V2, the following scripts can be used to generate the ‘Create statement’ (Similar way can be Python script as well):

with tmp
as
(
select 
    TBL_Name,
    group_concat(concat(Column_Name,' ' ,Type_name) order by integer_idx) as col
from 
	TBLS t
inner join 
	SDS s on t.sd_id = s.sd_id
inner join 
	(select CD_ID,Column_Name, 
    case type_name 
    when 'int' then 'UInt32' 
    when 'string' then 'String' 
    when 'date' then 'Date'
    when 'boolean' then 'Boolean' 
    else type_name 
    end Type_name, 
    Integer_idx from COLUMNS_V2 order by cd_id,integer_idx) v on s.CD_ID = v.CD_ID
where 
	DB_ID = 61
group by 
	tbl_name
)

select concat('create table ', tbl_name, '(',col,')', ' ENGINE = MergeTree()')  as statement  from tmp;

The above code is 90% of the Create statement for CH, since the CH table creation needs either KEY column or ORDER parts. We manually added the missing part by considering the query/storage performance.

The are multiple ways to load external table into CH. CSV Bridging (Hive -> CSV -> CH), Python (HIVE ODBC, pyodbc, CH Python connector), ClickHose JDBC Bridge (https://github.com/ClickHouse/clickhouse-jdbc-bridge) , or Apache Seatunal (https://seatunnel.apache.org/blog/hive-to-clickhouse…. haven’t yet tested this way…). Depends on table size/records, We used the Clickhouse JDBC bridge way which pretty acceptable in performance perspective, the only thing we have to follow the tutorial to install the JDBC drivers. A sample statement of the JDBC bridge

INSERT INTO <CH Table> SELECT * FROM jdbc('driver name','db','table')