This practice is trying to use HDInsight Kafka to save to the streaming data into Azure DW. Even though Kafka has been added into the HDInsight of Azure, but there is not too much resources can be referred except you have good knowledge on Apache Kafka before.
The practice will use two environment which are not the default in Windows system: JAVA JDK 7 or 8 (this can be download from Oracle website) and Maven (Apache Maven project, this can be downloaded from here: http://maven.apache.org/
- In windows OS -> Environment Variables, Add JAVA_HOME system variable to the JDK installation folder, for example: c:\projram Files\Java\jdk1.8...
- In Environment variables, add JAVA_HME BIN folder to the Path variable, for example: %JAVA_HAME%\bin;....
- In Environment variables, add Maven bin folder into the Path as well, for example: c:\maven\bin;...
- Download the project from here: https://github.com/Azure-Samples/hdinsight-kafka-java-get-started. This sample looks like to use Kafka as producer and consumer...
- In windows command line to build the sample:
- cd producer-Consumer
- mvn clean package
- In Azure, create VNET, this VNET will be used to hose Kafka cluster or other clusters (Storm or Spark in the future)
- In Azure, create a Storage Account, this storage account will be used to save Kafka stream data. (Download and install MS Storage Explorer will be big help on access the Storage Account)
- In Azure, create Kafka HDInsight cluster, specify the storage account in step 7) and VNET in step 6)
- In Azure, open Cloud Shell (Use Bash, as the sample created by Apache and Java)
- Copy the compiled java file into Kafka cluster. (couldn't duplicate the command line which mentioned in the sample, but WinSCP can be downloaded, it is much better than command line). Login WinSCP by using email@example.com, copy the jar file to kafka sshuser folder;
- In Azure Cloud shell: use the SSH to connect to the cluster: ssh sshuser@@kafkaclustername-ssh.azurehdinsight.net
- sudo apt -y install jq (use lower case).
- Run the step 7 in the project webpage (those two statement have been verified)
- Run the step 8 in the project webpage (those two statement have been verified) to verify the kafkahosts and kafkabrokers
- Create the 'Test' topic:
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 2 --partitions 8 --topic test --zookeeper $KAFKAZKHOSTS
Data security consideration is the top priority when migrating to Cloud environment, especially when data transformation happened in public internet pipeline. Technically, when moving into Azure from on-premises, no matter data flow or files, there are several ways can help to protect the data transformation based on what ways we are using.
- Data Connection Encryption.
Azure DB or Azure SQL DW will generate Connection strings (ADO.NET, JDBC, ODBC and PHP) that can be used in client applications. Be aware of that the "Encrypt = True" will be always added (in ADO.NET, JDBC only). This configuration should be always kept when using this connection string in client applications.
- SSIS (SQL Server Integration) Data Flow Encryption.
Believe that most of the MS SQL Server guys will be considering the technical tips when using SSIS as the major tool to do the data transformation. when creating the connection to Azure DW, Azure SQL Server or Azure NoSQL Database by using OLE DB, please ensure to change the 'Use Encryption for Data' to True, this will use SSL in HTTP connection:
If user ADO.NET provider to connect to Azure DW, there is more controls should be aware of:
Column Encryption Setting: This feature is now only being supported in SQL Server 2016 (Not in Azure DW). If the target is Azure DW, this value should be disabled. This property is being used to setup the encryption for particular column.
Encrypt: When True, SQL Server uses SSL encryption for all data sent between the client and server if the server has a certificate installed. In Azure
- Data Factory Data Encryption
Data Factory is Azure-based data movement service, it can not only move data from on-Premises to Azure but also move data inside Azure environment as well..
Our first practice will be loading a local SQL server database (on-premises) into Azure SQL DW. There are several things we need to think about the migrating.
- Technical approach. There are several ways to move data: SQL Server Integration Services, Data Factory, Data Warehouse Migration utility and Azure commands...
- Historical Loading and Incremental Loading. Consider the different way when loading incremental data.
- Security configuration. How to move the user/group accounts into Azure DW
- SQL Code migration. Even though this is SQL to SQL, there is still some problems need to be addressed. For example, limitation in Azure DW
No matter which approach we will be using, there are four major step we have to follow: Move Schema, Move Data and Move Code.
1) Move Schema. Move schema means we have to create the tables' schema before loading the data. Typically in SQL Server, you can generate tables' script from SSMS, or by using other 3rd party tools. Keep in mind that there are not all SQL Server objects will be needed in Azure DW, for example 'primary key'. In the generated scripts, please add the following statement to follow the 'Create Table' :
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
This will create the Distribution column (will discuss later) and Columnstore index.
2) Move Data. This is the core step. Moving data from on-premises SQL to Azure DW has two kinds of 'Moving': Historical Loading and Incremental Loading. In Data Ware housing methodology, the Historical Loading means one time loading or fresh loading, it is being used to load the entire on-premises data into Azure DW. In most of articles, it introduces the Historical Loading. Incremental Loading means we only load new records into Azure DW, this process should be happened based on scheduled frequencies. Both Incremental Loading and Historical Loading can use same technique, but may be not. I will cover this discussion later.
Move data by using SQL Server Integration Services
Probably this is easiest way to do both Historical Loading and Incremental Loading.
I am using this series to describe the topics that related with migrating into Azure SQL Data Warehouse - MS MPP (Massively Parallel Processing) Data platform in Azure. In fact the name is confused for most of people who are familiar with SQL on-premises, in fact, Azure SQL Data Warehouse is a Azure SQL Database which supports MPP by using distributed Compute Nodes, so it created has to be under Azure SQL Database services.
I am trying to using multiple different DW systems to evaluate the 'Migrating': SQL Server, Oracle, DB2, RDS etc. Also I will use this series to compare pros and cons between different DWs in technical perspective (I hope I can.:)).
Setup and initial the Azure SQL Data Warehouse is pretty simple, much easier than installing a SQL Server instance in the local server :). Several Steps:
- Create SQL DW (Specify or create new SQL Server)
- Specify Performance Tire (DWU unit, 100 DWU initially)
- Specify administrator account
The following screen shot shows the server name just created:
Very basic request at this point is how can we connect and operate this Azure DW? Fortunately we have a very simple and easy way do that by using SQL Server Management Studio 2016. In the screen shot below, it shows we are trying to connect to the Azure SQL Server instead of Azure SQL DW as Azure SQL DW is a DATABASE.
When connecting the Azure SQL, the system requests to login Azure from client computer so that Azure SQL firewall can allow the computer's IP address to connect to Azure SQL. The screen shot below shows the result when connected to Azure SQL, and the first Azure SQL DW: bridatadw1