Creating a SQL Server 2019 Big Data Cluster
SQL Server 2019 Big Data Clusters represents a new feature of the SQL Server platform, combining specific services and resources used in efficiently managing and analyzing very large sets of mostly non-relational data, and allowing it to be used alongside relational data hosted in SQL Server databases. To achieve these principal objectives, Big Data Clusters implement a highly scalable big-data storage (HDFS) system, highly versatile querying capabilities (Spark), the power of distributed computing (Kubernetes), and a data virtualization infrastructure (PolyBase).
To deploy all the required features that represent a single Big Data Clusters instance, you can use multiple physical—or virtual—machines that can either be hosted on premises or in the cloud.
As we do not want you to carry the burden of providing the necessary infrastructure to host the Big Data Cluster instance yourself, you are going to make use of the Azure Kubernetes Services (AKS), hosted on a cluster of Azure Virtual Machines (VMs), providing data storage, computational capabilities, and connectivity, as well as management and monitoring.
The default VM size recommended to host a Big Data Cluster is the ESv3 standard, memory-optimized Azure VM with eight CPU cores, running the Linux operating system. Five VMs are required to host an AKS instance used by a SQL Server 2019 Big Data Cluster instance.
Important note
Before using Azure VMs or setting up an AKS instance, you should also be familiar with the cost of hosting these resources in Azure. You can find more information at https://azure.microsoft.com/en-gb/pricing/details/virtual-machines/linux/.
Getting ready
The configuration and deployment of a Big Data Cluster hosted on AKS are fully integrated into Azure Data Studio (ADS); therefore, you will complete most of the following activities in ADS.
Important note
Due to the restrictions of the free trial subscription, you will also need access to a Pay-As-You-Go Azure subscription.
If you do not have access to a Pay-As-You-Go Azure subscription that you will be able to use for the recipes in this book, you can upgrade the free subscription that you created in Chapter 1, Getting Started with Azure and SSIS 2019.
- To upgrade your free Azure subscription, log in to the Azure portal at https://portal.azure.com/.
- On the Home page, select Subscriptions.
- On the Subscriptions page, select the Free Trial subscription that you should have already created in Chapter 1, Getting Started with Azure and SSIS 2019.
- On the subscription overview page, click Upgrade to start the upgrade procedure.
It should take a few moments to check the state of your current subscription.
- On the Upgrade page, enter a friendly name of your new Pay-As-You-Go subscription (for instance, ETL in Azure Cookbook), and then check the No technical support option, meaning that you do not want to add a support plan to your subscription.
Important note
The no-support plan should be good enough for typical development and testing purposes and is certainly good enough for the recipes in this book. However, for production use, and especially for enterprise-level development activities, one of the other plans would probably be much more appropriate.
You can learn more about Azure subscription support plans from the vendor documentation, available at https://azure.microsoft.com/en-us/support/plans/.
When ready, click Upgrade to continue. Your payment information will be verified, and you might have to update it, so keep your credit card handy. When the upgrade is completed, a new message will appear in your Azure Notifications area.
Before even attempting to deploy a Big Data Cluster to Azure, you should also be familiar with the current resource restrictions in your region. As explained in the introduction to this recipe, you are going to need five standard ESv3 VMs with 8 CPU cores each, for a total of 40 cores. You can check the resource quotas by following these steps:
- On the Azure portal, open the Azure PowerShell pane by clicking the Cloud Shell icon, located to the right of the search text box, as shown in the following screenshot:
- A Terminal pane should open at the bottom of the browser window, as shown in the following screenshot:
If PowerShell is not the selected language, select it from the list box, located on the far left of the Cloud Shell menu.
- Use the following PowerShell command to check the current quotas in the East US region:
Get-AzVMUsage -Location "East US"
These instructions assume that you are interested in the current quotas in this particular region, as they will also be used later in the recipe. To check the quotas in a different region, replace the value of the Location argument before executing the command.
- In the results returned by the Get-AzVMUsage command, locate the Standard ESv3 Family vCPUs counter:
Important note
If the Limit column value is less than 40, you will have to request a quota increase by sending a support case request to the administrators. Relevant information about this is available in the online vendor documentation, at https://docs.microsoft.com/en-us/azure/azure-portal/supportability/per-vm-quota-requests.
You only need to request a quota increase for the number of virtual central processing units (vCPUs) in the selected region (eastus, by default) to 40 for the ESv3 family of Azure VMs for the subscription you created in Chapter 1, Getting Started with Azure and SSIS 2019. No other requests should be needed in order to complete the recipes in this chapter.
Until the current resource restrictions allow you to deploy the necessary number of VMs of the recommended size, you will not be able to continue with this recipe.
Support cases with a minimal-impact severity are resolved within 8 business hours, but are often resolved within minutes (of course, depending on how busy the support team is at the given time).
You are also going to need a place on your workstation where specific files needed in this chapter will be kept:
- By using Windows Explorer, create a new folder named ETL-with-Azure on the C:\ drive of your workstation.
- Inside the new folder, create a folder for this chapter, named Chapter03, and inside that one, create two additional folders, named BDC and Files:
The preceding screenshot shows how the folders need to be organized.
- Use your favorite internet browser, and connect to the GitHub location for this cookbook at https://github.com/PacktPublishing/ETL-with-Azure-Cookbook.
If you have not done so yet, clone the cookbook repository to the default location on your workstation.
- Locate the /Chapter03/Files folder in your local clone of the cookbook repository and copy them to the C:\ETL-with-Azure\Chapter03\Files folder you created earlier.
You now have all the prerequisites in place to start with the first recipe.
How to do it…
You are going to use ADS to activate a new service in your Azure trial subscription—namely, an AKS cluster. You will then configure the newly created AKS cluster to host a SQL Server 2019 Big Data Cluster.
- Start ADS, unless already open, and on the Welcome page, click Deploy a server to start the deployment procedure. If the Welcome page is closed, you can open it by using the Welcome command located in the Help menu.
- In the Select the deployment options window, select SQL Server Big Data Cluster as the deployment type, SQL Server 2019 as the Version, and New Azure Kubernetes Service Cluster as the Deployment target:
When deploying a Big Data Cluster for the first time, there should also be a warning displayed at the top of the window, saying that certain required tools have not yet been installed. If any tools are missing, as shown in the preceding screenshot, click Install tools to correct this, and then wait a few minutes for the installation to complete.
During the installation, when prompted by the Windows User Account Control to allow the applications to make changes to your device, click Yes to continue.
Tip
You should use this time to familiarize yourself with the Microsoft Privacy Statement and the License Terms. Follow the links above the deployment options to access the documents.
- When all required tools have been installed successfully (marked by the status Installed in the Required Tools table), accept the privacy statement and license terms, and click Select to continue. It should take about a minute to load the deployment configurations and start the deployment wizard.
- On the Deployment configuration profile page, make sure the aks-dev-test deployment configuration profile is selected. In this test deployment, you will not need any High Availability (HA) features.
When ready, click Next to continue.
- On the Azure settings page, leave the Subscription id text box empty so that the default Azure subscription will be used, and use the following values for the other settings:
a) New resource group name: etlinazure-ch-3
b) Location: eastus
East US is the default location in the template; additional locations with the version of AKS on offer are available. You can access the complete list by clicking View available Azure locations.
Tip
Generally, the location should be determined based on its proximity to the location where the services you plan on hosting in Azure are going to be used—for instance, the proximity to the end users, if the services are to be accessed by the end users directly; or, the proximity of your on-premises location, if your organization is the principal consumer.
For development and testing purposes, the default location should be good enough, but you can select any other location if you prefer:
a) AKS cluster name: etlinazure-ch-3-aks
b) VM count: 5
c) VM size: Standard_E8s_v3
When ready, click Next to continue.
- On the Cluster settings page, use etlinazure-ch-3-aks-cluster as the cluster name, leave admin as the administrator username, and assign a password for the administrator account.
Leave all the Docker settings unchanged.
When ready, click Next to continue.
- On the Service settings page, leave all default settings unchanged.
When ready, click Next to continue.
- On the Summary page, review all the configuration settings, and make sure that they match the preceding instructions. Use the Previous and Next buttons to navigate through the pages of the wizard if you need to make any changes.
If all the settings are correct, click Save config files to save the configuration to the local disk of your workstation. Save the files to the C:\ETL-with-Azure\Chapter03\BDC folder.
When ready, click Script to Notebook, after which the wizard generates a new notebook with a script based on the configuration settings you provided earlier.
- If this is the first time that you have used a Python Notebook, the Configure Python for Notebooks dialog should open.
Either select New Python installation as the installation type or specify the folder where an existing Python installation is located.
When ready, click Install to continue.
The download and installation procedure should take a few minutes to complete.
During the installation, you might be prompted to upgrade any installed Python packages.
Click Yes to confirm the upgrade.
When the installation completes, the results will be displayed in the Tasks pane.
- The deploy-bdc-aks Python Notebook should now be open in ADS. Using the Ctrl + S shortcut on your keyboard, or by selecting Save from the File menu, save the Notebook to the C:\ETL-with-Azure\Chapter03\BDC folder.
Inspect it, and then click Run Cells in the Notebook editor menu to execute the scripts.
As no Azure subscription ID was supplied during the configuration, a browser window will open, prompting you to sign into your Azure account. Use the account you created in Chapter 1, Getting Started with Azure and SSIS 2019, with the corresponding credentials. The browser window might open in the background, so pay attention to that.
The installation might take several minutes to complete; patience is advised.
While the installation is in progress, the process might not return any messages; however, the first message returned should contain the command line that you can execute in a Windows Command Prompt window to track the current progress.
After the installation has completed, the notebook should log you in to the newly created Big Data Cluster and retrieve the endpoint addresses. Some of the information returned in the results will be needed later; therefore, save the Notebook again—this time including the most recent results so that the information retrieved during the installation is now available in the Notebook for later use.
Tip
One of the significant benefits provided by Notebooks is the ability to not only display the results of the operations inside the Notebook but also to have them saved into a file to be looked at later.
- If the execution completed successfully, a new Click here to connect to SQL Server Master instance hyperlink should appear at the bottom of the Notebook, as shown in the following screenshot:
Follow the link. It might take a few moments for the connection to be made, so patience is advised. Once the connection has been made, the information about the SQL Server master instance of your first Big Data Cluster should appear in a new ADS tab.
- In the left column of the tab, under Monitoring, click SQL Server Big Data Cluster to view the properties of your Big Data Cluster, as shown in the following screenshot:
- In the window's menu ribbon, select Cluster Dashboard. After a few minutes, the current state of all of the parts of your newly created Big Data Cluster should be displayed, as shown in the next screenshot:
In normal circumstances, all six elements should be in the Ready state, with a Healthy health status.
If this is not the case, you can try following the link to the problematic item and running a troubleshooting operation for it.
Depending on the exact situation, eventually the only successful remedy might be simply to remove the entire Resource Group from your trial subscription and start over.
- Return to the ADS tab with the Big Data Cluster properties. As you can see in the ADS tab shown in Figure 3.7, all the endpoints use the secure HTTP protocol (HTTPS) in their Uniform Resource Locators (URLs); however, no certificates are installed on these endpoints by default. This prevents client applications from connecting to the endpoints securely. In the Notebooks section of the current tab, click SQL Server 2019 Guide to open the SQL Server 2019 Big Data Clusters Operations and Support Jupyter Notebook. This Notebook provides access to a number of additional Notebooks that you can use to manage your SQL Server 2019 Big Data Clusters—for instance, to monitor, analyze, troubleshoot, and repair the installation.
- In the Chapters section of the Notebook, click Certificate Management to open the set of notebooks used for Certificate Management.
- In the Notebooks in this Chapter section of the Certificate Management notebook, locate the CER100 - Configure Cluster with Self Signed Certificates link, and follow the link to open the corresponding Notebook.
Tip
For development and testing purposes, you can safely use self-signed certificates to secure the communication between the server and the client applications; however, in a production environment, you would normally use a certificate issued by a Root Certificate Authority (CA). To use an existing Root CA certificate, you can use the CER101 - Configure Cluster with Self Signed Certificates using existing Root CA Python notebook, instead of the CER100 one.
- When the CER100 - Configure Cluster with Self Signed Certificates Python notebook loads, inspect it, and then click Run Cells in the ADS tab menu to execute the script.
Unfortunately, during the execution, you might run into one or more problems:
a) If the execution of the main notebook (the CER100) should fail (for instance, when executing the Run the notebooks cell), you can work around this problem by executing each of the associated notebooks individually instead. They are listed in the correct order in the Related section of the main notebook; their names begin with CER (you can skip the CAN100 notebook).
b) If the execution of an individual notebook should fail, inspect the error messages returned by the failed cell execution. These might contain references to additional notebooks that you can use to solve the problem. After loading and running any additional notebooks, repeat the execution of the previously failed notebook—depending on the particular notebook, you might have to run all cells, or run each individual cell, starting with the one that originally failed.
c) The Install the Root CA cell in the CER010 - Install generated Root CA locally notebook will use PowerShell to download the certificate to be installed on your workstation. Pay attention to the PowerShell window, as it might open in the background; it opens a dialog asking you to confirm the installation of the certificate. The execution will not continue until you respond.
After all the notebooks have been executed successfully, the certificates will be installed on all the endpoints of your Big Data Cluster as well as on your workstation, which should now allow client applications to connect.
If by any chance you do get stuck when running the notebooks and find yourself unable to solve a particular problem, please contact Microsoft Support; they might be able to assist you further.
Important note
When using self-signed certificates to secure server resources (for instance, endpoints), such certificates are not trusted by client applications as their validity cannot be proven through an appropriate CA. This might prevent client applications from using the secured resource altogether.
For instance, internet browsers or browser-based client applications will warn the user of these circumstances and might require user interaction before a connection can be made. Non-browser client applications, on the other hand, might be able to establish the connection with such a resource in a secure way, as long as they have access to the information needed in establishing trust.
The CER010 notebook installs the self-signed certificate on your workstation in a secure way, and thus makes the necessary information available to the client applications running on your workstation.
How it works…
You have just deployed a SQL Server 2019 Big Data Cluster to Azure, using AKS to host it. You will use this resource later in this chapter to load data into the cloud, process it, and then retrieve the results to be loaded into an on-premises database.
Important note
The resource that you just created in your Azure subscription is not free; you are already being charged for its use. To reduce the cost of your subscription, we recommend that you complete Steps 1 through 4 of the next recipe, Stopping and starting Azure resources.