-. (Optional) Mount ADLS storage account in Databricks workspace

Register an Azure AD application
- Search and go to Azure Active Directory.
- Click on the "App registrations" link, then "New registration".
- Once an application is registered, get a copy of the Application (client) ID & Directory (tenant) ID.
Generate an Authentication Key for the registered application
- On the registered application page, click on the "Certificates & secrets" link.
- Click on the "New client secret" link.
- Pick a lifetime for the secret in the "Expires" drop-down-list, then click on the Add button to add the secret.
- Get a copy of the Secret from the Value field as the Authentication Key.
Grant the registered application access to a storage account
- Navigate to the storage account to which the registered application needs to access.
- Click on the "Access Control (IAM)" link, then "Add role assignment".
- Select the "Storage Blob Data Contributor" from the list, then click on the Next button
- Click on the "Select members" link to add the registered application, then click on the "Review + assign" button
Add secrets in Azure Key-Vault
- Search and go to Azure Key-Vault
- Click on the Secrets link to add a secret for the Application (client) ID from step 1.
- Create a secret for the Directory (tenant) ID from step 1.
- Create a secret for the Authentication Key from step 2.
Get the Vault URI & Resource ID
- On the previous Key-Vault page, click on the Properties link.
- Get a copy of the "Vault URI" and "Resource ID".
Create an Azure Key Vault-backed Secret Scope in Azure Databricks
- Search and go to Azure Databricks Workspace.
- Launch a Databricks Workspace.
- Append secrets/createScope at the end of the workspace url, such as https://???.azuredatabricks.net/?o=???#secrets/createScope.
- Fill in the "Vault URI" and "Resource ID" from step 6 as the "DNS Name" and "Resource ID" respectively.
- Click on the Create button

Mount ADLS storage account in Azure Databricks

Create the following containers under storage account from step 3.
- input
- conf
- pipelines
- lib
- scripts
- output
Create a notebook with scala in the Databricks workspace from step 6.

Run the following code in the notebook

//ws-hands-on-scoep is the name of the secret scope created in step 6
//adls-client-id is the name of the secret created in step 4 for Application (client) ID 
val applicationId = dbutils.secrets.get(scope="ws-hands-on-scope",key="adls-client-id")
//adls-secret-id is the name of the secret created in step 4 for Authentication Key
val secretId = dbutils.secrets.get(scope="ws-hands-on-scope",key="adls-secret-id")
//adls-tenant-id is the name of the secret created in step 4 for Directory (tenant) ID
val tenandId = dbutils.secrets.get(scope="ws-hands-on-scope", key="adls-tenant-id")

val endpoint = "https://login.microsoftonline.com/" + tenandId + "/oauth2/token"
val configs = Map(
  "fs.azure.account.auth.type" -> "OAuth",
  "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id" -> applicationId,
  "fs.azure.account.oauth2.client.secret" -> secretId,
  "fs.azure.account.oauth2.client.endpoint" -> endpoint
)

//mount the input container under storage account wstorage to /mnt/input
dbutils.fs.mount("abfss://input@wstorage.dfs.core.windows.net/", "/mnt/input", "", null, configs)
//mount the conf container under storage account wstorage to /mnt/conf
dbutils.fs.mount("abfss://conf@wstorage.dfs.core.windows.net/", "/mnt/conf", "", null, configs)
//mount the lib container under storage account wstorage to /mnt/lib
dbutils.fs.mount("abfss://lib@wstorage.dfs.core.windows.net/", "/mnt/lib", "", null, configs)
//mount the scripts container under storage account wstorage to /mnt/scripts
dbutils.fs.mount("abfss://scripts@wstorage.dfs.core.windows.net/", "/mnt/scripts", "", null, configs)
//mount the pipelines container under storage account wstorage to /mnt/pipelines
dbutils.fs.mount("abfss://pipelines@wstorage.dfs.core.windows.net/", "/mnt/pipelines", "", null, configs)
//mount the output container under storage account wstorage to /mnt/output
dbutils.fs.mount("abfss://output@wstorage.dfs.core.windows.net/", "/mnt/output", "", null, configs)

List & verify all mounted containers
```
%fs
ls /mnt
```

-. Prepare for the Spark job by uploading job data & files.

Upload users.csv to /input/users
Upload train.txt to /input/train
Upload application.conf to /conf
Upload transform-user-train.sql to /scripts
Upload fileRead-fileWrite to /pipelines
Upload spark-etl-framework_2.12_3.2.1-1.0.jar to /lib

Verify the uploaded files

%fs
ls /mnt/input/users

%fs 
ls /mnt/input/train

%fs 
ls /mnt/input/conf

%fs 
ls /mnt/input/pipelines

%fs 
ls /mnt/input/lib

%fs 
ls /mnt/input/scripts

-. Create a Jar job in Databricks workspace

In Azure Databricks workspace, create a job with the following properties:

- Name: fileRead-fileWrite
- Type: JAR
- Main class: com.qwshen.etl.Launcher
- Dependent Library by adding with DBFS/ADLS: dbfs:/mnt/lib/spark-etl-framework_2.12_3.2.1-1.0.jar
- Cluster: Single Node cluster with Scala 2.12 Spark 3.2.1 and Worker Type Standard_DS3_v2
- Parameters: ["--pipeline-def","dbfs:/mnt/pipelines/pipeline_fileRead-fileWrite.xml","--application-conf","dbfs:/mnt/conf/application.conf"]

-. Run the job & check the result in the output container.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure-adls-read-write.md

azure-adls-read-write.md

Files

azure-adls-read-write.md

Latest commit

History

azure-adls-read-write.md

File metadata and controls