Mounting Azure Data Lake

Hadoop configuration options set using spark.conf.set(...) are not accessible via SparkContext. This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. For this reason we must mount the ADL with DBFS via OAuth.

As a prerequisite to mounting an ADL instance you must first create a serivce principal. When performing the steps in the Assign the application to a role section of the article, make sure to assign the Storage Blob Data Contributor role to the service principal. When performing the steps in the Get values for signing in section of the article, paste the tenant ID, application ID, and authentication key values into a text file. You'll need those for the following configuration.

configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", 
"fs.azure.account.oauth2.client.id":"<service-client-id>", 
"fs.azure.account.oauth2.client.secret":"<service-client-secret",
"fs.azure.account.oauth2.client.endpoint":"https://login.microsoftonline.com/<your-directory-id>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"} 

dbutils.fs.mount(
source = "abfss://<file-system>@<account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)

Mounting S3

ACCESS_KEY = "<access-key>"
SECRET_KEY = "<secret-key>"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "<bucket-name>"
MOUNT_NAME = "<mount-name>"

dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)

Where <mount-name> is the DataBricks File System (DBFS) mount point.

Accessing Mounted Files

Listing the files:

dbutils.fs.ls("/mnt/<mount-name>/")

Iterating over files:

for f in dbutils.fs.ls("/mnt/<mount-name>/"):
    # Do something
    pass

Unmounting Mounted Data Source

dbutils.fs.unmount("/mnt/<mount-name>/")