Mounting Azure Data Lake
Hadoop configuration options set using spark.conf.set(...)
are not accessible via SparkContext
. This means that while they are visible to the DataFrame and Dataset API, they are not visible to the RDD API. For this reason we must mount the ADL with DBFS via OAuth.
As a prerequisite to mounting an ADL instance you must first create a serivce principal. When performing the steps in the Assign the application to a role section of the article, make sure to assign the Storage Blob Data Contributor role to the service principal. When performing the steps in the Get values for signing in section of the article, paste the tenant ID, application ID, and authentication key values into a text file. You'll need those for the following configuration.
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type":"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id":"<service-client-id>",
"fs.azure.account.oauth2.client.secret":"<service-client-secret",
"fs.azure.account.oauth2.client.endpoint":"https://login.microsoftonline.com/<your-directory-id>/oauth2/token",
"fs.azure.createRemoteFileSystemDuringInitialization": "true"}
dbutils.fs.mount(
source = "abfss://<file-system>@<account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Mounting S3
ACCESS_KEY = "<access-key>"
SECRET_KEY = "<secret-key>"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "<bucket-name>"
MOUNT_NAME = "<mount-name>"
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
Where <mount-name>
is the DataBricks File System (DBFS) mount point.
Accessing Mounted Files
Listing the files:
dbutils.fs.ls("/mnt/<mount-name>/")
Iterating over files:
for f in dbutils.fs.ls("/mnt/<mount-name>/"):
# Do something
pass
Unmounting Mounted Data Source
dbutils.fs.unmount("/mnt/<mount-name>/")