Databricks (AWS S3) Connector Setup Guide
This article describes how to set up the Databricks (AWS S3) connector.
Actions
Action Name | AudienceStream | EventStream |
---|---|---|
Send Entire Event Data | ✗ | ✓ |
Send Custom Event Data | ✗ | ✓ |
Send Entire Visitor Data | ✓ | ✗ |
Send Custom Visitor Data | ✓ | ✗ |
How it works
The Databricks (AWS S3) connector requires two sets of connections: Tealium to AWS S3 and Databricks to AWS S3.
Tealium to AWS S3 connection
Tealium requires a connection to an AWS S3 instance to list buckets and upload event and audience data into S3 objects. You have two options for authentication for the Databricks (AWS S3) connector:
- Provide an Access Key and Access Secret.
- Provide STS (Security Token Service) credentials.
Access Key and Secret credentials
To find your AWS Access Key and Secret:
- Log in to the AWS Management Console and go to the IAM (Identity and Access Management) service.
- Click Users and then Add user.
- Enter a username. For example,
TealiumS3User
. - Attach policies to the user you have just created.
- In the Permissions tab, click Attach existing policies directly.
- Search for and attach the
AmazonS3FullAccess
policy, for full access. If you want to restrict access to a specific bucket, you can write a policy similar to the example below. In this example,YOUR_BUCKET_NAME
is the bucket that Tealium would use to upload event and audience data into S3 objects.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:ListBucketMultipartUploads", "s3:ListMultipartUploadParts" ], "Resource": [ "arn:aws:s3:::YOUR_BUCKET_NAME", "arn:aws:s3:::YOUR_BUCKET_NAME/*" ] } ] }
- Create the keys.
- Go to the Security credentials tab and click Create Access Key.
- Copy the Access Key ID and Secret Access Key and save them securely.
STS credentials
- Log in to the AWS Management Console and go to the IAM (Identity and Access Management) service.
- Click Roles and then Create role.
- For the Trusted entity type, select the AWS account.
- Select Another AWS account and specify the Tealium account ID:
719386139886
. - Optional. Check the Require external ID checkbox and specify the external ID that you want to use. External IDs can be up to 256 characters long and can include alphanumeric characters (
A-Z
,a-z
,0-9
) and symbols, such as hyphens (-
), underscores (_
), and periods (.
). - Enter a name for the role. The role name must start with
tealium-databricks
. For example,tealium-databricks-s3-test
. - Attach policies to the role.
- In the Permissions tab, click Attach existing policies directly.
- Search for and attach the
AmazonS3FullAccess
policy, for full access. If you want to restrict access to a specific bucket, you can write a policy similar to the example below. In this example,YOUR_BUCKET_NAME
is the bucket that Tealium would use to upload event and audience data into S3 objects.{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:ListBucketMultipartUploads", "s3:ListMultipartUploadParts" ], "Resource": [ "arn:aws:s3:::YOUR_BUCKET_NAME", "arn:aws:s3:::YOUR_BUCKET_NAME/*" ] } ] }
- Create a trust policy.
- Go to the Trust relationships tab and click Edit trust relationship.
- Ensure the trust policy allows the specific external ID to the role you created and that the Tealium production account ID is
757913464184
as seen in the example below. - Set the
EXTERNAL_ID
value for the connection to Tealium. The ID can be up to 256 characters long and can include alphanumeric characters (A-Z
,a-z
,0-9
) and symbols, such as hyphens (-
), underscores (_
), and periods (.
).
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::757913464184:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "EXTERNAL_ID"
}
}
}
]
}
Databricks to AWS S3 connection
To connect Databricks to an AWS S3 instance, you must first create an IAM role in your AWS instance that can be used to create storage credentials in the Databricks instance. For more information about creating the AWS IAM role, see Databricks: Create a storage credential for connecting to AWS S3.
After the storage credentials have been created, define the external location in the AWS S3 instance that you will pull data from. For more information, see Databricks: Create an external location to connect cloud storage to Databricks.
Batch Limits
This connector uses batched requests to support high-volume data transfers to the vendor. For more information, see Batched Actions. Requests are queued until one of the following thresholds is met or the profile is published:
- Max number of requests: 100,000
- Max time since oldest request: 10 minutes
- Max size of requests: 10 MB
Configuration
Navigate to the Connector Marketplace and add a new connector. For general instructions on how to add a connector, see About Connectors.
After adding the connector, configure the following settings:
- Authentication Type
- (Required) Select the authentication type. Available options are: STS and Access Key.
- STS: Requires the following fields Assume Role: ARN and Assume Role: Session Name.
- Access Key: Requires the following fields AWS Access Key and AWS Secret Access Key.
- Region
- (Required) Select a region.
- STS - Assume Role: ARN
- Required for STS authentication. Provide the Amazon Resource Name (ARN) of the role to assume. For example:
arn:aws:iam:222222222222:role/myrole
. For more information, see AWS Identity and Access Management: Switch to an IAM role (AWS API).
- Required for STS authentication. Provide the Amazon Resource Name (ARN) of the role to assume. For example:
- STS - Assume Role: Session Name
- Required for STS authentication. Provide the session name of the role to assume. Minimum length of 2, maximum length of 64.
- STS - Assume Role: External ID
- Required for STS authentication. Provide a third-party external identifier. For more information, see AWS Identity and Access Management: Access to AWS accounts owned by third parties.
- Access Key - AWS Access Key
- Required for Access Key authentication. Provide the AWS access key.
- Access Key - AWS Secret Access Key
- Required for Access Key authentication. Provide the AWS secret access key.
- Databricks Host URL
- Provide the Databricks account URL.
- For example:
https://{ACCOUNT_NAME}.cloud.databricks.com
.
- Databricks Token
- Provide the Databricks access token. To create an access token in Databricks, click the user avatar in Databricks and navigate to Settings > Developer > Access Tokens > Manage > Generate New Token.
Create a notebook
Notebooks in Databricks are documents that contain executable code, visualizations, and narrative text. They are used for data exploration, visualization, and collaboration. In the connector configuration, you have the option to create a new notebook while you create a new connector by clicking Create Notebook in the configuration step.
- In the connector configuration screen, click Create Notebook.
- Enter the table name. The Schema is specified when creating the job, so do not add it into this field.
- Names can include alphanumeric characters (
A-Z
,a-z
,0-9
) and underscores (_
). - Spaces and special characters are not allowed, such as
!
,@
,#
,-
, and.
are not allowed. - Names are case-sensitive. For example,
tableName
andtablename
would be considered different names. - Names cannot start with a digit. For example,
1table
is invalid.
- Names can include alphanumeric characters (
- For Notebook Path, enter the absolute path of the notebook. For example:
/Users/user@example.com/project/NOTEBOOK_NAME
.- To locate the absolute path of the notebook in Databricks, go to your Databricks workspace and expand the Users section.
- Click on the user and then expand the options menu.
- Click on Copy URL/Path > Full Path. The path name will be in the following format:
/Workspace/Users/myemail@company.com
. Add the virtual folder and notebook name that you want separated by a slash/
. For example,/Workspace/Users/myemail@company.com/virtualfolder/virtualsubfolder/MyNotebook
.
- For S3 Bucket, select the S3 bucket to connect to Databricks.
- The Overwrite option indicates whether to overwrite an existing notebook in the specified workspace if one already exists.
Create a job
Jobs in Databricks automate running notebooks on a schedule or based on specific triggers. Jobs allow you to perform tasks, such as data processing, analysis, and reporting, at regular intervals or triggered on certain events.
- In the connector configuration screen, click Create Job.
- Enter a name for the processing job.
- For Catalog, specify a catalog from the Unity catalog to use as a destination for publishing pipeline data.
- For Target, specify the schema you want to publish/update your table in the above catalog. Do not specify the target table here, as the table that will be used is the one specified in the notebook.
- For Notebook Path, enter the absolute path of the notebook. For example:
/Users/user@example.com/project/NOTEBOOK_NAME
.- To locate the absolute path of the notebook in Databricks, go to your Databricks workspace and expand the Users section.
- Click on the user and then expand the options menu.
- Click on Copy URL/Path > Full Path. The path name will be in the following format:
/Workspace/Users/myemail@company.com
. Add the virtual folder and notebook name that you want separated by a slash/
. For example,/Workspace/Users/myemail@company.com/virtualfolder/virtualsubfolder/MyNotebook
.
- For S3 Bucket, select the S3 bucket to connect to Databricks.
- For Trigger Type, select when to process the data. Available options are:
- File Arrived: Process data every time a new file arrives.
- Scheduled: Process data on a repeating schedule that you specify.
- Cron: Process data on a repeating schedule that you define in the Cron field.
- For Start Time, specify the start time for job processing in the
hh:mm
format. The default value for the start time is00:00
. - For Timezone, specify the timezone in
country/city
format. For example,Europe/London
. This field is required if you provide a start time. - For Cron, enter the quartz cron expression you want to use for scheduled processing. For example
20 30 * * * ?
will process files on the 20th second of the 30th minute of every hour, day, day of the week, and year. For more information, see Quartz: Cron Trigger Tutorial.
Actions
The following section lists the supported parameters for each action.
Send Entire Event Data
Parameters
Parameter | Description |
---|---|
Amazon S3 Bucket | Select the Amazon S3 bucket or provide a custom value. |
Databricks Catalog | Select the Databricks catalog or provide a custom value. |
Databricks Schema | Select the Databricks schema or provide a custom value. |
Databricks Table | Select the Databricks table or provide a custom value. |
Column to record the payload | Select the VARIANT column to record the payload. |
Column to record the Timestamp | Select the column to record the timestamp. |
Timestamp Attribute | The default sends the current timestamp for the action. Select an attribute to assign as the timestamp if you want to send a different format. If an attribute is assigned and produces an empty value, we will send the current timestamp. |
Send Custom Event Data
Parameters
Parameter | Description |
---|---|
Amazon S3 Bucket | Select the Amazon S3 bucket or provide a custom value. |
Databricks Catalog | Select the Databricks catalog or provide a custom value. |
Databricks Schema | Select the Databricks schema or provide a custom value. |
Databricks Table | Select the Databricks table or provide a custom value. |
Event Parameters
Map parameters to the columns of the Databricks table. You must map at least one parameter.
Send Entire Visitor Data
Parameters
Parameter | Description |
---|---|
Amazon S3 Bucket | Select the Amazon S3 bucket or provide a custom value. |
Databricks Catalog | Select the Databricks catalog or provide a custom value. |
Databricks Schema | Select the Databricks schema or provide a custom value. |
Databricks Table | Select the Databricks table or provide a custom value. |
Column to record the visitor data | Select the VARIANT column to record the visitor data. |
Column to record the Timestamp | Select the column to record the timestamp. |
Timestamp Attribute | The default sends the current timestamp for the action. Select an attribute to assign as the timestamp if you want to send a different format. If an attribute is assigned and produces an empty value, we will send the current timestamp. |
Include Current Visit Data with Visitor Data | Select to include the current visit data with visitor data. |
Send Custom Visitor Data
Parameters
Parameter | Description |
---|---|
Amazon S3 Bucket | Select the Amazon S3 bucket or provide a custom value. |
Databricks Catalog | Select the Databricks catalog or provide a custom value. |
Databricks Schema | Select the Databricks schema or provide a custom value. |
Databricks Table | Select the Databricks table or provide a custom value. |
Visitor Parameters
Map parameters to the columns of the Databricks table. You must map at least one parameter.
This page was last updated: December 12, 2024