Kafka Connect
Integration Details
This plugin extracts the following:
- Source and Sink Connectors in Kafka Connect as Data Pipelines
- For Source connectors - Data Jobs to represent lineage information between source dataset to Kafka topic per
{connector_name}:{source_dataset}
combination - For Sink connectors - Data Jobs to represent lineage information between Kafka topic to destination dataset per
{connector_name}:{topic}
combination
Concept Mapping
This ingestion source maps the following Source System Concepts to DataHub Concepts:
Source Concept | DataHub Concept | Notes |
---|---|---|
"kafka-connect" | Data Platform | |
Connector | DataFlow | |
Kafka Topic | Dataset |
Current limitations
Works only for
- Source connectors: JDBC, Debezium, Mongo and Generic connectors with user-defined lineage graph
- Sink connectors: BigQuery
Module kafka-connect
Important Capabilities
Capability | Status | Notes |
---|---|---|
Platform Instance | ✅ | Enabled by default |
CLI based Ingestion
Install the Plugin
pip install 'acryl-datahub[kafka-connect]'
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: "kafka-connect"
config:
# Coordinates
connect_uri: "http://localhost:8083"
# Credentials
username: admin
password: password
# Optional
platform_instance_map:
bigquery: bigquery_platform_instance_id
sink:
# sink configs
Config Details
- Options
- Schema
Note that a .
is used to denote nested fields in the YAML recipe.
View All Configuration Options
Field | Required | Type | Description | Default |
---|---|---|---|---|
env | string | The environment that all assets produced by this connector belong to | PROD | |
platform_instance_map | Dict[str,string] | Platform instance mapping to use when constructing URNs. e.g.platform_instance_map: { "hive": "warehouse" } | ||
connect_uri | string | URI to connect to. | http://localhost:8083/ | |
username | string | Kafka Connect username. | None | |
password | string | Kafka Connect password. | None | |
cluster_name | string | Cluster to ingest from. | connect-cluster | |
convert_lineage_urns_to_lowercase | boolean | Whether to convert the urns of ingested lineage dataset to lowercase | False | |
provided_configs | Array of object | Provided Configurations | None | |
connect_to_platform_map | Dict | Platform instance mapping when multiple instances for a platform is available. Entry for a platform should be in either platform_instance_map or connect_to_platform_map . e.g.connect_to_platform_map: { "postgres-connector-finance-db": "postgres": "core_finance_instance" } | ||
generic_connectors | Array of object | Provide lineage graph for sources connectors other than Confluent JDBC Source Connector, Debezium Source Connector, and Mongo Source Connector | [] | |
connector_patterns | AllowDenyPattern (see below for fields) | regex patterns for connectors to filter for ingestion. | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} | |
connector_patterns.allow | Array of string | List of regex patterns to include in ingestion | ['.*'] | |
connector_patterns.deny | Array of string | List of regex patterns to exclude from ingestion. | [] | |
connector_patterns.ignoreCase | boolean | Whether to ignore case sensitivity during pattern matching. | True |
The JSONSchema for this configuration is inlined below.
{
"title": "KafkaConnectSourceConfig",
"description": "Any non-Dataset source that produces lineage to Datasets should inherit this class.\ne.g. Orchestrators, Pipelines, BI Tools etc.",
"type": "object",
"properties": {
"env": {
"title": "Env",
"description": "The environment that all assets produced by this connector belong to",
"default": "PROD",
"type": "string"
},
"platform_instance_map": {
"title": "Platform Instance Map",
"description": "Platform instance mapping to use when constructing URNs. e.g.`platform_instance_map: { \"hive\": \"warehouse\" }`",
"type": "object",
"additionalProperties": {
"type": "string"
}
},
"connect_uri": {
"title": "Connect Uri",
"description": "URI to connect to.",
"default": "http://localhost:8083/",
"type": "string"
},
"username": {
"title": "Username",
"description": "Kafka Connect username.",
"type": "string"
},
"password": {
"title": "Password",
"description": "Kafka Connect password.",
"type": "string"
},
"cluster_name": {
"title": "Cluster Name",
"description": "Cluster to ingest from.",
"default": "connect-cluster",
"type": "string"
},
"convert_lineage_urns_to_lowercase": {
"title": "Convert Lineage Urns To Lowercase",
"description": "Whether to convert the urns of ingested lineage dataset to lowercase",
"default": false,
"type": "boolean"
},
"connector_patterns": {
"title": "Connector Patterns",
"description": "regex patterns for connectors to filter for ingestion.",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"allOf": [
{
"$ref": "#/definitions/AllowDenyPattern"
}
]
},
"provided_configs": {
"title": "Provided Configs",
"description": "Provided Configurations",
"type": "array",
"items": {
"$ref": "#/definitions/ProvidedConfig"
}
},
"connect_to_platform_map": {
"title": "Connect To Platform Map",
"description": "Platform instance mapping when multiple instances for a platform is available. Entry for a platform should be in either `platform_instance_map` or `connect_to_platform_map`. e.g.`connect_to_platform_map: { \"postgres-connector-finance-db\": \"postgres\": \"core_finance_instance\" }`",
"type": "object"
},
"generic_connectors": {
"title": "Generic Connectors",
"description": "Provide lineage graph for sources connectors other than Confluent JDBC Source Connector, Debezium Source Connector, and Mongo Source Connector",
"default": [],
"type": "array",
"items": {
"$ref": "#/definitions/GenericConnectorConfig"
}
}
},
"additionalProperties": false,
"definitions": {
"AllowDenyPattern": {
"title": "AllowDenyPattern",
"description": "A class to store allow deny regexes",
"type": "object",
"properties": {
"allow": {
"title": "Allow",
"description": "List of regex patterns to include in ingestion",
"default": [
".*"
],
"type": "array",
"items": {
"type": "string"
}
},
"deny": {
"title": "Deny",
"description": "List of regex patterns to exclude from ingestion.",
"default": [],
"type": "array",
"items": {
"type": "string"
}
},
"ignoreCase": {
"title": "Ignorecase",
"description": "Whether to ignore case sensitivity during pattern matching.",
"default": true,
"type": "boolean"
}
},
"additionalProperties": false
},
"ProvidedConfig": {
"title": "ProvidedConfig",
"type": "object",
"properties": {
"provider": {
"title": "Provider",
"type": "string"
},
"path_key": {
"title": "Path Key",
"type": "string"
},
"value": {
"title": "Value",
"type": "string"
}
},
"required": [
"provider",
"path_key",
"value"
],
"additionalProperties": false
},
"GenericConnectorConfig": {
"title": "GenericConnectorConfig",
"type": "object",
"properties": {
"connector_name": {
"title": "Connector Name",
"type": "string"
},
"source_dataset": {
"title": "Source Dataset",
"type": "string"
},
"source_platform": {
"title": "Source Platform",
"type": "string"
}
},
"required": [
"connector_name",
"source_dataset",
"source_platform"
],
"additionalProperties": false
}
}
}
Advanced Configurations
Kafka Connect supports pluggable configuration providers which can load configuration data from external sources at runtime. These values are not available to DataHub ingestion source through Kafka Connect APIs. If you are using such provided configurations to specify connection url (database, etc) in Kafka Connect connector configuration then you will need also add these in provided_configs
section in recipe for DataHub to generate correct lineage.
# Optional mapping of provider configurations if using
provided_configs:
- provider: env
path_key: MYSQL_CONNECTION_URL
value: jdbc:mysql://test_mysql:3306/librarydb
Code Coordinates
- Class Name:
datahub.ingestion.source.kafka_connect.KafkaConnectSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Kafka Connect, feel free to ping us on our Slack