Skip to main content

Documentation Index

Fetch the complete documentation index at: https://openmetadata-feat-feat-2mbfixdeploy.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Airflow Troubleshooting & Advanced

This page covers installation validation, Git Sync guidance, SSL configuration, and troubleshooting for Airflow-based ingestion pipelines. For setup and configuration, see the OpenMetadata Ingestion Overview.

Validating the installation

What we need to verify here is that the OpenMetadata server can reach the Airflow APIs endpoints (wherever they live: bare metal, containers, k8s pods…). One way to ensure that is to connect to the deployment hosting your OpenMetadata server and running a query against the /health endpoint. For example:
$ curl -XGET ${PIPELINE_SERVICE_CLIENT_ENDPOINT}/api/v1/openmetadata/health
{"status": "healthy", "version": "x.y.z"}
It is important to do this validation passing the command as is (i.e., curl -XGET ${PIPELINE_SERVICE_CLIENT_ENDPOINT}/api/v1/openmetadata/health) and allowing the environment to do the substitution for you. That’s the only way we can be sure that the setup is correct.

More validations in the installation

If you have an existing DAG in Airflow, you can further test your setup by running the following:
curl -XPOST http://localhost:8080/api/v1/openmetadata/enable --data-raw '{"dag_id": "example_bash_operator"}' -u "admin:admin" --header 'Content-Type: application/json'
Note that in this example we are assuming:
  • There is an Airflow instance running at localhost:8080,
  • There is a user admin with password admin
  • There is a DAG named example_bash_operator.
A generic call would look like:
curl -XPOST <PIPELINE_SERVICE_CLIENT_ENDPOINT>/api/v1/openmetadata/enable --data-raw '{"dag_id": "<DAG name>"}' -u "<user>:<password>" --header 'Content-Type: application/json'
Please update it accordingly.

Git Sync?

One recurrent question when setting up Airflow is the possibility of using git-sync to manage the ingestion DAGs. Let’s remark the differences between git-sync and what we want to achieve by installing our custom API plugins:
  1. git-sync will use Git as the source of truth for your DAGs. Meaning, any DAG you have on Git will eventually be used and scheduled in Airflow.
  2. With the openmetadata-managed-apis we are using the OpenMetadata server as the source of truth. We are enabling dynamic DAG creation from the OpenMetadata into your Airflow instance every time that you create a new Ingestion Workflow.
Then, should you use git-sync?
  • If you have an existing Airflow instance, and you want to build and maintain your own ingestion DAGs then you can go for it. Check a DAG example here.
  • If instead, you want to use the full deployment process from OpenMetadata, git-sync would not be the right tool, since the DAGs won’t be backed up by Git, but rather created from OpenMetadata. Note that if anything would to happen where you might lose the Airflow volumes, etc. You can just redeploy the DAGs from OpenMetadata.

SSL

If you want to learn how to set up Airflow using SSL, you can learn more here:

Airflow SSL

Learn how to configure Airflow with SSL.

Troubleshooting

Ingestion Pipeline deployment issues

Airflow APIs Not Found

Validate the installation, making sure that from the OpenMetadata server you can reach the Airflow host, and the call to /health gives us the proper response:
$ curl -XGET ${PIPELINE_SERVICE_CLIENT_ENDPOINT}/api/v1/openmetadata/health
{"status": "healthy", "version": "x.y.z"}
Also, make sure that the version of your OpenMetadata server matches the openmetadata-ingestion client version installed in Airflow.

GetServiceException: Could not get service from type XYZ

In this case, the OpenMetadata client running in the Airflow host had issues getting the service you are trying to deploy from the API. Note that once pipelines are deployed, the auth happens via the ingestion-bot. Here there are a couple of points to validate:
  1. The JWT of the ingestion bot is valid. You can check services such as https://jwt.io/ to help you review if the token is expired or if there are any configuration issues.
  2. The ingestion-bot does not have the proper role. If you go to <openmetadata-server>/bots/ingestion-bot, the bot should present the Ingestion bot role. You can validate the role policies as well to make sure they were not updated and the bot can indeed view and access services from the API.
  3. Run an API call for your service to verify the issue. An example trying to get a database service would look like follows:
    curl -XGET 'http://<server>:8585/api/v1/services/databaseServices/name/<service name>' \
    -H 'Accept: application/json' -H 'Authorization: Bearer <token>'
    
    If, for example, you have an issue with the roles you would be getting a message similar to:
    {"code":403,"message":"Principal: CatalogPrincipal{name='ingestion-bot'} operations [ViewAll] not allowed"}
    

AirflowException: Dag ‘XYZ’ could not be found

If you’re seeing a similar error to
[...]
task_run
    _dag = get_dag(args.subdir, args.dag_id)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/cli.py", line 235, in get_dag
    raise AirflowException(
airflow.exceptions.AirflowException: Dag '...' could not be found; either it does not exist or it failed to parse.
This is a common situation where you have not properly enabled the shared volumes between Webserver <> Scheduler <> Worker in your distributed environment. We have specific instructions on how to set up the shared volumes in Kubernetes depending on your cloud deployment here.

ClientInitializationError

The main root cause here is a version mismatch between the server and the client. Make sure that the openmetadata-ingestion python package you installed on the Airflow host has the same version as the OpenMetadata server. For example, to set up OpenMetadata server 0.13.2 you will need to install openmetadata-ingestion~=0.13.2. Note that we are validating the version as in x.y.z. Any differences after the PATCH versioning are not taken into account, as they are usually small bugfixes on existing functionalities.

401 Unauthorized

If you get this response during a Test Connection or Deploy:
airflow API returned Unauthorized and response
{ "detail": null, "status": 401, "title": "Unauthorized", "type": "https://airflow.apache.org/docs/apache-airflow/2.3.3/stable-rest-api-ref.html#section/Errors/Unauthenticated" }
This is a communication issue between the OpenMetadata Server and the Airflow instance. You are able to reach the Airflow host, but your provided user and password are not correct. Note the following section of the server configuration:
pipelineServiceClientConfiguration:
  [...]
  parameters:
    username: ${AIRFLOW_USERNAME:-admin}
    password: ${AIRFLOW_PASSWORD:-admin}
You should validate if the content of the environment variables AIRFLOW_USERNAME and AIRFLOW_PASSWORD allow you to authenticate to the instance.

CentOS / Debian - The name ‘template_blueprint’ is already registered

If you are using a CentOS / Debian system to install the openmetadata-managed-apis you might encounter the following issue when starting Airflow:
airflow standalone
standalone | Starting Airflow Standalone
standalone | Checking database is initialized
INFO  [alembic.runtime.migration] Context impl SQLiteImpl.
INFO  [alembic.runtime.migration] Will assume non-transactional DDL.
WARNI [airflow.models.crypto] empty cryptography key - values will not be stored encrypted.
standalone | Database ready
[2023-08-11 05:39:28,851] {manager.py:508} INFO - Created Permission View: can create on DAGs
[2023-08-11 05:39:28,910] {manager.py:508} INFO - Created Permission View: menu access on REST API Plugin
[2023-08-11 05:39:28,916] {manager.py:568} INFO - Added Permission menu access on REST API Plugin to role Admin
Traceback (most recent call last):
  File "/home/pmcevoy/airflow233/bin/airflow", line 8, in <module>
    sys.exit(main())
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/airflow/__main__.py", line 38, in main
    args.func(args)
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/airflow/cli/cli_parser.py", line 51, in command
    return func(*args, **kwargs)
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/airflow/cli/commands/standalone_command.py", line 48, in entrypoint
    StandaloneCommand().run()
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/airflow/cli/commands/standalone_command.py", line 64, in run
    self.initialize_database()
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/airflow/cli/commands/standalone_command.py", line 180, in initialize_database
    appbuilder = cached_app().appbuilder
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/airflow/www/app.py", line 158, in cached_app
    app = create_app(config=config, testing=testing)
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/airflow/www/app.py", line 140, in create_app
    init_plugins(flask_app)
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/airflow/www/extensions/init_views.py", line 141, in init_plugins
    app.register_blueprint(blue_print["blueprint"])
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/flask/scaffold.py", line 56, in wrapper_func
    return f(self, *args, **kwargs)
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/flask/app.py", line 1028, in register_blueprint
    blueprint.register(self, options)
  File "/home/pmcevoy/airflow233/lib64/python3.9/site-packages/flask/blueprints.py", line 305, in register
    raise ValueError(
ValueError: The name 'template_blueprint' is already registered for this blueprint. Use 'name=' to provide a unique name.
The issue occurs because a symlink exists inside the venv
(airflow233) [pmcevoy@lab1 airflow233]$ ls -la
total 28
drwxr-xr-x 6 pmcevoy pmcevoy 4096 Aug 14 00:34 .
drwx------ 6 pmcevoy pmcevoy 4096 Aug 14 00:32 ..
drwxr-xr-x 3 pmcevoy pmcevoy 4096 Aug 14 00:34 bin
drwxr-xr-x 3 pmcevoy pmcevoy 4096 Aug 14 00:33 include
drwxr-xr-x 3 pmcevoy pmcevoy 4096 Aug 14 00:32 lib
lrwxrwxrwx 1 pmcevoy pmcevoy    3 Aug 14 00:32 lib64 -> lib
-rw-r--r-- 1 pmcevoy pmcevoy   70 Aug 14 00:32 pyvenv.cfg
drwxr-xr-x 3 pmcevoy pmcevoy 4096 Aug 14 00:34 share
(airflow233) [pmcevoy@lab1 airflow233]$ grep -r template_blueprint *
lib/python3.9/site-packages/openmetadata_managed_apis/plugin.py:template_blueprint = Blueprint(
lib/python3.9/site-packages/openmetadata_managed_apis/plugin.py:    "template_blueprint",
lib/python3.9/site-packages/openmetadata_managed_apis/plugin.py:    flask_blueprints = [template_blueprint, api_blueprint]
grep: lib/python3.9/site-packages/openmetadata_managed_apis/__pycache__/plugin.cpython-39.pyc: binary file matches
lib64/python3.9/site-packages/openmetadata_managed_apis/plugin.py:template_blueprint = Blueprint(
lib64/python3.9/site-packages/openmetadata_managed_apis/plugin.py:    "template_blueprint",
lib64/python3.9/site-packages/openmetadata_managed_apis/plugin.py:    flask_blueprints = [template_blueprint, api_blueprint]
grep: lib64/python3.9/site-packages/openmetadata_managed_apis/__pycache__/plugin.cpython-39.pyc: binary file matches
A workaround is to remove the lib64 symlink: rm lib64.