Back to Datahub

Lookml Pre

metadata-ingestion/docs/sources/looker/lookml_pre.md

1.6.07.1 KB
Original Source

Overview

The lookml module ingests metadata from Looker into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

Ingestion Options

You have 3 options for controlling where your ingestion of LookML is run.

  • The DataHub UI (recommended for the easiest out-of-the-box experience)
  • As a GitHub Action (recommended to ensure that you have the freshest metadata pushed on change)
  • Using the CLI (scheduled via an orchestrator like Airflow)

Read on to learn more about these options.

UI-based Ingestion [Recommended for ease of use]

To ingest LookML metadata through the UI, you must set up a GitHub deploy key using the instructions in the section above. Once that is complete, you can follow the on-screen instructions to set up a LookML source using the Ingestion page. The following video shows you how to ingest LookML metadata through the UI and find the relevant information from your Looker account.

<div style={{ position: "relative", paddingBottom: "56.25%", height: 0 }}> <iframe src="https://www.loom.com/embed/c66dd625de7f48b39005e0eb9c345f5a" frameBorder={0} webkitallowfullscreen="" mozallowfullscreen="" allowFullScreen="" style={{ position: "absolute", top: 0, left: 0, width: "100%", height: "100%" }} /> </div>

GitHub Action based Ingestion [Recommended for push-based integration]

You can set up ingestion using a GitHub Action to push metadata whenever your main Looker GitHub repo changes. The following sample GitHub action file can be modified to emit LookML metadata whenever there is a change to your repository. This ensures that metadata is already fresh and up to date.

Sample GitHub Action

Drop this file into your .github/workflows directory inside your Looker GitHub repo. You need to set up the following secrets in your GitHub repository to get this workflow to work:

  • DATAHUB_GMS_HOST: The endpoint where your DataHub host is running
  • DATAHUB_TOKEN: An authentication token provisioned for DataHub ingestion
  • LOOKER_BASE_URL: The base url where your Looker assets are hosted (e.g. https://acryl.cloud.looker.com)
  • LOOKER_CLIENT_ID: A provisioned Looker Client ID
  • LOOKER_CLIENT_SECRET: A provisioned Looker Client Secret
yml
name: lookml metadata upload
on:
  # Note that this action only runs on pushes to your main branch. If you want to also
  # run on pull requests, we'd recommend running datahub ingest with the `--dry-run` flag.
  push:
    branches:
      - main
  release:
    types: [published, edited]
  workflow_dispatch:

jobs:
  lookml-metadata-upload:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.10"
      - name: Run LookML ingestion
        run: |
          pip install 'acryl-datahub[lookml,datahub-rest]'
          cat << EOF > lookml_ingestion.yml
          # LookML ingestion configuration.
          # This is a full ingestion recipe, and supports all config options that the LookML source supports.
          source:
            type: "lookml"
            config:
              base_folder: ${{ github.workspace }}
              parse_table_names_from_sql: true
              git_info:
                repo: ${{ github.repository }}
                branch: ${{ github.ref }}
              # Options
              #connection_to_platform_map:
              #  connection-name:
              #    platform: platform-name (e.g. snowflake)
              #    default_db: default-db-name (e.g. DEMO_PIPELINE)
              api:
                client_id: ${LOOKER_CLIENT_ID}
                client_secret: ${LOOKER_CLIENT_SECRET}
                base_url: ${LOOKER_BASE_URL}
              # Enable API-based lineage extraction (required for field splitting features)
              use_api_for_view_lineage: true
              # Optional: Large view handling configuration
              # field_threshold_for_splitting: 100
              # allow_partial_lineage_results: true
              # enable_individual_field_fallback: true
              # max_workers_for_parallel_processing: 10
          sink:
            type: datahub-rest
            config:
              server: ${DATAHUB_GMS_URL}
              token: ${DATAHUB_GMS_TOKEN}
          EOF
          datahub ingest -c lookml_ingestion.yml
        env:
          DATAHUB_GMS_URL: ${{ secrets.DATAHUB_GMS_URL }}
          DATAHUB_GMS_TOKEN: ${{ secrets.DATAHUB_GMS_TOKEN }}
          LOOKER_BASE_URL: ${{ secrets.LOOKER_BASE_URL }}
          LOOKER_CLIENT_ID: ${{ secrets.LOOKER_CLIENT_ID }}
          LOOKER_CLIENT_SECRET: ${{ secrets.LOOKER_CLIENT_SECRET }}

If you want to ingest lookml using the datahub cli directly, read on for instructions and configuration details.

Prerequisites

[Recommended] Create a GitHub Deploy Key

To use LookML ingestion through the UI, or automate github checkout through the cli, you must set up a GitHub deploy key for your Looker GitHub repository. Read this document for how to set up deploy keys for your Looker git repo.

Three steps:

  1. Generate SSH key pair without passphrase (creates looker_datahub_deploy_key and looker_datahub_deploy_key.pub):

  2. Add public key to Looker git repo as read-only deploy key (guide):

  3. Save private key file contents for the GitHub Deploy Key field in UI-based ingestion

Clone Timeout

By default, DataHub allows up to 600 seconds for the git clone to complete. If your repository is large or your network is slow, you can increase this value:

yml
source:
  type: lookml
  config:
    git_info:
      repo: https://github.com/your-org/your-lookml-repo
      branch: main
      deploy_key: ${DEPLOY_KEY}
      clone_timeout: 900 # seconds; set to null to disable

If the clone fails (network error, SSH misconfiguration, timeout), ingestion will stop with a clear error entry rather than crashing the pipeline.

Setup your connection mapping

Connection mapping enables accurate lineage to upstream warehouses by mapping Looker connection names to platforms and databases.

Two configuration options:

  1. Automatic (recommended): Provide Looker admin API credentials for automatic mapping (details below)
  2. Manual: Populate connection_to_platform_map and project_name fields (see starter recipe)
[Optional] Create an API key with admin privileges

Create a client ID and secret following Looker authentication docs. Ensure the API key has Admin privileges.

Without admin API credentials, manually populate connection_to_platform_map and project_name in your recipe.