Openapi Post - Datahub

Capabilities

Schema Extraction Behavior

The source uses a multi-step approach to extract schemas for API endpoints:

OpenAPI Specification (Primary) - The source first attempts to extract schemas directly from the OpenAPI specification file. This includes:
- Response schemas from 200 responses
- Request body schemas for POST/PUT/PATCH methods
- Parameter schemas when available
Example Data (Secondary) - If schemas aren't fully defined, the source looks for example data in the specification
Live API Calls (Optional) - If enable_api_calls_for_schema_extraction=True and credentials are provided, the source will make GET requests to endpoints when:
- Schema extraction from the spec fails
- The endpoint uses the GET method
- Valid credentials are available (username/password, token, or bearer_token)

:::note API calls are only made for GET methods. POST, PUT, and PATCH methods rely solely on schema definitions in the OpenAPI specification. :::

:::tip Most schemas are extracted from the OpenAPI specification itself. API calls are primarily used as a fallback when the specification is incomplete. :::

Schema Extraction Priority

When multiple HTTP methods are available for an endpoint, the source prioritizes extracting metadat from methods in this order:

GET
POST
PUT
PATCH

The description, tags, and schema metadata all come from the same priority method to ensure consistency.

Browse Paths

All ingested endpoints are organized in DataHub's browse interface using browse paths based on their endpoint path structure. This makes it easy to navigate and discover related endpoints.

For example:

/pet/findByStatus appears under the pet browse path
/pet/{petId} appears under the pet browse path
/store/order/{orderId} appears under store → order

Endpoints are grouped by their path segments, making it easy to find all endpoints related to a particular resource or feature.

Authentication Methods

Bearer Token

yaml

source:
  type: openapi
  config:
    name: my_api
    url: https://api.example.com
    swagger_file: openapi.json
    bearer_token: "your-bearer-token-here"

Custom Token

yaml

source:
  type: openapi
  config:
    name: my_api
    url: https://api.example.com
    swagger_file: openapi.json
    token: "your-token-here"

Basic Authentication

yaml

source:
  type: openapi
  config:
    name: my_api
    url: https://api.example.com
    swagger_file: openapi.json
    username: your_username
    password: your_password

Dynamic Token Retrieval

The source can retrieve a token dynamically by making a request to a token endpoint. This is useful when tokens expire and need to be refreshed.

yaml

source:
  type: openapi
  config:
    name: my_api
    url: https://api.example.com
    swagger_file: openapi.json
    get_token:
      request_type: get # or "post"
      url_complement: api/auth/login?username={username}&password={password}
    username: your_username
    password: your_password

:::note When using get_token with request_type: get, the username and password are sent in the URL query parameters, which is less secure. Use request_type: post when possible. :::

Forced Examples

For endpoints with path parameters where the source cannot automatically determine example values, you can provide them manually using forced_examples:

yaml

source:
  type: openapi
  config:
    name: petstore_api
    url: https://petstore.swagger.io
    swagger_file: /v2/swagger.json
    forced_examples:
      /pet/{petId}: [1]
      /store/order/{orderId}: [1]
      /user/{username}: ["user1"]

The source will use these values to construct URLs for API calls when needed.

Ignoring Endpoints

You can exclude specific endpoints from ingestion:

yaml

source:
  type: openapi
  config:
    name: my_api
    url: https://api.example.com
    swagger_file: openapi.json
    ignore_endpoints:
      - /health
      - /metrics
      - /internal/debug

Examples

Basic Configuration (Schema from Spec Only)

yaml

source:
  type: openapi
  config:
    name: petstore_api
    url: https://petstore.swagger.io
    swagger_file: /v2/swagger.json
    enable_api_calls_for_schema_extraction: false

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

With API Calls Enabled

yaml

source:
  type: openapi
  config:
    name: petstore_api
    url: https://petstore.swagger.io
    swagger_file: /v2/swagger.json
    bearer_token: "${BEARER_TOKEN}"
    enable_api_calls_for_schema_extraction: true

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

Complete Example with All Options

yaml

source:
  type: openapi
  config:
    name: petstore_api
    url: https://petstore.swagger.io
    swagger_file: /v2/swagger.json

    # Authentication
    bearer_token: "${BEARER_TOKEN}"

    # Optional: Enable/disable API calls
    enable_api_calls_for_schema_extraction: true

    # Optional: Ignore specific endpoints
    ignore_endpoints:
      - /user/logout

    # Optional: Provide example values for parameterized endpoints
    forced_examples:
      /pet/{petId}: [1]
      /store/order/{orderId}: [1]
      /user/{username}: ["user1"]

    # Optional: Proxy configuration
    proxies:
      http: "http://proxy.example.com:8080"
      https: "https://proxy.example.com:8080"

    # Optional: SSL verification
    verify_ssl: true

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

No schemas extracted

If schemas aren't being extracted:

Check the OpenAPI specification - Ensure your spec includes schema definitions in responses or request bodies
Enable API calls - Set enable_api_calls_for_schema_extraction: true and provide credentials
Check authentication - Verify your credentials are correct if API calls are enabled
Review warnings - Check the ingestion report for warnings about specific endpoints

Limitations

API calls are GET-only: Live API calls for schema extraction are only made for GET methods. POST, PUT, and PATCH methods rely solely on schema definitions in the OpenAPI specification.
Authentication required for API calls: If enable_api_calls_for_schema_extraction=True, valid credentials must be provided.
200 response codes only: Only endpoints with 200 response codes are ingested.
Schema extraction from spec is preferred: The source prioritizes extracting schemas from the OpenAPI specification. API calls are used as a fallback.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

Endpoints not appearing

If endpoints aren't appearing in DataHub:

Check ignore_endpoints - Ensure endpoints aren't in the ignore list
Verify response codes - Only endpoints with 200 response codes are ingested
Check OpenAPI spec format - Ensure the specification is valid OpenAPI v2 or v3

Authentication errors

If you see authentication errors:

Verify credentials - Check that username/password or tokens are correct
Check token format - Bearer tokens should not include the "Bearer " prefix
Review get_token configuration - Ensure the token endpoint URL and method are correct