> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Create Datasets

> Datasets let you apply the same parsing configuration to many documents — useful for versioning schemas and OCR settings across workflows.

Datasets allow you to apply the same document parsing configuration to multiple documents.

This makes it easy to version schemas, and OCR configurations, and apply them to new documents in your workflows.

<Note>
  {" "}

  All the code examples on this page use the [Tensorlake Python SDK](https://github.com/tensorlakeai/tensorlake).
  For other languages, please consult our [API Reference](/api-reference/v2/datasets/).{" "}
</Note>

## Quick Start

<Info>
  The example can be run in a [Google Colab
  notebook](https://colab.research.google.com/drive/1Bz6wFrJd64RY9cslpwmJ4nncCTbwV6rL?usp=sharing).
</Info>

<Steps>
  <Step title="Create a dataset">
    <CodeGroup>
      ```python Python theme={null}
      from tensorlake.documentai.client import DocumentAI

      doc_ai = DocumentAI(api_key="YOUR_TENSORLAKE_API_KEY")

      # Create a dataset. The only required argument for a dataset is its name.
      # A dataset name may only contain alphanumeric characters, hyphens or underscores.
      #
      # Not specifying parsing, or extraction options will create a dataset used for parsing
      # documents with our recommended defaults.
      dataset = doc_ai.create_dataset(
          name="your_dataset_name"
      )
      ```

      ```bash curl theme={null}
      curl --request POST \
        --url https://api.tensorlake.ai/documents/v2/datasets \
        --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
        --header 'Content-Type: application/json' \
        --data '{
          "name": "your_dataset_name"
        }'

      # Response returns the dataset ID, which you can use to reference the dataset in future API calls.
      # {
      #   "dataset_id": "dataset_xxxxx",
      # }
      ```
    </CodeGroup>
  </Step>

  <Step title="Parse a file">
    <CodeGroup>
      ```python Python theme={null}
      # Use a publicly accessible URL or upload a file to Tensorlake and use the file ID.
      file_url = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf"

      parse_id = doc_ai.parse_dataset_file(
          dataset=dataset,
          file=file_url
      )
      ```

      ```bash curl theme={null}
      curl --request POST \
        --url https://api.tensorlake.ai/documents/v2/datasets/{dataset_id}/parse \
        --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
        --header 'Content-Type: application/json' \
        --data '{
          "file_url": "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf"
        }'
      ```
    </CodeGroup>
  </Step>

  <Step title="Retrieve outputs">
    <CodeGroup>
      ```python python theme={null}
      # Retrieve the outputs of the parsing job.
      result = doc_ai.wait_for_completion(parse_id)

      # The result contains the parsed document and any extracted data.
      print(result)
      ```

      ```bash curl theme={null}
      curl --request GET \
        --url https://api.tensorlake.ai/documents/v2/parse/{parse_id} \
        --header 'Authorization: Bearer ${TENSORLAKE_API_KEY}' \
        --header 'Content-Type: application/json'

      # Result will only be available after the parsing job is complete.
      ```
    </CodeGroup>

    <Note>
      The Python SDK `wait_for_completion` method will block until the parsing job is complete and return the result.
    </Note>
  </Step>
</Steps>

With datasets, you can ingest as many files as you want, and the parsing configuration will be applied to all of them. You can also create a dataset with structured extraction options, which will allow you to extract structured data from related documents.
