Product Scrapper

In this tutorial, we will:

Create a Tensorlake Graph
Test Locally
Define Dependencies and Secrets
Deploy to Tensorlake Serverless
Invoke the Graph Remotely
Troubleshoot Remote Executions

Let’s create a simple workflow that scrapes an e-commerce product page, summarizes the product details, and extracts some structured information about the product.

Prerequisites

Before proceeding, ensure you have the following:

Python Environment: Python 3.9 or higher installed.
Tensorlake Account: Sign up at Tensorlake.
API Key: After creating your account, generate an API key for the Tensorlake CLI and set it as an environment variable:
```
export TENSORLAKE_API_KEY=<your-api-key>
```
Tensorlake SDK: Install the Tensorlake SDK using pip:
```
pip install tensorlake
```
OpenAI API Key: Can be created at OpenAI.

Step 1: Writing the Graph

In workflow.py, we write three functions:

scrape_website will leverage https://jina.ai/reader/ to parse websites into text.
summarize_text will leverage OpenAI’s chatgpt to summarize the text outputted from scrape_website.
extract_structured_data will leverage OpenAI’s chatgpt to extract structured data defined as a Python class from the text outputted from scrape_website.

The Graph website-summarizer executes the scrape_website function, and then executes both the summarize_text and extract_structured_data in parallel with the output of the scraper.

from tensorlake import tensorlake_function, Graph
from tensorlake import Image
from pydantic import BaseModel
from openai import OpenAI
import requests

@tensorlake_function()
def scrape_website(url: str) -> str:
    return requests.get(f"http://r.jina.ai/{url}").text

@tensorlake_function()
def summarize_text(text: str) -> str:
    completion = OpenAI().chat.completions.create(
        model="gpt-4o-mini-2024-07-18",
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Generate a summary of this website"},
            {"role": "user", "content": text},
        ],
    )
    return completion.choices[0].message.content

class Product(BaseModel):
    name: str
    description: str
    price: float

@tensorlake_function()
def extract_structured_data(text: str) -> Product:
    client = OpenAI()
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": "Extract the product information."},
            {"role": "user", "content": text},
        ],
        response_format=Product,
    )
    return completion.choices[0].message.parsed

graph = Graph(name="product-scraper", start_node=scrape_website)
graph.add_edge(scrape_website, summarize_text)
graph.add_edge(scrape_website, extract_structured_data)

Step 2: Test Locally

Before running the code locally, we need to ensure all the dependencies of the graph are available locally. For this graph, we need to have to run pip install openai to install the OpenAI SDK.

Additionally, the OpenAI SDK requires the OPENAI_API_KEY environment variable:

export OPENAI_API_KEY=<openai_api_key>

Once the dependencies and secrets are available, add the following code to enable running the graph locally:

if __name__ == "__main__":
    invocation = graph.local().queue("https://onyxcoffeelab.com/products/blend-box")
    outputs = invocation.outputs("summarize_text")
    print(outputs)

    outputs = invocation.outputs("extract_structured_data")
    print(outputs)

Running python workflow.py will execute the workflow locally and print the outputs. There are two print statements for this graph: one for the text summarization text and one for the structured extraction.

Step 3: Define Dependencies and Secrets

The current version of the graph requires some Python dependencies and some environment variables containing secrets. Tensorlake Serverless provides Images and Secrets to define what a tensorlake_function requires when running on the Tensorlake Cloud.

Dependencies

With Tensorlake Serverless, every function runs in its own sandbox defined via images. We define two images that we associate with the function that requires them.

from tensorlake import tensorlake_function, Graph, Image
from pydantic import BaseModel

scrape_image = Image().run("pip install requests")
openai_image = Image().run("pip install openai")

@tensorlake_function(image=scrape_image)
def scrape_website(url: str) -> str:
    import requests
    return requests.get(f"http://r.jina.ai/{url}").text

@tensorlake_function(image=openai_image)
def summarize_text(text: str) -> str:
    from openai import OpenAI
    completion = OpenAI().chat.completions.create(
        model="gpt-4o-mini-2024-07-18",
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Generate a summary of this website"},
            {"role": "user", "content": text},
        ],
    )
    return completion.choices[0].message.content

class Product(BaseModel):
    name: str
    description: str
    price: float

@tensorlake_function(image=openai_image)
def extract_structured_data(text: str) -> Product:
    from openai import OpenAI
    completion = OpenAI().beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": "Extract the product information."},
            {"role": "user", "content": text},
        ],
        response_format=Product,
    )
    return completion.choices[0].message.parsed

As part of adding an image attribute to the tensorlake_function decorator, we also moved imports within each function.

This allows creating smaller per-function images without needing to have all the dependencies in all images therefore reducing cold-start when big dependencies are needed like AI models.

Secrets

The graph requires the presence of the OPENAI_API_KEY environment variable containing a sensitive value. Tensorlake Serverless provides the concept of secrets that are injected at runtime into functions depending on them. Secrets are encrypted and only decrypted to be injected into functions.

Create the tensorlake secret using the Tensorlake CLI:

tensorlake secrets set OPENAI_API_KEY=<openai_api_key>

Change the function requiring the OpenAI API Key so that Tensorlake Serverless can inject the value at runtime:

@tensorlake_function(image=openai_image, secrets=["OPENAI_API_KEY"])
def summarize_text(text: str) -> str:
    from openai import OpenAI
    completion = OpenAI().chat.completions.create(
        model="gpt-4o-mini-2024-07-18",
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Generate a summary of this website"},
            {"role": "user", "content": text},
        ],
    )
    return completion.choices[0].message.content

class Product(BaseModel):
    name: str
    description: str
    price: float

@tensorlake_function(image=openai_image, secrets=["OPENAI_API_KEY"])
def extract_structured_data(text: str) -> Product:
    from openai import OpenAI
    completion = OpenAI().beta.chat.completions.parse(
        model="gpt-4o-2024-08-06",
        messages=[
            {"role": "system", "content": "Extract the product information."},
            {"role": "user", "content": text},
        ],
        response_format=Product,
    )
    return completion.choices[0].message.parsed

Every remote invocation will now use the value of the secret we created when running the summarize_text and extract_structured_data functions.

Step 4: Deploying the Graph

The graph can be deployed as a remote API on Tensorlake Cloud, and can be called from any application on-demand.

tensorlake deploy workflow.py

This process will create a new image capable of running your functions, and deploy the graph as a remote API.

Step 5: Invoking the Graph Remotely

Once the graph is deployed, you can invoke it remotely by modifying the main code:

if __name__ == "__main__":
    invocation = graph.queue("https://onyxcoffeelab.com/products/blend-box")
    outputs = invocation.outputs("summarize_text")
    print(outputs)

    outputs = invocation.outputs("extract_structured_data")
    print(outputs)

Alternatively, you can obtain a reference to the deployed graph and invoke it:

from tensorlake import TensorlakeClient

if __name__ == "__main__":
    client = TensorlakeClient()
    graph = client.get_graph("product-scrapper")
    invocation = graph.queue("https://onyxcoffeelab.com/products/blend-box")
    outputs = invocation.outputs("summarize_text")
    print(outputs[0])

    outputs = invocation.outputs("extract_structured_data")
    print(outputs[0])

The Graph is called with the input of the starting node of the graph, in this case scrape_website, so the input to the graph is the url parameter.

The result of calling a graph is an Invocation. Since data applications can take a long time to complete, calling outputs on an invocation will wait for the invocation to be complete.

In either case, the result of the individual functions can be retrieved using the invocation id, and the name of the function.

Step 6: Monitoring and Troubleshooting

Monitor your graph’s invocations and logs using the Tensorlake CLI:

tensorlake invocations list
tensorlake invocations logs <invocation-id> --function-name <function-name>

These commands help you track executions and diagnose any issues that may arise during remote invocations.

Tensorlake Cloud

Document Ingestion

Workflows

Prerequisites

Step 1: Writing the Graph

Step 2: Test Locally

Step 3: Define Dependencies and Secrets

Dependencies

Secrets

Step 4: Deploying the Graph

Step 5: Invoking the Graph Remotely

Step 6: Monitoring and Troubleshooting

Tensorlake Cloud

Document Ingestion

Workflows

​Prerequisites

​Step 1: Writing the Graph

​Step 2: Test Locally

​Step 3: Define Dependencies and Secrets

​Dependencies

​Secrets

​Step 4: Deploying the Graph

​Step 5: Invoking the Graph Remotely

​Step 6: Monitoring and Troubleshooting

Prerequisites

Step 1: Writing the Graph

Step 2: Test Locally

Step 3: Define Dependencies and Secrets

Dependencies

Secrets

Step 4: Deploying the Graph

Step 5: Invoking the Graph Remotely

Step 6: Monitoring and Troubleshooting