> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Analysis

> Perform parallel data analysis and model benchmarking in isolated sandboxes.

Run parallel data analysis, model training, and benchmarking tasks in secure, isolated sandbox environments. Each sandbox can have its own dependencies and resource limits, allowing you to compare different models or process large datasets concurrently.

This example demonstrates how to benchmark several `scikit-learn` classification models in parallel by running each in its own sandbox.

## TypeScript SDK starter

The same benchmarking pattern works in Node.js: one model per sandbox, `Promise.all()` for fan-out, and JSON on stdout for aggregation.

```typescript theme={null}
import { Sandbox } from "tensorlake";


async function runModelBenchmark(modelName: string, sklearnPath: string) {
  const splitAt = sklearnPath.lastIndexOf(".");
  const modulePath = sklearnPath.slice(0, splitAt);
  const className = sklearnPath.slice(splitAt + 1);
  const code = `
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from ${modulePath} import ${className}
import json

data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)
model = ${className}()
model.fit(X_train, y_train)
print(json.dumps({"model": "${modelName}", "accuracy": model.score(X_test, y_test)}))
`;

  const sandbox = await Sandbox.create({
    timeoutSecs: 900,
    allowInternetAccess: false,
  });

  try {
    await sandbox.run("pip", {
      args: [
        "install",
        "numpy",
        "scikit-learn",
        "--user",
        "--break-system-packages",
      ],
    });
    const result = await sandbox.run("python", {
      args: ["-c", code],
    });
    return JSON.parse(result.stdout);
  } finally {
    await sandbox.terminate();
  }
}

const modelsToTest = {
  "Random Forest": "sklearn.ensemble.RandomForestClassifier",
  SVM: "sklearn.svm.SVC",
  "Logistic Regression": "sklearn.linear_model.LogisticRegression",
};

const results = await Promise.all(
  Object.entries(modelsToTest).map(([name, path]) =>
    runModelBenchmark(name, path),
  ),
);

console.table(results);
client.close();
```

## Example: Parallel Model Benchmarking

The following script benchmarks five different `scikit-learn` models on the Iris dataset. Each model is trained and evaluated in a separate, concurrent sandbox.

```python theme={null}
import asyncio
import json

from dotenv import load_dotenv
load_dotenv()

from tensorlake.sandbox import Sandbox


async def run_model_benchmark(model_name, sklearn_path):
    """
    Runs a model benchmark inside an isolated sandbox.
    Returns a dict with model name and accuracy.
    """
    module_path, class_name = sklearn_path.rsplit('.', 1)

    code = f"""
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from {module_path} import {class_name}
import json

data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)

model = {class_name}()
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
print(json.dumps({{"model": "{model_name}", "accuracy": score}}))
"""

    def _sync_benchmark():
        sandbox = Sandbox.create()
            print(f"🚀 Sandbox started for {model_name}...")
            # install scikit-learn and its dependencies in the sandbox
            sandbox.run("pip", ["install", "--user", "--break-system-packages", "numpy", "scikit-learn"])
            # run the code in the sandbox
            result = sandbox.run("python", ["-c", code])

            output_data = json.loads(result.stdout.strip())

            return output_data

    return await asyncio.to_thread(_sync_benchmark)

async def main():
    models_to_test: dict[str, str] = {
    "Random Forest": "sklearn.ensemble.RandomForestClassifier",
    "SVM": "sklearn.svm.SVC",
    "Logistic Regression": "sklearn.linear_model.LogisticRegression",
    "Decision Tree": "sklearn.tree.DecisionTreeClassifier",
    "KNN": "sklearn.neighbors.KNeighborsClassifier",
}

    tasks = [run_model_benchmark(name, path) for name, path in models_to_test.items()]
    print("Gathering results from all sandboxes...\n")
    results = await asyncio.gather(*tasks)

    print("--- Benchmark Results ---")
    for r in results:
        print(f"{r['model']:<20}: {r['accuracy']:.4f}")

if __name__ == "__main__":
    asyncio.run(main())
```

## How It Works

The script orchestrates the parallel execution of model benchmarks using Python's `asyncio` library.

**1. Parallel Execution:** The `main` function defines a dictionary of models to test and creates a list of asynchronous tasks using a list comprehension. `asyncio.gather` runs all these tasks concurrently.

**2. Sandbox Task:** The `run_model_benchmark` function is responsible for a single benchmark. For each model, it:

* Creates a new, isolated sandbox.
* Installs the necessary Python libraries (`numpy` and `scikit-learn`) inside the sandbox using `sandbox.run()`. The `--break-system-packages` flag is used to comply with PEP 668 in newer Python environments.
* Executes a Python script that trains the model on the Iris dataset and calculates its accuracy.
* Prints the results as a JSON string to standard output.
* Captures the `stdout`, parses the JSON, and returns the result.

**3. Aggregate Results:** Once all sandboxes have completed their tasks, `asyncio.gather` returns a list of all the results, which are then printed to the console.

<Note>
  This example uses the `python-dotenv` library to load your Tensorlake API key from a `.env` file. Create a file named `.env` in your project root and add your key:

  ```
  TENSORLAKE_API_KEY="your-api-key-here"
  ```

  The SDK will automatically use this key.
</Note>

## Pro Tips

### Faster Execution with Snapshots

The example installs dependencies every time a sandbox is created. This is simple but inefficient for repeated runs. To significantly speed up your workflow, you can use **Snapshots**.

1. Create a "base" sandbox and install all your dependencies.
2. Create a snapshot of that sandbox.
3. Start new sandboxes from the snapshot ID. The new sandboxes will have all the dependencies pre-installed, saving you valuable setup time.

Learn more in the [Snapshots guide](/sandboxes/snapshots).

## Learn More

<CardGroup cols={2}>
  <Card title="Sandboxes Overview" icon="rocket" href="/sandboxes/introduction">
    Install Tensorlake and create your first sandbox.
  </Card>

  <Card title="File Operations" icon="folder-open" href="/sandboxes/file-operations">
    Learn how to upload custom datasets and other files to your sandboxes.
  </Card>
</CardGroup>
