Map-Reduce is supported by Tensorlake Workflows to support large scale ETL of data.

Map is the process of applying a function to each item of a list in parallel. Reduce is the process of aggregating the results of the map phase.

Map

Automatically parallelize functions across multiple machines when a function returns a sequence and the downstream function accepts only a single element of that sequence.

```python
@tensorlake_function(image=image, input_encoding="json")
def chunk_text(text: str) -> List[str]:
    return text.split("\n\n")

@tensorlake_function(image=image, input_encoding="json")
def count_words(text: str) -> int:
    return len(text.split(" "))

In the above example, count_words will be called in parallel for paragraph of the text. If you are using tools like Airflow, you would need to use something like Spark, Ray or Dask to parallelize the processing. Tensorlake does this automatically.

Reduce

Reduce functions in Tensorlake Serverless aggregate outputs from one or more functions that return sequences. They operate with the following characteristics:

  • Lazy Evaluation: Reduce functions are invoked incrementally as elements become available for aggregation. This allows for efficient processing of large datasets or streams of data.
  • Stateful Aggregation: The aggregated value is persisted between invocations. Each time the Reduce function is called, it receives the current aggregated state along with the new element to be processed.
@tensorlake_function()
def fetch_numbers() -> list[int]:
    return [1, 2, 3, 4, 5]

class Total(BaseModel):
    value: int = 0

@tensorlake_function(accumulate=Total)
def accumulate_total(total: Total, number: int) -> Total:
    total.value += number
    return total

Use Cases: Aggregating a summary from hundreds of web pages.