Skip to main content
Map-Reduce is supported by Tensorlake Applications to support large scale ETL of data. Map is the process of applying a function to each item of a list in parallel. Reduce is the process of aggregating the results of the map phase. Tensorlake automatically parallelizes functions across multiple machines when you call the map function. On the other hand, the reduce function will wait until all the functions in the map phase have completed, it will aggregate the results, and return a single output. In the following example, we calculate the square of each number in a list in parallel, and then, we sum the results after the map phase completes.
from pydantic import BaseModel
from tensorlake.applications import application, function, map as tl_map, reduce as tl_reduce

class Total(BaseModel):
    value: int = 0

def generate_seq(x: int) -> List[int]:
    return [i for i in range(x)]

@application()
@function()
def aggregate_squares(total_numbers: int) -> Total:
    squares = tl_map(square, generate_seq(total_numbers))
    total = tl_reduce(accumulate_total, squares, Total(value=0))
    return total

@function()
def square(number: int) -> int:
    return number ** 2

@function()
def accumulate_total(total: Total, number: int) -> Total:
    total.value += number
    return total
Map/Reduce in Tensorlake has several advantages that makes it a powerful tool for processing large datasets:
  • Lazy Evaluation: Reduce functions are invoked incrementally as elements become available for aggregation. This allows for efficient processing of large datasets or streams of data.
  • Stateful Aggregation: The aggregated value is persisted between invocations. Each time the Reduce function is called, it receives the current aggregated state along with the new element to be processed.
I