Map-Reduce
Map-Reduce is supported by Tensorlake Workflows to support large scale ETL of data.
Map is the process of applying a function to each item of a list in parallel. Reduce is the process of aggregating the results of the map phase.
Map
Automatically parallelize functions across multiple machines when a function returns a sequence and the downstream function accepts only a single element of that sequence.
In the above example, count_words
will be called in parallel for paragraph of the text. If you are using tools like
Airflow, you would need to use something like Spark, Ray or Dask to parallelize the processing. Tensorlake does this automatically.
Reduce
Reduce functions in Tensorlake Serverless aggregate outputs from one or more functions that return sequences. They operate with the following characteristics:
- Lazy Evaluation: Reduce functions are invoked incrementally as elements become available for aggregation. This allows for efficient processing of large datasets or streams of data.
- Stateful Aggregation: The aggregated value is persisted between invocations. Each time the Reduce function is called, it receives the current aggregated state along with the new element to be processed.
Use Cases: Aggregating a summary from hundreds of web pages.