@@ -308,15 +308,15 @@ The high level languages (e.g., Python, R) are not designed to control the cache
In practice we can reduce the execution time by paying attention memory usage and IO operations. We should copy the data close to the compute infrastructure (or even in memory).
## Estimating the Hardware Requirements
## Optimize and Estimate the Hardware Requirements
It works well for embarrassingly parallel problems.
**Sample!**
1. Take a representative small sample from a batch, whether it's data, file or parameters. If the batch is heterogeneous, take a few representative samples.
2.If possible, run a pipeline or workflow start-to-finish with that sample.
3. Run the same pipeline with more resources and ask yourself, "does it speed up?"
4. Extrapolate the time needed for the entire data or parameter set from this small sample.
1.**Take a representative** small sample from a batch, whether it's data, file or parameters. If the batch is heterogeneous, take a few representative samples.
2.**Run the start-to-finish** computation with that sample.
3. Run the same pipeline with **more resources** and ask yourself, **"does it speed up?"**. Optimize the hardware requirements until you are happy.
4.**Extrapolate** the time needed for the entire data or parameter set from this small sample.
Doing this will also help you find bugs or errors in the code much faster than when starting with the entire data or parameter set.