# What is Azure ML ParallelRunStep?

In some cases you may need to deal with large amount of data before they can be used inside a Machine Learning process.

One normal concern of processing big data is the time it will cost. And this time may even rise in an exponential growth.

A common solution can be thought of here is to process the data in parallel.

Instead of using only one machine to do all the job, you can separate the data into small batches, distribute the computing to several other machines(assume you have n of them in total), and then process them at the same time, thus each machine only need to process 1/n of the original data amount and you can shorten the time significantly.

Azure ML allows you to implement this solution easily with a functionality called ParallelRunStep.

Similar to running a normal python script, Azure ML allows you to run a script with multi processes in different machines by appending a special ParallelRunStep to the pipeline.

When you add a ParallelRunStep to an Azure ML pipeline, you need to define settings like,

  • how many nodes to run the script
  • how many processes per nodes to run the script
  • an input dataset which will be partitioned to small batches and send to each parallel run

For more details, check following links,
ParallelRunStep Class (opens new window)
ParallelRunConfig Class (opens new window)
Example: batch-inference-notebooks (opens new window)

Last Updated: 1/13/2021, 6:09:44 PM