Question

What is the purpose of 'Thread Count' & 'Batch scalability factor' in the Edit Settings of Data Flow configuration screen

We are trying to optimize our data flow to handle around 14 million records. The Designer Studio --> Infrastructure --> Services --> Data Flow configuration screen when you click on the 'Edit settings' we see 2 configuration items - 'Thread Count' and 'Batch Scalability Factor'. What is the purpose of the 'Batch Scalability factor' setting? If we set the Thread Count to 3 and Batch Scalability factor to 2, what exactly does it imply?

We are on Pega 7.2.1 and PegaMarketing 7.21

Thanks

***Moderator Edit: Vidyaranjan | Updated Categories***

**Moderation Team has archived post**

This post has been archived for educational purposes. Contents and links will no longer be updated. If you have the same/similar question, please write a new post.

Comments

Keep up to date on this post and subscribe to comments

Pega
April 25, 2018 - 7:59am

 

Hi Srikanth,  

Batch scalability factor is used to calculate the suggested number of partitions to be used in a data flow run, that number is calculated using this formula (numOfNodes * threadCount * scalabilityFactor). Keep in mind that this calculation will only suggest a number of partition, it's up to the dataset implementation to decide how many partitions will actually be used.
 
Thread count by default nodes are configured to run with 5 threads. Each node that will take part in the data flow execution needs to be included in the service cluster. Note that setting a large number for thread count won't necessarily improve data flow execution speed. It's important to take the number of available cores in consideration when deciding on this value.

Let me know if the above information helps.

Regards,

Basavaraj

 

 

April 25, 2018 - 9:19am
Response to BASAVARAJ

Thanks Basavraj. 

Does this mean that the scalabilityFactor does not have effect on the number of partitions that will be processed in parallel? If we have 5 nodes and set threadCount to 2 and batchScalabilityFactor to 2 and have number of partitions to 20, then the number of partitions that will be processed in parallel is 10 (5 nodes * 2 threads = 10) and not 20 (5 nodes * 2 threads * 2 scalabilityFactor).

Regards,

Srikanth

April 26, 2018 - 4:58am
Response to SrikanthK2988

Hello Srikanth

Is there any latest finding on this ?

I am trying to figure this out too.

Regards

Raju Botu

 

April 26, 2018 - 12:54pm
Response to Raju_Botu

From what we have seen in out testing, the batchScalabilityFactor does not have any effect on the number of partitions that are processed in parallel. Only the 'Thread Count' determines how many partitions are processed in parallel. This is the case for RDMBS database (Oracle). We heard that the batchScalabilityFactor will come into the picture when dealing with a Cassandra database - but I am not sure how that works.

Regards,

Srikanth