Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

[NSE-1075] Dynamically adjust input partition size #1076

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

PHILO-HE
Copy link
Collaborator

No description provided.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/oap-project/native-sql-engine/issues

Then could you also rename commit message and pull request title in the following format?

[NSE-${ISSUES_ID}] ${detailed message}

See also:

@PHILO-HE PHILO-HE changed the title Dynamically adjust input partition size [NSE-1075] Dynamically adjust input partition size Aug 19, 2022
@github-actions
Copy link

#1075

@zhouyuan
Copy link
Collaborator

@jackylee-ch

val minPartitionNum = sparkSession.sessionState.conf.filesMinPartitionNum
.getOrElse(SparkShimLoader.getSparkShims.leafNodeDefaultParallelism(sparkSession))
val PREFERRED_PARTITION_SIZE_LOWER_BOUND: Long = 128 * 1024 * 1024
val PREFERRED_PARTITION_SIZE_UPPER_BOUND: Long = 512 * 1024 * 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a new config for these tow value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your advice. The PREFERRED_PARTITION_SIZE_UPPER_BOUND may do the same limitation with spark's max partition size configuration. They can be unified. Maybe, we can make PREFERRED_PARTITION_SIZE_LOWER_BOUND configurable.


// This implementation is ported from spark FilePartition.scala with changes for
// adjusting openCost.
def getFilePartitions(sparkSession: SparkSession,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackylee-ch, please put your code changes for open cost here. It should be workable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey

// val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
// // val minPartitionNum = sparkSession.sessionState.conf.filesMinPartitionNum
// // .getOrElse(sparkSession.leafNodeDefaultParallelism)
// val minPartitionNum = sparkSession.sessionState.conf.filesMinPartitionNum
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackylee-ch, I note you have introduced a sort of computation for taskParallelismNum, is it as same as minPartitionNum? This piece of code is ported from spark source code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they are not same. The taskParallelismNum is actually the spark.sql.files.expectedPartitionNum, which can be configured by the user and the default value is the maximum number of tasks that can be parallelized in the current application.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants