The MinHasher execution depends by the number of nodes #634

Gaglia88 · 2017-06-06T13:17:08Z

Hi,
I am using the Minhasher32 to create clusters of similar records, tokenizing the records values to create the signatures (as I explained here #609), but seems that the resulting buckets depends by the Spark configuration.
I executed the same code on a single node of a cluster machine with 16 cores more times and I always obtained X number of buckets. Than on the same machine I aumented the number of cores to 20, and the number of buckets it is changed to another number Y, I repeated the test and I obtained Y again.

It is possible that the execution of the MinHasher is influenced by the number of nodes? Someone it is able to explain me why?

Thanks

Regards
Luca

Gaglia88 · 2017-06-06T18:43:04Z

I confirm that the bucket generation depends by the level of Spark parallelism.
I made a test on my laptop, repartitioning the token before initializing the MinHasher

val attributeWithHashes: RDD[(String, Iterable[MinHashSignature])] = attributesToken.repartition(10).map {
   case (attribute, token) =>
      (attribute, minHasher.init(token))
}.groupByKey()

At the same level of repartition I always obtains the same buckets, if I change it, I obtain different buckets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The MinHasher execution depends by the number of nodes #634

The MinHasher execution depends by the number of nodes #634

Gaglia88 commented Jun 6, 2017

Gaglia88 commented Jun 6, 2017

The MinHasher execution depends by the number of nodes #634

The MinHasher execution depends by the number of nodes #634

Comments

Gaglia88 commented Jun 6, 2017

Gaglia88 commented Jun 6, 2017