Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] K-means slowness #4541

Merged
merged 6 commits into from
Mar 31, 2020
Merged

[FIX] K-means slowness #4541

merged 6 commits into from
Mar 31, 2020

Conversation

markotoplak
Copy link
Member

Issue

After Orange 3.21 we introduced quite a few performance-regressions to K-means, which became very slow for large data set.

Description of changes
  1. Preprocess data once. In master, data set is preprocessed separately for each number of clusters and then also when computing silhouettes. If used with from-to this caused too many in-memory copies of data. Which means crashes on big data due to memory usage. Note that sklearn's k-means makes another copy of the data.

  2. Only compute approximate (sampled) silhouettes for big data. Fixes bug introduced in Orange 3.22. Compute them in worker threads.

  3. Fix O(|features|^2) when creating centroids.

There is a further obvious improvement that I did not tackle in this PR: preprocessing in a worker thread. Now (=this branch, master, current release) the preprocessing for big data blocks UI for about a minute for a data set with 98304 rows and 806 columns. It is mainly normalization. If it is disabled, it only blocks for a few seconds.

Benchmarks

Orange 3.21

[from_to_100_100] with 3 loops, best of 3:
	min 372 msec per loop
	avg 374 msec per loop
[from_to_100_100_no_normalize] with 3 loops, best of 3:
	min 379 msec per loop
	avg 414 msec per loop
[from_to_sampled_silhouette] with 3 loops, best of 3:
	min 2.1 sec per loop
	avg 2.18 sec per loop
[wide] with 3 loops, best of 3:
	min 306 msec per loop
	avg 315 msec per loop

Master

[from_to_100_100] with 3 loops, best of 3:
	min 1.98 sec per loop
	avg 1.99 sec per loop
[from_to_100_100_no_normalize] with 3 loops, best of 3:
	min 824 msec per loop
	avg 828 msec per loop
[from_to_sampled_silhouette] with 3 loops, best of 3:
	min 10.6 sec per loop
	avg 10.7 sec per loop
[wide] with 3 loops, best of 3:
	min 2.43 sec per loop
	avg 2.44 sec per loop

This branch

[from_to_100_100] with 3 loops, best of 3:
	min 636 msec per loop
	avg 663 msec per loop
[from_to_100_100_no_normalize] with 3 loops, best of 3:
	min 462 msec per loop
	avg 486 msec per loop
[from_to_sampled_silhouette] with 3 loops, best of 3:
	min 2.62 sec per loop
	avg 2.99 sec per loop
[wide] with 3 loops, best of 3:
	min 640 msec per loop
	avg 649 msec per loop

Still not quite Orange 3.21 performance, but will do. The different in [wide] is due to normalization.

Includes
  • Code changes
  • Benchmarks
  • Documentation

This saves both time and memory. Before, data set was preprocessed
separately for each number of clusters and then also when computing
silhouettes. If used with from-to this made many in-memory copies of
data.
Also compute silhouette within worker threads. Fixes bug introduced
in Orange 3.22, when Orange started also computing full silhouette
scores for big datasets). Furthermore, they were computed in the main
thread, which blocked the UI.
Because silhouette scores are computed in worker threads,
__preprocessed_data attribute is not needed anymore.
@codecov
Copy link

codecov bot commented Mar 17, 2020

Codecov Report

Merging #4541 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4541      +/-   ##
==========================================
+ Coverage   83.35%   83.35%   +<.01%     
==========================================
  Files         274      274              
  Lines       54950    54974      +24     
==========================================
+ Hits        45801    45825      +24     
  Misses       9149     9149

@thocevar thocevar merged commit c88978d into biolab:master Mar 31, 2020
@markotoplak markotoplak deleted the kmeans-faster branch September 28, 2020 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants