Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Scatter Plot Graph and Scaling can handle metas #2699

Merged
merged 5 commits into from
Nov 10, 2017
Merged

[ENH] Scatter Plot Graph and Scaling can handle metas #2699

merged 5 commits into from
Nov 10, 2017

Conversation

jerneju
Copy link
Contributor

@jerneju jerneju commented Oct 20, 2017

Issue

Scatter Plot moves primitive meas to attributes.

Description of changes
Includes
  • Code changes
  • Tests
  • Documentation

Copy link
Contributor

@janezd janezd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so much simpler. All widgets should use models for combos/lists of variables.

no_jit[index] /= 2 * len(attr.values)
c *= 2
c += 1
c /= 2 * len(attr.values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't (2x+1)/(2a) same as (x + 0.5)/a? Aren't the above three lines the same as

c += 0.5
c /= len(attr.values)

After rewriting it this way, I also understand what the code actually does. :) Moves the data into the middle of intervals and normalizes it to [0, 1].

I know this was here before, but only God knows why it was written like this.


def _compute_jittered_data(self):
data = self.data
self.jittered_data = self.scaled_data.copy()
random = np.random.RandomState(seed=self.jitter_seed)
for index, col in enumerate(self.jittered_data):
domain = self.domain
for attr in domain.attributes + domain.class_vars + domain.metas:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use chain(domain, domain.metas) in such situations.

self.data = data
domain = data.domain
self.primitives = data.domain.attributes + data.domain.class_vars + \
tuple([v for v in data.domain.metas if v.is_primitive()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use chain(data.domain, (v for v in data.domain.metas if v.is_primitive())).

If you keep it as it is: don't use tuple([...]): just use generators, tuple(...), without constructing a list first.

tuple([v for v in data.domain.metas if v.is_primitive()])
new_domain = Domain(attributes=domain.attributes,
class_vars=domain.class_vars,
metas=tuple([v for v in domain.metas if v.is_primitive()]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same - no need to construct a list first.

col = self.jittered_data.get_column_view(attr)[0]
col = 1 - col
col = self.scaled_data.get_column_view(attr)[0]
col = 1 - col
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These four lines have no effect, do they?

Did you mean something like this?

col *= -1
col += 1

Or (shorter, but maybe slower) col[:] = 1 - col.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the latter one can consume even 4 times more time.


attributes = self.data.domain.attributes + (self.variable_x, self.variable_y) + \
primitive_metas
attributes = data.domain.attributes + (self.variable_x, self.variable_y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlazZupan, do we want the columns that contain x and y coordinates from the MDS to appear as ordinary attributes or as metas?

They used to be in metas, but now that visualization widgets can show metas as well, I would tend to store x and y as metas to not pollute that feature set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this line here is not related to that (it's for constructing data for plotting).
Output data has new coordinates in metas as we want, see commit function below (L671)

attr = self.domain[index]
getCached(self.data, DomainBasicStats, (self.data, True))
domain = self.domain
for attr in domain.attributes + domain.class_vars + domain.metas:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe chain(domain, domain.metas)?

Copy link
Contributor Author

@jerneju jerneju Oct 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sincere, I would prefer domain.all_variables or something similar. Like we have domain.variables = domain.attributes + domain.class_vars we could have something that includes all three: attributes, class_vars and metas.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate this, too.

When one gets older, (s)he start adding historic notes to everything. So let me say that in the beginning, like 20 years ago, Orange used to have attributes and classes, and both together were called variables. Then I introduced meta attributes, and they didn't mean the same thing as today: meta attributes were (essentially) sparse, while attributes and classes were dense ("essentially", because there was no numpy, these were C++ structures). Since all code - before the introduction of metas - expected that variables were attributes + classes, I would break everything if I added metas there. It also made more sense than today since variables (=attributes + classes) were the dense part of the data.

The idea that any part can be dense or sparse, and that metas are not so different from attributes and so forth evolved gradually even in Orange 3. Hence we have not initially decided that variables will be attributes + class + meta. To make things worse, we (most probably: I) extended this bad design to iteration over Domain. @kernc (I think) correctly pointed out that since var in domain returns True also when var is in domain.metas, so for var in domain should iterate over attributes, classes and also metas. So in all the places where you could use chain(domain, domain.metas) in your code, just domain should suffice instead. Alas, fixing the iteration over Domain would break so much code that we gave up. This horrible design decision stays forever, I fear.

I don't like all_variables. variables should already be all, so all_variables sounds almost like really_all_variables. Plus, the next thing you'll want will be primitive_variables. And maybe also continuous_variables. I wouldn't like to multiply the number of domains properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of brainstorming - and if this is indeed necessary - what about having a method Domain.all(condition). If condition is a type derived from Variable, all returns all variables of that type (and derived types). Hence , Domain.all(DiscreteVariable) returns all discrete variables. We then add PrimitiveVariable as a new common base class for DiscreteVariable and ContinuousVariable. With that, Domain.all(PrimitiveVariable) returns all primitive variables. Finally, the default value for condition is Variable, so Domain.all() returns attributes + class_vars + metas.

Method all could also accept other types of conditions if we can think of any.

Just for brainstorming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, I implemented it.

"""
Get array of 0 and 1 of len = len(self.data). If there is a missing
value at any attribute in indices return 0 for that instance.
"""
if self.valid_data_array is None or len(self.valid_data_array) == 0:
return np.array([], np.bool)
domain = self.domain
indices = []
for index, attr in enumerate(domain.attributes + domain.class_vars + domain.metas):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enumerate(chain(domain, domain.metas))?

kernc
kernc previously requested changes Oct 23, 2017
@@ -211,6 +211,9 @@ def variables(self):
def metas(self):
return self._metas

def all(self, condition=Variable):
return [v for v in chain(self._variables, self._metas) if isinstance(v, condition)]
Copy link
Contributor

@kernc kernc Oct 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.all() is not good. The idiom means something completely different in Python/NumPy context. .filter() would be better, .where() would be best.

This change should require re-touching everywhere the former ([]+[] or chain) idiom was used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like where. :)

all reads well: I want all discrete variables, hence domain.all(DiscreteVariable). But I agree the name is wrong wrt its usual meaning in Python.

domain.where(DiscreteVariable) doesn't read so well and it also means two different things in numpy, one of which is similar to this Domain.where (but with different arguments) and one which is completely different.

domain.filter(DiscreteVariable) is sort of fine, but it gives an impression that it will return a new instance of Domain with just discrete variables.

What about domain.collect(DiscreteVariable)? Or domain.pick(DiscreteVariable) -- though I prefer collect?

@codecov-io
Copy link

codecov-io commented Oct 23, 2017

Codecov Report

Merging #2699 into master will decrease coverage by 0.01%.
The diff coverage is 88.39%.

@@            Coverage Diff             @@
##           master    #2699      +/-   ##
==========================================
- Coverage   76.06%   76.05%   -0.02%     
==========================================
  Files         338      338              
  Lines       59675    59670       -5     
==========================================
- Hits        45391    45381      -10     
- Misses      14284    14289       +5

@@ -211,6 +211,9 @@ def variables(self):
def metas(self):
return self._metas

def where(self, condition=Variable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you do this in a separate PR?
While I agree that this change is neatly hidden here, I would prefer to have modifications to Variables / Domain in a separate, clearly marked PR (where we can discuss different alternatives).

attr = self.domain[index]
getCached(self.data, DomainBasicStats, (self.data, True))
domain = self.domain
for attr in domain.where():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jerneju jerneju mentioned this pull request Oct 24, 2017
3 tasks
@jerneju jerneju changed the title [WIP][ENH] Scatter Plot Graph and Scaling can handle metas [ENH] Scatter Plot Graph and Scaling can handle metas Oct 30, 2017
@jerneju jerneju added this to the 3.8 milestone Nov 3, 2017
@jerneju
Copy link
Contributor Author

jerneju commented Nov 3, 2017

cols = []
for var in attrs:
cols.append(graph.jittered_data.get_column_view(var)[0][valid])
X = np.vstack(cols).T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X = np.column_stack(cols)

all_data = (data.X, Y, data.metas)
else:
all_data = (data.X, Y)
all_data = np.hstack(all_data).T
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can have dtype==object when stacking with metas which will cause np.isfinite to fail 2 lines below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here, but I get similar errors elsewhere (see sentry report).
Try fixing all metas dtype problems (converting all primitive metas to float might be enough, maybe something else is also needed). Then please test thoroughly! Use at least the workflow from the sentry report.

Copy link
Contributor Author

@jerneju jerneju Nov 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found another error caused by Corpus (issue: biolab/orange3-text#324 )

@jerneju
Copy link
Contributor Author

jerneju commented Nov 7, 2017

@lanzagar lanzagar merged commit 45d8402 into biolab:master Nov 10, 2017
@jerneju jerneju deleted the spg-scaling-metas branch November 10, 2017 10:22
@markotoplak markotoplak modified the milestones: 3.8, 3.7.1 Nov 10, 2017
@jerneju jerneju mentioned this pull request Nov 14, 2017
3 tasks
nikicc pushed a commit to nikicc/orange3 that referenced this pull request Nov 17, 2017
[ENH] Scatter Plot Graph and Scaling can handle metas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants