Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up & deflake tests by disabling bundling #1689

Merged
merged 4 commits into from
Sep 18, 2020

Conversation

evankanderson
Copy link
Member

Fixes #1672

It turns out that both Stackdriver and OpenCensus have rpc Bundlers that hold reports until at least N have been accumulated or for 1-2s. I had to send an upstream PR to OpenCensus to expose the bundler options in order to speed up reporting, which accounts for the go.mod update.

It also turns out that gRPC will take (on my machine, going through ??? WSL2 -> Windows DNS -> internet) about 1s to attempt to resolve external-svc, so changing that to a name that will resolve sped up the config_test by a second or so. It also turns out that gRPC takes an additional 1-2 seconds to set up "fail-fast" channels on Windows but not on a Linux kernel -- caveat testor if you're attempting to debug timing on Windows.

I think this has been de-flaked from a timeout perspective; on my machine (i7 mobile processor, plugged in, running WSL2), the entire suite takes about 5s, compared with 10s before these changes.

@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Sep 8, 2020
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 8, 2020
@evankanderson
Copy link
Member Author

/assign @vagababov since I know this was driving him crazy -- any additional thoughts on tests?

$ go test ./metrics -count 40 -timeout 900s 2>&1 > /tmp/test-log
$ cat /tmp/test-log
ok      knative.dev/pkg/metrics 208.145s

(I didn't push too much past -count 40 because there seems to be an FD leak somewhere in the tests which was causing the ulimit to be exhausted with 2k FDs)

@evankanderson
Copy link
Member Author

Fixing th e races now, forgot to run with -race.

@evankanderson evankanderson force-pushed the speed-test branch 2 times, most recently from c40b6ae to 263b86c Compare September 8, 2020 23:29
@evankanderson
Copy link
Member Author

Note that at head, -count 40 does not complete within 15 minutes. After this change, it's consistently less than 6 minutes.

I've noticed that Prow test runs seem to be much slower than my local machine, so it's possible the timeouts are due to the slower processing time on the Prow cluster.

@evankanderson
Copy link
Member Author

/retest

In hopes of getting another Prow run on this.

records = append(records, extracted)
if strings.HasPrefix(ts.Metric.Type, "knative.dev/") {
if diff := cmp.Diff(ts.Resource.Type, metricskey.ResourceTypeKnativeRevision); diff != "" {
t.Errorf("Incorrect resource type for %q: (-want +got): %s", ts.Metric.Type, diff)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t.Errorf("Incorrect resource type for %q: (-want +got): %s", ts.Metric.Type, diff)
t.Errorf("Incorrect resource type for %q: (-want +got):\n%s", ts.Metric.Type, diff)

sdFake.srv.GracefulStop()
break loop
}
case <-time.After(5 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should those timeouts be consistent? it's 4s above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There used to be an extra delay on the Stackdriver outputs, but they should be similar now.


// A variable for testing to reduce the size (number of metrics) buffered before
// Stackdriver will send a bundled metric report. Only applies if non-zero.
TestOverrideBundleCount = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a public variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed like e2e tests in Serving or Eventing might want to be able to disable the batching for the same reasons that the metrics tests would want to do so.

I'd like to add tests for important exports to the Serving and Eventing repos once these tests are no longer flaky.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure then let's specify that, since just looking at the comment I see no reason why it should be public.
Also let's do this in a linter happy format TestOverrideBundleCount is a variable...

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 11, 2020
@evankanderson
Copy link
Member Author

Updated!

@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2020
@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-pkg-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
metrics/opencensus_exporter.go 65.8% 75.0% 9.2
metrics/resource_view.go 86.6% 91.5% 4.9
metrics/stackdriver_exporter.go 85.9% 86.7% 0.7

@@ -225,9 +226,10 @@ func sortMetrics() cmp.Option {

// Begin table tests for exporters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Begin table tests for exporters

break loop
}
case <-time.After(4 * time.Second):
t.Error("Timeout reading input")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if fatal is more appropriate here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

select {
case record := <-ocFake.published:
if record == nil {
continue loop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not important here?

Suggested change
continue loop
continue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you need the label, otherwise the continue applies to the select, rather than to the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if it's a lambda you return from the func rather than continue. But since we're keeping the label, it doesn't matter.

labels := map[string]string{}
if record.Resource != nil {
labels = record.Resource.Labels
loop:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically I'd prefer a lambda with return rather than goto label, but up to you. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would look like:

for {
  breakErr := error.New("intentional break")
  err := func() error {
    select {
    case record := <-ocFake.published:
      if record == nil {
        return nil
      }
      // ...
      if len(records) >= len(expected) {
        return breakErr
      }
    case <-time.After(5 * time.Second):
      return errors.New("timeout")
    }
  }()
  if err != breakErr {
    t.Fatal("Unable to fetch events", err)
  }
  if err == breakErr {
    break
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with the label is better, stylistically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you :-)

Value: ts.Points[0].Value.GetInt64Value(),
}
// Override 'cluster-name' label to reset to a fixed value
if extracted.Labels["cluster_name"] != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it not be set if empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using Golang defaults, so mapVal["doesNotExist"] will return the default value from the map map[string]string, which is "".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly theoretically there may be the value in the map with "" value. So if this happens due to a bug it won't be caught. But 🤷 .

@vagababov
Copy link
Contributor

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2020
@knative-prow-robot knative-prow-robot merged commit 8c2ebdb into knative:master Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[flaky] TestMetricsExport is super flaky
5 participants