Speed up & deflake tests by disabling bundling #1689

evankanderson · 2020-09-08T22:41:40Z

It turns out that both Stackdriver and OpenCensus have rpc Bundlers that hold reports until at least N have been accumulated or for 1-2s. I had to send an upstream PR to OpenCensus to expose the bundler options in order to speed up reporting, which accounts for the go.mod update.

It also turns out that gRPC will take (on my machine, going through ??? WSL2 -> Windows DNS -> internet) about 1s to attempt to resolve external-svc, so changing that to a name that will resolve sped up the config_test by a second or so. It also turns out that gRPC takes an additional 1-2 seconds to set up "fail-fast" channels on Windows but not on a Linux kernel -- caveat testor if you're attempting to debug timing on Windows.

I think this has been de-flaked from a timeout perspective; on my machine (i7 mobile processor, plugged in, running WSL2), the entire suite takes about 5s, compared with 10s before these changes.

knative-prow-robot · 2020-09-08T22:41:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [evankanderson]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

evankanderson · 2020-09-08T22:45:21Z

/assign @vagababov since I know this was driving him crazy -- any additional thoughts on tests?

$ go test ./metrics -count 40 -timeout 900s 2>&1 > /tmp/test-log
$ cat /tmp/test-log
ok      knative.dev/pkg/metrics 208.145s

(I didn't push too much past -count 40 because there seems to be an FD leak somewhere in the tests which was causing the ulimit to be exhausted with 2k FDs)

evankanderson · 2020-09-08T22:51:45Z

Fixing th e races now, forgot to run with -race.

evankanderson · 2020-09-08T23:30:57Z

Note that at head, -count 40 does not complete within 15 minutes. After this change, it's consistently less than 6 minutes.

I've noticed that Prow test runs seem to be much slower than my local machine, so it's possible the timeouts are due to the slower processing time on the Prow cluster.

evankanderson · 2020-09-09T01:13:52Z

/retest

In hopes of getting another Prow run on this.

vagababov · 2020-09-09T18:43:29Z

metrics/resource_view_test.go

+						records = append(records, extracted)
+						if strings.HasPrefix(ts.Metric.Type, "knative.dev/") {
+							if diff := cmp.Diff(ts.Resource.Type, metricskey.ResourceTypeKnativeRevision); diff != "" {
+								t.Errorf("Incorrect resource type for %q: (-want +got): %s", ts.Metric.Type, diff)


Suggested change

t.Errorf("Incorrect resource type for %q: (-want +got): %s", ts.Metric.Type, diff)

t.Errorf("Incorrect resource type for %q: (-want +got):\n%s", ts.Metric.Type, diff)

vagababov · 2020-09-09T18:44:06Z

metrics/resource_view_test.go

+						sdFake.srv.GracefulStop()
+						break loop
+					}
+				case <-time.After(5 * time.Second):


Should those timeouts be consistent? it's 4s above.

There used to be an extra delay on the Stackdriver outputs, but they should be similar now.

vagababov · 2020-09-09T18:44:41Z

metrics/stackdriver_exporter.go

+
+	// A variable for testing to reduce the size (number of metrics) buffered before
+	// Stackdriver will send a bundled metric report. Only applies if non-zero.
+	TestOverrideBundleCount = 0


why is this a public variable?

It seemed like e2e tests in Serving or Eventing might want to be able to disable the batching for the same reasons that the metrics tests would want to do so.

I'd like to add tests for important exports to the Serving and Eventing repos once these tests are no longer flaky.

Sure then let's specify that, since just looking at the comment I see no reason why it should be public.
Also let's do this in a linter happy format TestOverrideBundleCount is a variable...

evankanderson · 2020-09-13T05:29:56Z

Updated!

knative-metrics-robot · 2020-09-13T05:34:09Z

The following is the coverage report on the affected files.
Say /test pull-knative-pkg-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
metrics/opencensus_exporter.go	65.8%	75.0%	9.2
metrics/resource_view.go	86.6%	91.5%	4.9
metrics/stackdriver_exporter.go	85.9%	86.7%	0.7

vagababov · 2020-09-13T16:45:29Z

metrics/resource_view_test.go

@@ -225,9 +226,10 @@ func sortMetrics() cmp.Option {

 // Begin table tests for exporters


Suggested change

// Begin table tests for exporters

vagababov · 2020-09-13T16:46:20Z

metrics/resource_view_test.go

+						break loop
+					}
+				case <-time.After(4 * time.Second):
+					t.Error("Timeout reading input")


I wonder if fatal is more appropriate here?

vagababov · 2020-09-13T16:46:40Z

metrics/resource_view_test.go

+				select {
+				case record := <-ocFake.published:
+					if record == nil {
+						continue loop


I guess it's not important here?

Suggested change

continue loop

continue

Actually, you need the label, otherwise the continue applies to the select, rather than to the loop.

No, if it's a lambda you return from the func rather than continue. But since we're keeping the label, it doesn't matter.

vagababov · 2020-09-13T16:50:31Z

metrics/resource_view_test.go

-						labels := map[string]string{}
-						if record.Resource != nil {
-							labels = record.Resource.Labels
+		loop:


Stylistically I'd prefer a lambda with return rather than goto label, but up to you. :)

That would look like:

for { breakErr := error.New("intentional break") err := func() error { select { case record := <-ocFake.published: if record == nil { return nil } // ... if len(records) >= len(expected) { return breakErr } case <-time.After(5 * time.Second): return errors.New("timeout") } }() if err != breakErr { t.Fatal("Unable to fetch events", err) } if err == breakErr { break } }

I think with the label is better, stylistically.

Up to you :-)

vagababov · 2020-09-13T16:51:25Z

metrics/resource_view_test.go

+							Value:  ts.Points[0].Value.GetInt64Value(),
+						}
+						// Override 'cluster-name' label to reset to a fixed value
+						if extracted.Labels["cluster_name"] != "" {


Should it not be set if empty?

This is using Golang defaults, so mapVal["doesNotExist"] will return the default value from the map map[string]string, which is "".

Strictly theoretically there may be the value in the map with "" value. So if this happens due to a bug it won't be caught. But 🤷 .

vagababov · 2020-09-18T23:17:19Z

/lgtm

evankanderson added 2 commits September 8, 2020 09:42

Speed up & deflake tests by disabling bundling

3ee4008

Fix go vet and races detected

263b86c

googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Sep 8, 2020

knative-prow-robot requested review from tcnghia and yanweiguo September 8, 2020 22:41

knative-prow-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 8, 2020

evankanderson force-pushed the speed-test branch 2 times, most recently from c40b6ae to 263b86c Compare September 8, 2020 23:29

vagababov reviewed Sep 9, 2020

View reviewed changes

evankanderson added 2 commits September 9, 2020 23:30

Address @vagababov comments, add nil check

581c977

Merge remote-tracking branch 'upstream/master' into speed-test

eb728c8

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 11, 2020

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 13, 2020

vagababov reviewed Sep 13, 2020

View reviewed changes

knative-prow-robot assigned vagababov Sep 18, 2020

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2020

knative-prow-robot merged commit 8c2ebdb into knative:master Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up & deflake tests by disabling bundling #1689

Speed up & deflake tests by disabling bundling #1689

evankanderson commented Sep 8, 2020

knative-prow-robot commented Sep 8, 2020

evankanderson commented Sep 8, 2020

evankanderson commented Sep 8, 2020

evankanderson commented Sep 8, 2020

evankanderson commented Sep 9, 2020

vagababov Sep 9, 2020

vagababov Sep 9, 2020

evankanderson Sep 13, 2020

vagababov Sep 9, 2020

evankanderson Sep 13, 2020

vagababov Sep 13, 2020

evankanderson commented Sep 13, 2020

knative-metrics-robot commented Sep 13, 2020

vagababov Sep 13, 2020

vagababov Sep 13, 2020

evankanderson Sep 18, 2020

vagababov Sep 13, 2020

evankanderson Sep 18, 2020

vagababov Sep 18, 2020

vagababov Sep 13, 2020

evankanderson Sep 18, 2020

evankanderson Sep 18, 2020

vagababov Sep 18, 2020

vagababov Sep 13, 2020

evankanderson Sep 18, 2020

vagababov Sep 18, 2020

vagababov commented Sep 18, 2020

	t.Errorf("Incorrect resource type for %q: (-want +got): %s", ts.Metric.Type, diff)
	t.Errorf("Incorrect resource type for %q: (-want +got):\n%s", ts.Metric.Type, diff)

		@@ -225,9 +226,10 @@ func sortMetrics() cmp.Option {

		// Begin table tests for exporters

Speed up & deflake tests by disabling bundling #1689

Speed up & deflake tests by disabling bundling #1689

Conversation

evankanderson commented Sep 8, 2020

knative-prow-robot commented Sep 8, 2020

evankanderson commented Sep 8, 2020

evankanderson commented Sep 8, 2020

evankanderson commented Sep 8, 2020

evankanderson commented Sep 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evankanderson commented Sep 13, 2020

knative-metrics-robot commented Sep 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vagababov commented Sep 18, 2020