Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address istio mesh e2e mesh failures #11286

Closed
dprotaso opened this issue Apr 30, 2021 · 25 comments
Closed

Address istio mesh e2e mesh failures #11286

dprotaso opened this issue Apr 30, 2021 · 25 comments
Assignees
Labels
area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.

Comments

@dprotaso
Copy link
Member

dprotaso commented Apr 30, 2021

/area test-and-release
/area networking
/kind cleanup
/kind good-first-issue

Expected Behavior

e2e tests pass when running our e2e tests with istio and the local gateway disabled (mesh only) with mTLS enabled (STRICT). At a minimum we should be skipping tests we know that will fail when strict mTLS is enabled.

This also affects DomainMapping but that's being tracked by knative-extensions/net-istio#562.

Actual Behavior

The following tests will always fail since we have global mTLS turned on (STRICT)

test/e2e: TestSvcToSvcViaActivator/both-disabled
test/e2e: TestSvcToSvcViaActivator/a-disabled
test/e2e: TestCallToPublicService/local_address 

Steps to Reproduce the Problem

  1. Ensure your net-istio latest/stable mesh has the right PeerAuthentication to ensure global mTLS is on
    ie. [istio-stable] renable global mTLS when in mesh mode knative-extensions/net-istio#617
  2. Once e2e - drop imperative bash yaml for declarative ytt/kapp #11175 merges open a PR with a noop change and run
/test pull-knative-serving-istio-latest-mesh
/test pull-knative-serving-istio-stable-mesh

Additional Info

@knative-prow-robot knative-prow-robot added area/test-and-release It flags unit/e2e/conformance/perf test issues for product features area/networking kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/good-first-issue labels Apr 30, 2021
@dprotaso
Copy link
Member Author

dprotaso commented May 3, 2021

/assign @arturenault

@nak3
Copy link
Contributor

nak3 commented May 10, 2021

@dprotaso @arturenault The root cause is local-gateway.mesh: "mesh".

The option makes tests failed w/o sidecar injection. The failed tests disable sidecar as:

"sidecar.istio.io/inject": strconv.FormatBool(inject),

I verified these tests worked if local-gateway.mesh: "mesh" was dropped by https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_serving/11323/pull-knative-serving-istio-stable-mesh/1391619647853301760

So, we need to change either:

a) Remove local-gateway.mesh: "mesh".
b) Skip tests with w/o sidecar -> app for mesh tests.

@arturenault
Copy link
Contributor

Thanks for looking into this @nak3. I think option 2 makes sense. I'll send a PR later today.

@dprotaso
Copy link
Member Author

dprotaso commented May 11, 2021

yeah - some of the tests skip disable sidecars so the test will never pass with STRICT mTLS

@nak3
Copy link
Contributor

nak3 commented Jun 11, 2021

Now, I am inclined to think "a) Remove local-gateway.mesh: "mesh"".
Even if we skip test with w/o sidecar -> app (by #11331), DomainMapping tests cannot be fixed as knative-extensions/net-istio#562.

So can we remove the option in net-istio https://github.com/knative-sandbox/net-istio/blob/main/third_party/istio-stable/istio-kind-mesh/config-istio.yaml#L11 ? (I kicked the test run in #11503)

cc @ZhiminXiang

@dprotaso
Copy link
Member Author

"a) Remove local-gateway.mesh: "mesh""

Then are we really running mesh (with mTLS) ?

@ZhiminXiang
Copy link

"a) Remove local-gateway.mesh: "mesh""

Then are we really running mesh (with mTLS) ?

I think we use mTLS STRICT for mesh https://github.com/knative-sandbox/net-istio/blob/8a3cf7e23bc14a27218bddaf0a3757c4d64f891a/third_party/istio-latest/extra/istio-mesh.yaml#L9

@ZhiminXiang
Copy link

So can we remove the option in net-istio https://github.com/knative-sandbox/net-istio/blob/main/third_party/istio-stable/istio-kind-mesh/config-istio.yaml#L11 ? (I kicked the test run in #11503)

cc @ZhiminXiang

I think we can remove the option local-gateway.mesh: "mesh".
If we remove this option, we may want to add an extra test for service-to-service communication when there is no pods for knative-local-gateway service in order to make sure there is no extra hop for service-to-service communication (in case there is any regression in Istio side).

@dprotaso
Copy link
Member Author

I'm confused I thought by removing local-gateway.mesh that would required service to service communication to go through an additional hop (a gateway pod) - vs. sidecar to sidecar.

I'd only consider sidecar to sidecar to be 'mesh'

@julz
Copy link
Member

julz commented Jun 11, 2021

I'm confused I thought by removing local-gateway.mesh that would required service to service communication to go through an additional hop (a gateway pod) - vs. sidecar to sidecar.

I believe removing this just allows access through the gateway for pods that arent in the mesh, it doesn't require it, and pods with a sidecar will still go pod-to-pod.

@ZhiminXiang
Copy link

I'm confused I thought by removing local-gateway.mesh that would required service to service communication to go through an additional hop (a gateway pod) - vs. sidecar to sidecar.

I believe removing this just allows access through the gateway for pods that arent in the mesh, it doesn't require it, and pods with a sidecar will still go pod-to-pod.

@julz is right. From my test knative-extensions/net-istio#562 (comment), after removing local-gateway.mesh, the internal communication is still sidecar to sidecar. I think the sidecar of client knows what is the right destination and will directly proxy the requests to the destination.

@dprotaso
Copy link
Member Author

ok great :)

@nak3
Copy link
Contributor

nak3 commented Jun 14, 2021

So we will proceed this issue by the following steps:

  1. Do not enable local-gateway.mesh in e2e (remove it from net-istio).
  2. Consider (open an issue/send a mail knative-user@) to deprecate local-gateway.mesh option. (The deprecation was mentioned in DomainMapping does not work with local gateway "mesh" mode knative-extensions/net-istio#562 (comment). And we should deprecate it as we no longer test it.)
  3. Add a new test to verify svc-to-svc in mesh does not use local gateway.

I will send the PRs so please help your reviews 🙏

@arturenault
Copy link
Contributor

Thanks all, sorry for dropping the ball on the previous PR (got busy with other things and then went on vacation for 2 weeks). Sounds like we're going with a different solution and assign this issue to kenjiro.

/assign @nak3

@nak3
Copy link
Contributor

nak3 commented Jun 21, 2021

So, the last monster is scale-200 test 😅

#11552 verified that all test passed with -short option (it runs scale-10 instead of scale-200) so I think we can use the -short options for somewhere.

I think we have a few options for the solution like:

🟢 ... test will pass.
🔴 ... test will fail.

option 1. Use short option for pre-submit test.

option 2. Add new job trigger /test pull-knative-serving-istio-stable-mesh-short to run mesh w/ -short.

  • 🟢 /test pull-knative-serving-istio-stable-mesh-short (w/ -short)
  • 🔴 periodic job (test grid)
  • 🔴 /test pull-knative-serving-istio-stable-mesh

@nak3
Copy link
Contributor

nak3 commented Jun 21, 2021

And I am thinking that option 2 is good for now.
(option 1 will make confusion as pre-submit passed all tests but test grid always red. Someone wonder what is the difference...)

@julz
Copy link
Member

julz commented Jun 21, 2021

Option 2 seems reasonable to me too fwiw

(I would love, at some point, separately, to figure out if we can tweak the timeouts on the full scale tests in mesh mode to have it pass, though -- it's obviously not ideal if we have no scale tests for mesh mode long-term since I imagine people who use mesh might have relatively big clusters/workloads)

@dprotaso
Copy link
Member Author

To @julz's point is configuring the timeouts for mesh an option easier option - ie. make them test flags

@dprotaso
Copy link
Member Author

dprotaso commented Jun 21, 2021

aside: a caveat is these timings would be sensitive to GKE performance and it could regress again. I think these scale tests should be spun up on GCE/kubeadm were we have more control of the machine types of the API server

But we've only been bitten by this once so maybe not worth doing so yet

nak3 added a commit to nak3/serving that referenced this issue Jun 22, 2021
This patch adds short option to e2e test and support it on scale test.

Please see knative#11286 for the discussion.
knative-prow-robot pushed a commit that referenced this issue Jun 22, 2021
* Add short option to scale test

This patch adds short option to e2e test and support it on scale test.

Please see #11286 for the discussion.

* Temporary enable short option.

* Revert "Temporary enable short option."

This reverts commit 52eed09.

* Use short option for all test when enabled
@nak3
Copy link
Contributor

nak3 commented Jun 28, 2021

Now, mesh test fails with only scale-200. HA test and TestHPAAutoscaleUpDownUp fails sometime but it was also caused by scale-200 test 😓 (It can be fixed by tweaking the test order as #11552).

Should we close this issue? Or keep open until scale-200 test could be fixed?

@julz
Copy link
Member

julz commented Jun 28, 2021

Should we close this issue? Or keep open until scale-200 test could be fixed?

Personal opinion: I reckon since we now have consistently green tests (🎉🎉!) we should close this issue but we should create another one to track re-enabling the fill scale-200 (unless we already did obviously), because we do want to do that one way or another

@dprotaso
Copy link
Member Author

testgrid still shows red for me - is there a new job?

https://testgrid.k8s.io/r/knative-own-testgrid/serving

@nak3
Copy link
Contributor

nak3 commented Jun 29, 2021

The mesh test's failure is scale-200.

knative/test-infra#2817 added pull-knative-serving-istio-stable-mesh-short but it is just for a manual trigger or updaate of third_party/istio-latest/net-istio.yaml.
Should we add or replace stable-mesh-short for the test grid?

@dprotaso
Copy link
Member Author

dprotaso commented Jul 5, 2021

I guess I'm trying to answer the question:

Where do I look or what tests do I run to ensure the mesh case is working as expected.

If we know scale-200 is going to fail should we use skip it for the mesh case and decide after whether it's possible to fix?

@arturenault arturenault removed their assignment Jul 8, 2021
@dprotaso dprotaso added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. and removed kind/good-first-issue labels Jul 23, 2021
@nak3 nak3 closed this as completed Aug 28, 2021
@nak3
Copy link
Contributor

nak3 commented Aug 28, 2021

stable-mesh-short passes all test.
Removing/decreasing scale-200 for mesh (not short) was already proposed before but #5989 was not accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.
Projects
None yet
6 participants