Search

Istio Tasks; Traffic Management - Circuit Breaking

준비

이번엔 bookinfo가 아닌 httpbin(컨테이너 이미지 도커 레지스트리는 kong 것이나 이걸 포크한거라고 한다) 샘플 앱을 준비한다. bookinfo가 이미 설치된 default 네임스페이스와 구분하기 위해 새 네임스페이스 httpbin 을 만들고 istio 사이드카(envoy 프록시) 주입을 활성시켰다:
$ k create ns httpbin $ k label ns httpbin istio-injection=enabled
Shell
복사
httpbin 관련 리소스를 설치한다. 서비스는 8000(←80 컨테이너)로 노출하고 있다:
$ k apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/httpbin/httpbin.yaml
Shell
복사
apiVersion: v1 kind: ServiceAccount metadata: name: httpbin --- apiVersion: v1 kind: Service metadata: name: httpbin labels: app: httpbin service: httpbin spec: ports: - name: http port: 8000 targetPort: 80 selector: app: httpbin --- apiVersion: apps/v1 kind: Deployment metadata: name: httpbin spec: replicas: 1 selector: matchLabels: app: httpbin version: v1 template: metadata: labels: app: httpbin version: v1 spec: serviceAccountName: httpbin containers: - image: docker.io/kong/httpbin imagePullPolicy: IfNotPresent name: httpbin ports: - containerPort: 80
YAML
복사

회로 차단(서킷 브레이킹)

서킷 브레이킹이 구성된 destinationrule을 적용한다:
$ k apply -f - <<EOF apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: httpbin spec: host: httpbin trafficPolicy: connectionPool: tcp: maxConnections: 1 http: http1MaxPendingRequests: 1 maxRequestsPerConnection: 1 outlierDetection: consecutive5xxErrors: 1 interval: 1s baseEjectionTime: 3m maxEjectionPercent: 100 EOF
Shell
복사
apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: httpbin spec: host: httpbin trafficPolicy: connectionPool: tcp: maxConnections: 1 http: http1MaxPendingRequests: 1 maxRequestsPerConnection: 1 outlierDetection: consecutive5xxErrors: 1 interval: 1s baseEjectionTime: 3m maxEjectionPercent: 100
YAML
복사
trafficPolicy.connectionPooltcp.maxConnectionshttp.http1MaxPendingRequests 를 초과할 경우 서킷 브레이킹이 작동한다.

클라이언트

fortio 라는 클라이언트를 사용해 httpbin에 요청을 할 것이다. fortio는 istio에서 만든 http 또는 grpc 부하 발생기이다:
fortio
fortio
$ k apply -f https://raw.githubusercontent.com/istio/istio/release-1.20/samples/httpbin/sample-client/fortio-deploy.yaml
Shell
복사
apiVersion: v1 kind: Service metadata: name: fortio labels: app: fortio service: fortio spec: ports: - port: 8080 name: http selector: app: fortio --- apiVersion: apps/v1 kind: Deployment metadata: name: fortio-deploy spec: replicas: 1 selector: matchLabels: app: fortio template: metadata: annotations: # This annotation causes Envoy to serve cluster.outbound statistics via 15000/stats # in addition to the stats normally served by Istio. The Circuit Breaking example task # gives an example of inspecting Envoy stats via proxy config. proxy.istio.io/config: |- proxyStatsMatcher: inclusionPrefixes: - "cluster.outbound" - "cluster_manager" - "listener_manager" - "server" - "cluster.xds-grpc" labels: app: fortio spec: containers: - name: fortio image: fortio/fortio:latest_release imagePullPolicy: Always ports: - containerPort: 8080 name: http-fortio - containerPort: 8079 name: grpc-ping
YAML
복사
주석에 의하면, 파드의 어노테이션을 통해 서킷 브레이크(cluster.outbound?) 관련한 추가적인 envoy 통계를 볼 수 있다고 한다. 다음 페이지가 관련 있을거 같고, 자세한 내용은 다음에 알아보자(과제 마지막에 istio-proxy에서 확인한다):
fortio에 붙어 curl로 httpbin에 요청이 되는지 확인한다:
❯ k exec $(k get po -l app=fortio -oyaml | yq .items[0].metadata.name) -c fortio -- /usr/bin/fortio curl -quiet http://httpbin:8000/get HTTP/1.1 200 OK server: envoy date: Sat, 06 Jan 2024 13:02:13 GMT content-type: application/json content-length: 594 access-control-allow-origin: * access-control-allow-credentials: true x-envoy-upstream-service-time: 17 { "args": {}, "headers": { "Host": "httpbin:8000", "User-Agent": "fortio.org/fortio-1.60.3", "X-B3-Parentspanid": "5052877c3cc61319", "X-B3-Sampled": "0", "X-B3-Spanid": "7d4b2fba746272de", "X-B3-Traceid": "8ece8bdbb28e04ad5052877c3cc61319", "X-Envoy-Attempt-Count": "1", "X-Forwarded-Client-Cert": "By=spiffe://cluster.local/ns/httpbin/sa/httpbin;Hash=7691258e47df2875ae47978a5d18b0bfc87a64400e3393d131e5bf58ce87f4e0;Subject=\"\";URI=spiffe://cluster.local/ns/httpbin/sa/default" }, "origin": "127.0.0.6", "url": "http://httpbin:8000/get" }
Shell
복사

고장내기(tripping)

destinationrule에 tcp.maxConnections=1, http.http1MaxPendingRequests=1 그리고 http.maxRequestsPerConnection=1이라고 되어 있다.
이런 명세는 istio 객체로써 쓰인 것이고 실제 세부 동작과 구성은 envoy 쪽이다:
여기서 위 구성에 해당하는 것은 Cluster maximum connections maximum requests 타입의 서킷 브레이킹이다. envoy에선 업스트림 호스트 묶음을 클러스터라고 한다(그런데 이게 서비스의 모든 파드인지는 모르겠다. 파드마다 envoy 프록시가 뜨지 않는가?). 그랬을 때의 각각 서킷 브레이크를 해야하는, 즉 이상치라고 판단하는 계산식이 있다. envoy 구성을 잘 모르는 상태에선 아주 직관적이진 않다.
destinationrule 명세의 설명을 참고하자:
tcp.maxConnections: destination host에 대한 최대 HTTP1 /TCP 연결 수, 기본값은 2^32-1.
http.http1MaxPendingRequests: 준비된 connection pool의 연결을 기다리는 큐의 최대 길이(요청 수), 기본값은 2^32-1. HTTP/1.1와 HTTP2 모두에 적용
http.maxRequestsPerConnection: 백엔드 연결당 최대 요청수. 1로 설정하면 keep alive 비활성화. 기본 값 0(무제한), 0 ~ 2^29
마지막 옵션은 keep alive를 비활성화하여 최대한 연결당 요청 수를 적게하려는 것으로 이해했다.
이제 부하를 만들어 본다. 연결(connection) 두 개에 요청 20개를 만든다. 로그레벨을 Warning으로 하여 503오류를 확인한다:
❯ k exec $(k get po -l app=fortio -oyaml | yq .items[0].metadata.name) -c fortio -- /usr/bin/fortio load -c 2 -qps 0 -n 20 -loglevel Warning http://httpbin:8000/get {"ts":1704547471.734806,"level":"info","r":1,"file":"logger.go","line":254,"msg":"Log level is now 3 Warning (was 2 Info)"} Fortio 1.60.3 running at 0 queries per second, 2->2 procs, for 20 calls: http://httpbin:8000/get Starting at max qps with 2 thread(s) [gomax 2] for exactly 20 calls (10 per thread + 0) {"ts":1704547471.744735,"level":"warn","r":7,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547471.758365,"level":"warn","r":6,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0} {"ts":1704547471.763986,"level":"warn","r":7,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547471.770005,"level":"warn","r":7,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547471.784960,"level":"warn","r":7,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547471.790872,"level":"warn","r":6,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0} {"ts":1704547471.796857,"level":"warn","r":7,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547471.803149,"level":"warn","r":6,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0} Ended after 83.25321ms : 20 calls. qps=240.23 Aggregated Function Time : count 20 avg 0.0077428385 +/- 0.005626 min 0.000572906 max 0.019008139 sum 0.15485677 # range, mid point, percentile, count >= 0.000572906 <= 0.001 , 0.000786453 , 15.00, 3 > 0.001 <= 0.002 , 0.0015 , 25.00, 2 > 0.003 <= 0.004 , 0.0035 , 30.00, 1 > 0.004 <= 0.005 , 0.0045 , 35.00, 1 > 0.005 <= 0.006 , 0.0055 , 45.00, 2 > 0.006 <= 0.007 , 0.0065 , 50.00, 1 > 0.008 <= 0.009 , 0.0085 , 60.00, 2 > 0.009 <= 0.01 , 0.0095 , 70.00, 2 > 0.011 <= 0.012 , 0.0115 , 85.00, 3 > 0.016 <= 0.018 , 0.017 , 90.00, 1 > 0.018 <= 0.0190081 , 0.0185041 , 100.00, 2 # target 50% 0.007 # target 75% 0.0113333 # target 90% 0.018 # target 99% 0.0189073 # target 99.9% 0.0189981 Error cases : count 8 avg 0.0027348501 +/- 0.002553 min 0.000572906 max 0.008008356 sum 0.021878801 # range, mid point, percentile, count >= 0.000572906 <= 0.001 , 0.000786453 , 37.50, 3 > 0.001 <= 0.002 , 0.0015 , 62.50, 2 > 0.003 <= 0.004 , 0.0035 , 75.00, 1 > 0.005 <= 0.006 , 0.0055 , 87.50, 1 > 0.008 <= 0.00800836 , 0.00800418 , 100.00, 1 # target 50% 0.0015 # target 75% 0.004 # target 90% 0.00800167 # target 99% 0.00800769 # target 99.9% 0.00800829 # Socket and IP used for each connection: [0] 4 socket used, resolved to 172.20.30.15:8000, connection timing : count 4 avg 0.00027655325 +/- 0.0002784 min 9.1591e-05 max 0.000757319 sum 0.001106213 [1] 6 socket used, resolved to 172.20.30.15:8000, connection timing : count 6 avg 0.00020128 +/- 0.0001627 min 0.00010712 max 0.000562831 sum 0.00120768 Connection time (s) : count 10 avg 0.0002313893 +/- 0.0002197 min 9.1591e-05 max 0.000757319 sum 0.002313893 Sockets used: 10 (for perfect keepalive, would be 2) Uniform: false, Jitter: false, Catchup allowed: true IP addresses distribution: 172.20.30.15:8000: 10 Code 200 : 12 (60.0 %) Code 503 : 8 (40.0 %) Response Header Sizes : count 20 avg 138.15 +/- 112.8 min 0 max 231 sum 2763 Response Body/Total Sizes : count 20 avg 590.95 +/- 285.7 min 241 max 825 sum 11819 All done 20 calls (plus 0 warmup) 7.743 ms avg, 240.2 qps
Shell
복사
나의 경우 성공 실패 비율이 6:4로 실패가 꽤 많이 발생했다. 동시성을 3으로 증가시키면 더 많이 실패, 즉 서킷 브레이크가 더 빈번히 일어난다:
❯ k exec $(k get po -l app=fortio -oyaml | yq .items[0].metadata.name) -c fortio -- /usr/bin/for tio load -c 3 -qps 0 -n 30 -loglevel Warning http://httpbin:8000/get {"ts":1704547619.743989,"level":"info","r":1,"file":"logger.go","line":254,"msg":"Log level is now 3 Warning (was 2 Info)"} Fortio 1.60.3 running at 0 queries per second, 2->2 procs, for 30 calls: http://httpbin:8000/get Starting at max qps with 3 thread(s) [gomax 2] for exactly 30 calls (10 per thread + 0) {"ts":1704547619.750389,"level":"warn","r":10,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547619.751528,"level":"warn","r":10,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547619.754584,"level":"warn","r":10,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547619.755895,"level":"warn","r":10,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547619.757561,"level":"warn","r":9,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0} {"ts":1704547619.762045,"level":"warn","r":9,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0} {"ts":1704547619.768052,"level":"warn","r":11,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0} {"ts":1704547619.776519,"level":"warn","r":9,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0} {"ts":1704547619.777420,"level":"warn","r":10,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547619.780687,"level":"warn","r":10,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547619.782058,"level":"warn","r":11,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0} {"ts":1704547619.785895,"level":"warn","r":9,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0} {"ts":1704547619.794717,"level":"warn","r":9,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0} {"ts":1704547619.797338,"level":"warn","r":9,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":0,"run":0} {"ts":1704547619.797767,"level":"warn","r":11,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0} {"ts":1704547619.798744,"level":"warn","r":11,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0} {"ts":1704547619.799656,"level":"warn","r":11,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0} {"ts":1704547619.800629,"level":"warn","r":11,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0} {"ts":1704547619.801611,"level":"warn","r":11,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":2,"run":0} {"ts":1704547619.811691,"level":"warn","r":10,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} {"ts":1704547619.814642,"level":"warn","r":10,"file":"http_client.go","line":1104,"msg":"Non ok http code","code":503,"status":"HTTP/1.1 503","thread":1,"run":0} Ended after 67.999653ms : 30 calls. qps=441.18 Aggregated Function Time : count 30 avg 0.0056838562 +/- 0.006266 min 0.000939173 max 0.030577574 sum 0.170515686 # range, mid point, percentile, count >= 0.000939173 <= 0.001 , 0.000969586 , 10.00, 3 > 0.001 <= 0.002 , 0.0015 , 30.00, 6 > 0.002 <= 0.003 , 0.0025 , 46.67, 5 > 0.003 <= 0.004 , 0.0035 , 60.00, 4 > 0.004 <= 0.005 , 0.0045 , 63.33, 1 > 0.005 <= 0.006 , 0.0055 , 66.67, 1 > 0.007 <= 0.008 , 0.0075 , 76.67, 3 > 0.008 <= 0.009 , 0.0085 , 83.33, 2 > 0.011 <= 0.012 , 0.0115 , 86.67, 1 > 0.012 <= 0.014 , 0.013 , 93.33, 2 > 0.016 <= 0.018 , 0.017 , 96.67, 1 > 0.03 <= 0.0305776 , 0.0302888 , 100.00, 1 # target 50% 0.00325 # target 75% 0.00783333 # target 90% 0.013 # target 99% 0.0304043 # target 99.9% 0.0305602 Error cases : count 21 avg 0.003817704 +/- 0.006214 min 0.000939173 max 0.030577574 sum 0.080171784 # range, mid point, percentile, count >= 0.000939173 <= 0.001 , 0.000969586 , 14.29, 3 > 0.001 <= 0.002 , 0.0015 , 42.86, 6 > 0.002 <= 0.003 , 0.0025 , 66.67, 5 > 0.003 <= 0.004 , 0.0035 , 85.71, 4 > 0.005 <= 0.006 , 0.0055 , 90.48, 1 > 0.007 <= 0.008 , 0.0075 , 95.24, 1 > 0.03 <= 0.0305776 , 0.0302888 , 100.00, 1 # target 50% 0.0023 # target 75% 0.0034375 # target 90% 0.0059 # target 99% 0.0304563 # target 99.9% 0.0305654 # Socket and IP used for each connection: [0] 6 socket used, resolved to 172.20.30.15:8000, connection timing : count 6 avg 0.00030770633 +/- 0.0002855 min 8.4577e-05 max 0.000871142 sum 0.001846238 [1] 8 socket used, resolved to 172.20.30.15:8000, connection timing : count 8 avg 0.00018375637 +/- 0.0001988 min 6.638e-05 max 0.000685104 sum 0.001470051 [2] 7 socket used, resolved to 172.20.30.15:8000, connection timing : count 7 avg 0.000454983 +/- 0.0008248 min 6.9092e-05 max 0.002473526 sum 0.003184881 Connection time (s) : count 21 avg 0.00030957952 +/- 0.0005274 min 6.638e-05 max 0.002473526 sum 0.00650117 Sockets used: 21 (for perfect keepalive, would be 3) Uniform: false, Jitter: false, Catchup allowed: true IP addresses distribution: 172.20.30.15:8000: 21 Code 200 : 9 (30.0 %) Code 503 : 21 (70.0 %) Response Header Sizes : count 30 avg 69.066667 +/- 105.5 min 0 max 231 sum 2072 Response Body/Total Sizes : count 30 avg 389.56667 +/- 286.9 min 153 max 825 sum 11687 All done 30 calls (plus 0 warmup) 5.684 ms avg, 441.2 qps
Shell
복사
사이드카에 있는 pilot-agent 를 통해, 아까 어노테이션으로 구성한, 통계를 확인해본다(네임스페이스 때문에 httpbin\.httpbin으로 필터했다):
❯ k exec $(k get po -l app=fortio -oyaml | yq .items[0].metadata.name) -c istio-proxy -- pilot-agent request GET stats | gre p httpbin\.httpbin | grep pending cluster.outbound|8000||httpbin.httpbin.svc.cluster.local.circuit_breakers.default.remaining_pending: 1 cluster.outbound|8000||httpbin.httpbin.svc.cluster.local.circuit_breakers.default.rq_pending_open: 0 cluster.outbound|8000||httpbin.httpbin.svc.cluster.local.circuit_breakers.high.rq_pending_open: 0 cluster.outbound|8000||httpbin.httpbin.svc.cluster.local.upstream_rq_pending_active: 0 cluster.outbound|8000||httpbin.httpbin.svc.cluster.local.upstream_rq_pending_failure_eject: 0 cluster.outbound|8000||httpbin.httpbin.svc.cluster.local.upstream_rq_pending_overflow: 20 cluster.outbound|8000||httpbin.httpbin.svc.cluster.local.upstream_rq_pending_total: 26
Shell
복사
upstream_rq_pending_overflow 가 쌓여 있는 것을 볼 수 있다. 위에서 확인한 envoy 서킷 브레이크 타입 중 Cluster maximum pending requests나 maximum requests가 오버플로 됐다고 볼 수 있다.

습득 교훈

서킷 브레이킹은 envoy의 구현을 그대로 가져온 것이다. 그에 맞게 구성할 수 있는 destinationrule의 명세가 있다.