데이터독에 쿠버네티스 이벤트 수집하기

데이터독에 통합한 쿠버네티스 클러스터의 이벤트는 데이터독 이벤트로써 수집하여 활용할 수 있다. 이벤트를 수집하는 방법과 간단한 활용 예시를 설명한다.

요금

에이전트 구성

요금

free tier가 아니라면, 쿠버네티스 이벤트 수집에 대한 비용은 없다. 데이터독은 standard events and metrics에 대해 비용을 부과하지 않는데, 통합한 쿠버네티스 이벤트는 standard events에 해당한다.

Pricing | Datadog

Flexible, transparent pricing designed to scale with your business

https://www.datadoghq.com/pricing/

(과거에 비용이 많이 들어 수집하지 않았다는 팀 내 히스토리가 있었지만) 공짜라서 당장 켜본다.

에이전트 구성

Kubernetes에서 Datadog Agent 추가 설정

Datadog, the leading service for cloud-scale monitoring.

https://docs.datadoghq.com/ko/containers/kubernetes/configuration?tab=helm#kubernetes-%EC%9D%B4%EB%B2%A4%ED%8A%B8-%EC%88%98%EC%A7%91-%ED%99%9C%EC%84%B1%ED%99%94

팀에선 쿠버네티스 통합에 cluster agent helm chart를 사용하고 있고, 따라서 datadog.collectEvents 를 활성화 시켜 준다(아래 helm chart의 values).

values.yaml#L391

datadog

파이프라인

활성 후 데이터독 왼쪽 내비바, Service Mgmt > Event Managements > Explorer에 들어가면 위처럼 kubernetes source의 이벤트가 수집되는 것을 볼 수 있다. Event Explorer는 Log Explorer랑 비슷하지만 완전히 같은 기능이 있진 않다(후술).

그런데 여기 보이는 것은 데이터독 이벤트이고 이건 쿠버네티스 이벤트와 완전히 1:1 대치하지 않다. 쿠버네티스 이벤트의 필드 기반으로 검색하기 위해선 파이프라인 구성이 필요하다.

grok parser

1 Created: Created container xxxx
1 Pulled: Container image "xxxx:latest" already present on machine
1 Scheduled: Successfully assigned yyyy/xxx to zzzz
1 Started: Started container xxxx

Events emitted by the default-scheduler seen at 2024-10-29 05:35:01 +0000 UTC since 2024-10-29 05:35:01 +0000 UTC
Plain Text
복사

익스플로러에 보이는 이벤트 하나를 클릭해보면 파드 스케쥴 ~ 컨테이너 실행 당시의 위와 비슷한 메시지를 볼 수 있다. 맨 위부터 각 줄에 이벤트 사유(reason)마다 횟수, 그리고 상세 메시지가 쓰인다(편의상 줄 바꿈을 했지만 공백이다). 쿠버네티스 이벤트로썬 이는 각각의 이벤트지만 데이터독에선 각 쿠버네티스 관련 객체(involvedObject)마다 특정 타임프레임(??)에 대하여 이렇게 하나로 모은다. 그리고 맨 아랫줄에 이벤트 발생 시작(since)과 끝(at) 시각을 적는다.

익스플로러에서 “created” 같은 키워드로 검색해도 위 결과는 나오겠지만 모든 다른 이벤트의 메시지도 다 보일 것이다. 좀 더 정확한 검색 그리고 패싯 사용을 위해 위와 같은 메시지를 파싱하자.

현재 데이터독 웹 콘솔 UI에서 가장 오른쪽 탭 Pipelines를 고르면 새 파이프라인을 만들 수 있다. 나는 아래처럼 파이프라인을 구성했다:

먼저 첫번째는 grok parser다. 우선 grok의 문법이 딱히 직관적이지 않기 때문에 그나마 구문 이해를 도와줄 helper rules를 사용할 것이고, event samples에 디버깅용 메시지를 써 넣어 활용할 것이다.

helper rules를 쓰기 위해 아랫부분 Advanced Settings를 펼쳐준다. Extract from: 은 자동으로 message 로 채워져 있을 것이다. 아래의 Helper Rules: 에 난 다음과 같이 적었다:

count %{integer:count}
reason \*\*%{word:reason}\*\*\
origin_message %{data:origin_message}\n
event (%{integer:count} \*\*%{word:reason}\*\*\: %{data:origin_message}\n\s)?


source by the %{notSpace:source}
at %{date("yyyy-MM-dd HH:mm:ss"):at} \+0000 UTC
since %{date("yyyy-MM-dd HH:mm:ss"):since} \+0000 UTC_
Plain Text
복사

grok parser 문법은 이 데이터독 문서를 참고하자

Datadog Infrastructure and Application MonitoringGrok Parser. 나는 모두 %{MATCHER:FILTER} 로만 룰을 만들었고, 나머지 필드로써 파싱이 필요 없는 부분은 공백이나 실제 들어가는 문자를 써주었다. reason을 감싸는 ** 는 메시지에서 마크다운 표현으로 두껍게 처리한 것으로 보인다.

이 helper rules를 바탕으로 정리한 파싱 규칙(Define parsing rules)는 다음과 같다:

messageRule %%%\s+%{event}%{event}%{event}%{event}%{event}%{event}%{event}%{event}%{event}%{event}_Events emitted %{source} seen at %{at} since %{since}\s+%{data}%%%
Go
복사

count reason origin_message 구조의 헬퍼인 event 가 반복되는 것을 볼 수 있다. 앞서 예시 메시지에서 설명한거처럼, 데이터독 이벤트 메시지에는 쿠버네티스 이벤트 여럿의 것이 들어가 있다. 이를 파싱하기 위해 크게 두가지 방법을 쓸 수 있다.

첫번째로 count_1, count_2, …, reason_1, reason_2, …, 이런 식으로 여러 이벤트(일수도 아닐수도…)에 대한 필드명을 정적으로 미리 필터로써 선언하는 것이다. 두번째는 위에 쓴 것처럼 같은 필터를 여러번 대입하는 것이다. 이러면 파싱한 reason 필드는 배열이 되고, 익스플로러에서는 인덱스를 명시할 필요 없이 reason:REASON 으로 검색이 가능하다. 예를 들어,

{
  "reason": [
    "Pulling",
    "Scheduled",
    "Pulling",
    "Pulled",
    "Created",
    "Created",
    "Pulled",
    "Started",
    "Started"
  ],
  ...
}
JSON
복사

이벤트 파싱 결과가 위와 같다면, 익스플로러에서 reason:pulling , reason:scheduled 쿼리로 검색이 각각 된다. 두 방법 다 reason에 개수를 정적으로 정할 수 밖에 없긴하다. 난 두번째 방법을 택하고 event 헬퍼를 열번 반복하였다.

grok parser가 잘 동작하는지 확인하기 위해 event samples를 추가한다. 아마 기본으로 제공되는 이미지는 이벤트 사유 하나뿐이 없을 것이다. 아래와 같은 예시를 추가하고 파싱 규칙 검증과 추출(extraction)이 성공하는지 보자:

%%% 
1 **Created**: Created container xxxx
 1 **Created**: Created container yyyy
 1 **Pulled**: Successfully pulled image "xxxx:latest" in 436ms (436ms including waiting)
 1 **Pulled**: Successfully pulled image "yyyy:latest" in 210ms (210ms including waiting)
 1 **Pulling**: Pulling image "xxxx:latest"
 1 **Pulling**: Pulling image "yyyy:latest"
 1 **Scheduled**: Successfully assigned zzzz/aaaa to bbbb
 1 **Started**: Started container xxxx
 1 **Started**: Started container yyyy
 
 _Events emitted by the default-scheduler seen at 2024-09-27 01:17:51 +0000 UTC since 2024-09-27 01:17:51 +0000 UTC_ 

 %%%
Plain Text
복사

위는 message 에 대한 grok parser이고, title에 대해서도 파싱할 정보가 있다(Extract from: title). involved object에 대한 정보를 파싱할 수 있다:

// Helper Rules:

kind %{word:object_kind}
namespace %{notSpace:object_namespace}/
name %{notSpace:object_name}

// Define parsing rules:
titleRule Events from the %{kind} (%{namespace})?%{name}
Plain Text
복사

event samples에 네임스페이스가 있는 것, 없는 것을 둘 다 넣어 테스트 해본다:

// sample 1; cluster scope
Events from the Node aaaa

// sample 2; namespaced
Events from the Pod bbbb/cccc
Plain Text
복사

remapper

위에서 파싱한 이유는 쿠버네티스 이벤트의 파싱한 필터 키워드로 검색하고 싶어서이다. grok parser의 필터로써 파싱하면 데이터독 이벤트 attribute인데 이걸론 익스플로러에서 검색이 안된다. 따라서 이걸 태그로 매핑해준다.

message에서 이벤트마다 count, origin_message(message와 구분하기 위해)도 파싱하지만 우선은 reason 별로 검색하는 것이 하고 싶어 이것만 태그로 remap 해주었다:

파이프라인에 + Add Processor 에서 Remapper를 추가하고 위처럼 구성한다. title에서 파싱한, object_(kind|namespace|name)에 대해서도 비슷하게 각각 해주었다.

활용?

그러면 이제 이벤트마다 reason , involved_object_(kind|namespace|name) (grok 필터 앞에 involved_를 붙여 태그로 remap했다) 태그를 바탕으로 쿼리할 수 있다. 왼쪽에 패싯으로써 추가할 수도 있다.

예를 들어 create 로 시작하는 reason 이벤트를 필터했을 때(source:kubernetes reason:create*) 연관 객체 종류가 pod, order만 있는 것을 볼 수 있다:

이벤트는 대시보드나 노트북의 위젯 쿼리 소스로써도 쓸 수 있다. 다음 예시는 실 사용 예시라기 보단 이벤트 수집한김에 쥐어짜내본(??) 것들이다:

warning(= status info가 아닌)의 이유 top list

involved object kind별 reason fail로 시작하는 것들의 수

backoff 중인 파드