Avro

다음 두 링크, Avro 문서와 DDIA 4장 내용의 일부를 중심으로 정리했다:

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

https://avro.apache.org/docs/

Designing Data-Intensive Applications

Chapter 4. Encoding and Evolution Everything changes and nothing stands still. Heraclitus of Ephesus, as quoted by Plato in Cratylus (360 BCE) Applications inevitably change over time. Features are added … - Selection from Designing Data-Intensive Applications [Book]

https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/ch04.html

TMI 배경

Avro

Documentation

https://avro.apache.org/docs/

아브로는 데이터 직렬화 시스템이다.

하둡의 데이터 전송 시 스리프트(Thrift)가 적합하지 않아 나왔다 하고, 방식도 많이 다르다. 후술할 “정확한” 스키마를 사용하여 스키마 진화를 가능케 한다.

컨플루언트의 schema registry 포맷으로써도 지원된다.

파이썬 예제

Getting Started (Python)

This is a short guide for getting started with Apache Avro™ using Python. This guide only covers using Avro for data serialization; see Patrick Hunt’s Avro RPC Quick Start for a good introduction to using Avro for RPC.

https://avro.apache.org/docs/1.11.1/getting-started-python/

위 문서와 DDIA의 예제 데이터를 참고해서 아래 코드를 구성했다. 아브로가 무엇인지 어떻게 생겼는지 아주 간단한 예시로 살펴본다.

먼저 avro 패키지를 설치한다(python3 기준):

$ pip install avro==1.11.1
Shell
복사

그리고 .avsc 확장자의 JSON 파일을 만든다:

{
  "namespace": "example.avro",
  "type": "record",
  "name": "Person",
  "fields": [
    {
      "name": "userName",
      "type": "string"
    },
    {
      "name": "favoriteNumber",
      "type": [
        "long",
        "null"
      ],
      "default": null
    },
    {
      "name": "interests",
      "type": {
        "type": "array",
        "items": "string"
      }
    }
  ]
}
JSON
복사

Person 레코드의 필드를 정의한다. favoriteNumber 는 null 타입도 가능하고, interests 는 배열로써 쓰인다.

다음 예제 파일을 만든다:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("person.avsc", "rb").read())

writer = DataFileWriter(open("people.avro", "wb"), DatumWriter(), schema)
writer.append(
    {
        "userName": "Martin",
        "favoriteNumber": 1337,
        "interests": ["daydreaming", "hacking"],
    }
)
writer.append(
    {
        "userName": "flavono123",
        "interests": ["kafka", "confluent", "schema registry", "avro"],
    }
)
writer.close()

reader = DataFileReader(open("people.avro", "rb"), DatumReader())
for user in reader:
    print(user)


reader.close()
Python
복사

avro 패키지 API를 이용하여 people.avro 라는 파일에 avro 형식의 데이터를 쓰고 다시 읽는다. 이 때 위에서 작성한 JSON 스키마, person.avsc를 사용한다.

두번째 레코드의 favoriteNumber 가 없다. 실행하면 다음처럼 출력한다:

❯ python avro_example.py 
{'userName': 'Martin', 'favoriteNumber': 1337, 'interests': ['daydreaming', 'hacking']}
{'userName': 'flavono123', 'favoriteNumber': None, 'interests': ['kafka', 'confluent', 'schema registry', 'avro']}
Shell
복사

IDL

IDL Language

This document defines Avro IDL, a higher-level language for authoring Avro schemata. Before reading this document, you should have familiarity with the concepts of schemata and protocols, as well as the various primitive and complex types available in Avro.

https://avro.apache.org/docs/1.11.1/idl-language/

아브로는 스키마를 JSON(avsc)보다 간단하게 정의할 수 있게 IDL(Interface Description Language)를 제공한다. 위 person.avsc 같은 경우 다음처럼 쓸 수 있다(person.avdl):

// import schema "person.avsc"
record User {
  string userName;
  union {null, int} favoriteNumber = null;
  array<string> interests;
}
Java
복사

이 avdl 파일을 참조하거나 생성하는 툴, 패키지는 전부 자바쪽에만 있는거 같다. 실행해보진 않았다. JSON 보다 간결하게 표현할 수 있는 것만 확인했다.

people.avro?

위 파이썬 코드로 만든 people.avro 파일이 실제 아브로 스키마를 적용해 인코딩한 데이터이다. 파일을 카프카 브로커라고 하면 쓰기는 produce, 읽기는 consume이라고 볼 수 있다. 그럼 어떤 내용이 어떻게 쓰여 있을까?

$ xxd people.avro 
00000000: 4f62 6a01 0414 6176 726f 2e63 6f64 6563  Obj...avro.codec
00000010: 086e 756c 6c16 6176 726f 2e73 6368 656d  .null.avro.schem
00000020: 6182 047b 2274 7970 6522 3a20 2272 6563  a..{"type": "rec
00000030: 6f72 6422 2c20 226e 616d 6522 3a20 2250  ord", "name": "P
00000040: 6572 736f 6e22 2c20 226e 616d 6573 7061  erson", "namespa
00000050: 6365 223a 2022 6578 616d 706c 652e 6176  ce": "example.av
00000060: 726f 222c 2022 6669 656c 6473 223a 205b  ro", "fields": [
00000070: 7b22 7479 7065 223a 2022 7374 7269 6e67  {"type": "string
00000080: 222c 2022 6e61 6d65 223a 2022 7573 6572  ", "name": "user
00000090: 4e61 6d65 227d 2c20 7b22 7479 7065 223a  Name"}, {"type":
000000a0: 205b 226c 6f6e 6722 2c20 226e 756c 6c22   ["long", "null"
000000b0: 5d2c 2022 6e61 6d65 223a 2022 6661 766f  ], "name": "favo
000000c0: 7269 7465 4e75 6d62 6572 222c 2022 6465  riteNumber", "de
000000d0: 6661 756c 7422 3a20 6e75 6c6c 7d2c 207b  fault": null}, {
000000e0: 2274 7970 6522 3a20 7b22 7479 7065 223a  "type": {"type":
000000f0: 2022 6172 7261 7922 2c20 2269 7465 6d73   "array", "items
00000100: 223a 2022 7374 7269 6e67 227d 2c20 226e  ": "string"}, "n
00000110: 616d 6522 3a20 2269 6e74 6572 6573 7473  ame": "interests
00000120: 227d 5d7d 00c0 622e 8d2a a232 f456 14ed  "}]}..b..*.2.V..
00000130: e0d9 fa4a f404 a601 0c4d 6172 7469 6e00  ...J.....Martin.
00000140: f214 0416 6461 7964 7265 616d 696e 670e  ....daydreaming.
00000150: 6861 636b 696e 6700 1466 6c61 766f 6e6f  hacking..flavono
00000160: 3132 3302 080a 6b61 666b 6112 636f 6e66  123...kafka.conf
00000170: 6c75 656e 741e 7363 6865 6d61 2072 6567  luent.schema reg
00000180: 6973 7472 7908 6176 726f 00c0 622e 8d2a  istry.avro..b..*
00000190: a232 f456 14ed e0d9 fa4a f4              .2.V.....J.
Shell
복사

이진 파일 people.avro를 덤프했다. 아브로 객체임이 써 있고 avsc 파일과 비슷한 스키마가 00000120까지 쓰여 있다. 그래서 읽을 때는 따로 스키마 파일을 주지 않아도 가능했던 것이다:

reader = DataFileReader(open("people.avro", "rb"), DatumReader())
Python
복사

그 뒤에 데이터에 해당하는 부분이 짧게 쓰여 있다. 문자열은 덤프로도 알아 볼 수 있다. 이젠 DDIA 책의 그림으로 이해해보자(일부러 첫 person의 데이터를 책의 것과 같게 했다):

Designing Data-Intensive Applications

https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/ch04.html

위 덤프에서 뒷 부분에 해당하는 데이터 부분이다. 크게 각 필드의 길이-값의 반복 패턴이다. 데이터 부분에 필드 타입이 따로 없다. 순서대로 스키마를 보고 해석하기 때문이다. 아브로를 읽기 위해선 “정확한” 스키마가 필요하다.

쓰기 스키마와 읽기 스키마

위에 살펴본 예시에선 인코딩/디코딩을 위한 쓰기/읽기 스키마가 같았다. 하지만 쓰기와 읽기에 대한 각각의 스키마가 필요하고(그게 같았을 뿐이다) 서로 호환이 된다면 같은 데이터를 읽고 쓸 수 있다:

왼쪽의 쓰기 스키마와 오른쪽의 읽기 스키마는 다르지만 호환될 수 있다.

먼저 순서는 중요치 않다. 같은 필드의 순서가 스키마 마다 다르다면 각 데이터의 순서도 다를테니 따라서 읽을 수 있게 된다.

photoURL 처럼 쓰기 스키마에 있지만(인코딩 됐지만), 읽기 스키마에 없다면 무시한다(디코딩 하지 않는다).

userID 처럼 읽기 스키마에 있지만 쓰기 스키마에 없었다면 기본값으로 채우게 된다.

스키마 진화 규칙

아브로에서 스키마 진화 즉, 스키마 버전간 상하위 호환성을 위해 몇가지 규칙을 지켜야 한다.

상위 호환성은 새 버전(e.g. v2) 쓰기 스키마와 예전 버전(v1) 읽기 스키마에 호환성이 있어야하고, 하위 호환성은 v1 쓰기 스키마 - v2 읽기 스키마에 호환성이 있어야 한다.

호환성을 위해 진화 시(스키마 버전 업그레이드) “기본값이 있는” 필드의 추가 삭제만 가능하다. 위 규칙대로 생각해보면 기본 값이 없는 필드를 추가 시 하위 호환성이 보장되지 않고, 기본 값이 없는 필드를 삭제하면 상위 호환성이 사라진다.

아브로는 nullable을 조금 특이하게 표현한다. null로 “시작”하는 다른 타입과의 유니온 타입으로 쓴다(e.g. union {null, long, string}).

필드 타입 변경은 지원 가능한 타입에서 가능하다. int→ long으로 확장하거나(widening conversion) long → int으로 축소(narrowing conversion) 가능하다고 한다(직접 해보진 않았다. 컨플루언트 스키마 레지스트리가 내부에서 깔끔하게 해주려나…?). 또는 int → union {int, long} → long 이런식으로 인접한 버전간 하위 호환성을 유지하면서 안전하고 천천히 하는 방법도 있다. 타입에 대한 내용은 문서를 참고하자:

Specification

This document defines Apache Avro. It is intended to be the authoritative specification. Implementations of Avro must adhere to this document.

https://avro.apache.org/docs/1.11.1/specification/

필드 이름을 바꾸는 것은 읽기 스키마에서 그 전 이름에 대한 alias를 지정하여 가능하다. 이 역시 하위 호환성만 유지 가능하다:

Specification

This document defines Apache Avro. It is intended to be the authoritative specification. Implementations of Avro must adhere to this document.

https://avro.apache.org/docs/1.11.1/specification/#aliases

습득 교훈

컨플루언트 스키마 레지스트리를 공부하기 전에, 아브로의 경우 어떻게 스키마 진화를 가능케 하는지 원리를 살펴보았다:

•

아브로는 인코딩/디코딩을 위해 각각 “정확한” 스키마가 필요하다.

•

스키마 진화 규칙

◦

상위 호환성: 새 버전 쓰기 스키마 - 예전 버전 읽기 스키마

◦

하위 호환성: 예전 버전 쓰기 스키마 - 새 버전 읽기 스키마

◦

“기본 값이 있는” 필드만 추가/삭제 하여 하/상위 호환성을 유지한다.

◦

타입 변경과 이름 변경 시 하위 호환성만 지킬 수 있다.

•

하위 호환성: 예전 버전 쓰기 스키마, 새 버전 읽기 스키마