데이터 수집(2)

데이터 준비

각 ExampleGen 컴포넌트를 사용하여 데이터셋의 input_config와 output_config을 구성할 수 있다. 데이터셋을 점진적으로 수집하려면 span을 입력 구성으로 정의할 수 있다. 또한 데이터를 분할하는 방법도 구성할 수 있다. 평가 및 테스트 데이터셋과 함께 학습 데이터셋을 생성할 때가 많은데, 출력 구성으로 이런 전처리 작업을 정의할 수 있다.

데이터셋 분할

파이프라인 후반부에는 학습 중에 머신러닝 모델을 평가하고 모델 분석 단계에서 테스트하려고 한다. 따라서 데이터셋을 필요한 하위 집합으로 분할해두면 좋다.

단일 데이터셋을 하위 집합으로 분할

학습, 평가, 테스트 데이터셋을 각각 6:2:2의 비율로 분할해보았다. 비율 설정은 hash_buckets으로 정의할 수 있다.

import os
from pathlib import Path

from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
from tfx.components import CsvExampleGen
from tfx.proto import example_gen_pb2

context = InteractiveContext()

dir_path = Path().parent.absolute()
data_dir = os.path.join(dir_path, "..", "..", "data", "processed")
output = example_gen_pb2.Output(
    # 선호하는 분할을 정의합니다.
    split_config=example_gen_pb2.SplitConfig(splits=[
        # 비율을 지정합니다.
        example_gen_pb2.SplitConfig.Split(name='train', hash_buckets=6),
        example_gen_pb2.SplitConfig.Split(name='eval', hash_buckets=2),
        example_gen_pb2.SplitConfig.Split(name='test', hash_buckets=2)
    ]))

# output_config 인수를 추가합니다.
example_gen = CsvExampleGen(input_base=data_dir, output_config=output)
context.run(example_gen)

출력 구성을 지정하지 않으면 ExampleGen 컴포넌트는 데이터셋을 학습 및 평가 데이터셋 2:1로 기본 분할한다.

기존 분할 보존

데이터셋의 하위 집합이 이미 잘 구분되어 데이터셋을 수집할 때 기존 분할을 그대로 가져올 경우도 있다. 입력 구성을 정의하여 이를 보존할 수 있다.

import os
from pathlib import Path

from tfx.components import CsvExampleGen
from tfx.proto import example_gen_pb2

from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

context = InteractiveContext()

dir_path = Path().parent.absolute()
data_dir = os.path.join(dir_path, "..", "..", "data", "tfrecord")

tfrecord_filename = "consumer-complaints.tfrecord"
tfrecord_filepath = os.path.join(data_dir, tfrecord_filename)

# 기존 하위 디렉터리를 설정합니다.
input = example_gen_pb2.Input(splits=[
    example_gen_pb2.Input.Split(name='train', pattern='train/*'),
    example_gen_pb2.Input.Split(name='eval', pattern='eval/*'),
    example_gen_pb2.Input.Split(name='test', pattern='test/*')
])

# input_config 인수를 추가합니다.
example_gen = CsvExampleGen(input_base=data_dir, input_config=input)
context.run(example_gen)

입력 구성을 정의한 후 input_config 인수를 정의하여 ExampleGen 컴포넌트에 설정을 전달할 수 있다.

데이터셋 스패닝

새로운 데이터가 제공될 때 머신러닝 모델을 업데이트할 수 있다는 점은 머신러닝 파이프라인의 주요 이점이다. ExampleGen 컴포넌트를 사용하여 span을 사용할 수 있다.

span은 기존 데이터 레코드를 복제할 수 있다. 패턴을 지정하고 입력 구성에서 {SPAN} 자리 표시자를 사용할 수 있으며, 이는 폴더 구조에 표시된 숫자(0,1,2,...)를 나타낸다.

import os

from tfx.components import CsvExampleGen
from tfx.proto import example_gen_pb2
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

context = InteractiveContext()

base_dir = os.getcwd()
data_dir = os.path.join(os.pardir, "data")

input = example_gen_pb2.Input(splits=[
    example_gen_pb2.Input.Split(pattern='export-{SPAN}/*')
])

example_gen = CsvExampleGen(input_base=os.path.join(base_dir, data_dir), input_config=input)
context.run(example_gen)

데이터셋 버전 관리

머신러닝 파이프라인에서 모델을 학습하는 데 사용한 데이터셋과 함께 생산된 모델을 추적하고자 할 때 버전화가 도움이 된다.

수집한 데이터의 파일 이름과 경로를 ML 메타데이터스토어에 저장하며, 수집한 데이터의 해시와 같은 원시 데이터셋에 관한 더 많은 메타 정보를 추적할 수 있다. 이런 버전 추적을 통해 학습 중에 사용한 데이터셋이 학습 이후 시점의 데이터셋과 동일한 지 확인할 수 있다. 이런 피처는 end-to-end ML 재현성에 매우 중요하다.

데이터셋을 버전화하려면 다음과 같은 도구를 사용할 수 있다.

DVC: DVC(http://dvc.org/)는 머신러닝 프로젝트용 오픈 소스 버전 제어 시스템이다. 전체 데이터셋 자체 대신 데이터셋 해시를 commit 할 수 있다. 따라서 데이터셋의 상태는 git 등을 통해 추적되지만 repository는 전체 데이터셋 단위로 적재되진 않는다.
Pachyderm: Pachyderm(https://www.pachyderm.com/)은 쿠버네티스에서 운영하는 오픈 소스 머신러닝 플랫폼이다. 데이터용 깃이라는 개념에서 시작했지만 이제는 데이터 버전을 기반으로 하는 파이프라인 조정을 포함한 전체 데이터 플랫폼으로 확장되었다.

This post was written based on what I read and studied the book below.
https://www.oreilly.com/library/view/building-machine-learning/9781492053187/

저작자표시 비영리 변경금지 (새창열림)

"What's this world coming to?"