Amazon CloudWatch - 완전 관리형 모니터링 및 관찰성 서비스

CloudWatch Logs, Metrics, Alarms로 서버리스 애플리케이션 모니터링하기

Posted Dec 13, 2024

By kimmin1kk

21 min read

Amazon CloudWatch란?

Amazon CloudWatch는 AWS 리소스와 애플리케이션을 실시간으로 모니터링하는 완전 관리형 서비스임.

로그, 메트릭, 이벤트를 수집하고 시각화하며, 자동화된 작업을 트리거하여 운영 상태를 유지할 수 있음.

왜 CloudWatch를 사용해야 하는가?

기존 방식의 문제점

전통적인 모니터링 구축:

┌─────────────────────────────────────────┐
│  Self-Hosted Monitoring Stack           │
│                                         │
│  ├─ ELK Stack (로그)                    │
│  │  ├─ Elasticsearch (3 nodes)         │
│  │  ├─ Logstash                        │
│  │  └─ Kibana                          │
│  │  월 비용: ~$300                     │
│  │                                     │
│  ├─ Prometheus + Grafana (메트릭)       │
│  │  ├─ Prometheus Server               │
│  │  ├─ Grafana Dashboard               │
│  │  └─ AlertManager                    │
│  │  월 비용: ~$150                     │
│  │                                     │
│  └─ 관리 부담                           │
│     ├─ 인프라 유지보수                  │
│     ├─ 버전 업그레이드                  │
│     ├─ 백업 & 복구                     │
│     └─ 고가용성 구성                    │
│                                         │
│  총 월 비용: ~$450                      │
└─────────────────────────────────────────┘

문제점:

높은 고정 비용: 사용량과 무관하게 인프라 비용 발생
관리 부담: 모니터링 시스템 자체를 모니터링해야 함
확장성 제한: 로그/메트릭 증가 시 수동으로 확장 필요
복잡한 통합: 각 서비스마다 별도 agent 설치 필요
단일 장애점: Elasticsearch 클러스터 다운 시 로그 손실

CloudWatch의 해결 방법

┌─────────────────────────────────────────┐
│  Amazon CloudWatch (완전 관리형)         │
│                                         │
│  ├─ CloudWatch Logs                     │
│  │  ├─ 무제한 확장                      │
│  │  ├─ 자동 보관/삭제                   │
│  │  └─ Logs Insights (쿼리)            │
│  │                                     │
│  ├─ CloudWatch Metrics                  │
│  │  ├─ 기본 메트릭 (무료)               │
│  │  ├─ 커스텀 메트릭                    │
│  │  └─ 대시보드                         │
│  │                                     │
│  ├─ CloudWatch Alarms                   │
│  │  ├─ 메트릭 기반 알람                 │
│  │  └─ SNS/Lambda 연동                 │
│  │                                     │
│  └─ 종량제 과금                         │
│     - 로그: $0.50/GB                   │
│     - 커스텀 메트릭: $0.30/metric      │
│                                         │
│  실제 월 비용: ~$50 (10GB 로그)         │
└─────────────────────────────────────────┘

장점:

완전 관리형: 인프라 관리 불필요
즉각적인 확장: 자동 확장
AWS 서비스 통합: Lambda, ECS, EC2 등 네이티브 지원
종량제 과금: 사용한 만큼만 지불
고가용성: AWS의 내구성 활용

CloudWatch 핵심 구성 요소

1. CloudWatch Logs

로그 데이터를 수집, 저장, 분석.

주요 개념:

Log Group: 로그 스트림의 논리적 그룹 (예: /aws/lambda/order-processor)
Log Stream: 동일한 소스의 로그 이벤트 시퀀스
Log Event: 타임스탬프 + 메시지

2. CloudWatch Metrics

시계열 데이터 포인트 집합.

메트릭 타입:

Standard Metrics: AWS 서비스가 자동 생성 (무료)
Custom Metrics: 애플리케이션이 발행 (유료)

주요 차원 (Dimensions): 메트릭을 필터링하기 위한 name-value 쌍 (예: FunctionName=order-processor)

3. CloudWatch Alarms

메트릭이 임계값을 초과하면 알림 또는 작업 실행.

알람 상태:

OK: 임계값 이하
ALARM: 임계값 초과
INSUFFICIENT_DATA: 데이터 부족

4. CloudWatch Dashboards

메트릭과 로그를 시각화하는 커스터마이즈 가능한 대시보드.

실전 프로젝트 사례

1. IoT 센서 모니터링 (iot-sensor-lab)

아키텍처:

IoT Device (센서)
    │
    ├─ Kinesis Data Streams
    │
    ↓
Lambda (process-sensor-data)
    │
    ├─ DynamoDB (데이터 저장)
    │
    ├─ CloudWatch Metrics (커스텀 메트릭)
    │  ├─ Temperature (온도)
    │  ├─ Humidity (습도)
    │  └─ Dimensions: DeviceId, SensorType, Location
    │
    └─ CloudWatch Alarms
       ├─ 온도 임계값 초과 → SNS 알림
       └─ 습도 임계값 초과 → SNS 알림

Lambda 핸들러 (커스텀 메트릭 발행):

  
import { CloudWatchClient, PutMetricDataCommand } from '@aws-sdk/client-cloudwatch';
import { SNSClient, PublishCommand } from '@aws-sdk/client-sns';

const cwClient = new CloudWatchClient();
const snsClient = new SNSClient();

const METRIC_NAMESPACE = 'IoTSensor/Metrics';
const SNS_TOPIC_ARN = process.env.SNS_TOPIC_ARN;

export const handler = async (event) => {
  console.log('Processing sensor data:', event.Records.length);

  for (const record of event.Records) {
    // Kinesis 레코드 디코딩
    const payload = Buffer.from(record.kinesis.data, 'base64').toString('utf-8');
    const sensorData = JSON.parse(payload);

    const { deviceId, sensorType, temperature, humidity, location, timestamp } = sensorData;

    // CloudWatch 커스텀 메트릭 발행
    const metrics = [
      {
        MetricName: 'Temperature',
        Value: temperature,
        Unit: 'None',
        Timestamp: new Date(timestamp),
        Dimensions: [
          { Name: 'DeviceId', Value: deviceId },
          { Name: 'SensorType', Value: sensorType },
          { Name: 'Location', Value: location }
        ]
      },
      {
        MetricName: 'Humidity',
        Value: humidity,
        Unit: 'Percent',
        Timestamp: new Date(timestamp),
        Dimensions: [
          { Name: 'DeviceId', Value: deviceId },
          { Name: 'SensorType', Value: sensorType },
          { Name: 'Location', Value: location }
        ]
      }
    ];

    await cwClient.send(new PutMetricDataCommand({
      Namespace: METRIC_NAMESPACE,
      MetricData: metrics
    }));

    console.log(`Metrics published for device ${deviceId}`);

    // 임계값 체크 (온도 > 30도 또는 습도 > 80%)
    if (temperature > 30 || humidity > 80) {
      await snsClient.send(new PublishCommand({
        TopicArn: SNS_TOPIC_ARN,
        Subject: `센서 알림: ${deviceId}`,
        Message: `
장치: ${deviceId}
위치: ${location}
온도: ${temperature}°C
습도: ${humidity}%
시간: ${timestamp}

${temperature > 30 ? '⚠️ 온도가 임계값(30°C)을 초과했습니다!' : ''}
${humidity > 80 ? '⚠️ 습도가 임계값(80%)을 초과했습니다!' : ''}
        `.trim()
      }));

      console.log('Alert sent for threshold violation');
    }
  }

  return { statusCode: 200 };
};

Serverless Framework 설정 (Alarms):

  
# serverless.yml
resources:
  Resources:
    # 온도 알람
    HighTemperatureAlarm:
      Type: AWS::CloudWatch::Alarm
      Properties:
        AlarmName: ${self:service}-high-temperature-${sls:stage}
        AlarmDescription: Alert when temperature exceeds 30 degrees
        MetricName: Temperature
        Namespace: IoTSensor/Metrics
        Statistic: Average
        Period: 60  # 1분
        EvaluationPeriods: 2  # 2회 연속
        Threshold: 30
        ComparisonOperator: GreaterThanThreshold
        TreatMissingData: notBreaching
        AlarmActions:
          - !Ref AlertTopic
        Dimensions:
          - Name: SensorType
            Value: DHT22

    # 습도 알람
    HighHumidityAlarm:
      Type: AWS::CloudWatch::Alarm
      Properties:
        AlarmName: ${self:service}-high-humidity-${sls:stage}
        AlarmDescription: Alert when humidity exceeds 80%
        MetricName: Humidity
        Namespace: IoTSensor/Metrics
        Statistic: Average
        Period: 60
        EvaluationPeriods: 2
        Threshold: 80
        ComparisonOperator: GreaterThanThreshold
        AlarmActions:
          - !Ref AlertTopic

    # SNS Topic
    AlertTopic:
      Type: AWS::SNS::Topic
      Properties:
        TopicName: ${self:service}-alerts-${sls:stage}
        Subscription:
          - Protocol: email
            Endpoint: admin@example.com

functions:
  processSensorData:
    handler: functions/process-sensor-data.handler
    events:
      - stream:
          type: kinesis
          arn: !GetAtt SensorDataStream.Arn
          batchSize: 100
          startingPosition: LATEST
    environment:
      SNS_TOPIC_ARN: !Ref AlertTopic
    iamRoleStatements:
      - Effect: Allow
        Action:
          - cloudwatch:PutMetricData
        Resource: '*'
      - Effect: Allow
        Action:
          - sns:Publish
        Resource: !Ref AlertTopic

CloudWatch Dashboard 생성:

  
import { CloudWatchClient, PutDashboardCommand } from '@aws-sdk/client-cloudwatch';

const cwClient = new CloudWatchClient();

const dashboard = {
  widgets: [
    {
      type: 'metric',
      properties: {
        metrics: [
          ['IoTSensor/Metrics', 'Temperature', { stat: 'Average' }],
          ['.', '.', { stat: 'Maximum' }]
        ],
        period: 300,
        stat: 'Average',
        region: 'ap-northeast-2',
        title: 'Temperature Over Time',
        yAxis: { left: { label: 'Celsius' } }
      }
    },
    {
      type: 'metric',
      properties: {
        metrics: [
          ['IoTSensor/Metrics', 'Humidity', { stat: 'Average' }]
        ],
        period: 300,
        stat: 'Average',
        region: 'ap-northeast-2',
        title: 'Humidity Over Time',
        yAxis: { left: { label: 'Percent' } }
      }
    }
  ]
};

await cwClient.send(new PutDashboardCommand({
  DashboardName: 'IoTSensorMonitoring',
  DashboardBody: JSON.stringify(dashboard)
}));

2. 에러 로그 모니터링 (aws-log-monitoring)

아키텍처:

Application Logs
    │
    ↓
CloudWatch Logs
    │
    ├─ Log Group: /aws/lambda/order-processor
    │
    ├─ Subscription Filter (ERROR 로그만)
    │
    ↓
Lambda (log-filter)
    │
    ├─ 로그 분석 및 집계
    │
    ├─ 심각도 판단
    │
    └─ SNS 알림 (이메일)

Subscription Filter 설정:

  
# serverless.yml
resources:
  Resources:
    ErrorLogSubscription:
      Type: AWS::Logs::SubscriptionFilter
      Properties:
        LogGroupName: /aws/lambda/order-processor
        FilterPattern: '{ $.level = "error" }'  # JSON 로그 필터
        DestinationArn: !GetAtt LogFilterLambdaFunction.Arn

    LogFilterInvokePermission:
      Type: AWS::Lambda::Permission
      Properties:
        FunctionName: !Ref LogFilterLambdaFunction
        Action: lambda:InvokeFunction
        Principal: logs.amazonaws.com
        SourceArn: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/order-processor:*"

functions:
  logFilter:
    handler: functions/log-filter.handler
    environment:
      SNS_TOPIC_ARN: !Ref ErrorAlertTopic

Lambda 핸들러 (로그 필터링):

  
import { gunzip } from 'zlib';
import { promisify } from 'util';
import { SNSClient, PublishCommand } from '@aws-sdk/client-sns';

const gunzipAsync = promisify(gunzip);
const snsClient = new SNSClient();
const SNS_TOPIC_ARN = process.env.SNS_TOPIC_ARN;

export const handler = async (event) => {
  console.log('Processing CloudWatch Logs event');

  // CloudWatch Logs 데이터는 gzip + base64 인코딩됨
  const payload = Buffer.from(event.awslogs.data, 'base64');
  const decompressed = await gunzipAsync(payload);
  const logData = JSON.parse(decompressed.toString('utf8'));

  console.log('Log Group:', logData.logGroup);
  console.log('Log Stream:', logData.logStream);
  console.log('Log Events:', logData.logEvents.length);

  // ERROR 레벨 로그만 필터링
  const errorLogs = logData.logEvents
    .map(event => {
      try {
        return JSON.parse(event.message);
      } catch {
        return null;
      }
    })
    .filter(log => log && log.level === 'error');

  if (errorLogs.length === 0) {
    console.log('No error logs found');
    return;
  }

  console.log('Error logs found:', errorLogs.length);

  // 에러 요약 생성
  const errorSummary = errorLogs.map(log => ({
    timestamp: log.timestamp,
    message: log.message,
    stack: log.stack,
    context: log.context
  }));

  // SNS 알림 전송
  const emailBody = `
에러 로그 알림

로그 그룹: ${logData.logGroup}
로그 스트림: ${logData.logStream}
에러 개수: ${errorLogs.length}

에러 목록:
${errorSummary.map((err, idx) => `
${idx + 1}. ${err.timestamp}
   메시지: ${err.message}
   스택: ${err.stack ? err.stack.split('\n')[0] : 'N/A'}
`).join('\n')}

CloudWatch Logs Insights 쿼리:
https://console.aws.amazon.com/cloudwatch/home?region=ap-northeast-2#logsV2:log-groups/log-group/${encodeURIComponent(logData.logGroup)}
  `.trim();

  await snsClient.send(new PublishCommand({
    TopicArn: SNS_TOPIC_ARN,
    Subject: `에러 알림 - ${errorLogs.length}건`,
    Message: emailBody
  }));

  console.log('Alert sent successfully');
};

CloudWatch Logs Insights 쿼리:

  
-- 최근 1시간 에러 로그 검색
fields @timestamp, level, message, context.orderId, context.userId
| filter level = "error"
| sort @timestamp desc
| limit 100

-- 에러 타입별 집계
fields message
| filter level = "error"
| stats count() by message
| sort count desc

-- 특정 사용자 에러 추적
fields @timestamp, message, context.orderId
| filter level = "error" and context.userId = "user-123"
| sort @timestamp desc

3. Lambda 성능 모니터링

Lambda 기본 메트릭 (자동 수집, 무료):

Invocations: 호출 횟수
Duration: 실행 시간
Errors: 에러 수
Throttles: 스로틀링 수
ConcurrentExecutions: 동시 실행 수
DeadLetterErrors: DLQ 전송 실패 수

Lambda 성능 알람 설정:

  
resources:
  Resources:
    # 에러율 알람 (5% 초과)
    LambdaErrorRateAlarm:
      Type: AWS::CloudWatch::Alarm
      Properties:
        AlarmName: ${self:service}-lambda-error-rate-${sls:stage}
        AlarmDescription: Alert when Lambda error rate exceeds 5%
        MetricName: Errors
        Namespace: AWS/Lambda
        Statistic: Sum
        Period: 300  # 5분
        EvaluationPeriods: 1
        Threshold: 5
        ComparisonOperator: GreaterThanThreshold
        AlarmActions:
          - !Ref AlertTopic
        Dimensions:
          - Name: FunctionName
            Value: !Ref OrderProcessorLambdaFunction

    # Duration 알람 (3초 초과)
    LambdaDurationAlarm:
      Type: AWS::CloudWatch::Alarm
      Properties:
        AlarmName: ${self:service}-lambda-duration-${sls:stage}
        MetricName: Duration
        Namespace: AWS/Lambda
        Statistic: Average
        Period: 300
        EvaluationPeriods: 2
        Threshold: 3000  # 3초 (밀리초)
        ComparisonOperator: GreaterThanThreshold
        AlarmActions:
          - !Ref AlertTopic
        Dimensions:
          - Name: FunctionName
            Value: !Ref OrderProcessorLambdaFunction

    # Throttles 알람
    LambdaThrottlesAlarm:
      Type: AWS::CloudWatch::Alarm
      Properties:
        AlarmName: ${self:service}-lambda-throttles-${sls:stage}
        MetricName: Throttles
        Namespace: AWS/Lambda
        Statistic: Sum
        Period: 60
        EvaluationPeriods: 1
        Threshold: 1
        ComparisonOperator: GreaterThanOrEqualToThreshold
        AlarmActions:
          - !Ref AlertTopic
        Dimensions:
          - Name: FunctionName
            Value: !Ref OrderProcessorLambdaFunction

CloudWatch Logs Insights 고급 쿼리

1. 로그 분석

API 응답 시간 분석:

  
fields @timestamp, method, path, duration
| filter @type = "request"
| stats avg(duration), max(duration), min(duration), count() by path
| sort avg(duration) desc

에러 패턴 분석:

  
fields @timestamp, errorType, errorMessage
| filter level = "error"
| parse errorMessage /(?<errorCode>\d{3})/
| stats count() by errorType, errorCode
| sort count desc

사용자별 활동 추적:

  
fields @timestamp, userId, action, resourceId
| filter userId = "user-123"
| sort @timestamp desc
| limit 100

2. 성능 최적화

Cold Start 분석:

  
fields @timestamp, @duration, @initDuration
| filter ispresent(@initDuration)
| stats count(), avg(@initDuration), max(@initDuration)

메모리 사용량 분석:

  
fields @timestamp, @memorySize, @maxMemoryUsed
| stats avg(@maxMemoryUsed / @memorySize * 100) as avgMemoryUtilization by bin(5m)

CloudWatch vs 대안 비교

CloudWatch vs ELK Stack

항목	CloudWatch	ELK Stack
관리	완전 관리형	Self-Hosted
가격	$0.50/GB (로그)	$300+/월 (인프라)
확장성	자동 무제한	수동 확장 필요
AWS 통합	✓ 네이티브	✗ Agent 필요
쿼리 성능	Logs Insights	Elasticsearch
커스터마이징	제한적	완전 커스터마이징
사용 사례	AWS 중심	복잡한 분석

선택 기준:

CloudWatch: AWS 서비스 중심, 간단한 쿼리, 관리 부담 최소화
ELK: 복잡한 로그 분석, 커스터마이징 중요, 온프레미스 연동

CloudWatch vs Datadog/New Relic

항목	CloudWatch	Datadog/New Relic
가격	$0.50/GB	$15-31/host/월
AWS 통합	✓ 네이티브	✓ API 통합
APM	X-Ray (별도)	✓ 내장
대시보드	기본	고급
알림	기본	고급 (AI)
사용 사례	AWS 전용	Multi-cloud

선택 기준:

CloudWatch: AWS 전용, 비용 중요, 기본 모니터링
Datadog/New Relic: Multi-cloud, APM 필요, 고급 대시보드

비용 최적화

실제 프로젝트 비용 분석

중소 규모 서버리스 앱 (월 기준):

CloudWatch Logs:
- 로그 수집: 10GB × $0.50 = $5.00
- 로그 보관: 30일 × 0.03/GB = $9.00
- 총합: $14.00/월

CloudWatch Metrics:
- 커스텀 메트릭: 50개 × $0.30 = $15.00
- API 요청: 10,000 × $0.01 = $0.10
- 총합: $15.10/월

CloudWatch Alarms:
- 표준 알람: 10개 × $0.10 = $1.00
- 총합: $1.00/월

전체 CloudWatch 비용: $30.10/월

ELK Stack으로 구축 시: $450+/월
절감액: $420/월 (93% 절감)

비용 절감 팁

1. 로그 보관 기간 최적화:

  
resources:
  Resources:
    ApplicationLogGroup:
      Type: AWS::Logs::LogGroup
      Properties:
        LogGroupName: /aws/lambda/my-function
        RetentionInDays: 7  # 7일만 보관 (기본 무제한)

2. 로그 레벨 필터링:

  
// BAD: 모든 로그 출력
console.log('Processing item:', item);  // DEBUG
console.log('Item validated');          // INFO
console.log('Item saved');              // INFO

// GOOD: 환경별 로그 레벨
const LOG_LEVEL = process.env.LOG_LEVEL || 'INFO';

function log(level, message) {
  const levels = ['DEBUG', 'INFO', 'WARN', 'ERROR'];
  if (levels.indexOf(level) >= levels.indexOf(LOG_LEVEL)) {
    console.log(JSON.stringify({ level, message, timestamp: new Date().toISOString() }));
  }
}

log('DEBUG', 'Processing item');  // prod에서는 출력 안됨
log('ERROR', 'Failed to process');  // 항상 출력

3. 메트릭 Resolution 조정:

  
// Standard Resolution (1분 간격, 저렴)
await cwClient.send(new PutMetricDataCommand({
  Namespace: 'MyApp',
  MetricData: [{
    MetricName: 'RequestCount',
    Value: 1
  }]
}));

// High Resolution (1초 간격, 비쌈)
await cwClient.send(new PutMetricDataCommand({
  Namespace: 'MyApp',
  MetricData: [{
    MetricName: 'CriticalMetric',
    Value: 1,
    StorageResolution: 1  // 고해상도 (1초)
  }]
}));

4. Logs Insights 쿼리 비용 최소화:

필요한 필드만 선택 (fields 사용)
시간 범위 좁히기
limit 사용하여 결과 제한

실전 경험에서 배운 것

1. 구조화된 로깅이 핵심

  
// BAD: 문자열 로그
console.log(`Order ${orderId} created by ${userId}`);

// GOOD: JSON 구조화 로그
console.log(JSON.stringify({
  level: 'info',
  message: 'Order created',
  orderId,
  userId,
  amount,
  timestamp: new Date().toISOString(),
  context: {
    requestId: event.requestContext.requestId
  }
}));

장점:

Logs Insights에서 필드별 검색/집계 가능
자동 파싱 및 분석
일관된 로그 포맷

2. Alarm은 Composite Alarm으로

  
# BAD: 개별 알람 (노이즈 많음)
HighErrorRateAlarm: ...
LowSuccessRateAlarm: ...

# GOOD: Composite Alarm (조건 조합)
ServiceHealthAlarm:
  Type: AWS::CloudWatch::CompositeAlarm
  Properties:
    AlarmName: service-health-alarm
    AlarmRule: (ALARM(HighErrorRateAlarm) OR ALARM(LowSuccessRateAlarm))
    AlarmActions:
      - !Ref AlertTopic

3. 로그 그룹 이름 규칙

/aws/lambda/{function-name}         # Lambda (자동)
/aws/ecs/{cluster}/{service}        # ECS (자동)
/application/{env}/{service}        # 애플리케이션 로그
/access-logs/{env}/{service}        # 액세스 로그

체계적인 이름으로 관리와 권한 설정이 쉬워짐.

4. X-Ray와 함께 사용

CloudWatch Logs + X-Ray로 완전한 관찰성:

CloudWatch: 로그 & 메트릭
X-Ray: 분산 추적 & 서비스 맵

  
import { captureAWSv3Client } from 'aws-xray-sdk-core';
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';

// X-Ray로 DynamoDB 호출 추적
const dynamoClient = captureAWSv3Client(new DynamoDBClient());

마무리

CloudWatch는 AWS 모니터링의 기본이자 핵심임. 완전 관리형으로 인프라 부담 없이 로그, 메트릭, 알람을 통합 관리할 수 있음.

사용 권장 사항:

구조화된 JSON 로깅
환경별 로그 레벨 조정
중요 메트릭에만 알람 설정
Logs Insights로 고급 분석
X-Ray와 조합하여 완전한 관찰성 확보

최근 프로젝트에서 ELK 대신 CloudWatch를 사용해 월 $450 → $30으로 비용을 93% 절감했고, 관리 부담도 크게 줄어들어 매우 만족스러웠음.

AWS, Monitoring, CloudWatch

This post is licensed under CC BY 4.0 by the author.