[AWS] 20,000대 에이전트를 위한 Lambda 동시성 처리 - PoC 실전 경험

대규모 Agent 시스템에서 Lambda 동시 실행 제한과 503 에러를 해결한 실전 경험담

Posted Nov 12, 2025

By kimmin1kk

views 12 min read

문제 상황 (Problem)

20,000대의 Agent가 매일 00:00-01:00 KST 사이에 데이터를 Collector(Lambda)로 전송하는 시스템을 구축했다. 소규모 테스트(100대, 1,000대)는 정상 작동했지만, 20,000대로 확장하자 503 Service Unavailable 에러가 폭발했다.

PoC 시나리오

┌─────────────────────────────────────────────────────────┐
│  20,000 Agents (Linux 10,000 + Windows 10,000)         │
│                                                         │
│  - Data Collection: 00:00-01:00 KST (Random)          │
│  - Performance Collection: Every 1 hour               │
│  - Gzip Compression                                   │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│           API Gateway (HTTP API)                        │
│           POST /data                                    │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│              Lambda (Validator)                         │
│              Memory: 512MB                              │
│              Timeout: 30s                               │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│                    SQS                                  │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│              Lambda (Processor)                         │
│              → S3, DynamoDB, CloudWatch                 │
└─────────────────────────────────────────────────────────┘

에러 로그

Error: Request failed with status code 503
Error: TooManyRequestsException
Message: Rate exceeded

비용 폭발

Current month: $31.95 (7,330% 증가)
Forecasted month end: $55.06 (3,217% 증가)

원인 분석 (Root Cause)

1. Lambda 동시 실행 제한 (Primary Issue)

제한 항목	기본값	문제
계정당 동시 실행 수	1,000개	20,000 Agent 동시 요청 시 19,000개 throttle
리전당 동시 실행 수	1,000개	동일
버스트 제한	500-3000/min	초기 버스트 후 throttle

20,000 Agents 동시 요청
    ↓
Lambda 동시 실행 제한: 1,000개
    ↓
19,000개 요청 Throttle
    ↓
503 Service Unavailable

2. API Gateway 제한

제한 항목	기본값	영향
Throttle Limit	10,000 RPS	20,000 요청/초 시 10,000개 throttle
Burst Limit	5,000	초기 버스트만 허용
계정 레벨 제한	10,000 RPS	리전 전체 API 합산

3. Lambda Payload 제한

제한 항목	동기	비동기
Request Payload	6MB	256KB
Response Payload	6MB	N/A

Gzip 압축된 Agent 데이터가 간혹 256KB를 초과하면, SQS 비동기 호출 시 에러 발생.

해결 방법 (Solution)

1. Reserved Concurrency 설정

Reserved Concurrency: 특정 Lambda 함수에 고정 동시 실행 수를 예약.

  
# serverless.yml
functions:
  validator:
    handler: src/validator/handler.handler
    reservedConcurrency: 500  # 👈 500개 예약
    memorySize: 512
    timeout: 30

  processor:
    handler: src/processor/handler.handler
    reservedConcurrency: 500  # 👈 500개 예약
    memorySize: 1024
    timeout: 60

트레이드오프

설정	장점	단점
Reserved	다른 함수 영향 안 받음	계정 전체 동시성 감소
Provisioned	Cold Start 없음	비용 증가 (항상 가동)
없음 (기본)	비용 효율적	다른 함수에 영향받음

2. SQS 버퍼링 (핵심 해결책)

문제: Lambda 동시성 제한을 넘는 요청은 throttle됨 해결: SQS를 버퍼로 사용, Lambda가 자신의 속도로 처리

┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│   Agent      │       │   Lambda     │       │     SQS      │
│  (20,000)    │──────▶│  Validator   │──────▶│   (Queue)    │
│              │       │   (1,000)    │       │  (Buffering) │
└──────────────┘       └──────────────┘       └──────────────┘
                                                      │
                                                      ▼
                                               ┌──────────────┐
                                               │   Lambda     │
                                               │  Processor   │
                                               │   (Poll)     │
                                               └──────────────┘

SQS 설정

  
resources:
  Resources:
    DataQueue:
      Type: AWS::SQS::Queue
      Properties:
        QueueName: collector-data-queue
        VisibilityTimeout: 360  # Lambda timeout * 6
        MessageRetentionPeriod: 1209600  # 14일
        ReceiveMessageWaitTimeSeconds: 20  # Long polling
        RedrivePolicy:
          deadLetterTargetArn: !GetAtt DeadLetterQueue.Arn
          maxReceiveCount: 3  # 3번 실패 시 DLQ로 이동

    DeadLetterQueue:
      Type: AWS::SQS::Queue
      Properties:
        QueueName: collector-data-dlq
        MessageRetentionPeriod: 1209600  # 14일

Lambda Event Source Mapping

  
functions:
  processor:
    handler: src/processor/handler.handler
    events:
      - sqs:
          arn: !GetAtt DataQueue.Arn
          batchSize: 10  # 👈 한 번에 10개씩 처리
          maximumBatchingWindowInSeconds: 5  # 최대 5초 대기
          functionResponseType: ReportBatchItemFailures  # 부분 실패 허용

3. 점진적 확장 전략

한 번에 20,000대를 테스트하지 말고, 단계적으로 확장한다.

단계	Agent 수	목적	예상 동시성
1. Small	100	기본 기능 검증	~10
2. Medium	1,000	동시성 초기 테스트	~100
3. Large	5,000	제한 확인	~500
4. XLarge	10,000	Reserved Concurrency 검증	~1,000
5. Full	20,000	최종 부하 테스트	~1,000 (버퍼링)

  
# 점진적 테스트 스크립트
npm run poc -- --agents 100
# 확인 후...
npm run poc -- --agents 1000
# 확인 후...
npm run poc -- --agents 5000
# 확인 후...
npm run poc -- --agents 20000

4. 랜덤 오프셋 적용

모든 Agent가 정확히 00:00:00에 요청하지 않도록 랜덤 지연을 추가한다.

  
// PoC Simulator 코드
class TaskScheduler {
  scheduleDataCollection(agent) {
    // 00:00-01:00 KST 사이 랜덤 시각
    const hour = 0;
    const randomMinute = Math.floor(Math.random() * 60);
    const randomSecond = Math.floor(Math.random() * 60);

    const cronExpression = `${randomSecond} ${randomMinute} ${hour} * * *`;

    cron.schedule(cronExpression, async () => {
      await agent.sendDataCollection();
    });
  }
}

효과

Without Random Offset:
00:00:00 → 20,000 requests (SPIKE!) → Throttle

With Random Offset:
00:00:00-01:00:00 → ~333 requests/min → Smooth

5. CloudWatch 모니터링

  
resources:
  Resources:
    LambdaThrottleAlarm:
      Type: AWS::CloudWatch::Alarm
      Properties:
        AlarmName: validator-throttle-alarm
        MetricName: Throttles
        Namespace: AWS/Lambda
        Statistic: Sum
        Period: 60
        EvaluationPeriods: 1
        Threshold: 10  # 10번 이상 throttle 시 알림
        ComparisonOperator: GreaterThanThreshold
        Dimensions:
          - Name: FunctionName
            Value: !Ref ValidatorLambdaFunction

최종 아키텍처 (Final Architecture)

┌─────────────────────────────────────────────────────────┐
│  20,000 Agents                                          │
│  Random Offset: 00:00-01:00 (1 hour window)           │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼  ~333 req/min (평균)
┌─────────────────────────────────────────────────────────┐
│           API Gateway (HTTP API)                        │
│           Throttle: 10,000 RPS (충분함)                │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│              Lambda Validator                           │
│              Reserved: 500                              │
│              Actual: ~100 (평균)                        │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼  모든 요청 버퍼링
┌─────────────────────────────────────────────────────────┐
│                    SQS Queue                            │
│                    (핵심 버퍼)                          │
│              - Visibility Timeout: 360s                 │
│              - DLQ 설정                                 │
└─────────────────────────────────────────────────────────┘
                        │
                        ▼  Batch 10개씩
┌─────────────────────────────────────────────────────────┐
│              Lambda Processor                           │
│              Reserved: 500                              │
│              Actual: ~50 (평균)                         │
│              → S3, DynamoDB, CloudWatch                 │
└─────────────────────────────────────────────────────────┘

최종 결과

지표	Before	After
503 에러율	95% (19,000/20,000)	0%
Lambda Throttle	19,000회/시간	0회
평균 응답시간	실패	200-300ms
데이터 손실	95%	0% (DLQ 포함)
비용	$55/월 (예상)	$15/월 (실제)

비용 최적화 (Cost Optimization)

Lambda 비용 계산

요청 수: 20,000 agents × 2회/일 (data + performance) × 30일 = 1,200,000 요청/월

Lambda 요청 비용:
- 처음 1M 요청: 무료
- 이후 200,000 요청: $0.20 × 0.2 = $0.04

Lambda 실행 시간 비용 (512MB, 평균 500ms):
- GB-초: 0.5GB × 0.5s × 1,200,000 = 300,000 GB-s
- 처음 400,000 GB-s: 무료
- 비용: 0 (무료 티어 내)

SQS 비용:
- 요청: 1,200,000 / 1M = 1.2M 요청
- 처음 1M 무료
- 이후 200,000: $0.40 × 0.2 = $0.08

총 비용: ~$0.12/월 (무료 티어 포함)

비용 절감 팁

Reserved Concurrency 최소화: 꼭 필요한 만큼만 설정
SQS 배치 처리: batchSize 10으로 Lambda 호출 횟수 감소
Provisioned Concurrency 제거: PoC 환경에서는 불필요
CloudWatch Logs 보관 기간: 7일로 제한

실전 팁 (Best Practices)

1. Lambda 동시성 모니터링

  
// CloudWatch Metrics 확인
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name ConcurrentExecutions \
  --dimensions Name=FunctionName,Value=validator \
  --start-time 2025-12-01T00:00:00Z \
  --end-time 2025-12-01T01:00:00Z \
  --period 60 \
  --statistics Maximum

2. Reserved Concurrency 계산

총 계정 동시성: 1,000
Reserved for Validator: 500
Reserved for Processor: 300
나머지 함수들: 200

3. SQS DLQ 모니터링

  
# DLQ 메시지 수 확인
aws sqs get-queue-attributes \
  --queue-url https://sqs.ap-northeast-2.amazonaws.com/.../collector-data-dlq \
  --attribute-names ApproximateNumberOfMessages

DLQ에 메시지가 쌓이면 즉시 조사해야 한다.

4. 동시성 제한 증가 요청

AWS Support에 동시성 제한 증가를 요청할 수 있다.

Service: Lambda
Limit Type: Concurrent executions
Region: ap-northeast-2
New Limit: 5,000
Reason: 20,000 agents 지원을 위한 확장

승인까지 1-2일 소요.

트러블슈팅 체크리스트

503 에러 발생 시

CloudWatch Metrics에서 ConcurrentExecutions 확인
CloudWatch Logs에서 Throttles 확인
API Gateway 4XXError, 5XXError 메트릭 확인
SQS Queue 깊이 확인 (ApproximateNumberOfMessages)
Lambda Reserved Concurrency 설정 확인

응답 시간 느릴 때

Lambda Cold Start 확인 (InitDuration)
Lambda Memory 사용량 확인 (적절한 메모리 할당)
S3/DynamoDB 응답 시간 확인
VPC NAT Gateway 병목 확인 (VPC Lambda인 경우)

비용 급증 시

CloudWatch Logs 무한 로깅 확인
Lambda 재귀 호출 확인 (자기 자신 호출)
Provisioned Concurrency 불필요 설정 확인
API Gateway 캐싱 비활성화 확인

마치며

대규모 Agent를 위한 Lambda 시스템 구축에서 핵심은 동시성 제한 이해와 SQS 버퍼링입니다.

Lambda 동시성은 계정당 기본 1,000개(리전별)로 제한되며, SQS 버퍼링이 대량 요청의 핵심 해결책입니다. 랜덤 오프셋으로 Spike를 방지하고, 100 → 1,000 → 5,000 → 대규모로 점진적으로 테스트합니다. Reserved Concurrency는 비용을 고려하여 필요 시에만 설정합니다.

서버리스 아키텍처는 자동 확장이 가능하지만, 물리적 제한은 여전히 존재합니다. 이를 이해하고 설계하는 것이 대규모 시스템 성공의 핵심입니다.

도움이 되셨길 바랍니다! 😀

AWS, Serverless

This post is licensed under CC BY 4.0 by the author.