On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the...

23
On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization Hao Yu, Rong Jin Machine Intelligence Technology Alibaba Group (US) Inc., Bellevue, WA Poster @ Pacific Ballroom #103

Transcript of On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the...

Page 1: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

On the Computation and Communication Complexity of Parallel

SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization

Hao Yu, Rong JinMachine Intelligence TechnologyAlibaba Group (US) Inc., Bellevue, WA

Poster @ Pacific Ballroom #103

Page 2: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Stochastic Non-Convex Optimization

• Stochastic non-convex optimization

minx∈ℛm

f(x) Δ= $ζ∼D[F(x; ζ)]

Page 3: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Stochastic Non-Convex Optimization

• Stochastic non-convex optimization

• SGD:

minx∈ℛm

f(x) Δ= $ζ∼D[F(x; ζ)]

xt+ 1 = xt − γ 1B ∑B

i= 1 ∇F(xt; ζi)stochastic gradient averaged from a mini-batch of size B

Page 4: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Stochastic Non-Convex Optimization

• Stochastic non-convex optimization

• SGD:

• Singe node training:

• Larger B can improve the utilization of computing hardware

minx∈ℛm

f(x) Δ= $ζ∼D[F(x; ζ)]

xt+ 1 = xt − γ 1B ∑B

i= 1 ∇F(xt; ζi)stochastic gradient averaged from a mini-batch of size B

Page 5: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Stochastic Non-Convex Optimization

• Stochastic non-convex optimization

• SGD:

• Singe node training:

• Larger B can improve the utilization of computing hardware

• Data-parallel training:

• Multiple nodes form a bigger “mini-batch” by aggregating individual mini-batch gradients at each step.

• Given a budget of gradient access, larger batch size yields fewer update/comm steps

minx∈ℛm

f(x) Δ= $ζ∼D[F(x; ζ)]

xt+ 1 = xt − γ 1B ∑B

i= 1 ∇F(xt; ζi)stochastic gradient averaged from a mini-batch of size B

Page 6: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Batch size for (parallel) SGD

• Question: Should always use a BS as large as possible in (parallel) SGD?

Page 7: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Batch size for (parallel) SGD

• Question: Should always use a BS as large as possible in (parallel) SGD?

• You may tend to say “yes” because in strongly convex case, SGD with extremely large BS is close to GD?

Page 8: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Batch size for (parallel) SGD

• Question: Should always use a BS as large as possible in (parallel) SGD?

• You may tend to say “yes” because in strongly convex case, SGD with extremely large BS is close to GD?

• Theoretically, No! [Bottou&Bousquet’08] [Bottou et. al.’18] shows that with limited budgets of stochastic gradient ҁStochastic First Order) access, GD (SGD with extremely large BS) has slower convergence than SGD with small batch sizes.

Page 9: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Batch size for (parallel) SGD

• Question: Should always use a BS as large as possible in (parallel) SGD?

• You may tend to say “yes” because in strongly convex case, SGD with extremely large BS is close to GD?

• Theoretically, No! [Bottou&Bousquet’08] [Bottou et. al.’18] shows that with limited budgets of stochastic gradient ҁStochastic First Order) access, GD (SGD with extremely large BS) has slower convergence than SGD with small batch sizes.

• Under a finite SFO access budget, [Bottou et. al.’18] shows SGD with B=1 achieves better stochastic opt error than GD.

Page 10: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Batch size for (parallel) SGD

• Question: Should always use a BS as large as possible in (parallel) SGD?

• You may tend to say “yes” because in strongly convex case, SGD with extremely large BS is close to GD?

• Theoretically, No! [Bottou&Bousquet’08] [Bottou et. al.’18] shows that with limited budgets of stochastic gradient ҁStochastic First Order) access, GD (SGD with extremely large BS) has slower convergence than SGD with small batch sizes.

• Under a finite SFO access budget, [Bottou et. al.’18] shows SGD with B=1 achieves better stochastic opt error than GD.

• Recall B=1 means poor hardware utilization and huge communication cost

Page 11: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Dynamic BS: reduce communication without sacrificing SFO convergence

• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BSO(1/T )

Page 12: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Dynamic BS: reduce communication without sacrificing SFO convergence

• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS

• This paper explores how to use dynamic BS for non-convex opt such that:

O(1/T )

Page 13: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Dynamic BS: reduce communication without sacrificing SFO convergence

• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS

• This paper explores how to use dynamic BS for non-convex opt such that:

• do not sacrifice SFO convergence in (parallel) SGD. Recall (N node parallel) SGD with (B=1) has SFO convergence

O(1/T )

O(1/ NT)T: SFO access budge at each node

Linear speedup w.r.t. # of nodes; computation power perfectly scaled out

Page 14: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Dynamic BS: reduce communication without sacrificing SFO convergence

• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS

• This paper explores how to use dynamic BS for non-convex opt such that:

• do not sacrifice SFO convergence in (parallel) SGD. Recall (N node parallel) SGD with (B=1) has SFO convergence

O(1/T )

O(1/ NT)T: SFO access budge at each node

Linear speedup w.r.t. # of nodes; computation power perfectly scaled out

Page 15: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Dynamic BS: reduce communication without sacrificing SFO convergence

• Motivating result: For strongly convex stochastic opt, [Friedlander&Schmidt’12] and [Bottou et.al.’18] show that SGD with exponentially increasing BS can achieve the same SFO convergence as SGD with fixed small BS

• This paper explores how to use dynamic BS for non-convex opt such that:

• do not sacrifice SFO convergence in (parallel) SGD. Recall (N node parallel) SGD with (B=1) has SFO convergence

• reduce communication complexity (# of used batches) in parallel SGD

O(1/T )

O(1/ NT)T: SFO access budge at each node

Linear speedup w.r.t. # of nodes; computation power perfectly scaled out

Page 16: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Non-Convex under PL condition

• PL condition:

• Milder than strong convexity: strong convexity implies PL condition.

• Non-convex fun under PL is typically as nice as strong convex fun.

12 ∥∇f(x)∥2 ≥ μ( f(x) − f*), ∀x

Brief Article

The Author

June 4, 2019

Algorithm 1 CR-PSGD(f,N, T,x1, B1, ⇢, �)

1: Input: N , T , x1 2 Rm, � , B1 and ⇢ > 1.2: Initialize t = 13: while

Pt⌧=1B⌧ T do

4: Each worker calculates batch gradient average gt,i =1Bt

PBtj=1 F (xt; ⇣i,j).

5: Each worker aggregates gradient average gt =1N

PNi=1 gt,i.

6: Each worker updates in parallel via: xt+1 = xt � �gt.7: Set batch size Bt+1 = b⇢tB1c.8: Update t t+ 1.9: end while

10: Return: xt

1

budge of SFO access at each worker

Page 17: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Non-Convex under PL condition

• Under PL, we show using exponentially increasing batch sizes in PSGD with N workers has SFO convergence with comm rounds

• SoA SFO convergence with inter-worker comm rounds attained by local SGD in [Stich’18] for strongly convex opt only

O(log T )O( 1NT )

O( 1NT ) O( NT )

Page 18: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Non-Convex under PL condition

• Under PL, we show using exponentially increasing batch sizes in PSGD with N workers has SFO convergence with comm rounds

• SoA SFO convergence with inter-worker comm rounds attained by local SGD in [Stich’18] for strongly convex opt only

• How about general non-convex without PL?

O(log T )O( 1NT )

O( 1NT ) O( NT )

Page 19: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Non-Convex under PL condition

• Under PL, we show using exponentially increasing batch sizes in PSGD with N workers has SFO convergence with comm rounds

• SoA SFO convergence with inter-worker comm rounds attained by local SGD in [Stich’18] for strongly convex opt only

• How about general non-convex without PL?

• Inspiration from “catalyst acceleration” developed in [Lin et al.’15][Paquette et al.’18]

• Instead of solving original problem directly, it repeatedly solves “strongly convex” proximal minimization

O(log T )O( 1NT )

O( 1NT ) O( NT )

Page 20: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

General Non-Convex Opt

• A new catalyst-like parallel SGD methodAlgorithm 2 CR-PSGD-Catalyst(f,N, T,y0, B1, ⇢, �)

1: Input: N , T , ✓, y0 2 Rm, � , B1 and ⇢ > 1.2: Initialize y(0) = y0 and k = 1.3: while k b

pNT c do

4: Define h✓(x;y(k�1))�= f(x) + ✓

2kx� y(k�1)k2 .

5: Update y(k) via

y(k) = CR-PSGD(h✓(·;y(k�1)), N, bp

T/Nc,y(k�1), B1, ⇢, �)

6: Update k k + 1.7: end while

2

strongly convex fun whose unbiased stochastic gradient is easily estimated

• We show this catalyst-like parallel SGD (with dynamic BS) has SFO convergence with comm rounds

• SoA is SFO convergence with inter-worker comm rounds

O(1/ NT )O( NT log( T

N ))

O(1/ NT ) O(N3/4T3/4)

Page 21: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Experiments

Distributed Logistic Regression: N=10

Page 22: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Experiments

Training ResNet20 over Cifar10: N=8

Page 23: On the Computation and Communication Complexity of ...12-14-00)-12...2000/12/14  · On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic

Thanks!

Poster on Wed Jun 12th 06:30 -- 09:00 PM @ Pacific Ballroom #103