Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

28
UPC CGO’03 San Francisco March 2003 Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric Gibert 1 Jesús Sánchez 2 Antonio González 1,2 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona 2 Intel Barcelona Research Center Intel Labs Barcelona

description

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache. Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2. 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona. 2 Intel Barcelona Research Center - PowerPoint PPT Presentation

Transcript of Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

Page 1: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW

Processor with a Distributed Data Cache

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW

Processor with a Distributed Data Cache

Enric Gibert1

Jesús Sánchez2

Antonio González1,2

1Dept. d’Arquitectura de Computadors

Universitat Politècnica de Catalunya (UPC)

Barcelona

2Intel Barcelona Research CenterIntel LabsBarcelona

Page 2: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Motivation

Capacity vs. Communication-bound Clustered microarchitectures

– Simpler + faster– Power consumption– Communications not homogeneous

Clustering embedded/DSP domain

Page 3: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Clustered Microarchitectures

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER 2

Reg. FileReg. File

FUsFUs

CLUSTER 3

Reg. FileReg. File

FUsFUs

CLUSTER 4

Reg. FileReg. File

FUsFUs

Register-to-register communication buses

L1 cacheL1 cache

L2 cacheL2 cache

Memory buses

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER 2

Reg. FileReg. File

FUsFUs

CLUSTER 3

Reg. FileReg. File

FUsFUs

CLUSTER 4

Reg. FileReg. File

FUsFUs

Register-to-register communication buses

L1 cachemodule

L1 cachemodule

L2 cacheL2 cache

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER 2

Reg. FileReg. File

FUsFUs

CLUSTER 3

Reg. FileReg. File

FUsFUs

CLUSTER 4

Reg. FileReg. File

FUsFUs

Register-to-register communication buses

L1 cachemodule

L1 cachemodule

L2 cacheL2 cache

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

Memory buses

Page 4: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Contributions

Distribution of data cache– Architecture design + data mapping

• Word-interleaved scheme [ICS’02]

– Appropriate scheduling techniques [MICRO’02]

– Memory coherence Scheduling techniques for mem. coherence

– Local software-based techniques– Applied to word-interleaved cache

• Complex conf. (with Attraction Buffers – refer to paper)• Simple conf. (without Attraction Buffers)

– Applicable to any other cache configuration

Page 5: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Talk Outline

Architecture and Scheduling Algorithms Memory Coherence Problem Solutions

– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)

Evaluation Conclusions

Page 6: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Word-Interleaved Distribution

CLUSTER 1

Register FileRegister File

Func. UnitsFunc. Units

Register-to-register communication buses

cache module

CLUSTER 2

Register FileRegister File

Func. UnitsFunc. Units

cache module

CLUSTER 3

Register FileRegister File

Func. UnitsFunc. Units

cache module

CLUSTER 4

Register FileRegister File

Func. UnitsFunc. Units

cache module

L2 cacheTAG W0 W1 W2 W4 W5 W6 W7W3

TAG W0 W4 TAG W1 W5 TAG W2 W6 TAG W3 W7

subblock 1

cache block

local hit remote hit

local miss remote miss

Page 7: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Scheduling Techniques

CLUSTER 1

cache module

a[0] a[4]

CLUSTER 2

cache module

a[1] a[5]

CLUSTER 3

cache module

a[2] a[6]

CLUSTER 4

cache module

a[3] a[7]

for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i]}

ld r31, a[i] ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3]

for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes) ...}

ld r3, a[i]

Modulo scheduling

Loop unrolling

Assignment of latencies

Padding + Profiling

Page 8: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Cluster Assignment

Non-memory instructions• Minimize register communications• Maximize workload balance

Memory instructions 2 heuristics:– PrefClus Heuristic

• Preferred Cluster = most accessed cluster• Profiling + Padding

– MinComs Heuristic• Minimize register communications• Maximize workload balance• Post-pass phase to increase local accesses

Page 9: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Talk Outline

Architecture and Scheduling Algorithms Memory Coherence Problem Solutions

– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)

Evaluation Conclusions

Page 10: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Memory Coherence Problem

CLUSTER 1

a[0] a[4]

Cache module

CL

UST

ER

3

CL

UST

ER

2

CLUSTER 4

a[3] a[7]

Cache module

NEXT MEMORY LEVELNEXT MEMORY LEVEL

memory buses

cycle i - - - store to a[0]

cycle i+1 - - - -

cycle i+2 - - - -

cycle i+3 - - - -

cycle i+4 load from a[0] - - -

Store to a[0]Store to a[0]

Update a[0]

Read a[0]

Remote accessesMissesReplacementsOthers

NON-DETERMINISTIC BUS LATENCY!!!

Store to a[0]Store to a[0]

Page 11: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Talk Outline

Architecture and Scheduling Algorithms Memory Coherence Problem Solutions

– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)

Evaluation Conclusions

Page 12: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Solutions Outline

Local scheduling solutions applied at a loop granularity– Memory Dependent Chains (MDC)– Data Dependence Graph Transformations (DDGT)

• Store replication• Load-store synchronization

Software-based solutions Applicable to other configurations

– Replicated distributed cache– MultiVLIW [MICRO00]

Page 13: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Memory Dependent Chains

Sets of aliased instructions:– Memory Dependent Chains (MDC)

Instructions in same set:– Assigned to same cluster

Restrictions on cluster

assignment– PrefClus: average preferred

cluster– MinComs: minimize comms.

when scheduling first node

n1load

n2load

n3add

n4store

n6load

n7div

n8add

RF

RF

RFRF

RF

RF

MA

MA

MF = memory-flow MA = memory-antiRF = register-flow

MFMF

Page 14: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Memory Dependent Chains

CLUSTER 1

a[0] a[4]

Cache module

CL

UST

ER

3

CL

UST

ER

2

CLUSTER 4

a[3] a[7]

Cache module

NEXT MEMORY LEVELNEXT MEMORY LEVEL

memory buses

cycle i - - - store to a[0]

cycle i+1 - - - -

cycle i+2 - - - -

cycle i+3 - - - -

cycle i+4 load from a[0] - - -

store to a[0]load from a[0]

Page 15: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

DDGT: Store Replication

Overcome MEM_FLOW (MF) and MEM_OUT (MO)

storeA

storeA

loadB

loadB

MF

storeA

storeA

storeA’

storeA’

storeA’’

storeA’’

storeA’’’

storeA’’’

loadB

loadB

MF

storereplication

storeA

storeA

storeB

storeB

MO

storeA

storeA

storeA’

storeA’

storeA’’

storeA’’

storeA’’’

storeA’’’

MO

storereplication

storeB

storeB

storeB’

storeB’

storeB’’

storeB’’

storeB’’’

storeB’’’

local instance

remote instances

Page 16: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

DDGT: Store Replication

CLUSTER 1

a[0] a[4]

Cache module

CL

UST

ER

3

CL

UST

ER

2

CLUSTER 4

a[3] a[7]

Cache module

NEXT MEMORY LEVELNEXT MEMORY LEVEL

memory buses

cycle i - - - store to a[0]

cycle i+1 store to a[0] - store to a[0] -

cycle i+2 - - - -

cycle i+3 - store to a[0] - -

cycle i+4 load from a[0] - - -

local instance

remote instances

Increase number of register communications!!!

Page 17: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

DDGT: ld-st Synchronization

Overcome MEM_ANTI (MA) dependences

loadA

loadA

storeB

storeB

MA

addadd

RF load-storesync.

loadA

loadA

storeB

storeB

SYNCaddadd

RF

Special cases:– Store is already REG_FLOW dependent on the load– Impossible recurrences

loadA

loadA

storeC

storeC

RF storeB

storeB

MA

MO

loadA

loadA

storeC

storeC

RF

storeB

storeB

MO

fakecons

fakecons

RF

SYNC

load-storesync.

MA

Page 18: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

CCCC

BAMRT

IIres=2

C1 C2 C3 C4

MDC Solution: Case Study

Impact on compute time– May increase the IIres

loadA

loadA

storeC

storeC

loadB

loadB

C

BAMRT

IIres=2

C1 C2 C3 C4

MA

MFMFB

C

AMRT

IIres=3

C1 C2 C3 C4

Impact on stall time– May increase remote accesses

• Extra stall cycles = 3 cycles / iteration

always accesses data in cluster 1

always accesses data in cluster 2

Latency LH = 1 cycleLatency RH = 5 cycles

addadd

RF

cycle 1

cycle 3

Page 19: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

DDGT Solution: Case Study

Impact on compute time– More instructions (IIres)

• Store replication• Fake consumers (few)• Register communications MRT

IIres=2

C1 C2 C3 C4X

XXX

storeB

storeB

loadA

loadA

MAMF

C4

MRT

IIres=3

C1 C2 C3

BXBB

B

AXXX

set ofmemory

instructionsX

Impact on stall time– Small

• New dependences may decrease slack of some memory instructions

Page 20: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Talk Outline

Architecture and Scheduling Algorithms Memory Coherence Problem Solutions

– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)

Evaluation Conclusions

Page 21: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Evaluation Framework

IMPACT C compiler• Compile + optimize + memory disambiguation

Mediabench benchmark suite

Profile Execution

epicdec test_image titanic

g721dec clinton S_16_44

g721enc clinton S_16_44

gsmdec clinton S_16_44

gsmenc clinton S_16_44

jpegdec testimg monalisa

jpegenc testimg monalisa

Profile Execution

mpeg2dec mei16v2 tek6

pegwitdec pegwit techrep

pegwitenc pgptest techrep

pgpdec pgptext techrep

pgpenc pgptest techrep

rasta ex5_c1 ex5_c1

Page 22: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Evaluation Framework

Word-Interleaved Cache Clustered VLIW Processor

# clusters 4

Functional units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster

Register buses 4 buses running at ½ the core freq.

Memory buses 4 buses running at ½ the core freq.

Cache configuration

8KB, 2-way set-associative, 32 byte blocks

L2 always hits

Cache latencies Local Hit=1 Remote Hit=5 Local Miss=10 Remote Miss=15

Algorithm PrefClus and MinComs

Interleaving factor 2 or 4 bytes depending on benchmark

BASELINE Same architecture but complete freedom when assigning instructions to clusters

Page 23: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%N

oRes

MD

CD

DG

T

NoR

esM

DC

DD

GT

NoR

esM

DC

DD

GT

NoR

esM

DC

DD

GT

NoR

esM

DC

DD

GT

NoR

esM

DC

DD

GT

remote misses

local misses

remote hits

local hits

epicdec jpegdec pegwitdec pgpdec rasta AMEAN

Local vs. Remote Accesses

Page 24: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

0

0,2

0,4

0,6

0,8

1

1,2

1,4

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

stall time

compute time

Ex

ec

uti

on

tim

eepicdec jpegdec pegw itdec pgpdec rasta AMEAN

Execution Time

Page 25: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Other Configurations

Configuration 1

24Memory buses42Register buses

Latency# BusesLatency# Buses

More pressure on register busesMDC outperforms DDGT in all cases MDC requires less register communications

42Memory buses24Register buses

Latency# BusesLatency# Buses

More pressure on memory busesDDGT outperforms best MDC in several cases: epicdec 17%, pgpdec 20%, pgpenc 9%, rasta 7%…

Configuration 2

Page 26: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Talk Outline

Architecture and Scheduling Algorithms Memory Coherence Problem Solutions

– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)

Evaluation Conclusions

Page 27: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Conclusions

Memory coherence problem– Two software-based solutions: MDC and DDGT– Applied to a word-interleaved cache clustered VLIW

processor MDC vs DDGT

– Results depending on architecture configuration• MDC outperforms DDGT in most cases • DDGT better by up to 20% in specific configuration

– Sets of memory dependent insts. are small– DDGT freedom in cluster assignment

• Increase local accesses by 15% reduce stall time

Page 28: Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Questions?