Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

UPC

CGO’03San Francisco

March 2003

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW

Processor with a Distributed Data Cache

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW

Processor with a Distributed Data Cache

Enric Gibert1

Jesús Sánchez2

Antonio González1,2

1Dept. d’Arquitectura de Computadors

Universitat Politècnica de Catalunya (UPC)

Barcelona

2Intel Barcelona Research CenterIntel LabsBarcelona

UPC


March 2003

Motivation

Capacity vs. Communication-bound Clustered microarchitectures

– Simpler + faster– Power consumption– Communications not homogeneous

Clustering embedded/DSP domain

UPC


March 2003

Clustered Microarchitectures

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER 2

Reg. FileReg. File

FUsFUs

CLUSTER 3

Reg. FileReg. File

FUsFUs

CLUSTER 4

Reg. FileReg. File

FUsFUs

Register-to-register communication buses

L1 cacheL1 cache

L2 cacheL2 cache

Memory buses

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER 2

Reg. FileReg. File

FUsFUs

CLUSTER 3

Reg. FileReg. File

FUsFUs

CLUSTER 4

Reg. FileReg. File

FUsFUs


L1 cachemodule

L1 cachemodule

L2 cacheL2 cache

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

CLUSTER 1

Reg. FileReg. File

FUsFUs

CLUSTER 2

Reg. FileReg. File

FUsFUs

CLUSTER 3

Reg. FileReg. File

FUsFUs

CLUSTER 4

Reg. FileReg. File

FUsFUs


L1 cachemodule

L1 cachemodule

L2 cacheL2 cache

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

L1 cachemodule

Memory buses

UPC


March 2003

Contributions

Distribution of data cache– Architecture design + data mapping

• Word-interleaved scheme [ICS’02]

– Appropriate scheduling techniques [MICRO’02]

– Memory coherence Scheduling techniques for mem. coherence

– Local software-based techniques– Applied to word-interleaved cache

• Complex conf. (with Attraction Buffers – refer to paper)• Simple conf. (without Attraction Buffers)

– Applicable to any other cache configuration

UPC


March 2003

Talk Outline

Architecture and Scheduling Algorithms Memory Coherence Problem Solutions

– Memory Dependent Chains (MDC) – DDG Transformations (DDGT)

Evaluation Conclusions

UPC


March 2003

Word-Interleaved Distribution

CLUSTER 1

Register FileRegister File

Func. UnitsFunc. Units


cache module

CLUSTER 2



cache module

CLUSTER 3



cache module

CLUSTER 4



cache module

L2 cacheTAG W0 W1 W2 W4 W5 W6 W7W3

TAG W0 W4 TAG W1 W5 TAG W2 W6 TAG W3 W7

subblock 1

cache block

local hit remote hit

local miss remote miss

UPC


March 2003

Scheduling Techniques

CLUSTER 1

cache module

a[0] a[4]

CLUSTER 2

cache module

a[1] a[5]

CLUSTER 3

cache module

a[2] a[6]

CLUSTER 4

cache module

a[3] a[7]

for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i]}

ld r31, a[i] ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3]

for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes) ...}

ld r3, a[i]

Modulo scheduling

Loop unrolling

Assignment of latencies

Padding + Profiling

UPC


March 2003

Cluster Assignment

Non-memory instructions• Minimize register communications• Maximize workload balance

Memory instructions 2 heuristics:– PrefClus Heuristic

• Preferred Cluster = most accessed cluster• Profiling + Padding

– MinComs Heuristic• Minimize register communications• Maximize workload balance• Post-pass phase to increase local accesses

UPC


March 2003

Talk Outline




UPC


March 2003

Memory Coherence Problem

CLUSTER 1

a[0] a[4]

Cache module

CL

UST

ER

3

CL

UST

ER

2

CLUSTER 4

a[3] a[7]

Cache module

NEXT MEMORY LEVELNEXT MEMORY LEVEL

memory buses

cycle i - - - store to a[0]

cycle i+1 - - - -

cycle i+2 - - - -

cycle i+3 - - - -

cycle i+4 load from a[0] - - -

Store to a[0]Store to a[0]

Update a[0]

Read a[0]

Remote accessesMissesReplacementsOthers

NON-DETERMINISTIC BUS LATENCY!!!

Store to a[0]Store to a[0]

UPC


March 2003

Talk Outline




UPC


March 2003

Solutions Outline

Local scheduling solutions applied at a loop granularity– Memory Dependent Chains (MDC)– Data Dependence Graph Transformations (DDGT)

• Store replication• Load-store synchronization

Software-based solutions Applicable to other configurations

– Replicated distributed cache– MultiVLIW [MICRO00]

…

UPC


March 2003

Memory Dependent Chains

Sets of aliased instructions:– Memory Dependent Chains (MDC)

Instructions in same set:– Assigned to same cluster

Restrictions on cluster

assignment– PrefClus: average preferred

cluster– MinComs: minimize comms.

when scheduling first node

n1load

n2load

n3add

n4store

n6load

n7div

n8add

RF

RF

RFRF

RF

RF

MA

MA

MF = memory-flow MA = memory-antiRF = register-flow

MFMF

UPC


March 2003

Memory Dependent Chains

CLUSTER 1

a[0] a[4]

Cache module

CL

UST

ER

3

CL

UST

ER

2

CLUSTER 4

a[3] a[7]

Cache module


memory buses


cycle i+1 - - - -

cycle i+2 - - - -

cycle i+3 - - - -


store to a[0]load from a[0]

UPC


March 2003

DDGT: Store Replication

Overcome MEM_FLOW (MF) and MEM_OUT (MO)

storeA

storeA

loadB

loadB

MF

storeA

storeA

storeA’

storeA’

storeA’’

storeA’’

storeA’’’

storeA’’’

loadB

loadB

MF

storereplication

storeA

storeA

storeB

storeB

MO

storeA

storeA

storeA’

storeA’

storeA’’

storeA’’

storeA’’’

storeA’’’

MO

storereplication

storeB

storeB

storeB’

storeB’

storeB’’

storeB’’

storeB’’’

storeB’’’

local instance

remote instances

UPC


March 2003

DDGT: Store Replication

CLUSTER 1

a[0] a[4]

Cache module

CL

UST

ER

3

CL

UST

ER

2

CLUSTER 4

a[3] a[7]

Cache module


memory buses


cycle i+1 store to a[0] - store to a[0] -

cycle i+2 - - - -

cycle i+3 - store to a[0] - -


local instance

remote instances

Increase number of register communications!!!

UPC


March 2003

DDGT: ld-st Synchronization

Overcome MEM_ANTI (MA) dependences

loadA

loadA

storeB

storeB

MA

addadd

RF load-storesync.

loadA

loadA

storeB

storeB

SYNCaddadd

RF

Special cases:– Store is already REG_FLOW dependent on the load– Impossible recurrences

loadA

loadA

storeC

storeC

RF storeB

storeB

MA

MO

loadA

loadA

storeC

storeC

RF

storeB

storeB

MO

fakecons

fakecons

RF

SYNC

load-storesync.

MA

UPC


March 2003

CCCC

BAMRT

IIres=2

C1 C2 C3 C4

MDC Solution: Case Study

Impact on compute time– May increase the IIres

loadA

loadA

storeC

storeC

loadB

loadB

C

BAMRT

IIres=2

C1 C2 C3 C4

MA

MFMFB

C

AMRT

IIres=3

C1 C2 C3 C4

Impact on stall time– May increase remote accesses

• Extra stall cycles = 3 cycles / iteration

always accesses data in cluster 1

always accesses data in cluster 2

Latency LH = 1 cycleLatency RH = 5 cycles

addadd

RF

cycle 1

cycle 3

UPC


March 2003

DDGT Solution: Case Study

Impact on compute time– More instructions (IIres)

• Store replication• Fake consumers (few)• Register communications MRT

IIres=2

C1 C2 C3 C4X

XXX

storeB

storeB

loadA

loadA

MAMF

C4

MRT

IIres=3

C1 C2 C3

BXBB

B

AXXX

set ofmemory

instructionsX

Impact on stall time– Small

• New dependences may decrease slack of some memory instructions

UPC


March 2003

Talk Outline




UPC


March 2003

Evaluation Framework

IMPACT C compiler• Compile + optimize + memory disambiguation

Mediabench benchmark suite

Profile Execution

epicdec test_image titanic

g721dec clinton S_16_44

g721enc clinton S_16_44

gsmdec clinton S_16_44

gsmenc clinton S_16_44

jpegdec testimg monalisa

jpegenc testimg monalisa

Profile Execution

mpeg2dec mei16v2 tek6

pegwitdec pegwit techrep

pegwitenc pgptest techrep

pgpdec pgptext techrep

pgpenc pgptest techrep

rasta ex5_c1 ex5_c1

UPC


March 2003

Evaluation Framework

Word-Interleaved Cache Clustered VLIW Processor

# clusters 4

Functional units 1 FP / cluster + 1 integer / cluster + 1 memory / cluster

Register buses 4 buses running at ½ the core freq.

Memory buses 4 buses running at ½ the core freq.

Cache configuration

8KB, 2-way set-associative, 32 byte blocks

L2 always hits

Cache latencies Local Hit=1 Remote Hit=5 Local Miss=10 Remote Miss=15

Algorithm PrefClus and MinComs

Interleaving factor 2 or 4 bytes depending on benchmark

BASELINE Same architecture but complete freedom when assigning instructions to clusters

UPC


March 2003

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%N

oRes

MD

CD

DG

T

NoR

esM

DC

DD

GT

NoR

esM

DC

DD

GT

NoR

esM

DC

DD

GT

NoR

esM

DC

DD

GT

NoR

esM

DC

DD

GT

remote misses

local misses

remote hits

local hits

epicdec jpegdec pegwitdec pgpdec rasta AMEAN

Local vs. Remote Accesses

UPC


March 2003

0

0,2

0,4

0,6

0,8

1

1,2

1,4

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

MD

C P

refC

lus

MD

C M

inC

oms

DD

GT

Pre

fClu

sD

DG

T M

inC

oms

stall time

compute time

Ex

ec

uti

on

tim

eepicdec jpegdec pegw itdec pgpdec rasta AMEAN

Execution Time

UPC


March 2003

Other Configurations

Configuration 1

24Memory buses42Register buses

Latency# BusesLatency# Buses

More pressure on register busesMDC outperforms DDGT in all cases MDC requires less register communications

42Memory buses24Register buses

Latency# BusesLatency# Buses

More pressure on memory busesDDGT outperforms best MDC in several cases: epicdec 17%, pgpdec 20%, pgpenc 9%, rasta 7%…

Configuration 2

UPC


March 2003

Talk Outline




UPC


March 2003

Conclusions

Memory coherence problem– Two software-based solutions: MDC and DDGT– Applied to a word-interleaved cache clustered VLIW

processor MDC vs DDGT

– Results depending on architecture configuration• MDC outperforms DDGT in most cases • DDGT better by up to 20% in specific configuration

– Sets of memory dependent insts. are small– DDGT freedom in cluster assignment

• Increase local accesses by 15% reduce stall time

UPC


March 2003

Questions?

Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2

Documents

Transcript of Enric Gibert 1 Jes ús Sánchez 2 Antonio González 1,2