Ordered Mesh Network Interconnect (OMNI) Design &...
Transcript of Ordered Mesh Network Interconnect (OMNI) Design &...
Ordered Mesh Network Interconnect (OMNI)
Design & Implementation of In-Network Coherence Suvinay Subramanian, Advisor: Li-Shiuan Peh
Collaborators: Chia-Hsin Owen Chen, Bhavya Daya, Tushar Krishna, Woo-Cheol Kwon, Sunghyun Park
286 386 486 Pentium P2 Celeron P3
P4
Itanium
Core
Core Nehalem
Nehalem
Westmere
Sandybridge
Larrabee
TeraFlops
SCC
Opteron
Opteron
Barcelona
Bulldozer
Bulldozer
PowerPC
Power3 Power4
Power5
Power6
Power7
Cell Niagara
TILE64
TILE-GX
Octeon III
1
2
4
8
16
32
64
128
1980 1985 1990 1995 2000 2005 2010 2015
Nu
mb
er o
f C
ore
s
Year
Uniprocessor
Bus
Ring
Mesh
Compute +
Communicate
• Moore’s Law “The number of transistors
incorporated in a chip will approximately
double every 24 months” Increased
compute density
• Shift to multi-core processors better
performance vis-à-vis power efficiency
• WE NEED A scalable coherence
mechanism to maintain “consistent” view of
“shared memory”
• WE NEED A scalable communication fabric
Specification
• 36 Power-ISA cores
• 32 KB L1; 128 KB L2
• 6x6 Mesh network: OMNI
Why Snoopy Coherence?
• Efficient cache-to-cache transfers
• No directory overhead
• No indirection latency
Snoopy COherent Research Processor with Interconnect Ordering
Evaluations
Introduction
Walkthrough Example
Channel Width Number of
Virtual
Networks
Number of
Virtual
Channels
16 bytes 3 4+2+2 = 8
L2 + Cache Controller
40%
Network Interface + Router
10%
Core 50%
Area Breakdown
Average Power
Critical Path 1.2 ns
The SCORPIO Processor OMNI Design
Traditional interconnects
(eg: buses) route all
messages to central
ordering point
All nodes see all messages
in the same order
OMNI: Network provides
mechanism for global
ordering
No central ordering point
Idea: Nodes agree on global
order at synchronized time
intervals
Router Microarchitecture Network Interface Block Diagram
0
0.2
0.4
0.6
0.8
1
Norm
alize
d R
untim
e
Normalized Runtime of OMNI against
directory protocol
Technology Node
45 nm commercial
Lookahead
bypassing
provides effective
single cycle path
per hop when
possible
Reserved VC
(rVC) provides
escape path
preventing
deadlock
Filter redundant broadcasts
to save bandwidth and
power
Routers maintain &
propagate sharing
information