# Architecting Optically-Controlled Phase Change Memory

ADITYA NARAYAN, Boston University, USA

# YVAIN THONNART, Univ. Grenoble Alpes, CEA, List, France

- PASCAL VIVET, Univ. Grenoble Alpes, CEA, List, France
- AYSE K. COSKUN, Boston University, USA
- AJAY JOSHI, Boston University, USA

1 2 3

4

5

6

8

9 10

11 12

13

14

15

18

33

34

35 36

37

38

39

Phase Change Memory (PCM) is an attractive candidate for main memory as it offers non-volatility and zero leakage power, while providing higher cell densities, longer data retention time, and higher capacity scaling compared to DRAM. In PCM, data is stored in the crystalline or amorphous state of the phase change material. The typical electrically-controlled PCM (EPCM), however, suffers from longer write latency and higher write energy compared to DRAM and limited multi-level cell (MLC) capacities. These challenges limit the performance of data-intensive applications running on computing systems with EPCMs.

16 Recently, researchers demonstrated optically-controlled PCM (OPCM) cells, with support for 5 bits/cell in contrast to 2 bits/cell in 17 EPCM. These OPCM cells can be accessed directly with optical signals that are multiplexed in high-bandwidth-density silicon-photonic links. The higher MLC capacity in OPCM and the direct cell access using optical signals enable an increased read/write throughput and 19 lower energy per access than EPCM. However, due to the direct cell access using optical signals, OPCM systems cannot be designed 20 21 using conventional memory architecture. We need a complete redesign of the memory architecture that is tailored to the properties of 22 OPCM technology.

23 This paper presents the design of a unified network and main memory system called COSMOS that combines OPCM and silicon-24 photonic links to achieve high memory throughput. COSMOS is composed of a hierarchical multi-banked OPCM array with novel read 25 and write access protocols. COSMOS uses an Electrical-Optical-Electrical (E-O-E) control unit to map standard DRAM read/write 26 commands (sent in electrical domain) from the memory controller on to optical signals that access the OPCM cells. Our evaluation of a 27 2.5D-integrated system containing a processor and COSMOS demonstrates 2.14× average speedup across graph and HPC workloads 28 29 compared to an EPCM system. COSMOS consumes 3.8× lower read energy-per-bit and 5.97× lower write energy-per-bit compared to 30 EPCM. COSMOS is the first non-volatile memory that provides comparable performance and energy consumption as DDR5 in addition 31 to increased bit density, higher area efficiency and improved scalability. 32

CCS Concepts: • Hardware  $\rightarrow$  Emerging optical and photonic technologies; Emerging architectures.

Additional Key Words and Phrases: phase change memory, silicon-photonics, 2.5D computing system, non-volatile memory

## **ACM Reference Format:**

Aditya Narayan, Yvain Thonnart, Pascal Vivet, Ayse K. Coskun, and Ajay Joshi. 2021. Architecting Optically-Controlled Phase Change 

40 This is a new paper, not an extension of a conference paper. This work was funded by NSF CCF 2131127 and NSF CCF 1716352 grants.

41 Authors' addresses: Aditya Narayan, adityan@bu.edu, Boston University, 8 Saint Mary's Street, Boston, MA, USA, 02215; Yvain Thonnart, yvain.thonnart@ 42 cea.fr, Univ. Grenoble Alpes, CEA, List, Grenoble, France; Pascal Vivet, yvain.thonnart@cea.fr, Univ. Grenoble Alpes, CEA, List, Grenoble, France; Ayse 43 K. Coskun, acoskun@bu.edu, Boston University, 8 Saint Mary's Street, Boston, MA, USA, 02215; Ajay Joshi, joshi@bu.edu, Boston University, 8 Saint Mary's Street, Boston, MA, USA, 02215. 44

- 49 © 2021 Association for Computing Machinery.
- 50 Manuscript submitted to ACM
- 51

<sup>45</sup> Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or 46 distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work 47 owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to 48 lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

## 1 INTRODUCTION

Today's data-driven applications that use graph processing [30, 53, 56, 79], machine learning [15, 29, 91], or privacypreserving paradigms [3, 19, 82] demand memory sizes on the order of hundreds of *GBs* and bandwidths on the order of *TB/s*. The widely-used main memory technology, DRAM, is facing critical technology scaling challenges and fails to meet the increasing bandwidth and capacity demands of these data-driven applications [37, 40, 41, 48, 58, 96]. Phase Change Memory (PCM) is emerging as a class of non-volatile memory (NVM) that is a promising alternative to DRAM [33, 39, 46, 47, 71, 72]. PCMs outperform other NVM candidates owing to their higher reliability, increased bit density, and better write endurance [13, 16, 61, 93].

63 In PCMs, data is stored in the state of the phase change material, i.e., crystalline (logic 1) or amorphous (logic 64 0) [64, 94]. A SET operation triggers a transition to crystalline state, and a RESET operation triggers a transition to 65 amorphous state. PCMs also enable multi-level cells (MLC) using the partially crystalline states. Higher MLC capacity 66 67 enables increased bit density (*bits/mm<sup>2</sup>*). PCM cells are typically controlled electrically (we refer to them as EPCM cells), 68 where different PCM states have distinct resistance values. EPCM cells are SET or RESET by passing the corresponding 69 current through the phase change material (via the bitline) to trigger the desired state transition. The state of the EPCM 70 cells is read out by passing a read current and measuring the voltage on the bitline. Main memory systems using EPCM 71 72 cells are designed using the same microarchitecture and read/write access protocol as DRAM systems [45, 85]. EPCM 73 systems, however, experience resistance drift over time and so are limited to 2 bits/cell [13, 17], have  $3-4 \times$  higher write 74 latency than DRAM leading to lower performance [5, 45], consume high power due to the need for large on-chip charge 75 pumps [35, 66, 90], and have lower lifetime than DRAM due to faster cell wearout [70]. 76

Recent advances in device research have demonstrated optically-controlled PCM cells (we refer to them as OPCM
 cells) [18, 26, 27, 78]. OPCM cells exhibit higher MLC capacity than EPCM cells (up to 5 *bits/cell* [52]). Moreover, high bandwidth-density silicon-photonic links [84, 87], which are being developed for processor-to-memory communication,
 can directly access these OPCM cells, thereby yielding higher throughput and lower energy-per-access than EPCM. These
 two factors make OPCM a more attractive candidate for main memory than EPCM.

83 Given that in OPCM the optical signals in silicon-photonic links directly access the OPCM cells, the traditional 84 row-buffer based memory microarchitecture and the read/write access protocol encounter critical design challenges 85 when adapted for OPCM. We need a complete redesign of the memory microarchitecture and a novel access protocol 86 87 that is tailored to the OPCM cell technology. In this paper, we propose a COmbined System of Optical Phase 88 Change Memory and Optical LinkS, COSMOS, which integrates the OPCM technology and the silicon-photonic 89 link technology, thereby providing seamless high-bandwidth access from the processor to a high-density memory. 90 Figure 1 shows a computing system with COSMOS. COSMOS includes a hierarchical multi-banked OPCM array, E-O-E 91 92 control unit, silicon-photonic links, and laser sources. The multi-banked OPCM array uses 3D optical integration to stack 93 multiple banks vertically, with 1 bank/layer. The cells in the OPCM array are directly accessed using silicon-photonic 94 links that carry optical signals, thereby eliminating the need for electrical-optical (E-O) and optical-electrical (O-E) 95 conversion in the OPCM array. These optical signals are generated by an E-O-E control unit that serves as an intermediary 96 97 between the memory controller (MC) in the processor and the OPCM array. This E-O-E control unit is responsible for 98 mapping the standard DRAM protocol commands sent by the MC onto optical signals, and then sending these optical 99 signals to the OPCM array. 100

- 100
- 102
- 103

104 Manuscript submitted to ACM



Fig. 1. Overview of a 2.5D-integrated computing system with OPCM array stack as the main memory, E-O-E control unit chiplet, processor chiplet, and laser sources chiplet.<sup>1</sup>

The major contributions of our work are as follows:

- (1) We architect the COSMOS, which consists of a hierarchical multi-banked OPCM array, where the cells are accessed directly using optical signals in silicon-photonic links. The OPCM array design combines wavelength-divisionmultiplexing (WDM) and mode-division-multiplexing (MDM) properties of optical signals to deliver high memory bandwidth. Moreover, the OPCM array contains only passive optical elements and does not consume power, thus providing cost and efficiency advantages.
- (2) We propose a novel mechanism for read and write operation of cache lines in COSMOS. A cache line is interleaved across multiple banks in the OPCM array to enable high-throughput access. The write data is encoded in the intensity of optical signals that uniquely address the OPCM cell. The readout of an OPCM cell uses a 3-step operation that measures the attenuation of the optical signal transmitted through the cell, where the attenuation corresponds to a predetermined bit pattern. Since the read operation is destructive, we design an opportunistic writeback operation of the read data to restore the OPCM cell state.
- (3) We design an E-O-E control unit to interface COSMOS with the processor. This E-O-E control unit receives standard
   DRAM commands from the processor, and converts them into the OPCM-specific address, data, and control signals
   that are mapped onto optical signals. These optical signals are then used to read/write data from/to the OPCM array.
   The responses from the OPCM array are converted by the E-O-E control unit back into standard DRAM protocol
   commands that are sent to the processor.

Evaluation of a 2.5D system with a multi-core processor and COSMOS demonstrates 2.15× higher write throughput and 2.09× higher read throughput compared to an equivalent system with EPCM. This increased memory throughput in COSMOS reduces the memory latency by 33%. For graph and high performance computing (HPC) workloads, when compared to EPCM, COSMOS has  $2.14 \times$  better performance,  $3.8 \times$  lower read energy-per-bit, and  $5.97 \times$  lower write energy-per-bit. Moreover, COSMOS provides a scalable and non-volatile alternative to DDR5 DRAM systems, with similar performance and energy consumption for read and write accesses. With DRAM technology undergoing critical scaling challenges, COSMOS presents the first non-volatile main memory system with improved scalability, increased bit density, high area efficiency, and comparable performance and energy consumption as DDR5 DRAM. 

<sup>154</sup> <sup>1</sup>COSMOS-based computing system is agnostic of the integration technology. However, 3D-integrated systems raise thermal concerns and 2D systems result
 <sup>155</sup> in large system footprint and communication overheads.



Fig. 2. (a) 3D view of GST-based PCM cell. (b) Cross-sectional view of GST deposited on a Si<sub>3</sub>N<sub>4</sub> waveguide.

# 2 BACKGROUND

In this section, we discuss the basic operation of an OPCM cell along with its properties, and the silicon-photonic links that enable optical signals to directly access the OPCM cells.

# 174 2.1 OPCM Cell175

 $Ge_2Sb_2Te_5$  (GST) is a well-known phase change material that exhibits high contrast in the electrical property (resistance) 176 and the optical property (refractive index) between its two states, in addition to long data retention time and nanoscale 177 178 size [55, 75, 94]. Thus, GST has been widely used as a storage element in a PCM cell (EPCM and OPCM cells). An 179 OPCM cell consists of only a GST element, and does not use a separate access transistor as an EPCM cell. Figure 2 shows 180 the structure of an OPCM cell, where the GST is integrated on a waveguide [52, 78]. The waveguides are fabricated using 181 a  $Si_3N_4$  layer deposited over a  $SiO_2$  layer [51]. The GST layer is covered with a layer of Indium-Tin-Oxide (ITO) to 182 183 prevent oxidation. The optical signals to read and write the OPCM cell lie in the C band (1530nm - 1565nm) and L band 184 (1565nm - 1625nm) of the telecommunication spectrum. 185

# 187 2.2 Write Operation in OPCM Cells

For write operation, the optical signal traversing through the waveguide is coupled to the GST element. The energy of this optical signal heats the GST element and triggers a state transition. For RESET operation, i.e., switching the GST element to an amorphous state (a-GST), an optical pulse of 180pJ energy is applied to the GST element for 25ns [52]. For SET operation, i.e., switching the GST element to a fully crystalline state (c-GST), an optical pulse with an energy of 130pJ is applied to the GST element for 250ns [52]. The transition of the GST state to a partially crystalline state requires different values of pulse energies (60pJ - 130pJ) applied for varying durations (50ns - 250ns) [52].

## 197 2.3 Read Operation in OPCM Cells

198 The readout mechanism for an OPCM cell uses the high contrast in the refractive indices of a-GST (3.56) and c-GST 199 (6.33) [57]. When an optical signal is passed through the GST element, the higher refractive index of c-GST results in 200 an increased optical absorption by the GST element. Rios et al. [78] demonstrate that c-GST absorbs 79% of the input 201 optical signal and allows transmission of only 21% of the optical signal. In contrast, a-GST transmits 100% of the optical 202 203 signal. The transmission of partially crystalline states lies between 100% and 21% [78]. An OPCM cell is, therefore, 204 read out by sending a sub-ns optical pulse through the GST element and measuring the transmitted optical intensity of 205 the output pulse. This transmitted intensity corresponds to a pre-determined bit pattern, thus allowing the readout of the 206 207 stored data in the GST element.

208 Manuscript submitted to ACM

4

157

158

159 160

161 162

163 164

165 166 167

168 169

170

171 172

173

186

# 209 2.4 High MLC Capacity of OPCM Cells

In OPCM cells, the read operation uses the refractive index of the GST state to determine the stored value. Unlike the resistance value used in EPCM cells, the refractive index experiences minimal to no drift over time [52, 78]. This enables designing OPCM cells with multiple stable partially crystalline states with each having a unique refractive index. Prior works have demonstrated that it is possible to reliably program an OPCM cell to contain more than 34 unique partially crystalline states [52, 100], which enables an OPCM cell to have an MLC capacity of up to 5 bits/cell. Using a higher capacity MLC enables the read and write operation of a higher number of bits per access than EPCM, thereby increasing the memory throughput. 

# 2.5 Silicon-Photonic Links

In a computing system that uses a main memory composed of OPCM cells, optical signals in silicon-photonic links can directly read/write the cells. The silicon-photonic links provide higher bandwidth density at negligible data-dependent power compared to electrical links [8, 9, 42]. In addition, these silicon-photonic links have single-cycle latency, in contrast to electrical links that often take 3-4 cycles each for a memory request and a memory response. Moreover, we can multiplex multiple optical signals (up to 32 signals) in a single waveguide, resulting in dense WDM [44]. MicroRing Resonators (MRRs) can modulate these optical signals at data rates up to 12Gbps [4, 67, 86] giving a peak memory throughput of 384Gbps per link. Therefore, it is possible to design densely multiplexed silicon-photonic links that can directly access the OPCM cells, further increasing the memory throughput. 

# 3 MOTIVATION

In this section, we motivate the need for a novel memory microarchitecture and access protocol for OPCM, by first describing the typical EPCM architecture and then explaining why such an architectural design is impractical for OPCM arrays. Figure 3 shows the architecture of EPCM [39, 45]. The EPCM array is a hierarchical organization of banks, blocks, and sub-blocks [45]. During read or write operation, the EPCM first receives a row address. The row address decoder reads the appropriate row from the EPCM array into a row buffer. The EPCM next receives the column address, and the column address multiplexer selects the appropriate data block from the row buffer. The bitlines of the selected data block are connected to the write drivers for write operation or to the sense amplifiers for read operation. For write operation, the charge pumps supply the required drive voltage to the write drivers, which corresponds to SET or RESET operation. For read operation, a read current is first passed through the GST element in the EPCM cell through an access transistor [45]. Then, sense amplifiers determine the voltage on the bitline to read out logic 0 or logic 1.

Naively adapting the EPCM architecture for OPCM, by just replacing the EPCM cells with OPCM cells raises latency, energy and thermal concerns, thereby rendering such a design impractical. To understand these concerns, let us consider an OPCM array that uses the EPCM architecture from Figure 3 with either an optical row buffer or an electrical row buffer. Such an OPCM array architecture has following limitations:

Limitations with optical row buffer: An optical row buffer can be designed using a row of GST elements, whose states are controlled using optical signals. When a row is read from the OPCM array using an optical signal, the data is encoded in the signal's intensity. This intensity is not large enough to update the state of the GST elements in the optical row buffer. So the read value first needs to be converted into an electrical signal. Based on this value, an optical signal with the appropriate intensity is generated to write the value into the optical row buffer. Essentially we perform an extra

## A. Narayan et al.





O-E and E-O conversion. This necessitates the use of photodetectors, receivers, transmitters and optical pulse generators, which adds to the energy and latency of a memory access. Hence, an optical row buffer is not a viable option. 

Limitations with electrical row buffer: An electrical row buffer can be designed either using capacitor cells as in DRAM or using phase change materials controlled using electrical current as in EPCM. In both cases, the row buffer is accessed using electrical signals (assuming electrical links between the processor and memory). This increases the access latency and energy, and creates thermal issues as follows:

- (1)Impact on read latency: Upon receiving a row address from the MC on electrical links, the address first needs to be converted to an optical pulse, which is then used to read data from OPCM cells. After optical readout of an entire row from OPCM array, the data has to be converted back into electrical domain to store it in the row buffer. These two operations require an E-O and an O-E conversion, respectively, inside the OPCM array. These E-O/O-E conversions adds a latency of 25 - 30 cycles for each read access [6].
- (2)Impact on write latency: When writing data from the row buffer to the OPCM array, a set of sense amplifiers reads the data from the electrical row buffer. This row buffer data is then mapped onto optical signals with appropriate intensities using a pulse generation circuitry within memory. The optical signals are then used to write the data to the OPCM cells. Therefore, the write operation requires three E-O/O-E conversions, which adds a latency of 40-45cycles for each write access [6].
- (3)Impact on read/write energy: The energy spent in the peripheral circuitry for optical signal generation and readout, as well as in the circuitry for E-O-E conversion increases the active power dissipation within memory [6, 60, 63]. Since each read/write operation encounters multiple E-O-E conversions, the energy per read and write access rises considerably high (> 200 pJbit) [24].
- (4)Thermal issues: The MRRs used in the OPCM array are highly sensitive to thermal variations [65]. The thermal variations due to active electrical circuits within memory lowers the reliability of the MRR operation. Such a design calls for active thermal and power management in OPCM, which contributes to a power overhead of 10 - 30W [2].

Furthermore, using silicon-photonic links in combination with OPCM requires additional E-O and O-E conversions on the MC and the OPCM array with this EPCM architecture that exacerbate the above discussed problems. Hence, we argue for the need to redesign the microarchitecture and the read/write access mechanisms that are tailored to the properties of the OPCM cell technology and the associated silicon-photonic link technology. 

Manuscript submitted to ACM

# 313 4 COSMOS ARCHITECTURE

314 In this section, we describe the microarchitecture of the high-throughput OPCM array in COSMOS. The key innovation 315 of our proposed microarchitecture is enabling direct access of OPCM cells by the optical signals in the silicon-316 317 photonic links. This direct access avoids the extra E-O and O-E conversions that are required if we were to adapt 318 the EPCM architecture for COSMOS. Our OPCM array microarchitecture is a hierarchical multi-banked design that 319 maximizes the degree of parallelism for read and write accesses within the array using a combination of WDM and MDM. 320 A distinguishing feature of our OPCM array design is that it does not contain any active circuits that consume power, 321 322 i.e., it only contains passive optical devices. Figure 4 illustrates the detailed microarchitecture of our proposed OPCM 323 array in COSMOS that uses GST as the phase change material. We base our architectural design on prior OPCM cell 324 prototype designs [26, 27, 52, 78], which demonstrate the switching of OPCM cells between multiple states with high 325 reproducibility. The confidence of cell read/write is mainly limited by the variations in cell switching and by the SNR of 326 327 readout circuits. For 4-bit OPCM cells, prior works show minimal variations in cell switching and high SNR, resulting in 328 high confidence of read/write. We describe each component of the proposed architecture, particularly focusing on how to 329 read and write an OPCM cell in the optical domain with minimal E-O and O-E conversions. 330

# 4.1 OPCM Tile

331

332

345

346

360

361

An OPCM tile (see Figure 4c) consists of an  $n \times n$  array of GST elements, i.e., OPCM cells. The GST elements are 333 334 placed on top of waveguide crossings as shown in Figure 4d. This organization enables every OPCM cell to be accessed 335 using a unique pair of optical signals: one on the associated row and one on the associated column. We need a total 336 of n unique optical signals with wavelengths  $\lambda_1, \lambda_2, \dots, \lambda_n$  that are routed in the rows (one per row waveguide), and n 337 unique optical signals with wavelengths  $\lambda_{n+1}$ ,  $\lambda_{n+2}$ ,...,  $\lambda_{2n}$  that are routed in the columns (one per column waveguide). 338 Wavelengths  $\lambda_1$  to  $\lambda_n$  together form the Tile Row Access (TRA)-channel, and wavelengths  $\lambda_{n+1}$  to  $\lambda_{2n}$  together form 339 340 the Tile Column Access (TCA)-channel. A TRA-channel (and similarly each TCA-channel) is mapped to one or more 341 waveguides depending on the number of wavelengths that can be multiplexed in a waveguide. Owing to MLC, each 342 OPCM cell stores  $b_{cell}$  bits. The total capacity of an OPCM tile is  $n^2 \cdot b_{cell}$ . A maximum of *n* cells can be read/written in 343 344 parallel from a single tile, which gives us a peak throughput of n.bcell bits per read/write access for a tile.

# 4.2 OPCM Bank

347 Figure 4b shows the organization of an OPCM bank. The OPCM bank is composed of an array of  $m \times m$  OPCM tiles, 348 and has a total capacity of  $m^2 . n^2 . b_{cell}$  bits. The OPCM bank uses m TRA-channels, one for each row in the bank, and 349 m TCA-channels, one for each column in the bank to communicate with the E-O-E control unit. Each TRA-channel 350 uses  $\lambda_1$  to  $\lambda_n$ , and each TCA-channel uses  $\lambda_{n+1}$  to  $\lambda_{2n}$ . We design a hierarchical array of OPCM cells ( $m^2$  tiles with  $n^2$ 351 352 OPCM cells per tile) instead of a large monolithic array ( $m^2 n^2$  OPCM cells), as designed by Feldman *et al.* [26, 27] to 353 decrease the laser power required by the optical signals. With our proposed design, the laser sources only need to support 354 2n unique optical signals (in the range of  $\lambda_1$  to  $\lambda_{2n}$ ) instead of the m.2n unique optical signals that would be required in 355 356 a large monolithic array. We utilize MRRs to couple the optical signals of each TRA-channel and TCA-channel to its 357 corresponding tile. We need n MRRs that are tuned to  $\lambda_1$  to  $\lambda_n$  in each of the m TRA-channels and n MRRs that are tuned 358 to  $\lambda_{n+1}$  to  $\lambda_{2n}$  in each of the *m* TCA-channels. 359

# 4.3 Multi-banked OPCM Array

Figure 4a shows the proposed multi-banked organization of the OPCM array using MDM. We interleave a cache-line across multiple banks. There are *p* banks, each supporting one of the *p* spatial modes of the 2*n* optical signals. Bank 1 Manuscript submitted to ACM

## A. Narayan et al.



Fig. 4. (a) A multibanked-OPCM uses p optical modes to access p banks. (b) An OPCM bank is an array of  $m \times m$  tiles. Every tile is accessed by a TRA-channel and a TCA-channel, each channel containing n optical signals. (c) An OPCM tile is an array of  $n \times n$  cells. Every cell is accessed by a unique pair of optical signals. (d) OPCM cells are placed at every waveguide crossing. (e) Address mapping of the physical address to cells in the OPCM array. The physical address corresponds to OPCM cells in the shaded blue row of OPCM array.

only uses mode 1 of all optical signals  $\lambda_1, \dots, \lambda_n$  and  $\lambda_{n+1}, \dots, \lambda_{2n}$ , Bank 2 only uses mode 2 of all optical signals, and so on. The waveguides connecting the OPCM to the E-O-E control unit are multi-mode waveguides, which carry all the *p* spatial modes of optical signals. We employ single-mode MRRs [89, 97] that couple a single spatial mode of optical signals from the multi-mode waveguide to a bank. Multiple prior works have exploited MDM property of optical signals coupled with WDM to design high-bandwidth-density silicon-photonic links [54, 92].

## 4.4 Address Mapping in COSMOS

392 Figure 4e shows an example mapping of the physical address received by the MC to the physical location of cells within 393 the OPCM array in COSMOS. A cache line of 64B is stored in a total of 128 OPCM cells with 4bits/cell. We interleave 394 the cache line across 4 different banks. Within a bank, we map the 128-bit chunk of a cache line to a tile. The tile has 395  $32 \times 32$  cells, and so we map that 128-bit chunk to an entire row within a tile. The row (column) field of physical address 396 397 in the MC is mapped to the row ID of tile (column ID of tile) field and the row ID of cell (column ID of cell) field. In 398 Figure 4e, we show how the different fields of the physical address 0x10301FC0 are mapped to bank ID, row ID of tile, 399 column ID of tile, row ID of cell, and column ID of cell. 400

#### 401 402 **5**

382

390

391

# 5 ACCESS PROTOCOL IN COSMOS

To enable high-throughput access of OPCM cells within the OPCM array, we propose a novel read and write access protocol for COSMOS. When the MC issues a read or write operation, the row address and column address are entered into the Row Address Queue and Column Address Queue, respectively, and the write data is entered into the Data Buffer in the E-O-E control unit.

408 409 410

# 5.1 Writing a cache line to OPCM array

To write a cache line to the OPCM array, the E-O-E control unit identifies the bank ID, the row ID and column ID of the tile, and the row ID and column ID of the cell within a tile using the address mapping. In our example with  $32 \times 32$  array of cells in a tile, when writing 128-bit chunk of a cache line, we end up updating all the cells in a row (any misaligned accesses are handled on the processor side). Hence, for writes at cache line granularity, the column ID Manuscript submitted to ACM

within a tile is not used. The E-O-E control unit determines the optical intensity that is required at each OPCM cell in the 417 418 row to write the 128-bit chunk of the cache line. It then breaks down the optical intensity into two signals, one with a 419 constant intensity of  $I_0$  and the other with a data-dependent intensity of  $I_i$ , where i = 1, 2, ..., 128. The E-O-E control unit 420 modulates the constant intensity  $I_0$  onto the optical signal corresponding to the row (selected by the row ID of cell) within 421 422 a tile. The E-O-E control unit then modulates the data-dependent optical intensities (i.e., I1, I2, ...,I128) onto the optical 423 signals corresponding to the 4 tiles spread across 4 banks with 32 columns per tile. The E-O-E control unit transmits 424 the row signal  $I_0$ , and the column optical signals  $I_1, I_2, ..., I_{128}$  in parallel to write the cache line in the OPCM array. The 425 superposition of the optical signals, i.e.,  $I_0+I_1$ ,  $I_0+I_2$ , ...,  $I_0+I_{128}$  updates the state of the OPCM cells. Note that since a 426 cache line is spread across 4 banks, the E-O-E control unit modulates data on optical signals to write to an OPCM tile in 427 428 each of these 4 banks. None of the optical signals individually carries sufficient intensity to trigger a state transition at any 429 cell, so none of the other cells along the row or column are affected. 430

# 5.2 Reading a cache line from OPCM array

431

432

457

458

433 To read a cache line from OPCM array, the E-O-E control unit transmits sub-ns optical pulses along all the columns 434 in a tile that contain the cache line and measures the pulse attenuation. However, there are multiple OPCM cells along 435 each column and so the output intensity of optical signals will be attenuated by all cells in that column. It is, therefore, 436 not possible to determine the OPCM cell values using a one-pulse readout. Hence, we use a three-step process for read 437 438 operation of OPCM array in COSMOS. (1) To read a cache line, the E-O-E control unit first determines the bank ID, row 439 ID and column ID of tile, and row ID and column ID of cell. The E-O-E control unit transmits a read pulse  $RD_1$  through 440 all the columns in a tile containing the cache line. Note that since a cache line is spread across 4 banks, the E-O-E control 441 unit transmits  $RD_1$  on the 4 different optical modes corresponding to the 4 banks. Each read pulse is attenuated by all the 442 443 OPCM cells in the column. The attenuated pulses are received by the E-O-E control unit, which records the intensities of 444 these attenuated pulses as  $I_{1,1}, I_{2,1}, ..., I_{128,1}$ . These intensities are converted into electrical voltage and stored as  $V_{1,1}$ , 445  $V_{2,1}, ..., V_{128,1}$  (2) The E-O-E control unit then transmits a RESET pulse to the OPCM cells of the cache line, i.e., all 446 447 the cells along a row within a tile. All the cells along the row are now amorphized and have 100% optical transmission. 448 (3) The E-O-E control unit then sends a second read pulse  $RD_2$  through all the columns of a tile containing the cache line. 449 Each read pulse is again attenuated by all OPCM cells in the column. Given that step 2 amorphized all OPCM cells of the 450 cache line, the output pulse intensities are different from those in step 1. The attenuated pulses are received by the E-O-E 451 control unit, which records the intensities of these attenuated pulses as  $I_{1,2}, I_{2,2}, ..., I_{128,2}$ . These intensities are converted 452 453 into electrical voltage and stored as V1,2, V2,2, ..., V128,2. The E-O-E control unit computes the difference of the stored 454 voltages of steps 1 and 3, i.e.,  $V_{1,1} - V_{1,2}$ ,  $V_{2,1} - V_{2,2}$ , ...,  $V_{128,1} - V_{128,2}$ . This difference is used to determine the cache 455 line data stored in the OPCM cells. 456

## 5.3 Opportunistic Writeback after Read

459 The RESET operation in step 2 of the read operation destructs the original data in the OPCM cells. We, therefore, perform 460 an opportunistic writeback of the cache line to the OPCM cells. After completing the 3 steps of the read operation, the 461 462 read data and the address are saved into a holding buffer in the E-O-E control unit. When there are no pending read or 463 write operations from the MC, the E-O-E control unit reads the data and its address from the holding buffer and writes the 464 data back to the OPCM array. This writeback operation does not block any critical pending read and write operations 465 coming from the MC. The dependencies in read and write requests between the holding buffer and the data buffer is 466 467 handled in the E-O-E control unit. For a Read-After-Read case, the second read operation reads the data from the holding 468 Manuscript submitted to ACM

#### A. Narayan et al.



Fig. 5. (a) E-O-E control unit design. DMU: Generates the modulation voltage and the bias current corresponding to read/write data. AMU: Determines optical signals that correspond to read/write address. PSU: Selects the optical signals. PAU: Amplifies the optical signals using the bias current. PFU: Filters the optical signals to read cell data. Different micro-steps performed in E-O-E control unit and OPCM array during (b) write operation and (c) read operation.

*buffer* if present. If the data is not in the *holding buffer* then the second read operation just uses the 3-step process + writeback (described above) to complete the read operation. For a Write-After-Read case, if the write address matches the read address and there is an entry for that read in the *holding buffer*, then the corresponding entry in the *holding buffer* is invalidated. The write data is then entered into the data buffer and then written into the appropriate OPCM array.

# 6 E-O-E CONTROL UNIT DESIGN

Our proposed E-O-E control unit provides the interface between the processor and the OPCM array. The MC sends standard DRAM access protocol commands to the E-O-E control unit. The E-O-E control unit maps these commands onto optical signals that read/write the data from/to OPCM array. Though we can design a COSMOS-specific MC and the associated read/write protocol, our goal is to enable the COSMOS operation with a standard MC in any processor. The E-O-E control unit uses the following five sub-units to read from and write to the OPCM array: data modulation unit (DMU), address mapping unit (AMU), pulse selector unit (PSU), pulse amplification unit (PAU), and pulse filtering unit (PFU). Each OPCM bank has a dedicated set of these five sub-units in the E-O-E control unit. Figure 5a shows the design of the E-O-E control unit in COSMOS and the internals of these sub-units. 

Figure 5b illustrates the sequence of operations in the E-O-E control unit for write operation to a bank containing  $512 \times 512$  tiles with  $32 \times 32$  cells per tile (same design as that used in Figure 4e). The AMU in the E-O-E control unit first receives the row address and then the column address from MC (Step 1). Depending on the addresses, the PSU in the E-O-E control unit selects the appropriate optical signals using the address mapping explained in Section 4.4 (Step 2). The PSU selects one optical signal for the row and 32 optical signals for the 32 columns in the row to write to 32 cells in a tile. In parallel with the write address, the DMU in the E-O-E control unit receives the write data from the MC (Step 3). The DMU generates a unique bias current for each of the 32 optical signals depending on write data and applies the currents to the semiconductor optical amplifiers (SOA) in the PAU (Step 4). The SOAs amplify the optical signals to the required intensities. These amplified signals and the optical signal (corresponding to the row) traverse through the silicon-photonic links to the appropriate OPCM cells in the bank, and SET/RESET the cell (Step 5). The E-O-E control Manuscript submitted to ACM

unit incurs a latency of  $T_{EO}$  cycles to map the address and data onto optical signals, resulting in a peak throughput of 521 522  $1/T_{EO}$ . It should be noted that the physical location of a cell in the OPCM array in COSMOS determines the level of 523 losses that will be experienced by an optical signal that is writing to the cell. These losses in turn dictate the amplification 524 of that optical signal in the E-O-E control unit. To address this, the E-O-E control unit uses the address mapping (refer to 525 Fig. 4e) to map the physical address to the corresponding OPCM cell that needs to be written. Based on the physical 526 527 location of the cell, the DMU in the E-O-E control unit looks up a pre-programmed LUT, which holds the amplification 528 factor required for each cell. The DMU applies a bias current as a function of this amplification factor to the PAU, which 529

<sub>530</sub> amplifies the optical signals to the required level.

Figure 5c illustrates the sequence of operations in the E-O-E control unit for the 3-step read operation from a bank. In 531 532 the first step, the AMU receives the row and column addresses from MC and selects the appropriate 32 optical signals in 533 the PSU using the address mapping explained in Section 4.4 (Step 1.1). The DMU generates a low-intensity readout pulse 534  $(RD_1)$  and the PAU modulates this pulse on the 32 optical signals (Step 1.2). The optical signals traverse through the 535 silicon-photonic link and then through the columns in the tile. The optical signals lose intensity as they pass through all 536 537 the OPCM cells in their associated columns (Step 1.3). The intensities of these attenuated signals are recorded by the 538 PFU (Step 1.4). The PFU then converts the optical intensities into electrical voltages,  $V_{1,1}$ ,  $V_{2,1}$ , ...,  $V_{32,1}$  (Step 1.5). In the 539 second step, the DMU generates the RESET pulse. This RESET pulse is mapped onto the appropriate optical signals and 540 541 these signals are sent to the OPCM array (Step 2.1). The signals traverse through the silicon-photonic links and amorphize 542 the OPCM cells corresponding to the read address (Step 2.2). In the third step, the DMU generates another readout 543 pulse  $(RD_2)$  and the PAU modulates this pulse on a set of 32 optical signals (Step 3.1). These signals traverse through 544 the silicon-photonic links and then through the appropriate columns in the tile. These signals too loses intensity as they 545 pass through all the OPCM cells in their associated columns (Step 3.2). The PFU records these attenuated signals (Step 546 547 3.3) and converts these optical signals into electrical voltages  $V_{1,2}$ ,  $V_{2,2}$ , ...,  $V_{32,2}$  (Step 3.4). Finally, the PFU computes 548  $V_{1,1} - V_{1,2}$ ,  $V_{2,1} - V_{2,2}$ , ...,  $V_{32,1} - V_{32,2}$  to determine the data (Step 3.5) and sends the data to the MC. The PFU also 549 writes this data back to the holding buffer in the DMU (Step 3.6). 550

## 7 EVALUATION METHODOLOGY

551 552

553 554

555

572

## 7.1 Multicore System with COSMOS

Our simulations for COSMOS are primarily based on parameters derived from prior multi-bit prototypes [26, 78]. These 556 works demonstrate the scalability and precision of up to 5-bit/cell OPCM arrays under different load conditions using 557 558 state-of-the-art optical devices for signal modulation and filtering. Moreover, the cell-to-cell static variability on refractive 559 indices of GST elements have been shown to be minimal in these works [52, 78]. Due to the lack of active circuitry 560 within the OPCM array, the dynamic variations in COSMOS due to thermal gradient is negligible. The minimal impact 561 of these variations on GST cell operation enable high-fidelity optical detection and SET/RESET operation of OPCM 562 563 arrays. As part of our future work we plan to further explore the impact of these variations on reliability for larger scale 564 OPCM arrays at an architectural-level. In our simulations, we use OPCM cell parameters (MLC, pulse intensity and GST 565 size) from real prototypes [26, 27, 78, 102], losses in optical elements based on prior demonstrations [10, 31, 52, 81], 566 silicon-photonic link parameters (signals/waveguide, data rates, MRR sizes) from prior chip prototypes [8-10, 12]. In 567 568 addition to 4-bit OPCM cells, we also evaluate the potential performance benefits of a 8-bit OPCM cell. Though designing 569 optical circuitry for high-precision filtering of 8-bit OPCM cells is a challenge, our goal is to motivate the potential 570 benefits of higher-density OPCM arrays. 571

| Table 1. Architectural Details of the Simulated System. |                                                                                                       |  |  |
|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------|--|--|
| Processor, On-chip caches                               |                                                                                                       |  |  |
| Cores                                                   | 8-core, 2.5GHz x86 ISA, Out-of-Order, 192 ROB entries, dispatch/fetch/issue/commit width=8            |  |  |
| L1 caches                                               | 32kB split L1 I\$ and D\$, 2-way, 1-cycle hit, 64B, LRU, write-through, MSHR: 4 instruction & 32 data |  |  |
| L2 cache                                                | Shared L2\$, 2MB, 8-way, 8-cycle hit, 64B, LRU, write-back, MSHR: 32 (I & D)                          |  |  |
| Main memory (2GB)                                       |                                                                                                       |  |  |
| EPCM [20]                                               | 4 banks, 8 devices/rank, 1 rank/channel, bus width = 64, burst length = 4                             |  |  |
|                                                         | $t_{SET} = 120ns, t_{RESET} = 50ns, t_{read} = 60ns, t_{BURST} = 4ns$                                 |  |  |
| OPCM array in                                           | 8 banks, 1 rank/channel, 1 device/rank, bus width = $32 \times b_{cell}$ , burst length = 8           |  |  |
| COSMOS [52, 78]                                         | $t_{SET} = 160ns, t_{RESET} = 25ns, t_{read} = 25ns, t_{BURST} = 1ns, t_{EOE} = 5ns$                  |  |  |

Table 1. Architectural Details of the Simulated System.

We use an 8-core processor for our evaluation. We primarily evaluate COSMOS with 4-bit MLC OPCM cells (given that OPCM cell with 5 *bits/cell* has been prototyped [52]) against an EPCM with 2 *bits/cell*. We choose 2 *bits/cell* instead of 4 *bits/cell* [61] for EPCM as prior works [13, 17] have shown that a cell density higher than 2 *bits/cell* leads to unreliable EPCM designs. Table 1 details the processor and memory configurations. For processor-memory networks, we consider electrical as well as silicon-photonic links, with 1*GT/s* transfer rate per link. We obtain a peak bandwidth of 64GB/s in EPCM and 256GB/s in COSMOS. Peak bandwidth in COSMOS is calculated as the product of data rate, bus width (64 lines between process and memory), OPCM MLC capability as each optical signal can read/write 4bits/cell and the number of parallel banks. ( $1GT/s \times 64lines \times 4bits/cell \times 8banks = 256GBps$ ).

The OPCM array in COSMOS is organized as a single rank connected to a memory channel via the E-O-E control unit. Each one of the 8 OPCM banks has its dedicated set of DMU, ATU, PSU, PAU, and PFU in the E-O-E control unit. The average SET latency is  $t_{SET} + t_{EOE}$ , 165*ns*, the RESET latency is  $t_{RESET} + t_{EOE}$ , 30*ns*, and the read latency is  $t_{read}$  (time for 3-step read operation) +  $t_{EOE}$ , i.e., 30*ns*. A maximum of  $t_{SET}/t_{EOE} = 32$  writes can be issued from the E-O-E control unit to OPCM in parallel. So, we can write  $32 \times b_{cell}$  bits in parallel. A maximum of  $t_{read}/t_{EOE} = 5$  reads can be issued from the E-O-E control unit to OPCM in parallel. So, we can read  $5 \times b_{cell}$  bits in parallel. We use a holding buffer that is large enough (16 cache line slots from our evaluations) to avoid stalling any read/write memory requests from the MC. 

## 604 7.2 Simulation Framework

We model the architectural specifications of the system in gem5 [14]. We conduct full-system simulations in gem5 with Ubuntu 12.04 OS and Linux kernel v4.8.13. We fast-forward to the end of Linux boot and execute each workload for 10 billion instructions. The main memory models with the different timing parameters for DDR5 are modeled in DRAMSim2 [77]. For modeling EPCM and OPCM, we integrate NVMain2.0 [68] in gem5.

# 7.3 Workloads

We simulate graph applications from GAP-BS benchmark [11] and HPC applications from NAS-PB benchmark [7]. We evaluate the graph applications on three different input datasets from SNAP repository [49]: Google web graph (*google*), road network graph of Pennsylvania (*roadNetPA*), and Youtube online social network (*youtube*). For HPC applications from NAS-PB benchmark, we use the large dataset. We execute 8 threads of these applications in a workload.

# 8 EVALUATION RESULTS

# 620 8.1 COSMOS vs EPCM

8.1.1 Performance. We compare EPCM (2bit MLC or EPCM-2bit) that uses 64 processor-to-memory electrical links
 with COSMOS (4bit OPCM cells, or COSMOS-4bit) that also uses 64 processor-to-memory silicon-photonic links, and
 Manuscript submitted to ACM



Fig. 6. Performance comparison of COSMOS with EPCM.

with COSMOS-4bit that uses 256 processor-to-memory silicon-photonic links. Figure 6 shows the overall performance (execution time in seconds) for systems with these three configurations. Compared to the EPCM-2bit with 64 electrical links, COSMOS-4bit with 64 silicon-photonic links has on average  $1.52 \times$  better performance across all workloads. This performance improvement is due to the higher *bits/access* throughput of COSMOS resulting from higher MLC capacity and the single-cycle latency in silicon-photonic links. Increasing the number of silicon-photonic links from 64 to 256 further improves the system performance. Compared to EPCM-2bit using 64 electrical links, we observe performance improvement of  $2.14 \times$  on average for graph and HPC workloads with COSMOS-4bit using 256 silicon-photonic links. These performance benefits are due to denser WDM in silicon-photonic links. *The key takeaway from this comparison is that even though the OPCM cells suffer from long write latency similar to EPCM cells, the superior MLC capacity of OPCM cells that are directly accessed by high-bandwidth-density silicon-photonic links improves the system performance in COSMOS.* 

8.1.2 Throughput. Figures 7a and 7b show the read and write throughput, respectively, of COSMOS-4bit with 256 silicon-photonic links, and EPCM-2bit with 64 electrical links. Compared to EPCM-2bit with 64 electrical links, COSMOS-4bit with 256 silicon-photonic links theoretically has  $8 \times$  higher peak throughput, i.e.,  $2 \times$  due to higher MLC capacity and the  $4 \times$  due to the increased number of processor-to-memory links. Therefore, it is possible to issue increased number of parallel read and write operations in COSMOS-4bit. As a result, from Figure 7a and Figure 7b, we observe that COSMOS-4bit can achieve  $2.09 \times$  higher read throughput and  $2.15 \times$  higher write throughput, respectively, than EPCM-2bit for graph and HPC workloads. This increased read and write throughput of COSMOS-4bit hides the long write latencies. Figure 7c shows that the average memory latency (read+write) of COSMOS-4bit is 33% lower than EPCM-2bit across all workloads. The key insight from this study is the increased read and write throughput provided by the higher MLC capacity and the silicon-photonic links hides the long write latencies of OPCM cells in COSMOS. 

*8.1.3 Energy Consumption.* The primary contributors to the overall power consumption during the read and write operations are the different active components in the E-O-E control unit and the laser sources that drive the silicon-photonic links. The OPCM array in COSMOS consists of only passive optical devices, so it does not consume any active or idle power. The electrical power consumed in the laser source is proportional to its optical output power, which in turn depends on the optical losses in the path of the optical signal and the minimum power required to switch the farthest GST element. Table 2 lists the optical losses in the various components and the maximum switching power required at Manuscript submitted to ACM



Fig. 7. (a) Read throughput, (b) Write throughput, (c) Average memory latency

the GST element in decibels (dB). The various optical losses and SOA gains are obtained from prior characterization works [10, 31, 52, 81]. By accounting for the wall-plug efficiency, we calculate the minimum required laser power per optical signal as 0.95*mW*. Aggregating the laser power for all optical signals required in a 2*GB* COSMOS system, we get a total laser power of 16.38*W*.

In the E-O-E control unit, the current-DAC in DMU and the ADC in PFU consume 0.3mW each [74]. For OPCM-4bit, 32 write operations can be issued in parallel per bank, i.e., we can write  $32 \times b_{cell} \times 8 = 128B$  in parallel with an average write latency of 160*ns*. That aggregates to writing 2 cache lines of 64*B* each in parallel. A cache line is interleaved across

| 708 |
|-----|
| 709 |
| 710 |
| 711 |

| Loss/gain component                | Single                     | Total             |
|------------------------------------|----------------------------|-------------------|
| Coupling loss                      | -1dB                       | -1 <i>dB</i> [10] |
| MRR drop loss (E-O-E control)      | -0.5 <i>dB</i> [31]        | -0.5 dB           |
| MRR through loss (E-O-E control)   | -0.05 dB [31]              | -3.2dB            |
| Propagation loss (Laser to SOA)    | -0.3 <i>dB/cm</i> [81]     | -0.09 dB          |
| SOA gain                           | 20 <i>dB</i>               | 20dB              |
| Propagation loss (SOA to OPCM)     | -0.3 <i>dB/cm</i> [81]     | -0.09 dB          |
| Bending loss                       | -0.167 <i>dB</i> [81]      | -0.167 dB         |
| MRR drop loss (OPCM)               | -0.5 <i>dB</i> [31]        | -0.5dB            |
| MRR through loss (OPCM)            | -0.05dB [31]               | -3.2dB            |
| Propagation loss (in OPCM)         | -0.03 <i>dB/cm</i> [81]    | -4.91 dB          |
| Max. power required to SET the GST | $\frac{135pJ}{250ns}$ [52] | -2.67 dBm         |
| Power per optical signal           |                            | -7.22dBm = 0.19mW |
| Laser wall-plug efficiency         |                            | 20%               |
| Total laser power                  |                            | 16.38W            |

| Table 2. Optical power budget for 2GB COSMOS. The table shows optical power losses and SOA gain along the optical |
|-------------------------------------------------------------------------------------------------------------------|
| path from laser source to OPCM cells.                                                                             |

728 Manuscript submitted to ACM

| Energy-per-bit (pJ/bit)        | EPCM-2bit | COSMOS-4bit |
|--------------------------------|-----------|-------------|
| Write                          | 243       | 40.68       |
| Read                           | 44.5      | 11.6        |
| <b>Opportunistic Writeback</b> | NA        | 40.68       |

Table 3. Energy-per-bit for read and write accesses.

4 banks and is row aligned in an OPCM tile. Therefore, we need 4 row optical signals and  $4 \times 32$  column optical signals to write a cache line. Therefore, the total power of the laser, SOAs and DACs in the E-O-E control unit for writing 2 cache lines in parallel aggregates to 334.8mW. This equates to 40.68pJ/bit for writing to COSMOS-4bit.

For read operation, up to 5 read operations can be issued in parallel per bank, i.e.,  $5 \times b_{cell} \times 8 = 20B$  bits in parallel, with a read latency of 25*ns*. The total power of the laser, SOA, DAC, and ADC in E-O-E control for 5 parallel read operations is 9.3*mW*, resulting in a read energy of 11.6*pJ/bit* for COSMOS-4bit. The energy consumed in the electrical links connecting the processor and the E-O-E control unit is < 1pJ/bit [21]. For EPCM, we use parameters from the HSpice models in prior work [39] and model them in NVSim [24] to estimate the energy-per-bit for read and write operations. The opportunistic writeback operation in COSMOS uses the same energy as that required for write operation. Table 3 shows the energy-per-bit for EPCM-2bit and COSMOS-4bit. The read and write energy-per-bit of COSMOS-4bit are  $3.8 \times$  and  $5.97 \times$  lower, respectively, than that of EPCM-2bit.

## 8.2 Sensitivity Analysis of COSMOS

8.2.1 MLC values. Rios et al. gave the first demonstration of a 2-bit OPCM cell operation [78]. Advances in optical signaling and control have resulted in the demonstration of denser multilevel OPCM cells. Li et al. demonstrated 5-6 bits per OPCM cell [52]. Further prototypes have demonstrated scalable integration of OPCM cell arrays in silicon and silicon nitride platforms [27, 51]. With the maturity in optical integration technologies, we also evaluate the performance of 8-bit OPCM cells to provide a forward-looking comprehensive view of the potential benefits of developing higher bit density OPCM cells compared to DRAM. We compare the performance of COSMOS that uses OPCM cells with different MLC capacities, ranging from 2 *bits/cell* to 8 *bits/cell*, for the same number of silicon-photonic links (see Figure 8). The performance across applications increases, on average, by 39.2% and 26.4% as the MLC capacity of OPCM cells



# Fig. 8. Performance comparison of COSMOS with different MLC.



Fig. 9. Performance comparison of COSMOS with different number of silicon-photonic links.

increases from 2 *bits/cell* to 4 *bits/cell* and from 4 *bits/cell* to 8 *bits/cell*, respectively. An OPCM cell with higher MLC capacity will provide higher memory throughput.

8.2.2 Number of Silicon-Photonic Links. We compare the performance of COSMOS-4bit with different number of silicon-photonic links (see Figure 9). Multiplexing a higher number of optical signals in silicon-photonic links enables parallel read and write accesses of a higher number of OPCM cells. Due to this increased throughput, the overall system performance improves as the number of silicon-photonic links increases. We observe a performance improvement of 29.3% (on average) for COSMOS-4bit with 256 silicon-photonic links over COSMOS-4bit with 64 links.

8.2.3 Holding Buffer. As discussed earlier, in absence of the holding buffer, the read data needs to be written back to the OPCM cells immediately after readout due to the destructive read operation. Therefore, the complete read operation incurs a total latency of readout latency (25ns) + writeback latency (160ns). In contrast, when the E-O-E control unit uses a holding buffer, the read data is stored in the holding buffer at the end of read operation. The data from the holding buffer is written back to the OPCM cells only when the DB in the E-O-E control unit is empty, ensuring that the writeback operation does not stall any critical read and write operations. Using the highest read and write rate of the workloads that we evaluated, we determine that a *holding buffer* with 16 cache line slots, i.e., 1KB, is enough to avoid any memory read/write stalls. The holding buffer occupies < 1000  $\mu m^2$  area and can be integrated into the E-O-E control unit with minimal overhead. Figure 10 shows that using a holding buffer in COSMOS provides 59.2% average performance uplift. 



Fig. 10. Performance comparison of COSMOS with and without holding buffer for opportunistic writeback in read
 operation.

832 Manuscript submitted to ACM



Fig. 11. Average lifetime (in years) of COSMOS with different MLC capacities of OPCM cells and different memory capacities.

## 8.3 Endurance Analysis of COSMOS

Similar to EPCM, OPCM cells have lower endurance due to cell wearout. The OPCM cell endurance depends on how often we write to that cell [70]. Given that the read operation in COSMOS also includes a write (RESET) in step 2, the endurance of OPCM cells also depend on the read rate. We estimate the COSMOS lifetime using the equation proposed by Qureshi *et al.* [71]:

$$Y = \frac{S.W_m}{B.F.2^{25}}$$

where, *Y* is lifetime in years,  $W_m$  is maximum allowable writes per cell (10<sup>6</sup> for OPCM cells [52, 78]), *B* is write rate in bytes/cycle (average read+write rate across graph and HPC workloads), *F* is core frequency in Hz (1*GHz*), and *S* is COSMOS size in bytes (2*GB*, 4*GB* and 8*GB*).

Figure 11 plots the average lifetime for OPCM with different MLC capacities. Here, we assume that for a given memory size, all MLC options use the same number of silicon-photonic links. Hence, the COSMOS with 8-bit OPCM cells has higher effective throughput than the COSMOS with 4-bit OPCM cells and so an application running on COSMOS-8bit runs faster than an application running on COSMOS-4bit. As a result, for an application, even if the absolute number of memory writes is same for both COSMOS-8bit and COSMOS-4bit, the average number of *writes/second* to COSMOS-8bit is higher than the average number of *writes/second* to COSMOS-4bit. Hence, the lifetime of COSMOS-8bit is lower than that of the COSMOS-4bit, and similarly the lifetime of COSMOS-4bit is lower than that of COSMOS-2bit.

# 8.4 Area Analysis of the OPCM Array

To design the OPCM array in COSMOS, we use the prototype of a GST element developed by Rios et al. [75, 78] and the MRR dimensions from prior work as shown in Table 4. We use 3D stacking for OPCM array, with different banks stacked vertically (one bank per layer). The multi-mode waveguides in the interposer are routed vertically, and at each layer single-mode MRRs filter out the mode of all optical signals that belong to its corresponding bank. For a 2GB 4-bit OPCM array with 8 banks, a single bank consists of 1024 tiles with 32 cells/tile and a row and column of MRRs as shown in Figure 4b.<sup>2</sup> A bank, therefore, is composed of  $1024 \times 32$  GSTs along a row/column with  $1024 \times 32 - 1 \times 50$ nm of separation between GSTs, and a single row/column of MRRs at the beginning. Using the dimensions of these optical devices listed in Table 4, we calculate the area of a 2GB OPCM array and its bit density and report it in Table 5. 

<sup>&</sup>lt;sup>2</sup>The tile size is limited by the number of unique optical signals in C and L bands with sufficient guardbands (32 in our case). The number of banks depends on the number of unique electromagnetic modes that can be supported (8 in our case).

#### Table 4. Dimensions of optical devices in the OPCM array.

| Optical device                   | Dimension              |  |
|----------------------------------|------------------------|--|
| GST                              | 500nm × 500nm [75, 78] |  |
| Separation between adjacent GSTs | 50nm [32]              |  |
| MRR diameter                     | 5µm [50]               |  |

#### Table 5. Bit density (bits/mm<sup>2</sup>) of memory technologies.

| Memory technology | Area of 2GB memory                  | Bit density (bits/mm <sup>2</sup> ) |
|-------------------|-------------------------------------|-------------------------------------|
| DDR4              | $224mm^2$ [1]                       | 9.14 <i>MB/mm</i> <sup>2</sup>      |
| HBM2.0            | 91.99mm <sup>2</sup> [38]           | 22.26 <i>MB/mm</i> <sup>2</sup>     |
| EPCM-2bit         | 336mm <sup>2</sup> (simulated [24]) | 6.095 <i>MB/mm</i> <sup>2</sup>     |
| 3D OPCM-4bit      | 268.43mm <sup>2</sup> (calculated)  | 7.63 <i>MB/mm</i> <sup>2</sup>      |
| 3D OPCM-8bit      | 67.1mm <sup>2</sup> (calculated)    | 30.52 <i>MB/mm</i> <sup>2</sup>     |

901 902

903

904

905

906

907 908

909

910 911 We compare the area and bit density of the 3D-stacked OPCM array in COSMOS with DDR4, 3D-stacked HBM2.0 and EPCM-2bit memory system (see Table 5).<sup>3</sup> With current OPCM cell footprints, 3D-stacked OPCM-4bit has  $1.2 \times$  and  $2.9 \times$  lower bit density than DDR4 and HBM2.0, respectively, and  $1.25 \times$  higher bit density than EPCM-2bit. 3D-stacked OPCM-8bit has  $3.4 \times$ ,  $1.4 \times$  and  $5 \times$  higher bit density than DDR4, HBM2.0 and EPCM-2bit, respectively. Nevertheless, device-level research efforts have demonstrated that GST elements are highly scalable and can retain the electrical and optical characteristics at amorphous and crystalline states [73, 88]. An aggressive chip prototype with  $200nm \times 200nm$ GST element with 50nm separation has been recently fabricated [32]. These aggressive optical fabrication technologies promise achieving several orders higher densities for OPCM arrays than current DRAM technologies.

## 912 8.5 COSMOS vs DRAM

913 The overarching goal of COSMOS is to replace DRAM systems that are used widely in computing systems. We noted that 914 though all other NVM systems (in their current form) provide non-volatility, data persistence and high scalability, their 915 poor performance negates their benefits and makes them impractical to replace DRAM systems. We, therefore, compare 916 917 the performance and energy of current state-of-the-art DRAM systems, DDR5 with 64 electrical links, DDR5 with 256 918 silicon-photonic links [12], COSMOS-4bit with 256 silicon-photonic links, and COSMOS-8bit with 256 silicon-photonic 919 links. Figure 12 shows the overall system performance across the four configurations. For DDR5, replacing 64 electrical 920 links with 256 silicon-photonic links provides 24% average performance improvement. This improvement results from 921 922 the higher throughput due to dense WDM and single-cycle latency of silicon-photonic links. With COSMOS-4bit with 923 256 silicon-photonic links, we obtain 1.2% improvement in performance compared to DDR5 with 64 electrical links. This 924 is in stark contrast to EPCM-2bit, which performs  $4-5 \times$  worse than DDR5. COSMOS-8bit with 256 silicon-photonic 925 links performs 24.7% better than DDR5 with 64 electrical links and 1.8% better than DDR5 with 256 silicon-photonic 926 927 links. Here the increased read and write throughput due to the higher MLC capacity and dense WDM silicon-photonic 928 links reduces the average memory access latency of COSMOS and in turn improves performance. Figure 7c shows the the 929 average memory latency in COSMOS is 33.64ns across all workloads, which is lower than DDR5 DRAM (48ns). 930

Though we evaluate DDR5 memory with silicon-photonic links, such a system encounters several design challenges. To support silicon-photonic links in DDR5, memory requests from MC require an E-O conversion in MC and an O-E conversion in memory, and memory responses from DDR5 require an E-O conversion in memory and an O-E conversion

936 Manuscript submitted to ACM

891

<sup>&</sup>lt;sup>935</sup> <sup>3</sup>DDR5 area models were not publicly available at the time of submitting the manuscript. So we report a comparison with DDR4.

## Architecting Optically-Controlled Phase Change Memory



Fig. 12. Performance comparison of OPCM with DDR5.

in MC. Effectively, we need two extra conversions on the memory side. The active peripheral circuitry to support E-O-E conversions within memory increases the power density and raises thermal concerns. Due to the high thermal sensitivity of MRRs, there is a need for active thermal management. The power and resulting thermal concerns affect the reliability of optical communication in DRAM systems.

We observe that COSMOS with 4 *bits/cell* OPCM array demonstrates similar performance and energy characteristics as current state-of-the-art DDR5 systems, while COSMOS with 8 *bits/cell* OPCM array improves performance. This is particularly exciting as COSMOS exhibits zero leakage power, better scaling and non-volatility, making it a viable replacement for DRAM in the near future.

#### 9 RELATED WORK

#### 9.1 Phase Change Memories

Several works have proposed architectural and management policies to address the PCM challenges and have designed EPCM systems either as a standalone main memory, as part of hybrid DRAM-PCM systems or as a storage memory between DRAM and flash memory [5, 25, 33, 34, 36, 39, 43, 46, 47, 69, 71, 72, 83, 85, 95, 98]. Most of these efforts have focused on addressing the long write latency and high write energy. A summary of these efforts is shown in Table 6. Hybrid DRAM-PCM systems leverage the higher bit density in PCMs for improved performance, but at the cost of higher write energy [33, 46, 47, 71, 72]. To address PCM cell wearout, the techniques to enhance the write endurance include rotation-based wear leveling [70], process variation-aware leveling [23, 103], and writeback minimization and endurance management [28]. Due to lower write endurance, PCM cells are also susceptible to malicious write attacks. Common strategies employed in EPCMs to thwart these attacks and improve reliability include write-efficient data encryption [99], multi-way wear leveling [101], write-verify-write [62] or randomized address mapping [80]. These techniques can be readily deployed in OPCM. While several approaches discussed above address EPCM limitations, EPCM is not yet a viable alternative for DRAM due to their scalability and reliability challenges, high energy overhead and constrained bandwidth density.

In Table 6, we see that optical control of PCMs combined with silicon-photonic links significantly improves performance and lowers energy, without using any of the complementary methods provided in prior work. Applying these
 complementary methods to OPCM will further improve its performance and lower energy.

Write Double-COSMOS Fine-Logical Proactive Partition-Boosting grained truncation SET [69] XOR rank paraldecoupling aware power bud-[36] & mapping scheduling mapping lelism geting [34] [98] [83] [25] [5] 26% 19.2% 34% 28% 12% 16.7%  $2.31 \times$ Performance 76% gains NR NR 14.4% 25% 20% NR NR  $4 \times$ Energy reductions

Table 6. Survey of research efforts to improve write performance and write energy for using EPCM as main memory.
 The performance gains and energy reductions are shown in comparison to a naive EPCM system. (NR: Not reported)

# <sup>1000</sup> 9.2 Silicon-Photonic Links and OPCM Cells

Silicon-photonic links have enabled high bandwidth-density and low-energy communication between processor and
 memory [9, 10, 12, 22, 59, 84, 86, 87]. To provide high DRAM internal bandwidth, Beamer *et al.* [12] proposed a joint
 silicon-photonic link and electro-photonic DRAM design. However, the O-E-O conversion in DRAM adds to the latency.
 Optical control of memory cells can avoid this O-E-O conversion and enable signals in the silicon-photonic links to
 directly access the cells and deliver higher memory throughput.

Several recent efforts have prototyped GST-basd PCM cells with optical control. Rios et al. demonstrate the optical 1008 1009 control of multi-bit GST-based PCMs with fast readout and low switching energies [78]. Zhang et al. [102] present an 1010 approach to selectively couple optical signals from MRR to GST. Feldman et al. [26, 27] design a prototype of a monolithic 1011 OPCM array based on waveguide crossing but not a comprehensive memory microarchitecture and access protocols. 1012 Subsequent efforts demonstrate higher bit density per GST [52], in-memory computing on PCM cells using optical 1013 1014 signals [76], basic arithmetic operations in OPCM [26, 27], and a behavioral model for neuromorphic computing [18]. 1015 We are the first to propose a comprehensive OPCM microarchitecture with custom read/write access protocols, 1016 and design an E-O-E control unit to interface the OPCM array with the processor. 1017

## 1018 1019 **10 CONCLUSION**

1020 EPCM systems suffer from long write latencies and high write energies, yielding poor performance and high energy 1021 consumption for data-intensive applications. In contrast, OPCM technology provides the opportunity to design high-1022 performance and low-energy memory systems due to its higher MLC capacity and the direct cell access via high-1023 bandwidth-density and low-latency silicon-photonic links. Adapting the current EPCM design architecture for OPCM 1024 1025 systems, however, raises major issues in terms of latency, energy and thermal concerns, thereby rendering such a design 1026 impractical. We are the first to architect a complete memory system, COSMOS, which consists of an OPCM array 1027 microarchitecture, a read/write access protocol tailored for OPCM technology, and an E-O-E control unit that interfaces 1028 the OPCM array with the MC. Our evaluations show that, compared to an EPCM system, our proposed COSMOS system 1029 1030 provides  $2.09 \times$  higher read throughput and  $2.15 \times$  higher write throughput, thereby reducing the execution time by  $2.14 \times$ , 1031 read energy by  $1.24 \times$ , and write energy by  $4.06 \times$ . 1032

We show that COSMOS designed with state-of-the-art technology provides similar performance and energy as DDR5. This is a significant finding as future higher-density OPCM cells are expected to provide better performance. Our promising first version of a COSMOS architecture opens doors for new architecture-level, circuit-level, and system-level methods to enable practical integration of OPCM-based main memory in future computing systems. Moreover, the high-throughput and scalable OPCM technology ushers in interesting research opportunities in persistent memory, in-memory computing, and accelerator-specific memory designs.

1040 Manuscript submitted to ACM

20

991

992

993

994

995

996

997

998 999

## 1041 **REFERENCES**

- [1] "DDR4 area," http://www.micron.com/-/media/client/global/documents/products/technical-note/dram/tn4041\_adding\_ecc\_with\_ddr4\_x16\_
   [1] "DDR4 area," http://www.micron.com/-/media/client/global/documents/products/technical-note/dram/tn4041\_adding\_ecc\_with\_ddr4\_x16\_
   [1] "DDR4 area," http://www.micron.com/-/media/client/global/documents/products/technical-note/dram/tn4041\_adding\_ecc\_with\_ddr4\_x16\_
- [2] J. L. Abellán, A. K. Coskun, A. Gu, W. Jin, A. Joshi, A. B. Kahng, J. Klamkin, C. Morales, J. Recchio, V. Srinivas *et al.*, "Adaptive tuning of photonic devices in a photonic NoC through dynamic workload allocation," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 36, no. 5, pp. 801–814, 2016.
- [3] A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, "A survey on homomorphic encryption schemes: Theory and implementation," *ACM Computing* Surveys, vol. 51, no. 4, pp. 1–35, 2018.
- [4] A. Alduino, L. Liao, R. Jones, M. Morse, B. Kim, W.-Z. Lo, J. Basak, B. Koch, H.-F. Liu, H. Rong, M. Sysak, C. Krause, R. Saba, D. Lazar, L. Horwitz, R. Bar, S. Litski, A. Liu, K. Sullivan, O. Dosunmu, N. Na, T. Yin, F. Haubensack, I. wei Hsieh, J. Heck, R. Beatty, H. Park, J. Bovington, S. Lee, H. Nguyen, H. Au, K. Nguyen, P. Merani, M. Hakami, and M. Paniccia, "Demonstration of a high speed 4-channel integrated silicon photonics WDM link with hybrid silicon lasers," in *Proc. Integrated Photonics Research, Silicon and Nanophotonics and Photonics in Switching, Monterey, California, USA*. Optical Society of America, 2010, p. PDIWI5.
- [5] M. Arjomand, M. T. Kandemir, A. Sivasubramaniam, and C. R. Das, "Boosting access parallelism to PCM-based main memory," in *Proc. International* Symposium on Computer Architecture, Seoul, South Korea, 2016, pp. 695–706.
- [6] M. Bahadori, R. Polster, S. Rumley, Y. Thonnart, J.-L. Gonzalez-Jimenez, and K. Bergman, "Energy-bandwidth design exploration of silicon photonic interconnects in 65nm CMOS," in *Proc. Optical Interconnects Conference, San Diego, CA, USA*, 2016, pp. 2–3.
- [7] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber,
   H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, "The NAS parallel benchmarks," *The International Journal of Supercomputing Applications*,
   vol. 5, no. 3, pp. 63–73, 1991.
- [8] T. Barwicz, H. Byun, F. Gan, C. Holzwarth, M. Popovic, P. Rakich, M. Watts, E. Ippen, F. Kärtner, H. Smith, J. S. Orcutt, R. J. Ram, V. Stojanovic, O. O. Olubuyide, J. L. Hoyt, S. Spector, M. Geis, M. Grein, T. Lyszczarz, and J. U. Yoon, "Silicon photonics for compact, energy-efficient interconnects," *Journal of Optical Networking*, vol. 6, no. 1, pp. 63–73, 2007.
- [9] C. Batten, A. Joshi, V. Stojanovic, and K. Asanovic, "Designing chip-level nanophotonic interconnection networks," *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, vol. 2, no. 2, pp. 137–153, 2012.
- [10] C. Batten, A. Joshi, J. Orcutt, A. Khilo, B. Moss, C. W. Holzwarth, M. A. Popovic, H. Li, H. I. Smith, J. L. Hoyt, F. Kartner, R. J. Ram, V. Stojanović,
   and K. Asanović, "Building many-core processor-to-DRAM networks with monolithic CMOS silicon photonics," in *Proc. Symposium on High Performance Interconnects, Stanford, CA, USA*, 2008, pp. 21–30.
- [11] S. Beamer, K. Asanović, and D. Patterson, "The GAP benchmark suite," arXiv preprint arXiv:1508.03619, 2015.
- [12] S. Beamer, C. Sun, Y.-j. Kwon, A. Joshi, C. Batten, V. Stojanovic, and K. Asanovi, "Re-architecting DRAM with monolithically integrated silicon
   photonics," in *Proc. International Symposium on Computer Architecture, Saint-Malo, France*, 2009, pp. 129–140.
- [13] F. Bedeschi, R. Fackenthal, C. Resta, E. M. Donze, M. Jagasivamani, E. C. Buda, F. Pellizzer, D. W. Chow, A. Cabrini, G. M. A. Calvi, R. Faravelli, A. Fantini, G. Torelli, D. Mills, R. Gastaldi, and G. Casagrande, "A multi-level-cell bipolar-selected phase-change memory," in *Proc. International Solid-State Circuits Conference, San Francisco, CA*, 2008, pp. 428–625.
- [14] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," *ACM SIGARCH computer architecture news*, vol. 39, no. 2, pp. 1–7, 2011.
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
   G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
   C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, "Language models are few-shot learners," *arXiv preprint arXiv:2005.14165*,
   2020.
- [16] G. Burr, M. Breitwisch, M. Franceschini, D. Garetto, K. Gopalakrishnan, B. Jackson, B. Kurdi, C. Lam, L. Lastras, A. Padilla, and B. Rajendran,
   "Phase change memory technology," *Journal of Vacuum Science & Technology B, Nanotechnology and Microelectronics: Materials, Processing, Measurement, and Phenomena*, vol. 28, no. 2, pp. 223–262, 2010.
- [17] A. Cabrini, S. Braga, A. Manetto, and G. Torelli, "Voltage-driven multilevel programming in phase change memories," in *Proc. International Workshop on Memory Technology, Design, and Testing, Hsinchu, Taiwan*, 2009, pp. 3–6.
- [18] S. G.-C. Carrillo, E. Gemo, X. Li, N. Youngblood, A. Katumba, P. Bienstman, W. Pernice, H. Bhaskaran, and C. D. Wright, "Behavioral modeling of integrated phase-change photonic devices for neuromorphic computing applications," *APL Materials*, vol. 7, no. 9, p. 091113, 2019.
- [19] J. H. Cheon, A. Kim, M. Kim, and Y. Song, "Homomorphic encryption for arithmetic of approximate numbers," in *Proc. International Conference on the Theory and Application of Cryptology and Information Security, Hong Kong, China.* Springer, 2017, pp. 409–437.
- Y. Choi, I. Song, M.-H. Park, S. C. Hoeju Chung, B. Cho, J. Kim, Y. Oh, D. Kwon, J. Sunwoo, J. Shin, Y. Rho, C. Lee, M. G. Kang, J. Lee, Y. Kwon,
   S. Kim, J. Kim, Y. jun Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H.-S. Joo, K. Lee, Y.-T. Lee, J.-H. Yoo, and G. Jeong, "A 20nm 1.8V
   8Gb PRAM with 40MB/s program bandwidth," in *Proc. International Solid-State Circuits Conference, San Francisco, CA, USA*, 2012, pp. 46–48.
- [21] A. Coskun, F. Eris, A. Joshi, A. B. Kahng, Y. Ma, A. Narayan, and V. Srinivas, "Cross-layer co-optimization of network design and chiplet placement in 2.5 D systems," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 39, no. 12, pp. 5183–5196, 2020.
- 1091
- 1092

- 1093 [22] Y. Demir, Y. Pan, S. Song, N. Hardavellas, J. Kim, and G. Memik, "Galaxy: A high-performance energy-efficient multi-chip architecture using 1094 photonic interconnects," in Proc. International Conference on Supercomputing, Munich, Germany, 2014, pp. 303-312. [23] J. Dong, L. Zhang, Y. Han, Y. Wang, and X. Li, "Wear rate leveling: Lifetime enhancement of PRAM with endurance variation," in Proc. Design 1095 Automation Conference, New York, NY, USA, 2011, pp. 972–977. 1096 [24] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, "NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory," IEEE 1097 Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 7, pp. 994–1007, 2012. 1098 [25] Y. Du, M. Zhou, B. R. Childers, D. Mossé, and R. Melhem, "Bit mapping for balanced PCM cell programming," in Proc. International Symposium 1099 on Computer Architecture, Tel Aviv, Israel, 2013, p. 428-439. 1100 [26] J. Feldmann, M. Stegmaier, N. Gruhler, C. Ríos, H. Bhaskaran, C. Wright, and W. Pernice, "Calculating with light using a chip-scale all-optical 1101 abacus," Nature communications, vol. 8, no. 1, pp. 1-8, 2017. 1102 [27] J. Feldmann, N. Youngblood, X. Li, C. D. Wright, H. Bhaskaran, and W. H. Pernice, "Integrated 256 cell photonic phase-change memory with 1103 512-bit capacity," IEEE Journal of Selected Topics in Quantum Electronics, vol. 26, no. 2, pp. 1–7, 2019. 1104 [28] A. P. Ferreira, M. Zhou, S. Bock, B. Childers, R. Melhem, and D. Mossé, "Increasing PCM main memory lifetime," in Proc. Design, Automation & Test in Europe Conference & Exhibition, Dresden, Germany, 2010, pp. 914-919. 1105 [29] X. Gao, C. Shan, C. Hu, Z. Niu, and Z. Liu, "An adaptive ensemble machine learning model for intrusion detection," IEEE Access, vol. 7, pp. 1106 82 512-82 521, 2019. 1107 [30] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, "Powergraph: Distributed graph-parallel computation on natural graphs," in USENIX 1108 Symposium on Operating Systems Design and Implementation, 2012, pp. 17-30. 1109 [31] P. Grani and S. Bartolini, "Design options for optical ring interconnect in future client devices," ACM Journal on Emerging Technologies in Computing 1110 Systems, vol. 10, no. 4, pp. 1-25, 2014. 1111 [32] P. Hosseini, C. D. Wright, and H. Bhaskaran, "An optoelectronic framework enabled by low-dimensional phase-change films," Nature, vol. 511, no. 1112 7508, pp. 206-211, 2014. 1113 [33] G. Jia, G. Han, J. Jiang, and L. Liu, "Dynamic adaptive replacement policy in shared last-level cache of DRAM/PCM hybrid memory for big data 1114 storage," IEEE Transactions on Industrial Informatics, vol. 13, no. 4, pp. 1951-1960, 2016. [34] L. Jiang, Y. Zhang, B. R. Childers, and J. Yang, "FPB: Fine-grained power budgeting to improve write throughput of multi-level cell phase change 1115 memory," in Proc. International Symposium on Microarchitecture, Vancouver, BC, Canada, 2012, pp. 1–12. 1116 [35] L. Jiang, B. Zhao, J. Yang, and Y. Zhang, "A low power and reliable charge pump design for phase change memories," Proc. Internal Symposium on 1117 Computer Architecture, Minneapolis, MN, USA, vol. 42, no. 3, pp. 397-408, 2014. 1118 [36] L. Jiang, B. Zhao, Y. Zhang, J. Yang, and B. R. Childers, "Improving write operations in MLC phase change memory," in Proc. International 1119 Symposium on High Performance Computer Architecture, New Orleans, LA, USA, 2012, pp. 1–10. 1120 [37] U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. S. Choi, "Co-architecting controllers and dram to enhance dram process 1121 scaling," in The memory forum, vol. 14, 2014. 1122 [38] J. Kim and Y. Kim, "HBM: Memory solution for bandwidth-hungry processors," in Proc. Hot Chips Symposium, Cupertino, CA, USA, 2014, pp. 1123 1-24.1124 [39] N. S. Kim, C. Song, W. Y. Cho, J. Huang, and M. Jung, "LL-PCM: Low-latency phase change memory architecture," in Proc. Design Automation Conference, Las Vegas, NV, USA, 2019, pp. 1-6. 1125 [40] S. K. Kim, S. W. Lee, J. H. Han, B. Lee, S. Han, and C. S. Hwang, "Capacitors with an equivalent oxide thickness of < 0.5nm for nanoscale electronic 1126 semiconductor memory," Advanced Functional Materials, vol. 20, no. 18, pp. 2989-3003, 2010. 1127 [41] S. K. Kim and M. Popovici, "Future of dynamic random-access memory as main memory," MRS Bulletin, vol. 43, no. 5, p. 334, 2018. 1128 [42] A. Krishnamoorthy, H. Schwetman, X. Zheng, and R. Ho, "Energy-efficient photonics in future high-connectivity computing systems," Journal Of 1129 Lightwave Technology, vol. 33, no. 4, pp. 889-900, 2015. 1130 [43] M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, "Mixed-precision in-memory 1131 computing," Nature Electronics, vol. 1, no. 4, pp. 246-253, 2018. 1132 [44] B. G. Lee, X. Chen, A. Biberman, X. Liu, I. Hsieh, C. Chou, J. I. Dadap, F. Xia, W. M. J. Green, L. Sekaric, Y. A. Vlasov, R. M. Osgood, and 1133 K. Bergman, "Ultrahigh-bandwidth silicon photonic nanowire waveguides for on-chip networks," IEEE Photonics Technology Letters, vol. 20, no. 6, 1134 pp. 398-400, 2008. [45] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, "Architecting phase change memory as a scalable DRAM alternative," in Proc. International Symposium 1135 on Computer Architecture, Austin, TX, USA, 2009, pp. 2-13. 1136 [46] H. G. Lee, S. Baek, C. Nicopoulos, and J. Kim, "An energy-and performance-aware DRAM cache architecture for hybrid DRAM/PCM main memory 1137 systems," in Proc. International Conference on Computer Design, Amherst, MA, USA, 2011, pp. 381-387. 1138 [47] S. Lee, H. Bahn, and S. H. Noh, "Characterizing memory write references for efficient management of hybrid PCM and DRAM memory," in Proc. 1139 International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, Singapore, Singapore, 2011, pp. 1140 168-175. 1141 [48] C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and T. W. Keller, "Energy management for commercial servers," Computer, vol. 36, 1142 no. 12, pp. 39-48, 2003. 1143 [49] J. Leskovec and A. Krevl, "SNAP Datasets: Stanford large network dataset collection," http://snap.stanford.edu/data, Jun. 2014.
- 1144 Manuscript submitted to ACM

#### Architecting Optically-Controlled Phase Change Memory

- [50] C. Li, R. Bai, A. Shafik, E. Z. Tabasy, G. Tang, C. Ma, C.-H. Chen, Z. Peng, M. Fiorentino, P. Chiang, and S. Palermo, "A ring-resonator-based silicon photonics transceiver with bias-based wavelength stabilization and adaptive-power-sensitivity receiver," in *Proc. International Solid-State Circuits Conference, San Francisco, CA, USA*, 2013, pp. 124–125.
- 1147 Circuits Conference, San Francisco, CA, USA, 2013, pp. 124–125.
  1148 [51] X. Li, N. Youngblood, Z. Cheng, S. G.-C. Carrillo, E. Gemo, W. H. Pernice, C. D. Wright, and H. Bhaskaran, "Experimental investigation of silicon and silicon nitride platforms for phase-change photonic in-memory computing," *Optica*, vol. 7, no. 3, pp. 218–225, 2020.
- [52] X. Li, N. Youngblood, C. Ríos, Z. Cheng, C. D. Wright, W. H. Pernice, and H. Bhaskaran, "Fast and reliable storage using a 5 bit, nonvolatile photonic memory cell," *Optica*, vol. 6, no. 1, pp. 1–6, 2019.
- [53] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, "Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud," *Proc. VLDB Endowment, Istanbul, Turke*, vol. 5, no. 8, p. 716–727, Apr. 2012. [Online]. Available: https://doi.org/10.14778/2212351.2212354
- [54] L.-W. Luo, N. Ophir, C. P. Chen, L. H. Gabrielli, C. B. Poitras, K. Bergmen, and M. Lipson, "WDM-compatible mode-division multiplexing on a silicon chip," *Nature communications*, vol. 5, no. 1, pp. 1–7, 2014.
- [55] H.-K. Lyeo, D. G. Cahill, B.-S. Lee, J. R. Abelson, M.-H. Kwon, K.-B. Kim, S. G. Bishop, and B.-k. Cheong, "Thermal conductivity of phase-change material *Ge*<sub>2</sub>*Sb*<sub>2</sub>*Te*<sub>5</sub>," *Applied Physics Letters*, vol. 89, no. 15, p. 151904, 2006.
- [56] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, "Pregel: A system for large-scale graph processing," in
   *Proc. of International Conference on Management of data, Indianapolis, Indiana, USA*, 2010, pp. 135–146.
- [57] A.-K. U. Michel, P. Zalden, D. N. Chigrin, M. Wuttig, A. M. Lindenberg, and T. Taubner, "Reversible optical switching of infrared antenna resonances with ultrathin phase-change layers using femtosecond laser pulses," *ACS Photonics*, vol. 1, no. 9, pp. 833–839, 2014.
  - [58] O. Mutlu, "Memory scaling: A systems architecture perspective," in International Memory Workshop, 2013, pp. 21–25.
- [59] A. Narayan, Y. Thonnart, P. Vivet, A. Joshi, and A. K. Coskun, "System-level evaluation of chip-scale silicon photonic networks for emerging data-intensive applications," in *Proc. Design, Automation & Test in Europe Conference & Exhibition, Grenoble, France*, 2020, pp. 1444–1449.
- [60] A. Narayan, Y. Thonnart, P. Vivet, C. F. Tortolero, and A. K. Coskun, "WAVES: Wavelength selection for power-efficient 2.5D-integrated photonic
   NoCs," in *Proc. Design, Automation & Test in Europe Conference & Exhibition, Florence, Italy*, 2019, pp. 516–521.
- [61] T. Nirschl, J. Philipp, T. Happ, G. Burr, B. Rajendran, M.-H. Lee, A. Schrott, M. Yang, M. Breitwisch, C.-F. Chen, E. Joseph, M. Lamorey, R. Cheek,
   S.-H. Chen, S. Zaidi, S. Raoux, Y. Chen, Y. Zhu, R. Bergmann, H.-L. Lung, and C. Lam, "Write strategies for 2 and 4-bit multi-level phase-change
   memory," in *Proc. International Electron Devices Meeting, Washington, DC, USA*, 2007, pp. 461–464.
- [62] H. Noguchi, K. Ikegami, S. Takaya, E. Arima, K. Kushida, A. Kawasumi, H. Hara, K. Abe, N. Shimomura, J. Ito, S. Fujita, T. Nakada, and H. Nakamura, "4Mb STT-MRAM-based cache with memory-access-aware power optimization and write-verify-write / read-modify-write scheme," in *Proc. International Solid-State Circuits Conference, San Francisco, CA, USA*, 2016, pp. 132–133.
- [17]
   [63] M. Notomi, K. Nozaki, A. Shinya, S. Matsuo, and E. Kuramochi, "Toward fj/bit optical communication in a chip," *Optics Communications*, vol. 314, pp. 3–17, 2014.
- [64] S. R. Ovshinsky, "Reversible electrical switching phenomena in disordered structures," *Physical Review Letters*, vol. 21, no. 20, p. 1450, 1968.
- [65] K. Padmaraju and K. Bergman, "Resolving the thermal challenges for silicon microring resonator devices," *Nanophotonics*, vol. 3, no. 4-5, pp.
   269–281, 2014.
- [66] G. Palumbo and D. Pappalardo, "Charge pump circuits: An overview on design strategies and topologies," *IEEE Circuits and Systems Magazine*,
   vol. 10, no. 1, pp. 31–45, 2010.
- [67] A. Pandey and S. K. Selvaraja, "Four channel 48Gbps multicasting in a coupled Si ring resonator with tunable channel spacing," in *Proc. Conference* on Lasers and Electro-Optics/Pacific Rim, Hong Kong, Hong Kong. Optical Society of America, 2018, pp. W2D–4.
- [68] M. Poremba, T. Zhang, and Y. Xie, "Nvmain 2.0: A user-friendly memory simulator to model (non-) volatile memory systems," *IEEE Computer Architecture Letters*, vol. 14, no. 2, pp. 140–143, 2015.
- [69] M. K. Qureshi, M. M. Franceschini, A. Jagmohan, and L. A. Lastras, "PreSET: Improving performance of phase change memories by exploiting asymmetry in write times," in *Proc. International Symposium on Computer Architecture, Portland, Oregon, USA*, 2012.
- [70] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali, "Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling," in *Proc. international Symposium on Microarchitecture, New York, NY, USA*, 2009, pp. 14–23.
- [71] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, "Scalable high performance main memory system using phase-change memory technology," in *Proc. International Symposium on Computer Architecture, Austin, Texas, USA*, 2009, pp. 24–33.
- [72] L. E. Ramos, E. Gorbatov, and R. Bianchini, "Page placement in hybrid memory systems," in *Proc. International Conference on Supercomputing*, *Tucson, Arizona, USA*, 2011, pp. 85–95.
- [73] S. Raoux, G. W. Burr, M. J. Breitwisch, C. T. Rettner, Y. Chen, R. M. Shelby, M. Salinga, D. Krebs, S. Chen, H. Lung, and C. H. Lam,
   "Phase-change random access memory: A scalable technology," *IBM Journal of Research and Development*, vol. 52, no. 4.5, pp. 465–479, 2008.
- [74] A. S. Rekhi, B. Zimmer, N. Nedovic, N. Liu, R. Venkatesan, M. Wang, B. Khailany, W. J. Dally, and C. T. Gray, "Analog/mixed-signal hardware error modeling for deep learning inference," in *Proceedings of Design Automation Conference, Las Vegas, NV, USA*, 2019, pp. 1–6.
- [75] C. Rios, P. Hosseini, C. D. Wright, H. Bhaskaran, and W. H. Pernice, "On-chip photonic memory elements employing phase-change materials," *Advanced Materials*, vol. 26, no. 9, pp. 1372–1377, 2014.
- [76] C. Ríos, N. Youngblood, Z. Cheng, M. Le Gallo, W. H. Pernice, C. D. Wright, A. Sebastian, and H. Bhaskaran, "In-memory computing on a photonic
   platform," *Science advances*, vol. 5, no. 2, p. eaau5759, 2019.

1196

[78] C. Ríos, M. Stegmaier, P. Hosseini, D. Wang, T. Scherer, C. D. Wright, H. Bhaskaran, and W. H. Pernice, "Integrated all-photonic non-volatile 1199 multi-level memory," Nature Photonics, vol. 9, no. 11, p. 725, 2015. 1200 [79] S. Salihoglu and J. Widom, "Gps: A graph processing system," in Proc. of the International Conference on Scientific and Statistical Database 1201 Management, Baltimore, Maryland, USA, 2013, pp. 1-12. 1202 [80] N. H. Seong, D. H. Woo, and H.-H. S. Lee, "Security refresh: Prevent malicious wear-out and increase durability for phase-change memory with 1203 dynamically randomized address mapping," in Proc. International Symposium on Computer Architecture, Saint-Malo, France, 2010, pp. 383–394. 1204 [81] K. Shang, S. Pathak, B. Guan, G. Liu, and S. Yoo, "Low-loss compact multilayer silicon nitride platform for 3D photonic integrated circuits," Optics 1205 Express, vol. 23, no. 16, pp. 21 334-21 342, 2015. 1206 [82] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, "Edge computing: Vision and challenges," IEEE internet of things journal, vol. 3, no. 5, pp. 637-646, 1207 2016. [83] S. Song, A. Das, O. Mutlu, and N. Kandasamy, "Enabling and exploiting partition-level parallelism (PALP) in phase change memories," ACM 1208 Transactions on Embedded Computing Systems, vol. 18, no. 5s, pp. 1-25, 2019. 1209 [84] C. Sun, M. T. Wade, Y. Lee, J. S. Orcutt, L. Alloatti, M. S. Georgas, A. S. Waterman, J. M. Shainline, R. R. Avizienis, S. Lin, R. Benjamin R. Moss, 1210 Kumar, F. Pavanello, A. H. Atabaki, H. M. Cook, A. J. Ou, J. C. Leu, Y.-H. Chen, K. Asanović, R. J. Ram, M. A. Popović, and V. M. Stojanović, 1211 "Single-chip microprocessor that communicates directly using light," Nature, vol. 528, no. 7583, pp. 534-538, 2015. 1212 [85] I. G. Thakkar and S. Pasricha, "DyPhase: A dynamic phase change memory architecture with symmetric write latency and restorable endurance," 1213 IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 9, pp. 1760–1773, 2017. 1214 [86] Y. Thonnart, S. Bernabé, J. Charbonnier, C. Bernard, D. Coriat, C. Fuguet, P. Tissier, B. Charbonnier, S. Malhouitre, D. Saint-Patrice, M. Assous, 1215 A. Narayan, A. Coskun, D. Dutoit, and P. Vivet, "POPSTAR: A robust modular optical NoC architecture for chiplet-based 3D integrated systems," in 1216 Proc. Design, Automation & Test in Europe Conference & Exhibition, Grenoble, France, 2020, pp. 1456-1461. 1217 [87] M. Wade, E. Anderson, S. Ardalan, P. Bhargava, S. Buchbinder, M. L. Davenport, J. Fini, H. Lu, C. Li, R. Meade, C. Ramamurthy, M. Rust, 1218 F. Sedgwick, V. Stojanovic, D. Van Orden, C. Zhang, C. Sun, S. Y. Shumarayev, C. O'Keeffe, T. T. Hoang, D. Kehlet, R. V. Mahajan, M. T. Guzy, A. Chan, and T. Tran, "Teraphy: A chiplet technology for low-power, high-bandwidth in-package optical i/o," IEEE Micro, vol. 40, no. 2, pp. 63–71, 1219 2020 1220 [88] J. Wang, L. Wang, and J. Liu, "Overview of phase-change materials based photonic devices," IEEE Access, vol. 8, pp. 121 211–121 245, 2020. 1221 [89] S. Wang, X. Feng, S. Gao, Y. Shi, T. Dai, H. Yu, H.-K. Tsang, and D. Dai, "On-chip reconfigurable optical add-drop multiplexer for hybrid 1222 wavelength/mode-division-multiplexing systems," Optics letters, vol. 42, no. 14, pp. 2802-2805, 2017. 1223 [90] O.-Y. Wong, H. Wong, W.-S. Tam, and C. Kok, "A comparative study of charge pumping circuits for flash memory applications," Microelectronics 1224 Reliability, vol. 52, no. 4, pp. 670-687, 2012. 1225 [91] J. Woodhouse, "Big, big, big data: higher and higher resolution video surveillance," technology. ihs. com, 2016. 1226 [92] X. Wu, C. Huang, K. Xu, C. Shu, and H. K. Tsang, "Mode-division multiplexing for silicon photonic network-on-chip," Journal of Lightwave 1227 Technology, vol. 35, no. 15, pp. 3223-3228, 2017. 1228 [93] M. Wuttig, H. Bhaskaran, and T. Taubner, "Phase-change materials for non-volatile photonic applications," Nature Photonics, vol. 11, no. 8, pp. 465-476, 2017. 1229 [94] M. Wuttig and N. Yamada, "Phase-change materials for rewriteable data storage," Nature materials, vol. 6, no. 11, pp. 824–832, 2007. 1230 [95] F. Xia, D. Jiang, J. Xiong, M. Chen, L. Zhang, and N. Sun, "DWC: Dynamic write consolidation for phase change memory systems," in Proc. 1231 International conference on Supercomputing, Munich, Germany, 2014, pp. 211-220. 1232 [96] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, and Y. Xie, "Overcoming the challenges of crossbar resistive memory 1233 architectures," in Proc. International Symposium on High Performance Computer Architecture, Bay Area, California, USA, 2015, pp. 476-488. 1234 [97] Y.-D. Yang, Y. Li, Y.-Z. Huang, and A. W. Poon, "Silicon nitride three-mode division multiplexing and wavelength-division multiplexing using 1235 asymmetrical directional couplers and microring resonators," Optics express, vol. 22, no. 18, pp. 22 172-22 183, 2014. 1236 [98] H. Yoon, J. Meza, N. Muralimanohar, N. P. Jouppi, and O. Mutlu, "Efficient data mapping and buffering techniques for multilevel cell phase-change 1237 memories," ACM Transactions on Architecture and Code Optimization, vol. 11, no. 4, pp. 1-25, 2014. 1238 [99] V. Young, P. J. Nair, and M. K. Qureshi, "DEUCE: Write-efficient encryption for non-volatile memories," in Proc. Architectural Support for Programming Languages and Operating Systems, Istanbul, Turkey, 2015, pp. 33-44. 1239 N. Youngblood, C. Ríos, E. Gemo, J. Feldmann, Z. Cheng, A. Baldycheva, W. H. Pernice, C. D. Wright, and H. Bhaskaran, "Tunable volatility of 1240 [100] Ge<sub>2</sub>Sb<sub>2</sub>Te<sub>5</sub> in integrated photonics," Advanced Functional Materials, vol. 29, no. 11, p. 1807571, 2019. 1241 [101] H. Yu and Y. Du, "Increasing endurance and security of phase-change memory with multi-way wear-leveling," IEEE Transactions on Computers, 1242 vol. 63, no. 5, pp. 1157-1168, 2012. 1243 [102] H. Zhang, L. Zhou, J. Xu, L. Lu, J. Chen, and B. Rahman, "All-optical non-volatile tuning of an AMZI-coupled ring resonator with GST phase-change 1244 material," Optics letters, vol. 43, no. 22, pp. 5539-5542, 2018.

[103] M. Zhao, L. Jiang, Y. Zhang, and C. J. Xue, "SLC-enabled wear leveling for MLC PCM considering process variation," in Proc. Design Automation

[77] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, "DRAMSim2: A cycle accurate memory system simulator," IEEE computer architecture letters, vol. 10,

1245 1246 1247

1248 Manuscript submitted to ACM

Conference, San Francisco, CA, USA, 2014, pp. 1-6.

24

no. 1, pp. 16-19, 2011.