1. Introduction
As of 2020, data centers consumed about 1% of worldwide electricity use [
1]. As cloud services such as artificial intelligence, autonomous vehicles, and distributed manufacturing systems have been expanding, more numbers of data centers are demanded. Their energy use was predicted to increase by a factor of three or four within the next decade. As a result, many researchers have been focusing on power reduction without affecting computing performance.
Figure 1 compares power breakdowns for data center in [
2,
3,
4]: (1) servers and storage systems, (2) cooling and humidification systems, (3) power conditioning equipment, (4) networking equipment, and (5) lighting/physical security. Servers’ power strongly depends on workload. Cooling power depends on ambient weather conditions around the data center facility. Those three data sources show that servers and cooling consume about 80% of data center power.
Figure 2 shows the breakdown of “servers” in to CPU, DRAM and storage. Storage includes hard disk drive (HDD) and NAND Flash based solid-state drive (SSD). NAND Flash memory is a nonvolatile semiconductor memory. As the bit density increases, Solid-State Drives (SSDs), whose storage components are NAND Flash memory dies, have been replacing hard disk drives in data center as well as personal computers because of lower power and faster latency.
Distributed temperature control units control local temperature surrounding the CPUs running in different workload [
5]. As a result, the total power for servers and cooling can be minimized. Energy-efficient distribution of power converters is also important to reduce the total power of data centers [
6]. Shuffled topologies spread secondary power feeds over power grid, which allows a single power unit failure. Power routing schedules workload dynamically. The network consumes much less power than servers and cooling at full utilization. However, since servers typically operate at much lower levels of utilization, the network power cannot be ignored. If the system is 15% utilized and the servers are fully energy-proportional, the network will consume about 50% of overall power [
7]. Thus, the network power needs to be proportion to the workload. It is shown that a flattened butterfly topology itself is inherently more power efficient than the other commonly proposed topology for high-performance datacenter networks in [
7]. Database software also affects the energy efficiency of servers. It is shown that the CPU power consumption varies by as much as 60% depending on operators for the same CPU utilization in [
8]. Thus, data center energy consumption depends on each of individual hardware such as CPUs, memories, storage, cooling, and network; construction and control of the hardware and software; workload and environmental temperature. Therefore, energy consumption models are important to design energy-efficient data center and to optimize its operations. Reference [
9] surveys more than 200 models for all the hierarchical levels of the hardware. In [
10], analytical models called FlashPower were developed to estimate NAND flash memory chip energy dissipation during basic flash operations such as read, program and erase. Each component such as selected and unselected wordlines (WLs), bitlines (BLs) for data 0 and 1, source line, decoder and senseamp are parameterized for each operation.
Now, let’s take a look at the inside of an SSD.
Figure 3 illustrates a block diagram of the internal structure of an SSD [
11]. Multiple NAND Flash dies are integrated to store large volume of data. DRAM is used as a memory buffer for multiple NAND dies. SSD controller controls data traffic at the interface between SATA (Serial Advanced Technology Attachment) and the NAND dies. When data is written into the SSD, the sequential write data input to the SSD is stored in DRAM first, and then is transferred to one or more NAND dies through data bus inside the SSD according to the write address. When data is read out of the SSD, the sequential data is moved from one or more NAND dies to the DRAM die through data bus inside the SSD first, and then is transferred to SATA. DC/DC converter inputs power from 3 V power source to output multiple voltages for controller, NAND and DRAM.
The array access time of NAND is much longer than that of DRAM by factors of 1000 for read operation and of 10000 for write operation. In order to increase band width for read and write operations, multiple NAND dies in an SSD operate in parallel. The maximum number of NAND dies operating in parallel is determined by peak power [
12]. Peak point occurs when heavily capacitive WLs and BLs are charged up. As shown in
Figure 4, the peak point can be shifted by adding a suspend time (
TSUSPEND) between the dies, which improves parallelism. It is more favorable to reduce the power itself not only for parallelism, but also for energy reduction of SSDs and data center.
As NAND bit density increases, page length increases from 512B to 16KB. Because read and program operations are done on a page basis, IO speed has needed to increase accordingly for preventing bottleneck at IO path in data traffic. In order to make it easier to develop advanced SSD controller and advanced SSD, two groups, Open NAND Flash Interface (ONFI) [
13] and Toggle [
14] working groups, have standardized the interface of NAND.
Figure 5 illustrates power distribution for NAND die with ONFI 1.0 (a), 2.0 (b), 4.0 (c), and proposed power distribution in NAND die (d). MA, SA and IO are Memory Array, Sense amplifier, and IO buffers, respectively. In order to increase bandwidth of NAND interface, IO transistors have scaled. As a result, the power supply for IO buffers, namely VDDQ, has decreased from 3V to 1.8V and from 1.8V to 1.2V. 2.4 Gb/s IO speed was achieved in 1Tb 3D NAND Flash [
15]. ONFI 1.0 NAND has one set of power supply VDD and ground VSS. IO operates at 3 V with VDD/VSS. Since ONFI 2.0 [
16,
17], power/ground for IO is dedicated as VDDQ/VSSQ to allow scaled transistors operating at lower voltage for faster IO operation. VDD remains the same as 3 V even when VDDQ is lowered because high voltages of over 20 V for program and erase operations need to be generated by charge pumps on chip [
18]. If VDD was scaled as VDDQ, charge pumps would have increased circuit area, which affects cost. In order to further improve IO operation frequency for increasing band width, more scaled transistors required a lower VDDQ of 1.2V with ONFI 4.0 [
19]. As shown in
Figure 5(d), the proposed design [
20] utilizes VDDQ not only for IO buffers but also for SA to significantly reduce power in the BL path as will be described in the following sections.
This paper is an extended version of a previously reported conference paper [
20] regarding low power design for NAND Flash with existing NAND Flash interface. NAND Flash dies with this low power design can replace existing ones without any additional cost because there is no need to update in printed circuit boards for SSDs and in design of NAND controller.
This paper organizes as follows:
Section 2 overviews and models two operations for BL read access; shielded BL (SBL) [
21] and all BL (ABL) [
22] read operations.
Section 3 compares the circuit diagrams and read operations in the conventional and proposed circuits for ABL read operation. Experimental results are shown in
Section 4.
Section 5 discusses design considerations such as scalability in BL capacitance and noise immunity.
2. BL Access for Read Operation
Figure 6 illustrates BL path of NAND Flash. This section overviews and models two operations for BL read access; shielded BL (SBL) [
21] and all BL (ABL) [
22] read operations. Long, narrow, and tightly pitched BL has a relatively high parasitic resistance and capacitance as depicted by
R and
C. Multiple cells are connected with BL. (For simplicity,
Figure 6 shows only one of them connected at the farthest node, which has the longest delay.) The gates of NAND Flash cells are connected with word-lines (WLs) and two select gates (SGD, SGS). A read operation is done as follows. Only a selected WL, e.g. WL
127, goes up to a certain voltage, e.g., 1V while the other deselected WLs and the two select gates go up to a higher voltage such as 5V to turn on regardless of the cell threshold voltages. When the selected cell has a threshold voltage below (above) 1V, it turns on (off).
NAND string is modeled by a switching transistor controlled by SGD signal as a switch and a linear resistor RC for simplicity. In this paper, the cell data is related to RC as follows: the cell whose data is 0, namely 0-cell, has much lower current than the cell whose data is 1, namely 1-cell, i.e., the 0-cell has much higher RC than the 1-cell. BL is modeled by a simple 2 π RC model. PASS gate acts as a source follower to limit the BL voltage VBL at about 0.5V. The lower boundary is determined by the value at which the cell current enters into a linear region where the cell current ICELL has a strong function as VBL. The BL access time increases as ICELL decreases. The higher bound is limited by reliability. Too high VBL increases the probability that hot carrier injection into the gate of the cell transistors occurs, resulting in substantial shift in the cell threshold voltage. From the view point of power, VBL should be as low as possible. the lower VBL, the lower the power in BL path. SN indicates “storage node”. The parasitic capacitance CSN stores charges temporary, whose amount is translated into a digital value of 1 or 0 at DO by a clocked invertor controlled by a sensing signal SNS. BL is charged up through PREB transistor from VDDBL.
2.1. BL Delay Time in Case of Shielded BL Read
Figure 7 shows waveform of BL control signals and
VBL for shielded BL (SBL) read operation. BL access time
TBL (
TSBL for SBL and
TABL for ABL) is the sum of a pre-charge period
TPRE and a discharge period
TDIS. BL starts charging up with
PASS high and
PREB low. Because
SGD is forced to ground during
TPRE,
VBLF and
VBLN go up regardless of the cell data. The discharge period starts with
PASS low and
SGD high. Depending on the cell data,
VBLN gradually lowers by
ΔVBL for 1-cell whereas remains the same for 0-cell. After
TDIS,
PASS goes up to a voltage slightly lower by
ΔVPASS than that in
TPRE.
SN rapidly lowers for the 1-BL with
ΔVBL >
ΔVPASS whereas keeps the voltage for the 0-BL as high as
VDDBL. With
SNS high,
DO is set to present the cell data.
Next,
TSBL is estimated with the simple model shown in
Figure 6. Assuming
VBLN,
VBL at the nearest node to senseamp, is forced to a constant voltage of
VBL_PRE with
PASS high in
TPRE,
VBLF,
VBL at the farthest node from senseamp, is given by (1).
In
TDIS, differential equations for
VBLN and
VBLF are given by (2) and (3).
Using the initial conditions of (4) and (5),
VBLN(t) is solved to be (6).
where
,
f1, VA, VB are defined by (7) – (10), respectively.
ΔVBL =
VBL_PRE – VBLN(
TDIS) can be calculated by (6) at
t = TDIS with specific
RC values for 0-cell and 1-cell.
2.2. BL Delay Time in Case of all BL Read
Figure 8 shows waveform of BL control signals and
VBL for all BL (ABL) read operation. BL starts charging up with
SGD and
PASS high and
PREB low.
VBLF goes up depending on the cell data.
VBLF for 1-cell is lower than that for 0-cell. The sense current
IS approaches the cell current. After
TABL,
PASS goes up to discharge
CSN.
SN rapidly lowers for the 1-BL in comparison with 0-BL. When
VSN becomes low enough,
SNS toggles to transfer the cell data to
DO.
TABL is estimated as follows.
VBLF is governed by (11), which is solved as (12) with the initial condition of
VBLF (0) = 0.
IS can be calculated by (13).
A sense margin for ABL can be defined by
%IS. Thus,
TABL is a function of
%IS.
2.3. Energy in BL Path
VDDBL supplies energy (
ESBL) into every BL as given by (15) regardless of data 1 or 0 in case of SBL.
On the other hand, ABL requires more energy because
VDDBL also needs to supply the direct cell current in addition to the displacement current for BL parasitic capacitance, as given by (16), where
ESBL is the averaged energy per BL.
2.4. Performance Comparison between SBL and ABL
TABL and
EBL depend on a sensing scheme such as SBL or ABL, on technology-dependent parameters such as
R and
C, and on design parameters such as
VBL_PRE,
ΔVBL and
%IS. It is challenging to determine which sensing scheme is better than the other in terms of performance generally, but it should be good to demonstrate their comparison under a specific condition. In this sub-section, the following parameters are used as demonstration:
,
,
,
,
VDDBL = 2.0V,
= 25, 50, 75 mV for SBL, and
70, 80, 90% for ABL.
VBL_PRE is also skewed as shown in
Table 1.
Figure 9 shows a performance comparison between SBL with
= 25, 50, 75 mV (from left to right) and ABL with
70, 80, 90% (from left to right). Note that the condition was selected so as to have crossing points between SBL and ABL. More sensing margins result in longer BL delay, but do not contribute (for SBL) or do minor contribution (for ABL) on energy for read operation in BL path. A finite slope of
EBL –
TBL curves for ABL comes from the cell current. The longer
TBL with more sensing margin, the more the integration of power due to the cell current. In
Section 3 and afterwards, a proposed design is based on ABL, but its effectiveness on energy reduction is also expected with SBL.
4. Experimental
To validate the effectiveness of the proposed design on power reduction, a test circuit was designed and fabricated in 180nm CMOS, as shown in
Figure 13. In actual NAND Flash memory, parasitic BL resistance and capacitance are based on the nature of wiring. In this test circuit, poly resistor and MIM capacitor were used to mimic parasitic BL resistance and capacitance. Without memory process, normal NMOSFETs were used as cell transistors. To have I
CELL1 and I
CELL0 with a normal NMOSFET, WL voltage (i.e., gate voltage of the transistors) was altered between high and low. Equivalently, 100 sets of BL and sense amp were implemented. Sufficiently small area was required for the power switch. Analog buffers were placed next to BL to monitor V
BL at different locations.
Figure 14a illustrates an analog buffer. To monitor V
BL in rage between 0 V and 0.5 V, PMOSFET is used as a source follower amp. Because 3V transistors are available in 180nm CMOS, V
DDMON of 3V is sufficient to monitor V
BL up to 1V, as shown in
Figure 14b.
To estimate the energy of the conventional design, power switch was not toggled to force VDDBL with VDDint in entire cycles.
Figure 15 shows the waveform in case of 0-1-0 access with a cycle time of 60μs in proposed circuit mode. Due to insufficient tail current of the fabricated analog buffers, the cycle time needed to be longer to accurately measure energy than expected. Therefore, the energy in a cycle time of 5μs was estimated by the data of
Figure 16. Estimated energy per 5μs read cycle was 1.3 nJ in case of conventional circuit mode and 0.54 nJ in case of proposed circuit mode. As a result, a reduction in energy by 59% was achieved.