When mobile phones first appeared, they were powered by very simple cores like the venerable ARM7 and later the ARM9. Low clock frequencies, zero microarchitectural sophistication, sufficient for the job. In recent years, as smartphones have come into their own as the most important computing device for most people, the processor performance of mobile phones have increased tremendously. Today, cutting-edge phones and tablets contain four or eight cores, running at clock frequencies well above 2 gigahertz. The performance race for most of the market (more about that in a moment) was mostly about pushing higher clock frequencies and more cores, even while microarchitecture was left comparatively simple. Mobile meant “fairly simple”, and IPC was nowhere near what you would get with a typical Intel processor for a laptop or desktop.
Today, that seems to be changing, as the Nvidia Denver core and Apple’s Cyclone core both go the route of a few fat cores rather than many thin cores.
ARM’s own leading-edge core, the Cortex -57, is considered aggressive for a core from ARM. Still, it gets much worse IPC than a Haswell, and it is supposedly similar to the Cortex-A15. The Denver core from Nvidia is much larger, as the Nvidia Tegra K1 32-bit variant has four Cortex A15 cores, while the K1 64-bit only has two Denver cores. The Denver would be twice as big as the Cortex A15, from that data.
Reading up on Nvidia’s Denver core, on their blog and in the HotChips presentation and elsewhere, it is clear that Nvidia is going for wide issue and high IPC to gain performance – if nothing else, since there seems to be little additional clock frequency to be had. The 64-bit K1 is supposed to match or beat the performance of the 32-bit K1 at very similar clock frequencies (seems to be 2.3 GHz vs 2.5 GHz, but details are a bit sketchy). This pretty much means that each of the Denver cores have to do the work of two Cortex-A15 cores to provide performance parity.
I think this makes sense from a number of perspectives.
- It appears that mobile software is also having problem actually being parallel. At the SiCS Multicore Day last month, Qualcomm showed some graphs indicating that on a quad-core Snapdragon, very few programs spent much time at all using more than two cores. Given this, making a single core that is even just 1.5x faster (for the sake of argumen) at 2x the size still provides a faster overall system. A quad-core A15 competing with a dual-core Denver could be seen as being two A15s competing with two much faster Denvers, with two A15s being useless bystanders.
- Having a powerful core that can get through single threads of software much faster saves power, as it is important to get back to idle as quickly as possible. A lower-power processor that processes for a longer time might use more energy.
- A core with a higher IPC can provide the same actual latency for a given piece software at a lower clock frequency compared to a simpler core. Given good power management and good insight into the running software, a fat core should be more power efficient for simple programs that do not tax the processor. It is not just about matching peak performance, but also matching power-performance for mundane software.
I think the Denver approach of using a few fat cores for a mobile form factor is interesting. However – there is a precedent. Apple’s Cyclone core used in the fruit company’s A7 and A8 SoCs has not had as many details about it published. However, it is clear that it very much takes the “few and fat” approach to processing. The clock speed of the A7 and A8 chips are reportedly below 1.5 GHz, which is really very slow for a high-end mobile processor (the Denver is shooting for 2.5 GHz). That lower clock appears to have big benefits, since power consumption is typically quadratic in the clock frequency. Apple’s phones should get a nice battery life benefit thanks to this. It seems iOS is a bit more efficient than Android, which explains how Apple gets away with such a low clock frequency. It should be noted that it is not clear for how long a Qualcomm or Samsung or Nvidia chip can actually run at 2 GHz or more before throttling down to avoid being cooked.
I think it is clear that having a fatter core design philosophy seems to be the current best way forward in mobile processor design, at least at the high end (one anecdotal data point is provided here). For the mid-end and low-end of the market, it rather seems that the trend is to use four or eight quite weak ARM Cortex-A7 cores (or maybe the new Cortex-A12), providing core counts that sound good.
The actual way the Denver is built, using binary translation and a wide in-order core is quite different from the Apple Cyclone, which is a traditional core that directly executes ARMv8 code. It will be very interesting to see how this works out in practice once the Nexus 9 is properly benchmarked. When Transmeta tried the same thing in 2000, they did not quite manage to beat the incumbent. In mobile, I guess the first test will be to compare a dual Denver with a 4+4 ARM Big-Little Cortex-A57+A53 setup. And then, against whatever 64-bit ARM core Qualcomm builds. But right here, right now, this is quite probably the fastest Android processor around for code that is not insanely multithreaded.