CPU Instruction Performance

Cluster focuses on why simple low-level instruction sequences (e.g., shifts, masks, adds) are surprisingly fast on modern x86 CPUs compared to dedicated ops, due to pipelining, micro-op caches, register renaming, and addressing modes.

➡️ Stable 0.6x Hardware
3,955
Comments
19
Years Active
5
Top Authors
#7865
Topic ID

Activity Over Time

2008
5
2009
33
2010
28
2011
78
2012
79
2013
149
2014
167
2015
146
2016
260
2017
203
2018
250
2019
230
2020
405
2021
359
2022
389
2023
356
2024
383
2025
408
2026
29

Keywords

RAM REP e.g CPU ARM ALU CPUS AMD IMO ISA cpus instructions sum cpu cache instruction lookup cycles latency registers

Sample Comments

adamnemecek Dec 17, 2013 View on HN

Explain please. Are you talking about the fact that if you are doing calculations on 64-bit operands that it might take you fewer cycles?

saagarjha May 4, 2018 View on HN

Might be something to do with caching or fitting operands in registers?

the8472 Feb 3, 2020 View on HN

Don't fancy x86 addressing modes provide most of those multiplications and offsets with very little IPC penalty?

pclmulqdq May 22, 2024 View on HN

Not when you build the CPU. You can pipeline the operation. All these ops have very long latency.

MaxBarraclough Nov 4, 2020 View on HN

Could that be due to the cleverness of modern heavyweight CPUs, with techniques like register renaming? Would things change if you used less sophisticated processors?

mike_hock Nov 30, 2023 View on HN

You'd think the CPU vendor knows their CPU best. If there's a faster "software" implementation, why doesn't REP MOVS at least do the same thing in microcode?

MichaelGG Aug 11, 2016 View on HN

Why? It's a common op that requires internal knowledge of every microarchitecture, isn't it? Seems like something that should be totally offloaded to the CPU so you're guaranteed best performance.

voltagex_ Nov 18, 2020 View on HN

Why does this provide such a speed up on modern CPUs?

iforgotpassword Nov 14, 2023 View on HN

Because they cost no/less cycles compared to NOPs?

ilyt Aug 13, 2023 View on HN

I'd wager also yes.> As long as it isn't a bottleneck in common software, a few shifts/masks/add/integer multiply or whatever, are very quick on modern cpus. Often 1-cycle. If not >1 such instructions in parallel per clock.one instruction takes less cache space than multiple instructions. At worst it might mean fitting vs not fitting hot loop in cache.