CPU Instruction Performance
Cluster focuses on why simple low-level instruction sequences (e.g., shifts, masks, adds) are surprisingly fast on modern x86 CPUs compared to dedicated ops, due to pipelining, micro-op caches, register renaming, and addressing modes.
Activity Over Time
Top Contributors
Keywords
Sample Comments
Explain please. Are you talking about the fact that if you are doing calculations on 64-bit operands that it might take you fewer cycles?
Might be something to do with caching or fitting operands in registers?
Don't fancy x86 addressing modes provide most of those multiplications and offsets with very little IPC penalty?
Not when you build the CPU. You can pipeline the operation. All these ops have very long latency.
Could that be due to the cleverness of modern heavyweight CPUs, with techniques like register renaming? Would things change if you used less sophisticated processors?
You'd think the CPU vendor knows their CPU best. If there's a faster "software" implementation, why doesn't REP MOVS at least do the same thing in microcode?
Why? It's a common op that requires internal knowledge of every microarchitecture, isn't it? Seems like something that should be totally offloaded to the CPU so you're guaranteed best performance.
Why does this provide such a speed up on modern CPUs?
Because they cost no/less cycles compared to NOPs?
I'd wager also yes.> As long as it isn't a bottleneck in common software, a few shifts/masks/add/integer multiply or whatever, are very quick on modern cpus. Often 1-cycle. If not >1 such instructions in parallel per clock.one instruction takes less cache space than multiple instructions. At worst it might mean fitting vs not fitting hot loop in cache.