CPU Instruction Performance

Cluster focuses on why simple low-level instruction sequences (e.g., shifts, masks, adds) are surprisingly fast on modern x86 CPUs compared to dedicated ops, due to pipelining, micro-op caches, register renaming, and addressing modes.

➡️ Stable 0.6x Hardware

3,955

Comments

Years Active

Top Authors

#7865

Topic ID

Activity Over Time

2008

2009

2010

2011

2012

2013

149

2014

167

2015

146

2016

260

2017

203

2018

250

2019

230

2020

405

2021

359

2022

389

2023

356

2024

383

2025

408

2026

Top Contributors

userbinator (98) dragontamer (53) nkurz (53) BeeOnRope (51) dzaima (51)

Keywords

RAM REP e.g CPU ARM ALU CPUS AMD IMO ISA cpus instructions sum cpu cache instruction lookup cycles latency registers

Sample Comments

adamnemecek • Dec 17, 2013 • View on HN

Explain please. Are you talking about the fact that if you are doing calculations on 64-bit operands that it might take you fewer cycles?

saagarjha • May 4, 2018 • View on HN

Might be something to do with caching or fitting operands in registers?

the8472 • Feb 3, 2020 • View on HN

Don't fancy x86 addressing modes provide most of those multiplications and offsets with very little IPC penalty?

pclmulqdq • May 22, 2024 • View on HN

Not when you build the CPU. You can pipeline the operation. All these ops have very long latency.

MaxBarraclough • Nov 4, 2020 • View on HN

Could that be due to the cleverness of modern heavyweight CPUs, with techniques like register renaming? Would things change if you used less sophisticated processors?

mike_hock • Nov 30, 2023 • View on HN

You'd think the CPU vendor knows their CPU best. If there's a faster "software" implementation, why doesn't REP MOVS at least do the same thing in microcode?

MichaelGG • Aug 11, 2016 • View on HN

Why? It's a common op that requires internal knowledge of every microarchitecture, isn't it? Seems like something that should be totally offloaded to the CPU so you're guaranteed best performance.

voltagex_ • Nov 18, 2020 • View on HN

Why does this provide such a speed up on modern CPUs?

iforgotpassword • Nov 14, 2023 • View on HN

Because they cost no/less cycles compared to NOPs?

ilyt • Aug 13, 2023 • View on HN

I'd wager also yes.> As long as it isn't a bottleneck in common software, a few shifts/masks/add/integer multiply or whatever, are very quick on modern cpus. Often 1-cycle. If not >1 such instructions in parallel per clock.one instruction takes less cache space than multiple instructions. At worst it might mean fitting vs not fitting hot loop in cache.