Girish Mahajan (Editor)

SSE4

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

SSE4 (Streaming SIMD Extensions 4) is a SIMD CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L). It was announced on 27 September 2006 at the Fall 2006 Intel Developer Forum, with vague details in a white paper; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum in Beijing, in the presentation. SSE4 is fully compatible with software written for previous generations of Intel 64 and IA-32 architecture microprocessors. All existing software continues to run correctly without modification on microprocessors that incorporate SSE4, as well as in the presence of existing and new applications that incorporate SSE4.

Contents

SSE4 subsets

Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1 in some Intel documentation, is available in Penryn. Additionally, SSE4.2, a second subset consisting of the 7 remaining instructions, is first available in Nehalem-based Core i7. Intel credits feedback from developers as playing an important role in the development of the instruction set.

Starting with Barcelona-based processors, AMD introduced the SSE4a instruction set, which has 4 SSE4 instructions and 4 new SSE instructions. These instructions are not found in Intel's processors supporting SSE4.1 and AMD processors only started supporting Intel's SSE4.1 and SSE4.2 (the full SSE4 instruction set) in the Bulldozer-based FX processors. With SSE4a the misaligned SSE feature was also introduced which meant unaligned load instructions were as fast as aligned versions on aligned addresses. It also allowed disabling the alignment check on non-load SSE operations accessing memory. Intel later introduced similar speed improvements to unaligned SSE in their Nehalem processors, but did not introduce misaligned access by non-load SSE instructions until AVX.

Name confusion

What is now known as SSSE3 (Supplemental Streaming SIMD Extensions 3), introduced in the Intel Core 2 processor line, was referred to as SSE4 by some media until Intel came up with the SSSE3 moniker. Internally dubbed Merom New Instructions, Intel originally did not plan to assign a special name to them, which was criticized by some journalists. Intel eventually cleared up the confusion and reserved the SSE4 name for their next instruction set extension.

Intel is using the marketing term HD Boost to refer to SSE4.

New instructions

Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications. It features a number of instructions whose action is determined by a constant field and a set of instructions that take XMM0 as an implicit third operand.

Several of these instructions are enabled by the single-cycle shuffle engine in Penryn. (Shuffle operations reorder bytes within a register.)

SSE4.1

These instructions were introduced with Penryn microarchitecture, the 45 nm shrink of Intel's Core microarchitecture. Support is indicated via the CPUID.01H:ECX.SSE41[Bit 19] flag.

SSE4.2

SSE4.2 added STTNI (String and Text New Instructions), several new instructions that perform character searches and comparison on two operands of 16 bytes at a time. These were designed (among other things) to speed up the parsing of XML documents. It also added a CRC32 instruction to compute cyclic redundancy checks as used in certain data transfer protocols. These instructions were first implemented in the Nehalem-based Intel Core i7 product line and complete the SSE4 instruction set. Support is indicated via the CPUID.01H:ECX.SSE42[Bit 20] flag.

POPCNT and LZCNT

These instructions operate on integer rather than SSE registers, because they are not SIMD instructions, but appear at the same time and although introduced by AMD with the SSE4a instruction set, they are counted as separate extensions with their own dedicated CPUID bits to indicate support. Intel implements POPCNT beginning with the Nehalem microarchitecture and LZCNT beginning with the Haswell microarchitecture. AMD implements both beginning with the Barcelona microarchitecture.

AMD calls this pair of instructions Advanced Bit Manipulation (ABM).

The result of lzcnt is 31 minus the result of the bsr (bit scan reverse), except when the input is 0. lzcnt produces a result of 32, while bsr produces an undefined result (and sets the zero flag). The encoding of lzcnt is similar enough to bsr that if lzcnt is performed on a CPU not supporting it such as Intel CPU's prior to Haswell, it will perform the bsr operation instead of raising an invalid instruction error.

Trailing zeros can be counted using the existing bsf instruction.

SSE4a

The SSE4a instruction group was introduced in AMD's Barcelona microarchitecture. These instructions are not available in Intel processors. Support is indicated via the CPUID.80000001H:ECX.SSE4A[Bit 6] flag.

Supporting CPUs

  • Intel
  • Intel Silvermont processors (SSE4.1, SSE4.2 and POPCNT supported)
  • Intel Goldmont processors (SSE4.1, SSE4.2 and POPCNT supported)
  • Intel Penryn processors (SSE4.1 supported)
  • Intel Nehalem processors and newer (SSE4.1, SSE4.2 and POPCNT supported)
  • Intel Haswell processors and newer (SSE4.1, SSE4.2, POPCNT and LZCNT supported)
  • AMD
  • AMD Barcelona-based processors and newer (SSE4a, POPCNT and LZCNT supported)
  • AMD Bulldozer-based processors and newer (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
  • AMD Bobcat-based processors (SSE4a, POPCNT and LZCNT supported)
  • AMD Jaguar-based processors and newer (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
  • AMD Piledriver-based processors and newer (SSE4a, SSE4.1, SSE4.2, POPCNT and LZCNT supported)
  • VIA
  • VIA Nano-based processors (SSE4.1 supported)
  • References

    SSE4 Wikipedia