C에서 "signed int"가 "unsigned int"보다 빠른 이유는 무엇입니까? (In C, why is "signed int" faster than "unsigned int"?)

문제 설명

C에서 signed int가 unsigned int보다 빠른 이유는 무엇입니까? 사실, 나는 이것이 이 웹사이트(아래 링크)에서 여러 번 묻고 대답했다는 것을 알고 있습니다. 그러나 대부분의 사람들은 차이가 없다고 말했습니다. 코드를 작성했는데 실수로 상당한 성능 차이를 발견했습니다.

내 코드의 "서명되지 않은" 버전이 "서명된" 버전보다 느린 이유는 무엇입니까(동일한 번호를 테스트하는 경우에도)? (저는 x86‑64 Intel 프로세서가 있습니다).

유사한 링크

부호 없는 정수보다 부호 있는 정수 비교
The code compiled with clang ‑O2 on OS/X produces this output:
```
         isprime_slow:  788.004ms
unsigned_isprime_slow:  965.381ms
         isprime_fast:    0.065ms
unsigned_isprime_fast:    0.089ms
```
These timings are consistent with the OP's observed behavior on a different system, but show the dramatic improvement caused by the more efficient iteration test: 10000 times faster!
Regarding the question Why is the function slower with unsigned?, let's look at the generated code (gcc 7.2 ‑O2):
```
isprime_slow(int):
        ...
.L5:
        movl    %edi, %eax
        cltd
        idivl   %ecx
        testl   %edx, %edx
        je      .L1
.L4:
        addl    $2, %ecx
        cmpl    %esi, %ecx
        jne     .L5
.L6:
        movl    $1, %edx
.L1:
        movl    %edx, %eax
        ret

unsigned_isprime_slow(unsigned int):
        ...
.L19:
        xorl    %edx, %edx
        movl    %edi, %eax
        divl    %ecx
        testl   %edx, %edx
        je      .L22
.L18:
        addl    $2, %ecx
        cmpl    %esi, %ecx
        jne     .L19
.L20:
        movl    $1, %eax
        ret
       ...
.L22:
        xorl    %eax, %eax
        ret
```
The inner loops are very similar, same number of instructions, similar instructions. Here are however some potential explanations:
- cltd extends the sign of the eax register into the edx register, which may be causing an instruction delay because eax is modified by the immediately preceeding instruction movl %edi, %eax. Yet this would make the signed version slower than the unsigned one, not faster.
- the loops' initial instructions might be misaligned for the unsigned version, but it is unlikely as changing the order in the source code has no effect on the timings.
- Although the register contents are identical for the signed and unsigned division opcodes, it is possible that the idivl instruction take fewer cycles than the divl instruction. Indeed the signed division operates on one less bit of precision than the unsigned division, but the difference seems quite large for this small change.
- I suspect more effort was put into the silicon implementation of idivl because signed divisions are more common that unsigned divisions (as measured by years of coding statistics at Intel).
- as commented by rcgldr, looking at instruction tables for Intel process, for Ivy Bridge, DIV 32 bit takes 10 micro ops, 19 to 27 cycles, IDIV 9 micro ops, 19 to 26 cycles. The benchmark times are consistent with these timings. The extra micro‑op may be due to the longer operands in DIV (64/32 bits) as opposed to IDIV (63/31 bits).
This surprising result should teach us a few lessons:
- optimizing is a difficult art, be humble and procrastinate.
- correctness is often broken by optimizations.
- choosing a better algorithm beats optimization by a long shot.
- always benchmark code, do not trust your instincts.
방법 2:

Because signed integer overflow is undefined, the compiler can make a lot of assumptions and optimizations on code involving signed integers. Unsigned integer overflow is defined to wrap around, so the compiler won't be able to optimize as much. See also http://blog.llvm.org/2011/05/what‑every‑c‑programmer‑should‑know.html#signed_overflow and http://www.airs.com/blog/archives/120.

방법 3:

From Instruction specification on AMD/Intel we have (for K7):
```
Instruction Ops Latency Throughput
DIV r32/m32 32  24      23
IDIV r32    81  41      41
IDIV m32    89  41      41 
```
For i7, latency and throughput are the same for IDIVL and DIVL, a slight difference exists for the µops.
This may explain the difference as ‑O3 assembly codes only differ by signedness (DIVL vs IDIVL) on my machine.
방법 4:

Alternative wiki candidate test that may/may not show a significant time difference.
```
#include <stdio.h>
#include <time.h>

#define J 10
#define I 5

int main(void) {
  clock_t c1,c2,c3;
  for (int j=0; j<J; j++) {
    c1 = clock();
    for (int i=0; i<I; i++) {
      isprime(294967241);
      isprime(294367293);
    }
    c2 = clock();
    for (int i=0; i<I; i++) {
      isunsignedprime(294967241);
      isunsignedprime(294367293);
    }
    c3 = clock();
    printf("%d %d %d\n", (int)(c2‑c1), (int)(c3‑c2), (int)((c3‑c2) ‑ (c2‑c1)));
    fflush(stdout);
  }
  return 0;
}
```
Sample output
```
2761 2746 ‑15
2777 2777 0
2761 2745 ‑16
2793 2808 15
2792 2730 ‑62
2746 2730 ‑16
2746 2730 ‑16
2776 2793 17
2823 2808 ‑15
2793 2823 30
```
(by Devyn Collier Johnson、chqrlie、shadow_map、Jean‑Baptiste Yunès、chux ‑ Reinstate Monica)

참조 문서
1. In C, why is "signed int" faster than "unsigned int"? (CC BY‑SA 2.5/3.0/4.0)

C에서 "signed int"가 "unsigned int"보다 빠른 이유는 무엇입니까? (In C, why is "signed int" faster than "unsigned int"?)

문제 설명

방법 2:

방법 3:

방법 4:

참조 문서

관련 질문

코멘트

C에서 "signed int"가 "unsigned int"보다 빠른 이유는 무엇입니까? (In C, why is "signed int" faster than "unsigned int"?)

문제 설명

방법 2:

방법 3:

방법 4:

참조 문서

관련 질문

세 개의 작은 숫자를 하나의 두 배에 저장하는 방법은 무엇입니까? (How to store three small numbers into one double?)

중첩 최적화에서 솔루션을 반환하는 방법은 무엇입니까? (How to return solutions in a nested optimization?)

C에서 "signed int"가 "unsigned int"보다 빠른 이유는 무엇입니까? (In C, why is "signed int" faster than "unsigned int"?)

값이 없는 경우 HTML5 로컬 저장소 개체는 무엇을 반환합니까? (What does the HTML5 Local Storage object return if there is no value?)

200K 게시물이 있는 Wordpress 사이트는 JOIN이 느린 SQL_CALC_FOUND_ROWS를 게시합니까? 속도 최적화? (Wordpress site with 200K posts SQL_CALC_FOUND_ROWS with JOINs slow? Optimize for speed?)

xlx의 각 시트에서 특정 셀을 추출하기 위해 이 R 코드를 개선하는 데 도움이 필요합니다. (I need help to improve this R code to extract specific cell from each sheet in xlxs)

이 for 루프에서 fminunc가 매번 동일한 결과를 제공하는 이유는 무엇입니까? (Why is fminunc in this for loop giving the same result each time?)

정수 스칼라 배열만 Hyperopt에서 스칼라 인덱스 오류로 변환될 수 있습니다. (only integer scalar arrays can be converted to a scalar index error in Hyperopt)

벡터 병합 알고리즘 도움말 (Help with algorithm for merging vectors)

이 MIPS 프로그램이 이렇게 오래 걸리나요? (파이에 가까운 프로그램) (Is this MIPS program supposed to take this long? (program that approximates pi))

힘을 최소화하기 위한 최적의 궤적, 최종 조건 문제 (Optimal trajectory to minimize force, issues with final conditions)

단일 svg 요소를 만들기 위해 여러 모양 병합 (Merging number of shapes to make a single svg element)

코멘트