IT박스

CPU 캐시를 가장 잘 활용하여 성능을 향상시키는 코드는 어떻게 작성합니까?

itboxs 2020. 6. 7. 10:53
반응형

CPU 캐시를 가장 잘 활용하여 성능을 향상시키는 코드는 어떻게 작성합니까?


이것은 주관적인 질문처럼 들릴 수 있지만 내가 찾고있는 것은이와 관련하여 발생할 수있는 특정 사례입니다.

  1. 코드를 효과적으로 작성하고 캐시를 효율적으로 / 캐시 친화적으로 만드는 방법 (가능한 한 많은 캐시 누락, 더 적은 캐시 누락)? 두 가지 관점에서, 데이터 캐시 및 프로그램 캐시 (명령 캐시), 즉 데이터 구조 및 코드 구성과 관련하여 코드에서 어떤 것이 캐시에 효과적이되도록 처리해야 하는가.

  2. 코드 캐시를 효과적으로 사용하기 위해 사용하거나 피해야하는 특정 데이터 구조가 있거나 해당 구조의 멤버 등에 액세스하는 특정 방법이 있습니까?

  3. 프로그램 구성 (if, for, switch, break, goto, ...), 코드 흐름 (if 내부, if 내부 if 등 ...) 이이 문제에서 따라야하거나 피해야합니까?

캐시 효율적인 코드를 만드는 데 관련된 개별 경험을 듣고 싶습니다. 프로그래밍 언어 (C, C ++, Assembly, ...), 하드웨어 대상 (ARM, Intel, PowerPC 등), OS (Windows, Linux, S ymbian 등) 등이 될 수 있습니다. .

다양성은 그것을 깊이 이해하는 데 도움이 될 것입니다.


캐시는 CPU가 메모리 요청이 이행 될 때까지 대기하는 횟수를 줄이고 (메모리 대기 시간을 피함 ) 두 번째 효과로, 전송해야하는 전체 데이터 양을 줄일 수 있습니다 (보존 메모리 대역폭 ).

메모리 페치 대기 시간으로 인한 고통을 피하는 기술은 일반적으로 가장 먼저 고려해야 할 사항이며 때로는 먼 길을 돕습니다. 제한된 메모리 대역폭은 특히 많은 스레드가 메모리 버스를 사용하려는 다중 코어 및 다중 스레드 응용 프로그램의 경우 제한 요소입니다. 후자의 문제를 해결하는 데 도움이되는 다양한 기술이 있습니다.

공간적 지역성을 개선 한다는 것은 각 캐시 라인이 캐시에 매핑 된 후에 전체 캐시 라인이 완전히 사용되도록하는 것을 의미합니다. 다양한 표준 벤치 마크를 살펴보면, 캐시 라인이 제거되기 전에 놀랍도록 많은 부분이 페치 된 캐시 라인을 100 % 사용하지 못한다는 것을 알았습니다.

캐시 라인 활용도를 높이면 세 가지 측면에서 도움이됩니다.

  • 캐시에 더 유용한 데이터를 넣는 경향이있어 유효 캐시 크기가 증가합니다.
  • 동일한 캐시 라인에서 더 유용한 데이터를 맞추는 경향이있어 요청 된 데이터가 캐시에서 발견 될 가능성이 높아집니다.
  • 페치가 적으므로 메모리 대역폭 요구 사항이 줄어 듭니다.

일반적인 기술은 다음과 같습니다.

  • 더 작은 데이터 유형 사용
  • 정렬 구멍을 피하기 위해 데이터를 구성하십시오 (크기를 줄임으로써 구조체 멤버를 정렬하는 것이 한 가지 방법입니다)
  • 표준 동적 메모리 할당자를 조심하십시오. 예열되면 구멍이 생겨 데이터가 메모리에 퍼질 수 있습니다.
  • 모든 인접 데이터가 실제로 핫 루프에서 사용되는지 확인하십시오. 그렇지 않으면, 핫 루프가 핫 데이터를 사용하도록 데이터 구조를 핫 및 콜드 구성 요소로 분리하는 것을 고려하십시오.
  • 불규칙한 액세스 패턴을 나타내는 알고리즘 및 데이터 구조를 피하고 선형 데이터 구조를 선호하십시오.

또한 캐시를 사용하는 것보다 메모리 대기 시간을 숨길 수있는 다른 방법이 있습니다.

최신 CPU : 종종 하나 이상의 하드웨어 프리 페 처가 있습니다. 그들은 캐시에서 미스를 훈련시키고 규칙 성을 발견하려고 노력합니다. 예를 들어, 후속 캐시 라인을 몇 번 놓친 후 hw 프리 페처는 애플리케이션의 요구를 예상하여 캐시 라인을 캐시로 가져 오기 시작합니다. 정기적 인 액세스 패턴이있는 경우 하드웨어 프리 페처는 일반적으로 매우 잘 수행됩니다. 또한 프로그램에 규칙적인 액세스 패턴이 표시되지 않으면 프리 페치 명령어를 직접 추가하여 상황을 개선 할 수 있습니다 .

캐시에서 항상 그리워하는 명령이 서로 가깝게 발생하도록 명령어를 다시 그룹화하면 CPU가 때때로 이러한 페치와 겹치므로 애플리케이션이 하나의 대기 시간 히트 ( 메모리 레벨 병렬 처리 ) 만 유지할 수 있습니다.

전체 메모리 버스 압력을 줄이려면 temporal locality 라는 문제를 해결해야 합니다. 이는 데이터가 여전히 캐시에서 제거되지 않은 동안 데이터를 재사용해야 함을 의미합니다.

동일한 데이터에 닿는 루프를 병합 ( 루프 융합 )하고 모든 타일링 또는 블로킹 이라는 재 작성 기술을 사용 하면 이러한 추가 메모리 페치를 피하기 위해 노력합니다.

이 재 작성 연습에는 몇 가지 규칙이 있지만 일반적으로 프로그램의 의미에 영향을 미치지 않도록 루프 전달 데이터 종속성을 신중하게 고려해야합니다.

이러한 것들이 멀티 코어 세계에서 실제로 지불하는 것인데, 일반적으로 두 번째 스레드를 추가 한 후에는 처리량이 크게 향상되지 않습니다.


이에 대한 답변이 더 이상 없다고 믿을 수 없습니다. 어쨌든 고전적인 예는 다차원 배열을 "내부"로 반복하는 것입니다.

pseudocode
for (i = 0 to size)
  for (j = 0 to size)
    do something with ary[j][i]

이것이 캐시 비효율적 인 이유는 최신 CPU가 단일 메모리 주소에 액세스 할 때 주 메모리에서 "가까운"메모리 주소로 캐시 라인을로드하기 때문입니다. 내부 루프의 배열에서 "j"(외부) 행을 반복하므로 내부 루프를 통과 할 때마다 캐시 라인이 플러시되고 [ j] [i] 항목. 이것이 동등한 것으로 변경되면 :

for (i = 0 to size)
  for (j = 0 to size)
    do something with ary[i][j]

훨씬 빠르게 실행됩니다.


메모리와 소프트웨어의 상호 작용에 관심이 있다면 Ulrich Drepper 가 메모리대해 알아야 할 9 부로 구성된 기사를 읽는 것이 좋습니다 . 104 페이지 PDF 로도 제공됩니다 .

이 질문과 특히 관련된 섹션은 Part 2 (CPU 캐시) 및 Part 5 (프로그래머가 수행 할 수있는 작업-캐시 최적화) 일 수 있습니다.


기본 규칙은 실제로 매우 간단합니다. 까다로운 곳은 코드에 어떻게 적용되는지입니다.

캐시는 두 가지 원칙, 즉 임시 지역 및 공간 지역에서 작동합니다. 전자는 최근에 특정 데이터 청크를 사용한 경우 곧 다시 필요할 것이라는 생각입니다. 후자는 최근에 주소 X의 데이터를 사용한 경우 곧 주소 X + 1이 필요할 것입니다.

The cache tries to accomodate this by remembering the most recently used chunks of data. It operates with cache lines, typically sized 128 byte or so, so even if you only need a single byte, the entire cache line that contains it gets pulled into the cache. So if you need the following byte afterwards, it'll already be in the cache.

And this means that you'll always want your own code to exploit these two forms of locality as much as possible. Don't jump all over memory. Do as much work as you can on one small area, and then move on to the next, and do as much work there as you can.

A simple example is the 2D array traversal that 1800's answer showed. If you traverse it a row at a time, you're reading the memory sequentially. If you do it column-wise, you'll read one entry, then jump to a completely different location (the start of the next row), read one entry, and jump again. And when you finally get back to the first row, it will no longer be in the cache.

The same applies to code. Jumps or branches mean less efficient cache usage (because you're not reading the instructions sequentially, but jumping to a different address). Of course, small if-statements probably won't change anything (you're only skipping a few bytes, so you'll still end up inside the cached region), but function calls typically imply that you're jumping to a completely different address that may not be cached. Unless it was called recently.

Instruction cache usage is usually far less of an issue though. What you usually need to worry about is the data cache.

In a struct or class, all members are laid out contiguously, which is good. In an array, all entries are laid out contiguously as well. In linked lists, each node is allocated at a completely different location, which is bad. Pointers in general tend to point to unrelated addresses, which will probably result in a cache miss if you dereference it.

And if you want to exploit multiple cores, it can get really interesting, as usually, only one CPU may have any given address in its L1 cache at a time. So if both cores constantly access the same address, it will result in constant cache misses, as they're fighting over the address.


Apart from data access patterns, a major factor in cache-friendly code is data size. Less data means more of it fits into the cache.

This is mainly a factor with memory-aligned data structures. "Conventional" wisdom says data structures must be aligned at word boundaries because the CPU can only access entire words, and if a word contains more than one value, you have to do extra work (read-modify-write instead of a simple write). But caches can completely invalidate this argument.

Similarly, a Java boolean array uses an entire byte for each value in order to allow operating on individual values directly. You can reduce the data size by a factor of 8 if you use actual bits, but then access to individual values becomes much more complex, requiring bit shift and mask operations (the BitSet class does this for you). However, due to cache effects, this can still be considerably faster than using a boolean[] when the array is large. IIRC I once achieved a speedup by a factor of 2 or 3 this way.


The most effective data structure for a cache is an array. Caches work best, if your data structure is laid out sequentially as CPUs read entire cache lines (usually 32 bytes or more) at once from main memory.

Any algorithm which accesses memory in random order trashes the caches because it always needs new cache lines to accomodate the randomly accessed memory. On the other hand an algorithm, which runs sequentially through an array is best because:

  1. It gives the CPU a chance to read-ahead, e.g. speculatively put more memory into the cache, which will be accessed later. This read-ahead gives a huge performance boost.

  2. Running a tight loop over a large array also allows the CPU to cache the code executing in the loop and in most cases allows you to execute an algorithm entirely from cache memory without having to block for external memory access.


One example I saw used in a game engine was to move data out of objects and into their own arrays. A game object that was subject to physics might have a lot of other data attached to it as well. But during the physics update loop all the engine cared about was data about position, speed, mass, bounding box, etc. So all of that was placed into its own arrays and optimized as much as possible for SSE.

So during the physics loop the physics data was processed in array order using vector math. The game objects used their object ID as the index into the various arrays. It was not a pointer because pointers could become invalidated if the arrays had to be relocated.

In many ways this violated object-oriented design patterns but it made the code a lot faster by placing data close together that needed to be operated on in the same loops.

This example is probably out of date because I expect most modern games use a prebuilt physics engine like Havok.


Only one post touched on it, but a big issue comes up when sharing data between processes. You want to avoid having multiple processes attempting to modify the same cache line simultaneously. Something to look out for here is "false" sharing, where two adjacent data structures share a cache line and modifications to one invalidates the cache line for the other. This can cause cache lines to unnecessarily move back and forth between processor caches sharing the data on a multiprocessor system. A way to avoid it is to align and pad data structures to put them on different lines.


A remark to the "classic example" by user 1800 INFORMATION (too long for a comment)

I wanted to check the time differences for two iteration orders ( "outter" and "inner"), so I made a simple experiment with a large 2D array:

measure::start();
for ( int y = 0; y < N; ++y )
for ( int x = 0; x < N; ++x )
    sum += A[ x + y*N ];
measure::stop();

and the second case with the for loops swapped.

The slower version ("x first") was 0.88sec and the faster one, was 0.06sec. That's the power of caching :)

I used gcc -O2 and still the loops were not optimized out. The comment by Ricardo that "most of the modern compilers can figure this out by itselves" does not hold


I can answer (2) by saying that in the C++ world, linked lists can easily kill the CPU cache. Arrays are a better solution where possible. No experience on whether the same applies to other languages, but it's easy to imagine the same issues would arise.


Cache is arranged in "cache lines" and (real) memory is read from and written to in chunks of this size.

Data structures that are contained within a single cache-line are therefore more efficient.

Similarly, algorithms which access contiguous memory blocks will be more efficient than algorithms which jump through memory in a random order.

Unfortunately the cache line size varies dramatically between processors, so there's no way to guarantee that a data structure that's optimal on one processor will be efficient on any other.


To ask how to make a code, cache effective-cache friendly and most of the other questions , is usually to ask how to Optimize a program, that's because the cache has such a huge impact on performances that any optimized program is one that is cache effective-cache friendly.

I suggest reading about Optimization, there are some good answers on this site. In terms of books, I recommend on Computer Systems: A Programmer's Perspective which has some fine text about the proper usage of the cache.

(b.t.w - as bad as a cache-miss can be, there is worse - if a program is paging from the hard-drive...)


There has been a lot of answers on general advices like data structure selection, access pattern, etc. Here I would like to add another code design pattern called software pipeline that makes use of active cache management.

The idea is borrow from other pipelining techniques, e.g. CPU instruction pipelining.

This type of pattern best applies to procedures that

  1. could be broken down to reasonable multiple sub-steps, S[1], S[2], S[3], ... whose execution time is roughly comparable with RAM access time (~60-70ns).
  2. takes a batch of input and do aforementioned multiple steps on them to get result.

Let's take a simple case where there is only one sub-procedure. Normally the code would like:

def proc(input):
    return sub-step(input))

To have better performance, you might want to pass multiple inputs to the function in a batch so you amortize function call overhead and also increases code cache locality.

def batch_proc(inputs):
    results = []
    for i in inputs:
        // avoids code cache miss, but still suffer data(inputs) miss
        results.append(sub-step(i))
    return res

However, as said earlier, if the execution of the step is roughly the same as RAM access time you can further improve the code to something like this:

def batch_pipelined_proc(inputs):
    for i in range(0, len(inputs)-1):
        prefetch(inputs[i+1])
        # work on current item while [i+1] is flying back from RAM
        results.append(sub-step(inputs[i-1]))

    results.append(sub-step(inputs[-1]))

The execution flow would look like:

  1. prefetch(1) ask CPU to prefetch input[1] into cache, where prefetch instruction takes P cycles itself and return, and in the background input[1] would arrive in cache after R cycles.
  2. works_on(0) cold miss on 0 and works on it, which takes M
  3. prefetch(2) issue another fetch
  4. works_on(1) if P + R <= M, then inputs[1] should be in the cache already before this step, thus avoid a data cache miss
  5. works_on(2) ...

There could be more steps involved, then you can design a multi-stage pipeline as long as the timing of the steps and memory access latency matches, you would suffer little code/data cache miss. However, this process needs to be tuned with many experiments to find out right grouping of steps and prefetch time. Due to its required effort, it sees more adoption in high performance data/packet stream processing. A good production code example could be found in DPDK QoS Enqueue pipeline design: http://dpdk.org/doc/guides/prog_guide/qos_framework.html Chapter 21.2.4.3. Enqueue Pipeline.

More information could be found:

https://software.intel.com/en-us/articles/memory-management-for-optimal-performance-on-intel-xeon-phi-coprocessor-alignment-and

http://infolab.stanford.edu/~ullman/dragon/w06/lectures/cs243-lec13-wei.pdf


Write your program to take a minimal size. That is why it is not always a good idea to use -O3 optimisations for GCC. It takes up a larger size. Often, -Os is just as good as -O2. It all depends on the processor used though. YMMV.

Work with small chunks of data at a time. That is why a less efficient sorting algorithms can run faster than quicksort if the data set is large. Find ways to break up your larger data sets into smaller ones. Others have suggested this.

In order to help you better exploit instruction temporal/spatial locality, you may want to study how your code gets converted in to assembly. For example:

for(i = 0; i < MAX; ++i)
for(i = MAX; i > 0; --i)

The two loops produce different codes even though they are merely parsing through an array. In any case, your question is very architecture specific. So, your only way to tightly control cache use is by understanding how the hardware works and optimising your code for it.


Besides aligning your structure and fields, if your structure if heap allocated you may want to use allocators that support aligned allocations; like _aligned_malloc(sizeof(DATA), SYSTEM_CACHE_LINE_SIZE); otherwise you may have random false sharing; remember that in Windows, the default heap has a 16 bytes alignment.

참고URL : https://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance

반응형