手撕ECS #2: Archetypes and Vectorization

2020-12-28

语法糖

C++

原文链接：Building an ECS #2: Archetypes and Vectorization

上回说到如何访问 Entity 所拥有的 Components，这一章将会探索如何储存 Component 数据。久闻 ECS 以高性能自称，来看看它到底有何能耐。

性能杀手锏：序列化储存数据

经验之谈，如果想让运行速度变快，用数组就完事了。背后的理论在计算机架构课本中，由于内存对齐、换页等因素，内存顺序读取总是要比随机读取快不少。详细可以观看 Data-Oriented Design and C++。

内存连续状的数组具有一个强大的特性，就是可被**向量化(Vectorization)**，即可以被 CPU 使用 SIMD 系列指令同时读写。一个好的向量化代码运行效率可以加速至 2 倍、甚至 16 倍。

在构建 ECS 时，我们当然希望能够同时使用**快速串流(fast streaming)和矢量化(vectorization)*的特性。为了实现这一点，我们不能仅仅局限于在数组中存储数据，还需要令数组连续，即要求其不包含任何未使用的元素，并且存储在单个内存块中*。只要一个数组有一个空位，都会影响到我们利用 SIMD 优化运算。

ABC 问题

不幸的是，将所有数据完美地存储在连续数组中并非易事。

至于何出此言，来举个栗：假设有个应用含有 3 个实体，并且每个实体都有组件 A，我们可以像这样储存它们：

A a[3];

我们可以通过实体的 id 来获取对应的组件数据，如 a[2] 返回 id 为 2 的 A 组件数据。此时再规定 3 个实体中的2 个拥有组件 B，该如何设计这个组件的储存方式呢？一种直观的方式是直接创建一个与 A 等长的数组：

B b[3];

这样做虽然简单，但是此时只有 2 个实体拥有组件 B，我们无法保证 id 对应的储存方式能使 B 连续！（有可能是 id = 0 与 id = 2 拥有 B）应该怎么解决呢？

让我们从定义出发，先定义一个长度为 2 的连续的 B 数组：

B b[2];

且定义一个特殊情况：

1
2
3

0: [A B]
1: [A  ]
2: [A B]

显然很难满足 A、B 的连续。这里介绍一个 trick，即对实体 id 做一次映射，使其在内存中变换一下位置：

1
2
3

0: [A B]
2: [A B]
1: [A  ]

可以发现这回两个数组都连续了。为了一个并行运算的可能，完全值得我们牺牲一部分性能用于实体 id 映射。这种模式可以支持我们设计非常多的组件搭配，比如再加一个仅拥有组件 B 的实体，这种方法仍然奏效：

3: [  B]
0: [A B]
2: [A B]
1: [A  ]

那么问题解决了吗? 非常不幸：没有。让我们看看如果我们在这之上加入另一个组件会发生什么:

0: [  B  ]
1: [  B C]
2: [A B C]
3: [A B  ]
4: [A    ]
5: [A   C]
6: [    C]

上述这种情况不可能使 A、B、C 均连续（你可以试试hhh）。那有人问了：我们在煎水作冰吗？别慌，追求并行虽不可亵玩，却仍可近观。

原型为何物？

让我们重新构建数据结构，尝试另一种方法：创建以 Component 列表为类型的若干数组。比如下例：

// Type [A]
A a[2];
// Type [A, B]
A a[2];
B b[2];
// Type [A, C]
A a[2];
C c[2];

在 entity index 中对应的数据为：

0: [A]
1: [A]

2: [A B]
3: [A B]

4: [A C]
5: [A C]

这里设计的三个数组均符合我们之前所声明的条件，且仅当遍历所有实例的 A 组件时有一些额外的性能开销。如此一来可以直接利用向量化特性，来执行类似如下仅涉及 A 与 B 的逻辑：

1
2
3

for (int i = 0; i < 10; i ++) {
    a[i] += b[i];
}

这种存储数据的方法通常被称为原型(Archetypes)。ECS 框架在内部将具有相同组件组合的所有实体组合在一组打包的数组中，即原型。原型通常在一个组件组合首次出现在程序中时自动创建。

原型方法不仅能提高并行效率，还能方便我们使用系统(System)匹配符合的实体。假设我们拥有 1000 个包含 [A, B] 的实体与 1000 个包含 [A, C] 的实体，当我们创建一个需要匹配 [A, B] 的系统时，就只需要遍历原型列表（而非所有 2000 个实例）来定位匹配的实例。

下面就是一组描述原型的代码案例:

struct ComponentArray {
    void *elements; // vector<T>
    int size;
};
struct Archetype {
    Type type;
    vector<ComponentArray> components;
    int length;
};

我们无法提前得知原型中储存什么组件，故使用一个 void 指针来接受任意类型的组件。type 成员储存了原型的组件列表（详见上一章的 Type 定义）。为了简单，components 与 type 中的组件顺序保持一致。

定义好数据结构，我们就可以继续讨论基于原型ECS结构中的几个典型操作了。

操作：获得实体与组件

在能够顺序储存相同类型实体的数据结构基础上，我们仍需要定位到某一个实体。试着尝试修改上一章提及的 entity inedx，我们可以用一个名为Record 的数据结构替换原先的定义。

struct Record {
    Archetype *archetype;
    int row;
};

unordered_map<EntityId, Record> entity_index;

row 是指实体对应的组件在 Archetype::components 中的位置。此时如果想获得 10 号实体的 A 组件，我们可以这样做：

Record& r = entity_index[10];
Type& type = r.archetype->type;

for (int i = 0; i < type.size(); i ++) {
    if (type[i] == A) {
        return r.archetype->components[i].elements[r.row];
    }
}

这里简化了一些代码，比如不能用 [] 操作符直接获取 elements 对应地址内存，因为它被定义为 void*；类型比较时需要转换为 typehash 比较。但基本思路是这样的。

我们还希望在迭代组件数组时能够访问我们的实体 ID。为此，我们可以简单地在原型中添加一个额外的数组。

struct Archetype {
    Type type;
    vector<EntityId> entity_ids; // Entities stored in archetype
    vector<ComponentArray> components;
    int length;
    vector<Edge> edges;
};

别看占用了一些空间，这些数据可以极大地加快组件的插入、删除和复制操作。

操作：添加、删除组件

添加组件到一个实体上会改变它的类型，这意味着要将它从原先的 Archetype 移至另一个。但对于序列化加速来说，这么做还是值得的。

我们举一个例子：为一个原型为[A, B] 的实体添加组件 C。首先我们需要找到目标原型，即 [A, B, C] ，详细操作我们下一节细谈。接着我们需要添加一列新内存，并将组件 A 与 B 的数据复制过去，最后将前原型中的实体删除。简单来说分为四步：

找到目标原型(Find destination archetype)
在目标原型中插入该实体(Insert entity into component arrays of destination)
复制原有的组件数据(Copy overlapping components from source to destination)
从原来的原型中移除实体(Remove entity from component arrays of source)

移除组件的步骤也大致相同。

操作：获得或创建目标原型

轮子场景中，仅需要本章内第一种方法，即顺序搜索

An important factor in how well add / remove operations perform is how fast an ECS framework can find an archetype. A simple approach would be to hash the component ids of an archetype, and then use a hash map to lookup an archetype:

1 2	using TypeHash = uint64_t; unordered_map<TypeHash, Archetype*> type_index;

This approach is neither slow nor is it particularly fast:

First we need to create our destination type. If we take our previous example where we add C to [A, B], we need to create [A, B, C]. In our case, where a type is stored as a vector of component ids this is an O(n) operation (to avoid adding duplicates). Then we need to compute the hash, which is another O(n) operation, followed by a hash table lookup which is O(1) on average, and O(n) worst case.

This seems like a lot of work for one of the most common ECS operations.

A faster approach is to build a graph, where each archetype represents a node in the graph, and the components to add / remove are the edges. Consider this example:

An example archetype graph

Each arrow to the right represents an “add” operation, and each arrow to the left represents a “remove” operation. Now, to move from archetype [A, B] to [A, B, C] we simply follow the C edge, which is an O(1) operation. When we follow an edge that does not have an archetype yet, we create it.

Let’s add the graph to our previous archetype data structure:

struct Archetype;struct ComponentArray {
    void *elements; // vector<T>
    int size;
};
struct Edge {
    Archetype *add;    // traverse right
    Archetype *remove; // traverse left
};
struct Archetype {
    Type type;
    vector<EntityId> entity_ids;
    vector<ComponentArray> components;
    int length;
    vector<Edge> edges;
};

We can now write this function to find our next archetype:

Archetype *node = root;
for (int i = 0; i < type.size(); i ++) {
   Edge *edge = &node->edges[type[i]];
   if (!edge->add) {
       edge->add = create_archetype(node, type[i]);
   }
   
   node = edge->add;
}

Note how we use a type to traverse the graph. If we only add a single component to an entity, the type will only contain a single element, but by writing the function this way, we can quickly traverse multiple nodes.

AoS vs. SoA

So far we assumed that an archetype stores an array per component. This is usually referred to as the “struct of arrays” (SoA) approach. The reason it is named like this is because the way data is stored resembles that of a struct with array members:

struct SoA_Archetype {
    A a[100];
    B b[100];
    C c[100];
};

It turns out that SoA works really well in combination with ECS. This is because a system is usually not interested in all components of an entity, and the SoA approach lets us load only the arrays into the CPU cache that we need. There is another valid approach though which is called “array of structs” (AoS). AoS can be represented like this:

struct EntityType {
    A a;
    B b;
    C c;
};
EntityType AoS_ArcheType[100];

This example stores the same data, but it is laid out differently in memory. The biggest difference between AoS and SoA is that with AoS, all data of an entity is loaded into the CPU cache. This may slow down applications when only iterating a few components. AoS can however perform better when randomly accessing components. Flecs uses the SoA approach.

Summary

Our ECS framework is starting to shape up nicely. Let’s combine the data structures of our archetype with those of the previous post:

// Entities, types and entity index
using EntityId = uint64_t;
using Type = vector<EntityId>;
unordered_map<EntityId, Type> entity_index;// Type flags
const EntityId INSTANCEOF = 1 << 63;
const EntityId CHILDOF = 2 << 62;// Component data
struct ComponentArray {
    void *elements; // vector<T>
    int size;
};// Archetype graph
struct Archetype;
struct Edge {
    Archetype *add;
    Archetype *remove;
};
struct Archetype {
    Type type;
    vector<EntityId> entity_ids;
    vector<ComponentArray> components;
    int length;
    vector<Edge> edges;
};

Archetypes are a popular approach that provide good general performance, but they are not the only way to implement an ECS. I hope this gave you some background on their pro’s and con’s.

If you liked what you read, consider checking out the Flecs repository (https://github.com/SanderMertens/flecs) and the Flecs Discord channel!