pyecs.html (11241B)
1 <h3>The Specification</h3> 2 <p> 3 Recently at Uni we were given a piece of coursework that had us write a game 4 in python. Normally, this is no big deal: simply import pygame, write a 5 <code>while not quit: ...</code> loop, and be done with it. The catch was 6 that we weren't allowed to use any non-standard features (i.e. no pip). This 7 meant that pygame, pillow, and many other useful libraries were unavailable, 8 and the only remaining windowing option was Tkinter. Now, I like to do things 9 from scratch as much as the next person, but I would not have been the first 10 to suggest Tkinter as a (viable) game development framework. But the coursework 11 specification had had its say, and thus Tkinter it was to be. 12 </p> 13 14 <h3>Tkinter as a Game Engine</h3> 15 <p> 16 For fun and profit I decided to write an ECS game engine in python. Partly 17 because I had never done an ECS engine before, and wanted to learn about the 18 upsides and downsides of the ECS paradigm. Partly because I like to make my 19 life harder than it has to be. Mainly because python is inherently slow and 20 I wanted to make the game engine as fast as possible to amortize the cost of 21 the language. Thus I started reading up about ECS and Tkinter. A great resource 22 for Tk is <span><a href="http://effbot.org">effbot.org</a></span>, which I used 23 extensively during the course of researching for the engine. It seems to be 24 down as of 08-12-2020, which is a shame, but it is still accessible on the 25 <span><a href="https://web.archive.org/web/20201111171246/https://effbot.org/tkinterbook/">wayback machine</a></span>. 26 </p> 27 <p> 28 Whilst I may be come off as negative towards Tkinter, after working with it for 29 a bit I do think it is quite good. Its performance surprised me, as it was easily 30 able to draw thousands of canvas entities and move them around without issue at 31 60 frames per second. It was also really easy to work with, with a simple 32 <code>Tk()</code> call to initialise the entire library and then a single 33 <code>Canvas()</code> and <code>canvas.pack()</code> call to make the game 34 screen show up. 35 </p> 36 <p> 37 Overall I enjoyed the project, and below is further explanation of the ECS 38 engine itself. For now, the code for the game engine is available on my 39 <span><a href="https://git.lenczewski.org/pyecs/log.html">git instance</a></span>. 40 </p> 41 42 <h3>Basics of an ECS</h3> 43 <p> 44 ECS stands for Entity-Component-System, and is a way of designing programs that 45 allegedly leads to very clean code, and easy composition of behaviours. If you 46 have component A, with its relevant system (controlling said components behaviour), 47 every entity with component A will exhibit said associated behavior (as it will 48 be processed by system A). Say that system A causes an entity to oscillate up 49 and down, and that you have another entity which corresponds to a rock in your scene. 50 By simply adding component A to this other entity, it will (hopefully) start being 51 processed by system A, and will start moving up and down. 52 </p> 53 54 <h3>Entities</h3> 55 <p> 56 Entities are the "things" in a game, simply unique identifiers, with no data by 57 themselves. They represent discrete objects in the game (i.e. the player, rocks 58 in the scene, a hidden manager object). Some ECS architectures store tags and 59 other lightweight metadata on the Entity, which is fine, but I chose not to for 60 "purity" and simplicity reasons. 61 </p> 62 63 <pre><code class="language-c"> 64 typedef unsigned long entity_t; // as simple as a uint64 65 66 typedef struct Entity { 67 unsigned long id; 68 unsigned long flags; 69 unsigned long tags; 70 // ... 71 } Entity_t; // or something more complex 72 </code></pre> 73 74 <h3>Components</h3> 75 <p> 76 Components hold all the data for the game. Every entity that needs to have 77 its mesh rendered, for example, will have a MeshComponent. Each component 78 stores only the data that it needs, and no more. This separation means that 79 no memory is wasted for entities that don't need a certain component. 80 </p> 81 <p> 82 Another good reason for storing data in components is that when you have a huge 83 number of entities, cache locality and the efficient filling of cache becomes 84 important. Storing all possible state in 1 struct, and iterating over an array 85 of said struct just to do 1 operation on it (say, update physics) means that 86 the cache is constantly cold for the next piece of data to operate on. 87 </p> 88 89 <h4>A Little Detour to Hardware Land!</h4> 90 <p> 91 When a CPU wants to fetch some data from RAM, it will load a certain slice 92 (say 64 bytes) into cache first (this is the idea of a cacheline). By loading 93 more data than necessary from RAM, the CPU makes the assumption that it will 94 need to fetch data nearby soon, and so it preloads the data into cache (which 95 is <span><a href="https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf">much much faster</a></span> 96 than RAM; see pg 22 of above pdf). The relative timings of cache vs RAM taken 97 from an intel pdf are as follows: 98 </p> 99 100 <pre><code class="language-text"> 101 Data Source Latency 102 L3 Cache hit, line unshared between cores ~40 cycles 103 L3 Cache hit, line shared readonly between cores ~65 cycles 104 L3 Cache hit, line shared readwrite between cores ~75 cycles 105 L3 Cache hit, from other core ~100-300 cycles 106 Local DRAM ~60ns 107 Remote DRAM ~100ns 108 </code></pre> 109 110 <p> 111 For a CPU clocked at 2GHz (slower than modern CPUs), each cycle takes half a 112 nanosecond. This means that an unshared L3 hit takes 20ns, whereas going to RAM 113 takes triple that. Lower caches (L1 and L2) can be on the order of 5-2 ns for a 114 hit, meaning that going to RAM takes between 12 and 30 times longer. Thus, the 115 CPU can save a lot of cycles by loading chunks instead of fetching data byte by 116 byte if nearby data is also used (and there will be a cache hit when said new 117 data is accessed). 118 </p> 119 120 <h4>Getting Back on Track</h4> 121 <p> 122 Getting back to why components are a thing, with an example: 123 </p> 124 125 <pre><code class="language-c"> 126 // clean separation into multiple discrete components 127 typedef struct Transform { 128 double px, py, pz; 129 } Transform_t; 130 131 typedef struct Velocity { 132 double vx, vy, vz; 133 } Velocity_t; 134 135 typedef struct Mesh { 136 void *mesh_data; 137 } Mesh_t; 138 139 // chucking it all into one mixer and pressing "blend" 140 typedef struct State { 141 Transform transform; 142 Velocity velocity; 143 Mesh mesh; 144 } State_t; 145 146 // ... 147 148 void main(void) { 149 // this approach will have worse cacheline efficiency 150 State game_objects[] = { ... }; 151 152 for (int i = 0; i < ARRLEN(game_objects); i++) { 153 calculate_physics(game_objects[i]); 154 render_object(game_objects[i]); 155 } 156 157 // this approach will have better cacheline efficiency 158 Entity entities[] = { ... }; 159 Transform transforms[] = { ... }; 160 Velocity velocities[] = { ... }; 161 Mesh meshes[] = { ... }; 162 163 for (int i = 0; i < ARRLEN(entities); i++) { 164 process_physics(transforms[i], velocities[i]); 165 } 166 167 for (int i = 0; i < ARRLEN(entities); i++) { 168 render(transforms[i], meshes[i]); 169 } 170 } 171 </code></pre> 172 173 <p> 174 Why is the second approach better? Don't we iterate twice compared to once? The 175 reason that the second approach is preferrable is that it packs the cacheline 176 much more efficiently (leading to a higher percentage of cacheline hits, and 177 thus time saved). 178 </p> 179 <p> 180 Below is a visualisation of this difference, with `x`, `y`, and `z` standing in 181 for Transform, Velocity, and Mesh. The dashes show a simplified cacheline. 182 </p> 183 184 <pre><code class="language-text"> 185 Mixed State: 186 | x0 y0 z0 x1 | y1 z1 x2 y2 | z2 x3 y3 z3 | x4 y4 z4 | 187 | ----------- | ----------- | ----------- | ----------- | 188 189 Components: 190 | x0 y0 x1 y1 | x2 y2 x3 y3 | x4 y4 | 191 | ----------- | ----------- | ----------- | 192 193 | x0 z0 x1 z1 | x2 z2 x3 z3 | x4 z4 | 194 | ----------- | ----------- | ----------- | 195 </code></pre> 196 197 <p> 198 Not only does processing the mixed state do worse on the instruction cache, it 199 also fills the regular data cache worse. 1 struct can fit at a time, and to get 200 the next struct to process another fetch must be made. This contrasts with the 201 component approach, where each cacheline can fit 2 entities (Transform + Other) 202 at once. When the number of processed entities gets large, the worse cache 203 performance of the mixed state approach becomes more and more evident. 204 </p> 205 <p> 206 The fact that you have to iterate twice doesn't really matter. You end up doing 207 the same amount of work (you still do render() and physics() for each entity). 208 The difference is that each fetch in the 1st approach has the opcache be cold. 209 The 2nd approach means that the opcache is cold for the first entity, and hot 210 thereafter. This means that the 2nd approach can outperform the 1st. 211 </p> 212 213 <h3>Systems</h3> 214 <p> 215 Systems implement the behaviour seen in a game. From rendering and physics, to 216 input handling and enemy AI, everything the player experiences is the result of 217 a "system". The only difference between a system and a standalone "script", is 218 that where a script acts on a single object, the system acts on an iterator of 219 objects (be that an array or a vector). This means that the instruction cache 220 of the CPU is kept hot while processing objects, as the CPU doesn't have to go 221 from processing a Collider to rendering a Mesh to then deserialising some data 222 from the network. 223 </p> 224 <p> 225 Whilst systems at their simplest process a single component type, in real 226 situations it is often necessary to reference other components on the same 227 entity. For example, when running physics, it is important to know an entities 228 current position (e.g. stored in a Transform component) and its velocity (e.g. 229 stored in a Velocity component), as well as its current collider (yet another 230 component). 231 </p> 232 <p> 233 One solution is to move velocity data into the Transform. This means that it is 234 no longer possible to have an entity with no velocity, only one with zero 235 velocity. This makes the Transform component larger, harming preformance, and 236 eventually leads to a single "mega-component" holding all possible data an 237 entity could have. 238 </p> 239 <p> 240 Another solution is to have sytems work not on individual components, but on 241 unions of components instead. This is what the "archetypal" ECS architecture is 242 about. By doing this, data can be kept in smaller, more cache-friendly chunks, 243 as discussed earlier. 244 </p> 245 246 <pre><code class="language-python"> 247 class RenderSystem: 248 def actions(): 249 return { 250 self.process_active: (Transform2D, ScreenElement), 251 self.process_stale: (StaleTag,), 252 } 253 </code></pre> 254 255 <p> 256 This is the approach I took, where each System has an <code>actions()</code> 257 method which returns delegates which process entities, and the archetypes those 258 delegates operate on. This allows a single system to contain many processing 259 methods, and thus keeps related state in one place instead of strewn around many 260 different systems. The above example shows how 1 system can handle multiple archetypes. 261 </p>