pyecs.html - website - website.git

pyecs.html (11241B)
      1 <h3>The Specification</h3>
      2 <p>
      3   Recently at Uni we were given a piece of coursework that had us write a game
      4   in python. Normally, this is no big deal: simply import pygame, write a
      5   <code>while not quit: ...</code> loop, and be done with it. The catch was
      6   that we weren't allowed to use any non-standard features (i.e. no pip). This
      7   meant that pygame, pillow, and many other useful libraries were unavailable,
      8   and the only remaining windowing option was Tkinter. Now, I like to do things
      9   from scratch as much as the next person, but I would not have been the first
     10   to suggest Tkinter as a (viable) game development framework. But the coursework
     11   specification had had its say, and thus Tkinter it was to be.
     12 </p>
     13 
     14 <h3>Tkinter as a Game Engine</h3>
     15 <p>
     16   For fun and profit I decided to write an ECS game engine in python. Partly 
     17   because I had never done an ECS engine before, and wanted to learn about the
     18   upsides and downsides of the ECS paradigm. Partly because I like to make my
     19   life harder than it has to be. Mainly because python is inherently slow and
     20   I wanted to make the game engine as fast as possible to amortize the cost of
     21   the language. Thus I started reading up about ECS and Tkinter. A great resource
     22   for Tk is <span><a href="http://effbot.org">effbot.org</a></span>, which I used
     23   extensively during the course of researching for the engine. It seems to be
     24   down as of 08-12-2020, which is a shame, but it is still accessible on the
     25   <span><a href="https://web.archive.org/web/20201111171246/https://effbot.org/tkinterbook/">wayback machine</a></span>.
     26 </p>
     27 <p>
     28   Whilst I may be come off as negative towards Tkinter, after working with it for 
     29   a bit I do think it is quite good. Its performance surprised me, as it was easily 
     30   able to draw thousands of canvas entities and move them around without issue at 
     31   60 frames per second. It was also really easy to work with, with a simple
     32   <code>Tk()</code> call to initialise the entire library and then a single
     33   <code>Canvas()</code> and <code>canvas.pack()</code> call to make the game
     34   screen show up.
     35 </p>
     36 <p>
     37   Overall I enjoyed the project, and below is further explanation of the ECS
     38   engine itself. For now, the code for the game engine is available on my
     39   <span><a href="https://git.lenczewski.org/pyecs/log.html">git instance</a></span>.
     40 </p>
     41 
     42 <h3>Basics of an ECS</h3>
     43 <p>
     44   ECS stands for Entity-Component-System, and is a way of designing programs that 
     45   allegedly leads to very clean code, and easy composition of behaviours. If you
     46   have component A, with its relevant system (controlling said components behaviour),
     47   every entity with component A will exhibit said associated behavior (as it will
     48   be processed by system A). Say that system A causes an entity to oscillate up
     49   and down, and that you have another entity which corresponds to a rock in your scene.
     50   By simply adding component A to this other entity, it will (hopefully) start being 
     51   processed by system A, and will start moving up and down.
     52 </p>
     53 
     54 <h3>Entities</h3>
     55 <p>
     56   Entities are the "things" in a game, simply unique identifiers, with no data by 
     57   themselves. They represent discrete objects in the game (i.e. the player, rocks 
     58   in the scene, a hidden manager object). Some ECS architectures store tags and 
     59   other lightweight metadata on the Entity, which is fine, but I chose not to for 
     60   "purity" and simplicity reasons.
     61 </p>
     62 
     63 <pre><code class="language-c">
     64 typedef unsigned long entity_t; // as simple as a uint64
     65 
     66 typedef struct Entity {
     67   unsigned long id;
     68   unsigned long flags;
     69   unsigned long tags;
     70   // ...
     71 } Entity_t; // or something more complex
     72 </code></pre>
     73 
     74 <h3>Components</h3>
     75 <p>
     76   Components hold all the data for the game. Every entity that needs to have
     77   its mesh rendered, for example, will have a MeshComponent. Each component
     78   stores only the data that it needs, and no more. This separation means that
     79   no memory is wasted for entities that don't need a certain component.
     80 </p>
     81 <p>
     82   Another good reason for storing data in components is that when you have a huge 
     83   number of entities, cache locality and the efficient filling of cache becomes
     84   important. Storing all possible state in 1 struct, and iterating over an array
     85   of said struct just to do 1 operation on it (say, update physics) means that
     86   the cache is constantly cold for the next piece of data to operate on.
     87 </p>
     88 
     89 <h4>A Little Detour to Hardware Land!</h4>
     90 <p>
     91   When a CPU wants to fetch some data from RAM, it will load a certain slice 
     92   (say 64 bytes) into cache first (this is the idea of a cacheline). By loading 
     93   more data than necessary from RAM, the CPU makes the assumption that it will 
     94   need to fetch data nearby soon, and so it preloads the data into cache (which 
     95   is <span><a href="https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf">much much faster</a></span>
     96   than RAM; see pg 22 of above pdf). The relative timings of cache vs RAM taken 
     97   from an intel pdf are as follows:
     98 </p>
     99 
    100 <pre><code class="language-text">
    101 Data Source                                             Latency
    102 L3 Cache hit, line unshared between cores               ~40 cycles
    103 L3 Cache hit, line shared readonly between cores        ~65 cycles
    104 L3 Cache hit, line shared readwrite between cores       ~75 cycles
    105 L3 Cache hit, from other core                           ~100-300 cycles
    106 Local DRAM                                              ~60ns
    107 Remote DRAM                                             ~100ns
    108 </code></pre>
    109 
    110 <p>
    111   For a CPU clocked at 2GHz (slower than modern CPUs), each cycle takes half a 
    112   nanosecond. This means that an unshared L3 hit takes 20ns, whereas going to RAM 
    113   takes triple that. Lower caches (L1 and L2) can be on the order of 5-2 ns for a 
    114   hit, meaning that going to RAM takes between 12 and 30 times longer. Thus, the 
    115   CPU can save a lot of cycles by loading chunks instead of fetching data byte by 
    116   byte if nearby data is also used (and there will be a cache hit when said new 
    117   data is accessed).
    118 </p>
    119 
    120 <h4>Getting Back on Track</h4>
    121 <p>
    122   Getting back to why components are a thing, with an example:
    123 </p>
    124 
    125 <pre><code class="language-c">
    126 // clean separation into multiple discrete components
    127 typedef struct Transform {
    128   double px, py, pz;
    129 } Transform_t;
    130 
    131 typedef struct Velocity {
    132   double vx, vy, vz;
    133 } Velocity_t;
    134 
    135 typedef struct Mesh {
    136   void *mesh_data;
    137 } Mesh_t;
    138 
    139 // chucking it all into one mixer and pressing "blend"
    140 typedef struct State {
    141   Transform transform;
    142   Velocity velocity;
    143   Mesh mesh;
    144 } State_t;
    145 
    146 // ...
    147 
    148 void main(void) {
    149   // this approach will have worse cacheline efficiency
    150   State game_objects[] = { ... };
    151 	
    152   for (int i = 0; i < ARRLEN(game_objects); i++) {
    153     calculate_physics(game_objects[i]);
    154     render_object(game_objects[i]);
    155   }
    156 
    157   // this approach will have better cacheline efficiency
    158   Entity entities[] = { ... };
    159   Transform transforms[] = { ... };
    160   Velocity velocities[] = { ... };
    161   Mesh meshes[] = { ... };
    162 
    163   for (int i = 0; i < ARRLEN(entities); i++) {
    164     process_physics(transforms[i], velocities[i]);
    165   }
    166 
    167   for (int i = 0; i < ARRLEN(entities); i++) {
    168     render(transforms[i], meshes[i]);
    169   }
    170 }
    171 </code></pre>
    172 
    173 <p>
    174   Why is the second approach better? Don't we iterate twice compared to once? The 
    175   reason that the second approach is preferrable is that it packs the cacheline 
    176   much more efficiently (leading to a higher percentage of cacheline hits, and
    177   thus time saved).
    178 </p>
    179 <p>
    180   Below is a visualisation of this difference, with `x`, `y`, and `z` standing in 
    181   for Transform, Velocity, and Mesh. The dashes show a simplified cacheline.
    182 </p>
    183 
    184 <pre><code class="language-text">
    185 Mixed State:
    186 | x0 y0 z0 x1 | y1 z1 x2 y2 | z2 x3 y3 z3 | x4 y4 z4    |
    187 | ----------- | ----------- | ----------- | ----------- |
    188 
    189 Components:
    190 | x0 y0 x1 y1 | x2 y2 x3 y3 | x4 y4       |
    191 | ----------- | ----------- | ----------- |
    192 
    193 | x0 z0 x1 z1 | x2 z2 x3 z3 | x4 z4       |
    194 | ----------- | ----------- | ----------- |
    195 </code></pre>
    196 
    197 <p>
    198   Not only does processing the mixed state do worse on the instruction cache, it 
    199   also fills the regular data cache worse. 1 struct can fit at a time, and to get 
    200   the next struct to process another fetch must be made. This contrasts with the 
    201   component approach, where each cacheline can fit 2 entities (Transform + Other) 
    202   at once. When the number of processed entities gets large, the worse cache 
    203   performance of the mixed state approach becomes more and more evident.
    204 </p>
    205 <p>
    206   The fact that you have to iterate twice doesn't really matter. You end up doing 
    207   the same amount of work (you still do render() and physics() for each entity). 
    208   The difference is that each fetch in the 1st approach has the opcache be cold. 
    209   The 2nd approach means that the opcache is cold for the first entity, and hot 
    210   thereafter. This means that the 2nd approach can outperform the 1st.
    211 </p>
    212 
    213 <h3>Systems</h3>
    214 <p>
    215   Systems implement the behaviour seen in a game. From rendering and physics, to 
    216   input handling and enemy AI, everything the player experiences is the result of 
    217   a "system". The only difference between a system and a standalone "script", is 
    218   that where a script acts on a single object, the system acts on an iterator of 
    219   objects (be that an array or a vector). This means that the instruction cache 
    220   of the CPU is kept hot while processing objects, as the CPU doesn't have to go 
    221   from processing a Collider to rendering a Mesh to then deserialising some data 
    222   from the network.
    223 </p>
    224 <p>
    225   Whilst systems at their simplest process a single component type, in real 
    226   situations it is often necessary to reference other components on the same 
    227   entity. For example, when running physics, it is important to know an entities 
    228   current position (e.g. stored in a Transform component) and its velocity (e.g. 
    229   stored in a Velocity component), as well as its current collider (yet another
    230   component).
    231 </p>
    232 <p>
    233   One solution is to move velocity data into the Transform. This means that it is 
    234   no longer possible to have an entity with no velocity, only one with zero 
    235   velocity. This makes the Transform component larger, harming preformance, and
    236   eventually leads to a single "mega-component" holding all possible data an
    237   entity could have.
    238 </p>
    239 <p>
    240   Another solution is to have sytems work not on individual components, but on 
    241   unions of components instead. This is what the "archetypal" ECS architecture is 
    242   about. By doing this, data can be kept in smaller, more cache-friendly chunks,
    243   as discussed earlier.
    244 </p>
    245 
    246 <pre><code class="language-python">
    247 class RenderSystem:
    248   def actions():
    249     return {
    250       self.process_active: (Transform2D, ScreenElement),
    251       self.process_stale: (StaleTag,),
    252     }
    253 </code></pre>
    254 
    255 <p>
    256   This is the approach I took, where each System has an <code>actions()</code>
    257   method which returns delegates which process entities, and the archetypes those
    258   delegates operate on. This allows a single system to contain many processing
    259   methods, and thus keeps related state in one place instead of strewn around many
    260   different systems. The above example shows how 1 system can handle multiple archetypes.
    261 </p>
	website website.git
	git clone git://git.lenczewski.org/website.git
	Log \| Files \| Refs