Jump to content

Will KSP 2 have better file/data formats?


kfsone

Recommended Posts

I recently had to try and help a team of devs understand why our Windows port of an app went from a 15s startup on Mac and Mobile to 2-5 minutes: The current Windows OS is based on Windows NT which is in turn based on the VMS operating system. So it starts with a fantastic permissions system, but it comes with a perf cost. Then, over the years, it's acquired various things in the kernel that Linux, for example, did in user space.

The upshot is that just opening files on Windows is anywhere upto 80 times slower than opening a file under another OS.

This isn't "Windows is slow" or "Windows IO is slow", it's just opening files. Open a few large files and you'll not notice it. Open a lot of small files, and you'll add seconds or minutes.

The project I was helping port had 429,000+ assets it was accessing on startup.

If you want to see this for yourself: https://github.com/kfsone/filebench  (uses C++17)

Quote

Output on a high-end 2019 i7 on Samsung nvme drive/Windows:


Iteration Stats (4096 files): min=56ms, avg=59ms, p90=62ms, p95=63ms, p99=68ms, max=77ms
Per File: 14411ns

Output from an Ubuntu 18.04 VM on an Intel NUC with a SATA 2 SSD:


Iteration Stats (4096 files): min=10ms, avg=11ms, p90=12ms, p95=13ms, p99=15ms, max=18ms
Per File: 2786ns

Output from an Ubuntu 19.04 Linux machine on an EVO 960 SSD:


Iteration Stats (4096 files): min=4ms, avg=5ms, p90=5ms, p95=5ms, p99=6ms, max=7ms
Per File: 1314ns

The other significant factor is the amount of whitespace in the KSP1 files. I have 76MB of .cfg files in my fairly lightweight Windows folder, and about 16MB of that is [edit: beginning-of-line] whitespace... On startup, you spend quite a lot of time skipping over this.

I was able to shave a chunk of my startup time by running this under wsl on my ksp .cfg files:

# s://.*?([\r\n]+):$1: - removes comments but leaves end-of-line as is
# s:^[ \t]+::  - removes leading whitespaces
# s:[ \t]+([\r\n]+):$1: - removes trailing whitespaces but leaves end-of-line as is
find . -name \*.cfg -print0 | xargs -0 perl -i -pe 's://.*?([\r\n]+):$1:; s:^[ \t]+::; s:[ \t]+([\r\n]+):$1:'

Fair enough - the file format has been a great boon to the modding community, I get that, and I get how handy it is as a dev not to have to put files thru the asset pipeline to see a change.

Could you maybe at least convert the outputs to a binary representation and cache those in N large files? Allocate a bunch of padding between source files so that relatively small changes don't require rebuilding the entire file.

The other thing you might consider is parallelizing tokenization vs loading. On a fast filesystem, the tokenization will be unlikely to starve the alloc-heavy instantiator, on a slow filesystem it may fall behind but it will still be significantly faster than before.

 

Edited by kfsone
Link to comment
Share on other sites

Interesting analysis.  I'm not sure how the devs will approach this, but I could see the current config file format staying fairly static (or at least being left as something similar) - unless you're heavily modded, it's not going to add huge amounts of time.  (Though parallelization would be a good idea.)  Module Manager already tries to do the caching to an extent, and it shows the main issue with caching for this: You still need to check for changes to see if the cache is still valid.

Personally if I were on the dev team I'd be looking at something like SQLite for the main save file/craft file format.  It's not much harder to work with manually than a text file, but it'll be far more robust in use.  But that's just a personal thought.

Link to comment
Share on other sites

40 minutes ago, DStaal said:

the main issue with caching for this: You still need to check for changes to see if the cache is still valid

Personally if I were on the dev team I'd be looking at something like SQLite for the main save file/craft file format.  It's not much harder to work with manually than a text file, but it'll be far more robust in use.  But that's just a personal thought.

`stat`ing files is much cheaper than opening them, and if you choose your directory crawl carefully you get that info with your list of files and without any further effort. ext4 and ntfs anticipate and optimize for this pattern.

If they want a text format, they should really consider using an existing one optimized for performance, such as json or https://cuelang.org/. If they choose yaml, I'm gonna necrorez this thread and beg everyone to kick me in the head, though.

Other options, especially since they're talking multi-player, would be using something like thrift/protobuf/grpc and serializing the message formats. Yes, as a programmer, I love that I can open the files in vim/emacs/vscode/textedit ... And yes, I hate the guy who says "if we did it in formatX we could write an editor", but the other option is to look at scripting choices like Jupyter Notebooks where you can use a textual entry/modification method that translates to a binary representation.

Going to something like sqlite would reap the benefits of a single file that can be mmap/createfilemapping'd, definite win there. It can be a little more overhead than you bargain for, tho, and it means you have to start being a dba along with everything else. I've used sqlite for a few things, including https://pypi.org/project/tradedangerous/, but I supplemented it with textual ingestion and transfer formats so users still have access to text files if/when they want them, but general usage isn't bothered by them, and startup times are significantly less painful than they might have been.

Alternatively, aim for a more centralized approach and try to encourage collation of files to reduce the overall file count.

You can try this for yourself, this is how long it takes *just* to read all the part/craft files, never mind all the texture files etc. No parsing, just reading the files, and this is immediately after I loaded the game on a 64GB machine with a high-end plextor 2tb nvme:

C:/KSPPath/> $files = dir -r *.cfg,*.part,*.craft
C:/KSPPath/> $files | measure
Count             : 3461
...
C:/KSPPath/> measure-command { get-content $files >$null }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 42
Milliseconds      : 751
Ticks             : 427514498
TotalDays         : 0.000494808446759259
TotalHours        : 0.0118754027222222
TotalMinutes      : 0.712524163333333
TotalSeconds      : 42.7514498
TotalMilliseconds : 42751.4498

You'll have to take my word that these are representative samples rather than one offs :) But...

[G:\Steam\steamapps\common\Kerbal Space Program\GameData]> wsl
oliver@Spud:/mnt/g/Steam/steamapps/common/Kerbal Space Program/GameData$ find . -name \*.cfg -print0 | xargs -0 perl -i -pe 's/^[ \t]+//; s/[ \t]+([\r\n]+)/$1/; s/\s+\/\/.*?([\r\n]+)/$1/'
oliver@Spud:/mnt/g/Steam/steamapps/common/Kerbal Space Program/GameData$ exit
logout
[G:\Steam\steamapps\common\Kerbal Space Program\GameData]
> measure-command { get-content $files >$null }
Get-Content: An object at the specified path G:\Steam\steamapps\common\Kerbal Space Program\GameData\[x] Science!\PluginData\[x] Science!\settings.cfg does not exist, or has been filtered by the -Include or -Exclude parameter.
Get-Content: An object at the specified path G:\Steam\steamapps\common\Kerbal Space Program\GameData\[x] Science!\PluginData\[x] Science!\settings.cfg does not exist, or has been filtered by the -Include or -Exclude parameter.

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 18
Milliseconds      : 806
Ticks             : 188060925
TotalDays         : 0.000217663107638889
TotalHours        : 0.00522391458333333
TotalMinutes      : 0.313434875
TotalSeconds      : 18.8060925
TotalMilliseconds : 18806.0925

So - 20 seconds, big whoop. I tripple dare the KSP team to add metrics to their build and launch system so that they track every dev-build startup and see just how much they are paying their staff to watch loading screens :( Open a simple tcp connection to something like a flask/sinatra server, and log the connection time, write the build info and pid over the socket, and leave the socket open. Any termination of the process will close the socket; if the machine hard-locks, the tcp session will get closed soon after. Otherwise, you can have the client send breadcrumbs during execution and shutdown that you can use to monitor general health of your dev builds.
 

At my last game gig, their asset pipeline spent 15 minutes processing text files, which I whipped into 50 seconds with a minor change to their language spec and some changes to the parser. I actually got it down to ~800ms with a quick parser written in C++ but the tooling was all in Python so it wasn't worth the codebase overhead.

-Oliver

Link to comment
Share on other sites

13 minutes ago, kfsone said:

`stat`ing files is much cheaper than opening them, and if you choose your directory crawl carefully you get that info with your list of files and without any further effort. ext4 and ntfs anticipate and optimize for this pattern.

I'll take your word for it, as I haven't used Windows in ages.  ;)  (I know that holds true on Unix and Mac, of course.)

15 minutes ago, kfsone said:

If they want a text format, they should really consider using an existing one optimized for performance, such as json or https://cuelang.org/. If they choose yaml, I'm gonna necrorez this thread and beg everyone to kick me in the head, though.

Oh, definitely.  My preference would likely be JSON, as it's similar to the current format, and widely-known.  (Also, I'm not a fan of white-space delimiters.  To many ways for that to get confused.)

19 minutes ago, kfsone said:

Going to something like sqlite would reap the benefits of a single file that can be mmap/createfilemapping'd, definite win there. It can be a little more overhead than you bargain for, tho, and it means you have to start being a dba along with everything else. I've used sqlite for a few things, including https://pypi.org/project/tradedangerous/, but I supplemented it with textual ingestion and transfer formats so users still have access to text files if/when they want them, but general usage isn't bothered by them, and startup times are significantly less painful than they might have been.

Alternatively, aim for a more centralized approach and try to encourage collation of files to reduce the overall file count

An injest/export program wouldn't be that hard to write, assuming a fairly sane db layout.  Even potentially for a modder to write.  And there's no particular reason you couldn't just share the sqlite files directly to share craft files.   Alternatively, have one central craft repository, and an export/import function for sharing files between users - either into text or into a DB format.
 

And you'd get a simplistic multiplayer by just allowing SQL commands between running instances.  (Though that would only really be a start.)

I'll take your word for the metrics - as I said, I don't have a Windows machine to play with.  ;)  Interesting facet to consider however.

Link to comment
Share on other sites

31 minutes ago, kfsone said:

Going to something like sqlite would reap the benefits of a single file that can be mmap/createfilemapping'd, definite win there. It can be a little more overhead than you bargain for, tho, and it means you have to start being a dba along with everything else. I've used sqlite for a few things, including https://pypi.org/project/tradedangerous/, but I supplemented it with textual ingestion and transfer formats so users still have access to text files if/when they want them, but general usage isn't bothered by them, and startup times are significantly less painful than they might have been.

Please not sqlite.  In fact, no SQL.  Way too easy to screw up queries and totally bog down a system

I could live with JSON.  But what I'd like to see would be something like what KSP currently uses, and a JSON translator to take the source file and output JSON.  Source file for ease of use, JSON for speed of implementation

Link to comment
Share on other sites

2 minutes ago, DStaal said:

I'll take your word for the metrics - as I said, I don't have a Windows machine to play with.  ;)  Interesting facet to consider however.

I got back into the bad habbit while working on a port, and found with vscode, wsl, docker, etc, it makes a really nice window manager for stuff. "PowerShell Core" - the open source/cross-platform version - had just come out, and I jokingly used that to solve one of the CI issues we were having, and was like "wait, what, this is posix shell with objects? so like a repl for the OS?". s/35 years of ed, sed, awk, grep, ksh, csh, bash, zsh/pwsh/ftw. All my home vms, mac mini, bsd boxes have it as either a second or default shell.

$s = new-pssession vm1, vm2, vm3, win1, win2, mac1
invoke-command $s { upgrade-os-packages ; upgrade-pip-packages ; update-go-packages }
 

Mmmm :)

Link to comment
Share on other sites

17 minutes ago, linuxgurugamer said:

Please not sqlite.  In fact, no SQL.  Way too easy to screw up queries and totally bog down a system

That's part of why I'm only suggesting it for craft/save files - mods should rarely if ever touch those directly.  (They should add structure info to the data structures they're built off of.)  It should only be the main developers that are working in them.

Link to comment
Share on other sites

5 minutes ago, linuxgurugamer said:

Please not sqlite.  In fact, no SQL.  Way too easy to screw up queries and totally bog down a system

I could live with JSON.  But what I'd like to see would be something like what KSP currently uses, and a JSON translator to take the source file and output JSON.  Source file for ease of use, JSON for speed of implementation

Facebook used ZooKeeper to distribute configs fleet-wide(*) and the configs were pretty massive. They used python to allow variable-complexity config generation which they then serialized as json and fed to zk to distribute, proved to be hella-fast and relatively easy to maintain, but it mean't the overhead of two human-readable language formats to contend with. I think you'd be better off with something like thrift or protobufs for ksp: there are great, highly optimized parsers; the human format is relatively readable; versioning included; designed for the wire and in-memory use, which is pretty important because that's the other half of loading: allocating for and populating the structures the data represents. Several of the teams I've interacted with have hard-coded pool allocators with speculative initial sizes that are basically always undersized because they're based on the last art/asset built the programmers have pulled and not the current art/asset dev branch :) "Yeah, but we don't know how many total meshes or bones there are going to be". Right, because you've got 429k .mesh files to read and you're doing upto 50 allocations per line of text you read. The technical CS term for this is "derp" :)

Link to comment
Share on other sites

8 minutes ago, DStaal said:

That's part of why I'm only suggesting it for craft/save files - mods should rarely if ever touch those directly.  (They should add structure info to the data structures they're built off of.)  It should only be the main developers that are working in them.

But then you are suggesting having two different "languages", one for craft/save files and a different one for part files.  That seems like extra work and more changes for bugs.

And, given that we occasionally have to edit either craft or save files, it would be better IMHO if the syntax was identical, and readable by humans

Link to comment
Share on other sites

2 minutes ago, linuxgurugamer said:

But then you are suggesting having two different "languages", one for craft/save files and a different one for part files.  That seems like extra work and more changes for bugs.

And, given that we occasionally have to edit either craft or save files, it would be better IMHO if the syntax was identical, and readable by humans

Agreed, sqlite is heavy handed unless you're actively going to use it as a datastore. If all you want to do is load the data, it's a sub-optimal approach because it's not in memory-ready format. Better, where possible, to try and compile the data into memory-ready binary representation, even if that winds up wasting some storage/etc. If you can just mmap the files into memory and know they have a usable binary layout, the size stops mattering so much (https://github.com/google/flatbuffers, https://github.com/capnproto/capnproto, https://grpc.io/docs/guides/)

 

Link to comment
Share on other sites

For moddability, portability, and ease of sharing save files, flat text continues to be a good choice.

Something that would add a bit of complexity would be to essentially bake MM into KSP 2, such that you can have "source" flat-text files in GameData with a cache that can be rapidly loaded if it hasn't changed. Hopefully, modders this time around remember to store settings and other mutable data outside GameData.

For save files, they may consider bundling a save file converter which would let the player extract binary save files into flat text and back again

This would have the effect of making life a bit more difficult for people who frequently manually edit or view save games, but is at least a compromise between speed and viewability/moddability.

Edited by Starman4308
Link to comment
Share on other sites

24 minutes ago, Starman4308 said:

Hopefully, modders this time around to store settings and other mutable data outside GameData.

I thought they already did? All the mods I can think of place settings (that are likely to change) either in the settings.cfg or your persistent.sfs file. I can't think of one that did so in Gamedata, though I'm sure some exist.

Link to comment
Share on other sites

2 minutes ago, 5thHorseman said:

I thought they already did? All the mods I can think of place settings (that are likely to change) either in the settings.cfg or your persistent.sfs file. I can't think of one that did so in Gamedata, though I'm sure some exist.

I think there was at least one point where KAC stored settings in GameData, and there's some mod in my current RP-1 install that occasionally triggers an MM cache rebuild.

Link to comment
Share on other sites

5 minutes ago, Starman4308 said:

I think there was at least one point where KAC stored settings in GameData

Amusingly that was the example I was going to use, it's the only one I can think of and it stopped way way long ago. Like possibly before KSP 1.0. I remember it so well because it used to be such a pain to upgrade, mostly because I kept forgetting to save the settings file and lost it.

Link to comment
Share on other sites

If it doesn't have a different file format entirely then it could be backward compatible or easy enough to write a convertor.

Dev's have ruled out file transfer between versions 1 to 2 so safe to assume they think the new system is better for the game overall.

Link to comment
Share on other sites

8 hours ago, Starman4308 said:

For moddability, portability, and ease of sharing save files, flat text continues to be a good choice.

Something that would add a bit of complexity would be to essentially bake MM into KSP 2, such that you can have "source" flat-text files in GameData with a cache that can be rapidly loaded if it hasn't changed. Hopefully, modders this time around remember to store settings and other mutable data outside GameData.

For save files, they may consider bundling a save file converter which would let the player extract binary save files into flat text and back again

This would have the effect of making life a bit more difficult for people who frequently manually edit or view save games, but is at least a compromise between speed and viewability/moddability.

If you're going to go with a textual layout, at the very least use an existing one or go with a config/scripting language like lua or one of the .net equivalents. Alternatively, provide a robust interface to the data. Using something like thrift/protobufs lets you declare the structure so that you can access the representation through widely available tools to work with the data in human form, and you can use human-form during development. Modders/hackers can mess with the data from a variety of languages: node, py, c#, mono, powershell across platforms, javascript, webasm, rust, go, c++, ruby, ...

I love me some text files - I've written MUD languages and parsers are a hobby of mine, but - and ksp 1 demonstrates handily - they can hamstring engines by fooling you into dealing with object instantiation in certain limited and non-performant ways.

That said, as others have mentioned - there's the middle ground of text ingestion and binary retention with a simple timestamp+size check on the source files.

Link to comment
Share on other sites

×
×
  • Create New...