Will 32-bit be the death of KSP

sarbian · February 19, 2015

The Squad code is written for 32bit addresses because the Unity editor is 32bit, thankfully x64 hardware and operating systems are 32bit backwards compatible, but there can be unintended effects when compiling the same code on 64bit.
Memory addresses are larger, so code that expects an integer to be 2 bytes (16 bits) now has to work with 4 byte integers, this can become a problem if you're doing bitwise operations for speed for example.
You can code to avoid depending on an integer being 2 bytes long, and the compiler will warn you of problems with your own code most of the time, but what do you do if you are calling engine functions that depend on it?

That's not the case with .NET/C#/KSP code. An int in C# is 32 bit whatever the platform is was compiled for or run on. There are type that change with 32 or 64 bits but they should only be used to interface with native code, and afaik it's not the case with KSP code ( minus from the 3D mouse stuff AFAIK )

Rokmonkey · February 19, 2015

Sure, I follow. Also, double checked and I'm a derp, they are releasing 64-bit builds for Linux I thought they weren't. I guess I'll have to put a Linux partition back on my machine, I like my mods.

sal_vager · February 19, 2015

That's not the case with .NET/C#/KSP code.(snip)

The Unity engine is written C and C++, C# is used to extend Unity with both developer code and addon plugins

Tuareg · February 19, 2015

And you, sir, are prime example of why 64bit windows are slow moving form "major pain in the ass" to "slight nuissance", never mind being actually useful. Even if your app has zero advantage in running 64bit (and actually slight penalty that comes with it), your users will have significant advantage of running native. Don't you find it strange that linux port of Unity, while getting much less attention then windows one, is just fine? Is linux Unity somehow magic? Let's seeÃ¢â‚¬Â¦ On my linux box, 100% of installed packages is 64bit. And all of them work just fine. On win7 box, about half of system is 32bit, and anything 64bit exhibits all colours of trouble (still great compared to 64bit XP though). Doesn't look to me like it's just "some more virtual memory space". 32bit emulation need some complicated and error prone machinery, and geting rid of it is advatage for your users even if app itself does not change one bit.

once this^^, on the other hand 64bit wouldn't just give 64bit memory addresses but native 64bit floating-point calculations and the entire spacegameprecision shakingfallingapart etc problem is because of precision. ksp already use a 3rd party 64bit lib but only for a couple of calculations and they are slow on 32bit. using all double precision calculations on native 64bit support would make the game having less precision issues running still smoother.

Renegrade · February 19, 2015

Motherboard limited, unfortunately - two slots for DDR2, so the practical maximum is 4 GB; 4 GB DDR2 sticks exist but are extortionately priced. I'm waiting for Skylake and DDR4 for my next build to avoid getting in the same situation again. /offtopic

Ouch, I feel for you. I'd say go for a SandyBridge-E on the cheap right now (or shortly) - you'll get 95% of the performance (or 140% of the performance if you roll a bit of overclock), 200% the memory, for a fraction of the cost. Future CPU generations aren't going to have any significant single-threaded gains in terms of IPC, and mhz was tapped out in the P4 era

They can pry my i7-3820/32GiB from my cold, dead fingers! LGA2011 forever! (mainboard supports EIGHT DDR3 slots, half of which are populated with 8GiB units. The memory cost less than the video card or the processor)

Most compilers are quite good at optimizing the code for a specific architecture, if you just tell them to do it. It's just rarely done in software you don't compile yourself. A program that's been optimized for one CPU could be slower on another kind of CPU than without the optimizations, or it might not even run at all. That's why most software is built for a generic x64 architecture.

Actually, by and large, I've found compiling for x86-64 (properly 'amd64'. Not a big fan of AMD these days, but they did invent it, and save us from IA64, which was idiotic) can easily take upwards of a 10% performance hit over the same CPU with the same compiler running x86.

DOn't forget that x86-64 CPUs can often employ the heavy decoder architecture necessary for the fat x86-64 instructions for faster/better/deeper x86 decoding when they're running 32-bit code - often with a significant speedup (especially on the Intel side of the street). PUSH AL ; whoops! I just pushed two UTF-32 unicode characters onto the stack!

The bigger pointers aren't such a big deal in performance-critical code. If you're working with a nontrivial amount of data, the reasonable assumption is that if you follow a pointer, you get a cache miss. Hence optimized code tends to avoid pointer-based structures.

I'd love to see that in a 'modern' language, which are basically constructed out of WEBS of pointers inside~ Anyhow, no, you cannot assume that following a pointer results in a cache miss, NOR that you're following them simply because they're present. If you end up with say, an array of pointers, it will be twice the size under x86-64 as x86 or 68k. You could very well miss the initial LOOKUP on x86-64. Also, if you're following a pointer to something like an array, then the cache has probably preloaded the next 64 bytes of it for you, which contains 8 64bit values, 16 32-bit values, or 32 16-bit values (or 64 characters, of course). Granted these newer processors tend to have larger caches - but that just pushes it back a bit.

Leaving that aside though, x86-64 also uses 64-bit strides on the stack, also pressuring the cache.

Even if you build 64-bit code, the default integer types are still 32-bit on most compilers. That's kind of silly, because 32-bit integers are too small for most purposes these days. They should never be used, unless you're certain that their size isn't going to be a problem. (And even then, your assumptions are often going to be wrong.)

Like the Funds variable for KSP (which is probably a single-precision float, knowing Squad)? Hard capped to 99,999,999? That'll totally be a problem for a 32-bit int down the road. Not.

While I agree that caution is necessary with datatype selection (painting oneself into a corner by having a 16-bit funds variable would be rather embarrassing, or say a 32-bit timestamp), there are many, many times when you know hard limits won't be broken (ex. good luck loading an image that has to have it's width and height specified in something longer than 32-bit. heck try loading a truecolor 32bitx32bit image at the max size. You'll quickly discover that your actual physical memory is usually limited to 40-48 bits total). And many of these times, the values fit within things smaller than 32-bits.

I seem to recall that IBM settled on the ILP64 model themselves .. but in any case, even when actual "int" ints are 32-bits in a given environment (and they ARE quite a few environments where this is not the case, at least in the C world. Obviously in Java or Java-like languages that's a different story), almost everybody has adopted the LP64 model, and thus the long ints are now almost universally 64-bit (outside of Windows).

This isn't like the 16-bit era, when x86 was running away in terror from crappy nonsense like segmented memory and very-limited-purpose registers. x86-64 is basically IA32 stretched out a bit (with some minor trimming of some legacy bits).

your users will have significant advantage of running native.

What advantage? Longer runtimes? Bigger executables? More heat buildup for the same code? Less TurboBoost?

On win7 box, about half of system is 32bit, and anything 64bit exhibits all colours of trouble (still great compared to 64bit XP though).

90% of my processes running right now are 64-bit (under Win7, or as I like to call it "NT 6.1"), and they aren't giving me any more problems than their 32-bit ancestors did...save for the overhead. The few that are 32-bit aren't giving me problems either, aside from the heavy-memory-usage ones running out (like KSP or Firefox). Oh well, at least Notepad can load an 8G file now! That was surely limiting me before.

but native 64bit floating-point calculations

The 8087, released something like thirty years ago, is full 64-bit floating point calculations. Actually, it does the calculations internally at 80 bits (externally too if you ask for 'long double'). The mighty 68881 from that same era uses 96-bit precision internally if I recall correctly.

People just spam 'float' (32-bit, by default), as the SIMD instructions (SSE) can process upto four floats at once, but only two doubles (they're 128 bits).

x86-64 was hailed as a godsend for that, as x86-64 CPUs automatically have SSE2 or better, but there's no guarantee of that 'or better'. SSE2 is pretty vicious and nasty when it comes to any accurate calculation, as it's only handling doubles internally as doubles (vs 80/96 bits for classical FPUs), and given that it prefers floats for performance (twice the throughput), you're really suffering unless you're developing some really limited-scope FPS (new Doom 5! Now with 40m wide rooms! That's twice as big as before!)

...

Of course this all being said, KSP could definitely use a 64-bit memory space, if nothing more than so support the large-scale mods (errr I mean like KW and Interstellar and 8k skyboxes and such, not just RSS).

cpast · February 19, 2015

once this^^, on the other hand 64bit wouldn't just give 64bit memory addresses but native 64bit floating-point calculations and the entire spacegameprecision shakingfallingapart etc problem is because of precision. ksp already use a 3rd party 64bit lib but only for a couple of calculations and they are slow on 32bit. using all double precision calculations on native 64bit support would make the game having less precision issues running still smoother.

Er, no. Double-precision floating point is exactly as fast in 32-bit and 64-bit, and any processor with an FPU (aka any processor in the last well over 10 years) will actually natively support doubles at *least* as well as floats. Floating-point arithmetic is not done in general-purpose registers, and Intel FPUs actually use 80-bit values internally (they support doubles *and* floats but execute at 80 bits of precision). GPUs generally don't support double-precision natively in hardware. At no point is 32-bit or 64-bit relevant at all; many GPUs just don't have the circuitry to natively handle doubles, while CPUs push floating-point calculations off to an FPU that operates past double-precision regardless of 32- or 64-bit, and at all levels the relevant factor is the actual circuitry in the hardware (a 32-bit program takes a tiny bit longer to load a double from memory, but that's not really a significant performance factor), and there's no difference in circuitry between 32- and 64-bit. Furthermore, many graphics and physics systems, including Unity, use floats, because GPUs happen to generally work better with floats than doubles, for reasons of hardware.

Renegrade · February 19, 2015

an FPU that operates past double-precision regardless of 32- or 64-bit, and at all levels the relevant factor is the actual circuitry in the hardware (a 32-bit program takes a tiny bit longer to load a double from memory, but that's not really a significant performance factor), and there's no difference in circuitry between 32- and 64-bit.

Actually, there's no performance difference in that respect. The Pentium.. Original, P5 Pentium, from 1995, no, sorry, 1993, has a 64-bit datapath, and can load a double in a cycle memory cycle. It's only 486s and 386DXes that would take two cycles, and 386SXes/286s/8086s that would take four, and uh.. not sure if you can put an FPU with the 8088..maybe? That would take eight cycles.

Aside from memory addressability, 64-bit has been largely irrelevant for 22 years.

Furthermore, many graphics and physics systems, including Unity, use floats, because GPUs happen to generally work better with floats than doubles, for reasons of hardware.

Well, plus SSE, which runs a lot faster on floats, as you can do a whole 3d vector operation on floats in one SSE instruction, whereas it takes two for doubles (and it doesn't have high internal precision like the FPU).

cpast · February 19, 2015

I didn't actually know that much about SIMD, actually. To clarify a bit, though: That applies exactly the same to 64-bit and 32-bit stuff, right? I know Unity has SSE2 enabled, so it seems like there's no difference then between what it could do in 64-bit and what it can do now.

It also seems like floats' smaller size means it's better for memory-restricted uses (e.g. GPUs, stuff that you want lots of to live in cache). Again, it works the same between 32- and 64-bit. The point's roughly the same, though: Processors can actually have all manner of more complex stuff than just "operate on WORDLENGTH registers addressing WORDLENGTH amount of RAM;" Intel actually made 32-bit processors that could handle 36-bit physical addresses (if the OS supported it), working with the OS to coordinate longer physical addresses (because there's no reason the physical address should be limited to the same size as logical address); they handle high-precision floating-point numbers natively on a separate unit; they can do a round of AES with a 256-bit key on a 128-bit block in hardware; word length matters for many things, but it's best not to assume that it controls everything, because it doesn't.

sarbian · February 19, 2015

The Unity engine is written C and C++
, C# is used to extend Unity with both developer code and addon plugins

I know that perfectly. But you were talking about "The Squad code" which is in C#.

Renegrade · February 19, 2015

I didn't actually know that much about SIMD, actually. To clarify a bit, though: That applies exactly the same to 64-bit and 32-bit stuff, right? I know Unity has SSE2 enabled, so it seems like there's no difference then between what it could do in 64-bit and what it can do now.

Yep. SIMD/SSE predates x86-64 (amd64), it's descended from MMX afterall. The only thing x86-64 does is ensure that your CPU is new enough to GUARANTEE it has SSE2. But you can do that with a label on the box too, "Pentium IV or higher required." No 64-bit CPU needed.

It also seems like floats' smaller size means it's better for memory-restricted uses (e.g. GPUs, stuff that you want lots of to live in cache).

Yep. Also rolling 64-bit numbers through a system (or 80-bit) as a unit requires more transistors, which means a bigger die, more heat, longer pathways, poorer timing, and less space for more units (modern video cards have thousands of execution units; doubling them up in width would halve the number you could put on the chip). Plus the SSE instructions pack them all into a 128-bit word, and do a single instruction on all four floats at once (or two doubles , depending)...

Also .. your typical FPS game doesn't need extended precision anyways. The character is 1.7m tall (give or take), and the environments are actually very, very small (the 'huge' Reach levels were probably no more than a dozen or so km across), so a float's limited mantissa (err 24 bits?) isn't too big of a deal. So for Unity to be float-happy, and SSE to be kinda limited, isn't much of a problem for your typical game. So as much as I'd like to roast those guys alive for limiting our precision, I have to remember that for the typical use, it's good enough and very fast.

Hardware people are starting to turn to double-precision but it's a very slow thing (partly because the FPU handles that 'well enough' for most uses)...

The point's roughly the same, though: Processors can actually have all manner of more complex stuff than just "operate on WORDLENGTH registers addressing WORDLENGTH amount of RAM;" Intel actually made 32-bit processors that could handle 36-bit physical addresses (if the OS supported it), working with the OS to coordinate longer physical addresses (because there's no reason the physical address should be limited to the same size as logical address); they handle high-precision floating-point numbers natively on a separate unit; they can do a round of AES with a 256-bit key on a 128-bit block in hardware; word length matters for many things, but it's best not to assume that it controls everything, because it doesn't.

Absolutely. (the 36-bit thing is called 'PAE' by the way, all my Linux servers that have less than 8G of ram use x86+PAE instead of amd64/x86-64. Most of 'em either have eight-or-more or two-or-less gigs of memory anyhow)

The bit-hype wears me down very thoroughly. Things were quite different in the 8-16 and 16-32 bit transition periods (heck the Atari 2600 had so little memory that this line would require a memory expansion), when we were shaking free of some very extreme limitations. For example, the segment:offset thing was such a shock to me that I never fully learned x86 assembly (I'm a 68k sort of person), and I only know about SSE as I was working on some code that used inline SSE/SSE2 instructions historically (thankfully segment:offset nonsense had mostly died by then). More isn't always better

sal_vager · February 19, 2015

Sorry, not seeing your point Sarbian, I'm pretty sure it's not impossible to have memory leaks with C# even with garbage collection, a programmer can't always rely on the CG

Jouni · February 19, 2015

I'd love to see that in a 'modern' language, which are basically constructed out of WEBS of pointers inside~

That's why you don't use such languages for performance-critical code.

Anyhow, no, you cannot assume that following a pointer results in a cache miss, NOR that you're following them simply because they're present. If you end up with say, an array of pointers, it will be twice the size under x86-64 as x86 or 68k.

When you're implementing low-level algorithms and data structures, counting the number of random memory accesses (and the equivalent measures for scanning large contiguous memory regions) is often the best way to predict performance. The rest of the code matters surprisingly little, because computation is cheap, while memory is slow.

An array of pointers sounds like inefficient code, btw.

Like the Funds variable for KSP (which is probably a single-precision float, knowing Squad)? Hard capped to 99,999,999? That'll totally be a problem for a 32-bit int down the road. Not.

Based on what we see in the game, it's probably double in some places and int in other places. float doesn't have enough precision for the sums we routinely handle in the game. int would be barely enough most of the time, but 21.5 million would already be a negative number in some places.

While I agree that caution is necessary with datatype selection (painting oneself into a corner by having a 16-bit funds variable would be rather embarrassing, or say a 32-bit timestamp), there are many, many times when you know hard limits won't be broken (ex. good luck loading an image that has to have it's width and height specified in something longer than 32-bit. heck try loading a truecolor 32bitx32bit image at the max size. You'll quickly discover that your actual physical memory is usually limited to 40-48 bits total). And many of these times, the values fit within things smaller than 32-bits.

My rule of thumb is that integers are always uint64_t, unless there are very good reasons for using smaller integers. If I choose to use smaller integers, I try to make them 2x larger than seems necessary, because my justifications and assumptions may turn to be wrong 5 years from now.

Of course, there's a difference between data types and the way the data is actually stored. I use things like 21-bit integers and variable-length codes every day, because I write code for computers with only 256 gigabytes of memory, so I have to avoid using large arrays of native integers.

radonek · February 19, 2015

Umm, I see here so many misconceptions about 64bit arch I don't even know where to begin. But let me tryÃ¢â‚¬Â¦

First, 64bit cpu can't run 32bit applications just like that. Seriously. It can chew 16bit/32bit instructions. Unless you run 8086 real mode code, running instructions is like 5% of "compatibility". Most crucial 5%, but still just a tiny bit of all the work needed. Lot of the other work is done by kernel, which have to contain compatibility layer with complete legacy syscall interface. This is complicated in itself, but fine if kernel is only thing your app comunicates with. But that only apply (and only to a limited extent) to statically linked binaries (read: DOS programs).

Whenever your program comunicates with other, things get wee more complicated. To simplify a bit Ã¢â‚¬â€œ legacy app can talk to other legacy app and native app is friends with other natives, but they don't mix together well. You could in theory have complete stacks for both, but in practice there are lots of thing you can't do twice. For instance, you can't have two GPU or keyboard drivers, so you end up with 64bit binary and wrapper that maps 32bit calls onto it. You can't really do it other way around - that could lose some of those extra bits. So here is another thing Ã¢â‚¬â€œ whenever you get native code in stack, everything under it must be native too. 64bit texture library need 64bit openGL library which need 64 bit GPU driver which need 64bit kernelÃ¢â‚¬Â¦ you get the idea. Note that I'm talking binary interface here Ã¢â‚¬â€œ how one .dll comunicates with other.

This is why running 32bit windows on 64bit cpu is easy Ã¢â‚¬â€œ BIOS sets up a few things and everything else is 32bit only. This is also why you can't run 64bit apps on 32bit windows, even if cpu would be happy to see them and kernel could in theory be patched to allow them.

Now, this is still easy peasy until you start thinking dynamic. Libraries. Dynamic linking is complicated stuff even at best of times, and windows is as far from best of times as "dll hell" can get you. Even a simple program nowadays links tens of libraries, and for big apps this can easily go to hundreds. These do not form simple up-down graph from app to kernel as above, but are interwined together to what can easily become unholy mess (see any office app). Now, if some of them start talking different its very easy to run into problems even with all kinds of wrappers and compatibility layers. This is why many big apps (like firefox or openoffice) took so long to get native. Not because firefox can't make use more memory, but because untangle all its libraries is not an easy task (further complicated by bad practice of bundling libraries which can wreak all kinds havoc when interfering with their system versions. But that's another story). In case of Firefox, it took several years even on linux, where transition is simple: all 32bit to all 64bit. On windows, many third parties can't be bothered to provide 64bit versions, and we are stuck with system that is not actauly able to work without some really deep and scary magic, involving keeping multiple copies of same library. And it ends with silly situation where we go 64bit to have more memory in which to hold all those conflicting libraries that should not be there in a first place. It's no wonder that \winsxs on my box is as big as complete winxp install Ã¢â‚¬â€œ it is one. With some redundancy on top.

To complicate things a bit more, amd64 arch is not merely about adding another 32bits to some registers. (See PAE for that, and note it can be made to work on 32bit system). It is way more complicated (linux kernel amd64 arch was for a long time completely separate from x86 arch, same way as arm or sparc or whatever) which, apart from kernel level headaches, also mean that 32-to-64 bit wrappers are not nearly as simple as adding 32 zero bits to registers.

Same problems can be exhibited by any other means of communication: COM, IPC, network protocolsÃ¢â‚¬Â¦ anything that want to make use of 64bit stuff will have to deal with changing APIs or even ABIs. Full transition may be tricky, half-assed approach just hides garbage under carpet. Just seeing how many system daemons under windows is still 32bit makes me shudder.

All of this is even more pronounced with KSP since its graphic intensive application (read: game) which means lot of low-level stuff. What if that improperly typed integer is pointer to buffer that gets passsed via GL to graphics driver that does DMA transfer off it? We are talking ring0 code here, and all kinds of bad ....Ã¢â‚¬Â¦ What would be at worst simple segfault for other app is like minefield for the likes of Unity.

If all of this sounds like hell, it is not. It works great on linux. Comes with with mentality I guess Ã¢â‚¬â€œ porting is common task in *nix world and any coder sticking with emulation without very good reason would be laughed off as either lazy or clueless (sorry Daid). Because whole system is native and compatibility layers are only used by very, very few apps, there is little room for trouble. Hell, I don't even have all emulation packages installed. And that is moral of the story: There are no "32bit" and "64bit" apps. There are native apps and apps sitting on top of huge hairball of emulated legacy cruft. Redmond guys did incredibly great job of hiding that cruft and even making it work most of the time. But its still cruft and to actually get rid of it, there is no other way then go full blown. You can have perfectly good 32bit system, or even better 64bit one, but there is nothing good in between.

cpast · February 19, 2015

That's...no.

First, calling 64-bit "native" and 32-bit "emulated" is somewhat inaccurate. There are things that the kernel does to provide the same system call interface on amd64 as on x86, but "emulation" implies it actually provides an x86 processor interface itself, which is not true. Likewise, the kernel also has to stand in between 64-bit programs and hardware; the only things that aren't managed by the kernel are kernel-mode things. IA32 on an amd64 system is fairly "native." For a 32-bit OS on a 64-bit processor, there's no emulation at all; the processor is actually running in protected mode, where it is literally running as a 32-bit processor. The CPU isn't more than happy to run a 64-bit program in protected mode. Binary interface has nothing to do with it; the CPU itself won't allow it.

Furthermore, there is a pretty good reason to have emulation: you don't want to break applications if you can conceivably avoid it. *nix does exactly that: it doesn't necessarily include it in all copies, but Microsoft made the decision to ship compatibility with all versions of Windows by default. They did the same thing for 16-bit on 32-bit; in cases where it's *not* relevant anymore (e.g. new versions of Windows Server), it's not included by default. The actual *kernel* aspects that manage 32-bit process are, AFAICT, unchanged - they're just part of the kernel. WOW64 is mostly about handling system calls, and has very low overhead; one of the main issues is just having the libraries, but that's not really all that big a problem (most of the libraries on WOW are also the same as on 32-bit versions of windows). From what I can tell, Linux does it exactly the same -- the kernel can handle most 32-bit stuff just fine, it's just libraries that are missing by default.

"Emulation is bad" is also utterly irrelevant to your observations, which are that 64-bit stuff can run badly. That's because you can't put 32-bit processes in a 64-bit address space, so things that have lots of proprietary addons won't be recompiled to 64-bit, and 64-bit on Windows requires knowing that an int can't store a pointer (unlike Linux, where it can). But WOW64 works extremely well, and there is no advantage other than saving disk space to not including it. Porting to 64-bit on Windows is somewhat harder on Linux (you can't put pointers in ints), and since WOW64 works so well there's really no reason to bother in most cases. People who think "64-bit is key! That's the only native thing, and the rest is bad 'emulation!'" are just wrong - 64-bit gives zero advantage in most applications, requires annoying conversion work, and 32-bit is perfectly suitable.

radonek · February 19, 2015

That's...no.
First, calling 64-bit "native" and 32-bit "emulated" is somewhat inaccurate. There are things that the kernel does to provide the same system call interface on amd64 as on x86, but "emulation" implies it actually provides an x86 processor interface itself, which is not true.

True. But that is why I stated that kernel itself is fine - if you run all static binaries. It's all those 32-to-64bit wrappers that are, in my book, emulation. Call it compatibility layer if you like... my point is its these layers that makes things complicated.

Furthermore, there is a pretty good reason to have emulation: you don't want to break applications if you can conceivably avoid it.

Reason is what I'm missing here. Again, on my linux box, everything is 64bit and works just fine. On windows with all those fancy compatibility/emulation stuff, many 64bits apps are broken. Some of them, like Unity, have same codebase.

From what I can tell, Linux does it exactly the same -- the kernel can handle most 32-bit stuff just fine, it's just libraries that are missing by default.

Which is exactly my point. When you get rid of these, 64bit windows will be as good as linux. And you get there by running as much code as possible without emulation.

(edited) As a sidenote, linux compatibility is nothing like WoW64. It's much simpler - dynamic linker can tell 32bit ELF from 64bit one (obviously :-) and is configured to use different LDPATH. There is set of packages (x86-emul-linux-*) installed there that provides wrapper shims of most common libraries. You can run multiple instances of same library via manipulating LD_* variables, but this is feature of the linker and is not used much. From what I understand, WoW64 works more like Wine, setting up separate enviroment for every binary (my .wine/ look a lot like \winsxs).

"Emulation is bad" is also utterly irrelevant to your observations, which are that 64-bit stuff can run badly. That's because you can't put 32-bit processes in a 64-bit address space, so things that have lots of proprietary addons won't be recompiled to 64-bit, and 64-bit on Windows requires knowing that an int can't store a pointer (unlike Linux, where it can). But WOW64 works extremely well, and there is no advantage other than saving disk space to not including it. Porting to 64-bit on Windows is somewhat harder on Linux (you can't put pointers in ints), and since WOW64 works so well there's really no reason to bother in most cases. People who think "64-bit is key! That's the only native thing, and the rest is bad 'emulation!'" are just wrong - 64-bit gives zero advantage in most applications, requires annoying conversion work, and 32-bit is perfectly suitable.

I'm not claiming that 64bit is any kind of magical key. Actually, I quite agree with most of what Renegade said about 32bit code being better off. I'm only saying that legacy emulation is a bad cruft, and you should run native - be it 32bit apps on 32bit system or the other way.

As for my "observations", my only observation is that my linux box can run anything 64bit including KSP just fine and windows can't. Something that people around here still fail to absorb.

Edited February 20, 2015 by radonek
more info on linux emul

Renegrade · February 20, 2015

That's why you don't use such languages for performance-critical code.

I wish. They shouldn't be, but that doesn't seem to stop people... :/

When you're implementing low-level algorithms and data structures, counting the number of random memory accesses (and the equivalent measures for scanning large contiguous memory regions) is often the best way to predict performance. The rest of the code matters surprisingly little, because computation is cheap, while memory is slow.

Yes-ish and no; trig functions are generally still slower than memory access (ever called sin lately? it's still slow like 40 years later) - and if you're scanning sequentially, that's orders of magnitude faster than random accesses. Modern memory's really bursty, although still only 15-20ns for a random access. It's definitely something to watch out for, but if you do that and ignore tight loops that make no external accesses and run for a long time, you're in for some trouble.

Or to put it shortly, "not wrong, but a bit too much of a generalization".

An array of pointers sounds like inefficient code, btw.

That's one of those "it depends". Keep in mind that navigating some tree that's fifteen levels deep will kill the cache, but if you can compute an offset instead, a single peek at said array is your old fashioned O(1) type dealy. And if you need to do some minor work on all things in that array (major work will obviously evict large swaths of cache and may include the current bit of the table), steppin' through 'em sequentially will be quite favorable.

Based on what we see in the game, it's probably double in some places and int in other places. float doesn't have enough precision for the sums we routinely handle in the game. int would be barely enough most of the time, but 21.5 million would already be a negative number in some places.

Bur? Are they treating it as scaled? If you're worried about funds bonuses and such, there's ways around overflow - instead of adding 5% by something naive like (x*100)/5, one can simply do x+(x/20), as a random example (or temporarily promote to a larger number/floating point if need be. Kinda clunky in most languages, but doable).

(Granted Squad probably doesn't know about 'em. Hmm, I should pay attention to the number when it gets over 8m - could very well BE a float~)

My rule of thumb is that integers are always uint64_t, unless there are very good reasons for using smaller integers. If I choose to use smaller integers, I try to make them 2x larger than seems necessary, because my justifications and assumptions may turn to be wrong 5 years from now.

C99 programmer? Surely not. They don't exist anymore. I'm the opposite, I start at uint8_t, and consider the use of the variable carefully, it's maximums, and the sort of operations it will take. If it's something that increases with time (like a clock), then well, is there a maximum runtime? If not, then the longest thing it can be, otherwise something suitable. If I make a design decision like the 99 million funds thing, that would end up being uint32_t (or int32_t if it can be negative obviously). If there was no top of the funds, than longer, but if it were something that can't exceed 10,000 no matter what, it's u/int16_t.

Of course, there's a difference between data types and the way the data is actually stored. I use things like 21-bit integers and variable-length codes every day, because I write code for computers with only 256 gigabytes of memory, so I have to avoid using large arrays of native integers.

Bur? That sounds like you have more than enough memory for that - even my desktop only has 32 GiB of memory. Still, I applaud wise conservation regardless of the circumstance

And it ends with silly situation where we go 64bit to have more memory in which to hold all those conflicting libraries that should not be there in a first place. It's no wonder that \winsxs on my box is as big as complete winxp install Ã¢â‚¬â€œ it is one. With some redundancy on top.

cpast handled most of this - but I'm a system administrator professionally these days (partly to escape the Javas and C#s and other BS. I'm supposed to be a network admin, but they have me doing both because paying for two people to do two jobs is apparently a waste of money? Maybe they could switch to F/OSS and use the license money to hire another guy to do one of the two jobs?) - anyhow - SxS assemblies get huge even on the old x86 Windows 2003 servers. We have a couple of internal units that run IIS and some hideous monstrosity content/scripting system -> their SxS directories are approaching 20 gigs in size. Granted the x86-64/amd64 versions are like 22-24 gigs, and the 2012 machines are complete disasters.. anyhow, SxS became a big poopy mess all on it's own

Disclaimer: I'm an MCSE so I may have an unreasonably pro-MS attitude~

(Anyhow, just to be clear, I'm not anti-64-bit KSP* or anti-64bit. Obviously my system would choke on some of the photo work I do in a 32-bit address space, but I'm very tired of seeing "OMG everything needs to be 64-bit" and having providers offer me 1GiB of 512MiB systems with 64-bit OSes. There's a time and a place, people! Firefox needs to get their thumb out of their uh.. mohole, and release 64-bit, but there's no need for EditPlus to do the same)

* - assuming they can get it to work. I'm anti-buggy-messes, which is what 64-bit KSP is (heck even 32-bit KSP is to an extent) right now.

- - - Updated - - -

Sorry, not seeing your point Sarbian, I'm pretty sure it's not impossible to have memory leaks with C# even with garbage collection, a programmer can't always rely on the CG

Those new-fangled languages like C# generally use some sort of reference system to know if they can free stuff.. so if you leave a pointer or reference or whatever-they're-calling-it-this-week pointed at some object (or whatever), it generally can't be freed until the reference is cleared... so it totally can leak memory.

I try to stay away from these sorts of thing (it's generally not my style) but it shows up even in some older languages - Perl does that too, for example. (Perl is a rude, messy hack, but it's built FOR making rude, messy, quick hacks, and does so quite well)

Jouni · February 20, 2015

Bur? Are they treating it as scaled? If you're worried about funds bonuses and such, there's ways around overflow - instead of adding 5% by something naive like (x*100)/5, one can simply do x+(x/20), as a random example (or temporarily promote to a larger number/floating point if need be. Kinda clunky in most languages, but doable).

Recovery values are shown with two decimal places, while resource values are defined with one decimal place.

Bur? That sounds like you have more than enough memory for that - even my desktop only has 32 GiB of memory. Still, I applaud wise conservation regardless of the circumstance

People are generating new data at least as quickly as computers are getting faster. Because disk speeds and network bandwidths aren't growing at the same pace, more and more processing needs to be done in memory.

The only time I've had a computer with more memory than I needed was three years ago, when the university I was working at bought a server with 1 TB of memory. The work I was doing back then didn't need more than a few hundred gigabytes, so things were good for a while. Still, the occasional combinatorial explosions resulted in failed 15 PB memory allocations, which just proved the point that no computer has ever had and will ever have enough memory.

sarbian · February 20, 2015

Reason is what I'm missing here. Again, on my linux box, everything is 64bit and works just fine. On windows with all those fancy compatibility/emulation stuff, many 64bits apps are broken. Some of them, like Unity, have same codebase. [...]

You know that you sound like a linux fanatic when saying stuff like that ? When was the last time you launched windows ? Win Me ?

Windows do it exactly like Linux does. You have 2 set of libs, the 32 bit and the 64 bits.

Renegrade : it's double everywhere afaik.

Renegrade · February 20, 2015

Recovery values are shown with two decimal places, while resource values are defined with one decimal place.

Right, good point. Although that could have been implemented as a temporary promotion for that one section.

It probably is a double as sarbian says below - er above this post - although that itself will have issues occasionally (like the mass indicator in the VAB, which will tell you that 18.0 > 18.0, as the first number is actually anywhere upto 18.04999..)

People are generating new data at least as quickly as computers are getting faster. Because disk speeds and network bandwidths aren't growing at the same pace, more and more processing needs to be done in memory.

Amen. Although..disk bandwidth had a nice upwards bump when SSDs were introduced (with an equivalent or bigger jump downward in space).

The only time I've had a computer with more memory than I needed was three years ago, when the university I was working at bought a server with 1 TB of memory. The work I was doing back then didn't need more than a few hundred gigabytes, so things were good for a while. Still, the occasional combinatorial explosions resulted in failed 15 PB memory allocations, which just proved the point that no computer has ever had and will ever have enough memory.

Yeah, you can never have enough. I always buy as much memory as I have of either CPU or graphics for a desktop (and my personal servers are simply retired desktops, they cascade down until they end up in storage..and cascade back up should I ever encounter a complete system failure), and when speccing up a new server for work, I always push the limits on the memory front.

But that just underlines my point about machines with smaller physical memory size (older units, things in the cloud) - every byte counts there. 64-bit is great for my 32 gig monster, but if I had one of those half-gig cloud servers ... :/

Anyhow we're getting a bit far afield from the OP's topic.

On that, topic, I'd say if Squad could fix/implement the following, the pressure on memory would fall dramatically:

- The DX9 memory bloat (OpenGL/DX11 modes can save over a gig of memory so obviously something is wrong with DX9 mode)

- Dynamic loading of planetary surfaces and KSC assets (I don't believe in part dynamic loading, that's just A) a story to scare small children and another potential source for memory leaks)

- Killing the biggest of the memory leaks (some shenanigans ARE going on with 0.90)

- Some light optimization here and there

(again, not opposed to the 64-bit client, if they can make it so it doesn't fall over at the slightest hint of a problem, but that's sounding like a long shot if Unity 5 isn't carrying any sort of guarantee of A) existing and having fully operable Win64 client)

cantab · February 20, 2015

C99 programmer? Surely not. They don't exist anymore. I'm the opposite, I start at uint8_t, and consider the use of the variable carefully, it's maximums, and the sort of operations it will take. If it's something that increases with time (like a clock), then well, is there a maximum runtime? If not, then the longest thing it can be, otherwise something suitable. If I make a design decision like the 99 million funds thing, that would end up being uint32_t (or int32_t if it can be negative obviously). If there was no top of the funds, than longer, but if it were something that can't exceed 10,000 no matter what, it's u/int16_t.

And then you get burnt when the higher-ups decide that actually they're doing a rebalance to make 1 Fund roughly requivalent to 1 US dollar and the maximum is 1 trillion now. (Well, depending on how much work changing the data type is).

sarbian · February 20, 2015

- The DX9 memory bloat (OpenGL/DX11 modes can save over a gig of memory so obviously something is wrong with DX9 mode)

AFAIK those are inherent of how Unity use DX9 (or limits of DX9). It seems that DX11 push read only texture fully in the GPU memory, while DX9 may keep a copy in RAM. I doubt Squad can do much about that beside making DX11 the default build for Win which may require fixing some shader and would block some player without DX11 hardware.

- Dynamic loading of planetary surfaces and KSC assets (I don't believe in part dynamic loading, that's just A) a story to scare small children and another potential source for memory leaks)

The surface texture stay loaded afaik. The KSC I m less sure. I have a tool to check that.

I don't see how dynamic part loading would create more leak. Dynamic loading is common in most engine and keeping a list of the currently needed part texture is not that complex (but not trivial either)

- Killing the biggest of the memory leaks (some shenanigans ARE going on with 0.90)

That may require some a new thread but if you have know reproducible leak then list them. I made a few tool to check find the source of some leak and the more info I get the more detail I can report to Squad.

(again, not opposed to the 64-bit client, if they can make it so it doesn't fall over at the slightest hint of a problem, but that's sounding like a long shot if Unity 5 isn't carrying any sort of guarantee of A) existing and having fully operable Win64 client)

What do you mean by "existing" ? Unity 5 beta are available for month and they released an RC2 this (last?) week.

Tuareg · February 24, 2015

oh, the half informations again...

lets start. FPUs had 80 bit precision to use it for INTERNAL calculations only, there was no processor or motherboard level support to FEED it.

SSE had 128bit register which was able to work with 4 piece of SINGLE PRECISION NUMBERS IN PARALLEL doing the SAME operation on the 4 singles. NO HARDWARE SUPPORT for double precision.

with SSE2 aka Athlon 64 came the hardware level support for double precision calculations. what it means:

a single float in the memory looks like "1 bit minus sign""6 bit dot position" "the number"... with a 32bit register pushing it to the FPU means a simple step and the FPU can use it for calculations straight away

a double float (or just double) looks like the same but the "dot position" is (i think) 8 bit and "the number" bit is far bigger. when you read it from the memory it already takes double the time to read it into the 32bit general registers but the problem comes when you push out the 2x 32bit data to the 32 bit arithmetic pipeline to build a freakin double from it... inside the FPU it takes actually a 32bit bitwise left (<<) and an other 32bit addition BEFORE the FPU can even start working with it. Yes, it supported double but it COULDNT RECEIVE the double. and when it has finished the calculations it had to cut the output into 2x 32 bit data and send it back and then took twice the steps (or even more) to write it back to the memory from the cpu registers.

while with the appearance of the 64bit processors came the 64 bit arithmetic pipeline and the native 64bit registers so you can move and work with doubles everywhere with only 1 steps. its not twice faster just because it became 64bit, it became far far far quicker to work with 64bit floating point numbers

Graphics cards can use single precision (well, actually nvidia cg has native support for doubles but just because it uses worldspace coordinates) because they are calculating only within a (-1) - (1) (called eye or clip coordinates /generally representing the space between the corners of the screen with a depth added/) sized standardized coordinate system which receives the coordinates from the DX or GL (so the single/double precision worldspace calcualtions are done on cpu level, thats why SSE2 was that big step in 3d graphics because it became able to actually use doubles on hardware level)... sad that most (yes, sadly unity developers too) dont get that they would have to separate the physics (in most cases its more than enough to have there singles, though bullet physics uses doubles) the hardware level graphics (like cg, hlsl) and their own internal coordinate system which actually could easily support doubles (cry engine actually works on it with RSI and i think i read somewhere that the new UR engine has an intelligent switch between the required precision) and they could actually use those doubles on DX and GL level too, and even (i dont know about amd graphics but im absolutely sure about nvidia) on low level cg programming. its just their laziness that stops them to do it

DuoDex · February 24, 2015

I beg pardon, tuareg?

I don't see why anything in computers should need feeding.

Good day.

Renegrade · February 26, 2015

And then you get burnt when the higher-ups decide that actually they're doing a rebalance to make 1 Fund roughly requivalent to 1 US dollar and the maximum is 1 trillion now. (Well, depending on how much work changing the data type is).

I just add a "000,000" to the end of every display. Or simply an 'M'. And say that the maximum is now 99 trillion as an added bonus.

With my own workflow, changing a variable referenced a thousand times with FULL security audit would be about an hour. Maybe less. I'd tell the pointy haired idiots that it would take a a week or two though, and spend the bulk of the time playing KSP instead though as revenge, however. You'd be talking about a lot of work for the part designing people too btw, unless we're simply multiplying all costs by some constant, which makes my initial line all that more appropriate. In fact, do we even know if "1" on the funds-meter is actually 1 fund, and not a thousand or million? Most budget things for entities of that size tend to drop some of the trailing zeroes for clarity anyways...

(I'd love to see a space program do it's first launch for $25,000 USD)

AFAIK those are inherent of how Unity use DX9 (or limits of DX9). It seems that DX11 push read only texture fully in the GPU memory, while DX9 may keep a copy in RAM. I doubt Squad can do much about that beside making DX11 the default build for Win which may require fixing some shader and would block some player without DX11 hardware.

I thought DX9 mostly operated in kernel space (once of it's massive limitations over OpenGL) - even if it does keep a texture in main ram, shouldn't it be in someone else's memory space?

If it's a Unity issue, yeah, Squad probably can't fix it :/

The surface texture stay loaded afaik. The KSC I m less sure. I have a tool to check that.

EW.. Well, that's a low-hanging fruit that they could hit right there...

I don't see how dynamic part loading would create more leak. Dynamic loading is common in most engine and keeping a list of the currently needed part texture is not that complex (but not trivial either)

Normally I'd say it wouldn't (although you can get into an emergency situation where you've allocated all your memory without leaking any if someone makes a monster ship with all parts etc), but remember who would be implementing this dynamic loading.

That may require some a new thread but if you have know reproducible leak then list them. I made a few tool to check find the source of some leak and the more info I get the more detail I can report to Squad.

I left KSP running the other day after making a screenshot of FAR's tweakables, with the tweakable open, and it crashed from OOM after like 15-30 mins or somesuch. It was literally an empty sandbox save, I just loaded, clicked together a half-dozen part example rocket and right clicked a tailfin. I saw a thread about that the other day, but I was looking for it just now and couldn't find it.

What do you mean by "existing" ? Unity 5 beta are available for month and they released an RC2 this (last?) week.

It's still not released though. "Don't count your chickens before they hatch" sort of dealie. Also we don't know if it will be something that Squad can use and have definite Win64 support until we get our grubby little hands on it. I've never been able to get a straight answer on whether or not the client parts will be fully working 64-bit components.

with a 32bit register pushing it to the FPU means a simple step and the FPU can use it for calculations straight away
a double float (or just double) looks like the same but the "dot position" is (i think) 8 bit and "the number" bit is far bigger. when you read it from the memory it already takes double the time to read it into the 32bit general registers but the problem comes when you push out the 2x 32bit data to the 32 bit arithmetic pipeline to build a freakin double from it... inside the FPU it takes actually a 32bit bitwise left (<<) and an other 32bit addition BEFORE the FPU can even start working with it. Yes, it supported double but it COULDNT RECEIVE the double. and when it has finished the calculations it had to cut the output into 2x 32 bit data and send it back and then took twice the steps (or even more) to write it back to the memory from the cpu registers.

Even the original 8087 could load a full-length double from memory. Yes, it was dog-slow and took something like 8 bus cycles to load all 8 bytes (the 8086 had a 16-bit data bus, but it multiplexed the data and address lines so it would take two cycles for it to load something), but it could be done. Every CPU since the Pentium has had an integrated FPU (at least, anything you'd find in a desktop), and almost all of those (at least on Intel's side of the street) have had 64-bit data paths since the Pentium. The 486 also featured an integrated FPU, but there were some models without (SX), and it had a 32-bit datapath.

And the 8087-series CAN externally load an 80-bit number. A lot of high level languages don't support that or need switches (see http://en.wikipedia.org/wiki/Long_double), and it can impose some nasty alignment requirements, but it does exist and is a REAL hardware thing. Even without external representation, it still means that FPU operations can and are carried out in extended-precision, whereas SSE2 and graphics card operations are NOT.

I'm pretty sure the FPU has dedicated circuitry to handle any necessary internal data alignment management and it would be a relatively simple matter for it to simply load the upper half of the register from the data on the bus, no need to actually to a bitwise shift. (for 486s. For a Pentium or later, it would simply load the full 64-bit value in one bus cycle assuming it was aligned)

An x86-64 processor has to manipulate data to handle it's larger datatypes too by the way: addresses have to be sign-extended to be made into 'canonical' addresses, and their page tables are FOUR deep, rather than THREE for PAE and TWO for a non-PAE 32-bit processor. I'm sure the address extension is handled in sub-cycle hardware, but the page tables are not.

while with the appearance of the 64bit processors came the 64 bit arithmetic pipeline and the native 64bit registers so you can move and work with doubles everywhere with only 1 steps. its not twice faster just because it became 64bit, it became far far far quicker to work with 64bit floating point numbers

No. The 32-bit chips, especially later models like the Pentium III, can handle data every bit as fast as x86-64 chips, in terms of clock cycles. Like I said, the original "P5" series Pentiums had 64-bit data paths (not counting Overdrive units and other such one-offs). The P5 core was rather slow and pokey, but that was resolved with the P6/Pentium Pro (and modern chips are just an evolved form of the P6, except the Netburst or "NetBust" architecture).

The only advantage that x86-64 is giving you is 64-bit addressing, and more/fatter registers. If having more registers was a critical advantage in any significant way though, we'd all be using 680x0/PPC/MIPS/etc CPUs instead of x86.

Graphics cards can use single precision (well, actually nvidia cg has native support for doubles but just because it uses worldspace coordinates) because they are calculating only within a (-1) - (1)

You mean the view frustum? Yes, I know. but being between -1 and 1 doesn't actually help floating point numbers any. 0.0000001 of 1.0 is just as hard for single-precision as 0.1 of 1000000.0. It's the relative difference between the largest and smallest value that matters, NOT the magnitude (well, so long as you stay < 10^38).

My own FarStar render engine actually sets the far viewing plane to a very large number and disables depth testing entirely during the far-render stage; painter's algorithm is actually often faster and more accurate when drawing distant spheres, it turns out. No z-fighting for me.

Gkirmathal · February 28, 2015

Some very interesting discussions here. I can't add anything else to it, as all important stuff has been mentioned by Renegade and Sarbian.

Only that, on 32bit a lot of enhancements can still be gained by what has been mentioned above.

On and offtopic.

However I have build up an allergy, from an other game I play (World of Tank), for developers who only talk about adding in and working on new features instead of focusing on bug fixes and performance enhancements.

If I may take WoT as the example here. Over the last 2 years those developers only added more and more HD reworked content and added more new graphics features like modernized shaders.

All of those things looked nice, but they did (and still havent done enough) not modernize their BigWorld 1.0 engine. Thus as patches were released performance went down each patch. Due to the dismay of the player base.

Reason was they wanted to keep up with the rest of the world. Only you cannot keep up in a Lada, only by adding racing wheels, spoilers, turbo etc. It's 1.2l petrol engine is still going to hinder development. At one point they need to update the engine.

In my opinion same hold true for KSP and Squad, they want to add more and more new features. While the backbone, Unity4, in the way it currently handles stuff is not really up for it and needs adjusting.

When I read the DevNotes, I get the feeling I also get from reading WoT dev news, that is: they seem to ignore the issues and that is enhanced by not mentioning anything that is currently being done by Squad to tackle these issues.

Foremost, besides the DevNotes, I like to see a Tech DevNote in which we can read on the technical development side of KSP.

Will 32-bit be the death of KSP

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation