Jump to content

C, C++, C# Programming - what is the sense in this


PB666

Recommended Posts

This may seem like a completely random injection into this thread, but given the tread is about the sense of the C programming language this should be completely appropriate.

I was trying to write a file grabbing routine and was cloning code from the net and testing it when I ran across this:

       int width, height;
this.GetDefaultSize( out width, out height );
this.Resize( width, height );
FileChooserDialog chooser = new FileChooserDialog("Please select a File to Open", this, FileChooserAction.Open,
"Cancel", ResponseType.Cancel,"Open", ResponseType.Accept );

I went looking through my C/C++ books(5) there is nothing about this anywhere, the object is not designed or mentioned, and surprisingly the code compiles. There was something breifly mention about this

this->month = mn;

What da hail is this, is there some supersecret C club in which these out-of-the-blue object handlers are passed. :^).

The code sample there is C#, which is related to the C family of languages in name only. C++ has `this` as a keyword, but C does not. Within a class method, `this` is a pointer to the class instance that the method is currently executing on. Most of the time you don't need to use it since it's implicitly added for you. Some exceptions to this are situations where you've named a method parameter the same as a member variable and you need to disambiguate them. Generally I prefer to name member variables with a prefix indicating that they are, in fact, member variables so that this isn't a problem.

There are other situations where the `this` pointer actually does come up that are more useful. For example, methods that return a reference to the current object:

struct Foo
{
Foo& SomeMethod( int x )
{
// do something with 'x'
return *this;
}
};

In a sense, all member methods really have an extra implicit parameter and that extra implicit parameter is called `this` and is a pointer to the instance to operate on. So really you could imagine the above struct could also be implemented like this:

struct Foo
{
static Foo& SomeMethod( Foo* this, int x )
{
// do something with 'x'
return *this;
}
};

Of course this might prevent the compiler from being a little more clever about optimizing method calls (who knows, C++ compilers are absurdly complicated), but semantically it's similar to the previous situation.

In C# the semantics are a bit different since C# doesn't expose pointers (unless you're working with unsafe code, which you should rarely be doing). In C# the 'this' keyword is a variable containing the value of the instance that the called method is currently operating on. Like in the C++ case it can also be thought of as an implicit method parameter (and, in fact, the IL code emitted by the compiler treats it as such; in non-static methods the `ldarg.0` instruction is always `this`).

Link to comment
Share on other sites

Despite being 64GB, Sandy Bridge and Ivy Bridge chipsets only allows maximum of 32GB memory when the chipset first came out. Now there are some 64 and even some 128 out there. Most everything else is in the Xeon range.

Xeons are x86-64 processors, just like the i3/i5/i7 family. They're just intended for the server/workstation market instead of the desktop/laptop market. In general, computers using Xeon processors have more memory and more CPU cores, while computers using a desktop-grade i5/i7 processor have better single-threaded performance.

Link to comment
Share on other sites

(...)the answer is that I want to create an array based on values of 2^48 bits, so it would be highly advantageous to have a method on-the-fly that reversed order and 'Not'ed the bits, and alternative is to carve it into 2 @ 24 bits or 3 @ 16 bits(which I could use a table for).

In order to divide into 3 16 bit parts the QW first has to be copied rightshifted 16 copied, and rightshifted again

and copied. All bits higher than 16 need to be removed, to do this take the register, create a dummy register, rightshift 16, leftshift 16, xor with the source register, then lookup. Once the lookup is done IMul register 1 by 4,294,967,296 and add to register 3, register 2 by 65536 and add to register 3. Done.

Next issue is that each needs a lookup value, since 2^48 is too large for memory it will have to be disked, by doing this I can greatly compact the array, and in doing that I don't need to process trillions of values at once.

Each of them needs a lookup value? You're working with an array that actually holds 281 trillion different values? Because if you're not, wouldn't using a sparse array be a much saner solution?

Link to comment
Share on other sites

Despite being 64GB, Sandy Bridge and Ivy Bridge chipsets only allows maximum of 32GB memory when the chipset first came out. Now there are some 64 and even some 128 out there. Most everything else is in the Xeon range. In most cases this is dependent on module size each module is typically no more than 8 GB and dimms are 16, the manufactures limit this to 2 sets, as memory size increases or slots increase this goes to 8. Increasing the number of dimms slots apparently lowers memory transfer rates. Now there are a few at 64 and even 128.

This is an motherboard limit mostly, they did not bother adding more memory support than is practical with four of the larges dimm modules.

Back then I bought my 2011 socket system I did this in part because of the 32 GB limit.

As you say currently its 64 and moving over to 128 GB.

I also have an strong feeling that they could easy increase the address span from 40 bit if needed.

As Jouni says this will require 40 times more memory than on the largest servers before becoming an serious issue

Link to comment
Share on other sites

I think there are limits in addressing in 64 bit mode, I remember reading about these modes in the 3603 page manual. Once a 64 bit OS is installed the OS accesses memory in only using the base pointer.

- - - Updated - - -

Each of them needs a lookup value? You're working with an array that actually holds 281 trillion different values? Because if you're not, wouldn't using a sparse array be a much saner solution?

You didn't read what I wrote. What I said was that at the asembler level or a very good c compiler you can right shift a copied 48biter by 0, 16, 32 crossreference the word in each of the three registers them sum (shift them left by 32, 16, 0, respectively). The problem is effectively building a table each and justifying and assembler call.

Link to comment
Share on other sites

What a great thread :) It just so happens that I recently started toying with QBasic/QB64 again. I quickly found it to be a fun exercise, but a rather futile one if you want to do anything useful without also doing massive amounts of work. I switched to Visual Basic 2015 and after a good day of initial frustration, things are coming along. Many things are different, but plenty are also the same. In my younger years the programs were mostly glued together with enthusiasm, nowadays a little more structure really does help.

The Visual Basic versus C discussion is an interesting one. There is no doubt about it that C and C++ are seen as the standard. I am happy to say, however, that learning about VB is also allowing me to read C/C++ more easily. Before I could not really make much of it, now I can often at least understand the intention of the code. I have also found many API's exist for VB for programs and applications I use or am interested in, which is a definite plus if you want to dive right in.

Whether I will make the switch to C/C++ later on depends on both the need and usefulness it will have for me. I guess I will need to dive in when I finally want to make work of those Arduinos, where every. cycle. counts. Or OpenCL, which is available for Visual Basic, but is not very well documented as compared to C.

At least I can say I am now versed in two useless programming languages :D

Edited by Camacha
Link to comment
Share on other sites

the answer given online was make a table, by numerous people.

Sure, use a small lookup table, like one that can reverse the bits of a single 8-bit byte. Then the problem reduces to just reversing the bytes in a structure and passing each of them through the lookup table (the code sample I provided had an example of this for 64-bit integers). If you try to make a lookup table for reversing every possible 48-bit value, the whole thing is going to end up being even slower because of all the memory shenanigans it'll cause. Even if you had that kind of memory available, you're basically just going to take a dump all over your cache locality and that can be a very big deal for speed.

In fact, I was curious exactly to what extent that will impact performance, so I wrote a quick little app in C# to test this out. I have 3 methods for reversing the bits in a 64-bit integer. Each one reverses it either 8-bits, 16-bits, or 24-bits at a time. I first profile generating a random 64-bit integer and xor-ing it into a result variable (to make sure nothing optimizes out the function calls, I always make sure to use the result). I then profile each of the 3 methods and subtract out the overhead from the first run.

using System;
using System.Diagnostics;

namespace ReverseProfile
{
static class Program
{
const int Samples = 100000000;

static void Main()
{
var rand = new Random( 0 );

Console.WriteLine( "Testing..." );

for( int i = 0; i < Samples; ++i )
{
ulong x = rand.Next64();

if( Reverse8( x ) != Reverse16( x ) )
{
Console.WriteLine( "Failed: Reverse8 != Reverse16 @ 0x{0:X16}", x );
return;
}

if( Reverse16( x ) != Reverse24( x ) )
{
Console.WriteLine( "Failed: Reverse16 != Reverse24 @ 0x{0:X16}", x );
return;
}
}

Console.WriteLine( "Tests passed.\n" );

Console.WriteLine( "Profiling Random..." );

ulong r = 0;
rand = new Random( 0 );

var sw = Stopwatch.StartNew();
for( int i = 0; i < Samples; ++i )
{
r ^= rand.Next64();
}

sw.Stop();

var baseTime = sw.Elapsed;

Console.WriteLine( "Result = 0x{0:X16}", r );
Console.WriteLine( "Base time: {0}/call\n", FormatTime( baseTime ) );

Console.WriteLine( "Profiling Reverse8..." );
rand = new Random( 0 );

r = 0;
sw.Restart();
for( int i = 0; i < Samples; ++i )
{
r ^= Reverse8( rand.Next64() );
}

sw.Stop();

Console.WriteLine( "Result = 0x{0:X16}", r );
Console.WriteLine( "Reverse8: {0}/call\n", FormatTime( sw.Elapsed - baseTime ) );

Console.WriteLine( "Profiling Reverse16..." );
rand = new Random( 0 );

r = 0;
sw.Restart();
for( int i = 0; i < Samples; ++i )
{
r ^= Reverse16( rand.Next64() );
}

sw.Stop();

Console.WriteLine( "Result = 0x{0:X16}", r );
Console.WriteLine( "Reverse16: {0}/call\n", FormatTime( sw.Elapsed - baseTime ) );

Console.WriteLine( "Profiling Reverse24..." );
rand = new Random( 0 );

r = 0;
sw.Restart();
for( int i = 0; i < Samples; ++i )
{
r ^= Reverse24( rand.Next64() );
}

sw.Stop();

Console.WriteLine( "Result = 0x{0:X16}", r );
Console.WriteLine( "Reverse24: {0}/call\n", FormatTime( sw.Elapsed - baseTime ) );
}

static Program()
{
Console.WriteLine( "Initializing..." );
byte[] reverse4 = { 0x0, 0x8, 0x4, 0xC, 0x2, 0xA, 0x6, 0xE, 0x1, 0x9, 0x5, 0xD, 0x3, 0xB, 0x7, 0xF };

smReverse8 = new byte[1 << 8];
for( int i = 0; i < smReverse8.Length; ++i )
{
smReverse8[i] = (byte) ( ( reverse4[i & 0xF] << 4 ) | reverse4[i >> 4] );
}
Console.WriteLine( "8-bit table built." );

smReverse16 = new ushort[1 << 16];
for( int i = 0; i < smReverse16.Length; ++i )
{
smReverse16[i] = (ushort) ( ( smReverse8[i & 0xFF] << 8 ) | smReverse8[i >> 8] );
}
Console.WriteLine( "16-bit table built." );

smReverse24 = new uint[1 << 24];
for( int i = 0; i < smReverse24.Length; ++i )
{
smReverse24[i] =
(uint) ( ( smReverse8[i & 0xFF] << 16 ) | ( smReverse8[( i >> 8 ) & 0xFF] << 8 ) | smReverse8[i >> 16] );
}
Console.WriteLine( "24-bit table built." );
}

private static ulong Reverse8( ulong n )
{
return ( (ulong) smReverse8[n & 0xFF] << 56
| (ulong) smReverse8[( n >> 8 ) & 0xFF] << 48
| (ulong) smReverse8[( n >> 16 ) & 0xFF] << 40
| (ulong) smReverse8[( n >> 24 ) & 0xFF] << 32
| (ulong) smReverse8[( n >> 32 ) & 0xFF] << 24
| (ulong) smReverse8[( n >> 40 ) & 0xFF] << 16
| (ulong) smReverse8[( n >> 48 ) & 0xFF] << 8
| (ulong) smReverse8[n >> 56] );
}

private static ulong Reverse16( ulong n )
{
return ( (ulong) smReverse16[n & 0xFFFF] << 48
| (ulong) smReverse16[( n >> 16 ) & 0xFFFF] << 32
| (ulong) smReverse16[( n >> 32 ) & 0xFFFF] << 16
| (ulong) smReverse16[( n >> 48 ) & 0xFFFF] );
}

private static ulong Reverse24( ulong n )
{
return ( (ulong) smReverse24[( n >> 40 ) & 0xFFFFFF]
| (ulong) smReverse24[( n >> 16 ) & 0xFFFFFF] << 24
| (ulong) smReverse24[n & 0xFFFF] << 40 );
}

private static ulong Next64( this Random rand )
{
return (ulong) rand.Next() << 32 | (ulong) rand.Next();
}

private static string FormatTime( TimeSpan time )
{
double ms = time.TotalMilliseconds / Samples;

if( ms > 10000 ) return string.Format( "{0} s", ms / 1000 );
if( ms > 10 ) return string.Format( "{0} ms", ms );
if( ms > 0.01 ) return string.Format( "{0} µs", ms * 1000 );
return string.Format( "{0} ns", ms * 1000000 );
}

private static readonly byte[] smReverse8;
private static readonly ushort[] smReverse16;
private static readonly uint[] smReverse24;
}
}

Now, looking at the methods, one would expect Reverse16 to be about twice as fast as Reverse8 and Reverse24 to be about 3 times as fast. If computation time was purely a function of the number of instructions that pass through the CPU, anyway.

Here's what a typical run looks like:

Initializing...

8-bit table built.

16-bit table built.

24-bit table built.

Testing...

Tests passed.

Profiling Random...

Result = 0x4D522B76016D765B

Base time: 20.373974 ns/call

Profiling Reverse8...

Result = 0xDA6EB6806ED44AB2

Reverse8: 8.063506 ns/call

Profiling Reverse16...

Result = 0xDA6EB6806ED44AB2

Reverse16: 4.015819 ns/call

Profiling Reverse24...

Result = 0xDA6EB6806ED44AB2

Reverse24: 49.221145 ns/call

What do you know, Reverse16 is about twice as fast as Reverse8, that's reassuring. But, oh, Reverse24 is over 6 times slower than Reverse8. So our prediction is off by a factor of 18, what gives? Locality of reference. The lookup table for 24 bits is too big fit in the cache so only portions of it are loaded at any given time. If we were iterating through every 64-bit value (which we're not going to do because none of us have that amount of time to wait), you'd probably see completely different timing behavior. Instead, I'm throwing random values at it, which means lots of cache misses and the CPU ends up spending a lot of time fetching stuff from main memory (or slower cache levels).

Of course I don't know what the optimal lookup table size is. It's probably hardware dependent at the very least, but I'd be willing to bet that the 16-bit lookup table is a good starting point. As usual, standard advice applies: Profile before optimizing, and make it work, make it right, make it fast in that order.

Link to comment
Share on other sites

Of course I don't know what the optimal lookup table size is. It's probably hardware dependent at the very least, but I'd be willing to bet that the 16-bit lookup table is a good starting point. As usual, standard advice applies: Profile before optimizing, and make it work, make it right, make it fast in that order.

You may want to repeat the experiment with the random samples stored in an array instead of being generated on the fly.

These days, processors typically have 64 kilobytes of L1 cache and 256 kilobytes of L2 cache per core. An 8-bit lookup table takes 256 bytes, while a 16-bit lookup table requires 128 kilobytes. When you benchmark Reverse8() and Reverse16() with randomly generated numbers, you're essentially comparing an 8-bit lookup table in L1 cache vs. a 16-bit lookup table in L2 cache. On the other hand, if the data resides in memory (as it would in the real use case), an 8-bit lookup table would probably still remain in L1 cache most of the time, while you might have constant L2 cache misses with a 16-bit lookup table.

Link to comment
Share on other sites

Sure, use a small lookup table, like one that can reverse the bits of a single 8-bit byte. Then the problem reduces to just reversing the bytes in a structure and passing each of them through the lookup table (the code sample I provided had an example of this for 64-bit integers). If you try to make a lookup table for reversing every possible 48-bit value, the whole thing is going to end up being even slower because of all the memory shenanigans it'll cause. Even if you had that kind of memory available, you're basically just going to take a dump all over your cache locality and that can be a very big deal for speed.

In fact, I was curious exactly to what extent that will impact performance, so I wrote a quick little app in C# to test this out. I have 3 methods for reversing the bits in a 64-bit integer. Each one reverses it either 8-bits, 16-bits, or 24-bits at a time. I first profile generating a random 64-bit integer and xor-ing it into a result variable (to make sure nothing optimizes out the function calls, I always make sure to use the result). I then profile each of the 3 methods and subtract out the overhead from the first run.

........

Here's what a typical .......

Profiling Random...

Result = 0x4D522B76016D765B

Base time: 20.373974 ns/call

Profiling Reverse8...

Result = 0xDA6EB6806ED44AB2

Reverse8: 8.063506 ns/call

Profiling Reverse16...

Result = 0xDA6EB6806ED44AB2

Reverse16: 4.015819 ns/call

Profiling Reverse24...

Result = 0xDA6EB6806ED44AB2

Reverse24: 49.221145 ns/call

What do you know, Reverse16 is about twice as fast as Reverse8, that's reassuring. But, oh, Reverse24 is over 6 times slower than Reverse8. So our prediction is off by a factor of 18, what gives? Locality of reference. The lookup table for 24 bits is too big fit in the cache so only portions of it are loaded at any given time. If we were iterating through every 64-bit value (which we're not going to do because none of us have that amount of time to wait), you'd probably see completely different timing behavior. Instead, I'm throwing random values at it, which means lots of cache misses and the CPU ends up spending a lot of time fetching stuff from main memory (or slower cache levels).

Of course I don't know what the optimal lookup table size is. It's probably hardware dependent at the very least, but I'd be willing to bet that the 16-bit lookup table is a good starting point. As usual, standard advice applies: Profile before optimizing, and make it work, make it right, make it fast in that order.

Cool, i suspected 16 would be. but the biggest advantage of 16 is that its a register compatible size, 32 is way too big and 8 is not specific enough.

- - - Updated - - -

This thread feels either like:

- someone wants help with his homework

or

- someone wants to rant about something he doesnt like but would need

Not homework, its a project. The major rant that i have is the documentation, partcularly for the GNU stuff is really weak.

Edited by PB666
Link to comment
Share on other sites

You may want to repeat the experiment with the random samples stored in an array instead of being generated on the fly.

These days, processors typically have 64 kilobytes of L1 cache and 256 kilobytes of L2 cache per core. An 8-bit lookup table takes 256 bytes, while a 16-bit lookup table requires 128 kilobytes. When you benchmark Reverse8() and Reverse16() with randomly generated numbers, you're essentially comparing an 8-bit lookup table in L1 cache vs. a 16-bit lookup table in L2 cache. On the other hand, if the data resides in memory (as it would in the real use case), an 8-bit lookup table would probably still remain in L1 cache most of the time, while you might have constant L2 cache misses with a 16-bit lookup table.

That's a simple enough change, here's the modified code and results:


using System.Diagnostics;

namespace ReverseProfile
{
static class Program
{
const int SampleCount = 100000000;

static void Main()
{
var rand = new Random();

var samples = new ulong[SampleCount];

Console.WriteLine( "Testing..." );

for( int i = 0; i < SampleCount; ++i )
{
samples[i] = rand.Next64();

if( Reverse8( samples[i] ) != Reverse16( samples[i] ) )
{
Console.WriteLine( "Failed: Reverse8 != Reverse16 @ 0x{0:X16}", samples[i] );
return;
}

if( Reverse16( samples[i] ) != Reverse24( samples[i] ) )
{
Console.WriteLine( "Failed: Reverse16 != Reverse24 @ 0x{0:X16}", samples[i] );
return;
}
}

Console.WriteLine( "Tests passed.\n" );

Console.WriteLine( "Profiling array access..." );

ulong r = 0;

var sw = Stopwatch.StartNew();
for( int i = 0; i < SampleCount; ++i )
{
r ^= samples[i];
}

sw.Stop();

var baseTime = sw.Elapsed;

Console.WriteLine( "Result = 0x{0:X16}", r );
Console.WriteLine( "Base time: {0}/call\n", FormatTime( baseTime ) );

Console.WriteLine( "Profiling Reverse8..." );

r = 0;
sw.Restart();
for( int i = 0; i < SampleCount; ++i )
{
r ^= Reverse8( samples[i] );
}

sw.Stop();

Console.WriteLine( "Result = 0x{0:X16}", r );
Console.WriteLine( "Reverse8: {0}/call\n", FormatTime( sw.Elapsed - baseTime ) );

Console.WriteLine( "Profiling Reverse16..." );

r = 0;
sw.Restart();
for( int i = 0; i < SampleCount; ++i )
{
r ^= Reverse16( samples[i] );
}

sw.Stop();

Console.WriteLine( "Result = 0x{0:X16}", r );
Console.WriteLine( "Reverse16: {0}/call\n", FormatTime( sw.Elapsed - baseTime ) );

Console.WriteLine( "Profiling Reverse24..." );

r = 0;
sw.Restart();
for( int i = 0; i < SampleCount; ++i )
{
r ^= Reverse24( samples[i] );
}

sw.Stop();

Console.WriteLine( "Result = 0x{0:X16}", r );
Console.WriteLine( "Reverse24: {0}/call\n", FormatTime( sw.Elapsed - baseTime ) );
}

static Program()
{
Console.WriteLine( "Initializing..." );
byte[] reverse4 = { 0x0, 0x8, 0x4, 0xC, 0x2, 0xA, 0x6, 0xE, 0x1, 0x9, 0x5, 0xD, 0x3, 0xB, 0x7, 0xF };

smReverse8 = new byte[1 << 8];
for( int i = 0; i < smReverse8.Length; ++i )
{
smReverse8[i] = (byte) ( ( reverse4[i & 0xF] << 4 ) | reverse4[i >> 4] );
}
Console.WriteLine( "8-bit table built." );

smReverse16 = new ushort[1 << 16];
for( int i = 0; i < smReverse16.Length; ++i )
{
smReverse16[i] = (ushort) ( ( smReverse8[i & 0xFF] << 8 ) | smReverse8[i >> 8] );
}
Console.WriteLine( "16-bit table built." );

smReverse24 = new uint[1 << 24];
for( int i = 0; i < smReverse24.Length; ++i )
{
smReverse24[i] =
(uint) ( ( smReverse8[i & 0xFF] << 16 ) | ( smReverse8[( i >> 8 ) & 0xFF] << 8 ) | smReverse8[i >> 16] );
}
Console.WriteLine( "24-bit table built." );
}

private static ulong Reverse8( ulong n )
{
return ( (ulong) smReverse8[n & 0xFF] << 56
| (ulong) smReverse8[( n >> 8 ) & 0xFF] << 48
| (ulong) smReverse8[( n >> 16 ) & 0xFF] << 40
| (ulong) smReverse8[( n >> 24 ) & 0xFF] << 32
| (ulong) smReverse8[( n >> 32 ) & 0xFF] << 24
| (ulong) smReverse8[( n >> 40 ) & 0xFF] << 16
| (ulong) smReverse8[( n >> 48 ) & 0xFF] << 8
| (ulong) smReverse8[n >> 56] );
}

private static ulong Reverse16( ulong n )
{
return ( (ulong) smReverse16[n & 0xFFFF] << 48
| (ulong) smReverse16[( n >> 16 ) & 0xFFFF] << 32
| (ulong) smReverse16[( n >> 32 ) & 0xFFFF] << 16
| (ulong) smReverse16[( n >> 48 ) & 0xFFFF] );
}

private static ulong Reverse24( ulong n )
{
return ( (ulong) smReverse24[( n >> 40 ) & 0xFFFFFF]
| (ulong) smReverse24[( n >> 16 ) & 0xFFFFFF] << 24
| (ulong) smReverse24[n & 0xFFFF] << 40 );
}

private static ulong Next64( this Random rand )
{
return (ulong) rand.Next() << 32 | (ulong) rand.Next();
}

private static string FormatTime( TimeSpan time )
{
double ms = time.TotalMilliseconds / SampleCount;

if( ms > 10000 ) return string.Format( "{0} s", ms / 1000 );
if( ms > 10 ) return string.Format( "{0} ms", ms );
if( ms > 0.01 ) return string.Format( "{0} µs", ms * 1000 );
return string.Format( "{0} ns", ms * 1000000 );
}

private static readonly byte[] smReverse8;
private static readonly ushort[] smReverse16;
private static readonly uint[] smReverse24;
}
}
using System;

Initializing...

8-bit table built.

16-bit table built.

24-bit table built.

Testing...

Tests passed.

Profiling array access...

Result = 0x6CFAB40F37C448F1

Base time: 1.97389 ns/call

Profiling Reverse8...

Result = 0x8F1223ECF02D5F36

Reverse8: 10.088138 ns/call

Profiling Reverse16...

Result = 0x8F1223ECF02D5F36

Reverse16: 5.013811 ns/call

Profiling Reverse24...

Result = 0x8F1223ECF02D5F36

Reverse24: 31.326612 ns/call

Link to comment
Share on other sites

I did a quick C++ conversion of the code:


#include <array>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <random>
#include <vector>


const uint64_t SAMPLE_COUNT = 100000000;
const double TO_NANO = 1000000000.0 / SAMPLE_COUNT;


double
readTimer()
{
std::chrono::duration<double> temp = std::chrono::steady_clock::now().time_since_epoch();
return temp.count();
}

struct array_access
{
std::string name;

array_access() : name("access") {}

inline uint64_t operator() (uint64_t n) { return n; }
};

struct reverse8
{
std::string name;
std::array<uint8_t, 256> data;

reverse8() : name("reverse8")
{
std::vector<uint64_t> reverse4 = { 0x0, 0x8, 0x4, 0xC, 0x2, 0xA, 0x6, 0xE, 0x1, 0x9, 0x5, 0xD, 0x3, 0xB, 0x7, 0xF };
for(uint64_t i = 0; i < data.size(); i++)
{
data[i] = ((reverse4[i & 0xF] << 4) | reverse4[i >> 4]);
}
}

inline uint64_t operator() (uint64_t n) const
{
return ((uint64_t)data[n & 0xFF] << 56
| (uint64_t)data[(n >> 8) & 0xFF] << 48
| (uint64_t)data[(n >> 16) & 0xFF] << 40
| (uint64_t)data[(n >> 24) & 0xFF] << 32
| (uint64_t)data[(n >> 32) & 0xFF] << 24
| (uint64_t)data[(n >> 40) & 0xFF] << 16
| (uint64_t)data[(n >> 48) & 0xFF] << 8
| (uint64_t)data[n >> 56]);
}
};

struct reverse16
{
std::string name;
std::vector<uint16_t> data;

reverse16(const reverse8& r8) : name("reverse16"), data(1 << 16)
{
for(uint64_t i = 0; i < data.size(); i++)
{
data[i] = ((r8.data[i & 0xFF] << 8) | r8.data[i >> 8]);
}
}

inline uint64_t operator() (uint64_t n)
{
return ((uint64_t)data[n & 0xFFFF] << 48
| (uint64_t)data[(n >> 16) & 0xFFFF] << 32
| (uint64_t)data[(n >> 32) & 0xFFFF] << 16
| (uint64_t)data[(n >> 48) & 0xFFFF]);
}
};

struct reverse24
{
std::string name;
std::vector<uint32_t> data;

reverse24(const reverse8& r8) : name("reverse24"), data(1 << 24)
{
for(uint64_t i = 0; i < data.size(); i++)
{
data[i] = ((r8.data[i & 0xFF] << 16) | (r8.data[(i >> 8) & 0xFF] << 8) | r8.data[i >> 16]);
}
}

inline uint64_t operator() (uint64_t n)
{
return ((uint64_t)data[(n >> 40) & 0xFFFFFF]
| (uint64_t)data[(n >> 16) & 0xFFFFFF] << 24
| (uint64_t)data[n & 0xFFFF] << 40);
}
};


template<class Func>
void
profile(Func& func, const std::vector<uint64_t>& samples)
{
std::cout << "Profiling " << func.name << "..." << std::endl;

uint64_t r = 0;
double start = readTimer();
for(uint64_t i = 0; i < samples.size(); i++)
{
r ^= func(samples[i]);
}
double seconds = readTimer() - start;

std::cout << "Result = " << r << std::endl;
std::cout << func.name << ": " << (seconds * TO_NANO) << " ns/call" << std::endl;
std::cout << std::endl;
}


int main(int argc, char** argv)
{
std::cout << "Initializing..." << std::endl;

array_access aa;

reverse8 r8;
std::cout << "8-bit table built." << std::endl;

reverse16 r16(r8);
std::cout << "16-bit table built." << std::endl;

reverse24 r24(r8);
std::cout << "24-bit table built." << std::endl;

std::cout << "Testing..." << std::endl;
std::mt19937_64 rng(0xDEADBEEF);
std::vector<uint64_t> samples(SAMPLE_COUNT);
for(uint64_t i = 0; i < samples.size(); i++)
{
samples[i] = rng();
if(r8(samples[i]) != r16(samples[i]))
{
std::cout << "Failed: reverse8 != reverse16 @ " << samples[i] << std::endl;
return 1;
}
if(r16(samples[i]) != r24(samples[i]))
{
std::cout << "Failed: reverse16 != reverse24 @ " << samples[i] << std::endl;
return 1;
}
}
std::cout << "Tests passed." << std::endl;
std::cout << std::endl;

profile(aa, samples);
profile(r8, samples);
profile(r16, samples);
profile(r24, samples);

return 0;
}

With the C# version, I also observed a significant performance difference between the 8-bit lookup table and the 16-bit lookup table:

$ mcs -o+ foo.cs

$ mono foo.exe

Initializing...

8-bit table built.

16-bit table built.

24-bit table built.

Testing...

Tests passed.

Profiling array access...

Result = 0x611E686E163A105C

Base time: 1,054863 ns/call

Profiling Reverse8...

Result = 0x3A085C6876167886

Reverse8: 10,821065 ns/call

Profiling Reverse16...

Result = 0x3A085C6876167886

Reverse16: 6,586188 ns/call

Profiling Reverse24...

Result = 0x3A085C6876167886

Reverse24: 34,173045 ns/call

In the C++ version, the differences in performance were quite minor:

$ g++-5 -std=c++11 -Wall -O3 foo.cpp -o foo

$ ./foo

Initializing...

8-bit table built.

16-bit table built.

24-bit table built.

Testing...

Tests passed.

Profiling access...

Result = 1630952772224003662

access: 0.585153 ns/call

Profiling reverse8...

Result = 8247669395094127976

reverse8: 4.34656 ns/call

Profiling reverse16...

Result = 8247669395094127976

reverse16: 4.06664 ns/call

Profiling reverse24...

Result = 8247669395094127976

reverse24: 26.3678 ns/call

Link to comment
Share on other sites

Pretty interesting seeing the performance differences. Not sure why exactly that is. I could imagine that it may be related to some of the overhead in the array accesses in C#, since it may be doing some bounds checking on the lookup tables. I could try rolling unsafe versions of the lookup functions, since that would allow me to essentially force it to skip the bounds checks.

But at least the cache misses still have an obvious and significant impact on performance. That's the only thing I actually set out to demonstrate in the first place, after all.

EDIT:

Well, that's not it:

Initializing...

8-bit table built.

16-bit table built.

24-bit table built.

Testing...

Tests passed.

Profiling array access...

Result = 0x35B74AB9323BAF0A

Base time: 2.114285 ns/call

Profiling Reverse8...

Result = 0x50F5DC4C9D52EDAC

Reverse8: 7.892897 ns/call

Profiling Reverse16...

Result = 0x50F5DC4C9D52EDAC

Reverse16: 4.123271 ns/call

Profiling Reverse24...

Result = 0x50F5DC4C9D52EDAC

Reverse24: 32.213506 ns/call

Profiling UnsafeReverse8...

Result = 0x50F5DC4C9D52EDAC

UnsafeReverse8: 8.589045 ns/call

Profiling UnsafeReverse16...

Result = 0x50F5DC4C9D52EDAC

UnsafeReverse16: 5.796279 ns/call

Profiling UnsafeReverse24...

Result = 0x50F5DC4C9D52EDAC

UnsafeReverse24: 36.317731 ns/call

It closed the gap somewhat, but not nearly to the extent necessary to say that's where the majority of the performance difference is coming in.

Edited by Yourself
Link to comment
Share on other sites

I did a quick C++ conversion of the code:?

This is very good to know, it means there is not too high of a price to be paid using C#.

I noticed that earlier it looked like you were using visual studio iDE, then on you C# looks like you are using MonoDevelop.

I take it you are using Mono in windows, or are you also comparing OS at the same time?

Link to comment
Share on other sites

I noticed that earlier it looked like you were using visual studio iDE, then on you C# looks like you are using MonoDevelop.

I take it you are using Mono in windows, or are you also comparing OS at the same time?

I'm using g++ 5.2.0 on OS X. The C# code was compiled on Mono 4.2.0. I didn't even know that I had Mono installed, and this was the first time I compiled any C# code. The last time I used Visual Studio was 15 years ago, and the last version of Windows I used was Windows 2000.

Link to comment
Share on other sites

I'm using g++ 5.2.0 on OS X. The C# code was compiled on Mono 4.2.0. I didn't even know that I had Mono installed, and this was the first time I compiled any C# code. The last time I used Visual Studio was 15 years ago, and the last version of Windows I used was Windows 2000.

Sorry, i seem to be confusing posts with YourSelf , he appears to be using a windows environ. The g++ compiler lead i saw. This is kind of wow, here we have all kinds of OS, hardware even, who knows if AMD is represented.

I have an i4590s with apparently 6 MB smart cache, 16GB base which i can expand to 32, SSE4.2/AVX2.0. This is what intel says: L2,L3 cache shared by all four processors, essentially each processor has access to 6 Mb, but doubt this is actually the case, each chip has its own unique L1 cache.

according to this site its L1 is 128 program, 128 data, L2 proper is 1mb, L3 is 6 MB which i guess muddles with L2 as needed.

Link to comment
Share on other sites

Yes, I'm on Windows. I primarily develop in Visual Studio, though I'll use LINQPad if I don't want to go through all the trouble of actually creating a new solution.

I also ran the whole test suite again, but this time I targeted x64. Previously I was running with the default, which is Any CPU which would run as x64 on a 64-bit machine and x86 on a 32-bit machine. At least, it would if I had changed it from the default. By default it prefers 32-bit, so essentially it was running as a 32-bit process before. If I specifically target 64-bit, I get a different Jitter and different results:

Initializing...

8-bit table built.

16-bit table built.

24-bit table built.

Testing...

Tests passed.

Profiling array access...

Result = 0x563765EE46B64B32

Base time: 0.608915 ns/call

Profiling Reverse8...

Result = 0x4CD26D6277A6EC6A

Reverse8: 8.348577 ns/call

Profiling Reverse16...

Result = 0x4CD26D6277A6EC6A

Reverse16: 5.116408 ns/call

Profiling Reverse24...

Result = 0x4CD26D6277A6EC6A

Reverse24: 33.064754 ns/call

Profiling UnsafeReverse8...

Result = 0x4CD26D6277A6EC6A

UnsafeReverse8: 6.70843 ns/call

Profiling UnsafeReverse16...

Result = 0x4CD26D6277A6EC6A

UnsafeReverse16: 5.058065 ns/call

Profiling UnsafeReverse24...

Result = 0x4CD26D6277A6EC6A

UnsafeReverse24: 29.71604 ns/call

Notably the baseline array access and XORs are much faster (presumably because the XORs are operating on 64-bit integers) and the unsafe code actually edges out the safe code for performance.

Link to comment
Share on other sites

Yeah, Studio was throwing me errors on the snippet, though looking at the safe versus unsafe it may simply be statistical variation, did you use a random seed or randomize on the system time.

Does studio C++ have a 32bit limitation on the compile? It keeps throwing errors on the 64bit ints.

Link to comment
Share on other sites

  • 2 weeks later...
... but eventually found I could code directly in hexadecimal.... why do I want to waste the clock cycles trying to achieve 20 places.... across many platforms.

2cp regarding thoose few pointers

https://en.wikipedia.org/wiki/Reference_(computer_science)

you also got the tralalala blablablabla of defense(s) related versions that's why some languages are popular'ized and some don't: Hardware deeply shared architecture

i do love how the "external data reference" pages disapeared(may be vanished is more acurate ... anyway) a few month ago ; ) no tralalalala no blablablabla you know some might found them usefull ; )

Edited by WinkAllKerb''
Link to comment
Share on other sites

  • 2 weeks later...

https://msdn.microsoft.com/en-us/library/yyaad03b%28v=vs.90%29.aspx

Just to note from this C sharp and VB are basically now two 'ways' of saying the same thing.

Arrays: In C++ an array is merely a pointer. In C#, arrays are objects that include methods and properties. For example, the size of an array can be queried via the Length property. C# arrays also employ indexers that verify each index used to access the array. The syntax for declaring C# arrays is different from that for C++ arrays: the tokens "[]" appear following the array type in C#, not the variable.

In C# and VB an array is an object inherited from the Array.Class of System. It is in essence a slightly weaker brother of the System.Collection. set.

Here is the problem, is using some of the features of C you might find yourself wanting to implement a jagged array or a structure with certain properties, for examples in programming controls, its somewhat hindered in

C++ setting up a control array (Note: it is not easy as VB6 setting up control arrays in either case). You end up with a massive collection class dealing with rather simple implimentations. But new to the program there is a process for dumping the trash on anything you might generate, and so it is possible to silence a class once it is no longer needed.

Inheritance: In C++, classes and structs are virtually identical whereas in C#, they are quite different. C# classes can implement any number of interfaces, but can inherit from only one base class. Furthermore, C# structs do not support inheritance, and do not support explicit default constructors (one is provided by default).
Link to comment
Share on other sites

  • 3 weeks later...
This thread is quite old. Please consider starting a new thread rather than reviving this one.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...