Jump to content

Non-comparative sorting algorithm - CurveSort


sevenperforce

Recommended Posts

Way back in my undergrad comp sci I class, we went through the usual steps of learning to code various sorting algorithms: bubble sort, quicksort, mergesort, and so forth. If you aren't familiar, all these sorting algorithms are comparison sorts, meaning that they sort a list of elements by successive greater-than/less-than comparison of individual elements. Conceptually, they function like a person with a balance scale and a set of unmarked weights, successively comparing individual weights against each other until all values have been sorted. Lots of energy in the computer science and mathematical logic worlds has been expended trying to design efficient sorting algorithms for data.

220px-Sorting_quicksort_anim.gifBubble-sort-example-300px.gif

(Quicksort, left, and Bubble Sort, right)

It is mathematically proven that the best-case guaranteed runtime for any purely comparison-based sorting algorithm is n*log(n), where the number of comparisons required is equal to some constant (usually disregarded) multiplied by the number of list items, multiplied by the logarithm of the number of list items. Some algorithms can have better runtime on certain datasets but worse average performance, while other algorithms have better average performance but terrible performance (up to n2 or worse) on certain datasets. As long as you're sorting by successive comparisons, n*log(n) is the shortest set of comparisons you can be sure will properly sort the set. At the very best-case scenario, the number of comparisons you need is a simple factor of n, the number of items in the list.

During my comp sci class, I came up with a half-baked idea for a sorting algorithm that wouldn't rely on individual comparisons at all, but I was having trouble figuring out how to make it complete recursively and it just stayed in the back of my head for more than a decade. At the time I had the notion of calling it "bucket sort" on the theory that it would "toss" the list items into a series of buckets that were arranged by size (an n-complex operation) and then sort those buckets individually. I had the vague sense that this approach could work well for many real-world datasets, especially ones with a normal distribution or a weighted distribution.

This idea came back to me last night, and I couldn't sleep so I worked it out in my head and I think I figured out how to actually make it work, not necessarily with buckets, but by defining a two-dimensional curve and then placing the list elements along that curve. Both of these are n-complex operations, so it would have a best-case runtime of 2n (which in computer science is taken as simply equal to n) for data with linear, exponential, or normal distributions. You could also conceptualize it as a "histogram sort" because it essentially creates a new histogram of the dataset with each successive pass and uses that histogram to refine the curve being used to sort the data. There are still comparisons, of course, but you never do piecewise comparisons between individual list items.

Linear Case

As an example, I'll start by doing a purely linear-curve version of the algorithm. Consider the following set of ten random list items:

S = {14,39,11,41,30,37,19,18,6,27}

We can depict this set graphically:

Random-10.png

Once sorted, we know that these list items (arranged graphically) will form some two-dimensional monotonic curve.  Let's pretend for a moment that the curve they form when sorted is defined by your standard y = mx + b slope-intercept equation, where y is the value of each list item and x is its ordinal position in the sorted version of the list. If we knew the values of m and b, the constants in that equation, we wouldn't have to do ANY piecewise comparisons at all; we could just run through the list a single time, plug in the list value for y, and solve for x to figure out where to put it on the list. We can guess at m and b if we know the highest value, the lowest value, and the number of list items, so let's run through the list once (1*n) and get those values. We run through the list and find that the lowest value is 6, the highest value is 41, and the number of list items is 10, so we take = 6 and m = (41 - 6) / 10 = 3.5. We define a new array and project a curve over that array using our equation y = 3.5x + 6:

slope-intercept-1.png

We solve for x in our equation (giving us x = 2(y - 6)/7), and then we go back to the unsorted list and run through it a second time (now up to 2*n), using our equation to place the list elements along this curve. For example, the first element is 14, which gives x = 2.29 which rounds to 2, so we move it to the second slot on our new array:

slope-intercept-first-pass.png

We repeat for each successive list item:

slope-intercept-first-pass-step-2.png

If our list had a perfectly linear distribution, each list item would land perfectly on the line in perfect order, and we'd be done. But it's a random set, and so eventually we will run into a value that lands in a slot we already filled. Here, our sixth list element gives us x(37) = 8.86 ≈ 9, but we already put our second list item in that slot because x(39) = 9.43 ≈ 9, so we simply tack it onto the end of our second list item in a linked list (noting for future reference the length and position of our linked list):

slope-intercept-first-pass-step-3.png

Finally, we continue until we complete our run through the list, still keeping track of where we are creating linked lists due to overflow (here still we are only at 2*n operations):

slope-intercept-first-pass-complete.png

If the values had been perfectly linear, then they would have dropped perfectly into order and the number of new linked lists would have been zero, and the computer would know that the list was sorted. But of course this was a random list and there were three places where the algorithm dropped multiple values into the same slot. (We can visually see that each of those happen to be out of order, but the computer doesn't know that; it only knows that there was overflow in those spots.) We now essentially have done the following:

slope-intercept-before-recursion.png

The computer knows that the green items are sorted, and it knows that the sequences of red items are unsorted sub-lists that are nevertheless sorted relative to the green items that bound them. Here, the algorithm can simply be called recursively on each unsorted sub-list until the entire list is sorted. (In operation, you would call a simpler comparison algorithm for a two-item unsorted list, but of course you would also have plenty of sub-lists that were longer than just two-items.)

This algorithm (I'll call it "linear curvesort") has some conceptual commonalities with heapsort and smoothsort although it is not a comparison sort and it provides a stable sort (that is, if two items have the same value, they remain in the same order they started in). It is also adaptive, meaning that it will run much faster on a dataset that is almost entirely sorted than it will on a dataset that is more randomized. Determining best-case, worst-case, and average-case performance is beyond my ken.

Beyond Linear

Linear curvesort works great for lists with items that have a basically linear distribution, but what about a weighted or normal distribution? For example, the following set of randomized values, when sorted, is heavily weighted to the larger end:

upper-weighted.png

Because the set is weighted toward larger values, running the linear version of this algorithm would end up having a bunch of successive recursions on the right-hand side of the data:

too-many-recursions.png

With this small of a list, that's not so bad, but for larger lists you would end up reshuffling the top end of the list (or any areas where the list values were weighted) many many times (approaching n2) to get it sorted. Conceptually, the linear curvesort predicts a line, then refines sections of that line which don't fit the data, then refines the remaining sections, over and over not unlike a monotonic moving average.

Can we improve on this?

I think so. When we do our initial linear sort, we set the size of the sorting array to some number that is much smaller than the original list. For example, in this version, there are thirty list items but I'm setting my sorting array to a length of just 5. Because of this, all of the list items end up in a linked sub-list, but that is by design. As before, we keep track of the sizes and locations of each linked list, which creates a histogram of the dataset (note that you don't actually have to write the list items into new locations here; you can just keep track of how many would fit into each slot as you go through):

histogram.png

In theory (depending on the size of my sorting array), this histogram will be able to tell me whether the dataset distribution is normal, inverted normal, weighted high, weighted low, or something else entirely, meaning that I can make predictions about the shape of the curve formed by the sorted set. This is clearly a normal distribution that is weighted high, so I can predict the curve and use the corresponding equation to shift my list items accordingly:

curvy.png

 

As shown, the new curve is generated by drawing through the average value within each sub-list (I did it in MS Paint but of course a computer would do it mathematically). I'm sure that some empirical testing could yield a good measure of how many sub-lists to use based on the size of the dataset to minimize the amount of time it takes to reach a trivially sortable state; using 5 sub-lists for a 30-item list got this particular dataset there in one iteration.

It feels like this could be a very fast sorting algorithm for many real-world datasets. Unfortunately I don't quite have the math skills to figure out how fast it would be. Anybody here have ideas? Is this similar to other sorting algorithms I just don't know about?

Link to comment
Share on other sites

So... you look at the element value, see where it fals on the curve and then use the corresponding x-value to place it in a new array that uses  floats as index values? Wouldn't you then have to sort that list as well to get things in the right order? Also, how expensive is it to do the reverse lookup on the curve?

I'm not saying you're wrong, but sorting is one field in computer science that has been exhaustively researched. It's also way beyond my capacity to understand, but my suspicion is that there's just a step with a "hidden cost" involved that makes the algorithm in practice not as fast as you think.

Link to comment
Share on other sites

It looks like you still have a best case of O(n) a worst case of O(n^2) and an average case of O(n*log(n)), just with different constants applied to those calculations.

You may have managed to trade a 2x cost on your best case for a smaller number on your average case(perhaps 0.5x), but the big O would be the same.

Link to comment
Share on other sites

27 minutes ago, Kerbart said:

So... you look at the element value, see where it fals on the curve and then use the corresponding x-value to place it in a new array that uses  floats as index values? Wouldn't you then have to sort that list as well to get things in the right order?

For linear curvesort, the new array doesn't need floats as index values; they would be ints and it would be the same length as your original list. And you don't need to sort it; the whole idea is that you are moving each element in your original list, one by one, directly into the index point on the new array that they SHOULD be when the list is sorted. Then you can just read the new array from start to finish to get the sorted list.

A good real-world example would be if you had the pages to a book (let's say pages 25-35) and they were all jumbled up, and you wanted to arrange them properly. You could leaf through and sort them page by page (bubble sort), but the easier way to do it would be to eyeball 10 page-sized spaces on your table, put them all in the proper space, then pick them all up in sequence. For example, if the first page was 31, you'd take 31-25 = 6 and put that page in the 6th space on the table. If the second page was 34, you'd take 34-25 = 9 and put that page in the ninth space, and so forth.

But of course it's rare that you need to sort a series of ordinal numbers. Let's suppose it's more complicated: you need to put a stack of invoices in order from lowest billing to highest billing. If you're clever, you might flip through the stack quickly and note that (a) there are 10 total invoices, (b) the lowest invoice amount is $201, and (c) the highest invoice amount is $5,022. The difference between $5,022 and $201 is $4,821, and since there are 10 total invoices, you know that on average there will be $482.10 between individual invoices. Assuming you're REALLY good at mental math, you decide to divide each value by $482.10 in your head in order to space them out on the table by that amount. The first invoice is for $3,016, and 3016/482.10 = 6.3, which rounds to 6, so you move it to array_table[6]. The next invoice is for $3,715, and 3715/482.10 = 7.7, which rounds to 8, so you move it to array_table[8], and so forth:

invoice-sort.png

When you encountered $2,659, the math told you to go to array_table[6], but when you got there, there was already something there. So you group them together and move on. Same with $4,086 going to array_table[8].

Then you can simply scoop up the whole list, running across your table, skipping any empty spots, and sub-sorting any stacks you encounter. Nothing further needed.

(Note that when I say "table" here I mean a literal table. Probably should have said "desk" instead.)

27 minutes ago, Kerbart said:

Also, how expensive is it to do the reverse lookup on the curve?

That's a good question. The computational cost to define a straight line is negligible, but the computational cost to define a curve is not. I'm not sure where the breaking point is, honestly.

I wonder if it would be just as efficient (or nearly so) to make the first pass without actually writing the values, merely keeping track of the running average of each sub-space with more than one value in it, and then define a series of straight lines linking those averaged points. 

1 hour ago, Terwin said:

It looks like you still have a best case of O(n) a worst case of O(n^2) and an average case of O(n*log(n)), just with different constants applied to those calculations.

think that I may have a worst case of O(n^1.5). Still not as good as a worse-case of O(n*log(n)) but better than O(n^2).

Link to comment
Share on other sites

The more I think about this sorting method, the more it looks like a hash sort.

The linear method might even be a standard hashing function.

The only real difference that I can see is that you are sorting into a smaller target array.

I believe that hash sorts work better when your destination array is large enough to be sparsely populated in order to reduce collisions.

 

Link to comment
Share on other sites

On 10/6/2023 at 1:58 AM, monophonic said:

A very interesting approach. It feels like your approach would share a lot of traits with radix sort. But the ability to tweak the curve to fit the expected data set might provide an advantage in some uses.

https://en.wikipedia.org/wiki/Radix_sort

Well, looking at radix sort led me to the actual bucket sort which is essentially what I proposed, at least in the linear curvesort case.

While unsourced, the Wikipedia notes in passing:

If the input distribution is known or can be estimated, buckets can often be chosen which contain constant density (rather than merely having constant size). This allows O(n) average time complexity even without uniformly distributed input.

That seems very close to what I was proposing, where the initial low-k bucket sort is used to estimate the distribution. It's also reminiscent of a note concerning radix sort:

Computerized radix sorts had previously been dismissed as impractical because of the perceived need for variable allocation of buckets of unknown size. Seward's innovation was to use a linear scan to determine the required bucket sizes and offsets beforehand, allowing for a single static allocation of auxiliary memory.

Seems quite similar on the whole. Both appear to share a relationship to counting sort.

Link to comment
Share on other sites

This thread is quite old. Please consider starting a new thread rather than reviving this one.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...