Jump to content

Lisias

Members
  • Posts

    7,364
  • Joined

  • Last visited

Everything posted by Lisias

  1. But would mangle the bitstream, breaking one of the premises of the project: absolutely no derivatives: whatever I will write on the WARC file must be exactly what Forum sent me. I'm currently working on some pywb extensions that will help me on this task. Your idea is very good, to tell the truth - I wondering if I can use it somehow on that extensions I'm writing now. At very worst, I think your idea would do beautifully on an output filter on the pywb replay! https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/issues/8
  2. Lisias

    LLC’s

    Given the current trending, sooner there will be no gaming companies to be sued at first place...
  3. Fun Facts: The very first publicly available post is dated 2011-0618:2349 @HarvesteR sleepless at that day, uh? Forum reached 1.000 posts at 2011-0724:2203 Forum reached 10.000 posts at 2012-0428:1833 Forum reached 100.000 posts at 2015-0218:0917 Forum reached 200.000 posts at 2021-0201:1321 Forum took about a month to reach 1.000 posts, 10 months to reach 10k posts and about 4 years (a bit less) to reach 100K posts. And then it took another 6 years to reach 200K posts - slightly slower, but not that much... This kinda matches the Steam Charts: This Forum was pretty crowded up to 2015 more or less, more or less matching the concurrent players on Steam Charts. You see, we will be able to do some interesting stats using data from the Forum - having all that data offline in your hardisk allows us to munch the data way faster, and without hitting Forum all the time. NEWS FROM THE FRONT I came to terms that a collaborative and distributed effort for scraping the Forum in a decentralized way would degrade relatively fast to everybody scraping everything, and then someone deduplicating the WARC files offline. The reason is that Forum doesn't exactly favour our efforts due the way it handles timestamps on the posts! Instead of a full timestamp, Forum uses relative time deltas (seconds ago, minutes ago, days ago, etc) and this changes every second, changing the page contents, changing the digest, making it pretty hard to deduplicate them. At least until the page/comment reaches an age in which forum decides to use full timestamps. For people that love gory details, this is this html tag from the previous post above: <time datetime="2024-07-31T21:59:13Z" title="07/31/2024 06:59 PM" data-short="16 hr">16 hours ago</time> This means that this very page will render differently in an hour, even if I would not had added this post now. So the digest (on redis) deduplication ends up being pretty useless at this time for anything newer than thread 222455, as anything newer will start to get that that pesky time deltas - exactly where the work will focus on - defeating the purpose of the idea at first place. So the idea is degenerating from a Federated approach to a classic Master/Slaves one, what means that someone will need to host that Master, and that some kind of authentication method will be needed to defend from adversarial slaves pretending to help... Trying to find my way on it.
  4. Lisias

    LLC’s

    Well... There's at least one positive point in favor our LLCs by now... CrowdStrike sued by shareholders over global outage https://www.bbc.com/news/articles/cy08ljxndr4o On a LLC, there're way less people going to sue you, at very least. If this trend catches on and reaches the gaming industry, some interesting things are going to happen...
  5. ANNOUNCE Release 2.2.0.0 is available for downloading, with the following changes: Add's (transparent) support for Kopernicus Adds a blacklist to prevent some bodies from being flared (and labeled) Allows customising the Fly Over Labels for vessels and bodies Allows per savegame settings Finally fix that pesky Settings Dialog being too tall. Kopernicus support wasn't tested yet, but Stock is still working fine. [Tested on Kopernicus!! #HURRAY!!!] Being able to choose the Fly Over Labels size and font is nice! See OP for the links. — — — — — This Release will be published using the following Schedule: GitHub. Right Now. CurseForge. Right Now. SpaceDock. Right Now.
  6. On the other hand... We can create our own customized command seat and try to avoid some of the drawbacks by manipulating some of the PART settings, like: maximum_drag = 0.05 minimum_drag = 0.05 angularDrag = 1 crashTolerance = 6 breakingForce = 20 breakingTorque = 20 maxTemp = 1200 // = 2900 making this customized command seat incredibly resistant to torque and crash (and removing drag if inside another part) you may be able to keep the Kerbals in the seats on high G maneuvers and hard bumps. You will lose all that marvelous instruments on IVA mods, however. What you gain in external looks, you lose twice on realism while flying first person.
  7. I'm desperately debugging a problem that doesn't exists since Monday, due a borked deploy on the QAS servers. Apparently, an important client was using QAS on some critical task, and things escalated pretty quickly, pressuring me to quickly fix the problem at any costs - what I did, and with costs. That costs screwed other clients, that so escalated things again pretty quickly, pressuring me to quickly fix the problem at any costs - what I denied. This time I will diagnose the problem properly, and the clients will have to wait. This is the freaking QAS, it's the reason we have it, to detect nasty problems before they reach Production. After another 3 painstaking hours of diagnosing, the root cause was (not unsurprisingly) the half baked fix for the borked deploy on Monday, followed by a unexpected http 30x that one partner thought it would be a good idea to issue instead of a http 404 or plain http 405 on a URL - as automated processes would be interested in reading their help page. Dude... The fix for this second borked fix? Redeploying the code that was deployed on Monday!!!!! I'm code reviewing everything, rerunning unit tests, you name it, and nothing wrong was detected. Rerun the integration tests on the DEV servers, everything is working as expected. So the problem was environmental. TL;DR: we should not had moved the code from DEV to Public QAS without probing it first on our internal QAS (yeah, I use an Internal QAS before shipping code to Public QAS), but since that project was being developed by a 3rd party without access to our internal QAS servers, the P.O. decided to push the code to Public QAS. And this, my friends, is the reason we do staging deployment and should not deviate from due development process. CrowdStrike is hiring? --- -- - POST EDIT - -- --- Now I will spend the rest of the day debugging server's configuration. This failing (and it will probably will), I will spend the rest of the week monitoring the requests and responses from our partners and clients to understand what in hell happened at Monday, what I still don't understand.
  8. Well... reworking the interior overlays mask should do the trick. @ColdJ, what do you think?
  9. Well... Now, we literally have a real life Kerbalism on Reddit! and
  10. I think it's important to explain something I just realize today! The free versions were the ones up to 0.13.3 , and from Version 0.14 and up KSP became a paid product. 0.18.3, 1.0 and 1.3.1 were "forked" into free demos (all of them with restrictions compared with the full, paid release), but other than these 3, everything else are paid versions! Well... Better late than never, now I see the problem on mentioning anything newer than 0.13.3 here on Forum!
  11. I find your lack of faith disturbing...
  12. Thx! Yes! Nope. The Legislation is saying, I'm just the messenger! First, we need to understand the difference between "authorship" and "ownership". They are two different things. Ownership is plain straightforward, it's the exact same concept of owning a videogame console or car. You have a set of rights about a property by owning it. Authorship is a bit less straightforward, but in essence is "who wrote the damned thing at first place". You also have a (different) set of rights about a intellectual property by authoring it. In USA, these two concepts are somewhat blurred because in USA's legislation it's possible to "transfer" the authorship of a IP to someone else - for example, ghost writers. In some other countries (like mine), authorship is nontransferable, point. So a ghost writer will be forever the author of the book and nobody can, legally, get it from him. But the ghost writer can, still, sell all the ownership rights to who is paying him - so for all what matters from the commercial point of view, it works the same. About ownership, you automatically owns anything you author by default, additional terms of service or contracts are needed to transfer such rights. It's the reason one needs to agree with Forum's ToS to post here, otherwise Forum would be subjected to the whims of copyright trolls around the World. And what the ToS says? So: UCG is "user generated content", or your posts. "You retain whatever rights, if any, you may have under applicable law in Your UGC. " Forum is not claiming ownership of your UCG. POINT. Forum never did since I'm here, and I'm here since 2018 "you hereby grant us an irrevocable, worldwide, royalty-free, non-exclusive and sublicensable right to" (yada yada yada) So, oukey, they will use your UCG the way they want, and they may make some money from it and they don't own you a penny. But that's all, you are still the author and the owner of your content! And that's it. All your posts here can be used by TTWO the way they want as a licensee, because you granted such license to them by posting things here. But this license, besides irrevocable, is also non-exclusive, i.e., they are not claiming that only them have the right to do such things with your posts, you are entitled to give such rights to anyone else if you want - but you can't revoke the rights you already granted to Forum. And it's simple like that, once you decode all that legalese into plain English. All that drama was FUD. There were reasons to criticize TTWO at that time (and now we know it pretty well), but this one - definitively - wasn't one of them, and I dare to say that this helped to sweep under the carpet the real troubles.
  13. Oh, now I see! I gave it a peek anyway. I looked on "my" WARCs for something from discord that would not be working anymore. Found this: https://cdn.discordapp.com/attachments/198576615658094592/235746983732707328/unknown.png This link, currently, leads to a 404 page from discord: This content is no longer available. Your code would change it to: https://web.archive.org/web/0if_/https://cdn.discordapp.com/attachments/198576615658094592/235746983732707328/unknown.png And I.A woud rewrite it to: https://web.archive.org/web/20240616102212if_/https://cdn.discordapp.com/attachments/198576615658094592/235746983732707328/unknown.png But this link also returns a "This content is no longer available" error message the same, because IA had revisited this link after the link had expired. There's no previous visit for this link, so that specific content is lost for good now. On content that IA had fetched with success at least once, like this one (IA fetched it in the past, by on revisiting it got a 404): https://cdn.discordapp.com/attachments/252199919316631553/261800273071177728/image.jpg Your plugin should work fine. Perhaps the script should fail nicely when IA returns a 404? The rationale is showing the user that the extension is working, and the problem is IA not having the image neither. The way it works, people will come back telling the extension is not working!
  14. Things can be way simpler with pywb. Just add the Internet Archive as a Collection in your proxy, and it will hit IA for anything missing in your Collections. As a matter of fact, the hard part is to hit IA only for resources sourced from Forum, this thing is pretty abroad. Ideally, the pywb proxy should try to hit a source under this scheme: Check if it exists in the local WARCs. If yes, serve it and finish the request. Check if it exists in the live web. If yes, serve it and finish the request. Check if it exists in the IA collection. If yes, serve it and finish the request. throw a 404. I updated the issue.
  15. Yes. And nice idea, I'm registering it on issue tracker to avoid forgetting about: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/issues/5 I want to stress that this last deliverable is just the first one, and I'm pretty sure I made some mistakes - I tend to err on the safe side, so I probably lost something instead of archiving something unwanted by accident. I'm specially concerned about copyrights, being that the reason I had split the content in 4 different WARC files, shielding the really important content. I probably could had used some more time to polish it a bit, but - frankly - good enough is enough for an intermediate deliverable. I need more brains on this project, and I would not get them by trying to do everything right (and failing) by myself. Thanks for jumping in! What makes me remember: a WARC file is a pretty simple straightforward format, it's really easy (besides terribly cumbersome in the extreme cases) to manipulate its contents using grep, sed, perl or even the old and faithful mcedit. So rest assured that we can forge WARC files to inject content into the archives. Obviously, this will be made on a 5th WARC file to avoid tampering the "official" ones. There's really very few things we can't accomplish with this (somewhat convoluted, I admit) scheme of scraping. The sky, bandwidth and disk space are the limit.
  16. Something to try to do on KSP, for sure!!
  17. NEWS FROM THE FRONT Yeah, baby, finally a deliverable! https://archive.org/details/KSP-Forum-Preservation-Project This torrent has the following contents (at this time), except by the informational and security boilerplate: -r--r--r-- 1 lisias staff 941M May 7 2023 forum.kerbalspaceprogram.com-00000.warc.lrz -r--r--r-- 1 lisias staff 413M May 7 2023 forum.kerbalspaceprogram.com-00001.warc.lrz -r--r--r-- 1 lisias staff 504M Jul 28 09:52 forum.kerbalspaceprogram.com-202407.warc.lrz -r--r--r-- 1 lisias staff 12M Jul 28 09:49 forum.kerbalspaceprogram.com-images-202407.warc.lrz -r--r--r-- 1 lisias staff 1.1G Jul 27 09:32 forum.kerbalspaceprogram.com-media-202407.warc.lrz -r--r--r-- 1 lisias staff 307K Jul 26 12:48 forum.kerbalspaceprogram.com-styles-202407.wrc.lrz -r--r--r-- 1 lisias staff 24M Jul 28 21:41 redis.dump.json.lrz For the sake of curiosity, follows the same files uncompressed: -r--r--r-- 1 deck deck 41G May 7 2023 forum.kerbalspaceprogram.com-00000.warc -r--r--r-- 1 deck deck 20G May 7 2023 forum.kerbalspaceprogram.com-00001.warc -r--r--r-- 1 deck deck 23G Jul 28 09:52 forum.kerbalspaceprogram.com-202407.warc -r--r--r-- 1 deck deck 19M Jul 28 09:49 forum.kerbalspaceprogram.com-images-202407.warc -r--r--r-- 1 deck deck 1.2G Jul 27 09:32 forum.kerbalspaceprogram.com-media-202407.warc -r--r--r-- 1 deck deck 1.7M Jul 26 12:48 forum.kerbalspaceprogram.com-styles-202407.warc -r--r--r-- 1 deck deck 236M Jul 28 21:41 redis.dump.json Except by images and movies, we get a 40 to 1 compress ration using lrz - you just can't beat this. The Internet Archive infrastructure costs thanks for your understanding! You will also find some minimal documentation, as well the crypto boilerplate I'm using to guarantee integrity and origin (me): -r--r--r-- 1 lisias staff 2.9K Jul 28 22:15 README.md -r--r--r-- 1 lisias staff 1.5K Jul 28 20:14 allowed_signers -r--r--r-- 1 lisias staff 2.9K Jul 28 20:27 allowed_signers.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:58 forum.kerbalspaceprogram.com-00000.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:58 forum.kerbalspaceprogram.com-00001.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-202407.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-images-202407.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-media-202407.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-styles-202407.wrc.lrz.sig -r--r--r-- 1 lisias staff 2.9K Jul 28 21:48 redis.dump.json.lrz.sig -r-xr-xr-x 1 lisias staff 209 Jul 28 20:28 verify.sh Have openssh installed and run the verify.sh script and it will validate the files' integrity. What to do with these WARC files is up to you, but you will find a lot of information about on the Project's repository: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project (the project issue tracker is active) Unfortunately, proper documentation is - as usual - lacking, but I'm working on it: You will find at very least links to the tools I'm using and some hints about how to use them as well the configuration files I'm using for them. You will also find the source code for the crawler there. More to come Now, some bad news: I can't guarantee the WARC files above are up to date because, well, they aren't. Scraping Forum is a race against the clock, and the clock always win - by the time you scrap the last page, there's a lot of new content to be scraped again. So I'm not even trying at this moment. The whole process, to tell you the true, is terribly slow once you have the whole site in the database and you are just updating things. Even some pretty aggressive optimizations (as caching in the spider's memory the pages already visited and avoiding them again no matter what) didn't improved the situation to a point I would find it comfortable. I'm currently studying how the deduplication works in the hopes to find some opportunities for performance improvements. Now, for the next steps: Documentation Proper documentation Setting up a prototype for a content server Including how to create a Federation of trusted content mirrors to round robin the requests, sharing the burden Cooking something to allow collaborative scraping of this site Setting up a Watch Dog to monitor the (from external point of view) site's health, so we can determine what would be the best times to scraping it without causing trouble. Cheers!
  18. Lisias

    LLC’s

    Do you think 737MAX was plain bad luck?
  19. And another pearl!! https://kerbalspace.tumblr.com/page/44 !! https://kerbalspace.tumblr.com/page/41
  20. There's this thread: But no 0.17. [snip] === == = POST EDIT = == === LOOK WHAT I FOUND!!! https://kerbalspace.tumblr.com/post/6711201090/ksp-dev-blogs-online
  21. Now that I had stabilize the scraping tool, I ended up being also a nice watch dog! I'm running it since 2024-07-27 18:07:36 GMT-3 continuously, and this is what I got: From 2024-07-27 18:07:36 to 2024-07-28 13:45:36 nominal From 2024-07-28 13:46:24 to 2024-07-28 14:14:36 severe turbulence From 2024-07-28 14:18:36 to 2024-07-28 15:04:36 flawless From 2024-07-28 15:05:36 to 2024-07-28 16:41:35 severe turbulence From 2024-07-28 16:41:36 to 2024-07-28 17:54:36 some turbulence From 2024-07-28 18:00:36 to 2024-07-28 18:32:36 (right now) nominal I'm extracting about 60 pages/minute, so the data above is pretty accurate without false negatives. You know, there's interesting data that can be extracted from these log. Next interaction (202408) I will preserve the logs...
  22. I'm experiencing them about 14:00 GMT-3 on working days, but can't say if only on that time because I'm usually working, and now and then come to my personal rig to check the scraping logs. It happens that at 14:00 it's near the end of the lunch time, and I still have some time to burn on a good day. On weekends, they happen almost all the day.
×
×
  • Create New...