Jump to content

Lisias

Members
  • Posts

    7,436
  • Joined

  • Last visited

Everything posted by Lisias

  1. I think it's important to explain something I just realize today! The free versions were the ones up to 0.13.3 , and from Version 0.14 and up KSP became a paid product. 0.18.3, 1.0 and 1.3.1 were "forked" into free demos (all of them with restrictions compared with the full, paid release), but other than these 3, everything else are paid versions! Well... Better late than never, now I see the problem on mentioning anything newer than 0.13.3 here on Forum!
  2. I find your lack of faith disturbing...
  3. Thx! Yes! Nope. The Legislation is saying, I'm just the messenger! First, we need to understand the difference between "authorship" and "ownership". They are two different things. Ownership is plain straightforward, it's the exact same concept of owning a videogame console or car. You have a set of rights about a property by owning it. Authorship is a bit less straightforward, but in essence is "who wrote the damned thing at first place". You also have a (different) set of rights about a intellectual property by authoring it. In USA, these two concepts are somewhat blurred because in USA's legislation it's possible to "transfer" the authorship of a IP to someone else - for example, ghost writers. In some other countries (like mine), authorship is nontransferable, point. So a ghost writer will be forever the author of the book and nobody can, legally, get it from him. But the ghost writer can, still, sell all the ownership rights to who is paying him - so for all what matters from the commercial point of view, it works the same. About ownership, you automatically owns anything you author by default, additional terms of service or contracts are needed to transfer such rights. It's the reason one needs to agree with Forum's ToS to post here, otherwise Forum would be subjected to the whims of copyright trolls around the World. And what the ToS says? So: UCG is "user generated content", or your posts. "You retain whatever rights, if any, you may have under applicable law in Your UGC. " Forum is not claiming ownership of your UCG. POINT. Forum never did since I'm here, and I'm here since 2018 "you hereby grant us an irrevocable, worldwide, royalty-free, non-exclusive and sublicensable right to" (yada yada yada) So, oukey, they will use your UCG the way they want, and they may make some money from it and they don't own you a penny. But that's all, you are still the author and the owner of your content! And that's it. All your posts here can be used by TTWO the way they want as a licensee, because you granted such license to them by posting things here. But this license, besides irrevocable, is also non-exclusive, i.e., they are not claiming that only them have the right to do such things with your posts, you are entitled to give such rights to anyone else if you want - but you can't revoke the rights you already granted to Forum. And it's simple like that, once you decode all that legalese into plain English. All that drama was FUD. There were reasons to criticize TTWO at that time (and now we know it pretty well), but this one - definitively - wasn't one of them, and I dare to say that this helped to sweep under the carpet the real troubles.
  4. Oh, now I see! I gave it a peek anyway. I looked on "my" WARCs for something from discord that would not be working anymore. Found this: https://cdn.discordapp.com/attachments/198576615658094592/235746983732707328/unknown.png This link, currently, leads to a 404 page from discord: This content is no longer available. Your code would change it to: https://web.archive.org/web/0if_/https://cdn.discordapp.com/attachments/198576615658094592/235746983732707328/unknown.png And I.A woud rewrite it to: https://web.archive.org/web/20240616102212if_/https://cdn.discordapp.com/attachments/198576615658094592/235746983732707328/unknown.png But this link also returns a "This content is no longer available" error message the same, because IA had revisited this link after the link had expired. There's no previous visit for this link, so that specific content is lost for good now. On content that IA had fetched with success at least once, like this one (IA fetched it in the past, by on revisiting it got a 404): https://cdn.discordapp.com/attachments/252199919316631553/261800273071177728/image.jpg Your plugin should work fine. Perhaps the script should fail nicely when IA returns a 404? The rationale is showing the user that the extension is working, and the problem is IA not having the image neither. The way it works, people will come back telling the extension is not working!
  5. Things can be way simpler with pywb. Just add the Internet Archive as a Collection in your proxy, and it will hit IA for anything missing in your Collections. As a matter of fact, the hard part is to hit IA only for resources sourced from Forum, this thing is pretty abroad. Ideally, the pywb proxy should try to hit a source under this scheme: Check if it exists in the local WARCs. If yes, serve it and finish the request. Check if it exists in the live web. If yes, serve it and finish the request. Check if it exists in the IA collection. If yes, serve it and finish the request. throw a 404. I updated the issue.
  6. Yes. And nice idea, I'm registering it on issue tracker to avoid forgetting about: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project/issues/5 I want to stress that this last deliverable is just the first one, and I'm pretty sure I made some mistakes - I tend to err on the safe side, so I probably lost something instead of archiving something unwanted by accident. I'm specially concerned about copyrights, being that the reason I had split the content in 4 different WARC files, shielding the really important content. I probably could had used some more time to polish it a bit, but - frankly - good enough is enough for an intermediate deliverable. I need more brains on this project, and I would not get them by trying to do everything right (and failing) by myself. Thanks for jumping in! What makes me remember: a WARC file is a pretty simple straightforward format, it's really easy (besides terribly cumbersome in the extreme cases) to manipulate its contents using grep, sed, perl or even the old and faithful mcedit. So rest assured that we can forge WARC files to inject content into the archives. Obviously, this will be made on a 5th WARC file to avoid tampering the "official" ones. There's really very few things we can't accomplish with this (somewhat convoluted, I admit) scheme of scraping. The sky, bandwidth and disk space are the limit.
  7. Something to try to do on KSP, for sure!!
  8. NEWS FROM THE FRONT Yeah, baby, finally a deliverable! https://archive.org/details/KSP-Forum-Preservation-Project This torrent has the following contents (at this time), except by the informational and security boilerplate: -r--r--r-- 1 lisias staff 941M May 7 2023 forum.kerbalspaceprogram.com-00000.warc.lrz -r--r--r-- 1 lisias staff 413M May 7 2023 forum.kerbalspaceprogram.com-00001.warc.lrz -r--r--r-- 1 lisias staff 504M Jul 28 09:52 forum.kerbalspaceprogram.com-202407.warc.lrz -r--r--r-- 1 lisias staff 12M Jul 28 09:49 forum.kerbalspaceprogram.com-images-202407.warc.lrz -r--r--r-- 1 lisias staff 1.1G Jul 27 09:32 forum.kerbalspaceprogram.com-media-202407.warc.lrz -r--r--r-- 1 lisias staff 307K Jul 26 12:48 forum.kerbalspaceprogram.com-styles-202407.wrc.lrz -r--r--r-- 1 lisias staff 24M Jul 28 21:41 redis.dump.json.lrz For the sake of curiosity, follows the same files uncompressed: -r--r--r-- 1 deck deck 41G May 7 2023 forum.kerbalspaceprogram.com-00000.warc -r--r--r-- 1 deck deck 20G May 7 2023 forum.kerbalspaceprogram.com-00001.warc -r--r--r-- 1 deck deck 23G Jul 28 09:52 forum.kerbalspaceprogram.com-202407.warc -r--r--r-- 1 deck deck 19M Jul 28 09:49 forum.kerbalspaceprogram.com-images-202407.warc -r--r--r-- 1 deck deck 1.2G Jul 27 09:32 forum.kerbalspaceprogram.com-media-202407.warc -r--r--r-- 1 deck deck 1.7M Jul 26 12:48 forum.kerbalspaceprogram.com-styles-202407.warc -r--r--r-- 1 deck deck 236M Jul 28 21:41 redis.dump.json Except by images and movies, we get a 40 to 1 compress ration using lrz - you just can't beat this. The Internet Archive infrastructure costs thanks for your understanding! You will also find some minimal documentation, as well the crypto boilerplate I'm using to guarantee integrity and origin (me): -r--r--r-- 1 lisias staff 2.9K Jul 28 22:15 README.md -r--r--r-- 1 lisias staff 1.5K Jul 28 20:14 allowed_signers -r--r--r-- 1 lisias staff 2.9K Jul 28 20:27 allowed_signers.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:58 forum.kerbalspaceprogram.com-00000.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:58 forum.kerbalspaceprogram.com-00001.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-202407.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-images-202407.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-media-202407.warc.lrz.sig -r--r--r-- 1 lisias staff 3.0K Jul 28 19:59 forum.kerbalspaceprogram.com-styles-202407.wrc.lrz.sig -r--r--r-- 1 lisias staff 2.9K Jul 28 21:48 redis.dump.json.lrz.sig -r-xr-xr-x 1 lisias staff 209 Jul 28 20:28 verify.sh Have openssh installed and run the verify.sh script and it will validate the files' integrity. What to do with these WARC files is up to you, but you will find a lot of information about on the Project's repository: https://github.com/net-lisias-ksp/KSP-Forum-Preservation-Project (the project issue tracker is active) Unfortunately, proper documentation is - as usual - lacking, but I'm working on it: You will find at very least links to the tools I'm using and some hints about how to use them as well the configuration files I'm using for them. You will also find the source code for the crawler there. More to come Now, some bad news: I can't guarantee the WARC files above are up to date because, well, they aren't. Scraping Forum is a race against the clock, and the clock always win - by the time you scrap the last page, there's a lot of new content to be scraped again. So I'm not even trying at this moment. The whole process, to tell you the true, is terribly slow once you have the whole site in the database and you are just updating things. Even some pretty aggressive optimizations (as caching in the spider's memory the pages already visited and avoiding them again no matter what) didn't improved the situation to a point I would find it comfortable. I'm currently studying how the deduplication works in the hopes to find some opportunities for performance improvements. Now, for the next steps: Documentation Proper documentation Setting up a prototype for a content server Including how to create a Federation of trusted content mirrors to round robin the requests, sharing the burden Cooking something to allow collaborative scraping of this site Setting up a Watch Dog to monitor the (from external point of view) site's health, so we can determine what would be the best times to scraping it without causing trouble. Cheers!
  9. Lisias

    LLC’s

    Do you think 737MAX was plain bad luck?
  10. And another pearl!! https://kerbalspace.tumblr.com/page/44 !! https://kerbalspace.tumblr.com/page/41
  11. There's this thread: But no 0.17. [snip] === == = POST EDIT = == === LOOK WHAT I FOUND!!! https://kerbalspace.tumblr.com/post/6711201090/ksp-dev-blogs-online
  12. Now that I had stabilize the scraping tool, I ended up being also a nice watch dog! I'm running it since 2024-07-27 18:07:36 GMT-3 continuously, and this is what I got: From 2024-07-27 18:07:36 to 2024-07-28 13:45:36 nominal From 2024-07-28 13:46:24 to 2024-07-28 14:14:36 severe turbulence From 2024-07-28 14:18:36 to 2024-07-28 15:04:36 flawless From 2024-07-28 15:05:36 to 2024-07-28 16:41:35 severe turbulence From 2024-07-28 16:41:36 to 2024-07-28 17:54:36 some turbulence From 2024-07-28 18:00:36 to 2024-07-28 18:32:36 (right now) nominal I'm extracting about 60 pages/minute, so the data above is pretty accurate without false negatives. You know, there's interesting data that can be extracted from these log. Next interaction (202408) I will preserve the logs...
  13. I'm experiencing them about 14:00 GMT-3 on working days, but can't say if only on that time because I'm usually working, and now and then come to my personal rig to check the scraping logs. It happens that at 14:00 it's near the end of the lunch time, and I still have some time to burn on a good day. On weekends, they happen almost all the day.
  14. Yep, but most of them will be working, studying or commuting - so most of these awaken people would not be browsing Forum neither on a working day. Interesting enough, and again on a working day, the less worst time to hit the Forum's servers would be indeed about 16:00 GMT-5 (EST), because (and assuming my axiom about most people hitting the services between 19:00 and 22:00 local time is correct) at that time, the timezones on the "playingtime" would be on some of the lesser population density areas in the World. So Gargamel is right about the time the IA scrapers hitting the Forum being the less worst, he apparently made a mistake on trying to explain why. Being awake is not enough, people need to have some time to burn in order to hit Forum's servers with significant load. I think you got it right about most people being awake, but I think you made a mistake on assuming that these people would be available to hitting Forum while awake. They have a relatively tiny time window to hit here in volume on their waking hours.
  15. Lisias

    LLC’s

    Because the contractor's liability is limited to 100K! So if the contractor screws up the job, you will lose 900K USD because their liability is only 100K. That's the very purpose of the LLC! Now it's up to you to take the risk or not. I'm losing you. I found this: https://www.doola.com/blog/llc-liability-protection/ Can you explain to me if this link is correct? Because if it is, it's exactly how Sociedade Limitada (LTDA) works here on Brazil - but with a catch, frauds and crimes (mostly fiscal) "breaks" the limitation and can reach your personal assets, and the link above appears to say the same for LLC.
  16. Lisias

    LLC’s

    Limited Liability protects the owner's personal assets from the company, granted. But the company needs to have a thingy called share capital ("capital social" in pt-br), being it the max liability the LLC is responsible for. hint: the company's owner needs to foot the share capital - so, if the worst happens, they just lose that money. Now, it's up to the clients to decide to do business to such LLC, right? Would you contract a LLC for a job of 1M USD when the share capital of that company is 100K USD? Well, I would not - neither anyone that know how to do business. So, and still, the LLC owners have their skin on the game - they need to foot the share capital of the LLC, and they are risking lose that money if the company goes kaput.
  17. Taking the hint, and elaborating over it (I have some time to burn right now... ). Most people are students or workers, so they are not available for playing between 07:00 and 19:00 (I'm considering commuting) in their local time. Assuming most people enjoys sleeping from 22:00 to 06:00 local time, we have a time window for playing from 19:00 to 22:00 local time. So, assuming late afternoon as being 16:00 GMT-5 (on Summer), that playing time window I mentioned from 19 to 22 local time would be at GMT-8 to GMT+11. Essentially the western parts of USA and Canada, Alaska (Magenta on the map below): https://www.timeanddate.com/time/map/ And these are not exactly the areas with the most population density in the world! 16:00 EST is around 12:00 on Greenwich (UK, Western Europe and Africa) and Dawn to Morning up to India and western China. So, yeah, it's exactly the opposite - most people on the Word are awake, but studying or working (or preparing to) and, so, not hitting this Forum with their browsers.
  18. Perhaps we could send them that torrent I'm building? That would save you guys some serious bandwidth... These <insert your favorite non-forum-compliant expletive here> AI companies are literally using our money to make money - for them.
  19. If a server crashes in the weekend, and no user is connected, it still makes a 502 Bad Gateway?
  20. I'm trying to explain that for some "authors" around here for months, but all I get are mocking and disdain. Trademarks are a thing. Some people here are going to learn this the hard way.
  21. Thursday afternoon (GMT-3), I had noticed Forum was getting yet more 502 Bad Gateways than "normal". I noticed that at that time, in the few times I managed to load the front page, the number of guests were about 6.7K - way more than usual, that it's about 1.3K +/- 1.5k at peak times. That was almost a DDoS attack, almost 5 times the usual guests... Sooo, yeah... I think the problem is not exactly Forum, but the extra load of people trying to scrap Forum by themselves. T2 probably cut down some costs in their infrastructure, but given that right now Forum is allright with 1.2K guests, I think that there was some slack on that infrastructure at first place.
  22. I will left this note here for future reference, in the case someone do a search for the problem. The problem happened exactly after MM lists the loaded DLLs, and before KSP start to build the part database: [EXC 14:06:08.985] InvalidOperationException: Collection was modified; enumeration operation may not execute. System.ThrowHelper.ThrowInvalidOperationException (System.ExceptionResource resource) (at <9577ac7a62ef43179789031239ba8798>:0) System.Collections.Generic.List`1+Enumerator[T].MoveNextRare () (at <9577ac7a62ef43179789031239ba8798>:0) System.Collections.Generic.List`1+Enumerator[T].MoveNext () (at <9577ac7a62ef43179789031239ba8798>:0) KSPBurst.KSPBurst.FlushMessages () (at <c8b5fe2baff6415cb73ea658bdb44b4c>:0) KSPBurst.KSPBurst+<BurstCompile>d__20.MoveNext () (at <c8b5fe2baff6415cb73ea658bdb44b4c>:0) UnityEngine.SetupCoroutine.InvokeMoveNext (System.Collections.IEnumerator enumerator, System.IntPtr returnValueAddress) (at <12e76cd50cc64cf UnityEngine.DebugLogHandler:LogException(Exception, Object) ModuleManager.UnityLogHandle.InterceptLogHandler:LogException(Exception, Object) UnityEngine.Debug:CallOverridenDebugHandler(Exception, Object) [LOG 14:06:09.164] PartLoader: Creating part database It looks as something Burst Compiler does not knows how to cope with, and not a bug on the US2's assembly. This should be reported for the Burst Compiler guys so they can find out if there's something they can do about, as adding US2 into a ignore list.
  23. Lisias

    LLC’s

    Moderators note: This topic was split off from: Companies care about money. Communities are important while they help securing the income, and it's really simple like that. We are an asset, and we need to learn how to cope with it. This is not necessarily bad, besides uncomfortable - being a sentient asset have some advantages that can be beneficial to us if we learn to play the cards right. Just remember: Companies are not people, they are made of people. Some are good, some are evil, most of them are somewhere in the middle. We need to reach the good ones. Standard INC procedure. You know, LLCs have the upmost interest on knowing exactly what had gone wrong when a big project fails because the Company's owner ultimately have their skin on the game (pun not intended), there's no way out - they will pay for the failure directly from their pockets. PLC and INC companies work different internally, the money's owners are an abstract mass of investors that are not directly dealing with the failure, only paying for it indirectly. So, whoever is in charge, have a special interest in hiding the problems that could tarnish their images and try to elect escape goats instead - and screw the aftermath, who cares if this is gong to be bad on the long run, these dudes only care about their image and how it affects their careers, and they have no problem on pursuing such career on the competition.
  24. NEWS FROM THE FRONT I just updated the scraping tool. Now it's scraping images and styles too, and exactly as I intended: html pages on a collection, images on a second one, anything related to CSS and styling on a third one. The anti dupe code is also working fine, preventing the scrapy tool from visiting the same page twice on a session - exactly what had screwed me July 13th, when I first tried it. I'm doing proper logging now, too. pywb allows us to dynamically merge the collections and serve them on a single front-end, as they were just one. Pretty convenient. The rationale for this decision is simple besides not exactly straightforward: images almost never changes, as well styles. Scraping them separately will save a bit of Forum's resources and scraping time while updating the collections, as the images will rarely (if ever) change. Same for styles. So we can just ignore them while refreshing the archive contents. There's an additional benefit on keeping textual info separated from images and styles: whoever owns the IP, owns the images and styles, but not the textual contents. Posts on Forum are almost unrestrictedly and perpetually licensed to the Forum's owner, but they still belong to the original poster. So whoever owns the IP, at least theoretically, have no legal grounds to take down these content - assuming the worst scenario, where this Forum goes titties up and a new owner decides to take down the Forum mirrors, they will be able to do so only for the material they own - images and styling. And these we can easily replace later, forging a new WARC file pretending being that, now lost, content. Ok, ok, on Real Life™ things doesn't work exactly like that. But it costs very little (if any) to take some preventive measures, no? Scrapy tells me that it INFO: Crawled 316615 pages (at 58 pages/min), scraped 31847497 items (at 6128 items/min) at this moment, but the last time the WARC file was touched was Jul 20 20:19. So, apparently, the sum of the older WARC files from 2023 and the new ones I'm building now have all the information since the last time I restarted the tool. Please note that the tool understands as scraped item anything it crawls into, disregarding being a dupe or not, or if it was fetched or ignored. Right now, I'm trying to find my way on the Internet Archive to host the torrent file I will build with the material I already have. I hope it could be a mutable torrent, otherwise I will need to find some other way to host these damned huge files - I intend to have it updated at least once a month, being the reason it needs to be a mutable torrent. People not willing to host anything can also find this stunt useful, as there're tools to extract the WARC contents and built a dump in their hard disks as you would get by using a 'dumb' crawler as HTTrack. So, really, there will be no need for everybody and the kitchen's sink to hit Forum all the time to scraping it. Finally, once I have the torrent hosted somewhere, I will start to find a way to cook a way to scrap the side cooperatively, so many people can share the burden, making thing way faster and saving Forum's resources - once people realize they don't need to scrap things themselves every time, I expect the load on Forum to be way easier. This is what I currently have: -r--r--r-- 1 deck deck 41G May 7 2023 forum.kerbalspaceprogram.com-00000.warc -r--r--r-- 1 deck deck 20G May 7 2023 forum.kerbalspaceprogram.com-00001.warc -r--r--r-- 1 deck deck 24G Jul 25 15:32 forum.kerbalspaceprogram.com-202407.warc I expect future WARC files to be smaller and smaller. At least, but not at last: === == = POST EDIT = == === For the sake of curiosity, some stats I'm fetching from the archives I have at this moment: forum.kerbalspaceprogram.com-00000.warc 318421 Content-Type: application/http; msgtype=request 318421 Content-Type: application/http; msgtype=response 6 Content-Type: application/json 2 Content-Type: application/json 8 Content-Type: application/json 2 Content-Type: application/json;charset=utf-8 2 Content-Type: application/x-www-form-urlencoded 122909 Content-Type: ;charset=UTF-8 24 Content-Type: text/html; charset=UTF-8 195482 Content-Type: text/html;charset=UTF-8 forum.kerbalspaceprogram.com-00001.warc 121990 Content-Type: application/http; msgtype=request 121990 Content-Type: application/http; msgtype=response 40841 Content-Type: ;charset=UTF-8 81149 Content-Type: text/html;charset=UTF-8 forum.kerbalspaceprogram.com-202407.warc 553096 Content-Type: application/http; msgtype=request 553096 Content-Type: application/http; msgtype=response 27 Content-Type: application/json;charset=UTF-8 1 Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;charset=UTF-8 5 Content-Type: application/x-unknown;charset=UTF-8 8 Content-Type: application/zip;charset=UTF-8 342190 Content-Type: ;charset=UTF-8 1497 Content-Type: text/html 7 Content-Type: text/html; charset=UTF-8 208622 Content-Type: text/html;charset=UTF-8 3 Content-Type: text/plain; charset=UTF-8 2 Content-Type: text/plain;charset=UTF-8 93 Content-Type: text/xml;charset=UTF-8 28 Content-Type: video/mp4;charset=UTF-8 3 Content-Type: video/quicktime;charset=UTF-8 As we can see, some videos leaked into the WARC files... I'm working on it. === == = POST2 EDIT = == === This one gave me a run for the money. Forum serves media (like movies) using an interface called /applications/core/interface/file/cfield.php where the file is send on a query. But the crawler was looking only at the path, taking advantage of the fact that Forum doesn't obfuscate the artefacts - what made my life simpler while routing html, image and style files to their respective collections. Until now. Since the crawler didn't parsed the query, it thought it was a content file and stored the freaking videos togheter the html files, screwing me because: These videos are Forum's IP, and so shoving them together the textual content would erode my chances on a hypothetical future copyright takedown attempt. It royally screwed the compression ratio of the WARC file!! As a matter of fact, I thought it was weird that the 2023 WARC files were compressing at 44 to 1, while mine "only" 22 to 1 - they should compress at similar rations, as they would be similar content. This is the reason I made that stats above, already intuiting some already compressed content had leaked on the stream - but I was thinking on some images or even a zip file, not a whole freaking movie file! Anyway, I salvaged the files, removing the image/* content into the image collection, removing about 2G of binary data from the text stream. I'm double checking everything and recompressing the data files, I will pursue torrenting them tomorrow. Cheers! === == = POST3 EDIT = == === Well, I blew it. I made a really, REALLY, REALLY stupid mistake on the spider, that ended up with a memory leaking that was only growing and growing without I'm being aware. Then I finally noticed the problem, tried to salvage the situation (scrapy have a telnet interface, in which you can do whatever you want - including hot code change!!) but had no memory enough available. So I tried terminating one of the proxies (the style one), as it was idle since the start, to have enough memory to work on the system and try adding a swap file on the steam deck. This would life the pressure and allow me to work to try to salvage the session, not to mention stopping grinding my m2 that probably lost half of its lifespan on this stunt... Problem... this dumbass that is typing you this post decided to create the swapfile in the most dumb way possible: dd if=/dev/zero of=/run/media/deck/MIRROR0/SWAP1 bs=1G count=8 I was lazy (and still sleepy) and decided I didn't wanted to take a calculator to see how many 4K blocks I would need to reach 8G, so I told dd to write 8 blocks of 1G each and call it a day. But by doing it, dd tried to malloc a 1G buffer on a system that were fighting for 32KB on the current swapfile. So the kernel decided to kill something, elected the SSHD (probably) and the session was finished. However, SteamOS uses that terrible excuse of INIT called SystemD, and this crap automatically kills all the processes owned by a user when it log offs, what essentially is what happens when you lose a SSH session. And then I had to restart scraping again this morning. I'm currently back at: 2024-07-27 19:15:36 [scrapy.extensions.logstats] INFO: Crawled 32474 pages (at 80 pages/min), scraped 839693 items (at 2080 items/min) There was no loss of data, the redis database is being hosted on another computer (I did some things right, after all), so the current scraping is to be sure I fetched all the pages from Forum, without leaving things behind. Oh, well... Sheet happens! I'm compressing what I have now (will add a new WARC file with whatever is being scraped now later) and will proceed with the creation of the torrent. === == = POST4 EDIT = == === It's about 90 minutes since the log above, and now I have: 2024-07-27 20:46:36 [scrapy.extensions.logstats] INFO: Crawled 39619 pages (at 73 pages/min), scraped 1025463 items (at 1924 items/min) I.e.: 992.989 items scraped, or about 11K items scraped per minute. 7145 pages crawled, or about 79 pages per minute. I know from fellow scrapers that we have about 400K pages, so unless things go faster, I still have 84.38 hours to complete the task. Oukey, this settles the matter. I will publish the torrent tomorrow, and later update it. [edit: I made the same mistake again - I gone Internet Archive way, it's way more pages!] === == = BRUTE FORCE POST MERGING = == === By all means, no apologies! This is a brainstorm, we were essentially throwing... "things" into the wall to see what sticks. Jumping the gun is only a problem when there's no one around to tell you are jumping the gun, so no harm, no foul. And it's good to know that someone is willing to go that extra mile if needed, and this message you passed with success. So, thank you for the offer! Mine too. I'm really hoping for the best - but still in alert expecting the worst. Trying hard to do not cause it! Cheers! (and sorry answering this so late, I had a hellish week...)
×
×
  • Create New...