-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create documentation #32
Comments
Yes, I would like to, just haven't had time thus far :) Are there particular areas that you are most interested in? How are you using wabac.js (if you are)? Would also welcome contributions in this area if you have time to help :) |
Cheers for the quick reply! |
@ikreymer Is there any possible timeline on documenting the API for wabac.js? Over at Kiwix, we are particularly interested in URL rewriting functions... I understand how busy you are with various aspects of the Replay.Web project, so this is just a gentle nudge. Many thanks. |
@ikreymer I would like to add my own interest to this issue, we have terabytes of WARC files which we would like to present to our users and we would like to develop a playback frontend. I'd like to know how to use this to determine if it suits our needs, but like @thisistaimur mentioned I don't know where to start with it. |
Maybe a simple test case / demo showing, how to use the main entry points of the API that appear to be exposed in https://github.com/webrecorder/wabac.js/blob/main/src/api.js ? |
@mattbbc are there reasons that replayweb.page won't work for your use case for a frontend? We've built that specifically to act as a customizable frontend for replay, and have documentation on how to use it and embed it in other sites. If the main goal is just to replay WARC files, we recommend using replayweb.page, or building on top of that, rather than working with wabac.js directly. We also recommend packaging WARC files to WACZ, so you can benefit from efficient loading. I would recommend starting with replayweb.page and see from there if there are gaps in what you want it to do. Still, I agree that we should have more docs, especially for customization that don't work with WARC/WACZ files, such as @Jaifroid use's case which uses ZIM file. We'll see what we can do in creating a simple example that sets up replay and uses the API to get some basic info about the archive. |
If you get a chance to do something along those lines, it would be really helpful! 😊 |
Thanks @ikreymer, we just started looking at replayweb.page yesterday and it's easy to get a basic embed working, thanks for your work on that. It works beautifully. Some of our historical warc files are quite large, and we were considering an application that would load parts of warcs via range requests rather than entire warc files from a url. This might be an Express API that would load the relevant parts of the warc, inject any of our code, then send the rendered document to the user. I wasn't sure if this package would help with that idea or isn't the right thing. We were also considering something like a React application that could load in the warc url dynamically. Maybe the npm package for replayweb.page might be more appropriate for that but I haven't gotten very far just yet - we're just starting to take stock of our options. I did try using the npm package in a React application to test but I haven't figured that out yet. I get a few build errors with it e.g. an error in |
Hey @mattbbc, I was able to setup my own WARC server with PyWb, which has a decent documentation. I was able to bypass the Shadow DOM issue by hosting the PyWb server and client page on the same domains (different sub-domains). That worked as far as accessing the WARC content via the browser went. After much digging, I figured out that WARCs with Shadow DOMs can be accessed from a parent page if the the domain of the parent page and the Shadow DOM WARC match. |
Where do you host your WARC files? Ours are all in S3. |
@mattbbc I keep them within the docker container where I host PyWb. PyWb serves the files. PyWb can also serve WARC files from S3 containers. |
We have two different tools for hosting web archives. ReplayWeb.page system provides a 'serverless' replay system, where web archives can be loaded directly from static storage/S3 with a web based system. The idea is that web archives are replayed in the browser, the way you'd replay a video or a PDF file. This library is a component of that system, with replayweb.page being the main user-facing tool. We also have our older tool, pywb, which is a more traditional web replay system, where you need to run it as a server and users' access the web archives through the server. We are continuing to develop both. ReplayWeb.page also uses webcomponents, so you can place a tag on a page and have a full web archive load. @thisistaimur I don't quite understand your concern with webcomponents/shadow dom. What are you trying to do that you are having issues with? Are you injecting custom code into the web archive replay? |
We have created the WACZ format to address this particular issue. With WACZ, the WARC files are packaged into a ZIP file which is then read via range requests. You can package existing WARCs into WACZ files using the py-wacz tool, and then load the resulting WACZ in replayweb.page instead of WARCs Are the WARCs that you have mostly fixed, or are you doing continuous crawling?
We are using webpack and it currently predefines VERSION so it needs to be defined to build. I realize this is a gap in documentation for replayweb.page - we should mention what to do if you're starting with a large set of WARC files. As this issue is about documentation in general, perhaps we can continue over email (info [at] webrecorder.net) or also on our forum at https://forum.webrecorder.net/ to discuss this specific use case. |
At the moment we have inherited several terabytes of historical WARC files that are currently in a couple of formats, as in I have a simple harness / HTML page with the example embedding from the replayweb.page docs and I tested it with one of our larger WARC files after using your tool to wrap it up into a .wacz file and it loads significantly faster, thanks for the information on that!
We're not keen on using a CDN for anything important, which was why I was looking into the npm module. I wasn't able to get the standard embedding code to work as React complains when you try and use the It might be more appropriate to take my queries to the forum though instead of constantly harassing this GH issue though! |
@ikreymer I use the postMessage method on the window object of the shadow DOM. So yeah I inject custom code. |
No description provided.
The text was updated successfully, but these errors were encountered: