-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only folder structure without videos #649
Comments
same here |
edx recently changed the structure of the website and this package isn't being maintained as it was before. It's going to be up to one of us to fix it I think. Not sure if they are still devoting time to this package |
I started playing with the code yesterday to see if I could get it to work. I haven't used python much so my code isn't very pretty but it's just about working. I just need to get pdf and other files downloading. the videos and subtitles work. I'm not sure if I can advertise a fork here but when its done i'll upload it. Let me know if it's ok to post here. |
I noticed that some videos are only loaded afterwards via JS introducing an iframe. |
Meanwhile, since I couldn't use this tool, I created my own chrome extension for that. You can find it here. |
There is an api call now that gets unit IDs and other info. The section code stays the same but to get the units you make an api call for each section, it returns unit IDs for each unit, then you use the prefix https://courses.edx.org/xblock/ to get what is loaded in the iframe. I'm testing my final code now. When I upload it you can take a look. |
Great, I think I used a similar approach for the chrome extension. |
I'd love to see what you've done. I'm working with the code myself and got to the point of correctly identifying the urls for all subsections. Then I ran into this regex on line #92 of parsing.py: re_units = re.compile('(<div?[^>]id="seq_contents_\d+".?>.?</div>)', re.DOTALL) As far as I can tell, this method and its associated regex are what is causing the script to fail to identify units. Using the following url as an example: [(https://courses.edx.org/courses/course-v1:BerkeleyX+Data8.1x+2T2020/jump_to/block-v1:BerkeleyX+Data8.1x+2T2020+type@sequential+block@851eafb36585493aa5ce5c54f8d56d4a)] which part do you append to [(https://courses,edx.org/xblock)]? |
The advantage of the extension is that it can wait for JavaScript to load. The iframe src to I think it won't be possible to build a successful scrapper without JS rendering. In my case I wait for the page to load, then redirect to the iframe src, then I take the From |
I published my code, I modified edx_dl.py and parsing.py https://github.com/RJFeddeler/edx-dl/ I decided to switch the way youtube-dl is used to the embedded method so I'm still playing around with the settings for that. It doesn't show progress or anything and downloads the best quality and muxes the audio/video together which is slower (ffmpeg is required for that, I forget if it defaults to the normal video file if not installed). I did that because I've been getting a lot of 500 errors when trying to download the default video+audio. |
The xblock has both the sequential block id of the sub-section and the vertical block id of the unit. The sequential block IDs (section/subsections) are still working as usual but you use api calls to get the vertical block IDs of the units of each section. |
@RJFeddeler You code is working . I had one problem with default settings it tries to download youtube videos. It will download one video and then it will raise an exception and give error during downloading second video. I ran the script again and it downloaded the second video and raised the exception again for third video. Something about connection time out. Also it takes long time to download video. EDIT: I tried again and it was going fine without any problem. I will update on it.
Downloading courses before edx decides to update their system again |
@shad90 I'm not sure why youtube downloading isn't working for you. One thing you could try is to add the command line argument --ignore-errors (download latest version for that). I mentioned why the downloads take a long time above, its downloading the best quality audio and video separately and muxing them together. You can add the argument --format "best" and it should download from youtube quicker. EDIT: It is --format "mp4" (or: -f "mp4") |
Hey @RJFeddeler I used your repository and got the following error message Logging into Open edX site: https://courses.edx.org/login_ajax |
@0n0n0m0uz No problem, I'm happy it worked! I am currently working on an update to make the output nicer and more useful (progress bars for everything, not just the current video) and to handle youtube errors better (downloading alternative videos) @vobisie I didn't modify that at all, extracting the sections still works from the old code, it's getting the units from the sections that was the problem, not sure why you are getting that error. |
(base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u *******@gmail.com https://courses.edx.org/courses/course-v1:MITx+JPAL102x+3T2020/course/ The above is the complete output I get. |
@vobisie I haven't looked at the section extraction code much. Your guess is as good as mine. You sure you have the right URL for the course? Do other courses work or same problem? EDIT: Working on some code now and I see it verifies the URL is in your list before starting, so guess it's not a problem of a wrong URL. |
I am getting different errors after using your code. `During handling of the above exception, another exception occurred: Traceback (most recent call last): |
@techfre Unable to extract video data is a problem with either youtube or youtube-dl, I get those somewhat frequently so I'm in the process of detecting those errors and downloading a different version of the video (youtube hosts multiple versions of each video with different encodings/resolutions/etc) You can see the error in regard to the exception comes from youtube-dl so I can't do anything about it. I believe if you just use the flag -i AKA --ignore-errors then it should skip that video and continue downloading. |
@RJFeddeler I can confirm that your code works well. Been able to download several courses. The only change I've made so far is to add downloading progress to the console. The output is not pretty, but it works. Add bolded text to the my_hook method of the MyLogger class starting at line 115 of edx_dl.py def my_hook(d): |
I updated my repository with my latest version. It isn't perfect but it displays progress for the course/section/unit/video. I thought it was worth posting even though it isn't finished. It uses tqdm for progress bars. I also added an additional argument which I haven't tested:
|
@RJFeddeler still having the same issues. (base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u ***@gmail.com https://courses.edx.org/courses/course-v1:MITx+JPAL102x+3T2020/course/ (base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u ***@gmail.com https://courses.edx.org/courses/course-v1:MITx+14.740x+3T2020/course/ (base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u ****@gmail.com https://courses.edx.org/courses/course-v1:MITx+15.415.1x+1T2020/course/ However, it could be an issue with MIT courses because I was able to download this course without much hassle, while I struggled previously. (base) C:\Users\iobis\Desktop\edx-dl-master>python edx-dl.py -u ***@gmail.com https://courses.edx.org/courses/course-v1:edX+edx201+1T2020/course/ Processing units... Removed 0 duplicated urls from 24 in total Please advise and assist if possible. |
I used edx-dl 2 months ago and it worked smoothly, I came back for it today but discovered this issues, thanks for resolving it. However, I'm a novice with github, how do I incorporate your codes into my edx-dl folder on my PC. I tried Kindly help. |
@weirdsourcer I'm actually not sure where pip pulls stuff from. I can try to figure it out later but for now you just have to replace two files from my source in the edx-dl folder. The two modified files are edx-dl.py and parsing.py. I'm not sure exactly where pip installs your packages but you can type: pip show edx-dl to find out. |
I tried it exactly according to your guide but unfortunately it still maintains it behaviour of downloading only empty folders. I used the coding below, I'm almost certain my code is correct as that was what I used to download the Microsoft courses I took 2 months ago.
UPDATE: the code is working now after I remove --cache from the code which makes me wonder if you Iit will work if I work to continue my course download later as MITx release course contents every week. Update: it stops downloading after a while with the error Removed 3 duplicated urls from 330 in total
|
@RJFeddeler You are the man, thanks so much! Just in case someone else runs into the same problem I did, it seems for a course I was downloading the separated files and then merging was giving me problems. If anyone has a similar problem, with errors from youtubedl or ffmpeg, try the
|
@weirdsourcer --cache isn't very well implemented (by the original authors, I didn't touch it) and it really only saves you like a minute in time. You can resume downloading a course each week just by downloading the course without the --cache argument. It is supposed to skip files already downloaded, which it does for everything but youtube downloads which I'm currently working on fixing. The original code relies on youtube-dl to skip the youtube download which works okay but wastes some time. As far as the error you get now where it stops downloading, that is something thats always happened for me. Thats why I always use the --ignore-errors (or just -i) arguments. At least then when it encounters that error it will keep going. As for a better solution, I am having it download alternate versions of the videos if one fails like that. I haven't tested it but it should be working, i'll publish the new code soon. You need to use the --ignore-errors argument though or any error youtube-dl encounters is just gonna end the program. @txeni Yea I've had it default to downloading the best quality audio and video separately and using ffmpeg to combine them but I think it makes more sense to use the standard mp4 format argument as the default. I was just having trouble with the original code with the same error @weirdsourcer was having so I was looking for a format source that was more reliable but none are. I'll decide on a format order to go through when errors are encountered so that it doesn't just skip the file, but --ignore-errors must be specified. |
@RJFeddeler Thanks for your new code, however I always keep getting a traceback that seems different from the ones reported so far, even if I'm typing nothing but edx-dl without any arguments at all.
I upgraded youtube-dl and Crypto packages just to be safe, but nothing changed. :( BTW I'm running python 3.8 if that means anything. EDIT: Nvm, I reinstalled python (latest miniconda) and redid everything, and now the error is gone. |
@RJFeddeler thanks for your effort, is it possible with the current with your latest commit to download a list of courses with just one request. I noticed the progress bar counts course as |
@RJFeddeler I installed your version, but when I run edx-dl I'm getting this error... any ideas? :-) "C:\Users*****\AppData\Local\Programs\Python\Python38-32\lib\site-packages\edx_dl-0.1.13-py3.8.egg\edx_dl\edx_dl.py", line 27, in |
Great work. Thanks for sharing. |
@RJFeddeler Hello, appreciate all your work and effort in this. Have you maybe found a solution for the Youtube unable to extract video data error? I am getting the same like some guys above. Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Does anyone have any potential solutions to resolve this issue? (base) C:\Users*\Desktop\edx-dl-master>python edx-dl.py -u @gmail.com https://courses.edx.org/courses/course-v1:MITx+14.740x+3T2020/course/ |
@weirdsourcer you can list multiple course urls or you can list no course urls and use the -a or --all flag to download all available courses. @diamneth use the --ignore-errors flag, my latest code will attempt to download it again in a different format and if that fails it will at least continue download the rest of the videos. I'm guessing that error is a problem with youtube-dl, I've always gotten that error randomly. |
@vobisie I had the same problem. I fixed it in parsing.py by changing the line 431 from: |
Thank you @sorin71 . Do you have any idea how to fix the issue of there being no sound with the videos downloaded? Thank you |
Also, does the take down of youtube-dl impact edx-dl? Thank you |
The take down of youtube-dl will have an impact on edx-dl, but probably on longer term when youtube will make format changes. youtube-dlc might end up in being taken down as well as it seems to be a fork of youtube-dl. The problem with no sound for the downloaded videos is a false one. The video (mp4) and the audio (m4a) are in separate files, and you have to combine them in a single file using a tool like ffmpeg. |
Same Error 403 problem: File "/Users/blahuser/.pyenv/versions/3.8.3/lib/python3.8/urllib/request.py", line 649, in http_error_default |
it is working great note the something I noted is sometimes some videos are separated from track audio but nevertheless works great |
@sorin71 & @RJFeddeler do you have any idea why I get this output for python edx-dl -u *** -a -i Traceback (most recent call last): |
This is it! Yes, this worked for me. I had tried everything else but until I did 'pip uninstall edx-dl' nothing worked. Thanks! JM |
@jmfontana and @MissGorgeousTech, what do you mean by "zip code" in:
Is it possible to have the link? Thanks ! |
Hi. I refer to download the zipped code...search for a green button that says Code, click on it and you will see Download Zip...and click it...and you go from there. If you still have any difficulties feel free to tell me. I will try with screenshots. |
Thanks for your help @MissGorgeousTech ! I found the green button on the home page but I am still getting empty folders. 4 days ago, I was able to get some videos by following the @RJFeddeler code but at some point in the downloading I got an error I am on python 3.6 (can't update yet to 3.8) |
Could youtube-dl takedown be the culprit for the following error? is there a way to resolve it in case it is? I'm on python 3.8
|
After having seen that Thanks for your help! |
@RJFeddeler you rock!! Thanks for sharing your repository! |
🚨Please review the Troubleshooting section Subject of the issueafter modifying my edx.py and parsing.py as prescribed by @RJFeddeler, I still get this error can someone figure out whats wrong? Your environment
Steps to reproduceedx-dl -u email -p password --ignore-errors --cache https://courses.edx.org/courses/course-v1:USMx+ENCE607.1x+3T2019/course/ Expected behaviourdownload should have started normally Actual behaviourI get this Error message Building initial headers for future requests. Removed 0 duplicated urls from 76 in total edx_dl version 0.1.13 |
Anyone ever got this error and knows anything about it? Course : 0%| | 0/1Got SSL/Connection error: HTTPConnectionPool(host='www.math.umt.edu', port=80): Max retries exceeded with url: /bardsley/courses/495/Projects/HIV/PerelsonEtAl1996.pdf (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001DD37B2C910>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')) |
First a thank you to @RJFeddeler for the great work! When I started downloading courses using that version I was getting the split audio and video. I'm on Centos 8 and didn't have ffmpeg installed. Once I got that installed (something of a battle in itself), rerunning edx-dl not only merged the files, but got rid of the separated ones already there. |
Thank you for you trying. But it can't work. |
Thank You for your work @RJFeddeler but while download a course mine gets stuck on 24% before which this always works flawlessly and after reaching it it just stucks there |
🚨Please review the Troubleshooting section Subject of the issue Your environment Expected behaviour @RJFeddeler Can you please help me? |
I get the "No downloadable video found" message too. |
I have also noted that if I change in edx_dl.py the line n.63 from: 'url': 'https://courses.edx.org' I get again the "HTTP Error 403: Forbidden", while re-changing to the original value I get the "No downloadable video found" message. |
Subject of the issue
edx-dl creates the folder structure but does not download any video
Your environment
Steps to reproduce
edx-dl -u https://courses.edx.org/courses/course-v1:KTHx+PHSC01.1x+1T2020/course/
Expected behaviour
I would expect it to download all videos
Actual behaviour
Is creating folder structure only
The text was updated successfully, but these errors were encountered: