I have recently got a mail from one of my friends which had a pdf in the attachment. This pdf had links to some of the books that springer publications had published over time. I quickly turned to google to just check if this pdf was not malicious. That’s when I have realised that springer has released hundreds of textbooks spread across various domains and topics for free. WOAH!!
The PDF had books of which some of them were written by authors who have an h-index greater than 100. If someone were to read even 20% of these in a year or two, they would not be same as they were before reading these books. I just went through all the titles of the books and with each passing line item there was only one thought running non-stop in my head, I should start reading this book first to make most out the lockdown.
The books are on various topics like Statistics, Machine learning, AI, Chemistry, Manufacturing and many more. For someone who is not interested in academic books, the list also includes books related to some of the very fewly spoken topics like Criminal Justice and Mental Health, International Humanitarian Action, Psychoeducational Assessment and Report Writing, The Sea Floor, Principles of Astrophysics, Survival Analysis, Handbook of LGBT Elders to name a few. The list was vivid and diversified across a plethora of domains
Though this was a huge gold mine, I did not want to take a chance of losing out on this. Just in case, if springer decides to roll back on their delivery :P . Though I might end up not reading 90% of these books for sure. I definately wanted to add each one of them to my of e-books collection.
Whenever life throws data at you, store it first ! undoubtedly if they are books!!
There were almost 400 + downloadable links to the books in the PDF. I for one am definitely not clicking on each one of the links one by one and download them one at a time. You shouldn’t too! it is god damn boring!
Even if I would have done that, if clicking a URL, downloading a book and moving on to the next one would have taken approximately 10 seconds. That sums up to almost an hour to download the entire set of four hundred books. which is a ridiculous way of spending your precious time!
That’s when it hit me to write a simple python code which will iterate through all the links and download each one of them and store it as a pdf in a folder on my laptop! cool isn’t it? That thought felt so awesome, I immediately jumped into writing the code.
Here is the code with a brief explanation that will follow after each code snippet :)
Importing the necessary libraries. The start_time variable is to set a pointer at the current time, Which in the later part of the code can be used to calculate the total time the code took to execute.
BeautifulSoup is a library which is generally used to read and understand the HTML in a web page. This is a great library to do some HTML parsing, If you are interested, you can get more info on their official documentation here.
Since this is not a default library you need to install it on your machine. I would recommend installing using pip. By running the command below on your terminal( mac os / Linux) or CMD (Windows), you can install the latest version of BeautifulSoup (BS4).
pip install beautifulsoup4
download_pdf() function which will take a URL as an argument. Using python’s default library requests we’ll get the HTML of the URL by a get request. This call would return an HTML of the URL we have passed to the function.
From the get request, we have the HTML. we’ll use BeautifulSoup to find the exact download link on the page. On inspecting the webpage, we’ll get to know the CSS class of the anchor tag element in the HTML is test-bookpdf-link. We’ll extract the download link to the PDF from this section of the webpage. The H1 tag in the page is the name of the book.
Once we have the downloadable URL, we’ll create a new file on our local machine with the book’s name we pulled from the H1 tag and write all the data into that file. Close the file once this operation is done! ( all sorts of bad things can happen to you if you don’t close the file after opening it :P).
This function returns the name of the book, which we’ll use in the next and last step.
We are just one step away from downloading all the books! All we need to do is to iterate the URLs from the list of the URLs we had created in step1 and pass it into our download_pdf() function.
Iterate the list of URLs and pass them to the download_pdf() function to download the books and save it in books folder on your disk.
It is always a good practice to print the progress of your code run. In this part, we print the iteration number, the name of the book, size it occupies on the disk and the time it took to download.
To get the size of the file on the disk we use .st_size from python’s default OS Module. In the end, after all the URLs are iterated, the code run time is printed.
The code would look like this while in action :)
That’s all folks! Now that you have all the knowledge you needed very accessible. Just pick up some interesting book and start reading.
Books are infinite in number and time is short. The secret of knowledge is to take what is essential. Take that and try to live up to it.
- Swami Vivekananda
Reading furnishes the mind only with materials of knowledge; it is thinking that makes what we read ours.
- John Locke
PS 1: Some of the books on this list are paid books, those won’t be downloaded. These contribute to a very few percentage on the list though!
PS 2: You can transfer these pdfs to your mobile and can use aldiko book reader to read them on the go. Since none of us are moving during this pandemic ( I hope you are really staying home!), you can read it on a tablet or a PC if you have access to one of these :)
PS 3: Below is the GitHub repo for the code. Please feel free to make some upgrades and keep those Pull requests coming :)