Lawrence Gimenez

Weekend Project: ImageDownloader.py

My personal project for this weekend is to migrate my portfolio from my old WordPress blog to my current one.

Export

First, I exported my content using WordPress admin. And was saved into a lwgmnzme.wordpress.com-2022-09-25-02_03_10 folder in XML format.

WordPress XML to Markdown

I then used this handy library for migrating all the pages to Markdown.

Curl Option

At first I was downloading each image individually, and it seems I have more than 50 images to process. So this is not an option.

curl https://lwgmnzme.files.wordpress.com/2021/02/simulator-screen-shot-iphone-12-pro-2021-02-27-at-17.22.11.png > image1.png

Python Script

Time to brush up my Python skills once again. I downloaded the requests Python library.

sudo pip3 install requests

The plan is to be able to input several URLs and maybe prompt a shortcut key or something to cue the script that it’s time to download the images.

Or better yet look for the png URLs inside the .md file? Seems like the most logical option for me.

Initial commit

import requests

file = open("index.md", "r")
line = file.read()
print("Read = %s" % (line))

So far, so good.

Find all the png files

The next step is to find all the URL images with .png extensions. Time to use Regex I guess.

Update the scripts.

import requests
import re

file = open("index.md", "r")
line = file.read()

result = re.findall(r'(https?://[^\s]+)', line)
print(result)

Looks promising. Let’s loop through it.

import requests
import re

file = open("index.md", "r")
line = file.read()

results = re.findall(r'(https?://[^\s]+)', line)

for result in results:
	print(result)

Much clearer. Okay, let’s try downloading each one of them. But, wait I noticed an extra ) character on the results. The hell.

Maybe I should remove the extra ?w= while I’m at it too.

import requests
import re

file = open("index.md", "r")
line = file.read()

results = re.findall(r'(https?://[^\s]+)', line)

for result in results:
	indexOfPng = result.find("?")
	updatedResult = result[:indexOfPng]
	print(updatedResult)

Not the most elegant of solutions but it worked.

Get the filename

The plan is to get the filename from the URL and use it for saving as a file.

import requests
import re
import os
from urllib.parse import urlparse

file = open("index.md", "r")
line = file.read()

results = re.findall(r'(https?://[^\s]+)', line)

for result in results:
	indexOfPng = result.find("?")
	updatedResult = result[:indexOfPng]
	# Get the filename
	parse = urlparse(result)
	print(os.path.basename(parse.path))

Time to download

I was having a IsADirectoryError trouble. And it seems that I need to filter out to download only files coming from a wordpress.com domain.

Final code

import requests
import re
import os
from urllib.parse import urlparse
from urllib.request import urlopen

file = open("index.md", "r")
line = file.read()

results = re.findall(r'(https?://[^\s]+)', line)

for result in results:

	# Filter only Wordpress domains
	if "wordpress.com" in result:
		# print(result)
		# Create folder
		if not os.path.exists("images"):
			os.makedirs("images")

		# Remove unnecessary characters in the URL
		indexOfPng = result.find("?")
		updatedResult = result[:indexOfPng]
		# print(updatedResult)

		# Get the filename
		parse = urlparse(result)
		filename = os.path.basename(parse.path)

		# Create a file path
		filePath = os.path.join("images", filename)
		print(filePath)

		request = requests.get(updatedResult)
		with open(filePath, "wb") as file:
			file.write(request.content)