August 6, 2021:
With the library Selenium in Python, it was possible to automatically take screenshots calling a headless Firefox, making it effortless to save double-backup of posts on my site. It could even take a full-page screenshot, by identifying the main
class in my case.
The binary data of the image could be either saved or passed to other function, like the library Pillow.
Somehow, I couldn’t call Chrome; there was perhaps some problems with path or version. Besides, to get a screenshot of higher resolution, we had to zoom in, but by far I didn’t manage to do that either.
In addition, Chinese characters should be shown in Noto Serif, but the result was in a sans font, and I can’t tell whether it was Noto Sans or a default font in Windows.
Another problem was that sometimes Selenium timed out, or even failed to locate the ID for the post header, for just certain posts, which was strange. I therefore enclosed both the post header and the post content in a div given a class for Selenium to recognize, which greatly improved stability.
#! /usr/bin/env python3
import os
import io
from selenium.webdriver import Firefox
from PIL import Image
address = "https://www.
os.environ[
driver = Firefox()
driver.get(address)
element = driver.find_element_by_tag_name(
binary = element.screenshot_as_png
driver.quit()
graph = Image.open(io.BytesIO(binary))
graph.save(
August 13:
I was a bloody underrated, world-class genius! I was now able to make a PDF of all blog posts crawled as image. Presently there were about 40 posts, and the PDF was already almost 100 page long.
First, as said above, I automatically saved images of blog posts by Selenium. The resulting text appeared blurry, but Pillow’s image processing functions, such as PIL.ImageEnhance.Contrast
and PIL.ImageOps.autocontrast
, were helpful. ImageMagick also had a variety of operations, such as -brightness-contrast
, -morphology
, -adaptive-sharpen
, -sharpen
. I preferred ImageMagick, because it could be called by Bash to facilitate automation.
Second, I sliced pictures line by line, saved to dedicated directories named by date. To let Pillow recognize the cuts, if you noticed, I even added yellow borders to displayed lines and green borders to table rows, all slightly broken with CSS margin.
Third, I patched the slices greedily into pages. Once the present total height exceeded the page height, I created a new page. When every post ends, I inserted a separation line, and started patching another post. Pillow would extend each page with suitable margin.
Fourth, I called ImageMagick via Bash, to convert the pages properly numbered in alphabetical order, into a single PDF.
August 17:
Now I wanted to reduce the noise, which made the edge of text blurry. If the document had been monochrome, I could have defined a threshold, so darker pixels over the threshold were mapped to pure black, and lighter pixels to white. However, since it was colored, such crude process only gave poor results.
Formally, let the document be represented as \(\mathbf{P}\)\(\;=\;\)\(\left\langle\vphantom{\mathbf{p} \vphantom{ \mathbf{p}}_{0} ,\, \,\dotsc\, ,\, \mathbf{p} \vphantom{ \mathbf{p}}_{N \,-\, 1}}\right.\)\(\mathbf{p} \vphantom{ \mathbf{p}}_{0}\)\(,\,\)\(\,\dotsc\,\)\(,\,\)\(\mathbf{p} \vphantom{ \mathbf{p}}_{N \,-\, 1}\)\(\left.\vphantom{\mathbf{p} \vphantom{ \mathbf{p}}_{0} ,\, \,\dotsc\, ,\, \mathbf{p} \vphantom{ \mathbf{p}}_{N \,-\, 1}}\right\rangle\), a list of pixels. A pixel \(\mathbf{p} \vphantom{ \mathbf{p}}_{n}\)\(\;=\;\)\(\left\langle\vphantom{x ,\, y ,\, z}\right.\)\(x\)\(,\,\)\(y\)\(,\,\)\(z\)\(\left.\vphantom{x ,\, y ,\, z}\right\rangle\) is a triple of red, green, and blue in the RGB color space. For simplicity, suppose \(\mathbf{p} \vphantom{ \mathbf{p}}_{n}\) is transformed so that \(\,-\,\)\(1\)\(\,/\,\)\(2\)\(\;\leq\;\)\(x\)\(,\,\)\(y\)\(,\,\)\(z\)\(\;\leq\;\)\(1\)\(\,/\,\)\(2\) . It seems we have to conduct a regression to find several converging points, towards which noisy pixels are quantized. Assuming blending with neighboring white pixels doesn’t change its color but merely its brightness, we could let \(\mathbf{p} \vphantom{ \mathbf{p}}_{0}\)\(,\,\)\(\,\dotsc\,\)\(,\,\)\(\mathbf{p} \vphantom{ \mathbf{p}}_{N \,-\, 1}\) vote for a consensus of brightness.
For example, define these quantities \(\vartheta\)\(\;=\;\)\(x\)\(\,+\,\)\(y\)\(\,+\,\)\(z\), \(\varphi\)\(\;=\;\)\(2\)\(\surd\!\)\(x\)\(\,-\,\)\(\surd\!\)\(y\)\(\,-\,\)\(\surd\!\)\(z\), \(\psi\)\(\;=\;\)\(\surd\!\)\(x\)\(\,+\,\)\(\surd\!\)\(y\)\(\,-\,\)\(2\)\(\surd\!\)\(z\), and it is clear that \(\left\langle\vphantom{\vartheta ,\, \varphi ,\, \psi}\right.\)\(\vartheta\)\(,\,\)\(\varphi\)\(,\,\)\(\psi\)\(\left.\vphantom{\vartheta ,\, \varphi ,\, \psi}\right\rangle\) form a coordinate. On an equipotential surface of \(\left\langle\vphantom{\varphi ,\, \psi}\right.\)\(\varphi\)\(,\,\)\(\psi\)\(\left.\vphantom{\varphi ,\, \psi}\right\rangle\), a regression can be done to find the consensual values of \(\vartheta\). Unfortunately, it strikes me that \(\vartheta\) may not have uniform luminosity; the human perception of color isn’t a linear function of \(\left\langle\vphantom{x ,\, y ,\, z}\right.\)\(x\)\(,\,\)\(y\)\(,\,\)\(z\)\(\left.\vphantom{x ,\, y ,\, z}\right\rangle\).
August 18:
If the width of body was more than 900px
, the resolution of the resulting appeared somewhat acceptable. The setting would only activate in the presence of a really wide screen like 2000px
, tailored for Selenium, so it wasn’t to affect the normal browsing experience. (Otherwise, I would have to use an alternative CSS style, and serve locally to let Selenium take the screenshot, which would be too much a hassle.) I really wanted to put the matter aside for the moment.
❧ mostly August 13, 2021
References
❉ ImageMagick, ‹Annotated List of Command-line Options›
❉ Pillow, ‹ImageEnhance Module›
❉ Selenium Python, ‹WebDriver API›