Automated web scraping with RPi

Automated web scraping with RPi

Introduction – time to let the machine do the work

In the last installment, I covered command functions to interact with the RPi’s GPIO header.  The examples were interactive and had little in the way of full automation.  In this discussion, I would like to focus on automating processes.  These are tasks that we will define and schedule to run at times we specify.  The purpose of this exercise is to establish a foundation for the RPi to run tasks, without the need of an operator.  This is useful for polling temperatures from a third party website, gathering photos from web cameras, or checking the status of a device connected to the GPIO header.

I’ll cover some related subjects that were previously discussed in earlier posts.  These will be key in enabling automated functionality on the RPi. We will schedule using Webmin, since it’s much simpler than the CLI task of setting CRON.  Most of our automation will revolve around CRON, so having a simple way to implement and debug will be helpful.

Another aspect of this exercise will be image processing, so I’ll also cover this topic.  I won’t be going into great detail the uses of OpenCV, but I will cover the steps to install in on RPi.  Most of the image processing topic will be focused on ImageMagick, ffmpeg, avconv, and mencoder.

Purpose – a picture says a thousand words

Web scraping is a technique of automatically downloading web content at set intervals.  The content I’ll be downloading are images from traffic cameras, satellites, network cameras, and radar plotters.  The images are not compiled by the host providers, so I would like to add value to the data by rendering them into video.

The steps to accomplish this will provide an understanding of how to automate web scraping for other purposes.  It will also provide an introduction to image processing and rendering.  Lastly, it provides scalability to manage automated tasks.  All of these skills will be useful toward deploying RPi in advanced settings.

Detail – how it’s done

I was initially interested in the project as a way to create video from a web camera attached to the RPi.  I had no trouble finding a way to setup motion for RPi.  The only problem was it was the project was limited to my hardware and I felt it missed the point of web scraping.  So, with motion all setup and running I decided to move further on.

These were the steps I followed in order to render videos from the web scraped images.

  1. Determine the resources to gather off the internet
  2. The command “wget” would be used to gather and rename images from the internet.
  3. Webmin was used to set CRON schedules for the scrape commands.
  4. Create video from image repository
  5. Install OpenCV on the RPi

I had the sources that I wanted to scrape from, the trick was identifying the image resources.  The easiest one was Unisys Weather.  Getting the URL of the images was simple, all I had to do was load the page and right click the image properties to view it.  I had more of a challenge getting the image properties from Seattle’s DOT and Weatherspark websites.  For those sites, I used the debug function on my browser to list resources as they loaded.  Once I was able to identify them, I tested to verify they worked.  I did this for my home internet camera as well.

Now that I have the resources picked out, I needed to set the parameters for the “wget” command.  Since the images would be processed into a video feed, I had to sequence the name.  Initially, I wanted to just number them, but I settled on a time stamp in the name for simplicity.  One added benefit of doing this is now I have a time I can reference later, if needed.  Here is the syntax I used:

  • wget -O resource_$(date +%Y%m%d%H%M%S).jpg “http://<the web site>/images/resource.jpg”

I knew Webmin would come in handy later on, and setting up CRON confirmed that.  There are all sorts of instructions on how to run the CRON setup through CLI.  I can recall my first impressions of the command structure.  For this reason, I wanted to show how Webmin can be used for others wanting to schedule tasks.

In the System section of Webmin, there is a sub category called “Scheduled Cron Jobs”.  Clicking this link will open the page to set scheduled jobs on the RPi.  I won’t repeat all the steps here, but I will say it is easier that CLI.  I set the interval for my camera images to scrape every 5 minutes, while I set the other images to scrape at 10 minute intervals.  Once the jobs were set, I tested them.  Webmin will debug the results, so you can tell if something isn’t right and why.

After running my automated scrapes, I decided to run a CRON job to open and close relays attached to my GPIO header.  The python script wasn’t tricky and I’ve been able to automatically control the relays using this command:

  • python /home/myuser/pythoncode/relay_on.py 
  • python /home/myuser/pythoncode/relay_off.py

This was all it took to get automation in place.  Now I can create more extensive conditions and have CRON do the rest.  This is all it is behind automation on the RPi.

After a couple of days, my web scraped image repository was starting to get some content.  With enough images gathered, the video rendering step was next.  I tried to use AVConv, but had trouble handling the time stamped names.  I didn’t want to spend too much time creating a script, so I went ahead and used MEncode.  First I installed it then ran the encode commands.  It worked like a champ, sort of.

  • sudo apt-get install mencoder
  • mencoder mf://*.jpg -mf fps=25 -o jpg_movie.avi -ovc lavc -lavcopts vcodec=mpeg4
  • mencoder mf://*.png -mf fps=25 -o png_movie.avi -ovc lavc -lavcopts vcodec=mpeg4

It worked for jpg and png files, but failed to render gif images.  For that, I would need to install and use ImageMagick and ffmpeg.  The setup is simple:

  • sudo apt-get install imagemagick
  • sudo apt-get install ffmpeg

The first thing I had to do was create an animated gif from the gif files.

  • convert ‘/home/myuser/myimgs/ir/*.gif’ ‘/home/myuser/myvids/video.gif’

Then I rendered the gif into a video file.

  • ffmpeg -r 60 -i ‘/home/myuser/myvids/video.gif’ ‘/home/myuser/myvids/video.avi’

I had some issue rendering, this was from some of the image files being null.  There were occasions when the web scrape would fail, this would result in an orphaned download and a new file with zero byte value.  After removing these zero byte files, the render worked fine.

Since I want to process the images and video with OpenCV in the future, I determined it was worth while to go through the setup.  Be warned, this is a time consuming process.  I was fortunate to find the setup steps online after a failed attempt.  Much thanks and appreciation go out to Francesco Piscani for taking the time to spell it out and provide support for folks that have had trouble.  You can find the install details online at Francesco’s website.

After about 12 hours, mostly spent compiling the OpenCV code, I was done with the install of OpenCV on the RPi.

Relations – enhanced image processing

The image processing features of OpenCV were something that was beyond the scope of this post.  I had originally wanted to include it, but I felt it would be too much.  The installation is a big undertaking and that was stretching the content of this topic by including it.  However, I do think it’s worth mentioning.

OpenCV has a tremendous feature set.  Leveraging it against web scraped image content seems like an ideal fit for the RPi. Most of the examples I’ve seen gravitate around the camera directly attached to the host.  It just makes sense to use third party streams and process them with OpenCV.  I’m not entirely sure what the application would be, but it is clear the expanded functionality would be just as tremendous.

Summary

Automation and web scraping are two fascinating topics in their own right.  Add them together and mix into the RPi and you’ve got an embedded platform that’s more than a novelty.  The automation of tasks using Webmin is the simplest and most scalable approach.  This frees us from the mind numbing repetitive tasks and utilizes the RPi for what it’s best at.  Web scraping content from third parties is an excellent way to gather data, instead of re creating it.  This data can then be processed and utilized to perform tasks that would otherwise be too difficult or impractical.  When possibilities open up, the ideas that follow will flourish.

Comments are closed.