Web crawling in F#

I’m working on web crawler in F# and have some question on the subject.

  • Do you need to create a script to crawl in F# or can you create a console application?
    I have seen some tutorials, and all of them uses #r (references), which is not possible in a console application?
  • What are the benefits with using F# for web crawling maybe compared to C#?

I hope someone can shed some light on these questions.

best regards and merry Christmas

1 Like

Hi EdwinSilvaDK,

A late response but have a look at the Type Providers feature in F#. In particular, HTML type provider.

I see that this is an old question, but I’ll give my brief thoughts anyway:

I do web crawling all the time with F# typically by starting with a Console app.
I usually use:

  • Http.fs “A simple, functional HTTP client library for F#” (I find it more Fsharp-ish than just using the standard Microsoft libraries and more powerful than the Http Utilities in F# Data.
  • Hopac “a Concurrent ML style concurrent programming library for F#” – Mainly because Http.fs uses it.
  • Selenium – cuz more and more these days you need to have javascript running in a headless browser to scrape a page correctly
  • And most importantly AngleSharp – makes it easy to explore the DOM.

The first three are probably for after you’ve gained experience with simple scrapers. And for all of them I eventually wrote lots of helper functions and extension methods to make writing scrapers quicker and easier for ME.

And of course, before you can scrape web pages, you really need to be fluent at using Chrome/FireFox/Opera/Brave’s DevTools to poke around within a page. And being able to directly execute javascript code from a Browser’s console comes in handy when trying to figure out how to do stuff. Just keep in mind that downloading and examining static HTML is not the same thing as downloading a page in a Browser that then runs javascript on it.

As far as F# vs C# see Comparing F# with C#: Downloading a web page.

The main reason people create a .fsx script is so they can interactively play around in the F# Interactive (FSI) REPL. But it can be tricky getting the right #r references declared first. The equivalent action for console apps is to install the correct nuget package references (which you normally also have to do anyway before you can write scripts that reference them). I personally rarely bother using FSI or .fsx scripts.

And apologies to @kevinmcfarlane, but I also NEVER bother with type providers,
especially the HTML Type Provider, since the things I want to scrape are
rarely if ever sitting nicely in HTML tables. Instead they’re in complicated <div>/<a>/<span> hierarchies that I use AngleSharp to get at.

2 Likes