Webscraping with F#


#1

Hi, does anyone knows some resources where i will learn web scraping with f#.


#2

If you are searching for a silver bullet then, alas, there is no any.

So, I suggest you asking more specific questions. One by one. Step by step.

Here is some libraries, which you probably find useful:

F# Data: Library for Data Access

OpenScraping HTML Structured Data Extraction
C# Library

AngleSharp

I used only F# Data & AngleSharp. I had chosen AngleSharp when F# Data wasn’t able to parse pages correctly. But for simple cases also worked fine.


#3

You may also want to look into Canopy:
https://lefthandedgoat.github.io/canopy/

This is a browser testing framework, a wrapper around Selenium, but there’s probably a good opportunity to hack this into doing what you’d like :slight_smile:


#4

Thanks. Anyone who wants build framework for web scraping :smile:


#5

I gave a talk at F# London a few years ago where I talked about building a simple search engine which featured a little bit of web scraping https://skillsmatter.com/skillscasts/8901-f-sharpunctional-londoners-meetup The code for it is still on my GitHub but it’s a private repo because of keys etc but I’ve attached the code for scraping a webpage using HtmlAgilityPack. I also have a robots.txt parser which I should probably open source at some point.

module Scraper

open System
open HtmlAgilityPack
open System.Text

let scrapePage (uriSink:string -> Async<unit>)
               (indexDocument:Indexer.HtmlDocument -> unit)
               (storeLink:LinkPair -> unit)
               (getDocument:string -> Async<Option<string>>)
               (resolveUri:string -> string -> string)
               (url:string) =
    async {
        let! content = getDocument url
        match content with
        | Some content ->
            let document = HtmlAgilityPack.HtmlDocument()
            document.LoadHtml(content)
            do!
                document.DocumentNode.SelectNodes("//a[@href]")
                |> Seq.map (fun t -> t.GetAttributeValue("href", url))
                |> Seq.map (resolveUri url)
                |> Seq.filter (fun t -> t <> url)
                |> Seq.filter (fun t ->
                    try
                        let uri = Uri(t)
                        uri.Scheme.StartsWith("http")
                    with
                    | _ -> false)
                |> Seq.map (fun t -> { SourceUrl = url; DestinationUrl = t })
                |> Seq.map (fun t -> async {
                    do storeLink t
                    //do! uriSink t.DestinationUrl
                })
                |> Async.Parallel
                |> Async.Ignore
            let pageTitle =
                let title = document.DocumentNode.SelectSingleNode("//title")
                if title = null then ""
                else
                    title.InnerText                
            let text =
                document.DocumentNode.SelectSingleNode("//body").Descendants()
                |> Seq.filter (fun t -> t.NodeType = HtmlNodeType.Text &&
                                        t.ParentNode.Name <> "script" &&
                                        t.ParentNode.Name <> "style")
                |> Seq.map (fun t -> t.InnerText)
                |> Seq.filter (String.IsNullOrWhiteSpace >> not)
                |> String.concat Environment.NewLine
            let domain, path =
                let uri = Uri(url)
                uri.Host, uri.AbsolutePath
            let doc =
                { documentId = Convert.ToBase64String(Encoding.UTF8.GetBytes(url))
                  domain = domain
                  path = path
                  pageTitle = pageTitle
                  textContent = text
                  pageRankScore = 0.0 }
            indexDocument doc
            return ()
        | None -> return ()
    }

#6

Thanks i will use it


#7

I started experimenting with monadic web scraping: https://github.com/gusty/ScrapeM
But I didn’t go further, because there were no anonymous types available. Now I should continue it with F# 4.6


#8