Hi, does anyone knows some resources where i will learn web scraping with f#.
If you are searching for a silver bullet then, alas, there is no any.
So, I suggest you asking more specific questions. One by one. Step by step.
Here is some libraries, which you probably find useful:
F# Data: Library for Data Access
OpenScraping HTML Structured Data Extraction
C# Library
I used only F# Data & AngleSharp. I had chosen AngleSharp when F# Data wasn’t able to parse pages correctly. But for simple cases also worked fine.
You may also want to look into Canopy:
https://lefthandedgoat.github.io/canopy/
This is a browser testing framework, a wrapper around Selenium, but there’s probably a good opportunity to hack this into doing what you’d like
Thanks. Anyone who wants build framework for web scraping
I gave a talk at F# London a few years ago where I talked about building a simple search engine which featured a little bit of web scraping https://skillsmatter.com/skillscasts/8901-f-sharpunctional-londoners-meetup The code for it is still on my GitHub but it’s a private repo because of keys etc but I’ve attached the code for scraping a webpage using HtmlAgilityPack. I also have a robots.txt parser which I should probably open source at some point.
module Scraper
open System
open HtmlAgilityPack
open System.Text
let scrapePage (uriSink:string -> Async<unit>)
(indexDocument:Indexer.HtmlDocument -> unit)
(storeLink:LinkPair -> unit)
(getDocument:string -> Async<Option<string>>)
(resolveUri:string -> string -> string)
(url:string) =
async {
let! content = getDocument url
match content with
| Some content ->
let document = HtmlAgilityPack.HtmlDocument()
document.LoadHtml(content)
do!
document.DocumentNode.SelectNodes("//a[@href]")
|> Seq.map (fun t -> t.GetAttributeValue("href", url))
|> Seq.map (resolveUri url)
|> Seq.filter (fun t -> t <> url)
|> Seq.filter (fun t ->
try
let uri = Uri(t)
uri.Scheme.StartsWith("http")
with
| _ -> false)
|> Seq.map (fun t -> { SourceUrl = url; DestinationUrl = t })
|> Seq.map (fun t -> async {
do storeLink t
//do! uriSink t.DestinationUrl
})
|> Async.Parallel
|> Async.Ignore
let pageTitle =
let title = document.DocumentNode.SelectSingleNode("//title")
if title = null then ""
else
title.InnerText
let text =
document.DocumentNode.SelectSingleNode("//body").Descendants()
|> Seq.filter (fun t -> t.NodeType = HtmlNodeType.Text &&
t.ParentNode.Name <> "script" &&
t.ParentNode.Name <> "style")
|> Seq.map (fun t -> t.InnerText)
|> Seq.filter (String.IsNullOrWhiteSpace >> not)
|> String.concat Environment.NewLine
let domain, path =
let uri = Uri(url)
uri.Host, uri.AbsolutePath
let doc =
{ documentId = Convert.ToBase64String(Encoding.UTF8.GetBytes(url))
domain = domain
path = path
pageTitle = pageTitle
textContent = text
pageRankScore = 0.0 }
indexDocument doc
return ()
| None -> return ()
}
Thanks i will use it
I started experimenting with monadic web scraping: https://github.com/gusty/ScrapeM
But I didn’t go further, because there were no anonymous types available. Now I should continue it with F# 4.6