Module ::Utils
In: lib/rbot/core/utils/httputil.rb
lib/rbot/core/utils/utils.rb

Miscellaneous useful functions

Methods

Classes and Modules

Class ::Utils::HttpUtil

Constants

UNESCAPE_TABLE = { 'laquo' => '«', 'raquo' => '»', 'quot' => '"', 'apos' => '\'', 'micro' => 'µ', 'copy' => '©', 'trade' => '™', 'reg' => '®', 'amp' => '&', 'lt' => '<', 'gt' => '>', 'hellip' => '…', 'nbsp' => ' ', 'Agrave' => 'À', 'Aacute' => 'Á', 'Acirc' => 'Â', 'Atilde' => 'Ã', 'Auml' => 'Ä', 'Aring' => 'Å', 'AElig' => 'Æ', 'OElig' => 'Œ', 'Ccedil' => 'Ç', 'Egrave' => 'È', 'Eacute' => 'É', 'Ecirc' => 'Ê', 'Euml' => 'Ë', 'Igrave' => 'Ì', 'Iacute' => 'Í', 'Icirc' => 'Î', 'Iuml' => 'Ï', 'ETH' => 'Ð', 'Ntilde' => 'Ñ', 'Ograve' => 'Ò', 'Oacute' => 'Ó', 'Ocirc' => 'Ô', 'Otilde' => 'Õ', 'Ouml' => 'Ö', 'Oslash' => 'Ø', 'Ugrave' => 'Ù', 'Uacute' => 'Ú', 'Ucirc' => 'Û', 'Uuml' => 'Ü', 'Yacute' => 'Ý', 'THORN' => 'Þ', 'szlig' => 'ß', 'agrave' => 'à', 'aacute' => 'á', 'acirc' => 'â', 'atilde' => 'ã', 'auml' => 'ä', 'aring' => 'å', 'aelig' => 'æ', 'oelig' => 'œ', 'ccedil' => 'ç', 'egrave' => 'è', 'eacute' => 'é', 'ecirc' => 'ê', 'euml' => 'ë', 'igrave' => 'ì', 'iacute' => 'í', 'icirc' => 'î', 'iuml' => 'ï', 'eth' => 'ð', 'ntilde' => 'ñ', 'ograve' => 'ò', 'oacute' => 'ó', 'ocirc' => 'ô', 'otilde' => 'õ', 'ouml' => 'ö', 'oslash' => 'ø', 'ugrave' => 'ù', 'uacute' => 'ú', 'ucirc' => 'û', 'uuml' => 'ü', 'yacute' => 'ý', 'thorn' => 'þ', 'yuml' => 'ÿ'
AFTER_PAR_PATH = /^(?:div|span)$/
AFTER_PAR_EX = /^(?:td|tr|tbody|table)$/
AFTER_PAR_CLASS = /body|message|text/i
TITLE_REGEX = /<\s*?title\s*?>(.+?)<\s*?\/title\s*?>/im   Title
HX_REGEX = /<h(\d)(?:\s+[^>]*)?>(.*?)<\/h\1>/im   H1, H2, etc
PAR_REGEX = /<p(?:\s+[^>]*)?>.*?<\/?(?:p|div|html|body|table|td|tr)(?:\s+[^>]*)?>/im   A paragraph
AFTER_PAR1_REGEX = /<\w+\s+[^>]*(?:body|message|text)[^>]*>.*?<\/?(?:p|div|html|body|table|td|tr)(?:\s+[^>]*)?>/im   Some blogging and forum platforms use spans or divs with a ‘body’ or ‘message’ or ‘text’ in their class to mark actual text
AFTER_PAR2_REGEX = /<br(?:\s+[^>]*)?\/?>.*?<\/?(?:br|p|div|html|body|table|td|tr)(?:\s+[^>]*)?\/?>/im   At worst, we can try stuff which is comprised between two <br>
SEC_PER_MIN = 60   Seconds per minute
SEC_PER_HR = SEC_PER_MIN * 60   Seconds per hour
SEC_PER_DAY = SEC_PER_HR * 24   Seconds per day
SEC_PER_MNTH = SEC_PER_DAY * 30   Seconds per (30-day) month
SEC_PER_YR = SEC_PER_MNTH * 12   Second per (30*12 = 360 day) year

Public Class methods

The bot instance

Set up some Utils routines which depend on the associated bot.

HTML info filters often need to check if the webpage location of a passed DataStream ds matches a given Regexp.

Decode HTML entities in the String str, using HTMLEntities if the package was found, or UNESCAPE_TABLE otherwise.

Translates a number of minutes into verbal distances. e.g. 0.5 => less than a minute

     70 => about one hour

Get the first pars of the first count urls. The pages are downloaded using the bot httputil service. Returns an array of the first paragraphs fetched. If (optional) opts :message is specified, those paragraphs are echoed as replies to the IRC message passed as opts :message

This method extracts title, content (first par) and extra information from the given document doc.

doc can be an URI, a Net::HTTPResponse or a String.

If doc is a String, only title and content information are retrieved (if possible), using standard methods.

If doc is an URI or a Net::HTTPResponse, additional information is retrieved, and special title/summary extraction routines are used if possible.

This method extracts title, content (first par) and extra information from the given Net::HTTPResponse resp.

Currently, the only accepted options (in opts) are

uri_fragment:the URI fragment of the original request
full_body:get the whole body instead of @@bot.config bytes only

Returns a DataStream with the following keys:

text:the (partial) body
title:the title of the document (if any)
content:the first paragraph of the document (if any)
headers:the headers of the Net::HTTPResponse. The value is a Hash whose keys are lowercase forms of the HTTP header fields, and whose values are Arrays.

This method extracts title and content (first par) from the given HTML or XML document text, using standard methods (String#ircify_html_title, Utils.ircify_first_html_par)

Currently, the only accepted option (in opts) is

uri_fragment:the URI fragment of the original request

Try to grab and IRCify the first HTML par (<p> tag) in the given string. If possible, grab the one after the first heading

It is possible to pass some options to determine how the stripping occurs. Currently supported options are

strip:Regex or String to strip at the beginning of the obtained text
min_spaces:minimum number of spaces a paragraph should have

HTML first par grabber using hpricot

HTML first par grabber without hpricot

Execute an external program, returning a String obtained by redirecting the program‘s standards errors and output

Safely (atomically) save to file, by passing a tempfile to the block and then moving the tempfile to its final location when done.

Turn a number of seconds into a hours:minutes:seconds e.g. 3:18:10 or 5‘12" or 7s

Turn a number of seconds into a human readable string, e.g 2 days, 3 hours, 18 minutes and 10 seconds

Returns human readable time. Like: 5 days ago

      about one hour ago

options :start_date, sets the time to measure against, defaults to now :date_format, used with <tt>to_formatted_s<tt>, default to :default

This method runs an appropriately-crafted DataStream ds through the filters in the :htmlinfo filter group, in order. If one of the filters returns non-nil, its results are merged in ds and returned. Otherwise nil is returned.

The input DataStream shuold have the downloaded HTML as primary key (:text) and possibly a :headers key holding the resonse headers.

[Validate]