Module | ::Utils |
In: |
lib/rbot/core/utils/httputil.rb
lib/rbot/core/utils/utils.rb |
Miscellaneous useful functions
UNESCAPE_TABLE | = | { 'laquo' => '«', 'raquo' => '»', 'quot' => '"', 'apos' => '\'', 'micro' => 'µ', 'copy' => '©', 'trade' => '™', 'reg' => '®', 'amp' => '&', 'lt' => '<', 'gt' => '>', 'hellip' => '…', 'nbsp' => ' ', 'Agrave' => 'À', 'Aacute' => 'Á', 'Acirc' => 'Â', 'Atilde' => 'Ã', 'Auml' => 'Ä', 'Aring' => 'Å', 'AElig' => 'Æ', 'OElig' => 'Œ', 'Ccedil' => 'Ç', 'Egrave' => 'È', 'Eacute' => 'É', 'Ecirc' => 'Ê', 'Euml' => 'Ë', 'Igrave' => 'Ì', 'Iacute' => 'Í', 'Icirc' => 'Î', 'Iuml' => 'Ï', 'ETH' => 'Ð', 'Ntilde' => 'Ñ', 'Ograve' => 'Ò', 'Oacute' => 'Ó', 'Ocirc' => 'Ô', 'Otilde' => 'Õ', 'Ouml' => 'Ö', 'Oslash' => 'Ø', 'Ugrave' => 'Ù', 'Uacute' => 'Ú', 'Ucirc' => 'Û', 'Uuml' => 'Ü', 'Yacute' => 'Ý', 'THORN' => 'Þ', 'szlig' => 'ß', 'agrave' => 'à', 'aacute' => 'á', 'acirc' => 'â', 'atilde' => 'ã', 'auml' => 'ä', 'aring' => 'å', 'aelig' => 'æ', 'oelig' => 'œ', 'ccedil' => 'ç', 'egrave' => 'è', 'eacute' => 'é', 'ecirc' => 'ê', 'euml' => 'ë', 'igrave' => 'ì', 'iacute' => 'í', 'icirc' => 'î', 'iuml' => 'ï', 'eth' => 'ð', 'ntilde' => 'ñ', 'ograve' => 'ò', 'oacute' => 'ó', 'ocirc' => 'ô', 'otilde' => 'õ', 'ouml' => 'ö', 'oslash' => 'ø', 'ugrave' => 'ù', 'uacute' => 'ú', 'ucirc' => 'û', 'uuml' => 'ü', 'yacute' => 'ý', 'thorn' => 'þ', 'yuml' => 'ÿ' | ||
AFTER_PAR_PATH | = | /^(?:div|span)$/ | ||
AFTER_PAR_EX | = | /^(?:td|tr|tbody|table)$/ | ||
AFTER_PAR_CLASS | = | /body|message|text/i | ||
TITLE_REGEX | = | /<\s*?title\s*?>(.+?)<\s*?\/title\s*?>/im | Title | |
HX_REGEX | = | /<h(\d)(?:\s+[^>]*)?>(.*?)<\/h\1>/im | H1, H2, etc | |
PAR_REGEX | = | /<p(?:\s+[^>]*)?>.*?<\/?(?:p|div|html|body|table|td|tr)(?:\s+[^>]*)?>/im | A paragraph | |
AFTER_PAR1_REGEX | = | /<\w+\s+[^>]*(?:body|message|text)[^>]*>.*?<\/?(?:p|div|html|body|table|td|tr)(?:\s+[^>]*)?>/im | Some blogging and forum platforms use spans or divs with a ‘body’ or ‘message’ or ‘text’ in their class to mark actual text | |
AFTER_PAR2_REGEX | = | /<br(?:\s+[^>]*)?\/?>.*?<\/?(?:br|p|div|html|body|table|td|tr)(?:\s+[^>]*)?\/?>/im | At worst, we can try stuff which is comprised between two <br> | |
SEC_PER_MIN | = | 60 | Seconds per minute | |
SEC_PER_HR | = | SEC_PER_MIN * 60 | Seconds per hour | |
SEC_PER_DAY | = | SEC_PER_HR * 24 | Seconds per day | |
SEC_PER_MNTH | = | SEC_PER_DAY * 30 | Seconds per (30-day) month | |
SEC_PER_YR | = | SEC_PER_MNTH * 12 | Second per (30*12 = 360 day) year |
HTML info filters often need to check if the webpage location of a passed DataStream ds matches a given Regexp.
Decode HTML entities in the String str, using HTMLEntities if the package was found, or UNESCAPE_TABLE otherwise.
Translates a number of minutes into verbal distances. e.g. 0.5 => less than a minute
70 => about one hour
Get the first pars of the first count urls. The pages are downloaded using the bot httputil service. Returns an array of the first paragraphs fetched. If (optional) opts :message is specified, those paragraphs are echoed as replies to the IRC message passed as opts :message
This method extracts title, content (first par) and extra information from the given document doc.
doc can be an URI, a Net::HTTPResponse or a String.
If doc is a String, only title and content information are retrieved (if possible), using standard methods.
If doc is an URI or a Net::HTTPResponse, additional information is retrieved, and special title/summary extraction routines are used if possible.
This method extracts title, content (first par) and extra information from the given Net::HTTPResponse resp.
Currently, the only accepted options (in opts) are
uri_fragment: | the URI fragment of the original request |
full_body: | get the whole body instead of @@bot.config bytes only |
Returns a DataStream with the following keys:
text: | the (partial) body |
title: | the title of the document (if any) |
content: | the first paragraph of the document (if any) |
headers: | the headers of the Net::HTTPResponse. The value is a Hash whose keys are lowercase forms of the HTTP header fields, and whose values are Arrays. |
This method extracts title and content (first par) from the given HTML or XML document text, using standard methods (String#ircify_html_title, Utils.ircify_first_html_par)
Currently, the only accepted option (in opts) is
uri_fragment: | the URI fragment of the original request |
Try to grab and IRCify the first HTML par (<p> tag) in the given string. If possible, grab the one after the first heading
It is possible to pass some options to determine how the stripping occurs. Currently supported options are
strip: | Regex or String to strip at the beginning of the obtained text |
min_spaces: | minimum number of spaces a paragraph should have |
Execute an external program, returning a String obtained by redirecting the program‘s standards errors and output
Safely (atomically) save to file, by passing a tempfile to the block and then moving the tempfile to its final location when done.
Turn a number of seconds into a human readable string, e.g 2 days, 3 hours, 18 minutes and 10 seconds
Returns human readable time. Like: 5 days ago
about one hour ago
options :start_date, sets the time to measure against, defaults to now :date_format, used with <tt>to_formatted_s<tt>, default to :default
This method runs an appropriately-crafted DataStream ds through the filters in the :htmlinfo filter group, in order. If one of the filters returns non-nil, its results are merged in ds and returned. Otherwise nil is returned.
The input DataStream shuold have the downloaded HTML as primary key (:text) and possibly a :headers key holding the resonse headers.