Frequently Asked Questions - Technical
- What is HTTP?
- What is HTML?
- What is character-set encoding?
- What data exchange formats do you support?
- What data source formats do you support?
- What is parsing?
- What is data validation?
- What's the difference between Zip Codes and Census Bureau tabulated regions?
- What are NAICS codes?
- What are SIC codes?
- Can you work with password-protected pages?
- Can you work with JavaScript pages?
- Can you work with secure web pages?
- What is "web scraping"?
TBD
HTML, or HyperText Markup Language is the predominate code behind the pages you see on the web. It is written in the form of "tags" that are surrounded by angle brackets. These tags tell the web browser to create paragraphs, tables, links, bold or emphasized text and so on.
It's important to note that different browsers render HTML each in their own way, and all browsers follow relaxed rules for handling HTML code that may be non-standard or incorrect. At Aware Research we use a three-level hierarchy of fallback methods for parsing broken HTML while preserving the information content.
Character-set encoding tells the web browser how to display characters which appear only in sets specific to certain languages, regions or special purposes. In the early days of computing, a standard set known as ASCII contained only 128 distinct codes. Since then other common encodings have come into use, representing Western European, Eastern European, Asian characters and many many more. A newer standard called Unicode aims to reconcile and unify these many systems, but it's still far from universal.
In web data-mining we often come across pages where characters don't match their declared encoding, or there are mixed encodings, or there is no declared encoding at all. Consequences include strangely garbled or accented characters, missing characters, or unknown characters replaced with a special Unicode substitute character which looks like a question mark.
At Aware Research we pay attention to these issues, do whatever is possible to detect and correct encoding errors, and we provide all text results encoded in UTF-8, capable of representing all character sets and compatible with modern computers.
We can exchange data in virtually all common formats, including CSV, TSV, XML, Excel, and Access formats.
We can work with data in a variety of structured formats including TSV, CSV, XML, Excel, Access and dBase.
We also work with relatively unstructured datasources such as web pages, email, RSS/Atom feeds, plaintext and scanned documents.
We parse raw data to identify meaningful segments. For example, a five-digit number might be a Zip Code, or it might be something else. Our address parsing software examines the surrounding text—is there a subsequent hyphen and 4-digit number? Is there a preceding state name or abbreviation, or a preceding city name matching that Zip Code? Likewise for potential street addresses, we check against our database of all valid street names within an area, flagging any unlikely or erroneous data.
We validate certain types of data by testing according to a particular set of rules. For example, given a Zip Code and an associated city name, we can easily validate each against the other according to our database. Likewise for valid street names within an area, and for items such as URLs, email addresses, 10 or 13-digit ISBNs for books, SIC and NAICS codes, etc.
We can also do partial validation, for example against lists of proper names, organization names against a directory, product or technology terms, or special entities according to validation rules you might provide.
tbd
tbd
tbd
tbd
tbd
tbd
tbd