We convert back-file inventories and current patent text data from 15 different formats produced by the various patent authorities into a single, coherent, easy-to-us format that we call MAPS (Modified APS). APS was the original USPTO mainframe, 80 column, line-oriented data storage format, and the ASCII version of the original data files is still the only format available for the 1976 through 1999 U.S. patents (from various sources at no cost). We located and corrected over 1600 bibliographic data errors in the ASCII APS data, plus we repaired over 1000 files where the text in the in the description and/or claims sections was missing or damaged manually editing inserting text from OCR copies and verified these sections agains the facsimile copies. If you intend to use the ASCII APS data, your custom parser should be able to detect many of these errors if you check for basic patent text formatting rules (e.g., expected periods ending paragraphs and on the last claims element of eahc numbered cliam. Many errors have been correct since we began working the APS text in 2003, and several errors were reported by our customers, and we thank them for their assistance improving the overall quality of the data.
Even though WIPO produced and maintains an agreed-upon XML standard for new patent text data (currently, the ST.36 Standard), the data sets from all of the authorities that follow ST.36 are sufficiently different to require completely different parsers to read, index or convert the XML data. We also convert all character data including HTML/XHTML numeric and named entities into UTF-8 binary codes. This allows us to easily edit and correct data in our line oriented MAPS format comparing the text to the facsimile image copies, which are considered the "Legal" copies of these doucments. MAPS is our main corrected storage format. Our MAPS-XML format is generated from the MAPS formatted files when needed for internal use, or for delivery to customers.
We handle the parsing and conversion for all of the different formats (now numbering 15 from four authorities for complete data sets) in a modular fashion with "front end" modules that handle initial parsing, converison to UTF-8, patent-element-identification and element-value-standardization, after which the data is passed to standard modules used for all applicaitons and patents in all languages. Below is alink to a PDF shoing the basic Flow of the conversion to reach the final MAPS format:
Our experience with bulk patent text and image data dates back to 1993 working with the set of US text and image data on 5000 mainframe tape cartridges when it was the only source available (at a cost of close to a quarter of a million dollars).
If you have data you need converted to or from various formats, we no doubt already have what you need to handle the job. We can also tailor it to make it easy to add to your work flow.
Contact us and let us know the following particulars:
- Source Format (specification),
- Source character set and language,
- Number of Source publications,
- How they are grouped when stored (singe files or multiple pubs per physical file),
- Total storage size on disk of Source data,
- Destination Format (specification),
- Destination character set desired,
- Any additional translations required such as:
- HTML Entities to UTF-8 characters or HTML format (ex: X² to X<sup>2</sup> )
- HTML Entities to Plain Text (words for scientific symbols or characters)
- UTF or ISO characters to Text or HTML Entities,
- Additional character data translations and insertions for indexing such as Scientific Symbols to plain text name following them parenthetically, for example: Å (Angstrom) Ø (Phase)
- Any reports required such as lists of all scientific symbols or conversions, and
- Anything else you can think of that you may require.
The bottom line is, we can probably save you money and time. Give us a call at one of the above numbers, or send an email with a brief description to IPDataCorp.com with the user name Support and put Data Conversion Info Request in the subject line, and we will let you know what we can do you you, provide you with "ball park cost" and estimated completion time, then you can decide if you'd like a formal quote.
* * * * *