704 W Park Ave Suite C
 Edgewater FL 32132-1409
 Ph 800-832-2823
 Fx 208-631-6381
 Outside the US:
 01-386-426-5393

IP Data Corporation

Patent SEARCH SYSTEMS & DATA SERVICES

We convert back-file inventories and current patent text data from 15 different formats produced by the various patent authorities into a single, coherent, easy-to-us format that we call MAPS (Modified APS).  APS was the original USPTO mainframe, 80 column, line-oriented data storage format, and the ASCII version of the original data files is still the only format available for the 1976 through 1999 U.S. patents (from various sources at no cost). We located and corrected over 1600 bibliographic data errors in the ASCII APS data, plus we repaired over 1000 files where the text in the in the description and/or claims sections was missing or damaged manually editing inserting text from OCR copies and verified these sections agains the facsimile copies. If you intend to use the ASCII APS data, your custom parser should be able to detect many of these errors if you check for basic patent text formatting rules (e.g., expected periods ending paragraphs and on the last claims element of eahc numbered cliam. Many errors have been correct since we began working the APS text in 2003, and several errors were reported by our customers, and we thank them for their assistance improving the overall quality of the data. 

Even though WIPO produced and maintains an agreed-upon XML standard for new patent text data (currently, the ST.36 Standard), the data sets from all of the authorities that follow ST.36 are sufficiently different to require completely different parsers to read, index or convert the XML data. We also convert all character data including HTML/XHTML numeric and named entities into UTF-8 binary codes. This allows us to easily edit and correct data in our line oriented MAPS format comparing the text to the facsimile image copies, which are considered the "Legal" copies of these doucments. MAPS is our main corrected storage format. Our MAPS-XML format is generated from the MAPS formatted files when needed for internal use, or for delivery to customers. 

We handle the parsing and conversion for all of the different formats (now numbering 15 from four authorities for complete data sets) in a modular fashion with "front end" modules that handle initial parsing, converison to UTF-8, patent-element-identification and element-value-standardization, after which the data is passed to standard modules used for all applicaitons and patents in all languages.  Below is alink to a PDF shoing the basic Flow of the conversion to reach the final MAPS format:

MAPS Creation Flowchart (simplified)

Our experience with bulk patent text and image data dates back to 1993 working with the set of US text and image data on 5000 mainframe tape cartridges when it was the only source available (at a cost of close to a quarter of a million dollars). 

If you have data you need converted to or from various formats, we no doubt already have what you need to handle the job. We can also tailor it to make it easy to add to your work flow.

Contact us and let us know the following particulars:

  • Source Format (specification), 
  • Source character set and language,
  • Number of Source publications,
  • How they are grouped when stored (singe files or multiple pubs per physical file),
  • Total storage size on disk of Source data,
  • Destination Format (specification),
  • Destination character set desired,
  • Any additional translations required such as:
  •    HTML Entities to UTF-8 characters or HTML format (ex: X&sup2; to X<sup>2</sup> )
  •    HTML Entities to Plain Text (words for scientific symbols or characters)
  •    UTF or ISO characters to Text or HTML Entities,
  • Additional character data translations and insertions for indexing such as Scientific Symbols to plain text name following them parenthetically, for example:  Å (Angstrom)   Ø (Phase) 
  • Any reports required such as lists of all scientific symbols or conversions, and
  • Anything else you can think of that you may require.

The bottom line is, we can probably save you money and time. Give us a call at one of the above numbers, or send an email with a brief description to IPDataCorp.com with the user name Support and put Data Conversion Info Request in the subject line, and we will let you know what we can do you you, provide you with "ball park cost" and estimated completion time, then you can decide if you'd like a formal quote.

                                                             * * * * *


ITEMS OF INTEREST


USPTO MCF  vs.  EPO DOCDB
Same US Patents, Different CPC Classifications.   WHY? 

The USPTO MCF data began in late 2015, and almost immediately we noticed differences between the MCF and DOCDB for the same US patents. The differences were not trivial. 30 days later we started our project to track and analyze these differences. 


We mistakenly assumed that the USPTO sent DOCDB updates to the EPO with CPC classes in them, and the EPO used them.  But as it turns out, US patents are classified by both authorities, and not always in the same Groups and Sub-Groups. A small number were even found to be in totally different sub-classes and a few were not even in the same class  (all of these appeared to be mistakes and were fixed fairly quickly). 


Early results indicated it was a learning curve. The EPO had a decent headstart with the new CPC since it has its roots in ECLA, the EPO's previous system, with both based on the ST.8 standard (with minor differences). For the USPTO, it was a brand new ball game with a different set of rules. Frankly, the thought of training 9000 or so examiners on a new Class system in 12 to 18 months conjureed up images of hearding cats...  while wearing a blindfold. To the USPTO's credit, we began to see far fewer differences in less than 8 or 9 month (summer of 2016).  We are now headed into the winter of 2017, and thankfully, they are growing even closer.  


If you build your own search system, this could be a problem for your searchers. If not, does your current search provider index both sets?


Searching by classification is the most popular method for professional searchers, and depending on the type of search, a good searcher will often "eyeball" EVERY document in the Sub-Groups of interest. This can be hundreds of documents, or even a thousand or two depending on the technology.


The following questions remain:

1)  Whose CPC data is more accurate?

2) They will ever match exactly- No-  so how close will they get?

3) How will this affect your class searches?

4) Is it wise to index both sets of CPC data into one system?  (we think so - just ignore the duplicates!)

5)  CPC data for Reissue patents is still not included in the U.S. MCF. Will it ever be?


We will continue to acquire Reissue CPC data from DOCDB for our subscribers in our standard CSV file format until the USPTO supplies it.