Which MIME types are related to file extension ".hocr"?

The .hocr file extension is associated with 2 MIME types:

text/html, text/vnd.hocr+html.

A MIME type is a string that tells browsers and other tools how to handle a particular kind of file.

About .hocr Files

HOCR files are files that store OCR results in an HTML format. They use MIME types text/html and text/vnd.hocr+html to indicate they are essentially text documents with embedded layout data.

These files hold both the recognized text and the positional data of that text. They record details like text structure, positions, and formatting using HTML tags embellished with extra metadata. This helps in mapping the recognized text back onto the original scanned image.


Based on information from FilExt.com, HOCR files are ideal for scenarios where preserving both text and its layout is necessary for further processing or review.

Relationship between file extension and MIME type

A file extension is a suffix at the end of a filename that indicates what type of file it is. File extensions help both users and operating systems identify what application should be used to open the file.

File extensions are typically separated from the filename by a period (dot) and consist of 2-4 characters, though they can be longer. For example, in the filename "document.pdf", ".pdf" is the file extension.

File extensions are closely related to MIME types, as they both serve to identify the format of a file. However, while MIME types are used primarily by web browsers and servers, file extensions are used by operating systems and applications.

Associated MIME types

text/html, text/vnd.hocr+html

FAQs

What is an .hocr file used for?

An .hocr file is a specialized file format used to store Optical Character Recognition (OCR) results. It represents recognized text and its layout information (coordinates on the page) using standard HTML tags, allowing the text to be mapped precisely over the original image.

How do I open an .hocr file?

Since HOCR files are based on HTML, you can open them in any standard web browser like Google Chrome, Mozilla Firefox, or Microsoft Edge. To view the underlying code and coordinate data, you can open the file with a text editor like Notepad++ or Visual Studio Code.

How can I convert an .hocr file to a searchable PDF?

You can convert .hocr files to PDF using command-line tools like hocr2pdf or through the OCR engine Tesseract. This process creates a 'sandwich' PDF where the invisible text layer from the HOCR file is placed exactly over the original scanned image, making the document searchable.

Why does my .hocr file look like a regular web page?

The HOCR format is essentially valid HTML with additional metadata embedded in tags (like <span> or <div>). Because it uses the standard text/html structure, web browsers render the text content just like a normal webpage, often ignoring the hidden layout attributes unless specific stylesheets are applied.

Which software creates .hocr files?

The most common software for generating these files is the open-source Tesseract OCR engine. Other OCR libraries and scanning software may also support exporting to HOCR to preserve layout analysis alongside the recognized text.

What is the correct MIME type for serving .hocr files?

Because HOCR is based on HTML, it is often served as text/html. However, the specific media type text/vnd.hocr+html is used to explicitly indicate that the content contains HOCR-specific metadata.

Can I edit the text inside an .hocr file?

Yes, you can edit the text using any plain text editor. However, you must be careful not to alter the title attributes inside the HTML tags, as these contain the bounding box coordinates required to map the text back to its position on the original image.

General FAQ

What is a MIME type?

A MIME (Multipurpose Internet Mail Extensions) type is a standard that indicates the nature and format of a document, file, or assortment of bytes. MIME types are defined and standardized in IETF's RFC 6838.

MIME types are important because they help browsers and servers understand how to process a file. When a browser receives a file from a server, it uses the MIME type to determine how to display or handle the content, whether it's an image to display, a PDF to open in a viewer, or a video to play.

MIME types consist of a type and a subtype, separated by a slash (e.g., text/html, image/jpeg, application/pdf). Some MIME types also include optional parameters.

How do I find the MIME type for a file?

You can check the file extension or use a file identification tool such as file --mime-type on the command line. Many programming languages also provide libraries to detect MIME types.

Why can one extension have multiple MIME types?

Different programs and historical usage may assign various MIME identifiers to the same file format. Listing them together helps maintain compatibility across tools.