This package contains an OCR engine - libtesseract
and a command line program - tesseract
.
此软件包包含一个 OCR 引擎 - libtesseract
和一个命令行程序 - tesseract
。
Tesseract 4 adds a new neural net (LSTM) based OCR engine which is focused on line recognition, but also still supports the legacy Tesseract OCR engine of Tesseract 3 which works by recognizing character patterns. Compatibility with Tesseract 3 is enabled by using the Legacy OCR Engine mode (--oem 0). It also needs traineddata files which support the legacy engine, for example those from the tessdata repository.
Tesseract 4 添加了一个新的基于神经网络 (LSTM) 的 OCR 引擎,该引擎专注于线条识别,但也仍然支持 Tesseract 3 的传统 Tesseract OCR 引擎,该引擎通过识别字符模式来工作。与 Tesseract 3 的兼容性是通过使用旧版 OCR 引擎模式 (--oem 0) 实现的。它还需要支持旧引擎的 traineddata 文件,例如来自 tessdata 存储库的文件。
Stefan Weil is the current lead developer. Ray Smith was the lead developer until 2018. The maintainer is Zdenko Podobny. For a list of contributors see AUTHORS and GitHub's log of contributors.
Stefan Weil 是目前的首席开发人员。Ray Smith 在 2018 年之前一直是首席开发人员。维护者是 Zdenko Podobny。有关贡献者列表,请参阅 AUTHORS 以及 GitHub 的贡献者日志。
Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".
Tesseract 具有 unicode (UTF-8) 支持,并且可以“开箱即用”地识别 100 多种语言。
Tesseract supports various image formats including PNG, JPEG and TIFF.
Tesseract 支持各种图像格式,包括 PNG、JPEG 和 TIFF。
Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV, ALTO and PAGE.
Tesseract 支持各种输出格式:纯文本、hOCR (HTML)、PDF、仅不可见文本 PDF、TSV、ALTO 和 PAGE。
You should note that in many cases, in order to get better OCR results, you'll need to improve the quality of the image you are giving Tesseract.
您应该注意,在许多情况下,为了获得更好的 OCR 结果,您需要提高您提供给 Tesseract 的图像的质量。
This project does not include a GUI application. If you need one, please see the 3rdParty documentation.
此项目不包括 GUI 应用程序。如果需要,请参阅 3rdParty 文档。
Tesseract can be trained to recognize other languages. See Tesseract Training for more information.
可以训练 Tesseract 识别其他语言。有关更多信息,请参阅 Tesseract 训练。
发表评论