Skip to content

jahtz/xmltools

Repository files navigation

xmltools

Toolset for working with PageXML files.

Setup

git clone --recurse-submodules https://github.com/jahtz/xmltools
pip install -r xmltools/requirements.txt

Usage

python xmltools --help

Modules

convert

convert different ocr file formats (only abbyy right now) to PageXML.

python xmltools convert --help

abbyy

> python xmltools convert abbyy --help
Usage: xmltools convert abbyy [OPTIONS] FILES...

  Convert different XML ocr formats to PageXML.

  FILES: List of ABBYY XML files to convert. Supports multiple file paths,
  wildcards or directories (when used with the -g option).

Options:
  --help                   Show this message and exit.
  -o, --output DIRECTORY   Directory to save the output PageXML files.
                           Defaults to the parent directory of each input
                           file.
  -g, --glob TEXT          Specify a glob pattern to match image files when
                           processing directories in FILES.  [default: *.xml]
  -i, --image-suffix TEXT  The suffix appended to the imageFilename tag within
                           the output PageXML file.  [default: .png]
  -x, --xml-suffix TEXT    Append a suffix to the output PageXML file names.
                           For example, using `.seg.xml` results in filenames
                           like `imagename.seg.xml`  [default: .xml]
  -lp, --line-polygon      When calculating the masks for each TextLine, use
                           the coordinates from each character to build a
                           polygon. If not set, use a bounding box.
  -rp, --region-polygon    When calculating the masks for each TextRegion, use
                           the coordinates from each TextLine to build a
                           polygon. If not set, use a bounding box.
  --use-polygons           When calculating the TextRegion polygon, use the
                           polygon coordinates from each TextLine. If not set,
                           use the lines bounding boxes. Output depends on the
                           input file.
  --padding INTEGER        Set inner padding for region polygons in px (only
                           used if --region-polygon is set).  [default: 10]
  --creator TEXT           Specify the creator of the PageXML file. This can
                           be useful for tracking the origin of segmented
                           files.

shrink

shrink PageXML regions.

> python xmltools shrink --help
Usage: xmltools shrink [OPTIONS] FILES...

  Shrink regions of a PageXML file to its content.

  FILES: List of PageXML files to shrink regions in. Supports multiple file
  paths, wildcards or directories (when used with the -g option).  Images
  should be in the same directory as the XML files with matching
  `imageFilename` tags.

Options:
  --help                  Show this message and exit.
  -o, --output DIRECTORY  Directory to save the output PageXML files.
                          Overwrite original files if not set.
  -g, --glob TEXT         Specify a glob pattern to match image files when
                          processing directories in FILES.  [default: *.xml]
  -p, --padding INTEGER   Set inner padding for shrunk region polygon.
                          [default: 5]
  -a, --alpha FLOAT       Set alpha value for creating the alphashape around
                          the regions content. The lower the value, the faster
                          computation but the less precise output. shape.
                          [default: 0.02]
  -e, --epsilon FLOAT     Set the epsilon value to simplify the contours where
                          the alphashape is computed from. The lower the
                          value, the slower the computation but the more
                          precise output shape. A good starting point is
                          `5.0`.

transform

Transform PageXML files.

python xmltools transform --help

rotate

Rotate all coordinates of PageXML files.

> python xmltools transform rotate --help
Usage: xmltools transform rotate [OPTIONS] FILES...

  Rotate all coordinates of PageXML files.

  Negative coordinates resulting from the rotation will be set to 0 to
  maintain valid PageXML formatting.

  FILES: List of PageXML files to rotate. Supports multiple file paths,
  wildcards or directories (when used with the -g option).

Options:
  --help                         Show this message and exit.
  -a, --angle FLOAT              Angle to rotate points by. A positive value
                                 rotates points counterclockwise.  [required]
  -g, --glob TEXT                Specify a glob pattern to match image files
                                 when processing directories in FILES.
                                 [default: *.xml]
  -o, --output DIRECTORY         Directory to save the output PageXML files.
                                 If not specified, original files are
                                 overwritten
  --origin [center|tr|tl|br|bl]  Origin point for rotation: center, tr (top
                                 right), tl (top left), br (bottom right), or
                                 bl (bottom left).  [default: center]

scale

Scale all coordinates of PageXML files.

> python xmltools transform scale --help
Usage: xmltools transform scale [OPTIONS] FILES...

  Scale all coordinates of PageXML files.

  FILES: List of PageXML files to scale. Supports multiple file paths,
  wildcards or directories (when used with the -g option).

Options:
  --help                  Show this message and exit.
  -f, --factor FLOAT      Scale factor to resize pages. Ignored if --width or
                          --height is set.
  -w, --width INTEGER     Set a fixed output width (in pixels) for all pages.
                          Overrides --factor.
  -h, --height INTEGER    Set a fixed output height (in pixels) for all pages.
                          Overrides --factor and --width.
  -g, --glob TEXT         Specify a glob pattern to match image files when
                          processing directories in FILES.  [default: *.xml]
  -o, --output DIRECTORY  Directory to save the output PageXML files. If not
                          specified, original files are overwritten.

ZPD

Developed at Centre for Philology and Digitality (ZPD), University of Würzburg.

About

Command line tool for working PageXML files

Topics

Resources

License

Stars

Watchers

Forks

Languages