Toolset for working with PageXML files.


git clone --recurse-submodules
pip install -r xmltools/requirements.txt


python xmltools --help



convert different ocr file formats (only abbyy right now) to PageXML.

python xmltools convert --help


> python xmltools convert abbyy --help
Usage: xmltools convert abbyy [OPTIONS] FILES...

  Convert different XML ocr formats to PageXML.

  FILES: List of ABBYY XML files to convert. Supports multiple file paths,
  wildcards or directories (when used with the -g option).

  --help                   Show this message and exit.
  -o, --output DIRECTORY   Directory to save the output PageXML files.
                           Defaults to the parent directory of each input
  -g, --glob TEXT          Specify a glob pattern to match image files when
                           processing directories in FILES.  [default: *.xml]
  -i, --image-suffix TEXT  The suffix appended to the imageFilename tag within
                           the output PageXML file.  [default: .png]
  -x, --xml-suffix TEXT    Append a suffix to the output PageXML file names.
                           For example, using `.seg.xml` results in filenames
                           like `imagename.seg.xml`  [default: .xml]
  -lp, --line-polygon      When calculating the masks for each TextLine, use
                           the coordinates from each character to build a
                           polygon. If not set, use a bounding box.
  -rp, --region-polygon    When calculating the masks for each TextRegion, use
                           the coordinates from each TextLine to build a
                           polygon. If not set, use a bounding box.
  --use-polygons           When calculating the TextRegion polygon, use the
                           polygon coordinates from each TextLine. If not set,
                           use the lines bounding boxes. Output depends on the
                           input file.
  --padding INTEGER        Set inner padding for region polygons in px (only
                           used if --region-polygon is set).  [default: 10]
  --creator TEXT           Specify the creator of the PageXML file. This can
                           be useful for tracking the origin of segmented


shrink PageXML regions.

> python xmltools shrink --help
Usage: xmltools shrink [OPTIONS] FILES...

  Shrink regions of a PageXML file to its content.

  FILES: List of PageXML files to shrink regions in. Supports multiple file
  paths, wildcards or directories (when used with the -g option).  Images
  should be in the same directory as the XML files with matching
  `imageFilename` tags.

  --help                  Show this message and exit.
  -o, --output DIRECTORY  Directory to save the output PageXML files.
                          Overwrite original files if not set.
  -g, --glob TEXT         Specify a glob pattern to match image files when
                          processing directories in FILES.  [default: *.xml]
  -p, --padding INTEGER   Set inner padding for shrunk region polygon.
                          [default: 5]
  -a, --alpha FLOAT       Set alpha value for creating the alphashape around
                          the regions content. The lower the value, the faster
                          computation but the less precise output. shape.
                          [default: 0.02]
  -e, --epsilon FLOAT     Set the epsilon value to simplify the contours where
                          the alphashape is computed from. The lower the
                          value, the slower the computation but the more
                          precise output shape. A good starting point is


Transform PageXML files.

python xmltools transform --help


Rotate all coordinates of PageXML files.

> python xmltools transform rotate --help
Usage: xmltools transform rotate [OPTIONS] FILES...

  Rotate all coordinates of PageXML files.

  Negative coordinates resulting from the rotation will be set to 0 to
  maintain valid PageXML formatting.

  FILES: List of PageXML files to rotate. Supports multiple file paths,
  wildcards or directories (when used with the -g option).

  --help                         Show this message and exit.
  -a, --angle FLOAT              Angle to rotate points by. A positive value
                                 rotates points counterclockwise.  [required]
  -g, --glob TEXT                Specify a glob pattern to match image files
                                 when processing directories in FILES.
                                 [default: *.xml]
  -o, --output DIRECTORY         Directory to save the output PageXML files.
                                 If not specified, original files are
  --origin [center|tr|tl|br|bl]  Origin point for rotation: center, tr (top
                                 right), tl (top left), br (bottom right), or
                                 bl (bottom left).  [default: center]


Scale all coordinates of PageXML files.

> python xmltools transform scale --help
Usage: xmltools transform scale [OPTIONS] FILES...

  Scale all coordinates of PageXML files.

  FILES: List of PageXML files to scale. Supports multiple file paths,
  wildcards or directories (when used with the -g option).

  --help                  Show this message and exit.
  -f, --factor FLOAT      Scale factor to resize pages. Ignored if --width or
                          --height is set.
  -w, --width INTEGER     Set a fixed output width (in pixels) for all pages.
                          Overrides --factor.
  -h, --height INTEGER    Set a fixed output height (in pixels) for all pages.
                          Overrides --factor and --width.
  -g, --glob TEXT         Specify a glob pattern to match image files when
                          processing directories in FILES.  [default: *.xml]
  -o, --output DIRECTORY  Directory to save the output PageXML files. If not
                          specified, original files are overwritten.


Developed at Centre for Philology and Digitality (ZPD), University of Würzburg.


