Toolset for working with PageXML files.
git clone --recurse-submodules https://github.com/jahtz/xmltools
pip install -r xmltools/requirements.txt
python xmltools --help
convert different ocr file formats (only abbyy right now) to PageXML.
python xmltools convert --help
> python xmltools convert abbyy --help
Usage: xmltools convert abbyy [OPTIONS] FILES...
Convert different XML ocr formats to PageXML.
FILES: List of ABBYY XML files to convert. Supports multiple file paths,
wildcards or directories (when used with the -g option).
Options:
--help Show this message and exit.
-o, --output DIRECTORY Directory to save the output PageXML files.
Defaults to the parent directory of each input
file.
-g, --glob TEXT Specify a glob pattern to match image files when
processing directories in FILES. [default: *.xml]
-i, --image-suffix TEXT The suffix appended to the imageFilename tag within
the output PageXML file. [default: .png]
-x, --xml-suffix TEXT Append a suffix to the output PageXML file names.
For example, using `.seg.xml` results in filenames
like `imagename.seg.xml` [default: .xml]
-lp, --line-polygon When calculating the masks for each TextLine, use
the coordinates from each character to build a
polygon. If not set, use a bounding box.
-rp, --region-polygon When calculating the masks for each TextRegion, use
the coordinates from each TextLine to build a
polygon. If not set, use a bounding box.
--use-polygons When calculating the TextRegion polygon, use the
polygon coordinates from each TextLine. If not set,
use the lines bounding boxes. Output depends on the
input file.
--padding INTEGER Set inner padding for region polygons in px (only
used if --region-polygon is set). [default: 10]
--creator TEXT Specify the creator of the PageXML file. This can
be useful for tracking the origin of segmented
files.
shrink PageXML regions.
> python xmltools shrink --help
Usage: xmltools shrink [OPTIONS] FILES...
Shrink regions of a PageXML file to its content.
FILES: List of PageXML files to shrink regions in. Supports multiple file
paths, wildcards or directories (when used with the -g option). Images
should be in the same directory as the XML files with matching
`imageFilename` tags.
Options:
--help Show this message and exit.
-o, --output DIRECTORY Directory to save the output PageXML files.
Overwrite original files if not set.
-g, --glob TEXT Specify a glob pattern to match image files when
processing directories in FILES. [default: *.xml]
-p, --padding INTEGER Set inner padding for shrunk region polygon.
[default: 5]
-a, --alpha FLOAT Set alpha value for creating the alphashape around
the regions content. The lower the value, the faster
computation but the less precise output. shape.
[default: 0.02]
-e, --epsilon FLOAT Set the epsilon value to simplify the contours where
the alphashape is computed from. The lower the
value, the slower the computation but the more
precise output shape. A good starting point is
`5.0`.
Transform PageXML files.
python xmltools transform --help
Rotate all coordinates of PageXML files.
> python xmltools transform rotate --help
Usage: xmltools transform rotate [OPTIONS] FILES...
Rotate all coordinates of PageXML files.
Negative coordinates resulting from the rotation will be set to 0 to
maintain valid PageXML formatting.
FILES: List of PageXML files to rotate. Supports multiple file paths,
wildcards or directories (when used with the -g option).
Options:
--help Show this message and exit.
-a, --angle FLOAT Angle to rotate points by. A positive value
rotates points counterclockwise. [required]
-g, --glob TEXT Specify a glob pattern to match image files
when processing directories in FILES.
[default: *.xml]
-o, --output DIRECTORY Directory to save the output PageXML files.
If not specified, original files are
overwritten
--origin [center|tr|tl|br|bl] Origin point for rotation: center, tr (top
right), tl (top left), br (bottom right), or
bl (bottom left). [default: center]
Scale all coordinates of PageXML files.
> python xmltools transform scale --help
Usage: xmltools transform scale [OPTIONS] FILES...
Scale all coordinates of PageXML files.
FILES: List of PageXML files to scale. Supports multiple file paths,
wildcards or directories (when used with the -g option).
Options:
--help Show this message and exit.
-f, --factor FLOAT Scale factor to resize pages. Ignored if --width or
--height is set.
-w, --width INTEGER Set a fixed output width (in pixels) for all pages.
Overrides --factor.
-h, --height INTEGER Set a fixed output height (in pixels) for all pages.
Overrides --factor and --width.
-g, --glob TEXT Specify a glob pattern to match image files when
processing directories in FILES. [default: *.xml]
-o, --output DIRECTORY Directory to save the output PageXML files. If not
specified, original files are overwritten.
Developed at Centre for Philology and Digitality (ZPD), University of Würzburg.