2.2 Noisy text
2.3.3 Image processing
After a document image has been acquired, these images need to be processed before further steps in the digitisation pipeline. The key steps of image processing are binarisation, enhancement, and normalisation (Gatos, 2014).
Images are usually acquired in greyscale or colour. In an 8-bit greyscale image, each pixel is represented as a number between 0 and 255 indicating intensity of light. Colour images have multiple numbers for each pixel, each representing intensity of light of a certain colour. Binarisation is the conversion of greyscale or colour images into a binary image — one that only has two possible values for each pixel. This significantly reduces storage requirements, improves legibility of text, and improves the performance of layout analysis and recognition. Approaches to binarisation are typically threshold based, with global
1
2.3. Digitisation 21
Acquisition processingImage
Layout Analysis Geometric regions Logical regions China keen on Qld spaceport By ROBERT THOMSON, Herald Corespondent BEIJING, Tuesday: China is very interested in using the proposed spaceport in northern Queensland for commercial sat- ellite launches, the Minister for Industry, Technology and Com- merce, Senator Button, said here.
Chinese officials are well briefed about plans for the centre on the Cape York Peninsula, and apparently think it would com- lement their plans to use the Long March III rocket for commercial satellite launches.
Senator Button said the Aus- tralian site close to the equator appeals to the Chinese, whose own launch facility in the south-west of the country is unable to put satellites in the equatorial orbit desired by launch customers.
And by using a base in Australia, the Chinese, who have grand plans for their space industry, could get around restrictions on technology trans- fers to communist countries that would limit the sophistication of Western satellites they are able to launch from their own base.
“They see immense advan- tages for their satellite program in having a launch closer to the equator than they are presently able to do.” Senator Button said. “Launching closer to the equator means that your satellite has a longer life.”
China has signed several pre- liminary agreements for the launch of satellites for American and European companies, but has not finalised dates.
Senator Button, who is in Beijing for a joint ministerial economic commission meeting, said two feasibility studies for the spaceport were to be deliv- ered in coming months.
The Senate is investigating whether the Americans violated the Constitution by desecrating a national historical shrine, and is trying to establish just who gave the gold-diggers a permit to tear apart the old walled city and on whose authority.
Manila police said today they may charge the Americans with murder over the death of three workmen who were buried alive late last month. The police believe the Americans were reckless and did not take adequate precautions to protect the labourers.
Dr McDougald claims, how- ever, that the men died in an Egyptian-style “sand trap” designed to protect the gold from looters. Their deaths could not have been anticipated, he says.
The dig leader, a self-styled Indiana Jones, has stopped his daily press briefings and guided tours of the dig site. Sources, however, say the Americans are still excavating in secret at night.
Dr McDougald claims that the Japanese occupation forces buried a fabulous fortune of gold bul- lion, jewellery and money — known locally as the Yamashita Treasure — inside the walled city’s torture chamber when they fled the Philippines at the end of World War II.
The treasure hunt has become a political scandal, splitting the Aquino Government, which stands to keep 75 per cent of any treasure recovered.
While the Senate has ordered the site closed on several previous occasions, President Aquino allowed digging to continue until the weekend and provided the Americans with guards from her own Presidential Security Force.
Several local newspaper colum- nists today demanded a “presi- dential explanation”.
One questioned the wisdom of relying on a team of foreign treasure hunters to miraculously cure the country’s ailing economy.
Another said that damage done to the old walled city could never be repaired and that large sections were now in danger of collapsing.
MANILA, Tuesday: Four US treasure hunters who promised the Philippines enough gold to repay its entire foreign debt have been banned from leaving the country and may be charged with murder over the death of three workmen who were buried alive during their excavation.
The Senate has closed the excavation site inside an ancient torture chamber within the walls of Manila’s 17th century Spanish city, one week after the dig leader, Dr Charles McDougald, confi- dently announced that he was within 15 days of hitting a legendary $A280 billion gold cache.
Dr McDougald … Egyption “sand trap” killed workers.
By LOUISE WILLIAMS, Herald Corespondent
PHILIPPINES
Gold-diggers may be charged with murder
Character Recognition
Article Segmentation
Physical media (microfilm) Scanned greyscale image Processed binary image
Digitised text
Visual approaches Textual approaches
Figure 2.3: The digitisation pipeline and where article segmentation fits in. Excerpt from the smh 1988-03-09, p. 19.
thresholding (Otsu, 1979), local thresholding (Gatos et al., 2006) and hybrid approaches (Tseng and Lee, 2008).
Image enhancement aims to improve image quality by improving contrast and back- ground uniformity, removing noise in image backgrounds, and reducing bleed through. Lack of contrast can make it difficult to distinguish between text and background pixels. Images that have been scanned with an illumination source can often have non-uniform background intensity. Leung et al. (2005), Likforman-Sulem et al. (2011), and Nomura et al. (2009) discuss techniques for improving image contrast and background uniformity. Techniques for removing bleed through in document images include those that have access to the reverse side of the page (non-blind, e.g. Dubois and Pathak, 2001), and those that do not have access to the reverse side (blind, e.g. Kim et al., 2002).
Image normalisation determines page orientation and de-skews images. The orientation of document images is not known implicitly and must be determined so that lines of text
are horizontal.2This involves detection of portrait or landscape orientation and is typically
accomplished using projection histograms (such as Le et al., 1994) or counting black-to- white transitions (such as Yin, 2001).
Additionally, scanning or digital photography can capture skewed images if the imaging sensor is not perfectly parallel to the physical document. Approaches to correct image skew include using projection profiles (Baird, 1995), Hough transformations (Amin and Fischer, 2000), and nearest-neighbour clusterings (Lu and Tan, 2003).