Sie sind auf Seite 1von 6
028 TextFinder: An Automatic System to Detect ‘and Recognize Text In Images Vietor Wu, Raghavan Manmatha, Member, IEEE, ‘and Edward M. Riseman, Sr. Member, IEEE Abort —A robust tom propose to aamatoly detect and exact rape om dirt suo. nei video, reper, edvertaaren, eens phelps, a nec, Test nt cece ug mics tere seertten ana epatal halen contain hen cod yp ans ‘emma un Moga aiedaaton agen, At auntie parlomance raluaon stam le wko propo ‘ox TormsToxt outa, charter eogron, muted dag, teat oc, toute segmentation, Mars, herarhcl pressing. barat, ‘omrated component + 1 InTnopucTion Most information available today is either on paper or in the form ‘of photographs and videos. To build digital Ubrases, this information needs to be digitized into images and the text converted to ASCII for storage, eteieval, and. manipulation. However, current optical character recognition (OCR) technology In}, TSE 4 restricted to finding text printed against clean back: [grounds, and cannot handle text printed against shaded or ‘extured backgrounds or embedded in images. More sophisticated text reading systems employ document analysis (page segmentation) schemes identi text regions before applying ‘OCR ao that time is ot spent tying to interpret aontext siems. “Btemad etal [2] used ancural nett classify the output of wavelets {nt text and nontext regions. The neural net requies rch set of teaning examples to work effectively, Other schemes requize ean, Dinary Input (3), (13), [14], [15] some assume specific document layouts such as video frames [12], newspapers [6, technical journals [9], or are domain specific like mail address Blocks (1), ‘Thus, there is need for robust systems which extract and recognize text from goneral backgrounds. 2. ANEW System ‘The system takes advantage of the distinctive characterises of text ‘which make it stand out from other image material For example, by looking a the comic page of a newspaper a few feet away, one cat probably tell quickly where the text is without actually ecognizing individual characters, Intuitively, text hat the following distinguishing characteristics: 1) text possesses certain frequency and crientation information; 2) text shows spatial ‘ohesion—characters of the same text string (a word, or words in the same line) are of similar heights, ortentation, and spacing. 2.1 Stop 1: Texture Segmentation Module ‘The fist characteristic suggests that text may be treated as a distinctive texture. The fist phase of the system therefore, adapts standard Texture Segmentation scheme (Fig. 1) to segment text regions. This consists of « linesr filtering sage followed by a De autrs ve ith th Dprtment of Computer See, Univesity of Mecacati Arr, MA O00. Putt Secu Morus ie! 27 De. 1898; red 18 Aug. 198. euch for acs by Rese. For igang rin of iat, ple send eat ‘potcmpatecng. end foes IBECS Lag Nie 10858 IEEE TRANGAGTIONS ON PATTERN ANALYSSS AND MACHINE INTELLIGENCE. VOL. 2, No.1, NOvEWBER 608 nonlinear stage (7. Nine second onder derivatives of Gtisans a fell of 60,22) ae Fh fle ep pe through te noinearsesforaton ton), where Torenth pc cao, kel energy tints ar compte ing the outputs ofthe nanan Wanstermaton which form a fextre ect fr tat pre The sto tare vec is chstered ing means algactin (wth K = 3) Fig. 2 shows result of texte segmentation apples 9 Fg 23 “Te texte sentation scheme wed isnt sii or ext detection and eiacon if images more complicated than clean hewapaper scins have 10 be dealt with, Neverbeles, the Segientton result canbe used a focus of atecton fr father provesing called Chip Generation 22 Stop 2: Chip Generation Module “The basic idea for chip generation isto apply a set of appropriate heuristics to fc text strings witin/neat the segmented regions. ‘The heuristics are designed to reflect the spatial cohesion ofthe text, The algorithm uses a bottom-up approach: significant edges form strokes (connected componens); strokes are aggregated t0 form chips (regions) corresponding to text strings. The rectangular ‘bounding boxes ofthe chips are used! to indicate the locations of the hypothesized (detected) text strings. Chip Generation consists ofthe following ordered steps: 1. Stroke Generation by grouping edge pisels sing con ‘nected components 2. Stroke Filtering to eliminate strokes unlikely w belong to any horizontal text string, 2. Stroke Aggregation to form chips of connected strokes likely to belong tothe same text siting, 4. Chip Filtering to climinate chips unlikely to correspond to ‘horizontal text strings. 5. Chip Estension where filtered chip are treated as strokes ‘and! aggregated spain to form chips Which cover the text ‘stings more completely. ‘The purpase of Stroke Filtering i to eliminate tho false positive strokes! with heuristics fo capture the fact that neighboring charactore/words in the same text string usualy have similar heights and are horizontally aligned. I is reasonable to assume that similarity of character heights causes heights of corresponding, stokes to be similar. These heuristics can be described using connectabilty defined as: Definition, Stokes A and B are connectable if they are of similar ‘eight ed horizontally aligned and aris path batwcen And B, tahere a path i a horizontal sequence of consecutive pitels in the segment regi tic connects A and B by d-ightor ajnceey Here, two strokes are considered tobe of similar height ifthe height ofa shortur stroke i atleast 40 percent ofthat of taller one ‘To determine the horizontal alignment, strokes are projected ont the Y-axis. Ifthe overlap ofthe projections of two strokes I atleast [50 percent of the shorter stroke, they are considered. to be horizontally aligned. More formally, a stoke is eliminated if either 1 does not sufficiently overlap vith the segmented text regions ‘oF 2) thas no connectable stroke, Condition 1 says the strokes are ‘expected to overlap the segmented regions, ince text segmenta- tion isnot perfect one cannot expect total overlap. A minimum of 20 percent overlap rate worked! well for all the test images: Condition 2 says that IF no path leads 10 some connectable stokes), itis probably an isolated stroke or line which does not ‘belong to any text string Since characters which belong to the same text string are expected to be of similar height andl horizontally aligned, the concept of conneetability can be used to aggregate strokes #0 generate chips that correspond to text stings. By empirical IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE RTELLIQENCE, VOL. 21, NO. 1, NOVEMBER 180 v5 . {Texture =) Chip : mentation 5) Generation} Tenure /=| Chip Segmentation} | Generation = = Chip Seate }-={ text Fusion | a} Cleanup [reme —— +! crip | Seamentaion |=) Generation f=} Chip | Generation __| Texture Segmentation Character 4 Refinement Recognition Text Clean-up Fi 1. Th op vl canon tk detection an extaton eystam. The yt of th put image ehh a observation aross a range of text sources, the spacing between the and there is path between 4 and 8 whose lng isles than characte and words of text sting is usually less than twice the height ofthe tallest character, and so isthe width ofa character in rast fonts. Therefore, the following criterion is used 10 generate force the height ofthe taller stroke Text stings are expected to have a certain bel rellably rocogrized by an OCR aystom. Thus, one choice isto iter hips two stokes, A and B, are connected if they are connectable out chips whese height is savall, Furthermore, since we are © rc) Saar (Geng) se =e A eres © 0 o ” Fa 2, An example of Teta Seyetzton and Cp Geran st neha razon leva) portion fw np ia; 2) at fhe esta saps ck felons ara bold bas? vgn) the Text regns she epcogSsur open () sues pred by Soka Ganraion proces (0) ATES ‘oan: ip ay Boxes) rocucod by spain Stoke Agogo: Ss ar Chp Fitrg and Exorion process tenis mapped Yo Mt Mae, IEEE TRANSACTIONS ON PATTERN AVALYS'S AND MACHINE INTELLIGENOE, VOL 21, NO.11, NOVEMBER 1860 @ @ ip. 9. To scala rblam ants solaton a) Chips gar for he putas a elt: 2) a scion) asohion (cps gerade levis mapped ono the inpt age. Soleredunst ipa ae removed Ndoe ra ar an opt erokstn fo ech ago ot nt ze jnlerested in txt strings, not just isolated characters, the width of chip is also used to filter out text, Last, for hor ‘aligned text strings, their aspect ratio (height/width) is usually lage. “Therefore, chips are filtered using the folowing constraints on thee minimum bounding boxes: A chip is eliminated if the wid fs box is less than eyo the height of ts box is leas than ch oF the aspect ratio (width lg) of ls box is larger than ratio, Is usualy difficult even for alma 4 read the text when it Ineght i les than seven pixels, thus Thas been used fr ch, forthe experiment, horizontal tex ting is usually longer horizontally, hence seting ce, to at least twice the minimum height seems reasonable, Thus, fall of our experiments, cu, = 15 and chy =7 ‘were used. Normally, the width ofa text string should be laeger ‘hon is height. But in some fonts, the height ofa characteris larger than its wid. Therefore, ratio, = 09 is used here, atiempting to cover that case to some extent, Some of the strokes may only cover fragments of the corresponding characters. Therefore, these strokes might violate {he constraints used for stroke filtering, and hence be eliminated, Consequently, some chips may only cover part othe corresponal> fing text strings. Fortunately, this fragmentation problem can usually be corrected. Notice thatthe chips corresponding #0 the same text stoke are stil horizontally allgaed andl of similar height Thus, by treating the chips as strokes, the Stroke Aggrogation procedure can be applied again to aggregate the chips into langer chips and capture more complete words in the extended chips 23° Step 3: Chip Scale Fusion Module ‘The text detection procedures just outlined work well for text over a certain range of font sizes To detect text whose font size varies significantly, the input image is procesed at dilerentresalutions. ‘The output chip boxes generated at each resokation level aze then snapped back onto the original image (Chip Scale Fusion). Fig. 3 shows an example of how chips at diferent scales are detacted, witha hierarchy of three levels to find text of fonts up to 160 pixels, in eight (ig) 24 Stop 4: Text Cleanup Module Since each of the generated text bounding bones usually contains text of similar intensities and background (osually around single words of group of words), a single threshold suffices to lean up snd binerize the corresponding region so txt stands out A simple, fective histogram-based algoritamy as described in [16] i used to find the threshold value automatically for each txt region. This algorithm i uso forthe Text Cleanup module in the syste. 25 Step 5: Chip Refinomont Module [Nomex items might survive the previous processing and occur in ‘he binarized output (Pig, 4b). Thus, a Chip Refinement phase Is ‘sed in the syatom to iter them out. This is dane by treating the o @ ip 4.6) An noutimape torte Now Yarar maps.) Biaanon rut) belo the raat sp. (0) Bareston recut fF. (8) 3A ream