approches used for language detection on images ...

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

approches used for language detection on images ...

Alussam
I worked on corpora research and text cleansing can be done relatively
straightforwardly. The problem is with images, images containing texts, which
language, ...
Could you point me in the right direction? (I am a mathematician, so Math is
not a problem for me at all)
 Thank you

--
JWein (via www.gimpusers.com/forums)
_______________________________________________
gimp-user-list mailing list
List address:    [hidden email]
List membership: https://mail.gnome.org/mailman/listinfo/gimp-user-list
List archives:   https://mail.gnome.org/archives/gimp-user-list
Reply | Threaded
Open this post in threaded view
|

Re: approches used for language detection on images ...

Liam R E Quin
On Tue, 2020-01-28 at 18:21 +0100, JWein wrote:
> I worked on corpora research and text cleansing can be done
> relatively
> straightforwardly. The problem is with images, images containing
> texts, which
> language, ...
> Could you point me in the right direction? (I am a mathematician, so
> Math is
> not a problem for me at all)
>  Thank you

You need (1) feature extraction, finding the writing, (2) OCR of some
sort, to turn pictures of letters into letters, and then (3) the
linguistic analysis.

However, many images contain metadata in plain text (OK, XML or
whatever) that may include language and location information.

I'm interested in the text cleansing, can you tell me more (off list
maybe?)

Thank you!

slave liam

--
Liam Quin - web slave for https://www.fromoldbooks.org/
with fabulous vintage art and fascinating texts to read.
Click here to have the slave rewarded with extra work.

_______________________________________________
gimp-user-list mailing list
List address:    [hidden email]
List membership: https://mail.gnome.org/mailman/listinfo/gimp-user-list
List archives:   https://mail.gnome.org/archives/gimp-user-list
Reply | Threaded
Open this post in threaded view
|

approches used for language detection on images ...

Alussam
>You need (1) feature extraction, finding the writing, (2) OCR of some
>sort, to turn pictures of letters into letters, and then (3) the
>linguistic Analysis.

 Hey Liam:

Thank you, and yes, I could guess the way to go would be through the steps you
outline, but I am pretty sure some other gimp developers have trodden those
paths before and may have some tips to share.
 
>However, many images contain metadata in plain text (OK, XML or
>whatever) that may include language and location information.

Most of the kinds of texts I work on are image based pdf files which were
scanned as images

>I'm interested in the text cleansing, can you tell me more (off list
>maybe?)

"text cleansing" or "text normalization" (as they also call it, but which to
most people is another phase of "cleansing", for example, making sure that the
text is "normalized", e.g., in a java.text.Normalizer.Form way) means removing
all the bsing visual distraction and the ephemeral comercial nonsense from
pages.
 
 https://www.google.com/search?q="text+cleansing"

For example, gutenberg.org, has taken the effort to textualize lots of books,
but they include some nonsensical header and footer, use breaklines (something
necessary in those times people used main frames which displays were 80
character wide, ...)

This kind of nonsense has become the new normal. I work as a teacher and I see
it as abusive specially when done to students and people who are just trying to
get something done. Companies internally block certain sites, types of content,
pages and sections of pages, it is about time that people start doing it more
aggressively on their own. Some other people tell you about "user agreements",
"morallity" and about "capitalism going down if people start doing that more
aggressively" ;-)

I do the same kinds of things you do but these times I am more interested in
texts especially if they relate to education. Mine of my research efforts
relates to a corpus of the Regents exams (going back to the 1860's). They
contain plenty of intertextual pictures and zero comma nada annotations,
frequent language switch in the texts . . .

--
JWein (via www.gimpusers.com/forums)
_______________________________________________
gimp-user-list mailing list
List address:    [hidden email]
List membership: https://mail.gnome.org/mailman/listinfo/gimp-user-list
List archives:   https://mail.gnome.org/archives/gimp-user-list
Reply | Threaded
Open this post in threaded view
|

approches used for language detection on images ...

Alussam
ONE of my research efforts relates to a corpus of the Regents exams (going back
to the 1860's).

 https://en.wikipedia.org/wiki/Regents_Examinations

. . . frequent language switching mostly in the sentences of multilingual texts
. . .

--
JWein (via www.gimpusers.com/forums)
_______________________________________________
gimp-user-list mailing list
List address:    [hidden email]
List membership: https://mail.gnome.org/mailman/listinfo/gimp-user-list
List archives:   https://mail.gnome.org/archives/gimp-user-list
Reply | Threaded
Open this post in threaded view
|

Re: approches used for language detection on images ...

Ofnuts-2
In reply to this post by Alussam
On 1/29/20 1:52 PM, JWein wrote:
>> You need (1) feature extraction, finding the writing, (2) OCR of some
>> sort, to turn pictures of letters into letters, and then (3) the
>> linguistic Analysis.
>   Hey Liam:
>
> Thank you, and yes, I could guess the way to go would be through the steps you
> outline, but I am pretty sure some other gimp developers have trodden those
> paths before and may have some tips to share.
>

Gimp is not really about OCR.

You would also have to define the range of languages you are interested
in. For instance you can't OCR Cyrillic without knowing it's cyrillic,
because many glyphs are undistinguishable from similar latin glyphs but
have a different Unicode point, and can be unrelated  characters.


_______________________________________________
gimp-user-list mailing list
List address:    [hidden email]
List membership: https://mail.gnome.org/mailman/listinfo/gimp-user-list
List archives:   https://mail.gnome.org/archives/gimp-user-list
Reply | Threaded
Open this post in threaded view
|

Re: approches used for language detection on images ...

Liam R E Quin
In reply to this post by Alussam
On Wed, 2020-01-29 at 13:52 +0100, JWein wrote:

> > You need (1) feature extraction, finding the writing, (2) OCR of
> > some
> > sort, to turn pictures of letters into letters, and then (3) the
> > linguistic Analysis.
>
>  Hey Liam:
>
> Thank you, and yes, I could guess the way to go would be through the
> steps you
> outline, but I am pretty sure some other gimp developers have trodden
> those
> paths before and may have some tips to share.

I doubt it.

There _are_ somepeople who use GIMP to clean up images preparatory to
running OCR on them, or have been in the past, but there are much
better programs for that.

I asked you about text cleansing (cleaning) because it has different
meanings in different contexts; i'm *certainly* not interested in
losing the page apparatus or hyphenation information, although in my
own work i mark them so software can skip them whe wanted.

If you're doing an academic study of a book “manifestation” such things
are important, but i had rather use the Text Encoding Initiative as a
model than Michael Hart’s flailing Gutenberg project.

> I do the same kinds of things you do

I doubt that, at least from your description, but some of it may be a
language issue in reading the tone of your message. If you are doing
natural language processing and semantic-Web-style text mining your
needs for texts overlap with my personal projects but not so much with
GIMP, which is a bitmap image editor. For example, detecting Greek
words and phrases included in a 30,000 page OCR's text by analyzing the
page images would interest me (and detecting italics for that matter);
if i ever have a spare few days i plan to try the (then) latest
Tesseract for that.

--
Liam Quin - web slave for https://www.fromoldbooks.org/

_______________________________________________
gimp-user-list mailing list
List address:    [hidden email]
List membership: https://mail.gnome.org/mailman/listinfo/gimp-user-list
List archives:   https://mail.gnome.org/archives/gimp-user-list