Message Boards

WOLFRAM COMMUNITY

8320 Views

9 Replies

9 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Euro symbol gets swallowed during TXT import?

Raspi Rascal

Raspi Rascal, novato, contributor, pseudo-wannabe (not even tryhard)

Posted 6 years ago

i have a primitive short plain-text file (.txt) containing a narrative text written in geman, with words, sentences, usual punctuation, numbers, and also the EUR currency symbol () in it. i import it with Import["test.txt", "Text"] , and then i notice that the resulting string lacks all instances of the Euro symbol. Apparently they got replaced by a space char, in other words: deleted!, blanked!, swallowed!, skipped!, omitted!, not imported!, not properly imported!, gone!, overriden!, etc. FYI the symbol is a standard key on any geman layout keyboard, it is invoked through pressing <Alt Gr>+<E>, it is also allowed as char in file names (Windows OS) and it cannot be regarded as special char such as: <>\|":/?\ The currency symbol is part of the keyboard alphabet, if you will! At least in the Europe. Maybe there is a normal explanation and a workaround for the observed behavior, e.g. an option to Import[] or general settings/preferences. Or maybe it could be considered a bug? Because, interestingly, when i import the text file with Import["test.txt", "HTML"] , the resulting string does contain all instances of the symbol, as expected. By reporting this observation i am glad to have helped raise public awareness and make the responsibles improve the software :-P Attachments: test.txt

POSTED BY: Raspi Rascal

9 Replies

Sort By:

Sean Cheren

Sean Cheren, Wolfram

Posted 6 years ago

POSTED BY: Sean Cheren

Raspi Rascal

Raspi Rascal, novato, contributor, pseudo-wannabe (not even tryhard)

Posted 6 years ago

POSTED BY: Raspi Rascal

Raspi Rascal

Raspi Rascal, novato, contributor, pseudo-wannabe (not even tryhard)

Posted 6 years ago

POSTED BY: Raspi Rascal

John Doty

John Doty, Noqsi Aerospace Ltd

Posted 6 years ago

I see the ?? symbol from your file using either the default (UTF-8) or ISOLatin1 encodings. *Mathematica" 11.3 on MacOSX.

POSTED BY: John Doty

Sean Cheren

Sean Cheren, Wolfram

Posted 6 years ago

While some WL functions will depend on the computer system by using $CharacterEncoding (WindowsANSI on windows, UTF-8 on *NIX), Import[.., "Text"] is not one of them.. When importing text in general these days, UTF-8 is the most common default on all platforms, which is what WL uses internally for Text import by default.

POSTED BY: Sean Cheren

Yihe Dong

Posted 6 years ago

And in the future, the WL will be able to automatically detect the encoding when Importing Text.

POSTED BY: Yihe Dong

John Doty

John Doty, Noqsi Aerospace Ltd

Posted 6 years ago

Your test.txt encodes the offending symbol as octal 200, an ambiguous encoding in general. I suspect that the reason your HTML works is that the HTML file includes information on which encoding you're using. I suggest you try: Import["test.txt", CharacterEncoding -> "ISOLatin1"]

POSTED BY: John Doty

Raspi Rascal

Raspi Rascal, novato, contributor, pseudo-wannabe (not even tryhard)

Posted 6 years ago

Hello Mr Huisman, thanks for the suggestion. I've tried your code, same (neg) result. The test.txt test file is attached to the OP. If you guys can import the file as Text or Plaintext with the EUR symbol not getting swallowed, then it must be a problem with my computer system. In such a case, never mind.