Message Boards Message Boards

Euro € symbol gets swallowed during TXT import?

i have a primitive short plain-text file (*.txt) containing a narrative text written in geman, with words, sentences, usual punctuation, numbers, and also the EUR currency symbol (€) in it. i import it with

Import["test.txt", "Text"]

, and then i notice that the resulting string lacks all instances of the Euro symbol. Apparently they got replaced by a space char, in other words: deleted!, blanked!, swallowed!, skipped!, omitted!, not imported!, not properly imported!, gone!, overriden!, etc.

FYI the € symbol is a standard key on any geman layout keyboard, it is invoked through pressing <Alt Gr>+<E>, it is also allowed as char in file names (Windows OS) and it cannot be regarded as special char such as: <>|":/?*\ The currency symbol is part of the keyboard alphabet, if you will! At least in the Europe.

Maybe there is a normal explanation and a workaround for the observed behavior, e.g. an option to Import[] or general settings/preferences. Or maybe it could be considered a bug? Because, interestingly, when i import the text file with

Import["test.txt", "HTML"]

, the resulting string does contain all instances of the € symbol, as expected. By reporting this observation i am glad to have helped raise public awareness and make the responsibles improve the software :-P

Attachments:
POSTED BY: Raspi Rascal
9 Replies
Posted 6 years ago
POSTED BY: Sean Cheren
POSTED BY: Raspi Rascal
POSTED BY: Raspi Rascal

I see the ??€ symbol from your file using either the default (UTF-8) or ISOLatin1 encodings. *Mathematica" 11.3 on MacOSX.

POSTED BY: John Doty
Posted 6 years ago

While some WL functions will depend on the computer system by using $CharacterEncoding (WindowsANSI on windows, UTF-8 on *NIX), Import[.., "Text"] is not one of them.. When importing text in general these days, UTF-8 is the most common default on all platforms, which is what WL uses internally for Text import by default.

POSTED BY: Sean Cheren
Posted 6 years ago

And in the future, the WL will be able to automatically detect the encoding when Importing Text.

POSTED BY: Yihe Dong

Your test.txt encodes the offending symbol as octal 200, an ambiguous encoding in general. I suspect that the reason your HTML works is that the HTML file includes information on which encoding you're using.

I suggest you try:

Import["test.txt", CharacterEncoding -> "ISOLatin1"]
POSTED BY: John Doty

Hello Mr Huisman, thanks for the suggestion. I've tried your code, same (neg) result. The test.txt test file is attached to the OP. If you guys can import the file as Text or Plaintext with the EUR symbol not getting swallowed, then it must be a problem with my computer system. In such a case, never mind.

POSTED BY: Raspi Rascal

What about:

Import["test.txt", "Plaintext"]

or

Import["test.txt", "Plaintext"]

?

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract