Message Boards Message Boards

GROUPS:

Euro € symbol gets swallowed during TXT import?

Posted 9 months ago
1206 Views
|
9 Replies
|
9 Total Likes
|

i have a primitive short plain-text file (*.txt) containing a narrative text written in geman, with words, sentences, usual punctuation, numbers, and also the EUR currency symbol (€) in it. i import it with

Import["test.txt", "Text"]

, and then i notice that the resulting string lacks all instances of the Euro symbol. Apparently they got replaced by a space char, in other words: deleted!, blanked!, swallowed!, skipped!, omitted!, not imported!, not properly imported!, gone!, overriden!, etc.

FYI the € symbol is a standard key on any geman layout keyboard, it is invoked through pressing <Alt Gr>+<E>, it is also allowed as char in file names (Windows OS) and it cannot be regarded as special char such as: <>|":/?*\ The currency symbol is part of the keyboard alphabet, if you will! At least in the Europe.

Maybe there is a normal explanation and a workaround for the observed behavior, e.g. an option to Import[] or general settings/preferences. Or maybe it could be considered a bug? Because, interestingly, when i import the text file with

Import["test.txt", "HTML"]

, the resulting string does contain all instances of the € symbol, as expected. By reporting this observation i am glad to have helped raise public awareness and make the responsibles improve the software :-P

Attachments:
9 Replies

What about:

Import["test.txt", "Plaintext"]

or

Import["test.txt", "Plaintext"]

?

Hello Mr Huisman, thanks for the suggestion. I've tried your code, same (neg) result. The test.txt test file is attached to the OP. If you guys can import the file as Text or Plaintext with the EUR symbol not getting swallowed, then it must be a problem with my computer system. In such a case, never mind.

Your test.txt encodes the offending symbol as octal 200, an ambiguous encoding in general. I suspect that the reason your HTML works is that the HTML file includes information on which encoding you're using.

I suggest you try:

Import["test.txt", CharacterEncoding -> "ISOLatin1"]

Hello Mr Doty, thanks for the suggestion. I've tried your code, same (neg) result. On my systems. On my raspi the notebook (v11.2) displays a tiny box with a cross in it (a crossed checkbox so to speak) instead of the € char, no matter which code variant from this thread i try.

Counter question: Are you guys able to import the text file (see OP attachment) with the € symbol being displayed/imported in the string in the Mathematica notebook on your system? If yes, which code did you use? If no, then it confirms my observation, the topic of this thread.

I see the €€€ symbol from your file using either the default (UTF-8) or ISOLatin1 encodings. *Mathematica" 11.3 on MacOSX.

Posted 9 months ago

While some WL functions will depend on the computer system by using $CharacterEncoding (WindowsANSI on windows, UTF-8 on *NIX), Import[.., "Text"] is not one of them.. When importing text in general these days, UTF-8 is the most common default on all platforms, which is what WL uses internally for Text import by default.

Posted 9 months ago

And in the future, the WL will be able to automatically detect the encoding when Importing Text.

Ah, ok! On the raspi ive resaved the text file with Leafpad in Character Encoding UTF-8. Now when i import the modified file on the raspi the imported string displays the EUR currency symbol, as desired. Lesson learned, a text file is not just a text file, but it comes with character encoding. So, depending on how, i.e. with which "text file settings" the text file is saved to hdd, the text file will be factually different. Mathematica seems to have no problems when importing text files which the creator saved/encoded in UTF-8.

That's a good piece of information to know. Especially when distributing TXT files on the WWW or when importing TXT files in Mathematica (or other software applications), the users should be aware of the present character encoding and eventually they must convert, i.e. resave, the present file to UTF-8 encoding. Modern text editors allow the user to do such a conversion.

Well, that closes the lesson for today. Federer playing 2nd round tomorrow, sweet! Thanks everyone!!

Posted 9 months ago

As others have suggested, and your last comment has indicated, there's something simply going on with the character encoding. What I found a bit amusing was that the last line of the file was:

This TEXT file has been saved with notepad.exe Encoding ANSI.

And sure enough, the original file imported using:

Import["~/Downloads/test.txt", CharacterEncoding -> "WindowsANSI"]

If I hadn't been given that tip, I would have done something like this to start debugging:

Find some appropriate range of bytes from the file which contain the character in question. I could import as "String" format which is a binary format which has no encoding and imports a bytestring first:

In[12]:= str = StringTake[Import["~/Downloads/test.txt", "String"], 175 ;; 185]
Out[12]= "mbol (€) in"

It doesn't even look like the website is parsing the glyph I'm seeing above, but it's clearly not the EUR symbol... Now that I have the bytes, lets get their character codes, and see what interpreting those bytes as different encodings looks like. With this I could visually inspect all results, but since I know which character I want to find, I can speed this process up by only taking results which have the expected character:

In[16]:= 
Select[
    AssociationMap[FromCharacterCode[ToCharacterCode[str], #] &, $CharacterEncodings]
    , 
    StringContainsQ["\[Euro]"]
]

During evaluation of In[18]:= $CharacterEncoding::utf8: The byte sequence {128} could not be interpreted as a character in the UTF-8 character encoding.

During evaluation of In[18]:= $CharacterEncoding::utf8: The byte sequence {128} could not be interpreted as a character in the UTF-8 character encoding.

Out[18]= <|"WindowsANSI" -> "mbol (\[Euro]) in", 
 "WindowsBaltic" -> "mbol (\[Euro]) in", 
 "WindowsEastEurope" -> "mbol (\[Euro]) in", 
 "WindowsGreek" -> "mbol (\[Euro]) in", 
 "WindowsThai" -> "mbol (\[Euro]) in", 
 "WindowsTurkish" -> "mbol (\[Euro]) in"|>

Those messages are pretty useful, the data is certainly not UTF-8.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract