Encoding/Decoding text python

Hello All,

So I’ve got twitter tools working now SSL is built in which is fun and am currently writing a script to search twitter and then return 20 recent tweets. The problem is the data thats being returned and trying to convert it.

First off the code:

sys.path.append("C:/Python32/Lib/site-packages")
sys.path.append("C:/Python32/Lib/site-packages/twitter-1.10.0-py3.2.egg")

from twitter import *

consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""

auth = OAuth(access_key, access_secret, consumer_key, consumer_secret)
t = Twitter(auth = auth)

search = t.search.tweets(q="Andy Murray",result_type='recent', count=20)

d = op('tweets')

for statuses in search['statuses']:
    for key, value in statuses.items():
        if(key == "text"):
            utf8val = value.encode('utf-8')
            d.appendRow([utf8val])
            #print(utf8val)

So this copies all of the tweets into the table named tweets but each row is formatted in byte format as follows:

b'Celebs Watch Andy Murray Win Wimbledon 2013 http://t.co/8F3JIG2hq4'

Yet when I leave it as string I get this error in the table DAT:

<UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026' in position 139: ordinal not in range(256)>

and this in the textport:

td.error: Str value expected.

Does TouchDesigners string type only support certain characters as python seems to think that the returned data is a string yet touch refuses to accept certain characters into DATs. UTF-8 byte format does work and I can use DATs to take out the extra fluff but it’d be much cleaner to just send the plain text straight from the script. Do I need to make a script to manually remove the invalid characters before sending this to the table or is there a certain way I can encode this?

It’s worth mentioning encoding to ascii format still has the b’ prefix and ’ suffix and encoding to latin-1 throws the error still unless I set it to ignore in which case the b’ prefix shows back up.

Hi Ennui.
When I try:

a = b'Celebs Watch Andy Murray Win Wimbledon 2013 http://t.co/8F3JIG2hq4'
c = a.decode("utf-8")
op('table1').appendRow(c)

It behaves as expected.
Also internally, we try to use ISO-8859-1 for our DAT encoding/decoding.

Does that help?
Rob.

It works provided there aren’t any illegal characters but if there are it will error (or just not bring in the data if you set the ignore parameter).

So if I go to pull in 20 tweets and the 5th has an illegal character then it only creates 4 rows.

Can you send an example of a tweet with an illegal character?
or the python bytes output?

Aha it was the quotation character causing problems. I managed to get tweets into the system without errors with the following:

encoded = value.encode('iso-8859-1','ignore')
decoded = encoded.decode('iso-8859-1')
d.appendRow(decoded)

It seems the textport has no problem printing everything but if you append into a DAT row then it will throw an error unless you manually set the encoding to ignore invalid characters.

So you have to convert then convert back to make sure the DAT can handle it.

Dunno if this helps you guys out in any way with encoding stuff?

Can you supply the exact contents of ‘value’ so we can reproduce it here and possibly streamline the process?

emailing you my file now

Thanks for the example.
Ive changed the code to include the ‘ignore’ keyword in your example, so it no longer errors out.
So you can avoid the encode/decode hack in future builds.
cheers,
Rob.