Hitting bottleneck controlling close to 4,000 LEDs!

lucasm · October 15, 2014, 5:59am

Hey everyone,

I’ve been working on a modular LED video panel system for several months and despite being close to the end I’ve hit a rather large roadblock and I’m at a loss here of what to do next…
The setup here does work great up until maybe 1000 led’s, then the slowdown gets to be quite noticeable.

In Short:

I’m trying to find a way to either optimize sending 11,520 bytes every frame (roughly 690 kbytes a second) over serial OR another protocol or way all together…

I’ve read through the forums in search of other similar issues and have only found this:
[Seeking advice on using TD with display of 7000+ LED Pixels)
A bit more detail on the setup:

Hardware:

Windows 8
Teensy 3.1 (arduino based MC with full usb speed serial communication)
OctoWs2811 (high speed led library for the teensy)
Adafruit’s Neopixel strips

My setup:

1) Touch designer instance that manages the UI and animations created. This sends the animation data pre mapped in a table (3 rows, r/g/b, the number of columns equal the number of addressable led’s), out through a Touch Out DAT.

2) Second Touch designer instance receives the table dat, and focuses solely on formatting that table data into several compact byte strings that are sent using python and the serial.sendBytes() command.

[i]A side note here, I have to split the led bytes into chunks or packets of 255 or less due to a built in limit that sendBytes has… Is there a way around this limit? I’ve successfully sent entire byte strings using processing and it’s considerably faster (35-45 fps)

3) Byte string is received over serial comport by the teensy, at this point it’s premapped and each 3 values are applied to the led’s 1:1 more or less.

Python Code:

[code]def receive(dat, rowIndex, message, bytes):
print(“asd”)
thisTeensysName = “%s”%(me.name.split(“_”)[0])
panelEnabled = op(“serialCom_enabler”)[“%s_enableLED”%(thisTeensysName)]

if(panelEnabled == 1):
	
	n = op('%s_dataEnd'%(thisTeensysName))
	serialConnectorName = "%s_serialConnector"%(thisTeensysName)
	serialToggleVal = op("serialCom_enabler")["serialToggle"]
	
	packetSize = int(255)
	
	debugCharLength = 0 # when this is set to one, we build our debug array. If it is set to 2, we debug our time.
	
	if(debugCharLength == 2):
		import time
		millis0 = int(round(time.time() * 1000))
		
	colorTable_RGB = []
	#debugLengthArray = []
	executeArray = []
	subExArray = []
	
	numSamps = n.numSamples
	
	ch_r = n['r']
	ch_g = n['g']
	ch_b = n['b']
	
	if(debugCharLength == 2):
		millis1 = int(round(time.time() * 1000))
		print("      Initialization Stuff TIME: %i"%(millis1 - millis0))
	for i in range(0, numSamps):
		if((len(subExArray) + 5) < packetSize):
			subExArray.append(int(ch_r[i]))
			subExArray.append(int(ch_g[i]))
			subExArray.append(int(ch_b[i]))
		else:
			executeArray.append(subExArray)
			subExArray = []
			subExArray.append(int(ch_r[i]))
			subExArray.append(int(ch_g[i]))
			subExArray.append(int(ch_b[i]))
	executeArray.append(subExArray)
	if(debugCharLength == 2):
		millis2 = int(round(time.time() * 1000))
		print("      Building Exec Arrays TIME: %i"%(millis2 - millis1))
	if( int(serialToggleVal) == 1 ):
		for subArray in executeArray:
			#print(subArray)
			#print('op("%s").sendBytes(%s)'%(serialConnectorName, str(subArray).strip('[]')))
			exec('op("%s").sendBytes(%s)'%(serialConnectorName, str(subArray).strip('[]')))
	if(debugCharLength == 2):
		millis3 = int(round(time.time() * 1000))
		print("               Serial Send TIME: %i"%(millis3 - millis2))
return[/code]

My next course of action is trying out a couple of FadeCandy’s and using 1 per panel instead of 8 panels per teensy.

I’m not even sure this will remove the bottleneck, as just as many bytes will need to be written, just to different devices… but these devices are highly optimized so worth a shot I guess.

Any ideas, thoughts, criticisms are all hugely appreciated! I really want to get this part ironed out.

malcolm · October 15, 2014, 2:40pm

Have you measured the performance without print statements? Those can be quite heavy. We’ll try to take a look at this though.

lucasm · October 15, 2014, 5:11pm

Hi Malcolm,

Yep, I always check both when testing.Even with them off though, writing a lot via serial like this gets the OP’s cook time into the 20, 30, and sometimes even 40 ms range.

lucasm · October 15, 2014, 7:03pm

I just ordered a pixel pusher board as well as a few fade candy’s to try out as I read some really great things about driving lots of led’s with both although I’m leaning heavily towards pixel pusher at the moment.
Anyone have any positive (or not so great )experience with either?

From the research I’ve done~

Pixel Pusher:

Communicates via Ethernet
Can control up to 3,840 pixels per board (480 per each 8 channels which fits my setup exactly) @ 60 hz (this claim is from their website, however they are really against neopixels as being slow so I’d have to do some testing to find out if this refresh rate holds true for neopixels too)
Has support for receiving a TOP’s video stream via the spout protocol in processing. (touch designer has direct support for this Wooo!) Although this means the premapped led data i generate in touch for the custom panels would have to be remapped to a top in a way that when applied to the led’s would look mapped again… Not impossible but would take some noodling.
Even potentially better yet! It has support for Art-Net via a java artnet bridge app that allows touch to talk to the pixel pusher/leds with out going through processing (but the java bridge app)

Fade Candy:

Communicates via USB
Can control up to 512 pixels per board via 8 channels
would require a FC MC per panel and a panel to consist of 8 “strips” making up 480 rather than 1 “strip” making 480 pixels. This would cause a lot of work I’ve done to be void though
Would have to configure rgb data to go from Touch → processing(fade candy libraray) → fadeCandy board → leds

- I would need 16 usb connections to the computer to control the 16 panel setup at it’s largest configuration… that might cause a world of problems of its own. Thoughts?

I’m going to keep this thread going with findings and results, as I’ve had trouble finding much info on this topic. If anyone else out there has anything to throw in please do!

michelchrome · October 17, 2014, 4:33pm

Hi,
I’m stuck with a very similar problem, I want to control around 2000 LEDs and I can’t find an efficient way to deal with this huge amount of data.
I’m using 3 Artnet controllers with 6 outputs each, each output is assigned to a DMX Universe. And there is about 120 LEDs on each controllers output.

My workflow is organized as follow:

[Generate a TOP to represent the ± 120 LEDs of one output (resolution 120x1)]==>[convert the TOP into a CHOP] ==> [reorder the chans (G0 R0 B0 G1 R1 B1 etc.)] ==> [sends chans via Artnet through a DMX out CHOP]
This chain is repeated 18 times (once per output)

I spent a lot of time trying many solutions to optimize the workflow but it didn’t work. I inspected the performance monitor to point out which OPs are slow to cook and it appears that the problem mainly comes from the TOPto CHOPs, the reorder CHOPs and the DMX out CHOPs.

I’m a bit desperate, i don’t know how to re-think the process, and even if the fps drops to 15, neither my GPU nor my CPU seems to be overloaded (I inspected CPU/GPU load monitor when running TD)

Any ideas ?

lucasm · October 18, 2014, 2:52am

Yikes,

I can’t speak for the dmx out / artnet speeds as I haven’t been able to get that to work anyways, but you SHOULD be able to get better speeds for the other OP’s…

right now in my setup, I have a huge line sop that represents my led locations (each point a physical location) and using that in conjunction with a TOP to Chop to sample specific coordinates on the top.

These values are passed through a few other chops performing some maths, then that chop is passed into a CHOP to TOP generating a very wide, 1 pixel tall image like you have (3,840 x 1)

Going from CHOP to TOP my cook time never goes above .5 ms, for 3 channels at 3,840 samples. and the rest of the network related to converting the data around is equally quick…

You are repeating that chain you said 18 times, I think you might get better speeds if you combine data for as much of that chain as possible and split it up again at the end.

with chops you can use trim to select a certain span of your chop. Tops you can selectively crop, etc. do this as close toe your dmx out’s as possible, as I think most OP’s can process lots of data internally better than a lot of OP’s can process less data … I think.

Maybe someone else can chime in on this?

lucasm · October 18, 2014, 3:07am

As far as my findings go with my project, I’ve definitely decided on going with Pixel Pusher.

I got mine in the mail today, and it’s really optimized for communicating with large amounts of led’s and works smoothly for the most part.

Good News:

Pixel Pusher’s processing sketch for Spout(touch has a spout out TOP) works incredibly well… with COOK times of around .5 MS in touch designer and way above 30 fps on processing the whole thing is really robust.
This is good news for scalability…

The Bad News:

According to Jas @ heroicrobotics Neopixels are extremely slow(comparatively ) and perform badly with their device. They claim that the driver is about an order of magnitude slower than anything else they support, and depending on when you order them, the led’s will actually behave differently, sometimes not work at all because of manufacturing differences.

I confirmed this with my own 4 panels I’ve build so far, 1 of them bugged out in a totally different way than the other 3 and they were all built within weeks of each other.

Anyways,
To further complicate my own situation at least, the 60 led per meter flavor of the strip that works well (apa102) isn’t widely sold, almost at all, so fun times! I get to redesign quite a lot of 3d printed parts.

I’m sure I’ll have more to report on this soon.

lucasm · October 18, 2014, 9:55am

Well I’m happy to say I’ve made some progress in the frame rate department:

TL/DR
I managed to get my fps @ 25 using Touch’s Spout Out → Processing Spout In → Processing Serial Out.

Here’s a video showing it all working:
[url]25 FPS @ 3,840 Led's - Touch designer, Spout, and Processing! - YouTube

I was fooling around with pixelPusher, and ended up tearing out the dll and code for Spout that had been incorporated into a processing sketch and re purposed it to receive frames from touch over Spout but send that data out over serial to the led’s

Since processing is able to batch send the entire set of data at once (or at least it appears to from a higher level) the FPS was a solid 25 the whole time… not quite 30, but it looks pretty good to the eye.

That’s with touch and processing moving 3,840 pixels worth of data too. I’ll post some more info on the how’s for those who are interested soon. Going to clean up some code first.

lucasm · October 19, 2014, 10:31am

Hey everyone,

I’ve put together a collection of attachments of all the things you might need to get rolling with a large numbers of led’s .

Touch example (barebones)
Processing middle man sketch
Arduino / teensy 3.1 sketch

If you’re driving 3,840 led’s you can expect roughly 25 fps, the less you use granted you configure processing and arduino sketches accordingly the faster the speed you can expect!

Im sure these aren’t anywhere near perfect but it works and it’s quite robust in my experience. Hope it helps!
touch_to_serialArduino_11_octows8211Lib.zip (1.56 KB)
ReceiveFrames_R3_03.zip (34.3 KB)
simpleLedMappingAndDrivingWorkflow.toe (20 KB)

michelchrome · October 19, 2014, 6:40pm

Wooooow,
thanks a lot for sharing this stuff, it helps me so much ! The way you are dealing with the led mapping is so much more effective and smart than mine, i’ll redraw all the project .
Thanks to your advices I finally achieve to manage all the LEDs @45-50fps but I think i could optimize even more: Like in your example, i’m using a Shuffle CHOP to split all the samples in order to send it to the DMX CHOP, but I must use a reorder CHOP for each DMX output (18x) to organize the channels in the right order (G0 R0 B0 G1 R1 B1…).
It take 0,35ms to each reorderCHOP to cook this task, isn’t it a bit long for such a simple operation ?
If I convert the data to DAT to reorder the channels, it’s really quicker but then the conversion back to CHOP it’s soooooooo slow.
I’m wondering about a solution to directly split the samples in the right order or maybe use a Cplusplus CHOP to write a optimized reorder…

lucasm · October 19, 2014, 6:57pm

Glad it helped!

Sounds like a lot of splitting and re ordering. If you can isolate the part of your TOE that you’re talking about I could take a look at it. I’m not hugely experienced at optimizing but there might be a few things you could do.

For example, if your rgb data is coming from a single top at some point in your network before splitting, you could try a reorder TOP and switch the channels there, it might be more efficient.

Any step you’re doing multiple times on multiple streams, see if there’s an alternate way of doing it in another OP type further back in your stream when things are merged.

lucasm · October 19, 2014, 7:24pm

FM64,
Actually you mind if I ask what your hardware setup is like?

What kind of Artnet controllers are you using? Leds?

michelchrome · October 20, 2014, 9:58am

Hi Lucas,
thanks for your support, here is a very simplified and commented .toe, it’s great if you have the time to take a look at it !
I’m currently using WS2812b ledstrips, I know it’s kind of slow and innacurate but it does the job.
This is the controller : [url]http://www.electrondes.com/an6u1903_00/an6u1903_00.pdf[/url]
it looks cheap but works incredibly great !
LEDmapper_v2.10.toe (9.03 KB)

rob · October 20, 2014, 3:49pm

Hi Everyone.
Thanks for the great analysis and working solutions.

Some thoughts on the python script:

Lucasm, can you describe your issue with:
"A side note here, I have to split the led bytes into chunks or packets of 255 or less due to a built in limit that sendBytes has… "

Looking at your sample script, I notice you do things like:

exec(‘op(“%s”).sendBytes(%s)’%(serialConnectorName, str(subArray).strip(‘[]’)))

Instead of:

op(serialConnectorName).sendBytes(subArray)

Though I suspect: .sendBytes(*subArray) may work in your case.

Minimizing the amount of scripting altogether would likely improve performance.
(ie, using a Quantize CHOP to avoid the int() calls, and a shuffle CHOP to get everything in the right format).

It might boil down to:

n = op(‘some_resulting_chop’) #everything interleaved and with shuffle CHOPs
v = n[0].vals #put everything into one float array
op(‘serial_dat’).sendBytes(v)

Also, agreed, FM64, those reorder CHOPs seem unnessarily slow.
Thanks for the examples, we will try to optimize them further.

We’re also researching the idea of changing the formats accepted by the DMX CHOPs to avoid the interleaving altogether.

Cheers,
Rob.

rob · October 20, 2014, 5:48pm

A couple more points:

The latest official, Shuffle CHOP has a relatively recent option (August 11th):
“Sequence All Samples”
Which would take three long channels, (example: r, g, b), and reshuffle it into one
single interleaved channel: (r0, g0, b0, r1, g1, b1, …)
This may be of use in some networks.

Also, we’ve just optimized the Reorder CHOP to be 5 to 10 times faster when dealing with a few hundred channels, which is often the case with these LED networks.
That will be in build 25440 or later.

Cheers,
Rob.

michelchrome · October 20, 2014, 10:19pm

I’ts awesome Rob, thanks a lot for your reactivity, can’t wait to see my network run @60FPS

lucasm · October 21, 2014, 5:30am

Rob, thanks a bunch for sharing your thoughts on that part of the send code. Ill be honest i was looking to optimize everywhere else but there… Everything you suggested did LOADS of good…

Precisely the bits about shuffle CHOP to get the samples ordered correctly.
Building that ordered array via loop in python was also eating up a ton of time. The append function was probably being used WAY too much.

My FPS is around the 60 range and solidly sticking. All of it is split between two instances of Touch Designer with out the help of any other programs . Woo!

Not quite sure why I was approaching it the way I was with strings, but using a direct array with that asterisk in front of it did perfectly… and loads faster.
What’s with the asterisk anyways? I noticed it stripped away the commas when sent through a print statement, I’m curious, is that all it does?

All that being said, here’s the new Serial send code, it’s LOADS faster and WAY simpler.
You’re the man Rob:

[code]
def valueChange(channel, sampleIndex, val, prev):

n = op('topto1') # CHOP to pull rgb data from
serialConnector = op("serialConnector") # serial DAT

vals = [int(i) for i in n[0].vals] # returns a list with values converted to ints.
serialConnector.sendBytes(*vals) # Send data to leds!

return
[/code]

I’ve attached a new TOE file to reflect the changes and updates.
simpleLedMappingAndDrivingWorkflow.6.toe (9.35 KB)

lucasm · October 21, 2014, 6:10am

FM64,

I took a look at your file, made a few changes and attached it.
I switched the order of operations up a bit and moved a few things to your ledstrips container.

I was able to cut it down by almost about half, but there might be a way to squeeze a few more MS out of it. It’s getting by at just over 60 though.

I suspect that won’t hold if you have more going on in that file but if you split it off to another touch process I imagine it would hold fine!

You may already be doing this but when you test fps, open your performance monitor and then go into performance mode and then hit analyze. Touch’s UI takes a good chunk it’s self understandably and that can make it seem worse than it is.
LEDmapper_v2.12.toe (8.98 KB)

rob · October 21, 2014, 4:33pm

That’s awesome. Im glad all our bits are making one fast system.
One more thing:

vals = [int(i) for i in n[0].vals] # returns a list with values converted to ints.
serialConnector.sendBytes(*vals) # Send data to leds!

The .sendBytes will cast each entry to int anyways, but to be extra sure, you can use the Limit CHOP to quantize/round a channel to whole numbers, eliminating that line entirely.

As for the python * operator, apparently called the ‘splat’ operator, which expands a list or tuple into separate arguments in a function call.

snaut · October 21, 2014, 4:55pm

You can save a bit more by not using a select CHOP but a constant CHOP with a replace CHOP in the dmx_output COMPs.
The expression you are using in the select CHOP can be copied and pasted into the name parameter of the constant CHOP, then use the constant CHOP as the first input to the replace and the select2 CHOP as the second input. This little trick saves you another ~0.18 ms per dmx output…

cheers
Markus