3 views (last 30 days)

Show older comments

Hi,

in data analysis, the visualization workflow usually is

- load data into (time-) table T
- process/transform data
- visualize data

Point 3 usually contains filtering, grouping and indexing directly into T to avoid copys of large data files.

I'm searching for general concepts for 3. that are flexible, easy to understand and sufficiently performant.

Simple example, a table with 2 data variables and 1 categorical, used for grouping. I'd like to plot each variables in its own subplot and each group as a single line plot. I.e., a figure with two subplots and 3 lineplots per subplot (since there are 3 categories). You can find the whole script attached!

% T =

%

% 3×3 timetable

%

% Time cat dat1 dat2

% ____________________ ___ ________ _______

%

% 01-Jan-2021 12:00:00 2 -0.54518 -5.995

% 01-Jan-2021 12:00:00 3 0.37835 18.321

% 01-Jan-2021 12:00:00 1 -0.32751 -10.042

The script/class should be as readable as possible while maintaining performance. I hate to use traditional for loops because they are error prone. One always has to know to iterate through which array, therefore I prefer to use "for each" to directly iterate through an array/list instead using a integer index. But this doees not work as soon as two lists shall be iterated through at the same time. E.g. in this example the array of subplots handles and the number of variables to be plotted have the same size and are iterated through together, there is a connection between them.

In my opionion, some kind of double nested loop is needed here which, since one needs to iterate through the subplots anyway and also one can't vectorize plotting, because the elements per groups aren't the same, so no reshape to a matrix is possible.

Let's start with the example, create data:

d = datetime("2021-01-01 12:00");

t = linspace(d, d+hours(1), testSize);

cat = categorical(randi(3, size(t)));

dat1 = randn(size(t));

dat2 = 10*randn(size(t));

T = timetable(t', cat', dat1', dat2', 'VariableNames',["cat", "dat1", "dat2"]);

% ans =

%

% 3×3 timetable

%

% Time cat dat1 dat2

% ____________________ ___ ________ _______

%

% 01-Jan-2021 12:00:00 2 -0.54518 -5.995

% 01-Jan-2021 12:00:00 3 0.37835 18.321

% 01-Jan-2021 12:00:00 1 -0.32751 -10.042

Approach 1, manual grouping & plotting. Simple for loops with integer indices. The user has to know which variables belong together, aka need to get iterated through simultaneously (e.g. plotVars and haxes).

function naiveForLoopPlotTest1(T, plotVars)

arguments

T timetable

plotVars (1,:) string

end

% infer number of plots & subplots

numPlotVars = length(plotVars); % #subplots

grps = unique(T.cat); % should be fast due to categorical

numGrps = length(grps); % #lines per subplot

% we want/need to keep the handles for further processing

haxes = gobjects(numPlotVars); % subplot handles

hlines = gobjects(numPlotVars, numGrps); % line handles per subplot

% create graphic objects, just called once

figure

tiledlayout('flow')

for iPlotVar = 1:numPlotVars

haxes(iPlotVar) = nexttile();

hold(haxes(iPlotVar),'on');

for iGrp = 1:numGrps

hlines(iPlotVar, iGrp) = plot(NaT,NaN);

end

end

% plotting, could be called multiple times in real program - I hate it, its so verbose

% and one needs to keep track on the indices

for iPlotVar = 1:numPlotVars

plotVar = plotVars(iPlotVar); % current var to plot

for iGrp = 1:numGrps % nested loop...

grp = grps(iGrp); % current group

idx = T.cat == grp; % find index -> filter

TT = T(idx, :); % filtered (grouped) data from a single column

set(hlines(iPlotVar, iGrp),'XData', TT.Properties.RowTimes, 'YData', TT.(plotVar)); % update graphic data

end

end

end

Approach 2: Reduce number of numXXX variables, do not iterate through integer indices but through the elements directly. Drawback: Needs for-each toolbox (with awful license..) https://www.mathworks.com/matlabcentral/fileexchange/48729-for-each . Use arrayfun for handle initiatlization

function maybeBetterForLoopPlotTest(T, plotVars)

arguments

T timetable

plotVars (1,:) string

end

% NOT LONGER NEEDED!: Infer number of plots & subplots

% numPlotVars = length(plotVars); % #subplots

grps = unique(T.cat); % should be fast due to categorical

% numGrps = length(grps); % #lines per subplot

% combine / make graphic object creation shorter (?!)

% same function as in naiveForLoopPlotTest1

figure

tiledlayout('flow')

haxes = arrayfun(@(~)nexttile, plotVars); % subplots

arrayfun(@(hax)hold(hax,'on'), haxes); % hold on

hlines = arrayfun(@(hax)arrayfun(@(~)plot(hax, NaT, NaN), grps), ...

haxes,'unif',false); % lines per subplots; this is not nice either.... is cell array

% 1x2 cellarr with line handles, each of size numGrps x 1

% plotting, could be called multiple times in real program

% use for-each instead of integer indices, see https://www.mathworks.com/matlabcentral/fileexchange/48729-for-each

% unfortunately, this for-each toolbox' license is way to restricting!

% Is there anything like for [hax, plotVar] = [haxes, plotVars] in Matlab?

for elem = eachTuple(plotVars, hlines) % advantage here: it's clearly seen that all these variables belong together!

plotVar = elem{1};

hlinesSub = elem{2};

for subElem = eachTuple(grps, hlinesSub)

grp = subElem{1};

hline = subElem{2}; % handle class, can therefore be updated here!

idx = T.cat == grp;

TT = T(idx, :); % filtered (grouped) data from a single column

set(hline,'XData', TT.Properties.RowTimes, 'YData', TT.(plotVar)); % update plots

end

end

end

Approach 3: Replace second nested for loop by splitapply

function splitApplyPlotTest(T, plotVars)

arguments

T timetable

plotVars (1,:) string

end

% find groups

[G, grps] = findgroups(T.cat); %

% combine / make graphic object creation shorter (?!)

% same function as in naiveForLoopPlotTest1

figure

tiledlayout('flow')

haxes = arrayfun(@(~)nexttile, plotVars); % subplots

arrayfun(@(hax)hold(hax,'on'), haxes); % hold on

hlines = arrayfun(@(hax)arrayfun(@(n)plot(hax, NaT, NaN), grps), ...

haxes,'unif',false); % lines per subplots; this is not nice either.... is cell array

% 1x2 cellarr with line handles, each of size numGrps x 1

% plotting, could be called multiple times in real program

% use for-each instead of iteger indices, see https://www.mathworks.com/matlabcentral/fileexchange/48729-for-each

% unfortunately, this for-each toolbox' license is way to restricting!

% Is there anything like for [hax, plotVar] = [haxes, plotVars] in Matlab?

for elem = eachTuple(plotVars, hlines) % advantage here: it's clearly visible that all these variables belong together!

plotVar = elem{1};

hlinesSub = elem{2};

hline = hlinesSub(G); % this is ugly but needed - creation of a big handle array

% remove 2nd for loop - is this clearer?

% the problem here is: how to supply the hlinesSub handles array without

% making it huge? All data variables must have same number of rows in

% splitapply..

splitapply(@(h, t,dat)...

set(h(1),'XData', t, 'YData', dat),...

hline, T.Properties.RowTimes, T.(plotVar), G); % update plots

end

end

Benchmark results:

% bench =

%

% 3×4 table

%

% TimeTimeit TimeProfiler TotalMemoryMb PeakMemoryMb

% __________ ____________ _____________ ____________

%

% naiveForLoopPlotTest1 0.11933 0.17104 3962.1 65.221

% maybeBetterForLoopPlotTest 0.10815 0.11897 141.47 41.275

% splitApplyPlotTest 0.17568 0.18233 11215 185.27

This is very interesting - splitapply seems to need much more time & memory (which makes sense, especially because the hline array must be repmatted to the size of group array G which seems to be complete nonesense.

Interestingly, maybeBetterForLoopPlotTest is faster and needs less memory than other solutions.

What do you prefer? Are there other ways to structure the code or functions I'm not aware of yet? I mean this problem occurs each day in data analysis I guess.

I'm looking forward for you suggestions, thank you very much!

Gaurav Garg
on 22 Feb 2021

Hi Jan,

The benchmark results you have tabulated seem to be correct and making sense for the case study you have mentioned.

Along with the approaches you have mentioned already, gscatter is a function which plots classification dataset by group in a very beautiful manner. Although it isn't an approach, but gscatter and similar functions (like scatter, grpstats, gplotmatrix) are some functions you should be aware of.

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!