最近正在进行ETL后台系统数据的日志分析,查看运行耗时长的TASK,并找出耗时长的JOB,进行逻辑层面和数据库层面的优化.本文仅从数据库层面上的优化着手(包括SQL语句的调整以及greenplum table dk的调整).查看一个耗时30分钟左右的JOB,找到相应的源表,进行如下分析:
dw=#select
gp_segment_id,
count
(
*
)
from
tb_name
group
by
gp_segment_id
order
by
count
(
*
)
desc
gp_segment_id count
----------------------
65 16655
说明:gp_segment_id是greenplum table里面的一个隐藏列,用来标记该行属于哪个节点.由此可见,该表只分布在一个节点65上(节点信息请查看gp_segment_configuration),而我的gp总共有96个节点,这显然没有利用到gp多节点运算能力,该表的DK值设置的有问题.因此,使用alter table tb_name set distributed by (col1,...)对表的DK值进行重新设置.然后重新运行上面的语句,一方面观察节点数(是否每个节点都分布了),另一方面观察节点的条数(分布是否平衡)。在上述二项观察指标大致满足要求后,请vacuum full、vacuum analyze一样,彻底回收空间+收集统计信息。把耗时长JOB的源表抓出来,逐个分析,整个TASK的执行时长从3小时缩短到2小时左右(主要是之前表设计的太差,才导致有这么大的优化空间),后期就是对逻辑以及SQL的优化,以及提高并发度,这才是王道。
为了统计分析方便,设计了如下二张表和若干function,用来收集表的分布情况,并发现哪些表需要进行重新调整DK值。
--
二张表
CREATE
TABLE
"
public
"."table_segment_statistics" (
"table_name"
varchar
(
200
)
DEFAULT
NULL
,
"segment_count" int4
DEFAULT
NULL
,
"table_rows" int8
DEFAULT
NULL
);
CREATE
TABLE
"
public
"."table_segment_statistics_balance" (
"table_name"
varchar
(
200
)
DEFAULT
NULL
,
"segment_id" int4
DEFAULT
NULL
,
"segment_count" int8
DEFAULT
NULL
);
--
function
CREATE
OR
REPLACE
FUNCTION
"
public
"."analyze_table_dk_balance"(v_schemaname
varchar
)
RETURNS
"pg_catalog"."int4"
AS
$BODY$
DECLARE
v_tb
varchar
(
200
);
v_cur_tb
cursor
for
select
schemaname
||
'
.
'
||
tablename
from
pg_tables
where
schemaname
<>
'
information_schema
'
and
schemaname
<>
'
pg_catalog
'
and
schemaname
<>
'
gp_toolkit
'
and
tablename
not
like
'
%prt%
'
and
schemaname
=
v_schemaname;
BEGIN
truncate
table
public
.table_segment_statistics;
truncate
table
public
.table_segment_statistics_balance;
open
v_cur_tb;
loop
fetch
v_cur_tb
into
v_tb;
if
not
found
THEN
exit
;
end
if
;
execute
'
insert into public.table_segment_statistics select
'''
||
v_tb
||
'''
as table_name,count(*) as segment_id,sum(num) as table_rows from (select gp_segment_id,count(*) num from
'
||
v_tb
||
'
group by gp_segment_id) t
'
;
execute
'
insert into public.table_segment_statistics_balance select
'''
||
v_tb
||
'''
as table_name,gp_segment_id,count(*) as cnt from
'
||
v_tb
||
'
group by gp_segment_id order by gp_segment_id
'
;
end
loop;
RETURN
0
;
end
;
$BODY$
LANGUAGE
'
plpgsql
'
VOLATILE;
分析的语句如下:
--
96指的是greenplum的节点(我的机器是96个)
select
*
from
public
.table_segment_statistics
where
table_rows
is
not
null
and
segment_count
<
96
and
table_rows
>
10000
order
by
table_rows
desc
;
--
找出比平均值超出10%的节点,这个阀值可以自行调整,另:只统计超过1万行的表,小表没有太大的分析意义
select
a."table_name",b.segment_id,a.table_rows
/
a.segment_count
as
reldk,b.segment_count
from
"
public
".table_segment_statistics a
inner
join
"
public
".table_segment_statistics_balance b
on
a."table_name"
=
b."table_name"
where
a."table_name"
is
not
null
and
a.table_rows
>
10000
and
abs
(a.table_rows
/
a.segment_count
-
b.segment_count)
/
(a.table_rows
/
a.segment_count)
>
0.1

