If you're looking for Big Data Greenplum DBA Interview Questions & Answers for Experienced or Freshers, you are in right place. There are a lot of opportunities from many reputed companies in the world. According to research, Big Data Greenplum DBA has a market share of about 0.3%.
So, You still have the opportunity to move ahead in your career in Big Data Greenplum DBA Administration. Mindmajix offers Advanced Big Data Greenplum DBA Interview Questions 2024 that help you in cracking your interview & acquire a dream career as a Big Data Greenplum DBA Administrator.
If you want to enrich your career and become a professional in Greenplum DBA, then visit Mindmajix - a global online training platform: "Big Data Greenplum DBA Training" This course will help you to achieve excellence in this domain. |
The Describe table sales show the distribution details.
psql>d sales
Table” public. sales”
Column | Type | Modifiers
——–+———+———–
id | integer |
date | date |
Distributed by: (id)
Use”dn” at psql prompt.
In 4.x check pg_stat_operations for all action-name performed on any object.
For example, a sales table:
gpdb=# select objname,action-name,statime from pg_stat_operations where objname like ‘sales’;
objname | action-name | statime
——–+———–+——————————-
sales | CREATE | 2010-10-01 12:18:41.996244-07
sales | ANALYZE | 2010-10-06 14:38:21.933281-07
sales | VACUUM | 2010-10-06 14:38:33.881245-07
Table Level:
psql> select pg_size_pretty(pg_relation_size(‘schema.tablename’));
Replace schema.tablename with your search table.
Table and Index:
psql> select pg_size_pretty(pg_total_relation_size(‘schema.tablename’));
Replace schema.tablename with your search table.
Schema Level:
psql> select schemaname ,round(sum(pg_total_relation_size(schemaname||’.’||tablename))/1024/1024) “Size_MB”
from pg_tables where schemaname=’SCHEMANAME’ group by 1;
[ Related Article: Analytics Tools in Big Data ]
To see the size of the specific database:
psql> select pg_size_pretty(pg_database_size(‘DATBASE_NAME’));
Example:
gpdb=# select pg_size_pretty(pg_database_size(‘gpdb’));
pg_size_pretty
—————-
24 MB
(1 row)
To see all database sizes:
psql> select datname,pg_size_pretty(pg_database_size(datname)) from pg_database;
Table size with partitions:
The following SQL gives you employee_dailly table size, which includes partitions.
select schemaname,tablename,round(sum(pg_total_relation_size(schemaname || ‘.’ || partitiontablename))/1024/1024) “MB”
from pg_partitions where tablename=’employee_daily’ group by 1,2;
schemaname | tablename | MB
———–+—————-+—–
public | employee_daily | 254
In psql session type halter table which will display the syntax:
gpdb=# h alter table
From master host
PGOPTIONS=’-c gp_session_role=utility’ psql -p -h
Where:
port is segment/ master database port.
hostname is segment/master hostname.
Master: Master gpdb log file is located in the $MASTER_DATA_DIRECTORY/pg_log/ directory and the file name depends on the database “log_filename” parameter.
1. $MASTER_DATA_DIRECTORY/pg_log/gpdb-yyyy-mm-dd_000000.csv –>Log file format with default installation.
2. ~gpadmin/gpAdminLogs/ –>gpstart,gpstop,gpstate and other utility logs.
Segments:
1. primary segments run below SQL to see log file location:
select dbid,hostname,datadir||’/pg_log’ from gp_configuration where content not in (-1) and isprimary is true;
2. Miror Segments run below SQL to see log file location:
select dbid,hostname,datadir||’/pg_log’ from gp_configuration where content not in (-1) and isprimary is false;
df schemaname.functionname (schemaname and function name support wildcard characters)
test=# df pub*.*test*
List of functions
Schema | Name | Result data type | Argument data types
——–+————-+——————+———————
public | bugtest | integer |
public | test | boolean | integer
public | test | void |
(3 rows)
The gpstate is the utility to check gpdb status.
Use gpstate -Q to show a quick status. Refer to gpstate –help for more options.
Sample output:
[gpadmin@stinger2]/export/home/gpadmin>gpstate -Q
gpadmin-[INFO]:-Obtaining GPDB array type, [Brief], please wait…
gpadmin-[INFO]:-Obtaining GPDB array type, [Brief], please wait…
gpadmin-[INFO]:-Quick Greenplum database status from Master instance only
gpadmin-[INFO]:———————————————————-
gpadmin-[INFO]:-GPDB fault action value = readonly
gpadmin-[INFO]:-Valid count in status view = 4
gpadmin-[INFO]:-Invalid count in status view = 0
gpadmin-[INFO]:———————————————————-
There are two ways to create gpdb database using psql session or the Greenplum createdb utility.
Using psql session:
gpdb=# h create the database
Command: CREATE DATABASE
Description: create a new database
Syntax:
CREATE DATABASE name
[ [ WITH ] [ OWNER [=] dbowner ]
[ TEMPLATE [=] template ]
[ ENCODING [=] encoding ]
[ TABLESPACE [=] tablespace ]
[ CONNECTION LIMIT [=] connlimit ] ]
Using createdb utility:
Usage: $GPHOME/bin/createdb –help
createdb [OPTION]… [DBNAME] [DESCRIPTION]
Options:
-D, –tablespace=TABLESPACE default tablespace for the database
-e, –echo shows the commands being sent to the server
-E, –encoding=ENCODING encoding for the database
-O, –owner=OWNER database user to own the new database
-T, –template=TEMPLATE template database to copy
–help show this help, then exit
–version output version information, then exit
gpdb=# l (lowercase letter “l”)
List of databases
Name | Owner | Encoding
——{}———-
gpdb | gpadmin | UTF8
gpperfmon | gpadmin | UTF8
postgres | gpadmin | UTF8
template0 | gpadmin | UTF8
template1 | gpadmin | UTF8
Check below SQL for more details on dbs.
gpdb=# select * from pg_database;
gpdb=# h DROP Database
Command: DROP DATABASE
Description: remove a database
Syntax:DROP DATABASE [ IF EXISTS ] name
Also check dropdb utility:
$GPHOME/bin/dropdb –help
dropdb removes a PostgreSQL database.
Usage:
dropdb [OPTION]… DBNAME
In psql session
“ ?” – for all psql session help
“h ” For any SQL syntax help.
Check gpstart logfile in ~gpadmin/gpAdminLogs/gpstart_yyyymmdd.log
Take a look at the pg start log file for more details in
$MASTER_DATA_DIRECTORY/pg_log/startup.log
The gpstart -m command allows you to start the master only and none of the data segments and is used primarily by support to get system level information/configuration. An end user would not regularly or even normally use it.
There are no utilities available to remove mirrors from Greenplum. You need to make sure all primary segments are good then you can remove the mirror configuration from gp_configuration in 3.x.
The gpcheckcat tool is used to check catalog inconsistencies between master and segments. It can be found in the $GPHOME/bin/lib directory:
Usage: gpcheckcat
[dbname]
-?
-B parallel: number of worker threads
-g dir : generate SQL to rectify catalog corruption, put it in dir
-h host : DB hostname
-p port : DB port number
-P passwd : DB password
-o : check OID consistency
-U uname : DB User Name
-v : verbose
Example:
gpcheckcat gpdb >gpcheckcat_gpdb_logfile.log
The gpdetective utility collects information from a running Greenplum Database system and creates a bzip2-compressed tar output file. This output file helps with the diagnosis of Greenplum Database errors or system failures. for more details check help.
gpdetective –help
To remove the currently configured standby master host from your Greenplum Database system, run the following command in the master only:
# gpinitstandby -r
Use this option if you already have a standby master configured, and just want to resynchronize the data between the primary and backup master host. The Greenplum system catalog tables will not be updated.
# gpinitstandby -n (resynchronize)
Use the gprecoverseg tool, which will recognize which segments need recovery and will initialize recovery.
3.3.x:
4.0.x:
The gpaddmirrors utility configures mirror segment instances for an existing Greenplum Database system that was initially configured with primary segment instances only.
For more details check help.
# gpaddmirrors –help
From the database catalog following query list configuration on content ID, you can figure out primary and mirror for each content.
gpdb=# select * from gp_configuration order by content.
Note: starting from GPDB 4.x, gp_segment_configuration table is used instead.
gpdb=# select * from gp_segment_configuration order by dbid;
Admin mode:
The gpstart with option (-R) stands for Admin mode or restricted mode where only superusers can connect to the database when the database opened using this option.
utility mode:
Utility mode allows you to connect to only individual segments when started using gpstart -m, for example< to connect to only master instance only:
PGOPTIONS=’-c gp_session_role=utility’ psql
Create a directory where you have free space and common in all hosts.
For network I/O test for each nic card:
gpcheckperf -f seg_host_file_nic-1 -r N -d /data/gpcheckperf > seg_host_file_nic_1.out
gpcheckperf -f seg_host_file_nic-2 -r N -d /data/gpcheckperf > seg_host_file_nic_2.out
For disk I/O:
gpcheckperf -f seg_host_file_nic-1 -r ds -D -d /data/gpdb_p1 -d /data/gpdb_p2 -d /data/gpdb_m1 -d /data/gpdb_m2
In GP 4.0 version check gpconfig utility to change postgres.conf parameters.
In 3.X version manually change parameters in postgres.conf for more details check Greenplum Administrator’s Guide.
The pg_hba.conf file of the master instance controls client access and authentication to your Greenplum system. Check Greenplum Administrator’s Guide for instructions to add/change contents of this file.
Use create user utility to create users. See create user –help for more details.
You can also use SQL commands in psql prompt to create users.
For example: CREATE USER or ROLE ….
Use gpssh-exkeys:
gpssh-exkeys -h hostname1 -h hostname2 .. -h hostnameN
To check the version:
psql> select version();
or
postgres –gp-version
To check gp version at install:
psql> select * from gp_version_at_initdb;
By connecting GPDB database using psql query catalog or do show parameter.
Example:
gpdb# select name,setting from pg_settings where name=’GUC’;
or
gpdb# show ;
cd $MASTER_DATA_DIRECTORY – Master direcoty.
pg_hba.conf and postgres.conf location and other GPDB internal directories.
cd $MASTER_DATA_DIRECTORY/pg_logs — Master database logfiles location.
grep for ERRORS, FATAL, SIGSEGV in pg_logs directory.
VACUUM reclaims storage occupied by deleted tuples. In normal GPDB operation, tuples that are deleted or obsoleted by an update are not physically removed from their table. They remain present on the disk until a VACUUM is done. Therefore, it is necessary to do VACUUM periodically, especially on a frequently updated table.
Unless you need to return space to the OS so that other tables or other parts of the system can use that space, you should use VACUUM instead of VACUUM FULL.
VACUUM FULL is only needed when you have a table that is mostly dead rows, that is, the vast majority of its contents have been deleted.
Even then, there is no point using VACUUM FULL unless you urgently need that disk space back for other things or you expect that the table will never again grow to its past size. Do not use it for table optimization or periodic maintenance as it is counterproductive.
ANALYZE collects statistics about the contents of tables in the database, and stores the results in the system table pg_statistic. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.
It is a good idea to run ANALYZE periodically, or just after making major changes in the contents of a table. Accurate statistics will help the query planner to choose the most appropriate query plan, and thereby improve the speed of query processing. A common strategy is to run VACUUM and ANALYZE once a day during a low-usage time of day.
Resource queues are used to manage Greenplum database workload management. All user/queries can be prioritized using Resource queues. Refer to Admin guide for more details.
The gp_toolkit is a database schema, which has many tables, views and functions to better manage Greenplum Database when DB is up. In 3.x earlier versions, it was referred to as gp_jetpack.
Use pg_dump utility to generate DDL.
Example:
pg_dump -t njonna.accounts -s -f ddl_accounts.sql
Where:
-f ddl_accounts.sql is output file.
-t njonna.accounts is table name with schema njonna.
-s dump only schema no data
For non-parallel backups:
Use postgres utilities (pg_dump, pg_dumpall for backup, and pg_restore for restore).
Another useful command for getting data out of the database is the COPY to.
For parallel backups:
gp_dump and gpcrondump for backups and gp_restore for restore process.
If Prod and QA on same GPDB cluster, use CREATE database template.
If Prod and QA are on different clusters, use backup and restore utilities.
pg_dump – Non-parallel backup utility, you need a big file system where the backup will be created in the master node only.
gp_dump – Parallel backup utility. The backup will be created in the master and segments file system.
A wrapper utility for gp_dump, which can be called directly or from a crontab entry.
Example: gpcrondump -x
Solaris: zfs snapshots at the file system level.
All OS: gpcrondump / gp_dump.
This can be done by checking for dropped packets on the interconnect “netstat -i” and by running gpcheckperf. It is also possible that a segment is experiencing hardware problems, which can be found in the output of dmesg or in
cd $MASTER_DATA_DIRECTORY – Master direcoty.
pg_hba.conf and postgres.conf location and other GPDB internal directories.
cd $MASTER_DATA_DIRECTORY/pg_logs — Master database logfiles location.
When the session starts in master and segments, all the child processes in segments will be identified with a master session_id connection string (con+sess_id).
For example:
gpdb=# select * from pg_Stat_activity;
datid | datname | procpid | sess_id |.. ..
——-+———+———+———+
16986 | gpdb | 18162 | 76134 | .. ..
In all segments child processes for session 76134:
[gpadmin@stinger2]/export/home/gpadmin/gp40>gpssh -f host_file /usr/ucb/ps -auxww |grep con76134
[stinger2] gpadmin 18162 1.7 6.0386000124480 ? S 09:57:55 0:04 postgres: port 4000, gpadmin gpdb [local] con76134 [local] cmd3 CREATE DATABASE…………………………………
[stinger2] gpadmin 18625 0.3 2.726056455932 ? S 10:01:56 0:01 postgres: port 40000, gpadmin gpdb 10.5.202.12(18864) con76134 seg0 cmd4 MPPEXEC UTILITY………………………….
[stinger2] gpadmin 18669 0.0 0.1 3624 752 pts/2 S 10:02:36 0:00 grep con76134
[stinger3] gpadmin 22289 0.8 9.4531860196404 ? S 09:36:20 0:05 postgres: port 40000, gpadmin gpdb 10.5.202.12(18866) con76134 seg1 cmd4 MPPEXEC UTILITY………………………….
Check the “waiting” column in pg_stat_activity and the “granted” column in pg_locks for any object-level locks.
Locks that are held for a very long time and multiple other queries are waiting for that lock also.
Use Greenplum performance monitor (gpperfmon), which has GUI to monitor and query performance history.
It's a monitoring tool that collects statistics on system and query performance and builds historical data.
Master
There are many ways. The simplest steps are Unload data into csv files, create tables in the Greenplum database corresponding to Oracle, Create an external table, start gpfdist pointing to external table location, Load data into Greenplum. You can also use gpload utility. Gpload creates an external table at runtime.
gp_dump, gpcrondump, pg_dump, pg_dumpall, copy
NO. Yes if during the gp_dump you backed up one table only.
Yes
-s (-s | –schema-only Dump only the object definitions (schema), not data.)
Segment nodes
Segment nodes
VACUUM FULL, CTAS.
A VACUUM FULL will reclaim all expired row space, but is a very expensive operation and may take an unacceptably long time to finish on large, distributed Greenplum Database tables. If you do get into a situation where the free space map has overflowed, it may be more timely to recreate the table with a CREATE TABLE AS statement and drop the old table.
A VACUUM FULL is not recommended in Greenplum Database.
There are two types of in-database compression available in the Greenplum Database for append-only tables:
Oracle is a relational database. Greenplum is MPP nature. Greenplum is shared-nothing architecture. There are many other differences in terms of functionality and behavior.
Greenplum is built on top of Postgresql. It is a shared-nothing, MPP architecture best for data warehousing env. Good for big data analytics purposes.
Oracle is an all-purpose database.
Look at the log files. querying the GP perfmon and tools schema for various data and statistics.
Yes
No
Yes
Yes
workload management is done by creating resource queues and assigning various limits.
A configuration like port change, Active directory authentication, pg_hba.conf changes, postgresql.conf changes, gpfdist, etc.
Explore Bigdata Greenplum DBA Sample Resumes! Download & Edit, Get Noticed by Top Employers!
List of Other Big Data Courses:
Hadoop Administration | MapReduce |
Big Data On AWS | Informatica Big Data Integration |
Bigdata Greenplum DBA | Informatica Big Data Edition |
Hadoop Hive | Impala |
Hadoop Testing | Apache Mahout |
Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:
Name | Dates | |
---|---|---|
Bigdata Greenplum DBA Training | Nov 23 to Dec 08 | View Details |
Bigdata Greenplum DBA Training | Nov 26 to Dec 11 | View Details |
Bigdata Greenplum DBA Training | Nov 30 to Dec 15 | View Details |
Bigdata Greenplum DBA Training | Dec 03 to Dec 18 | View Details |
Vinod M is a Big data expert writer at Mindmajix and contributes in-depth articles on various Big Data Technologies. He also has experience in writing for Docker, Hadoop, Microservices, Commvault, and few BI tools. You can be in touch with him via LinkedIn and Twitter.