Start networking and exchanging professional insights

Register now or log in to join your professional community.

Follow

I've 250 million row citizen data with field, name, gender, birth date, birth place, could u share best sql query to get potencial data duplication?

user-image
Question added by helmy yoga pranata , Database administrator , Ministry of Home Affairs
Date Posted: 2017/08/02
Wameedh Fawzi
by Wameedh Fawzi , Postpay CVM Manager , Vodafone

 

 

select t1.*

from

(Select

a.name

,a.gender

,a.birth_date

,a.birth_place

,row_number() over (partition by a.name,a.gender, a.birth_date order by a.name) as rnk

from Citizen_data a

) t1

where t1.rnk >1

Tahir Rehman
by Tahir Rehman , EHS System Analyst, Data Analyst, developer and trainer , Maaden

select name,gender, birthdate,birthplace,count(*) from Citizen group by name,gender, birthdate,birthplace having count(*) > 1;

Kasiviswanathan Selvakumar
by Kasiviswanathan Selvakumar , Full-Stack Developer , Freelance

Let us consider the table name be 'citizen_table' with 'name', 'gender', 'birth_date' and 'birth_place' as column names

The following sql query gets the potential data duplicates

select count(*) as number_of_repeated_rows, name, gender, birth_date, birth_place

from citizen_table

group by name, gender, birth_date, birth_place

 

More Questions Like This

Do you need help in adding the right keywords to your CV? Let our CV writing experts help you.