Download Quick Currency Converter

Duplicate File Finder (C++)

Description: Duplicate File Finder is a Linux/Unix utility than scans your hard drive for finding duplicates of a given file. The current code finds duplicates of one file at a time. You can easily customize it so that it finds duplicates of all the files automatically. To understand the code, you must have knowledge of Unix API.

You can also check and download our software Duplicate File Finder

Difficulty: Medium

Language: C++

Compiler/IDE: GCC

View Source


//Created By:   		Ibrahim Ahmed...
//Date: 		    	15 Oct, 2012...
//Compile: 			gcc DuplicateFileFinder.c -o duplicatefilefinder.exe
//Run: 				./DuplicateFileFinder.exe parent_dir filename...

#define false 0
#define true  1

int duplicateCount = 0;

int FindDuplicates(char* path, char* fileName);
int CompareFiles(char* originalFile, char* currFile);

int main(int argc, char *argv[])
        //Two additional arguments are expected: Parent dir, file to find duplicates of...

	if (argc!=3)					
		printf("Usage: %s 'Base Directory' 'File Name'\n", argv[0]);
		return -1;

        //argv[1] = base dir, argv[2] = file to find duplicates of; e.g argv[1] = /home,
        argv[2] = "file.txt"...

	FindDuplicates(argv[1], argv[2]);		
	printf("\n\nFound %d duplicate(s)\n", duplicateCount); 
	return 0;

int FindDuplicates(char* path, char* fileName)
	DIR *dir;
	struct dirent *dp;
	struct dirent *result;
	struct stat statp;

	char absoluteFilePath[255];

	if ((dir = opendir(path))== NULL)
		perror("Failed to open directory");
		return -1;

	while ((dp =readdir(dir)) != NULL)
		//readdir returns . and .. which we should ignore...
		if (strcmp(dp->d_name, ".") && strcmp(dp->d_name,".."))	
		    //find file full path, relative to base path. e.g, a /home/file.txt...

                        //copy path to absoluteFilePath...
			strcpy(absoluteFilePath, path);				

                        //append / at end...
			strcat(absoluteFilePath, "/");    	 		

                        //append filename to path...
 			strcat(absoluteFilePath, dp->d_name); 			

			//check if the current file is actually file or dir...
			stat(absoluteFilePath, &statp);

			if (S_ISDIR(statp.st_mode))		//is a directory...
                                //recurse through this dir...
				FindDuplicates(absoluteFilePath, fileName);		
			else if (S_ISREG(statp.st_mode))	//is a file...
				//check for duplicates here...
				//compare current file with the file specified by user...
				if (strcmp(fileName, absoluteFilePath))	   
                                        if (CompareFiles(fileName, absoluteFilePath))
                                                //yes, duplicate; print it...
						printf("%s\n", absoluteFilePath);

			}		//end else if (S_ISREG(statp.st_mode))...

		}	    //if (strcmp(dp->d_name, ".") && strcmp(dp->d_name,".."))...
	}	    //end while...

	return 0;


int CompareFiles(char* originalFile, char* currFile)
	//two step comparison: (1) first check size; if not same, return false.
        If equal, (2) compare file content. If equal, return true, false otherwise...

	struct stat statOriginal, statCurr;
	stat(originalFile, &statOriginal);
	stat(currFile, &statCurr);

	//Step 1...
	if ((int) statOriginal.st_size != (int) statCurr.st_size)  //size not same...
		return false;

	//Step 2...
	//size matches, files can be same; confirm it by matching both file contents...

	int fdOriginal  = open(originalFile, O_RDONLY);
	int fdCurr	= open(currFile, O_RDONLY);

	if (fdOriginal == -1 || fdCurr == -1)
		return false;		//error occurred, not sure if file is duplicate...

	//we will read file in small chunks and compare...

	int chunkSize = 1024, bytesRead;
	char *bufferOriginal  = (char*) malloc(chunkSize * sizeof(char));
	char *bufferCurr      = (char*) malloc(chunkSize * sizeof(char));

	while (true)
		//read file in chunk...
		bytesRead = read(fdOriginal, bufferOriginal, chunkSize);
		if (bytesRead <= 0)
			break;		//end of file...

		bytesRead = read(fdCurr, bufferCurr, bytesRead);

		//compare buffer...
		if (strcmp(bufferOriginal, bufferCurr))	    //if content not matching...
			return false;			

	return true;

Notes:To run the program, you need to supply two inputs.

(1) Base Directory

This is the starting directory for scanning. Base directory will be completely scanned (including directories inside the Base directory) to find duplicates

(2) File Name

This is the file name you want to find duplicates of.

Sample Run:

Suppose you have a directory named, /home/user. Inside this directory, you want to scan duplicates of a file /home/file.txt (Note that "File Name" may not be necessarily inside of "Base Directory"), you would run the program as follows... ./duplicatefilefinder.exe /home/user /home/file.txt assumming that duplicatefilefinder.exe is in your path.

To find duplicates, you simply need to loop over the directory in which you're interested to check duplicates. Once inside the directory, you need to compare each and every file to check if its the same as target file. If you were to compare the contents of every file, it would take infinity (or close to infinity) to complete the operation. To overcome this, the program compare file sizes; if their sizes aren't same, they can't be the same file. If sizes are same, then we need to compare the contents and if contents match, we report the file as duplicate.