Show
Given a list of integers with duplicate elements in it. The task to generate another list, which contains only the duplicate elements. In simple words, the new list should contain the elements which appear more than one. Method 1: Using the Brute Force approach Below is the implementation :
Output : [20, 30, -20, 60]Method 2: Using Counter() function from collection module
Output Counter({1: 4, 2: 3, 5: 2, 9: 2, 3: 1, 4: 1, 6: 1, 7: 1, 8: 1}) [1, 2, 5, 9]
You can use iteration_utilities.duplicates: >>> from iteration_utilities import duplicates >>> list(duplicates([1,1,2,1,2,3,4,2])) [1, 1, 2, 2]or if you only want one of each duplicate this can be combined with iteration_utilities.unique_everseen: >>> from iteration_utilities import unique_everseen >>> list(unique_everseen(duplicates([1,1,2,1,2,3,4,2]))) [1, 2]It can also handle unhashable elements (however at the cost of performance): >>> list(duplicates([[1], [2], [1], [3], [1]])) [[1], [1]] >>> list(unique_everseen(duplicates([[1], [2], [1], [3], [1]]))) [[1]]That's something that only a few of the other approaches here can handle. BenchmarksI did a quick benchmark containing most (but not all) of the approaches mentioned here. The first benchmark included only a small range of list-lengths because some approaches have O(n**2) behavior. In the graphs the y-axis represents the time, so a lower value means better. It's also plotted log-log so the wide range of values can be visualized better: Removing the O(n**2) approaches I did another benchmark up to half a million elements in a list: As you can see the iteration_utilities.duplicates approach is faster than any of the other approaches and even chaining unique_everseen(duplicates(...)) was faster or equally fast than the other approaches. One additional interesting thing to note here is that the pandas approaches are very slow for small lists but can easily compete for longer lists. However as these benchmarks show most of the approaches perform roughly equally, so it doesn't matter much which one is used (except for the 3 that had O(n**2) runtime). from iteration_utilities import duplicates, unique_everseen from collections import Counter import pandas as pd import itertools def georg_counter(it): return [item for item, count in Counter(it).items() if count > 1] def georg_set(it): seen = set() uniq = [] for x in it: if x not in seen: uniq.append(x) seen.add(x) def georg_set2(it): seen = set() return [x for x in it if x not in seen and not seen.add(x)] def georg_set3(it): seen = {} dupes = [] for x in it: if x not in seen: seen[x] = 1 else: if seen[x] == 1: dupes.append(x) seen[x] += 1 def RiteshKumar_count(l): return set([x for x in l if l.count(x) > 1]) def moooeeeep(seq): seen = set() seen_add = seen.add # adds all elements it doesn't know yet to seen and all other to seen_twice seen_twice = set( x for x in seq if x in seen or seen_add(x) ) # turn the set into a list (as requested) return list( seen_twice ) def F1Rumors_implementation(c): a, b = itertools.tee(sorted(c)) next(b, None) r = None for k, g in zip(a, b): if k != g: continue if k != r: yield k r = k def F1Rumors(c): return list(F1Rumors_implementation(c)) def Edward(a): d = {} for elem in a: if elem in d: d[elem] += 1 else: d[elem] = 1 return [x for x, y in d.items() if y > 1] def wordsmith(a): return pd.Series(a)[pd.Series(a).duplicated()].values def NikhilPrabhu(li): li = li.copy() for x in set(li): li.remove(x) return list(set(li)) def firelynx(a): vc = pd.Series(a).value_counts() return vc[vc > 1].index.tolist() def HenryDev(myList): newList = set() for i in myList: if myList.count(i) >= 2: newList.add(i) return list(newList) def yota(number_lst): seen_set = set() duplicate_set = set(x for x in number_lst if x in seen_set or seen_set.add(x)) return seen_set - duplicate_set def IgorVishnevskiy(l): s=set(l) d=[] for x in l: if x in s: s.remove(x) else: d.append(x) return d def it_duplicates(l): return list(duplicates(l)) def it_unique_duplicates(l): return list(unique_everseen(duplicates(l)))Benchmark 1from simple_benchmark import benchmark import random funcs = [ georg_counter, georg_set, georg_set2, georg_set3, RiteshKumar_count, moooeeeep, F1Rumors, Edward, wordsmith, NikhilPrabhu, firelynx, HenryDev, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates ] args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 12)} b = benchmark(funcs, args, 'list size') b.plot()Benchmark 2funcs = [ georg_counter, georg_set, georg_set2, georg_set3, moooeeeep, F1Rumors, Edward, wordsmith, firelynx, yota, IgorVishnevskiy, it_duplicates, it_unique_duplicates ] args = {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(2, 20)} b = benchmark(funcs, args, 'list size') b.plot()Disclaimer1 This is from a third-party library I have written: iteration_utilities. Page 2
Using Set Function eg:- arr=[1,4,2,5,2,3,4,1,4,5,2,3] arr2=list(set(arr)) print(arr2)
eg:- arr=[1,4,2,5,2,3,4,1,4,5,2,3] arr3=[] for i in arr: if(i not in arr3): arr3.append(i) print(arr3)
eg:- rem_duplicate_func=lambda arr:set(arr) print(rem_duplicate_func(arr))
eg:- dict1={ 'car':["Ford","Toyota","Ford","Toyota"], 'brand':["Mustang","Ranz","Mustang","Ranz"] } dict2={} for key,value in dict1.items(): dict2[key]=set(value) print(dict2)
eg:- set1={1,2,4,5} set2={2,1,5,7} rem_dup_ele=set1.symmetric_difference(set2) print(rem_dup_ele)
Finding duplicates in a Python List and Removing duplicates from a Python list variable are quite common tasks. And that’s because Python Lists are prone to collecting duplicates in them. Checking if there are duplicates or not in a list variable is a common task for Python programmers. Fortunately it is relatively easy to check for duplicates in Python. And once you spot them, you can do several action items
But before we delve deeper into each of these tasks, it is better to quickly understand what are lists, and why duplicates can exist in Python lists. I also want you to know about the Set data type in the Python programming language. Once you know their unique points and their differences, you will better appreciate the methods used to identify and remove duplicates from a Python list. What is a List in PythonA list in Python is like an array. It is a collection of objects, stored in a single variable. A list is changeable. You can add or remove elements from Python lists. A list can be sorted too. But by default, a list is not sorted. A Python list can also contain duplicates, and it can also contain multiple elements of different data types. This way, you can store integers, floating point numbers, positive or negatives, strings, and even boolean values in a list. Python lists can also contain other lists within it, and can grow to any size. But lists are considered slower in accessing elements, as compared to Tuples. So some methods are more suited for small lists, and others are better for large lists. It largely depends on the list size. You define a list by enclosing the elements in square brackets. Each element is separated by commas within the list. What is a Set in Python?A Set is another data type available in Python. Here also you can store multiple items in a Set. But a set differs from a python list in that a Set can not contain duplicates. You can define a Set with curly braces, as compared to a list, which is defined by using square brackets. A Set in Python is not ordered or indexed. It is possible that every time you access a particular index from a set, you get a different value. Once you have create a Set in Python, you can add elements to it, but you can’t change the existing elements. Now that you have a basic list comprehension, and Set datatype understanding in Python, we will explore the identification and removal of duplicates in Python Lists. Multiple Ways To Check if duplicates exist in a Python list
We will be using Python 3 as the language. So as long as you have any version of Python 3 compiler, you are good to go. Method 1: Using the length of a list to identify if it contains duplicate elements.Let’s write the Python program to check this. # this input list contains duplicates mylist = [5, 3, 5, 2, 1, 6, 6, 4] # 5 & 6 are duplicate numbers. # find the length of the list print(len(mylist)) 8 # create a set from the list myset = set(mylist) # find the length of the Python set variable myset print(len(myset)) 6 # create a set from the list myset = set(mylist) # find the length of the Python set variable myset print(len(myset)) 6As you can see, the length of the mylist variable is 8, and the myset length is 6. # create a set from the list myset = set(mylist) # find the length of the Python set variable myset print(len(myset))Output: 6Here’s the final Python program – the full code can be copied and pasted into a Python program and used to check if identical items exist in a list or not. # this input list contains duplicates mylist = [5, 3, 5, 2, 1, 6, 6, 4] # 5 & 6 are duplicate numbers. # find the length of the list print(len(mylist)) # create a set from the list myset = set(mylist) # find the length of the Python set variable myset print(len(myset)) # compare the length and print if the list contains duplicates if len(mylist) != len(myset): print("duplicates found in the list") else: print("No duplicates found in the list")Output: 8 6 duplicates found in the listAlternatively, we can create a function that will check if duplicate items exist, and will return a True or a False to alert us of duplicates. Here the complete function to check if duplicates exist in Python list def is_duplicate(anylist): if type(anylist) != 'list': return("Error. Passed parameter is Not a list") if len(anylist) != len(set(anylist)): return True else: return False mylist = [5, 3, 5, 2, 1, 6, 6, 4] # you can see some repeated number in the list. if is_duplicate(mylist): print("duplicates found in list") else: print("no duplicates found in list")The output of this Python code is: duplicates found in listMethod 2: Listing Duplicates in a List & Listing Unique Values – SortedIn this method, we will create different lists for different use – one to have the duplicate keys or repeated values, and different lists for the unique keys. A few lines of code can do magic in a Python program. # the given list contains duplicates mylist = [5, 3, 5, 2, 1, 6, 6, 4] # the original list of integers with duplicates newlist = [] # empty list to hold unique elements from the list duplist = [] # empty list to hold the duplicate elements from the list for i in mylist: if i not in newlist: newlist.append(i) else: duplist.append(i) # this method catches the first duplicate entries, and appends them to the list # The next step is to print the duplicate entries, and the unique entries print("List of duplicates", duplist) print("Unique Item List", newlist) # prints the final list of unique itemsOutput: List of duplicates [5, 6] Unique Item List [5, 3, 2, 1, 6, 4]And if you want to sort the list items after removing the duplicates, you can use the inbuilt function called sort on the list of numbers. # sorting the list newlist.sort() # the sort method sorts all the values print("The sorted list", newlist) # this prints the sorted listOutput: The sorted list [1, 2, 3, 4, 5, 6]Method 3: Listing only Duplicate values with the Count MethodThis method iterates over each element of the entire list, and checks if the count of each element is greater than 1. If yes, that item is added to a set. If your remember, a set cannot contain any duplicates, by design. In the following code, for items that exist more than once, only those repeated element are added to the set. # the mylist variable represents a duplicate list. mylist = [5, 3, 5, 2, 1, 6, 6, 4] # the original input list with repeated elements. dup = {x for x in mylist if mylist.count(x) > 1} print(dup) #To count the number of list elements that were duplicated, you can run print(len(dup))Output: {5, 6} 2Keep in mind that the listed duplicate values might have existed once, or eve The fastest way to Remove Duplicates From Python ListsOne of the fastest ways to remove duplicates is to create a set from the list variable. All this can be done in just a single Python statement. This is the fastest method, so it is more suited for large lists. Here’s the final code in Python – probably the best way… # this list contains duplicate number 5 & 6 mylist = [5, 3, 5, 2, 1, 6, 6, 4] myunique = set(mylist) # prints the final list without any duplicates print(myunique)Output: {1, 2, 3, 4, 5, 6}How to Avoid Duplicates in a Python ListThe first thing you must think of is – Why am I using a list in Python? Because it can collect duplicates. If you are absolutely clear that duplicates don’t exist in whatever you are collecting or storing, then don’t use a list. Instead a better way is to use a Set. A set is built to reject duplicates, so this is a better solution. You should explore sets a bit more to gain a better set comprehension. It can be a real time saver as this is a more efficient way. If you don’t care about the order then just using set(mylist) will do the job of removing any duplicates. This is what I use, even in the worst case scenario where the incoming entire list is a dirty list of multiple duplicate elements. Alternatively, if you really must use a list because of the things you can do with a list data type, then do a simple check before you add any element.
So before you add any new element in a list, just do a quick check for the existence of the value. If the element exists, then don’t store it. Simple! The methods discussed above work on any list of elements. So if you want to find duplicate strings or duplicate integers or duplicate floating numbers or any kind of duplicate objects, you can use these Python programs. Hope the different ways to find duplicates, list them, and finally remove duplicate elements altogether from any Python list using simple programs and methods will come in handy for your processing and list comprehension. |