Tag Archives: Python pandas merge error

Python pandas merge cannot merge two data frames based on column names (Key Error)?

toss

I am used to R for data analysis, and I feel that pandas is a bit anti-human to use. When processing data with pandas in python today, the two data frames simply cannot be merged.

I have two data frames, the column names are unknown, I can only know the index, and which two indexes are used as the primary key merge. (Don’t ask me why the listing is unknown, because I am a development tool).

The idea is to find the primary key column, rename it, and merge it.

df1.columns.values[args.marker1-1]="markerID"
df2.columns.values[args.marker2-1]="markerID"
pd.merge(df1,df2,on='markerID')

But the result is that it cannot be merged anyway. Report an error KeyError:'markerID'.

I checked it online and said that columns.valuesassignments cannot be used directly , and they must be separated. then:

colnames_df1 = df1.columns
colnames_df1[args.marker1-1] = "markerID"
df1.columns = colnames_df1
colnames_df2 = df2.columns
colnames_df2[args.marker1-1] = "markerID"
df2.columns = colnames_df2
pd.merge(df1,df2,on='markerID')

There is actually no difference.TypeError: Index does not support mutable operations

Print out all the column names of the data frame. It seems normal. Why can’t it be merged?

Use a very rough method:

pd.merge(df1,df2,left_index=True,right_index=True)

For most of the data, but some data is not, and no error is reported, the result is wrong, there is a great risk.

Solution

Finally, I checked on the Internet and finally found the answer.

Reason:
The primary key names of the two data frames look the same, but may actually be different because they may contain spaces. how come? No solution.

The solution is to remove any spaces that may exist in the column names.

method one:

df1.columns = df1.columns.str.strip() 
df2.columns=df2.columns.str.strip()

Method 2:
Remove when the data is read in.

pd.read_csv(file,sep='\s*,\s*') 
# delimiter includes x*whitespace before and after

The data corresponding to me is:

df1.columns = df1.columns.str.strip()
df2.columns = df2.columns.str.strip()
mrkid = df1.columns.values[args.marker-1]
df1.columns.values[args.marker-1]="markerID"
mergesnp = pd.merge(df1,df2,on='markerID')
mergesnp.columns.values[args.marker-1]=mrkid