Category Archives: python

Convert nested spark table Row structs to python dicts in functions, and then convert processed Rows to dataframes

Its been a while since I used flatMaps in spark, and most often it can be replaced by using Explode. But in my case today, I constructed a list of Rows and then converted them to an actual DF without the painful schema registry using rdd operations.

now, input is kinda messy and I used my convert functions to convert everything into dictionaries should they come in:

Of course, there are many other processing steps, this is just converting the massively nested data into the ones that I like to see, into easy python dictionaries instead of the nasty spark Row objects.

Now I can do my processing based on the converted data.

After that when I have to save the data, I did something like

def process_df(a,b,c):

....

return [T.Row(**item) for item in processed_data.values()]

so they become one list of Rows. Then, in order to convert the list of Row back to the df, I did this

df = data.rdd.flatMap(lambda x:process_df(x[0], x[1], x[2])).toDF()

there it is! The nested ugly dataframe now has been turned into a flat structure and much easier to get analysis going.

Heat map subplots sharing same color bar pandas with seaborn

Leave a reply

Make sure you have pandas and seaborn installed
plt.cla() plt.close() fig, (ax0,ax1) = plt.subplots(1, 2, sharex=True, sharey=True) cbar_ax = fig.add_axes([.91,.3,.03,.4]) sns.heatmap(pd1.corr(),ax=ax0,cbar=True,vmin=-1,vmax=1,cbar_ax = cbar_ax) ax0.set_title('title1') sns.heatmap(pd2.corr(),ax=ax1,cbar=True,vmin=-1,vmax=1,cbar_ax = cbar_ax) ax1.set_title('title 2') fig.suptitle('big title',fontsize=20) #saving figure for publication if needed plt.savefig('save.tif', dpi=300) plt.show()

thats it! I guess the most important thing here is cbar_ax = fig.add_axes([.91,.3,.03,.4]) and make sure you have a fixed vmin and vmax.

source: http://stackoverflow.com/questions/24653986/saving-matplotlib-figure-with-add-axes

How to make Custom estimator class and custom scorer to do cross validation using sklearn api on your custom model

Leave a reply

I made a combined weak classifier model, needed a custom estimator and custom scorer. I went through a few stack overflow articles however none actually targeted specifically for cross validation in sklearn.

Then I figured I would try to implement baseestimator class, and make my own scorer. It WORKED. :>

Therefore, I am posting instructions here on how to use it, hopefully its gonna be useful to you.

Steps:

Write your own estimator class, just make sure to implement base estimator (or extend I am not sure how this works in python but its similar. base estimator is like an interface or abstract class provides basic functionalities for estimator)
Write your loss function or gain function, and then make your own scorer
Use the sklearn api to do cross validation. Using whatever you have created in 1 and 2.

Code: Please read comments. Important.


#create a custom estimator class
#Keep in mind. This is just a simplified version. You can treat it as any other class, just make sure the signitures should stay same, or you should add default value to other parameters
from sklearn.base import BaseEstimator

class custom_classifier(BaseEstimator):

  from sklearn import tree

  from sklearn.cluster import KMeans

  import numpy as np

  from sklearn.cluster import KMeans

  #Kmeans clustering model

  __clusters = None

  #decision tree model

    __tree = None

    #x library.

    __X = None

    #y library.

    __y = None

    #columns selected from pandas dataframe.

    __columns = None
    def fit(self, X, y, **kwargs):

        self.fit_kmeans(self.__X,self.__y)

        self.fit_decisiontree(self.__X,self.__y)
    def predict(self,X):

        result_kmeans = self.__clusters.predict(X)

        result_tree = self.__tree.predict(X)

        result = result_tree

        return np.array(result)
    def fit_kmeans(self,X,y):

        clusters = KMeans(n_clusters=4, random_state=0).fit(X)

        #the error center should have the lowest number of labels.(implementation not shown here)

        self.__clusters = clusters
    def fit_decisiontree(self,X,y):

        temp_tree = tree.DecisionTreeClassifier(criterion='entropy',max_depth=3)

        temp_tree.fit(X,y)

        self.__tree = temp_tree

Now we have our class. We need to build hit/loss function:


#again, feel free to change any thing in the hit function. As long as the function signature remain the same.
def seg_tree_hit_func(ground_truth, predictions):

    total_hit = 0

    total_number = 0

    for i in xrange(len(predictions)):

        if predictions[i]==2:

            continue

        else:

            total_hit += (1-abs(ground_truth[i]-predictions[i]))

            total_number+=1.0

        print 'skipped: ',len(predictions)-    total_number,'/',len(predictions),'instances'

    return total_hit/total_number if total_number!=0 else 0

Now we still need to build scorer.


from sklearn.metrics.scorer import make_scorer
#make our own scorer

score = make_scorer(seg_tree_hit_func, greater_is_better=True)

We have our scorer, our estimator, and so we can start doing cross-validation task:


#change the 7 to whatever fold validation you are running.
scores = cross_val_score(custom_classifier(), X, Y, cv=7, scoring=score)

There it is! You have your own scorer and estimator, and you can use sklearn api to plug it in anything from sklearn easily.

Hope this helps.

Python threading vs multiprocessing package

Leave a reply

1. Threading package actually slows down your system because of GIL
2. Multithreading packages make use of processes and can have the expected speed gain
If you want to pass global dictionary, use Manager.dict()

Install new version of python and set up virtualenvwrapper under your user account (e.g. you don’t have admin rights)

Leave a reply

I ran into a problem today trying to install python libraries on server which I don’t have admin rights. Not happy but got a solution around. This way its easier to maintain the environment myself. Hopefully someone can make a software that manages software environment in user space on linux.

Install python 2.7.10 under your user/.localpython
Replace the USER_NAME with your own username.
mkdir ~/src mkdir ~/.localpython cd ~/src wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz tar -zxvf Python-2.7.10.tgz cd Python-2.7.10

./configure -prefix=/home/USER_NAME/.localpython
make
make install

Install virtualenv
cd ~/src wget https://pypi.python.org/packages/source/v/virtualenv/virtualenv-13.1.2.tar.gz#md5=b989598f068d64b32dead530eb25589a tar -zxvf virtualenv-13.1.2.tar.gz cd virtualenv-13.1.2 ~/.localpython/bin/python setup.py install
install per
cd ~/src wget https://pypi.python.org/packages/source/p/pbr/pbr-1.8.1.tar.gz#md5=c8f9285e1a4ca6f9654c529b158baa3a tar -zxvf pbr-1.8.1.tar.gz cd pbr-1.8.1 ~/.localpython/bin/python setup.py install

install virtualenvwrapper
cd ~/src wget https://pypi.python.org/packages/source/v/virtualenvwrapper/virtualenvwrapper-4.7.1.tar.gz#md5=3789e0998818d9a8a4ec01cfe2a339b2 tar -zxvf virtualenvwrapper-4.7.1.tar.gz cd virtualenvwrapper-4.7.1 ~/.localpython/bin/python setup.py install

install stevedore (dependency for virtualenvwrapper
cd ~/src wget https://pypi.python.org/packages/source/s/stevedore/stevedore-1.9.0.tar.gz#md5=53e2bc3b49dd9c920cfce7f63822b1a5 tar -zxvf stevedore-1.9.0.tar.gz cd stevedore-1.9.0 ~/.localpython/bin/python setup.py install

Now last step:
Edit your ~/.bashrc file so your python distribution is the one when you type which python command. Add the following lines at the top.

export PATH="/home/USER_NAME/.localpython/bin:$PATH" export WORKON_HOME=~/.virtualenvs export PROJECT_HOME=~/Devel source /home/USER_NAME/.localpython/bin/virtualenvwrapper.sh
exit your server and log in again. You should be good to go to make mkvirtualenv and do whatever you want.

The above guide is tested on my University cluster where python2.6 was installed however i don’t have control over custom libraries. I could use virtualenv though.

Tkinter text search and annotation

Leave a reply

textbox search in python

This function can be used for any other python program.

Python filter array based on multiple masks

Leave a reply

I used this in my fft transformed data to filter some wavelengths.

mask = [all(tup) for tup in zip(fftFreq>=32,fftFreq<=45)]

fftData[mask] then will output the filtered results.

Installing scipy on redhat with error “no lapack/blas resources found”

Leave a reply

Update Feb21 2016:
For centos: 
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/lapack-devel-3.2.1-4.el6.x86_64.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/blas-devel-3.2.1-4.el6.x86_64.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/texinfo-tex-4.13a-8.el6.x86_64.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/libicu-devel-4.2.1-9.1.el6_2.x86_64.rpm
sudo yum localinstall *.rpm

sudo pip install scipy
This is copied from http://stackoverflow.com/questions/24708213/install-r-on-redhat-errors-on-dependencies-that-dont-exist

For Ubuntu 14.04:
sudo apt-get install gfortran

sudo apt-get install libblas-dev liblapack-dev