computing star: Python

顯示具有 Python 標籤的文章。顯示所有文章

2015年11月19日

python == and is operators

在python當中，== operator是用於比較兩個變值之值是否相等，而is operator是比較兩個變數是否為相同物件。

比如說


a=[1,2,3]


b=[1,2,3]
 
a == b # True

a is b # False

通常我在寫程式的時候，如果比較的對象是None, True, False時，我會習慣用is operator，但是最近發現這種寫法在True, False時不一定能成功，讓我非常的驚訝，即我的程式中出現以下情況：


a=func(...) # a is False

if a is False:

   # do something

在 a is False這一行竟然結果為False而非True，必須將判斷式改為a==False才會為True。

2015年11月16日

using numpy in cython

最近大量的使用cython+numpy的方式在寫程式，有許多小細節值得注意。

在官方的文檔中，http://docs.cython.org/src/tutorial/numpy.html
基本使用方法如下：


cimport numpy as cnp

ctypedef cnp.float64_t FLOAT_t



@cython.wraparound(False)

@cython.boundscheck(False)

def myfunc(cnp.ndarray[FLOAT_t, ndim=2] f):

 pass

numpy所定義的c version data type都是python version後面加_t，支援的type如連結。
wraparound(False)關閉負值索引
boundcheck(False)加速array元素的存取，但是會無法使用負值的index且不做bound check。
其它可用的選項可查詢：compiler directives
cnp.ndarray[FLOAT_t, ndim=2]將不使用python的慢速索引，而直接使用c的buffer。
如果在function中，array沒有使用到numpy的broadcasting或是相關的函式，而只有單純的存取元素時，可使用memoryview的方式定義變數。

2015年6月18日

ipython parallel callback

使用apply_async()函式後，會回傳AsyncResult_object，因此原本我設想可以如同multiprocessing模組中的apply_async函式使用callback function來處理執行完成物件的訊息，

但是在ipython parallel的apply_async似乎不支援callback的用法，所以後來我在stackoverflow找到了一個簡單的處理事件完成的解法如下：

c = Client()
dview = c[:]
asyncs = [dview.map_async(f, [arg]) for arg in args]
while asyncs:
    for async in asyncs[:]:
        if async.ready():
            asyncs.remove(async)
            print async.result[0]

2015年6月15日

今天在測試J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, "Efficient projections onto the l 1-ball for learning in high dimensions," in Proceedings of the 25th international conference on Machine learning, 2008, pp. 272-279.

這篇論文中的projection in simplex演算法，將該算法的結果，與直接用pyomo求解比較，該問題是quadratic programming(因為objective function is to minimize Euclidean distance between original vector and projected vector, 而constraint為convex set)。

使用pyomo求解時，solver使用了cplex, gurobi, and ipopt三種，cplex, gurobi與algorithm的結果一致，而使用ipopt時，答案居然有很大的誤差，看來商用軟體還是比較穩定。

ipython parallel import

如果平行處理時需要import package時，可使用以下方式：

rc = Client(profile='ssh')

rc.block = True

dview = rc.direct_view()

dview.use_dill()

with dview.sync_imports():

import numpy as np

# update dict

dview['np'] = np

dview['MyDerive'] = MyDerive

如果有自已定義的package(e.g. ipro)時，必須在每台機器設定PYTHONPATH，以保證可以直接在python當中import pro。

ipython parallel serialization

平行處理時，不論使用direct view或是loadbalance view, 物件序列化都是一個很麻煩的問題，後來找解法時，發現可以使用dill處理。

因為只有direct view支援use_dill() function, 如果loadbalace view也要使用dill時，可使用以下方法 (stackoverflow)。

rc = Client()
rc.direct_view().use_dill()
lv = rc.load_balanced_view()

如果要使用自定義的class時，還要考慮Data movement via DirectView, 否則無法呼叫class instance method,
class MyObject(object):

def __init__(self, val):
import os
import platform
self.name = platform.node()
self.pid = os.getpid()
self.val = val

def run(self):
return "{}, myobject {}_{}".format(self.val, self.name, self.pid)

def MyObjectrun(val):
obj = MyObject(val)
return obj.run()

def test_parallel():
rc = Client(profile='ssh')
rc.block = True
print "nodes: {}".format(len(rc.ids))
rc.direct_view().use_dill()

rc.direct_view()['MyObject'] = MyObject
lview = rc.load_balanced_view()
res = lview.apply_sync(MyObjectrun, 10)
print res

ipython parallel

automatic setting: ipcluster
manual setting: ipcontroller and ipengine

因為我們使用多台機器計算，所以使用ipcontroller and ssh方式設定。

設定profile

$ipython profile create --parallel --profile=ssh
會在/home/chenhh/.ipython/profile_ssh/中建立許多設定檔

執行ipcontroller

$ipcontroller --profile=ssh --reuse

必須指定profile=ssh才會讀取profile_ssh中的設定，否則預設讀取profile_default中的設定。
--reuse指定重複使用json檔，否則每次執行ipcontroller時會生成新的json檔。

會在/home/chenhh/.ipython/profile_default/security/資料夾下生成ipcontroller-client.json與ipcontroller-engine.json兩個設定檔。
將engine的設定檔以scp方式傳給engine

scp /home/chenhh/.ipython/profile_ssh/security/ipcontroller-engine.json ./

而engine使用以下命令執行

ipengine --file=./ipcontroller-engine.json

ipcontroller_config.py設定

HubFactory

c.HubFactory.ip = u'*'

#listen to all interface，此設定對應到ipcontroller-engine.json中的location內容。

IPControllerApp

c.IPControllerApp.work_dir = u'/home/chenhh/.ipython/profile_ssh'
c.IPControllerApp.profile = u'ssh'
c.IPControllerApp.reuse_files = True

engine

$ipython profile create --parallel --profile=ssh
使用scp chenhh@192.168.1.1:/home/chenhh/.ipython/profile_ssh/security/ipcontroller-engine.json /tmp 將檔案複制到/tmp資料夾下。
$ipengine --file=/tmp/ipcontroller-engine.json --ssh=chenhh@192.168.1.1 --profile=ssh

注意一個ipengine只會對應到一顆cpu, 如果要使用多個cpu時，上述指令多執行幾次即可。

ipython

from IPython.parallel import Client
rc = Client(profile="ssh")
指定profile後，即可使用

debug

我的controller主機使用的zmq版本是4.0.4，而engine中，有一台的zmq版本是2.0.2，所以該台主機在連線時，一直出現heartbeat timeout。

解決方法：

首先將engine的zmq-dev版本更新為libzmq3-dev:amd64的版本。
pip uninstall pyzmq
然後抓取pyzmq的source code, python setup.py install後，使用ipython。

import zmq
zmq.zmq_version_info(), 檢查版本是否正確，正確後即解決此問題。

2014年11月21日

django manual release query objects

之前在寫程式的時間，會在程式當中直接使用django的ORM，但是在查詢大量資料後，發現記憶體被吃完了，原因是如果不是使用網頁的方式執行ORM, 必須自行手動使用

https://docs.djangoproject.com/en/dev/faq/models/#why-is-django-leaking-memory

from django.db import reset_queries
reset_queries()

來清除query object cache, 使用後，記憶體的使用量就變正常了。

2013年12月21日

Gentoo編譯scipy(with intel MKL)

在gentoo安裝好mkl後，依照官方的說明
http://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl
依序安裝numpy, scipy,
在numpy的部份順利完成，但是在scipy的部份，安裝完後使用
import scipy時，出現

ImportError: /usr/lib64/python2.7/site-packages/scipy/sparse/sparsetools/_csr.so: undefined symbol: __intel_sse4_strlen

原因如下,C/C++ 有完整的「編譯 -> 連結 -> 執行」三個階段, 各階段都可能發生 undefined symbol。在解決惱人的 undefined symbol 前, 得先明白整個編譯流程:

編譯 .c / .cpp 為 .o (object file) 時, 需要提供 header 檔 (用到 gcc 參數 -I)。事實上, 在編譯單一檔案時, gcc/g++ 根本不在意真正的 symbol 是否存在, 反正有宣告它就信了, 所以有引對 header 即可。這也是可分散編譯的原因, 程式之間在編譯成 .o 檔時, 並沒有相依性。
用 linker (ld 或 gold) 將 *.o 連結成 dynamic library 或執行檔時, 需要提供要連結的 library (用到 gcc 參數 -L 指定目錄位置, 用 -l 指定要連什麼函式庫)。不同於前一步, 此時 symbol 一定要在。
執行的時候, 會再動態開啟 shared library 讀出 symbol。換句話說, 前一個步驟只是檢查是否有。檢查通過也連結成 executable 或 shared library 後, 若執行時對應的檔案不見了, 仍會在執行期間找不到 symbol。若位置沒設好, 可能需要用 LIB_LIBRARY_PATH 指定動態函式的位置, 但不建議這麼做, 最好在執行 linker 時就指定好位置。

就看 undefined symbol 發生在那個階段, 若是編 object file 時發生, 就是沒和編譯器說 header 檔在那, 記得用 -I 告訴它。若在 linking 時發生, 就要同時設好 -L 和 -l。不過難就難在要去那找 undefined symbol 的出處。

因為我的問題是發生在執行時間，所以屬於第３類。
首先使用nm _csr.so或是objdump -x _csr.so確認連結。

而 __intel_sse4_strlen定義在
/opt/intel/lib/intel64/libirc.so
/opt/intel/lib/intel64/libiomp5.so

#使用ldconfig -p確認mkl設定（需root權限）
＃ldconfig -p|grep mkl
libiomp5.so (libc6,x86-64) => /opt/intel/composerxe/lib/intel64/libiomp5.so
libirc.so (libc6,x86-64) => /opt/intel/composerxe/lib/intel64/libirc.so

#檢查檔案的格式
$file _csr.so

_csr.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, not stripped

#使用ldd -r查看缺少的symbol
$ldd -r _csr.so

linux-vdso.so.1 => (0x00007fff207ff000)
        libpython2.7.so.1.0 => /usr/lib64/libpython2.7.so.1.0 (0x00007f3757ffe000)
        libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.3/libstdc++.so.6 (0x00007f3757cf4000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f3757a72000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.3/libgcc_s.so.1 (0x00007f375785c000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f37574d2000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f37572b5000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f37570b1000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f3756ead000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f37587e8000)
undefined symbol: __intel_sse4_strlen   (./_csr.so)
undefined symbol: _intel_fast_memset    (./_csr.so)
undefined symbol: _intel_fast_memcpy    (./_csr.so)
undefined symbol: __intel_sse4_strncmp (./_csr.so)
undefined symbol: __intel_sse4_strcpy   (./_csr.so)
undefined symbol: __intel_sse4_strncpy (./_csr.so)
undefined symbol: __intel_sse4_strcat   (./_csr.so)

解法方法：
首先在編譯scipy時，使用以下的指令：

python setup.py config --compiler=intelem --fcompiler=intelem build_clib --compiler=intelem --fcompiler=intelem build_ext --compiler=intelem --fcompiler=intelem install >>build.log
將編譯時期的log寫進build.log中
cat build.log|grep _csr.so, 可得到
c++ -shared build/temp.linux-x86_64-2.7/scipy/sparse/sparsetools/csr_wrap.o -L/usr/lib64 -Lbuild/temp.linux-x86_64-2.7 -lpython2.7 -o build/lib.linux-x86_64-2.7/scipy/sparse/sparsetools/_csr.so
copying build/lib.linux-x86_64-2.7/scipy/sparse/sparsetools/_csr.so -> /usr/lib64/python2.7/site-packages/scipy/sparse/sparsetools
由於已知道是linking的時候有問題，所以在scipy目錄下，將上述編譯指令改成：
c++ -shared build/temp.linux-x86_64-2.7/scipy/sparse/sparsetools/csr_wrap.o -L/opt/intel/lib/intel64 -L/usr/lib64 -Lbuild/temp.linux-x86_64-2.7 -lpython2.7 -lirc -o build/lib.linux-x86_64-2.7/scipy/sparse/sparsetools/_csr.so
完成編譯後，將編譯完成的檔案拷貝到系統目錄下(需root權限)
cp build/lib.linux-x86_64-2.7/scipy/sparse/sparsetools/_csr.so /usr/lib64/python2.7/site-packages/scipy/sparse/sparsetools
全部完成後，使用import scipy.sparses(scipy.stats)來確認問題解決。

除了_csr.so之後，還有幾個檔案_csc.so, _coo.so, _dia.so, _bsr.so, _csgraph都是用上述方法解決。

_bsr.so除了-lirc外，還需加上-lsvml.

2013年12月18日

快速取出matrix diagonal element方法

n = 1000
c = 20
a = np.random.rand(n,n)

a[np.diag_indices_from(a)] /= c # 119 microseconds
a.flat[::n+1] /= c # 25.3 microseconds

#取出off diagonal的方法

 np.delete(a, a.ravel()[::n+1])

2013年12月12日

fast rolling window

假設有兩個array S1 (size m), S2 (size n), m>n，要計算S2相對於S1的滑動函數值(如mean或是std等)，一般的做法是使用loop實作，但是python的迴圈速度相當的慢，所以建議使用numpy strides的功能來加速，在測試當中，計算stdev大約能夠加速100倍，程式碼如下：

def rolling_window(a, window):

shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)

print shape

strides = a.strides + (a.strides[-1],)

return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def test_rolling_window(winsize=3):

import time

t = time.clock()

observations = np.random.randn(1000)

np.std(rolling_window(observations, winsize), 1)

print "strided %.6f secs"%(time.clock()-t)

t = time.clock()

for idx in xrange(winsize, len(observations)):

np.std(observations[idx-winsize: idx])

print "loop %.6f secs"%(time.clock()-t)

原理在此篇文章當中有詳細的描述：
http://chintaksheth.wordpress.com/2013/07/31/numpy-the-tricks-of-the-trade-part-ii/

2013年10月1日

numpy multiply broadcast by column

假設array a的size為(2,5)，array c的size為(2,)，如果要做elementwise multiplication時，我的做法:
(a.T*c).T
但是這個做法需要寫到兩次transpose，式子變長時不易理解。

後來在trace別人的code時，看到了以下的寫法:
a*c[:, np.newaxis]
即可達到相同的效果，且較容易理解。

2013年8月10日

[Theano] 在Eclipse中使用GPU執行程式

首先在{HOME}下建立.theanorc檔案，內容如下：
[global] floatX = float32
device = gpu0
[nvcc] fastmath = True
存檔後，退出至{HOME}中。
輸入nvcc指令測試cuda套件是否已經安裝完成。

然後到eclipse隨便建立一個python檔，import theano後，執行該程式，此時可能會出現以下錯誤：
ERROR (theano.sandbox.cuda): nvcc compiler not found on $PATH. Check your nvcc installation and try again.

這是因為eclipse的pydev套件使用的PATH變數與linux中不同，
用以下指令修正：
window->preferences->PyDev->Interpreter-Python->Environment->New-> Name: PATH, Value: ${env_var:PATH}:/usr/local/cuda-5.0/bin

設定完成後，可在python檔使用
import os
print os.environ['PATH']
確定環境變數修改完成。

2013年3月9日

更換BLAS implementation

使用openBLAS替換ATLAS(Ubuntu 12.04)

sudo apt-get install libopenblas-base, libopenblas-dev
sudo update-alternatives --all
set liblapack.so.3gf to/usr/lib/lapack/liblapack.so.3gf

其它的BLAS替代方案可參考Debian wiki：http://wiki.debian.org/DebianScience/LinearAlgebraLibraries

2012年12月17日

Threano scan function

http://deeplearning.net/software/theano/tutorial/loop.html 使用scan的好處在連結裡面已交代清楚，但是用法仍需再解釋清楚。首先從第一個範例開始，下面兩個程式都是計算A的k次方之值：

result = 1
for i in xrange(k):
    result = result * A

import theano
import theano.tensor as T
theano.config.warn.subtensor_merge_bug = False

k = T.iscalar("k")
A = T.vector("A")

def inner_fct(prior_result, A):
    return prior_result * A

# Symbolic description of the result
result, updates = theano.scan(fn=inner_fct,
                            outputs_info=T.ones_like(A),
                            non_sequences=A, n_steps=k)

'''
Scan has provided us with A ** 1 through A ** k.  
Keep only the last value. Scan notices this and 
does not waste memory saving them.
'''
final_result = result[-1]

power = theano.function(inputs=[A, k], outputs=final_result,
                      updates=updates)

print power(range(10),2)
#[  0.   1.   4.   9.  16.  25.  36.  49.  64.  81.]

scan當中的fn為所要執行的函數，也可使用lambda的方式來定義。
第二個param outputs_info設定為大小與 A相同的矩陣，且矩陣內之值全部為1。
non_sequences為在scan當中不會變動之值，在此A在整個loop當中均不會變化。
steps為所要執行次數。

Theano shared variable

http://deeplearning.net/software/theano/tutorial/examples.html#using-shared-variables
裡面的參考範例如下：

from theano import shared
state = shared(0)
inc = T.iscalar('inc')
accumulator = function([inc], state;
                              updates=[(state, state+inc)])

裡面比較特殊的部份是state為shared variable, 0為其初始值。
此值可在多個function當中共用, 在程式當中可用state.get_value()的方式取其值，也可用state.set_value(val)的方式來設定其值。

另一需說明的部份為function.update([shared-variable, new-expression]), 此函數必須為pair form，也可使用dict的key=value形式。
此式的意義即在每次執行時，都將shared-variable.value更換成new-expression所得到的結果。

因此在執行範例後得到的結果如下：

state.get_value() #程式尚未執行，array(0)
accumulator(1)    #array(0)->array(1)
state.get_value() #array(1)
accumulator(300)  #array(1)->array(301)
state.get_value() #array(301)
#reset shared variable
state.set_value(-1)
accumulator(3)
state.get_value() #array(-1)->array(2)

如同上述，shard variable可被多個function共用，因此定義另一個decreaser對state做存取：

decrementor = function([inc], state, updates=[(state, state-inc)])
decrementor(2)
state.get_value() #array(2)->array(0)

如果要在shared variable放函數時，需改用function.given()，範例如下：

fn_of_state = state * 2 + inc
# the type (lscalar) must match the shared ariable we
# are replacing with the ``givens`` list
foo = T.lscalar() 
skip_shared = function([inc, foo], fn_of_state,
                                   givens=[(state, foo)])
skip_shared(1, 3)  # we're using 3 for the state, not state.value
state.get_value()  # old state still there, but we didn't use it
#array(0)

雖然上述的函數相當方便，但文件中未提到是否會有race condition的情形發生。
http://deeplearning.net/software/theano/tutorial/aliasing.html
在understanding memory aliasing for speed and correctness這一節中，提到了theano有自已管理記憶體的機制(pool)，而theano會管理pool中變數之變動。

theano的pool中的變數與python的變數位於不同的memory space，因此不會互相衝突
Theano functions only modify buffers that are in Theano’s memory space.
Theano's memory space includes the buffers allocated to store shared variables and the temporaries used to evaluate functions.
Physically, Theano's memory space may be spread across the host, a GPU device(s), and in the future may even include objects on a remote machine.
The memory allocated for a shared variable buffer is unique: it is never aliased to anothershared variable.
Theano's managed memory is constant while Theano functions are not running and Theano's library code is not running.
The default behaviour of a function is to return user-space values for outputs, and to expect user-space values for inputs.

The distinction between Theano-managed memory and user-managed memory can be broken down by some Theano functions (e.g. shared, get_value and the constructors for In and Out) by using aborrow=True flag. This can make those methods faster (by avoiding copy operations) at the expense of risking subtle bugs in the overall program (by aliasing memory).

Theano gpu setting

根據官方網站的設定：

http://deeplearning.net/software/theano/library/config.html#libdoc-config

如果要改用gpu而不是使用cpu來計算函數，必須在import theano之前就先設定，方法有兩種：

在$HOME/.theanorc中設定
在環境變數THEANO_FLAGS中設定

而在eclipse的開發環境中，如果要對不同的檔案使用不同的設定，使用方法2較有彈性，執行設定方法如下：

切換到要執行的檔案，選擇上方的Run->Run configrations->Environment->New，然後Name中填入THEANO_FLAGS，values中填入floatX=float32,device=gpu後, 按下方的apply
因為eclipse似乎無法正確讀入環境變數中的PATH設定，因此要在同一畫面中加入name=PATH, values=usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/cuda/bin:/usr/local/cuda/bin (依cuda安裝位置而定).

之後在程式當中即可正確使用gpu來計算。

也可用print theano.config來確定設定正確。

2011年4月15日

Scipy求eigenvalues

簡單的求eigenvalues範例:

#-*-coding:utf-8-*-
'''
Created on 2011/4/15
@author: Hung-Hsin Chen
simple example for solving eigenvalues
'''
import numpy as np
from scipy import linalg
#A=[[1 2 3] [4 5 6] [7 8 9]]
A = np.arange(1,10).reshape(3,3)
eigvals, eigvecs = linalg.eig(A)
#eigen vector is column vector
for i in range(0,3):
print eigvals[i], eigvecs[:,i]
#A*X = lambda * X
print np.dot(A, eigvecs[:,0].transpose())
print eigvals[0]*eigvecs[:, 0]

2011年1月19日

Intel MKL

剛剛在搜尋資料的時候看到這一篇文章Using Intel MKL in your Python program，裡面說明了如何在python當中使用MKL來做數值計算，兩者可以一起使用放合體技，真是太棒了。

而在Intel MKL link advisor這篇文章中，使用者可以依照自已的需求，填完表格後，就會產生link時所需要的參數，可以直接使用或是寫在Makefile中，也是很方便的做法。

在Compiling 64-bit R 2.10.1 with MKL in Linux中，提到了如何使用MKL來compile R，這對於加速數學計算是很有幫助的。

2010年12月2日

Python programming tips

一些實用的小技巧，可減少多餘的動作

Swap

//C++和Java的寫法:
temp = x
x = y
y = temp

#Python建議使用
x, y= y, x

讀dict時避免判斷key的存在

d = { 'key': 'value' }
#一般寫法
if 'key' in d: print d['key']
else: print 'not find'

#建議寫法
print d.get('key', 'not find')

尋找最小值和對應位置

s = [ 4,1,8,3 ]
#一般寫法
mval, mpos = MAX, 0
for i in xrange(len(s)):
if s[i] < mval: mval, mpos = s[i], i

#建議寫法
mval, mpos = min([ (s[i], i) for i in xrange(len(s)) ])

讀取檔案

#一般寫法
line = ''
fp = open('text.txt', 'r')
for line in fp: text += line

#建議寫法1
text = string.join([ line for line in open('text.txt')], '']

#建議寫法2
text = ''.join([ line for line in open('text.txt')])

#建議寫法3
text = file('text.txt').read()

Python使用三元式

#一般寫法，C++會使用?:來處理
if n >= 0: print 'positive'
else: print 'negitive'

#建議寫法1, 但後面式子為None會有問題
print (n >= 0) and 'positive' or 'negitive'

#建議寫法2, 解決了None的問題
print (n >= 0 and ['positive'] or ['negitive])[0]

#建議寫法3
print ('negitive', 'positive')[n >= 0]

Dict成員是複雜類別時的初使用

#一般寫法
if not y in d: d[y] = { }
d[y][x] = 3

#建議寫法
d.setdefault(y, { })[x] = 3

#如果為list時
d.setdefault(key, []).append(val)

訂閱：文章 (Atom)